Saturday, February 3, 2007

Winter of Code (part 3)

I am going to talk today about some design criteria for the file explorer. See here and here for the previous posts in the series.

Up to this point I have stated that speed is a primary concern. So how I am going to beat explorer in this area? Two strategies I have in mind:
  • Focus only on NTFS and Windows XP

  • Be less flexible but more efficient

Let’s talk about each one.


Focus only on NTFS

I only use NTFS volumes. This has been the case for about 5 years now (even for external USB/Firewire disks). The only case where I use FAT is for key-sized memory sticks. All the people I know do the same but more importantly, most people out there also do. For example, a survey I trust is done by Steam (http://www.steampowered.com/status/survey.html) (which began on Nov 15, 2006): it had 942,000 unique PCs from which 93.66 % have NTFS as a mounted file system. The same survey shows also that the vast majority of people are running some flavor of Windows XP.

By targeting NTFS and XP I hope I can use efficient techniques and unique APIs that are simply not available for FAT32 volumes or Win2k. In some cases the difference is so dramatic that you can understand why Microsoft wants so dearly to get rid of the FAT legacy in Vista and beyond.

Be less flexible but more efficient

The amount of queries and miscellaneous operations that happen undercover related to COM and to the registry to render a folder view is truly frightening. This is what makes explorer a very extensible host but at the same time it makes it very slow. Not supporting complex extensibility models works well for me because I disagree with the overhead that they cause and I do not have the time or resources to provide them. Besides, I don't want to get rid of explorer, i just don't want it as the primary interface to manage my files.

Now, enough preliminaries, and before I bore myself to tears (as I am sure I have no readers), let's get down to business:


Some NTFS background
Let’s delve into some basic technical terms related to NTFS to be used on this series.

Sectors: While the physical construction of a disk dictates that it has several platters each divided in cylinders (which are akin to circular tracks) and so on, the built in microcontroller in the disk abstracts all that and presents to the computer a linear addressable space starting from sector 0 to sector N. A sector is thus the minimal addressable unit of I/O on a disk; or in other words, a sector is to a disk what a word is to RAM: There is no way to just read or write just 2 bits. Typically, a sector is 512 bytes and for example in my home PC I have 60115229 sectors in my boot partition.

Partition: A disk can be partitioned among several filesystems with the condition that each filesystem has to span a linear range of sectors. For example, you can have a Linux filesystem between sector m to sector m + p where p times 512 bytes is the size of the Linux partition and then have a NTFS filesystem starting at sector m+p+1 and ending on a sector farther down.

Volume: A volume is the OS representation of a partition that has been formatted with a filesystem that they can operate on so a single physical disk can be represented by several volumes. From the perspective of this project we only care about volumes and in particular NTFS volumes.

Cluster/LCN: The minimal addressable unit in NTFS is called a cluster. A cluster size is always an integer multiple of the sector size; in most cases being 32, 16, 8 or 4 sectors. This is a parameter that can be specified when a partition is formatted with NTFS. Every cluster in NTFS has a LCN (Logical Cluster Number). The cluster number starts with 0 at the first cluster of the file system. By the way the cluster 0 always contains the beginning of a special file known as $Boot, aka the boot record. The boot record is important because it contains the general parameters of the volume, such as total number of sectors, sectors per cluster and where the MFT is located.

MFT: Everything is a file in a NTFS volume. Even the very on-disk structures that control the operation of NTFS. The index to all and every file is the Master File Table ($Mft). The MFT file is really the heart of the volume and is located not too far from the beginning of the volume. There is, of course, a mirror copy of this important file somewhere in the middle of the volume, in case it gets corrupted. By the way, the MFT file is never fragmented and thus at format time it gets a contiguous cluster range reserved that can shrink up to 12.5% of the volume size.

MFT Record: Each entry in MFT is known as a file record and its default size is 1024 bytes. The first 42 bytes is fixed for header and the rest of the entry stores attribute, an example of an attribute is the file name or another one the modified date.

$ Files: The key files that control the operation of NTFS have as prefix the $ (dollar) sign and they are hidden; We have been introduced to two of them $Boot and $Mft but there are others that we'll meet in due time.

The image below is a screenshot of a third party tool showing a partial decoding of the boot record ($Boot) showing the key parameters of my XP boot volume:


Ok, that is all for this post, next time we'll get deep in enemy territory.

No comments: