| 
									
										
										
										
											2009-11-20 20:13:39 +01:00
										 |  |  | 
 | 
					
						
							|  |  |  | The LogFS Flash Filesystem | 
					
						
							|  |  |  | ========================== | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Specification | 
					
						
							|  |  |  | ============= | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Superblocks | 
					
						
							|  |  |  | ----------- | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Two superblocks exist at the beginning and end of the filesystem. | 
					
						
							|  |  |  | Each superblock is 256 Bytes large, with another 3840 Bytes reserved | 
					
						
							|  |  |  | for future purposes, making a total of 4096 Bytes. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Superblock locations may differ for MTD and block devices.  On MTD the | 
					
						
							|  |  |  | first non-bad block contains a superblock in the first 4096 Bytes and | 
					
						
							|  |  |  | the last non-bad block contains a superblock in the last 4096 Bytes. | 
					
						
							|  |  |  | On block devices, the first 4096 Bytes of the device contain the first | 
					
						
							|  |  |  | superblock and the last aligned 4096 Byte-block contains the second | 
					
						
							|  |  |  | superblock. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | For the most part, the superblocks can be considered read-only.  They | 
					
						
							|  |  |  | are written only to correct errors detected within the superblocks, | 
					
						
							|  |  |  | move the journal and change the filesystem parameters through tunefs. | 
					
						
							|  |  |  | As a result, the superblock does not contain any fields that require | 
					
						
							|  |  |  | constant updates, like the amount of free space, etc. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Segments | 
					
						
							|  |  |  | -------- | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | The space in the device is split up into equal-sized segments. | 
					
						
							|  |  |  | Segments are the primary write unit of LogFS.  Within each segments, | 
					
						
							|  |  |  | writes happen from front (low addresses) to back (high addresses.  If | 
					
						
							|  |  |  | only a partial segment has been written, the segment number, the | 
					
						
							|  |  |  | current position within and optionally a write buffer are stored in | 
					
						
							|  |  |  | the journal. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Segments are erased as a whole.  Therefore Garbage Collection may be | 
					
						
							|  |  |  | required to completely free a segment before doing so. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Journal | 
					
						
							|  |  |  | -------- | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | The journal contains all global information about the filesystem that | 
					
						
							|  |  |  | is subject to frequent change.  At mount time, it has to be scanned | 
					
						
							|  |  |  | for the most recent commit entry, which contains a list of pointers to | 
					
						
							|  |  |  | all currently valid entries. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Object Store | 
					
						
							|  |  |  | ------------ | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | All space except for the superblocks and journal is part of the object | 
					
						
							|  |  |  | store.  Each segment contains a segment header and a number of | 
					
						
							|  |  |  | objects, each consisting of the object header and the payload. | 
					
						
							|  |  |  | Objects are either inodes, directory entries (dentries), file data | 
					
						
							|  |  |  | blocks or indirect blocks. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Levels | 
					
						
							|  |  |  | ------ | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Garbage collection (GC) may fail if all data is written | 
					
						
							| 
									
										
										
										
											2010-05-10 14:51:19 +05:30
										 |  |  | indiscriminately.  One requirement of GC is that data is separated | 
					
						
							| 
									
										
										
										
											2009-11-20 20:13:39 +01:00
										 |  |  | roughly according to the distance between the tree root and the data. | 
					
						
							|  |  |  | Effectively that means all file data is on level 0, indirect blocks | 
					
						
							|  |  |  | are on levels 1, 2, 3 4 or 5 for 1x, 2x, 3x, 4x or 5x indirect blocks, | 
					
						
							|  |  |  | respectively.  Inode file data is on level 6 for the inodes and 7-11 | 
					
						
							|  |  |  | for indirect blocks. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Each segment contains objects of a single level only.  As a result, | 
					
						
							| 
									
										
										
										
											2010-05-10 14:51:19 +05:30
										 |  |  | each level requires its own separate segment to be open for writing. | 
					
						
							| 
									
										
										
										
											2009-11-20 20:13:39 +01:00
										 |  |  | 
 | 
					
						
							|  |  |  | Inode File | 
					
						
							|  |  |  | ---------- | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | All inodes are stored in a special file, the inode file.  Single | 
					
						
							|  |  |  | exception is the inode file's inode (master inode) which for obvious | 
					
						
							|  |  |  | reasons is stored in the journal instead.  Instead of data blocks, the | 
					
						
							|  |  |  | leaf nodes of the inode files are inodes. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Aliases | 
					
						
							|  |  |  | ------- | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Writes in LogFS are done by means of a wandering tree.  A naïve | 
					
						
							|  |  |  | implementation would require that for each write or a block, all | 
					
						
							|  |  |  | parent blocks are written as well, since the block pointers have | 
					
						
							|  |  |  | changed.  Such an implementation would not be very efficient. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | In LogFS, the block pointer changes are cached in the journal by means | 
					
						
							|  |  |  | of alias entries.  Each alias consists of its logical address - inode | 
					
						
							|  |  |  | number, block index, level and child number (index into block) - and | 
					
						
							|  |  |  | the changed data.  Any 8-byte word can be changes in this manner. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Currently aliases are used for block pointers, file size, file used | 
					
						
							|  |  |  | bytes and the height of an inodes indirect tree. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Segment Aliases | 
					
						
							|  |  |  | --------------- | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Related to regular aliases, these are used to handle bad blocks. | 
					
						
							|  |  |  | Initially, bad blocks are handled by moving the affected segment | 
					
						
							|  |  |  | content to a spare segment and noting this move in the journal with a | 
					
						
							|  |  |  | segment alias, a simple (to, from) tupel.  GC will later empty this | 
					
						
							|  |  |  | segment and the alias can be removed again.  This is used on MTD only. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Vim | 
					
						
							|  |  |  | --- | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | By cleverly predicting the life time of data, it is possible to | 
					
						
							| 
									
										
										
										
											2010-05-10 14:51:19 +05:30
										 |  |  | separate long-living data from short-living data and thereby reduce | 
					
						
							| 
									
										
										
										
											2009-11-20 20:13:39 +01:00
										 |  |  | the GC overhead later.  Each type of distinc life expectency (vim) can | 
					
						
							| 
									
										
										
										
											2010-05-10 14:51:19 +05:30
										 |  |  | have a separate segment open for writing.  Each (level, vim) tupel can | 
					
						
							| 
									
										
										
										
											2009-11-20 20:13:39 +01:00
										 |  |  | be open just once.  If an open segment with unknown vim is encountered | 
					
						
							|  |  |  | at mount time, it is closed and ignored henceforth. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Indirect Tree | 
					
						
							|  |  |  | ------------- | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Inodes in LogFS are similar to FFS-style filesystems with direct and | 
					
						
							|  |  |  | indirect block pointers.  One difference is that LogFS uses a single | 
					
						
							|  |  |  | indirect pointer that can be either a 1x, 2x, etc. indirect pointer. | 
					
						
							|  |  |  | A height field in the inode defines the height of the indirect tree | 
					
						
							|  |  |  | and thereby the indirection of the pointer. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Another difference is the addressing of indirect blocks.  In LogFS, | 
					
						
							|  |  |  | the first 16 pointers in the first indirect block are left empty, | 
					
						
							|  |  |  | corresponding to the 16 direct pointers in the inode.  In ext2 (maybe | 
					
						
							|  |  |  | others as well) the first pointer in the first indirect block | 
					
						
							|  |  |  | corresponds to logical block 12, skipping the 12 direct pointers. | 
					
						
							|  |  |  | So where ext2 is using arithmetic to better utilize space, LogFS keeps | 
					
						
							|  |  |  | arithmetic simple and uses compression to save space. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Compression | 
					
						
							|  |  |  | ----------- | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Both file data and metadata can be compressed.  Compression for file | 
					
						
							|  |  |  | data can be enabled with chattr +c and disabled with chattr -c.  Doing | 
					
						
							|  |  |  | so has no effect on existing data, but new data will be stored | 
					
						
							|  |  |  | accordingly.  New inodes will inherit the compression flag of the | 
					
						
							|  |  |  | parent directory. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Metadata is always compressed.  However, the space accounting ignores | 
					
						
							|  |  |  | this and charges for the uncompressed size.  Failing to do so could | 
					
						
							|  |  |  | result in GC failures when, after moving some data, indirect blocks | 
					
						
							|  |  |  | compress worse than previously.  Even on a 100% full medium, GC may | 
					
						
							|  |  |  | not consume any extra space, so the compression gains are lost space | 
					
						
							|  |  |  | to the user. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | However, they are not lost space to the filesystem internals.  By | 
					
						
							|  |  |  | cheating the user for those bytes, the filesystem gained some slack | 
					
						
							|  |  |  | space and GC will run less often and faster. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Garbage Collection and Wear Leveling | 
					
						
							|  |  |  | ------------------------------------ | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Garbage collection is invoked whenever the number of free segments | 
					
						
							|  |  |  | falls below a threshold.  The best (known) candidate is picked based | 
					
						
							|  |  |  | on the least amount of valid data contained in the segment.  All | 
					
						
							|  |  |  | remaining valid data is copied elsewhere, thereby invalidating it. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | The GC code also checks for aliases and writes then back if their | 
					
						
							|  |  |  | number gets too large. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Wear leveling is done by occasionally picking a suboptimal segment for | 
					
						
							|  |  |  | garbage collection.  If a stale segments erase count is significantly | 
					
						
							|  |  |  | lower than the active segments' erase counts, it will be picked.  Wear | 
					
						
							|  |  |  | leveling is rate limited, so it will never monopolize the device for | 
					
						
							|  |  |  | more than one segment worth at a time. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Values for "occasionally", "significantly lower" are compile time | 
					
						
							|  |  |  | constants. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Hashed directories | 
					
						
							|  |  |  | ------------------ | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | To satisfy efficient lookup(), directory entries are hashed and | 
					
						
							|  |  |  | located based on the hash.  In order to both support large directories | 
					
						
							|  |  |  | and not be overly inefficient for small directories, several hash | 
					
						
							|  |  |  | tables of increasing size are used.  For each table, the hash value | 
					
						
							|  |  |  | modulo the table size gives the table index. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Tables sizes are chosen to limit the number of indirect blocks with a | 
					
						
							|  |  |  | fully populated table to 0, 1, 2 or 3 respectively.  So the first | 
					
						
							|  |  |  | table contains 16 entries, the second 512-16, etc. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | The last table is special in several ways.  First its size depends on | 
					
						
							|  |  |  | the effective 32bit limit on telldir/seekdir cookies.  Since logfs | 
					
						
							|  |  |  | uses the upper half of the address space for indirect blocks, the size | 
					
						
							|  |  |  | is limited to 2^31.  Secondly the table contains hash buckets with 16 | 
					
						
							|  |  |  | entries each. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Using single-entry buckets would result in birthday "attacks".  At | 
					
						
							|  |  |  | just 2^16 used entries, hash collisions would be likely (P >= 0.5). | 
					
						
							|  |  |  | My math skills are insufficient to do the combinatorics for the 17x | 
					
						
							|  |  |  | collisions necessary to overflow a bucket, but testing showed that in | 
					
						
							|  |  |  | 10,000 runs the lowest directory fill before a bucket overflow was | 
					
						
							|  |  |  | 188,057,130 entries with an average of 315,149,915 entries.  So for | 
					
						
							|  |  |  | directory sizes of up to a million, bucket overflows should be | 
					
						
							|  |  |  | virtually impossible under normal circumstances. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | With carefully chosen filenames, it is obviously possible to cause an | 
					
						
							|  |  |  | overflow with just 21 entries (4 higher tables + 16 entries + 1).  So | 
					
						
							|  |  |  | there may be a security concern if a malicious user has write access | 
					
						
							|  |  |  | to a directory. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Open For Discussion | 
					
						
							|  |  |  | =================== | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Device Address Space | 
					
						
							|  |  |  | -------------------- | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | A device address space is used for caching.  Both block devices and | 
					
						
							|  |  |  | MTD provide functions to either read a single page or write a segment. | 
					
						
							|  |  |  | Partial segments may be written for data integrity, but where possible | 
					
						
							|  |  |  | complete segments are written for performance on simple block device | 
					
						
							|  |  |  | flash media. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Meta Inodes | 
					
						
							|  |  |  | ----------- | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Inodes are stored in the inode file, which is just a regular file for | 
					
						
							|  |  |  | most purposes.  At umount time, however, the inode file needs to | 
					
						
							|  |  |  | remain open until all dirty inodes are written.  So | 
					
						
							|  |  |  | generic_shutdown_super() may not close this inode, but shouldn't | 
					
						
							|  |  |  | complain about remaining inodes due to the inode file either.  Same | 
					
						
							|  |  |  | goes for mapping inode of the device address space. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Currently logfs uses a hack that essentially copies part of fs/inode.c | 
					
						
							|  |  |  | code over.  A general solution would be preferred. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Indirect block mapping | 
					
						
							|  |  |  | ---------------------- | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | With compression, the block device (or mapping inode) cannot be used | 
					
						
							|  |  |  | to cache indirect blocks.  Some other place is required.  Currently | 
					
						
							|  |  |  | logfs uses the top half of each inode's address space.  The low 8TB | 
					
						
							|  |  |  | (on 32bit) are filled with file data, the high 8TB are used for | 
					
						
							|  |  |  | indirect blocks. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | One problem is that 16TB files created on 64bit systems actually have | 
					
						
							|  |  |  | data in the top 8TB.  But files >16TB would cause problems anyway, so | 
					
						
							|  |  |  | only the limit has changed. |