s/seperate/separate Signed-off-by: Anand Gadiyar <gadiyar@ti.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
		
			
				
	
	
		
			241 lines
		
	
	
	
		
			9.6 KiB
			
		
	
	
	
		
			Text
		
	
	
	
	
	
			
		
		
	
	
			241 lines
		
	
	
	
		
			9.6 KiB
			
		
	
	
	
		
			Text
		
	
	
	
	
	
 | 
						|
The LogFS Flash Filesystem
 | 
						|
==========================
 | 
						|
 | 
						|
Specification
 | 
						|
=============
 | 
						|
 | 
						|
Superblocks
 | 
						|
-----------
 | 
						|
 | 
						|
Two superblocks exist at the beginning and end of the filesystem.
 | 
						|
Each superblock is 256 Bytes large, with another 3840 Bytes reserved
 | 
						|
for future purposes, making a total of 4096 Bytes.
 | 
						|
 | 
						|
Superblock locations may differ for MTD and block devices.  On MTD the
 | 
						|
first non-bad block contains a superblock in the first 4096 Bytes and
 | 
						|
the last non-bad block contains a superblock in the last 4096 Bytes.
 | 
						|
On block devices, the first 4096 Bytes of the device contain the first
 | 
						|
superblock and the last aligned 4096 Byte-block contains the second
 | 
						|
superblock.
 | 
						|
 | 
						|
For the most part, the superblocks can be considered read-only.  They
 | 
						|
are written only to correct errors detected within the superblocks,
 | 
						|
move the journal and change the filesystem parameters through tunefs.
 | 
						|
As a result, the superblock does not contain any fields that require
 | 
						|
constant updates, like the amount of free space, etc.
 | 
						|
 | 
						|
Segments
 | 
						|
--------
 | 
						|
 | 
						|
The space in the device is split up into equal-sized segments.
 | 
						|
Segments are the primary write unit of LogFS.  Within each segments,
 | 
						|
writes happen from front (low addresses) to back (high addresses.  If
 | 
						|
only a partial segment has been written, the segment number, the
 | 
						|
current position within and optionally a write buffer are stored in
 | 
						|
the journal.
 | 
						|
 | 
						|
Segments are erased as a whole.  Therefore Garbage Collection may be
 | 
						|
required to completely free a segment before doing so.
 | 
						|
 | 
						|
Journal
 | 
						|
--------
 | 
						|
 | 
						|
The journal contains all global information about the filesystem that
 | 
						|
is subject to frequent change.  At mount time, it has to be scanned
 | 
						|
for the most recent commit entry, which contains a list of pointers to
 | 
						|
all currently valid entries.
 | 
						|
 | 
						|
Object Store
 | 
						|
------------
 | 
						|
 | 
						|
All space except for the superblocks and journal is part of the object
 | 
						|
store.  Each segment contains a segment header and a number of
 | 
						|
objects, each consisting of the object header and the payload.
 | 
						|
Objects are either inodes, directory entries (dentries), file data
 | 
						|
blocks or indirect blocks.
 | 
						|
 | 
						|
Levels
 | 
						|
------
 | 
						|
 | 
						|
Garbage collection (GC) may fail if all data is written
 | 
						|
indiscriminately.  One requirement of GC is that data is separated
 | 
						|
roughly according to the distance between the tree root and the data.
 | 
						|
Effectively that means all file data is on level 0, indirect blocks
 | 
						|
are on levels 1, 2, 3 4 or 5 for 1x, 2x, 3x, 4x or 5x indirect blocks,
 | 
						|
respectively.  Inode file data is on level 6 for the inodes and 7-11
 | 
						|
for indirect blocks.
 | 
						|
 | 
						|
Each segment contains objects of a single level only.  As a result,
 | 
						|
each level requires its own separate segment to be open for writing.
 | 
						|
 | 
						|
Inode File
 | 
						|
----------
 | 
						|
 | 
						|
All inodes are stored in a special file, the inode file.  Single
 | 
						|
exception is the inode file's inode (master inode) which for obvious
 | 
						|
reasons is stored in the journal instead.  Instead of data blocks, the
 | 
						|
leaf nodes of the inode files are inodes.
 | 
						|
 | 
						|
Aliases
 | 
						|
-------
 | 
						|
 | 
						|
Writes in LogFS are done by means of a wandering tree.  A naïve
 | 
						|
implementation would require that for each write or a block, all
 | 
						|
parent blocks are written as well, since the block pointers have
 | 
						|
changed.  Such an implementation would not be very efficient.
 | 
						|
 | 
						|
In LogFS, the block pointer changes are cached in the journal by means
 | 
						|
of alias entries.  Each alias consists of its logical address - inode
 | 
						|
number, block index, level and child number (index into block) - and
 | 
						|
the changed data.  Any 8-byte word can be changes in this manner.
 | 
						|
 | 
						|
Currently aliases are used for block pointers, file size, file used
 | 
						|
bytes and the height of an inodes indirect tree.
 | 
						|
 | 
						|
Segment Aliases
 | 
						|
---------------
 | 
						|
 | 
						|
Related to regular aliases, these are used to handle bad blocks.
 | 
						|
Initially, bad blocks are handled by moving the affected segment
 | 
						|
content to a spare segment and noting this move in the journal with a
 | 
						|
segment alias, a simple (to, from) tupel.  GC will later empty this
 | 
						|
segment and the alias can be removed again.  This is used on MTD only.
 | 
						|
 | 
						|
Vim
 | 
						|
---
 | 
						|
 | 
						|
By cleverly predicting the life time of data, it is possible to
 | 
						|
separate long-living data from short-living data and thereby reduce
 | 
						|
the GC overhead later.  Each type of distinc life expectency (vim) can
 | 
						|
have a separate segment open for writing.  Each (level, vim) tupel can
 | 
						|
be open just once.  If an open segment with unknown vim is encountered
 | 
						|
at mount time, it is closed and ignored henceforth.
 | 
						|
 | 
						|
Indirect Tree
 | 
						|
-------------
 | 
						|
 | 
						|
Inodes in LogFS are similar to FFS-style filesystems with direct and
 | 
						|
indirect block pointers.  One difference is that LogFS uses a single
 | 
						|
indirect pointer that can be either a 1x, 2x, etc. indirect pointer.
 | 
						|
A height field in the inode defines the height of the indirect tree
 | 
						|
and thereby the indirection of the pointer.
 | 
						|
 | 
						|
Another difference is the addressing of indirect blocks.  In LogFS,
 | 
						|
the first 16 pointers in the first indirect block are left empty,
 | 
						|
corresponding to the 16 direct pointers in the inode.  In ext2 (maybe
 | 
						|
others as well) the first pointer in the first indirect block
 | 
						|
corresponds to logical block 12, skipping the 12 direct pointers.
 | 
						|
So where ext2 is using arithmetic to better utilize space, LogFS keeps
 | 
						|
arithmetic simple and uses compression to save space.
 | 
						|
 | 
						|
Compression
 | 
						|
-----------
 | 
						|
 | 
						|
Both file data and metadata can be compressed.  Compression for file
 | 
						|
data can be enabled with chattr +c and disabled with chattr -c.  Doing
 | 
						|
so has no effect on existing data, but new data will be stored
 | 
						|
accordingly.  New inodes will inherit the compression flag of the
 | 
						|
parent directory.
 | 
						|
 | 
						|
Metadata is always compressed.  However, the space accounting ignores
 | 
						|
this and charges for the uncompressed size.  Failing to do so could
 | 
						|
result in GC failures when, after moving some data, indirect blocks
 | 
						|
compress worse than previously.  Even on a 100% full medium, GC may
 | 
						|
not consume any extra space, so the compression gains are lost space
 | 
						|
to the user.
 | 
						|
 | 
						|
However, they are not lost space to the filesystem internals.  By
 | 
						|
cheating the user for those bytes, the filesystem gained some slack
 | 
						|
space and GC will run less often and faster.
 | 
						|
 | 
						|
Garbage Collection and Wear Leveling
 | 
						|
------------------------------------
 | 
						|
 | 
						|
Garbage collection is invoked whenever the number of free segments
 | 
						|
falls below a threshold.  The best (known) candidate is picked based
 | 
						|
on the least amount of valid data contained in the segment.  All
 | 
						|
remaining valid data is copied elsewhere, thereby invalidating it.
 | 
						|
 | 
						|
The GC code also checks for aliases and writes then back if their
 | 
						|
number gets too large.
 | 
						|
 | 
						|
Wear leveling is done by occasionally picking a suboptimal segment for
 | 
						|
garbage collection.  If a stale segments erase count is significantly
 | 
						|
lower than the active segments' erase counts, it will be picked.  Wear
 | 
						|
leveling is rate limited, so it will never monopolize the device for
 | 
						|
more than one segment worth at a time.
 | 
						|
 | 
						|
Values for "occasionally", "significantly lower" are compile time
 | 
						|
constants.
 | 
						|
 | 
						|
Hashed directories
 | 
						|
------------------
 | 
						|
 | 
						|
To satisfy efficient lookup(), directory entries are hashed and
 | 
						|
located based on the hash.  In order to both support large directories
 | 
						|
and not be overly inefficient for small directories, several hash
 | 
						|
tables of increasing size are used.  For each table, the hash value
 | 
						|
modulo the table size gives the table index.
 | 
						|
 | 
						|
Tables sizes are chosen to limit the number of indirect blocks with a
 | 
						|
fully populated table to 0, 1, 2 or 3 respectively.  So the first
 | 
						|
table contains 16 entries, the second 512-16, etc.
 | 
						|
 | 
						|
The last table is special in several ways.  First its size depends on
 | 
						|
the effective 32bit limit on telldir/seekdir cookies.  Since logfs
 | 
						|
uses the upper half of the address space for indirect blocks, the size
 | 
						|
is limited to 2^31.  Secondly the table contains hash buckets with 16
 | 
						|
entries each.
 | 
						|
 | 
						|
Using single-entry buckets would result in birthday "attacks".  At
 | 
						|
just 2^16 used entries, hash collisions would be likely (P >= 0.5).
 | 
						|
My math skills are insufficient to do the combinatorics for the 17x
 | 
						|
collisions necessary to overflow a bucket, but testing showed that in
 | 
						|
10,000 runs the lowest directory fill before a bucket overflow was
 | 
						|
188,057,130 entries with an average of 315,149,915 entries.  So for
 | 
						|
directory sizes of up to a million, bucket overflows should be
 | 
						|
virtually impossible under normal circumstances.
 | 
						|
 | 
						|
With carefully chosen filenames, it is obviously possible to cause an
 | 
						|
overflow with just 21 entries (4 higher tables + 16 entries + 1).  So
 | 
						|
there may be a security concern if a malicious user has write access
 | 
						|
to a directory.
 | 
						|
 | 
						|
Open For Discussion
 | 
						|
===================
 | 
						|
 | 
						|
Device Address Space
 | 
						|
--------------------
 | 
						|
 | 
						|
A device address space is used for caching.  Both block devices and
 | 
						|
MTD provide functions to either read a single page or write a segment.
 | 
						|
Partial segments may be written for data integrity, but where possible
 | 
						|
complete segments are written for performance on simple block device
 | 
						|
flash media.
 | 
						|
 | 
						|
Meta Inodes
 | 
						|
-----------
 | 
						|
 | 
						|
Inodes are stored in the inode file, which is just a regular file for
 | 
						|
most purposes.  At umount time, however, the inode file needs to
 | 
						|
remain open until all dirty inodes are written.  So
 | 
						|
generic_shutdown_super() may not close this inode, but shouldn't
 | 
						|
complain about remaining inodes due to the inode file either.  Same
 | 
						|
goes for mapping inode of the device address space.
 | 
						|
 | 
						|
Currently logfs uses a hack that essentially copies part of fs/inode.c
 | 
						|
code over.  A general solution would be preferred.
 | 
						|
 | 
						|
Indirect block mapping
 | 
						|
----------------------
 | 
						|
 | 
						|
With compression, the block device (or mapping inode) cannot be used
 | 
						|
to cache indirect blocks.  Some other place is required.  Currently
 | 
						|
logfs uses the top half of each inode's address space.  The low 8TB
 | 
						|
(on 32bit) are filled with file data, the high 8TB are used for
 | 
						|
indirect blocks.
 | 
						|
 | 
						|
One problem is that 16TB files created on 64bit systems actually have
 | 
						|
data in the top 8TB.  But files >16TB would cause problems anyway, so
 | 
						|
only the limit has changed.
 |