| 
									
										
										
										
											2009-02-09 17:02:34 +03:00
										 |  |  | POHMELFS: Parallel Optimized Host Message Exchange Layered File System. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 		Evgeniy Polyakov <zbr@ioremap.net> | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Homepage: http://www.ioremap.net/projects/pohmelfs | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | POHMELFS first began as a network filesystem with coherent local data and | 
					
						
							|  |  |  | metadata caches but is now evolving into a parallel distributed filesystem. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Main features of this FS include: | 
					
						
							|  |  |  |  * Locally coherent cache for data and metadata with (potentially) byte-range locks. | 
					
						
							|  |  |  | 	Since all Linux filesystems lock the whole inode during writing, algorithm | 
					
						
							|  |  |  | 	is very simple and does not use byte-ranges, although they are sent in | 
					
						
							|  |  |  | 	locking messages. | 
					
						
							|  |  |  |  * Completely async processing of all events except creation of hard and symbolic | 
					
						
							|  |  |  | 	links, and rename events. | 
					
						
							|  |  |  | 	Object creation and data reading and writing are processed asynchronously. | 
					
						
							|  |  |  |  * Flexible object architecture optimized for network processing. | 
					
						
							|  |  |  | 	Ability to create long paths to objects and remove arbitrarily huge | 
					
						
							|  |  |  | 	directories with a single network command. | 
					
						
							|  |  |  | 	(like removing the whole kernel tree via a single network command). | 
					
						
							|  |  |  |  * Very high performance. | 
					
						
							|  |  |  |  * Fast and scalable multithreaded userspace server. Being in userspace it works | 
					
						
							|  |  |  | 	with any underlying filesystem and still is much faster than async in-kernel NFS one. | 
					
						
							|  |  |  |  * Client is able to switch between different servers (if one goes down, client | 
					
						
							|  |  |  | 	automatically reconnects to second and so on). | 
					
						
							|  |  |  |  * Transactions support. Full failover for all operations. | 
					
						
							|  |  |  | 	Resending transactions to different servers on timeout or error. | 
					
						
							|  |  |  |  * Read request (data read, directory listing, lookup requests) balancing between multiple servers. | 
					
						
							|  |  |  |  * Write requests are replicated to multiple servers and completed only when all of them are acked. | 
					
						
							|  |  |  |  * Ability to add and/or remove servers from the working set at run-time. | 
					
						
							|  |  |  |  * Strong authentification and possible data encryption in network channel. | 
					
						
							|  |  |  |  * Extended attributes support. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | POHMELFS is based on transactions, which are potentially long-standing objects that live | 
					
						
							|  |  |  | in the client's memory. Each transaction contains all the information needed to process a given | 
					
						
							|  |  |  | command (or set of commands, which is frequently used during data writing: single transactions | 
					
						
							|  |  |  | can contain creation and data writing commands). Transactions are committed by all the servers | 
					
						
							|  |  |  | to which they are sent and, in case of failures, are eventually resent or dropped with an error. | 
					
						
							|  |  |  | For example, reading will return an error if no servers are available. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | POHMELFS uses a asynchronous approach to data processing. Courtesy of transactions, it is | 
					
						
							|  |  |  | possible to detach replies from requests and, if the command requires data to be received, the | 
					
						
							|  |  |  | caller sleeps waiting for it. Thus, it is possible to issue multiple read commands to different | 
					
						
							|  |  |  | servers and async threads will pick up replies in parallel, find appropriate transactions in the | 
					
						
							|  |  |  | system and put the data where it belongs (like the page or inode cache). | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | The main feature of POHMELFS is writeback data and the metadata cache. | 
					
						
							|  |  |  | Only a few non-performance critical operations use the write-through cache and | 
					
						
							|  |  |  | are synchronous: hard and symbolic link creation, and object rename. Creation, | 
					
						
							|  |  |  | removal of objects and data writing are asynchronous and are sent to | 
					
						
							|  |  |  | the server during system writeback. Only one writer at a time is allowed for any | 
					
						
							|  |  |  | given inode, which is guarded by an appropriate locking protocol. | 
					
						
							|  |  |  | Because of this feature, POHMELFS is extremely fast at metadata intensive | 
					
						
							|  |  |  | workloads and can fully utilize the bandwidth to the servers when doing bulk | 
					
						
							|  |  |  | data transfers. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | POHMELFS clients operate with a working set of servers and are capable of balancing read-only | 
					
						
							| 
									
										
										
										
											2009-03-27 15:04:29 +03:00
										 |  |  | operations (like lookups or directory listings) between them according to IO priorities. | 
					
						
							| 
									
										
										
										
											2009-02-09 17:02:34 +03:00
										 |  |  | Administrators can add or remove servers from the set at run-time via special commands (described | 
					
						
							| 
									
										
										
										
											2011-08-15 02:02:26 +02:00
										 |  |  | in Documentation/filesystems/pohmelfs/info.txt file). Writes are replicated to all servers, which | 
					
						
							|  |  |  | are connected with write permission turned on. IO priority and permissions can be changed in | 
					
						
							|  |  |  | run-time. | 
					
						
							| 
									
										
										
										
											2009-02-09 17:02:34 +03:00
										 |  |  | 
 | 
					
						
							|  |  |  | POHMELFS is capable of full data channel encryption and/or strong crypto hashing. | 
					
						
							|  |  |  | One can select any kernel supported cipher, encryption mode, hash type and operation mode | 
					
						
							|  |  |  | (hmac or digest). It is also possible to use both or neither (default). Crypto configuration | 
					
						
							|  |  |  | is checked during mount time and, if the server does not support it, appropriate capabilities | 
					
						
							|  |  |  | will be disabled or mount will fail (if 'crypto_fail_unsupported' mount option is specified). | 
					
						
							|  |  |  | Crypto performance heavily depends on the number of crypto threads, which asynchronously perform | 
					
						
							|  |  |  | crypto operations and send the resulting data to server or submit it up the stack. This number | 
					
						
							|  |  |  | can be controlled via a mount option. |