| 
									
										
										
										
											2007-06-12 09:07:21 -04:00
										 |  |  | /*
 | 
					
						
							|  |  |  |  * Copyright (C) 2007 Oracle.  All rights reserved. | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * This program is free software; you can redistribute it and/or | 
					
						
							|  |  |  |  * modify it under the terms of the GNU General Public | 
					
						
							|  |  |  |  * License v2 as published by the Free Software Foundation. | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * This program is distributed in the hope that it will be useful, | 
					
						
							|  |  |  |  * but WITHOUT ANY WARRANTY; without even the implied warranty of | 
					
						
							|  |  |  |  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU | 
					
						
							|  |  |  |  * General Public License for more details. | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * You should have received a copy of the GNU General Public | 
					
						
							|  |  |  |  * License along with this program; if not, write to the | 
					
						
							|  |  |  |  * Free Software Foundation, Inc., 59 Temple Place - Suite 330, | 
					
						
							|  |  |  |  * Boston, MA 021110-1307, USA. | 
					
						
							|  |  |  |  */ | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-04-02 10:50:19 -04:00
										 |  |  | #ifndef __BTRFS_I__
 | 
					
						
							|  |  |  | #define __BTRFS_I__
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												Btrfs: improve inode hash function/inode lookup
Currently the hash value used for adding an inode to the VFS's inode
hash table consists of the plain inode number, which is a 64 bits
integer. This results in hash table buckets (hlist_head lists) with
too many elements for at least 2 important scenarios:
1) When we have many subvolumes. Each subvolume has its own btree
   where its files and directories are added to, and each has its
   own objectid (inode number) namespace. This means that if we have
   N subvolumes, and all have inode number X associated to a file or
   directory, the corresponding inodes all map to the same hash table
   entry, resulting in a bucket (hlist_head list) with N elements;
2) On 32 bits machines. Th VFS hash values are unsigned longs, which
   are 32 bits wide on 32 bits machines, and the inode (objectid)
   numbers are 64 bits unsigned integers. We simply cast the inode
   numbers to hash values, which means that for all inodes with the
   same 32 bits lower half, the same hash bucket is used for all of
   them. For example, all inodes with a number (objectid) between
   0x0000_0000_ffff_ffff and 0xffff_ffff_ffff_ffff will end up in
   the same hash table bucket.
This change ensures the inode's hash value depends both on the
objectid (inode number) and its subvolume's (btree root) objectid.
For 32 bits machines, this change gives better entropy by making
the hash value depend on both the upper and lower 32 bits of the
64 bits hash previously computed.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
											
										 
											2013-10-06 22:22:33 +01:00
										 |  |  | #include <linux/hash.h>
 | 
					
						
							| 
									
										
										
										
											2007-08-27 16:49:44 -04:00
										 |  |  | #include "extent_map.h"
 | 
					
						
							| 
									
										
										
										
											2008-01-24 16:13:08 -05:00
										 |  |  | #include "extent_io.h"
 | 
					
						
							| 
									
										
										
										
											2008-07-17 12:53:50 -04:00
										 |  |  | #include "ordered-data.h"
 | 
					
						
							| 
									
										
											  
											
												btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
  root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
  Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
  Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
  inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
  balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
  which is created for every directory and file, and used to manage the
  delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
  manage the delayed nodes which are created for every file/directory.
  One is used to manage all the delayed nodes that have delayed items. And the
  other is used to manage the delayed nodes which is waiting to be dealt with
  by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
  index which is going to be inserted into b+ tree, and the other is used to
  manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
  to deal with the works of the delayed directory name index items insertion
  and deletion and the delayed inode update.
  When the delayed items is beyond the lower limit, we create works for some
  delayed nodes and insert them into the work queue of the worker, and then
  go back.
  When the delayed items is beyond the upper bound, we create works for all
  the delayed nodes that haven't been dealt with, and insert them into the work
  queue of the worker, and then wait for that the untreated items is below some
  threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
  information into the delayed inserting rb-tree.
  And then we check the number of the delayed items and do delayed items
  balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
  in the inserting rb-tree at first. If we look it up, just drop it. If not,
  add the key of it into the delayed deleting rb-tree.
  Similar to the delayed inserting rb-tree, we also check the number of the
  delayed items and do delayed items balance.
  (The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
  inode into the delayed node. the worker will flush it into the b+ tree after
  dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
  delayed node, By this way, we can cache more delayed items and merge more
  inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
  and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
        Total files: 50000
        Total time: 1.096108
        Average time: 0.000022
Delete files:
        Total files: 50000
        Total time: 1.510403
        Average time: 0.000030
After applying this patch:
Create files:
        Total files: 50000
        Total time: 0.932899
        Average time: 0.000019
Delete files:
        Total files: 50000
        Total time: 1.215732
        Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
											
										 
											2011-04-22 18:12:22 +08:00
										 |  |  | #include "delayed-inode.h"
 | 
					
						
							| 
									
										
										
										
											2007-08-27 16:49:44 -04:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-05-23 14:13:11 -04:00
										 |  |  | /*
 | 
					
						
							|  |  |  |  * ordered_data_close is set by truncate when a file that used | 
					
						
							|  |  |  |  * to have good data has been truncated to zero.  When it is set | 
					
						
							|  |  |  |  * the btrfs file release call will add this inode to the | 
					
						
							|  |  |  |  * ordered operations list so that we make sure to flush out any | 
					
						
							|  |  |  |  * new data the application may have written before commit. | 
					
						
							|  |  |  |  */ | 
					
						
							|  |  |  | #define BTRFS_INODE_ORDERED_DATA_CLOSE		0
 | 
					
						
							|  |  |  | #define BTRFS_INODE_ORPHAN_META_RESERVED	1
 | 
					
						
							|  |  |  | #define BTRFS_INODE_DUMMY			2
 | 
					
						
							|  |  |  | #define BTRFS_INODE_IN_DEFRAG			3
 | 
					
						
							|  |  |  | #define BTRFS_INODE_DELALLOC_META_RESERVED	4
 | 
					
						
							| 
									
										
										
										
											2012-05-23 14:26:42 -04:00
										 |  |  | #define BTRFS_INODE_HAS_ORPHAN_ITEM		5
 | 
					
						
							| 
									
										
										
										
											2012-06-08 15:26:47 -04:00
										 |  |  | #define BTRFS_INODE_HAS_ASYNC_EXTENT		6
 | 
					
						
							| 
									
										
											  
											
												Btrfs: turbo charge fsync
At least for the vm workload.  Currently on fsync we will
1) Truncate all items in the log tree for the given inode if they exist
and
2) Copy all items for a given inode into the log
The problem with this is that for things like VMs you can have lots of
extents from the fragmented writing behavior, and worst yet you may have
only modified a few extents, not the entire thing.  This patch fixes this
problem by tracking which transid modified our extent, and then when we do
the tree logging we find all of the extents we've modified in our current
transaction, sort them and commit them.  We also only truncate up to the
xattrs of the inode and copy that stuff in normally, and then just drop any
extents in the range we have that exist in the log already.  Here are some
numbers of a 50 meg fio job that does random writes and fsync()s after every
write
		Original	Patched
SATA drive	82KB/s		140KB/s
Fusion drive	431KB/s		2532KB/s
So around 2-6 times faster depending on your hardware.  There are a few
corner cases, for example if you truncate at all we have to do it the old
way since there is no way to be sure what is in the log is ok.  This
probably could be done smarter, but if you write-fsync-truncate-write-fsync
you deserve what you get.  All this work is in RAM of course so if your
inode gets evicted from cache and you read it in and fsync it we'll do it
the slow way if we are still in the same transaction that we last modified
the inode in.
The biggest cool part of this is that it requires no changes to the recovery
code, so if you fsync with this patch and crash and load an old kernel, it
will run the recovery and be a-ok.  I have tested this pretty thoroughly
with an fsync tester and everything comes back fine, as well as xfstests.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
											
										 
											2012-08-17 13:14:17 -04:00
										 |  |  | #define BTRFS_INODE_NEEDS_FULL_SYNC		7
 | 
					
						
							| 
									
										
										
										
											2012-10-11 15:53:56 -04:00
										 |  |  | #define BTRFS_INODE_COPY_EVERYTHING		8
 | 
					
						
							| 
									
										
										
										
											2013-01-29 10:11:59 +00:00
										 |  |  | #define BTRFS_INODE_IN_DELALLOC_LIST		9
 | 
					
						
							| 
									
										
										
										
											2013-02-08 07:01:08 +00:00
										 |  |  | #define BTRFS_INODE_READDIO_NEED_LOCK		10
 | 
					
						
							| 
									
										
										
										
											2012-05-23 14:13:11 -04:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-06-13 16:18:26 -04:00
										 |  |  | /* in memory btrfs inode */ | 
					
						
							| 
									
										
										
										
											2007-04-02 10:50:19 -04:00
										 |  |  | struct btrfs_inode { | 
					
						
							| 
									
										
										
										
											2008-09-29 15:18:18 -04:00
										 |  |  | 	/* which subvolume this inode belongs to */ | 
					
						
							| 
									
										
										
										
											2007-04-06 15:37:36 -04:00
										 |  |  | 	struct btrfs_root *root; | 
					
						
							| 
									
										
										
										
											2008-09-29 15:18:18 -04:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	/* key used to find this inode on disk.  This is used by the code
 | 
					
						
							|  |  |  | 	 * to read in roots of subvolumes | 
					
						
							|  |  |  | 	 */ | 
					
						
							| 
									
										
										
										
											2007-04-06 15:37:36 -04:00
										 |  |  | 	struct btrfs_key location; | 
					
						
							| 
									
										
										
										
											2008-09-29 15:18:18 -04:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2011-07-15 15:16:44 +00:00
										 |  |  | 	/* Lock for counters */ | 
					
						
							|  |  |  | 	spinlock_t lock; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-09-29 15:18:18 -04:00
										 |  |  | 	/* the extent_tree has caches of all the extent mappings to disk */ | 
					
						
							| 
									
										
										
										
											2007-08-27 16:49:44 -04:00
										 |  |  | 	struct extent_map_tree extent_tree; | 
					
						
							| 
									
										
										
										
											2008-09-29 15:18:18 -04:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	/* the io_tree does range state (DIRTY, LOCKED etc) */ | 
					
						
							| 
									
										
										
										
											2008-01-24 16:13:08 -05:00
										 |  |  | 	struct extent_io_tree io_tree; | 
					
						
							| 
									
										
										
										
											2008-09-29 15:18:18 -04:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	/* special utility tree used to record which mirrors have already been
 | 
					
						
							|  |  |  | 	 * tried when checksums fail for a given block | 
					
						
							|  |  |  | 	 */ | 
					
						
							| 
									
										
										
										
											2008-04-09 16:28:12 -04:00
										 |  |  | 	struct extent_io_tree io_failure_tree; | 
					
						
							| 
									
										
										
										
											2008-09-29 15:18:18 -04:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	/* held while logging the inode in tree-log.c */ | 
					
						
							| 
									
										
										
										
											2008-09-05 16:13:11 -04:00
										 |  |  | 	struct mutex log_mutex; | 
					
						
							| 
									
										
										
										
											2008-09-29 15:18:18 -04:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-01-13 12:09:22 -05:00
										 |  |  | 	/* held while doing delalloc reservations */ | 
					
						
							|  |  |  | 	struct mutex delalloc_mutex; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-09-29 15:18:18 -04:00
										 |  |  | 	/* used to order data wrt metadata */ | 
					
						
							| 
									
										
										
										
											2008-07-17 12:53:50 -04:00
										 |  |  | 	struct btrfs_ordered_inode_tree ordered_tree; | 
					
						
							| 
									
										
										
										
											2007-08-10 16:22:09 -04:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-09-29 15:18:18 -04:00
										 |  |  | 	/* list of all the delalloc inodes in the FS.  There are times we need
 | 
					
						
							|  |  |  | 	 * to write all the delalloc pages to disk, and this list is used | 
					
						
							|  |  |  | 	 * to walk them all. | 
					
						
							|  |  |  | 	 */ | 
					
						
							| 
									
										
										
										
											2008-08-04 23:17:27 -04:00
										 |  |  | 	struct list_head delalloc_inodes; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-03-31 13:27:11 -04:00
										 |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * list for tracking inodes that must be sent to disk before a | 
					
						
							|  |  |  | 	 * rename or truncate commit | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	struct list_head ordered_operations; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												Btrfs: Mixed back reference  (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one.  At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root.  This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces.  But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
											
										 
											2009-06-10 10:45:14 -04:00
										 |  |  | 	/* node for the red-black tree that links inodes in subvolume root */ | 
					
						
							|  |  |  | 	struct rb_node rb_node; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-05-23 14:13:11 -04:00
										 |  |  | 	unsigned long runtime_flags; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-04-15 00:44:02 +00:00
										 |  |  | 	/* Keep track of who's O_SYNC/fsyncing currently */ | 
					
						
							| 
									
										
										
										
											2012-11-16 13:56:32 -05:00
										 |  |  | 	atomic_t sync_writers; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-09-29 15:18:18 -04:00
										 |  |  | 	/* full 64 bit generation number, struct vfs_inode doesn't have a big
 | 
					
						
							|  |  |  | 	 * enough field for this. | 
					
						
							|  |  |  | 	 */ | 
					
						
							| 
									
										
										
										
											2008-09-05 16:13:11 -04:00
										 |  |  | 	u64 generation; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-08-10 16:22:09 -04:00
										 |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * transid of the trans_handle that last modified this inode | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	u64 last_trans; | 
					
						
							| 
									
										
										
										
											2009-10-13 13:21:08 -04:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * log transid when this inode was last modified | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	u64 last_sub_trans; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-09-05 16:13:11 -04:00
										 |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * transid that last logged this inode | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	u64 logged_trans; | 
					
						
							| 
									
										
										
										
											2008-09-11 15:53:12 -04:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-09-29 15:18:18 -04:00
										 |  |  | 	/* total number of bytes pending delalloc, used by stat to calc the
 | 
					
						
							|  |  |  | 	 * real block usage of the file | 
					
						
							|  |  |  | 	 */ | 
					
						
							| 
									
										
										
										
											2008-02-08 13:49:28 -05:00
										 |  |  | 	u64 delalloc_bytes; | 
					
						
							| 
									
										
										
										
											2008-09-29 15:18:18 -04:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * the size of the file stored in the metadata on disk.  data=ordered | 
					
						
							|  |  |  | 	 * means the in-memory i_size might be larger than the size on disk | 
					
						
							|  |  |  | 	 * because not all the blocks are written yet. | 
					
						
							|  |  |  | 	 */ | 
					
						
							| 
									
										
										
										
											2008-07-17 12:54:05 -04:00
										 |  |  | 	u64 disk_i_size; | 
					
						
							| 
									
										
										
										
											2008-09-29 15:18:18 -04:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-07-24 12:12:38 -04:00
										 |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * if this is a directory then index_cnt is the counter for the index | 
					
						
							|  |  |  | 	 * number for new files that are created | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	u64 index_cnt; | 
					
						
							| 
									
										
										
										
											2008-09-29 15:18:18 -04:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-03-24 10:24:20 -04:00
										 |  |  | 	/* the fsync log has some corner cases that mean we have to check
 | 
					
						
							|  |  |  | 	 * directories to see if any unlinks have been done before | 
					
						
							|  |  |  | 	 * the directory was logged.  See tree-log.c for all the | 
					
						
							|  |  |  | 	 * details | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	u64 last_unlink_trans; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2011-08-04 10:25:02 -04:00
										 |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * Number of bytes outstanding that are going to need csums.  This is | 
					
						
							|  |  |  | 	 * used in ENOSPC accounting. | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	u64 csum_bytes; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2011-07-14 14:28:08 -04:00
										 |  |  | 	/* flags field from the on disk inode */ | 
					
						
							|  |  |  | 	u32 flags; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-08-29 01:07:55 -06:00
										 |  |  | 	/* a local copy of root's last_log_commit */ | 
					
						
							|  |  |  | 	unsigned long last_log_commit; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-09-11 16:12:44 -04:00
										 |  |  | 	/*
 | 
					
						
							| 
									
										
										
										
											2009-10-08 13:34:05 -04:00
										 |  |  | 	 * Counters to keep track of the number of extent item's we may use due | 
					
						
							|  |  |  | 	 * to delalloc and such.  outstanding_extents is the number of extent | 
					
						
							|  |  |  | 	 * items we think we'll end up using, and reserved_extents is the number | 
					
						
							|  |  |  | 	 * of extent items we've reserved metadata for. | 
					
						
							| 
									
										
										
										
											2009-09-11 16:12:44 -04:00
										 |  |  | 	 */ | 
					
						
							| 
									
										
										
										
											2011-07-15 15:16:44 +00:00
										 |  |  | 	unsigned outstanding_extents; | 
					
						
							|  |  |  | 	unsigned reserved_extents; | 
					
						
							| 
									
										
										
										
											2009-09-11 16:12:44 -04:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2010-03-11 09:42:04 -05:00
										 |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * always compress this one file | 
					
						
							|  |  |  | 	 */ | 
					
						
							| 
									
										
										
										
											2012-05-23 14:13:11 -04:00
										 |  |  | 	unsigned force_compress; | 
					
						
							| 
									
										
										
										
											2010-03-11 09:42:04 -05:00
										 |  |  | 
 | 
					
						
							| 
									
										
											  
											
												btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
  root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
  Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
  Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
  inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
  balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
  which is created for every directory and file, and used to manage the
  delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
  manage the delayed nodes which are created for every file/directory.
  One is used to manage all the delayed nodes that have delayed items. And the
  other is used to manage the delayed nodes which is waiting to be dealt with
  by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
  index which is going to be inserted into b+ tree, and the other is used to
  manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
  to deal with the works of the delayed directory name index items insertion
  and deletion and the delayed inode update.
  When the delayed items is beyond the lower limit, we create works for some
  delayed nodes and insert them into the work queue of the worker, and then
  go back.
  When the delayed items is beyond the upper bound, we create works for all
  the delayed nodes that haven't been dealt with, and insert them into the work
  queue of the worker, and then wait for that the untreated items is below some
  threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
  information into the delayed inserting rb-tree.
  And then we check the number of the delayed items and do delayed items
  balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
  in the inserting rb-tree at first. If we look it up, just drop it. If not,
  add the key of it into the delayed deleting rb-tree.
  Similar to the delayed inserting rb-tree, we also check the number of the
  delayed items and do delayed items balance.
  (The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
  inode into the delayed node. the worker will flush it into the b+ tree after
  dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
  delayed node, By this way, we can cache more delayed items and merge more
  inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
  and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
        Total files: 50000
        Total time: 1.096108
        Average time: 0.000022
Delete files:
        Total files: 50000
        Total time: 1.510403
        Average time: 0.000030
After applying this patch:
Create files:
        Total files: 50000
        Total time: 0.932899
        Average time: 0.000019
Delete files:
        Total files: 50000
        Total time: 1.215732
        Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
											
										 
											2011-04-22 18:12:22 +08:00
										 |  |  | 	struct btrfs_delayed_node *delayed_node; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-09-29 15:18:18 -04:00
										 |  |  | 	struct inode vfs_inode; | 
					
						
							| 
									
										
										
										
											2007-04-02 10:50:19 -04:00
										 |  |  | }; | 
					
						
							| 
									
										
										
										
											2008-07-17 12:54:05 -04:00
										 |  |  | 
 | 
					
						
							| 
									
										
											  
											
												btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
  root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
  Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
  Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
  inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
  balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
  which is created for every directory and file, and used to manage the
  delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
  manage the delayed nodes which are created for every file/directory.
  One is used to manage all the delayed nodes that have delayed items. And the
  other is used to manage the delayed nodes which is waiting to be dealt with
  by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
  index which is going to be inserted into b+ tree, and the other is used to
  manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
  to deal with the works of the delayed directory name index items insertion
  and deletion and the delayed inode update.
  When the delayed items is beyond the lower limit, we create works for some
  delayed nodes and insert them into the work queue of the worker, and then
  go back.
  When the delayed items is beyond the upper bound, we create works for all
  the delayed nodes that haven't been dealt with, and insert them into the work
  queue of the worker, and then wait for that the untreated items is below some
  threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
  information into the delayed inserting rb-tree.
  And then we check the number of the delayed items and do delayed items
  balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
  in the inserting rb-tree at first. If we look it up, just drop it. If not,
  add the key of it into the delayed deleting rb-tree.
  Similar to the delayed inserting rb-tree, we also check the number of the
  delayed items and do delayed items balance.
  (The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
  inode into the delayed node. the worker will flush it into the b+ tree after
  dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
  delayed node, By this way, we can cache more delayed items and merge more
  inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
  and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
        Total files: 50000
        Total time: 1.096108
        Average time: 0.000022
Delete files:
        Total files: 50000
        Total time: 1.510403
        Average time: 0.000030
After applying this patch:
Create files:
        Total files: 50000
        Total time: 0.932899
        Average time: 0.000019
Delete files:
        Total files: 50000
        Total time: 1.215732
        Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
											
										 
											2011-04-22 18:12:22 +08:00
										 |  |  | extern unsigned char btrfs_filetype_table[]; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-04-02 10:50:19 -04:00
										 |  |  | static inline struct btrfs_inode *BTRFS_I(struct inode *inode) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	return container_of(inode, struct btrfs_inode, vfs_inode); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												Btrfs: improve inode hash function/inode lookup
Currently the hash value used for adding an inode to the VFS's inode
hash table consists of the plain inode number, which is a 64 bits
integer. This results in hash table buckets (hlist_head lists) with
too many elements for at least 2 important scenarios:
1) When we have many subvolumes. Each subvolume has its own btree
   where its files and directories are added to, and each has its
   own objectid (inode number) namespace. This means that if we have
   N subvolumes, and all have inode number X associated to a file or
   directory, the corresponding inodes all map to the same hash table
   entry, resulting in a bucket (hlist_head list) with N elements;
2) On 32 bits machines. Th VFS hash values are unsigned longs, which
   are 32 bits wide on 32 bits machines, and the inode (objectid)
   numbers are 64 bits unsigned integers. We simply cast the inode
   numbers to hash values, which means that for all inodes with the
   same 32 bits lower half, the same hash bucket is used for all of
   them. For example, all inodes with a number (objectid) between
   0x0000_0000_ffff_ffff and 0xffff_ffff_ffff_ffff will end up in
   the same hash table bucket.
This change ensures the inode's hash value depends both on the
objectid (inode number) and its subvolume's (btree root) objectid.
For 32 bits machines, this change gives better entropy by making
the hash value depend on both the upper and lower 32 bits of the
64 bits hash previously computed.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
											
										 
											2013-10-06 22:22:33 +01:00
										 |  |  | static inline unsigned long btrfs_inode_hash(u64 objectid, | 
					
						
							|  |  |  | 					     const struct btrfs_root *root) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	u64 h = objectid ^ (root->objectid * GOLDEN_RATIO_PRIME); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | #if BITS_PER_LONG == 32
 | 
					
						
							|  |  |  | 	h = (h >> 32) ^ (h & 0xffffffff); | 
					
						
							|  |  |  | #endif
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	return (unsigned long)h; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | static inline void btrfs_insert_inode_hash(struct inode *inode) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	unsigned long h = btrfs_inode_hash(inode->i_ino, BTRFS_I(inode)->root); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	__insert_inode_hash(inode, h); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2011-04-20 10:31:50 +08:00
										 |  |  | static inline u64 btrfs_ino(struct inode *inode) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	u64 ino = BTRFS_I(inode)->location.objectid; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2011-09-11 10:52:24 -04:00
										 |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * !ino: btree_inode | 
					
						
							|  |  |  | 	 * type == BTRFS_ROOT_ITEM_KEY: subvol dir | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	if (!ino || BTRFS_I(inode)->location.type == BTRFS_ROOT_ITEM_KEY) | 
					
						
							| 
									
										
										
										
											2011-04-20 10:31:50 +08:00
										 |  |  | 		ino = inode->i_ino; | 
					
						
							|  |  |  | 	return ino; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-07-17 12:54:05 -04:00
										 |  |  | static inline void btrfs_i_size_write(struct inode *inode, u64 size) | 
					
						
							|  |  |  | { | 
					
						
							| 
									
										
										
										
											2009-11-12 09:34:21 +00:00
										 |  |  | 	i_size_write(inode, size); | 
					
						
							| 
									
										
										
										
											2008-07-17 12:54:05 -04:00
										 |  |  | 	BTRFS_I(inode)->disk_i_size = size; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-07-10 05:28:39 -06:00
										 |  |  | static inline bool btrfs_is_free_space_inode(struct inode *inode) | 
					
						
							| 
									
										
										
										
											2011-07-26 15:35:09 -04:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2012-07-10 05:28:39 -06:00
										 |  |  | 	struct btrfs_root *root = BTRFS_I(inode)->root; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-07-10 05:28:38 -06:00
										 |  |  | 	if (root == root->fs_info->tree_root && | 
					
						
							|  |  |  | 	    btrfs_ino(inode) != BTRFS_BTREE_INODE_OBJECTID) | 
					
						
							|  |  |  | 		return true; | 
					
						
							|  |  |  | 	if (BTRFS_I(inode)->location.objectid == BTRFS_FREE_INO_OBJECTID) | 
					
						
							| 
									
										
										
										
											2011-07-26 15:35:09 -04:00
										 |  |  | 		return true; | 
					
						
							|  |  |  | 	return false; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-05-29 16:57:49 -04:00
										 |  |  | static inline int btrfs_inode_in_log(struct inode *inode, u64 generation) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	if (BTRFS_I(inode)->logged_trans == generation && | 
					
						
							| 
									
										
										
										
											2013-09-11 09:55:42 -04:00
										 |  |  | 	    BTRFS_I(inode)->last_sub_trans <= | 
					
						
							|  |  |  | 	    BTRFS_I(inode)->last_log_commit && | 
					
						
							|  |  |  | 	    BTRFS_I(inode)->last_sub_trans <= | 
					
						
							|  |  |  | 	    BTRFS_I(inode)->root->last_log_commit) | 
					
						
							| 
									
										
										
										
											2012-08-29 01:07:55 -06:00
										 |  |  | 		return 1; | 
					
						
							|  |  |  | 	return 0; | 
					
						
							| 
									
										
										
										
											2012-05-29 16:57:49 -04:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-07-25 19:22:34 +08:00
										 |  |  | struct btrfs_dio_private { | 
					
						
							|  |  |  | 	struct inode *inode; | 
					
						
							|  |  |  | 	u64 logical_offset; | 
					
						
							|  |  |  | 	u64 disk_bytenr; | 
					
						
							|  |  |  | 	u64 bytes; | 
					
						
							|  |  |  | 	void *private; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/* number of bios pending for this dio */ | 
					
						
							|  |  |  | 	atomic_t pending_bios; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/* IO errors */ | 
					
						
							|  |  |  | 	int errors; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/* orig_bio is our btrfs_io_bio */ | 
					
						
							|  |  |  | 	struct bio *orig_bio; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/* dio_bio came from fs/direct-io.c */ | 
					
						
							|  |  |  | 	struct bio *dio_bio; | 
					
						
							|  |  |  | 	u8 csum[0]; | 
					
						
							|  |  |  | }; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-02-08 07:01:08 +00:00
										 |  |  | /*
 | 
					
						
							|  |  |  |  * Disable DIO read nolock optimization, so new dio readers will be forced | 
					
						
							|  |  |  |  * to grab i_mutex. It is used to avoid the endless truncate due to | 
					
						
							|  |  |  |  * nonlocked dio read. | 
					
						
							|  |  |  |  */ | 
					
						
							|  |  |  | static inline void btrfs_inode_block_unlocked_dio(struct inode *inode) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	set_bit(BTRFS_INODE_READDIO_NEED_LOCK, &BTRFS_I(inode)->runtime_flags); | 
					
						
							|  |  |  | 	smp_mb(); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | static inline void btrfs_inode_resume_unlocked_dio(struct inode *inode) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	smp_mb__before_clear_bit(); | 
					
						
							|  |  |  | 	clear_bit(BTRFS_INODE_READDIO_NEED_LOCK, | 
					
						
							|  |  |  | 		  &BTRFS_I(inode)->runtime_flags); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-04-02 10:50:19 -04:00
										 |  |  | #endif
 |