Merge branch 'master' into assistant
This commit is contained in:
		
				commit
				
					
						abe5a73d3f
					
				
			
		
					 5 changed files with 122 additions and 60 deletions
				
			
		
							
								
								
									
										37
									
								
								doc/design/assistant/blog/day_43__simple_scanner.mdwn
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										37
									
								
								doc/design/assistant/blog/day_43__simple_scanner.mdwn
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,37 @@ | ||||||
|  | Milestone: I can run `git annex assistant`, plug in a USB drive, and it | ||||||
|  | automatically transfers files to get the USB drive and current repo back in | ||||||
|  | sync. | ||||||
|  | 
 | ||||||
|  | I decided to implement the naive scan, to find files needing to be | ||||||
|  | transferred. So it walks through `git ls-files` and checks each file | ||||||
|  | in turn. I've deferred less expensive, more sophisticated approaches to later. | ||||||
|  | 
 | ||||||
|  | I did some work on the TransferQueue, which now keeps track of the length | ||||||
|  | of the queue, and can block attempts to add Transfers to it if it gets too | ||||||
|  | long. This was a nice use of STM, which let me implement that without using | ||||||
|  | any locking. | ||||||
|  | 
 | ||||||
|  | [[!format haskell """ | ||||||
|  | atomically $ do | ||||||
|  |         sz <- readTVar (queuesize q) | ||||||
|  |         if sz <= wantsz | ||||||
|  |                 then enqueue schedule q t (stubInfo f remote) | ||||||
|  |                 else retry -- blocks until queuesize changes | ||||||
|  | """]] | ||||||
|  | 
 | ||||||
|  | Anyway, the point was that, as the scan finds Transfers to do, | ||||||
|  | it doesn't build up a really long TransferQueue, but instead is blocked | ||||||
|  | from running further until some of the files get transferred. The resulting | ||||||
|  | interleaving of the scan thread with transfer threads means that transfers | ||||||
|  | start fairly quickly upon a USB drive being plugged in, and kind of hides | ||||||
|  | the innefficiencies of the scanner, which will most of the time be | ||||||
|  | swamped out by the IO bound large data transfers. | ||||||
|  | 
 | ||||||
|  | --- | ||||||
|  | 
 | ||||||
|  | At this point, the assistant should do a good job of keeping repositories | ||||||
|  | in sync, as long as they're all interconnected, or on removable media | ||||||
|  | like USB drives. There's lots more work to be done to handle use cases | ||||||
|  | where repositories are not well-connected, but since the assistant's | ||||||
|  | [[syncing]] now covers at least a couple of use cases, I'm ready to move | ||||||
|  | on to the next phase. [[Webapp]], here we come! | ||||||
|  | @ -5,14 +5,72 @@ all the other git clones, at both the git level and the key/value level. | ||||||
| 
 | 
 | ||||||
| * At startup, and possibly periodically, or when the network connection | * At startup, and possibly periodically, or when the network connection | ||||||
|   changes, or some heuristic suggests that a remote was disconnected from |   changes, or some heuristic suggests that a remote was disconnected from | ||||||
|   us for a while, queue remotes for processing by the TransferScanner, |   us for a while, queue remotes for processing by the TransferScanner. | ||||||
|   to queue Transfers of files it or we're missing. | * Ensure that when a remote receives content, and updates its location log, | ||||||
| * After git sync, identify content that we don't have that is now available |   it syncs that update back out. Prerequisite for: | ||||||
|  | * After git sync, identify new content that we don't have that is now available | ||||||
|   on remotes, and transfer. (Needed when we have a uni-directional connection |   on remotes, and transfer. (Needed when we have a uni-directional connection | ||||||
|   to a remote, so it won't be uploading content to us.)  |   to a remote, so it won't be uploading content to us.) Note: Does not | ||||||
|   But first, need to ensure that when a remote |   need to use the TransferScanner, if we get and check a list of the changed | ||||||
|   receives content, and updates its location log, it syncs that update |   files. | ||||||
|   out. | 
 | ||||||
|  | ## longer-term TODO | ||||||
|  | 
 | ||||||
|  | * Test MountWatcher on LXDE. | ||||||
|  | * git-annex needs a simple speed control knob, which can be plumbed | ||||||
|  |   through to, at least, rsync. A good job for an hour in an | ||||||
|  |   airport somewhere. | ||||||
|  | * Find a way to probe available outgoing bandwidth, to throttle so | ||||||
|  |   we don't bufferbloat the network to death. | ||||||
|  | * Investigate the XMPP approach like dvcs-autosync does, or other ways of | ||||||
|  |    signaling a change out of band. | ||||||
|  | * Add a hook, so when there's a change to sync, a program can be run | ||||||
|  |    and do its own signaling. | ||||||
|  | * --debug will show often unnecessary work being done. Optimise. | ||||||
|  | * This assumes the network is connected. It's often not, so the | ||||||
|  |   [[cloud]] needs to be used to bridge between LANs. | ||||||
|  | * Configurablity, including only enabling git syncing but not data transfer; | ||||||
|  |   only uploading new files but not downloading, and only downloading | ||||||
|  |   files in some directories and not others. See for use cases: | ||||||
|  |   [[forum/Wishlist:_options_for_syncing_meta-data_and_data]] | ||||||
|  | * speed up git syncing by using the cached ssh connection for it too | ||||||
|  |   (will need to use `GIT_SSH`, which needs to point to a command to run, | ||||||
|  |   not a shell command line) | ||||||
|  | * Map the network of git repos, and use that map to calculate | ||||||
|  |   optimal transfers to keep the data in sync. Currently a naive flood fill | ||||||
|  |   is done instead. | ||||||
|  | * Find a more efficient way for the TransferScanner to find the transfers | ||||||
|  |   that need to be done to sync with a remote. Currently it walks the git | ||||||
|  |   working copy and checks each file. | ||||||
|  | 
 | ||||||
|  | ## misc todo | ||||||
|  | 
 | ||||||
|  | * --debug will show often unnecessary work being done. Optimise. | ||||||
|  | * It would be nice if, when a USB drive is connected,  | ||||||
|  |   syncing starts automatically. Use dbus on Linux? | ||||||
|  | 
 | ||||||
|  | ## data syncing | ||||||
|  | 
 | ||||||
|  | There are two parts to data syncing. First, map the network and second, | ||||||
|  | decide what to sync when. | ||||||
|  | 
 | ||||||
|  | Mapping the network can reuse code in `git annex map`. Once the map is | ||||||
|  | built, we want to find paths through the network that reach all nodes | ||||||
|  | eventually, with the least cost. This is a minimum spanning tree problem, | ||||||
|  | except with a directed graph, so really a Arborescence problem. | ||||||
|  | 
 | ||||||
|  | With the map, we can determine which nodes to push new content to. Then we | ||||||
|  | need to control those data transfers, sending to the cheapest nodes first, | ||||||
|  | and with appropriate rate limiting and control facilities. | ||||||
|  | 
 | ||||||
|  | This probably will need lots of refinements to get working well. | ||||||
|  | 
 | ||||||
|  | ### first pass: flood syncing | ||||||
|  | 
 | ||||||
|  | Before mapping the network, the best we can do is flood all files out to every | ||||||
|  | reachable remote. This is worth doing first, since it's the simplest way to | ||||||
|  | get the basic functionality of the assistant to work. And we'll need this | ||||||
|  | anyway. | ||||||
| 
 | 
 | ||||||
| ## TransferScanner | ## TransferScanner | ||||||
| 
 | 
 | ||||||
|  | @ -21,6 +79,8 @@ to a remote, or Downloaded from it. | ||||||
| 
 | 
 | ||||||
| How to find the keys to transfer? I'd like to avoid potentially | How to find the keys to transfer? I'd like to avoid potentially | ||||||
| expensive traversals of the whole git working copy if I can. | expensive traversals of the whole git working copy if I can. | ||||||
|  | (Currently, the TransferScanner does do the naive and possibly expensive | ||||||
|  | scan of the git working copy.) | ||||||
| 
 | 
 | ||||||
| One way would be to do a git diff between the (unmerged) git-annex branches | One way would be to do a git diff between the (unmerged) git-annex branches | ||||||
| of the git repo, and its remote. Parse that for lines that add a key to | of the git repo, and its remote. Parse that for lines that add a key to | ||||||
|  | @ -53,58 +113,6 @@ one. Probably worth handling the case where a remote is connected | ||||||
| while in the middle of such a scan, so part of the scan needs to be | while in the middle of such a scan, so part of the scan needs to be | ||||||
| redone to check it. | redone to check it. | ||||||
| 
 | 
 | ||||||
| ## longer-term TODO |  | ||||||
| 
 |  | ||||||
| * Test MountWatcher on LXDE. |  | ||||||
| * git-annex needs a simple speed control knob, which can be plumbed |  | ||||||
|   through to, at least, rsync. A good job for an hour in an |  | ||||||
|   airport somewhere. |  | ||||||
| * Find a way to probe available outgoing bandwidth, to throttle so |  | ||||||
|   we don't bufferbloat the network to death. |  | ||||||
| * Investigate the XMPP approach like dvcs-autosync does, or other ways of |  | ||||||
|    signaling a change out of band. |  | ||||||
| * Add a hook, so when there's a change to sync, a program can be run |  | ||||||
|    and do its own signaling. |  | ||||||
| * --debug will show often unnecessary work being done. Optimise. |  | ||||||
| * This assumes the network is connected. It's often not, so the |  | ||||||
|   [[cloud]] needs to be used to bridge between LANs. |  | ||||||
| * Configurablity, including only enabling git syncing but not data transfer; |  | ||||||
|   only uploading new files but not downloading, and only downloading |  | ||||||
|   files in some directories and not others. See for use cases: |  | ||||||
|   [[forum/Wishlist:_options_for_syncing_meta-data_and_data]] |  | ||||||
| * speed up git syncing by using the cached ssh connection for it too |  | ||||||
|   (will need to use `GIT_SSH`, which needs to point to a command to run, |  | ||||||
|   not a shell command line) |  | ||||||
| 
 |  | ||||||
| ## misc todo |  | ||||||
| 
 |  | ||||||
| * --debug will show often unnecessary work being done. Optimise. |  | ||||||
| * It would be nice if, when a USB drive is connected,  |  | ||||||
|   syncing starts automatically. Use dbus on Linux? |  | ||||||
| 
 |  | ||||||
| ## data syncing |  | ||||||
| 
 |  | ||||||
| There are two parts to data syncing. First, map the network and second, |  | ||||||
| decide what to sync when. |  | ||||||
| 
 |  | ||||||
| Mapping the network can reuse code in `git annex map`. Once the map is |  | ||||||
| built, we want to find paths through the network that reach all nodes |  | ||||||
| eventually, with the least cost. This is a minimum spanning tree problem, |  | ||||||
| except with a directed graph, so really a Arborescence problem. |  | ||||||
| 
 |  | ||||||
| With the map, we can determine which nodes to push new content to. Then we |  | ||||||
| need to control those data transfers, sending to the cheapest nodes first, |  | ||||||
| and with appropriate rate limiting and control facilities. |  | ||||||
| 
 |  | ||||||
| This probably will need lots of refinements to get working well. |  | ||||||
| 
 |  | ||||||
| ### first pass: flood syncing |  | ||||||
| 
 |  | ||||||
| Before mapping the network, the best we can do is flood all files out to every |  | ||||||
| reachable remote. This is worth doing first, since it's the simplest way to |  | ||||||
| get the basic functionality of the assistant to work. And we'll need this |  | ||||||
| anyway. |  | ||||||
| 
 |  | ||||||
| ## done | ## done | ||||||
| 
 | 
 | ||||||
| 1. Can use `git annex sync`, which already handles bidirectional syncing. | 1. Can use `git annex sync`, which already handles bidirectional syncing. | ||||||
|  |  | ||||||
							
								
								
									
										10
									
								
								doc/forum/Fixing_up_corrupt_annexes.mdwn
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										10
									
								
								doc/forum/Fixing_up_corrupt_annexes.mdwn
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,10 @@ | ||||||
|  | I was wondering how does one recover from... | ||||||
|  | 
 | ||||||
|  | <pre> | ||||||
|  | (Recording state in git...) | ||||||
|  | error: invalid object 100644 8f154c946adc039af5240cc650a0a95c840e6fa6 for '041/5a4/SHA256-s6148--7ddcf853e4b16e77ab8c3c855c46867e6ed61c7089c334edf98bbdd3fb3a89ba.log' | ||||||
|  | fatal: git-write-tree: error building trees | ||||||
|  | git-annex: failed to read sha from git write-tree | ||||||
|  | </pre> | ||||||
|  | 
 | ||||||
|  | The above was caught when i ran a "git annex fsck --fast" to check stash of files" | ||||||
|  | @ -0,0 +1,7 @@ | ||||||
|  | [[!comment format=mdwn | ||||||
|  |  username="http://joeyh.name/" | ||||||
|  |  subject="comment 1" | ||||||
|  |  date="2012-07-24T22:00:35Z" | ||||||
|  |  content=""" | ||||||
|  | This is a corrupt git repository. See [[tips/what_to_do_when_a_repository_is_corrupted]] | ||||||
|  | """]] | ||||||
|  | @ -16,4 +16,4 @@ are present. | ||||||
| If you later found the drive, you could let git-annex know it's found | If you later found the drive, you could let git-annex know it's found | ||||||
| like so: | like so: | ||||||
| 
 | 
 | ||||||
| 	git annex semitrusted usbdrive | 	git annex semitrust usbdrive | ||||||
|  |  | ||||||
		Loading…
	
	Add table
		Add a link
		
	
		Reference in a new issue
	
	 Joey Hess
				Joey Hess