From 47ad76d1105eda38ccc9917ce7bc1886d5aa89d0 Mon Sep 17 00:00:00 2001 From: "https://www.google.com/accounts/o8/id?id=AItOawkSq2FDpK2n66QRUxtqqdbyDuwgbQmUWus" Date: Tue, 24 Jul 2012 19:08:11 +0000 Subject: [PATCH 1/5] --- doc/forum/Fixing_up_corrupt_annexes.mdwn | 10 ++++++++++ 1 file changed, 10 insertions(+) create mode 100644 doc/forum/Fixing_up_corrupt_annexes.mdwn diff --git a/doc/forum/Fixing_up_corrupt_annexes.mdwn b/doc/forum/Fixing_up_corrupt_annexes.mdwn new file mode 100644 index 0000000000..be6beeca8f --- /dev/null +++ b/doc/forum/Fixing_up_corrupt_annexes.mdwn @@ -0,0 +1,10 @@ +I was wondering how does one recover from... + +
+(Recording state in git...)
+error: invalid object 100644 8f154c946adc039af5240cc650a0a95c840e6fa6 for '041/5a4/SHA256-s6148--7ddcf853e4b16e77ab8c3c855c46867e6ed61c7089c334edf98bbdd3fb3a89ba.log'
+fatal: git-write-tree: error building trees
+git-annex: failed to read sha from git write-tree
+
+ +The above was caught when i ran a "git annex fsck --fast" to check stash of files" From b5b0ae6f3e1c9edead1e697f401f7670a93d710a Mon Sep 17 00:00:00 2001 From: "http://joeyh.name/" Date: Tue, 24 Jul 2012 22:00:36 +0000 Subject: [PATCH 2/5] Added a comment --- .../comment_1_cea21f96bcfb56aaab7ea03c1c804d2d._comment | 7 +++++++ 1 file changed, 7 insertions(+) create mode 100644 doc/forum/Fixing_up_corrupt_annexes/comment_1_cea21f96bcfb56aaab7ea03c1c804d2d._comment diff --git a/doc/forum/Fixing_up_corrupt_annexes/comment_1_cea21f96bcfb56aaab7ea03c1c804d2d._comment b/doc/forum/Fixing_up_corrupt_annexes/comment_1_cea21f96bcfb56aaab7ea03c1c804d2d._comment new file mode 100644 index 0000000000..335cbb51d2 --- /dev/null +++ b/doc/forum/Fixing_up_corrupt_annexes/comment_1_cea21f96bcfb56aaab7ea03c1c804d2d._comment @@ -0,0 +1,7 @@ +[[!comment format=mdwn + username="http://joeyh.name/" + subject="comment 1" + date="2012-07-24T22:00:35Z" + content=""" +This is a corrupt git repository. See [[tips/what_to_do_when_a_repository_is_corrupted]] +"""]] From bd2b388fd8c668ed6fd031d0ed8a7edf3c7b67ee Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Wed, 25 Jul 2012 15:07:41 -0400 Subject: [PATCH 3/5] update --- doc/design/assistant/syncing.mdwn | 114 ++++++++++++++++-------------- 1 file changed, 61 insertions(+), 53 deletions(-) diff --git a/doc/design/assistant/syncing.mdwn b/doc/design/assistant/syncing.mdwn index f04f20218b..3aeb76afc1 100644 --- a/doc/design/assistant/syncing.mdwn +++ b/doc/design/assistant/syncing.mdwn @@ -5,14 +5,66 @@ all the other git clones, at both the git level and the key/value level. * At startup, and possibly periodically, or when the network connection changes, or some heuristic suggests that a remote was disconnected from - us for a while, queue remotes for processing by the TransferScanner, - to queue Transfers of files it or we're missing. -* After git sync, identify content that we don't have that is now available + us for a while, queue remotes for processing by the TransferScanner. +* Ensure that when a remote receives content, and updates its location log, + it syncs that update back out. Prerequisite for: +* After git sync, identify new content that we don't have that is now available on remotes, and transfer. (Needed when we have a uni-directional connection - to a remote, so it won't be uploading content to us.) - But first, need to ensure that when a remote - receives content, and updates its location log, it syncs that update - out. + to a remote, so it won't be uploading content to us.) Note: Does not + need to use the TransferScanner, if we get and check a list of the changed + files. + +## longer-term TODO + +* Test MountWatcher on LXDE. +* git-annex needs a simple speed control knob, which can be plumbed + through to, at least, rsync. A good job for an hour in an + airport somewhere. +* Find a way to probe available outgoing bandwidth, to throttle so + we don't bufferbloat the network to death. +* Investigate the XMPP approach like dvcs-autosync does, or other ways of + signaling a change out of band. +* Add a hook, so when there's a change to sync, a program can be run + and do its own signaling. +* --debug will show often unnecessary work being done. Optimise. +* This assumes the network is connected. It's often not, so the + [[cloud]] needs to be used to bridge between LANs. +* Configurablity, including only enabling git syncing but not data transfer; + only uploading new files but not downloading, and only downloading + files in some directories and not others. See for use cases: + [[forum/Wishlist:_options_for_syncing_meta-data_and_data]] +* speed up git syncing by using the cached ssh connection for it too + (will need to use `GIT_SSH`, which needs to point to a command to run, + not a shell command line) +* Map the network of git repos, and use that map to calculate + optimal transfers to keep the data in sync. Currently a naive flood fill + is done instead. +* Find a more efficient way for the TransferScanner to find the transfers + that need to be done to sync with a remote. Currently it walks the git + working copy and checks each file. + +## data syncing + +There are two parts to data syncing. First, map the network and second, +decide what to sync when. + +Mapping the network can reuse code in `git annex map`. Once the map is +built, we want to find paths through the network that reach all nodes +eventually, with the least cost. This is a minimum spanning tree problem, +except with a directed graph, so really a Arborescence problem. + +With the map, we can determine which nodes to push new content to. Then we +need to control those data transfers, sending to the cheapest nodes first, +and with appropriate rate limiting and control facilities. + +This probably will need lots of refinements to get working well. + +### first pass: flood syncing + +Before mapping the network, the best we can do is flood all files out to every +reachable remote. This is worth doing first, since it's the simplest way to +get the basic functionality of the assistant to work. And we'll need this +anyway. ## TransferScanner @@ -21,6 +73,8 @@ to a remote, or Downloaded from it. How to find the keys to transfer? I'd like to avoid potentially expensive traversals of the whole git working copy if I can. +(Currently, the TransferScanner does do the naive and possibly expensive +scan of the git working copy.) One way would be to do a git diff between the (unmerged) git-annex branches of the git repo, and its remote. Parse that for lines that add a key to @@ -53,52 +107,6 @@ one. Probably worth handling the case where a remote is connected while in the middle of such a scan, so part of the scan needs to be redone to check it. -## longer-term TODO - -* Test MountWatcher on LXDE. -* git-annex needs a simple speed control knob, which can be plumbed - through to, at least, rsync. A good job for an hour in an - airport somewhere. -* Find a way to probe available outgoing bandwidth, to throttle so - we don't bufferbloat the network to death. -* Investigate the XMPP approach like dvcs-autosync does, or other ways of - signaling a change out of band. -* Add a hook, so when there's a change to sync, a program can be run - and do its own signaling. -* --debug will show often unnecessary work being done. Optimise. -* This assumes the network is connected. It's often not, so the - [[cloud]] needs to be used to bridge between LANs. -* Configurablity, including only enabling git syncing but not data transfer; - only uploading new files but not downloading, and only downloading - files in some directories and not others. See for use cases: - [[forum/Wishlist:_options_for_syncing_meta-data_and_data]] -* speed up git syncing by using the cached ssh connection for it too - (will need to use `GIT_SSH`, which needs to point to a command to run, - not a shell command line) - -## data syncing - -There are two parts to data syncing. First, map the network and second, -decide what to sync when. - -Mapping the network can reuse code in `git annex map`. Once the map is -built, we want to find paths through the network that reach all nodes -eventually, with the least cost. This is a minimum spanning tree problem, -except with a directed graph, so really a Arborescence problem. - -With the map, we can determine which nodes to push new content to. Then we -need to control those data transfers, sending to the cheapest nodes first, -and with appropriate rate limiting and control facilities. - -This probably will need lots of refinements to get working well. - -### first pass: flood syncing - -Before mapping the network, the best we can do is flood all files out to every -reachable remote. This is worth doing first, since it's the simplest way to -get the basic functionality of the assistant to work. And we'll need this -anyway. - ## done 1. Can use `git annex sync`, which already handles bidirectional syncing. From 2e085c6383f096a58d1e9b52ae457f9491850c7f Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Wed, 25 Jul 2012 15:31:26 -0400 Subject: [PATCH 4/5] blog for the day --- .../blog/day_43__simple_scanner.mdwn | 37 +++++++++++++++++++ 1 file changed, 37 insertions(+) create mode 100644 doc/design/assistant/blog/day_43__simple_scanner.mdwn diff --git a/doc/design/assistant/blog/day_43__simple_scanner.mdwn b/doc/design/assistant/blog/day_43__simple_scanner.mdwn new file mode 100644 index 0000000000..11ee3cca49 --- /dev/null +++ b/doc/design/assistant/blog/day_43__simple_scanner.mdwn @@ -0,0 +1,37 @@ +Milestone: I can run `git annex assistant`, plug in a USB drive, and it +automatically transfers files to get the USB drive and current repo back in +sync. + +I decided to implement the naive scan, to find files needing to be +transferred. So it walks through `git ls-files` and checks each file +in turn. I've deferred less expensive, more sophisticated approaches to later. + +I did some work on the TransferQueue, which now keeps track of the length +of the queue, and can block attempts to add Transfers to it if it gets too +long. This was a nice use of STM, which let me implement that without using +any locking. + +[[!format haskell """ +atomically $ do + sz <- readTVar (queuesize q) + if sz <= wantsz + then enqueue schedule q t (stubInfo f remote) + else retry -- blocks until queuesize changes +"""]] + +Anyway, the point was that, as the scan finds Transfers to do, +it doesn't build up a really long TransferQueue, but instead is blocked +from running further until some of the files get transferred. The resulting +interleaving of the scan thread with transfer threads means that transfers +start fairly quickly upon a USB drive being plugged in, and kind of hides +the innefficiencies of the scanner, which will most of the time be +swamped out by the IO bound large data transfers. + +--- + +At this point, the assistant should do a good job of keeping repositories +in sync, as long as they're all interconnected, or on removable media +like USB drives. There's lots more work to be done to handle use cases +where repositories are not well-connected, but since the assistant's +[[syncing]] now covers at least a couple of use cases, I'm ready to move +on to the next phase. [[Webapp]], here we come! From 3a02c7b635fc1017c05874b8a6f54a91a587651d Mon Sep 17 00:00:00 2001 From: jtang Date: Wed, 25 Jul 2012 20:12:16 +0000 Subject: [PATCH 5/5] fix example to match current command in git-annex semitrust --- doc/tips/what_to_do_when_you_lose_a_repository.mdwn | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/tips/what_to_do_when_you_lose_a_repository.mdwn b/doc/tips/what_to_do_when_you_lose_a_repository.mdwn index 3be13b8abd..363eeea4e0 100644 --- a/doc/tips/what_to_do_when_you_lose_a_repository.mdwn +++ b/doc/tips/what_to_do_when_you_lose_a_repository.mdwn @@ -16,4 +16,4 @@ are present. If you later found the drive, you could let git-annex know it's found like so: - git annex semitrusted usbdrive + git annex semitrust usbdrive