From 47ad76d1105eda38ccc9917ce7bc1886d5aa89d0 Mon Sep 17 00:00:00 2001
From: 
 "https://www.google.com/accounts/o8/id?id=AItOawkSq2FDpK2n66QRUxtqqdbyDuwgbQmUWus"
 <Jimmy@web>
Date: Tue, 24 Jul 2012 19:08:11 +0000
Subject: [PATCH 1/5]

---
 doc/forum/Fixing_up_corrupt_annexes.mdwn | 10 ++++++++++
 1 file changed, 10 insertions(+)
 create mode 100644 doc/forum/Fixing_up_corrupt_annexes.mdwn
diff --git a/doc/forum/Fixing_up_corrupt_annexes.mdwn b/doc/forum/Fixing_up_corrupt_annexes.mdwn
new file mode 100644
index 0000000000..be6beeca8f
--- /dev/null
+++ b/doc/forum/Fixing_up_corrupt_annexes.mdwn
@@ -0,0 +1,10 @@
+I was wondering how does one recover from...
+
+<pre>
+(Recording state in git...)
+error: invalid object 100644 8f154c946adc039af5240cc650a0a95c840e6fa6 for '041/5a4/SHA256-s6148--7ddcf853e4b16e77ab8c3c855c46867e6ed61c7089c334edf98bbdd3fb3a89ba.log'
+fatal: git-write-tree: error building trees
+git-annex: failed to read sha from git write-tree
+</pre>
+
+The above was caught when i ran a "git annex fsck --fast" to check stash of files"

From b5b0ae6f3e1c9edead1e697f401f7670a93d710a Mon Sep 17 00:00:00 2001
From: "http://joeyh.name/" <http://joeyh.name/@web>
Date: Tue, 24 Jul 2012 22:00:36 +0000
Subject: [PATCH 2/5] Added a comment

---
 .../comment_1_cea21f96bcfb56aaab7ea03c1c804d2d._comment    | 7 +++++++
 1 file changed, 7 insertions(+)
 create mode 100644 doc/forum/Fixing_up_corrupt_annexes/comment_1_cea21f96bcfb56aaab7ea03c1c804d2d._comment

diff --git a/doc/forum/Fixing_up_corrupt_annexes/comment_1_cea21f96bcfb56aaab7ea03c1c804d2d._comment b/doc/forum/Fixing_up_corrupt_annexes/comment_1_cea21f96bcfb56aaab7ea03c1c804d2d._comment
new file mode 100644
index 0000000000..335cbb51d2
--- /dev/null
+++ b/doc/forum/Fixing_up_corrupt_annexes/comment_1_cea21f96bcfb56aaab7ea03c1c804d2d._comment
@@ -0,0 +1,7 @@
+[[!comment format=mdwn
+ username="http://joeyh.name/"
+ subject="comment 1"
+ date="2012-07-24T22:00:35Z"
+ content="""
+This is a corrupt git repository. See [[tips/what_to_do_when_a_repository_is_corrupted]]
+"""]]

From bd2b388fd8c668ed6fd031d0ed8a7edf3c7b67ee Mon Sep 17 00:00:00 2001
From: Joey Hess <joey@kitenet.net>
Date: Wed, 25 Jul 2012 15:07:41 -0400
Subject: [PATCH 3/5] update

---
 doc/design/assistant/syncing.mdwn | 114 ++++++++++++++++--------------
 1 file changed, 61 insertions(+), 53 deletions(-)

diff --git a/doc/design/assistant/syncing.mdwn b/doc/design/assistant/syncing.mdwn
index f04f20218b..3aeb76afc1 100644
--- a/doc/design/assistant/syncing.mdwn
+++ b/doc/design/assistant/syncing.mdwn
@@ -5,14 +5,66 @@ all the other git clones, at both the git level and the key/value level.
 
 * At startup, and possibly periodically, or when the network connection
   changes, or some heuristic suggests that a remote was disconnected from
-  us for a while, queue remotes for processing by the TransferScanner,
-  to queue Transfers of files it or we're missing.
-* After git sync, identify content that we don't have that is now available
+  us for a while, queue remotes for processing by the TransferScanner.
+* Ensure that when a remote receives content, and updates its location log,
+  it syncs that update back out. Prerequisite for:
+* After git sync, identify new content that we don't have that is now available
   on remotes, and transfer. (Needed when we have a uni-directional connection
-  to a remote, so it won't be uploading content to us.) 
-  But first, need to ensure that when a remote
-  receives content, and updates its location log, it syncs that update
-  out.
+  to a remote, so it won't be uploading content to us.) Note: Does not
+  need to use the TransferScanner, if we get and check a list of the changed
+  files.
+
+## longer-term TODO
+
+* Test MountWatcher on LXDE.
+* git-annex needs a simple speed control knob, which can be plumbed
+  through to, at least, rsync. A good job for an hour in an
+  airport somewhere.
+* Find a way to probe available outgoing bandwidth, to throttle so
+  we don't bufferbloat the network to death.
+* Investigate the XMPP approach like dvcs-autosync does, or other ways of
+   signaling a change out of band.
+* Add a hook, so when there's a change to sync, a program can be run
+   and do its own signaling.
+* --debug will show often unnecessary work being done. Optimise.
+* This assumes the network is connected. It's often not, so the
+  [[cloud]] needs to be used to bridge between LANs.
+* Configurablity, including only enabling git syncing but not data transfer;
+  only uploading new files but not downloading, and only downloading
+  files in some directories and not others. See for use cases:
+  [[forum/Wishlist:_options_for_syncing_meta-data_and_data]]
+* speed up git syncing by using the cached ssh connection for it too
+  (will need to use `GIT_SSH`, which needs to point to a command to run,
+  not a shell command line)
+* Map the network of git repos, and use that map to calculate
+  optimal transfers to keep the data in sync. Currently a naive flood fill
+  is done instead.
+* Find a more efficient way for the TransferScanner to find the transfers
+  that need to be done to sync with a remote. Currently it walks the git
+  working copy and checks each file.
+
+## data syncing
+
+There are two parts to data syncing. First, map the network and second,
+decide what to sync when.
+
+Mapping the network can reuse code in `git annex map`. Once the map is
+built, we want to find paths through the network that reach all nodes
+eventually, with the least cost. This is a minimum spanning tree problem,
+except with a directed graph, so really a Arborescence problem.
+
+With the map, we can determine which nodes to push new content to. Then we
+need to control those data transfers, sending to the cheapest nodes first,
+and with appropriate rate limiting and control facilities.
+
+This probably will need lots of refinements to get working well.
+
+### first pass: flood syncing
+
+Before mapping the network, the best we can do is flood all files out to every
+reachable remote. This is worth doing first, since it's the simplest way to
+get the basic functionality of the assistant to work. And we'll need this
+anyway.
 
 ## TransferScanner
 
@@ -21,6 +73,8 @@ to a remote, or Downloaded from it.
 
 How to find the keys to transfer? I'd like to avoid potentially
 expensive traversals of the whole git working copy if I can.
+(Currently, the TransferScanner does do the naive and possibly expensive
+scan of the git working copy.)
 
 One way would be to do a git diff between the (unmerged) git-annex branches
 of the git repo, and its remote. Parse that for lines that add a key to
@@ -53,52 +107,6 @@ one. Probably worth handling the case where a remote is connected
 while in the middle of such a scan, so part of the scan needs to be
 redone to check it.
 
-## longer-term TODO
-
-* Test MountWatcher on LXDE.
-* git-annex needs a simple speed control knob, which can be plumbed
-  through to, at least, rsync. A good job for an hour in an
-  airport somewhere.
-* Find a way to probe available outgoing bandwidth, to throttle so
-  we don't bufferbloat the network to death.
-* Investigate the XMPP approach like dvcs-autosync does, or other ways of
-   signaling a change out of band.
-* Add a hook, so when there's a change to sync, a program can be run
-   and do its own signaling.
-* --debug will show often unnecessary work being done. Optimise.
-* This assumes the network is connected. It's often not, so the
-  [[cloud]] needs to be used to bridge between LANs.
-* Configurablity, including only enabling git syncing but not data transfer;
-  only uploading new files but not downloading, and only downloading
-  files in some directories and not others. See for use cases:
-  [[forum/Wishlist:_options_for_syncing_meta-data_and_data]]
-* speed up git syncing by using the cached ssh connection for it too
-  (will need to use `GIT_SSH`, which needs to point to a command to run,
-  not a shell command line)
-
-## data syncing
-
-There are two parts to data syncing. First, map the network and second,
-decide what to sync when.
-
-Mapping the network can reuse code in `git annex map`. Once the map is
-built, we want to find paths through the network that reach all nodes
-eventually, with the least cost. This is a minimum spanning tree problem,
-except with a directed graph, so really a Arborescence problem.
-
-With the map, we can determine which nodes to push new content to. Then we
-need to control those data transfers, sending to the cheapest nodes first,
-and with appropriate rate limiting and control facilities.
-
-This probably will need lots of refinements to get working well.
-
-### first pass: flood syncing
-
-Before mapping the network, the best we can do is flood all files out to every
-reachable remote. This is worth doing first, since it's the simplest way to
-get the basic functionality of the assistant to work. And we'll need this
-anyway.
-
 ## done
 
 1. Can use `git annex sync`, which already handles bidirectional syncing.

From 2e085c6383f096a58d1e9b52ae457f9491850c7f Mon Sep 17 00:00:00 2001
From: Joey Hess <joey@kitenet.net>
Date: Wed, 25 Jul 2012 15:31:26 -0400
Subject: [PATCH 4/5] blog for the day

---
 .../blog/day_43__simple_scanner.mdwn          | 37 +++++++++++++++++++
 1 file changed, 37 insertions(+)
 create mode 100644 doc/design/assistant/blog/day_43__simple_scanner.mdwn

diff --git a/doc/design/assistant/blog/day_43__simple_scanner.mdwn b/doc/design/assistant/blog/day_43__simple_scanner.mdwn
new file mode 100644
index 0000000000..11ee3cca49
--- /dev/null
+++ b/doc/design/assistant/blog/day_43__simple_scanner.mdwn
@@ -0,0 +1,37 @@
+Milestone: I can run `git annex assistant`, plug in a USB drive, and it
+automatically transfers files to get the USB drive and current repo back in
+sync.
+
+I decided to implement the naive scan, to find files needing to be
+transferred. So it walks through `git ls-files` and checks each file
+in turn. I've deferred less expensive, more sophisticated approaches to later.
+
+I did some work on the TransferQueue, which now keeps track of the length
+of the queue, and can block attempts to add Transfers to it if it gets too
+long. This was a nice use of STM, which let me implement that without using
+any locking.
+
+[[!format haskell """
+atomically $ do
+        sz <- readTVar (queuesize q)
+        if sz <= wantsz
+                then enqueue schedule q t (stubInfo f remote)
+                else retry -- blocks until queuesize changes
+"""]]
+
+Anyway, the point was that, as the scan finds Transfers to do,
+it doesn't build up a really long TransferQueue, but instead is blocked
+from running further until some of the files get transferred. The resulting
+interleaving of the scan thread with transfer threads means that transfers
+start fairly quickly upon a USB drive being plugged in, and kind of hides
+the innefficiencies of the scanner, which will most of the time be
+swamped out by the IO bound large data transfers.
+
+---
+
+At this point, the assistant should do a good job of keeping repositories
+in sync, as long as they're all interconnected, or on removable media
+like USB drives. There's lots more work to be done to handle use cases
+where repositories are not well-connected, but since the assistant's
+[[syncing]] now covers at least a couple of use cases, I'm ready to move
+on to the next phase. [[Webapp]], here we come!

From 3a02c7b635fc1017c05874b8a6f54a91a587651d Mon Sep 17 00:00:00 2001
From: jtang <jtang@web>
Date: Wed, 25 Jul 2012 20:12:16 +0000
Subject: [PATCH 5/5] fix example to match current command in git-annex
 semitrust

---
 doc/tips/what_to_do_when_you_lose_a_repository.mdwn | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/tips/what_to_do_when_you_lose_a_repository.mdwn b/doc/tips/what_to_do_when_you_lose_a_repository.mdwn
index 3be13b8abd..363eeea4e0 100644
--- a/doc/tips/what_to_do_when_you_lose_a_repository.mdwn
+++ b/doc/tips/what_to_do_when_you_lose_a_repository.mdwn
@@ -16,4 +16,4 @@ are present.
 If you later found the drive, you could let git-annex know it's found
 like so:
 
-	git annex semitrusted usbdrive
+	git annex semitrust usbdrive