Merge branch 'master' into newchunks

2014-08-03 15:04:10 -04:00 · 2014-08-03 15:04:10 -04:00 · c653e80829
commit c653e80829
parent 00f92a7e59 c648548e1f
13 changed files with 272 additions and 17 deletions
--- a/doc/bugs/S3_memory_leaks/comment_3_5e9cecb0e2ec7602963406779b6e3c1f._comment
+++ b/doc/bugs/S3_memory_leaks/comment_3_5e9cecb0e2ec7602963406779b6e3c1f._comment
@ -0,0 +1,12 @@
+[[!comment format=mdwn
+ username="http://joeyh.name/"
+ ip="209.250.56.112"
+ subject="comment 3"
+ date="2014-08-02T23:08:44Z"
+ content="""
+hS3's author seems to have abandoned it and it has other problems. I should try to switch to a different S3 library.
+
+There is now a workaround; S3 special remotes can be configured to use [[chunking]]. A max of one chunk will then be buffered in memory at a time. 
+
+For example, to reconfigure an existing mys3 remote: `enableremote mys3 chunk=1MiB`
+"""]]
--- a/doc/bugs/S3_memory_leaks/comment_4_37e41b518813bd7c349017abf4a0ca0f._comment
+++ b/doc/bugs/S3_memory_leaks/comment_4_37e41b518813bd7c349017abf4a0ca0f._comment
@ -0,0 +1,8 @@
+[[!comment format=mdwn
+ username="http://joeyh.name/"
+ ip="209.250.56.112"
+ subject="comment 4"
+ date="2014-08-03T18:40:26Z"
+ content="""
+Beginning work on a `s3-aws` branch using the aws library instead of hS3.
+"""]]
--- a/doc/bugs/S3_upload_not_using_multipart/comment_3_d878b87a05f4fcd380e6ff309b615aab._comment
+++ b/doc/bugs/S3_upload_not_using_multipart/comment_3_d878b87a05f4fcd380e6ff309b615aab._comment
@ -0,0 +1,14 @@
+[[!comment format=mdwn
+ username="http://joeyh.name/"
+ ip="209.250.56.112"
+ subject="comment 3"
+ date="2014-08-02T23:13:41Z"
+ content="""
+There is now a workaround; S3 special remotes can be configured to use [[chunking]]. 
+
+For example, to reconfigure an existing mys3 remote: `enableremote mys3 chunk=1MiB`
+
+I'm leaving this bug open because chunking is not the default (although the assistant does enable it by default), and because this chunking operates at a higher, and less efficient level than S3's own multipart upload API. In particular, AWS will charge a fee for each http request made for a chunk.
+
+Adding proper multipart support will probably require switching to a different S3 library.
+"""]]
--- a/doc/bugs/S3_upload_not_using_multipart/comment_4_09a3372fd13734cbb05e79d0ba76d052._comment
+++ b/doc/bugs/S3_upload_not_using_multipart/comment_4_09a3372fd13734cbb05e79d0ba76d052._comment
@ -0,0 +1,8 @@
+[[!comment format=mdwn
+ username="http://joeyh.name/"
+ ip="209.250.56.112"
+ subject="comment 4"
+ date="2014-08-03T18:22:58Z"
+ content="""
+The aws library does not support multipart yet either; here's the bug report requesting it: <https://github.com/aristidb/aws/issues/94>
+"""]]
--- a/doc/bugs/S3_upload_not_using_multipart/comment_5_5add65b5b284f79ec09ee4d0326e7132._comment
+++ b/doc/bugs/S3_upload_not_using_multipart/comment_5_5add65b5b284f79ec09ee4d0326e7132._comment
@ -0,0 +1,8 @@
+[[!comment format=mdwn
+ username="http://joeyh.name/"
+ ip="209.250.56.112"
+ subject="comment 5"
+ date="2014-08-03T18:27:32Z"
+ content="""
+However, I don't think that multipart upload actually allows exceeding the S3 limit of 5 GB per object. Configuring the remote with `chunk=100MiB` *does* allow bypassing whatever S3's maximum object size happens to be.
+"""]]
--- a/doc/bugs/WORM_keys_differ_depending_on_working_dir_during_add.mdwn
+++ b/doc/bugs/WORM_keys_differ_depending_on_working_dir_during_add.mdwn
@ -0,0 +1,57 @@
+### Please describe the problem.
+
+While the docs say that WORM keys are a function of a files basename,
+when doing «git annex add .», the generated keys will actually contain
+the relative path (with slashes escaped). Not sure whether this is by
+design or a bug in its own right. I suppose that to minimize the chance
+of collisions on WORM, having the path within the key is preferable.
+
+A problem about this, however, is that the path in the key is not
+stable, but varies with the working dir when doing the «git annex
+add». So, when a file is added from one working dir (say, the repo
+base), later unlocked, and readded from another working dir (say,
+somewhere below the repo base), this will generate a different key
+even when the file has not been touched.
+
+Is there a rationale for this variability, or should «add» canonicalize
+the encoded paths to the repo root?
+
+
+### What steps will reproduce the problem?
+
+
+[[!format sh """
+
+# Init
+$ git init /tmp/foo
+$ cd /tmp/foo && git annex init
+
+$ mkdir baz
+$ touch baz/quux
+
+# Add file with working dir at repo root.
+$ git annex add --backend=WORM baz
+$ git commit -m "first"
+
+# Key includes relative path.
+$ readlink baz/quux
+../.git/annex/objects/8x/8V/WORM-s0-m1406981486--baz%quux/WORM-s0-m1406981486--baz%quux
+
+# Unlock and readd with working dir at path below repo root.
+$ cd baz
+$ git annex unlock quux
+
+$ git annex add quux
+$ git com -m "second"
+
+# Relative path is anchored to working dir instead of repo root.
+$ readlink quux
+../.git/annex/objects/9G/72/WORM-s0-m1406981486--quux/WORM-s0-m1406981486--quux
+
+# End of transcript or log.
+"""]]
+
+### What version of git-annex are you using? On what operating system?
+Linux 3.15.8
+
+git-annex 5.20140716
--- a/doc/devblog/day_209__mass_conversion.mdwn
+++ b/doc/devblog/day_209__mass_conversion.mdwn
@ -0,0 +1,32 @@
+Have started converting lots of special remotes to the new API. Today, S3
+and hook got chunking support. I also converted several remotes to the new
+API without supporting chunking: bup, ddar, and glacier (which should
+support chunking, but there were complications). 
+
+This removed 110 lines of code while adding features! And,
+I seem to be able to convert them faster than `testremote` can test them. :)
+
+Now that S3 supports chunks, they can be used to work around several
+problems with S3 remotes, including file size limits, and a memory leak in
+the underlying S3 library.
+
+The S3 conversion included caching of the S3 connection when
+storing/retrieving chunks. [Update: Actually, it turns out it didn't;
+the hS3 library doesn't support persistent connections. Another reason I
+need to switch to a better S3 library!] 
+
+But the API doesn't yet support caching
+when removing or checking if chunks are present. I should probably expand
+the API, but got into some type checker messes when using generic enough
+data types to support everything. Should probably switch to `ResourceT`.
+
+Also, I tried, but failed to make `testremote` check that storing a key
+is done atomically. The best I could come up with was a test that stored a
+key and had another thread repeatedly check if the object was present on
+the remote, logging the results and timestamps. It then becomes a
+statistical problem -- somewhere toward the end of the log it's ok if the key
+has become present -- but too early might indicate that it wasn't stored
+atomically. Perhaps it's my poor knowledge of statistics, but I could not
+find a way to analize the log that reliably detected non-atomic storage.
+If someone would like to try to work on this, see the `atomic-store-test`
+branch.
--- a/doc/forum/Duplicate_entries_in_location_tracking_logs.mdwn
+++ b/doc/forum/Duplicate_entries_in_location_tracking_logs.mdwn
@ -0,0 +1,25 @@
+I’ve noticed something odd when inspecting the history of the
+git-annex branch today. Apparently, the branch had some merge
+conflicts during sync that involved two alternative location tracking
+entries that both were for one and the same remote. Both entries only
+differed in their timestamps, and the union merge kept both, so that I
+now have .log files in the annex branch that contain duplicate parts
+like this.
+
+<pre>
+1404838274.151066s 1 a2401cfd-1f58-4441-a2b3-d9bef06220ad
+1406978406.24838s 1 a2401cfd-1f58-4441-a2b3-d9bef06220ad
+</pre>
+
+The UUID here is my local repository.
+
+The duplication also occurred in the uuid.log:
+
+<pre>
+4316c3dc-5b6d-46eb-b780-948c717b7be5 server timestamp=1404839228.113473s
+4316c3dc-5b6d-46eb-b780-948c717b7be5 server timestamp=1404847241.863051s
+</pre>
+
+Is this something to be concerned about? The situation somehow arose
+in relation to unannexing a bunch of files and rebasing the master
+branch.
--- a/doc/forum/Pusher_crashed44_attempt_to_repair_hangs44_broken_symlinks.mdwn
+++ b/doc/forum/Pusher_crashed44_attempt_to_repair_hangs44_broken_symlinks.mdwn
@ -0,0 +1,77 @@
+Sorry that I put all this in the same thread but I don't know what happened and how it is related.
+
+I have just a simple setup: git-annex client with assistant (Windows 7) and on a server (Debian, no assistant).
+
+Suddenly weird things started to happen
+
+1.) On Windows, when I start the assistant, it writes "Attempting to repair THINKTANK:c:\data\annex [here]" but it runs forever and never stops
+
+2.) On Windows, when I get "Pusher crashed: failed to read sha from git write-tree [Restart Thread]". When I click "Restart Thread" nothing happens but the message from (1) persists.
+
+3.) When I run "git annex fsck" on the client I get thousands of messages like
+
+    fsck Fotos/2014/DSC_0303.JPG
+      ** No known copies exist of Fotos/2014/DSC_0303.JPG
+    failed
+
+Here the same:
+
+    $ git annex whereis "Fotos/2014/DSC_0303.JPG"
+    whereis Fotos/2014/DSC_0303.JPG (0 copies) failed
+    git-annex: whereis: 1 failed
+
+4.) When I do "git annex status" a whole bunch of files are displayed with "M" (modified) although they are not, they are not even checked out and should be only at the server ...
+
+5.) On the server, files that should ALWAYS be on the server (configured as "full backup") suddenly wiped data that was also made available on the client. The symlinks are dangling symlinks and contain just binary data:
+
+    ls -l
+    lrwxrwxrwx  1 4 Aug  2 08:55 DSC_0011.JPG -> ????
+    lrwxrwxrwx  1 4 Aug  2 08:55 DSC_0012.JPG -> ????
+    lrwxrwxrwx  1 4 Aug  2 08:55 DSC_0013.JPG -> ????
+    lrwxrwxrwx  1 4 Aug  2 08:55 DSC_0014.JPG -> ????
+    lrwxrwxrwx  1 4 Aug  2 08:55 DSC_0015.JPG -> ????
+    lrwxrwxrwx  1 4 Aug  2 08:55 DSC_0018.JPG -> ????
+    lrwxrwxrwx  1 4 Aug  2 08:55 DSC_0019.JPG -> ????
+    lrwxrwxrwx  1 4 Aug  2 08:55 DSC_0020.JPG -> ????
+
+6.) "git annex fsck" on the server is still successful, returning no errors!
+
+7.) Manually executing "git annex sync --content" on both sides does not change anything and does not output any error messages.
+
+8.) On the client:
+
+    $ git annex group here
+    error: invalid object 100644 3b3767ae65e5c6d2e3835af3d55fbf2f9e145c8b for '000/0e6/SHA256Es193806--b6d4689fba8e15acd6497f9a7e584c93ea0c8c2199ad32eadac79d59b9f49814.JPG.log'
+    fatal: git-write-tree: error building trees
+    manual
+    (Recording state in git...)
+    git-annex: failed to read sha from git write-tree
+
+    $ git annex wanted here
+    error: invalid object 100644 3b3767ae65e5c6d2e3835af3d55fbf2f9e145c8b for '000/0e6/SHA256Es193806--b6d4689fba8e15acd6497f9a7e584c93ea0c8c2199ad32eadac79d59b9f49814.JPG.log'
+    fatal: git-write-tree: error building trees
+    exclude="*" and present
+    git-annex: failed to read sha from git write-tree
+
+9.) Ok I don't know what happened I did nothing special but it seems that the repository is broken :( :(
+
+    $ git annex --verbose --debug repair
+    [...]
+    [2014-08-02 13:27:38 Pacific Daylight Time] read: git ["--git-dir=C:\\Data\\annex\\.git","--work-tree=C:\\Data\\annex","-c","core.bare=false","show","ef3fe549f457783dbbd877b467b4e54b0ebc813c"]
+    Running git fsck ...
+    
+    git-annex: DeleteFile "C:\\Data\\annex\\.git\\objects\\2a\\54bb281c80c91ea7a732c0d48db0c5acc0ca2c": permission denied (Access is denied.)
+    failed
+    git-annex: repair: 1 failed
+
+But this file exists, I can read, write and delete to this file manually, there is definitely no permission denied ...
+
+
+
+Oh no, so desparate :-( Any ideas?
+
+As it seems the client repository is broken but how can it be then that also files on the server repository get deleted which shouldn't be deleted?
+And how can it be that there are not only broken symlinks but symlinks that have just binary garbage as target and "fsck" returns success?
+
+(I am happy to share all log files privately but I do not want to publish them here because they contain sensitive data)
+
--- a/doc/forum/Using_git-annex_as_a_library/comment_5_a4ab4173620b72ac0a24d575fa9c810c._comment
+++ b/doc/forum/Using_git-annex_as_a_library/comment_5_a4ab4173620b72ac0a24d575fa9c810c._comment
@ -0,0 +1,13 @@
+[[!comment format=mdwn
+ username="https://www.google.com/accounts/o8/id?id=AItOawmkuFJVGp6WVvJtIV5JYb8IqN8mRvSGQdI"
+ nickname="Emilio Jesús"
+ subject="Would you accept a patch?"
+ date="2014-08-03T01:18:54Z"
+ content="""
+Dear Joey,
+
+I am also interested in using git-annex as a Haskell library, would you accept a patch to the .cabal file then?
+
+Thanks,
+Emilio
+"""]]
--- a/doc/forum/usability:_what_are_those_arrow_things63.mdwn
+++ b/doc/forum/usability:_what_are_those_arrow_things63.mdwn
@ -1,6 +1,8 @@
+# Introduction
+
 i want to relate a usability story that happens fairly regularly when I show git-annex to people. the story goes like this.

----
+# The story

 Antoine sat down at his computer saying, "i have this great movie collection I want to share with you, my friend, because the fair use provisions allow for that, and I use this great git-annex tool that allows me to sync my movie collection between different places". His friend Charlie, a Linux user only vaguely familiar with the internals of how his operating system or legal system actually works, reads this as "yay free movies" and wholeheartedly agrees to lend himself to the experiment.

@ -10,7 +12,7 @@ Charlie logs into Antoine's computer, named `marcos`. Antoine shows Charlie wher

 Antoine then has no solution but to convert the git-annex repository into direct mode, something which takes a significant amount of time and is actually [[designated as "untrusted"|direct_mode]] in the documentation. In fact, so much so that he actually did [[screw up his repository magnificently|bugs/direct_command_leaves_repository_inconsistent_if_interrupted]] because he freaked out when `git-annex direct` started and interrupted it because he tought it would take too long.

----
+# Technical analysis

 Now I understand it is not necessarily `git-annex`'s responsability if Thunar (or Nautilus, for that matter), doesn't know how to properly deal with symlinks (hint: just dereference the damn thing already). Maybe I should file a bug about this against thunar? I also understand that symlinks are useful to ensure the security of the data hosted in `git-annex`, and that I could have used direct mode in the first place. But I like to track changes in git to those files, and direct mode makes that really difficult.

@ -19,3 +21,9 @@ I didn't file this as a bug because I want to start the conversation, but maybe
 (The other being "how do i actually use git annex to sync those files instead of just copying them by hand", but that's for another story!)

 -- [[anarcat]]
+
+# Followup
+
+Here is a bug report filed against Thunar, with a patch to fix this behavior: https://bugzilla.xfce.org/show_bug.cgi?id=11065
+
+Similar bugs would need to be filed against Nautilus, at the very least, but probably other file managers, which makes this task a little daunting, to say the least. -- [[anarcat]]
--- a/doc/special_remotes/comment_21_5c11e69c28b9ed4cbe238a36c0839a47._comment
+++ b/doc/special_remotes/comment_21_5c11e69c28b9ed4cbe238a36c0839a47._comment
@ -1,15 +0,0 @@
-[[!comment format=mdwn
- username="http://joeyh.name/"
- ip="209.250.56.64"
- subject="comment 21"
- date="2013-11-24T15:58:30Z"
- content="""
-@Bence the closest I have is some tests of particular special remotes inside Test.hs. The shell equivilant of that code is:
-
-[[!format sh \"\"\"
-set -e
-git annex copy file --to remote # tests store
-git annex drop file # tests checkpresent when remote has file
-git annex move file --from remote # tests retrieve and remove
-\"\"\"]]
-"""]]
--- a/doc/todo/wishlist:_annex.largefiles_support_for_mimetypes/comment_2_975edca7ec87158216d9e106903dfb48._comment
+++ b/doc/todo/wishlist:_annex.largefiles_support_for_mimetypes/comment_2_975edca7ec87158216d9e106903dfb48._comment
@ -0,0 +1,8 @@
+[[!comment format=mdwn
+ username="zardoz"
+ ip="78.48.163.229"
+ subject="comment 2"
+ date="2014-08-02T14:29:26Z"
+ content="""
+This could be achieved in a generic way by allowing filter binaries in expressions, which are run on the filename and return 0 or 1.
+"""]]