Merge branch 'master' into newchunks

This commit is contained in:
Joey Hess 2014-08-03 15:04:10 -04:00
commit c653e80829
13 changed files with 272 additions and 17 deletions

View file

@ -0,0 +1,12 @@
[[!comment format=mdwn
username="http://joeyh.name/"
ip="209.250.56.112"
subject="comment 3"
date="2014-08-02T23:08:44Z"
content="""
hS3's author seems to have abandoned it and it has other problems. I should try to switch to a different S3 library.
There is now a workaround; S3 special remotes can be configured to use [[chunking]]. A max of one chunk will then be buffered in memory at a time.
For example, to reconfigure an existing mys3 remote: `enableremote mys3 chunk=1MiB`
"""]]

View file

@ -0,0 +1,8 @@
[[!comment format=mdwn
username="http://joeyh.name/"
ip="209.250.56.112"
subject="comment 4"
date="2014-08-03T18:40:26Z"
content="""
Beginning work on a `s3-aws` branch using the aws library instead of hS3.
"""]]

View file

@ -0,0 +1,14 @@
[[!comment format=mdwn
username="http://joeyh.name/"
ip="209.250.56.112"
subject="comment 3"
date="2014-08-02T23:13:41Z"
content="""
There is now a workaround; S3 special remotes can be configured to use [[chunking]].
For example, to reconfigure an existing mys3 remote: `enableremote mys3 chunk=1MiB`
I'm leaving this bug open because chunking is not the default (although the assistant does enable it by default), and because this chunking operates at a higher, and less efficient level than S3's own multipart upload API. In particular, AWS will charge a fee for each http request made for a chunk.
Adding proper multipart support will probably require switching to a different S3 library.
"""]]

View file

@ -0,0 +1,8 @@
[[!comment format=mdwn
username="http://joeyh.name/"
ip="209.250.56.112"
subject="comment 4"
date="2014-08-03T18:22:58Z"
content="""
The aws library does not support multipart yet either; here's the bug report requesting it: <https://github.com/aristidb/aws/issues/94>
"""]]

View file

@ -0,0 +1,8 @@
[[!comment format=mdwn
username="http://joeyh.name/"
ip="209.250.56.112"
subject="comment 5"
date="2014-08-03T18:27:32Z"
content="""
However, I don't think that multipart upload actually allows exceeding the S3 limit of 5 GB per object. Configuring the remote with `chunk=100MiB` *does* allow bypassing whatever S3's maximum object size happens to be.
"""]]

View file

@ -0,0 +1,57 @@
### Please describe the problem.
While the docs say that WORM keys are a function of a files basename,
when doing «git annex add .», the generated keys will actually contain
the relative path (with slashes escaped). Not sure whether this is by
design or a bug in its own right. I suppose that to minimize the chance
of collisions on WORM, having the path within the key is preferable.
A problem about this, however, is that the path in the key is not
stable, but varies with the working dir when doing the «git annex
add». So, when a file is added from one working dir (say, the repo
base), later unlocked, and readded from another working dir (say,
somewhere below the repo base), this will generate a different key
even when the file has not been touched.
Is there a rationale for this variability, or should «add» canonicalize
the encoded paths to the repo root?
### What steps will reproduce the problem?
[[!format sh """
# Init
$ git init /tmp/foo
$ cd /tmp/foo && git annex init
$ mkdir baz
$ touch baz/quux
# Add file with working dir at repo root.
$ git annex add --backend=WORM baz
$ git commit -m "first"
# Key includes relative path.
$ readlink baz/quux
../.git/annex/objects/8x/8V/WORM-s0-m1406981486--baz%quux/WORM-s0-m1406981486--baz%quux
# Unlock and readd with working dir at path below repo root.
$ cd baz
$ git annex unlock quux
$ git annex add quux
$ git com -m "second"
# Relative path is anchored to working dir instead of repo root.
$ readlink quux
../.git/annex/objects/9G/72/WORM-s0-m1406981486--quux/WORM-s0-m1406981486--quux
# End of transcript or log.
"""]]
### What version of git-annex are you using? On what operating system?
Linux 3.15.8
git-annex 5.20140716

View file

@ -0,0 +1,32 @@
Have started converting lots of special remotes to the new API. Today, S3
and hook got chunking support. I also converted several remotes to the new
API without supporting chunking: bup, ddar, and glacier (which should
support chunking, but there were complications).
This removed 110 lines of code while adding features! And,
I seem to be able to convert them faster than `testremote` can test them. :)
Now that S3 supports chunks, they can be used to work around several
problems with S3 remotes, including file size limits, and a memory leak in
the underlying S3 library.
The S3 conversion included caching of the S3 connection when
storing/retrieving chunks. [Update: Actually, it turns out it didn't;
the hS3 library doesn't support persistent connections. Another reason I
need to switch to a better S3 library!]
But the API doesn't yet support caching
when removing or checking if chunks are present. I should probably expand
the API, but got into some type checker messes when using generic enough
data types to support everything. Should probably switch to `ResourceT`.
Also, I tried, but failed to make `testremote` check that storing a key
is done atomically. The best I could come up with was a test that stored a
key and had another thread repeatedly check if the object was present on
the remote, logging the results and timestamps. It then becomes a
statistical problem -- somewhere toward the end of the log it's ok if the key
has become present -- but too early might indicate that it wasn't stored
atomically. Perhaps it's my poor knowledge of statistics, but I could not
find a way to analize the log that reliably detected non-atomic storage.
If someone would like to try to work on this, see the `atomic-store-test`
branch.

View file

@ -0,0 +1,25 @@
Ive noticed something odd when inspecting the history of the
git-annex branch today. Apparently, the branch had some merge
conflicts during sync that involved two alternative location tracking
entries that both were for one and the same remote. Both entries only
differed in their timestamps, and the union merge kept both, so that I
now have .log files in the annex branch that contain duplicate parts
like this.
<pre>
1404838274.151066s 1 a2401cfd-1f58-4441-a2b3-d9bef06220ad
1406978406.24838s 1 a2401cfd-1f58-4441-a2b3-d9bef06220ad
</pre>
The UUID here is my local repository.
The duplication also occurred in the uuid.log:
<pre>
4316c3dc-5b6d-46eb-b780-948c717b7be5 server timestamp=1404839228.113473s
4316c3dc-5b6d-46eb-b780-948c717b7be5 server timestamp=1404847241.863051s
</pre>
Is this something to be concerned about? The situation somehow arose
in relation to unannexing a bunch of files and rebasing the master
branch.

View file

@ -0,0 +1,77 @@
Sorry that I put all this in the same thread but I don't know what happened and how it is related.
I have just a simple setup: git-annex client with assistant (Windows 7) and on a server (Debian, no assistant).
Suddenly weird things started to happen
1.) On Windows, when I start the assistant, it writes "Attempting to repair THINKTANK:c:\data\annex [here]" but it runs forever and never stops
2.) On Windows, when I get "Pusher crashed: failed to read sha from git write-tree [Restart Thread]". When I click "Restart Thread" nothing happens but the message from (1) persists.
3.) When I run "git annex fsck" on the client I get thousands of messages like
fsck Fotos/2014/DSC_0303.JPG
** No known copies exist of Fotos/2014/DSC_0303.JPG
failed
Here the same:
$ git annex whereis "Fotos/2014/DSC_0303.JPG"
whereis Fotos/2014/DSC_0303.JPG (0 copies) failed
git-annex: whereis: 1 failed
4.) When I do "git annex status" a whole bunch of files are displayed with "M" (modified) although they are not, they are not even checked out and should be only at the server ...
5.) On the server, files that should ALWAYS be on the server (configured as "full backup") suddenly wiped data that was also made available on the client. The symlinks are dangling symlinks and contain just binary data:
ls -l
lrwxrwxrwx 1 4 Aug 2 08:55 DSC_0011.JPG -> ????
lrwxrwxrwx 1 4 Aug 2 08:55 DSC_0012.JPG -> ????
lrwxrwxrwx 1 4 Aug 2 08:55 DSC_0013.JPG -> ????
lrwxrwxrwx 1 4 Aug 2 08:55 DSC_0014.JPG -> ????
lrwxrwxrwx 1 4 Aug 2 08:55 DSC_0015.JPG -> ????
lrwxrwxrwx 1 4 Aug 2 08:55 DSC_0018.JPG -> ????
lrwxrwxrwx 1 4 Aug 2 08:55 DSC_0019.JPG -> ????
lrwxrwxrwx 1 4 Aug 2 08:55 DSC_0020.JPG -> ????
6.) "git annex fsck" on the server is still successful, returning no errors!
7.) Manually executing "git annex sync --content" on both sides does not change anything and does not output any error messages.
8.) On the client:
$ git annex group here
error: invalid object 100644 3b3767ae65e5c6d2e3835af3d55fbf2f9e145c8b for '000/0e6/SHA256Es193806--b6d4689fba8e15acd6497f9a7e584c93ea0c8c2199ad32eadac79d59b9f49814.JPG.log'
fatal: git-write-tree: error building trees
manual
(Recording state in git...)
git-annex: failed to read sha from git write-tree
$ git annex wanted here
error: invalid object 100644 3b3767ae65e5c6d2e3835af3d55fbf2f9e145c8b for '000/0e6/SHA256Es193806--b6d4689fba8e15acd6497f9a7e584c93ea0c8c2199ad32eadac79d59b9f49814.JPG.log'
fatal: git-write-tree: error building trees
exclude="*" and present
git-annex: failed to read sha from git write-tree
9.) Ok I don't know what happened I did nothing special but it seems that the repository is broken :( :(
$ git annex --verbose --debug repair
[...]
[2014-08-02 13:27:38 Pacific Daylight Time] read: git ["--git-dir=C:\\Data\\annex\\.git","--work-tree=C:\\Data\\annex","-c","core.bare=false","show","ef3fe549f457783dbbd877b467b4e54b0ebc813c"]
Running git fsck ...
git-annex: DeleteFile "C:\\Data\\annex\\.git\\objects\\2a\\54bb281c80c91ea7a732c0d48db0c5acc0ca2c": permission denied (Access is denied.)
failed
git-annex: repair: 1 failed
But this file exists, I can read, write and delete to this file manually, there is definitely no permission denied ...
Oh no, so desparate :-( Any ideas?
As it seems the client repository is broken but how can it be then that also files on the server repository get deleted which shouldn't be deleted?
And how can it be that there are not only broken symlinks but symlinks that have just binary garbage as target and "fsck" returns success?
(I am happy to share all log files privately but I do not want to publish them here because they contain sensitive data)

View file

@ -0,0 +1,13 @@
[[!comment format=mdwn
username="https://www.google.com/accounts/o8/id?id=AItOawmkuFJVGp6WVvJtIV5JYb8IqN8mRvSGQdI"
nickname="Emilio Jesús"
subject="Would you accept a patch?"
date="2014-08-03T01:18:54Z"
content="""
Dear Joey,
I am also interested in using git-annex as a Haskell library, would you accept a patch to the .cabal file then?
Thanks,
Emilio
"""]]

View file

@ -1,6 +1,8 @@
# Introduction
i want to relate a usability story that happens fairly regularly when I show git-annex to people. the story goes like this.
----
# The story
Antoine sat down at his computer saying, "i have this great movie collection I want to share with you, my friend, because the fair use provisions allow for that, and I use this great git-annex tool that allows me to sync my movie collection between different places". His friend Charlie, a Linux user only vaguely familiar with the internals of how his operating system or legal system actually works, reads this as "yay free movies" and wholeheartedly agrees to lend himself to the experiment.
@ -10,7 +12,7 @@ Charlie logs into Antoine's computer, named `marcos`. Antoine shows Charlie wher
Antoine then has no solution but to convert the git-annex repository into direct mode, something which takes a significant amount of time and is actually [[designated as "untrusted"|direct_mode]] in the documentation. In fact, so much so that he actually did [[screw up his repository magnificently|bugs/direct_command_leaves_repository_inconsistent_if_interrupted]] because he freaked out when `git-annex direct` started and interrupted it because he tought it would take too long.
----
# Technical analysis
Now I understand it is not necessarily `git-annex`'s responsability if Thunar (or Nautilus, for that matter), doesn't know how to properly deal with symlinks (hint: just dereference the damn thing already). Maybe I should file a bug about this against thunar? I also understand that symlinks are useful to ensure the security of the data hosted in `git-annex`, and that I could have used direct mode in the first place. But I like to track changes in git to those files, and direct mode makes that really difficult.
@ -19,3 +21,9 @@ I didn't file this as a bug because I want to start the conversation, but maybe
(The other being "how do i actually use git annex to sync those files instead of just copying them by hand", but that's for another story!)
-- [[anarcat]]
# Followup
Here is a bug report filed against Thunar, with a patch to fix this behavior: https://bugzilla.xfce.org/show_bug.cgi?id=11065
Similar bugs would need to be filed against Nautilus, at the very least, but probably other file managers, which makes this task a little daunting, to say the least. -- [[anarcat]]

View file

@ -1,15 +0,0 @@
[[!comment format=mdwn
username="http://joeyh.name/"
ip="209.250.56.64"
subject="comment 21"
date="2013-11-24T15:58:30Z"
content="""
@Bence the closest I have is some tests of particular special remotes inside Test.hs. The shell equivilant of that code is:
[[!format sh \"\"\"
set -e
git annex copy file --to remote # tests store
git annex drop file # tests checkpresent when remote has file
git annex move file --from remote # tests retrieve and remove
\"\"\"]]
"""]]

View file

@ -0,0 +1,8 @@
[[!comment format=mdwn
username="zardoz"
ip="78.48.163.229"
subject="comment 2"
date="2014-08-02T14:29:26Z"
content="""
This could be achieved in a generic way by allowing filter binaries in expressions, which are run on the filename and return 0 or 1.
"""]]