Merge branch 'master' into newchunks

This commit is contained in:
Joey Hess 2014-08-02 17:25:50 -04:00
commit 0c7c39840d
32 changed files with 535 additions and 19 deletions

View file

@ -51,7 +51,7 @@ dropDead f content trustmap = case getLogVariety f of
Just OtherLog -> PreserveFile
Nothing -> PreserveFile
dropDeadFromMapLog :: TrustMap -> (k -> UUID) -> M.Map k v -> M.Map k v
dropDeadFromMapLog :: Ord k => TrustMap -> (k -> UUID) -> M.Map k v -> M.Map k v
dropDeadFromMapLog trustmap getuuid = M.filterWithKey $ \k _v -> notDead trustmap getuuid k
{- Presence logs can contain UUIDs or other values. Any line that matches

View file

@ -106,7 +106,8 @@ runTransfer t file shouldretry a = do
v <- tryAnnex run
case v of
Right b -> return b
Left _ -> do
Left e -> do
warning (show e)
b <- getbytescomplete metervar
let newinfo = oldinfo { bytesComplete = Just b }
if shouldretry oldinfo newinfo

1
debian/changelog vendored
View file

@ -15,6 +15,7 @@ git-annex (5.20140718) UNRELEASED; urgency=medium
Fix this, including support for fixing up repositories that
were incompletely repaired before.
* Fix cost calculation for non-encrypted remotes.
* Display exception message when a transfer fails due to an exception.
* WebDAV: Dropped support for DAV before 0.6.1.
* testremote: New command to test uploads/downloads to a remote.

View file

@ -0,0 +1,15 @@
I get the following error message upon starting git-annex in a second user account on Android:
Falling back to hardcoded app location: cannot find expected files in /data/app-lib
git annex webapp
lib/lib.runshell.so: line 133: git: Permission denied
[Terminal session finished]
The same version of git-annex works just fine for the primary user.
(The primary user has root access which unfortunately can't be enabled for other user accounts.)
### What version of git-annex are you using? On what operating system?
* git-annex: 5.20140710
* OS: CyanogenMod 10.1.3-p3110

View file

@ -0,0 +1,43 @@
### Please describe the problem.
When `git annex direct` is interrupted (either through a power outage or deliberate `control-c`) it may leave the repository in an inconsistent state.
A typical situation is `git-annex` believing that the repo is in `indirect` mode while the files are not symlinks anymore.
I believe I have described this problem here before, but the bug report was deleted as part of the may 29th purge (222f78e9eadd3d2cc40ec94ab22241823a7d50d9, [[bugs/git_annex_indirect_can_fail_catastrophically]]).
### What steps will reproduce the problem?
`git annex direct` on a large repository, `control-c` before it finishes.
Observe how a lot of files are now considered to be in the famous [[typechange status|forum/git-status_typechange_in_direct_mode/]] in git.
### What version of git-annex are you using? On what operating system?
5.20140717 on Debian Jessie, ext4 filesystem.
### Please provide any additional information below.
I wish i could resume the `git annex direct` command, but this will do a `git commit -a` and therefore commit all those files to git directly. It still seems to me that `git annex` should never run `git commit -a` for exactly that kind of situations.
I think that's it for now. -- [[anarcat]]
Update: i was able to get rid of the `typechange` situation by running `git annex lock` on the repository, but then all files are found to be missing by `git annex fsck`:
[[!format txt """
fsck films/God Hates Cartoons/VIDEO_TS/VTS_15_0.BUP (fixing location log)
** Based on the location log, films/God Hates Cartoons/VIDEO_TS/VTS_15_0.BUP
** was expected to be present, but its content is missing.
Only 1 of 2 trustworthy copies exist of films/God Hates Cartoons/VIDEO_TS/VTS_15_0.BUP
Back it up with git-annex copy.
"""]]
Oddly enough, the repo still uses hundreds of gigs, because all the files ended up in `.git/annex/misctmp`. Not sure I remember what happened there.
Similar issues and discussions:
* [[bugs/direct_mode_merge_interrupt/]]
* [[forum/Cleaning_up_after_aborted_sync_in_direct_mode/]]
* [[bugs/failure_to_return_to_indirect_mode_on_usb/]]
* [[forum/git-status_typechange_in_direct_mode/]]

View file

@ -0,0 +1,19 @@
[[!comment format=mdwn
username="https://www.google.com/accounts/o8/id?id=AItOawm8BAEUyzYhORZmMuocRTk4M-3IumDm5VU"
nickname="luciusf0"
subject="Bug still valid"
date="2014-07-31T08:35:29Z"
content="""
The bug is still valid. A lot of german users had to use the @googlemail.com extension as google couldn't get the gmail domain in Germany.
So it might be bothering not just a few people, but a whole country! Now, if that doesn't count ...
Mac OSX 10.9.4
Version: 5.20140717-g5a7d4ff
Build flags: Assistant Webapp Webapp-secure Pairing Testsuite S3 WebDAV FsEvents XMPP DNS Feeds Quvi TDFA CryptoHash
This is the message I get
Unable to connect to the Jabber server. Maybe you entered the wrong password? (Error message: host xmpp.l.google.com.:5222 failed: AuthenticationFailure (Element {elementName = Name {nameLocalName = \"failure\", nameNamespace = Just \"urn:ietf:params:xml:ns:xmpp-sasl\", namePrefix = Nothing}, elementAttributes = [], elementNodes = [NodeElement (Element {elementName = Name {nameLocalName = \"not-authorized\", nameNamespace = Just \"urn:ietf:params:xml:ns:xmpp-sasl\", namePrefix = Nothing}, elementAttributes = [], elementNodes = []})]}); host alt2.xmpp.l.google.com.:5222 failed: AuthenticationFailure (Element {elementName = Name {nameLocalName = \"failure\", nameNamespace = Just \"urn:ietf:params:xml:ns:xmpp-sasl\", namePrefix = Nothing}, elementAttributes = [], elementNodes = [NodeElement (Element {elementName = Name {nameLocalName = \"not-authorized\", nameNamespace = Just \"urn:ietf:params:xml:ns:xmpp-sasl\", namePrefix = Nothing}, elementAttributes = [], elementNodes = []})]}); host alt1.xmpp.l.google.com.:5222 failed: AuthenticationFailure (Element {elementName = Name {nameLocalName = \"failure\", nameNamespace = Just \"urn:ietf:params:xml:ns:xmpp-sasl\", namePrefix = Nothing}, elementAttributes = [], elementNodes = [NodeElement (Element {elementName = Name {nameLocalName = \"not-authorized\", nameNamespace = Just \"urn:ietf:params:xml:ns:xmpp-sasl\", namePrefix = Nothing}, elementAttributes = [], elementNodes = []})]}); host alt4.xmpp.l.google.com.:5222 failed: AuthenticationFailure (Element {elementName = Name {nameLocalName = \"failure\", nameNamespace = Just \"urn:ietf:params:xml:ns:xmpp-sasl\", namePrefix = Nothing}, elementAttributes = [], elementNodes = [NodeElement (Element {elementName = Name {nameLocalName = \"not-authorized\", nameNamespace = Just \"urn:ietf:params:xml:ns:xmpp-sasl\", namePrefix = Nothing}, elementAttributes = [], elementNodes = []})]}); host alt3.xmpp.l.google.com.:5222 failed: AuthenticationFailure (Element {elementName = Name {nameLocalName = \"failure\", nameNamespace = Just \"urn:ietf:params:xml:ns:xmpp-sasl\", namePrefix = Nothing}, elementAttributes = [], elementNodes = [NodeElement (Element {elementName = Name {nameLocalName = \"not-authorized\", nameNamespace = Just \"urn:ietf:params:xml:ns:xmpp-sasl\", namePrefix = Nothing}, elementAttributes = [], elementNodes = []})]}))
"""]]

View file

@ -0,0 +1,8 @@
[[!comment format=mdwn
username="https://www.google.com/accounts/o8/id?id=AItOawmwjQzWgiD7_I3zw-_91rMRf_6qoThupis"
nickname="Mike"
subject="comment 8"
date="2014-07-30T20:33:44Z"
content="""
Great work Joeyh :-) I will install the new version soon. I is fantastic that you fixed this so thoroughly.
"""]]

View file

@ -0,0 +1,8 @@
[[!comment format=mdwn
username="http://svario.it/gioele"
nickname="gioele"
subject="comment 1"
date="2014-07-29T14:25:19Z"
content="""
For the records, the solution Joey suggested works but the correct option to pass to `sync` is `-c annex.alwayscommit=true`.
"""]]

View file

@ -0,0 +1,84 @@
### Please describe the problem.
`git annex whereis` says that there are no copies of any of the files annexed in repositories running in direct mode.
This is the error received:
$ git annex whereis
whereis fileA (0 copies) failed
whereis fileB (0 copies) failed
git-annex: whereis: 2 failed
### What steps will reproduce the problem?
The following script (available at <https://gist.github.com/gioele/dde462df89edfe17c5e3>) will reproduce this problem.
[[!format sh """
#!/bin/sh -x
set -e ; set -u
export LC_ALL=C
direct=true # set to false to make the problem disappear
h=${h:-localhost}
dr="/tmp/annex"
sync='sync -c annex.alwayscommit=true'
chmod a+rwx -R pc1 pc2 || true
rm -Rf pc1 pc2
# create central git repo
ssh $h "chmod a+rwx -R ${dr}/Docs.git" || true
ssh $h "rm -Rf ${dr}/Docs.git"
ssh $h "mkdir -p ${dr}/Docs.git"
ssh $h "cd ${dr}/Docs.git ; git init --bare"
d=$(pwd)
# populate repo in PC1
mkdir -p pc1/Docs
cd pc1/Docs
echo AAA > fileA
echo BBB > fileB
git init
git remote add origin $h:$dr/Docs.git
git fetch --all
# simulate a host without git-annex
git config remote.origin.annex-ignore true
git annex init "pc1"
git annex info
$direct && git annex direct
git annex add .
git annex $sync origin
# re-create repo on PC2
cd $d
mkdir -p pc2
cd pc2
git clone $h:$dr/Docs.git
cd Docs
git config remote.origin.annex-ignore true
git annex init "pc2"
git annex info
git annex whereis || true
echo "I was expecting location info to be available after info (press Enter)" ; read enter
git annex $sync origin
git annex whereis || true
echo "Why isn't location info available even after sync? (press Enter)"
"""]]
### What version of git-annex are you using? On what operating system?
git-annex version: 5.20140708-g42df533

View file

@ -10,7 +10,7 @@ Note that git-annex has to buffer chunks in memory before they are sent to
a remote. So, using a large chunk size will make it use more memory.
To enable chunking, pass a `chunk=nnMiB` parameter to `git annex
initremote, specifying the chunk size.
initremote`, specifying the chunk size.
Good chunk sizes will depend on the remote, but a good starting place
is probably `1MiB`. Very large chunks are problimatic, both because

View file

@ -160,17 +160,11 @@ Instead of storing the chunk count in the special remote, store it in
the git-annex branch.
The location log does not record locations of individual chunk keys
(too space-inneficient).
Instead, look at git-annex:aaa/bbb/SHA256-s12345--xxxxxxx.log.cnk to get
the chunk count and size for a key.
(too space-inneficient). Instead, look at a chunk log in the
git-annex branch to get the chunk count and size for a key.
Note that a given remote uuid might have multiple chunk sizes logged, if a
key was stored on it twice using different chunk sizes. Also note that even
when this file exists for a key, the object may be stored non-chunked on
the remote too.
`hasKey` would check if any one (chunksize, chunkcount) is satisfied by
the files on the remote. It would also check if the non-chunked key is
`hasKey` would check if any of the logged sets of chunks is
present on the remote. It would also check if the non-chunked key is
present, as a fallback.
When dropping a key from the remote, drop all logged chunk sizes.
@ -185,6 +179,31 @@ remote doesn't know anything about chunk sizes. It uses a little more
data in the git-annex branch, although with care (using the same timestamp
as the location log), it can compress pretty well.
## chunk log
Stored in the git-annex branch, this provides a mapping `Key -> [[Key]]`.
Note that a given remote uuid might have multiple sets of chunks (with
different sizes) logged, if a key was stored on it twice using different
chunk sizes. Also note that even when the log indicates a key is chunked,
the object may be stored non-chunked on the remote too.
For fixed size chunks, there's no need to store the list of chunk keys,
instead the log only records the number of chunks (needed because the size
of the parent Key may not be known), and the chunk size.
Example:
1287290776.765152s e605dca6-446a-11e0-8b2a-002170d25c55:10240 9
Later, might want to support other kinds of chunks, for example ones made
using a rsync-style rolling checksum. It would probably not make sense to
store the full [Key] list for such chunks in the log. Instead, it might be
stored in a file on the remote.
To support such future developments, when updating the chunk log,
git-annex should preserve unparsable values (the part after the colon).
## chunk then encrypt
Rather than encrypting the whole object 1st and then chunking, chunk and
@ -239,3 +258,14 @@ checking hasKey.
Note that this is safe to do only as long as the Key being transferred
cannot possibly have 2 different contents in different repos. Notably not
necessarily the case for the URL keys generated for quvi.
Both **done**.
## parallel
If 2 remotes both support chunking, uploading could upload different chunks
to them in parallel. However, the chunk log does not currently allow
representing the state where some chunks are on one remote and others on
another remote.
Parallel downloading of chunks from different remotes is a bit more doable.

View file

@ -4,6 +4,24 @@ One simple way is to find the key of the old version of a file that's
being transferred, so it can be used as the basis for rsync, or any
other similar transfer protocol.
For remotes that don't use rsync, a poor man's version could be had by
chunking each object into multiple parts. Only modified parts need be
transferred. Sort of sub-keys to the main key being stored.
For remotes that don't use rsync, use a rolling checksum based chunker,
such as BuzHash. This will produce [[chunks]], which can be stored on the
remote as regular Keys -- where unlike the fixed size chunk keys, the
SHA256 part of these keys is the checksum of the chunk they contain.
Once that's done, it's easy to avoid uploading chunks that have been sent
to the remote before.
When retriving a new version of a file, there would need to be a way to get
the list of chunk keys that constitute the new version. Probably best to
store this list on the remote. Then there needs to be a way to find which
of those chunks are available in locally present files, so that the locally
available chunks can be extracted, and combined with the chunks that need
to be downloaded, to reconstitute the file.
To find which chucks are locally available, here are 2 ideas:
1. Use a single basis file, eg an old version of the file. Re-chunk it, and
use its chunks. Slow, but simple.
2. Some kind of database of locally available chunks. Would need to be kept
up-to-date as files are added, and as files are downloaded.

View file

@ -14,5 +14,10 @@ Now in the
* Month 8 [[!traillink git-remote-daemon]]
* Month 9 Brazil!, [[!traillink assistant/sshpassword]]
* Month 10 polish [[assistant/Windows]] port
* **Month 11 [[!traillink assistant/chunks]], [[!traillink assistant/deltas]], [[!traillink assistant/gpgkeys]] (pick 2?)**
* Month 12 [[!traillink assistant/telehash]]
* Month 11 [[!traillink assistant/chunks]]
* **Month 12** user-driven features and polishing
Deferred until later:
* Month XX [[!traillink assistant/deltas]], [[!traillink assistant/gpgkeys]]
* Month XX [[!traillink assistant/telehash]]

View file

@ -0,0 +1,83 @@
Zap! ... My internet gateway was [destroyed by lightning](https://identi.ca/joeyh/note/xogvXTFDR9CZaCPsmKZipA).
Limping along regardless, and replacement ordered.
Got resuming of uploads to chunked remotes working. Easy!
----
Next I want to convert the external special remotes to have these nice
new features. But there is a wrinkle: The new chunking interface works
entirely on ByteStrings containing the content, but the external special
remote interface passes content around in files.
I could just make it write the ByteString to a temp file, and pass the temp
file to the external special remote to store. But then, when chunking is
not being used, it would pointlessly read a file's content, only to write
it back out to a temp file.
Similarly, when retrieving a key, the external special remote saves it to a
file. But we want a ByteString. Except, when not doing chunking or
encryption, letting the external special remote save the content directly
to a file is optimal.
One approach would be to change the protocol for external special
remotes, so that the content is sent over the protocol rather than in temp
files. But I think this would not be ideal for some kinds of external
special remotes, and it would probably be quite a lot slower and more
complicated.
Instead, I am playing around with some type class trickery:
[[!format haskell """
{-# LANGUAGE Rank2Types TypeSynonymInstances FlexibleInstances MultiParamTypeClasses #-}
type Storer p = Key -> p -> MeterUpdate -> IO Bool
-- For Storers that want to be provided with a file to store.
type FileStorer a = Storer (ContentPipe a FilePath)
-- For Storers that want to be provided with a ByteString to store
type ByteStringStorer a = Storer (ContentPipe a L.ByteString)
class ContentPipe src dest where
contentPipe :: src -> (dest -> IO a) -> IO a
instance ContentPipe L.ByteString L.ByteString where
contentPipe b a = a b
-- This feels a lot like I could perhaps use pipes or conduit...
instance ContentPipe FilePath FilePath where
contentPipe f a = a f
instance ContentPipe L.ByteString FilePath where
contentPipe b a = withTmpFile "tmpXXXXXX" $ \f h -> do
L.hPut h b
hClose h
a f
instance ContentPipe FilePath L.ByteString where
contentPipe f a = a =<< L.readFile f
"""]]
The external special remote would be a FileStorer, so when a non-chunked,
non-encrypted file is provided, it just runs on the FilePath with no extra
work. While when a ByteString is provided, it's swapped out to a temp file
and the temp file provided. And many other special remotes are ByteStorers,
so they will just pass the provided ByteStream through, or read in the
content of a file.
I think that would work. Thoigh it is not optimal for external special
remotes that are chunked but not encrypted. For that case, it might be worth
extending the special remote protocol with a way to say "store a chunk of
this file from byte N to byte M".
---
Also, talked with ion about what would be involved in using rolling checksum
based chunks. That would allow for rsync or zsync like behavior, where
when a file changed, git-annex uploads only the chunks that changed, and the
unchanged chunks are reused.
I am not ready to work on that yet, but I made some changes to the parsing
of the chunk log, so that additional chunking schemes like this can be added
to git-annex later without breaking backwards compatability.

View file

@ -0,0 +1,34 @@
It took 9 hours, but I finally got to make [[!commit c0dc134cded6078bb2e5fa2d4420b9cc09a292f7]],
which both removes 35 lines of code, and adds chunking support to all
external special remotes!
The groundwork for that commit involved taking the type scheme I sketched
out yesterday, completely failing to make it work with such high-ranked
types, and falling back to a simpler set of types that both I and GHC seem
better at getting our heads around.
Then I also had more fun with types, when it turned out I needed to
run encryption in the Annex monad. So I had to go convert several parts of
the utility libraries to use MonadIO and exception lifting. Yurk.
The final and most fun stumbling block caused git-annex to crash when
retriving a file from an external special remote that had neither
encryption not chunking. Amusingly it was because I had not put in an
optimation (namely, just renaming the file that was retrieved in this case,
rather than unnecessarily reading it in and writing it back out). It's
not often that a lack of an optimisation causes code to crash!
So, fun day, great result, and it should now be very simple to convert
the bup, ddar, gcrypt, glacier, rsync, S3, and WebDAV special remotes
to the new system. Fingers crossed.
But first, I will probably take half a day or so and write a
`git annex testremote` that can be run in a repository and does live
testing of a special remote including uploading and downloading files.
There are quite a lot of cases to test now, and it seems best to get
that in place before I start changing a lot of remotes without a way to
test everything.
----
Today's work was sponsored by Daniel Callahan.

View file

@ -0,0 +1,10 @@
Built `git annex testremote` today.
That took a little bit longer than expected, because it actually found
several fence post bugs in the chunking code.
It also found a bug in the sample external special remote script.
I am very pleased with this command. Being able to run 640 tests against
any remote, without any possibility of damaging data already stored in the
remote, is awesome. Should have written it a looong time ago!

View file

@ -0,0 +1 @@
Is there a way to configure a central git repository that keeps track of large files with git annex so that multiple users can clone the repository but no repository clone can drop files from the server. Essentially, I'm looking for a way to have one repository that is always populated with at least one copy of each file. Other users shouldn't be able to tell that repository to drop any files (but would be able to add files it). The term "user" in that last sentence really refers to other clones...

View file

@ -0,0 +1,10 @@
[[!comment format=mdwn
username="https://www.google.com/accounts/o8/id?id=AItOawn-TDneVW-8kwb1fyTRAJfH3l1xs2VSEmk"
nickname="James"
subject="comment 1"
date="2014-07-30T20:37:27Z"
content="""
It might not suit all your needs but you could try using gitolite and set permissions on the git-annex branch of your repository
http://gitolite.com/gitolite/conf.html#write-types
"""]]

View file

@ -0,0 +1,3 @@
Firefox has the nasty habit that it will force-dereference symlinks when locally opening files (i.e., opening an annexed document will cause it to be opened in .git/annex/objects/…). Since this will break relative links within HTML files, this would make Firefox pretty useless when working with a git annex containing HTML files. (Apparently this behavior is [desired](https://bugzilla.mozilla.org/show_bug.cgi?id=803999) upstream and might not be fixed.)
Seems Im not the only one who would like to work with annexed HTML files, though. On the [Debian bugtracker](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=691099), another user shared a handy shim which can be used in LD_PRELOAD and which will force Firefox to open symlinks in-place. Thought Id share this here in case its of use to anyone.

View file

@ -0,0 +1,8 @@
[[!comment format=mdwn
username="zardoz"
ip="78.48.163.229"
subject="comment 1"
date="2014-07-31T11:43:16Z"
content="""
Sorry, it escaped my attention theres a dedicated tips forum. Maybe this should be moved there.
"""]]

View file

@ -0,0 +1,10 @@
[[!comment format=mdwn
username="http://joeyh.name/"
ip="24.159.78.125"
subject="comment 1"
date="2014-07-30T14:55:16Z"
content="""
Known bug: [[bugs/Git_annexed_files_symlink_are_wrong_when_submodule_is_not_in_the_same_path]]
I don't think there's much likelyhood of a fix though. Using direct mode probably works around the problem. Or you can use something like myrepos instead of git subtrees.
"""]]

View file

@ -0,0 +1,10 @@
[[!comment format=mdwn
username="https://www.google.com/accounts/o8/id?id=AItOawkftzaCvV7EDKVDfJhsQZ3E1Vn-0db516w"
nickname="Edward"
subject="One snag"
date="2014-07-28T19:37:04Z"
content="""
I setup a non-bare repo on a server by following the above steps (git init, git annex init, then add it as a Remote Server from elsewhere and combine repos). It worked, but I hit a snag and needed to add another step.
After git init, you're not sitting on any branch yet, and that seems to have prevented the assistant from doing anything to synchronize the working tree on the server. After I did \"git checkout synced/master\", it started working.
"""]]

View file

@ -0,0 +1,20 @@
I am trying to S3 as a file store for git annex. I have set up the remote via the following command:
git annex initremote xxx-s3 type=S3 encryption=shared embedcreds=yes datacenter=EU bucket=xxx-git-annex fileprefix=test/
The remote gets set up correctly and creates the directory I want, and adds a annex-uuid file.
Now when I try to copy a file to the xxx-s3 remote, I get the following error:
$ git annex add ssl-success-and-failure-with-tl-logs.log
add ssl-success-and-failure-with-tl-logs.log ok
(Recording state in git...)
$ git annex copy ssl-success-and-failure-with-tl-logs.log --to xxx-s3
copy ssl-success-and-failure-with-tl-logs.log (gpg) gpg: no valid OpenPGP data found.
gpg: decrypt_message failed: eof
git-annex: user error (gpg ["--batch","--no-tty","--use-agent","--quiet","--trust-model","always","--batch","--passphrase-fd","10","--decrypt"] exited 2)
failed
git-annex: copy: 1 failed
Any ideas what might be wrong? Is shared cipher broken somehow?

View file

@ -0,0 +1,8 @@
[[!comment format=mdwn
username="https://www.google.com/accounts/o8/id?id=AItOawmAINLSovhWM_4_KrbngOcxduIbBuKv8ZA"
nickname="Nuutti"
subject="comment 1"
date="2014-08-01T09:28:21Z"
content="""
Sorry, this should probably be in bugs.
"""]]

View file

@ -0,0 +1,21 @@
i want to relate a usability story that happens fairly regularly when I show git-annex to people. the story goes like this.
----
Antoine sat down at his computer saying, "i have this great movie collection I want to share with you, my friend, because the fair use provisions allow for that, and I use this great git-annex tool that allows me to sync my movie collection between different places". His friend Charlie, a Linux user only vaguely familiar with the internals of how his operating system or legal system actually works, reads this as "yay free movies" and wholeheartedly agrees to lend himself to the experiment.
Antoine creates a user account for Charlie on his home computer, because he doesn't want to have to do everything himself. "That way you can choose which movies you want, because you probably don't want my complete movie collection!" Charlie emphatically responds: "right, I only have my laptop and this USB key here, so I don't think I can get it all".
Charlie logs into Antoine's computer, named `marcos`. Antoine shows Charlie where the movies are located (`/srv/video`) through the file browser (Thunar, for the record). Charlie inserts his USB key into `marcos` and a new icon for the USB key shows up. Then Charlie finds a video he likes, copies and pastes it into the USB key. But instead of a familiar progress bar, Charlie is prompted with a dialog that says "Le système de fichiers ne gère pas les liens symboliques." (Antoine is french, so excuse him, this weird message says that the filesystem doesn't support symbolic links.) Puzzled, Charlie tries to copy the file to his home directory instead. This works better, but the file has a little arrow on it, which seems odd to Charlie. He then asks Antoine for advice.
Antoine then has no solution but to convert the git-annex repository into direct mode, something which takes a significant amount of time and is actually [[designated as "untrusted"|direct_mode]] in the documentation. In fact, so much so that he actually did [[screw up his repository magnificently|bugs/direct_command_leaves_repository_inconsistent_if_interrupted]] because he freaked out when `git-annex direct` started and interrupted it because he tought it would take too long.
----
Now I understand it is not necessarily `git-annex`'s responsability if Thunar (or Nautilus, for that matter), doesn't know how to properly deal with symlinks (hint: just dereference the damn thing already). Maybe I should file a bug about this against thunar? I also understand that symlinks are useful to ensure the security of the data hosted in `git-annex`, and that I could have used direct mode in the first place. But I like to track changes in git to those files, and direct mode makes that really difficult.
I didn't file this as a bug because I want to start the conversation, but maybe it should qualify as a usability bug. As things stand, this is one of the biggest hurdle in teaching people about git annex.
(The other being "how do i actually use git annex to sync those files instead of just copying them by hand", but that's for another story!)
-- [[anarcat]]

View file

@ -81,3 +81,21 @@ on special remotes, instead use `git annex unused --from`. Example:
(To remove unwanted data: git-annex dropunused --from mys3 NUMBER)
$ git annex dropunused --from mys3 1
dropunused 12948 (from mys3...) ok
## Testing special remotes
To make sure that a special remote is working correctly, you can use the
`git annex testremote` command. This expects you to have set up the remote
as usual, and it then runs a lot of tests, using random data. It's
particularly useful to test new implementations of special remotes.
By default it will upload and download files of around 1MiB to the remote
it tests; the `--size` parameter can adjust it test using smaller files.
It's safe to use this command even when you're already storing data in a
remote; it won't touch your existing files stored on the remote.
For most remotes, it also won't bloat the remote with any data, since
it cleans up the stuff it uploads. However, the bup, ddar, and tahoe
special remotes don't support removal of uploaded files, so be careful
with those.

View file

@ -6,7 +6,7 @@ keeps track of.
One nice way to use the metadata is through **views**. You can ask
git-annex to create a view of files in the currently checked out branch
that have certian metadata. Once you're in a view, you can move and copy
that have certain metadata. Once you're in a view, you can move and copy
files to adjust their metadata further. Rather than the traditional
hierarchical directory structure, views are dynamic; you can easily
refine or reorder a view.

View file

@ -0,0 +1,7 @@
I'm currently in the process of gutting old (some broken) git-annex's and cleaning out download directories from before I started using git-annex.
To do this, I am running `git annex import --clean--duplicates $PATH` on the directories I want to clear out but sometimes, this takes a unnecessarily long time.
For example, git-annex will calculate the digest for a huge file (30GB+) in $TARGET, even though there are no files in the annex of that size.
It's a common shortcut to check for duplicate sizes first to eliminate definite non-matches really quickly. Can this be added to git-annex's `import` in some way or is this a no-go due to the constant memory constraint?

View file

@ -0,0 +1,10 @@
[[!comment format=mdwn
username="http://joeyh.name/"
ip="24.159.78.125"
subject="interesting idea"
date="2014-07-30T15:03:46Z"
content="""
This could be done in constant space using a bloom filter of known file sizes. Files with wrong sizes would sometimes match, but no problem, it would then just do the work it does now.
However, to build such a filter, git-annex would need to do a scan of all keys it knows about. This would take approximately as long to run as `git annex unused` does. It might make sense to only build the filter if it runs into a fairly large file. Alternatively, a bloom filter of file sizes could be cached and updated on the fly as things change (but this gets pretty complex).
"""]]

View file

@ -0,0 +1,21 @@
[[!comment format=mdwn
username="Xyem"
ip="81.111.193.130"
subject="comment 2"
date="2014-08-01T09:05:45Z"
content="""
Could be tested out with an additional flag `--with-size-bloom` on import?
It would then build a bloom (and use a cached one with --fast) and do the usual import.
So I could do this:
# Bloom is created and the import is done using it
git annex import --clean-duplicates --with-size-bloom $TARGET
# Previously created bloom is used
git annex import --clean-duplicates --with-size-bloom --fast $TARGET2
git annex import --clean-duplicates --with-size-bloom --fast $TARGET3
I can implement this behaviour in Perl with Bloom::Filter and let you know how it performs if that would be useful to you..?
"""]]

View file

@ -0,0 +1,7 @@
I have data that has accompanying parity files. This is supposed to add some
security to file integrity; however, it only works as long as the files are
available unencrypted. In case of encrypted special remotes the existing parity files
won't be of any use if the encrypted versions of files get corrupted in the remote location.
Would it be worthwhile for git-annex to generate its own
parity files for the encrypted data in encrypted special remotes?

View file

@ -0,0 +1,3 @@
The commit messages made by git-annex are quite spartan, especially in direct mode where one cannot enter its own commit messages. This means that all that the messages say is "branch created", "git-annex automatic sync", "update", "merging" or little more.
It would be nice if git-annex could add at least the name of the repository/remote to the commit message. This would make the log a lot more clear, especially when dealing with problems or bugs.