Merge branch 'master' of ssh://git-annex.branchable.com

This commit is contained in:
Joey Hess 2021-01-29 15:39:02 -04:00
commit a7eff71cff
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
7 changed files with 49 additions and 20 deletions

View file

@ -0,0 +1,11 @@
[[!comment format=mdwn
username="seanl@fe5df935169a5440a52bdbfc5fece85cdd002d68"
nickname="seanl"
avatar="http://cdn.libravatar.org/avatar/082ffc523e71e18c45395e6115b3b373"
subject="Thanks! And yes, no data loss."
date="2021-01-29T06:55:42Z"
content="""
I should have realized it wasn't a fundamental issue, because I was using it this way a couple years back without problems. Indeed there was no data loss; I just checked the files out from the commit before the merge.
Thanks for fixing it so quickly!
"""]]

View file

@ -0,0 +1,9 @@
Bup, ddar, and borg all use content-sensitive chunking, but git and git-annex do not. Git uses delta compression with heuristics for finding similar files, while with git-annex, deduplication beyond only storing one copy of identical content is only possible in special remotes because of how it works. Anyway, I'm posting this here not because I think git-annex should directly support content-sensitive chunking but because it's an idea I can't get out of my head and I think there are likely to be people in this group who are interested in it.
With content-sensitive chunking, there's always a trade-off between chunk size and deduplication. Bigger chunks mean more duplication, while smaller chunks mean more overhead for storing the chunks themselves. Even if chunks are concatenated into larger files to minimize filesystem overhead, you still have to store a collision-resistant hash of each chunk. My idea is this: because you have the chunks available, you can always recompute the hash if you need to. So instead of indexing on the full hash, index on only part of it, and identify the chunks themselves using base 128 integer IDs instead.
Given that you're only ever looking up the hash of a chunk in order to decide if you need to store a new chunk, one possible approach is to use a trie with only enough bytes of the hash to make it unique in a given repository. Then, any time there's a false collision, add enough additional bytes to make it unique in the presence of the new chunk. Use a cache to avoid repeated hashing of commonly matched chunks.
One problem with this technique is that the per-chunk overhead increases with the number of chunks, but it's logarithmic, so it grows slowly, and for many types of content the savings from increased deduplication should grow similarly. For a trie the average number of hash bytes stored would be the base 256 log of the number of chunks, so 2 bytes for up to 65536 chunks, 3 for up to 16.8 million chunks, 4 for up to 4 billion, etc. The space needed to store the chunk IDs is the same, but you store the ID for each time the chunk is used.
It occurs to me that I should propose this to the Datashards folks :)

View file

@ -0,0 +1,11 @@
[[!comment format=mdwn
username="seanl@fe5df935169a5440a52bdbfc5fece85cdd002d68"
nickname="seanl"
avatar="http://cdn.libravatar.org/avatar/082ffc523e71e18c45395e6115b3b373"
subject="Different use case from backend"
date="2021-01-29T17:35:43Z"
content="""
One of the requirements of the backend is that a collision means the content is identical, so it's trivial to handle them because it doesn't matter which one you keep. For dealing with \"near duplicates\", I'd suggest adding a field with `git annex metadata -s lsh=$(lsh $filename)` or something like that. The metadata is attached to the file's content rather than the name, which will ensure that the LSH gets recomputed if the content changes and will never get computed more than once for identical content.
I think the main drawback of this method is that it's a little more complicated to print metadata en masse than it is to print the key because `git annex find` doesn't support metadata. It's certainly possible to construct a command to do it, it's just a little more involved than the commands for finding duplicates.
"""]]

View file

@ -0,0 +1,9 @@
[[!comment format=mdwn
username="seanl@fe5df935169a5440a52bdbfc5fece85cdd002d68"
nickname="seanl"
avatar="http://cdn.libravatar.org/avatar/082ffc523e71e18c45395e6115b3b373"
subject="I use git-annex with FAT32"
date="2021-01-29T17:10:01Z"
content="""
I'm using git-annex for precisely this use case. I manage files on my Sansa Clip Zip, Odroid Go, and Pi1541 with it. I use a directory special remote for each device pointing at its mount point, with remote.<name>.annex-ignore-command set to `lsblk -no uuid,mountpoint | egrep -qx '<uuid>\s+<mountpoint>'` to make sure git-annex only tries to touch the remote when it's mounted. I just rename any files in the repo that have invalid names.
"""]]

View file

@ -0,0 +1,9 @@
[[!comment format=mdwn
username="chocolate.camera@ec2ecab153906be21ac5f36652c33786ad0e0b60"
nickname="chocolate.camera"
avatar="http://cdn.libravatar.org/avatar/4f00dfc3ad590ef7492788b854ceba78"
subject="comment 6"
date="2021-01-29T18:23:04Z"
content="""
@seanl What repo version are those? What about “if I understood correctly, v7 repos are required for being able to hiding missing files, but at the same time, on v7 and FAT32 filesystems, files that are present take up double their file size”?
"""]]

View file

@ -1,12 +0,0 @@
[[!comment format=mdwn
username="https://www.google.com/accounts/o8/id?id=AItOawmTNrhkVQ26GBLaLD5-zNuEiR8syTj4mI8"
nickname="Juan"
subject="comment 10"
date="2013-08-31T18:20:58Z"
content="""
I'm already spreading the word. Handling scientific papers, data, simulations and code has been quite a challenge during my academic career. While code was solved long ago, the three first items remained a huge problem.
I'm sure many of my colleagues will be happy to use it.
Is there any hashtag or twitter account? I've seen that you collected some of my tweets, but I don't know how you did it. Did you search for git-annex?
Best,
Juan
"""]]

View file

@ -1,8 +0,0 @@
[[!comment format=mdwn
username="rashi.k0306@46b3566bf776802cd7adb39707f018beec8b0f26"
nickname="rashi.k0306"
subject="Duplicate Files Deleter"
date="2016-01-14T11:42:10Z"
content="""
I use Duplicate Files Deleter as it is very effective. It is 100% accurate and performs the scan quickly.
"""]]