Merge branch 'master' into v8
This commit is contained in:
commit
029c883713
456 changed files with 6341 additions and 1085 deletions
|
@ -1 +1,3 @@
|
|||
since there is no generic 'fuse' mode, I would like to request to have `--get` (or `--auto-get`) option for diffdriver. I am trying to compare files across two branches on a repo I just cloned. I cannot download all the files and downloading differing keys across branches for the same file is a bit painful. So I felt that it would be super nice if git annex could auto get those files from somewhere (well -- original clone)
|
||||
|
||||
[[!tag confirmed]]
|
||||
|
|
|
@ -15,3 +15,6 @@ Apologies for the brevity, I've already typed this out once..
|
|||
git annex import --mode=Ns $src # (just creates symlinks for new)
|
||||
git annex import --mode=Nsd $src # (invalid mode due to data loss)
|
||||
git annex import --mode=Nid $src # (invalid or require --force)
|
||||
|
||||
> Current thinking is in [[remove_legacy_import_directory_interface]].
|
||||
> This old todo is redundant, so [[wontfix|done]] --[[Joey]]
|
||||
|
|
|
@ -19,3 +19,5 @@ There are other situations this is useful (and I use), for example, when I conve
|
|||
git annex metadata --parentchild original.svg compressed.png
|
||||
|
||||
and this would set 'parent' and 'child' metadata respectively.
|
||||
|
||||
[[!tag needsthought]]
|
||||
|
|
|
@ -13,4 +13,5 @@ You may ask why it is useful? I have several usecases:
|
|||
Does git-annex provide such functionnality? If not, do you think it could be implementable?
|
||||
|
||||
Thanks!
|
||||
|
||||
|
||||
[[!tag unlikely]]
|
||||
|
|
|
@ -0,0 +1 @@
|
|||
I've implemented true resumable upload in git-annex-remote-googledrive which means that uploads can, just as downloads, be resumed at any point, even within one chunk. However, it currently does not work with encrypted files (or chunks) due to the non-deterministic nature of GPG. In order to make this feature useable on encrypted files, I propose to not overwrite encrypted files which are already present inside the `tmp` directory.
|
|
@ -0,0 +1,53 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 1"""
|
||||
date="2020-02-17T15:33:31Z"
|
||||
content="""
|
||||
@lykos, what happens when git-annex-remote-googledrive tries
|
||||
to resume in this situation and git-annex has written a different tmp file
|
||||
than what it partially uploaded before?
|
||||
|
||||
I imagine it might resume after the last byte it sent before, and so
|
||||
the uploaded file gets corrupted?
|
||||
|
||||
If so, there are two hard problems with this idea:
|
||||
|
||||
1. If git-annex changes to reuse the same tmp file, then git-annex-remote-googledrive
|
||||
will work with the new git-annex, but corrupt files when used with an old
|
||||
git-annex.
|
||||
2. If someone has two clones, and starts an upload in one, but it's
|
||||
interrupted and then started later in the second clone, it would again
|
||||
corrupt the file that gets uploaded. (This would also happen,
|
||||
with a single clone, if git-annex unused gets used in between upload
|
||||
attempts, and cleans up the tmp file.)
|
||||
|
||||
The first could be dealt with by some protocol flag, but the second seems
|
||||
rather intractable, if git-annex-remote-googledrive behaves as I
|
||||
hypothesize it might. And even if git-annex-remote-googledrive behaves
|
||||
better that that somehow, it's certianly likely that some other remote
|
||||
would behave that way at some point.
|
||||
|
||||
----
|
||||
|
||||
As to implementation details, I started investigating before thinking
|
||||
about the above problem, so am leaving some notes here:
|
||||
|
||||
This would first require that the tmp file is written atomically,
|
||||
otherwise an interruption in the wrong place would resume with a partial
|
||||
file. (File size can't be used since gpg changes the file size with
|
||||
compression etc.) Seems easy to implement: Make
|
||||
Remote.Helper.Special.fileStorer write to a different tmp file and rename
|
||||
it into place.
|
||||
|
||||
Internally, git-annex pipes the content from gpg, so it is only written to
|
||||
a temp file when using a remote that operates on files, as the external
|
||||
remotes do. Some builtin remotes don't. So resuming an upload to an
|
||||
encrypted remote past the chunk level can't work in general.
|
||||
|
||||
There would need to be some way for the code that encrypts chunks
|
||||
(or whole objects) to detect that it's being used with a remote that
|
||||
operates on files, and then check if the tmp file already exists, and avoid
|
||||
re-writing it. This would need some way to examine a `Storer` and tell
|
||||
if it operates on files, which is not currently possisble, so would need
|
||||
some change to the data type.
|
||||
"""]]
|
|
@ -28,3 +28,5 @@ This problem comes up surprisingly often due to:
|
|||
5. Some repos being too large for a machine (e.g., repacking fails due to low memory), but which can still act like a dumb file-store.
|
||||
|
||||
The problem gets worse when you have a lot of remotes or a lot of repos to manage (I have both). My impression is that this feature would require a syntax addition for git-annex-sync only. I like '!' because it behaves the same in GNU find and sh.
|
||||
|
||||
[[!tag needsthought]]
|
||||
|
|
|
@ -0,0 +1,8 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 4"""
|
||||
date="2020-01-30T19:13:25Z"
|
||||
content="""
|
||||
git-annex sync does support remote groups, so that might also help with
|
||||
this use case without needing additional syntax?
|
||||
"""]]
|
|
@ -1,3 +1,5 @@
|
|||
Would it be hard to support MD5E keys that omit the -sSIZE part, the way this is allowed for URL keys? I have a use case where I have the MD5 hashes and filenames of files stored in the cloud, but not their sizes, and want to construct keys for these files to use with setpresentkey and registerurl. I could construct URL keys, but then I lose the error-checking and have to set annex.security.allow-unverified-downloads . Or maybe, extend URL keys to permit an -hMD5 hash to be part of the key?
|
||||
|
||||
Another (and more generally useful) solution would be [[todo/alternate_keys_for_same_content/]]. Then can start with a URL-based key but then attach an MD5 to it as metadata, and have the key treated as a checksum-containing key, without needing to migrate the contents to a new key.
|
||||
|
||||
[[!tag moreinfo]]
|
||||
|
|
|
@ -24,3 +24,4 @@ git-annex version: 6.20180913+git33-g2cd5a723f-1~ndall+1
|
|||
|
||||
[[!meta author=yoh]]
|
||||
[[!tag projects/datalad]]
|
||||
[[!tag moreinfo unlikely]]
|
||||
|
|
|
@ -12,3 +12,4 @@ If needed example, here is http://datasets.datalad.org/allen-brain-observatory/v
|
|||
|
||||
[[!meta author=yoh]]
|
||||
[[!tag projects/dandi]]
|
||||
[[!tag needsthought]]
|
||||
|
|
|
@ -1,3 +1,5 @@
|
|||
S3 lets you [redirect](https://docs.aws.amazon.com/AmazonS3/latest/dev/how-to-page-redirect.html) requests for an object to another object, or to a URL. This could be used to export a git branch, in the manner of [[`git-annex-export`|git-annex-export]], but with annexed objects redirecting to a key-value S3 remote in the same bucket.
|
||||
|
||||
Related: [[todo/simpler__44___trusted_export_remotes]] ; [[forum/Using_hashdirlower_layout_for_S3_special_remote]].
|
||||
|
||||
[[!tag needsthought unlikely]]
|
||||
|
|
|
@ -1,29 +0,0 @@
|
|||
I tried to use the S3 special remote to access DigitalOcean's Spaces API. [Their docs](https://developers.digitalocean.com/documentation/spaces/) suggest that this should be possible. However, it doesn't work.
|
||||
|
||||
The command I ran, with key removed:
|
||||
|
||||
git annex --debug initremote xamsi-everything type=S3 protocol=https host=sfo2.digitaloceanspaces.com datacenter=sfo2 chunk=64MiB encryption=hybrid keyid=XXX
|
||||
|
||||
The non-debug output, in full, with key removed:
|
||||
|
||||
initremote xamsi-everything (encryption setup) (to gpg keys: XXX) (checking bucket...) (creating bucket in sfo2...)
|
||||
git-annex: XmlException {xmlErrorMessage = "Missing error Message"}
|
||||
failed
|
||||
git-annex: initremote: 1 failed
|
||||
|
||||
The debug output of the part that breaks, again with key material removed:
|
||||
|
||||
(creating bucket in sfo2...) [2019-10-15 08:40:41.119524792] String to sign: "PUT\n\n\nTue, 15 Oct 2019 15:40:41 GMT\n/xamsi-everything-a36e2044-07ac-4d85-8450-e5760c897a9b/"
|
||||
[2019-10-15 08:40:41.119586065] Host: "xamsi-everything-a36e2044-07ac-4d85-8450-e5760c897a9b.sfo2.digitaloceanspaces.com"
|
||||
[2019-10-15 08:40:41.119639648] Path: "/"
|
||||
[2019-10-15 08:40:41.119683721] Query string: ""
|
||||
[2019-10-15 08:40:41.11972899] Header: [("Date","Tue, 15 Oct 2019 15:40:41 GMT"),("Authorization","AWS XXX")]
|
||||
[2019-10-15 08:40:41.119846915] Body: "<?xml version=\"1.0\" encoding=\"UTF-8\"?><CreateBucketConfiguration xmlns=\"http://s3.amazonaws.com/doc/2006-03-01/\"><LocationConstraint>sfo2</LocationConstraint></CreateBucketConfiguration>"
|
||||
[2019-10-15 08:40:41.174450718] Response status: Status {statusCode = 403, statusMessage = "Forbidden"}
|
||||
[2019-10-15 08:40:41.174566002] Response header 'Content-Length': '190'
|
||||
[2019-10-15 08:40:41.174627301] Response header 'x-amz-request-id': 'tx0000000000001c5b175eb-005da5e879-23e283-sfo2a'
|
||||
[2019-10-15 08:40:41.174685597] Response header 'Accept-Ranges': 'bytes'
|
||||
[2019-10-15 08:40:41.174730858] Response header 'Content-Type': 'application/xml'
|
||||
[2019-10-15 08:40:41.174776256] Response header 'Date': 'Tue, 15 Oct 2019 15:40:41 GMT'
|
||||
[2019-10-15 08:40:41.174821726] Response header 'Strict-Transport-Security': 'max-age=15552000; includeSubDomains; preload'
|
||||
[2019-10-15 08:40:41.174984394] Response metadata: S3: request ID=tx0000000000001c5b175eb-005da5e879-23e283-sfo2a, x-amz-id-2=<none>
|
|
@ -63,3 +63,5 @@ Thankfully, we already have a technology that can fill in elegantly here: parity
|
|||
|
||||
|
||||
This would also enhance the data-checking capabilities of git-annex, as data loss could be fixed and new parity files generated from the recovered files transparently, self-healing the archive.
|
||||
|
||||
[[!tag unlikely]]
|
||||
|
|
|
@ -4,3 +4,5 @@ I have a bunch of files I want to track with `git-annex` that are sitting in an
|
|||
git-annex import --to=s3-remote /mnt/usb-drive/myfiles
|
||||
|
||||
The proposed `--to=remote` option would add the files to my repo as `import` normally does, but it wouldn't every keep the content in the repo, the only copy would now sit in `s3-remote`. As little disk space as possible would be staged temporarily in `~/my-laptop-repo`. Perhaps the easiest option would be to import a file normally, but them immediately do a `move` to `s3-remote`? But, ideally for larger files, we would want to stream them directly from `/mnt/usb-drive/myfiles` to `s3-remote` without ever staging them at `~/my-laptop-repo`.
|
||||
|
||||
[[!tag unlikely needsthought]]
|
||||
|
|
|
@ -12,3 +12,5 @@ I often transfer files via mediums that have transfer limits, but I am eventuall
|
|||
|
||||
|
||||
Currently, I've been using tricks to select a subset of the files, such as a range of file-sizes.
|
||||
|
||||
[[!tag needsthought]]
|
||||
|
|
|
@ -21,3 +21,5 @@ repeatedly (though ssh connection caching helps some with that).
|
|||
> exposes this, when available. Some sftp servers can be locked down
|
||||
> so that the user can't run git-annex on them, so that could be the only
|
||||
> way to get diskreserve working for such a remote. --[[Joey]]
|
||||
|
||||
[[!tag confirmed]]
|
||||
|
|
|
@ -1 +1,3 @@
|
|||
To [[git-annex-test]] and [[git-annex-testremote]], add option to run tests under concurrency (-J). Many possible bugs are unique to the concurrent case, and it's the case I often use. While any bugs detected may be hard to reproduce, it's important to know _whether_ there are concurrency-related bugs. Much of the trust in git-annex comes from its extensive test suite, but it's somewhat concerning to trust it with important data when the concurrency case is not tested at all.
|
||||
|
||||
[[!tag unlikely]]
|
||||
|
|
|
@ -1 +1,3 @@
|
|||
From https://cyan4973.github.io/xxHash/ , xxHash seems much faster than md5 with comparable quality. There's a Haskell implementation.
|
||||
|
||||
[[!tag moreinfo]]
|
||||
|
|
|
@ -0,0 +1,8 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 1"""
|
||||
date="2020-01-06T19:30:43Z"
|
||||
content="""
|
||||
I looked at xxHash recently. I can't seem to find benchmarks of it compared
|
||||
with other fast hashes like Blake2.
|
||||
"""]]
|
|
@ -0,0 +1,8 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 2"""
|
||||
date="2020-01-09T20:52:25Z"
|
||||
content="""
|
||||
Let alone blake3, which is 5-6 times as fast as blake2 while still
|
||||
apparently being a cryptographically secure hash.
|
||||
"""]]
|
|
@ -0,0 +1,83 @@
|
|||
[[!comment format=mdwn
|
||||
username="yarikoptic"
|
||||
avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4"
|
||||
subject="what am I doing wrong?"
|
||||
date="2020-01-13T20:05:38Z"
|
||||
content="""
|
||||
I have tried to use this but I do not see it in effect:
|
||||
|
||||
[[!format sh \"\"\"
|
||||
$> mkdir repo && cd repo && git init && git annex init && git annex config --set addunlocked anything && git show git-annex:config.log && touch 1 2 && git add 1 && git annex add 2 && git commit -m 'committing' && ls -l && git show
|
||||
Initialized empty Git repository in /tmp/repo/.git/
|
||||
init (scanning for unlocked files...)
|
||||
ok
|
||||
(recording state in git...)
|
||||
addunlocked anything ok
|
||||
(recording state in git...)
|
||||
1578945668.466039639s addunlocked anything
|
||||
add 2
|
||||
ok
|
||||
(recording state in git...)
|
||||
[master (root-commit) e428211] committing
|
||||
2 files changed, 1 insertion(+)
|
||||
create mode 100644 1
|
||||
create mode 120000 2
|
||||
total 4
|
||||
-rw------- 1 yoh yoh 0 Jan 13 15:01 1
|
||||
lrwxrwxrwx 1 yoh yoh 178 Jan 13 15:01 2 -> .git/annex/objects/pX/ZJ/SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855/SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
|
||||
commit e428211fe0c64e67cf45d8c92165c866db5ba75f (HEAD -> master)
|
||||
Author: Yaroslav Halchenko <debian@onerussian.com>
|
||||
Date: Mon Jan 13 15:01:08 2020 -0500
|
||||
|
||||
committing
|
||||
|
||||
diff --git a/1 b/1
|
||||
new file mode 100644
|
||||
index 0000000..e69de29
|
||||
diff --git a/2 b/2
|
||||
new file mode 120000
|
||||
index 0000000..ea46194
|
||||
--- /dev/null
|
||||
+++ b/2
|
||||
@@ -0,0 +1 @@
|
||||
+.git/annex/objects/pX/ZJ/SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855/SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
|
||||
|
||||
\"\"\"]]
|
||||
|
||||
so I have tried to say that \"anything\" (all files) should be added unlocked. But it seems that neither file (`1` added via `git add` and `2` added via `git annex add`) were added unlocked.
|
||||
|
||||
<details>
|
||||
<summary>Here is some info on version/config: (click to expand)</summary>
|
||||
|
||||
|
||||
[[!format sh \"\"\"
|
||||
(git-annex)lena:/tmp/repo[master]
|
||||
$> cat .git/config
|
||||
[core]
|
||||
repositoryformatversion = 0
|
||||
filemode = true
|
||||
bare = false
|
||||
logallrefupdates = true
|
||||
[annex]
|
||||
uuid = f220cc03-1510-4e23-acb5-b95723ecf9fc
|
||||
version = 7
|
||||
[filter \"annex\"]
|
||||
smudge = git-annex smudge -- %f
|
||||
clean = git-annex smudge --clean -- %f
|
||||
(dev3) 1 17256.....................................:Mon 13 Jan 2020 03:03:30 PM EST:.
|
||||
(git-annex)lena:/tmp/repo[master]
|
||||
$> git annex version
|
||||
git-annex version: 7.20191230+git2-g2b9172e98-1~ndall+1
|
||||
build flags: Assistant Webapp Pairing S3 WebDAV Inotify DBus DesktopNotify TorrentParser MagicMime Feeds Testsuite
|
||||
dependency versions: aws-0.20 bloomfilter-2.0.1.0 cryptonite-0.25 DAV-1.3.3 feed-1.0.1.0 ghc-8.6.5 http-client-0.5.14 persistent-sqlite-2.9.3 torrent-10000.1.1 uuid-1.3.13 yesod-1.6.0
|
||||
key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL
|
||||
remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs hook external
|
||||
operating system: linux x86_64
|
||||
supported repository versions: 7
|
||||
upgrade supported from repository versions: 0 1 2 3 4 5 6
|
||||
local repository version: 7
|
||||
|
||||
\"\"\"]]
|
||||
|
||||
</details>
|
||||
"""]]
|
|
@ -0,0 +1,8 @@
|
|||
[[!comment format=mdwn
|
||||
username="kyle"
|
||||
avatar="http://cdn.libravatar.org/avatar/7d6e85cde1422ad60607c87fa87c63f3"
|
||||
subject="re: what am I doing wrong?"
|
||||
date="2020-01-14T03:19:19Z"
|
||||
content="""
|
||||
I believe that should be `git annex config --set annex.addunlocked anything` (i.e. an \"annex.\" in front of the name).
|
||||
"""]]
|
|
@ -8,7 +8,7 @@ When an external special remote tells git-annex a fuller URL for a given file, g
|
|||
|
||||
It would be better if, in the above log, the URL key was based on dx://file-FJZjVx001pB2BQPVKY4zX8kk/A4.assembly1-trinity.fasta , which would preserve the .fasta extension in the key and therefore in the symlink target.
|
||||
|
||||
> [fixed|done]] --[[Joey]]
|
||||
> [[fixed|done]] --[[Joey]]
|
||||
|
||||
Also, it would be good if the external special remote could return an etag
|
||||
for the URL, which would be a value guaranteed to change if the URL's
|
||||
|
|
|
@ -9,3 +9,4 @@ Also, sometimes one can determine the MD5 from the URL without downloading the f
|
|||
or because an MD5 was computed by a workflow manager that produced the file (Cromwell does this). The special remote's "CHECKURL" implementation could record an MD5E key in the
|
||||
alt_keys metadata field of the URL key. Then 'addurl --fast' could check alt_keys, and store in git an MD5E key rather than a URL key, if available.
|
||||
|
||||
[[!tag unlikely]]
|
||||
|
|
|
@ -0,0 +1,14 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 1"""
|
||||
date="2020-01-30T18:36:17Z"
|
||||
content="""
|
||||
This would mean that, every time something about a key is looked up in the
|
||||
git-annex branch, it would also need to look at the metadata to see if this
|
||||
`alt_keys` field is set.
|
||||
|
||||
So it doubles the time of every single query of the git-annex branch.
|
||||
|
||||
I don't think that's a good idea, querying the git-annex branch is already
|
||||
often a bottleneck to commands.
|
||||
"""]]
|
|
@ -0,0 +1,8 @@
|
|||
[[!comment format=mdwn
|
||||
username="Ilya_Shlyakhter"
|
||||
avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
|
||||
subject="alternate keys"
|
||||
date="2020-01-31T19:23:35Z"
|
||||
content="""
|
||||
\"every time something about a key is looked up in the git-annex branch, it would also need to look at the metadata to see if this alt_keys field is set\" -- not every time, just when checking if the key is checksum-based, and if content matches the checksum. Also, isn't metadata [[cached in a database|design/caching_database]]?
|
||||
"""]]
|
|
@ -0,0 +1,15 @@
|
|||
[[!comment format=mdwn
|
||||
username="https://christian.amsuess.com/chrysn"
|
||||
nickname="chrysn"
|
||||
avatar="http://christian.amsuess.com/avatar/c6c0d57d63ac88f3541522c4b21198c3c7169a665a2f2d733b4f78670322ffdc"
|
||||
subject="Re: comment 1 "
|
||||
date="2020-01-31T19:47:59Z"
|
||||
content="""
|
||||
The proposed implementation may be inefficient, but the idea has merit.
|
||||
|
||||
What if that information is stored in a place where it can be used to verify migrations?
|
||||
|
||||
For example, when entering that the migrating remote dropped the data into `git-annex:aaa/bbb/SHA1-s1234--somehash.log`, somewhere near there a record could be added that this was migrated to SHA512-s1234--longerhash. When then all the other remotes are asked to drop that file, they can actually do that because they see that it has been migrated, can verify the migration and are free to drop the file.
|
||||
|
||||
Even later, when a remote wants to get an old name (eg. because it checked out an old version of master), it can look up the key, find where it was migrated to, and make the data available under its own name (by copying, or maybe by placing a symlink pointing file from `.git/annex/objects/Aa/Bb/SHA1-s1234--somehash/SHA1-s1234--somehash` to the new.
|
||||
"""]]
|
|
@ -0,0 +1,8 @@
|
|||
[[!comment format=mdwn
|
||||
username="Ilya_Shlyakhter"
|
||||
avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
|
||||
subject="comment 4"
|
||||
date="2020-01-31T20:32:00Z"
|
||||
content="""
|
||||
\"can be used to verify migrations\" -- my hope was to *avoid* migrations, i.e. to get the benefit you'd get from migrating to a checksum-based key, without doing the migration.
|
||||
"""]]
|
|
@ -0,0 +1,12 @@
|
|||
[[!comment format=mdwn
|
||||
username="Ilya_Shlyakhter"
|
||||
avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
|
||||
subject="simpler proposal"
|
||||
date="2020-01-31T21:46:57Z"
|
||||
content="""
|
||||
So, to fully and properly implement what the title of this todo suggests -- \"alternate keys for same content\" -- might be hard. But to simply enable adding checksums to WORM/URL keys, stored separately on the git-annex branch rather than encoded in the key's name, is simpler. This would let some WORM/URL keys to be treated as checksum-based keys when getting contents from untrusted remotes or when checking integrity with `git-annex-fsck`. But this isn't really \"alternate keys for same content\": the content would be stored under only the WORM/URL key under which it was initially recorded. The corresponding MD5 key would not be recorded in [[location_tracking]] as present.
|
||||
|
||||
Checking whether a WORM/URL key has an associated checksum could be sped up by keeping a Bloom filter representing the set of WORM/URL keys for which `alt_keys` is set.
|
||||
|
||||
In the `addurl --fast` case for special remotes, where the remote can determine a file's checksum without downloading, a checksum-based key would be recorded to begin with, as happens with `addurl` without `--fast`. Currently I do this by manually calling plumbing commands like `git-annex-setpresentkey`, but having `addurl` do it seems better.
|
||||
"""]]
|
|
@ -0,0 +1,19 @@
|
|||
[[!comment format=mdwn
|
||||
username="Chel"
|
||||
avatar="http://cdn.libravatar.org/avatar/a42feb5169f70b3edf7f7611f7e3640c"
|
||||
subject="comment 6"
|
||||
date="2020-02-01T02:32:01Z"
|
||||
content="""
|
||||
There is also `aaa/bbb/*.log.cid` in git-annex branch for \"per-remote content identifiers for keys\".
|
||||
It could be another place to store alternate keys, but it is per-remote, so... no.
|
||||
|
||||
As for the metadata field `alt_keys` — it is another case of
|
||||
\"[setting a metadata field to a key](/todo/Bidirectional_metadata/#comment-788380998b25267c5b99c4a865277102)\"
|
||||
in [[Bidirectional metadata]].
|
||||
|
||||
Also, there is an interesting idea of [[git-annex-migrate using git-replace]].
|
||||
|
||||
By the way, as far as I know (maybe things have changed since then),
|
||||
ipfs has a similar problem of different identifiers for the same content.
|
||||
Because it encodes how things are stored. And hash functions can also be changed.
|
||||
"""]]
|
|
@ -0,0 +1,10 @@
|
|||
[[!comment format=mdwn
|
||||
username="Ilya_Shlyakhter"
|
||||
avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
|
||||
subject="potential security issues?"
|
||||
date="2020-02-06T21:00:55Z"
|
||||
content="""
|
||||
I wonder if storing checksums in a general-purpose mutable metadata field may cause security issues. Someone could use the [[`git-annex-metadata`|git-annex-metadata]] command to overwrite the checksum. It should be stored in a read-only field written only by `git-annex` itself, like the `field-lastchanged` metadata already is.
|
||||
|
||||
Of course, if someone is able to write the [[git-annex branch|internals#The_git-annex_branch]] directly, or get the user to pull merges to it, they could alter the checksum stored there. Maybe, only trust stored checksums if `merge.verifySignatures=true`?
|
||||
"""]]
|
|
@ -9,3 +9,5 @@ would effectively build up a file match expression. So it might then follow
|
|||
that the git config should also be a file match expression, with "true"
|
||||
being the same as "anything" and "false" the same as "nothing" for
|
||||
back-compat. --[[Joey]]
|
||||
|
||||
> This got accomplished by other means, [[done]] --[[Joey]]
|
||||
|
|
|
@ -13,3 +13,5 @@ need a git hook run before checkout to rescue such files.
|
|||
Also some parts of git-annex's code, including `withObjectLoc`, assume
|
||||
that the .annex/objects is present, and so it would need to be changed
|
||||
to look at the work tree file. --[[Joey]]
|
||||
|
||||
[[!tag needsthought]]
|
||||
|
|
|
@ -2,3 +2,4 @@ ATM 'annex merge' does not accept any parameter to specify which remotes to cons
|
|||
|
||||
[[!meta author=yoh]]
|
||||
[[!tag projects/datalad]]
|
||||
[[!tag moreinfo]]
|
||||
|
|
|
@ -0,0 +1,12 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 4"""
|
||||
date="2020-01-29T15:10:09Z"
|
||||
content="""
|
||||
Based on my last comment, I think, if you still need this, you
|
||||
should try configuring remote.name.fetch to avoid fetching the git-annex
|
||||
branches you don't want to merge.
|
||||
|
||||
If that's not sufficient, followup and we can think about the other options
|
||||
I discussed earlier.
|
||||
"""]]
|
|
@ -11,3 +11,6 @@ autobuilder? --[[Joey]]
|
|||
|
||||
Currently running release builds for arm64 on my phone, but it's not
|
||||
practical to run an autobuilder there. --[[Joey]]
|
||||
|
||||
>> [[done]]; the current qemu based autobuilder is not ideal, often gets
|
||||
>> stuck, but there's no point leaving this todo open. --[[Joey]]
|
||||
|
|
|
@ -3,3 +3,5 @@ I think it would be useful if the assistant (when monitoring a repo) could autom
|
|||
If I then add each repo as a remote of the other (from the command-line), assistant will still not sync files between the repos until I stop all the assistants running and then restart them. Presumably only on launch does the assistant check the list of remotes?
|
||||
|
||||
I think this is perhaps causing issues for users not just on the command-line but also for users who create multiple local remotes from the webapp and then combine them, since the webapp is perhaps not restarting the assistant daemons after the combine operation? I'm not sure about this…
|
||||
|
||||
[[!tag confirmed]]
|
||||
|
|
|
@ -9,3 +9,5 @@ This would invole:
|
|||
* The assistant ought to update the adjusted branch at some point after
|
||||
downloads, but it's not clear when. Perhaps this will need to be deferred
|
||||
until it can be done more cheaply, so it can do it after every file.
|
||||
|
||||
[[!tag confirmed]]
|
||||
|
|
|
@ -14,3 +14,5 @@ At least for the built in special remotes (not external) this should be possible
|
|||
|
||||
[[!meta author=yoh]]
|
||||
[[!tag projects/dandi]]
|
||||
|
||||
> [[done]] --[[Joey]]
|
||||
|
|
|
@ -0,0 +1,30 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 1"""
|
||||
date="2020-01-06T16:12:01Z"
|
||||
content="""
|
||||
There's a subtle backwards compatibility issue here: The stored config of a
|
||||
special remote is used when enabling it, so if an older version of
|
||||
git-annex is used to enable a remote, there might be a setting that it does
|
||||
not know about, or a value it doesn't understand. If that caused it to fail
|
||||
to enable the remote it wouldn't be possible to use it, at least w/o
|
||||
changing/removing the config.
|
||||
|
||||
For example, autoenable=true did not used to be a config setting, but older
|
||||
versions of git-annex can still use remotes that have that.
|
||||
|
||||
Another example is chunk=. While older versions of git-annex don't
|
||||
understand that, and so won't use chunks when storing/retrieving,
|
||||
the newer git-annex falls back to getting the unchunked object.
|
||||
So things stored by the old git-annex can be retrieved by the new,
|
||||
but not vice-versa.
|
||||
|
||||
Another example is S3's storageclass=. Older git-annex doesn't understand
|
||||
it, so uses the default storage class, but that behavior is interoperable
|
||||
with the new behavior.
|
||||
|
||||
So the stored config of a remote should not be checked
|
||||
everytime the remote is instantiated, but only the new settings passed
|
||||
to initremote/enableremote. That will complicate the API, since currently
|
||||
the old and new config are combined together by enableremote.
|
||||
"""]]
|
|
@ -0,0 +1,37 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 2"""
|
||||
date="2020-01-07T17:59:35Z"
|
||||
content="""
|
||||
I was thinking about implementing this today, but the shattered attack got
|
||||
in the way. Anyway, it seems like most of a plan:
|
||||
|
||||
* Make RemoteConfig contain Accepted or Proposed values. enableremote and initremote
|
||||
set Proposed values; Accepted values are anything read from git-annex:remote.log
|
||||
(update: done)
|
||||
* When a RemoteConfig value fails to parse, it may make sense to use a
|
||||
default instead when it's Accepted, and error out when it's Proposed. This could
|
||||
be used when parsing foo=yes/no to avoid treating foo=true the same as
|
||||
foo=no, which some things do currently do
|
||||
(eg importtree, exporttree, embedcreds).
|
||||
(update: Done for most yes/no and true/false parsers, surely missed a
|
||||
few though, (including autoenable).)
|
||||
* Add a Remote method that returns a list of all RemoteConfig fields it
|
||||
uses. This is the one part I'm not sure about, because that violates DRY.
|
||||
It would be nicer to have a parser that can also generate a list of the
|
||||
fields it parses.
|
||||
* Before calling Remote setup, see if there is any Proposed value in
|
||||
RemoteConfig whose field is not in the list. If so, error out.
|
||||
* For external special remotes, add a LISTCONFIG message. The program
|
||||
reponds with a list of all the fields it may want to later GETCONFIG.
|
||||
If the program responds with UNSUPPORTED-REQUEST then it needs to return
|
||||
something that says any and all fields are allowed.
|
||||
* External special remotes are responsible for parsing the content of
|
||||
GETCONFIG, as they do now, and can error out if there's a problem.
|
||||
|
||||
Having a method return a list of fields will also allow
|
||||
implementing
|
||||
<https://git-annex.branchable.com/todo/some_way_to_get_a_list_of_options_for_a_special_remote_of_a_given_type/>.
|
||||
It may be worthwhile to add, along with the field name, a human readable
|
||||
description of its value.
|
||||
"""]]
|
|
@ -0,0 +1,11 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 3"""
|
||||
date="2020-01-15T17:52:07Z"
|
||||
content="""
|
||||
Unknown fields will now result in an error message. And values like yes/no
|
||||
and true/false get parsed upfront.
|
||||
|
||||
External special remotes currently still accept all fields, so work still
|
||||
needs to be done to extend the protocol to list acceptable fields.
|
||||
"""]]
|
|
@ -0,0 +1,13 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 4"""
|
||||
date="2020-01-17T21:15:16Z"
|
||||
content="""
|
||||
Added LISTCONFIGS to external special remote protocol, and once your
|
||||
special remotes implement it, initremote will notice if the user provides
|
||||
any setting with the wrong name.
|
||||
|
||||
(external special remotes could already verify the values of settings using
|
||||
GETCONFIG at the INITREMOTE stage, and use INITREMOTE-FAILURE to inform the
|
||||
user of bad or missing values)
|
||||
"""]]
|
|
@ -1,3 +1,5 @@
|
|||
Can an option be added to unlock a file in such a way that the next time it gets committed, it is automatically re-locked? Or to just have this done for all unlocked files?
|
||||
|
||||
It's a common use case to just do one edit / re-generation of a locked file. If you forget to lock it (or a script that was supposed to lock it after modification fails in the middle), you end up with a permanently unlocked file, which can cause [[performance issues|bugs/git_status_extremely_slow_with_v7]] downstream, and also [[look odd when missing|todo/symlinks_for_not-present_unlocked_files]], lead to multiple copies when present (or risk [[annex.thin issues|bugs/annex.thin_can_cause_corrupt___40__not_just_missing__41___data]]), and leave the file open to inadvertent/unintended modification. Also, locking the file manually litters the git log with commits that don't really change repo contents.
|
||||
|
||||
[[!tag needsthought]]
|
||||
|
|
|
@ -1 +1,3 @@
|
|||
Current special remote protocol works on one file at a time. With some remotes, a batch operation can be more efficient, e.g. querying the status of many URLs in one API call. It would be good if special remotes could optionally implement batch versions of their operations, and these versions were used by batch-mode git-annex commands. Or maybe, keep the current set of commands but let the remote read multiple requests and then send multiple replies?
|
||||
|
||||
[[!tag moreinfo]]
|
||||
|
|
|
@ -9,3 +9,5 @@ object in it.
|
|||
This should be fixable by eg, catching all exceptions when running Annex
|
||||
operations on a remote, adding its path to the message and rethrowing.
|
||||
--[[Joey]]
|
||||
|
||||
[[!tag confirmed]]
|
||||
|
|
|
@ -32,3 +32,5 @@ Two open questions:
|
|||
objects over time. So leave the update up to the user to run the command
|
||||
when they want it? But then the user may get confused, why did it
|
||||
download files and they didn't appear?
|
||||
|
||||
[[!tag needsthought]]
|
||||
|
|
|
@ -13,3 +13,5 @@ backups, and git-annex would then be aware of what was backed up in borg,
|
|||
and could do things like count that as a copy.
|
||||
|
||||
--[[Joey]]
|
||||
|
||||
[[!tag needsthought]]
|
||||
|
|
|
@ -3,3 +3,5 @@
|
|||
Changing the default would also let one [[repeatedly re-import a directory while keeping original files in place|bugs/impossible__40____63____41___to_continuously_re-import_a_directory_while_keeping_original_files_in_place]].
|
||||
|
||||
I realize this would be a breaking change for some workflows; warning of it [[like git does|todo/warn_of_breaking_changes_same_way_git_does]] would mitigate the breakage.
|
||||
|
||||
[[!tag unlikely]]
|
||||
|
|
|
@ -0,0 +1,7 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 3"""
|
||||
date="2020-01-30T17:09:00Z"
|
||||
content="""
|
||||
See [[todo/remove_legacy_import_directory_interface]].
|
||||
"""]]
|
|
@ -6,3 +6,5 @@ Thanks in advance for considering
|
|||
|
||||
[[!meta author=yoh]]
|
||||
[[!tag projects/datalad]]
|
||||
|
||||
> [[done]]
|
||||
|
|
|
@ -0,0 +1,17 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 4"""
|
||||
date="2020-01-30T16:23:55Z"
|
||||
content="""
|
||||
Occurs to me that any git-annex command could result in an automatic
|
||||
init, and since v7 is default, will enter an adjusted branch when on a
|
||||
crippled filesystem.
|
||||
|
||||
I don't think it makes sense to add --progress to every single
|
||||
git-annex command.
|
||||
|
||||
I suppose, if your code always runs git-annex init after clone, then it
|
||||
would be good enough to have git-annex init be the only thing that
|
||||
supports --progress. If something else needs it (maybe the view commands),
|
||||
we can treat that separately.
|
||||
"""]]
|
|
@ -0,0 +1,12 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 5"""
|
||||
date="2020-01-30T16:38:30Z"
|
||||
content="""
|
||||
Heh, looking at the code, [[!commit
|
||||
24838547e2475b37d7e910361f9b6e087a1a0648]] in 2018
|
||||
made --progress be unconditionally passed when entering an adjusted branch.
|
||||
|
||||
That was done for unrelated reasons, but I don't think there's anything more
|
||||
to do on this now.
|
||||
"""]]
|
8
doc/todo/confirmed.mdwn
Normal file
8
doc/todo/confirmed.mdwn
Normal file
|
@ -0,0 +1,8 @@
|
|||
This tag is for todo items that have an agreed upon plan of action, but
|
||||
have not been implemented yet.
|
||||
|
||||
[[!inline pages="todo/* and !todo/*/* and !todo/done and !link(todo/done)
|
||||
and link(todo/confirmed)
|
||||
and !*/Discussion and !todo/moreinfo and !todo/confirmed
|
||||
and !todo/needsthought and !todo/unlikely" show=0 feedlimit=10
|
||||
archive=yes template=buglist]]
|
|
@ -33,3 +33,6 @@ be useful to speed up checks on larger files. The license is a
|
|||
|
||||
I know it might sound like a conflict of interest, but I *swear* I am
|
||||
not bringing this up only as a oblique feline reference. ;) -- [[anarcat]]
|
||||
|
||||
> Let's concentrate on [[xxhash|todo/add_xxHash_backend]] or other new hashes that are getting general
|
||||
> adoption, not niche hashes like meow. [[done]] --[[Joey]]
|
||||
|
|
|
@ -0,0 +1,11 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 1"""
|
||||
date="2020-01-06T19:36:32Z"
|
||||
content="""
|
||||
xxhash seems to fill a similar niche and is getting a lot more use from
|
||||
what I can see.
|
||||
|
||||
Meow seems to claim a faster gb/s rate than xxhash does, but
|
||||
it's hard to tell if the benchmarks are really equivilant.
|
||||
"""]]
|
|
@ -32,3 +32,5 @@ surprise users... I suggest using a logic similar to
|
|||
[[git-annex-import]] for consistency reasons.
|
||||
|
||||
Thanks! -- [[anarcat]]
|
||||
|
||||
[[!tag unlikely]]
|
||||
|
|
|
@ -1 +1,3 @@
|
|||
If an external special remote is implemented as a Docker container, it can be safely autoenabled and run in a sandboxed way. So the distributor of a repo that has annex files fetchable with a given special remote, could have the docker tag for the special remote configured on the git-annex branch, and users could then clone and use the repo without needing to install anything.
|
||||
|
||||
[[!tag needsthought]]
|
||||
|
|
|
@ -1 +1,3 @@
|
|||
It would help to document, in one place, the external programs and libraries on which git-annex depends for various functionalities, including optional ones. Ones I know: curl, gpg, bup. But there are also references in places to lsof, rsync, nocache. For reliable packaging, it would be useful to have an authoritative list of dependencies and which functionality each supports.
|
||||
|
||||
[[!tag unlikely]]
|
||||
|
|
|
@ -1,3 +1,5 @@
|
|||
If a spec of the [[sqlite database schemas|todo/sqlite_database_improvements]] could be added to the [[internals]] docs, this would open some possibilities for third-party tools based on this info. E.g. one could write some sqlite3 queries to get aggregate info on the number (and total size?) of keys present in specific combinations of repos. It would of course be understood that this is internal info subject to frequent change.
|
||||
|
||||
Also, if [[Sometimes the databases are used for data that has not yet been committed to git|devblog/day_607__v8_is_done]], this would improve [[future_proofing]].
|
||||
|
||||
[[!tag needsthought unlikely]]
|
||||
|
|
|
@ -0,0 +1,11 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 1"""
|
||||
date="2020-01-06T18:07:00Z"
|
||||
content="""
|
||||
There are not any situations where after losing the sqlite databases
|
||||
git-annex can't recover the information that was stored in them by other
|
||||
means. I know because the v8 upgrade deletes all the old sqlite databases
|
||||
and then recovers the information by other means. So no future-proofing
|
||||
impact here.
|
||||
"""]]
|
|
@ -0,0 +1,26 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 2"""
|
||||
date="2020-01-06T18:16:13Z"
|
||||
content="""
|
||||
It's easy enough to dump the database and see its schema.
|
||||
|
||||
joey@darkstar:~/lib/big>sqlite3 .git/annex/keys/db
|
||||
sqlite> .dump
|
||||
CREATE TABLE IF NOT EXISTS "associated"("id" INTEGER PRIMARY KEY,"key" VARCHAR NOT NULL,"file" VARCHAR NOT NULL,CONSTRAINT "key_file_index" UNIQUE ("key","file"),CONSTRAINT "file_key_index" UNIQUE ("file","key"));
|
||||
CREATE TABLE IF NOT EXISTS "content"("id" INTEGER PRIMARY KEY,"key" VARCHAR NOT NULL,"cache" VARCHAR NOT NULL,CONSTRAINT "key_cache_index" UNIQUE ("key","cache"));
|
||||
|
||||
Or the fully typed schema can be looked up in the haskell code
|
||||
(Database/Keys/Sql.hs)
|
||||
|
||||
I think that how the information in the databases relates to the state of the
|
||||
repository, and how it's updated from the git-annex branch etc is just as
|
||||
important as the schema. For example, if you wanted to use this database to
|
||||
query files using a key, you'd need to know this database only gets
|
||||
populated for unlocked files not locked files. And that the database may not
|
||||
reflect recent changes to the working tree, and there's a complicated process
|
||||
that can be used to update it to reflect any recent changes.
|
||||
|
||||
That's rather deep into the implementation details to be documenting
|
||||
outside the code.
|
||||
"""]]
|
|
@ -0,0 +1,21 @@
|
|||
[[!comment format=mdwn
|
||||
username="https://christian.amsuess.com/chrysn"
|
||||
nickname="chrysn"
|
||||
avatar="http://christian.amsuess.com/avatar/c6c0d57d63ac88f3541522c4b21198c3c7169a665a2f2d733b4f78670322ffdc"
|
||||
subject="Summary; Application: shared thumbnails"
|
||||
date="2020-01-10T08:41:18Z"
|
||||
content="""
|
||||
There are two conflicting approaches to mtimes:
|
||||
|
||||
* Treat them as local artifacts
|
||||
|
||||
This works great with Make, and generally with any software that works on \"is newer than\" properties.
|
||||
|
||||
* Treat them as preservation-worthy file attributes
|
||||
|
||||
This is generally required by tools that compare time stamps by identity.
|
||||
|
||||
Both approaches break tools that expect the other, and no single out-of-the-box choice will make all users happy. Tools like metastore, a bespoke solution like etckeeper's generated mkdir/chmod file or a git-annex solution like [[storing the full mtime at genmetadata time|bugs/file_modification_time_should_be_stored_in_exactly_one_metadata_field/]] with a (local or repository-wide) option to set the mtime at annex-get time would be convenient.
|
||||
|
||||
One more application where this would be relevant is sharing generated thumbnails among clones of repositories (to eventually maybe even have them available when the full files are not present) following the [XDG specification on shared thumnail repositories](https://specifications.freedesktop.org/thumbnail-spec/thumbnail-spec-latest.html#SHARED). Not only does that design rely on the mtimes of the thumbnail and the file to match, it even encodes the mtime again inside the thumbnail, practically requiring all checkouts to not only have consistent mtimes between thumbnails and files, but identical ones.
|
||||
"""]]
|
|
@ -1 +1,3 @@
|
|||
Is it possible to add an option, for initremote/enableremote, to encrypt the credentials but not the contents? Then it would be possible to have an exporttree remote while using embedcreds. It would also be good if locally stored credentials could be stored in encrypted form, and decrypted for use as needed. I'm uneasy about keeping credentials accessible without a passphrase.
|
||||
|
||||
[[!tag confirmed]]
|
||||
|
|
|
@ -7,3 +7,4 @@ store files under paths like s3://mybucket/randomstring/myfile ; the URL is "pub
|
|||
If the URLs could be stored encrypted in the git-annex branch, one could track such files using the ordinary web remote. One could use an S3 export-tree
|
||||
remote to share a directory with specific recipient(s), without them needing either AWS credentials or git-annex.
|
||||
|
||||
[[!tag unlikely moreinfo]]
|
||||
|
|
|
@ -0,0 +1,12 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 1"""
|
||||
date="2020-01-30T18:25:58Z"
|
||||
content="""
|
||||
Is this about SETURLPRESENT in an external special remote, or is addurl
|
||||
also supposed to enctypt an url? And how would addurl know if the user
|
||||
wants to encrypt it, and using what gpg keys?
|
||||
|
||||
If your git-annex repo contains information about files you want to remain
|
||||
private, why not just keep that repo private?
|
||||
"""]]
|
|
@ -8,3 +8,5 @@ Perhaps: Find pairs of renames that swap content between two files.
|
|||
Run each pair in turn. Then run the current rename code. Although this
|
||||
still probably misses cases, where eg, content cycles amoung 3 files, and
|
||||
the same content amoung 3 other files. Is there a general algorythm?
|
||||
|
||||
[[!tag needsthought]]
|
||||
|
|
|
@ -3,3 +3,5 @@ It would be good if one could define custom external [[backends]], the way one c
|
|||
@joey pointed out a potential problem: "needing to deal with the backend being missing or failing to work could have wide repurcussions in the code base." I wonder if there are ways around that. Suppose you specified a default backend to use in case a custom one was unavailable? Then you could always compute a key from a file, even if it's not in the right backend. And once a key is stored in git-annex, most of git-annex treats the key as just a string. If the custom backend supports checksum verification, without the backend's implementation, keys from that backend would be treated like WORM/URL keys that do not support checksum checking.
|
||||
|
||||
Thoughts?
|
||||
|
||||
[[!tag needsthought]]
|
||||
|
|
|
@ -48,3 +48,5 @@ subsequent WHEREIS, which may complicate its code slightly.
|
|||
|
||||
Note that the protocol does allow querying with GETCONFIG etc before
|
||||
responding to a WHEREIS request.
|
||||
|
||||
[[!tag confirmed]]
|
||||
|
|
|
@ -3,3 +3,5 @@ It would be useful to have a [[`git-annex-cat`|forum/Is_there_a___34__git_annex_
|
|||
If file is not present, or `remote.here.cost` is higher than `remote.someremote.cost` where file is present, `someremote` would get a `TRANSFER` request where the `FILE` argument is a named pipe, and a `cat` of that named pipe would be started.
|
||||
|
||||
If file is not annexed, for uniformity `git-annex-cat file` would just call `cat file`.
|
||||
|
||||
[[!tag needsthought]]
|
||||
|
|
|
@ -0,0 +1,15 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 4"""
|
||||
date="2020-01-01T18:44:37Z"
|
||||
content="""
|
||||
@Ilya_Shlyakhter, I'd assume:
|
||||
|
||||
* some remotes would write to the named pipe
|
||||
* some remotes would overwrite it with a file
|
||||
* some remotes would open it, try to seek around as they do non-sequential
|
||||
recieves, and hang or something
|
||||
* some remotes would maybe open and write to it, but would no longer be
|
||||
able to resume interrupted transfers, since they would I guess see its
|
||||
size as 0
|
||||
"""]]
|
|
@ -5,3 +5,5 @@ Now I followed the documentation about the special remote adb and created that r
|
|||
Which is caused by the fact that I didn't have checked out the files on my workstation. I don't need the files on this pc so it would be stupid to checkout partially huge files there or in other words I don't need the files at that place, I don't get why the export command not has a --from option where it can get the files?
|
||||
|
||||
Is there a reason that does not exist and if so what would be a way to do sending files to the android device without ssh-ing into my server?
|
||||
|
||||
[[!tag unlikely]]
|
||||
|
|
|
@ -1 +1,3 @@
|
|||
Can git-annex-get be extended so that "git-annex-get --batch --key" fetches the keys (rather than filenames) given in the input?
|
||||
|
||||
[[!tag needsthought]]
|
||||
|
|
|
@ -1 +1,3 @@
|
|||
`git diff` for annexed files, especially unlocked annexed files, is currently uninformative. It would help if [[`git-annex-init`|git-annex-init]] configured a [git diff driver](https://git-scm.com/docs/gitattributes#_generating_diff_text) to diff the contents of the annexed files, rather than the pointer files.
|
||||
|
||||
> [[wontfix|done]], see comment
|
||||
|
|
|
@ -0,0 +1,14 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 1"""
|
||||
date="2020-01-06T18:39:27Z"
|
||||
content="""
|
||||
Normally annexed files are huge binary files. Line-by-line diff of such
|
||||
files is unlikely to be useful.
|
||||
|
||||
So you would need some domain-specific diff for the kind of binary files
|
||||
you are storing in git-annex. If you have one, you can use
|
||||
[[git-annex-diffdriver]] to make git use it when diffing annexed files.
|
||||
|
||||
Not seeing anything more I can do here, so I'm going to close this todo.
|
||||
"""]]
|
|
@ -1,3 +1,5 @@
|
|||
Currently, git-annex-migrate leads to content (and metadata) being stored under both old and new keys. git-annex-unused can drop the contents under the old key, but then you can't access the content if you check out an older commit. Maybe, an option can be added to migrate keys using [git-replace](https://git-scm.com/docs/git-replace) ? You'd git-replace the blob .git/annex/objects/old_key with the blob .git/annex/objects/new_key, the blob ../.git/annex/objects/old_key with the blob ../.git/annex/objects/new_key , etc. You could then also have a setting to auto-migrate non-checksum keys to checksum keys whenever the contents gets downloaded.
|
||||
|
||||
More generally, git-annex-replace could be implemented this way, doing what git-replace does, but for git-annex keys rather than git hashes. [[git-annex-pre-commit]] might need to be changed to implement replacement of keys added later.
|
||||
|
||||
[[!tag needsthought]]
|
||||
|
|
|
@ -0,0 +1,16 @@
|
|||
[[!comment format=mdwn
|
||||
username="Chel"
|
||||
avatar="http://cdn.libravatar.org/avatar/a42feb5169f70b3edf7f7611f7e3640c"
|
||||
subject="comment 1"
|
||||
date="2020-02-01T02:55:03Z"
|
||||
content="""
|
||||
Very interesting idea! But some problems:
|
||||
|
||||
- As mentioned, not only `.git/annex/<...>` blobs need to be replaces for every key, but also `/annex/<...>`
|
||||
and all `../.git/annex/<...>`, `../../.git/annex/<...>`, etc.
|
||||
|
||||
- In big repositories it can create a giant amount of *refs/replace/* refs.
|
||||
I don't know how it affects the performance if they are stored in .git/packed-refs,
|
||||
but it can interfere with the normal operation on a repo.
|
||||
For example `git show-ref` will not work without ` | grep` or something.
|
||||
"""]]
|
|
@ -5,3 +5,5 @@ I think it would be better if `git annex reinject --known` would ignore the file
|
|||
This problem does not affect `git annex reinject` without `--known`.
|
||||
|
||||
--spwhitton
|
||||
|
||||
> mentioned this on the git-annex reinject man page; [[done]] --[[Joey]]
|
||||
|
|
|
@ -0,0 +1,20 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 1"""
|
||||
date="2020-01-06T17:11:58Z"
|
||||
content="""
|
||||
I can't think of a reasonable way to implement this.
|
||||
|
||||
It would need to hash and then look for a known SHA256E key that uses the
|
||||
hash. But the layout of the git-annex branch doesn't provide any way to do
|
||||
that, except for iterating over every filename in the branch. Which
|
||||
would be prohibitively slow when reinjecting many files. (N times git
|
||||
ls-tree -r) So it would need to build a data structure to map from SHA256
|
||||
to known SHA256E key. That can't be stored in memory, git-annex doesn't
|
||||
let the content of the repo cause it to use arbitrary amounts of memory
|
||||
(hopefully).
|
||||
|
||||
All I can think of is to traverse the git-annex branch and build a sqlite
|
||||
database and then query that, but that would add quite a lot of setup
|
||||
overhead to the command.
|
||||
"""]]
|
|
@ -0,0 +1,10 @@
|
|||
[[!comment format=mdwn
|
||||
username="spwhitton"
|
||||
avatar="http://cdn.libravatar.org/avatar/9c3f08f80e67733fd506c353239569eb"
|
||||
subject="comment 2"
|
||||
date="2020-01-07T12:29:47Z"
|
||||
content="""
|
||||
Thank you for your reply. Makes sense. If that's the only way to do it then it might as well be a helper script rather than part of git-annex.
|
||||
|
||||
Leaving this bug open because it would be good to have the limitation documented in git-annex-reinject(1).
|
||||
"""]]
|
|
@ -26,3 +26,5 @@
|
|||
Obviously this wasn't actually a file known to git-annex. But I get the same error in a non-dummy bare repo I am trying to reinject.
|
||||
|
||||
A workaround is to use `git worktree add` and run `git annex reinject` from there.
|
||||
|
||||
> [[fixed|done]] --[[Joey]]
|
||||
|
|
|
@ -1,50 +0,0 @@
|
|||
Hello.
|
||||
|
||||
What does this log mean? It seems to tell" "success", then "openDirStream" fails, then "1 failed". What failed?
|
||||
|
||||
Context is in [todo/git annex repair: performance can be abysmal, huge improvements possible](https://git-annex.branchable.com/ikiwiki.cgi?do=goto&page=todo%2Fgit_annex_repair__58___performance_can_be_abysmal__44___huge_improvements_possible)
|
||||
|
||||
fatal: bad object refs/heads/git-annex
|
||||
fatal: bad object refs/heads/git-annex
|
||||
fatal: bad object refs/heads/git-annex
|
||||
error: Could not read somehashA
|
||||
fatal: Failed to traverse parents of commit somehashB
|
||||
error: Could not read somehashA
|
||||
fatal: Failed to traverse parents of commit somehashB
|
||||
error: Could not read somehashA
|
||||
fatal: Failed to traverse parents of commit somehashB
|
||||
error: Could not read somehashA
|
||||
fatal: Failed to traverse parents of commit somehashB
|
||||
Deleted these local branches, which could not be recovered due to missing objects:
|
||||
refs/heads/master
|
||||
refs/heads/git-annex
|
||||
You currently have refs/heads/master checked out. You may have staged changes in the index that can be committed to recover the lost state of this branch!
|
||||
Successfully recovered repository!
|
||||
Please carefully check that the changes mentioned above are ok..
|
||||
|
||||
git-annex: .git/annex/journal/: openDirStream: does not exist (No such file or directory)
|
||||
failed
|
||||
git-annex: repair: 1 failed
|
||||
|
||||
The fact is: this repo is a plain git clone of a git annex repository.
|
||||
|
||||
There is no `.git/annex` directory there before `git-annex-repair` is run.
|
||||
|
||||
After it ran, there is a `.git/annex` directory with that content:
|
||||
|
||||
total 24
|
||||
drwxrwxr-x 3 4096 Jul 22 15:41 .
|
||||
drwxrwxr-x 9 4096 Jul 23 07:24 ..
|
||||
-rw-rw-r-- 1 65 Jul 20 11:59 index
|
||||
-rw-rw-r-- 1 41 Jul 20 11:59 index.lck
|
||||
-rw-rw-r-- 1 0 Jul 22 15:41 journal.lck
|
||||
-rw-rw-r-- 1 211 Jul 20 11:59 mergedrefs
|
||||
drwxrwxr-x 2 4096 Jul 22 15:41 misctmp
|
||||
|
||||
Perhaps git-annex-repair gets confused when recovering a repository that is a plain git clone of a git annex repository?
|
||||
|
||||
I did that because annexed objects are 1.7TB big here, so I wanted a local copy of pure git part only to perform repair of the repo, and propagate things somehow the objects at a later stage.
|
||||
|
||||
I'll keep the repo lying around for a few days, maybe weeks, if some experiment or further feedback is needed.
|
||||
|
||||
Thank you for your attention.
|
|
@ -1 +1,3 @@
|
|||
When using [[linked worktrees|tips/Using_git-worktree_with_annex]], the main tree is currently handled differently from the linked trees: "if there is change in the tree then syncing doesn't update git worktrees and their indices, but updates the checked out branches. This is different to the handling of the main working directory as it's either got updated or left behind with its branch if there is a conflict." Is there a reason for this? Could linked worktrees be treated same as main one?
|
||||
|
||||
[[!tag moreinfo]]
|
||||
|
|
|
@ -0,0 +1,18 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 1"""
|
||||
date="2020-01-30T17:12:40Z"
|
||||
content="""
|
||||
That tip was written by leni536, and I don't really understand what it's
|
||||
talking about with a difference in sync behavior. I'm not sure it's
|
||||
accurate or describes what happens clearly.
|
||||
|
||||
To me it seems really simple, no matter if you have a regular work tree, or
|
||||
are using git-worktree or whatever: sync fetches, merges, and pushes. Merging
|
||||
updates the current work tree, and AFAIK not whatever other work trees might
|
||||
be using the same .git repository. In any case, sync should behave the same
|
||||
as git pull as far as updating work trees goes.
|
||||
|
||||
Can you please show an example of whatever problem you may have with the
|
||||
current behavior?
|
||||
"""]]
|
|
@ -1 +1,3 @@
|
|||
git-annex-test failures sometimes reflect failures not of git-annex but of externals utils on which it relies. E.g. when my installation or configuration of gpg has problems, git-annex test suite fails due to the tests that rely on gpg. (And there doesn't seem to be a simple way to skip tests that match a regexp.) git-annex could avoid that by running some simple sanity checks (beyond just existence) on gpg or other optional dependencies, and skipping tests if these checks fail. E.g. if simple test commands to encrypt/sign a small file with gpg fail, then skip gpg-based tests (and warn the user).
|
||||
|
||||
[[!tag unlikely]]
|
||||
|
|
|
@ -26,3 +26,5 @@ I would be willing to contribute some patches and although I have a respectable
|
|||
A a sidenote, I don't know how a repo containing about 300k files jumped to 1400k git objects within the last 2 months.
|
||||
|
||||
Any feedback welcome, thanks.
|
||||
|
||||
[[!tag needsthought unlikely]]
|
||||
|
|
|
@ -6,3 +6,5 @@ A few possibilities:
|
|||
- Create branches or tags in an annex that collect a set of version-compatible checkouts for related projects. The commit/tag messages provide a natural place for meta-commentary
|
||||
- Save and version files that aren't quite junk but don't belong *in* a repo (logs, dumps, backups, editor project/workspace files, notes/to-do lists, build-artifacts, test-coverage/linter stat databases, shell history) alongside the repo, making it easier to have a consistent environment for working on one project across multiple systems.
|
||||
- Make separate system-specific "master" branches for the main projects directory on each system, then edit and push changes from any other. For example, prep the projects directory on an infrequently-used laptop from your desktop and push/pull the changes.
|
||||
|
||||
[[!tag unlikely moreinfo]]
|
||||
|
|
|
@ -0,0 +1,10 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 2"""
|
||||
date="2020-01-30T18:20:42Z"
|
||||
content="""
|
||||
This seems, at first glance, entirely out of scope for git-annex.
|
||||
|
||||
There are other things that manage lots of git repositories. I've written one
|
||||
even (myrepos).
|
||||
"""]]
|
|
@ -109,3 +109,5 @@ The best fix would be to improve git's smudge/clean interface:
|
|||
|
||||
* Allow clean filter to read work tree files itself, to avoid overhead of
|
||||
sending huge files through a pipe.
|
||||
|
||||
[[!tag confirmed]]
|
||||
|
|
|
@ -9,3 +9,6 @@ use restagePointerFile, but that did not help; git update-index does then
|
|||
smudge it during the `git annex unlock`, which is no faster (but at least
|
||||
doing it then would avoid the surprise of a slow `git status` or `git
|
||||
commit -a`). Afterwards, `git status` then smudged it again, unsure why!
|
||||
--[[Joey]]
|
||||
|
||||
[[!tag confirmed]]
|
||||
|
|
|
@ -3,3 +3,4 @@ Decided to ask before jumping into trying to implement it (not that I have any g
|
|||
|
||||
[[!meta author=yoh]]
|
||||
[[!tag projects/repronim]]
|
||||
[[!tag moreinfo]]
|
||||
|
|
|
@ -0,0 +1,10 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 6"""
|
||||
date="2020-01-29T15:02:27Z"
|
||||
content="""
|
||||
I'm not clear how the answer to that question would impact git-annex.
|
||||
|
||||
Assuming this is built with external special remotes and/or plain git
|
||||
remotes, is there something lacking in git-annex to implement it now?
|
||||
"""]]
|
|
@ -268,3 +268,5 @@ decreases as it goes?
|
|||
---
|
||||
|
||||
See also, [[adb_special_remote]]
|
||||
|
||||
[[!tag confirmed]]
|
||||
|
|
|
@ -36,3 +36,5 @@ importtree, but there are several roadblocks:
|
|||
|
||||
So, it seems that, importtree would need to be able to run commands
|
||||
other than rsync on the server. --[[Joey]]
|
||||
|
||||
[[!tag needsthought]]
|
||||
|
|
|
@ -1 +1,3 @@
|
|||
The documentation for the new import remote command says, "Importing from a special remote first downloads all new content from it". For many special remotes -- such as Google Cloud Storage or DNAnexus -- checksums and sizes of files can be determined without downloading the files. For other special remotes, data files might have associated checksum files (e.g. md5) stored next to them in the remote. In such cases, it would help to be able to import the files without downloading (which can be costly, especially from cloud provider egress charges), similar to addurl --fast .
|
||||
|
||||
[[!tag needsthought]]
|
||||
|
|
|
@ -12,3 +12,5 @@ An attempt at making it stream via unsafeInterleaveIO failed miserably
|
|||
and that is not the right approach. This would be a good place to use
|
||||
ResourceT, but it might need some changes to the Annex monad to allow
|
||||
combining the two. --[[Joey]]
|
||||
|
||||
[[!tag confirmed]]
|
||||
|
|
|
@ -1 +1,3 @@
|
|||
Currently, the git-annex branch is not checked out, but is accessed as needed with commands like git-cat. Could git-annex work faster if it kept the git-annex branch checked out? Especially if one could designate a fast location (like a ramdisk) for keeping the checked-out copy. Maybe git-worktree could be used to tie the separate checkout to the repository.
|
||||
|
||||
[[!tag unlikely]]
|
||||
|
|
|
@ -1 +1,3 @@
|
|||
Would it be hard to add a variantion to checksumming [[backends]], that would change how the checksum is computed: instead of computing it on the whole file, it would first be computed on file chunks of given size, and then the final checksum computed on the concatenation of the chunk checksums? You'd add a new [[key field|internals/key_format]], say cNNNNN, specifying the chunking size (the last chunk might be shorter). Then (1) for large files, checksum computation could be parallelized (there could be a config option specifying the default chunk size for newly added files); (2) I often have large files on a remote, for which I have md5 for each chunk, but not for the full file; this would enable me to register the location of these fies with git-annex without downloading them, while still using a checksum-based key.
|
||||
|
||||
[[!tag needsthought]]
|
||||
|
|
|
@ -0,0 +1,18 @@
|
|||
[[!comment format=mdwn
|
||||
username="Chel"
|
||||
avatar="http://cdn.libravatar.org/avatar/a42feb5169f70b3edf7f7611f7e3640c"
|
||||
subject="comment 4"
|
||||
date="2020-01-26T22:48:07Z"
|
||||
content="""
|
||||
Another theoretical use case (not available for now, but maybe for the future):
|
||||
verify with checksums parts of the file and re-download only those parts/chunks, that are bad.
|
||||
For this you need a checksum for each chunk and a \"global\" checksum in key, that somehow incorporates all these chunk checksums.
|
||||
An example of this is Tiger Tree Hash in file sharing.
|
||||
|
||||
When I used the SHA256 backend in my downloads, I often wondered that the long process of checksumming a movie
|
||||
or an OS installation .iso is not ideal. Because if the file download is not finished, I get the wrong checksum,
|
||||
and the whole process needs to be repeated.
|
||||
|
||||
And in the future git-annex can integrate a FUSE filesystem and literally store just chunks of files,
|
||||
but represent files as a whole in this virtual filesystem view.
|
||||
"""]]
|
Some files were not shown because too many files have changed in this diff Show more
Loading…
Add table
Add a link
Reference in a new issue