This could in theory allow importing subsets of files with less memory
use. Rather than building up a big import list and then filtering it to
a smaller list of wanted files, support optionally filtering wanted
files first.
So far, the directory special remote implements it and will probably use
less memory. (Since dirContentsRecursiveSkipping does lazy streaming.)
Implementation in Remote.S3 is incomplete and fails to compile. Bit of a
mess with ResourceT needing to use Annex.
Also, in Remote.S3, filtering is not done for old versions.
And mkImportableContentsUnversioned is doing now redundant work
to filterwanted.
When importing from a special remote, support preferred content expressions
that use terms that match on keys (eg "present", "copies=1"). Such terms
are ignored when importing, since the key is not known yet.
When "standard" or "groupwanted" is used, the terms in those
expressions also get pruned accordingly.
This does allow setting preferred content to "not (copies=1)" to make a
special remote into a "source" type of repository. Importing from it will
import all files. Then exporting to it will drop all files from it.
In the case of setting preferred content to "present", it's pruned on
import, so everything gets imported from it. Then on export, it's applied,
and everything in it is left on it, and no new content is exported to it.
Since the old behavior on these preferred content expressions was for
importtree to error out, there's no backwards compatability to worry about.
Except that sync/pull/etc will now import where before it errored out.
This can reduce the size of the branch by up to 8%. My test was
running git-annex add 1000 times on one file each.
Lots of different high-resolution timestamps were recorded before
and eliminating those, after packing, the git repo was 8% smaller.
Due to the use of vector clocks, high resolution timestamps are
not necessary to make clear which information is most recent when
eg, a value is changed repeatedly in the same second. In such a
case, the vector clock will be advanced to the next second after
the last modification. For example, running
git-annex numcopies 1; git-annex numcopies 2
The first will record the current second, while the next records
the second after that even if it runs in the same second.
As for conflicting information written to two different clones of the
repository, this will make git-annex sometimes pick information that
was written earlier in a second over information written later in the
same second. Usually git-annex does not write conflicting information,
but there are some cases where it could. Eg, storing an object on a remote
can update the remote state log with some state. If two repos both store the
same object, and end up storing different remote state for some reason,
this can result in one that ran a tiny bit later winning. Such a situation
seems unlikely to be user visible. And a small amount of clock skew could
already result in such things.
The only case I can think of where this might be a user visible change
is if a configuration command like git-annex numcopies is being run
in 2 clones of a repository on the same machine at very
close to the same time. Then the user will know which they ran last,
and git-annex won't.
If that did become a problem, this could be dialed back to eg log
milliseconds with still some space saving.
migrate: Support adding size to URL keys that were added with --relaxed, by
running eg: git-annex migrate --backend=URL foo
Since url keys cannot be generated, that used to fail. Make it notice that
the backend is not changed, and just get the size of the content.
Sponsored-by: Brock Spratlen on Patreon
pull, sync: When operating on content, automatically hard link objects
that have been migrated.
Added annex.syncmigrations config that can be set to false to prevent
pull and sync from migrating object content.
I think that true is a good default for this config, because it avoids
users having to re-download migrated content or learning about migration.
But, some users will surely not like it, whether because it does take some
time (especially for the first git-annex branch scan when there is a long
history), or because they want to deal with it manually, or because their
filesystem doesn't support hard links and they don't want it to copy
objects.
Sponsored-by: k0ld on Patreon
And avoid migrate --update/--aply migrating when the new key was already
present in the repository, and got dropped. Luckily, the location log
allows distinguishing from the new key never having been present!
That is mostly useful for --apply because otherwise dropped files would
keep coming back until the old objects were reaped as unused. But it
seemed to make sense to also do it for --update. for consistency in edge
cases if nothing else. One case where --update can use it is when one
branch got migrated earlier, and we dropped the file, and now another
branch has migrated the same file.
Sponsored-by: Jack Hill on Patreon
This only avoids extra work and a warning messsage. It seems likely that
in such a situation, the user does not want migrations to insecure
hashes, and so best to ignore them as much as possible. If
the user merges a branch that switches annexed files to an insecure
hash, they will notice that the file contents are unavailable,
and git-annex get will tell them the problem then. So it does not seem
useful to have migrate --update also complain about it.
Could use some more testing.
When the old key is not present, Command.ReKey.linkKey' will return
False, so this handles that case ok.
But, I do wonder if distributed migration may need to deal with the old
key getting copied into the repository later. In that situation,
re-running migrate --update won't link it to the new key. It may be that
some users will need that. They can delete .git/annex/migrate.log and
run it again, but that is not a good user interface. Maybe either have
a way to re-run all distributed migrations, or record migrations
in a database and scan the db to find migrations to do in a future run?
Sponsored-by: Kevin Mueller on Patreon
The git log is outputting the diff, but this only looks at the new
files. When we have a new file, we can get the old filename by just
replacing "new" with "old". And then use branchFileRef to refer to it
allows catting the old key.
While this does have to skip past the old files in the diff, it's still
faster than calling git diff separately.
Sponsored-by: Nicholas Golder-Manning on Patreon
This is most of the way there, but not quite working.
The layout of migrate.tree/ needs to be changed to follow this approach.
git log will list all the files in tree order, so the new layout needs
to alternate old and new keys. Can that be done? git may not document
tree order, or may not preserve it here.
Alternatively, change to using git log --format=raw and extract
the tree header from that, then use
git diff --raw $tree:migrate.tree/old $tree:migrate.tree/new
That will be a little more expensive, but only when there are lots of
migrations.
Sponsored-by: Joshua Antonishen on Patreon