1acdd18ea8
* Deal with clock skew, both forwards and backwards, when logging information to the git-annex branch. * GIT_ANNEX_VECTOR_CLOCK can now be set to a fixed value (eg 1) rather than needing to be advanced each time a new change is made. * Misuse of GIT_ANNEX_VECTOR_CLOCK will no longer confuse git-annex. When changing a file in the git-annex branch, the vector clock to use is now determined by first looking at the current time (or GIT_ANNEX_VECTOR_CLOCK when set), and comparing it to the newest vector clock already in use in that file. If a newer time stamp was already in use, advance it forward by a second instead. When the clock is set to a time in the past, this avoids logging with an old timestamp, which would risk that log line later being ignored in favor of "newer" line that is really not newer. When a log entry has been made with a clock that was set far ahead in the future, this avoids newer information being logged with an older timestamp and so being ignored in favor of that future-timestamped information. Once all clocks get fixed, this will result in the vector clocks being incremented, until finally enough time has passed that time gets back ahead of the vector clock value, and then it will return to usual operation. (This latter situation is not ideal, but it seems the best that can be done. The issue with it is, since all writers will be incrementing the last vector clock they saw, there's no way to tell when one writer made a write significantly later in time than another, so the earlier write might arbitrarily be picked when merging. This problem is why git-annex uses timestamps in the first place, rather than pure vector clocks.) Advancing forward by 1 second is somewhat arbitrary. setDead advances a timestamp by just 1 picosecond, and the vector clock could too. But then it would interfere with setDead, which wants to be overrulled by any change. So it could use 2 picoseconds or something, but that seems weird. It could just as well advance it forward by a minute or whatever, but then it would be harder for real time to catch up with the vector clock when forward clock slew had happened. A complication is that many log files contain several different peices of information, and it may be best to only use vector clocks for the same peice of information. For example, a key's location log file contains InfoPresent/InfoMissing for each UUID, and it only looks at the vector clocks for the UUID that is being changed, and not other UUIDs. Although exactly where the dividing line is can be hard to determine. Consider metadata logs, where a field "tag" can have multiple values set at different times. Should it advance forward past the last tag? Probably. What about when a different field is set, should it look at the clocks of other fields? Perhaps not, but currently it does, and this does not seems like it will cause any problems. Another one I'm not entirely sure about is the export log, which is keyed by (fromuuid, touuid). So if multiple repos are exporting to the same remote, different vector clocks can be used for that remote. It looks like that's probably ok, because it does not try to determine what order things occurred when there was an export conflict. Sponsored-by: Jochen Bartl on Patreon
230 lines
10 KiB
Markdown
230 lines
10 KiB
Markdown
This page's purpose is to collect and explore plans for a future
|
|
annex.version.
|
|
|
|
There are two major possible changes that could go in a new repo
|
|
version that would require a hard migration of git-annex repositories:
|
|
|
|
1. Changing .git/annex/objects/ paths, as appear in the git-annex symlinks.
|
|
|
|
2. Changing the layout of the git-annex branch in a substantial way.
|
|
|
|
## object path changes
|
|
|
|
Any change in this area requires the user make changes to their master
|
|
branch, any other active branches. Old un-converted tags and other
|
|
historical trees in git would also be broken. This is a pretty bad user
|
|
experience. (And it bloats history with a commit that rewrites everything
|
|
too.
|
|
|
|
For this reason, any changes in this area have been avoided, going all the
|
|
way back to v2 (2011).
|
|
|
|
> git-annex had approximately 3 users at the
|
|
> time of that migration, and as one of them, I can say it was a total PITA.
|
|
--[[Joey]]
|
|
|
|
So, there would need to be significant payoffs to justify this change.
|
|
|
|
Note that changing the hash directories might also change where objects are
|
|
stored in special remotes. Because repos can be offline or expensive to
|
|
migrate (or both -- Glacier!) any such changes need to keep looking in the
|
|
old locations for backwards compatability.
|
|
|
|
Possible reasons to make changes:
|
|
|
|
* It's annoyingly inconsistent that git-annex uses a different hash
|
|
directory layout for non-bare repository (on a non-crippled filesystem)
|
|
than is used for bare repositories and some special remotes.
|
|
|
|
Users occasionally stumble over this difference when messing with
|
|
internals. The code is somewhat complicated by it. In some cases,
|
|
git-annex checks both locations (eg, a bare repo defaults to xxx/yyy
|
|
but really old ones might use xX/yY for some keys).
|
|
|
|
The mixed case hash directories have caused trouble on case-insensative
|
|
filesystems, although that has mostly been papered over to avoid
|
|
problems. One remaining problem users can stuble on occurs
|
|
when [[moving a repository from OSX to Linux|bugs/OSX_case_insensitive_filesystem]].
|
|
|
|
* The hash directories, and also the per-key directories
|
|
can slow down using a repository on a disk (both SSD and spinning).
|
|
|
|
<https://github.com/datalad/datalad/issues/32>
|
|
|
|
Initial benchmarks suggest that going from xX/yY/KEY/OBJ to xX/yY/OBJ
|
|
directories would improve speed 3x.
|
|
|
|
Presumably, removing the yY would also speed it up, unless there are too
|
|
many objects and the filesystem gets slow w/o the hash directories.
|
|
|
|
* Removing a directory level would also reduce disk usage, see [[forum/scalability_with_lots_of_files/]] for more info.
|
|
|
|
## git-annex branch changes
|
|
|
|
This might involve, eg, rethinking the xxx/yyy/ hash directories used
|
|
in the git-annex branch.
|
|
|
|
Would this require a hard version transition? It might be possible to avoid
|
|
one, but then git-annex would have to look in both the old and the new
|
|
place. And if a un-transitioned repo was merged into a transitioned one,
|
|
git-annex would have to look in *both* places, and union merge the two sets
|
|
of data on the fly. This doubles the git-cat-file overhead of every
|
|
operation involving the git-annex branch. So a hard transition would
|
|
probably be best.
|
|
|
|
Also, note that w/o a hard transition, there's the risk that a old
|
|
git-annex version gets ahold of a git-annex branch created by a new
|
|
git-annex version, and sees only half of the story (the un-transitioned
|
|
files). This could be a very confusing failure mode. It doesn't help that
|
|
the git-annex branch does not currently have any kind of
|
|
version number embedded in it, so the old version of git-annex doesn't even
|
|
have a way to check if it can handle the branch.
|
|
|
|
Possible reasons to make changes:
|
|
|
|
* There is a discussion of some possible changes to the hash directories here
|
|
<https://github.com/datalad/datalad/issues/17#issuecomment-68558319> with a
|
|
goal of reducing the overhead of the git-annex branch in the overall size
|
|
of the git-annex repository.
|
|
|
|
Removing the second-level hash directories might improve performance.
|
|
It doesn't save much space when a repository is having incremental changes
|
|
made to it. However, if millions of annexed objects are being added
|
|
in a single commit, removing the second-level hash directories does save
|
|
space; it halves the number of tree
|
|
objects[1](https://github.com/datalad/datalad/issues/17#issuecomment-68759754).
|
|
|
|
Also,
|
|
<https://github.com/datalad/datalad/issues/17#issuecomment-68569727>
|
|
suggests using xxx/yyy.log, where one log contains information for
|
|
multiple keys. This would probably improve performance too due to
|
|
caching, although in some cases git-annex would have to process extra
|
|
information to get to the info about the key it wants, which hurts
|
|
performance. The disk usage change of this method has not yet been
|
|
quantified.
|
|
|
|
* While not a sufficient reason on its own, the best practices for file
|
|
formats in the git-annex branch has evolved over time, and there are some
|
|
files that have unusual formats for historical reasons. Other files have
|
|
modern formats, but their parsers have to cope with old versions that
|
|
have other formats. A hard transition would provide an opportunity to
|
|
clean up a lot of that.
|
|
|
|
## living on the edge
|
|
|
|
Rather than a hard transition, git-annex could add a mode
|
|
that could be optionally enabled when initing a repo for the first time.
|
|
|
|
Users who know they need that mode could then turn it one, and get the
|
|
benefits, while everyone else avoids a transition that doesn't benefit them
|
|
much.
|
|
|
|
There could even be multiple modes, with different tradeoffs depending on
|
|
how the repo will be used, its size, etc. Of course that adds complexity.
|
|
|
|
But the main problem with this idea is, how to avoid the foot shooting
|
|
result of merging repo A(v5) into repo B(vNG)? This seems like it would be
|
|
all to easy for a user to do.
|
|
|
|
As far as git-annex branch changes go, it might be possible for git-annex
|
|
to paper over the problem by handling both versions in the merged git-annex
|
|
branch, as discussed earlier. But for .git/annex/objects/ changes, there
|
|
does not seem to be a reasonable thing for git-annex to do. When it's
|
|
receiving an object into a mixed v5 and vNG repo, it can't know which
|
|
location that repo expects the object file to be located in. Different
|
|
files in the repo might point to the same object in different locations!
|
|
Total mess. Must avoid this.
|
|
|
|
Currently, annex.version is a per-local-repo setting. git-annex can't tell
|
|
if two repos that it's merging have different annex.version's.
|
|
|
|
It would be possible to add a git-annex:version file, which would work for
|
|
git-annex branch merging. Ie, `git-annex merge` could detect if different
|
|
git-annex branches have different versions, and refuse to merge them (or
|
|
upgrade the old one before merging it).
|
|
|
|
Also, that file could be used by git-annex, to automatically set
|
|
annex.version when auto-initing a clone of a repo that was initted with
|
|
a newer than default version.
|
|
|
|
But git-anex:version won't prevent merging B/master into A's master.
|
|
That merge can be done by git; nothing in git-annex can prevent it.
|
|
|
|
What we could do is have a .annex-version flag file in the root of the
|
|
repo. Then git merge would at least have a merge conflict. Note that this
|
|
means inflicting the file on all git-annex repos, even ones used by people
|
|
with no intention of living on the edge. And, it would take quite a while
|
|
until all such repos get updated to contain such a file.
|
|
|
|
Or, we could just document that if you initialize a repo with experimental
|
|
annex.version, you're living on the edge and you can screw up your repo
|
|
by merging with a repo from an old version.
|
|
|
|
git-annex fsck could also fix up any broken links that do result from the
|
|
inevitable cases where users ignore the docs.
|
|
|
|
## version numbers vs configuration
|
|
|
|
A particular annex.version like 5 encompasses a number of somewhat distinct
|
|
things
|
|
|
|
* git-annex branch layout
|
|
* .git/annex/objects/ layout
|
|
* other git stuff (like eg, the name of the HEAD branch in direct mode)
|
|
|
|
If the user is specifying at `git annex init` time some nonstandard things
|
|
they want to make the default meet their use case better, that is more
|
|
a matter of configuration than of picking a version.
|
|
|
|
For example, we could say that the user is opting out of the second-level
|
|
object hash directories. Or we could say the user is choosing to use vNG,
|
|
which is like v5 except with different object hash directory structure.
|
|
|
|
git annex init --config annex.objects.hashdirectories 1
|
|
--config annex.objects.hashlower true
|
|
git annex init --version 6
|
|
|
|
The former would be more flexible. The latter is simpler.
|
|
|
|
The former also lets the user chose *no* hash directories, or
|
|
choose 2 levels of hash directories while using the (v5 default) mixed
|
|
case hashing.
|
|
|
|
## concrete design
|
|
|
|
Make git-annex:difference.log be used by newer git-annex versions than v5,
|
|
and by nonstandard configurations.
|
|
|
|
The file contents will be "timestamp uuid [value, ..]", where value is a
|
|
serialized data type that describes divergence from v5 (since v5 and older
|
|
don't have the git-annex:difference.log file).
|
|
|
|
So, for example, "[Version 6]" could indicate that v6 is being used. Or,
|
|
"[ObjectHashLower True, ObjectHashDirectories 1, BranchHashDirectories 1]"
|
|
indicate a nonstandard configuration on top of v5 (this might turn out to
|
|
be identical to v6; just make the compare equal and no problem).
|
|
|
|
git-annex merge would check if it's merging in a git-annex:difference.log from
|
|
another repo that doesn't match the git-annex:difference.log of the local repo,
|
|
and abort. git-annex sync (and the assistant) would check the same, but
|
|
before merging master branches either, to avoid a bad merge there.
|
|
|
|
The git-annex:difference.log of a local repo could be changed by an upgrade
|
|
or some sort of transition. When this happens, the new value is written
|
|
for the uuid of the local repo. git-annex merge would then refuse to merge
|
|
with remote repos until they were also transitioned.
|
|
|
|
(There's perhaps some overlap here with the existing
|
|
git-annex:transitions.log, however the current transitions involve
|
|
forgetting history/dead remotes and so can be done repeatedly on a
|
|
repository. Also, the current transitions can be performed on remote
|
|
branches before merging them in; that wouldn't work well for version
|
|
changes since those require other changes in the remote repo.)
|
|
|
|
Not covered:
|
|
|
|
* git-merge of other branches, such as master (can be fixed by `git annex
|
|
fix` or `fsck`)
|
|
* Old versions of git-annex will ignore the version file of course,
|
|
and so merging such repos using them can result in pain.
|
|
|