84 lines
4.5 KiB
Markdown
84 lines
4.5 KiB
Markdown
Files in the git-annex branch use timestamps to ensure that the most
|
|
recently recorded state wins. This is unsatisfying, because it requires
|
|
accurate clocks amoung all users. It would be better to use vector clocks,
|
|
where possible, but it is not possible to use vector clocks for all
|
|
information in the branch.
|
|
|
|
To see why vector clocks can't be used for some information in the branch,
|
|
consider location log files. They are meant to reflect the actual state of
|
|
an external resource. Vector clocks can ensure that a consistent state is
|
|
agreed on by distributed users, but there's no way to guarantee that state
|
|
matches the actual state.
|
|
|
|
For example, let's assume there's a vector clock consisting of an an
|
|
integer, and an object is being added and removed from a remote by multiple
|
|
parties. First Alice logs (present, 1), and then some time later, Alice
|
|
logs (missing, 2). Meanwhile, Bob merges (present, 1) from Alice
|
|
and then logs (missing, 2), followed by (present, 3). At some later point,
|
|
they merge back up, and the winning state is (present, 3) as it has the
|
|
highest vector clock. Is the content really present on the remote?
|
|
Well, we don't know, Alice could have removed it before Bob stored it,
|
|
or afterwards.
|
|
|
|
But, other information in the branch could use vector clocks. Consider
|
|
numcopies setting. It's fine if the winner of a conflict over that is not
|
|
the one who set it most recently, as long as a value can be consistently
|
|
determined. So, the numcopies setting, and similar other configuration, is not
|
|
trying to track an external state, and so it could use vector clocks.
|
|
|
|
How would these vector clocks work, and how to transition to using them
|
|
without confusing old versions of git-annex that expect timestamps? A
|
|
change to a log could simply increment the clock from the previous
|
|
version of the log. This would make the new git-annex normally lose
|
|
when a conflicting change was written by an old git-annex, but the result
|
|
would be consistent, so that's acceptable.
|
|
|
|
Files that are related to external state need to continue to use
|
|
timestamps. But this could still be improved. Currently, if the clock is
|
|
wronly set far in the future, logs using those timestamps will win over
|
|
other logs for a long time. This could break git-annex badly as there
|
|
becomes no way to correct wrong information.
|
|
|
|
Experimenting with `GIT_ANNEX_VECTOR_CLOCK`, it looks like `git annex fsck`
|
|
is able to recover from wrong location information being recorded with a
|
|
far future timestamp. It replaces that timestamp with the current one.
|
|
However, if that then gets union merged with a change to the same location
|
|
log made in another repository, fsck's correction can be lost in the merge.
|
|
Re-running the fsck will eventually get the information corrected, once a
|
|
non-union merge happens. However, `git annex fsck` can't correct other
|
|
logs, like remote state logs, if they end up with bad information with
|
|
a far future timestamp.
|
|
|
|
There's a mirror problem of information being recorded with a timestamp
|
|
in the past and being ignored. But, at least in that case, re-recording
|
|
good information with the right timestamp will fix the problem.
|
|
|
|
Consider making git-annex ignore future timestamps
|
|
(with some amount of allowance for minor lock skew). There are two
|
|
problems, one is that currently valid information gets ignored, until it's
|
|
able to be re-recorded. The second is that when the timestamp slips
|
|
into the past, the old, invalid information suddenly starts being taken
|
|
into account.
|
|
|
|
---
|
|
|
|
A better idea: When writing new information, check if the old
|
|
value for the log has a timestamp `>=` current timestamp. If so, don't use the
|
|
current timestamp for the new information, instead increment the old
|
|
timestamp. So when there's clock skew (forwards or backwards), this makes
|
|
it fall back, effectively to vector clocks.
|
|
|
|
This would work for both kinds of logs. For configuration changes,
|
|
it's kind of better than using only vector clocks, because in the absence
|
|
of clock skew, the most recent change to a configuration wins. For state
|
|
changes, it keeps the benefits of timestamps except when there's clock
|
|
skew, in which case there are not any benefits of timestamps anymore
|
|
so vector clocks is the best that can be done. --[[Joey]]
|
|
|
|
(How would `GIT_ANNEX_VECTOR_CLOCK` interact with this? Maybe, when that's
|
|
set to a low number, it would be treated as the current time. So this would
|
|
let it be used and not, without issues, and also would let it be set to a
|
|
low number once, and not need to be changed, since git-annex would
|
|
increment as necessary.)
|
|
|
|
> The `vectorclock` branch has this mostly implemented. --[[Joey]]
|