git-annex/doc/todo/git-annex_branch_clocks.mdwn
2021-08-03 17:05:50 -04:00

84 lines
4.5 KiB
Markdown

Files in the git-annex branch use timestamps to ensure that the most
recently recorded state wins. This is unsatisfying, because it requires
accurate clocks amoung all users. It would be better to use vector clocks,
where possible, but it is not possible to use vector clocks for all
information in the branch.
To see why vector clocks can't be used for some information in the branch,
consider location log files. They are meant to reflect the actual state of
an external resource. Vector clocks can ensure that a consistent state is
agreed on by distributed users, but there's no way to guarantee that state
matches the actual state.
For example, let's assume there's a vector clock consisting of an an
integer, and an object is being added and removed from a remote by multiple
parties. First Alice logs (present, 1), and then some time later, Alice
logs (missing, 2). Meanwhile, Bob merges (present, 1) from Alice
and then logs (missing, 2), followed by (present, 3). At some later point,
they merge back up, and the winning state is (present, 3) as it has the
highest vector clock. Is the content really present on the remote?
Well, we don't know, Alice could have removed it before Bob stored it,
or afterwards.
But, other information in the branch could use vector clocks. Consider
numcopies setting. It's fine if the winner of a conflict over that is not
the one who set it most recently, as long as a value can be consistently
determined. So, the numcopies setting, and similar other configuration, is not
trying to track an external state, and so it could use vector clocks.
How would these vector clocks work, and how to transition to using them
without confusing old versions of git-annex that expect timestamps? A
change to a log could simply increment the clock from the previous
version of the log. This would make the new git-annex normally lose
when a conflicting change was written by an old git-annex, but the result
would be consistent, so that's acceptable.
Files that are related to external state need to continue to use
timestamps. But this could still be improved. Currently, if the clock is
wronly set far in the future, logs using those timestamps will win over
other logs for a long time. This could break git-annex badly as there
becomes no way to correct wrong information.
Experimenting with `GIT_ANNEX_VECTOR_CLOCK`, it looks like `git annex fsck`
is able to recover from wrong location information being recorded with a
far future timestamp. It replaces that timestamp with the current one.
However, if that then gets union merged with a change to the same location
log made in another repository, fsck's correction can be lost in the merge.
Re-running the fsck will eventually get the information corrected, once a
non-union merge happens. However, `git annex fsck` can't correct other
logs, like remote state logs, if they end up with bad information with
a far future timestamp.
There's a mirror problem of information being recorded with a timestamp
in the past and being ignored. But, at least in that case, re-recording
good information with the right timestamp will fix the problem.
Consider making git-annex ignore future timestamps
(with some amount of allowance for minor lock skew). There are two
problems, one is that currently valid information gets ignored, until it's
able to be re-recorded. The second is that when the timestamp slips
into the past, the old, invalid information suddenly starts being taken
into account.
---
A better idea: When writing new information, check if the old
value for the log has a timestamp `>=` current timestamp. If so, don't use the
current timestamp for the new information, instead increment the old
timestamp. So when there's clock skew (forwards or backwards), this makes
it fall back, effectively to vector clocks.
This would work for both kinds of logs. For configuration changes,
it's kind of better than using only vector clocks, because in the absence
of clock skew, the most recent change to a configuration wins. For state
changes, it keeps the benefits of timestamps except when there's clock
skew, in which case there are not any benefits of timestamps anymore
so vector clocks is the best that can be done. --[[Joey]]
(How would `GIT_ANNEX_VECTOR_CLOCK` interact with this? Maybe, when that's
set to a low number, it would be treated as the current time. So this would
let it be used and not, without issues, and also would let it be set to a
low number once, and not need to be changed, since git-annex would
increment as necessary.)
> The `vectorclock` branch has this mostly implemented. --[[Joey]]