Check just before running update-index if the worktree file's content is
still the same, don't update it when it's been modified. This narrows
the race window a lot, from possibly minutes or hours, to seconds or
less.
(Use replaceFile so that the worktree update happens atomically,
allowing the InodeCache of the new worktree file to itself be gathered
w/o any other race.)
This doesn't eliminate the race; it can still occur in the window before
update-index runs. When annex.queue is large, a lot of files will be
statted by the checks, and so the window may still be large enough to be a
problem.
When only a few files are being processed, the window is as small as it
is in the race where a modification gets overwritten by git-annex when
it updates the worktree. Or maybe as small as whatever race git
checkout/pull/merge may have when the worktree gets modified during it.
Still, I've kept a todo about this race.
This commit was supported by the NSF-funded DataLad project.
Use git update-index --refresh, since it's a little bit more
efficient and the user can be told to run it if a locked index prevents
git-annex from running it.
This also fixes the problem where an annexed file was deleted in the index
and a get of another file that uses the same key caused the index update to
add back the deleted file. update-index will not add back the deleted file.
Documented in tips/unlocked_files.mdwn the gotcha that the index update
may conflict with other operations. I can't see any way to possibly avoid
that conflict.
One new todo about a race that causes a modification to be accidentially
staged.
Note that the assistant only flushes the git command queue when it
commits a modification. I have not tested the assistant with v6 unlocked
files, but assume most users of the assistant won't care if the index
shows a file as modified for a while.
This commit was supported by the NSF-funded DataLad project.
After updating the worktree for an add/drop, update git's index, so git
status will not show the files as modified.
What actually happens is that the index update removes the inode
information from the index. The next git status (or similar) run
then has to do some work. It runs the clean filter.
So, this depends on the clean filter being reasonably fast and on git
not leaking memory when running it. Both problems were fixed in
a96972015d, but only for git 2.5. Anyone
using an older git will see very expensive git status after an add/drop.
This uses the same git update-index queue as other parts of git-annex, so
the actual index update is fairly efficient. Of course, updating the index
does still have some overhead. The annex.queuesize config will control how
often the index gets updated when working on a lot of files.
This is an imperfect workaround... Added several todos about new
problems this workaround causes. Still, this seems a lot better than the
old behavior.
This commit was supported by the NSF-funded DataLad project.
v6 add: Take advantage of improved SIGPIPE handler in git 2.5 to speed up
the clean filter by not reading the file content from the pipe. This also
avoids git buffering the whole file content in memory.
When built with an older git, still consumes stdin. If built with a newer
git and used with an older one, it breaks, but that's acceptable --
checking the git version every time would make repeated smudge runs slow.
This commit was supported by the NSF-funded DataLad project.
gmane's disk crashed, I found one thread in another archive, but could
not find my whole patch set in any archive (perhaps some of the messages
were too long), so pulled it out of my personal mail archives.
This commit was supported by the NSF-funded DataLad project.
The benchmark shows that the database access is quite fast indeed!
And, it scales linearly to the number of keys, with one exception,
getAssociatedKey.
Based on this benchmark, I don't think I need worry about optimising
for cases where all files are locked and the database is mostly empty.
In those cases, database access will be misses, and according to this
benchmark, should add only 50 milliseconds to runtime.
(NB: There may be some overhead to getting the database opened and locking
the handle that this benchmark doesn't see.)
joey@darkstar:~/src/git-annex>./git-annex benchmark
setting up database with 1000
setting up database with 10000
benchmarking keys database/getAssociatedFiles from 1000 (hit)
time 62.77 μs (62.70 μs .. 62.85 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 62.81 μs (62.76 μs .. 62.88 μs)
std dev 201.6 ns (157.5 ns .. 259.5 ns)
benchmarking keys database/getAssociatedFiles from 1000 (miss)
time 50.02 μs (49.97 μs .. 50.07 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 50.09 μs (50.04 μs .. 50.17 μs)
std dev 206.7 ns (133.8 ns .. 295.3 ns)
benchmarking keys database/getAssociatedKey from 1000 (hit)
time 211.2 μs (210.5 μs .. 212.3 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 211.0 μs (210.7 μs .. 212.0 μs)
std dev 1.685 μs (334.4 ns .. 3.517 μs)
benchmarking keys database/getAssociatedKey from 1000 (miss)
time 173.5 μs (172.7 μs .. 174.2 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 173.7 μs (173.0 μs .. 175.5 μs)
std dev 3.833 μs (1.858 μs .. 6.617 μs)
variance introduced by outliers: 16% (moderately inflated)
benchmarking keys database/getAssociatedFiles from 10000 (hit)
time 64.01 μs (63.84 μs .. 64.18 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 64.85 μs (64.34 μs .. 66.02 μs)
std dev 2.433 μs (547.6 ns .. 4.652 μs)
variance introduced by outliers: 40% (moderately inflated)
benchmarking keys database/getAssociatedFiles from 10000 (miss)
time 50.33 μs (50.28 μs .. 50.39 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 50.32 μs (50.26 μs .. 50.38 μs)
std dev 202.7 ns (167.6 ns .. 252.0 ns)
benchmarking keys database/getAssociatedKey from 10000 (hit)
time 1.142 ms (1.139 ms .. 1.146 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 1.142 ms (1.140 ms .. 1.144 ms)
std dev 7.142 μs (4.994 μs .. 10.98 μs)
benchmarking keys database/getAssociatedKey from 10000 (miss)
time 1.094 ms (1.092 ms .. 1.096 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 1.095 ms (1.095 ms .. 1.097 ms)
std dev 4.277 μs (2.591 μs .. 7.228 μs)
Several tricky parts:
* When the conflict is just between the same key being locked and unlocked,
the unlocked version wins, and the file is not renamed in this case.
* Need to update associated file map when conflict resolution renames
an unlocked file.
* git merge runs the smudge filter on the conflicting file, and actually
overwrites the file with the same content it had before, and so
invalidates its inode cache. This makes it difficult to know when it's
safe to remove such files as conflict cruft, without going so far as to
compare their entire contents.
Dealt with this by preventing the smudge filter from populating the file
when a merge is run. However, that also prevents the smudge filter being
run for non-conflicting files, so eg moving a file won't put its new
content into place.
* Ideally, if a merge or a merge conflict resolution renames an unlocked
file, the file in the work tree can just be moved, rather than copying
the content to a new worktree file.
This is attempted to be done in merge conflict resolution, but
due to git merge's behavior of running smudge filters, what actually
seems to happen is the old worktree file with the content is deleted and
rewritten as a pointer file, so doesn't get reused.
So, this is probably not as efficient as it optimally could be.
If that becomes a problem, could look into running the merge in a separate
worktree and updating the real worktree more efficiently, similarly to the
direct mode merge. However, the direct mode merge had a lot of bugs, and
I'd rather not use that more error-prone method unless really needed.
Decided it's too scary to make v6 unlocked files have 1 copy by default,
but that should be available to those who need it. This is consistent with
git-annex not dropping unused content without --force, etc.
* Added annex.thin setting, which makes unlocked files in v6 repositories
be hard linked to their content, instead of a copy. This saves disk
space but means any modification of an unlocked file will lose the local
(and possibly only) copy of the old version.
* Enable annex.thin by default on upgrade from direct mode to v6, since
direct mode made the same tradeoff.
* fix: Adjusts unlocked files as configured by annex.thin.
Writes are optimised by queueing up multiple writes when possible.
The queue is flushed after the Annex monad action finishes. That makes it
happen on program termination, and also whenever a nested Annex monad action
finishes.
Reads are optimised by checking once (per AnnexState) if the database
exists. If the database doesn't exist yet, all reads return mempty.
Reads also cause queued writes to be flushed, so reads will always be
consistent with writes (as long as they're made inside the same Annex monad).
A future optimisation path would be to determine when that's not necessary,
which is probably most of the time, and avoid flushing unncessarily.
Design notes for this commit:
- separate reads from writes
- reuse a handle which is left open until program
exit or until the MVar goes out of scope (and autoclosed then)
- writes are queued
- queue is flushed periodically
- immediate queue flush before any read
- auto-flush queue when database handle is garbage collected
- flush queue on exit from Annex monad
(Note that this may happen repeatedly for a single database connection;
or a connection may be reused for multiple Annex monad actions,
possibly even concurrent ones.)
- if database does not exist (or is empty) the handle
is not opened by reads; reads instead return empty results
- writes open the handle if it was not open previously