In particular, the mergedrefs file was written with CR added to each line,
but read without CRLF handling. This resulted in each update of the file
adding CR to each line in it, growing the number of lines, while also
preventing the optimisation from working, so it remerged unncessarily.
writeFile and readFile do NewlineMode translation on Windows. But the
ByteString conversion prevented that from happening any longer.
I've audited for other cases of this, and found three more
(.git/annex/index.lck, .git/annex/ignoredrefs, and .git/annex/import/). All
of those also only prevent optimisations from working. Some other files are
currently both read and written with ByteString, but old git-annex may have
written them with NewlineMode translation. Other files are at risk for
breakage later if the reader gets converted to ByteString.
This is a minimal fix, but should be enough, as long as I remember to use
fileLines when splitting a ByteString into lines. This leaves files written
using ByteString without CR added, but that's ok because old git-annex has
no difficulty reading such files.
When the mergedrefs file has gotten lines that end with "\r\r\r\n", this
will eventually clean it up. Each update will remove a single trailing CR.
Note that S8.lines is still used in eg Command.Unused, where it is parsing
git show-ref, and similar in Git/*. git commands don't include CR in their
output so that's ok.
Sponsored-by: Joshua Antonishen on Patreon
Minor optimisation, but a win in every case, except for a couple where
it's a wash.
Note that replaceFile still takes a FilePath, because it needs to
operate on Chars to truncate unicode filenames properly.
Fix a hang that occasionally occurred during commands such as move.
(A bug introduced in 10.20220927, in
commit 6a3bd283b8)
The restage.log was kept locked while running a complex index refresh
action. In an unusual situation, that action could need to write to the
restage log, which caused a deadlock.
The solution is a two-stage process. First the restage.log is moved to a
work file, which is done with the lock held. Then the content of the work
file is read and processed, which happens without the lock being held.
This is all done in a crash-safe manner.
Note that streamRestageLog may not be fully safe to run concurrently
with itself. That's ok, because restagePointerFiles uses it with the
index lock held, so only one can be run at a time.
streamRestageLog does delete the restage.old file at the end without
locking. If a calcRestageLog is run concurrently, it will either see the
file content before it was deleted, or will see it's missing. Either is
ok, because at most this will cause calcRestageLog to report more
work remains to be done than there is.
Sponsored-by: Dartmouth College's Datalad project
move: Fix openFile crash with -J
This does make them a bit slower, although usually the log file is not
very big, so even when it's being rewritten, they will not block for
long taking the lock. Still, little slowdowns may add up when moving a lot
file files.
A less expensive fix would be to use something lower level than openFile
that does not check if the file is already open for write by another
thread. But GHC does not seem to provide anything convenient; even mkFD
checks for a writing thread.
fullLines is no longer necessary since these functions no longer will
read the file while it's being written.
Sponsored-by: Dartmouth College's DANDI project
When pointer files need to be restaged, they're first written to the
log, and then when the restage operation runs, it reads the log. This
way, if the git-annex process is interrupted before it can do the
restaging, a later git-annex process can do it.
Currently, this lets a git-annex get/drop command be interrupted and
then re-ran, and as long as it gets/drops additional files, it will
clean up after the interrupted command. But more changes are
needed to make it easier to restage after an interrupted process.
Kept using the git queue to run the restage action, even though the
list of files that it builds up for that action is not actually used by
the action. This could perhaps be simplified to make restaging a cleanup
action that gets registered, rather than using the git queue for it. But
I wasn't sure if that would cause visible behavior changes, when eg
dropping a large number of files, currently the git queue flushes
periodically, and so it restages incrementally, rather than all at the
end.
In restagePointerFiles, it reads the restage log twice, once to get
the number of files and size, and a second time to process it.
This seemed better than reading the whole file into memory, since
potentially a huge number of files could be in there. Probably the OS
will cache the file in memory and there will not be much performance
impact. It might be better to keep running tallies in another file
though. But updating that atomically with the log seems hard.
Also note that it's possible for calcRestageLog to see a different file
than streamRestageLog does. More files may be added to the log in
between. That is ok, it will only cause the filterprocessfaster heuristic to
operate with slightly out of date information, so it may make the wrong
choice for the files that got added and be a little slower than ideal.
Sponsored-by: Dartmouth College's DANDI project
Like the comment says, this works without locking. It looks like I
originally copied another function and forgot to remove the locking.
Sponsored-by: Dartmouth College's DANDI project
WIP: This is mostly complete, but there is a problem: createDirectoryUnder
throws an error when annex.dbdir is set to outside the git repo.
annex.dbdir is a workaround for filesystems where sqlite does not work,
due to eg, the filesystem not properly supporting locking.
It's intended to be set before initializing the repository. Changing it
in an existing repository can be done, but would be the same as making a
new repository and moving all the annexed objects into it. While the
databases get recreated from the git-annex branch in that situation, any
information that is in the databases but not stored in the branch gets
lost. It may be that no information ever gets stored in the databases
that cannot be reconstructed from the branch, but I have not verified
that.
Sponsored-by: Dartmouth College's Datalad project
Since it was used on both worktree and .git/annex files, split into
multiple functions.
In passing, this also improves permissions of created directories in
.git/annex, using createAnnexDirectory on those.
Prevents merging the import from deleting the non-preferred files from
the branch it's merged into.
adjustTree previously appended the new list of items to the old, which
could result in it generating a tree with multiple files with the same
name. That is not good and confuses some parts of git. Gave it a
function to resolve such conflicts.
That allowed dealing with the problem of what happens when the import
contains some files (or subtrees) with the same name as files that were
filtered out of the export. The files from the import win.
The smuge filter no longer provides git with annexed file content, to
avoid a git memory leak, and because that did not honor annex.thin.
git annex smudge --update has to be run after a checkout to update
unlocked files in the working tree with annexed file contents.
No hooks yet to run it.
This commit was sponsored by Nick Piper on Patreon.
Fix more places where files in .git/annex/ were written with modes that
did not take the core.sharedRepository config into account.
This commit was sponsored by Jeff Goeke-Smith on Patreon.
git grep writeFile finds some more that might also be problems, but
for now I've concentrated on .git/annex/ log files. There are certianly
cases where writeFile is not a problem too.
This commit was sponsored by mo on Patreon.