Testing b9ac585454, it didn't find the
optimal union merge, the second sha was the one to use, at least in
the case I tried. Let's just try all shas to see if any can be reused.
I stopped using the expensive nub, so despite the use of sets to
sort/uniq file contents, this is probably as fast or faster than it
was before.
Always merge the git-annex branch into .git/annex/index before making a
commit from the index.
This ensures that, when the branch has been changed in any way
(by a push being received, or changes pulled directly into it, or
even by the user checking it out, and committing a change), the index
reflects those changes.
This is much too slow; it needs to be optimised to only update the
index when the branch has really changed, not every time.
Also, there is an unhandled race, when a change is made to the branch
right after the index gets updated. I left it in for now because it's
unlikely and I didn't want to complicate things with additional locking
yet.
Added files don't have to be committed before they can be unannexed.
unannex no longer commits existing staged changes
unannex of the last file in a directory now works, before it failed because
git rm deleted the directory out from under it,
There are several places where it's assumed a key can be written on one
line. One is in the format of the .git/annex/unused files. The difficult
one is that filenames derived from keys are fed into git cat-file --batch,
which has a line based input. (And no -z option.)
So, for now it's best to block such keys being created.
When storing content in bare repositories, use the hashDirLower
directories. Bare repositories can be on USB drives, which might
use the FAT filesystem, and fall afoul of recent bugs in linux's handling
of mixed case on FAT. Using hashDirLower avoids that.
The only fully supported thing is to have the main repository on one disk,
and .git/annex on another. Only commands that move data in/out of the annex
will need to copy it across devices.
There is only partial support for putting arbitrary subdirectories of
.git/annex on different devices. For one thing, but this can require more
copies to be done. For example, when .git/annex/tmp is on one device, and
.git/annex/journal on another, every journal write involves a call to
mv(1). Also, there are a few places that make hard links between various
subdirectories of .git/annex with createLink, that are not handled.
In the common case without cross-device, the new moveFile is actually
faster than renameFile, avoiding an unncessary stat to check that a file
(not a directory) is being moved. Of course if a cross-device move is
needed, it is as slow as mv(1) of the data.
The bug was that with --json, output lines were sometimes doubled. For
example, git annex init --json would output two lines, despite only running
one thing. Adding to the weirdness, this only occurred when the output
was redirected to a pipe or a file.
Strace showed two processes outputting the same buffered output.
The second process was this writer process (only needed to work around
bug #624389):
_ <- forkProcess $ do
hPutStr toh $ unlines paths
hClose toh
exitSuccess
The doubled output occurs when this process exits, and ghc flushes the
inherited stdout buffer. Why only when piping? I don't know, but ghc may
be behaving differently when stdout is not a terminal.
While this is quite possibly a ghc bug, there is a nice fix in git-annex.
Explicitly flushing after each chunk of json is output works around the
problem, and as a side effect, json is streamed rather than being output
all at the end when performing an expensive operaition.
However, note that this means all uses of putStr in git-annex must be
explicitly flushed. The others were, already.
It would be nice if command-specific options were supported. The first
difficulty is that which command is being called is not known until after
getopt; but that could be worked around by finding the first non-dashed
parameter. Storing the settings without putting them in the annex monad is
the next difficulty; it could perhaps be handled by making the seek stage
pass applicable settings into the start stage (and from there on to perform
as needed). But that still leaves a problem, what data type to use to
represent the options between getopt and seek?
Left out the backend usage graph for now, and bad/temp directory sizes
are only displayed when present. Also, disk usage is returned as a string
with units, which I can see changing later.
This is actually tricky, 45bbf210a1 added
the escaping because it's needed for rsync that does go over ssh.
So I had to detect whether the remote's rsync url will use ssh or not,
and vary the escaping.
Before, a merge was first calculated, by running various actions that
called git and built up a list of lines, which were at the end sent
to git update-index. This necessarily used space proportional to the size
of the diff between the trees being merged.
Now, lines are streamed into git update-index from each of the actions in
turn.
Runtime size of git-annex merge when merging 50000 location log files
drops from around 100 mb to a constant 4 mb.
Presumably it runs quite a lot faster, too.
Avoids doing auto-merging in commands that don't need fully current
information from the git-annex branch. In particular, git annex add no
longer needs to auto-merge. Affected commands: Anything that doesn't
look up data from the branch, but does write a change to it.
It might seem counterintuitive that we can change a value without first
making sure we have the current value. This optimisation works because
these two sequences are equivilant:
1. pull from remote
2. union merge
3. read file from branch
4. modify file and write to branch
vs.
1. read file from branch
2. modify file and write to branch
3. pull from remote
4. union merge
After either sequence, the git-annex branch contains the same logical content
for the modified file. (Possibly with lines in a different order or
additional old lines of course).
More accurately, it was supported already when map uses git-annex-shell,
but not when it does not.
Note that the user name cannot be shell escaped using git-annex's current
approach for shell escaping. I tried and some shells like dash cannot
cd ~'joey'. Rest of directory is still shell escaped, not for security but
in case a directory has a space or other weird character.
This is my own damn fault for not making UUID a real type, and then relying
on the type checker to ensure my refactoring was correct -- which it wasn't!
I should probably add code to clean up bogus entries in the uuid.log, but
right now I want to get the fix out there to prevent people experiencing
this bug.
I should also make UUID a real data type.
Thanks Valentin Haenel for a test case showing how non-fast-forward merges
could result in an ongoing pull/merge/push cycle.
While the git-annex branch is fast-forwarded, git-annex's index file is still
updated using the union merge strategy as before. There's no other way to
update the index that would be any faster.
It is possible that a union merge and a fast-forward result in different file
contents: Files should have the same lines, but a union merge may change
their order. If this happens, the next commit made to the git-annex branch
will have some unnecessary changes to line orders, but the consistency
of data should be preserved.
Note that when the journal contains changes, a fast-forward is never attempted,
which is fine, because committing those changes would be vanishingly unlikely
to leave the git-annex branch at a commit that already exists in one of
the remotes.
The real difficulty is handling the case where multiple remotes have all
changed. git-annex does find the best (ie, newest) one and fast forwards
to it. If the remotes are diverged, no fast-forward is done at all. It would
be possible to pick one, fast forward to it, and make a merge commit to
the rest, I see no benefit to adding that complexity.
Determining the best of N changed remotes requires N*2+1 calls to git-log, but
these are fast git-log calls, and N is typically small. Also, typically
some or all of the remote refs will be the same, and git-log is not called to
compare those. In the real world I expect this will almost always add only
1 git-log call to the merge process. (Which already makes N anyway.)