git-annex test hung, at varying points depending
on when git decided to run the smudge clean filter.
Recent changes to reconcileStaged caused a deadlock, when git write-tree
for some reason decides to run the smudge clean filter. Which tries
to open the keys db, and blocks waiting for the lock file that its
grandparent has locked.
I don't know why git write-tree does that. It's supposed to only write a
tree from the index which needs no smudge/clean filtering.
I've verified that, in a situation where git write-tree runs the clean
filter, disabling the filter results in a tree being written that
contains the annex link, not eg, the worktree file content. So it seems
safe to disable the clean filter, but also this seems likely to be
working around a bug in git because it seems it is running the clean
filter in a situation where the object has already been cleaned.
Sponsored-by: Dartmouth College's Datalad project
When the keys db is opened for read, and did not exist yet, it used to
skip creating it, and return mempty values. But that prevents
reconcileStaged from populating associated files information in time for
the read. This fixes the one remaining case I know of where
the fix in a56b151f90 didn't work.
Note that, when there is a permissions error, it still avoids creating
the db and returns mempty for all queries. This does mean that
reconcileStaged does not run and so it may want to drop files that it
should not. However, presumably a permissions error on the keys database
also means that the user does not have permission to delete annex
objects, so they won't be able to drop the files anyway.
Sponsored-by: Dartmouth College's Datalad project
* drop: When two files have the same content, and a preferred content
expression matches one but not the other, do not drop the file.
* sync --content, assistant: Fix an edge case where a file that is not
preferred content did not get dropped.
The sync --content edge case is that handleDropsFrom loaded associated files
and used them without verifying that the information from the database was
not stale.
It seemed best to avoid changing --want-drop's behavior, this way when
debugging a preferred content expression with it, the files matched will
still reflect the expression. So added a note to the --want-drop documentation,
to make clear it may not behave identically to git-annex drop --auto.
While it would be possible to introspect the preferred content
expression to see if it matches on filenames, and only look up the
associated files when it does, it's generally fairly rare for 2 files to
have the same content, and the database lookup is already avoided when
there's only 1 file, so I did not implement that further optimisation.
Note that there are still some situations where the associated files
database does not get locked files recorded in it, which will prevent
this fix from working.
Sponsored-by: Dartmouth College's Datalad project
All code that uses associated files already deals with this problem,
which used to be worse. Unfortunately I was not able to entirely
eliminate it, although it happens in fewer cases now.
Eg, when git commit runs the smudge filter.
Commit 428c91606b introduced the crash,
as write-tree fails in those situations. Now it will work, and git-annex
always gets up-to-date information even in those situations. It does
need to do a bit more work, each time git-annex is run with the index
locked. Although if the index is unmodified from the last time
write-tree succeeded, that work is avoided.
Before only unlocked files were included.
The initial scan now scans for locked as well as unlocked files. This
does mean it gets a little bit slower, although I optimised it as well
as I think it can be.
reconcileStaged changed to diff from the current index to the tree of
the previous index. This lets it handle deletions as well, removing
associated files for both locked and unlocked files, which did not
always happen before.
On upgrade, there will be no recorded previous tree, so it will diff
from the empty tree to current index, and so will fully populate the
associated files, as well as removing any stale associated files
that were present due to them not being removed before.
reconcileStaged now does a bit more work. Most of the time, this will
just be due to running more often, after some change is made to the
index, and since there will be few changes since the last time, it will
not be a noticable overhead. What may turn out to be a noticable
slowdown is after changing to a branch, it has to go through the diff
from the previous index to the new one, and if there are lots of
changes, that could take a long time. Also, after adding a lot of files,
or deleting a lot of files, or moving a large subdirectory, etc.
Command.Lock used removeAssociatedFile, but now that's wrong because a
newly locked file still needs to have its associated file tracked.
Command.Rekey used removeAssociatedFile when the file was unlocked.
It could remove it also when it's locked, but it is not really
necessary, because it changes the index, and so the next time git-annex
run and accesses the keys db, reconcileStaged will run and update it.
There are probably several other places that use addAssociatedFile and
don't need to any more for similar reasons. But there's no harm in
keeping them, and it probably is a good idea to, if only to support
mixing this with older versions of git-annex.
However, mixing this and older versions does risk reconcileStaged not
running, if the older version already ran it on a given index state. So
it's not a good idea to mix versions. This problem could be dealt with
by changing the name of the gitAnnexKeysDbIndexCache, but that would
leave the old file dangling, or it would need to keep trying to remove
it.
They're only needed to cover a gc edge case, and it's better someone
gets caught by that edge case than that someone who does not know about
them ends up with a filtered git-annex branch that contains such a tree
when some of the files listed in it are ones they wanted to *remove*
from the repository.
It's not currently possible to exclude a sameas repo using its
annex-config-uuid. (Remote.nameToUUID rejects them).
Since there's no real documented way to learn those, this seems ok, at
least for now. Also it avoids the problem of someone excluding the
parent but including the sameas, which would probably make the sameas
repo not usable when using the filtered branch.
Added a note to man page about what happens to information that is
recorded in the private journal. Since it uses Branch.get, that
information will be copied when options allow. It seemed better to allow
it and document it than not allow it, since the options allow excluding
repositories and so can be used to exclude private repos if desired.
init: When annex.commitmessage is set, use that message for the commit
that creates the git-annex branch.
This will be used by filter-branch too, and it seems to make sense to let
annex.commitmessage affect it.