Eg, before with a .gitattributes like:
*.2 annex.numcopies=2
*.1 annex.numcopies=1
And foo.1 and foo.2 having the same content and key, git-annex drop foo.1 foo.2
would succeed, leaving just 1 copy, despite foo.2 needing 2 copies.
It dropped foo.1 first and then skipped foo.2 since its content was gone.
Now that the keys database includes locked files, this longstanding wart
can be fixed.
Sponsored-by: Noam Kremen on Patreon
Most of this is just refactoring. But, handleDropsFrom
did not verify that associated files from the keys db were still
accurate, and has now been fixed to.
A minor improvement to this would be to avoid calling catKeyFile
twice on the same file, when getting the numcopies and mincopies value,
in the common case where the same file has the highest value for both.
But, it avoids checking every associated file, so it will scale well to
lots of dups already.
Sponsored-by: Kevin Mueller on Patreon
This was an old problem when the files were being added unlocked,
so the changelog mentions that being fixed. However, recently it's also
affected locked files.
The fix for locked files is kind of stupidly simple. moveAnnex already
handles populating unlocked files, and only does it when the object file
was not already present. So remove the redundant populateUnlockedFiles
call. (That call was added all the way back in
cfaac52b88, and has always been
unncessary.)
Sponsored-by: Dartmouth College's Datalad project
moveAnnex only gets to that check if the object file was not present
before. So in the case where dup files are being added repeatedly,
it will only run the first time, and so there's no significant speedup
from doing it; all it avoids is a single sqlite lookup. Since MVar
accesses do have overhead, it's better to optimise for the common case,
where unlocked files are supported.
removeAnnex is less clear cut, but I think mostly is skipped running on
keys when the object has already been dropped, so similar reasoning
applies.
When the log has an activity that is not known, eg added by a future
version of git-annex, it used to be treated as no activity at all,
which would make git-annex expire think it should expire the repository,
despite it having some kind of recent activity.
Hopefully there will be no reason to add a new activity until enough
time has passed that this commit is in use everywhere.
Sponsored-by: Jake Vosloo on Patreon
This will mostly just avoid a DB lookup, so things get marginally
faster. But in cases where there are many files using the same key, it
can be a more significant speedup.
Added overhead is one MVar lookup per call, which should be small
enough, since this happens after transferring or ingesting a file,
which is always a lot more work than that. It would be nice, though,
to move getGitConfig to AnnexRead, which there is an open todo about.
Clear visible progress bar first.
Removed showSideActionAfter because it can't be used in reconcileStaged
(import loop). Instead, it counts the number of files it
processes and displays it after it's seen a sufficient to know it's
taking a while.
Sponsored-by: Dartmouth College's Datalad project
This makes git checkout and git merge hooks do the work to catch up with
changes that they made to the tree. Rather than doing it at some later
point when the user is not thinking about that past operation.
Sponsored-by: Dartmouth College's Datalad project
reconcileStaged was doing a redundant scan to scannAnnexedFiles.
It would probably make sense to move the body of scannAnnexedFiles
into reconcileStaged, the separation does not really serve any purpose.
Sponsored-by: Dartmouth College's Datalad project
Commit 428c91606b made it need to do more
work in situations like switching between very different branches.
Compare with seekFilteredKeys which has a similar optimisation. Might be
possible to factor out the common part from these?
Sponsored-by: Dartmouth College's Datalad project
This is quite a subtle edge case, see the bug report for full details.
The second git diff is needed only when there's a merge conflict.
It would be possible to speed it up marginally by using
--diff-filter=Unmerged, but probably not enough to bother with.
Sponsored-by: Graham Spencer on Patreon
When this option is not used, there should be effectively no added
overhead, thanks to the optimisation in
b3cd0cc6ba.
When an action fails on a file, the size of the file still counts toward
the size limit. This was necessary to support concurrency, but also
generally seems like the right choice.
Most commands that operate on annexed files support the option.
export and import do not, and I don't know if it would make sense for
export to.. Why would you want an incomplete export? sync doesn't, and
while it would be easy to make it support it for transferring files,
it's not clear if dropping files should also take the size limit into
account. Commands like add that don't operate on annexed files don't
support the option either.
Exiting 101 not yet implemented.
Sponsored-by: Denis Dzyubenko on Patreon
Avoids users thinking this scan is a big deal, when it's not in the
majority of repos.
showSideActionAfter has some ugly caveats, since it has to display in
the background of another action. I could not see a better way to do it
and it works fine in this particular case. It also doesn't really belong
in Annex.Concurrent, but cannot go in Messages due to an import loop.
Sponsored-by: Dartmouth College's Datalad project
It makes sense to keep the key used by the old version of an
associated file, until the merge conflict is resolved.
Note that, since in this case git diff is being run with --index, it's
not possible to use -1 or -3, which would let the keys
associated with the new versions of the file also be added. That would
be better, because it's possible that the local modification to the file
that caused the merge conflict has not yet gotten its new key recorded
in the db.
Opened a bug about a case this is thus not able to address.
Sponsored-by: Boyd Stephen Smith Jr. on Patreon
There seems to be no reason to check the time here. I think it was
inherited from code in Database.Fsck, which does have a reason to commit
every few minutes. Removing that syscall speeds up a git-annex init
in a repo with 100000 annexed files by about 3 seconds.
Sponsored-by: Dartmouth College's Datalad project
Streaming through git this way speeds it up by around 25%. This is
similar to the optimisations of seeking annexed files.
Sponsored-by: Dartmouth College's Datalad project
The normalisation of filenames turns out to be the tricky part here,
because the associated files coming out of the keys db may look like
"./foo/bar" or "../bar". For the former to match a glob like "foo/*",
it needs to be normalised.
Note that, on windows, normalise "./foo/bar" = "foo\\bar"
which a glob like "foo/*" won't match. So the glob is matched a second
time, on the toInternalGitPath, so allowing the user to provide a glob
with the slashes in either direction. However, this still won't support
some wacky edge cases like the user providing a glob of "foo/bar\\*"
Sponsored-by: Dartmouth College's Datalad project
* drop: When two files have the same content, and a preferred content
expression matches one but not the other, do not drop the file.
* sync --content, assistant: Fix an edge case where a file that is not
preferred content did not get dropped.
The sync --content edge case is that handleDropsFrom loaded associated files
and used them without verifying that the information from the database was
not stale.
It seemed best to avoid changing --want-drop's behavior, this way when
debugging a preferred content expression with it, the files matched will
still reflect the expression. So added a note to the --want-drop documentation,
to make clear it may not behave identically to git-annex drop --auto.
While it would be possible to introspect the preferred content
expression to see if it matches on filenames, and only look up the
associated files when it does, it's generally fairly rare for 2 files to
have the same content, and the database lookup is already avoided when
there's only 1 file, so I did not implement that further optimisation.
Note that there are still some situations where the associated files
database does not get locked files recorded in it, which will prevent
this fix from working.
Sponsored-by: Dartmouth College's Datalad project
Before only unlocked files were included.
The initial scan now scans for locked as well as unlocked files. This
does mean it gets a little bit slower, although I optimised it as well
as I think it can be.
reconcileStaged changed to diff from the current index to the tree of
the previous index. This lets it handle deletions as well, removing
associated files for both locked and unlocked files, which did not
always happen before.
On upgrade, there will be no recorded previous tree, so it will diff
from the empty tree to current index, and so will fully populate the
associated files, as well as removing any stale associated files
that were present due to them not being removed before.
reconcileStaged now does a bit more work. Most of the time, this will
just be due to running more often, after some change is made to the
index, and since there will be few changes since the last time, it will
not be a noticable overhead. What may turn out to be a noticable
slowdown is after changing to a branch, it has to go through the diff
from the previous index to the new one, and if there are lots of
changes, that could take a long time. Also, after adding a lot of files,
or deleting a lot of files, or moving a large subdirectory, etc.
Command.Lock used removeAssociatedFile, but now that's wrong because a
newly locked file still needs to have its associated file tracked.
Command.Rekey used removeAssociatedFile when the file was unlocked.
It could remove it also when it's locked, but it is not really
necessary, because it changes the index, and so the next time git-annex
run and accesses the keys db, reconcileStaged will run and update it.
There are probably several other places that use addAssociatedFile and
don't need to any more for similar reasons. But there's no harm in
keeping them, and it probably is a good idea to, if only to support
mixing this with older versions of git-annex.
However, mixing this and older versions does risk reconcileStaged not
running, if the older version already ran it on a given index state. So
it's not a good idea to mix versions. This problem could be dealt with
by changing the name of the gitAnnexKeysDbIndexCache, but that would
leave the old file dangling, or it would need to keep trying to remove
it.
They're only needed to cover a gc edge case, and it's better someone
gets caught by that edge case than that someone who does not know about
them ends up with a filtered git-annex branch that contains such a tree
when some of the files listed in it are ones they wanted to *remove*
from the repository.
It's not currently possible to exclude a sameas repo using its
annex-config-uuid. (Remote.nameToUUID rejects them).
Since there's no real documented way to learn those, this seems ok, at
least for now. Also it avoids the problem of someone excluding the
parent but including the sameas, which would probably make the sameas
repo not usable when using the filtered branch.
Added a note to man page about what happens to information that is
recorded in the private journal. Since it uses Branch.get, that
information will be copied when options allow. It seemed better to allow
it and document it than not allow it, since the options allow excluding
repositories and so can be used to exclude private repos if desired.
This is less erorr-prone, and easier for the user to reason about; it
preserves the man page's promise that only explicitly included
information will be copied.
ghc 8.8.4 seems to have changed something that broke code that has been
successfully using forkProcess since 2012. Likely a change to GC internals.
Since forkProcess has never had clear documentation about how to
use it safely, avoid using it at all. Instead, when git-annex needs to
daemonize itself, re-run the git-annex command, in a new process group
and session.
This commit was sponsored by Luke Shumaker on Patreon.