The old code traversed the list of addtreeitems once per subdirectory in
the tree, so could get quite slow. Converting to Map lookups sped it up
significantly.
In my test case, git-annex import used to take about 2 minutes, when
calling adjustTree to add back excluded files to the imported tree. This
dropped it down to 6 seconds. Of which 4 seconds are the actual
enumeration of the contents of the remote, so really only 2 seconds for
this.
The path prefix map is a bit suboptimal memory-wise, since items get
stored in the map once per subdirectory on the path to the item. It
would perhaps be better to use a tree data structure.
Also it's suboptimal memory-wise that it builds two maps, as well
as retaining a reference to addtreeitems. I could not see a way around
that though.
This is a fixed version of commit 2c86651180.
It fixes a test suite reversion.
Sponsored-by: Jack Hill on Patreon
The old code traversed the list of addtreeitems once per subdirectory in
the tree, so could get quite slow. Converting to Map lookups sped it up
significantly.
In my test case, git-annex import used to take about 2 minutes, when
calling adjustTree to add back excluded files to the imported tree. This
dropped it down to 6 seconds. Of which 4 seconds are the actual
enumeration of the contents of the remote, so really only 2 seconds for
this.
The path prefix map is a bit suboptimal memory-wise, since items get
stored in the map once per subdirectory on the path to the item. It
would perhaps be better to use a tree data structure.
Also it's suboptimal memory-wise that it builds two maps, as well
as retaining a reference to addtreeitems. I could not see a way around
that though.
Sponsored-by: Luke T. Shumaker on Patreon
This will allow distributed migration: Start a migration in one clone of
a repo, and then update other clones.
commitMigration is a bit of a bear.. There is some inversion of control
that needs some TMVars. Also streamLogFile's finalizer does not handle
recording the trees, so an interrupt at just the wrong time can cause
migration.log to be emptied but the git-annex branch not updated.
Sponsored-by: Graham Spencer on Patreon
This gets the trees built, but it does not use them. Next step will be
to remember the tree for next time an import is done, and diff between
old and new trees to find the files that have changed.
Added --missing to the mktree parameters. That only disables a check, so
it's ok to do everywhere mktree is used. It probably also speeds up
mktree to disable the check.
Note that git fsck does not complain about the resulting tree objects
that point to shas that are not in the repository. Even with --strict.
A quick benchmark, importing 10000 files, this slowed it down
from 2:04.06 to 2:04.28. So it will more than pay for itself.
Sponsored-by: Luke Shumaker on Patreon
giveup changed to filter out control characters. (It is too low level to
make it use StringContainingQuotedPath.)
error still does not, but it should only be used for internal errors,
where the message is not attacker-controlled.
Changed a lot of existing error to giveup when it is not strictly an
internal error.
Of course, other exceptions can still be thrown, either by code in
git-annex, or a library, that include some attacker-controlled value.
This does not guard against those.
Sponsored-by: Noam Kremen on Patreon
This improves the borg special remote memory usage, by
letting it only load one archive's worth of filenames into memory at a
time, and building up a larger tree out of the chunks.
When a borg repository has many archives, git-annex could easily OOM
before. Now, it will use only memory proportional to the number of
annexed keys in an archive.
Minor implementation wart: Each new chunk re-opens the content
identifier database, and also a new vector clock is used for each chunk.
This is a minor innefficiency only; the use of continuations makes
it hard to avoid, although putting the database handle into a Reader
monad would be one way to fix it.
It may later be possible to extend the ImportableContentsChunkable
interface to remotes that are not third-party populated. However, that
would perhaps need an interface that does not use continuations.
The ImportableContentsChunkable interface currently does not allow
populating the top of the tree with anything other than subtrees. It
would be easy to extend it to allow putting files in that tree, but borg
doesn't need that so I left it out for now.
Sponsored-by: Noam Kremen on Patreon
Which generated unusual git trees that could confuse git merge,
since they incorrectly had 2 subtrees with the same name.
Root of the bug was a) not testing that at all! but also
b) confusing graftdirs, which contains eg "foo/bar" with
non-recursively read trees, which would contain eg "bar"
when reading a subtree of "foo".
It's worth noting that Annex.Import uses graftTree, but it really
shouldn't have needed to. Eg, when importing into foo/bar from a remote,
it's enough to generate a tree of foo/bar/x, foo/bar/y, and does not
include other files that are at the top of the master branch. It uses
graftTree, so it does include the other files, as well as the foo/bar
tree. git merge will do the same thing for both trees. With that said,
switching it away from graftTree would result in another import
generating a new commit that seems to delete files that were there in a
previous commit, so it probably has to keep using graftTree since it
used it before.
This commit was sponsored by Kevin Mueller on Patreon.
Not yet used, but allows getting the size of items in the tree fairly
cheaply.
I noticed that CmdLine.Seek uses ls-tree and the feeds the files into
another long-running process to check their size. That would be an
example of a place that might be sped up by using this. Although in that
particular case, it only needs to know the size of unlocked files, not
locked. And since enabling --long probably doubles the ls-tree runtime
or more, the overhead of using it there may outwweigh the benefit.
I don't think this actually fixes any buggy behavior in git-annex, I
just noticed that using treeItemToLsTreeItem and then serializing it
resulted in something starting with "160000 blob" rather than
"160000 commit"
In addition to regular file deletions, the removefiles argument passed
to adjustTree may contain removed submodules. When making the new
tree, filter these out in the same way that is done for regular files
so that the deletion is propagated.
When the recorded submodule commit changes on an adjusted branch, the
change is carried in the function that reverseAdjustedCommit passes
for adjustTree's adjusttreeitem parameter. Update the CommitObject
handling in adjustTree to consider adjusttreeitem so that a submodule
change is synced back.
Adds a dependency on filepath-bytestring, an as yet unreleased fork of
filepath that operates on RawFilePath.
Git.Repo also changed to use RawFilePath for the path to the repo.
This does eliminate some RawFilePath -> FilePath -> RawFilePath
conversions. And filepath-bytestring's </> is probably faster.
But I don't expect a major performance improvement from this.
This is mostly groundwork for making Annex.Location use RawFilePath,
which will allow for a conversion-free pipleline.
Goal is to make git-annex faster by using ByteString for all the
worktree traversal. For now, this is focusing on Command.Find,
in order to benchmark how much it helps. (All other commands are
temporarily disabled)
Currently in a very bad unbuildable in-between state.
Prevents merging the import from deleting the non-preferred files from
the branch it's merged into.
adjustTree previously appended the new list of items to the old, which
could result in it generating a tree with multiple files with the same
name. That is not good and confuses some parts of git. Gave it a
function to resolve such conflicts.
That allowed dealing with the problem of what happens when the import
contains some files (or subtrees) with the same name as files that were
filtered out of the export. The files from the import win.
This does not change the overall license of the git-annex program, which
was already AGPL due to a number of sources files being AGPL already.
Legally speaking, I'm adding a new license under which these files are
now available; I already released their current contents under the GPL
license. Now they're dual licensed GPL and AGPL. However, I intend
for all my future changes to these files to only be released under the
AGPL license, and I won't be tracking the dual licensing status, so I'm
simply changing the license statement to say it's AGPL.
(In some cases, others wrote parts of the code of a file and released it
under the GPL; but in all cases I have contributed a significant portion
of the code in each file and it's that code that is getting the AGPL
license; the GPL license of other contributors allows combining with
AGPL code.)
Added graftTree but it's buggy.
Should use graftTree in Annex.Branch.graftTreeish; it will be faster
than the current implementation there.
Started Annex.Import, but untested and it doesn't yet handle tree
grafting.
So it will be available later and elsewhere, even after GC.
I first though to use git update-index to do this, but feeding it a line
with a tree object seems to always cause it to generate a git subtree
merge. So, fell back to using the Git.Tree interface to maniupulate the
trees, and not involving the git-annex branch index file at all.
This commit was sponsored by Andreas Karlsson.
Restarting a crashing git process could result in filename encoding issues
when not in a unicode locale, as the restarted processes's handles were not
read in raw mode.
Since rawMode is always used when starting a coprocess, didn't bother
to parameterise it and just always enable it for simplicity.
This commit was sponsored by Jake Vosloo on Patreon.
When adding a tree like a/b/c/d when a/b already exists, fixes the bug that
the tree that got created was a/b/a/b/c/d
Just need to flatten out the top N directories of the tree that's being
grafted in, so we get the c/d part. This was complicated by the Tree
data type being a rose tree rather than a regular tree.
This commit was sponsored by Nick Daly on Patreon.
The modification flag was not being set when making modifications deep
in a tree, so parent trees were not updated to contain the modified tree.
Seems to have exposed another bug where the wrong filename gets grafted in.
This commit was sponsored by Brock Spratlen on Patreon.
Only reverse adjust the changes in the commit, which means that adjustments
do not need to be generally cleanly reversable.
For example, an adjustment can unlock all locked files, but does not need
to worry about files that were originally unlocked when reversing, because
it will only ever be run on files that have been changed. So, it's ok
if it locks all files when reversed, or even leaves all files as-is when
reversed.
extractTree has to parse the whole input list in order to generate a tree,
so convert interface to non-streaming.
Some quick memory benchmarks in a repo with 60k files
don't look too bad despite not streaming.
To stream, without building up a whole tree object, one way would
be a new interface:
adjustTree :: MonadIO m :: (TreeItem -> m (Maybe TreeItem)) -> Ref -> Repo -> m Sha
This would only need to buffer tree objects from the current one down
to the root, in order to update trees when a TreeItem is changed.
But, while it supports changing items in the tree, and removing items,
it does not support adding new items, or moving items from one directory to
another.