Commit graph

50 commits

Author SHA1 Message Date
Joey Hess
c15fa17635
optimise adjustTree when adding many TreeItems (take 2)
The old code traversed the list of addtreeitems once per subdirectory in
the tree, so could get quite slow. Converting to Map lookups sped it up
significantly.

In my test case, git-annex import used to take about 2 minutes, when
calling adjustTree to add back excluded files to the imported tree. This
dropped it down to 6 seconds. Of which 4 seconds are the actual
enumeration of the contents of the remote, so really only 2 seconds for
this.

The path prefix map is a bit suboptimal memory-wise, since items get
stored in the map once per subdirectory on the path to the item. It
would perhaps be better to use a tree data structure.

Also it's suboptimal memory-wise that it builds two maps, as well
as retaining a reference to addtreeitems. I could not see a way around
that though.

This is a fixed version of commit 2c86651180.
It fixes a test suite reversion.

Sponsored-by: Jack Hill on Patreon
2024-01-16 11:53:57 -04:00
Joey Hess
d9f36085c6
Revert "optimise adjustTree when adding many TreeItems"
This reverts commit 2c86651180.

That commit caused a test failure and problably wrong trees to be
imported, so revert until that is fixed.
2024-01-10 16:36:44 -04:00
Joey Hess
2c86651180
optimise adjustTree when adding many TreeItems
The old code traversed the list of addtreeitems once per subdirectory in
the tree, so could get quite slow. Converting to Map lookups sped it up
significantly.

In my test case, git-annex import used to take about 2 minutes, when
calling adjustTree to add back excluded files to the imported tree. This
dropped it down to 6 seconds. Of which 4 seconds are the actual
enumeration of the contents of the remote, so really only 2 seconds for
this.

The path prefix map is a bit suboptimal memory-wise, since items get
stored in the map once per subdirectory on the path to the item. It
would perhaps be better to use a tree data structure.

Also it's suboptimal memory-wise that it builds two maps, as well
as retaining a reference to addtreeitems. I could not see a way around
that though.

Sponsored-by: Luke T. Shumaker on Patreon
2024-01-03 15:07:49 -04:00
Joey Hess
d06aee7ce0
make commitMigration interuption safe
Fixed inversion of control issue, so the tree is recorded
in streamLogFile finalizer.

Sponsored-by: Leon Schuermann on Patreon
2023-12-06 16:29:58 -04:00
Joey Hess
0bd8b17b59
log migration trees to git-annex branch
This will allow distributed migration: Start a migration in one clone of
a repo, and then update other clones.

commitMigration is a bit of a bear.. There is some inversion of control
that needs some TMVars. Also streamLogFile's finalizer does not handle
recording the trees, so an interrupt at just the wrong time can cause
migration.log to be emptied but the git-annex branch not updated.

Sponsored-by: Graham Spencer on Patreon
2023-12-06 15:40:03 -04:00
Joey Hess
7298123520
build git trees using ContentIdentifier to speed up import
This gets the trees built, but it does not use them. Next step will be
to remember the tree for next time an import is done, and diff between
old and new trees to find the files that have changed.

Added --missing to the mktree parameters. That only disables a check, so
it's ok to do everywhere mktree is used. It probably also speeds up
mktree to disable the check.

Note that git fsck does not complain about the resulting tree objects
that point to shas that are not in the repository. Even with --strict.

A quick benchmark, importing 10000 files, this slowed it down
from 2:04.06 to 2:04.28. So it will more than pay for itself.

Sponsored-by: Luke Shumaker on Patreon
2023-05-31 12:46:54 -04:00
Joey Hess
cd544e548b
filter out control characters in error messages
giveup changed to filter out control characters. (It is too low level to
make it use StringContainingQuotedPath.)

error still does not, but it should only be used for internal errors,
where the message is not attacker-controlled.

Changed a lot of existing error to giveup when it is not strictly an
internal error.

Of course, other exceptions can still be thrown, either by code in
git-annex, or a library, that include some attacker-controlled value.
This does not guard against those.

Sponsored-by: Noam Kremen on Patreon
2023-04-10 13:50:51 -04:00
Joey Hess
69f8e6c7c0
ImportableContentsChunkable
This improves the borg special remote memory usage, by
letting it only load one archive's worth of filenames into memory at a
time, and building up a larger tree out of the chunks.

When a borg repository has many archives, git-annex could easily OOM
before. Now, it will use only memory proportional to the number of
annexed keys in an archive.

Minor implementation wart: Each new chunk re-opens the content
identifier database, and also a new vector clock is used for each chunk.
This is a minor innefficiency only; the use of continuations makes
it hard to avoid, although putting the database handle into a Reader
monad would be one way to fix it.

It may later be possible to extend the ImportableContentsChunkable
interface to remotes that are not third-party populated. However, that
would perhaps need an interface that does not use continuations.

The ImportableContentsChunkable interface currently does not allow
populating the top of the tree with anything other than subtrees. It
would be easy to extend it to allow putting files in that tree, but borg
doesn't need that so I left it out for now.

Sponsored-by: Noam Kremen on Patreon
2021-10-08 13:15:22 -04:00
Joey Hess
1dc82f177f
use bytestring filepaths more
This should be more efficient, and allocate less.

Sponsored-by: Graham Spencer on Patreon
2021-10-05 15:44:02 -04:00
Joey Hess
5712a7ef93
fix incomplete pattern match warning
There was not really a bug here, because the 2 lists are always the same
length, but the compiler does not know that.
2021-03-30 12:59:53 -04:00
Joey Hess
4611813ef1
Fix bug importing from a special remote into a subdirectory more than one level deep
Which generated unusual git trees that could confuse git merge,
since they incorrectly had 2 subtrees with the same name.

Root of the bug was a) not testing that at all! but also
b) confusing graftdirs, which contains eg "foo/bar" with
non-recursively read trees, which would contain eg "bar"
when reading a subtree of "foo".

It's worth noting that Annex.Import uses graftTree, but it really
shouldn't have needed to. Eg, when importing into foo/bar from a remote,
it's enough to generate a tree of foo/bar/x, foo/bar/y, and does not
include other files that are at the top of the master branch. It uses
graftTree, so it does include the other files, as well as the foo/bar
tree. git merge will do the same thing for both trees. With that said,
switching it away from graftTree would result in another import
generating a new commit that seems to delete files that were there in a
previous commit, so it probably has to keep using graftTree since it
used it before.

This commit was sponsored by Kevin Mueller on Patreon.
2021-03-26 16:04:36 -04:00
Joey Hess
a8b837aaef
add git ls-tree --long parser
Not yet used, but allows getting the size of items in the tree fairly
cheaply.

I noticed that CmdLine.Seek uses ls-tree and the feeds the files into
another long-running process to check their size. That would be an
example of a place that might be sped up by using this. Although in that
particular case, it only needs to know the size of unlocked files, not
locked. And since enabling --long probably doubles the ls-tree runtime
or more, the overhead of using it there may outwweigh the benefit.
2021-03-23 12:47:00 -04:00
Joey Hess
ed717cf646
fix handling of subtree
I don't think this actually fixes any buggy behavior in git-annex, I
just noticed that using treeItemToLsTreeItem and then serializing it
resulted in something starting with "160000 blob" rather than
"160000 commit"
2021-03-12 13:24:19 -04:00
Joey Hess
4b57e1c0ad
allow adjusttreeitem to remove submodules 2021-03-12 13:19:23 -04:00
Joey Hess
33bcee86f1
avoid using wildcard near bug kyle fixed 2021-01-07 13:44:23 -04:00
Kyle Meyer
fd161da2c2
adjustTree: Consider submodule deletions
In addition to regular file deletions, the removefiles argument passed
to adjustTree may contain removed submodules.  When making the new
tree, filter these out in the same way that is done for regular files
so that the deletion is propagated.
2021-01-07 13:43:09 -04:00
Joey Hess
6c81e0c8f1
ByteString Ref continued
Several nice speed wins I think.

At 340/633 files converted.
2020-04-07 13:27:11 -04:00
Kyle Meyer
376e69ec65
adjust: Propagate submodule changes back to original branch
When the recorded submodule commit changes on an adjusted branch, the
change is carried in the function that reverseAdjustedCommit passes
for adjustTree's adjusttreeitem parameter.  Update the CommitObject
handling in adjustTree to consider adjusttreeitem so that a submodule
change is synced back.
2020-03-26 15:16:08 -04:00
Joey Hess
bdec7fed9c
convert TopFilePath to use RawFilePath
Adds a dependency on filepath-bytestring, an as yet unreleased fork of
filepath that operates on RawFilePath.

Git.Repo also changed to use RawFilePath for the path to the repo.

This does eliminate some RawFilePath -> FilePath -> RawFilePath
conversions. And filepath-bytestring's </> is probably faster.
But I don't expect a major performance improvement from this.
This is mostly groundwork for making Annex.Location use RawFilePath,
which will allow for a conversion-free pipleline.
2019-12-09 15:07:21 -04:00
Joey Hess
6a97ff6b3a
wip RawFilePath
Goal is to make git-annex faster by using ByteString for all the
worktree traversal. For now, this is focusing on Command.Find,
in order to benchmark how much it helps. (All other commands are
temporarily disabled)

Currently in a very bad unbuildable in-between state.
2019-11-25 16:18:19 -04:00
Joey Hess
bbdeb1a1a8
sync: Fix crash when there are submodules and an adjusted branch is checked out
Reverse adjusting the branch uses treeItemToTreeContent, which was missed
when adding submodule support earlier.
2019-10-23 11:52:56 -04:00
Joey Hess
97fd9da6e7
add back non-preferred files to imported tree
Prevents merging the import from deleting the non-preferred files from
the branch it's merged into.

adjustTree previously appended the new list of items to the old, which
could result in it generating a tree with multiple files with the same
name. That is not good and confuses some parts of git. Gave it a
function to resolve such conflicts.

That allowed dealing with the problem of what happens when the import
contains some files (or subtrees) with the same name as files that were
filtered out of the export. The files from the import win.
2019-05-20 16:43:52 -04:00
Joey Hess
40ecf58d4b
update licenses from GPL to AGPL
This does not change the overall license of the git-annex program, which
was already AGPL due to a number of sources files being AGPL already.

Legally speaking, I'm adding a new license under which these files are
now available; I already released their current contents under the GPL
license. Now they're dual licensed GPL and AGPL. However, I intend
for all my future changes to these files to only be released under the
AGPL license, and I won't be tracking the dual licensing status, so I'm
simply changing the license statement to say it's AGPL.

(In some cases, others wrote parts of the code of a file and released it
under the GPL; but in all cases I have contributed a significant portion
of the code in each file and it's that code that is getting the AGPL
license; the GPL license of other contributors allows combining with
AGPL code.)
2019-03-13 15:48:14 -04:00
Joey Hess
bab6c570b0
buildImportTrees is fully working
buildImportCommit not yet tested
2019-02-22 12:41:17 -04:00
Joey Hess
1580ff3866
graphTree now works properly in all cases
(That I could think of.)
2019-02-21 22:25:42 -04:00
Joey Hess
8fdea8f444
WIP
Added graftTree but it's buggy.

Should use graftTree in Annex.Branch.graftTreeish; it will be faster
than the current implementation there.

Started Annex.Import, but untested and it doesn't yet handle tree
grafting.
2019-02-21 17:32:59 -04:00
Joey Hess
5483ea90ec
graft exported tree into git-annex branch
So it will be available later and elsewhere, even after GC.

I first though to use git update-index to do this, but feeding it a line
with a tree object seems to always cause it to generate a git subtree
merge. So, fell back to using the Git.Tree interface to maniupulate the
trees, and not involving the git-annex branch index file at all.

This commit was sponsored by Andreas Karlsson.
2017-08-31 18:06:49 -04:00
Joey Hess
a13c0ce66c
adjust: Fix behavior when used in a repository that contains submodules.
Also fixed the LsFiles parser to not assume its output has a fixed width
type field.
2017-02-20 13:44:55 -04:00
Joey Hess
e23028d19b
restart coprocess in raw mode
Restarting a crashing git process could result in filename encoding issues
when not in a unicode locale, as the restarted processes's handles were not
read in raw mode.

Since rawMode is always used when starting a coprocess, didn't bother
to parameterise it and just always enable it for simplicity.

This commit was sponsored by Jake Vosloo on Patreon.
2016-11-01 14:03:59 -04:00
Joey Hess
3f25317ad5
fix tree graft-in bug
When adding a tree like a/b/c/d when a/b already exists, fixes the bug that
the tree that got created was a/b/a/b/c/d

Just need to flatten out the top N directories of the tree that's being
grafted in, so we get the c/d part. This was complicated by the Tree
data type being a rose tree rather than a regular tree.

This commit was sponsored by Nick Daly on Patreon.
2016-10-11 15:36:40 -04:00
Joey Hess
b82c3e0783
sync: Fix bug in adjusted branch merging that could cause recently added files to be lost when updating the adjusted branch.
The modification flag was not being set when making modifications deep
in a tree, so parent trees were not updated to contain the modified tree.

Seems to have exposed another bug where the wrong filename gets grafted in.

This commit was sponsored by Brock Spratlen on Patreon.
2016-10-10 15:00:45 -04:00
Joey Hess
066f5bcdcb
more windows path fixes
Let git-style filepaths be looked up in the removeset, even though
windows-style filepaths are probably being fed into it.
2016-05-04 12:42:05 -04:00
Joey Hess
2cdfe33a4c
more windows path fixes
beneathSubTree can be called with both windows-style and git-style paths,
so needs to normalize to windows-style.
2016-05-04 12:36:50 -04:00
Joey Hess
db9269712f
avoid hardcoded slashes; broke on windows 2016-05-03 19:09:27 -04:00
Joey Hess
56dee9af10
fix build with ghc 7.6.3 2016-04-08 16:09:00 -04:00
Joey Hess
6c023e14ef
grafting new items into existing tree 2016-03-11 19:29:43 -04:00
Joey Hess
ad04550055
refactor 2016-03-11 16:45:40 -04:00
Joey Hess
f3b9c48a09
fixme 2016-03-11 16:37:31 -04:00
Joey Hess
ba1ef156a2
fix deletion of files in adjustTree 2016-03-11 16:30:06 -04:00
Joey Hess
b9184f69a7
improve propigation of commits from adjusted branches
Only reverse adjust the changes in the commit, which means that adjustments
do not need to be generally cleanly reversable.

For example, an adjustment can unlock all locked files, but does not need
to worry about files that were originally unlocked when reversing, because
it will only ever be run on files that have been changed. So, it's ok
if it locks all files when reversed, or even leaves all files as-is when
reversed.
2016-03-11 16:05:06 -04:00
Joey Hess
3c4ad3eeca
indent 2016-03-11 14:46:54 -04:00
Joey Hess
fed8fcb99f
allow adding new items via adjustTree 2016-03-11 14:08:06 -04:00
Joey Hess
e5dd91b189
better encapsulation 2016-02-23 22:22:22 -04:00
Joey Hess
4ea36b8c63
few strictness improvemnets 2016-02-23 22:03:47 -04:00
Joey Hess
85b05a29df
refactor 2016-02-23 21:56:08 -04:00
Joey Hess
e08bebf0eb
add adjustTree (low-level) interface that avoids buffering much in memory
Using getTree and recordTree in my big repo takes 594 mb ram.
Using adjustTree takes 73 mb.
2016-02-23 21:35:16 -04:00
Joey Hess
123f823ef7
no streaming
extractTree has to parse the whole input list in order to generate a tree,
so convert interface to non-streaming.

Some quick memory benchmarks in a repo with 60k files
don't look too bad despite not streaming.

To stream, without building up a whole tree object, one way would
be a new interface:

adjustTree :: MonadIO m :: (TreeItem -> m (Maybe TreeItem)) -> Ref -> Repo -> m Sha

This would only need to buffer tree objects from the current one down
to the root, in order to update trees when a TreeItem is changed.

But, while it supports changing items in the tree, and removing items,
it does not support adding new items, or moving items from one directory to
another.
2016-02-23 20:25:31 -04:00
Joey Hess
e266a6ec78
use getSha 2016-02-23 18:30:11 -04:00
Joey Hess
fc072699b7
minor improvements 2016-02-23 17:21:42 -04:00
Joey Hess
ae76cfde7d
add mktree interface 2016-02-23 16:36:38 -04:00