40017089f2
Large speed up to importing trees from special remotes that contain a lot of files, by only processing changed files. Benchmarks: Importing from a special remote that has 10000 files, that have all been imported before, and 1 new file sped up from 26.06 to 2.59 seconds. An import with no change and 10000 unchanged files sped up from 24.3 to 1.99 seconds. Going up to 20000 files, an import with no changes sped up from 125.95 to 3.84 seconds. Sponsored-by: k0ld on Patreon
48 lines
2.3 KiB
Markdown
48 lines
2.3 KiB
Markdown
Users sometimes expect `git-annex import --from remote` to be faster than
|
|
it is, when importing hundreds of thousands of files, particularly
|
|
from a directory special remote.
|
|
|
|
I think generally, they're expecting something that is not achievable.
|
|
It is always going to be slower than using git in a repository with that
|
|
many files, because git operates at a lower level of abstraction (the
|
|
filesystem), so has more optimisations available to it. (Also git has its
|
|
own scalability limits with many files.)
|
|
|
|
Still, it would be good to find some ways to speed it up.
|
|
|
|
In particular, speeding up repeated imports from the same special remote,
|
|
when only a few files have changed, would make it much more useful. It's ok
|
|
to pay a somewhat expensive price to import a lot of new files, if updates
|
|
are quick after that.
|
|
|
|
---
|
|
|
|
A major thing that makes it slow, when a remote contains
|
|
many files, is converting from ContentIdentifiers to Keys.
|
|
It does a cidsdb lookup for every file, before it knows if the file has
|
|
changed or not, which gets slow with a lot of files.
|
|
|
|
What if it generated a git tree, where each file in the tree is
|
|
a sha1 hash of the ContentIdentifier. The tree can just be recorded locally
|
|
somewhere. It's ok if it gets garbage collected; it's only an optimisation.
|
|
On the next sync, diff from the old to the new tree. It only needs to
|
|
import the changed files, and can avoid the cidsdb lookup for the
|
|
unchanged files!
|
|
|
|
(That is assuming that ContentIdentifiers don't tend to sha1 collide.
|
|
If there was a collision it would fail to import the new file. But it seems
|
|
reasonable, because git loses data on sha1 collisions anyway, and ContentIdentifiers
|
|
are no more likely to collide than the content of files, and probably less
|
|
likely overall..)
|
|
|
|
> I implemented this optimisation. Importing from a special remote that
|
|
> has 10000 files, that have all been imported before, and 1 new file
|
|
> sped up from 26.06 to 2.59 seconds. An import with no changes sped
|
|
> up from 24.3 to 1.99 seconds. Going up to 20000 files, an import with
|
|
> no changes sped up from 125.95 to 3.84 seconds.
|
|
> (All measured with warm cache.)
|
|
|
|
> (Note that I have only implemented this optimisation for imports that
|
|
> do not include History. So importing from versioned S3 buckets will
|
|
> still be slow. It would be possible to do a similar optimisation for
|
|
> History, but it seemed complicated so I punted.) --[[Joey]]
|