git-annex/doc/todo/speed_up_import_tree.mdwn
Joey Hess 40017089f2
use importChanges optimisation
Large speed up to importing trees from special remotes that contain a lot
of files, by only processing changed files.

Benchmarks:

Importing from a special remote that has 10000 files, that have all been
imported before, and 1 new file sped up from 26.06 to 2.59 seconds.

An import with no change and 10000 unchanged files sped up from 24.3 to
1.99 seconds.

Going up to 20000 files, an import with no changes sped up from
125.95 to 3.84 seconds.

Sponsored-by: k0ld on Patreon
2023-06-01 13:47:00 -04:00

48 lines
2.3 KiB
Markdown

Users sometimes expect `git-annex import --from remote` to be faster than
it is, when importing hundreds of thousands of files, particularly
from a directory special remote.
I think generally, they're expecting something that is not achievable.
It is always going to be slower than using git in a repository with that
many files, because git operates at a lower level of abstraction (the
filesystem), so has more optimisations available to it. (Also git has its
own scalability limits with many files.)
Still, it would be good to find some ways to speed it up.
In particular, speeding up repeated imports from the same special remote,
when only a few files have changed, would make it much more useful. It's ok
to pay a somewhat expensive price to import a lot of new files, if updates
are quick after that.
---
A major thing that makes it slow, when a remote contains
many files, is converting from ContentIdentifiers to Keys.
It does a cidsdb lookup for every file, before it knows if the file has
changed or not, which gets slow with a lot of files.
What if it generated a git tree, where each file in the tree is
a sha1 hash of the ContentIdentifier. The tree can just be recorded locally
somewhere. It's ok if it gets garbage collected; it's only an optimisation.
On the next sync, diff from the old to the new tree. It only needs to
import the changed files, and can avoid the cidsdb lookup for the
unchanged files!
(That is assuming that ContentIdentifiers don't tend to sha1 collide.
If there was a collision it would fail to import the new file. But it seems
reasonable, because git loses data on sha1 collisions anyway, and ContentIdentifiers
are no more likely to collide than the content of files, and probably less
likely overall..)
> I implemented this optimisation. Importing from a special remote that
> has 10000 files, that have all been imported before, and 1 new file
> sped up from 26.06 to 2.59 seconds. An import with no changes sped
> up from 24.3 to 1.99 seconds. Going up to 20000 files, an import with
> no changes sped up from 125.95 to 3.84 seconds.
> (All measured with warm cache.)
> (Note that I have only implemented this optimisation for imports that
> do not include History. So importing from versioned S3 buckets will
> still be slow. It would be possible to do a similar optimisation for
> History, but it seemed complicated so I punted.) --[[Joey]]