2023-05-30 19:42:34 +00:00
|
|
|
Users sometimes expect `git-annex import --from remote` to be faster than
|
|
|
|
it is, when importing hundreds of thousands of files, particularly
|
|
|
|
from a directory special remote.
|
|
|
|
|
|
|
|
I think generally, they're expecting something that is not achievable.
|
|
|
|
It is always going to be slower than using git in a repository with that
|
|
|
|
many files, because git operates at a lower level of abstraction (the
|
|
|
|
filesystem), so has more optimisations available to it. (Also git has its
|
|
|
|
own scalability limits with many files.)
|
|
|
|
|
|
|
|
Still, it would be good to find some ways to speed it up.
|
|
|
|
|
2023-06-01 17:46:16 +00:00
|
|
|
In particular, speeding up repeated imports from the same special remote,
|
|
|
|
when only a few files have changed, would make it much more useful. It's ok
|
|
|
|
to pay a somewhat expensive price to import a lot of new files, if updates
|
|
|
|
are quick after that.
|
|
|
|
|
2023-05-30 19:49:52 +00:00
|
|
|
---
|
|
|
|
|
2023-06-01 17:46:16 +00:00
|
|
|
A major thing that makes it slow, when a remote contains
|
|
|
|
many files, is converting from ContentIdentifiers to Keys.
|
|
|
|
It does a cidsdb lookup for every file, before it knows if the file has
|
|
|
|
changed or not, which gets slow with a lot of files.
|
|
|
|
|
2023-05-30 19:49:52 +00:00
|
|
|
What if it generated a git tree, where each file in the tree is
|
2023-05-30 19:42:34 +00:00
|
|
|
a sha1 hash of the ContentIdentifier. The tree can just be recorded locally
|
|
|
|
somewhere. It's ok if it gets garbage collected; it's only an optimisation.
|
|
|
|
On the next sync, diff from the old to the new tree. It only needs to
|
2023-06-01 17:46:16 +00:00
|
|
|
import the changed files, and can avoid the cidsdb lookup for the
|
|
|
|
unchanged files!
|
2023-05-30 19:42:34 +00:00
|
|
|
|
|
|
|
(That is assuming that ContentIdentifiers don't tend to sha1 collide.
|
|
|
|
If there was a collision it would fail to import the new file. But it seems
|
|
|
|
reasonable, because git loses data on sha1 collisions anyway, and ContentIdentifiers
|
|
|
|
are no more likely to collide than the content of files, and probably less
|
|
|
|
likely overall..)
|
|
|
|
|
2023-06-01 17:46:16 +00:00
|
|
|
> I implemented this optimisation. Importing from a special remote that
|
|
|
|
> has 10000 files, that have all been imported before, and 1 new file
|
|
|
|
> sped up from 26.06 to 2.59 seconds. An import with no changes sped
|
|
|
|
> up from 24.3 to 1.99 seconds. Going up to 20000 files, an import with
|
|
|
|
> no changes sped up from 125.95 to 3.84 seconds.
|
|
|
|
> (All measured with warm cache.)
|
|
|
|
|
|
|
|
> (Note that I have only implemented this optimisation for imports that
|
|
|
|
> do not include History. So importing from versioned S3 buckets will
|
|
|
|
> still be slow. It would be possible to do a similar optimisation for
|
|
|
|
> History, but it seemed complicated so I punted.) --[[Joey]]
|