importfeed: Made checking known urls step around 10% faster.

This was a bit disappointing, I was hoping for a 2x speedup. But, I think
the metadata lookup is wasting a lot of time and also needs to be made to
stream.

The changes to catObjectStreamLsTree were benchmarked to not also speed
up --all around 3% more. Seems I managed to make it polymorphic after all.
This commit is contained in:
Joey Hess 2020-07-14 12:44:35 -04:00
parent a6afa62a60
commit 535cdc8d48
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
6 changed files with 58 additions and 42 deletions

View file

@ -0,0 +1,11 @@
git-annex tries to run in a constant amount of memory, however `knownUrls`
loads all urls ever seen into a list, so the more urls there are, the more
memory `git annex importfeed` will need.
This is probably not a big problem in practice, but seems worth doing
something about if somehow possible.
Unfortunately, can't use a bloom filter, because a false positive would
prevent importing an url that has not been imported before. A sqlite
database would work, but would need to be updated whenever the git-annex
branch is changed. --[[Joey]]

View file

@ -61,3 +61,6 @@ looked up efficiently. (Before these changes, the same key lookup was done
speedup when such limits are used. What that optimisation needs is a way to
tell if the current limit needs the key or not. If it does, then match on
it after getting the key, otherwise before getting the key.
Also, importfeed could be sped up more, probably, if knownItems
streamed through cat-file --buffer.