importfeed: Made checking known urls step around 10% faster.

This was a bit disappointing, I was hoping for a 2x speedup. But, I think the metadata lookup is wasting a lot of time and also needs to be made to stream. The changes to catObjectStreamLsTree were benchmarked to not also speed up --all around 3% more. Seems I managed to make it polymorphic after all.
2020-07-14 12:44:35 -04:00 · 2020-07-14 12:44:35 -04:00 · 535cdc8d48
commit 535cdc8d48
parent a6afa62a60
6 changed files with 58 additions and 42 deletions
--- a/doc/todo/importfeed_needs_more_memory_the_more_urls_there_are.mdwn
+++ b/doc/todo/importfeed_needs_more_memory_the_more_urls_there_are.mdwn
@ -0,0 +1,11 @@
+git-annex tries to run in a constant amount of memory, however `knownUrls`
+loads all urls ever seen into a list, so the more urls there are, the more
+memory `git annex importfeed` will need.
+
+This is probably not a big problem in practice, but seems worth doing
+something about if somehow possible.
+
+Unfortunately, can't use a bloom filter, because a false positive would
+prevent importing an url that has not been imported before. A sqlite
+database would work, but would need to be updated whenever the git-annex
+branch is changed. --[[Joey]]
--- a/doc/todo/precache_logs_for_speed_with_cat-file_--buffer.mdwn
+++ b/doc/todo/precache_logs_for_speed_with_cat-file_--buffer.mdwn
@ -61,3 +61,6 @@ looked up efficiently. (Before these changes, the same key lookup was done
 speedup when such limits are used. What that optimisation needs is a way to
 tell if the current limit needs the key or not. If it does, then match on
 it after getting the key, otherwise before getting the key.
+
+Also, importfeed could be sped up more, probably, if knownItems
+streamed through cat-file --buffer.