unused: Reduce memory usage significantly.
Much of the memory bloat turned out to be due to getKeysReferenced containing a mapM, which is strict and buffered the whole list rather than streaming it. The other half of the bloat was due to building a temporary Set in order to call S.difference. While that is more cpu efficient, I switched to successive S.delete, since with it, I can run a whole git annex unused in less than 8 mb of memory. The whole Set of keys with content available is still stored in memory, so running unused in a repo with a whole lot of file content will still use more memory. In a repo containing 6000 files, it needed 40 mb. Note that the status command still uses the bloatful getKeysReferenced.
This commit is contained in:
parent
a13949bf37
commit
b086e32c63
3 changed files with 49 additions and 22 deletions
|
@ -1,12 +1,20 @@
|
|||
`git-annex unused` has to compare large sets of data
|
||||
(all keys with content present in the repository,
|
||||
with all keys used by files in the repository), and so
|
||||
uses more memory than git-annex typically needs; around
|
||||
50 mb when run in a repository with 80 thousand files.
|
||||
uses more memory than git-annex typically needs.
|
||||
|
||||
(Used to be 80 mb, but implementation improved.)
|
||||
It used to be a lot worse (hundreds of megabytes).
|
||||
|
||||
I would like to reduce this. One idea is to use a bloom filter.
|
||||
Now it only needs enough memory to store a Set of all Keys that currently
|
||||
have content in the annex. On a lightly populated repository, it runs in
|
||||
quite low memory use (like 8 mb) even if the git repo has 100 thousand
|
||||
files. On a repository with lots of file contents, it will use more.
|
||||
|
||||
Still, I would like to reduce this to a purely constant memory use,
|
||||
as running in constant memory no matter the repo size is a git-annex design
|
||||
goal.
|
||||
|
||||
One idea is to use a bloom filter.
|
||||
For example, construct a bloom filter of all keys used by files in
|
||||
the repository. Then for each key with content present, check if it's
|
||||
in the bloom filter. Since there can be false positives, this might
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue