2011-04-07 18:45:10 +00:00
|
|
|
`git-annex unused` has to compare large sets of data
|
|
|
|
(all keys with content present in the repository,
|
|
|
|
with all keys used by files in the repository), and so
|
2012-03-11 19:19:07 +00:00
|
|
|
uses more memory than git-annex typically needs.
|
2011-11-08 05:27:06 +00:00
|
|
|
|
2012-03-11 19:19:07 +00:00
|
|
|
It used to be a lot worse (hundreds of megabytes).
|
2011-04-07 18:45:10 +00:00
|
|
|
|
2012-03-11 19:19:07 +00:00
|
|
|
Now it only needs enough memory to store a Set of all Keys that currently
|
|
|
|
have content in the annex. On a lightly populated repository, it runs in
|
|
|
|
quite low memory use (like 8 mb) even if the git repo has 100 thousand
|
|
|
|
files. On a repository with lots of file contents, it will use more.
|
|
|
|
|
|
|
|
Still, I would like to reduce this to a purely constant memory use,
|
|
|
|
as running in constant memory no matter the repo size is a git-annex design
|
|
|
|
goal.
|
|
|
|
|
|
|
|
One idea is to use a bloom filter.
|
2011-04-07 18:45:10 +00:00
|
|
|
For example, construct a bloom filter of all keys used by files in
|
|
|
|
the repository. Then for each key with content present, check if it's
|
2011-11-08 05:27:06 +00:00
|
|
|
in the bloom filter. Since there can be false positives, this might
|
2011-04-07 18:45:10 +00:00
|
|
|
miss finding some unused keys. The probability/size of filter
|
|
|
|
could be tunable.
|
|
|
|
|
|
|
|
Another way might be to scan the git log for files that got removed
|
|
|
|
or changed what key they pointed to. Correlate with keys with content
|
|
|
|
currently present in the repository (possibly using a bloom filter again),
|
|
|
|
and that would yield a shortlist of keys that are probably not used.
|
|
|
|
Then scan thru all files in the repo to make sure that none point to keys
|
|
|
|
on the shortlist.
|