git-annex/doc/todo/git-annex_unused_eats_memory.mdwn

`git-annex unused` has to compare large sets of data
(all keys with content present in the repository,
with all keys used by files in the repository), and so
uses more memory than git-annex typically needs.

It used to be a lot worse (hundreds of megabytes).

Now it only needs enough memory to store a Set of all Keys that currently
have content in the annex. On a lightly populated repository, it runs in
quite low memory use (like 8 mb) even if the git repo has 100 thousand
files. On a repository with lots of file contents, it will use more.

Still, I would like to reduce this to a purely constant memory use,
as running in constant memory no matter the repo size is a git-annex design
goal.

One idea is to use a bloom filter. 
For example, construct a bloom filter of all keys used by files in
the repository. Then for each key with content present, check if it's
in the bloom filter. Since there can be false positives, this might
miss finding some unused keys. The probability/size of filter
could be tunable.

Another way might be to scan the git log for files that got removed
or changed what key they pointed to. Correlate with keys with content
currently present in the repository (possibly using a bloom filter again),
and that would yield a shortlist of keys that are probably not used. 
Then scan thru all files in the repo to make sure that none point to keys
on the shortlist.
add 2011-04-07 18:45:10 +00:00			`git-annex unused` has to compare large sets of data
			`(all keys with content present in the repository,`
			`with all keys used by files in the repository), and so`
unused: Reduce memory usage significantly. Much of the memory bloat turned out to be due to getKeysReferenced containing a mapM, which is strict and buffered the whole list rather than streaming it. The other half of the bloat was due to building a temporary Set in order to call S.difference. While that is more cpu efficient, I switched to successive S.delete, since with it, I can run a whole git annex unused in less than 8 mb of memory. The whole Set of keys with content available is still stored in memory, so running unused in a repo with a whole lot of file content will still use more memory. In a repo containing 6000 files, it needed 40 mb. Note that the status command still uses the bloatful getKeysReferenced. 2012-03-11 19:19:07 +00:00			`uses more memory than git-annex typically needs.`
update 2011-11-08 05:27:06 +00:00
unused: Reduce memory usage significantly. Much of the memory bloat turned out to be due to getKeysReferenced containing a mapM, which is strict and buffered the whole list rather than streaming it. The other half of the bloat was due to building a temporary Set in order to call S.difference. While that is more cpu efficient, I switched to successive S.delete, since with it, I can run a whole git annex unused in less than 8 mb of memory. The whole Set of keys with content available is still stored in memory, so running unused in a repo with a whole lot of file content will still use more memory. In a repo containing 6000 files, it needed 40 mb. Note that the status command still uses the bloatful getKeysReferenced. 2012-03-11 19:19:07 +00:00			`It used to be a lot worse (hundreds of megabytes).`
add 2011-04-07 18:45:10 +00:00
unused: Reduce memory usage significantly. Much of the memory bloat turned out to be due to getKeysReferenced containing a mapM, which is strict and buffered the whole list rather than streaming it. The other half of the bloat was due to building a temporary Set in order to call S.difference. While that is more cpu efficient, I switched to successive S.delete, since with it, I can run a whole git annex unused in less than 8 mb of memory. The whole Set of keys with content available is still stored in memory, so running unused in a repo with a whole lot of file content will still use more memory. In a repo containing 6000 files, it needed 40 mb. Note that the status command still uses the bloatful getKeysReferenced. 2012-03-11 19:19:07 +00:00			`Now it only needs enough memory to store a Set of all Keys that currently`
			`have content in the annex. On a lightly populated repository, it runs in`
			`quite low memory use (like 8 mb) even if the git repo has 100 thousand`
			`files. On a repository with lots of file contents, it will use more.`

			`Still, I would like to reduce this to a purely constant memory use,`
			`as running in constant memory no matter the repo size is a git-annex design`
			`goal.`

			`One idea is to use a bloom filter.`
add 2011-04-07 18:45:10 +00:00			`For example, construct a bloom filter of all keys used by files in`
			`the repository. Then for each key with content present, check if it's`
update 2011-11-08 05:27:06 +00:00			`in the bloom filter. Since there can be false positives, this might`
add 2011-04-07 18:45:10 +00:00			`miss finding some unused keys. The probability/size of filter`
			`could be tunable.`

			`Another way might be to scan the git log for files that got removed`
			`or changed what key they pointed to. Correlate with keys with content`
			`currently present in the repository (possibly using a bloom filter again),`
			`and that would yield a shortlist of keys that are probably not used.`
			`Then scan thru all files in the repo to make sure that none point to keys`
			`on the shortlist.`