From 4ea0b7c28850eb703562cd9dc84a02c49b5fda00 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Thu, 7 Apr 2011 14:45:10 -0400 Subject: [PATCH] add --- doc/todo/git-annex_unused_eats_memory.mdwn | 25 ++++++++++++++++++++++ 1 file changed, 25 insertions(+) create mode 100644 doc/todo/git-annex_unused_eats_memory.mdwn diff --git a/doc/todo/git-annex_unused_eats_memory.mdwn b/doc/todo/git-annex_unused_eats_memory.mdwn new file mode 100644 index 0000000000..6ce7140045 --- /dev/null +++ b/doc/todo/git-annex_unused_eats_memory.mdwn @@ -0,0 +1,25 @@ +`git-annex unused` has to compare large sets of data +(all keys with content present in the repository, +with all keys used by files in the repository), and so +uses more memory than git-annex typically needs; around +60-80 mb when run in a repository with 80 thousand files. + +I would like to reduce this. One idea is to use a bloom filter. +For example, construct a bloom filter of all keys used by files in +the repository. Then for each key with content present, check if it's +in the bloom filter. Since there can be false negatives, this might +miss finding some unused keys. The probability/size of filter +could be tunable. + +Another way might be to scan the git log for files that got removed +or changed what key they pointed to. Correlate with keys with content +currently present in the repository (possibly using a bloom filter again), +and that would yield a shortlist of keys that are probably not used. +Then scan thru all files in the repo to make sure that none point to keys +on the shortlist. + +---- + +`git annex unused --from remote` is much worse, using hundreds of mb of +memory. It has not been profiled at all yet, and can probably be improved +somewhat by fixing whatever memory leak it (probably) has.