todo
This commit is contained in:
parent
11cc9f1933
commit
6a8672d756
1 changed files with 60 additions and 0 deletions
60
doc/todo/info_--size-history.mdwn
Normal file
60
doc/todo/info_--size-history.mdwn
Normal file
|
@ -0,0 +1,60 @@
|
||||||
|
Support eg `git-annex info --size-history=30d` which would display
|
||||||
|
the combined size of all repositories every 30 days throughout the history
|
||||||
|
of the git-annex branch. This would allow graphing, analysis, etc of repo
|
||||||
|
growth patterns.
|
||||||
|
|
||||||
|
Also, `git-annex info somerepo --size-history=30d` would display the size
|
||||||
|
of only the selected repository.
|
||||||
|
|
||||||
|
Maybe also a way to get the size of each repository plus total size in a
|
||||||
|
single line of output?
|
||||||
|
|
||||||
|
----
|
||||||
|
|
||||||
|
Implementation of this is fairly subtle. My abandoned first try just went
|
||||||
|
through `git log` and updated counters as the location logs were updated.
|
||||||
|
That resulted in bad numbers. (The size went negative eventually in fact!)
|
||||||
|
The problem is that the git-annex branch is often updated both locally and
|
||||||
|
on a remote, eg when copying a file to a remote. And that results in 2
|
||||||
|
changes to the git-annex branch that both record the same data. So it gets
|
||||||
|
counted twice by my naive implementation.
|
||||||
|
|
||||||
|
I think it is not possible for an accumulation based approach to work in
|
||||||
|
constant memory and fast. In the worst case, there is a fork of the branch
|
||||||
|
that diverges hugely over a long period of time. So that divergence either
|
||||||
|
needs to be buffered in memory, or recalculated repeatedly.
|
||||||
|
|
||||||
|
What I think needs to be done is use `git log --reverse --date-order git-annex`.
|
||||||
|
Feed the changed annex log file refs into catObjectStream to get the log
|
||||||
|
files. (Or use --patch and parse the diff to extract log file lines,
|
||||||
|
might be faster?) Parse the log files, and update a simple data structure:
|
||||||
|
|
||||||
|
Map Key [UUIDPointer]
|
||||||
|
|
||||||
|
Where UUIDPointer is a number that points to the UUID in a Map. This
|
||||||
|
avoids storing copies of the uuids in the map.
|
||||||
|
|
||||||
|
This is essentially union merging all forks of the git-annex branch at
|
||||||
|
each commit, but far faster and in memory. Since union merging a git-annex
|
||||||
|
branch can be done at any point and always results in a consistent view of
|
||||||
|
the data, this will be consistent as well.
|
||||||
|
|
||||||
|
And when updating the data structure, then it can update a counter when
|
||||||
|
something changed, and avoid updating it when a redundant log was logged.
|
||||||
|
|
||||||
|
This approach will use an amount of memory that scales with
|
||||||
|
the number of keys and numbers of copies. I mocked it up using my big
|
||||||
|
repository. Storing every key in it in such a map, with 64 UUIDPointers
|
||||||
|
in the list (many more than the usual number of copies) took 2 gb of
|
||||||
|
memory. Which is a lot but also most users have that much if necessary.
|
||||||
|
With a more usual 5 copies, memory use was only 0.5 gb. So I think this is
|
||||||
|
an acceptable exception to git-annex's desire to use a constant amount of
|
||||||
|
memory.
|
||||||
|
|
||||||
|
(I considered a bloom filter, but a false positive would wreck the
|
||||||
|
statistics. An in-memory sqlite db might be more efficient, but probably
|
||||||
|
not worth the bother.)
|
||||||
|
|
||||||
|
[[!tag confirmed]]
|
||||||
|
|
||||||
|
--[[Joey]]
|
Loading…
Reference in a new issue