From 6a8672d75683a3906deacd2b382e3c41f66727bb Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Wed, 8 Nov 2023 14:14:35 -0400 Subject: [PATCH] todo --- doc/todo/info_--size-history.mdwn | 60 +++++++++++++++++++++++++++++++ 1 file changed, 60 insertions(+) create mode 100644 doc/todo/info_--size-history.mdwn diff --git a/doc/todo/info_--size-history.mdwn b/doc/todo/info_--size-history.mdwn new file mode 100644 index 0000000000..024f4c5c9c --- /dev/null +++ b/doc/todo/info_--size-history.mdwn @@ -0,0 +1,60 @@ +Support eg `git-annex info --size-history=30d` which would display +the combined size of all repositories every 30 days throughout the history +of the git-annex branch. This would allow graphing, analysis, etc of repo +growth patterns. + +Also, `git-annex info somerepo --size-history=30d` would display the size +of only the selected repository. + +Maybe also a way to get the size of each repository plus total size in a +single line of output? + +---- + +Implementation of this is fairly subtle. My abandoned first try just went +through `git log` and updated counters as the location logs were updated. +That resulted in bad numbers. (The size went negative eventually in fact!) +The problem is that the git-annex branch is often updated both locally and +on a remote, eg when copying a file to a remote. And that results in 2 +changes to the git-annex branch that both record the same data. So it gets +counted twice by my naive implementation. + +I think it is not possible for an accumulation based approach to work in +constant memory and fast. In the worst case, there is a fork of the branch +that diverges hugely over a long period of time. So that divergence either +needs to be buffered in memory, or recalculated repeatedly. + +What I think needs to be done is use `git log --reverse --date-order git-annex`. +Feed the changed annex log file refs into catObjectStream to get the log +files. (Or use --patch and parse the diff to extract log file lines, +might be faster?) Parse the log files, and update a simple data structure: + + Map Key [UUIDPointer] + +Where UUIDPointer is a number that points to the UUID in a Map. This +avoids storing copies of the uuids in the map. + +This is essentially union merging all forks of the git-annex branch at +each commit, but far faster and in memory. Since union merging a git-annex +branch can be done at any point and always results in a consistent view of +the data, this will be consistent as well. + +And when updating the data structure, then it can update a counter when +something changed, and avoid updating it when a redundant log was logged. + +This approach will use an amount of memory that scales with +the number of keys and numbers of copies. I mocked it up using my big +repository. Storing every key in it in such a map, with 64 UUIDPointers +in the list (many more than the usual number of copies) took 2 gb of +memory. Which is a lot but also most users have that much if necessary. +With a more usual 5 copies, memory use was only 0.5 gb. So I think this is +an acceptable exception to git-annex's desire to use a constant amount of +memory. + +(I considered a bloom filter, but a false positive would wreck the +statistics. An in-memory sqlite db might be more efficient, but probably +not worth the bother.) + +[[!tag confirmed]] + +--[[Joey]]