design for preferred content numcopies check

This commit is contained in:
Joey Hess 2014-01-20 14:28:33 -04:00
parent 5130bfdff3
commit e7f8c1911a
2 changed files with 66 additions and 7 deletions

View file

@ -7,12 +7,10 @@ for i in `git remote`; do git copy -to $i --auto; done
The use case is this:
I have a very large repo (300.000 files) in three places. Now I want the fastest possible way to ensure, that every file exists in annex.numcopies. This should scan every file one time and then get it or copy it to other repos as needed. Right now, I make one "git annex get --auto" in every repo, which is is a waste of time, since most of the files never change anyway!
> The closest we have to this is the (new) `git annex sync --content`.
> It does effectivly just what the shown for loop does.
> Now `git annex sync --content` does effectivly just what the shown for
> loop does. [[done]]
>
> But, that actually satisfies preferred content settings, which default
> to preferring every repo have a copy, and even if configured will
> typically be more than numcopies.
>
> Numcopies is more of a minimum lower bound (though not a hard bound).
> The only difference is that copy --auto proactively downloads otherwise
> unwanted files to satisfy numcopies, and sync --content does not.
> We need a [[preferred_content_numcopies_check]] to solve that.
> --[[Joey]]

View file

@ -0,0 +1,61 @@
The assistant and git annex sync --content do not try to proactively
download content that is not otherwise wanted in order to get numcopies
satisfied. (Unlike get --auto, which does take numcopies into account.)
Should these automated systems try to proactively satisfy numcopies? I
don't feel they should. It could result in surprising results. For example,
a transfer repository, which is of limited size, could start being filled
up with lots of content that all clients have, just because numcopies was
set to a larger number than the total number of clients. Another example,
a source repository on eg an Android phone, should never have content in it
that was not created on that device.
However, it would make sense for some specific
types of repositories to proactively get content to satisfy numcopies.
Currently some types of repositories use "or (not copies=semitrusted+:1)",
to ensure that if the only copy of a file is on a dead repository, they
will try to get that file before the repo goes away. This is done
by client repositories, and backup, and archive. Probably the same set
would make sense to proactively satisfy numcopies.
So, a new type of preferred content expression is called for. Such as, for
example, "numcopiesneeded=1". Which indicates that at least 1 more copy
is needed to satifsy numcopies.
(Note that it should only count semittrusted and higher trust
level repos as satisfying numcopies.)
But, preferred content expressions can only operate on info stored in the
git repo, or they will fail to be stable. Ie, repo A needs to be able to
calculate whether a file is preferred content by repo B and get the same
result as when repo B calculates that.
numcopies is currently configured in 3 places:
* .git/config `annex.numcopies` (global, stored only locally)
* .gitattributes `annex.numcopies` (per file, stored in git repo)
* --numcopies (not relevant)
So, need to add a global numcopies setting that is stored in the git repo.
That could either be a file in the git-annex branch, or just
`* annex.numcopies=2` in the toplevel .gitattributes. Note that the
assistant needs to be able to query and set it, which I think argues
against using .gitattributes for it. Also arguing against that is that the
.git/config numcopies valie applies even to objects with no file in the
work tree, which gitattributes settings do not.
Conclusion:
* Add to the git-annex branch a numcopies file that holds the global
numcopies default if present.
* Modify the assistant to use it when configuring numcopies.
* To deprecate .git/config's annex.numcopies, only make it take effect
when there is no numcopies file in the git-annex branch.
* Add "numcopiesneeded=N" preferred content expression using the git-annex
branch numcopies setting, overridden by any .gitattributes numcopies setting
for a particular file. It should ignore the other ways to specify
numcopies.
* Make the repo groups that currently end with "or (not copies=semitrusted+:1)"
to instead end with "or (not numcopiesneeded=1)"
--[[Joey]]