git-annex/doc/todo/preferred_content_numcopies_check.mdwn
Joey Hess f2713a3bb9 benchmarked numcopies .gitattributes in preferred content
Checking .gitattributes adds a full minute to a git annex find looking for
files that don't have enough copies. 2:25 increasts to 3:27. I feel this is
too much of a slowdown to justify making it the default. So, exposed two
versions of the preferred content expression, a slow one and a fast but
approximate one.

I'm using the approximate one in the default preferred content expressions
to avoid slowing down the assistant.
2014-01-21 18:49:25 -04:00

86 lines
3.9 KiB
Markdown

The assistant and git annex sync --content do not try to proactively
download content that is not otherwise wanted in order to get numcopies
satisfied. (Unlike get --auto, which does take numcopies into account.)
Should these automated systems try to proactively satisfy numcopies? I
don't feel they should. It could result in surprising results. For example,
a transfer repository, which is of limited size, could start being filled
up with lots of content that all clients have, just because numcopies was
set to a larger number than the total number of clients. Another example,
a source repository on eg an Android phone, should never have content in it
that was not created on that device.
However, it would make sense for some specific
types of repositories to proactively get content to satisfy numcopies.
Currently some types of repositories use "or (not copies=semitrusted+:1)",
to ensure that if the only copy of a file is on a dead repository, they
will try to get that file before the repo goes away. This is done
by client repositories, and backup, and archive. Probably the same set
would make sense to proactively satisfy numcopies.
So, a new type of preferred content expression is called for. Such as, for
example, "numcopiesneeded=1". Which indicates that at least 1 more copy
is needed to satifsy numcopies.
(Note that it should only count semittrusted and higher trust
level repos as satisfying numcopies.)
But, preferred content expressions can only operate on info stored in the
git repo, or they will fail to be stable. Ie, repo A needs to be able to
calculate whether a file is preferred content by repo B and get the same
result as when repo B calculates that.
numcopies is currently configured in 3 places:
* .git/config `annex.numcopies` (global, stored only locally)
* .gitattributes `annex.numcopies` (per file, stored in git repo)
* --numcopies (not relevant)
So, need to add a global numcopies setting that is stored in the git repo.
That could either be a file in the git-annex branch, or just
`* annex.numcopies=2` in the toplevel .gitattributes. Note that the
assistant needs to be able to query and set it, which I think argues
against using .gitattributes for it. Also arguing against that is that the
.git/config numcopies valie applies even to objects with no file in the
work tree, which gitattributes settings do not.
Conclusion:
* Add to the git-annex branch a numcopies file that holds the global
numcopies default if present. **done**
* Modify the assistant to use it when configuring numcopies. **done**
* To deprecate .git/config's annex.numcopies, only make it take effect
when there is no numcopies file in the git-annex branch. **done**
* Add "numcopiesneeded=N" preferred content expression using the git-annex
branch numcopies setting, overridden by any .gitattributes numcopies setting
for a particular file. It should ignore the other ways to specify
numcopies, particularly git config annex.numcopies. **done**
* Make the repo groups that currently end with "or (not copies=semitrusted+:1)"
to instead end with "or numcopiesneeded=1" **done**
* See if "numcopiesneeded=N" can check .gitattributes without getting
a lot slower. If now, perhaps add a "numcopiesneededaccurate=N" that
checks it. **done**
[[done]]
## Stability analysis
If a remote prefers eg, "blah or numcopiesneeded=1", and
file $foo does not match blah, but needs more copies, then then the
expression will match.
So, git-annex will get $foo, adding a copy. Which means that the
numcopiesneeded=1 will no longer match, so the file is no longer wanted
now that it has been downloaded.
Now there are two cases for what can happen:
* git-annex tries to drop $foo, but fails because it cannot find enough
other copies
* git-annex copies $foo to some other remote that wants it, and then
manages to drop $foo from the local remote.
This seems ok. Files flow through repos and they act like transfer
repos when there are not enough copies.
--[[Joey]]