Make sanity checker run git annex unused daily, and queue up transfers
of unused files to any remotes that will have them. The transfer retrying
code works for us here, so eg when a backup disk remote is plugged in,
any transfers to it are done. Once the unused files reach a remote,
they'll be removed locally as unwanted.
If the setup does not cause unused files to go to a remote, they'll pile
up, and the sanity checker detects this using some heuristics that are
pretty good -- 1000 unused files, or 10% of disk used by unused files,
or more disk wasted by unused files than is left free. Once it detects
this, it pops up an alert in the webapp, with a button to take action.
TODO: Webapp UI to configure this, and also the ability to launch an
immediate cleanup of all unused files.
This commit was sponsored by Simon Michael.
Checking .gitattributes adds a full minute to a git annex find looking for
files that don't have enough copies. 2:25 increasts to 3:27. I feel this is
too much of a slowdown to justify making it the default. So, exposed two
versions of the preferred content expression, a slow one and a fast but
approximate one.
I'm using the approximate one in the default preferred content expressions
to avoid slowing down the assistant.
* Add numcopiesneeded preferred content expression.
* Client, transfer, incremental backup, and archive repositories
now want to get content that does not yet have enough copies.
This means the asssistant will make copies of files that don't yet
meet the configured numcopies, even to places that would not normally want
the file.
For example, if numcopies is 4, and there are 2 client repos and
2 transfer repos, and 2 removable backup drives, the file will be sent
to both transfer repos in order to make 4 copies. Once a removable drive
get a copy of the file, it will be dropped from one transfer repo or the
other (but not both).
Another example, numcopies is 3 and there is a client that has a backup
removable drive and two small archive repos. Normally once one of the small
archives has a file, it will not be put into the other one. But, to satisfy
numcopies, the assistant will duplicate it into the other small archive
too, if the backup repo is not available to receive the file.
I notice that these examples are fairly unlikely setups .. the old behavior
was not too bad, but it's nice to finally have it really correct.
.. Almost. I have skipped checking the annex.numcopies .gitattributes
out of fear it will be too slow.
This commit was sponsored by Florian Schlegel.
Similar to the assistant, this honors any configured preferred content
expressions.
I am not entirely happpy with the implementation. It would be nicer if
the seek function returned a list of actions which included the individual
file gets and copies and drops, rather than the current list of calls to
syncContent. This would allow getting rid of the somewhat reundant display
of "sync file [ok|failed]" after the get/put display.
But, do that, withFilesInGit would need to somehow be able to construct
such a mixed action list. And it would be less efficient than the current
implementation, which is able to reuse several values between eg get and
drop.
Note that currently this does not try to satisfy numcopies when
getting/putting files (numcopies are of course checked when dropping
files!) This makes it like the assistant, and unlike get --auto
and copy --auto, which do duplicate files when numcopies is not yet
satisfied. I don't know if this is the right decision; it only seemed to
make sense to have this parallel the assistant as far as possible to start
with, since I know the assistant works.
This commit was sponsored by Øyvind Andersen Holm.
Let's make semitrusted+:1 mean that, since it cannot be expressed easily
with the current syntax (could use (semitrusted:1 or trusted:1), but that
does not scale to higher values than 2 copy, and also fails if I add more
trust levels).
Thanks to Andy for spotting this bug by just reading my blog.
in= was problimatic in two ways. First, it referred to a remote by name,
but preferred content expressions can be evaluated elsewhere, where that
remote doesn't exist, or a different remote has the same name. This name
lookup code could error out at runtime. Secondly, in= seemed pretty useless.
in=here did not cause content to be gotten, but it did let present content
be dropped.
present is more useful, although "not present" is unstable and should be
avoided.
When there's just 1 client repo, and a transfer repo is created, its
preferred content will now make it prefer all content in the client,
even though there's no other client yet to transfer it to. Presumably,
another client will be created eventually. It might even already exist,
and the transfer repo will be used to connect up with it.
I've designed these to work well together, I hope. If I get it wrong,
I can just change the code in one place, since these expressions
won't be stored in the git-annex branch.