preferred content stability analysis

This commit is contained in:
Joey Hess 2014-01-22 15:55:44 -04:00
parent ae3cd632bd
commit 02896ee15d
2 changed files with 49 additions and 2 deletions

View file

@ -0,0 +1,21 @@
The [[preferred_content]] expressions didn't have a design document, but
it's a small non-turing complete DSL for expressing which objects a
repository prefers to contain.
One thing that needs to be written down though is the stability analysis
that must be done of preferred content expressions.
It's important that when a set of repositories all look at one-another's
preferred content expressions, and copy/move/drop objects to satisfy them,
they end up at a steady state. So, a given preferred content expression
should ideally evaluate to the same answer for each key, from the
perspective of each repository.
The best way to ensure that is the case is to only use terms in preferred
content expressions that rely on state that is shared between all
repositories. So, state in the git-annex branch, or the master branch
(assuming all repositories have master checked out).
Since git is eventually consistent, there might be disagreements about
which object belongs where, but once consistency is reached, things will
settle down.

View file

@ -42,7 +42,8 @@ Finally, how to specify a feature request for git-annex?
> to hang on to unused content. > to hang on to unused content.
> Something like "unused=true" I suppose, because not having a parameter > Something like "unused=true" I suppose, because not having a parameter
> would complicate preferred content parsing, and I cannot think > would complicate preferred content parsing, and I cannot think
> of a useful parameter. > of a useful parameter. (It cannot be a timestamp, because there's
> no way repos can agree on about when a key became unused.)
> * In order to quickly match that terminal, the Annex monad will need > * In order to quickly match that terminal, the Annex monad will need
> to keep a Set of unused Keys. This should only be loaded on demand. > to keep a Set of unused Keys. This should only be loaded on demand.
> NB: There is some potential for a great many unused Keys to cause > NB: There is some potential for a great many unused Keys to cause
@ -57,7 +58,7 @@ Finally, how to specify a feature request for git-annex?
> for most repos. Note that the assistant could also notice on the > for most repos. Note that the assistant could also notice on the
> fly when files are removed and mark their keys as unused if that was > fly when files are removed and mark their keys as unused if that was
> the last associated file. (Only currently possible in direct mode.) > the last associated file. (Only currently possible in direct mode.)
> * It makes sense for the > * After scanning for unused files, it makes sense for the
> assistant to queue transfers of unused files to any remotes that > assistant to queue transfers of unused files to any remotes that
> do want them (eg, backup remotes). If the files can successfully be > do want them (eg, backup remotes). If the files can successfully be
> sent to a remote, that will lead to them being dropped locally as > sent to a remote, that will lead to them being dropped locally as
@ -70,6 +71,7 @@ Finally, how to specify a feature request for git-annex?
> time stamp of the object; we could use the mtime of the .map file, > time stamp of the object; we could use the mtime of the .map file,
> that that's direct mode only and may be replaced with a database > that that's direct mode only and may be replaced with a database
> later. Seems best to just keep a unused log file with timestamps. > later. Seems best to just keep a unused log file with timestamps.
> **done**
> * After the assistant scans for unused files, if annex.expireunused > * After the assistant scans for unused files, if annex.expireunused
> is not set, and there is some significant quantity of unused files > is not set, and there is some significant quantity of unused files
> (eg, more than 1000, or more than 1 gb, or more than the amount of > (eg, more than 1000, or more than 1 gb, or more than the amount of
@ -87,3 +89,27 @@ Finally, how to specify a feature request for git-annex?
> might be. For example, if a file is replicated to 2 clients, and one > might be. For example, if a file is replicated to 2 clients, and one
> client directly edits it, or deletes it, it loses the old version, > client directly edits it, or deletes it, it loses the old version,
> but the other client will still be storing that old version. > but the other client will still be storing that old version.
>
> ## Stability analysis for unused= in preferred content expressions
>
> This is tricky, because two repos that are otherwise entirely
> in sync may have differing opinons about whether a key is unused,
> depending on when each last scanned for unused keys.
>
> So, this preferred content terminal is *not stable*.
> It may be possible to write preferred content expressions
> that constantly moved such keys around without reaching a steady state.
>
> Example:
>
> A and B are clients directly connected, and both also connected
> to BACKUP.
>
> A deletes F. B syncs with A, and runs unused check; decides F
> is unused. B sends F to BACKUP. B will then think A doesn't want F,
> and will drop F from A. Next time A runs a full transfer scan, it will
> *not* find F (because the file was deleted!). So it won't get F back from
> BACKUP.
>
> So, it looks like the fact that unused files are not going to be
> looked for on the full transfer scan seems to make this work out ok.