Merge remote branch 'branchable/master'

This commit is contained in:
Joey Hess 2011-02-21 15:01:31 -04:00
commit 7b586f0833
9 changed files with 70 additions and 20 deletions

5
doc/forum/chrysn.mdwn Normal file
View file

@ -0,0 +1,5 @@
* **name**: chrysn
* **website**: <http://christian.amsuess.com/>
* **uses git-annex for**: managing the family's photos (and possibly videos and music in the future)
* **likes git-annex because**: it adds a layer of commit semantics over a regular file system without keeping everything in duplicate locally
* **would like git-annex to**: not be required any more as git itself learns to use cow filesystems to avoid abundant disk usage and gets better with sparser checkouts (git-annex might then still be a simpler tool that watches over what can be safely dropped for a sparser checkout)

View file

@ -0,0 +1,45 @@
This is a rough sketch of a modification of git-annex to rely more on git commit semantics. It might be flawed due to my lack of understanding of git-annex internals. --[[chrysn]]
Summary
=========
Currently, the location tracking is only used for informational purposes unless a repository is [[trust]]ed, in which case there is no checking at all. It is proposed to use the location tracking information as a commitment to keep track of a file without any promise that it might not be dropped if another repository takes over responsibility.
git's semantics for atomic commits are proposed to be used, which makes sure that before files are actually deleted, another repository has accepted the deletion.
Modified git-annex-drop behavior
==========================
The most important (if not only) git-annex command that is affected by this is `git annex drop`. Currently, for dropping a large number of files, every file is checked with another (or multiple, if so configured) host if it's safe to delete.
The new behavior would be to
* decrement the location tracking counter for all files to be dropped,
* commit that change,
* try to push it to at least as many repositories that the numcopies constraints are met,
* revert if that fails,
* otherwise really drop the files from the backend.
Unlike explicit checking, this never looks at the remote backend if the file is really present -- otoh, git-annex already relies on the files in the backend to not be touched by anyone but git-annex, and git-annex would only drop them if they were derefed and committed, in which case git would not accept the push. (git by itself would accept a merged push, but even if the reverting step failed due to a power outage or similar, git-annex would, before really deleting files from the backend, check again if the numcopies restraint is still met, and revert its own delete commit as the files are still present anyway.)
Implications for trust
==============
The proposed change also changes the semantics of trust. Trust can now be controlled in a finer-grained way between untrusted and semi-trusted, as best illustrated by a use case:
> Alice takes her netbook with her on a trip through Spain, and will fill most of its disk up with pictures she takes. As she expects to meet some old friends during the first days, she wants to take older pictures with her, which are safely backed up at home.
>
> She tells her netbook's repository to dereference the old images (but not other parts of the repository she has not copied anywhere yet) and pushes to the server before leaving. When she adds pictures from her camera to the repository, git-annex can now free up space as needed.
Dereferencing could be implemented as `git annex drop --not-yet`, freeing space is similar to `dropunused`.
A trusted repository with the new semantics would mean that the repository would not accept dropping anything, just as before.
Advantages / Disadvantages
=====================
The advantage of this proposal is that the round trips required for dropping something could be greatly reduced.
There should also be simplifications in the `git annex drop` command as it doesn't need to take care of locking any more (git should already do that between checking if HEAD is a parent of the pushed commit and replacing HEAD).
Besides being a major change in git-annex (with the requirement to track hosts' git-annex versions for migration, as the new trust system is incompatible with the old one), no disadvantages of that stragegy are known to the author (hoping for discussion below).

View file

@ -219,7 +219,7 @@ Many git-annex commands will stage changes for later `git commit` by you.
* fromkey file
This can be used to maually set up a file to link to a specified key
This can be used to manually set up a file to link to a specified key
in the key-value backend. How you determine an existing key in the backend
varies. For the URL backend, the key is just a URL to the content.
@ -244,7 +244,7 @@ Many git-annex commands will stage changes for later `git commit` by you.
* setkey file
This plumbing-level command sets the annxed data for a key to the content of
This plumbing-level command sets the annexed data for a key to the content of
the specified file, and then removes the file.
A backend will typically need to be specified with --backend. If none
@ -380,7 +380,7 @@ These files are used by git-annex, in your git repository:
available. Annexed files in your git repository symlink to that content.
`.git-annex/uuid.log` is used to map between repository UUID and
decscriptions.
descriptions.
`.git-annex/trust.log` is used to indicate which repositories are trusted
and untrusted.

View file

@ -51,7 +51,7 @@ files with git.
* [[git-annex man page|git-annex]]
* [[key-value backends|backends]] for data storage
* [[location_tracking]] reminds you where git-annex has seen files
* git-annex prevents accidential data loss by [[tracking copies|copies]]
* git-annex prevents accidental data loss by [[tracking copies|copies]]
of your files
* [[what git annex is not|not]]
* git-annex is Free Software, licensed under the [[GPL]].

View file

@ -27,5 +27,5 @@ descriptions to help you with finding them:
c0a28e06-d7ef-11df-885c-775af44f8882 -- USB archive drive 1
e1938fee-d95b-11df-96cc-002170d25c55
In certian cases you may want to configure git-annex to [[trust]]
In certain cases you may want to configure git-annex to [[trust]]
that location tracking information is always correct for a repository.

View file

@ -26,7 +26,7 @@
I only learned of git-media after writing git-annex, but I probably
would have still written git-annex instead of using it. Currently,
git-media has the advantage of using git smudge filters rather than
git-annex's pile of symlinks, and it may be a tighter fit for certian
git-annex's pile of symlinks, and it may be a tighter fit for certain
situations. It lacks git-annex's support for widely distributed storage,
using only a single backend data store. It also does not support
partial checkouts of file contents, like git-annex does.

View file

@ -11,12 +11,12 @@ information. When removing content, it will directly check
that other repositories have enough [[copies]].
Generally that explicit checking is a good idea. Consider that the current
[[location_tracking]] information for a remote may not yet have propigated
[[location_tracking]] information for a remote may not yet have propagated
out. Or, a remote may have suffered a catastrophic loss of data, or itself
been lost.
There is still some trust involved here. A semitrusted repository is
dependended on to retain a copy of the file content; possibly the only
depended on to retain a copy of the file content; possibly the only
[[copy|copies]].
(Being semitrusted is the default. The `git annex semitrust` command

View file

@ -6,13 +6,13 @@ safe place.
With git-annex, Bob has a single directory tree that includes all
his files, even if their content is being stored offline. He can
reorganize his files using that tree, committing new versions to git,
without worry about accidentially deleting anything.
without worry about accidentally deleting anything.
When Bob needs access to some files, git-annex can tell him which drive(s)
they're on, and easily make them available. Indeed, every drive knows what
is on every other drive.
Run in a cron job, git-annex adds new files to achival drives at night. It
Run in a cron job, git-annex adds new files to archival drives at night. It
also helps Bob keep track of intentional, and unintentional copies of
files, and logs information he can use to decide when it's time to duplicate
the content of old drives.

View file

@ -84,7 +84,7 @@ can get them.
## transferring files: When things go wrong
After a while, you'll have serveral annexes, with different file contents.
After a while, you'll have several annexes, with different file contents.
You don't have to try to keep all that straight; git-annex does
[[location_tracking]] for you. If you ask it to get a file and the drive
or file server is not accessible, it will let you know what it needs to get
@ -146,7 +146,7 @@ That's a good thing, because it might be the only copy, you wouldn't
want to lose it in a fumblefingered mistake.
# echo oops > my_cool_big_file
bash: my_cool_big_file: Permission deined
bash: my_cool_big_file: Permission denied
In order to modify a file, it should first be unlocked.
@ -176,7 +176,7 @@ There is one problem with using `git commit` like this: Git wants to first
stage the entire contents of the file in its index. That can be slow for
big files (sorta why git-annex exists in the first place). So, the
automatic handling on commit is a nice safety feature, since it prevents
the file content being accidentially commited into git. But when working with
the file content being accidentally committed into git. But when working with
big files, it's faster to explicitly add them to the annex yourself
before committing.
@ -267,12 +267,12 @@ that the URL is stable; no local backup is kept.
Another handy alternative to the default [[backend|backends]] is the
SHA1 backend. This backend provides more git-style assurance that your data
has not been damanged. And the checksum means that when you add the same
has not been damaged. And the checksum means that when you add the same
content to the annex twice, only one copy need be stored in the backend.
The only reason it's not the default is that it needs to checksum
files when they're added to the annex, and this can slow things down
significantly for really big files. To make SHA1 the detault, just
significantly for really big files. To make SHA1 the default, just
add something like this to `.gitattributes`:
* annex.backend=SHA1
@ -292,7 +292,7 @@ files will be skipped.
After migrating a file to a new backend, the old content in the old backend
will still be present. That is necessary because multiple files
can point to the same content. The `git annex unused` sucommand can be
can point to the same content. The `git annex unused` subcommand can be
used to clear up that detritus later. Note that hard links are used,
to avoid wasting disk space.
@ -342,7 +342,7 @@ setting is satisfied for all files.
fsck my_cool_big_file (checksum...) ok
...
You can also specifiy the files to check. This is particularly useful if
You can also specify the files to check. This is particularly useful if
you're using sha1 and don't want to spend a long time checksumming everything.
# git annex fsck my_cool_big_file
@ -367,7 +367,7 @@ might say about a badly messed up annex:
## backups
git-annex can be configured to require more than one copy of a file exists,
as a simple backup for your data. This is controled by the "annex.numcopies"
as a simple backup for your data. This is controlled by the "annex.numcopies"
setting, which defaults to 1 copy. Let's change that to require 2 copies,
and send a copy of every file to a USB drive.
@ -394,9 +394,9 @@ For more details about the numcopies setting, see [[copies]].
## untrusted repositories
Suppose you have a USB thunb drive and are using it as a git annex
Suppose you have a USB thumb drive and are using it as a git annex
repository. You don't trust the drive, because you could lose it, or
accidentially run it through the laundry. Or, maybe you have a drive that
accidentally run it through the laundry. Or, maybe you have a drive that
you know is dying, and you'd like to be warned if there are any files
on it not backed up somewhere else. Maybe the drive has already died
or been lost.