git-annex/doc/design/assistant/disaster_recovery.mdwn

150 lines
6.6 KiB
Text
Raw Normal View History

2013-07-23 22:46:09 +00:00
The assistant should help the user recover their repository when things go
wrong.
2013-10-18 19:27:20 +00:00
[[!toc ]]
2013-07-23 22:46:09 +00:00
## dangling lock files
There are a few ways a git repository can get broken that are easily fixed.
One is left over index.lck files. When a commit to a repository fails,
check that nothing else is using it, fix the problem, and redo the commit.
* **done** for .git/annex/index.lock, can be handled safely and automatically.
* **done** for .git/index.lock, only when the assistant is starting up.
* What about local remotes, eg removable drives? git-annex does attempt
2013-10-05 21:26:17 +00:00
to commit to the git-annex branch of those. It will use the automatic
fix if any are dangling. It does not commit to the master branch; indeed
a removable drive typically has a bare repository. So I think nothing to
do here.
* What about git-annex-shell? If the ssh remote has the assistant running,
it can take care of it, and if not, it's a server, and perhaps the user
should be required to fix up if it crashes during a commit. This should
not affect the assistant anyway.
2013-10-05 21:26:17 +00:00
* **done** Seems that refs can also have stale lock files, for example
2013-10-03 21:05:53 +00:00
'/storage/emulated/legacy/DCIM/.git/refs/remotes/flick_phonecamera/synced/git-annex.lock'
2013-10-05 21:26:17 +00:00
All git lock files are now handled (except gc lock files).
2013-07-23 22:46:09 +00:00
## incremental fsck
2013-10-11 21:33:51 +00:00
Add webapp UI to enable incremental fsck **done**
2013-10-05 21:26:17 +00:00
Of course, incremental fsck will run as an niced (and ioniced) background
job. There will need to be a button in the webapp to stop it, in case it's
2013-10-10 22:09:06 +00:00
annoying. **done**
2013-07-23 22:46:09 +00:00
When fsck finds a damanged file, queue a download of the file from a
2013-10-10 22:09:06 +00:00
remote. **done**
2013-10-14 20:40:48 +00:00
Detect when a removable drive is connected in the Cronner, and check
and try to run its remote fsck jobs. **done** (Same mechanism will work for
network remotes becoming connected.)
2013-10-11 21:33:51 +00:00
TODO: If no accessible remote has a file that fsck reported missing,
prompt the user to eg, connect a drive containing it. Or perhaps this is a
special case of a general problem, and the webapp should prompt the user
when any desired file is available on a remote that's not mounted?
2013-10-14 20:29:30 +00:00
## git-annex-shell remote fsck
2013-10-11 21:33:51 +00:00
TODO: git-annex-shell fsck support, which would allow cheap fast fscks
of ssh remotes.
2013-07-23 22:46:09 +00:00
Would be nice; otherwise remote fsck is too expensive (downloads
2013-10-14 20:29:30 +00:00
everything) to have the assistant do.
Note that Remote.Git already tries to use this, but the assistant does not
call it for non-local remotes.
2013-07-23 22:46:09 +00:00
## git fsck
Have the sanity checker run git fsck periodically (it's fairly inexpensive,
but still not too often, and should be ioniced and niced).
If committing to the repository fails, after resolving any dangling lock
files (see above), it should git fsck.
If git fsck finds problems, launch git repository repair.
## git repository repair
There are several ways git repositories can get damanged.
The most common is empty files in .git/annex/objects and commits that refer
to those objects. When the objects have not yet been pushed anywhere.
I've several times recovered from this manually by
removing the bad files and resetting to before the commits that referred to
them. Then re-staging any divergence in the working tree. This could
perhaps be automated.
As long as the git repository has at least one remote, another method is to
clone the remote, sync from all other remotes, move over .git/config and
.git/annex/objects, and tar up the old broken git repo and `git annex add`
it. This should be automatable and get the user back on their feet. User
could just click a button and have this be done.
2013-10-18 19:27:20 +00:00
This is useful outside git-annex as well, so make it a
git-recover-repository command.
### detailed design
2013-10-20 21:50:51 +00:00
Run `git fsck` and parse output to find bad objects. Note that
fsck may fall over and fail to print out all bad objects, when
files are corrupt. So if the fsck exits nonzero, need to collect all
bad objects it did find, and:
2013-10-18 19:11:17 +00:00
1. If the local repository contains packs, the packs may be corrupt.
So, start by using `git unpack-objects` to unpack all
packs it can handle (which may include parts of corrupt packs)
back to loose objects. And delete all packs.
2. Delete all loose corrupt objects.
2013-10-20 21:50:51 +00:00
Repeat until fsck finds no new problems.
Check if there's a remote. If so, and if the bad objects are all
present on it, can simply get all bad objects from the remote,
and inject them back into .git/objects to recover:
2013-10-19 15:54:08 +00:00
3. Make a new (bare) clone from the remote.
(Note: git does not seem to provide a way to fetch specific missing
objects from the remote. Also, cannot use `--reference` against
a repository with missing refs. So this seems unavoidably
network-expensive.)
5. Use git-cat-file in raw mode on the clone to dump each missing object,
and feed it into git-hash-object in the corrupt repo. (This avoids
needing to unpack packs in the clone.)
2013-10-18 19:11:17 +00:00
6. If each bad object was able to be repaired this way, we're done!
(If not, can reuse the clone for getting objects from the next remote.)
If some missing objects cannot be recovered from remotes, find commits in each
local branch that are broken by all remaining missing objects. Some of this can
be parsed from git fsck output, but for eg blobs, the commits need to
be walked to walk the trees, to find trees that refer to the blobs.
For each branch that is affected, look in the reflog and/or `git log
2013-10-18 19:20:44 +00:00
$branch` to find the last good commit that predates all broken commits. (If
the head commit of a branch is broken, git log is not going to show
anything useful, but the reflog can be used to find past refs for the
branch -- have to first delete the .git/HEAD file if it points to the
broken ref.)
2013-10-18 19:20:44 +00:00
The basic idea then is to reset the branch to the last good commit
2013-10-18 19:26:19 +00:00
that was found for it.
2013-10-18 19:20:44 +00:00
2013-10-18 19:26:19 +00:00
* For the HEAD branch, can just reset it. (If no last good commit was found
for the HEAD branch, reset it to a dummy empty commit.) This will
2013-10-18 19:20:44 +00:00
leave git showing any changes made since then as staged in the index and
uncommitted. Or if the index is missing/corrupt, any files in the tree will
show as modified and uncommitted. User (or git-annex assistant) can then
commit as appropriate. Print appropriate warning message.
2013-10-18 19:26:19 +00:00
* Special handling for git-annex branch: Reset to last good commit
(or to dummy empty commit is there is not one), and
2013-10-18 19:20:44 +00:00
then commit `.git/annex/index` over top of that, and then run a
`git annex fsck --fast` to fix up any object location info.
* Remote tracking branches can just be removed, and then `git fetch`
from the remote, which will re-download missing objects from it and
reinstate the tracking branch.
* For other branches (or tags), it's best to not rewrite them, because
that could get really confusing. Instead, delete the old broken branch,
2013-10-18 19:26:19 +00:00
and make a "recovered/$branch" that holds the last good commit (if one
was found).