Turns out sqlite does not like having its database deleted out from
underneath it. It might suffice to empty the table, but I would rather
start each fsck over with a new database, so I added a lock file, and
running incremental fscks use a shared lock.
This leaves one concurrency bug left; running two concurrent fsck --more
will lead to: "SQLite3 returned ErrorBusy while attempting to perform step."
and one or both will fail. This is a concurrent writers problem.
Did not keep backwards compat for sticky bit records. An incremental fsck
that is already in progress will start over on upgrade to this version.
This is not yet ready for merging. The autobuilders need to have sqlite
installed.
Also, interrupting a fsck --incremental does not commit the database.
So, resuming with fsck --more restarts from beginning.
Memory: Constant during a fsck of tens of thousands of files.
(But, it does seem to buffer whole transation in memory, so
may really scale with number of files.)
CPU: ?
Eliminated complexity and future proofed. The most important change is that
all functions over Difference are now total; any Difference that can be
expressed should be handled. Avoids needs for sanity checking of inputs,
and version skew with the future.
Also, the difference.log now serializes a [Difference], not a Differences.
This saves space and keeps it simpler.
Note that [Difference] might contain conflicting differences (eg,
[Version5, Version6]. In this case, one of them needs to consistently win
over the others, probably based on Ord.
* init: Repository tuning parameters can now be passed when initializing a
repository for the first time. For details, see
http://git-annex.branchable.com/tuning/
* merge: Refuse to merge changes from a git-annex branch of a repo
that has been tuned in incompatable ways.
This reverts commit a7f05c007b.
Consider: relPathDirToFile (absPathFrom "/tmp/repo/xxx" "y/bar") "/tmp/repo/.git/annex/objects/xxx"
This needs to always yield "../../../.git/annex/objects/xxx" but on
Windows, it is "..\\..\\/tmp/repo/.git/annex/objects/xxx"
Reverts 965e106f24
Unfortunately, this caused breakage on Windows, and possibly elsewhere,
because parentDir and takeDirectory do not behave the same when there is a
trailing directory separator.
parentDir is less safe than takeDirectory, especially when working
with relative FilePaths. It's really only useful in loops that
want to terminate at /
This commit was sponsored by Audric SCHILTKNECHT.
This allows the git repository to be moved while git-annex is running in
it, with fewer problems.
On Windows, this avoids some of the problems with the absurdly small
MAX_PATH of 260 bytes. In particular, git-annex repositories should
work in deeper/longer directory structures than before. See
http://git-annex.branchable.com/bugs/__34__git-annex:_direct:_1_failed__34___on_Windows/
There are several possible ways this change could break git-annex:
1. If it changes its working directory while it's running, that would
be Bad News. Good news everyone! git-annex never does so. It would also
break thread safety, so all such things were stomped out long ago.
2. parentDir "." -> "" which is not a valid path. I had to fix one
instace of this, and I should probably wipe all calls to parentDir out
of the git-annex code base; it was never a good idea.
3. Things like relPathDirToFile require absolute input paths,
and code assumes that the git repo path is absolute and passes it to it
as-is. In the case of relPathDirToFile, I converted it to not make
this assumption.
Currently, the test suite has 16 failures.
This fixes all instances of " \t" in the code base. Most common case
seems to be after a "where" line; probably vim copied the two space layout
of that line.
Done as a background task while listening to episode 2 of the Type Theory
podcast.
This avoids a proliferation of hash directories when using new-style
chunking, and should improve performance since chunks are accessed
in sequence and so should have a common locality.
Of course, when a chunked key is encrypted, its hash directories have no
relation to the parent key.
This commit was sponsored by Christian Kellermann.
This allows eg, putting .git/annex/tmp on a ram disk, if the disk IO
of temp object files is too annoying (and if you don't want to keep
partially transferred objects across reboots).
.git/annex/misctmp must be on the same filesystem as the git work tree,
since files are moved to there in a way that will not work cross-device,
as well as symlinked into there.
I first wanted to put the tmp objects in .git/annex/objects/tmp, but
that would pose transition problems on upgrade when partially transferred
objects existed.
git annex info does not currently show the size of .git/annex/misctemp,
since it should stay small. It would also be ok to make something clean it
out, periodically.
(And a vpop command, which is still a bit buggy.)
Still need to do vadd and vrm, though this also adds their documentation.
Currently not very happy with the view log data serialization. I had to
lose the TDFA regexps temporarily, so I can have Read/Show instances of
View. I expect the view log format will change in some incompatable way
later, probably adding last known refs for the parent branch to View
or something like that.
Anyway, it basically works, although it's a bit slow looking up the
metadata. The actual git branch construction is about as fast as it can be
using the current git plumbing.
This commit was sponsored by Peter Hogg.
Seems that locking of annexed objects when they're being dropped was broken
in direct mode:
* When taking the lock before dropping, it created the .git/annex/objects
file, as an empty file. It seems that the dropping code deleted that,
but that is not right, and for all I know could in some situation cause
a corrupted object to leak out.
* When the lock was checked, it actually tried to open each direct mode
file, and checked if it was locked. Not the same lock used above, and
could also fail if some consumer of the file locked it.
Fixed this, and added windows support by switching direct mode to lock a
.lck file.
A file named "foo-" or "foo-bar" was taken as a key's file, with a backend
of "foo", and an empty keyName. This led to various problems, especially
because converting that key back to a file did not yeild the same filename.
Extends the index.lock handling to other git lock files. I surveyed
all lock files used by git, and found more than I expected. All are
handled the same in git; it leaves them open while doing the operation,
possibly writing the new file content to the lock file, and then closes
them when done.
The gc.pid file is excluded because it won't affect the normal operation
of the assistant, and waiting for a gc to finish on startup wouldn't be
good.
All threads except the webapp thread wait on the new startup sanity checker
thread to complete, so they won't try to do things with git that fail
due to stale lock files. The webapp thread mostly avoids doing that kind of
thing itself. A few configurators might fail on lock files, but only if the
user is explicitly trying to run them. The webapp needs to start
immediately when the user has opened it, even if there are stale lock
files.
Arranging for the threads to wait on the startup sanity checker was a bit
of a bear. Have to get all the NotificationHandles set up before the
startup sanity checker runs, or they won't see its signal. Perhaps
the NotificationBroadcaster is not the best interface to have used for
this. Oh well, it works.
This commit was sponsored by Michael Jakl
FAT has a lot of characters it does not allow in filenames, like ? and *
It's probably the worst offender, but other filesystems also have
limitiations.
In 2011, I made keyFile escape : to handle FAT, but missed the other
characters. It also turns out that when I did that, I was also living
dangerously; any existing keys that contained a : had their object
location change. Oops.
So, adding new characters to escape to keyFile is out. Well, it would be
possible to make keyFile behave differently on a per-filesystem basis, but
this would be a real nightmare to get right. Consider that a rsync special
remote uses keyFile to determine the filenames to use, and we don't know
the underlying filesystem on the rsync server..
Instead, I have gone for a solution that is backwards compatable and
simple. Its only downside is that already generated URL and WORM keys
might not be able to be stored on FAT or some other filesystem that
dislikes a character used in the key. (In this case, the user can just
migrate the problem keys to a checksumming backend. If this became a big
problem, fsck could be made to detect these and suggest a migration.)
Going forward, new keys that are created will escape all characters that
are likely to cause problems. And if some filesystem comes along that's
even worse than FAT (seems unlikely, but here it is 2013, and people are
still using FAT!), additional characters can be added to the set that are
escaped without difficulty.
(Also, made WORM limit the part of the filename that is embedded in the key,
to deal with filesystem filename length limits. This could have already
been a problem, but is more likely now, since the escaping of the filename
can make it longer.)
This commit was sponsored by Ian Downes
This may not strictly be needed -- the transition code bypasses the
journal. However, this ensures that the git-annex branch is only
committed with the journal locked. This will allow for further
improvements.
* since this is a crippled filesystem anyway, git-annex doesn't use
symlinks on it
* so there's no reason to use the mixed case hash directories that we're
stuck using to avoid breaking everyone's symlinks to the content
* so we can do what is already done for all bare repos, and make non-bare
repos on crippled filesystems use the all-lower case hash directories
* which are, happily, all 3 letters long, so they cannot conflict with
mixed case hash directories
* so I was able to 100% fix this and even resuming `git annex add` in the
test case will recover and it will all just work.
This avoids commit churn by the assistant when eg,
replacing a file with a symlink.
But, just as importantly, it prevents the working tree being left with a
deleted file if git-annex, or perhaps the whole system, crashes at the
wrong time.
(It also probably avoids confusing displays in file managers.)
Two fixes. First, and most importantly, relax the isLinkToAnnex check
to only look for /annex/objects/, not [^|/].git/annex/objects. If
GIT_DIR is used with a detached work tree, the git directory is
not necessarily named .git.
There are important caveats with doing that at all, since git-annex will
make symlinks that point at GIT_DIR, which means that the relative path
between GIT_DIR and GIT_WORK_TREE needs to remain stable across all clones
of the repository.
----
The other fix is just fixing crazy and wrong code that, when GIT_DIR is
set, expects to still find a git repository in the path below the work
tree, and uses some of its configuration, and some of GIT_DIR. What was I
thinking, and why can't I seem to get this code right?
Also for dropping objects in direct mode.
Checking presence reliably needs a cache of mtime, size, and inode.
This way, if a file is modified, keys that point to it are no longer
present.
Also, the code for restoring the symlink when removing objects is
unnecessarily messy. calcGitLink was generating links starting with
"../../remote/.git/", when running "git annex move --from remote".
I put in a workaround, but calcGitLink should probably be fixed.
There is not yet support for getting objects from repositories in direct
mode; it still looks for content in .git/annex/objects, and there's no
once place I can change to fix that.
Also, getting objects from direct mode repositories is problematic since
the can be changed while the object is being transferred. It probably needs
to quarantine it first.