The type checker should have noticed this, but the changes to mapM
that make it accept any Traversable hid the fact that it was not being
passed a list at all. Thus, what should have returned an empty list most
of the time instead returned [""] which was treated as the name of the
associated file, with disasterout consequences.
When I have time, I should add a test case checking what sync --content
drops. I should also consider replacing mapM with one re-specialized to
lists.
Previously, it only flushed when the queue got larger than 1.
Also, make the queue auto-flush when items are added, rather than needing
to be flushed as a separate step. This simplifies the code and make it more
efficient too, as it avoids needing to read the queue out of the state to
check if it should be flushed.
The homomorphs are back, just encoded such that it doesn't crash in LANG=C
However, I noticed a bug in the old escaping; [pseudoSlash] was escaped the
same as ['/','/']. Fixed by using '%' to escape pseudoSlash. Which requires
doubling '%' to escape it, but that's already done in the escaping of
worktree filenames in a view, so is probably ok.
Linking the file to the tmp dir was not necessary in the clean
filter, and it caused the ctime to change, which caused git to think
the file was changed. This caused git status to get slow as it kept
re-cleaning unchanged files.
Fixes several bugs with updates of pointer files. When eg, running
git annex drop --from localremote
it was updating the pointer file in the local repository, not the remote.
Also, fixes drop ../foo when run in a subdir, and probably lots of other
problems. Test suite drops from ~30 to 11 failures now.
TopFilePath is used to force thinking about what the filepath is relative
to.
The data stored in the sqlite db is still just a plain string, and
TopFilePath is a newtype, so there's no overhead involved in using it in
DataBase.Keys.
WorkTree.lookupFile was finding a key for a file that's deleted from the
work tree, which is different than the v5 behavior (though perhaps the same
as the direct mode behavior). Fix by checking that the work tree file exists
before catting its key.
Hopefully this won't slow down much, probably the catKey is much more expensive.
I can't see any way to optimise this, except perhaps to make Command.Unused
check if work tree files exist before/after calling lookupFile. But,
it seems better to make lookupFile really only find keys for worktree files;
that's what it's intended to do.
Else, queued file stages won't have reached the index, and it won't find
everthing.
This evidently fixes a reversion in my work today, although I don't see how
I broke it. It didn't use to flush the queue first, before, and worked
somehow.
Test suite for v5 is back to 100% green now.
The smudge filter does need to be run, because if the key is in the local
annex already (due to renaming, or a copy of a file added, or a new file
added and its content has already arrived), git merge smudges the file and
this should provide its content.
This does probably mean that in merge conflict resolution, git smudges the
existing file, re-copying all its content to it, and then the file is
deleted. So, not efficient.
This is a behavior change for merge conflicts between locked files
that both pointed to the same key, in different ways.
Before, the conflict was resolved, but the file was renamed to .variant.
This was unnecessary, because there was only one variant.
Of course, this also handles conflicts between unlocked and locked, or even
two unlocked files with different pointer contents.
Since the file was present and locked, its annex object was not in the
inode cache. So, despite not needing to update the annex object when the
clean filter is run on the content by git merge, it does need to record the
inode cache of the annex object. Otherwise, the annex object will be
assumed to be bad, since its inode is not cached.
Several tricky parts:
* When the conflict is just between the same key being locked and unlocked,
the unlocked version wins, and the file is not renamed in this case.
* Need to update associated file map when conflict resolution renames
an unlocked file.
* git merge runs the smudge filter on the conflicting file, and actually
overwrites the file with the same content it had before, and so
invalidates its inode cache. This makes it difficult to know when it's
safe to remove such files as conflict cruft, without going so far as to
compare their entire contents.
Dealt with this by preventing the smudge filter from populating the file
when a merge is run. However, that also prevents the smudge filter being
run for non-conflicting files, so eg moving a file won't put its new
content into place.
* Ideally, if a merge or a merge conflict resolution renames an unlocked
file, the file in the work tree can just be moved, rather than copying
the content to a new worktree file.
This is attempted to be done in merge conflict resolution, but
due to git merge's behavior of running smudge filters, what actually
seems to happen is the old worktree file with the content is deleted and
rewritten as a pointer file, so doesn't get reused.
So, this is probably not as efficient as it optimally could be.
If that becomes a problem, could look into running the merge in a separate
worktree and updating the real worktree more efficiently, similarly to the
direct mode merge. However, the direct mode merge had a lot of bugs, and
I'd rather not use that more error-prone method unless really needed.
Decided it's too scary to make v6 unlocked files have 1 copy by default,
but that should be available to those who need it. This is consistent with
git-annex not dropping unused content without --force, etc.
* Added annex.thin setting, which makes unlocked files in v6 repositories
be hard linked to their content, instead of a copy. This saves disk
space but means any modification of an unlocked file will lose the local
(and possibly only) copy of the old version.
* Enable annex.thin by default on upgrade from direct mode to v6, since
direct mode made the same tradeoff.
* fix: Adjusts unlocked files as configured by annex.thin.
This optimisation was not necessary, and didn't work for v6 unlocked files.
Typically only a small number of files will be changed by a commit, so just
catKey them all.
This can happen when ingesting a new file in either locked or unlocked
mode, when some unlocked files in the repo use the same key, and the
content was not locally available before.
This fixes a race where the modified file ended up in annex/objects, and
the InodeCache stored in the database was for the modified version, so
git-annex didn't know it had gotten modified.
The race could occur when the smudge filter was running; now it gets the
InodeCache before generating the Key, which avoids the race.
In v5, lookupFile is supposed to only look at symlinks on disk (except when
in direct mode).
Note that v6 also has a bug when a locked file's symlink is deleted and is
replaced with a new file. It sees that a link is staged and gets that
key.
The problem is that shutdown is not always called, particularly in the test
suite. So, a database connection would be opened, possibly some changes
queued, and then not shut down.
One way this can happen is when using Annex.eval or Annex.run with a new
state. A better fix might be to make both of them call Keys.shutdown
(and be sure to do it even if the annex action threw an error).
Complication: Sometimes they're run reusing an existing state, so shutting
down a database connection could cause problems for other users of that
same state. I think this would need a MVar holding the database handle,
so it could be emptied once shut down, and another user of the database
connection could then start up a new one if it got shut down. But, what if
2 threads were concurrently using the same database handle and one shut it
down while the other was writing to it? Urgh.
Might have to go that route eventually to get the database access to run
fast enough. For now, a quick fix to get the test suite happier, at the
expense of speed.
This covers the case where multiple files have the same content and are
added with git add. Previously only the one that was linked to the annex
got its inode cached; now both are.
When a v6 unlocked files is removed from the work tree,
unused doesn't show it. When it gets removed from the index,
unused does show it. This is the same as a locked file.
If multiple files point to the same annex object, the user may want to
modify them independently, so don't use a hard link.
Also, check diskreserve when copying.
Before the smudge filter added a trailing newline, but other things that
wrote formatPointer to a file did not.
also some new pointer staging code to use later
This avoids querying the database when the content file doen't exist
(or otherwise fails the provided check). However, it does add overhead of
querying the database, and will certianly impact performance.
The Keys database can hold multiple inode caches for a given key. One for
the annex object, and one for each pointer file, which may not be hard
linked to it.
Inode caches for a key are recorded when its content is added to the annex,
but only if it has known pointer files. This is to avoid the overhead of
maintaining the database when not needed.
When the smudge filter outputs a file's content, the inode cache is not
updated, because git's smudge interface doesn't let us write the file. So,
dropping will fall back to doing an expensive verification then. Ideally,
git's interface would be improved, and then the inode cache could be
updated then too.
Renamed the db to keys, since it is various info about a Keys.
Dropping a key will update its pointer files, as long as their content can
be verified to be unmodified. This falls back to checksum verification, but
I want it to use an InodeCache of the key, for speed. But, I have not made
anything populate that cache yet.
This removes ambiguity, because while someone might have "WORM--foo" in a
file that's not intended to be a git-annex pointer file,
"annex/objects/WORM--foo" is less likely.
Also, 664cc987e8 had a caveat about symlink
targets being parsed as pointer files, and now the same parser is used for
both.
I did not include any hash directories before the key in the pointer file,
as they're not needed. However, if they were included, the parser would
still work ok.
Backend.lookupFile is changed to always fall back to catKey when
operating on a file that's not a symlink.
catKey is changed to understand pointer files, as well as annex symlinks.
Before, catKey needed a file mode witness, to be sure it was looking at a
symlink. That was complicated stuff. Now, it doesn't actually care if a
file in git is a symlink or not; in either case asking git for the content
of the file will get the pointer to the key.
This does mean that git-annex will treat a link
foo -> WORM--bar as a git-annex file, and also treats
a regular file containing annex/objects/WORM--bar as a git-annex file.
Calling catKey could make git-annex commands need to do more work than
before. This would especially be the case if a repo contained many regular
files, and only a few annexed files, as now git-annex will need to ask
git about the contents of the regular files.
Was not putting it inside the temp dir, but next to it!
This was just wrong, and it led to a longer filename that desired being
used, leading to some bug reports.
Since all places where a repo is used in direct mode need to have git-annex
upgraded before the repo can safely be converted to v6, the upgrade needs
to be manual for now.
I suppose that at some point I'll want to drop all the direct mode support
code. At that point, will stop supporting v5, and will need to auto-upgrade
any remaining v5 repos. If possible, I'd like to carry the direct mode
support for say, a year or so, to give people plenty of time to upgrade and
avoid disruption.
When core.sharedRepository is set, annex object files are not made mode
444, since that prevents a user other than the file owner from locking
them. Instead, a mode such as 664 is used in this case.
replaceFile created a temp file, which was guaranteed to not overlap with
another temp file. However, makeAnnexLink then deleted that file, in
preparation for making the symlink in its place. This caused a race, since
some other replaceFile could create a temp file, using the same name!
I was able to reproduce the race easily running git-annex add -J10 in a
directory with 100 files (all with different contents). Some files would
get ingested into the annex, but their annex links would fail to be added.
There could be other situations where this same problem could occur.
Perhaps when the assistant is adding a file, if the user manually also ran
git-annex add. Perhaps in cases not involving adding a file.
The new replaceFile makes a temprary directory, which is guaranteed to be
unique, and doesn't make a temp file in there. makeAnnexLink can thus
create the symlink without problem and the race is avoided.
Audited all calls to replaceFile to make sure that the old behavior of
providing an empty temp file was not relied on.
The general problem of asking for a temp file and deleting it as part of
the process of using it could reach beyond replaceFile. Did some quick
audits and didn't find other cases of it. Probably only symlink creation
stuff would tend to make that mistake, mostly.
Instead, only display transport error if the configlist output doesn't
include an annex.uuid line, even an empty one.
A recent change made git-annex init try to get all the remote uuids, and so
the transport error would be displayed by it. It was also displayed when
eg, copying files to a remote that had no uuid yet.
sync, merge, assistant: When git merge failed for a reason other than a
conflicted merge, such as a crippled filesystem not allowing particular
characters in filenames, git-annex would make a merge commit that could
omit such files or otherwise be bad. Fixed by aborting the whole merge
process when git merge fails for any reason other than a merge conflict.
Implemented with no additional overhead of compares etc.
This is safe to do for presence logs because of their locality of change;
a given repo's presence logs are only ever changed in that repo, or in a
repo that has just been actively changing the content of that repo.
So, we don't need to worry about a split-brain situation where there'd
be disagreement about the location of a key in a repo. And so, it's ok to
not update the timestamp when that's the only change that would be made
due to logging presence info.
sideAction is for things not generally related to the current action being
performed. And, it adds a newline after the side action. This was not the
right thing to use for stuff like "checksum", where doing a checksum is
part of the git annex get process, and indeed we want it to display
"(checksum...) ok"
By definition, a trusted repository is trusted to always have its location
tracking log accurate. Thus, it should never be in a position where content
is being dropped from it concurrently, as that would result in the location
tracking log not being accurate.
This avoids a failure where eg, we start with RecentlyVerifiedCopies
for all remotes, and so didn't do any active verification, which is
required.
Also, dedup the list of VerifiedCopies when checking if we have enough,
in case 2 copies of a UUID slip in.
See doc/bugs/concurrent_drop--from_presence_checking_failures.mdwn for
discussion about why 1 locked copy is all we can require, and how this
fixes concurrent dropping bugs.
Note that, since nothing yet generates a VerifiedCopyLock yet, this commit
breaks dropping temporarily.
There should be no behavior changes in this commit, it just adds a more
expressive data type and adjusts code that had been passing around a [UUID]
or sometimes a Maybe Remote to instead use [VerifiedCopy].
Although, since some functions were taking two different [UUID] lists,
there's some potential for me to have gotten it horribly wrong.
Also, rename lockContent to lockContentExclusive
inAnnexSafe should perhaps be eliminated, and instead use
`lockContentShared inAnnex`. However, I'm waiting on that, as there are
only 2 call sites for inAnnexSafe and it's fiddly.
In c6632ee5c8, it actually only handled
uploading objects to a shared repository. To avoid verification when
downloading objects from a shared repository, was a lot harder.
On the plus side, if the process of downloading a file from a remote
is able to verify its content on the side, the remote can indicate this
now, and avoid the extra post-download verification.
As of yet, I don't have any remotes (except Git) using this ability.
Some more work would be needed to support it in special remotes.
It would make sense for tahoe to implicitly verify things downloaded from it;
as long as you trust your tahoe server (which typically runs locally),
there's cryptographic integrity. OTOH, despite bup being based on shas,
a bup repo under an attacker's control could have the git ref used for an
object changed, and so a bup repo shouldn't implicitly verify. Indeed,
tahoe seems unique in being trustworthy enough to implicitly verify.
* When annex objects are received into git repositories, their checksums are
verified then too.
* To get the old, faster, behavior of not verifying checksums, set
annex.verify=false, or remote.<name>.annex-verify=false.
* setkey, rekey: These commands also now verify that the provided file
matches the key, unless annex.verify=false.
* reinject: Already verified content; this can now be disabled by
setting annex.verify=false.
recvkey and reinject already did verification, so removed now duplicate
code from them. fsck still does its own verification, which is ok since it
does not use getViaTmp, so verification doesn't happen twice when using fsck
--from.
I couldn't find a good way to make an *empty* index file (zero byte file
won't do), so I punted and just don't make index.lock when there's no index
yet. This means some other git process could race and write an index file
at the same time as the merge is ongoing, in theory. Only happens in new
repos though.
Oh boy, not again. So, another place that the filesystem encoding needs to
be applied. Yay.
In passing, I changed decodeBS so if a NUL is embedded in the input, the
resulting FilePath doesn't get truncated at that NUL. This was needed to
make prop_b64_roundtrips pass, and on reviewing the callers of decodeBS, I
didn't see any where this wouldn't make sense. When a FilePath is used to
operate on the filesystem, it'll get truncated at a NUL anyway, whereas if
a String is being used for something else, it might conceivably have a NUL
in it, and we wouldn't want it to get truncated when going through
decodeBS.
(NB: There may be a speed impact from this change.)
I think that the problem was caused by windows not having a concept of an
env var that is set, but to the empty string. So, GIT_ANNEX_SSHOPTION
got set to "" and was not seen as set at all.
Easy fix, which also makes git-annex sync a little faster is to not set
GIT_SSH, when GIT_ANNEX_SSHOPTION has no options. Might as well let git use
ssh per usual in this case, no need to run git-annex as the proxy ssh
command..
* proxy: Fix proxy git commit of non-annexed files in direct mode.
* proxy: If a non-proxied git command, such as git revert
would normally fail because of unstaged files in the work tree,
make the proxied command fail the same way.
* Perform a clean shutdown when --time-limit is reached.
This includes running queued git commands, and cleanup actions normally
run when a command is finished.
* fsck: Commit incremental fsck database when --time-limit is reached.
Previously, some of the last files fscked did not make it into the
database when using --time-limit.
Note that this changes Annex.addCleanup hooks, to run after --time-limit
expires. Fsck was using such a hook to clean up after a
--incremental-schedule, and that shouldn't run when --time-limit exipires
it. So, instead, moved that cleanup code to be run by cleanupIncremental.
Resulted in some data type juggling.
This is needed because when preferred content matches on files,
the second pass would otherwise want to drop all keys. Using a bloom filter
avoids this, and in the case of a false positive, a key will be left
undropped that preferred content would allow dropping. Chances of that
happening are a mere 1 in 1 million.
This makes git annex unused use around 48 mb more memory than it did before,
but the massive increase in accuracy makes this worthwhile for all but the
smallest systems.
Also, I want to use the bloom filter for sync --all --content, to avoid
dropping files that the preferred content doesn't want, and 1/1000
false positives would be far too many in that use case, even if it were
acceptable for unused.
Actual memory use numbers:
1000: 21.06user 3.42system 0:26.40elapsed 92%CPU (0avgtext+0avgdata 501552maxresident)k
1000000: 21.41user 3.55system 0:26.84elapsed 93%CPU (0avgtext+0avgdata 549496maxresident)k
10000000: 21.84user 3.52system 0:27.89elapsed 90%CPU (0avgtext+0avgdata 549920maxresident)k
Based on these numbers, 10 million seemed a better pick than 1 million.
This removes a bit of complexity, and should make things faster
(avoids tokenizing Params string), and probably involve less garbage
collection.
In a few places, it was useful to use Params to avoid needing a list,
but that is easily avoided.
Problems noticed while doing this conversion:
* Some uses of Params "oneword" which was entirely unnecessary
overhead.
* A few places that built up a list of parameters with ++
and then used Params to split it!
Test suite passes.
The content file may not be owned by the user running git-annex, in which
case, setting the owner write bit was not enough to let lockContent
act on the file. However, with some core.sharedRepository configs, the file
should be writable by the user's group. So, the thing to do is to call
thawContent on it.
It was returning Just False in this situation, which differed from indirect
mode behavior. I don't think this led to any actual problems; things that
checked if the file being dropped was present just failed to fail, and
instead reported it wasn't present, possibly incorrectly.
Hmm, it's possible that this could have made git annex fsck --from remote
update the location log wrongly, if a remote was in direct mode, and was in
the middle of trying to drop a key, and the drop later failed.
Also cleaned up the code, avoiding creating a lock file if we're going to
open it for create later anyway.
And, if there's an exception while preparing to lock the file, but not at
the point of actually taking the lock, throw an exception, instead of
silently not locking and pretending to succeed.
And, on Windows, always use lock file, even if the repo somehow got into
indirect mode (maybe with cygwin git..)
The one exception is in Utility.Daemon. As long as a process only
daemonizes once, which seems reasonable, and as long as it avoids calling
checkDaemon once it's already running as a daemon, the fcntl locking
gotchas won't be a problem there.
Annex.LockFile has it's own separate lock pool layer, which has been
renamed to LockCache. This is a persistent cache of locks that persist
until closed.
This is not quite done; lockContent stil needs to be converted.
Should be no behavior changes, just simplified code.
The only actual difference is it doesn't truncate the lock file.
I think that was a holdover from when transfer info was written to the lock
file.
Only the assistant uses these, and only the assistant cleans them up, so
make only git annex transferkeys write them,
There is one behavior change from this. If glacier is being used, and a
manual git annex get --from glacier fails because the file isn't available
yet, the assistant will no longer later see that failed transfer file and
retry the get. Hope no-one depended on that old behavior.
The setDifferences that got added to initialize turns out to make a git
commit, and before ensureCommit has been used. Thus, repo init can fail
when the system has a broken hostname etc.
Move the ensureCommit to the very first thing to avoid this kind of breakage.
This works, and seems fairly robust. Clean get of 20 files at -J3. At -J10,
there are some messages about ssh multiplexing, probably due to a race
spinning up the ssh connection cacher. But, it manages to get all the files
ok regardless.
The progress bars are a scrambled mess though, due to bugs in
ascii-progress, which I've already filed. Particularly this one:
https://github.com/yamadapc/haskell-ascii-progress/issues/8
Came up with a generic way to filter out progress messages while keeping
errors, for commands that use stderr for both.
--json mode will disable command outputs too.
This might be overkill; I only know I need it in ls-files, but other git
commands can also do their own globbing, it turns out, and I am pretty sure
I never want them too when git-annex is using them as plumbing.
Test suite still passes and it looks ok.
This was introduced by commit 450ee53ab6
However, the same problem could affect other calls to programPath,
specifically some on the assistant. So, I fixed it at a deeper level.
Seems to work, but still experimental until it's been tested more.
When repositories are on filesystems not supporting symlinks, the .git dir
symlink trick cannot be used. Since we're going to be in direct mode
anyway, the .git dir symlink is not strictly needed.
However, I have not fixed the code that creates new annex symlinks to
handle this case -- the committed symlinks will be wrong.
git annex sync happens to currently fail in a submodule using direct mode,
because there's no HEAD ref. That also needs to be dealt with to get
this fully working in crippled filesystems.
Leaving http://github.com/datalad/datalad/issues/44 open until these issues
are dealt with.
Most of the time, there will be no discreprancy between programPath and
readProgramFile.
But, the programFile might have been written by an old version of git-annex
that is still installed, while a newer one is currently running. In this
case, we want to run the same one that's currently running.
This is especially important for things like the GIT_SSH=git-annex used for
ssh connection caching.
The only code that still uses readProgramFile directly is the upgrade code,
which needs to know where the standalone git-annex was installed, in order to
upgrade it.
Turns out sqlite does not like having its database deleted out from
underneath it. It might suffice to empty the table, but I would rather
start each fsck over with a new database, so I added a lock file, and
running incremental fscks use a shared lock.
This leaves one concurrency bug left; running two concurrent fsck --more
will lead to: "SQLite3 returned ErrorBusy while attempting to perform step."
and one or both will fail. This is a concurrent writers problem.
* sync: Use the ssh-options git config when doing git pull and push.
* remotedaemon: Use the ssh-options git config.
Note that the rename env var means that if a new git-annex calls an old one
for git-annex ssh, or a new calls an old, nothing much will go wrong;
just ssh caching won't happen.
I hope this doesn't impact speed much -- it does have to pull out a value
from Annex state every time it accesses the branch now.
The test case I dropped has never caught any problems that I can remember,
and would have been rather difficult to convert.