Turns out sqlite does not like having its database deleted out from
underneath it. It might suffice to empty the table, but I would rather
start each fsck over with a new database, so I added a lock file, and
running incremental fscks use a shared lock.
This leaves one concurrency bug left; running two concurrent fsck --more
will lead to: "SQLite3 returned ErrorBusy while attempting to perform step."
and one or both will fail. This is a concurrent writers problem.
Database.Handle can now be given a CommitPolicy, making it easy to specify
transaction granularity.
Benchmarking the old git-annex incremental fsck that flips sticky bits
to the new that uses sqlite, running in a repo with 37000 annexed files,
both from cold cache:
old: 6m6.906s
new: 6m26.913s
This commit was sponsored by TasLUG.
Did not keep backwards compat for sticky bit records. An incremental fsck
that is already in progress will start over on upgrade to this version.
This is not yet ready for merging. The autobuilders need to have sqlite
installed.
Also, interrupting a fsck --incremental does not commit the database.
So, resuming with fsck --more restarts from beginning.
Memory: Constant during a fsck of tens of thousands of files.
(But, it does seem to buffer whole transation in memory, so
may really scale with number of files.)
CPU: ?
* sync: Use the ssh-options git config when doing git pull and push.
* remotedaemon: Use the ssh-options git config.
Note that the rename env var means that if a new git-annex calls an old one
for git-annex ssh, or a new calls an old, nothing much will go wrong;
just ssh caching won't happen.
Note that while the assistant detects changes made to remote names, I left
the commit message fixed rather than calculating it after every commit. It
doesn't seem worth the CPU to do the latter.
--deduplicate, --skip-duplicates, and --clean-duplicates still checksum the
file twice, the first time to determine if it's a duplicate. This cannot be
easily merged with the checksumming done to add the file, since the file
needs to be locked down before that second checksum is taken.
This reverts commit 0700fbc477.
This broke unit and test suite cleanup. The difference is that
dirContentsRecursive only returns files, but this needs to also operate on
directories.
I hope this doesn't impact speed much -- it does have to pull out a value
from Annex state every time it accesses the branch now.
The test case I dropped has never caught any problems that I can remember,
and would have been rather difficult to convert.
* init: Repository tuning parameters can now be passed when initializing a
repository for the first time. For details, see
http://git-annex.branchable.com/tuning/
* merge: Refuse to merge changes from a git-annex branch of a repo
that has been tuned in incompatable ways.
This is necessary for interop between inode caches created on unix and
windows. Which is more important than supporting inodecaches for large keys
with the wrong size, which are broken anyway.
There should be no slowdown from this change, except on Windows.
Avoid using fileSize which maxes out at just 2 gb on Windows.
Instead, use hFileSize, which doesn't have a bounded size.
Fixes support for files > 2 gb on Windows.
Note that the InodeCache code only needs to compare a file size,
so it doesn't matter it the file size wraps. So it has been
left as-is. This was necessary both to avoid invalidating existing inode
caches, and because the code passed FileStatus around and would have become
more expensive if it called getFileSize.
This commit was sponsored by Christian Dietrich.
* info: Can now display info about a given uuid.
* Added to remote/uuid info: Count of the number of keys present
on the remote, and their size. This is rather expensive to calculate,
so comes last and --fast will disable it.
* Git remote info now includes the date of the last sync with the remote.
Reverts 965e106f24
Unfortunately, this caused breakage on Windows, and possibly elsewhere,
because parentDir and takeDirectory do not behave the same when there is a
trailing directory separator.
I think this is the last problimatic setCurrentDirectory. I also audited
for extrnal commands that git-annex might run with cwd = foo, and did not
find any that were passed any FilePath that might be absolute.
parentDir is less safe than takeDirectory, especially when working
with relative FilePaths. It's really only useful in loops that
want to terminate at /
This commit was sponsored by Audric SCHILTKNECHT.
This allows the git repository to be moved while git-annex is running in
it, with fewer problems.
On Windows, this avoids some of the problems with the absurdly small
MAX_PATH of 260 bytes. In particular, git-annex repositories should
work in deeper/longer directory structures than before. See
http://git-annex.branchable.com/bugs/__34__git-annex:_direct:_1_failed__34___on_Windows/
There are several possible ways this change could break git-annex:
1. If it changes its working directory while it's running, that would
be Bad News. Good news everyone! git-annex never does so. It would also
break thread safety, so all such things were stomped out long ago.
2. parentDir "." -> "" which is not a valid path. I had to fix one
instace of this, and I should probably wipe all calls to parentDir out
of the git-annex code base; it was never a good idea.
3. Things like relPathDirToFile require absolute input paths,
and code assumes that the git repo path is absolute and passes it to it
as-is. In the case of relPathDirToFile, I converted it to not make
this assumption.
Currently, the test suite has 16 failures.
In the case where a remote of the bare repo has a fetch = configuration,
refs/remotes/origin/master will exist, and so the merge code path tried to
run in the bare repo.
The url log could have an url for a key, while the location log thinks it's
not present in the web. In this case, addurl --file url would not do
anything. Fixed it to re-add the web as a location.
I don't know how this situation could arise, but I saw it in the wild in
the conference_proceedings repo, affecting key
URL-s17806003--http://mirror.linux.org.au/pub/linux.conf.au/2014/Wednesday/53-Building_Effective_Alliances_around_the_Trans-Pacific_Partnershi-c0505b631127ccc67e38e637344d988e
Investigating the presence log, it looked like that key
was originally listed as present in the web, then in commit
56abf9e9f3e691ed9d83513037d4019313321ca3 someone else's git-annex
set it and some other things to not present in the web. It would be
interesting to know what that user did, but I doubt I'll be able to find
out. All I can tell from this investigation is that the inconsistency was
not introduced when originally addurl-ing the url.
This allows bypassing the direct mode guard in a safe way to do all sorts
of things including git revert, git mv, git checkout ...
This commit was sponsored by the WikiMedia Foundation.
I had hoped that the git devs could change git's handling of partial
commits to not use a false index file, but seems not.
So, this relies on some git internals to detect that case. The test suite
has a test case added to catch it if changes to git break it.
This commit was sponsored by Paul Tagliamonte.
Now `git annex info $remote` shows info specific to the type of the remote,
for example, it shows the rsync url.
Remote types that support encryption or chunking also include that in their
info.
This commit was sponsored by Ævar Arnfjörð Bjarmason.
This avoids making the parameters column quite wide, which caused
descriptions of other commands to not fit in 80 cols in the usage display.
FIELD=VALUE is a simplification, but so was the old display. The man page
gives more detail.
This is not a complete fix. For one, git remote will happily go add a
remote that has the same name as an existing special remote. For another,
enableremote will enable a special remote over top of an existing git
remote. And, also, the webapp might.
This reverts commit dd667844b6
and commit e6eff0e951.
Those commits were fine, except the android autobuilder currently has a bit
of a mess of yesod versions and broke. Better to wait on this.
Added a Default instance for TrustLevel, and was able to use that to clear
up several other parts of the code too.
This commit was sponsored by Stephan Schulz
Found these with:
git grep "^ " $(find -type f -name \*.hs) |grep -v ': where'
Unfortunately there is some inline hamlet that cannot use tabs for
indentation.
Also, Assistant/WebApp/Bootstrap3.hs is a copy of a module and so I'm
leaving it as-is.
This fixes all instances of " \t" in the code base. Most common case
seems to be after a "where" line; probably vim copied the two space layout
of that line.
Done as a background task while listening to episode 2 of the Type Theory
podcast.
This avoids cp -a overriding the default mode acls that the user might have
set in a git repository.
With GNU cp, this behavior change should not be a breaking change, because
git-anex also uses rsync sometimes in the same situation, and has only ever
preserved timestamps when using rsync.
Systems without GNU cp will no longer use cp -a, but instead just cp.
So, timestamps will no longer be preserved. Preserving timestamps when
copying between repos is not guaranteed anyway.
Closes: #729757
This fixed one bug where it needed to be and wasn't (in Assistant.Unused).
And also found one place where lockContent was used unnecessarily (by
drop --from remote).
A few other places like uninit probably don't really need to lockContent,
but it doesn't hurt to do call it anyway.
This commit was sponsored by David Wagner.
The nice refactoring in ec7dd0446a
highlighted a bug in lockContent -- when the content is not present,
this incorrectly created an empty lock file, using the same filename
as the content file.
This seems like it could result in empty objects, which fsck would detect
and complain about. Both drop and move --to call lockContent, as does
Remote.Git.dropKey -- I think we got lucky and this bug didn't show up
because both all of those only operate on files that are present. So
this bug could only manifest if there was a race, and a file's content
was dropped at just the wrong time, just as another process was about to
drop it. (And then only if the other process's dropping failed, otherwise
it'd delete the empty object file.)
Hmm, move --from also called lockContent. Unnecessarily, since the content
is not being removed from the local annex. In this case, the combination of
the 2 bugs could result in an empty lock file being written, and then if
the download of the content failed, left in the object directory as the
content.
This commit also optimises lockContent, avoiding an unncessary
doesFileExist test and instead just catching the exception that's thrown
when the file doesn't exist.
This commit was sponsored by Justine Lam.
Added a convenience Utility.LockFile that is not a windows/posix
portability shim, but still manages to cut down on the boilerplate around
locking.
This commit was sponsored by Johan Herland.
(With the exception of daemon pid locking.)
This fixes at part of #758630. I reproduced the assistant locking eg, a
removable drive's annex journal lock file and forking a long-running
git-cat-file process that inherited that lock.
This did not affect Windows.
Considered doing a portable Utility.LockFile layer, but git-annex uses
posix locks in several special ways that have no direct Windows equivilant,
and it seems like it would mostly be a complication.
This commit was sponsored by Protonet.
Added a mkUnavailable method, which a Remote can use to generate a version
of itself that is not available. Implemented for several, but not yet all
remotes.
This allows testing that checkPresent properly throws an exceptions when
it cannot check if a key is present or not. It also allows testing that the
other methods don't throw exceptions in these circumstances.
This immediately found several bugs, which this commit also fixes!
* git remotes using ssh accidentially had checkPresent return
an exception, rather than throwing it
* The chunking code accidentially returned False rather than
propigating an exception when there were no chunks and
checkPresent threw an exception for the non-chunked key.
This commit was sponsored by Carlo Matteo Capocasa.
Removed old extensible-exceptions, only needed for very old ghc.
Made webdav use Utility.Exception, to work after some changes in DAV's
exception handling.
Removed Annex.Exception. Mostly this was trivial, but note that
tryAnnex is replaced with tryNonAsync and catchAnnex replaced with
catchNonAsync. In theory that could be a behavior change, since the former
caught all exceptions, and the latter don't catch async exceptions.
However, in practice, nothing in the Annex monad uses async exceptions.
Grepping for throwTo and killThread only find stuff in the assistant,
which does not seem related.
Command.Add.undo is changed to accept a SomeException, and things
that use it for rollback now catch non-async exceptions, rather than
only IOExceptions.
Now git-annex-shell recvkey, when the key is already present, allows
another copy to be rsynced up, and just throws it away.
This same behavior could have already happened before, when eg, two repos
tried to upload the same object at the same time. So this makes the test
suite pass, and should not add any bad behavior, other than slightly more
work being done in a rather edge case.
This relies on moveAnnex's behavior of keeping the current version of an
object.
And fixed a bug found by these tests; retrieveKeyFile would fail
when the dest file was already complete.
This commit was sponsored by Bradley Unterrheiner.
This only performs some basic tests so far; no testing of chunking or
resuming. Also, the existing encryption type of the remote is used; it
would be good later to derive an encrypted and a non-encrypted version of
the remote and test them both.
This commit was sponsored by Joseph Liu.
Support users who have set commit.gpgsign, by disabling gpg signatures for
git-annex branch commits and commits made by the assistant.
The thinking here is that a user sets commit.gpgsign intending the commits
that they manually initiate to be gpg signed. But not commits made in the
background, whether by a deamon or implicitly to the git-annex branch.
gpg signing those would be at best a waste of CPU and at worst would fail,
or flood the user with gpg passphrase prompts, or put their signature on
changes they did not directly do.
See Debian bug #753720.
Also makes all commits done by git-annex go through a few central control
points, to make such changes easier in future.
Also disables commit.gpgsign in the test suite.
This commit was sponsored by Antoine Boegli.
When annex.genmetadata is set, metadata from the feed is added to files
that are imported from it.
Reused the same feedtitle and itemtitle, feedauthor, itemauthor, etc names
that are used in --template.
Also added title and author, which are the item title/author if available,
falling back to the feed title/author. These are more likely to be common
metadata fields.
(There is a small bit of dupication here, but once git gets
around to packing the object, it will compress it away.)
The itempubdate field is not included in the metadata as a string; instead
it is used to generate year and month fields, same as is done when adding
files with annex.genmetadata set.
This commit was sponsored by Amitai Schlair, who cooincidentially
is responsible for ikiwiki generating nice feed metadata!
I had thought that this was already done, but apparently not. There may
have been a reversion around version 5.20140606. Anna's laptop had its
desktop menu file etc having that version despite having upgraded git-annex
to a newer version. However, I could not find any commits that removed a
call to ensureInstalled.
When in direct mode, update the master branch after committing to the
annex/direct/master branch. Also, update the synced/master branch.
This fixes a topology A->B where both A and B are in direct mode and
running the assistant, and a change is made to B. Before this fix, A pulled
the changes from B, but since they were only on the annex/direct/master
branch, it did not merge them.
Note that I considered making the assistant merge the
remotes/B/annex/direct/master, but decided to keep it simple and only merge
the sync branches as before.
Rather than calculating the TSDelta once, and caching it, this now
reads the inode sential file's InodeCache file once, and then each time a
new InodeCache is generated, looks at the sentinal file to get the current
delta.
This way, if the time zone changes while git-annex is running, it will
adapt.
This adds some inneffiency, but only on Windows, and only 1 stat per new
file added. The worst innefficiency is that `git annex status` and
`git annex sync` will now (on Windows) stat the inode sentinal file once per
file in the repo.
It would be more efficient to use getCurrentTimeZone, rather than needing
to stat the sentinal file. This should be easy to do, once the time
package gets my bugfix patch.
This commit was sponsored by Jürgen Lüters.
On Windows, changing the time zone causes the apparent mtime of files to
change. This confuses git-annex, which natually thinks this means the files
have actually been modified (since THAT'S WHAT A MTIME IS FOR, BILL <sheesh>).
Work around this stupidity, by using the inode sentinal file to detect if
the timezone has changed, and calculate a TSDelta, which will be applied
when generating InodeCaches.
This should add no overhead at all on unix. Indeed, I sped up a few
things slightly in the refactoring.
Seems to basically work! But it has a big known problem:
If the timezone changes while the assistant (or a long-running command)
runs, it won't notice, since it only checks the inode cache once, and
so will use the old delta for all new inode caches it generates for new
files it's added. Which will result in them seeming changed the next time
it runs.
This commit was sponsored by Vincent Demeester.
This avoids a potential slowdown when using lots of views.
I think that it makes sense for unused to ignore (local) view branches,
since these are by definition supposed to be views of an existing branch,
so looking at the branch should be sufficient (and if the view is out of
date and has files that have since been deleted from the branch, the user's
intent is not to preserve those from unused reaping).
Note that TransferInfo does not always contain the Remote, although
any transfer added to the TransferQueue does have a Remote in its
TransferInfo. The transferkeys command still accepts a UUID, which is
useful to handle upgrades, where an old assistant version runs the new
transferkeys.
This commit was sponsored by Kalle Svensson.
It is useful to be able to specify an alternative git-annex-shell
program to execute on the remote, e.g., to run a version not on the
PATH. Use remote.<name>.annex-shell if specified, instead of the
default "git-annex-shell" i.e., first so-named executable on the
PATH.
Only fsck and reinject and the test suite used the Backend, and they can
look it up as needed from the Key. This simplifies the code and also speeds
it up.
There is a small behavior change here. Before, all commands would warn when
acting on an annexed file with an unknown backend. Now, only fsck and
reinject show that warning.
To do so, I slightly changed the behavior of unannex. Now in fast mode, it
only makes a hard link when the annexed file's link count is 1. This avoids
unannexing 2 files with the same content in fast mode from hard linking
them together. (One will end up hard linked to the annex, which the docs
warn about.)
With that change, uninit can simply always run unannex in fast mode. Since
.git/annex/objects is being blown away anyway, there's no worry in this
case about a hard link pointing into it causing an annexed object to be
modified.
For sync, saves 1 ssh connection per remote. For remotedaemon, the same
ssh connection that is already open to run git-annex-shell notifychanges
is reused to pull from the remote.
Only potential problem is that this also enables connection caching
when the assistant syncs with a ssh remote. Including the sync it does
when a network connection has just come up. In that case, cached ssh
connections are likely to be stale, and so using them would hang.
Until I'm sure such problems have been dealt with, this commit needs to
stay on the remotecontrol branch, and not be merged to master.
This commit was sponsored by Alexandre Dupas.
So far, handling connecting to git-annex-shell notifychanges, and
pulling immediately when a change is pushed to a remote.
A little bit buggy (crashes after the first pull), but it already works!
This commit was sponsored by Mark Sheppard.
This will be used by the remote-daemon to quickly tell when changes have
been pushed from some other repository into a ssh remote.
Adjusted the remote-daemon protocol to communicate changed shas, rather
than git branch refs. This way, it can easily check if a sha is new.
This commit was sponsored by Carlos Trijueque Albarran.
This includes checking when dropping files that any required content
configuration is satisfied. However, it does not yet include an active
check on the required content; the location log is trusted when checking
the required content expression.
Motivation: Hook scripts for nautilus or other file managers
need to provide the user with feedback that a file is being downloaded.
This commit was sponsored by THM Schoemaker.
Note that this is a nearly entirely free feature. The data was already
stored in the metadata log in an easily accessible way, and already was
parsed to a time when parsing the log. The generation of the metadata
fields may even be done lazily, although probably not entirely (the map
has to be evaulated to when queried).
unused: In direct mode, files that are deleted from the work tree are no longer incorrectly detected as unused.
Direct mode `git annex info` slows down a bit due to more stringent
checking, but not by a lot.
This is a new feature, it was not handled before, since it's a bit of an
edge case. However, it can be handled exactly the same as a file/dir
conflict, just leave the non-annexed item alone.
While implementing this, the core resolveMerge' function got a lot simpler
and clearer. Note especially that where before there was an asymetric call to
stagefromdirectmergedir, now graftin is called symmetrically in both cases.
And, in order to add that `graftin us`, the current branch needed to be
known (if there is no current branch, there cannot be a merge conflict).
This led to some cleanups of how autoMergeFrom behaved when there is no
current branch.
This commit was sponsored by Philippe Gauthier.
I think that 751f496c11 didn't quite manage
to actually fix the bug, although I have not checked since its "fix" got
redone.
The test suite now actually checks the file staged in git is a symlink,
rather than relying on the bug casing a later sync failure. This seems a
more reliable way to detect it, and probably avoids a heisenbug in the test
suite.
Added test cases for both ways this can happen, with a conflict involving a
file, or a directory.
Cleaned up resolveMerge to not touch the work tree in direct mode, which
turned out to be the only way to handle things.. And makes it much nicer.
Still need to run test suite on windows.
Using the extract(1) program to do the heavy lifting.
Decided to make git-annex run pre-commit-annex when committing. Since
git-annex pre-commit also runs it, it'll be run when git commit is run too,
via the pre-commit hook. This basically gives back the pre-commit hook
that git-annex took away. The implementation avoids repeatedly looking
for the hook script when the assistant is running and committing
repeatedly; only checks if the hook is available once.
To make the script simpler, made git-annex metadata -s field?=value
only set a field when it's not already got a value.
This commit was sponsored by bak.
Note that negated globs are not supported. Would have complicated the code
to add them, without changing the data type serialization in a
non-backwards-compatable way.
This commit was sponsored by Denver Gingerich.
This allows eg, putting .git/annex/tmp on a ram disk, if the disk IO
of temp object files is too annoying (and if you don't want to keep
partially transferred objects across reboots).
.git/annex/misctmp must be on the same filesystem as the git work tree,
since files are moved to there in a way that will not work cross-device,
as well as symlinked into there.
I first wanted to put the tmp objects in .git/annex/objects/tmp, but
that would pose transition problems on upgrade when partially transferred
objects existed.
git annex info does not currently show the size of .git/annex/misctemp,
since it should stay small. It would also be ok to make something clean it
out, periodically.
Performance impact: When adding a large tree of new files, this needs
to do some git cat-file queries to check if any of the files already
existed and might need a metadata copy. I tried a benchmark in a copy
of my sound repository (so there was already a significant git tree
to check against.
Adding 10000 small files, with a cold cache:
before: 1m48.539s
after: 1m52.791s
So, impact is 0.0004 seconds per file added. Which seems acceptable, so did
not add some kind of configuration to enable/disable this.
This commit was sponsored by Lisa Feilen.
When constructing views, metadata is available about the location of the
file in the view's reference branch. Allows incorporating parts of the
directory hierarchy in a view.
For example `git annex view tag=* podcasts/=*` makes a view in the form
tag/showname.
Performance impact: I benchmarked git annex view tag=* in the conference
proceedings repo to take 6.459s before this change, and 6.544s after.
FWIW, I considered making the syntax for this be podcasts/*, which might
be easier for the user to learn. However, I think it's not as good:
* The user has to then juggle two different syntaxes, and podcasts/* will
be expanded by the shell so they also need to quote it, while podcasts/=*
is unlikely to be expanded by the shell.
* It would allow for things like podcasts/*/* and *.mp3 which do not
map well into views.
This commit was sponsored by Aurélien Pinceaux.
While writing this documentation, I realized that there needed to be a way
to stay in a view like tag=* while adding a filter like tag=work that
applies to the same field.
So, there are really two ways a view can be refined. It can have a new
"field=explicitvalue" filter added to it, which does not change the
"shape" of the view, but narrows the files it shows.
Or, it can have a new view added, which adds another level of
subdirectories.
So, added a vfilter command, which takes explicit values to add to the
filter, and rejects changes that would change the shape of the view.
And, made vadd only accept changes that change the shape of the view.
And, changed the View data type slightly; now components that can match
multiple metadata values can be visible, or not visible.
This commit was sponsored by Stelian Iancu.
So the user can now switch to a view and then move files around within it
to manage metadata. For example, moving a file into a new directory
when in the tags=* view adds a tag to it.
Implementation is fairly efficient. One diff-index, which is no more
expensive than the first stage of a git commit, followed by possibly
some cat-file --batch traffic to find the key (when deleting a file).
Very similar to what's done in direct mode when committing. And like
direct mode when updating the WC after a merge, it has to buffer the
diff-tree values in order to make 2 passes over them.
When not in a view, pre-commit now does one extra git symbolic-ref,
which is tiny overhead.
This commit was sponsored by Andrew Eskridge.
Removed instance, got it all to build using fromRef. (With a few things
that really need to show something using a ref for debugging stubbed out.)
Then added back Read instance, and made Logs.View use it for serialization.
This changes the view log format.
(And a vpop command, which is still a bit buggy.)
Still need to do vadd and vrm, though this also adds their documentation.
Currently not very happy with the view log data serialization. I had to
lose the TDFA regexps temporarily, so I can have Read/Show instances of
View. I expect the view log format will change in some incompatable way
later, probably adding last known refs for the parent branch to View
or something like that.
Anyway, it basically works, although it's a bit slow looking up the
metadata. The actual git branch construction is about as fast as it can be
using the current git plumbing.
This commit was sponsored by Peter Hogg.
Adds metadata log, and command.
Note that unsetting field values seems to currently be broken.
And in general this has had all of 2 minutes worth of testing.
This commit was sponsored by Julien Lefrique.
Fix bug in automatic merge conflict resolution code when used
on a filesystem not supporting symlinks, which resulted in it losing
track of the symlink bit of annexed files.
This was the underlying bug that was causing another test to fail,
which got worked around in 1c997fd08c.
I've chosen to keep 2 separate test cases since the old test case only
detected the problem accidentially.
Test suite passes on FAT & in windows, as well as on proper unix systems.
This commit was sponsored by Ellis Whitehead.
Potentially fixes some FD leak if an action on an opened file handle fails
for some reason. There have been some hard to reproduce reports of
git-annex leaking FDs, and this may solve them.
Need to include the uuid of the local repo in the list of belived locations
of a key after getting it, in order for the drop from remote to include it
in the numcopies calculation.
* sync --content: Honor annex-ignore configuration.
* sync: Don't try to sync with xmpp remotes, which are only currently
supported when using the assistant.
Several places assumed this would not happen, and when the AssociatedFile
was Nothing, did nothing.
As part of this, preferred content checks pass the Key around.
Note that checkMatcher is sometimes now called with Just Key and Just File.
It currently constructs a FileMatcher, ignoring the Key. However, if it
constructed a FileKeyMatcher, which contained both, then it might be
possible to speed up parts of Limit, which currently call the somewhat
expensive lookupFileKey to get the Key.
I have not made this optimisation yet, because I am not sure if the key is
always the same. Will need some significant checking to satisfy myself
that's the case..
This will be used in expiring old unused objects. The timestamp is when it
was first noticed it was unused.
Backwards compatability: It supports reading old format unused log files.
The old version of git-annex will ignore lines in log files written by the
new version, so the worst interop problem would be git annex dropunused not
knowing some numbers that git-annex unused reported.
* numcopies: New command, sets global numcopies value that is seen by all
clones of a repository.
* The annex.numcopies git config setting is deprecated. Once the numcopies
command is used to set the global number of copies, any annex.numcopies
git configs will be ignored.
* assistant: Make the prefs page set the global numcopies.
This global numcopies setting is needed to let preferred content
expressions operate on numcopies.
It's also convenient, because typically if you want git-annex to preserve N
copies of files in a repo, you want it to do that no matter which repo it's
running in. Making it global avoids needing to warn the user about gotchas
involving inconsistent annex.numcopies settings.
(See changes to doc/numcopies.mdwn.)
Added a new variety of git-annex branch log file, that holds only 1 value.
Will probably be useful for other stuff later.
This commit was sponsored by Nicolas Pouillard.
I've been disliking how the command seek actions were written for some
time, with their inversion of control and ugly workarounds.
The last straw to fix it was sync --content, which didn't fit the
Annex [CommandStart] interface well at all. I have not yet made it take
advantage of the changed interface though.
The crucial change, and probably why I didn't do it this way from the
beginning, is to make each CommandStart action be run with exceptions
caught, and if it fails, increment a failure counter in annex state.
So I finally remove the very first code I wrote for git-annex, which
was before I had exception handling in the Annex monad, and so ran outside
that monad, passing state explicitly as it ran each CommandStart action.
This was a real slog from 1 to 5 am.
Test suite passes.
Memory usage is lower than before, sometimes by a couple of megabytes, and
remains constant, even when running in a large repo, and even when
repeatedly failing and incrementing the error counter. So no accidental
laziness space leaks.
Wall clock speed is identical, even in large repos.
This commit was sponsored by an anonymous bitcoiner.
Similar to the assistant, this honors any configured preferred content
expressions.
I am not entirely happpy with the implementation. It would be nicer if
the seek function returned a list of actions which included the individual
file gets and copies and drops, rather than the current list of calls to
syncContent. This would allow getting rid of the somewhat reundant display
of "sync file [ok|failed]" after the get/put display.
But, do that, withFilesInGit would need to somehow be able to construct
such a mixed action list. And it would be less efficient than the current
implementation, which is able to reuse several values between eg get and
drop.
Note that currently this does not try to satisfy numcopies when
getting/putting files (numcopies are of course checked when dropping
files!) This makes it like the assistant, and unlike get --auto
and copy --auto, which do duplicate files when numcopies is not yet
satisfied. I don't know if this is the right decision; it only seemed to
make sense to have this parallel the assistant as far as possible to start
with, since I know the assistant works.
This commit was sponsored by Øyvind Andersen Holm.
Noticed that it was possible for add to move a file to .git/annex/objects
and not make the link if the disk was full. This happened because the
location log update failed, and so addLink never got a chance to run.
Running addLink first fixes it; on error it will unwind by moving the file
back to where it was originally.
This adds a http HEAD before the download is done. That was already the
case when the assistant was running, and it seems worth it to avoid filling
up the whole disk, like happened to my server today.
Fixed up a number of things that had worked around there not being a way to
get that.
Most notably, transfer info files on windows now include the process id,
since no locking is currently done. This means the file format varies
between windows and unix.
transferkeys had used special FDs for communication, but that would be
quite annoying to do in Windows.
Instead, use stdin and stdout. But, to avoid commands like rsync stomping
on them and messing up the communications channel, they're duplicated to a
different handle; stdin is replaced with a null handle, and stdout is
replaced with a copy of stderr. This should all work in windows too.
Stopping in progress transfers may work on windows.. if the types unify
anyway. ;) May need some more porting.
Fixes a test case I received where a corrupted repo was repaired, but the
git-annex branch was not. The root of the problem was that the
MissingObject returned by the repair code was not necessarily a complete
set of all objects that might have been deleted during the repair.
So, stop trying to return that at all, and instead make the index file
checking code explicitly verify that each object the index uses is present.
Previous test did not notice if there is a dangling symlink.
Also, if a directory exists with the same name as the imported file, that
cannot work, so don't let --force have an effect.
Note that the hash backends were made to stop printing a (checksum..)
message as part of this, since it showed up without a file when deciding
whether to act on a file. Should have probably removed that message a while
ago anyway, I suppose.
The assistant's commit code also always avoids git commit, for simplicity.
Indirect mode sync still does a git commit -a to catch unstaged changes.
Note that this means that direct mode sync no longer runs the pre-commit
hook or any other hooks git commit might call. The git annex pre-commit
hook action for direct mode is however explicitly run. (The assistant
already ran git commit with hooks disabled, so no change there.)
Option parsing for commands that run outside git repos is still screwy,
as there is no Annex monad and so the flags cannot be passed in. But,
any remaining parameters can be, which is enough for this fix.
Because that allowed writing to symlinks of files that are not present,
which followed the link and put bad content in an object location.
fsck: Fix up .git/annex/object directory permissions.
This commit was sponsored by an anonymous bitcoin donor.
Complicated by such repositories potentially being repos that should have
an annex.uuid, but it failed to be gotten, perhaps due to the past ssh repo
setup bugs. This is handled now by an Upgrade Repository button.
Adding the file moved it to the annex, and then tried to set the mode.
Error unwind then moved the file back, and so the watcher saw the file get
deleted and then added back, and so tried again..
This works for both direct and indirect mode.
It may need some performance tuning.
Note that unlike git status, it only shows the status of the work tree, not
the status of the index. So only one status letter, not two .. and since
files that have been added and not yet committed do not differ between the
work tree and the index, they are not shown. Might want to add display of
the index vs the last commit eventually.
This commit was sponsored by an unknown bitcoin contributor, whose
contribution as been going up lately! ;)
Now that direct mode sets core.bare=true, git's normal prohibition about
pushing into the currently checked out branch doesn't work.
A simple fix for this would be an update hook which blocks the pushes..
but git hooks must be executable, and git-annex needs to be usable on eg,
FAT, which lacks x bits.
Instead, enabling direct mode switches the branch (eg master) to a special
purpose branch (eg annex/direct/master). This branch is not pushed when
syncing; instead any changes that git annex sync commits get written to
master, and it's pushed (along with synced/master) to the remote.
Note that initialization has been changed to always call setDirect,
even if it's just setDirect False for indirect mode. This is needed because
if the user has just cloned a direct mode repo, that nothing has synced
with before, it may have no master branch, and only a annex/direct/master.
Resulting in that branch being checked out locally too. Calling setDirect False
for indirect mode moves back out of this branch, to a new master branch,
and ensures that a manual "git push" doesn't push changes directly to
the annex/direct/master of the remote. (It's possible that the user
makes a commit w/o using git-annex and pushes it, but nothing I can do
about that really.)
This commit was sponsored by Jonathan Harrington.
Copies files out of the annex. This avoids an unannex of one file breaking
other files that link to the same content. Also, it means that the content
remains in the annex using up space until cleaned up with "git annex
unused".
(The behavior of unannex --fast has not changed; it still hard
links to content in the annex. --fast was not made the default because it
is potentially unsafe; editing such a hard linked file can unexpectedly
change content stored in the annex.)
This used to work, but now hsc2hs is failing with a usage message.
Since I have not changed my windows build environment at all, it must be
some change due to a change in the cabal file. Perhaps too make flags are
causing it to hit a windows command line length limit?
Anyway, these hsc files did nothing on Windows, so can be omitted and not
built to work around yet another epic windows weirdness.
This actually fixes a bug; if pre-commit was run in a subdir, it would pass
relative files when updating the associated file maps, and so the maps
wouldn't update.
I don't think this bug happened in practice, due to the way pre-commit is
called by the hook. It happened to chdir to the top of the work tree.
Note that this case is only fully automatically resolved in direct mode.
In indirect mode, git merge moves the file to file~HEAD, and replaces it
with the directory, and leaves the file in unmerged state, and sync doesn't
yet change that.
addurl: Improve message when adding url with wrong size to existing file.
Before the message suggested the url didn't exist.
Fixed handling of URL keys that have no recorded size. Before, if the key
has no size, the url also had to not declare any size, which was unlikely
and wrong, or it was taken to not exist. This probably would mostly affect
keys that were added to the annex with addurl --relaxed.
Thought was that this would be faster than a map, since a vector can be
updated more efficiently. It turns out to not seem to matter; runtime and
memory usage are basically identical.
recvkey was told it was receiving a HMAC key from a direct mode repo,
and that confused it into rejecting the transfer, since it has no way to
verify a key using that backend, since there is no HMAC backend.
I considered making recvkey skip verification in the case of an unknown
backend. However, that could lead to bad results; a key can legitimately be
in the annex with a backend that the remote git-annex-shell doesn't know
about. Better to keep it rejecting if it cannot verify.
Instead, made the gcrypt special remote not set the direct mode flag when
sending (and receiving) files.
Also, added some recvkey messages when its checks fail, since otherwise
all that is shown is a confusing error message from rsync when the remote
git-annex-shell exits nonzero.
Overridable with --user-agent option.
Not yet done for S3 or WebDAV due to limitations of libraries used --
nether allows a user-agent header to be specified.
This commit sponsored by Michael Zehrer.
I forgot I had <$$> hidden away in Utility.Applicative.
It allows doing the same kind of currying as does >=*>
and I found using it made the code more readable for me.
(*>=> was not used)
Done using a mode witness, which ensures it's fixed everywhere.
Fixing catFileKey was a bear, because git cat-file does not provide a
nice way to query for the mode of a file and there is no other efficient
way to do it. Oh, for libgit2..
Note that I am looking at tree objects from HEAD, rather than the index.
Because I cat-file cannot show a tree object for the index.
So this fix is technically incomplete. The only cases where it matters
are:
1. A new large file has been directly staged in git, but not committed.
2. A file that was committed to HEAD as a symlink has been staged
directly in the index.
This could be fixed a lot better using libgit2.
Note that it would be possible to extend the display to show all
repositories. But there can be a lot of repositories that are not set up as
remotes, and it would significantly clutter the display to show them all.
Since we're not showing all repositories, it's not worth trying to show
numcopies count either.
I decided to embrace these limitations and call the command remotes.
This is a git-remote-gcrypt encrypted special remote. Only sending files
in to the remote works, and only for local repositories.
Most of the work so far has involved making initremote work. A particular
problem is that remote setup in this case needs to generate its own uuid,
derivied from the gcrypt-id. That required some larger changes in the code
to support.
For ssh remotes, this will probably just reuse Remote.Rsync's code, so
should be easy enough. And for downloading from a web remote, I will need
to factor out the part of Remote.Git that does that.
One particular thing that will need work is supporting hot-swapping a local
gcrypt remote. I think it needs to store the gcrypt-id in the git config of the
local remote, so that it can check it every time, and compare with the
cached annex-uuid for the remote. If there is a mismatch, it can change
both the cached annex-uuid and the gcrypt-id. That should work, and I laid
some groundwork for it by already reading the remote's config when it's
local. (Also needed for other reasons.)
This commit was sponsored by Daniel Callahan.
The second commit had some bad refs which resulted in the race detection
code running. But that commit was unnecessary anyway, it only was there to
merge in the other refs.
Wrote nice pure transition calculator, and ugly code to stage its results
into the git-annex branch. Also had to split up several Log modules
that Annex.Branch needed to use, but that themselves used Annex.Branch.
The transition calculator is limited to looking at and changing one file at
a time. While this made the implementation relatively easy, it precludes
transitions that do stuff like deleting old url log files for keys that are
being removed because they are no longer present anywhere.
Having one module that knows about all the filenames used on the branch
allows working back from an arbitrary filename to enough information about
it to implement dropping dead remotes and doing other log file compacting
as part of a forget transition.
Works, more or less. --dead is not implemented, and so far a new branch
is made, but keys no longer present anywhere are not scrubbed.
git annex sync fails to push the synced/git-annex branch after a forget,
because it's not a fast-forward of the existing synced branch. Could be
fixed by making git-annex sync use assistant-style sync branches.
The reversion was that, if a file was git rm'd, but still in branches, it
would not be seen as used. Looking at both the added and the removed (or
changed) files from the diff-index is a cheap way to fix that.
Instead of populating the second-level Bloom filter with every key
referenced in every Git reference, consider only those which differ
from what's referenced in the index.
Incidentaly, unlike with its old behavior, staged
modifications/deletion/... will now be detected by 'unused'.
Credits to joeyh for the algorithm. :-)
When quvi is installed, git-annex addurl automatically uses it to detect
when an page is a video, and downloads the video file.
web special remote: Also support using quvi, for getting files,
or checking if files exist in the web.
This commit was sponsored by Mark Hepburn. Thanks!
This is a simple approach for setting up a mirroring repository.
It will work with any type of remotes.
Mirror --from is more expensive than mirror --to in general.
OTOH, mirror --from will get the file from any remote that has it, not only
the named mirror remote. And if the named mirror remote is not the fastest
available remote with a file, that can speed things up.
It would be possible to make the assistant or watch command do a more
dynamic mirroring, that didn't need to scan every time.
Note that --deduplicate currently checksums each file twice,
once to see if it's a known key, and once when importing it.
Perhaps this could be revisited and the extra checksum gotten rid of,
at the cost of not locking down the file when adding it.
Started with a problem when running addurl on a really long url,
because the whole url is munged into the filename. Ended up doing
a fairly extensive review for places where filenames could get too large,
although it's hard to say I'm not missed any..
Backend.Url had a 128 character limit, which is fine when the limit is 255,
but not if it's a lot shorter on some systems. So check the pathconf()
limit. Note that this could result in fromUrl creating different keys
for the same url, if run on systems with different limits. I don't see
this is likely to cause any problems. That can already happen when using
addurl --fast, or if the content of an url changes.
Both Command.AddUrl and Backend.Url assumed that urls don't contain a
lot of multi-byte unicode, and would fail to truncate an url that did
properly.
A few places use a filename as the template to make a temp file.
While that's nice in that the temp file name can be easily related back to
the original filename, it could lead to `git annex add` failing to add a
filename that was at or close to the maximum length.
Note that in Command.Add.lockdown, the template is still derived from the
filename, just with enough space left to turn it into a temp file.
This is an important optimisation, because the assistant may lock down
a bunch of files all at once, and using the same template for all of them
would cause openTempFile to iterate through the same set of names,
looking for an unused temp file. I'm not very happy with the relatedTemplate
hack, but it avoids that slowdown.
Backend.WORM does not limit the filename stored in the key.
I have not tried to change that; so git annex add will fail on really long
filenames when using the WORM backend. It seems better to preserve the
invariant that a WORM key always contains the complete filename, since
the filename is the only unique material in the key, other than mtime and
size. Since nobody has complained about add failing (I think I saw it
once?) on WORM, probably it's ok, or nobody but me uses it.
There may be compatability problems if using git annex addurl --fast
or the WORM backend on a system with the 255 limit and then trying to use
that repo in a system with a smaller limit. I have not tried to deal with
those.
This commit was sponsored by Alexander Brem. Thanks!
When there's no extension, don't use "none", but "".
When there is an extension, it starts with a dot, so don't put a redundant
dot in the default format.
This was the last place in git-annex that could remove data referred to by
the git history, without being forced.
Like drop, dropunused checks remotes, and honors the global annex.numcopies
setting. (However, .gitattributes settings cannot apply to unused files.)
In direct mode, it's best to whenever possible not move direct mode files
out of the way, and so I made unannex avoid touching the direct mode file at
all.
That actually turns out to be easy, because in direct mode, unlike indirect
mode, the pre-commit hook won't get confused if the unannexed file later
gets added back by git add. So there's no need to commit the unannex right
away; it can be staged for the user to commit later. This also means that
unannex in direct mode is a lot faster than in indirect mode!
Another subtle bit is the bookkeeping that is done when unannexing a direct
mode file. The inode cache needs to be removed so that when uninit runs
getKeysPresent, it doesn't see the cache and think the key is still
present and crash when it's not.
This commit is sponsored by Douglas Butts. Thanks!
A common failure mode for direct mode has been for files to end up still
stored in indirect mode. While I hope that doesn't happen anymore, fsck
should deal with it.
This write permission frobbing is very appropriate in indirect mode,
since annexed objects are stored as immutably as can be managed. But not
in direct mode, where files should be able to be modified at any time.
There are already sufficient guards that there's no need to prevent a file
being written to while it's being ingested, in direct mode. The inode cache
will detect (most) types of modifications, and the add will fail. Then a
re-add should be done. The assistant should get another inotify change
event, and automatically add the new version of the file.
Ie, when there'a a conflicted merge we may get foo.variant-xxxx
created in a merge. If a second merge conflict occurs on that new file,
it was not falling back to putting in the whole key (which should stop
the merge conflicts happening for good, but is ugly).
As seen in this bug report, the lifted exception handling using the StateT
monad throws away state changes when an action throws an exception.
http://git-annex.branchable.com/bugs/git_annex_fork_bombs_on_gpg_file/
.. Which can result in cached values being redundantly calculated, or other
possibly worse bugs when the annex state gets out of sync with reality.
This switches from a StateT AnnexState to a ReaderT (MVar AnnexState).
All changes to the state go via the MVar. So when an Annex action is
running inside an exception handler, and it makes some changes, they
immediately go into affect in the MVar. If it then throws an exception
(or even crashes its thread!), the state changes are still in effect.
The MonadCatchIO-transformers change is actually only incidental.
I could have kept on using lifted-base for the exception handling.
However, I'd have needed to write a new instance of MonadBaseControl
for the new monad.. and I didn't write the old instance.. I begged Bas
and he kindly sent it to me. Happily, MonadCatchIO-transformers is
able to derive a MonadCatchIO instance for my monad.
This is a deep level change. It passes the test suite! What could it break?
Well.. The most likely breakage would be to code that runs an Annex action
in an exception handler, and *wants* state changes to be thrown away.
Perhaps the state changes leaves the state inconsistent, or wrong. Since
there are relatively few places in git-annex that catch exceptions in the
Annex monad, and the AnnexState is generally just used to cache calculated
data, this is unlikely to be a problem.
Oh yeah, this change also makes Assistant.Types.ThreadedMonad a bit
redundant. It's now entirely possible to run concurrent Annex actions in
different threads, all sharing access to the same state! The ThreadedMonad
just adds some extra work on top of that, with its own MVar, and avoids
such actions possibly stepping on one-another's toes. I have not gotten
rid of it, but might try that later. Being able to run concurrent Annex
actions would simplify parts of the Assistant code.
This fixes a bug with git annex add in direct mode. If some files already
existed in the tree pointing at the same key as a file that was just added,
and their content was not present, add neglected to copy the content to
those files.
I also changed the behavior of moveAnnex slightly: When content is moved
into the annex in direct mode, it does not overwrite any content already
present in direct mode files. That content may be modified after all.
That's needed in files used to build the configure program.
For the other files, I'm keeping my __WINDOWS__ define, as I find that much easier to type.
I may search and replace it to use the mingw32_HOST_OS thing later.
A content directory can be frozen in direct mode. One way this can happen
is if the content is transferred before direct mode has a mapping for it,
so it's stored in the content directory.
So, we need to thaw the content directory before doing things with it.
Due to add using withFilesMaybeModified, it will get files that have been
deleted but are still in the index. So catch the IO error that results when
trying to stat such a file.
There may be already staged changes from a prior `git annex add`,
so always commit.
Also, suppressed the commit output, since it contains noise due to
typechanged files in direct mode.
This is so git remotes on servers without git-annex installed can be used
to keep clients' git repos in sync.
This is a behavior change, but since annex-sync can be set to disable
syncing with a remote, I think it's acceptable.
Most remotes have meters in their implementations of retrieveKeyFile
already. Simply hooking these up to the transfer log makes that information
available. Easy peasy.
This is particularly valuable information for encrypted remotes, which
otherwise bypass the assistant's polling of temp files, and so don't have
good progress bars yet.
Still some work to do here (see progressbars.mdwn changes), but this
is entirely an improvement from the lack of progress bars for encrypted
downloads.
* addurl: Register transfer so the webapp can see it.
* addurl: Automatically retry downloads that fail, as long as some
additional content was downloaded.
Fixed by storing a list of cached inodes for a key, instead of just one.
Backwards compatability note: An old git-annex version will fail to parse
an inode cache file that has been written by a new version, and has
multiple items. It will succees if just one. So old git-annexes will have
even worse behavior when there are duplicated files, if that is possible.
I don't think it will be a problem. (Famous last words.)
Also, note that it doesn't expire old and unused inode caches for a key.
It would be possible to add this if needed; just look through the
associated files for a key and if there are more cached inodes, throw out
any not corresponding to associated files. Unless a file is being copied
repeatedly and the old copy deleted, this lack of expiry should not be a
problem.
* since this is a crippled filesystem anyway, git-annex doesn't use
symlinks on it
* so there's no reason to use the mixed case hash directories that we're
stuck using to avoid breaking everyone's symlinks to the content
* so we can do what is already done for all bare repos, and make non-bare
repos on crippled filesystems use the all-lower case hash directories
* which are, happily, all 3 letters long, so they cannot conflict with
mixed case hash directories
* so I was able to 100% fix this and even resuming `git annex add` in the
test case will recover and it will all just work.
This avoids commit churn by the assistant when eg,
replacing a file with a symlink.
But, just as importantly, it prevents the working tree being left with a
deleted file if git-annex, or perhaps the whole system, crashes at the
wrong time.
(It also probably avoids confusing displays in file managers.)
I would have sort of liked to put this in .gitattributes, but it seems
it does not support multi-word attribute values. Also, making this a single
config setting makes it easy to only parse the expression once.
A natural next step would be to make the assistant `git add` files that
are not annex.largefiles. OTOH, I don't think `git annex add` should
`git add` such files, because git-annex command line tools are
not in the business of wrapping git command line tools.
There was confusion in different parts of the progress bar code about
whether an update contained the total number of bytes transferred, or the
number of bytes transferred since the last update. One way this bug
showed up was progress bars that seemed to stick at zero for a long time.
In order to fix it comprehensively, I add a new BytesProcessed data type,
that is explicitly a total quantity of bytes, not a delta.
Note that this doesn't necessarily fix every problem with progress bars.
Particularly, buffering can now cause progress bars to seem to run ahead
of transfers, reaching 100% when data is still being uploaded.
Needed to send a trailing NUL to end a request, and set the read handle
non-blocking.
Also, set fileSystemEncoding on all handles, since there's a filename in
there.
There are two types of equality here, and which one is right varies,
so this forces me to consider and choose between them.
Based on this, I learned that the commit in git anex sync was
always doing a strong comparison, even when in a repository where
the inodes had changed. Fixed that.
The comments correctly noted that the remote could drop the key and
yet False be returned due to some problem that occurred afterwards.
For example, if it's a network remote, it could drop the key just
as the network goes down, and so things timeout and a nonzero exit
from ssh is propigated through and False returned.
However... Most of the time, this scenario will not have happened.
False will mean the remote was not available or could not drop the key
at all.
So, instead of assuming the worst, just trust the status we have.
If we get it wrong, and the scenario above happened, our location
log will think the remote has the key. But the remote's location
log (assuming it has one) will know it dropped it, and the next sync
will regain consistency.
For a special remote, with no location log, our location log will be wrong,
but this is no different than the situation where someone else dropped
the key from the remote and we've not synced with them. The standard
paranoia about not trusting the location log to be the last word about
whether a remote has a key will save us from these situations. Ie,
if we try to drop the file, we'll actively check the remote,
and determine the inconsistency then.
Clean up from 9769235d6b.
In some cases, looking up a remote by name even though it has no UUID is
desirable. This includes git annex sync, which can operate on remotes
without an annex, and XMPP pairing, which runs addRemote (with calls
byName) before the UUID of the XMPP remote has been configured in git.
Pass subcommand as a regular param, which allows passing git parameters
like -c before it. This was already done in the pipeing set of functions,
but not the command running set.
I have seen some other programs do this, and think it's pretty cool. Means
you can test wherever it's deployed, as well as at build time.
My other reason for doing it is less happy. Cabal's handling of test suites
sucks, requiring duplicated info, and even when that's done, it fails to
preprocess hsc files here. Building it in avoids that and avoids having
to explicitly tell cabal to enable test suites, which would then make it
link the test executable every time, which is unnecessarily slow.
This also has the benefit that now "make fast test" does a max speed build
and tests it.
The only thing lost is ./ghci
Speed: make fast used to take 20 seconds here, when rebuilding from
touching Command/Unused.hs. With cabal, it's 29 seconds.
Adding a file that is already annexed, but has been modified, was broken in
direct mode.
This fix makes the new content be added. It does have the problem that
re-running `git annex add` will checksum and re-add the content repeatedly,
until it's committed. This happens because the key associated with the file
does not change until the new one gets committed, so it keeps thinking the
file has changed.
Refactored annex link code into nice clean new library.
Audited and dealt with calls to createSymbolicLink.
Remaining calls are all safe, because:
Annex/Link.hs: ( liftIO $ createSymbolicLink linktarget file
only when core.symlinks=true
Assistant/WebApp/Configurators/Local.hs: createSymbolicLink link link
test if symlinks can be made
Command/Fix.hs: liftIO $ createSymbolicLink link file
command only works in indirect mode
Command/FromKey.hs: liftIO $ createSymbolicLink link file
command only works in indirect mode
Command/Indirect.hs: liftIO $ createSymbolicLink l f
refuses to run if core.symlinks=false
Init.hs: createSymbolicLink f f2
test if symlinks can be made
Remote/Directory.hs: go [file] = catchBoolIO $ createSymbolicLink file f >> return True
fast key linking; catches failure to make symlink and falls back to copy
Remote/Git.hs: liftIO $ catchBoolIO $ createSymbolicLink loc file >> return True
ditto
Upgrade/V1.hs: liftIO $ createSymbolicLink link f
v1 repos could not be on a filesystem w/o symlinks
Audited and dealt with calls to readSymbolicLink.
Remaining calls are all safe, because:
Annex/Link.hs: ( liftIO $ catchMaybeIO $ readSymbolicLink file
only when core.symlinks=true
Assistant/Threads/Watcher.hs: ifM ((==) (Just link) <$> liftIO (catchMaybeIO $ readSymbolicLink file))
code that fixes real symlinks when inotify sees them
It's ok to not fix psdueo-symlinks.
Assistant/Threads/Watcher.hs: mlink <- liftIO (catchMaybeIO $ readSymbolicLink file)
ditto
Command/Fix.hs: stopUnless ((/=) (Just link) <$> liftIO (catchMaybeIO $ readSymbolicLink file)) $ do
command only works in indirect mode
Upgrade/V1.hs: getsymlink = takeFileName <$> readSymbolicLink file
v1 repos could not be on a filesystem w/o symlinks
Audited and dealt with calls to isSymbolicLink.
(Typically used with getSymbolicLinkStatus, but that is just used because
getFileStatus is not as robust; it also works on pseudolinks.)
Remaining calls are all safe, because:
Assistant/Threads/SanityChecker.hs: | isSymbolicLink s -> addsymlink file ms
only handles staging of symlinks that were somehow not staged
(might need to be updated to support pseudolinks, but this is
only a belt-and-suspenders check anyway, and I've never seen the code run)
Command/Add.hs: if isSymbolicLink s || not (isRegularFile s)
avoids adding symlinks to the annex, so not relevant
Command/Indirect.hs: | isSymbolicLink s -> void $ flip whenAnnexed f $
only allowed on systems that support symlinks
Command/Indirect.hs: whenM (liftIO $ not . isSymbolicLink <$> getSymbolicLinkStatus f) $ do
ditto
Seek.hs:notSymlink f = liftIO $ not . isSymbolicLink <$> getSymbolicLinkStatus f
used to find unlocked files, only relevant in indirect mode
Utility/FSEvents.hs: | Files.isSymbolicLink s = runhook addSymlinkHook $ Just s
Utility/FSEvents.hs: | Files.isSymbolicLink s ->
Utility/INotify.hs: | Files.isSymbolicLink s ->
Utility/INotify.hs: checkfiletype Files.isSymbolicLink addSymlinkHook f
Utility/Kqueue.hs: | Files.isSymbolicLink s = callhook addSymlinkHook (Just s) change
all above are lower-level, not relevant
Audited and dealt with calls to isSymLink.
Remaining calls are all safe, because:
Annex/Direct.hs: | isSymLink (getmode item) =
This is looking at git diff-tree objects, not files on disk
Command/Unused.hs: | isSymLink (LsTree.mode l) = do
This is looking at git ls-tree, not file on disk
Utility/FileMode.hs:isSymLink :: FileMode -> Bool
Utility/FileMode.hs:isSymLink = checkMode symbolicLinkMode
low-level
Done!!
Now getKeysPresent checks that the key's content, not only its directory,
exists. In direct mode, the inode cache file is used as a standin for the
content.
removeAnnex always removes the inode cache file, and drop and move --from
always call removeAnnex, even if the object does not seem to be inAnnex,
to ensure it's always deleted.
This reverts commit 57780cb3a4.
This was buggy, it caused the direct mode cache to be lost when dropping
keys, so when the file is gotten back, it's stored in indirect mode.
Note to self: Do not attempt bug fixes at 6 am!
In indirect mode, now checks the inode cache to detect changes to a file.
Note that a file can still be changed if a process has it open for write,
after landing in the annex.
In direct mode, some checking of the inode cache was done before, but
from a much later point, so fewer modifications could be detected. Now it's
as good as indirect mode.
On crippled filesystems, no lock down is done before starting to add a
file, so checking the inode cache is the only protection we have.
git annex init probes for crippled filesystems, and sets direct mode, as
well as `annex.crippledfilesystem`.
Avoid manipulating permissions of files on crippled filesystems.
That would likely cause an exception to be thrown.
Very basic support in Command.Add for cripped filesystems; avoids the lock
down entirely since doing it needs both permissions and hard links.
Will make this better soon.
Various things that don't work on Android are just ifdefed out.
* the webapp (needs template haskell for arm)
* --include and --exclude globbing (needs libpcre, which is not ported;
probably I'll make it use the pure haskell glob library instead)
* annex.diskreserve checking (missing sys/statvfs.h)
* timestamp preservation support (yawn)
* S3
* WebDAV
* XMPP
The resulting 17mb binary has been tested on Android, and it is able to,
at least, print its usage message.
These files were left behind, and made getKeysPresent find keys that were
not present. It would be expensive to make getKeysPresent check that the
actual key files are present (it just lists the directories). But that's not
needed if we just clean up the stale cache and mapping files.
To handle systems that were in direct mode and got switched back with stale
direct mode files, made cleanObjectLoc remove all files in the key's directory.
git annex unused will still list keys that are gone but for which the stale
direct mode files exists. To deal with that, made dropunused remove the key's
directory even if the key does not seem to be present.
Making the pre-commit hook look at git diff-index to find changed direct
mode files and update the mappings works pretty well.
One case where it does not work is when a file is git annex added, and then
git rmed, and then this is committed. That's a no-op commit, so the hook
probably doesn't even run, and it certianly never notices that the file
was deleted, so the mapping will still have the original filename in it.
For this and other reasons, it's important that the mappings still be
treated as possibly inconsistent.
Also, the assistant now allows the pre-commit hook to run when in direct
mode, so the mappings also get updated there.
It used to not log to daemon.log when a repository was first created, and
when starting the webapp. Now both do. Redirecting stdout and stderr to the
log is tricky when starting the webapp, because the web browser may want to
communicate with the user. (Either a console web browser, or web.browser = echo)
This is handled by restoring the original fds when running the browser.
An earlier commit (mislabeled) made direct mode fsck check file checksums.
While it's expected for files to change at any time in direct mode, and so
fsck cannot complain every time there's a checksum mismatch, it is possible
for it to detect when a file does not *seem* to have changed, then check
its checksum, and so detect disk corruption or other problems.
This commit improves that, by checking a second time, if the checksum
fails, that the file is still not modified, before taking action. This way,
a direct mode file can be modified while being fscked.
Now there's a Config type, that's extracted from the git config at startup.
Note that laziness means that individual config values are only looked up
and parsed on demand, and so we get implicit memoization for all of them.
So this is not only prettier and more type safe, it optimises several
places that didn't have explicit memoization before. As well as getting rid
of the ugly explicit memoization code.
Not yet done for annex.<remote>.* configuration settings.
This avoids some small overhead by only running the check once per command;
it also ensures that, even if the command doesn't find anything to run on,
it still fails to run when in a bare repo.
* Bugfix: Remove leading \ from checksums output by sha*sum commands,
when the filename contains \ or a newline. Closes: #696384
* fsck: Still accept checksums with a leading \ as valid, now that
above bug is fixed.
* migrate: Remove leading \ in checksums
However, I don't yet have a reliable way to deal with files being modified
while they're being transferred. I have code that detects it on the sending
side, but the receiver is still free to move the wrong content into its
annex, and record that it has the content. So that's not acceptable, and
I'll need to work on it some more.
However, at this point I can use a direct mode repository as a remote and
transfer files from and to it.
* get/copy --auto: Transfer data even if it would exceed numcopies,
when preferred content settings want it.
* drop --auto: Fix dropping content when there are no preferred content
settings.
Files are now written to a tmp directory in the remote, and once all
chunks are written, etc, it's moved into the final place atomically.
For now, checkpresent still checks every single chunk of a file, because
the old method could leave partially transferred files with some chunks
present and others not.
Inject the required git-remote-xmpp into PATH when running xmpp git push.
Rest of the time it will not be in PATH, and git won't be able to talk to
xmpp remotes.