Make git-annex sync and the assistant skip trying to drop from appendonly
remotes since it's just going to fail.
git-annex drop and similar commands will still try to drop from
appendonly, so the user will see failure messages when they try to do
that. To do otherwise would be confusing since the user has explicitly
asked for a drop with those commands.
This commit was supported by the NSF-funded DataLad project.
Probably not noticed until now because the queue is large enough that two
threads each filling theirs at the same time and flushing is unlikely to
happen.
Also made explicit that each worker thread gets its own queue.
I think that was the case before, but if something was put in the queue
before worker threads were forked off, they could have each inherited the
same queue.
Could have gone with a single shared queue, but per-worker queues is more
efficient, because a worker can add lots of stuff to its own queue without
any locking.
This commit was sponsored by Ole-Morten Duesund on Patreon.
Avoids annex.largefiles inconsitency and also avoids a lot of
unneccessary calls to the clean filter when a large repo's clone
is being initialized.
This commit was supported by the NSF-funded DataLad project.
v6: When annex.largefiles is not configured for a file, running git add or
git commit, or otherwise using git to stage a file will add it to the annex
if the file was in the annex before, and to git otherwise. This is to avoid
accidental conversion.
Note that git-annex add's behavior has not changed, for reasons explained
in the added comment.
Performance: No added overhead when annex.largefiles is configured.
When not configured, there is an added call to catObjectMetaData,
which involves a round trip through git cat-file --batch.
However, the earlier catKeyFile primes the cache for it.
This commit was supported by the NSF-funded DataLad project.
If a pointer file is being populated and something modifies it at the
same time, there was a race there the modified file's InodeCache
could get added into the keys database.
Note that replaceFile normally renames the temp file into place, so the
inode cache caculated for the temp file will still be good. If it has to
fall back to a copy, the worktree file won't be put in the inode cache.
This has the same result as if the worktree file gets touched, and will
be handled the same way. Eg, when dropping, isUnmodified will do an
expensive comparison and notice that the worktree file does have the
same content, and so drop it.
This commit was supported by the NSF-funded DataLad project.
This would be better if getInternalFiles were
more polymorphic, but I can't see a good
way to accomplish that without messing with Data.Typeable,
which seemed like overkill.
Reverted CommandAction back to the simpler version.
This commit was sponsored by Eric Drechsel on Patreon.
Test suite found a case where this is necessary.
And the man page says this, although current behavior is not as
documented..
Note that files beginning with . are discarded.
This includes ./file and dir/./file. If you don’t want
this, then use cleaner names.
This may hit path length limits on Windows. shrug
This commit was supported by the NSF-funded DataLad project.
Check just before running update-index if the worktree file's content is
still the same, don't update it when it's been modified. This narrows
the race window a lot, from possibly minutes or hours, to seconds or
less.
(Use replaceFile so that the worktree update happens atomically,
allowing the InodeCache of the new worktree file to itself be gathered
w/o any other race.)
This doesn't eliminate the race; it can still occur in the window before
update-index runs. When annex.queue is large, a lot of files will be
statted by the checks, and so the window may still be large enough to be a
problem.
When only a few files are being processed, the window is as small as it
is in the race where a modification gets overwritten by git-annex when
it updates the worktree. Or maybe as small as whatever race git
checkout/pull/merge may have when the worktree gets modified during it.
Still, I've kept a todo about this race.
This commit was supported by the NSF-funded DataLad project.
Use git update-index --refresh, since it's a little bit more
efficient and the user can be told to run it if a locked index prevents
git-annex from running it.
This also fixes the problem where an annexed file was deleted in the index
and a get of another file that uses the same key caused the index update to
add back the deleted file. update-index will not add back the deleted file.
Documented in tips/unlocked_files.mdwn the gotcha that the index update
may conflict with other operations. I can't see any way to possibly avoid
that conflict.
One new todo about a race that causes a modification to be accidentially
staged.
Note that the assistant only flushes the git command queue when it
commits a modification. I have not tested the assistant with v6 unlocked
files, but assume most users of the assistant won't care if the index
shows a file as modified for a while.
This commit was supported by the NSF-funded DataLad project.
After updating the worktree for an add/drop, update git's index, so git
status will not show the files as modified.
What actually happens is that the index update removes the inode
information from the index. The next git status (or similar) run
then has to do some work. It runs the clean filter.
So, this depends on the clean filter being reasonably fast and on git
not leaking memory when running it. Both problems were fixed in
a96972015d, but only for git 2.5. Anyone
using an older git will see very expensive git status after an add/drop.
This uses the same git update-index queue as other parts of git-annex, so
the actual index update is fairly efficient. Of course, updating the index
does still have some overhead. The annex.queuesize config will control how
often the index gets updated when working on a lot of files.
This is an imperfect workaround... Added several todos about new
problems this workaround causes. Still, this seems a lot better than the
old behavior.
This commit was supported by the NSF-funded DataLad project.
Added getStaged, to get the versions of git-annex branch files staged in its
index, and use during transitions so the result of merging sibling branches
is used.
The catFileStop in performTransitionsLocked is absolutely necessary,
without that the bug still occurred, because git cat-file was already
running and was looking at the old index file.
Note that getLocal still has cat-file look at the git-annex branch, not the
index. It might be faster if it looked at the index, but probably only
marginally so, and I've not benchmarked it to see if it's faster at all. I
didn't want to change unrelated behavior as part of this bug fix. And as
the need for catFileStop shows, using the index file has added
complications.
Anyway, it still seems fine for getLocal to look at the git-annex branch,
because normally the index file is updated just before the git-annex branch
is committed, and so they'll contain the same information. It's only during
a transition that the two diverge.
This commit was sponsored by Paul Walmsley in honor of Mark Phillips.
It was sorting by uuid, rather than cost!
Avoid future bugs of this kind by changing the Ord to primarily compare
by cost, with uuid only used when the cost is the same.
This commit was supported by the NSF-funded DataLad project.
Added annex.commitmessage config that can specify a commit message for the
git-annex branch instead of the usual "update".
This commit was supported by the NSF-funded DataLad project.
Support working trees set up by git-worktree, by setting up some symlinks
such that git-annex links work right.
Also improved support for repositories created with --separate-git-dir.
At least recent git makes a .git file for those (older may have used a
symlink?), so that also needs to be converted to a symlink.
This commit was sponsored by Nick Piper on Patreon.
addurl: When security configuration prevents downloads with youtube-dl,
still check if the url is one that it supports, and fail downloading it,
instead of downloading the raw web page.
No point in keeping an empty tmp workdir around.
The associated tmp object file is retained even if empty, didn't want to
deal with any possible races with something else downloading to that
file at the same time this would check if it's empty. Anyhow, temp
object files are normally retained, and this will get cleaned out the
same way those do, by dropunused.
Leveraged the existing verification code by making it also check the
retrievalSecurityPolicy.
Also, prevented getViaTmp from running the download action at all when the
retrievalSecurityPolicy is going to prevent verifying and so storing it.
Added annex.security.allow-unverified-downloads. A per-remote version
would be nice to have too, but would need more plumbing, so KISS.
(Bill the Cat reference not too over the top I hope. The point is to
make this something the user reads the documentation for before using.)
A few calls to verifyKeyContent and getViaTmp, that don't
involve downloads from remotes, have RetrievalAllKeysSecure hard-coded.
It was also hard-coded for P2P.Annex and Command.RecvKey,
to match the values of the corresponding remotes.
A few things use retrieveKeyFile/retrieveKeyFileCheap without going
through getViaTmp.
* Command.Fsck when downloading content from a remote to verify it.
That content does not get into the annex, so this is ok.
* Command.AddUrl when using a remote to download an url; this is new
content being added, so this is ok.
This commit was sponsored by Fernando Jimenez on Patreon.
A local http proxy would bypass the security configuration. So,
the security configuration has to be applied when choosing whether to
use the proxy.
While http rebinding attacks against the dns lookup of the proxy IP
address seem very unlikely, this implementation does prevent them, since
it resolves the IP address once, checks it, and then reconfigures
http-client's proxy using the resolved address.
This commit was sponsored by Ole-Morten Duesund on Patreon.
Security fix!
* git-annex will refuse to download content from http servers on
localhost, or any private IP addresses, to prevent accidental
exposure of internal data. This can be overridden with the
annex.security.allowed-http-addresses setting.
* Since curl's interface does not have a way to prevent it from accessing
localhost or private IP addresses, curl defaults to not being used
for url downloads, even if annex.web-options enabled it before.
Only when annex.security.allowed-http-addresses=all will curl be used.
Since S3 and WebDav use the Manager, the same policies apply to them too.
youtube-dl is not handled yet, and a http proxy configuration can bypass
these checks too. Those cases are still TBD.
This commit was sponsored by Jeff Goeke-Smith on Patreon.
Security fix! Allowing any schemes, particularly file: and
possibly others like scp: allowed file exfiltration by anyone who had
write access to the git repository, since they could add an annexed file
using such an url, or using an url that redirected to such an url,
and wait for the victim to get it into their repository and send them a copy.
* Added annex.security.allowed-url-schemes setting, which defaults
to only allowing http and https URLs. Note especially that file:/
is no longer enabled by default.
* Removed annex.web-download-command, since its interface does not allow
supporting annex.security.allowed-url-schemes across redirects.
If you used this setting, you may want to instead use annex.web-options
to pass options to curl.
With annex.web-download-command removed, nearly all url accesses in
git-annex are made via Utility.Url via http-client or curl. http-client
only supports http and https, so no problem there.
(Disabling one and not the other is not implemented.)
Used curl --proto to limit the allowed url schemes.
Note that this will cause git annex fsck --from web to mark files using
a disallowed url scheme as not being present in the web. That seems
acceptable; fsck --from web also does that when a web server is not available.
youtube-dl already disabled file: itself (probably for similar
reasons). The scheme check was also added to youtube-dl urls for
completeness, although that check won't catch any redirects it might
follow. But youtube-dl goes off and does its own thing with other
protocols anyway, so that's fine.
Special remotes that support other domain-specific url schemes are not
affected by this change. In the bittorrent remote, aria2c can still
download magnet: links. The download of the .torrent file is
otherwise now limited by annex.security.allowed-url-schemes.
This does not address any external special remotes that might download
an url themselves. Current thinking is all external special remotes will
need to be audited for this problem, although many of them will use
http libraries that only support http and not curl's menagarie.
The related problem of accessing private localhost and LAN urls is not
addressed by this commit.
This commit was sponsored by Brett Eisenberg on Patreon.
Since 3dd43df9c2, the socket warmup does
not run git-annex-shell on the remote host, and the point of this check
was to avoid error messages running git-annex-shell when it was not
installed. So the check is not needed any longer.
Also, this is one of only two uses of remoteGitConfig, which
I want to get rid of for reasons explained in
fc5888300f.
This commit was sponsored by Fernando Jimenez on Patreon.
This fixes a crash when a git submodule has a name starting with a dot.
Such a submodule might contain dotfiles that are intended to be used when
inside the view (since a dot-directory that's not a submodule was already
preserved when entering a view). So, rather than eliminating the submodule
from the view, its git ls-files --stage hash is copied over into the view.
dotfiles/dirs have their git ls-files --stage hashes similarly copied over
to the view. This is more efficient and simpler than the old method,
and also won't break if git ever adds a new type of tree item, like was
done with submodules.
Since the content of dotfiles in the working tree is no longer hashed
when entering a view, when there are unstaged modifications, they are
not included in the view branch. Entering the view branch still works,
but git checkout shows "M .dotfile", and git diff will show the unstaged
changes. This seems like an improvement over the old behavior.
Also made Command.View not delete empty directories that are submodules
when entering a view, while still deleting other empty directories.
This commit was supported by the NSF-funded DataLad project.
This was badly named, it's a not a blob necessarily, but anything that a
tree can refer to.
Also removed the Show instance which was used for serialization to git
format, instead use fmtTreeItemType.
This commit was supported by the NSF-funded DataLad project.
* Display error message when http download fails.
There's nothing in the http-client library to nicely format a http
exception, so in some cases it has to fall back to using show on it.
Seems better than just saying "it failed" or only showing the http
status code.
* Avoid forward retry when 0 bytes were received.
forwardRetry was comparing Nothing to Just 0, and so thought there had
been progress made when 0 bytes were received.
This commit was supported by the NSF-funded DataLad project.
Fix regression in last release that crashes when using --all or running
git-annex in a bare repository. May have also affected git-annex unused and
git-annex info.
Reversed the order of the (++) in Annex.Branch.files so --all will stream
lazily still when there are not a bunch of uncommitted journal files.
Added a todo to maybe improve this later.
This commit was sponsored by Trenton Cronholm on Patreon.
In Annex.Branch.branch, the (++) was killing laziness.
Rewrote so it streams lazily.
filterM also kills laziness, so made loggedKeys use a Unchecked type,
and check if the key is dead in the seek loop.
Note that loggedKeysFor still buffers, so git-annex info <remote> and
git-annex unused --from remote still use more memory than necessary.
Also removed some unused functions from Annex.Journal.
Flipped all comparisons. When a TrustLevel list was wanted from Trusted
downwards, used Down to compare it in that order.
This commit was sponsored by mo on Patreon.
This commit removes the Ord and Enum instances, commenting out all code
that depends on them, to make sure that all code effected by the
inversion fix has been identified.
(Assuming no ifdefs involve TrustLevel.)
The next commit will fix up all the identified code.
move: Added --safe option, which makes move honor numcopies settings.
Also --unsafe enables the default behavior, anticipating that the
default may one day change.
This commit was sponsored by Ethan Aubin.
* For url downloads, git-annex now defaults to using a http library,
rather than wget or curl. But, if annex.web-options is set, it will
use curl. To use the .netrc file, run:
git config annex.web-options --netrc
* git-annex no longer uses wget (and wget is no longer shipped with
git-annex builds).
Note that curl is always run in silent mode, since the new API for
download has a MeterUpdate and doesn't make way for curl progress
output. It might be worth writing a parser for curl's progress output
to update the meter when using it, but I didn't bother with this edge
case for now.
This commit was supported by the NSF-funded DataLad project.
Enable HTTP connection reuse across multiple files, when git-annex
uses http-conduit. Before, a new Manager was created each time
Utility.Url used it. Now, a single Manager gets created the first time,
so connections are reused.
Doesn't help when external programs are used for url download,
but does speed up addurl --fast, fsck --from web, etc.
Testing fsck --fast --from web with 3 files, over high-latency
satellite internet, it sped up from 19.37s to 14.96s.
This commit was supported by the NSF-funded DataLad project.
When adding a new version of a file, and annex.genmetadata is enabled,
don't copy the data metadata from the old version of the file, instead use
the mtime of the file. Rationalle being that the user has requested to
generate metadata and so would expect to get the new mtime into metadata.
Also, avoid warning about copying metadata when all the old metadata is
date metadata. Which was rather the harder part.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
I think this used to be the case, but it was accidentially lost way back in
commit 3887432c54. Normally, transfers do not
throw exceptions, so probably forward retrying was rarely done due to that
oversight.
This also affects the new annex.retry etc configuration. If a transfer
fails, without making any progress, eg because the file is not present on
the remote or the remote is not accessible, it will now retry when
configuration calls for it. In some cases such a retry is not desirable,
for example the remote could be accessible and not have a copy of the file
that the local repo thinks it has. I see no way to distinguish such cases
from cases where a retry should really be done. So, it'll be up to the user
to configure it to work for them.
Added annex.retry, annex.retry-delay, and per-remote versions to configure
transfer retries.
This commit was supported by the NSF-funded DataLad project.
Fix race condition in ssh warmup that caused git-annex to get stuck and
never process some while when run with high levels of concurrency.
So far, I've isolated the problem to processTranscript, which hangs
reading output from ssh in this situation. I don't yet understand why
processTranscript behaves that way.
Since here we don't care about the ssh output, and only want to /dev/null
it, changed to not use processTranscript, avoiding its problem.
This commit was supported by the NSF-funded DataLad project.
Avoid creating transfer info file before transfer lock is created and
locked.
The wrong order for one thing caused transfer info to be overwritten
when a transfer was already in progress.
But worse, it caused checkTransfer to see the transfer info,
and so lock the transfer lock in order to verify the transfer was not in
progress. Which in a concurrent situation, prevented the transferrer
from locking the transfer lock, so it failed with "transfer already in
progress".
Note that the transferinfo command does not lock the transfer lock
before creating the transfer info. But, that's only run after
recvkey is running, and recvkey does lock the transfer lock, so that
seems more or less ok. (Other than being a super complicated legacy mess
that the P2P code has mostly obsoleted now.)
This commit was supported by the NSF-funded DataLad project.
When resuming a download and not using a rolling checksummer like rsync,
the partial file we start with might contain garbage, in the case where a
file changed as it was being downloaded. So, disabling verification on
resumes risked a bad object being put into the annex.
Even downloads with rsync are currently affected. It didn't seem worth the
added complexity to special case those to prevent verification, especially
since git-annex is using rsync less often now.
This commit was sponsored by Brock Spratlen on Patreon.
P2P protocol version 1 adds VALID|INVALID after DATA; INVALID means the
file was detected to change content while it was being sent and so we
may not have received the valid content of the file.
Added new MustVerify constructor for Verification, which forces
verification even when annex.verify=false etc. This is used when INVALID
and in protocol version 0.
As well as changing git-annex-shell p2psdio, this makes git-annex tor
remotes always force verification, since they don't yet use protocol
version 1. Previously, annex.verify=false could skip verification when
using tor remotes, and let bad data into the repository.
This commit was sponsored by Jack Hill on Patreon.
lockContentShared had a screwy caveat that it didn't verify that the content
was present when locking it, but in the most common case, eg indirect mode,
it failed to lock when the content is not present.
That led to a few callers forgetting to check inAnnex when using it,
but the potential data loss was unlikely to be noticed because it only
affected direct mode I think.
Fix data loss bug when the local repository uses direct mode, and a
locally modified file is dropped from a remote repsitory. The bug
caused the modified file to be counted as a copy of the original file.
(This is not a severe bug because in such a situation, dropping
from the remote and then modifying the file is allowed and has the same
end result.)
And, in content locking over tor, when the remote repository is
in direct mode, it neglected to check that the content was actually
present when locking it. This could cause git annex drop to remove
the only copy of a file when it thought the tor remote had a copy.
So, make lockContentShared do its own inAnnex check. This could perhaps
be optimised for direct mode, to avoid the check then, since locking
the content necessarily verifies it exists there, but I have not bothered
with that.
This commit was sponsored by Jeff Goeke-Smith on Patreon.
Added annex.merge-annex-branches config setting which can be used to
disable automatic merge of git-annex branches.
I wonder if git-annex merge/sync/assistant should disable this
setting? Not sure yet, so have not done so. May be that users will not set
it in git config, but pass it via -c to commands that need it.
Checking the config setting adds a very small overhead, but it's
only checked once per command so should be insignificant.
This commit was supported by the NSF-funded DataLad project.
Repositories that are upgraded from before that version to this
one will not break, but will just not see the benefit of the mergedrefs log
speeding things up, until one new ref gets merged in.
And for tab completion, by not unnessessarily statting paths to remotes,
which used to cause eg, spin-up of removable drives.
Got rid of the remotes member of Git.Repo. This was a bit painful.
Remote.Git modifies the list of remotes as it reads their configs,
so still need a persistent list of remotes. So, put it in as
Annex.gitremotes. It's only populated by getGitRemotes, so commands
like examinekey that don't care about remotes won't do so.
This commit was sponsored by Jake Vosloo on Patreon.
Fix more places where files in .git/annex/ were written with modes that
did not take the core.sharedRepository config into account.
This commit was sponsored by Jeff Goeke-Smith on Patreon.
git grep writeFile finds some more that might also be problems, but
for now I've concentrated on .git/annex/ log files. There are certianly
cases where writeFile is not a problem too.
This commit was sponsored by mo on Patreon.
Fourth or fifth try at this and finally found a way to make it work.
Absurd amount of busy-work forced on me by change in cabal's behavior.
Split up Utility modules that need posix stuff out of ones used by
Setup. Various other hacks around inability for Setup to use anything
that ifdefs a use of unix.
Probably lost a full day of my life to this.
This is how build systems make their users hate them. Just saying.
And also now in non-fast mode, since it was just changed to query for the
filename separately.
And avoid processTranscript which mixed up stdout and stderr and could have
led to weirdness if there were warnings that didn't get suppressed.
addurl: When the file youtube-dl will download is already an annexed file,
don't download it again and fail to overwrite it, instead just do nothing,
like it used to when quvi was used.
This commit was sponsored by Anthony DeRobertis on Patreon.
This reverts commit 51228c2306.
No, still doesn't work when built with cabal. It did with stack; stack
must somehow make the unix package implicitly available.
With cabal, System.Posix.Process and System.Posix.Env are both missing.
Seems I had all the work in past commits to make this build, at least on
linux. I'm actually surprised it does, without a unix dep, Utility.Env
still builds ok somehow despite using System.Posix.Env.
This commit was sponsored by Fernando Jimenez on Patreon.
This avoids warnings from stack about the module not being listed in the
cabal file. So, the generated file is also renamed to Build/SysConfig.
Note that the setup program seems to be cached despite these changes; I
had to cabal clean to get cabal to update it so that Build/SysConfig was
written.
This commit was sponsored by Jochen Bartl on Patreon.
A top-level .noannex file will prevent git-annex init from being used in a
repository. This is useful for repositories that have a policy reason not
to use git-annex. The content of the file will be displayed to the user who
tries to run git-annex init.
This also affects git annex reinit and initialization via the webapp.
It does not affect automatic inits, when there's a sibling git-annex branch
already.
This commit was supported by the NSF-funded DataLad project.
Similar to c6e4bc0a22 but another code
path. As well as using youtube-dl unecessarily, it used the filename it
comes up with, which while nice for youtube videos, is not right for
other files.
This means more work is done for urls that youtube-dl does support,
but is probably more efficient for other urls, since it only downloads
the first chunk of content, while youtube-dl probably downloads more.
This commit was supported by the NSF-funded DataLad project.
Now youtubeDlCheck downloads the beginning of the url's content and
checks if it's html, only when it is does it pass it off the youtube-dl
to check if it supports it.
This means more work is done for urls that youtube-dl does support,
but is probably more efficient for other urls, since it only downloads
the first chunk of content, while youtube-dl probably downloads more.
As well as the reported bug, this also fixes behavior when an url
was added with youtube-dl, but the url content has now changed from
a html page to something else. Remote.Web.checkKey used to wrongly
succeed in that situation, since youtube-dl said sure it can download
that something else.
This commit was supported by the NSF-funded DataLad project.
Better to make it not be surprising and slow, than surprising and fast.
--raw can be used when it needs to be really fast.
Implemented adding a youtube-dl supported url to an existing file.
This commit was sponsored by andrea rota.
Decided not to --ignore-config by default. It the user has something in
their youtube-dl config files that breaks git-annex they can configure
it to use that option.
Fully working, including --fast/--relaxed.
Note that, while git-annex addurl --relaxed is not going to check
youtube-dl, I kept git annex importfeed --relaxed checking it.
Thinking is that, let's not break people's importfeed cron jobs, and
importfeed does not typically have to check a large number of new items,
so it's ok if it's a little bit slower when used with youtube playlist
feeds.
importfeed's behavior is also improved (?) when a feed has links in it
to non-media files. Before, those were skipped. Now, the content of the
link is downloaded. This had to be done, because trying to use
youtube-dl is slow, and if those were skipped, it would have to check
every time importfeed was run. While this behavior change may not be
desirable for some feeds, that intersperse links to web pages with
enclosures, it will be desirable for other feeds, that have
non-enclosure directy links to media files.
Remove old quvi modules.
This commit was sponsored by Øyvind Andersen Holm.
Including resuming and cleanup of incomplete downloads.
Still todo: --fast, --relaxed, importfeed, disk reserve checking,
quvi code cleanup.
This commit was sponsored by Anthony DeRobertis on Patreon.
Needed to run youtube-dl in, but could also be useful for other stuff.
The tricky part of this was making the workdir be cleaned up whenever the
tmp object file is cleaned up.
This commit was sponsored by Ole-Morten Duesund on Patreon.
Needs ghc 7.6.1, so minimum base version increased slightly. All builds
are well above this version of ghc, and debian oldstable is as well.
Code that could use lambdacase can be found by running:
git grep -B 1 'case ' | less
and searching in less for "<-"
This commit was sponsored by andrea rota.
They need unix on non-windows, for Utility.Env, which Build.Configure uses,
but cabal can't express that in a custom-setup stanza.
To avoid this problem, Utility.Env would need to be moved into
unix-compat..
Windows needs the setenv package in custom-setup, but I don't want to
pull it in on unix, which would probably break some builds and need more
work. Instead, split out setEnv to a separate module.
Quite likely, unix-compat will get a portable environment layer, and
then both modules can be removed from here.
This commit was sponsored by Øyvind Andersen Holm.
That version has my patches for the problems that Utility.PosixFiles
was working around, so am able to get rid of that module now.
This will later allow bringing back the custom-setup stanza in the cabal
file. It will need to depend on unix-compat 0.5 on all OS's, which I'm
not ready to do yet.
This commit was sponsored by Nick Daly on Patreon.
Actual problem is the keyName was set to "Ref \"sha\"", which led to
this follow-on failure since it contained a space.
The bad data would also get into the export database when exporting to a
non-external special remote. Looking briefly at that, I don't think the bad
data will lead to anything more than a re-upload of the file content
now that the problem has been fixed.
This commit was sponsored by Peter Hogg on Patreon.
This avoids all the complication about redundant work discussed in
the previous try at fixing this. At the expense of needing each command
that could have the problem to be patched to simply wrap the action in
onlyActionOn once the key is known. But there do not seem to be many
such commands.
onlyActionOn' should not be used with a CommandStart (or CommandPerform),
although the types do allow it. onlyActionOn handles running the whole
CommandStart chain. I couldn't immediately see a way to avoid mistken
use of onlyActionOn'.
This commit was supported by the NSF-funded DataLad project.
After a false start, I found a fairly non-intrusive way to deal with it.
Although it only handles transfers -- there may be issues with eg
concurrent dropping of the same key, or other operations.
There is no added overhead when -J is not used, other than an added
inAnnex check. When -J is used, it has to maintain and check a small
Set, which should be negligible overhead.
It could output some message saying that the transfer is being done by
another thread. Or it could even display the same progress info for both
files that are being downloaded since they have the same content. But I
opted to keep it simple, since this is rather an edge case, so it just
doesn't say anything about the transfer of the file until the other
thread finishes.
Since the deferred transfer action still runs, actions that do more than
transfer content will still get a chance to do their other work. (An
example of something that needs to do such other work is P2P.Annex,
where the download always needs to receive the content from the peer.)
And, if the first thread fails to complete a transfer, the second thread
can resume it.
But, this unfortunately means that there's a risk of redundant work
being done to transfer a key that just got transferred.
That's not ideal, but should never cause breakage; the same
thing can occur when running two separate git-annex processes.
The get/move/copy/mirror --from commands had extra inAnnex checks added,
inside the download actions. Without those checks, the first thread
downloaded the content, and then the second thread woke up and
downloaded the same content redundantly.
move/copy/mirror --to is left doing redundant uploads for now. It
would need a second checkPresent of the remote inside the upload
to avoid them, which would be expensive. A better way to avoid
redundant work needs to be found..
This commit was supported by the NSF-funded DataLad project.
Before, there was a window where interrupting an add could result in the
file being moved into the annex, with no symlink yet created.
This commit was supported by the NSF-funded DataLad project.
Fix process and file descriptor leak that was exposed when git-annex was
built with ghc 8.2.1. Apparently ghc has changed its behavior of GC
of open file handles that are pipes to running processes. That
broke git-annex test on OSX due to running out of FDs.
Audited for all uses of Annex.new and made stopCoProcesses be called
once it's done with the state. Fixed several places that might have
leaked in other situations than running the test suite.
This commit was sponsored by Ewen McNeill.
Using annexeval to run probeCrippledFileSystem' caused Git.CurrentRepo.get
to be run. Fixed easily since probeCrippledFileSystem' had no need to use
the Annex monad.
This commit was sponsored by Ethan Aubin.
Also deletes any tagged pushes that the assistant might have done,
since those would also prevent resetting a branch back.
This commit was sponsored by andrea rota.
Now when one repository has exported a tree, another repository can get
files from the export, after syncing.
There's a bug: While the database update works, somehow the database on
disk does not get updated, and so the database update is run the next
time, etc. Wasn't able to figure out why yet.
This commit was sponsored by Ole-Morten Duesund on Patreon.
New table needed to look up what filenames are used in the currently
exported tree, for reasons explained in export.mdwn.
Also, added smart constructors for ExportLocation and ExportDirectory to
make sure they contain filepaths with the right direction slashes.
And some code refactoring.
This commit was sponsored by Francois Marier on Patreon.
It was not getting old lines removed, because the tree graft confused
the updater, so it union merged from the previous git-annex branch,
which still contained the old lines. Fixed by carefully using setIndexSha.
This commit was supported by the NSF-funded DataLad project.
Don't allow "exporttree=yes" to be set when the special remote
does not support exports. That would be confusing since the user would
set up a special remote for exports, but `git annex export` to it would
later fail.
This commit was supported by the NSF-funded DataLad project.
Straightforward enough, except for the needed belt-and-suspenders sanity
checks to avoid foot shooting due to exports not being key/value stores.
* Even when annex.verify=false, always verify from exports.
* Only get files from exports that use a backend that supports
checksum verification.
* Never trust exports, even if the user says to, because then
`git annex drop` would drop content if the export seemed to contain
a copy.
This commit was supported by the NSF-funded DataLad project.
Went with a separate db per export remote, rather than a single export
database. Mostly because there will probably not be a lot of separate
export remotes, and it might be convenient to be able to delete a given
remote's export database.
This commit was supported by the NSF-funded DataLad project.
* Only export to remotes that were initialized to support it.
* Prevent storing key/value on export remotes.
* Prevent enabling exporttree=yes and encryption in the same remote.
SetupStage Enable was changed to take the old RemoteConfig.
This allowed only setting exporttree when initially setting up a
remote, and not configuring it later after stuff might already be stored
in the remote.
Went with =yes rather than =true for consistency with other parts of
git-annex. Changed docs accordingly.
This commit was supported by the NSF-funded DataLad project.
So it will be available later and elsewhere, even after GC.
I first though to use git update-index to do this, but feeding it a line
with a tree object seems to always cause it to generate a git subtree
merge. So, fell back to using the Git.Tree interface to maniupulate the
trees, and not involving the git-annex branch index file at all.
This commit was sponsored by Andreas Karlsson.
Security fix: Disallow hostname starting with a dash, which would get
passed to ssh and be treated an option. This could be used by an attacker
who provides a crafted ssh url (for eg a git remote) to execute arbitrary
code via ssh -oProxyCommand.
No CVE has yet been assigned for this hole.
The same class of security hole recently affected git itself,
CVE-2017-1000117.
Method: Identified all places where ssh is run, by git grep '"ssh"'
Converted them all to use a SshHost, if they did not already, for
specifying the hostname.
SshHost was made a data type with a smart constructor, which rejects
hostnames starting with '-'.
Note that git-annex already contains extensive use of Utility.SafeCommand,
which fixes a similar class of problem where a filename starting with a
dash gets passed to a program which treats it as an option.
This commit was sponsored by Jochen Bartl on Patreon.
To work around the problem that the external special remote protocol does
not support keys containing spaces.
This commit was sponsored by Denis Dzyubenko on Patreon.
Added remote configuration settings annex-ignore-command and
annex-sync-command, which are dynamic equivilants of the annex-ignore
and annex-sync configurations.
For this I needed a new DynamicConfig infrastructure. Its implementation
should be as fast as before when there is no dynamic config, and it caches
so shell commands are only run once.
Note that annex-ignore-command exits nonzero when the remote should be ignored.
While that may seem backwards, it allows using the same command for it as
for annex-sync-command when you want to disable both.
This commit was sponsored by Trenton Cronholm on Patreon.
Can be used to override the default timestamps used in log files in the
git-annex branch. This is a dangerous environment variable; use with
caution.
Note that this only affects writing to the logs on the git-annex branch.
It is not used for metadata in git commits (other env vars can be set for
that).
There are many other places where timestamps are still used, that don't
get committed to git, but do touch disk. Including regular timestamps
of files, and timestamps embedded in some files in .git/annex/, including
the last fsck timestamp and timestamps in transfer log files.
A good way to find such things in git-annex is to get for getPOSIXTime and
getCurrentTime, although some of the results are of course false positives
that never hit disk (unless git-annex gets swapped out..)
So this commit does NOT necessarily make git-annex comply with some HIPPA
privacy regulations; it's up to the user to determine if they can use it in
a way compliant with such regulations.
Benchmarking: It takes 0.00114 milliseconds to call getEnv
"GIT_ANNEX_VECTOR_CLOCK" when that env var is not set. So, 100 thousand log
files can be written with an added overhead of only 0.114 seconds. That
should be by far swamped by the actual overhead of writing the log files
and making the commit containing them.
This commit was supported by the NSF-funded DataLad project.
* Added annex.resolvemerge configuration, which can be set to false to
disable the usual automatic merge conflict resolution done by git-annex
sync and the assistant.
* sync: Added --no-resolvemerge option.
Note that disabling merge conflict resolution is probably not a good idea
in a direct mode repo or adjusted branch. Since updates to both are done
outside the usual work tree, if it fails the tree is not left in a
conflicted state, and it would be hard to manually resolve the conflict.
Still, made annex.resolvemerge be supported in those cases for consistency.
This commit was sponsored by Riku Voipio.
orElse is great, but was not the right thing to use here because
waitTakeLock could retry for other reasons than the lock being held,
which made tryTakeLock fail when it shouldn't.
Instead, move the code to tryTakeLock and implement waitTakeLock using
tryTakeLock and retry.
(Also, in runTransfer, when checkSaneLock fails, dropLock to avoid leaking a
lock handle.)
This commit was supported by the NSF-funded DataLad project.
When built with concurrent-output 1.9, ssh password prompts will no longer
interfere with the -J display.
To avoid flicker, only done when ssh actually does need to prompt;
ssh is first run in batch mode and if that succeeds the connection is up
and no need to clear regions.
This commit was supported by the NSF-funded DataLad project.
Might want to remove this when it gets fixed, in case adjusted branches are
used in a repo with a great many refs, which would become unnecessarily
slow.
This commit was supported by the NSF-funded DataLad project.
Removed dependency on MissingH, instead depending on the split
library.
After laying groundwork for this since 2015, it
was mostly straightforward. Added Utility.Tuple and
Utility.Split. Eyeballed System.Path.WildMatch while implementing
the same thing.
Since MissingH's progress meter display was being used, I re-implemented
my own. Bonus: Now progress is displayed for transfers of files of
unknown size.
This commit was sponsored by Shane-o on Patreon.
Cryptonite is faster and allocates less, and I want to get rid of
MissingH use.
Note that the new dependency on memory is free; it's a dependency of
cryptonite.
This commit was supported by the NSF-funded DataLad project.
Moving toward dropping MissingH dep.
I think I've addressed the problem identified earlier in
09a66f702d. On Windows,
absPathFrom "/tmp/repo/xxx" "y/bar" would be "/tmp/repo/xxx\\y/bar",
which then confuses relPathDirToFile. Fixed by converting to unix (git)
style paths.
Also, relPathDirToFile was splitting only on \\ on windows and not /
which broke the example in 09a66f702d of
relPathDirToFile (absPathFrom "/tmp/repo/xxx" "y/bar") "/tmp/repo/.git/annex/objects/xxx"
Now, on windows, that will yield "..\\..\\..\\.git/annex/objects/xxx"
which once converted to unix style paths is what we want.
When ssh connection caching is enabled (and when GIT_ANNEX_USE_GIT_SSH is
not set), only one ssh password prompt will be made per host, and only one
ssh password prompt will be made at a time.
This also fixes a race in prepSocket's stale ssh connection stopping
when run with -J. It was possible for one thread to start a cached ssh
connection, and another thread to immediately stop it, resulting in excess
connections being made.
This commit was supported by the NSF-funded DataLad project.
This is necessary because as feared, the extra -n parameter that git-annex
passes breaks uses of these environment variables that expect exactly the
parameters that git passes.
For example, see https://github.com/datalad/datalad/issues/1456
It would of course be possible to pre-close stdin before running ssh so not
needing the -n, and I think that would not even break ssh's password
caching. But it would probably involve a lot of work, possibly would need
to deal with some layering violations, and would be error-prone. The really
clean fix would be to make all the ssh stuff return a CreateProcess, which
could have the handle closed when appropriate, but that would be a large
reworing of the code base.
This commit was supported by the NSF-funded DataLad project.
They are handled close the same as they are by git. However, unlike git,
git-annex sometimes needs to pass the -n parameter when using these.
So, this has the potential for breaking some setup, and perhaps there ought
to be a ANNEX_USE_GIT_SSH=1 needed to use these. But I'd rather avoid that
if possible, so let's see if anyone complains.
Almost all places where "ssh" was run have been changed to support the env
vars. Anything still calling sshOptions does not support them. In
particular, rsync special remotes don't. Seems that annex-rsync-transport
already gives sufficient control there.
(Fixed in passing: Remote.Helper.Ssh.toRepo used to extract
remoteAnnexSshOptions and pass them to sshOptions, which was redundant
since sshOptions also extracts those.)
This commit was sponsored by Jeff Goeke-Smith on Patreon.
It was distributing jobs to remotes that were not being used by any other
job. But, suppose that there are only 2 remotes, and -J10. In such a case,
the first 2 downloads would be distributed amoung the 2 remotes, but
the other 8 would all go to remote #1. Improved by keeping a counter
of how many jobs are assigned to a remote, and prefer remotes with fewer
jobs.
Note use of Data.Map.Strict to avoid blowing up space. I kept the
bang-patterns as-is, although probably not needed with Data.Map.Strict.
This commit was sponsored by Jack Hill on Patreon.
* init: When annex.securehashesonly has been set with git-annex config,
copy that value to the annex.securehashesonly git config.
* config --set: As well as setting value in git-annex branch,
set local gitconfig. This is needed especially for
annex.securehashesonly, which is read only from local gitconfig and not
the git-annex branch.
doc/todo/sha1_collision_embedding_in_git-annex_keys.mdwn has the
rationalle for doing it this way. There's no perfect solution; this
seems to be the least-bad one.
This commit was supported by the NSF-funded DataLad project.
This avoids sending all the data to a remote, only to have it reject it
because it has annex.securehashesonly set. It assumes that local and
remote will have the same annex.securehashesonly setting in most cases.
If a remote does not have that set, and local does, the remote won't get
some content it would otherwise accept.
Also avoids downloading data that will not be added to the local object
store due to annex.securehashesonly.
Note that, while encrypted special remotes use a GPGHMAC key variety,
which is not collisiton resistent, Transfers are not used for such
keys, so this check is avoided. Which is what we want, so encrypted
special remotes still work.
This commit was sponsored by Ewen McNeill.
Added --securehash option to match files using a secure hash function, and
corresponding securehash preferred content expression.
This commit was sponsored by Ethan Aubin.
Cryptographically secure hashes can be forced to be used in a repository,
by setting annex.securehashesonly. This does not prevent the git repository
from containing files with insecure hashes, but it does prevent the content
of such files from being pulled into .git/annex/objects from another
repository.
We want to make sure that at no point does git-annex accept content into
.git/annex/objects that is hashed with an insecure key. Here's how it
was done:
* .git/annex/objects/xx/yy/KEY/ is kept frozen, so nothing can be
written to it normally
* So every place that writes content must call, thawContent or modifyContent.
We can audit for these, and be sure we've considered all cases.
* The main functions are moveAnnex, and linkToAnnex; these were made to
check annex.securehashesonly, and are the main security boundary
for annex.securehashesonly.
* Most other calls to modifyContent deal with other files in the KEY
directory (inode cache etc). The other ones that mess with the content
are:
- Annex.Direct.toDirectGen, in which content already in the
annex directory is moved to the direct mode file, so not relevant.
- fix and lock, which don't add new content
- Command.ReKey.linkKey, which manually unlocks it to make a
copy.
* All other calls to thawContent appear safe.
Made moveAnnex return a Bool, so checked all callsites and made them
deal with a failure in appropriate ways.
linkToAnnex simply returns LinkAnnexFailed; all callsites already deal
with it failing in appropriate ways.
This commit was sponsored by Riku Voipio.
Where before the "name" of a key and a backend was a string, this makes
it a concrete data type.
This is groundwork for allowing some varieties of keys to be disabled
in file2key, so git-annex won't use them at all.
Benchmarks ran in my big repo:
old git-annex info:
real 0m3.338s
user 0m3.124s
sys 0m0.244s
new git-annex info:
real 0m3.216s
user 0m3.024s
sys 0m0.220s
new git-annex find:
real 0m7.138s
user 0m6.924s
sys 0m0.252s
old git-annex find:
real 0m7.433s
user 0m7.240s
sys 0m0.232s
Surprising result; I'd have expected it to be slower since it now parses
all the key varieties. But, the parser is very simple and perhaps
sharing KeyVarieties uses less memory or something like that.
This commit was supported by the NSF-funded DataLad project.
sync: When syncing with a local repository located on a crippled
filesystem, run the post-receive hook there, since it wouldn't get run
otherwise. This makes pushing to repos on FAT-formatted removable drives
update them when receive.denyCurrentBranch=updateInstead.
Made Remote.Git export onLocal, which was cleaned up to not have so many
caveats about its use.
This commit was sponsored by Jeff Goeke-Smith on Patreon.
* Added post-recieve hook, which makes updateInstead work with direct
mode and adjusted branches.
* init: Set up the post-receive hook.
This commit was sponsored by Fernando Jimenez on Patreon.
... to avoid it consuming stdin that it shouldn't.
This fixes git-annex-checkpresentkey --batch remote, which didn't output
results for all keys passed into it.
Other git-annex commands that communicate with a remote over ssh may also
have been consuming stdin that they shouldn't have, which could have
impacted using them in eg, shell scripts. For example, a shell script
reading files from stdin and passing them to git annex drop would be
impacted by this bug, whenever git annex drop ran git-annex-shell
checkpresent, it would consume part/all of the stdin that the shell script
was supposed to consume.
Fixed by adding a ConsumeStdin parameter to Annex.Ssh.sshOptions, which
is used throughout git-annex to run ssh (in order for ssh connection
caching to work). Every call site was checked to see if it used
CreatePipe for stdin, and if not was marked NoConsumeStdin.
import: --deduplicate and --skip-duplicates were implemented inneficiently;
they unncessarily hashed each file twice. They have been improved to only
hash once.
The new approach is to lock down (minimally) and hash files, and then
reuse that information when importing them.
This was rather tricky, especially in detecting changes to files while
they are being imported.
The output of import changed slightly. While before it silently skipped
over files with eg --skip-duplicates, now it shows each file as it starts
to act on it. Since every file is hashed first thing, it would otherwise
not be clear what file import is chewing on. (Actually, it wasn't clear
before when any of the duplicates switches were used.)
This commit was sponsored by Alexander Thompson on Patreon.
Most remotes have an idempotent setup that can be reused for
enableremote, but in a few cases, it needs to tell which, and whether
a UUID was provided to setup was used.
This is groundwork for making initremote be able to provide a UUID.
It should not change any behavior.
Note that it would be nice to make the UUID always be provided to setup,
and make setup not need to generate and return a UUID. What prevented
this simplification is Remote.Git.gitSetup, which needs to reuse the
UUID of the git remote when setting it up, and so has to return that
UUID.
This commit was sponsored by Thom May on Patreon.
Turns out that Data.List.Utils.split is slow and makes a lot of
allocations. Here's a much simpler single character splitter that behaves
the same (even in wacky corner cases) while running in half the time and
75% the allocations.
As well as being an optimisation, this helps move toward eliminating use of
missingh.
(Data.List.Split.splitOn is nearly as slow as Data.List.Utils.split and
allocates even more.)
I have not benchmarked the effect on git-annex, but would not be surprised
to see some parsing of eg, large streams from git commands run twice as
fast, and possibly in less memory.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
Any config names can be set using this; git-annex commands will only look
at specific ones that make sense and are worth the overhead of querying the
branch.
This might also be useful for storing whatever other config-type stuff the
user might want to shove into the git-annex branch.
This commit was sponsored by Jochen Bartl on Patreon.
Revert ServerAliveInterval change in 6.20161111, which caused problems
with too many old versions of ssh and unusual ssh configurations.
It should have not been needed anyway since ssh is supposted to
have TCPKeepAlive enabled by default.
Added to change notification to P2P protocol.
Switched to a TBChan so that a single long-running thread can be
started, and serve perhaps intermittent requests for change
notifications, without buffering all changes in memory.
The P2P runner currently starts up a new thread each times it waits
for a change, but that should allow later reusing a thread. Although
each connection from a peer will still need a new watcher thread to run.
The dependency on stm-chans is more or less free; some stuff in yesod
uses it, so it was already indirectly pulled in when building with the
webapp.
This commit was sponsored by Francois Marier on Patreon.
ReadContent can't update the log, since it reads lazily. This part of
the P2P monad will need to be rethought.
Associated files are heavily sanitized when received from a peer;
they could be an exploit vector.
This commit was sponsored by Jochen Bartl on Patreon.
ghc 8 added backtraces on uncaught errors. This is great, but git-annex was
using error in many places for a error message targeted at the user, in
some known problem case. A backtrace only confuses such a message, so omit it.
Notably, commands like git annex drop that failed due to eg, numcopies,
used to use error, so had a backtrace.
This commit was sponsored by Ethan Aubin.
When used with an older version of ssh, any ServerAliveInterval in
~/.ssh/config will be overridden by .git/annex/ssh.config.
This commit was sponsored by Josh Taylor on Patreon.
So that stalled transfers will be noticed within about 3 minutes,
even if TCPKeepAlive is disabled or doesn't work.
Rather than setting with -o, use -F with another config file,
so that any settings in ~/.ssh/config or /etc/ssh/ssh_config overrides this.
Closes https://github.com/datalad/datalad/issues/1020
The use of runWriter in scanUnlockedFiles broke due to this change;
it failed with blocked indefinitely in mvar, because the database write
handle was taken while linkFromAnnex needed to also write to it (to update
the inode cache). So, switched to using a separate runWriter for each call
to addAssociatedFileFast. A little less efficient, but not greatly; the
writes should all still be cached.
In the case where the pointer file is in place, and not the content
of the object, lock's performNew was called with filemodified=True,
which caused it to try to repopulate the object from an unmodified
associated file, of which there were none. So, the content of the object
got thrown away incorrectly. This was the cause (although not the root
cause) of data loss in https://github.com/datalad/datalad/issues/1020
The same problem could also occur when the work tree file is modified,
but the object is not, and lock is called with --force. Added a test case
for this, since it's excercising the same code path and is easier to set up
than the problem above.
Note that this only occurred when the keys database did not have an inode
cache recorded for the annex object. Normally, the annex object would be in
there, but there are of course circumstances where the inode cache is out
of sync with reality, since it's only a cache.
Fixed by checking if the object is unmodified; if so we don't need to
try to repopulate it. This does add an additional checksum to the unlock
path, but it's already checksumming the worktree file in another case,
so it doesn't slow it down overall.
Further investigation found a similar problem occurred when smudge --clean
is called on a file and the inode cache is not populated. cleanOldKeys
deleted the unmodified old object file in this case. This was also
fixed by checking if the object is unmodified.
In general, use of getInodeCaches and sameInodeCache is potentially
dangerous if the inode cache has not gotten populated for some reason.
Better to use isUnmodified. I breifly auited other places that check the
inode cache, and did not see any immediate problems, but it would be easy
to miss this kind of problem.
An easy change now that supportedVersions is a list. Since v3 and v5 are
identical other than version number, just add v3 to the list.
This commit was sponsored by andrea rota.
Fixes a bug introduced with v6 mode that I didn't notice until now.
Probably not many v3 repos left out there, and upgrading them to v6 mode
is not disastrous, only a little premature.
This commit was sponsored by Riku Voipio
.. and have to be checked to see if they are a pointed to an annexed file.
Cases where such memory use could occur included, but were not limited to:
- git commit -a of a large unlocked file (in v5 mode)
- git-annex adjust when a large file was checked into git directly
Generally, any use of catKey was a potential problem.
Fix by using git cat-file --batch-check to check size before catting.
This adds another git batch process, which is included in the CatFileHandle
for simplicity.
There could be performance impact, anywhere catKey is used. Particularly
likely to affect adjusted branch generation speed, and operations on
unlocked files in v6 mode. Hopefully since the --batch-check and
--batch read the same data, disk buffering will avoid most overhead.
Leaving only the overhead of talking to the process over the pipe and
whatever computation --batch-check needs to do.
This commit was sponsored by Bruno BEAUFILS on Patreon.
Speeds up commands like "git-annex find --in remote" by over 50%.
Profiling showed that adjustGitEnv was 21% of the time and 37% of the
allocations of that command. It copied the environment each time with
getEnvironment.
The only repeated use of adjustGitEnv is in withIndexFile, which tends to
be run at least once per file. So, it was optimised by keeping a cache of
the environment, which can be reused.
There could be other better ways to optimise this. Maybe get the while
environment once at startup. But, then it would have to be serialized back
out each time running a child process, so I doubt that would be a net win.
It might be better to cache a version of the environment that is
pre-modified to use .git-annex/index. But, profiling doesn't show that
modifying the enviroment is taking any significant time.
key2file and file2key were top cost centers according to profiling.
The repeated use of replace was not efficient. This new approach is quite a
lot more efficient.
This commit was sponsored by Denis Dzyubenko on Patreon.
* sync: Previously, when run in a branch with a slash in its name,
such as "foo/bar", the sync branch was "synced/bar". That conflicted
with the sync branch used for branch "bar", so has been changed to
"synced/foo/bar".
* adjust: Previously, when adjusting a branch with a slash in its name,
such as "foo/bar", the adjusted branch was "adjusted/bar(unlocked)".
That conflicted with the adjusted branch used for branch "bar",
so has been changed to "adjusted/foo/bar(unlocked)"
* Also, running sync in an adjusted branch did not correctly sync
changes back to the parent branch when it had a slash in its name.
This bug has been fixed.
Eliminate use of Git.Ref.under and Git.Ref.basename; using
Git.Ref.underBase and Git.Ref.base make everything handle deep branches
correctly.
Probably noone was adjusting deep branches, and v6 is still experimental
anyway, so I'm not going to worry about the mess that was left by that bug.
In the case of git-annex sync, using a fixed git-annex with an old unfixed
one will mean they use different sync branches for a deep branch, and so
they may stop syncing until the old one is upgraded. However, that's only
a problem when syncing between repositories without going via a central
bare repository. Added a warning about this to the CHANGELOG, but it's
probably not going to affect many people at all.
This commit was sponsored by Riku Voipio.
This makes -Jn work with --json and --quiet, where before
setting -Jn disabled those options.
Concurrent json output is currently a mess though since threads output
chunks over top of one-another.
Only done in -J mode because only if there's concurrency can downloading
from two remotes be faster. Without concurrency, it's likely the case that
sequential downloads from the same remote are faster than switching back
and forth between two remotes.
There is some hairy MVar code here, but basically it just keeps
the activeremotes MVar full except when deciding which remote to assign
to a thread.
Also affects gets by sync --content -J
This commit was sponsored by Jochen Bartl.
This was disabled in commit 61ccf95004,
because only the assistant used them, and they were clutter. But, now
--failed also uses them.
Remove the failure log files after successful transfers. Should avoid
most of the clutter problems.
Commit 61ccf95004 mentions a subtle behavior
change, which has now been reverted:
There is one behavior change from this. If glacier is being used, and a
manual git annex get --from glacier fails because the file isn't available
yet, the assistant will no longer later see that failed transfer file and
retry the get.
Note that get --from foo --failed will get things that a previous get --from bar
tried and failed to get, etc. I considered making --failed only retry
transfers from the same remote, but it was easier, and seems more useful,
to not have the same remote requirement.
Noisy due to some refactoring into Types/
Use nextRandom to generate the random UUID, rather than using randomIO.
This gets fixes for the following two bugs in the uuid library.
However, this did not impact git-annex much, so a hard depedency has
not been added on uuid-1.3.12.
https://github.com/aslatter/uuid/issues/15
"v4 UUIDs are not that random"
This doesn't greatly affect git-annex, because even with only
2^64 possible UUIDs, the chance that two git-annex repositories
that are clones of the same git repo get the same UUID is miniscule.
And, git-annex generates only one UUID per run, so preducting
subsequent UUIDs is not a problem.
https://github.com/aslatter/uuid/issues/16
"Remove Random instance for UUID, or mark it as deprecated"
git-annex was using that instance; let's stop before it gets
deprecated or removed.
Show branch:file that is being operated on.
I had to make ActionItem a type and not a type class because
withKeyOptions' passed two different types of values when using the type
class, and I could not get the type checker to accept that.
This bug caused broken tree objects to get built by a later git annex sync.
This is a somewhat unlikely but not impossible situation, and the test
suite's union_merge_regression test tickled it when it was run on FAT.
Mostly the username is only used for the git committer or other display
purposes, and we can just fall back to a dummy value in these cases.
The only remaining place where an error is thrown is when starting local
pairing, which needs the username to be known.
The queue could potentially contain changes from before withAltRepo, and
get flushed inside the call, which would apply the changes to the modified
repo.
Or, changes could be queued in withAltRepo that were intended to affect
the modified repo, but don't get flushed until later.
I don't know of any cases where either happens, but better safe than sorry.
Note that this affect withIndexFile, which is used in git-annex branch
updates. So, it potentially makes things slower. Should not be by much;
the overhead consists only of querying the current queue a couple of times,
and potentially flushing changes queued within withAltRepo earlier, that
could have maybe been bundled with other later changes.
Notice in particular that the existing queue is not flushed when calling
withAltRepo. So eg when git annex add needs to stage files in the index,
it will still bundle them together efficiently.
Added guard in Annex.Transfer to prevent this problem at a deeper level.
I'm unhappy ith NoUUID, but having Maybe UUID instead wouldn't help either
if nothing checked that there was a UUID. Since there legitimately need to
be Remotes that do not have a UUID, I can't see a way to fix it at the type
level, short making there be two separate types of Remotes.
Removed the instance LensGpgEncParams RemoteConfig because it encouraged
code that does not take the RemoteGitConfig into account.
RemoteType's setup was changed to take a RemoteGitConfig,
although the only place that is able to provide a non-empty one is
enableremote, when it's changing an existing remote. This led to several
folow-on changes, and got RemoteGitConfig plumbed through.
This is actually worse than I thought; when git is being run with a
detached work tree, GIT_INDEX_FILE is treated as a path relative to CWD,
instead of the normal behavior of relative the top of the work tree.
This seems to make it basically impossible for any program that wants to
use GIT_INDEX_FILE to use anything other than an absolute path to it; there
are too many configurations to keep straight that can change how git
interprets what should be a simple relative path to a file.
(I have complained to the git developers.)
This affected git annex view. It turns out that some other places
that use GIT_INDEX_FILE were already working around the bug. I removed the
workaround from Annex.Branch since the new workaround will do.
Could not think of a foolproof way to detect if the old adjusted branch was
just behind the current branch. It's possible that the user amended the
adjusting commit at the head of the adjusted branch, for example.
I decided to bail in this situation, instead of just entering the old
branch, so that if git annex adjust succeeds the user is always in a
*current* adjusted branch, not some old and out of date one.
What could perhaps be done is enter the old branch and then update it. But
that seems too magical; the user may have rebased master or something or
may not want to propigate the changes from the old branch. Best to error
out.
It started exporting a isSymbolicLink which supports windows. But,
git-annex does no use symlinks on windows yet and this conflicts with the
function by the same name from unix-compat, so hide it.
git 2.8.1 (or perhaps 2.9.0) is going to prevent git merge from merging in
unrelated branches. Since the webapp's pairing etc features often combine
together repositories with unrelated histories, work around this behavior
change by setting GIT_MERGE_ALLOW_UNRELATED_HISTORIES when the assistant
merges.
Note though that this is not done for git annex sync's merges, so
it will follow git's default or configured behavior.
When git-annex is used with a git version older than 2.2.0, disable support for
adjusted branches, since GIT_COMMON_DIR is needed to update them and was first
added in that version of git.
Made all Annex.Perms file mode changing functions ignore errors when
core.sharedRepository is set, because the file might be owned by someone
else. I don't fancy getting bug reports about crashes due to set modes in
this configuration, which is a very foot-shooty configuration in the first
place.
The fsck warning is necessary because old repos kept files mode 444, which
doesn't allow locking them, and so if the mode remains 444 due to the file
being owned by someone else, the user should be told about it.
When annex.thin is set, adding an object will add the execute bits to the
work tree file, and this does mean that the annex object file ends up
executable.
This doesn't add any complexity that wasn't already present, because git
annex add of an executable file has always ingested it so that the annex
object ends up executable.
But, since an annex object file can be executable or not, when populating
an unlocked file from one, the executable bit is always added or removed
to match the mode of the pointer file.
This is how direct mode does it too, and somehow, for reasons that
currently escape me, this makes git merge not care if it's run with an
empty work tree.
Was using L.readFile, so the Handle would remain open until the garbage
collector got around to it. Changed to explicit open and close, so we know
it's always closed when the function returns.
This makes the direct mode to v6 upgrade able to be performed in one clone
of a repository without affecting other clones, which can continue using v5
and direct mode.
This does mean that it has to write out temp files containing updated
objects for the merge. So may use more disk space, and disk IO, but that
should generally win out over needing to launch N separate
git hash-object processes.
An unlocked present file does not have a pointer file in the worktree, so
info skipped counting it.
It may be that unused was also affected by the problem, but it seemed not
to be in my tests. I think because of the use of the associatedFilesFilter.
This fix slows down both info and unused a little bit, since they have to
query the contents of files from git, but only when handling unlocked files.
Only reverse adjust the changes in the commit, which means that adjustments
do not need to be generally cleanly reversable.
For example, an adjustment can unlock all locked files, but does not need
to worry about files that were originally unlocked when reversing, because
it will only ever be run on files that have been changed. So, it's ok
if it locks all files when reversed, or even leaves all files as-is when
reversed.
Using adjusted/unlocked/master made lots of git stuff dealing with "master"
complain that it was ambiguous. This new appoach is more like view branch
names, and shows the adjustment right there in the branch display even if
only the basename of the branch is shown.
There's a race here, but entering an adjusted branch for the first time is
not something to do when a commit is being made at the same time. Although,
may want to prevent the assistant from committing while entering the
adjusted branch.
So, it will pull and push the original branch, not the adjusted one.
And, for merging, it will use updateAdjustedBranch (not implemented yet).
Note that remaining uses of Git.Branch.current need to be checked too;
for things that should act on the original branch, and not the adjusted
branch.
"git annex adjust" may be a temporary interface, but works for a proof of
concept.
It is pretty fast at creating the adjusted branch. The main overhead is
injecting pointer files. It might be worth optimising that by reusing the
symlink target as the pointer file content. When I tried to do that,
the problem was that the clean filter doesn't use that same format, and so
git thought files had changed. Could be dealt with, perhaps make the clean
filter use symlink format for pointer files when on an adjusted branch?
But the real overhead is in checking out the branch, when git runs the
smudge filter once per file. That is perhaps too slow to be usable,
although it may only affect initial checkout of the branch, and not
updates. TBD.
* add, addurl, import, importfeed: When in a v6 repository on a crippled
filesystem, add files unlocked.
* annex.addunlocked: New configuration setting, makes files always be
added unlocked. (v6 only)
The problem with having the slashes unescaped is, it broke parsing, since
the parser takes the filename to get the part containing the key.
That particularly affected URL keys.
This makes the format be the same as symlinks point to, which keeps things
simple.
Existing pointer files will continue to work ok.
Before, the call to mkProgressUpdater created the directory as a
side-effect, but since that ignored failure to create it, this led to
a "does not exist" exception when the transfer lock file was created,
rather than a permissions error.
So, make sure the directory exists before trying to lock the file in it.
When a PermissionDenied exception is caught, skip making the transfer lock.
This lets downloads from readonly remotes happen.
If an upload is being tried, and the lock file can't be written due to
permissions, then probably the actual transfer will fail for the same
reason, so I think it's ok that it continues w/o taking the lock in that
case.
The type checker should have noticed this, but the changes to mapM
that make it accept any Traversable hid the fact that it was not being
passed a list at all. Thus, what should have returned an empty list most
of the time instead returned [""] which was treated as the name of the
associated file, with disasterout consequences.
When I have time, I should add a test case checking what sync --content
drops. I should also consider replacing mapM with one re-specialized to
lists.
Previously, it only flushed when the queue got larger than 1.
Also, make the queue auto-flush when items are added, rather than needing
to be flushed as a separate step. This simplifies the code and make it more
efficient too, as it avoids needing to read the queue out of the state to
check if it should be flushed.
The homomorphs are back, just encoded such that it doesn't crash in LANG=C
However, I noticed a bug in the old escaping; [pseudoSlash] was escaped the
same as ['/','/']. Fixed by using '%' to escape pseudoSlash. Which requires
doubling '%' to escape it, but that's already done in the escaping of
worktree filenames in a view, so is probably ok.
Linking the file to the tmp dir was not necessary in the clean
filter, and it caused the ctime to change, which caused git to think
the file was changed. This caused git status to get slow as it kept
re-cleaning unchanged files.
Fixes several bugs with updates of pointer files. When eg, running
git annex drop --from localremote
it was updating the pointer file in the local repository, not the remote.
Also, fixes drop ../foo when run in a subdir, and probably lots of other
problems. Test suite drops from ~30 to 11 failures now.
TopFilePath is used to force thinking about what the filepath is relative
to.
The data stored in the sqlite db is still just a plain string, and
TopFilePath is a newtype, so there's no overhead involved in using it in
DataBase.Keys.
WorkTree.lookupFile was finding a key for a file that's deleted from the
work tree, which is different than the v5 behavior (though perhaps the same
as the direct mode behavior). Fix by checking that the work tree file exists
before catting its key.
Hopefully this won't slow down much, probably the catKey is much more expensive.
I can't see any way to optimise this, except perhaps to make Command.Unused
check if work tree files exist before/after calling lookupFile. But,
it seems better to make lookupFile really only find keys for worktree files;
that's what it's intended to do.
Else, queued file stages won't have reached the index, and it won't find
everthing.
This evidently fixes a reversion in my work today, although I don't see how
I broke it. It didn't use to flush the queue first, before, and worked
somehow.
Test suite for v5 is back to 100% green now.
The smudge filter does need to be run, because if the key is in the local
annex already (due to renaming, or a copy of a file added, or a new file
added and its content has already arrived), git merge smudges the file and
this should provide its content.
This does probably mean that in merge conflict resolution, git smudges the
existing file, re-copying all its content to it, and then the file is
deleted. So, not efficient.
This is a behavior change for merge conflicts between locked files
that both pointed to the same key, in different ways.
Before, the conflict was resolved, but the file was renamed to .variant.
This was unnecessary, because there was only one variant.
Of course, this also handles conflicts between unlocked and locked, or even
two unlocked files with different pointer contents.
Since the file was present and locked, its annex object was not in the
inode cache. So, despite not needing to update the annex object when the
clean filter is run on the content by git merge, it does need to record the
inode cache of the annex object. Otherwise, the annex object will be
assumed to be bad, since its inode is not cached.
Several tricky parts:
* When the conflict is just between the same key being locked and unlocked,
the unlocked version wins, and the file is not renamed in this case.
* Need to update associated file map when conflict resolution renames
an unlocked file.
* git merge runs the smudge filter on the conflicting file, and actually
overwrites the file with the same content it had before, and so
invalidates its inode cache. This makes it difficult to know when it's
safe to remove such files as conflict cruft, without going so far as to
compare their entire contents.
Dealt with this by preventing the smudge filter from populating the file
when a merge is run. However, that also prevents the smudge filter being
run for non-conflicting files, so eg moving a file won't put its new
content into place.
* Ideally, if a merge or a merge conflict resolution renames an unlocked
file, the file in the work tree can just be moved, rather than copying
the content to a new worktree file.
This is attempted to be done in merge conflict resolution, but
due to git merge's behavior of running smudge filters, what actually
seems to happen is the old worktree file with the content is deleted and
rewritten as a pointer file, so doesn't get reused.
So, this is probably not as efficient as it optimally could be.
If that becomes a problem, could look into running the merge in a separate
worktree and updating the real worktree more efficiently, similarly to the
direct mode merge. However, the direct mode merge had a lot of bugs, and
I'd rather not use that more error-prone method unless really needed.
Decided it's too scary to make v6 unlocked files have 1 copy by default,
but that should be available to those who need it. This is consistent with
git-annex not dropping unused content without --force, etc.
* Added annex.thin setting, which makes unlocked files in v6 repositories
be hard linked to their content, instead of a copy. This saves disk
space but means any modification of an unlocked file will lose the local
(and possibly only) copy of the old version.
* Enable annex.thin by default on upgrade from direct mode to v6, since
direct mode made the same tradeoff.
* fix: Adjusts unlocked files as configured by annex.thin.
This optimisation was not necessary, and didn't work for v6 unlocked files.
Typically only a small number of files will be changed by a commit, so just
catKey them all.
This can happen when ingesting a new file in either locked or unlocked
mode, when some unlocked files in the repo use the same key, and the
content was not locally available before.
This fixes a race where the modified file ended up in annex/objects, and
the InodeCache stored in the database was for the modified version, so
git-annex didn't know it had gotten modified.
The race could occur when the smudge filter was running; now it gets the
InodeCache before generating the Key, which avoids the race.
In v5, lookupFile is supposed to only look at symlinks on disk (except when
in direct mode).
Note that v6 also has a bug when a locked file's symlink is deleted and is
replaced with a new file. It sees that a link is staged and gets that
key.
The problem is that shutdown is not always called, particularly in the test
suite. So, a database connection would be opened, possibly some changes
queued, and then not shut down.
One way this can happen is when using Annex.eval or Annex.run with a new
state. A better fix might be to make both of them call Keys.shutdown
(and be sure to do it even if the annex action threw an error).
Complication: Sometimes they're run reusing an existing state, so shutting
down a database connection could cause problems for other users of that
same state. I think this would need a MVar holding the database handle,
so it could be emptied once shut down, and another user of the database
connection could then start up a new one if it got shut down. But, what if
2 threads were concurrently using the same database handle and one shut it
down while the other was writing to it? Urgh.
Might have to go that route eventually to get the database access to run
fast enough. For now, a quick fix to get the test suite happier, at the
expense of speed.
This covers the case where multiple files have the same content and are
added with git add. Previously only the one that was linked to the annex
got its inode cached; now both are.
When a v6 unlocked files is removed from the work tree,
unused doesn't show it. When it gets removed from the index,
unused does show it. This is the same as a locked file.
If multiple files point to the same annex object, the user may want to
modify them independently, so don't use a hard link.
Also, check diskreserve when copying.
Before the smudge filter added a trailing newline, but other things that
wrote formatPointer to a file did not.
also some new pointer staging code to use later
This avoids querying the database when the content file doen't exist
(or otherwise fails the provided check). However, it does add overhead of
querying the database, and will certianly impact performance.
The Keys database can hold multiple inode caches for a given key. One for
the annex object, and one for each pointer file, which may not be hard
linked to it.
Inode caches for a key are recorded when its content is added to the annex,
but only if it has known pointer files. This is to avoid the overhead of
maintaining the database when not needed.
When the smudge filter outputs a file's content, the inode cache is not
updated, because git's smudge interface doesn't let us write the file. So,
dropping will fall back to doing an expensive verification then. Ideally,
git's interface would be improved, and then the inode cache could be
updated then too.
Renamed the db to keys, since it is various info about a Keys.
Dropping a key will update its pointer files, as long as their content can
be verified to be unmodified. This falls back to checksum verification, but
I want it to use an InodeCache of the key, for speed. But, I have not made
anything populate that cache yet.
This removes ambiguity, because while someone might have "WORM--foo" in a
file that's not intended to be a git-annex pointer file,
"annex/objects/WORM--foo" is less likely.
Also, 664cc987e8 had a caveat about symlink
targets being parsed as pointer files, and now the same parser is used for
both.
I did not include any hash directories before the key in the pointer file,
as they're not needed. However, if they were included, the parser would
still work ok.
Backend.lookupFile is changed to always fall back to catKey when
operating on a file that's not a symlink.
catKey is changed to understand pointer files, as well as annex symlinks.
Before, catKey needed a file mode witness, to be sure it was looking at a
symlink. That was complicated stuff. Now, it doesn't actually care if a
file in git is a symlink or not; in either case asking git for the content
of the file will get the pointer to the key.
This does mean that git-annex will treat a link
foo -> WORM--bar as a git-annex file, and also treats
a regular file containing annex/objects/WORM--bar as a git-annex file.
Calling catKey could make git-annex commands need to do more work than
before. This would especially be the case if a repo contained many regular
files, and only a few annexed files, as now git-annex will need to ask
git about the contents of the regular files.
Was not putting it inside the temp dir, but next to it!
This was just wrong, and it led to a longer filename that desired being
used, leading to some bug reports.
Since all places where a repo is used in direct mode need to have git-annex
upgraded before the repo can safely be converted to v6, the upgrade needs
to be manual for now.
I suppose that at some point I'll want to drop all the direct mode support
code. At that point, will stop supporting v5, and will need to auto-upgrade
any remaining v5 repos. If possible, I'd like to carry the direct mode
support for say, a year or so, to give people plenty of time to upgrade and
avoid disruption.
When core.sharedRepository is set, annex object files are not made mode
444, since that prevents a user other than the file owner from locking
them. Instead, a mode such as 664 is used in this case.
replaceFile created a temp file, which was guaranteed to not overlap with
another temp file. However, makeAnnexLink then deleted that file, in
preparation for making the symlink in its place. This caused a race, since
some other replaceFile could create a temp file, using the same name!
I was able to reproduce the race easily running git-annex add -J10 in a
directory with 100 files (all with different contents). Some files would
get ingested into the annex, but their annex links would fail to be added.
There could be other situations where this same problem could occur.
Perhaps when the assistant is adding a file, if the user manually also ran
git-annex add. Perhaps in cases not involving adding a file.
The new replaceFile makes a temprary directory, which is guaranteed to be
unique, and doesn't make a temp file in there. makeAnnexLink can thus
create the symlink without problem and the race is avoided.
Audited all calls to replaceFile to make sure that the old behavior of
providing an empty temp file was not relied on.
The general problem of asking for a temp file and deleting it as part of
the process of using it could reach beyond replaceFile. Did some quick
audits and didn't find other cases of it. Probably only symlink creation
stuff would tend to make that mistake, mostly.
Instead, only display transport error if the configlist output doesn't
include an annex.uuid line, even an empty one.
A recent change made git-annex init try to get all the remote uuids, and so
the transport error would be displayed by it. It was also displayed when
eg, copying files to a remote that had no uuid yet.
sync, merge, assistant: When git merge failed for a reason other than a
conflicted merge, such as a crippled filesystem not allowing particular
characters in filenames, git-annex would make a merge commit that could
omit such files or otherwise be bad. Fixed by aborting the whole merge
process when git merge fails for any reason other than a merge conflict.
Implemented with no additional overhead of compares etc.
This is safe to do for presence logs because of their locality of change;
a given repo's presence logs are only ever changed in that repo, or in a
repo that has just been actively changing the content of that repo.
So, we don't need to worry about a split-brain situation where there'd
be disagreement about the location of a key in a repo. And so, it's ok to
not update the timestamp when that's the only change that would be made
due to logging presence info.
sideAction is for things not generally related to the current action being
performed. And, it adds a newline after the side action. This was not the
right thing to use for stuff like "checksum", where doing a checksum is
part of the git annex get process, and indeed we want it to display
"(checksum...) ok"
By definition, a trusted repository is trusted to always have its location
tracking log accurate. Thus, it should never be in a position where content
is being dropped from it concurrently, as that would result in the location
tracking log not being accurate.
This avoids a failure where eg, we start with RecentlyVerifiedCopies
for all remotes, and so didn't do any active verification, which is
required.
Also, dedup the list of VerifiedCopies when checking if we have enough,
in case 2 copies of a UUID slip in.
See doc/bugs/concurrent_drop--from_presence_checking_failures.mdwn for
discussion about why 1 locked copy is all we can require, and how this
fixes concurrent dropping bugs.
Note that, since nothing yet generates a VerifiedCopyLock yet, this commit
breaks dropping temporarily.
There should be no behavior changes in this commit, it just adds a more
expressive data type and adjusts code that had been passing around a [UUID]
or sometimes a Maybe Remote to instead use [VerifiedCopy].
Although, since some functions were taking two different [UUID] lists,
there's some potential for me to have gotten it horribly wrong.
Also, rename lockContent to lockContentExclusive
inAnnexSafe should perhaps be eliminated, and instead use
`lockContentShared inAnnex`. However, I'm waiting on that, as there are
only 2 call sites for inAnnexSafe and it's fiddly.
In c6632ee5c8, it actually only handled
uploading objects to a shared repository. To avoid verification when
downloading objects from a shared repository, was a lot harder.
On the plus side, if the process of downloading a file from a remote
is able to verify its content on the side, the remote can indicate this
now, and avoid the extra post-download verification.
As of yet, I don't have any remotes (except Git) using this ability.
Some more work would be needed to support it in special remotes.
It would make sense for tahoe to implicitly verify things downloaded from it;
as long as you trust your tahoe server (which typically runs locally),
there's cryptographic integrity. OTOH, despite bup being based on shas,
a bup repo under an attacker's control could have the git ref used for an
object changed, and so a bup repo shouldn't implicitly verify. Indeed,
tahoe seems unique in being trustworthy enough to implicitly verify.
* When annex objects are received into git repositories, their checksums are
verified then too.
* To get the old, faster, behavior of not verifying checksums, set
annex.verify=false, or remote.<name>.annex-verify=false.
* setkey, rekey: These commands also now verify that the provided file
matches the key, unless annex.verify=false.
* reinject: Already verified content; this can now be disabled by
setting annex.verify=false.
recvkey and reinject already did verification, so removed now duplicate
code from them. fsck still does its own verification, which is ok since it
does not use getViaTmp, so verification doesn't happen twice when using fsck
--from.
I couldn't find a good way to make an *empty* index file (zero byte file
won't do), so I punted and just don't make index.lock when there's no index
yet. This means some other git process could race and write an index file
at the same time as the merge is ongoing, in theory. Only happens in new
repos though.
Oh boy, not again. So, another place that the filesystem encoding needs to
be applied. Yay.
In passing, I changed decodeBS so if a NUL is embedded in the input, the
resulting FilePath doesn't get truncated at that NUL. This was needed to
make prop_b64_roundtrips pass, and on reviewing the callers of decodeBS, I
didn't see any where this wouldn't make sense. When a FilePath is used to
operate on the filesystem, it'll get truncated at a NUL anyway, whereas if
a String is being used for something else, it might conceivably have a NUL
in it, and we wouldn't want it to get truncated when going through
decodeBS.
(NB: There may be a speed impact from this change.)
I think that the problem was caused by windows not having a concept of an
env var that is set, but to the empty string. So, GIT_ANNEX_SSHOPTION
got set to "" and was not seen as set at all.
Easy fix, which also makes git-annex sync a little faster is to not set
GIT_SSH, when GIT_ANNEX_SSHOPTION has no options. Might as well let git use
ssh per usual in this case, no need to run git-annex as the proxy ssh
command..
* proxy: Fix proxy git commit of non-annexed files in direct mode.
* proxy: If a non-proxied git command, such as git revert
would normally fail because of unstaged files in the work tree,
make the proxied command fail the same way.
* Perform a clean shutdown when --time-limit is reached.
This includes running queued git commands, and cleanup actions normally
run when a command is finished.
* fsck: Commit incremental fsck database when --time-limit is reached.
Previously, some of the last files fscked did not make it into the
database when using --time-limit.
Note that this changes Annex.addCleanup hooks, to run after --time-limit
expires. Fsck was using such a hook to clean up after a
--incremental-schedule, and that shouldn't run when --time-limit exipires
it. So, instead, moved that cleanup code to be run by cleanupIncremental.
Resulted in some data type juggling.
This is needed because when preferred content matches on files,
the second pass would otherwise want to drop all keys. Using a bloom filter
avoids this, and in the case of a false positive, a key will be left
undropped that preferred content would allow dropping. Chances of that
happening are a mere 1 in 1 million.
This makes git annex unused use around 48 mb more memory than it did before,
but the massive increase in accuracy makes this worthwhile for all but the
smallest systems.
Also, I want to use the bloom filter for sync --all --content, to avoid
dropping files that the preferred content doesn't want, and 1/1000
false positives would be far too many in that use case, even if it were
acceptable for unused.
Actual memory use numbers:
1000: 21.06user 3.42system 0:26.40elapsed 92%CPU (0avgtext+0avgdata 501552maxresident)k
1000000: 21.41user 3.55system 0:26.84elapsed 93%CPU (0avgtext+0avgdata 549496maxresident)k
10000000: 21.84user 3.52system 0:27.89elapsed 90%CPU (0avgtext+0avgdata 549920maxresident)k
Based on these numbers, 10 million seemed a better pick than 1 million.
This removes a bit of complexity, and should make things faster
(avoids tokenizing Params string), and probably involve less garbage
collection.
In a few places, it was useful to use Params to avoid needing a list,
but that is easily avoided.
Problems noticed while doing this conversion:
* Some uses of Params "oneword" which was entirely unnecessary
overhead.
* A few places that built up a list of parameters with ++
and then used Params to split it!
Test suite passes.
The content file may not be owned by the user running git-annex, in which
case, setting the owner write bit was not enough to let lockContent
act on the file. However, with some core.sharedRepository configs, the file
should be writable by the user's group. So, the thing to do is to call
thawContent on it.
It was returning Just False in this situation, which differed from indirect
mode behavior. I don't think this led to any actual problems; things that
checked if the file being dropped was present just failed to fail, and
instead reported it wasn't present, possibly incorrectly.
Hmm, it's possible that this could have made git annex fsck --from remote
update the location log wrongly, if a remote was in direct mode, and was in
the middle of trying to drop a key, and the drop later failed.
Also cleaned up the code, avoiding creating a lock file if we're going to
open it for create later anyway.
And, if there's an exception while preparing to lock the file, but not at
the point of actually taking the lock, throw an exception, instead of
silently not locking and pretending to succeed.
And, on Windows, always use lock file, even if the repo somehow got into
indirect mode (maybe with cygwin git..)
The one exception is in Utility.Daemon. As long as a process only
daemonizes once, which seems reasonable, and as long as it avoids calling
checkDaemon once it's already running as a daemon, the fcntl locking
gotchas won't be a problem there.
Annex.LockFile has it's own separate lock pool layer, which has been
renamed to LockCache. This is a persistent cache of locks that persist
until closed.
This is not quite done; lockContent stil needs to be converted.
Should be no behavior changes, just simplified code.
The only actual difference is it doesn't truncate the lock file.
I think that was a holdover from when transfer info was written to the lock
file.
Only the assistant uses these, and only the assistant cleans them up, so
make only git annex transferkeys write them,
There is one behavior change from this. If glacier is being used, and a
manual git annex get --from glacier fails because the file isn't available
yet, the assistant will no longer later see that failed transfer file and
retry the get. Hope no-one depended on that old behavior.
The setDifferences that got added to initialize turns out to make a git
commit, and before ensureCommit has been used. Thus, repo init can fail
when the system has a broken hostname etc.
Move the ensureCommit to the very first thing to avoid this kind of breakage.
This works, and seems fairly robust. Clean get of 20 files at -J3. At -J10,
there are some messages about ssh multiplexing, probably due to a race
spinning up the ssh connection cacher. But, it manages to get all the files
ok regardless.
The progress bars are a scrambled mess though, due to bugs in
ascii-progress, which I've already filed. Particularly this one:
https://github.com/yamadapc/haskell-ascii-progress/issues/8
Came up with a generic way to filter out progress messages while keeping
errors, for commands that use stderr for both.
--json mode will disable command outputs too.
This might be overkill; I only know I need it in ls-files, but other git
commands can also do their own globbing, it turns out, and I am pretty sure
I never want them too when git-annex is using them as plumbing.
Test suite still passes and it looks ok.
This was introduced by commit 450ee53ab6
However, the same problem could affect other calls to programPath,
specifically some on the assistant. So, I fixed it at a deeper level.
Seems to work, but still experimental until it's been tested more.
When repositories are on filesystems not supporting symlinks, the .git dir
symlink trick cannot be used. Since we're going to be in direct mode
anyway, the .git dir symlink is not strictly needed.
However, I have not fixed the code that creates new annex symlinks to
handle this case -- the committed symlinks will be wrong.
git annex sync happens to currently fail in a submodule using direct mode,
because there's no HEAD ref. That also needs to be dealt with to get
this fully working in crippled filesystems.
Leaving http://github.com/datalad/datalad/issues/44 open until these issues
are dealt with.
Most of the time, there will be no discreprancy between programPath and
readProgramFile.
But, the programFile might have been written by an old version of git-annex
that is still installed, while a newer one is currently running. In this
case, we want to run the same one that's currently running.
This is especially important for things like the GIT_SSH=git-annex used for
ssh connection caching.
The only code that still uses readProgramFile directly is the upgrade code,
which needs to know where the standalone git-annex was installed, in order to
upgrade it.
Turns out sqlite does not like having its database deleted out from
underneath it. It might suffice to empty the table, but I would rather
start each fsck over with a new database, so I added a lock file, and
running incremental fscks use a shared lock.
This leaves one concurrency bug left; running two concurrent fsck --more
will lead to: "SQLite3 returned ErrorBusy while attempting to perform step."
and one or both will fail. This is a concurrent writers problem.
* sync: Use the ssh-options git config when doing git pull and push.
* remotedaemon: Use the ssh-options git config.
Note that the rename env var means that if a new git-annex calls an old one
for git-annex ssh, or a new calls an old, nothing much will go wrong;
just ssh caching won't happen.
I hope this doesn't impact speed much -- it does have to pull out a value
from Annex state every time it accesses the branch now.
The test case I dropped has never caught any problems that I can remember,
and would have been rather difficult to convert.
Eliminated complexity and future proofed. The most important change is that
all functions over Difference are now total; any Difference that can be
expressed should be handled. Avoids needs for sanity checking of inputs,
and version skew with the future.
Also, the difference.log now serializes a [Difference], not a Differences.
This saves space and keeps it simpler.
Note that [Difference] might contain conflicting differences (eg,
[Version5, Version6]. In this case, one of them needs to consistently win
over the others, probably based on Ord.
* init: Repository tuning parameters can now be passed when initializing a
repository for the first time. For details, see
http://git-annex.branchable.com/tuning/
* merge: Refuse to merge changes from a git-annex branch of a repo
that has been tuned in incompatable ways.
This is necessary for interop between inode caches created on unix and
windows. Which is more important than supporting inodecaches for large keys
with the wrong size, which are broken anyway.
There should be no slowdown from this change, except on Windows.
Avoid using fileSize which maxes out at just 2 gb on Windows.
Instead, use hFileSize, which doesn't have a bounded size.
Fixes support for files > 2 gb on Windows.
Note that the InodeCache code only needs to compare a file size,
so it doesn't matter it the file size wraps. So it has been
left as-is. This was necessary both to avoid invalidating existing inode
caches, and because the code passed FileStatus around and would have become
more expensive if it called getFileSize.
This commit was sponsored by Christian Dietrich.
Reverts 965e106f24
Unfortunately, this caused breakage on Windows, and possibly elsewhere,
because parentDir and takeDirectory do not behave the same when there is a
trailing directory separator.
parentDir is less safe than takeDirectory, especially when working
with relative FilePaths. It's really only useful in loops that
want to terminate at /
This commit was sponsored by Audric SCHILTKNECHT.
This fixes 9 test suite failures. There are some tricky things going on
with the paths to the index file, and git's working directory, which
are hard to get right with relative paths. So, I switched back to absolute
here, at least for now.
Only 2 test suite failures remain on this branch, but there are other
potential problems the test suite doesn't catch. Including some calls to
setCurrentDirectory -- I was wrong and git-annex does do that in a few
places, like when generating a view.
This allows the git repository to be moved while git-annex is running in
it, with fewer problems.
On Windows, this avoids some of the problems with the absurdly small
MAX_PATH of 260 bytes. In particular, git-annex repositories should
work in deeper/longer directory structures than before. See
http://git-annex.branchable.com/bugs/__34__git-annex:_direct:_1_failed__34___on_Windows/
There are several possible ways this change could break git-annex:
1. If it changes its working directory while it's running, that would
be Bad News. Good news everyone! git-annex never does so. It would also
break thread safety, so all such things were stomped out long ago.
2. parentDir "." -> "" which is not a valid path. I had to fix one
instace of this, and I should probably wipe all calls to parentDir out
of the git-annex code base; it was never a good idea.
3. Things like relPathDirToFile require absolute input paths,
and code assumes that the git repo path is absolute and passes it to it
as-is. In the case of relPathDirToFile, I converted it to not make
this assumption.
Currently, the test suite has 16 failures.
This allows bypassing the direct mode guard in a safe way to do all sorts
of things including git revert, git mv, git checkout ...
This commit was sponsored by the WikiMedia Foundation.
Didn't know that this library existed!
This includes making git-annex not re-exec itself on start on windows, and
making the test suite on Windows run tests without forking.
Added a Default instance for TrustLevel, and was able to use that to clear
up several other parts of the code too.
This commit was sponsored by Stephan Schulz
Found these with:
git grep "^ " $(find -type f -name \*.hs) |grep -v ': where'
Unfortunately there is some inline hamlet that cannot use tabs for
indentation.
Also, Assistant/WebApp/Bootstrap3.hs is a copy of a module and so I'm
leaving it as-is.
This fixes all instances of " \t" in the code base. Most common case
seems to be after a "where" line; probably vim copied the two space layout
of that line.
Done as a background task while listening to episode 2 of the Type Theory
podcast.
* New annex.hardlink setting. Closes: #758593
* init: Automatically detect when a repository was cloned with --shared,
and set annex.hardlink=true, as well as marking the repository as
untrusted.
Had to reorganize Logs.Trust a bit to avoid a cycle between it and
Annex.Init.
This avoids cp -a overriding the default mode acls that the user might have
set in a git repository.
With GNU cp, this behavior change should not be a breaking change, because
git-anex also uses rsync sometimes in the same situation, and has only ever
preserved timestamps when using rsync.
Systems without GNU cp will no longer use cp -a, but instead just cp.
So, timestamps will no longer be preserved. Preserving timestamps when
copying between repos is not guaranteed anyway.
Closes: #729757
This fixed one bug where it needed to be and wasn't (in Assistant.Unused).
And also found one place where lockContent was used unnecessarily (by
drop --from remote).
A few other places like uninit probably don't really need to lockContent,
but it doesn't hurt to do call it anyway.
This commit was sponsored by David Wagner.
Also fixes a test suite failures introduced in recent commits, where
inAnnexSafe failed in indirect mode, since it tried to open the lock file
ReadWrite. This is why the new checkLocked opens it ReadOnly.
This commit was sponsored by Chad Horohoe.
The nice refactoring in ec7dd0446a
highlighted a bug in lockContent -- when the content is not present,
this incorrectly created an empty lock file, using the same filename
as the content file.
This seems like it could result in empty objects, which fsck would detect
and complain about. Both drop and move --to call lockContent, as does
Remote.Git.dropKey -- I think we got lucky and this bug didn't show up
because both all of those only operate on files that are present. So
this bug could only manifest if there was a race, and a file's content
was dropped at just the wrong time, just as another process was about to
drop it. (And then only if the other process's dropping failed, otherwise
it'd delete the empty object file.)
Hmm, move --from also called lockContent. Unnecessarily, since the content
is not being removed from the local annex. In this case, the combination of
the 2 bugs could result in an empty lock file being written, and then if
the download of the content failed, left in the object directory as the
content.
This commit also optimises lockContent, avoiding an unncessary
doesFileExist test and instead just catching the exception that's thrown
when the file doesn't exist.
This commit was sponsored by Justine Lam.
Added a convenience Utility.LockFile that is not a windows/posix
portability shim, but still manages to cut down on the boilerplate around
locking.
This commit was sponsored by Johan Herland.
This was a bug, but it was only used for ssh locks and by the hook special
remote locking. At least in the case of ssh locks, the lock files happened
to already exist before this tried to use them, so the bug didn't cause
anything to break.
This does mean that eg, copying multiple files to a local remote will
become slightly slower, since it now restarts git-cat-file after each copy.
Should not be significant slowdown.
The reason git-cat-file is run on the remote at all is to update its
location log. In order to add an item to it, it needs to get the current
content of the log. Finding a way to avoid needing to do that would be a
good path to avoiding this slowdown if it does become a problem somehow.
This commit was sponsored by Evan Deaubl.
(With the exception of daemon pid locking.)
This fixes at part of #758630. I reproduced the assistant locking eg, a
removable drive's annex journal lock file and forking a long-running
git-cat-file process that inherited that lock.
This did not affect Windows.
Considered doing a portable Utility.LockFile layer, but git-annex uses
posix locks in several special ways that have no direct Windows equivilant,
and it seems like it would mostly be a complication.
This commit was sponsored by Protonet.
replaceFileOr was broken and ran the rollback action always.
Luckily, for replaceFile, the rollback action was safe to run, since it
just nuked a temp file that had already been moved into place.
However, when `git annex direct` used replaeFileOr, its rollback printed a
scary message:
/home/joey/tmp/rrrr/.git/annex/misctmp/tmp32268: rename: does not exist (No such file or directory)
There was actually no bad result though.
Removed old extensible-exceptions, only needed for very old ghc.
Made webdav use Utility.Exception, to work after some changes in DAV's
exception handling.
Removed Annex.Exception. Mostly this was trivial, but note that
tryAnnex is replaced with tryNonAsync and catchAnnex replaced with
catchNonAsync. In theory that could be a behavior change, since the former
caught all exceptions, and the latter don't catch async exceptions.
However, in practice, nothing in the Annex monad uses async exceptions.
Grepping for throwTo and killThread only find stuff in the assistant,
which does not seem related.
Command.Add.undo is changed to accept a SomeException, and things
that use it for rollback now catch non-async exceptions, rather than
only IOExceptions.
For example, I had a copy to a remote that was failing for an unknown
reason. This let me see the exception was createDirectory: permission
denied; the underlying problem being a permissions issue.
Putting a callback in the Retriever type allows for the callback to
remove the retrieved file when it's done with it.
I did not really want to make Retriever be fixed to Annex Bool,
but when I tried to use Annex a, I got into some type of type mess.
Needed for eg, Remote.External.
Generally, any Retriever that stores content in a file is responsible for
updating the meter, while ones that procude a lazy bytestring cannot update
the meter, so are not asked to.
Slightly tricky as they are not normal UUIDBased logs, but are instead maps
from (uuid, chunksize) to chunkcount.
This commit was sponsored by Frank Thomas.
I think this is a git behavior change, but have not checked to be sure.
Conflict cruft used to look like $foo~HEAD, but now just $foo is left
behind as conflict cruft.
With test case.
Running `git annex direct` would cause loss of data, because the object
was moved to a temp file, which it then tried to replace the work tree file
with, and on failure, the temp file got deleted. Now it's instead moved
back into the annex object location.
Minor because normally only 1 FD is leaked per git-annex run. However,
the test suite leaks a few hundred FDs, and this broke it on the Debian
autobuilders, which seem to have a tigher than usual ulimit.
The leak was introduced by the lazy getDirectoryContents' that was
introduced in e6330988dd in order to scale to
millions of journal files -- if the lazy list was never fully consumed, the
directory handle did not get closed.
Instead, pull in openDirectory/readDirectory/closeDirectory code that I
already developed and submitted in a patch to the haskell directory library
earlier. Using this in journalDirty avoids the place that the lazy list
caused a problem. And using it in stageJournal eliminates the need for
getDirectoryContents'.
The getJournalFiles* functions are switched back to using the regular
strict getDirectoryContents. I'm not sure if those always consume the whole
list, so this avoids any leak. And the things that call those are things
like git annex unused, which also look at every file committed to the
git-annex branch, so would need more work to scale to insane numbers of
files anyway.
When one side is an annexed symlink, and the other side is a non-annexed symlink.
In this case, git-merge does not replace the annexed symlink in the work
tree with the non-annexed symlink, which is different from it's handling of
conflicts between annexed symlinks and regular files or directories.
So, while git-annex generated the correct merge commit, the work tree
didn't get updated to reflect it.
See comments on bug for additional analysis.
Did not add this to the test suite yet; just unloaded a truckload of firewood
and am feeling lazy.
This commit was sponsored by Adam Spiers.
Eg after git-annex add has run on 2 million files in one go.
Slightly unhappy with the neeed to use a temp file here, but I cannot see
any other alternative (see comments on the bug report).
This commit was sponsored by Hamish Coleman.
Support users who have set commit.gpgsign, by disabling gpg signatures for
git-annex branch commits and commits made by the assistant.
The thinking here is that a user sets commit.gpgsign intending the commits
that they manually initiate to be gpg signed. But not commits made in the
background, whether by a deamon or implicitly to the git-annex branch.
gpg signing those would be at best a waste of CPU and at worst would fail,
or flood the user with gpg passphrase prompts, or put their signature on
changes they did not directly do.
See Debian bug #753720.
Also makes all commits done by git-annex go through a few central control
points, to make such changes easier in future.
Also disables commit.gpgsign in the test suite.
This commit was sponsored by Antoine Boegli.
When annex.genmetadata is set, metadata from the feed is added to files
that are imported from it.
Reused the same feedtitle and itemtitle, feedauthor, itemauthor, etc names
that are used in --template.
Also added title and author, which are the item title/author if available,
falling back to the feed title/author. These are more likely to be common
metadata fields.
(There is a small bit of dupication here, but once git gets
around to packing the object, it will compress it away.)
The itempubdate field is not included in the metadata as a string; instead
it is used to generate year and month fields, same as is done when adding
files with annex.genmetadata set.
This commit was sponsored by Amitai Schlair, who cooincidentially
is responsible for ikiwiki generating nice feed metadata!
http://marc.info/?l=git&m=140262402204212&w=2
This git bug manifested on FAT and Windows as the test suite failing in 3
places. All involved merge conflict resolution. It turned out that the
associated file mappings were getting messed up, and that happened because
this git bug lost track of what files were supposed to be symlinks.
This commit was sponsored by Eric Kidd.
Rather than calculating the TSDelta once, and caching it, this now
reads the inode sential file's InodeCache file once, and then each time a
new InodeCache is generated, looks at the sentinal file to get the current
delta.
This way, if the time zone changes while git-annex is running, it will
adapt.
This adds some inneffiency, but only on Windows, and only 1 stat per new
file added. The worst innefficiency is that `git annex status` and
`git annex sync` will now (on Windows) stat the inode sentinal file once per
file in the repo.
It would be more efficient to use getCurrentTimeZone, rather than needing
to stat the sentinal file. This should be easy to do, once the time
package gets my bugfix patch.
This commit was sponsored by Jürgen Lüters.
On Windows, changing the time zone causes the apparent mtime of files to
change. This confuses git-annex, which natually thinks this means the files
have actually been modified (since THAT'S WHAT A MTIME IS FOR, BILL <sheesh>).
Work around this stupidity, by using the inode sentinal file to detect if
the timezone has changed, and calculate a TSDelta, which will be applied
when generating InodeCaches.
This should add no overhead at all on unix. Indeed, I sped up a few
things slightly in the refactoring.
Seems to basically work! But it has a big known problem:
If the timezone changes while the assistant (or a long-running command)
runs, it won't notice, since it only checks the inode cache once, and
so will use the old delta for all new inode caches it generates for new
files it's added. Which will result in them seeming changed the next time
it runs.
This commit was sponsored by Vincent Demeester.
It was possible for a interrupted sync or merge in direct mode to
leave the work tree out of sync with the last recorded commit.
This would result in the next commit seeing files missing from the work
tree, and committing their removal.
Now, a direct mode merge happens not only in a throwaway work tree, but using
a temporary index file, and without any commits or index changes
being made until the real work tree has been updated. If the merge is
interrupted, the work tree may have some updated files, but worst case a
commit will redundantly commit changes that come from the merge.
This commit was sponsored by Tony Cantor.