When git-annex-shell p2pstdio fails with 255, it's because the ssh
server is not reachable. Avoid running the fallback action in this case,
since it would just try a second time to connect, and presumably fail.
Note that the closed P2PSshConnection will not be stored in the pool,
so the next request tries again to connect. This is just the right
behavior; when the remote becomes reachable again, the same git-annex
process will start using it.
This commit was sponsored by Ole-Morten Duesund on Patreon.
Unfortunately ReceiveMessage didn't handle unknown messages the way it
was documented to; client sending VERSION would cause the server to
return an ERROR and hang up. Fixed that, but old releases of git-annex
use the P2P protocol for tor and will still have that behavior.
So, version is not negotiated for Remote.P2P connections, only for
Remote.Git connections, which will support VERSION from their first
release. There will need to be a later flag day to change Remote.P2P;
left a commented out line that is the only thing that will need to be
changed then.
Version 1 of the P2P protocol is not implemented yet, but updated
the docs for the DATA change that will be allowed by that version.
This commit was sponsored by Jeff Goeke-Smith on Patreon.
Note that, due to not using rsync to transfer files to ssh remotes
any longer, permissions and other file metadata of annexed files
will no longer be preserved when copying them to ssh remotes.
Other remotes never supported preserving that information, so
this is not considered a regression. Added NEWS item about this.
Another significant side effect of this is that, even when rsync is run to
retrieve a file, its progress display will no longer be shown, and
instead the native git-annex progress display will appear. It would be
possible to use the rsync process display when rsync is used (old
git-annex-shell and also retrieval from a local repository), but it
would have complicated the code unncessarily, and been inconsistent
behavior.
(I'd been thinking for a while about eliminating the rsync progress
display, since it's got some annoying verbosities, including display of
the key and the "(xfr#1, to-chk=0/1)" bit and was already somewhat
inconsistent.)
retrieveKeyFileCheap still uses rsync, since that ensures that it gets
the actual file content from the remote. Using the P2P protocol would
use the local content, as long as the local and remote size are the
same.
This commit was sponsored by John Pellman on Patreon.
Not yet used for everything else, but this is enough to
verify that it works, and do some benchmarking.
Some bugfixes included, which got it working. Also fallback to old
actions has been verified to work correctly.
Benchmarked dropping one thousand files from a ssh remote on localhost.
Using the old git-annex 40.867 seconds.
With the P2P protocol 9.905 seconds!
This commit was sponsored by Jochen Bartl on Patreon.
Not yet used by git-annex, but this will allow faster transfers etc than
using individual ssh connections and rsync.
Not called git-annex-shell p2p, because git-annex p2p does something
else and I don't want two subcommands with the same name between the two
for sanity reasons.
This commit was sponsored by Øyvind Andersen Holm.
Renaming is not supported; it might be possible to use --fuzzy to get rsync
to notice the file is being renamed, but that is a bit ..fuzzy.
On the other hand, interrupted transfers of an exported file are resumed,
since rsync is great at that. Had to adjust the exporttree docs, which
said interrupted transfers would restart.
Note that remove no longer makes the empty directory dummy, instead
sending the top-level empty directory. This works just as well and I
noticed the dummy was unncessary when refactoring it into removeGeneric.
Verified that behavior of remove is not changed, and git annex
testremote does pass.
This commit was sponsored by Brock Spratlen on Patreon.
It's left up to the special remote to detect when git-annex is new enough
to support the message; an old git-annex will blow up.
This commit was supported by the NSF-funded DataLad project.
gmane's disk crashed, I found one thread in another archive, but could
not find my whole patch set in any archive (perhaps some of the messages
were too long), so pulled it out of my personal mail archives.
This commit was supported by the NSF-funded DataLad project.
A top-level .noannex file will prevent git-annex init from being used in a
repository. This is useful for repositories that have a policy reason not
to use git-annex. The content of the file will be displayed to the user who
tries to run git-annex init.
This also affects git annex reinit and initialization via the webapp.
It does not affect automatic inits, when there's a sibling git-annex branch
already.
This commit was supported by the NSF-funded DataLad project.
This will be used in youtube-dl integration, to tell when a html page has
been downloaded by addurl, in which case it is worth running youtube-dl
to see if it can extract media from it.
tagsoup is an almost free dependency, because yesod depends on it.
So, this only really adds a dep when git-annex is built without the
webapp.
I'd like this to as closely as possible match how browsers decide if a
page is html or not. Unfortunately, that is fairly heuristic, in order
to support malformed html. And, we don't want to falsely detect
something as html just because it has something that looks like a html
tag embedded somewhere in it. Probably any major video hosting site is
going to be serving html documents that at least start with a <html>
tag, so requiring that or a DOCTYPE should be good enough.
This commit was sponsored by Jeff Goeke-Smith on Patreon.
As long as the class of remotes supports exporting, it's tested whether
or not the remote is configured with exporttree=yes.
Also, made testremote of a remote configured with exporttree=yes
disable that configuration for testing non-export storage.
This commit was supported by the NSF-funded DataLad project.
Remove closed bugs and todos that were last edited or commented before 2017.
Command line used:
for f in $(grep -l '|done\]\]' -- *.mdwn); do d="$(echo "$f" | sed 's/.mdwn$//')"; if [ -z "$(git log --since=01-01-2017 --pretty=oneline -- "$f")" -a -z "$(git log --since=01-01-2017 --pretty=oneline -- "$d")" ]; then git rm -- "./$f" ; git rm -rf "./$d"; fi; done
for f in $(grep -l '\[\[done\]\]' -- *.mdwn); do d="$(echo "$f" | sed 's/.mdwn$//')"; if [ -z "$(git log --since=01-01-2017 --pretty=oneline -- "$f")" -a -z "$(git log --since=01-01-2017 --pretty=oneline -- "$d")" ]; then git rm -- "./$f" ; git rm -rf "./$d"; fi; done
This is similar to the pusher thread, but a separate thread because git
pushes can be done in parallel with exports, and updating a big export
should not prevent other git pushes going out in the meantime.
The exportThread only runs at most every 30 seconds, since updating an
export is more expensive than pushing. This may need to be tuned.
Added a separate channel for export commits; the committer records a
commit in that channel.
Also, reconnectRemotes records a dummy commit, to make the exporter
thread wake up and make sure all exports are up-to-date. So,
connecting a drive with a directory special remote export will
immediately update it, and getting online will automatically
update S3 and WebDAV exports.
The transfer queue is not involved in exports. Instead, failed
exports are retried much like failed pushes.
This commit was sponsored by Ewen McNeill.
The bug occurred when closeDb was not called, and garbage collection of
the DbHandle didn't give the workerThread time to shut down. Fixed by
exiting the runSqlite action when a commit is made.
(MultiWriter mode already forked off a runSqlite action, so avoided the
problem.)
This commit was sponsored by Brock Spratlen on Patreon.
Now when one repository has exported a tree, another repository can get
files from the export, after syncing.
There's a bug: While the database update works, somehow the database on
disk does not get updated, and so the database update is run the next
time, etc. Wasn't able to figure out why yet.
This commit was sponsored by Ole-Morten Duesund on Patreon.
New table needed to look up what filenames are used in the currently
exported tree, for reasons explained in export.mdwn.
Also, added smart constructors for ExportLocation and ExportDirectory to
make sure they contain filepaths with the right direction slashes.
And some code refactoring.
This commit was sponsored by Francois Marier on Patreon.
There does not seem to be a use case for supporting that, and it would
need a lot of complication to support it in a way that allows eventual
consistency when two repositories are updating the same export.
This commit was sponsored by Henrik Riomar on Patreon.
Apparently box.com renaming is just buggy. I tried a couple of fixes:
* In case the http Manager was opening multiple connections and reaching
different backend servers, I tried limiting the number of connections
to 1. Didn't help.
* To make sure it was not a http connection reuse problem, I tried
rewriting how exportAction works, so that the same http connection
is clearly open. Didn't help.
So, disable renaming of exports for box.com. It would be good to test it
with some other webdav server.
This commit was sponsored by John Peloquin on Patreon.
It was not getting old lines removed, because the tree graft confused
the updater, so it union merged from the previous git-annex branch,
which still contained the old lines. Fixed by carefully using setIndexSha.
This commit was supported by the NSF-funded DataLad project.
This breaks backwards compatibility, but only with unreleased versions of
git-annex, which I think is acceptable.
This commit was supported by the NSF-funded DataLad project.
Removal works, only derives are a potential issue, so allow removing
with a warning. This way, unexporting a file works, and behavior is
consistent with IA remotes whether or not exporttree=yes.
Also tested exporting filenames containing unicode, spaces, underscores.
All worked, despite the IA's faq saying it doesn't.
This commit was sponsored by Trenton Cronholm on Patreon.
It opens a http connection per file exported, but then so does git
annex copy --to s3.
Decided not to munge exported filenames for IA. Too large a chance of
the munging having confusing results. Instead, export of files not
supported by IA, eg with spaces in their name, will fail.
This commit was supported by the NSF-funded DataLad project.
Only rename when actually ncessary.
The diff gets buffered in memory. Probably git has to buffer a diff in
memory when generating it as well, so this memory usage should not be a
problem, even when the diff is very large. I hope.
This commit was supported by the NSF-funded DataLad project.
Don't allow "exporttree=yes" to be set when the special remote
does not support exports. That would be confusing since the user would
set up a special remote for exports, but `git annex export` to it would
later fail.
This commit was supported by the NSF-funded DataLad project.
The export database has writes made to it and then expects to read back
the same data immediately. But, the way that Database.Handle does
writes, in order to support multiple writers, makes that not work, due
to caching issues. This resulted in export re-uploading files it had
already successfully renamed into place.
Fixed by allowing databases to be opened in MultiWriter or SingleWriter
mode. The export database only needs to support a single writer; it does
not make sense for multiple exports to run at the same time to the same
special remote.
All other databases still use MultiWriter mode. And by inspection,
nothing else in git-annex seems to be relying on being able to
immediately query for changes that were just written to the database.
This commit was supported by the NSF-funded DataLad project.
Straightforward enough, except for the needed belt-and-suspenders sanity
checks to avoid foot shooting due to exports not being key/value stores.
* Even when annex.verify=false, always verify from exports.
* Only get files from exports that use a backend that supports
checksum verification.
* Never trust exports, even if the user says to, because then
`git annex drop` would drop content if the export seemed to contain
a copy.
This commit was supported by the NSF-funded DataLad project.
Removed uncorrect UniqueKey key in db schema; a key can appear multiple
times with different files.
The database has to be flushed after each removal. But when adding files
to the export, lots of changes are able to be queued up w/o flushing.
So it's still fairly efficient.
If large removals of files from exports are too slow, an alternative
would be to make two passes over the diff, one pass queueing deletions
from the database, then a flush and the a second pass updating the
location log. But that would use more memory, and need to look up
exportKey twice per removed file, so I've avoided such optimisation yet.
This commit was supported by the NSF-funded DataLad project.
* Only export to remotes that were initialized to support it.
* Prevent storing key/value on export remotes.
* Prevent enabling exporttree=yes and encryption in the same remote.
SetupStage Enable was changed to take the old RemoteConfig.
This allowed only setting exporttree when initially setting up a
remote, and not configuring it later after stuff might already be stored
in the remote.
Went with =yes rather than =true for consistency with other parts of
git-annex. Changed docs accordingly.
This commit was supported by the NSF-funded DataLad project.
Make a pass over the whole exported tree, and upload anything that has
not yet reached the export. Update location log when exporting.
Note that the synthesized keys for non-annexed files are stored in the
location log too.
Some cases involving files in the tree with the same content are not
handled correctly yet.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
Cryptonite is faster and allocates less, and I want to get rid of
MissingH use.
Note that the new dependency on memory is free; it's a dependency of
cryptonite.
This commit was supported by the NSF-funded DataLad project.
Unlike git add -u, git annex add -u does not update the index for files
removed from the working tree. But then, "git add ." stages removals,
and "git annex add ." does not, so that's an existing divergence.
Seems that --update --batch would need to run git ls-files once per line of
batch input, which would surely be too slow, so just throw an error for
that.
This commit was supported by the NSF-funded DataLad project.
* init: When annex.securehashesonly has been set with git-annex config,
copy that value to the annex.securehashesonly git config.
* config --set: As well as setting value in git-annex branch,
set local gitconfig. This is needed especially for
annex.securehashesonly, which is read only from local gitconfig and not
the git-annex branch.
doc/todo/sha1_collision_embedding_in_git-annex_keys.mdwn has the
rationalle for doing it this way. There's no perfect solution; this
seems to be the least-bad one.
This commit was supported by the NSF-funded DataLad project.
Yesterday's SHA1 collision attack could be used to generate eg:
SHA256-sfoo--whatever.good
SHA256-sfoo--whatever.bad
Such that they collide. A repository with the good one could have the
bad one swapped in and signed commits would still verify.
I've already mitigated this.
I am not happy that I had to put backend-specific code in file2key. But
it would be very difficult to avoid this layering violation.
Most of the time, when parsing a Key from a symlink target, git-annex
never looks up its Backend at all, so adding this check to a method of
the Backend object would not work.
The Key could be made to contain the appropriate
Backend, but since Backend is parameterized on an "a" that is fixed to
the Annex monad later, that would need Key to change to "Key a".
The only way to clean this up that I can see would be to have the Key
contain a LowlevelBackend, and put the validation in LowlevelBackend.
Perhaps later, but that would be an extensive change, so let's not do
it in this commit which may want to cherry-pick to backports.
This commit was sponsored by Ethan Aubin.