Commit graph

1241 commits

Author SHA1 Message Date
Joey Hess
ec12537774
defer write permissions checking in import until after copy to repo
This should complete the fix started in
6329997ac4, fixing the actual cause of the
test suite failure this time.

Sponsored-by: Dartmouth College's Datalad project
2021-09-02 13:45:21 -04:00
Joey Hess
4f42292b13
improve url download failure display
* When downloading urls fail, explain which urls failed for which
  reasons.
* web: Avoid displaying a warning when downloading one url failed
  but another url later succeeded.

Some other uses of downloadUrl use urls that are effectively internal use,
and should not all be displayed to the user on failure. Eg, Remote.Git
tries different urls where content could be located depending on how the
remote repo is set up. Exposing those urls to the user would lead to wild
goose chases. So had to parameterize it to control whether it displays urls
or not.

A side effect of this change is that when there are some youtube urls
and some regular urls, it will try regular urls first, even if the
youtube urls are listed first. This seems like an improvement if
anything, but in any case there's no defined order of urls that it's
supposed to use.

Sponsored-by: Dartmouth College's Datalad project
2021-09-01 15:33:38 -04:00
Joey Hess
837116ef1e
Fix support for readonly git remotes
Boolean blindness oops.

(Reversion in version 8.20210621)

Sponsored-by: Dartmouth College's Datalad project
2021-08-30 12:34:19 -04:00
Joey Hess
a99a84f342
add: Detect when xattrs or perhaps ACLs prevent locking down a file's content
And fail with an informative message.

I don't think ACLs can prevent removing the write bit, but I'm not sure,
so kept it mentioning them as a possibility.

Should git-annex lock also check if the write bits are able to be removed?
Maybe, but the case I know about with xattrs involves cp -a copying NFS
xattrs, and it's the copy of the file that is the problem. So when locking
a file, I guess it will not be the copy.

Sponsored-by: Dartmouth College's Datalad project
2021-08-27 14:33:01 -04:00
Joey Hess
e17342b2a0
Run cp -a with --no-preserve=xattr, to avoid problems with copied xattrs
Including them breaking permissions setting on some NFS servers.

Sponsored-by: Dartmouth College's Datalad project
2021-08-27 13:09:34 -04:00
Joey Hess
6d4a728455
Added annex.youtube-dl-command config
This can be used to run some forks of youtube-dl.

Sponsored-by: Brett Eisenberg on Patreon
2021-08-27 09:44:23 -04:00
Joey Hess
ab7b5a492c
--batch-keys
New --batch-keys option added to these commands:  get, drop, move, copy, whereis

git-annex-matching-options had to be reworded since some of its options
can be used to match on keys, not only files.

Sponsored-by: Luke Shumaker on Patreon
2021-08-25 14:21:12 -04:00
Joey Hess
4ed36b2634
Fix test suite failure on Windows
It would be better if the Arbitrary instance avoided generating impossible
filenames like "foo/c:bar", but proably this is the only place that splits
the file from the directory and then uses the file without the directory..
At least on the quickcheck properties.

Sponsored-by: Svenne Krap on Patreon
2021-08-24 14:03:29 -04:00
Joey Hess
f9b92c81f6
unused: Skip the refs/annex/last-index ref that git-annex recently started creating
This was unlikely to cause any problem, but it is unsightly to mention
normally hidden refs, and it might have done a bit of unnecessary work to
check that ref.

Sponsored-by: Noam Kremen on Patreon
2021-08-24 12:58:14 -04:00
Joey Hess
53744e132d
incremental verification for gitlfs and httpalso
And that should be all the special remotes supporting it on linux now,
except for in the odd edge case here and there.

Sponsored-by: Dartmouth College's DANDI project
2021-08-18 15:17:10 -04:00
Joey Hess
f5e09a1dbe
incremental verification for S3
Sponsored-by: Dartmouth College's DANDI project
2021-08-18 15:07:00 -04:00
Joey Hess
d154e7022e
incremental verification for web special remote
Except when configuration makes curl be used. It did not seem worth
trying to tail the file when curl is downloading.

But when an interrupted download is resumed, it does not read the whole
existing file to hash it. Same reason discussed in
commit 7eb3742e4b76d1d7a487c2c53bf25cda4ee5df43; that could take a long
time with no progress being displayed. And also there's an open http
request, which needs to be consumed; taking a long time to hash the file
might cause it to time out.

Also in passing implemented it for git and external special remotes when
downloading from the web. Several others like S3 are within striking
distance now as well.

Sponsored-by: Dartmouth College's DANDI project
2021-08-18 15:02:22 -04:00
Joey Hess
8613770b06
incremental verify for webdav special remote
Sponsored-by: Dartmouth College's DANDI project
2021-08-16 17:29:32 -04:00
Joey Hess
b1622eb932
incremental verify for directory special remote
Added fileRetriever', which will let the remaining special remotes
eventually also support incremental verify.

Sponsored-by: Dartmouth College's DANDI project
2021-08-16 16:51:33 -04:00
Joey Hess
ec82299730
status update
I was wrong about S3 supporting tailVerify.
2021-08-16 15:15:32 -04:00
Joey Hess
dadbb510f6
incremental hashing for fileRetriever
It uses tailVerify to hash the file while it's being written.

This is able to sometimes avoid a separate checksum step. Although
if the file gets written quickly enough, tailVerify may not see it
get created before the write finishes, and the checksum still happens.

Testing with the directory special remote, incremental checksumming did
not happen. But then I disabled the copy CoW probing, and it did work.
What's going on with that is the CoW probe creates an empty file on
failure, then deletes it, and then the file is created again. tailVerify
will open the first, empty file, and so fails to read the content that
gets written to the file that replaces it.

The directory special remote really ought to be able to avoid needing to
use tailVerify, and while other special remotes could do things that
cause similar problems, they probably don't. And if they do, it just
means the checksum doesn't get done incrementally.

Sponsored-by: Dartmouth College's DANDI project
2021-08-13 15:43:29 -04:00
Joey Hess
7eb3742e4b
incremental verify for chunked remotes
Simply feed each chunk in turn to the incremental verifier.

When resuming an interrupted retrieve, it does not do incremental
verification. That would need to read the file, up to the resume point,
and feed it to the incremental verifier. That seems easy to get wrong.
Also it would mean extra work done before the transfer can start. Which
would complicate displaying progress, and would perhaps not appear to the
user as if it was resuming from where it left off. Instead, in that
situation, return UnVerified, and let the verification be done in a
separate pass.

Granted, Annex.CopyFile does manage all that, but it's not complicated
by dealing with chunks too.

Sponsored-by: Dartmouth College's DANDI project
2021-08-11 14:42:49 -04:00
Joey Hess
c20358b671
incremental verify for byteRetriever special remotes
Several special remotes verify content while it is being retrieved,
avoiding a separate checksum pass. They are: S3, bup, ddar, and
gcrypt (with a local repository).

Not done when using chunking, yet.

Complicated by Retriever needing to change to be polymorphic. Which in turn
meant RankNTypes is needed, and also needed some code changes. The
change in Remote.External does not change behavior at all but avoids
the type checking failing because of a "rigid, skolem type" which
"would escape its scope". So I refactored slightly to make the type
checker's job easier there.

Unfortunately, directory uses fileRetriever (except when chunked),
so it is not amoung the improved ones. Fixing that would need a way for
FileRetriever to return a Verification. But, since the file retrieved
may be encrypted or chunked, it would be extra work to always
incrementally checksum the file while retrieving it. Hm.

Some other special remotes use fileRetriever, and so don't get incremental
verification, but could be converted to byteRetriever later. One is
GitLFS, which uses downloadConduit, which writes to the file, so could
verify as it goes. Other special remotes like web could too, but don't
use Remote.Helper.Special and so will need to be addressed separately.

Sponsored-by: Dartmouth College's DANDI project
2021-08-11 14:20:38 -04:00
Joey Hess
f1176f82a5
rsync special remote: Stop displaying rsync progress, and use git-annex's own progress display
Reasons are same as in commit cee14f147a.
(It was already done when using -J.)

Sponsored-by: Mark Reidenbach on Patreon
2021-08-09 12:06:10 -04:00
Joey Hess
1acdd18ea8
deal better with clock skew situations, using vector clocks
* Deal with clock skew, both forwards and backwards, when logging
  information to the git-annex branch.
* GIT_ANNEX_VECTOR_CLOCK can now be set to a fixed value (eg 1)
  rather than needing to be advanced each time a new change is made.
* Misuse of GIT_ANNEX_VECTOR_CLOCK will no longer confuse git-annex.

When changing a file in the git-annex branch, the vector clock to use is now
determined by first looking at the current time (or GIT_ANNEX_VECTOR_CLOCK
when set), and comparing it to the newest vector clock already in use in
that file. If a newer time stamp was already in use, advance it forward by
a second instead.

When the clock is set to a time in the past, this avoids logging with
an old timestamp, which would risk that log line later being ignored in favor
of "newer" line that is really not newer.

When a log entry has been made with a clock that was set far ahead in the
future, this avoids newer information being logged with an older timestamp
and so being ignored in favor of that future-timestamped information.
Once all clocks get fixed, this will result in the vector clocks being
incremented, until finally enough time has passed that time gets back ahead
of the vector clock value, and then it will return to usual operation.

(This latter situation is not ideal, but it seems the best that can be done.
The issue with it is, since all writers will be incrementing the last
vector clock they saw, there's no way to tell when one writer made a write
significantly later in time than another, so the earlier write might
arbitrarily be picked when merging. This problem is why git-annex uses
timestamps in the first place, rather than pure vector clocks.)

Advancing forward by 1 second is somewhat arbitrary. setDead
advances a timestamp by just 1 picosecond, and the vector clock could
too. But then it would interfere with setDead, which wants to be
overrulled by any change. So it could use 2 picoseconds or something,
but that seems weird. It could just as well advance it forward by a
minute or whatever, but then it would be harder for real time to catch
up with the vector clock when forward clock slew had happened.

A complication is that many log files contain several different peices of
information, and it may be best to only use vector clocks for the same peice
of information. For example, a key's location log file contains
InfoPresent/InfoMissing for each UUID, and it only looks at the vector
clocks for the UUID that is being changed, and not other UUIDs.

Although exactly where the dividing line is can be hard to determine.
Consider metadata logs, where a field "tag" can have multiple values set
at different times. Should it advance forward past the last tag?
Probably. What about when a different field is set, should it look at
the clocks of other fields? Perhaps not, but currently it does, and
this does not seems like it will cause any problems.

Another one I'm not entirely sure about is the export log, which is
keyed by (fromuuid, touuid). So if multiple repos are exporting to the
same remote, different vector clocks can be used for that remote.
It looks like that's probably ok, because it does not try to determine
what order things occurred when there was an export conflict.

Sponsored-by: Jochen Bartl on Patreon
2021-08-04 12:33:46 -04:00
Joey Hess
899983058f
add: When adding a dotfile, avoid treating its name as an extension. 2021-08-03 12:22:58 -04:00
Joey Hess
9cae7c5bbf
releasing package git-annex version 8.20210803 2021-08-03 12:20:45 -04:00
Joey Hess
b3c4579c79
work around strange auto-init bug
git-annex get when run as the first git-annex command in a new repo did not
populate unlocked files. (Reversion in version 8.20210621)

I am not entirely happy with this, because I don't understand how
428c91606b caused the problem in the first
place, and I don't fully understand how skipping calling scanAnnexedFiles
during autoinit avoids the problem.

Kept the explicit call to scanAnnexedFiles during git-annex init,
so that when reconcileStaged is expensive, it can be made to run then,
rather than at some later point when the information is needed.

Sponsored-by: Brock Spratlen on Patreon
2021-07-30 18:36:03 -04:00
Joey Hess
66089e97de
Fix a rounding bug in display of data sizes
Eg, showImprecise 1 1.99 returned "1.1" rather than "2". The 9 rounded
upward to 10, and that was wrongly used as the decimal, rather than
carrying the 1.

Sponsored-by: Jack Hill on Patreon
2021-07-30 09:56:04 -04:00
Joey Hess
d2aead67bd
fsck: Detect and correct stale or missing inode caches for object files
An easy way to see this in action is to have an unlocked file, and touch the
object file.

While all code that compares inode caches for object files needs to be
prepared for this kind of problem and fall back to verification, having
fsck notice it and correct it is cheap (as long as fsck is being run
anyway) and ensures that if it happens for some unusual reason, there's a
way for the user to notice that it's happening.

Not that, when annex.thin is in use, the earlier call to isUnmodified
(and also potentially earlier calls to inAnnex in eg, verifyLocationLog)
will fix up the same problem silently. That might prevent the warning
being displayed, although probably it still will be, because the
Database.Keys write of the InodeCache will be queued but will not have
happened yet. I can't see a way to improve this, but it's not great.

Sponsored-by: Dartmouth College's Datalad project
2021-07-29 14:06:42 -04:00
Joey Hess
73e0cbbb19
fix problem populating pointer files
This is a result of an audit of every use of getInodeCaches,
to find places that misbehave when the annex object is not in the inode
cache, despite pointer files for the same key being in the inode cache.

Unfortunately, that is the case for objects that were in v7 repos that
upgraded to v8. Added a note about this gotcha to getInodeCaches.

Database.Keys.reconcileStaged, then annex.thin is set, would fail to
populate pointer files in this situation. Changed it to check if the
annex object is unmodified the same way inAnnex does, falling back to a
checksum if the inode cache is not recorded.

Sponsored-by: Dartmouth College's Datalad project
2021-07-27 14:26:49 -04:00
Joey Hess
3b5a3e168d
check if object is modified before starting to send it
Fix bug that caused some transfers to incorrectly fail with "content
changed while it was being sent", when the content was not changed.

While I don't know how to reproduce the problem that several people
reported, it is presumably due to the inode cache somehow being stale.
So check isUnmodified', and if it's not modified, include the file's
current inode cache in the set to accept, when checking for modification
after the transfer.

That seems like the right thing to do for another reason: The failure
says the file changed while it was being sent, but if the object file was
changed before the transfer started, that's wrong. So it needs to check
before allowing the transfer at all if the file is modified.

(Other calls to sameInodeCache or elemInodeCaches, when operating on inode
caches from the database, could also be problimatic if the inode cache is
somehow getting stale. This does not address such problems.)

Sponsored-by: Dartmouth College's Datalad project
2021-07-26 17:33:49 -04:00
Joey Hess
3d50b47ded
sync, merge: Added --allow-unrelated-histories option
Which is the same as the git merge option.

After last commit, this turns out to be needed in the test suite, and when
doing git-annex import from special remote, followed by a git-annex merge.

Sponsored-by: Svenne Krap on Patreon
2021-07-19 12:14:26 -04:00
Joey Hess
b6bea0d3f2
remove direct mode remnant of merging unrelated histories
sync, merge, post-receive: Avoid merging unrelated histories, which used to
be allowed only to support direct mode repositories.

(However, sync does still merge unrelated histories when importing trees
from special remotes, and the assistant still merges unrelated histories
always.)

See 556b2ded2b for why this was added
back in 2016, for direct mode.

This is a behavior change, which might break something that was relying
on sync merging unrelated histories, but git had a good reason to
prevent it, since it's easy to foot shoot with it, and git-annex should
follow suit.

Sponsored-by: Noam Kremen on Patreon
2021-07-19 11:41:26 -04:00
Joey Hess
33a80d083a
sync --quiet
* sync: When --quiet is used, run git commit, push, and pull without
  their ususual output.
* merge: When --quiet is used, run git merge without its usual output.

This might also make --quiet work better for some other commands
that make commits, like git-annex adjust.

Sponsored-by: Kevin Mueller on Patreon
2021-07-19 11:28:47 -04:00
Joey Hess
c952c485c8
Fix retrieval of content from borg repos accessed over ssh
It was making the borgrepo path absolute.. even when it was a ssh
repository.

Made BorgRepo a newtype, to guard against accidentially treating it like a
FilePath.

Sponsored-by: Graham Spencer on Patreon
2021-07-15 12:39:24 -04:00
Joey Hess
dd31fe7b9e
fall back to checking lower case hash directories in normal repo
Fix a bug that prevented getting content from a repository that started out
as a bare repository, or had annex.crippledfilesystem set, and was
converted to a non-bare repository.

This unfortunately means that inAnnex check gets slowed down by a stat call
in normal repos when the content is not present. Oh well, such is the cost
of backwards compatability with old mistakes.

Sponsored-by: Mark Reidenbach on Patreon
2021-07-15 12:16:31 -04:00
Joey Hess
47d3dccf19
whereused implemented
except --historical

Sponsored-by: Jack Hill on Patreon
2021-07-14 14:27:21 -04:00
Joey Hess
065db484e0
releasing package git-annex version 8.20210714 2021-07-14 12:23:24 -04:00
Joey Hess
8885bd3c5b
addistant: honor annex.delayadd for non-large files
assistant: When adding non-large files to git, honor annex.delayadd
configuration.

Also, don't add non-large files to git when they are still
being written to. This came for free, since the changes to non-large
files get queued up with the ones to large files, and run through the lsof
check.

Sponsored-by: Luke Shumaker on Patreon
2021-07-13 12:17:00 -04:00
Joey Hess
a6767ca81f
close bug
and mention another aspect of the reversion in changelog
2021-07-12 10:45:57 -04:00
Joey Hess
6a581f8b8b
fix init reversion when core.sharedRepository = group
init: Fix misbehavior when core.sharedRepository = group that caused it to
enter an adjusted branch. (Reversion in version 8.20210630)

Commit 4b1b9d7a83 made init call
freezeContent in case there was a hook that could prevent writing in
situations where perms don't. But with the above git config, freezeContent
does not prevent write at all. So init needs to do what freezeContent does
with a non-shared git config.

Or init could check for that config, and skip the probing, since it
won't actually be preventing write to any files. But that would make init
too aware if details of Annex.Perms, and also would break if the git config
were changed after init.

Sponsored-by: Dartmouth College's Datalad project
2021-07-12 10:15:49 -04:00
Joey Hess
b885007f0e
--debug output goes to stderr again, not stdout
Reversion in version 8.20210428

Sponsored-by: Dartmouth College's Datalad project
2021-07-12 09:40:38 -04:00
Joey Hess
b9db859221
addurl: Avoid crashing when used on beegfs.
Sponsored-by: Dartmouth College's DANDI project
2021-07-05 13:02:40 -04:00
Joey Hess
d2c48404a8
assistant: Avoid unncessary git repository repair
In a situation where git fsck gets confused about a commit that is made
while it's running.

Sponsored-by: Graham Spencer on Patreon
2021-06-30 18:00:16 -04:00
Joey Hess
fd99ce6c95
releasing package git-annex version 8.20210630 2021-06-30 11:48:33 -04:00
Joey Hess
73ccf34763
closing 2021-06-30 11:47:39 -04:00
Joey Hess
6b0d732746
repair: Fix reversion in version 8.20200522 that prevented fetching missing objects from remotes
In commit dfc4e641b5 git repair was changed
to use remote name, not url, when fetching. But it fetches into a temporary
git repo, which doesn't have remotes configured. Oops.

(In my defense, that commit was made just as covid lockdown started. But
testing? Urk.)

Sponsored-by: Mark Reidenbach on Patreon
2021-06-29 13:15:15 -04:00
Joey Hess
199391befe
make repair interruption safe
Fixed bug that interrupting git-annex repair (or assistant) while it was
fixing repository corruption would lose objects that were contained in pack
files.

Unpack all pack files and move objects into place *before* deleting the
pack files. The old approach moved the pack files to a temp directory
before unpacking them, which was not interruption safe.

Sponsored-By: Jochen Bartl on Patreon
2021-06-29 13:14:28 -04:00
Joey Hess
b8e32e200e
addurl, importfeed: Added --no-raw option
Forces eg, download with youtube-dl without falling back to raw download.

Since youtube-dl failing due to an url not being supported is difficult to
distinguish from it failing due to being blocked in some way, this can be
useful to avoid the fallback of git-annex downloading the raw web page and
adding that.

Since --raw also prevents using special remotes, --no-raw also
allows special remote downloads. Although it's always possible that some
special remote may claim an url and fall back to raw download of the
content, which --no-raw cannot prevent.

Sponsored-by: Boyd Stephen Smith Jr. on Patreon
2021-06-27 11:14:51 -04:00
Joey Hess
3a14648142
dropping unused marks as dead
Dropping an object with drop --unused or dropunused will mark it as
dead, preventing fsck --all from complaining about it after it's been
dropped from all repositories.

If another repository still has a copy, it won't be treated as dead
until it's also dropped from there.

The drop has to use --unused, can't be --key or something else, because
this indicates that the user has recently ran git-annex unused. If it
checked the unused log on every drop, bad things would happen when the
unused log was out of date, eg a file used to be unused but then got
re-added. Marking such a file as dead could be confusing. When the user
uses --unused/dropunused, they must consider the unused information to be
up-to-date.

The particular workflow this enables is:

	git annex add foo
	git annex unannex foo
	git annex unused
	git annex drop --unused / dropunused
	git annex fsck --all # no warnings

The docs for git-annex unannex say to use git-annex unused and dropunused,
so the user should be pointed in this direction when they want to undo an
accidental add.

Sponsored-by: Brock Spratlen on Patreon
2021-06-25 15:22:26 -04:00
Joey Hess
df2001aa88
Improve display of errors when transfers fail
Transfers from or to a local git repo could fail without a reason being
given, if the content failed to verify, or if the object file's stat
changed while it was being copied. Now display messages in these cases.

Sponsored-by: Jack Hill on Patreon
2021-06-25 13:17:04 -04:00
Joey Hess
4b1b9d7a83
Added annex.freezecontent-command and annex.thawcontent-command configs
Freeze first sets the file perms, and then runs
freezecontent-command. Thaw runs thawcontent-command before
restoring file permissions. This is in case the freeze command
prevents changing file perms, as eg setting a file immutable does.
Also, changing file perms tends to mess up previously set ACLs.

git-annex init's probe for crippled filesystem uses them, so if file perms
don't work, but freezecontent-command manages to prevent write to a file,
it won't treat the filesystem as crippled.

When the the filesystem has been probed as crippled, the hooks are not
used, because there seems to be no point then; git-annex won't be relying
on locking annex objects down. Also, this avoids them being run when the
file perms have not been changed, in case they somehow rely on
git-annex's setting of the file perms in order to work.

Sponsored-by: Dartmouth College's Datalad project
2021-06-21 14:40:52 -04:00
Joey Hess
1cc7b2661e
push synced/master before synced/git-annex
sync: Partly work around github behavior that first branch to be pushed to
a new repository is assumed to be the head branch, by not pushing
synced/git-annex first.

github expects master (or whatever the name is) to be pushed first, but
git-annex sync can't, because it's got to also support pushes to non-bare
repos where pushing master fails, as explained in the big comment. So
pushing synced/master is not entirely a fix, but at least it makes github
default to a branch with the stuff the user expects in it, not a bunch of
annex log files.

Aside from fixing github to not make this assumption, or improving
the git push protocol to include what the current HEAD is, the only other
approach I can think of is to identify git push's progress messages and
display those when pushing master, while filtering out error messages
about non-fast-forward etc. But git doesn't provide a way to separate out
or identify its progress messages.

Sponsored-by: Luke Shumaker on Patreon
2021-06-21 12:32:21 -04:00
Joey Hess
a6e281e008
releasing package git-annex version 8.20210621 2021-06-21 12:17:46 -04:00