Commit graph

1742 commits

Author SHA1 Message Date
Joey Hess
e853ef3095
decorate openTempFile errors with the template name
This is to track down what file in .git/annex/ is being written to via a
temp file when the repository is read-only.

Sponsored-by: Dartmouth College's Datalad project
2021-08-30 13:05:02 -04:00
Joey Hess
a99a84f342
add: Detect when xattrs or perhaps ACLs prevent locking down a file's content
And fail with an informative message.

I don't think ACLs can prevent removing the write bit, but I'm not sure,
so kept it mentioning them as a possibility.

Should git-annex lock also check if the write bits are able to be removed?
Maybe, but the case I know about with xattrs involves cp -a copying NFS
xattrs, and it's the copy of the file that is the problem. So when locking
a file, I guess it will not be the copy.

Sponsored-by: Dartmouth College's Datalad project
2021-08-27 14:33:01 -04:00
Joey Hess
6d4a728455
Added annex.youtube-dl-command config
This can be used to run some forks of youtube-dl.

Sponsored-by: Brett Eisenberg on Patreon
2021-08-27 09:44:23 -04:00
Joey Hess
4ed36b2634
Fix test suite failure on Windows
It would be better if the Arbitrary instance avoided generating impossible
filenames like "foo/c:bar", but proably this is the only place that splits
the file from the directory and then uses the file without the directory..
At least on the quickcheck properties.

Sponsored-by: Svenne Krap on Patreon
2021-08-24 14:03:29 -04:00
Joey Hess
492036622a
fix OSX build 2021-08-18 16:35:26 -04:00
Joey Hess
d154e7022e
incremental verification for web special remote
Except when configuration makes curl be used. It did not seem worth
trying to tail the file when curl is downloading.

But when an interrupted download is resumed, it does not read the whole
existing file to hash it. Same reason discussed in
commit 7eb3742e4b76d1d7a487c2c53bf25cda4ee5df43; that could take a long
time with no progress being displayed. And also there's an open http
request, which needs to be consumed; taking a long time to hash the file
might cause it to time out.

Also in passing implemented it for git and external special remotes when
downloading from the web. Several others like S3 are within striking
distance now as well.

Sponsored-by: Dartmouth College's DANDI project
2021-08-18 15:02:22 -04:00
Joey Hess
88b63a43fa
distinguish between incremental verification failing and not being done
Sponsored-by: Dartmouth College's DANDI project
2021-08-18 14:38:02 -04:00
Joey Hess
325bfda12d
refactor 2021-08-18 13:37:00 -04:00
Joey Hess
449851225a
refactor
IncrementalVerifier moved to Utility.Hash, which will let Utility.Url
use it later.

It's perhaps not really specific to hashing, but making a separate
module just for the data type seemed unncessary.

Sponsored-by: Dartmouth College's DANDI project
2021-08-18 13:19:02 -04:00
Joey Hess
f0754a61f5
plumb VerifyConfig into retrieveKeyFile
This fixes the recent reversion that annex.verify is not honored,
because retrieveChunks was passed RemoteVerify baser, but baser
did not have export/import set up.

Sponsored-by: Dartmouth College's DANDI project
2021-08-17 12:43:13 -04:00
Joey Hess
b1622eb932
incremental verify for directory special remote
Added fileRetriever', which will let the remaining special remotes
eventually also support incremental verify.

Sponsored-by: Dartmouth College's DANDI project
2021-08-16 16:51:33 -04:00
Joey Hess
a644f729ce
refactor fileCopier
Sponsored-by: Dartmouth College's DANDI project
2021-08-16 15:56:24 -04:00
Joey Hess
d889ae0c01
move comment 2021-08-16 15:25:06 -04:00
Joey Hess
aac0654ff4
handle AlreadyInUseError
As happens when using the directory special remote, gitlfs, webdav, and
S3. But not external, adb, gcrypt, hook, or rsync.

Sponsored-by: Dartmouth College's DANDI project
2021-08-16 15:03:48 -04:00
Joey Hess
c4aba8e032
better handling of finishing up incomplete incremental verify
Now it's run in VerifyStage.

I thought about keeping the file handle open, and resuming reading where
tailVerify left off. But that risks leaking open file handles, until the
GC closes them, if the deferred verification does not get resumed. Since
that could perhaps happen if there's an exception somewhere, I decided
that was too unsafe.

Instead, re-open the file, seek, and resume.

Sponsored-by: Dartmouth College's DANDI project
2021-08-16 14:52:59 -04:00
Joey Hess
e0b7f391bd
improve tailVerify
Wait for the file to get modified, not only opened. This way, if a
remote does not support resuming, and opens a new file over top of the
existing file, it will wait until that remote starts writing, and open
the file it's writing to, not the old file.

Sponsored-by: Dartmouth College's DANDI project
2021-08-16 14:47:37 -04:00
Joey Hess
e46a7dff6f
fix windows build 2021-08-13 16:36:33 -04:00
Joey Hess
16dd3dd4ca
catch more exceptions
I saw this:

  .git/annex/tmp/SHA256E-s1234376--5ba8e06e0163b217663907482bbed57684d7188024155ddc81da0710dfd2687d: openBinaryFile: resource busy (file is locked)

 guess catching IO exceptions did not catch that one.
2021-08-13 16:16:46 -04:00
Joey Hess
ff2dc5eb18
INotify.removeWatch can crash
Unsure why, possibly if the file has been replaced by another file.
2021-08-13 15:35:18 -04:00
Joey Hess
7503b8448b
inotify reports paths relative to directory being watched
Sponsored-by: Dartmouth College's DANDI project
2021-08-13 14:51:15 -04:00
Joey Hess
e07625df8a
convert tailVerify to not finalize the verification
Added failIncremental so it can force failure to verify.

Sponsored-by: Dartmouth College's DANDI project
2021-08-13 13:39:02 -04:00
Joey Hess
9d533b347f
tailVerify: return deferred action when it gets behind
Sponsored-by: Dartmouth College's DANDI project
2021-08-13 12:32:01 -04:00
Joey Hess
b6efba8139
add tailVerify
Not yet used, but this will let all remotes verify incrementally if it's
acceptable to pay the performance price. See comment for details of when
it will perform badly. I anticipate using this for all special remotes
that use fileRetriever. Except perhaps for a few like GitLFS that could
feed the incremental verifier themselves despite using that.

Sponsored-by: Dartmouth College's DANDI project
2021-08-12 14:38:02 -04:00
Joey Hess
fa62c98910
simplify and speed up Utility.FileSystemEncoding
This eliminates the distinction between decodeBS and decodeBS', encodeBS
and encodeBS', etc. The old implementation truncated at NUL, and the
primed versions had to do extra work to avoid that problem. The new
implementation does not truncate at NUL, and is also a lot faster.
(Benchmarked at 2x faster for decodeBS and 3x for encodeBS; more for the
primed versions.)

Note that filepath-bytestring 1.4.2.1.8 contains the same optimisation,
and upgrading to it will speed up to/fromRawFilePath.

AFAIK, nothing relied on the old behavior of truncating at NUL. Some
code used the faster versions in places where I was sure there would not
be a NUL. So this change is unlikely to break anything.

Also, moved s2w8 and w82s out of the module, as they do not involve
filesystem encoding really.

Sponsored-by: Shae Erisson on Patreon
2021-08-11 12:13:31 -04:00
Joey Hess
1acdd18ea8
deal better with clock skew situations, using vector clocks
* Deal with clock skew, both forwards and backwards, when logging
  information to the git-annex branch.
* GIT_ANNEX_VECTOR_CLOCK can now be set to a fixed value (eg 1)
  rather than needing to be advanced each time a new change is made.
* Misuse of GIT_ANNEX_VECTOR_CLOCK will no longer confuse git-annex.

When changing a file in the git-annex branch, the vector clock to use is now
determined by first looking at the current time (or GIT_ANNEX_VECTOR_CLOCK
when set), and comparing it to the newest vector clock already in use in
that file. If a newer time stamp was already in use, advance it forward by
a second instead.

When the clock is set to a time in the past, this avoids logging with
an old timestamp, which would risk that log line later being ignored in favor
of "newer" line that is really not newer.

When a log entry has been made with a clock that was set far ahead in the
future, this avoids newer information being logged with an older timestamp
and so being ignored in favor of that future-timestamped information.
Once all clocks get fixed, this will result in the vector clocks being
incremented, until finally enough time has passed that time gets back ahead
of the vector clock value, and then it will return to usual operation.

(This latter situation is not ideal, but it seems the best that can be done.
The issue with it is, since all writers will be incrementing the last
vector clock they saw, there's no way to tell when one writer made a write
significantly later in time than another, so the earlier write might
arbitrarily be picked when merging. This problem is why git-annex uses
timestamps in the first place, rather than pure vector clocks.)

Advancing forward by 1 second is somewhat arbitrary. setDead
advances a timestamp by just 1 picosecond, and the vector clock could
too. But then it would interfere with setDead, which wants to be
overrulled by any change. So it could use 2 picoseconds or something,
but that seems weird. It could just as well advance it forward by a
minute or whatever, but then it would be harder for real time to catch
up with the vector clock when forward clock slew had happened.

A complication is that many log files contain several different peices of
information, and it may be best to only use vector clocks for the same peice
of information. For example, a key's location log file contains
InfoPresent/InfoMissing for each UUID, and it only looks at the vector
clocks for the UUID that is being changed, and not other UUIDs.

Although exactly where the dividing line is can be hard to determine.
Consider metadata logs, where a field "tag" can have multiple values set
at different times. Should it advance forward past the last tag?
Probably. What about when a different field is set, should it look at
the clocks of other fields? Perhaps not, but currently it does, and
this does not seems like it will cause any problems.

Another one I'm not entirely sure about is the export log, which is
keyed by (fromuuid, touuid). So if multiple repos are exporting to the
same remote, different vector clocks can be used for that remote.
It looks like that's probably ok, because it does not try to determine
what order things occurred when there was an export conflict.

Sponsored-by: Jochen Bartl on Patreon
2021-08-04 12:33:46 -04:00
Joey Hess
6111958440
fix test suite
14683da9eb caused a test suite failure.
When the content of a key is not present, a LinkAnnexFailed is returned,
but replaceFile then tried to move the file into place, and since it was
not written, that crashed.

Sponsored-by: Boyd Stephen Smith Jr. on Patreon
2021-08-02 13:59:23 -04:00
Joey Hess
b3c4579c79
work around strange auto-init bug
git-annex get when run as the first git-annex command in a new repo did not
populate unlocked files. (Reversion in version 8.20210621)

I am not entirely happy with this, because I don't understand how
428c91606b caused the problem in the first
place, and I don't fully understand how skipping calling scanAnnexedFiles
during autoinit avoids the problem.

Kept the explicit call to scanAnnexedFiles during git-annex init,
so that when reconcileStaged is expensive, it can be made to run then,
rather than at some later point when the information is needed.

Sponsored-by: Brock Spratlen on Patreon
2021-07-30 18:36:03 -04:00
Joey Hess
748addbe05
remove second pass in scanAnnexedFiles
The pass was needed to populate files when annex.thin was set,
but in commit 73e0cbbb19,
reconcileStaged started to do that. So, this second pass is not needed
any longer.
2021-07-30 17:46:11 -04:00
Joey Hess
817ccbbc47
split verifyKeyContent
This avoids it calling enteringStage VerifyStage when it's used in
places that only fall back to verification rarely, and which might be
called while in TransferStage and be going to perform a transfer after
the verification.
2021-07-29 13:58:40 -04:00
Joey Hess
897fd5c104
add note 2021-07-29 13:14:03 -04:00
Joey Hess
067a9c70c7
simplify code 2021-07-29 12:28:13 -04:00
Joey Hess
3e0b210039
remove unncessary debugs
Keeping the ones in Annex.InodeSentinal
2021-07-29 12:19:37 -04:00
Joey Hess
73e0cbbb19
fix problem populating pointer files
This is a result of an audit of every use of getInodeCaches,
to find places that misbehave when the annex object is not in the inode
cache, despite pointer files for the same key being in the inode cache.

Unfortunately, that is the case for objects that were in v7 repos that
upgraded to v8. Added a note about this gotcha to getInodeCaches.

Database.Keys.reconcileStaged, then annex.thin is set, would fail to
populate pointer files in this situation. Changed it to check if the
annex object is unmodified the same way inAnnex does, falling back to a
checksum if the inode cache is not recorded.

Sponsored-by: Dartmouth College's Datalad project
2021-07-27 14:26:49 -04:00
Joey Hess
de482c7eeb
move verifyKeyContent to Annex.Verify
The goal is that Database.Keys be able to use it; it can't use
Annex.Content.Presence due to an import loop.

Several other things also needed to be moved to Annex.Verify as a
conseqence.
2021-07-27 14:07:23 -04:00
Joey Hess
14683da9eb
fix potential race in updating inode cache
Some uses of linkFromAnnex are inside replaceWorkTreeFile, which was
already safe, but others use it directly on the work tree file, which
was race-prone. Eg, if the work tree file was first removed, then
linkFromAnnex called to populate it, the user could have re-written it in
the interim.

This came to light during an audit of all calls of addInodeCaches,
looking for such races. All the other uses of it seem ok.

Sponsored-by: Brett Eisenberg on Patreon
2021-07-27 13:08:08 -04:00
Joey Hess
e4b2a067e0
fix potential race in updating inode cache
In Annex.Content, the object file was statted after pointer files were
populated. But if annex.thin is set, once the pointer files are
populated, the object file can potentially be modified via the hard
link. So, it was possible, though seemingly very unlikely, for the inode
of the modified object file to be cached.

Command.Fix and Command.Fsck had similar problems, statting the work
tree files after they were in place. Changed them to stat the temp file
that gets moved into place. This does rely on .git/annex being on the
same filesystem. If it's not, the cached inode will not be the same as
the one that the temp file gets moved to. Result will be that git-annex
will later need to do an expensive verification of the content of the
worktree files. Note that the cross-filesystem move of the temp file
already is a larger amount of extra work, so this seems acceptable.

Sponsored-by: Luke Shumaker on Patreon
2021-07-27 12:29:10 -04:00
Joey Hess
3b5a3e168d
check if object is modified before starting to send it
Fix bug that caused some transfers to incorrectly fail with "content
changed while it was being sent", when the content was not changed.

While I don't know how to reproduce the problem that several people
reported, it is presumably due to the inode cache somehow being stale.
So check isUnmodified', and if it's not modified, include the file's
current inode cache in the set to accept, when checking for modification
after the transfer.

That seems like the right thing to do for another reason: The failure
says the file changed while it was being sent, but if the object file was
changed before the transfer started, that's wrong. So it needs to check
before allowing the transfer at all if the file is modified.

(Other calls to sameInodeCache or elemInodeCaches, when operating on inode
caches from the database, could also be problimatic if the inode cache is
somehow getting stale. This does not address such problems.)

Sponsored-by: Dartmouth College's Datalad project
2021-07-26 17:33:49 -04:00
Joey Hess
f195f3b541
more inode cache debugging 2021-07-26 12:57:35 -04:00
Joey Hess
0073384850
add debugging in sameInodeCache 2021-07-26 10:58:07 -04:00
Joey Hess
33a80d083a
sync --quiet
* sync: When --quiet is used, run git commit, push, and pull without
  their ususual output.
* merge: When --quiet is used, run git merge without its usual output.

This might also make --quiet work better for some other commands
that make commits, like git-annex adjust.

Sponsored-by: Kevin Mueller on Patreon
2021-07-19 11:28:47 -04:00
Joey Hess
635e7f3e26
split annexLocations
To avoid mistakes like commit 0ccbed4f6f,
be explicit about the two variants of this.

Incidentially avoids a small amount of overhead in calling reverse.

Sponsored-by: Shae Erisson on Patreon
2021-07-16 14:17:56 -04:00
Joey Hess
0ccbed4f6f
fix oops
dd31fe7b9e broke non-bare repos by using
bare hash dirs first, oops
2021-07-15 21:01:07 -04:00
Joey Hess
dd31fe7b9e
fall back to checking lower case hash directories in normal repo
Fix a bug that prevented getting content from a repository that started out
as a bare repository, or had annex.crippledfilesystem set, and was
converted to a non-bare repository.

This unfortunately means that inAnnex check gets slowed down by a stat call
in normal repos when the content is not present. Oh well, such is the cost
of backwards compatability with old mistakes.

Sponsored-by: Mark Reidenbach on Patreon
2021-07-15 12:16:31 -04:00
Joey Hess
6a581f8b8b
fix init reversion when core.sharedRepository = group
init: Fix misbehavior when core.sharedRepository = group that caused it to
enter an adjusted branch. (Reversion in version 8.20210630)

Commit 4b1b9d7a83 made init call
freezeContent in case there was a hook that could prevent writing in
situations where perms don't. But with the above git config, freezeContent
does not prevent write at all. So init needs to do what freezeContent does
with a non-shared git config.

Or init could check for that config, and skip the probing, since it
won't actually be preventing write to any files. But that would make init
too aware if details of Annex.Perms, and also would break if the git config
were changed after init.

Sponsored-by: Dartmouth College's Datalad project
2021-07-12 10:15:49 -04:00
Joey Hess
9905ec19a7
add pointer to annex.security.allowed-url-schemes
Sponsored-by: Kevin Mueller on Patreon
2021-07-02 10:53:45 -04:00
Joey Hess
3a14648142
dropping unused marks as dead
Dropping an object with drop --unused or dropunused will mark it as
dead, preventing fsck --all from complaining about it after it's been
dropped from all repositories.

If another repository still has a copy, it won't be treated as dead
until it's also dropped from there.

The drop has to use --unused, can't be --key or something else, because
this indicates that the user has recently ran git-annex unused. If it
checked the unused log on every drop, bad things would happen when the
unused log was out of date, eg a file used to be unused but then got
re-added. Marking such a file as dead could be confusing. When the user
uses --unused/dropunused, they must consider the unused information to be
up-to-date.

The particular workflow this enables is:

	git annex add foo
	git annex unannex foo
	git annex unused
	git annex drop --unused / dropunused
	git annex fsck --all # no warnings

The docs for git-annex unannex say to use git-annex unused and dropunused,
so the user should be pointed in this direction when they want to undo an
accidental add.

Sponsored-by: Brock Spratlen on Patreon
2021-06-25 15:22:26 -04:00
Joey Hess
df2001aa88
Improve display of errors when transfers fail
Transfers from or to a local git repo could fail without a reason being
given, if the content failed to verify, or if the object file's stat
changed while it was being copied. Now display messages in these cases.

Sponsored-by: Jack Hill on Patreon
2021-06-25 13:17:04 -04:00
Joey Hess
51c696679f
avoid using temp file size when deciding whether to retry failed transfer
When stall detection is enabled, and a transfer is in progress,
it would display a doubled message:

(transfer already in progress, or unable to take transfer lock) (transfer already in progress, or unable to take transfer lock)

That happened because the forward retry decider had a start size of 0,
and an end size of whatever amount of the object the other process had
downloaded. So it incorrectly thought that the transferrer process had
made progress, when it had in fact immediately given up with that
message.

Instead, use the reported value from the progress meter. If a remote
does not report progress, this will mean it doesn't forward retry, in a
situation where it used to. But most remotes do report progress, and any
remote that does not can be fixed to, by using watchFileSize when
downloading. Also, some remotes might preallocate the temp file (eg
bittorrent), so relying on statting its size at this level to get
progress is dubious.

The same change was made to Annex/Transfer.hs, although only
Annex/TransferrerPool.hs needed to be changed to avoid the duplicate
message.

(An alternate fix would have been to start the retry decider with the
size of the object file before downloading begins, rather than 0.)

Sponsored-by: Brett Eisenberg on Patreon
2021-06-25 12:04:23 -04:00
Joey Hess
0fe550af75
fix windows build 2021-06-22 09:46:06 -04:00
Joey Hess
4b1b9d7a83
Added annex.freezecontent-command and annex.thawcontent-command configs
Freeze first sets the file perms, and then runs
freezecontent-command. Thaw runs thawcontent-command before
restoring file permissions. This is in case the freeze command
prevents changing file perms, as eg setting a file immutable does.
Also, changing file perms tends to mess up previously set ACLs.

git-annex init's probe for crippled filesystem uses them, so if file perms
don't work, but freezecontent-command manages to prevent write to a file,
it won't treat the filesystem as crippled.

When the the filesystem has been probed as crippled, the hooks are not
used, because there seems to be no point then; git-annex won't be relying
on locking annex objects down. Also, this avoids them being run when the
file perms have not been changed, in case they somehow rely on
git-annex's setting of the file perms in order to work.

Sponsored-by: Dartmouth College's Datalad project
2021-06-21 14:40:52 -04:00