Commit graph

1767 commits

Author SHA1 Message Date
Joey Hess
a0758bdd10
dynamically disable filter-process in restagePointerFile when it would be slower
Based on my earlier benchmark, I have a rough cost model for how
expensive it is for git-annex smudge to be run on a file, vs
how expensive it is for a gigabyte of a file's content to be read and
piped through to filter-process.

So, using that cost model, it can decide if using filter-process will
be more or less expensive than running the smudge filter on the files to
be restaged.

It turned out to be *really* annoying to temporarily disable
filter-process. I did find a way, but urk, this is horrible. Notice
that, if it's interrupted with it disabled, it will remain disabled
until the next time restagePointerFile runs. Which could be some time
later. If the user runs `git add` or `git checkout` on a lot of small
files before that, they will see slower than expected performance.

(This commit also deletes where I wrote down the benchmark results
earlier.)

Sponsored-by: Noam Kremen on Patreon
2021-11-08 16:20:34 -04:00
Joey Hess
837025b14f
Revert "disable filter.annex.process in restagePointerFile"
This reverts commit afe327ac49.

Unfortunately, disabling it by setting it to "" does not work, git
then ignores filter.annex.smudge/clean, and does not pass files through
git-annex at all.

I don't think there is a way to temporarily disable this git config
from the git command line. Which seems like a bug in git.

So, it may be more expensive than anticipated to enable
filter.annex.process, since git checkout etc will pipe all annexed files
being checked out through it.
2021-11-05 12:43:33 -04:00
Joey Hess
afe327ac49
disable filter.annex.process in restagePointerFile
This means git will run git-annex smudge --clean once per file that is
restaged, which can be slow. But probably *not* as slow as git feeding
all the content of annexed files you've gotten through a pipe to
git-annex filter-process.

The only time this is probably not ideal is after a drop of a bunch of
files, when filter-process would be faster.
2021-11-04 15:20:26 -04:00
Joey Hess
a3cdff3fd5
add a comment about checkSaneLock
See commit 8c2dd7d8ee for original
introduction of it, but needing to spelunk that far back to understand
the code is not good.
2021-10-27 14:55:30 -04:00
Joey Hess
55bfa414b3
move transfer already in progress message to warning
This makes it be displayed in the error-messages field with
--json-error-messages. And with --quiet, it will let it be displayed,
which makes sense because it's telling the user why what they requested
to do has failed to happen.
2021-10-27 14:46:21 -04:00
Joey Hess
669037862a
avoid redundant freezeContent call
This opens the potential for the object file to be in place but
git-annex is interrupted before it can freeze it. git-annex fsck already
fixes that situation, which can also occur when lockContentForRemoval
thaws content.

Also improve comment to not be Windows-specific.
2021-10-27 14:18:10 -04:00
Reiko Asakura
0db7297f00
Call freezeContent after move into annex
This change better supports Windows ACL management using
annex.freezecontent-command and annex.thawcontent-command and matches
the behaviour of adding an unlocked file.

By calling freezeContent after the file has moved into the annex,
the file's delete permission can be denied. If the file's delete
permission is denied before moving into the annex, the file cannot
be moved or deleted. If the file's delete permission is not denied after
moving into the annex, it will likely inherit a grant for the delete
permission which allows it to be deleted irrespective of the permissions
of the parent directory.
2021-10-27 14:05:57 -04:00
Joey Hess
5a9e6b1fd4
when private journal file exists, still read from git-annex branch
Fix bug that caused stale git-annex branch information to read when
annex.private or remote.name.annex-private is set.

The private journal file should not prevent reading more current
information from the git-annex branch, but used to.

Note that, overBranchFileContents has to do additional work now, when
there's a private journal file, it reads from the branch redundantly
and more slowly.

Sponsored-by: Jack Hill on Patreon
2021-10-26 13:43:50 -04:00
Joey Hess
0f38ad9a69
close keys db to possibly work around WSL1 issue 2021-10-19 13:07:49 -04:00
Joey Hess
887edeb1ad
avoid warning when built with unix-compat 0.5.3
It re-exports modificationTimeHiRes, and provides a windows version.

Might be worth using that windows version eventually, but I have not
tested it.
2021-10-18 16:25:28 -04:00
Joey Hess
69f8e6c7c0
ImportableContentsChunkable
This improves the borg special remote memory usage, by
letting it only load one archive's worth of filenames into memory at a
time, and building up a larger tree out of the chunks.

When a borg repository has many archives, git-annex could easily OOM
before. Now, it will use only memory proportional to the number of
annexed keys in an archive.

Minor implementation wart: Each new chunk re-opens the content
identifier database, and also a new vector clock is used for each chunk.
This is a minor innefficiency only; the use of continuations makes
it hard to avoid, although putting the database handle into a Reader
monad would be one way to fix it.

It may later be possible to extend the ImportableContentsChunkable
interface to remotes that are not third-party populated. However, that
would perhaps need an interface that does not use continuations.

The ImportableContentsChunkable interface currently does not allow
populating the top of the tree with anything other than subtrees. It
would be easy to extend it to allow putting files in that tree, but borg
doesn't need that so I left it out for now.

Sponsored-by: Noam Kremen on Patreon
2021-10-08 13:15:22 -04:00
Joey Hess
19e78816f0
convert Key to ShortByteString
This adds the overhead of a copy when serializing and deserializing keys.
I have not benchmarked much, but runtimes seem barely changed at all by that.

When a lot of keys are in memory, it improves memory use.

And, it prevents keys sometimes getting PINNED in memory and failing to GC,
which is a problem ByteString has sometimes. In particular, git-annex sync
from a borg special remote had that problem and this improved its memory
use by a large amount.

Sponsored-by: Shae Erisson on Patreon
2021-10-05 20:20:08 -04:00
Joey Hess
9012fa0187
reinject: Fix crash when reinjecting a file from outside the repository
Commit 4bf7940d6b introduced this
problem, but was otherwise doing a good thing. Problem being
that fileRef "/foo" used to return ":./foo", which was actually wrong,
but as long as there was no foo in the local repository, catKey
could operate on it without crashing. After that fix though, fileRef
would return eg "../../foo", resulting in fileRef returning
":./../../foo", which will make git cat-file crash since that's
not a valid path in the repo.

Fix is simply to make fileRef detect paths outside the repo and return
Nothing. Then catKey can be skipped. This needed several bugfixes to
dirContains as well, in previous commits.

In Command.Smudge, this led to needing to check for Nothing. That case
should actually never happen, because the fileoutsiderepo check will
detect it earlier.

Sponsored-by: Brock Spratlen on Patreon
2021-10-01 14:06:34 -04:00
Joey Hess
b9aa2ce8d1
resume properly when copying a file to/from a local git remote is interrupted (take 2)
This method avoids breaking test_readonly. Just check if the dest file
exists, and avoid CoW probing when it does, so when CoW probing fails,
it can resume where the previous non-CoW copy left off.

If CoW has been probed already to work, delete the dest file
since a CoW copy will presumably work. It seems like it would be almost
as good to just skip CoW copying in this case too, but consider that the
dest file might have started to be copied from some other remote, not
using CoW, but CoW has been probed to work to copy from the current
place.

Sponsored-by: Dartmouth College's Datalad project
2021-09-27 16:03:01 -04:00
Joey Hess
7ccf642863
revert change that broke test_readonly
commit 63d508e885 broke test_readonly.
When a local git remote is readonly, tryCopyCoW run to copy a file
from it failed at withOtherTmp.

Sponsored-by: Dartmouth College's Datalad project
2021-09-27 16:02:41 -04:00
Joey Hess
e47b4badb3
separate handles for cat-file and cat-file --batch-check
This avoids starting one process when only the other one is needed.
Eg in git-annex smudge --clean, this reduces the total number of
cat-file processes that are started from 4 to 2.

The only performance penalty is that when both are needed, it has to do
twice as much work to maintain the two Maps. But both are very small,
consisting of 1 or 2 items, so that work is negligible.

Sponsored-by: Dartmouth College's Datalad project
2021-09-24 13:16:13 -04:00
Joey Hess
798b33ba3d
simplify annex.bwlimit handling
RemoteGitConfig parsing looks for annex.bwlimit when a remote
does not have a per-remote config for it, so no need for a separate
gobal config.

Sponsored-by: Svenne Krap on Patreon
2021-09-22 10:52:01 -04:00
Joey Hess
05a097cde8
Merge branch 'master' into bwlimit 2021-09-22 10:48:27 -04:00
Joey Hess
4fef94d764
simplify annex.stalldetection handling
RemoteGitConfig parsing looks for annex.stalldetection when a remote
does not have a per-remote config for it, so no need for a separate
gobal config.

Sponsored-by: Noam Kremen on Patreon
2021-09-22 10:46:10 -04:00
Joey Hess
63d508e885
resume properly when copying a file to/from a local git remote is interrupted
Probably this fixes a reversion, but I don't know what version broke it.

This does use withOtherTmp for a temp file that could be quite large.
Though albeit a reflink copy that will not actually take up any space
as long as the file it was copied from still exists. So if the copy cow
succeeds but git-annex is interrupted just before that temp file gets
renamed into the usual .git/annex/tmp/ location, there is a risk that
the other temp directory ends up cluttered with a larger temp file than
later. It will eventually be cleaned up, and the changes of this being
a problem are small, so this seems like an acceptable thing to do.

Sponsored-by: Shae Erisson on Patreon
2021-09-21 17:43:35 -04:00
Joey Hess
18e00500ce
bwlimit
Added annex.bwlimit and remote.name.annex-bwlimit config that works for git
remotes and many but not all special remotes.

This nearly works, at least for a git remote on the same disk. With it set
to 100kb/1s, the meter displays an actual bandwidth of 128 kb/s, with
occasional spikes to 160 kb/s. So it needs to delay just a bit longer...
I'm unsure why.

However, at the beginning a lot of data flows before it determines the
right bandwidth limit. A granularity of less than 1s would probably improve
that.

And, I don't know yet if it makes sense to have it be 100ks/1s rather than
100kb/s. Is there a situation where the user would want a larger
granularity? Does granulatity need to be configurable at all? I only used that
format for the config really in order to reuse an existing parser.

This can't support for external special remotes, or for ones that
themselves shell out to an external command. (Well, it could, but it
would involve pausing and resuming the child process tree, which seems
very hard to implement and very strange besides.) There could also be some
built-in special remotes that it still doesn't work for, due to them not
having a progress meter whose displays blocks the bandwidth using thread.
But I don't think there are actually any that run a separate thread for
downloads than the thread that displays the progress meter.

Sponsored-by: Graham Spencer on Patreon
2021-09-21 16:58:10 -04:00
Joey Hess
ec12537774
defer write permissions checking in import until after copy to repo
This should complete the fix started in
6329997ac4, fixing the actual cause of the
test suite failure this time.

Sponsored-by: Dartmouth College's Datalad project
2021-09-02 13:45:21 -04:00
Joey Hess
bd5494bb9c
fix windows build 2021-09-02 12:21:25 -04:00
Joey Hess
4f42292b13
improve url download failure display
* When downloading urls fail, explain which urls failed for which
  reasons.
* web: Avoid displaying a warning when downloading one url failed
  but another url later succeeded.

Some other uses of downloadUrl use urls that are effectively internal use,
and should not all be displayed to the user on failure. Eg, Remote.Git
tries different urls where content could be located depending on how the
remote repo is set up. Exposing those urls to the user would lead to wild
goose chases. So had to parameterize it to control whether it displays urls
or not.

A side effect of this change is that when there are some youtube urls
and some regular urls, it will try regular urls first, even if the
youtube urls are listed first. This seems like an improvement if
anything, but in any case there's no defined order of urls that it's
supposed to use.

Sponsored-by: Dartmouth College's Datalad project
2021-09-01 15:33:38 -04:00
Joey Hess
6329997ac4
init: check for filesystem where write bit cannot be removed
This fixes a reversion caused by a99a84f342,
when git-annex init is run as root on a FAT filesystem mounted with
hdiutil on OSX. Such a mount point has file mode 777 for everything and
it cannot be changed. The existing crippled filesystem test tried to
write to a file after removing write bit, but that test does not run as
root (since root can write to unwritable files). So added a check of the
write permissions of the file, after attempting to remove them.

Sponsored-by: Dartmouth College's Datalad project
2021-09-01 10:27:28 -04:00
Joey Hess
e853ef3095
decorate openTempFile errors with the template name
This is to track down what file in .git/annex/ is being written to via a
temp file when the repository is read-only.

Sponsored-by: Dartmouth College's Datalad project
2021-08-30 13:05:02 -04:00
Joey Hess
a99a84f342
add: Detect when xattrs or perhaps ACLs prevent locking down a file's content
And fail with an informative message.

I don't think ACLs can prevent removing the write bit, but I'm not sure,
so kept it mentioning them as a possibility.

Should git-annex lock also check if the write bits are able to be removed?
Maybe, but the case I know about with xattrs involves cp -a copying NFS
xattrs, and it's the copy of the file that is the problem. So when locking
a file, I guess it will not be the copy.

Sponsored-by: Dartmouth College's Datalad project
2021-08-27 14:33:01 -04:00
Joey Hess
6d4a728455
Added annex.youtube-dl-command config
This can be used to run some forks of youtube-dl.

Sponsored-by: Brett Eisenberg on Patreon
2021-08-27 09:44:23 -04:00
Joey Hess
4ed36b2634
Fix test suite failure on Windows
It would be better if the Arbitrary instance avoided generating impossible
filenames like "foo/c:bar", but proably this is the only place that splits
the file from the directory and then uses the file without the directory..
At least on the quickcheck properties.

Sponsored-by: Svenne Krap on Patreon
2021-08-24 14:03:29 -04:00
Joey Hess
492036622a
fix OSX build 2021-08-18 16:35:26 -04:00
Joey Hess
d154e7022e
incremental verification for web special remote
Except when configuration makes curl be used. It did not seem worth
trying to tail the file when curl is downloading.

But when an interrupted download is resumed, it does not read the whole
existing file to hash it. Same reason discussed in
commit 7eb3742e4b76d1d7a487c2c53bf25cda4ee5df43; that could take a long
time with no progress being displayed. And also there's an open http
request, which needs to be consumed; taking a long time to hash the file
might cause it to time out.

Also in passing implemented it for git and external special remotes when
downloading from the web. Several others like S3 are within striking
distance now as well.

Sponsored-by: Dartmouth College's DANDI project
2021-08-18 15:02:22 -04:00
Joey Hess
88b63a43fa
distinguish between incremental verification failing and not being done
Sponsored-by: Dartmouth College's DANDI project
2021-08-18 14:38:02 -04:00
Joey Hess
325bfda12d
refactor 2021-08-18 13:37:00 -04:00
Joey Hess
449851225a
refactor
IncrementalVerifier moved to Utility.Hash, which will let Utility.Url
use it later.

It's perhaps not really specific to hashing, but making a separate
module just for the data type seemed unncessary.

Sponsored-by: Dartmouth College's DANDI project
2021-08-18 13:19:02 -04:00
Joey Hess
f0754a61f5
plumb VerifyConfig into retrieveKeyFile
This fixes the recent reversion that annex.verify is not honored,
because retrieveChunks was passed RemoteVerify baser, but baser
did not have export/import set up.

Sponsored-by: Dartmouth College's DANDI project
2021-08-17 12:43:13 -04:00
Joey Hess
b1622eb932
incremental verify for directory special remote
Added fileRetriever', which will let the remaining special remotes
eventually also support incremental verify.

Sponsored-by: Dartmouth College's DANDI project
2021-08-16 16:51:33 -04:00
Joey Hess
a644f729ce
refactor fileCopier
Sponsored-by: Dartmouth College's DANDI project
2021-08-16 15:56:24 -04:00
Joey Hess
d889ae0c01
move comment 2021-08-16 15:25:06 -04:00
Joey Hess
aac0654ff4
handle AlreadyInUseError
As happens when using the directory special remote, gitlfs, webdav, and
S3. But not external, adb, gcrypt, hook, or rsync.

Sponsored-by: Dartmouth College's DANDI project
2021-08-16 15:03:48 -04:00
Joey Hess
c4aba8e032
better handling of finishing up incomplete incremental verify
Now it's run in VerifyStage.

I thought about keeping the file handle open, and resuming reading where
tailVerify left off. But that risks leaking open file handles, until the
GC closes them, if the deferred verification does not get resumed. Since
that could perhaps happen if there's an exception somewhere, I decided
that was too unsafe.

Instead, re-open the file, seek, and resume.

Sponsored-by: Dartmouth College's DANDI project
2021-08-16 14:52:59 -04:00
Joey Hess
e0b7f391bd
improve tailVerify
Wait for the file to get modified, not only opened. This way, if a
remote does not support resuming, and opens a new file over top of the
existing file, it will wait until that remote starts writing, and open
the file it's writing to, not the old file.

Sponsored-by: Dartmouth College's DANDI project
2021-08-16 14:47:37 -04:00
Joey Hess
e46a7dff6f
fix windows build 2021-08-13 16:36:33 -04:00
Joey Hess
16dd3dd4ca
catch more exceptions
I saw this:

  .git/annex/tmp/SHA256E-s1234376--5ba8e06e0163b217663907482bbed57684d7188024155ddc81da0710dfd2687d: openBinaryFile: resource busy (file is locked)

 guess catching IO exceptions did not catch that one.
2021-08-13 16:16:46 -04:00
Joey Hess
ff2dc5eb18
INotify.removeWatch can crash
Unsure why, possibly if the file has been replaced by another file.
2021-08-13 15:35:18 -04:00
Joey Hess
7503b8448b
inotify reports paths relative to directory being watched
Sponsored-by: Dartmouth College's DANDI project
2021-08-13 14:51:15 -04:00
Joey Hess
e07625df8a
convert tailVerify to not finalize the verification
Added failIncremental so it can force failure to verify.

Sponsored-by: Dartmouth College's DANDI project
2021-08-13 13:39:02 -04:00
Joey Hess
9d533b347f
tailVerify: return deferred action when it gets behind
Sponsored-by: Dartmouth College's DANDI project
2021-08-13 12:32:01 -04:00
Joey Hess
b6efba8139
add tailVerify
Not yet used, but this will let all remotes verify incrementally if it's
acceptable to pay the performance price. See comment for details of when
it will perform badly. I anticipate using this for all special remotes
that use fileRetriever. Except perhaps for a few like GitLFS that could
feed the incremental verifier themselves despite using that.

Sponsored-by: Dartmouth College's DANDI project
2021-08-12 14:38:02 -04:00
Joey Hess
fa62c98910
simplify and speed up Utility.FileSystemEncoding
This eliminates the distinction between decodeBS and decodeBS', encodeBS
and encodeBS', etc. The old implementation truncated at NUL, and the
primed versions had to do extra work to avoid that problem. The new
implementation does not truncate at NUL, and is also a lot faster.
(Benchmarked at 2x faster for decodeBS and 3x for encodeBS; more for the
primed versions.)

Note that filepath-bytestring 1.4.2.1.8 contains the same optimisation,
and upgrading to it will speed up to/fromRawFilePath.

AFAIK, nothing relied on the old behavior of truncating at NUL. Some
code used the faster versions in places where I was sure there would not
be a NUL. So this change is unlikely to break anything.

Also, moved s2w8 and w82s out of the module, as they do not involve
filesystem encoding really.

Sponsored-by: Shae Erisson on Patreon
2021-08-11 12:13:31 -04:00
Joey Hess
1acdd18ea8
deal better with clock skew situations, using vector clocks
* Deal with clock skew, both forwards and backwards, when logging
  information to the git-annex branch.
* GIT_ANNEX_VECTOR_CLOCK can now be set to a fixed value (eg 1)
  rather than needing to be advanced each time a new change is made.
* Misuse of GIT_ANNEX_VECTOR_CLOCK will no longer confuse git-annex.

When changing a file in the git-annex branch, the vector clock to use is now
determined by first looking at the current time (or GIT_ANNEX_VECTOR_CLOCK
when set), and comparing it to the newest vector clock already in use in
that file. If a newer time stamp was already in use, advance it forward by
a second instead.

When the clock is set to a time in the past, this avoids logging with
an old timestamp, which would risk that log line later being ignored in favor
of "newer" line that is really not newer.

When a log entry has been made with a clock that was set far ahead in the
future, this avoids newer information being logged with an older timestamp
and so being ignored in favor of that future-timestamped information.
Once all clocks get fixed, this will result in the vector clocks being
incremented, until finally enough time has passed that time gets back ahead
of the vector clock value, and then it will return to usual operation.

(This latter situation is not ideal, but it seems the best that can be done.
The issue with it is, since all writers will be incrementing the last
vector clock they saw, there's no way to tell when one writer made a write
significantly later in time than another, so the earlier write might
arbitrarily be picked when merging. This problem is why git-annex uses
timestamps in the first place, rather than pure vector clocks.)

Advancing forward by 1 second is somewhat arbitrary. setDead
advances a timestamp by just 1 picosecond, and the vector clock could
too. But then it would interfere with setDead, which wants to be
overrulled by any change. So it could use 2 picoseconds or something,
but that seems weird. It could just as well advance it forward by a
minute or whatever, but then it would be harder for real time to catch
up with the vector clock when forward clock slew had happened.

A complication is that many log files contain several different peices of
information, and it may be best to only use vector clocks for the same peice
of information. For example, a key's location log file contains
InfoPresent/InfoMissing for each UUID, and it only looks at the vector
clocks for the UUID that is being changed, and not other UUIDs.

Although exactly where the dividing line is can be hard to determine.
Consider metadata logs, where a field "tag" can have multiple values set
at different times. Should it advance forward past the last tag?
Probably. What about when a different field is set, should it look at
the clocks of other fields? Perhaps not, but currently it does, and
this does not seems like it will cause any problems.

Another one I'm not entirely sure about is the export log, which is
keyed by (fromuuid, touuid). So if multiple repos are exporting to the
same remote, different vector clocks can be used for that remote.
It looks like that's probably ok, because it does not try to determine
what order things occurred when there was an export conflict.

Sponsored-by: Jochen Bartl on Patreon
2021-08-04 12:33:46 -04:00