Commit graph

1276 commits

Author SHA1 Message Date
Joey Hess
68257e9076
add git-annex filter-process
filter-process: New command that can make git add/checkout faster when
there are a lot of unlocked annexed files or non-annexed files, but that
also makes git add of large annexed files slower.

Use it by running: git
config filter.annex.process 'git-annex filter-process'

Fully tested and working, but I have not benchmarked it at all.
And, incremental hashing is not done when git add uses it, so extra work is
done in that case.

Sponsored-by: Mark Reidenbach on Patreon
2021-11-04 15:02:36 -04:00
Joey Hess
438e5b56aa
tighter --json parsing for metadata
metadata --batch --json: Reject input whose "fields" does not consist of
arrays of strings. Such invalid input used to be silently ignored.

Used to be that parseJSON for a JSONActionItem ran parseJSON separately
for the itemAdded, and if that failed, did not propagate the error. That
allowed different items with differently named fields to be parsed.
But it was actually only used to parse "fields" for metadata, so that
flexability is not needed.

The fix is just to parse "fields" as-is. AddJSONActionItemFields is needed
only because of the wonky way Command.MetaData adds onto the started json
object.

Note that this line got a dummy type signature added,
just because the type checker needs it to be some type.
itemFields = Nothing :: Maybe Bool
Since it's Nothing, it doesn't really matter what type it is,
and the value gets turned into json and is then thrown away.

Sponsored-by: Kevin Mueller on Patreon
2021-11-01 14:42:37 -04:00
Joey Hess
80f1354685
metadata --batch: Avoid crashing when a non-annexed file is input
Turns out that CommandStart actions do not have their exceptions caught,
which is why the giveup was causing a crash. Mostly these actions
do not do very much work on their own, but it does seem possible there
are other commands whose CommandStart also throws an exception.

So, my first attempt at a fix was to catch those exceptions. But,
--json-error-messages then causes a difficulty, because in order to output
a json error message, an action needs to have been started; that sets up
the json object that the error message will be included in a field of.

While it would be possible to output an object with just an error field,
this would be json output of a format that the user has no reason to
expect, that happens only in an exceptional circumstance. That is something
I have always wanted to avoid with the json output; while git-annex man
pages don't document what the json looks like, the output has always
been made to be self-describing. Eg, it includes "error-messages":[]
even when there's no errors.

With that ruled out, it doesn't seem a good idea to catch CommandStart
exceptions and display the error to stderr when --json-error-messages
is set. And so I don't know if it makes sense to catch exceptions from that
at all. Maybe I'd have a different opinion if --json-error-messages did not
exist though.

So instead, output a blank line like other batch commands do.
This also leaves open the possibility of implementing support for matching
object with metadata --json, which would also want to output a blank line
when the input didn't match.

Sponsored-by: Dartmouth College's DANDI project
2021-11-01 13:40:43 -04:00
Joey Hess
c260833a6b
releasing package git-annex version 8.20211028 2021-10-28 12:00:56 -04:00
Joey Hess
eb95ed4863
fix addurl concurrency issue
addurl: Support adding the same url to multiple files at the same time when
using -J with --batch --with-files.

Implementation was easier than expected, was able to reuse OnlyActionOn.

While it will download the url's content multiple times, that seems like
the best thing to do; see my comment for why.

Sponsored-by: Dartmouth College's DANDI project
2021-10-27 16:15:41 -04:00
Joey Hess
669037862a
avoid redundant freezeContent call
This opens the potential for the object file to be in place but
git-annex is interrupted before it can freeze it. git-annex fsck already
fixes that situation, which can also occur when lockContentForRemoval
thaws content.

Also improve comment to not be Windows-specific.
2021-10-27 14:18:10 -04:00
Joey Hess
0756625e1b
update, bugfix also fixed git-annex info 2021-10-27 12:22:02 -04:00
Joey Hess
b2c48fb86b
Fix using lookupkey inside a subdirectory
Caused by dirContains ".." "foo" being incorrectly False.

Also added a test of dirContains, which includes all the previous bug fixes
I could find and some obvious cases.

Reversion in version 8.20211011

Sponsored-by: Brett Eisenberg on Patreon
2021-10-26 15:00:45 -04:00
Joey Hess
5a9e6b1fd4
when private journal file exists, still read from git-annex branch
Fix bug that caused stale git-annex branch information to read when
annex.private or remote.name.annex-private is set.

The private journal file should not prevent reading more current
information from the git-annex branch, but used to.

Note that, overBranchFileContents has to do additional work now, when
there's a private journal file, it reads from the branch redundantly
and more slowly.

Sponsored-by: Jack Hill on Patreon
2021-10-26 13:43:50 -04:00
Joey Hess
2801528eb2
oops, I misread, still happens for adjusted branches 2021-10-20 13:45:56 -04:00
Joey Hess
f7b5a5c9ed
changelog
A user tested 0f38ad9a69 on WSL, and it
seems to have fixed the problem.
2021-10-20 13:26:01 -04:00
Joey Hess
f4bdecc4ec
improve sqlite MultiWriter handling of read after write
This removes a messy caveat that was easy to forget and caused at least one
bug. The price paid is that, after a write to a MultiWriter db, it has to
close the db connection that it had been using to read, and open a new
connection. So it might be a little bit slower. But, writes are usually
batched together, so there's often only a single write, and so there should
not be much of a slowdown. Notice that SingleWriter already closed the db
connection after a write, so paid the same overhead.

This is the second try at fixing a bug: git-annex get when run as the first
git-annex command in a new repo did not populate all unlocked files.
(Reversion in version 8.20210621)

Sponsored-by: Boyd Stephen Smith Jr. on Patreon
2021-10-19 15:13:29 -04:00
Joey Hess
29d687dce9
When retrival from a chunked remote fails, display the error that occurred when downloading the chunk
Rather than the error that occurred when trying to download the unchunked
content, which is less likely to actually be stored in the remote.

Sponsored-by: Boyd Stephen Smith Jr. on Patreon
2021-10-14 12:45:05 -04:00
Joey Hess
b36cc0320e
avoid crashing tilde expansion on user who does not exist
git does not crash when there's a remote configured for a user who does
not exist, and this prevents git-annex from crashing too. Consider that
a user might exist on one system but not another, and the git repo be
moved between systems. So not crashing is desirable.

Note that git fetch seems to mishandle a remote path like ~foo/bar
when the user does not exist. While it does access ./~foo/bar,
and gets as far as running git-upload-pack on the path,
it then complains there is no such repo. So different parts of git seem
to be doing different things in that edge case. Anyway, git-annex does
not need to be bug-for-bug compatible with git.

Sponsored-by: Jack Hill on Patreon
2021-10-13 09:16:36 -04:00
Joey Hess
c2a44eab50
move gpg tmp home to system temp dir
test: Put gpg temp home directory in system temp directory, not filesystem
being tested.

Since I've found indications gpg can fail talking to the agent when the
socket ends up on eg, fat. And to hopefully fix this bug report I've
followed up on.

The main risk in using the system temp dir is that TMPDIR could be set to a
long directory path, which is too long to put a unix socket in. To
partially amelorate that risk, it uses either an absolute or a relative
path, whichever is shorter. (Hopefully gpg will not convert it to a longer
form of the path..)

If the user sets TMPDIR to something so long a path to it +
"S.gpg-agent" is too long, I suppose that's their issue to deal with.

Sponsored-by: Dartmouth College's Datalad project
2021-10-12 13:29:56 -04:00
Joey Hess
17a0fa3dbc
negotiate P2P protocol version for tor remotes
This negotiation is not supported by versions of git-annex older
than 6.20180312. Well, maybe really 6.20180227 or so, but using that in
the changelog simplifies things since it was the version for the other
changes as well.

See commit c81768d425 for the back story.

As well as allowing for future protocol improvements, this will result
in negoatiating protocol version 1, which is an improvement over default
version 0.

In fact, it looks like no supported version of git-annex will use
protocol version 0, since version 1 was introduced in 6.20180227.
Still, removing the code for version 0 seems unncessary.
See commit 31e1adc005.

Sponsored-by: Brett Eisenberg on Patreon.
2021-10-11 15:58:51 -04:00
Joey Hess
f8816d2b92
remove list of removed commands
the list was wrong and also users shouldn't need to know
2021-10-11 15:43:19 -04:00
Joey Hess
7bdc7350a5
remove git-annex-shell compat code
* Removed support for accessing git remotes that use versions of
  git-annex older than 6.20180312.
* git-annex-shell: Removed several commands that were only needed to
  support git-annex versions older than 6.20180312.
  (lockcontent, recvkey, sendkey, transferinfo, commit)

The P2P protocol was added in that version, and used ever since, so
this code was only needed for interop with older versions.

"git-annex-shell commit" is used by newer git-annex versions, though
unnecessarily so, because the p2pstdio command makes a single commit at
shutdown. Luckily, it was run with stderr and stdout sent to /dev/null,
and non-zero exit status or other exceptions are caught and ignored. So,
that was able to be removed from git-annex-shell too.

git-annex-shell inannex, recvkey, sendkey, and dropkey are still used by
gcrypt special remotes accessed over ssh, so those had to be kept.
It would probably be possible to convert that to using the P2P protocol,
but it would be another multi-year transition.

Some git-annex-shell fields were able to be removed. I hoped to remove
all of them, and the very concept of them, but unfortunately autoinit
is used by git-annex sync, and gcrypt uses remoteuuid.

The main win here is really in Remote.Git, removing piles of hairy fallback
code.

Sponsored-by: Luke Shumaker
2021-10-11 15:36:51 -04:00
Joey Hess
e28cf82b45
releasing package git-annex version 8.20211011 2021-10-11 12:53:17 -04:00
Joey Hess
022bb6174c
Merge branch 'borgchunks' 2021-10-08 13:26:45 -04:00
Joey Hess
69f8e6c7c0
ImportableContentsChunkable
This improves the borg special remote memory usage, by
letting it only load one archive's worth of filenames into memory at a
time, and building up a larger tree out of the chunks.

When a borg repository has many archives, git-annex could easily OOM
before. Now, it will use only memory proportional to the number of
annexed keys in an archive.

Minor implementation wart: Each new chunk re-opens the content
identifier database, and also a new vector clock is used for each chunk.
This is a minor innefficiency only; the use of continuations makes
it hard to avoid, although putting the database handle into a Reader
monad would be one way to fix it.

It may later be possible to extend the ImportableContentsChunkable
interface to remotes that are not third-party populated. However, that
would perhaps need an interface that does not use continuations.

The ImportableContentsChunkable interface currently does not allow
populating the top of the tree with anything other than subtrees. It
would be easy to extend it to allow putting files in that tree, but borg
doesn't need that so I left it out for now.

Sponsored-by: Noam Kremen on Patreon
2021-10-08 13:15:22 -04:00
Joey Hess
1c11dd4793
avoid cursor jitter when updating progress display
When the progress display gets longer, and then shorter again, it causes
the cursor to jitter back and forth. Somehow I never noticed this until
this morning, but then it became intolerable to watch.

To fix it, pad the progress display to the maximum length it's occupied.

Sponsored-by: Svenne Krap on Patreon
2021-10-07 11:16:41 -04:00
Joey Hess
45dfddd33f
convert ExportLocation to ShortByteString to avoid PINNED memory fragmentation
This adds the overhead of a copy whenever converting to/from ExportLocation and
ImportLocation.

borg: Some improvements to memory use when importing a lot of archives.
(It's still pretty bad.)

Sponsored-by: Mark Reidenbach on Patreon
2021-10-05 14:51:55 -04:00
Joey Hess
9012fa0187
reinject: Fix crash when reinjecting a file from outside the repository
Commit 4bf7940d6b introduced this
problem, but was otherwise doing a good thing. Problem being
that fileRef "/foo" used to return ":./foo", which was actually wrong,
but as long as there was no foo in the local repository, catKey
could operate on it without crashing. After that fix though, fileRef
would return eg "../../foo", resulting in fileRef returning
":./../../foo", which will make git cat-file crash since that's
not a valid path in the repo.

Fix is simply to make fileRef detect paths outside the repo and return
Nothing. Then catKey can be skipped. This needed several bugfixes to
dirContains as well, in previous commits.

In Command.Smudge, this led to needing to check for Nothing. That case
should actually never happen, because the fileoutsiderepo check will
detect it earlier.

Sponsored-by: Brock Spratlen on Patreon
2021-10-01 14:06:34 -04:00
Joey Hess
b9a1cc512d
avoid uncessary call to inAnnex
sync --content: Avoid a redundant checksum of a file that was
incrementally verified, when used on NTFS and perhaps other filesystems.

When sync has just gotten the content, it does not need to check inAnnex a
second time. On NTFS, for some reason the write of the inode cache after
it gets the content is not immediately able to be read, and with an
empty/non-matching inode cache due to that stale data, inAnnex falls back
to hashing the whole object to determine if it's present.

Sponsored-by: Brock Spratlen on Patreon
2021-10-01 12:02:35 -04:00
Joey Hess
b9aa2ce8d1
resume properly when copying a file to/from a local git remote is interrupted (take 2)
This method avoids breaking test_readonly. Just check if the dest file
exists, and avoid CoW probing when it does, so when CoW probing fails,
it can resume where the previous non-CoW copy left off.

If CoW has been probed already to work, delete the dest file
since a CoW copy will presumably work. It seems like it would be almost
as good to just skip CoW copying in this case too, but consider that the
dest file might have started to be copied from some other remote, not
using CoW, but CoW has been probed to work to copy from the current
place.

Sponsored-by: Dartmouth College's Datalad project
2021-09-27 16:03:01 -04:00
Joey Hess
7ccf642863
revert change that broke test_readonly
commit 63d508e885 broke test_readonly.
When a local git remote is readonly, tryCopyCoW run to copy a file
from it failed at withOtherTmp.

Sponsored-by: Dartmouth College's Datalad project
2021-09-27 16:02:41 -04:00
Joey Hess
9ea8106bb0
sped up git-annex smudge --clean by 25%
Disabling git-annex branch update for this command is
ok, because it does not use any information from the branch,
but only logs the location when it adds a key.

Sponsored-by: Dartmouth College's Datalad project
2021-09-24 14:15:20 -04:00
Joey Hess
e8496d62e4
improved bwrate limiting implementation
New method is much better. Avoids unrestrained transfer at the beginning
(except for the first block. Keeps right at or a few kb/s below the
configured limit, with very little varation in the actual reported bandwidth.

Removed the /s part of the config as it's not needed.

Ready to merge.

Sponsored-by: Luke Shumaker on Patreon
2021-09-22 15:27:16 -04:00
Joey Hess
05a097cde8
Merge branch 'master' into bwlimit 2021-09-22 10:48:27 -04:00
Joey Hess
55b405a965
fix remote git config vs global git config order
Bug fix: Git configs such as annex.verify were incorrectly overriding
per-remote git configs such as remote.name.annex-verify.

This dates all the way back to 2013,
commit 8a5b397ac4,
where hlint apparently somehow confused me into parsing in the wrong
order. Before that it was correct.

Amazing noone has noticed until now.

Sponsored-by: Kevin Mueller on Patreon
2021-09-22 10:41:56 -04:00
Joey Hess
63d508e885
resume properly when copying a file to/from a local git remote is interrupted
Probably this fixes a reversion, but I don't know what version broke it.

This does use withOtherTmp for a temp file that could be quite large.
Though albeit a reflink copy that will not actually take up any space
as long as the file it was copied from still exists. So if the copy cow
succeeds but git-annex is interrupted just before that temp file gets
renamed into the usual .git/annex/tmp/ location, there is a risk that
the other temp directory ends up cluttered with a larger temp file than
later. It will eventually be cleaned up, and the changes of this being
a problem are small, so this seems like an acceptable thing to do.

Sponsored-by: Shae Erisson on Patreon
2021-09-21 17:43:35 -04:00
Joey Hess
18e00500ce
bwlimit
Added annex.bwlimit and remote.name.annex-bwlimit config that works for git
remotes and many but not all special remotes.

This nearly works, at least for a git remote on the same disk. With it set
to 100kb/1s, the meter displays an actual bandwidth of 128 kb/s, with
occasional spikes to 160 kb/s. So it needs to delay just a bit longer...
I'm unsure why.

However, at the beginning a lot of data flows before it determines the
right bandwidth limit. A granularity of less than 1s would probably improve
that.

And, I don't know yet if it makes sense to have it be 100ks/1s rather than
100kb/s. Is there a situation where the user would want a larger
granularity? Does granulatity need to be configurable at all? I only used that
format for the config really in order to reuse an existing parser.

This can't support for external special remotes, or for ones that
themselves shell out to an external command. (Well, it could, but it
would involve pausing and resuming the child process tree, which seems
very hard to implement and very strange besides.) There could also be some
built-in special remotes that it still doesn't work for, due to them not
having a progress meter whose displays blocks the bandwidth using thread.
But I don't think there are actually any that run a separate thread for
downloads than the thread that displays the progress meter.

Sponsored-by: Graham Spencer on Patreon
2021-09-21 16:58:10 -04:00
Joey Hess
9f38ecac1e
borg: Avoid trying to extract xattrs, ACLS, and bsdflags
when retrieving from a borg repository

That broke restoring on linux from a borg backup made on OSX.

Sponsored-by: Boyd Stephen Smith Jr. on Patreon
2021-09-03 12:10:14 -04:00
Joey Hess
1a586e473b
releasing package git-annex version 8.20210903 2021-09-03 12:01:12 -04:00
Joey Hess
ec12537774
defer write permissions checking in import until after copy to repo
This should complete the fix started in
6329997ac4, fixing the actual cause of the
test suite failure this time.

Sponsored-by: Dartmouth College's Datalad project
2021-09-02 13:45:21 -04:00
Joey Hess
4f42292b13
improve url download failure display
* When downloading urls fail, explain which urls failed for which
  reasons.
* web: Avoid displaying a warning when downloading one url failed
  but another url later succeeded.

Some other uses of downloadUrl use urls that are effectively internal use,
and should not all be displayed to the user on failure. Eg, Remote.Git
tries different urls where content could be located depending on how the
remote repo is set up. Exposing those urls to the user would lead to wild
goose chases. So had to parameterize it to control whether it displays urls
or not.

A side effect of this change is that when there are some youtube urls
and some regular urls, it will try regular urls first, even if the
youtube urls are listed first. This seems like an improvement if
anything, but in any case there's no defined order of urls that it's
supposed to use.

Sponsored-by: Dartmouth College's Datalad project
2021-09-01 15:33:38 -04:00
Joey Hess
837116ef1e
Fix support for readonly git remotes
Boolean blindness oops.

(Reversion in version 8.20210621)

Sponsored-by: Dartmouth College's Datalad project
2021-08-30 12:34:19 -04:00
Joey Hess
a99a84f342
add: Detect when xattrs or perhaps ACLs prevent locking down a file's content
And fail with an informative message.

I don't think ACLs can prevent removing the write bit, but I'm not sure,
so kept it mentioning them as a possibility.

Should git-annex lock also check if the write bits are able to be removed?
Maybe, but the case I know about with xattrs involves cp -a copying NFS
xattrs, and it's the copy of the file that is the problem. So when locking
a file, I guess it will not be the copy.

Sponsored-by: Dartmouth College's Datalad project
2021-08-27 14:33:01 -04:00
Joey Hess
e17342b2a0
Run cp -a with --no-preserve=xattr, to avoid problems with copied xattrs
Including them breaking permissions setting on some NFS servers.

Sponsored-by: Dartmouth College's Datalad project
2021-08-27 13:09:34 -04:00
Joey Hess
6d4a728455
Added annex.youtube-dl-command config
This can be used to run some forks of youtube-dl.

Sponsored-by: Brett Eisenberg on Patreon
2021-08-27 09:44:23 -04:00
Joey Hess
ab7b5a492c
--batch-keys
New --batch-keys option added to these commands:  get, drop, move, copy, whereis

git-annex-matching-options had to be reworded since some of its options
can be used to match on keys, not only files.

Sponsored-by: Luke Shumaker on Patreon
2021-08-25 14:21:12 -04:00
Joey Hess
4ed36b2634
Fix test suite failure on Windows
It would be better if the Arbitrary instance avoided generating impossible
filenames like "foo/c:bar", but proably this is the only place that splits
the file from the directory and then uses the file without the directory..
At least on the quickcheck properties.

Sponsored-by: Svenne Krap on Patreon
2021-08-24 14:03:29 -04:00
Joey Hess
f9b92c81f6
unused: Skip the refs/annex/last-index ref that git-annex recently started creating
This was unlikely to cause any problem, but it is unsightly to mention
normally hidden refs, and it might have done a bit of unnecessary work to
check that ref.

Sponsored-by: Noam Kremen on Patreon
2021-08-24 12:58:14 -04:00
Joey Hess
53744e132d
incremental verification for gitlfs and httpalso
And that should be all the special remotes supporting it on linux now,
except for in the odd edge case here and there.

Sponsored-by: Dartmouth College's DANDI project
2021-08-18 15:17:10 -04:00
Joey Hess
f5e09a1dbe
incremental verification for S3
Sponsored-by: Dartmouth College's DANDI project
2021-08-18 15:07:00 -04:00
Joey Hess
d154e7022e
incremental verification for web special remote
Except when configuration makes curl be used. It did not seem worth
trying to tail the file when curl is downloading.

But when an interrupted download is resumed, it does not read the whole
existing file to hash it. Same reason discussed in
commit 7eb3742e4b76d1d7a487c2c53bf25cda4ee5df43; that could take a long
time with no progress being displayed. And also there's an open http
request, which needs to be consumed; taking a long time to hash the file
might cause it to time out.

Also in passing implemented it for git and external special remotes when
downloading from the web. Several others like S3 are within striking
distance now as well.

Sponsored-by: Dartmouth College's DANDI project
2021-08-18 15:02:22 -04:00
Joey Hess
8613770b06
incremental verify for webdav special remote
Sponsored-by: Dartmouth College's DANDI project
2021-08-16 17:29:32 -04:00
Joey Hess
b1622eb932
incremental verify for directory special remote
Added fileRetriever', which will let the remaining special remotes
eventually also support incremental verify.

Sponsored-by: Dartmouth College's DANDI project
2021-08-16 16:51:33 -04:00
Joey Hess
ec82299730
status update
I was wrong about S3 supporting tailVerify.
2021-08-16 15:15:32 -04:00