Commit graph

1610 commits

Author SHA1 Message Date
Joey Hess
6d83bcff0f
Fix behavior of onlyingroup
Sponsored-by: k0ld on Patreon
2023-08-07 13:05:11 -04:00
Joey Hess
d19139a10d
releasing package git-annex version 10.20230802 2023-08-02 16:09:14 -04:00
Joey Hess
6da6449fff
stack.yaml: Update to build with ghc-9.6.2 and aws-0.24
This enables some new features that need the new aws.

Use http-client-restricted-0.1.0 because it uses the crypton side of the
cryptonite/crypton fork, which seems to be needed for ghc-9.6.2.

Dependency on connection removed because of the cryptonite/crypton fork.
This avoids needing a build flag. It was only used to throw a typed
exception in Utility.Url, which nothing depended on.

Used a fork of bloomfilter because it's not being maintained and no longer
builds as-of this ghc version. (I have been trying to contact its
maintainer about it, and emailed him today suggesting I take over the
package.)

Sponsored-by: Brock Spratlen on Patreon
2023-08-01 18:53:26 -04:00
Joey Hess
68c9b08faf
fix build with unix-2.8.0
Changed the parameters to openFd. So needed to add a small wrapper
library to keep supporting older versions as well.
2023-08-01 18:41:27 -04:00
Joey Hess
fb640bc2f4
support building with unix-compat 0.7
It removed System.PosixCompat.User.
2023-08-01 15:17:43 -04:00
Joey Hess
393275c105
Setup.hs: Stop installing man pages, desktop files, and the git-annex-shell and git-remote-tor-annex symlinks
Anything still relying on that, eg via cabal v1-install will need to
change to using make install-home. Which was added back in 2019 in
6491b62614 because cabal new-build
(now the default) already didn't use Setup in a way that let its
installation of those things work.

Notably this means Setup does not need to depend on unix-compat, which is
useful because in 0.7 it removed System.PosixCompat.User, which Setup
needed to determine where to install the desktop files. See
https://github.com/haskell-pkg-janitors/unix-compat/issues/3
2023-08-01 15:08:56 -04:00
Joey Hess
fa92383993
onlyingroup
* Support "onlyingroup=" in preferred content expressions.
* Support --onlyingroup= matching option.

Sponsored-by: Jack Hill on Patreon
2023-07-31 14:43:58 -04:00
Joey Hess
518a51a8a0
--explain for preferred/required content matching
And annex.largefiles and annex.addunlocked.

Also git-annex matchexpression --explain explains why its input
expression matches or fails to match.

When there is no limit, avoid explaining why the lack of limit
matches. This is also done when no preferred content expression is set,
although in a few cases it defaults to a non-empty matcher, which will
be explained.

Sponsored-by: Dartmouth College's DANDI project
2023-07-26 14:50:04 -04:00
Joey Hess
f25eeedeac
initial implementation of --explain
Currently it only displays explanations of options like --in and --copies.

In the future, it should explain preferred content expression evaluation
and other decisions.

The explanations of a few things could be better. In particular,
"standard" will just appear as-is (or as "!standard" if it doesn't
match), rather than explaining why the standard preferred content expression
for the group matches or not.

Currently as implemented, it goes to stdout, and so commands like
git-annex find that have custom output will not display --explain
information. Perhaps that should change, dunno.

Sponsored-by: Dartmouth College's DANDI project
2023-07-25 16:52:57 -04:00
Joey Hess
2807ab0a09
gcrypt: Remove empty hash directories when dropping content
As was recently done with the directory special remote.

Note that the top directory passed to removeDirGeneric was changed to
avoid deleting .git/annex or .git/annex/objects if they ended up empty.

Sponsored-by: Brett Eisenberg on Patreon
2023-07-21 16:04:11 -04:00
Joey Hess
3b34266e9e
typo 2023-07-21 15:36:01 -04:00
Joey Hess
b15366494a
directory: Remove empty hash directories when dropping content
Failure to remove is not treated as a problem, and no permissions
modifications are done, to avoid unexpected states.

Sponsored-by: Luke Shumaker on Patreon
2023-07-21 14:57:29 -04:00
Joey Hess
7f38355860
dropunused: Support --jobs
Sponsored-by: Kevin Mueller on Patreon
2023-07-21 14:03:34 -04:00
Joey Hess
33ba537728
deal with Amazon S3 breaking change for public=yes
* S3: Amazon S3 buckets created after April 2023 do not support ACLs,
  so public=yes cannot be used with them. Existing buckets configured
  with public=yes will keep working.
* S3: Allow setting publicurl=yes without public=yes, to support
  buckets that are configured with a Bucket Policy that allows public
  access.

Sponsored-by: Joshua Antonishen on Patreon
2023-07-21 13:59:07 -04:00
Joey Hess
7fc6503812
fix waiting for all started feed downloads with -J
importfeed bug fix: When -J was used with multiple feeds, some feeds did
not get their items downloaded.

In my case, I had added a feed to the end of the list, and no items from it
were ever downloaded.

Sponsored-by: Leon Schuermann on Patreon
2023-07-11 22:08:35 -04:00
Joey Hess
e82823d448
nub list of files
yt-dlp when resumed was observed having written the same filename twice
into the file list. Perhaps once by the first download and once by the
resumed one?
2023-07-09 14:18:25 -04:00
Joey Hess
51b24aac91
importfeed: Add feedurl to the metadata
(And allow it to be used in the --template although that seems unlikely to
be very useful.)

My use case for this is that one of the podcast feeds I subscribe to is
sometimes leaking episodes of some other podcast. The other podcast is also
very close to spam, so this may be a form of intentional spamming. I have
not been able to catch the podcast feed containing those episodes, so I
don't know which one is at fault. So putting this in the metadata will let
me eventually catch it.
2023-07-06 00:11:38 -04:00
Joey Hess
adb09117f1
propigateAdjustedCommits: avoid overwriting diverged original branch
Bug fix: Re-running git-annex adjust or sync when in an adjusted branch
would overwrite the original branch, losing any commits that had been made
to it since the adjusted branch was created.

When git-annex adjust is run in this situation, it will display a warning
about the diverged branches.

When git-annex sync is run in this situation, mergeToAdjustedBranch
will merge the changes from the original branch to the adjusted branch.
So it does not need to display the divergence warning.

Note that for some reason, I'm needing to run sync twice for that to
happen. The first run does not do the merge and the second does. I'm unsure
why and so am not fully done with this bug.

Sponsored-By: the NIH-funded NICEMAN (ReproNim TR&D3) project
2023-07-05 17:09:49 -04:00
Joey Hess
a05bc6a314
Fix breakage when git is configured with safe.bareRepository = explicit
Running git config --list inside .git then fails, so better to only
do that when --git-dir was specified explicitly. Otherwise, when the
repository is not bare, run the command inside the working tree.

Also make init detect when the uuid it just set cannot be read and fail
with an error, in case git changes something that breaks this later.

I still don't actually understand why git-annex add/assist -J2 was
affected but -J1 was not. But I did show that it was skipping writing to
the location log, because the uuid was NoUUID.

Sponsored-by: Graham Spencer on Patreon
2023-07-05 14:43:14 -04:00
Joey Hess
3c1d18cb3b
assist: With --jobs, parallelize transferring content to/from remotes
Command.Add.seek starts concurrency with CommandStages. And for
Command.Sync, it needs TransferStages. So, to get both types of concurrency
for the two different parts, it either needs to change the type of
concurrency in between, or just call startConcurrency once for each.

It seems safe enough to call startConcurrency twice, because it does shut
down concurrency (mostly) at the end, and eg the old Annex.workers get
emptied.

Sponsored-by: unqueued on Patreon
2023-07-05 12:47:30 -04:00
Joey Hess
e1fc9e204e
added git-annex satisfy
This ended up having an interface like sync, rather than like get/copy/drop.
That let it be implemented in terms of sync, which took a lot less code.
Also, it lets it handle many of the edge cases that sync does, such as
getting files that are not visible in a --hide-missing branch, and sending
files to exporttree remotes.

As well as being easier to implement, `git-annex satisfy myremote` makes
sense as it satisfies the preferred content settings of the remote.
`git-annex satisfy somefile` does not form a sentence that makes sense. So
while -C can be a little bit annoying, it still makes sense to have this
syntax.

Note that, while I initially thought this would also satisfy numcopies, it
does not. Arguably it ought to. But, sync does not send files in order to
satisfy numcopies, it only sends files to satisfy preferred content. And
it's important that this transfer the same files as sync does, because
it will probably be used in a workflow where the user sometimes syncs and
sometimes satisfies, and does not expect satisfy to do things that sync
would not do.

(Also opened a new bug that also affects sync et all, not only this command.)

Sponsored-by: Nicholas Golder-Manning on Patreon
2023-06-29 15:34:53 -04:00
Joey Hess
1b9958f4fd
document git-annex satisfy 2023-06-29 14:15:01 -04:00
Joey Hess
d5c6197791
diffdriver: Added --text option for easy diffing of the contents of annexed text files
This was already possible, but it was rather hard to come up with the
complex shell command needed.

Note that the diff output starts with "diff a/... b/...".
I left off the "--git" because it's not a git format diff.
2023-06-28 15:27:16 -04:00
Joey Hess
fbd4dbaafe
fix some typos
Anarcat fixed these in the news file, so transferred it over
2023-06-28 13:15:06 -04:00
Joey Hess
d98aa35b3b
reinject: Added --guesskeys option
Sponsored-by: Noam Kremen on Patreon
2023-06-26 14:05:31 -04:00
Joey Hess
a8779f4c2a
prep release 2023-06-26 10:41:36 -04:00
Joey Hess
928b2a4839
create journal directory in withJournalHandle
Fixes a crash by git-annex repair when .git/annex/journal/ does not exist.

Normally the journal directory is created before withJournalHandle gets
run, but git-annex repair can be run in a situation where it does not
exist.
2023-06-21 15:23:59 -04:00
Joey Hess
bad444342e
reorder and condense 2023-06-21 13:48:31 -04:00
Joey Hess
3cec932bb5
changelog 2023-06-21 12:51:33 -04:00
Joey Hess
a861d56428
httpalso: Support being used with special remotes that use chunking.
Sponsored-by: k0ld on Patreon
2023-06-20 13:35:28 -04:00
Joey Hess
958c2fa6d2
Improve resuming interrupted download when using yt-dlp or youtube-dl
Fixes a failure like this:

curl: (33) HTTP server doesn't seem to support byte ranges. Cannot resume.

That happens because the whole web page has already been downloaded
previously, and kept, so now addurl tries to download it, and curl asks the
server to resume from the last byte. And youtube.com can't, for whatever
stupid reason.

So, delete the temp file after determining that youtube-dl can be used.
2023-06-19 15:01:47 -04:00
Joey Hess
a36a81dea3
Improve resuming interrupted download when using yt-dlp
Sometimes resuming an interrupted download will fail to resume and download
more files with different names. That resulted in the workdir having
multiple files at the end, which causes git-annex to give up because it
does not know what was downloaded.

To fix this, use a yt-dlp feature, which appends to a file the name of each
file after it's finished downloading it. So the presence of other cruft in
the workdir will not confuse git-annex.
2023-06-19 14:39:08 -04:00
Joey Hess
217a6abb19
assistant: Fix a crash when a small file is deleted immediately after being created
git add will fail if the file got deleted in the meantime. And since it was
queued, there was a window until the queue flushed where a deletion of the
file would cause a crash.

Instead, reuse Command.Add.addFile, which sha1 hashes the file itself
immediately, and then queues the index update. Ignore exceptions that will
happen if the file got deleted already.

Sponsored-by: k0ld on Patreon
2023-06-19 12:44:56 -04:00
Joey Hess
114a2d7504
Fix display when run with -J1
Commit b6642dde8a broke it by enabling
non-concurrent display mode while leaving concurrency set in the config
and having already started concurrency earlier.

(I don't actually know if that commit was a good idea.)

Sponsored-By: Brett Eisenberg on Patreon
2023-06-15 10:07:54 -04:00
Joey Hess
64738ea157
config: Added the --show-origin and --for-file options
* config: Added the --show-origin and --for-file options.
* config: Support annex.numcopies and annex.mincopies.

There is a little bit of redundancy here with other code elsewhere that
combines the various configs and selects which to use. But really only
for the special case of annex.numcopies, which is a git config that does
not override the annex branch setting and for annex.mincopies, which does
not have a git config but does have gitattributes settings as well as the
annex branch setting.

That seems small enough, and unlikely enough to grow into a mess that it was
worth supporting annex.numcopies and annex.mincopies in git-annex config
--show-origin. Because these settings are a prime thing that someone might
get confused about and want to know where they were configured.

And, it followed that git-annex config might as well support those two
for --set and --get as well. While this is redundant with the speclialized
commands, it's only a little code and it makes it more consistent.

Note that --set does not have as nice output as numcopies/mincopies
commands in some special cases like setting to 0 or a negative number.
It does avoid setting to a bad value thanks to the smart
constructors (eg configuredNumCopies).

As for other git-annex branch configurations that are not set by git-annex
config, things like trust and wanted that are specific to a repository
don't map to a git config name, so don't really fit into git-annex config.
And they are only configured in the git-annex branch with no local override
(at least so far), so --show-origin would not be useful for them.

Sponsored-by: Dartmouth College's DANDI project
2023-06-12 16:24:31 -04:00
Joey Hess
38153ad340
assistant: Add dotfiles to git by default, unless annex.dotfiles is configured
Tthe same as git-annex add does.

Sponsored-by: Luke Shumaker on Patreon
2023-06-12 13:25:04 -04:00
Joey Hess
c33c226abd
fixed 2023-06-09 16:13:52 -04:00
Joey Hess
a0ab425c95
add ContentIndentifiersCidRemoteKeyIndex
Optimise database to further speed up importing large trees from special
remotes.

See comment for details of why the other index didn't help cid queries.

It would probably be better to manually create an index on only cid, rather
than adding a second uniqueness constraint that is a larger index. But
persitent does not support creating indexes, and an attempt to manually add
it to the migration failed.

Sponsored-by: Nicholas Golder-Manning on Patreon
2023-06-09 15:12:33 -04:00
Joey Hess
6821ba8dab
sync: use log to track adjusted branch needs updating
Speeds up sync in an adjusted branch by avoiding re-adjusting the branch
unncessarily, particularly when it is adjusted with --hide-missing or
--unlock-present.

When there are a lot of files, that was the majority of the time of a
--no-content sync.

Uses a log file, which is updated when content presence changes. This
adds a little bit of overhead to every file get/drop when on such an
adjusted branch. The overhead is minimal for get of any size of file,
but might be noticable for drop in some cases. It seems like a reasonable
trade-off. It would be possible to update the log file only at the end, but
then it would not happen if the command is interrupted.

When not in an adjusted branch, there should be no additional overhead.
(getCurrentBranch is an MVar read, and it avoids the MVar read of
getGitConfig.)

Note that this does not deal with situations such as:
git checkout master, git-annex get, git checkout adjusted branch,
git-annex sync. The sync won't know that the adjusted branch needs to be
updated. Dealing with that would add overhead to operation in non-adjusted
branches, which I don't like. Also, there are other situations like having
two adjusted branches that both need to be updated like this, and switching
between them and sync not updating.

This does mean a behavior change to sync, since it did previously deal
with those situations. But, the documentation did not say that it did.
The man pages only talk about sync updating the adjusted branch after
it transfers content.

I did consider making sync keep track of content it transferred (and
dropped) and only update the adjusted branch then, not to catch up to other
changes made previously. That would perform better. But it seemed rather
hard to implement, and also it would have problems with races with a
concurrent get/drop, which this implementation avoids.

And it seemed pretty likely someone had gotten used to get/drop followed by
sync updating the branch. It seems much less likely someone is switching
branches, doing get/drop, and then switching back and expecting sync to update
the branch.

Re-running git-annex adjust still does a full re-adjusting of the branch,
for anyone who needs that.

Sponsored-by: Leon Schuermann on Patreon
2023-06-08 14:35:41 -04:00
Joey Hess
3c15e0f7a0
cache negative lookups of global numcopies and mincopies
Speeds up eg git-annex sync --content by up to 50%. When it does not need
to transfer or drop anything, it now noops a lot more quickly.

I didn't see anything else in sync --content noop loop that could really
be sped up. It has to cat git objects to keys, stat object files, etc.

Sponsored-by: unqueued on Patreon
2023-06-06 14:43:25 -04:00
Joey Hess
cfad0def18
wrap 2023-06-05 15:15:20 -04:00
Joey Hess
fe1b2dfb4b
speed up very first tree import by 25%
Reading from the cidsdb is responsible for about 25% of the runtime of
an import. Since the cidmap is used to store the same information in
ram, the cidsdb is not written to during an import any longer. And so,
if it started off empty (and updateFromLog wasn't needed), those reads
can just be skipped.

This is kind of a cheesy optimisation, since after any import from any
special remote, the database will no longer be empty, so it's a single
use optimisation. But it's probably not uncommon to start by importing a
lot of files, and it can save a lot of time then.

Sponsored-by: Brock Spratlen on Patreon
2023-06-02 13:30:30 -04:00
Joey Hess
40017089f2
use importChanges optimisation
Large speed up to importing trees from special remotes that contain a lot
of files, by only processing changed files.

Benchmarks:

Importing from a special remote that has 10000 files, that have all been
imported before, and 1 new file sped up from 26.06 to 2.59 seconds.

An import with no change and 10000 unchanged files sped up from 24.3 to
1.99 seconds.

Going up to 20000 files, an import with no changes sped up from
125.95 to 3.84 seconds.

Sponsored-by: k0ld on Patreon
2023-06-01 13:47:00 -04:00
Joey Hess
f6aa097a39
avoid import writing to cidsdb initially
Speed up importing trees from special remotes somewhat by avoiding
redundant writes to sqlite database.

Before, import would write to both the git-annex branch and also to the
sqlite database. But then the next time it was run, needsUpdateFromLog
would see the branch had changed, so run updateFromLog, which would make
the same writes to the sqlite database a second time.

Now import writes only to the git-annex branch. The next time it's run,
needsUpdateFromLog sees that the branch has changed and so calls
updateFromLog, which updates the sqlite database.

Why defer the write to the sqlite database like this? It seems that it
could write to the database as it goes, and at the end call
recordAnnexBranchTree to indicate that the information in the git-annex
branch has all been written to the cidsdb. That would avoid the second
import doing extra work.

But, there could be other processes running at the same time, and one of
them may update the git-annex branch, eg merging a remote git-annex branch
into it. Any cids logs on that merged git-annex branch would not be
reflected in the cidsdb yet. If the import then called
recordAnnexBranchTree, the cidsdb would never get updated with that merged
information.

I don't think there's a good way to prevent, or to detect that situation.
So, it can't call recordAnnexBranchTree at the end. So it might as well
wait until the next run and do updateFromLog then. It could instead do
updateFromLog at the end, but it's going to check needsUpdateFromLog
at the beginning anyway.

Note that the database writes were queued, so there is already a cidmap
that is used to remember changes that the current process has made.
So, omitting database writes can't change the behavior of the current
process.

Also note that thirdpartypopulatedimport uses recordcidkeyindb, which
reflects what it already did. That code path does not use the cidmap,
but does not need to query it either. It might be possible to make that
code path also only update the git-annex branch and not the db, but I
haven't checked.

Sponsored-by: Noam Kremen on Patreon
2023-05-30 17:05:28 -04:00
Joey Hess
5070087a63
repair: Fix handling of git ref names on Windows
Sponsored-by: Kevin Mueller on Patreon
2023-05-30 16:09:13 -04:00
Joey Hess
f2db6da938
default to yt-dlp and fix progress parsing bugs
I noticed git-annex was using a lot of CPU when downloading from youtube,
and was not displaying progress. Turns out that yt-dlp (and I think also
youtube-dl) sometimes only knows an estimated size, not the actual size,
and displays the progress output slightly differently for that. That broke
the parser. And, the parser was feeding chunks that failed to parse back
as a remainder, which caused it to try to re-parse the entire output each
time, so it got slower and slower.

Using --progress-template like this should avoid parsing problems as well
as future proof against output changes. But it will work with only yt-dlp.

So, this seemed like the right time to deprecate youtube-dl, and default
to yt-dlp when available.

git-annex will still use youtube-dl if that's all that's available.
However, since the progress parser for youtube-dl was buggy, and I don't
want to maintain two different progress parsers (especially since
youtube-dl is no longer in debian unstable having been replaced by
yt-dlp), made git-annex no longer try to parse youtube-dl's progress.

Also, updated docs for yt-dlp being default. It did not seem worth
renaming annex.youtube-dl-options and annex.youtube-dl-command.

Note that yt-dlp does not seem to document the fields available in the
progress template. I found them by reading the source and looking at
the templates it uses internally. Also note that the use of "i" (rather
than "s") in progressTemplate makes it display floats rounded to integers;
particularly the estimated total size can be a float. That also does not
seem to be documented but I assume is a python thing?

Sponsored-by: Joshua Antonishen on Patreon
2023-05-27 13:04:53 -04:00
Joey Hess
0f89d221bd
version: Avoid error message when entire output is not read
Sponsored-by: Dartmouth College's Datalad project
2023-05-19 15:00:57 -04:00
Joey Hess
c4ad9b1446
Fix bug in -z handling of trailing NUL in input
The obvious way to fix this would be to adapt lines to split on null.

However, it's actually nontrivial to rewrite lines. In particular it has a
weird implementation to avoid a space leak. See:
https://gitlab.haskell.org/ghc/ghc/-/issues/4334

Also, while that is a small amount of code, it's covered by a rather
complex copyright and I'd have to include that copyright in git-annex.

So, I opted to filter out the trailing empty string instead.

Sponsored-by: Dartmouth College's Datalad project
2023-05-19 14:34:02 -04:00
Joey Hess
e955912ad0
git-annex assist
assist: New command, which is the same as git-annex sync but with
new files added and content transferred by default.

(Also this fixes another reversion in git-annex sync,
--commit --no-commit, and --message were not enabled, oops.)

See added comment for why git-annex assist does commit staged
changes elsewhere in the work tree, but only adds files under
the cwd.

Note that it does not support --no-commit, --no-push, --no-pull
like sync does. My thinking is, why should it? If you want that
level of control, use git commit, git annex push, git annex pull.
Sync only got those options because pull and push were not split
out.

Sponsored-by: k0ld on Patreon
2023-05-18 14:37:43 -04:00
Joey Hess
f93a7fce1d
sync: Started transition to --content being enabled by default
When used without --content or --no-content, warn about the upcoming
transition, and suggest using one of the options, or setting
annex.synccontent.

Sponsored-by: Brett Eisenberg on Patreon
2023-05-17 13:23:42 -04:00