Commit graph

1712 commits

Author SHA1 Message Date
Joey Hess
a11d6e0baf
avoid sync pushing view branches to remotes, and better view branch names
* sync: Avoid pushing view branches to remotes.
* Changed the name of view branches to include the parent branch.
  Existing view branches checked out using an old name will still work.

It does not seem useful for sync to push view branches around, because
the information in a view branch can entirely be derived from other
information in git. And sync doesn't push adjusted branches around either.

The better view branch names make it more in line with adjusted branch
names, but were also needed to make fromViewBranch be able to return the
original branch name.

Kept the old view branch names still working. But, when those branches
exist in a repo, sync will still try to push them as before. Avoiding
that would need more complicated and/or expensive changes to sync.

Sponsored-By: Boyd Stephen Smith Jr. on Patreon
2023-02-08 13:57:48 -04:00
Joey Hess
aa0350ff49
add directory to views for files that lack specified metadata
* view: New field?=glob and ?tag syntax that includes a directory "_"
  in the view for files that do not have the specified metadata set.
* Added annex.viewunsetdirectory git config to change the name of the
  "_" directory in a view.

When in a view using the new syntax, old git-annex will fail to parse the
view log. It errors with "Not in a view.", which is not ideal. But that
only affects view commands.

annex.viewunsetdirectory is included in the View for a couple of reasons.
One is to avoid needing to warn the user that it should not be changed when
in a view, since that would confuse git-annex. Another reason is that it
helped with plumbing the value through to some pure functions.

annex.viewunsetdirectory is actually mangled the same as any other view
directory. So if it's configured to something like "N/A", there won't be
multiple levels of directories, which would also confuse git-annex.

Sponsored-By: Jack Hill on Patreon
2023-02-07 16:28:46 -04:00
Joey Hess
04ec726d3b
S3 region=
S3: Support a region= configuration useful for some non-Amazon S3
implementations. This feature needs git-annex to be built with aws-0.24.

datacenter= sets both the AWS hostname and region in one setting, which is
easy when using AWS, but not useful for other hosts. So kept datacenter
as-is, but added this additional config.

Sponsored-By: Brett Eisenberg on Patreon
2023-02-06 14:08:45 -04:00
Joey Hess
65167463aa
releasing package git-annex version 10.20230126 2023-01-26 15:27:32 -04:00
Joey Hess
3a08c80dd8
improve wording 2023-01-24 14:11:32 -04:00
Joey Hess
a6c1d9752b
move/copy: option parsing for --from with --to
Allowing --from and --to as an alternative to --from or --to
is hard to do with optparse-applicative!

The obvious approach of (pfrom <|> pto <|> pfromandto) does not work
when pfromandto uses the same option names as pfrom and pto do.
It compiles but the generated parser does not work for all desired
combinations.

Instead, have to parse optionally from and optionally to. When neither
is provided, the parser succeeds, but it's a result that can't be
handled. So, have to giveup after option parsing. There does not seem to
be a way to make an optparse-applicative Parser give up internally
either.

Also, need seek' because I first tried making fto be a where binding,
but that resulted in a hang when git-annex move was run without --from
or --to. I think because startConcurrency was not expecting the stages
value to contain an exception and so ended up blocking.

Sponsored-by: Dartmouth College's DANDI project
2023-01-18 14:42:39 -04:00
Joey Hess
f8bc208e89
findkeys: New command, very similar to git-annex find but operating on keys
I've long been asked for `git-annex find --all` or something like that,
but pushed back on it because I feel that the command is analagous to
find(1) and so it would be surprising for it to list keys rather than
files. So instead, add a new findkeys subcommand.

Note that the use of withKeyOptions is rather strange because usually
that is used to fall back to --all rather than listing files, but here
it's made to default to --all like behavior and never list files.

A performance thing that could be improved is that withKeyOptions
always reads and caches location logs. But findkeys with no options does
not need them, so it could be made faster. That caching does speed up
options like --in though. This is really just a subset of a more general
performance thing that --all reads location logs sometimes unncessarily.
Anyway, it needs to read the location log in order to checkDead,
and it seems good that findkeys does skip dead keys.

Also, cleaned up comments on git-annex-find man page asking for --all
option.

Sponsored-by: Dartmouth College's DANDI project
2023-01-17 14:51:57 -04:00
Joey Hess
cfaae7e931
added an optional cost= configuration to all special remotes
Note that when this is specified and an older git-annex is used to
enableremote such a special remote, it will simply ignore the cost= field
and use whatever the default cost is.

In passing, fixed adb to support the remote.name.cost and
remote.name.cost-command configs.

Sponsored-by: Dartmouth College's DANDI project
2023-01-12 13:42:28 -04:00
Joey Hess
6fa166e1fc
web: Add urlinclude and urlexclude configuration settings
Sponsored-by: Dartmouth College's DANDI project
2023-01-09 17:16:53 -04:00
Joey Hess
8d06930c88
web special remote is no longer a singleton
Allow initremote of additional special remotes with type=web, in addition
to the default web special remote.

When --sameas=web is used, these provide additional names for the web
special remote, and may also have their own additional configuration
(once there is any for the web special remote) and cost.

Sponsored-by: Dartmouth College's DANDI project
2023-01-09 15:49:20 -04:00
Joey Hess
f316b7f105
Revert "Removed the vendored git-lfs and the GitLfs build flag"
This reverts commit efda811404.

Turns out that datalad is building git-annex against debian bullseye.
https://github.com/datalad/git-annex/issues/149
2023-01-04 17:33:29 -04:00
Joey Hess
7919b9d57a
changelog for cf892f4256 2023-01-02 12:36:10 -04:00
Joey Hess
efda811404
Removed the vendored git-lfs and the GitLfs build flag
AFAICS all git-annex builds are using the git-lfs library not the vendored
copy.

Debian stable does have a too old haskell-git-lfs package to be able to
build git-annex from source, but there is not currently a backport of a
recent git-annex to Debian stable. And if they update the backport at some
point, they should be able to backport the library too.

Sponsored-by: Svenne Krap on Patreon
2022-12-26 12:49:53 -04:00
Joey Hess
d475f82c62
Added libgcc_s.so.1 to the linux standalone build so pthread_cancel will work
In Makefile, listed additional deps of Build/Standalone. Without that,
it does not get updated for the change to Utility/LinuxMkLibs.hs when
compiling incrementally.

Sponsored-by: Dartmouth College's DANDI project
2022-12-22 15:15:25 -04:00
Joey Hess
eb8e0594bb
use status --ignore-submodules in configureSmudgeFilter
Speed up git-annex upgrade (from v5) and init in a repository that has
submodules. Setting the config does not affect the submodules, so avoid
the work of getting status in them, which may involve using the smudge
filter etc.

Sponsored-By: the NIH-funded NICEMAN (ReproNim TR&D3) project
2022-12-20 16:02:42 -04:00
Joey Hess
0b2dd374d8
--anything and --nothing
Added --anything (and --nothing). Eg, git-annex find --anything will list
all annexed files whether or not the content is present. This is slightly
faster and clearer than --include=* or --exclude=*

While I can't imagine how --nothing will be used, preferred content
expressions already had anything and nothing, so might as well support both
as matching options as well.

Sponsored-by: Dartmouth College's Datalad project
2022-12-20 15:44:09 -04:00
Joey Hess
9d60385001
convert renameFile to moveFile to support cross-device moves
Improve handling of some .git/annex/ subdirectories being on other
filesystems, in the bittorrent special remote, and youtube-dl integration,
and git-annex addurl.

The only one of these that I've confirmed to be a problem is in the
bittorrent special remote when .git/annex/tmp and .git/annex/othertmp are
on different filesystems.

As well as auditing for renameFile, also audited for createLink, all of
those are ok as are the other remaining renameFile calls. Also audited all
code paths that use .git/annex/othertmp, and did not find any other
cross-device problems. So, removing mention of othertmp needing to be on
the same device.

Sponsored-by: Dartmouth College's Datalad project
2022-12-20 15:17:50 -04:00
Joey Hess
aa6919737c
--metadata lexicographical comparisons
Change --metadata comparisons < > <= and >= to fall back to lexicographical
comparisons when one or both values being compared are not numbers.

Sponsored-by: Erik Bjäreholt on Patreon
2022-12-12 13:33:24 -04:00
Joey Hess
ab11fd70e2
releasing package git-annex version 10.20221212 2022-12-12 12:51:59 -04:00
Joey Hess
65f9e7a3c7
fix deadlock in restagePointerFiles
Fix a hang that occasionally occurred during commands such as move.
(A bug introduced in 10.20220927, in
commit 6a3bd283b8)

The restage.log was kept locked while running a complex index refresh
action. In an unusual situation, that action could need to write to the
restage log, which caused a deadlock.

The solution is a two-stage process. First the restage.log is moved to a
work file, which is done with the lock held. Then the content of the work
file is read and processed, which happens without the lock being held.
This is all done in a crash-safe manner.

Note that streamRestageLog may not be fully safe to run concurrently
with itself. That's ok, because restagePointerFiles uses it with the
index lock held, so only one can be run at a time.

streamRestageLog does delete the restage.old file at the end without
locking. If a calcRestageLog is run concurrently, it will either see the
file content before it was deleted, or will see it's missing. Either is
ok, because at most this will cause calcRestageLog to report more
work remains to be done than there is.

Sponsored-by: Dartmouth College's Datalad project
2022-12-08 14:36:11 -04:00
Joey Hess
2b5e6ff20a
test: Add --test-debug option
This work is supported by the NIH-funded NICEMAN (ReproNim TR&D3) project.
2022-11-28 15:12:53 -04:00
Joey Hess
43f681d4c1
Support parsing yt-dpl output to display download progress
Before this fix, no progress was displayed when yt-dpl was used.

Sponsored-by: Graham Spencer on Patreon
2022-11-21 15:04:36 -04:00
Joey Hess
5256be61c1
When youtube-dl is not available in PATH, use yt-dlp instead
Debian is going to drop youtube-dl which is not active upstream, and yt-dlp
is the replacement. This will make it be used if youtube-dl gets removed.

If an old version of youtube-dl remains installed, git-annex will still use
it. That might not be desirable, but changing git-annex to use yt-dlp in
preference to youtube-dl when both are installed risks breaking when
the user has annex.youtube-dl-options set to something that is supported
by youtube-dl, but not by yt-dlp.

Sponsored-by: Boyd Stephen Smith Jr. on Patreon
2022-11-21 14:40:33 -04:00
Joey Hess
2b014f1a8b
don't frontload reconcileStaged in git-annex init
init: Avoid scanning for annexed files, which can be lengthy in a
large repository. Instead that scan is done on demand. This lets git-annex
init be run and some query commands be used in a repository without
waiting.

Note that autoinit already behaved this way, so while this will mean some
commands like git-annex get/unlock/add will do the scan the first time run,
that is not really a significant behavior change.

And, it's really better to have a consistent behavior. The reason for
the inconsistency was a strange bug discussed in
b3c4579c79. Avoiding reconcileStaged in
init will keep avoiding whatever that was.

Sponsored-by: Dartmouth College's DANDI project
2022-11-18 13:58:47 -04:00
Joey Hess
c834d2025a
queue more changes to keys db
Increasing the size of the queue 10x makes git-annex init 7% faster in a
repository with 86000 annexed files.

The memory use goes up, from 70876 kb to 85376 kb.
2022-11-18 13:29:34 -04:00
Joey Hess
8fcee4ac9d
Sped up the initial scanning for annexed files by 15%
Avoids database querying overhead when the database is newly created.

In the large repository where git-annex init took 24 seconds, this sped it
up to 20.47 seconds, a speedup of around 15%.

Sponsored-by: Dartmouth College's DANDI project
2022-11-18 13:16:57 -04:00
Joey Hess
a3e9a0ae27
update changelog 2022-11-18 12:58:13 -04:00
Joey Hess
def779b250
revert change to use lts-19.32
This reverts commit 15dd7fe84b.

aws 0.23 is not used any longer, so read-only S3 import won't be
supported yet when building with stack.

That commit broke the build on windows, because the new version of Win32 that
was included (because the old one does not work with this lts version)
needs a version of filepath that is newer than the one bundled with the
ghc in that lts version. It is not possible to override that to a newer
filepath.

Seems that the only solution to get aws 0.23 will be to wait for a ghc that
contains filepath 1.4.100.0. No ghc yet contains it. (Backporting the Win32
fix to a point release version that does not include this bleeding edge
filepath would also resolve it, but seems unlikely to happen.)

Sponsored-by: Jarkko Kniivilä on Patreon
2022-11-14 11:56:55 -04:00
Joey Hess
b2cc63d5bf
export: fix multi-file delete bug
export: Fix a bug that left a file on a special remote when two files with
the same content were both deleted in the exported tree.

Case of the wrong data structure leading to the wrong result.
The DiffMap now contains all the old filenames, and all the new filenames.

Note that, when 2 files with the same content are both renamed,
it only renames the first, but deletes and re-exports the second.
Improving that is possible, but it would need to use a different temporary
filename. Anyway, that is an unusual case, and there are known to be other
unusual cases where export does not rename with maximum efficiency, IIRC.
(Or maybe this is the case that I remember?)

Sponsored-by: Dartmouth College's OpenNeuro project
2022-11-09 16:24:37 -04:00
Joey Hess
15dd7fe84b
stack.yaml: Updated to lts-19.32
This allows building with aws-0.23

Win32-2.13.4.0 contains a function that is not in lts-19.32 yet. Adding
it to stack.yaml does not seem to cause problems when building on linux.
2022-11-09 14:29:00 -04:00
Joey Hess
e100993935
complete support for S3 signature=anonymous
aws-0.23 has been released.

When built with an older aws, initremote will error out when run
with signature=anonymous. And when a remote has been initialized with
that by a version of git-annex that does support it, older versions will
fail when the remote is accessed, with a useful error message.

Sponsored-by: Dartmouth College's DANDI project
2022-11-04 16:20:28 -04:00
Joey Hess
de1e8201a6
Merge branch 'master' into anons3 2022-11-04 15:08:29 -04:00
Joey Hess
d22bd53310
releasing package git-annex version 10.20221103 2022-11-03 14:07:53 -04:00
Joey Hess
14f7a386f0
Make git-annex enable-tor work when using the linux standalone build
Clean the standalone environment before running the su command
to run "sh". Otherwise, PATH leaked through, causing it to run
git-annex.linux/bin/sh, but GIT_ANNEX_DIR was not set,
which caused that script to not work:

[2022-10-26 15:07:02.145466106] (Utility.Process) process [938146] call: pkexec ["sh","-c","cd '/home/joey/tmp/git-annex.linux/r' && '/home/joey/tmp/git-annex.linux/git-annex' 'enable-tor' '1000'"]
/home/joey/tmp/git-annex.linux/bin/sh: 4: exec: /exe/sh: not found

Changed programPath to not use GIT_ANNEX_PROGRAMPATH,
but instead run the scripts at the top of GIT_ANNEX_DIR.
That works both when the standalone environment is set up, and when it's
not.

Sponsored-by: Kevin Mueller on Patreon
2022-10-26 15:45:08 -04:00
Joey Hess
731e806c96
use lookupKeyStaged in --batch code paths
Make --batch mode handle unstaged annexed files consistently whether the
file is unlocked or not. Before this, a unstaged locked file
would have the symlink on disk examined and operated on in --batch mode,
while an unstaged unlocked file would be skipped.

Note that, when not in batch mode, unstaged files are skipped over too.
That is actually somewhat new behavior; as late as 7.20191114 a
command like `git-annex whereis .` would operate on unstaged locked
files and skip over unstaged unlocked files. That changed during
optimisation of CmdLine.Seek with apparently little fanfare or notice.

Turns out that rmurl still behaved that way when given an unstaged file
on the command line. It was changed to use lookupKeyStaged to
handle its --batch mode. That also affected its non-batch mode, but
since that's just catching up to the change earlier made to most
other commands, I have not mentioed that in the changelog.

It may be that other uses of lookupKey should also change to
lookupKeyStaged. But it may also be that would slow down some things,
or lead to unwanted behavior changes, so I've kept the changes minimal
for now.

An example of a place where the use of lookupKey is better than
lookupKeyStaged is in Command.AddUrl, where it looks to see if the file
already exists, and adds the url to the file when so. It does not matter
there whether the file is staged or not (when it's locked). The use of
lookupKey in Command.Unused likewise seems good (and faster).

Sponsored-by: Nicholas Golder-Manning on Patreon
2022-10-26 14:43:06 -04:00
Joey Hess
cde2e61105
improve sqlite retrying behavior
Avoid hanging when a suspended git-annex process is keeping a sqlite
database locked.

Sponsored-by: Dartmouth College's Datalad project
2022-10-18 15:47:20 -04:00
Joey Hess
3149a1e2fe
More robust handling of ErrorBusy when writing to sqlite databases
While ErrorBusy and other exceptions were caught and the write retried for
up to 10 seconds, it was still possible for git-annex to eventually
give up and error out without writing to the database. Now it will retry
as long as necessary.

This does mean that, if one git-annex process is suspended just as sqlite
has locked the database for writing, another git-annex that tries to write
it it might get stuck retrying forever. But, that could already happen when
opening the sqlite database, which retries forever on ErrorBusy. This is an
area where git-annex is known to not behave well, there's a todo about the
general case of it.

Sponsored-by: Dartmouth College's Datalad project
2022-10-17 15:56:19 -04:00
Joey Hess
6fbd337e34
avoid uncessary keys db writes; doubled speed!
When running eg git-annex get, for each file it has to read from and
write to the keys database. But it's reading exclusively from one table,
and writing to a different table. So, it is not necessary to flush the
write to the database before reading. This avoids writing the database
once per file, instead it will buffer 1000 changes before writing.

Benchmarking getting 1000 small files from a local origin,
git-annex get now takes 13.62s, down from 22.41s!
git-annex drop now takes 9.07s, down from 18.63s!
Wowowowowowowow!

(It would perhaps have been better if there were separate databases for
the two tables. At least it would have avoided this complexity. Ah well,
this is better than splitting the table in a annex.version upgrade.)

Sponsored-by: Dartmouth College's Datalad project
2022-10-12 15:33:16 -04:00
Joey Hess
c2ad84b423
all keys are still present on versioned remote after import of a tree
When importing from versioned remotes, fix tracking of the content of
deleted files.

Only S3 supports versioning so far, so only it was affected.

But, the draft import/export interface for external remotes also seemed to
need a change, so that versionedExport could be set.
2022-10-11 13:05:40 -04:00
Joey Hess
b4305315b2
S3: pass fileprefix into getBucket calls
S3: Speed up importing from a large bucket when fileprefix= is set by only
asking for files under the prefix.

getBucket still returns the files with the prefix included, so the rest of
the fileprefix stripping still works unchanged.

Sponsored-by: Dartmouth College's DANDI project
2022-10-10 17:37:26 -04:00
Joey Hess
ca91c3ba91
S3: Support signature=anonymous to access a S3 bucket anonymously
This can be used, for example, with importtree=yes to import from a public
bucket.

This needs a patch that has not yet landed in the aws library, and will
need to be adjusted to support compiling with old versions of the library,
so is not yet suitable for merging.
See https://github.com/aristidb/aws/pull/281

The stack.yaml changes are provided to show how to build against the aws
fork and will need to be reverted as well.

Sponsored-by: Dartmouth College's DANDI project
2022-10-10 17:02:45 -04:00
Joey Hess
4a42c69092
take lock in checkLogFile and calcLogFile
move: Fix openFile crash with -J

This does make them a bit slower, although usually the log file is not
very big, so even when it's being rewritten, they will not block for
long taking the lock. Still, little slowdowns may add up when moving a lot
file files.

A less expensive fix would be to use something lower level than openFile
that does not check if the file is already open for write by another
thread. But GHC does not seem to provide anything convenient; even mkFD
checks for a writing thread.

fullLines is no longer necessary since these functions no longer will
read the file while it's being written.

Sponsored-by: Dartmouth College's DANDI project
2022-10-07 13:19:17 -04:00
Joey Hess
15f9fcbcb1
avoid combining multiple words provided to trust/untrust/dead
* trust, untrust, semitrust, dead: Fix behavior when provided with
  multiple repositories to operate on.
* trust, untrust, semitrust, dead: When provided with no parameters,
  do not operate on a repository that has an empty name.

The man page and usage already indicated that multiple repos could be
provided to these commands, but they actually used unwords to combine
everything into string, and found a repo matching that string. This was
especially bad when no parameters resulted in the empty string and some
repo happened to have an empty description.

This does change the behavior, and it's possible someone relied on the
current behavior to eg, trust a repo by name with the name not quoted into
a single parameter. But fixing the empty string bug and matching the
documentation are worth breaking that usage.

Note that git-annex init/reinit do still unwords multiple parameters when
provided to them. That is inconsistent behavior, but it certianly seems
possible that something does run git-annex init with an unquoted
description, and I don't think it's worth breaking that just to make it more
consistent with these other commands.

Sponsored-by: Boyd Stephen Smith Jr. on Patreon
2022-10-03 13:48:40 -04:00
Joey Hess
32a44c3813
releasing package git-annex version 10.20221003 2022-10-03 13:24:21 -04:00
Joey Hess
1328be2013
applied a patch 2022-09-30 14:04:10 -04:00
Joey Hess
49ee07f93d
fix flush of a closed file handle
Avoids displaying warning about git-annex restage needing to be run in
situations where it does not.

Closing a handle flushes it anyway, so no need for an explict flush. The
handle does get closed twice, but that's fine, the second one does nothing.

Sponsored-by: Dartmouth College's DANDI project
2022-09-30 14:02:31 -04:00
Joey Hess
02fca53cb2
releasing package git-annex version 10.20220927 2022-09-27 13:31:55 -04:00
Joey Hess
7059322a6c
Support "inbackend" in preferred content expressions
Well, actually, fix a typo that has always been in the implementation of
that. "inbacked" used to work, but let's not tell users about that; they
might try to use it and expect git-annex to keep supporting the typo..

Sponsored-by: Jack Hill on Patreon
2022-09-26 16:06:49 -04:00
Joey Hess
17129fed66
fix wormhole --appid option position
p2p: Pass wormhole the --appid option before the receive/send command, as
it does not accept that option after the command

I'm left wondering, did I get this wrong from the beginning, or did
wormhole change its option parser? I'm reminded of the change in 0.8.2
where it silently changed what FD the pairing code was output to.
But, looking at the wormhole source, it was at least putting --appid before
send in its test suite from the introduction of the option.

So I think probably this has always been broken. On 2021-12-31 the --appid
option was enabled, and it took until now for someone to try
git-annex p2p --pair and notice that flag day broke it..

Sponsored-by: Svenne Krap on Patreon
2022-09-26 15:11:38 -04:00
Joey Hess
ce65f11de0
enable-tor: Fix breakage caused by git's fix for CVE-2022-24765
This relies on bfa451fc4e and is a bit of an
ugly hack.

Sponsored-by: Noam Kremen on Patreon
2022-09-26 14:48:58 -04:00
Joey Hess
bfa451fc4e
pass --git-dir when reading git config when it was specified explicitly
Let GIT_DIR and --git-dir override git's protection against operating in a
repository owned by another user.

This is the same behavior other git commands have.

Sponsored-by: Jarkko Kniivilä on Patreon
2022-09-26 14:38:34 -04:00
Joey Hess
79aadf63d4
changelog and close 2022-09-26 13:11:23 -04:00
Joey Hess
8230f4a6f1
changelog for problem fixed by earlier revert 2022-09-26 12:27:32 -04:00
Joey Hess
2478e9e03a
restage: New git-annex command, handles restaging unlocked files
This is much easier and less failure-prone than having the user run
git update-index --refresh themselves.

Sponsored-by: Dartmouth College's DANDI project
2022-09-23 16:29:59 -04:00
Joey Hess
f64eff9355
test: Added --test-with-git-config option
Sponsored-by: Dartmouth College's DANDI project
2022-09-22 15:58:45 -04:00
Joey Hess
6e3c9bea2e
drain transferrer read handle when shutting it down
Fixes updating git index file after getting an unlocked file when
annex.stalldetection is set.

The transferrer may want to send additional protocol messages when it's
shut down. Closing the read handle prevented it from doing that, and caused
it to crash rather than cleanly shutting down.

Draining the handle without processing the protocol seemed ok to do,
because anything it outputs is going to be some side message displayed
at shutdown. Displaying those once per transferrer process that is running
seems unncessary.

Sponsored-by: Dartmouth College's DANDI project
2022-09-22 14:39:39 -04:00
Joey Hess
66bd4f80b3
Improved handling of --time-limit when combined with -J
When concurrency is enabled, there can be worker threads still running
when the time limit is checked. Exiting right there does not
give those threads time to finish what they're doing. Instead, the seeking
is wrapped up, and git-annex then shuts down cleanly.

The whole point of --time-limit existing, rather than using timeout(1)
when running git-annex is to let git-annex finish the action(s) it is
working on when the time limit is reached, and shut down cleanly.

I noticed this problem when investigating why restagePointerFile might
not have run after get/drop of an unlocked file. With --time-limit -J,
a worker thread may have finished updating a work tree file, and be killed
by the time limit check before it can run restagePointerFile. So despite
--time-limit running the shutdown actions, the work tree file didn't get
restaged.

Sponsored-by: Dartmouth College's DANDI project
2022-09-22 12:54:52 -04:00
Joey Hess
34e313f786
annex.diskreserve default increased from 1 mb to 100 mb
It's hard to know what's a good default for this. But 1 mb seems way too
small, because it's very easy for a git pull or some similar operation
that we don't think of as using much space to use up 1 mb of space.

Most people would want to free up some space if a filesystem only had 100
mb free. But on a small VPS, it's probably not uncommon to have only 1 gb
free. So 1 gb is too large for annex.diskreserve.

While old 1 gb USB keys are around, it's unlikely that anyone is
relying on them to shuttle annex data around; it would be worth anyone's
time to upgrade to a 32 gb or larger cheap modern USB key ($5).

Sponsored-by: Kevin Mueller on Patreon
2022-09-21 15:00:13 -04:00
Joey Hess
24016090b3
wording 2022-09-20 14:55:34 -04:00
Joey Hess
8d26fdd670
skip checkRepoConfigInaccessible when git directory specified explicitly
Fix a reversion that prevented git-annex from working in a repository when
--git-dir or GIT_DIR is specified to relocate the git directory to
somewhere else. (Introduced in version 10.20220525)

checkRepoConfigInaccessible could still run git config --list, just passing
--git-dir. It seems not necessary, because I know that passing --git-dir
bypasses git's check for repo ownership. I suppose it might be that git
eventually changes to check something about the ownership of the working
tree, so passing --git-dir without --work-tree would still be worth doing.
But for now this is the simple fix.

Sponsored-by: Nicholas Golder-Manning on Patreon
2022-09-20 14:52:43 -04:00
Joey Hess
0756f4453d
try retrieval from more than one export location when the first fails
Combined with commit 0ffc59d341, this
fixes the case where there are duplicate files on the special remote,
and the first gets modified/deleted, while the second is still present.

directory, adb: Fixed a bug when importtree=yes, and multiple files in the
special remote have the same content, that caused it to refuse to get a
file from the special remote, incorrectly complaining that it had changed,
due to only accepting the inode+mtime of one file (that was since modified
or deleted) and not accepting the inode+mtime of other duplicate files.

Sponsored-by: Max Thoursie on Patreon
2022-09-20 13:33:57 -04:00
Joey Hess
1fe9cf7043
deal with ignoreinode config setting
Improve handling of directory special remotes with importtree=yes whose
ignoreinode setting has been changed. (By either enableremote or by
upgrading to commit 3e2f1f73cbc5fc10475745b3c3133267bd1850a7.)

When getting a file from such a remote, accept the content that would have
been accepted with the previous ignoreinode setting.

After a change to ignoreinode, importing a tree from the remote will
re-import and generate new content identifiers using the new config. So
when ignoreinode has changed to no, the inodes will be learned, and after
that point, a change in an inode will be detected as a change. Before
re-importing, a change in an inode will be ignored, as it was before the
ignoreinode change. This seems acceptble, because the user can re-import
immediately if they urgently need to add inodes. And if not, they'll
do it sometime, presumably, and the change will take effect then.

Sponsored-by: Erik Bjäreholt on Patreon
2022-09-16 14:11:25 -04:00
Joey Hess
1ed90cb75e
improve wording 2022-09-16 12:52:00 -04:00
Joey Hess
451a7ce77f
vicfg: Include mincopies configuration
Sponsored-by: k0ld on Patreon
2022-09-15 15:11:59 -04:00
Joey Hess
eefc026370
fix reversion on skipping dead keys in --all/bare
Fix a reversion that made dead keys not be skipped when operating on all
keys via --all or in a bare repo. (Introduced in version 8.20200720)

Also, improved the documentation of git-annex-dead, it does not only apply
to fsck --all.

Also, made git-annex fsck, when run on a file whose key is dead, display
that. Before, it displayed that only when run with --all, but with this
fix, it skips dead keys with --all. But it can still be run on a file that
uses a dead key, and displaying "This key is dead" explains to the user
why it does not consider missing content for it to be a problem.

Sponsored-by: k0ld on Patreon
2022-09-13 14:38:13 -04:00
Joey Hess
d2c842e9a1
don't force use of conduit in withUrlOptionsPromptingCreds
Use curl for downloads from git remotes when annex.url-options and other
git configs are set.

If the url needs a password, curl will fail, and git credential will not be
used to prompt for it. But the user can set --netrc in url-options and
put the password in the netrc file.

This also means that url-options settings like -4 will take effect.
That was the case before commit 1883f7ef8f
forced conduit to be used.
2022-09-09 16:07:32 -04:00
Joey Hess
9621beabc4
cache credentials in memory when doing http basic auth to a git remote
When accessing a git remote over http needs a git credential prompt for a
password, cache it for the lifetime of the git-annex process, rather than
repeatedly prompting.

The git-lfs special remote already caches the credential when discovering
the endpoint. And presumably commands like git pull do as well, since they
may download multiple urls from a remote.

The TMVar CredentialCache is read, so two concurrent calls to
getBasicAuthFromCredential will both prompt for a credential.
There would already be two concurrent password prompts in such a case,
and existing uses of `prompt` probably avoid it. Anyway, it's no worse
than before.
2022-09-09 14:20:32 -04:00
Joey Hess
8a4cfd4f2d
use getSymbolicLinkStatus not getFileStatus to avoid crash on broken symlink
Fix crash importing from a directory special remote that contains a broken
symlink.

The crash was in listImportableContentsM but some other places in
Remote.Directory also seemed like they could have the same problem.

Also audited for other places that have such a problem. Not all calls
to getFileStatus are bad, in some cases it's better to crash on something
unexpected. For example, `git-annex import path` when the path is a broken
symlink should crash, the same as when it does not exist. Many of the
getFileStatus calls are like that, particularly when they involve
.git/annex/objects which should never have a broken symlink in it.

Fixed a few other possible cases of the problem.

Sponsored-by: Lawrence Brogan on Patreon
2022-09-05 13:46:32 -04:00
Joey Hess
a93163d6f7
optimise linker in linux standalone tarballs
Trick the linker into not doing unncessary work searching for optimised
libraries that are not present, by symlinking the directories where
optimised libs would be to the main lib dir.

This reduces the ENOENT of git-annex init by about 1/2. The linker always
finds the files where it looks first time now. I have not looked at what
the wall clock speedup might be, it's probably rather small.

If a x86-64-v5 comes to be, the list will need to be extended. And there
may be other directories used on some machines that I have missed. Not done
for arm64 yet, or any uncommon architectures.

Sponsored-by: Dartmouth College's Datalad project
2022-08-30 15:20:04 -04:00
Joey Hess
78440ca37d
move assistant and webapp build-depends into main build-depends
For some reason, cabal 3.4.1.0 builds w/o the assistant and webapp,
even when the flag is explicitly turned on. Moving the build-depends from
inside the if flag section to the main build-depends somehow fixes this.

Since the webapp build deps are thus always available, there is no reason
not to build the webapp when building the assistant. So, got rid of the
webapp build flag. Kept the assistant build flag for now, since building
without it does at least still speed up the build.

Sponsored-by: Brock Spratlen on Patreon
2022-08-29 15:23:49 -04:00
Joey Hess
cbac6c680b
remove changelog entry about reverted stack.yaml change 2022-08-22 12:02:22 -04:00
Joey Hess
e801634875
prep release 2022-08-22 12:02:04 -04:00
Yaroslav Halchenko
0151976676
Typo fix unncessary -> unnecessary.
Detected while reading recent CHANGELOG entry but then decided to apply
to entire codebase and docs since why not?
2022-08-20 09:40:19 -04:00
Joey Hess
ed39979ac8
import: Avoid following symbolic links inside directories being imported
Too big a footgun.

This does not prevent attackers who can write to the directory being
imported from racing the check. But they can cause anything to be imported
anyway, so would be limited to getting the legacy import to follow into a
directory they do not write to, and move files out of it into the annex.
(The directory special remote does not have that problem since it does not
move files.)

Sponsored-by: Jack Hill on Patreon
2022-08-19 13:31:16 -04:00
Joey Hess
94029995fa
fix git-annex add regression on deleted file
Fix a regression in 10.20220624 that caused git-annex add to crash when
there was an unstaged deletion.

Sponsored-by: Dartmouth College's Datalad project
2022-08-19 12:55:49 -04:00
Joey Hess
840bd50390
make it easier to use curl for unusual url schemes
Use curl when annex.security.allowed-url-schemes includes an url scheme not
supported by git-annex internally, as long as
annex.security.allowed-ip-addresses is configured to allow using curl.

Sponsored-by: Luke Shumaker on Patreon
2022-08-15 12:22:13 -04:00
Joey Hess
e60766543f
add annex.dbdir (WIP)
WIP: This is mostly complete, but there is a problem: createDirectoryUnder
throws an error when annex.dbdir is set to outside the git repo.

annex.dbdir is a workaround for filesystems where sqlite does not work,
due to eg, the filesystem not properly supporting locking.

It's intended to be set before initializing the repository. Changing it
in an existing repository can be done, but would be the same as making a
new repository and moving all the annexed objects into it. While the
databases get recreated from the git-annex branch in that situation, any
information that is in the databases but not stored in the branch gets
lost. It may be that no information ever gets stored in the databases
that cannot be reconstructed from the branch, but I have not verified
that.

Sponsored-by: Dartmouth College's Datalad project
2022-08-11 16:58:53 -04:00
Joey Hess
2530012fa3
fix wording 2022-08-10 12:32:49 -04:00
Joey Hess
abd417d4fe
Avoid running multiple bup split processes concurrently
Since bup split is not concurrency safe.

Used a lock file so that 2 git-annex processes only run one bup split
between them (per bup repo).

(Concurrent writes from different git-annex repository clones to the same
bup repo could still have concurrency problems.)

Sponsored-by: Noam Kremen on Patreon
2022-08-08 18:54:06 -04:00
Joey Hess
5bc70e2da5
When bup split fails, display its stderr
It seems worth noting here that I emailed bup's author about bup split
being noisy on stderr even with -q in approximately 2011. That never got
fixed. Its current repo on github only accepts pull requests, not bug
reports. Needing to add such complexity to deal with such a longstanding
unfixed issue is not fun.

Sponsored-by: Kevin Mueller on Patreon
2022-08-05 13:57:20 -04:00
Joey Hess
f94908f2a6
improve output when storing to bup
bup split outputs to stderr even with -q. This was discarded when using -J,
but it was still outputting when not using -J, and so was git-annex.

Sponsored-by: Nicholas Golder-Manning on Patreon
2022-08-05 12:29:33 -04:00
Joey Hess
a23fd7349f
work around git segfault
Work around bug in git 2.37 that causes a segfault when when
core.untrackedCache is set, and broke git-annex init.

Depending on when git gets fixed and how widely the buggy versions are
used, this could be reverted quite soon, or need to linger for a long time.
It only makes git-annex init a tiny bit slower in a new repo.

Sponsored-by: Max Thoursie on Patreon
2022-08-04 14:20:57 -04:00
Joey Hess
3a513cfe73
add --dry-run: New option
This is intended for users who want to see what it would output in order to
eg, check if a file would be added to git or the annex. It is not intended
as a way for scripts to get information.

Sponsored-by: Dartmouth College's Datalad project
2022-08-03 11:16:04 -04:00
Joey Hess
570b1aa6a1
Allow find --branch to be used in a bare repository, the same as the deprecated findref can be
This will allow later fully deprecating and removing findref.

Sponsored-by: Erik Bjäreholt on Patreon
2022-07-29 12:52:12 -04:00
Joey Hess
be19a68276
new matching options --want-get-by and --want-drop-by
Sponsored-by: Graham Spencer on Patreon
2022-07-28 13:26:03 -04:00
Joey Hess
b5dc04099e
stack.yaml: Updated to lts-19.16
Last try at this broke on windows with a problem installing ghc, but I
wanted to try again.

Also this has a version of aws that allows using aeson 2.0, which has a
potential security fix.
2022-07-26 16:04:49 -04:00
Joey Hess
d905232842
use ResourcePool for hash-object handles
Avoid starting an unncessary number of git hash-object processes when
concurrency is enabled.

Sponsored-by: Dartmouth College's DANDI project
2022-07-25 17:32:39 -04:00
Joey Hess
63cef2ae0b
v8 repositories automatically upgrade to v9
(And v9 later on to v10.)

When v9/v10 were added, making v8 automatically upgrade was deferred
"for a few months" to prevent interoperability problems if users also
have an old version of git-annex. Of course that could still be the
case, but there has been a good amount of time and this can't be put off
forever.

Allow setting annex.autoupgraderepository to false to avoid this upgrade.
Previously, that only prevented upgrades from no longer supported git-annex
versions, but v8 is still supported, and users may want to keep on v8 to
interoperate with an old git-annex version.

Sponsored-by: Boyd Stephen Smith Jr. on Patreon
2022-07-25 16:20:04 -04:00
Joey Hess
a0e788c94a
releasing package git-annex version 10.20220724 2022-07-25 14:07:20 -04:00
Joey Hess
4e88137a28
prevent appends except when annex.alwayscompact=false
I would like for a new repo version to enable appends, but to do so
safely would need a v11 followed by a 1 year delay followed by a v12
that does it. Since a similar v9 and v10 transition is currently
happening, and is less than 6 months along in most repos, it does not
feel wise to stack up another year-long transition behind that. What if
I need to hurry up a new repo version for some other change?

Added todo so I remember to make this change at some time when a v11
and probably v12 repo version do make sense.

Sponsored-by: Dartmouth College's DANDI project
2022-07-20 13:23:55 -04:00
Joey Hess
36f0bdcd57
add annex.alwayscompact
Added annex.alwayscompact setting which can be unset to speed up writes to
the git-annex branch in some cases.

Sponsored-by: Dartmouth College's DANDI project
2022-07-18 16:39:19 -04:00
Joey Hess
a2b1f369d1
disable journalIgnorable in enableInteractiveBranchAccess
Fix a reversion that prevented --batch commands (and the assistant)
from noticing data written to the journal by other commands.

I have not identified which commit broke this for sure,
but probably it was aeca7c2207

--batch commands that wrote to the journal avoided the problem since
journalIgnorable sets unset on write. It's a little bit surprising that
nobody noticed that query --batch commands did not see data written by
other commands.

Sponsored-by: Dartmouth College's DANDI project
2022-07-15 13:48:41 -04:00
Joey Hess
093ad89ead
S3: Avoid writing or checking the uuid file in the S3 bucket when importtree=yes or exporttree=yes
It does not make sense for either; importing from an existing bucket should
not write to it. And the user may not have write access at all. And exporting to
a bucket should not write other files.

Also this prevents the uuid file being imported after being written.

Sponsored-by: Dartmouth College's DANDI project
2022-07-14 15:05:51 -04:00
Joey Hess
50c2cac7e7
adb: Added configuration setting oldandroid=true
To avoid using find -printf, which was first supported in Android around
2019-2020.

Probing seems too fragile, and execing stat once per file is too slow to do
when there's a faster way available, which brought me to an option...

Sponsored-by: Brett Eisenberg on Patreon
2022-07-13 18:00:47 -04:00
Joey Hess
fbc3c223a6
filter-process: Fix protocol for empty files
This caused git to complain that filter-process failed and kill it with
signal 15. Because it wrote an extra flushPkt for an empty file, which
git did not expect, and so git saw an unexpected response to the next
request.

Luckily, filter-process is only used by default in v9 and up, and v8 is
still the default. Also, git had to be updating an empty file, followed
by another file, which is a fairly unlikely situation. And git restarts
filter-process after this happens and uses it to filter the rest of the
files. So this isn't a crippling bug.

Sponsored-by: Luke Shumaker on Patreon
2022-07-13 17:13:54 -04:00
Joey Hess
201e41cffd
add: Fix reversion when adding an annex link that has been moved to another directory
Fixes commit f259be7f39

Sponsored-by: Dartmouth College's Datalad project
2022-07-05 16:22:41 -04:00
Joey Hess
d01530ac21
Revert "lts-19.13 (ghc 9.0.2)"
This reverts commit d2bc268317.

That seemed to break building on windows, before it starts building
git-annex at all, it tried to install ghc and something blew up:

Processing archive: C:\Users\runneradmin\AppData\Local\Programs\stack\x86_64-windows\ghc-9.0.2.tar.xz
Extracting  ghc-9.0.2.tar
...
Extracted total of 11790 files from ghc-9.0.2.tar
C:\Users\runneradmin\AppData\Local\Programs\stack\x86_64-windows\ghc-9.0.2-tmp-6d0fbe7f3b29e56c\ghc-9.0.2\: renameDirectory:pathIsDirectory:CreateFile "\\\\?\\C:\\Users\\runneradmin\\AppData\\Local\\Programs\\stack\\x86_64-windows\\ghc-9.0.2-tmp-6d0fbe7f3b29e56c\\ghc-9.0.2\\": does not exist (The system cannot find the file specified.)

Hopefully a newer ghc version or updated stackage version will fix this
at some point, in the meantime revert it.
2022-07-05 13:13:25 -04:00
Joey Hess
02ef3d6a64
fix build with assistant disabled and webapp enabled
The webapp modules cannot build with the assistant disabled, so make the
webapp be under the assistant build flag.

Sponsored-by: Jarkko Kniivilä on Patreon
2022-06-29 14:19:18 -04:00
Joey Hess
b223988e22
remove --backend from global options
--backend is no longer a global option, and is only accepted by commands
that actually need it.

Three commands that used to support backend but don't any longer are
watch, webapp, and assistant. It would be possible to make them support it,
but I doubt anyone used the option with these. And in the case of webapp
and assistant, the option was handled inconsistently, only taking affect
when the command is run with an existing git-annex repo, not when it
creates a new one.

Also, renamed GlobalOption etc to AnnexOption. Because there are many
options of this type that are not actually global (any more) and get
added to commands that need them.

Sponsored-by: Kevin Mueller on Patreon
2022-06-29 13:33:25 -04:00
Joey Hess
21c50c0f72
fix parallel copy from/to a local git repo
Improve handling of parallelization with -J when copying content from/to a
git remote that is a local path.

Sponsored-by: Nicholas Golder-Manning on Patreon
2022-06-29 12:40:12 -04:00
Joey Hess
d2bc268317
lts-19.13 (ghc 9.0.2) 2022-06-28 14:49:33 -04:00
Joey Hess
c1b9ea2759
The 23 never happened release.
It's 24 somewhere..
2022-06-23 13:55:54 -04:00
Joey Hess
57d088e9c2
fix release version 2022-06-23 13:35:14 -04:00
Joey Hess
86968a4047
releasing package git-annex version 10.20220526 2022-06-23 13:31:36 -04:00
Joey Hess
f259be7f39
fix overwrite race with small file that got large
When adding a small file, it does not get locked down, so can be modified
after git-annex checks that it's small. The use of queued git add made the
race window nice and wide too.

Fixed by checking if the file has changed, and by not using git add.
Instead, have to recapitulate git add's handling of things like symlinks
and executable files.

Sponsored-by: Jochen Bartl on Patreon
2022-06-14 16:38:56 -04:00
Joey Hess
78a3d44ea0
get rid of racy addLink
The remaining callers all did not rely on it checking gitignore, so were
easy to convert.

They were susceptable to the same overwrite race as add and fix,
although less likely to have it and a narrower window than add's race.

Command.Rekey in passing got an unncessary call to removeFile deleted.
addSymlink handles deleting any existing worktree file.
2022-06-14 14:47:15 -04:00
Joey Hess
64c7f60f7a
fixed overwrite race with git-annex fix
Similar to git-annex add, git-annex fix queued git add, so if a file
got modified before git add ran, the wrong content would be staged,
perhaps a large file content.

Sponsored-by: Brock Spratlen on Patreon
2022-06-14 14:19:58 -04:00
Joey Hess
dd6dec4eb1
fix add overwrite race with git-annex add to annex
This is not a complete fix for all such races, only the one where a
large file gets changed while adding and gets added to git rather than
to the annex.

addLink needs to go away, any caller of it is probably subject to the
same kind of race. (Also, addLink itself fails to check gitignore when
symlinks are not supported.)

ingestAdd no longer checks gitignore. (It didn't check it consistently
before either, since there were cases where it did not run git add!)
When git-annex import calls it, it's already checked gitignore itself
earlier. When git-annex add calls it, it's usually on files found
by withFilesNotInGit, which handles checking ignores.

There was one other case, when git-annex add --batch calls it. In that
case, old git-annex behaved rather badly, it would seem to add the file,
but git add would later fail, leaving the file as an unstaged annex symlink.
That behavior has also been fixed.

Sponsored-by: Brett Eisenberg on Patreon
2022-06-14 13:37:19 -04:00
Joey Hess
6d0b243d9d
avoid cleaning up move log when drop from remote fails
move: Improve resuming a move that succeeded in transferring the content,
but where dropping failed due to eg a network problem, in cases where
numcopies checks prevented the resumed move from dropping the object from
the source repository.

This was earlier done for moves that got interrupted during the drop stage.

Sponsored-by: Svenne Krap on Patreon
2022-06-09 15:26:25 -04:00
Joey Hess
13fc6a9b6a
fix to use 1 chunk for empty file
Fix retrival of an empty file that is stored in a special remote with
chunking enabled.

The speculative chunk stuff caused a reversion by adding an empty list for
the empty file. Which is just wrong; the empty file is still stored on the
remote, and should be retrieved like any other file. It uses 1 chunk, so
`max 1` is the simple fix.

Sponsored-by: Noam Kremen on Patreon
2022-06-09 14:24:56 -04:00
Joey Hess
14584e7a38
initremote type=git probe uuid
rather than matching path of an existing remote to find the uuid.

The main benefit of this is that locations not using ssh:// will work
now, including both paths and host:/path

The other benefit is that it's a simpler interface, no need to have an
existing remote with the same url and some other name. Although that
will still work of course.

This does rely on tryGitConfigRead working when given a Git.Repo that is
not a remote. Luckily, it works fine that way.

Also, tryGitConfigRead will auto-init a local repo that has a git-annex
branch. I did not enable auto-init of ssh repos though.

The uuid discovery actually happens twice; initremote discovers it,
and uses it to store the special remote config, but does not set it in the
git remote it creates. So the next run of git-annex does uuid discovery
again, and caches it that time. This could be improved for a tiny
speedup, but I didn't want to complicate things for that in this
commit.

Sponsored-by: Dartmouth College's DANDI project
2022-06-09 13:16:50 -04:00
Joey Hess
c59ea5b1ca
info: Added --autoenable option
Use cases include using git-annex init --no-autoenable and then going back
and enabling the special remotes that have autoenable configured. As well
as just querying to remember which ones have it enabled.

It lists all special remotes that have autoenable=yes whether currently
enabled or not. And it can be used with --json.

I pondered making this "git-annex info autoenable", but that seemed wrong
because then if the use has a directory named "autoenable", it's unclear
what they are asking for. (Although "git-annex info remote" may be
similarly unclear.) Making it an option does mean that it can't be provided
via --batch though.

Sponsored-by: Dartmouth College's Datalad project
2022-06-01 14:20:38 -04:00
Joey Hess
0d50c90794
init: Added --no-autoenable option
Someone may disagree with what repositories are set to autoenable and
it's good to have local overrides.

See https://github.com/datalad/datalad/issues/6634

Sponsored-by: Dartmouth College's Datalad project
2022-06-01 13:27:49 -04:00
Joey Hess
b60d85c4c0
releasing package git-annex version 10.20220525 2022-05-25 14:01:31 -04:00
Joey Hess
85f9193167
fix git-annex test -p
test: When limiting tests to run with -p, work around tasty limitation by
automatically including dependent tests.

This fixes a reversion because it didn't used to use dependencies and
forced tasty to run the init tests first. That changed when parallelizing
the test suite.

It will sometimes do a little more work than strictly required,
because it adds init tests deps when limited to eg quickcheck tests,
which don't depend on them. But this only adds a few seconds work.

Sponsored-by: Dartmouth College's Datalad project
2022-05-23 14:24:54 -04:00
Joey Hess
af0d854460
deal with git's changes for CVE-2022-24765
Deal with git's recent changes to fix CVE-2022-24765, which prevent using
git in a repository owned by someone else.

That makes git config --list not list the repo's configs, only global
configs. So annex.uuid and annex.version are not visible to git-annex.
It displayed a message about that, which is not right for this situation.
Detect the situation and display a better message, similar to the one other
git commands display.

Also, git-annex init when run in that situation would overwrite annex.uuid
with a new one, since it couldn't see the old one. Add a check to prevent
it running too in this situation. It may be that this fix has security
implications, if a config set by the malicious user who owns the repo
causes git or git-annex to run code. I don't think any git-annex configs
get run by git-annex init. It may be that some git config of a command
does get run by one of the git commands that git-annex init runs. ("git
status" is the command that prompted the CVE-2022-24765, since
core.fsmonitor can cause it to run a command). Since I don't know how
to exploit this, I'm not treating it as a security fix for now.

Note that passing --git-dir makes git bypass the security check. git-annex
does pass --git-dir to most calls to git, which it does to avoid needing
chdir to the directory containing a git repository when accessing a remote.
So, it's possible that somewhere in git-annex it gets as far as running git
with --git-dir, and git reads some configs that are unsafe (what
CVE-2022-24765 is about). This seems unlikely, it would have to be part of
git-annex that runs in git repositories that have no (visible) annex.uuid,
and git-annex init is the only one that I can think of that then goes on to
run git, as discussed earlier. But I've not fully ruled out there being
others..

The git developers seem mostly worried about "git status" or a similar
command implicitly run by a shell prompt, not an explicit use of git in
such a repository. For example, Ævar Arnfjörð Bjarma wrote:
> * There are other bits of config that also point to executable things,
>   e.g. core.editor, aliases etc, but nothing has been found yet that
>   provides the "at a distance" effect that the core.fsmonitor vector
>   does.
>
>   I.e. a user is unlikely to go to /tmp/some-crap/here and run "git
>   commit", but they (or their shell prompt) might run "git status", and
>   if you have a /tmp/.git ...

Sponsored-by: Jarkko Kniivilä on Patreon
2022-05-20 14:38:27 -04:00
Joey Hess
aa414d97c9
make fsck normalize object locations
The purpose of this is to fix situations where the annex object file is
stored in a directory structure other than where annex symlinks point to.

But it will also move object files from the hashdirmixed back to
hashdirlower if the repo configuration makes that the normal location.
It would have been more work to avoid that than to let it do it.

Sponsored-by: Dartmouth College's Datalad project
2022-05-16 15:38:06 -04:00
Joey Hess
54809e9eb3
fix untrustworthiness of import/export remotes
Commit 36133f27c0 had a boolean flip in it,
aaargh.

Special remotes with importtree=yes or exporttree=yes are once again
treated as untrusted, since files stored in them can be deleted or modified
at any time.

Sponsored-by: Kevin Mueller on Patreon
2022-05-09 15:53:23 -04:00
Joey Hess
e8a601aa24
incremental verification for retrieval from import remotes
Sponsored-by: Dartmouth College's Datalad project
2022-05-09 15:39:43 -04:00
Joey Hess
d1cce869ed
implement dataUnits finally
Added support for "megabit" and related bandwidth units in
annex.stalldetection and everywhere else that git-annex parses data units.

Note that the short form is "Mbit" not "Mb" because that differs from "MB"
only in case, and git-annex parses units case-insensitively. It would be
horrible if two different versions of git-annex parsed the same value
differently, so I don't think "Mb" can be supported.

See comment for bonus sad story from my childhood.

Sponsored-by: Nicholas Golder-Manning
2022-05-05 15:25:11 -04:00
Joey Hess
4e4c44ed8e
hah, I mean 0504 of course 2022-05-04 11:47:40 -04:00
Joey Hess
cb0e89bf77
releasing package git-annex version 10.20220404 2022-05-04 11:46:56 -04:00
Joey Hess
0406c33f58
fix git-annex repair false positive
Avoid treating refs/annex/last-index or other refs that are not commit
objects as evidence of repository corruption.

The repair code checks to find bad refs by trying to run `git log` on
them, and assumes that no output means something is broken.  But git log
on a tree object is empty.

This was worth fixing generally, not as a special case, since it's certainly
possible that other things store tree or other objects in refs.

Sponsored-by: Max Thoursie on Patreon
2022-05-04 11:32:23 -04:00
Joey Hess
43701759a3
disable shellescape for rsync 3.2.4
rsync 3.2.4 broke backwards-compatability by preventing exposing filenames
to the shell. Made the rsync and gcrypt special remotes detect this and
disable shellescape.

An alternative fix would have been to always set RSYNC_OLD_ARGS=1.
Which would avoid the overhead of probing rsync --help for each affected
remote. But that is really very fast to run, and it seemed better to switch
to the modern code path rather than keeping on using the bad old code path.

Sponsored-by: Tobias Ammann on Patreon
2022-05-03 12:12:41 -04:00
Joey Hess
280d41b58f
Fix a build failure with ghc 9.2.2
Thanks, gnezdo for the patch.
2022-05-02 14:21:48 -04:00
Joey Hess
17b20a2450
Fix test failure on NFS when cleaning up gpg temp directory
Using removePathForcibly avoids concurrent removal problems.

The i386ancient build still uses an old version of ghc and directory that
do not include removePathForcibly though.

Sponsored-by: Dartmouth College's Datalad project
2022-04-19 13:33:33 -04:00
Joey Hess
fd65de0eb9
multicast: Support uftp 5.0 by switching from aes256-cbc to aes256-gcm
aes256-gcm is supported by both 4.x and 5.x, while 5.x dropped aes256-cbc.

Sponsored-by: Graham Spencer on Patreon
2022-04-19 12:02:10 -04:00
Joey Hess
ff6b36c706
assistant prompt pushing of manual commits to remotes
assistant: When annex.autocommit is set, notice commits that the user makes
manually, and push them out to remotes promptly.

Sponsored-by: Boyd Stephen Smith Jr. on Patreon
2022-03-31 13:02:16 -04:00
Joey Hess
d266a41f8d
prevent numcopies or mincopies being configured to 0
Ignore annex.numcopies set to 0 in gitattributes or git config, or by
git-annex numcopies or by --numcopies, since that configuration would make
git-annex easily lose data. Same for mincopies.

This is a continuation of the work to make data only be able to be lost
when --force is used. It earlier led to the --trust option being disabled,
and similar reasoning applies here.

Most numcopies configs had docs that strongly discouraged setting it to 0
anyway. And I can't imagine a use case for setting to 0. Not that there
might not be one, but it's just so far from the intended use case of
git-annex, of managing and storing your data, that it does not seem like
it makes sense to cater to such a hypothetical use case, where any
git-annex drop can lose your data at any time.

Using a smart constructor makes sure every place avoids 0. Note that this
does mean that NumCopies is for the configured desired values, and not the
actual existing number of copies, which of course can be 0. The name
configuredNumCopies is used to make that clear.

Sponsored-by: Brock Spratlen on Patreon
2022-03-28 15:20:34 -04:00
Joey Hess
959beeea9f
releasing package git-annex version 10.20220322 2022-03-22 13:56:45 -04:00
Joey Hess
a460aa8b70
Removed the NetworkBSD build flag
Debian stable and the i386ancient build both have a new enough network
to not need this flag any longer.

Sponsored-by: Svenne Krap on Patreon
2022-03-22 11:52:52 -04:00
Joey Hess
982eb7ed0d
remove vendored http-client-restricted
Removed vendored copy of http-client-restricted, and removed the
HttpClientRestricted build flag that avoided that dependency.

http-client-restricted is in Debian stable, and the i386ancient build also
uses it, so I think this vendored copy is no longer needed.

Sponsored-by: Noam Kremen on Patreon
2022-03-22 11:50:06 -04:00
Joey Hess
42b6f24e67
reorder 2022-03-21 16:02:24 -04:00
Joey Hess
6079b0c72c
fix reversion
add: Avoid unncessarily converting a newly unlocked file to be stored
in git when it is not modified, even when annex.largefiles does not
match it.

This fixes a reversion in version 10.20220222, where git-annex unlock
followed by git-annex add, followed by git commit file could result in
git thinking the file was modified after the commit.

I do have half a mind to remove the withUnmodifiedUnlockedPointers part
of git-annex add. It seems weird, despite that old bug report arguing
a case of consistency that it ought to behave that way. When git-annex
add surpises me, it seems likely it's wrong.. But for now, this is the
smallest possible fix.

Sponsored-by: Dartmouth College's Datalad project
2022-03-21 15:54:04 -04:00
Joey Hess
3e2f1f73cb
add back inode to directory special remote ContentIdentifier
Directory special remotes with importtree=yes have changed to once more
take inodes into account. This will cause extra work when importing from a
directory on a FAT filesystem that changes inodes on every mount.

To avoid that extra work, set ignoreinodes=yes when initializing a new
directory special remote, or change the configuration of your existing
remote: git-annex enableremote foo ignoreinodes=yes

This will mean a one-time re-import of all contents from every directory
special remote due to the changed setting.

73df633a62 thought
it was too unlikely that there would be modifications that the inode number
was needed to notice. That was probably right; it's very unlikely that a
file will get modified and end up with the same size and mtime as before.
But, what was not considered is that a program like NextCloud might write
two files with different content so closely together that they share the
mtime. The inode is necessary to detect that situation.

Sponsored-by: Max Thoursie on Patreon
2022-03-21 13:12:02 -04:00
Joey Hess
025c18128b
test: Added --jobs option
Default to the number of CPU cores, which seems about optimal
on my laptop. Using one more saves me 2 seconds actually.

Better packing of workers improves speed significantly.

In 2 tests runs, I saw segfaulting workers despite my attempt
to work around that issue. So detect when a worker does, and re-run it.

Removed installSignalHandlers again, because I was seeing an
error "lost signal due to full pipe", which I guess was somehow caused
by using it.

Sponsored-by: Dartmouth College's Datalad project
2022-03-16 14:42:07 -04:00
Joey Hess
b1934cc794
changelog 2022-03-02 18:27:20 -04:00
Joey Hess
2fc46e1871
git-annex test from standalone speedup
Avoid git-annex test being very slow when run from within the standalone
linux tarball or OSX app.

It may not really be necessary to add to PATH the directory where the
git-annex binary resides, but it can't hurt. Most places where the test
suite or git-annex run git-annex, they use programPath, so won't need
a modified PATH. But I'm not sure if that's always the case.

Sponsored-by: Dartmouth College's Datalad project
2022-03-01 16:08:55 -04:00
Joey Hess
ce91f10132
fix annex.skipunknown false error propagation
Propagate nonzero exit status from git ls-files when a specified file does
not exist, or a specified directory does not contain any files checked into
git.

The recent completion of the annex.skipunknown transition exposed this
bug, that has unfortunately been lurking all along.

It is also possible that git ls-files errors out for some other reason
-- perhaps a permission problem -- and this will also fix error propagation
in such situations.

Sponsored-by: Dartmouth College's Datalad project
2022-02-28 12:54:56 -04:00
Joey Hess
f4b046252a
Run annex.thawcontent-command before deleting an object file
In case annex.freezecontent-command did something that would prevent
deletion.

Sponsored-by: Dartmouth College's Datalad project
2022-02-24 14:11:02 -04:00
Joey Hess
28bc5ce232
ignore write bits being set when there is a freeze hook
When annex.freezecontent-command is set, and the filesystem does not
support removing write bits, avoid treating it as a crippled filesystem.

The hook may be enough to prevent writing on its own, and some filesystems
ignore attempts to remove write bits.

Sponsored-by: Dartmouth College's Datalad project
2022-02-24 13:28:31 -04:00
Joey Hess
64ccb4734e
smudge: Warn when encountering a pointer file that has other content appended to it
It will then proceed to add the file the same as if it were any other
file containing possibly annexable content. Usually the file is one that
was annexed before, so the new, probably corrupt content will also be added
to the annex. If the file was not annexed before, the content will be added
to git.

It's not possible for the smudge filter to throw an error here, because
git then just adds the file to git anyway.

Sponsored-by: Dartmouth College's Datalad project
2022-02-23 15:17:08 -04:00
Joey Hess
67245ae00f
fully specify the pointer file format
This format is designed to detect accidental appends, while having some
room for future expansion.

Detect when an unlocked file whose content is not present has gotten some
other content appended to it, and avoid treating it as a pointer file, so
that appended content will not be checked into git, but will be annexed
like any other file.

Dropped the max size of a pointer file down to 32kb, it was around 80 kb,
but without any good reason and certianly there are no valid pointer files
anywhere that are larger than 8kb, because it's just been specified what it
means for a pointer file with additional data even looks like.

I assume 32kb will be good enough for anyone. ;-) Really though, it needs
to be some smallish number, because that much of a file in git gets read
into memory when eg, catting pointer files. And since we have no use cases
for the extra lines of a pointer file yet, except possibly to add
some human-visible explanation that it is a git-annex pointer file, 32k
seems as reasonable an arbitrary number as anything. Increasing it would be
possible, eg to 64k, as long as users of such jumbo pointer files didn't
mind upgrading all their git-annex installations to one that supports the
new larger size.

Sponsored-by: Dartmouth College's Datalad project
2022-02-23 14:20:31 -04:00
Joey Hess
1c4b0b4c2b
releasing package git-annex version 10.20220222 2022-02-22 13:33:45 -04:00
Joey Hess
ce1b3a9699
info: Allow using matching options in more situations
File matching options like --include will be rejected in situations where
there is no filename to match against. (Or where there is a filename but
it's not relative to the cwd, or otherwise seemed too bothersome to match
against.)

The addition of listKeys' was necessary to avoid using more memory in the
common case of "git-annex info". Adding a filterM would have caused the
list to buffer in memory and not stream. This is an ugly hack, but listKeys
had previously run Annex operations inside unafeInterleaveIO (for direct
mode). And matching against a matcher should hopefully not change any Annex
state.

This does allow for eg `git-annex info somefile --include=*.ext`
although why someone would want to do that I don't really know. But it
seems to make sense to allow it.
But, consider: `git-annex info ./somefile --include=somefile`
This does not match, so will not display info about somefile.
If the user really wants to, they can `--include=./somefile`.

Using matching options like --copies or --in=remote seems likely to be
slower than git-annex find with those options, because unlike such
commands, info does not have optimised streaming through the matcher.

Note that `git-annex info remote` is not the same as
`git-annex info --in remote`. The former shows info about all files in
the remote. The latter shows local keys that are also in that remote.
The output should make that clear, but this still seems like a point
where users could get confused.

Sponsored-by: Jochen Bartl on Patreon
2022-02-21 14:46:07 -04:00
Joey Hess
faf84aa5c2
Avoid git status taking a long time after git-annex unlock of many files.
Implemented by making Git.Queue have a FlushAction, which can accumulate
along with another action on files, and runs only once the other action has
run.

This lets git-annex unlock queue up git update-index actions, without
conflicting with the restagePointerFiles FlushActions.

In a repository with filter-process enabled, git-annex unlock will
often not take any more time than before, though it may when the files are
large. Either way, it should always slow down less than git-annex status
speeds up.

When filter-process is not enabled, git-annex unlock will slow down as much
as git status speeds up.

Sponsored-by: Jochen Bartl on Patreon
2022-02-18 15:06:40 -04:00
Joey Hess
07215cfeb5
complete annex.skipunknown transition
annex.skipunknown now defaults to false, so commands like `git annex get foo*`
will not silently skip over files/dirs that are not checked into git.

Sponsored-by: Brock Spratlen on Patreon
2022-02-18 13:18:05 -04:00
Joey Hess
0edf01d7d4
registerurl,unregisterurl: rework output and support --json
* registerurl, unregisterurl: Improved output when reading from stdin
  to be more like other batch commands.
* registerurl, unregisterurl: Added --json and --json-error-messages options.

Note that this did change the --batch output in a way that could possibly
break something that expected the old output to never change. I think it's
acceptable to break that because there has never been a guarantee of
unchanging output format except with --batch for most commands. The old
output was just really weird too!

One possible wart is that "git-annex registerurl" with no options now
seems to just hang, since it's waiting for stdin input. Before, it said
"registerurl (stdin)" which was clearer about what's happenening. But this
is a deprecated mode anyway, --batch makes clear what's happening. If
anything, this problem would be a reason to eventually remove the support
for reading from stdin w/o --batch.

Sponsored-by: Dartmouth College's Datalad project
2022-02-14 13:29:20 -04:00
Joey Hess
6992250d63
fix obviously wrong attoparsec parser
takeByteString can only be used at the end of a parser, not before other
input. This was a dumb enough mistake that I audited the rest of the
code base for similar mistakes. Pity that attoparsec cannot avoid it at
the type level.

Fixes git-annex forget propagation between repositories. (reversion
introduced in version 7.20190122)

Sponsored-by: Brock Spratlen on Patreon
2022-02-07 14:15:17 -04:00
Joey Hess
46d5098ff4
Pass --no-textconv when running git diff internally
Seems that --no-ext-diff and -c diff.external= are not enough to disable
external diff command when gitattributes textconv specifies it.

I'm pretty sure that --no-ext-diff and -c diff.external= are not both
needed, but not 100%. Something about -G may need the latter to fully
disable diffs in some cases. So kept that part as it was.

Sponsored-by: Dartmouth College's Datalad project
2022-02-01 13:43:18 -04:00
Joey Hess
a32ff6cef0
adb: Avoid find failing with "Argument list too long"
The "+" argument only runs the command once, so is not safe to use. Using
";" instead would have been the simplest fix, but also the slowest.

Since my phone has an xargs that supports -0, I piped find to xargs
instead. Unsure how portable this will be, perhaps some android's don't
have xargs -0 or find -printf to send null terminated output.

The business with pipefail is necessary to make a failure of find cause the
import to fail. Probably this works on all androids, but if not, it will
probably just result in a failure of find being ignored. It would be
possible to make ignorefinderror just disable setting pipefail, but then
if some android has a shell that has pipefail enabled by default, ignorefinderror
would not work, so I kept the || true approach for that.

Sponsored-by: Max Thoursie on Patreon
2022-01-31 13:19:09 -04:00
Joey Hess
e6e60b644b
releasing package git-annex version 10.20220127 2022-01-27 14:53:22 -04:00
Joey Hess
835c50966a
reject batch options combined with non-batch options
Reject combinations of --batch (or --batch-keys) with options like --all or
--key or with filenames.

Most commands ignored the non-batch items when batch mode was enabled.

For some reason, addurl and dropkey both processed first the specified
non-batch items, followed by entering batch mode. Changed them to also
error out, for consistency.

Sponsored-by: Dartmouth College's Datalad project
2022-01-26 13:00:19 -04:00
Joey Hess
213185c788
clarify 2022-01-24 15:01:08 -04:00
Joey Hess
47084b8a1d
enable filter.annex.process in v9
This has tradeoffs, but is generally a win, and users who it causes git add to
slow down unacceptably for can just disable it again.

It needed to happen in an upgrade, since there are git-annex versions
that do not support it, and using such an old version with a v8
repository with filter.annex.process set will cause bad behavior.
By enabling it in v9, it's guaranteed that any git-annex version that
can use the repository does support it. Although, this is not a perfect
protection against problems, since an old git-annex version, if it's
used with a v9 repository, will cause git add to try to run
git-annex filter-process, which will fail. But at least, the user is
unlikely to have an old git-annex in path if they are using a v9
repository, since it won't work in that repository.

Sponsored-by: Dartmouth College's Datalad project
2022-01-21 13:11:18 -04:00
Joey Hess
7e7a7140ce
update for v10
Sponsored-by: Dartmouth College's Datalad project
2022-01-21 12:32:44 -04:00
Joey Hess
f54c58f0df
Avoid crashing when run in a bare git repo that somehow contains an index file
Do not populate the keys database with associated files,
because a bare repo has no working tree, and so it does not make sense to
populate it.

Queries of associated files in the keys database always return empty lists
in a bare repo, even if it's somehow populated. One way it could be
populated is if a user converts a non-bare repo to a bare repo.

Note that Git.Config.isBare does a string comparison, so this is not free!
But, that string comparison is very small compared to a sqlite query.

Sponsored-by: Erik Bjäreholt on Patreon
2022-01-11 13:01:49 -04:00
Joey Hess
525473aa5a
adb: Added ignorefinderror configuration parameter
On a phone with Calyxos, adb find in /sdcard complains:

find: ./Android/data/com.android.providers.downloads.ui: Permission denied

But otherwise works, so this option makes import and export work ok, except
for that one app's data.

Sponsored-by: Graham Spencer
2022-01-10 21:17:00 -04:00
Joey Hess
e95747a149
fix handling of corrupted data received from git remote
Recover from corrupted content being received from a git remote due eg to a
wire error, by deleting the temporary file when it fails to verify. This
prevents a retry from failing again.

Reversion introduced in version 8.20210903, when incremental verification
was added.

Only the git remote seems to be affected, although it is certianly
possible that other remotes could later have the same issue. This only
affects things passed to getViaTmp that return (False, UnVerified) due to
verification failing. As far as getViaTmp can tell, that could just as well
mean that the transfer failed in a way that would resume, so it cannot
delete the temp file itself. Remote.Git and P2P.Annex use getViaTmp internally,
while other remotes do not, which is why only it seems affected.

A better fix perhaps would be to improve the types of the callback
passed to getViaTmp, so that some other value could be used to indicate
the state where the transfer succeeded but verification failed.

Sponsored-by: Boyd Stephen Smith Jr.
2022-01-07 13:25:33 -04:00
Joey Hess
21c0d5be6e
comment 2022-01-07 12:27:19 -04:00
Joey Hess
e416635021
renameremote: Better handling of case where there are multiple special remotes with a name
Instead of renaming one at random, error out and ask that a uuid be
specified.

Sponsored-by: Brett Eisenberg on Patreon
2022-01-05 15:24:02 -04:00
Joey Hess
58afb00f6e
enableremote: Better handling of the unusual case where multiple special remotes have been initialized with the same name
Before it would pick one at random, though preferring ones that were not
dead over dead ones.

Now, if one is dead and the other not, it will use the non-dead one. But if
both are not dead, or both dead, it will error out, suggesting the user
clarify what they want to enable.

Sponsored-by: Luke Shumaker on Patreon
2022-01-05 15:12:11 -04:00
Joey Hess
7e2f5edd68
avoid exporting non-annexed symlinks
So that importing does not replace them with plain files.

This works similarly to how the previous handling of submodules and
matchers did, except that annexed symlinks still get exported as plain
files of course, it's only non-annexed symlinks that it does not make sense
to export.

When symlinks have previously been exported, updating the export will
unexport them after upgrading to this commit.

Sponsored-by: Kevin Mueller on Patreon
2022-01-03 14:21:50 -04:00
Joey Hess
479ec0d533
releasing package git-annex version 8.20211231 2021-12-31 15:11:50 -04:00
Joey Hess
6d7ecd9e5d
merge git-annex branch in memory in read-only repository
Improved support for using git-annex in a read-only repository, git-annex
branch information from remotes that cannot be merged into the git-annex
branch will now not crash it, but will be merged in memory.

To avoid this making git-annex behave one way in a read-only repository,
and another way when it can write, it's important that Annex.Branch.get
return the same thing (modulo log file compaction) in both cases.

This manages that mostly. There are some exceptions:

- When there is a transition in one of the remote git-annex branches
  that has not yet been applied to the local or other git-annex branches.
  Transitions are not handled.
- `git-annex log` runs git log on the git-annex branch, and so
  it will not be able to show information coming from the other, not yet
  merged branches.
- Annex.Branch.files only looks at files in the git-annex branch and not
  unmerged branches. This affects git-annex info output.
- Annex.Branch.hs.overBranchFileContents ditto. Affects --all and
  also importfeed (but importfeed cannot work in a read-only repo
  anyway).
- CmdLine.Seek.seekFilteredKeys when precaching location logs.
  Note use of Annex.Branch.fullname
- Database.ContentIdentifier.needsUpdateFromLog and updateFromLog

These warts make this not suitable to be merged yet.

This readonly code path is more expensive, since it has to query several
branches. The value does get cached, but still large queries will be
slower in a read-only repository when there are unmerged git-annex
branches.

When annex.merge-annex-branches=false, updateTo skips doing anything,
and so the read-only repository code does not get triggered. So a user who
is bothered by the extra work can set that.

Other writes to the repository can still result in permissions errors.
This includes the initial creation of the git-annex branch, and of course
any writes to the git-annex branch.

Sponsored-by: Dartmouth College's Datalad project
2021-12-27 13:21:15 -04:00
Joey Hess
5ff55f622d
improve sync message in export edge case
sync: Better error message when unable to export to a remote because
remote.name.annex-tracking-branch is configured to a ref that does not
exist.

It does not suggest how to fix the problem because there are several
possible solutions: Change the git config to point to something that does
exist, git add some files, or put files on the special remote that will be
imported and so populate the ref.

I considered just silently not doing anything, which is what it does
when annex-tracking-branch = master and nothing has been committed to
master yet. But it seems better to be explicit about it, since this is a
fairly confusing situation to find yourself in.

Sponsored-By: Max Thoursie on Patreon
2021-12-23 14:45:01 -04:00
Joey Hess
c2e46f4707
improve git command queue flushing with time limit
So that eg, addurl of several large files that take time to download will
update the index for each file, rather than deferring the index updates to
the end.

In cases like an add of many smallish files, where a new file is being
added every few seconds. In that case, the queue will still build up a
lot of changes which are flushed at once, for best performance. Since
the default queue size is 10240, often it only gets flushed once at the
end, same as before. (Notice that updateQueue updated _lastchanged
when adding a new item to the queue without flushing it; that is
necessary to avoid it flushing the queue every 5 minutes in this case.)

But, when it takes more than a 5 minutes to add a file, the overhead of
updating the index immediately is probably small, so do it after each
file. This avoids git-annex potentially taking a very very long time
indeed to stage newly added files, which can be annoying to the user who
would like to get on with doing something with the files it's already
added, eg using git mv to rename them to a better name.

This is only likely to cause a problem if it takes say, 30 seconds to
update the index; doing an extra 30 seconds of work after every 5
minute file add would be less optimal. Normally, updating the index takes
significantly less time than that. On a SSD with 100k files it takes
less than 1 second, and the index write time is bound by disk read and
write so is not too much worse on a hard drive. So I hope this will not
impact users, although if it does turn out to, the time limit could be
made configurable.

A perhaps better way to do it would be to have a background worker
thread that wakes up every 60 seconds or so and flushes the queue.
That is made somewhat difficult because the queue can contain Annex
actions and so this would add a new source of concurrency issues.
So I'm trying to avoid that approach if possible.

Sponsored-by: Erik Bjäreholt on Patreon
2021-12-14 12:23:19 -04:00
Joey Hess
dbba231e06
Improve error message display when autoinit fails
Due to eg, a permissions problem.
2021-12-09 14:38:12 -04:00
Joey Hess
4b19626a36
Fix build with ghc 9.0.1
Continuing along the same lines as commit
2739adc258, it seems that
while Remote -> Retriever expands to the same data type this changes
it to, ghc 9.0.1 refuses to consider them equiviant. I guess it has
something to do with the forall?

The rest of the build all succeeds, although the stack build then crashes:
Linking .stack-work/dist/x86_64-linux-tinfo6/Cabal-3.4.0.0/build/git-annex/git-annex ...
Completed 233 action(s).
Prelude.chr: bad argument: 2214592520
This issue seems likely to be about it:
https://github.com/commercialhaskell/stack/pull/5508
I'm building with stack from debian, version 2.3.3, so a newer stack
probably avoids that. Anyway, despite that stack problem,
the git-annex binary is built, and works.

The stack.yaml I used for this build was patched as follows:

diff --git a/stack.yaml b/stack.yaml
index 8dac87c15..62c4b5b9d 100644
--- a/stack.yaml
+++ b/stack.yaml
@@ -1,6 +1,6 @@
 flags:
   git-annex:
-    production: true
+    production: false
     assistant: true
     pairing: true
     torrentparser: true
@@ -14,7 +14,7 @@ flags:
     httpclientrestricted: true
 packages:
 - '.'
-resolver: lts-18.13
+resolver: nightly-2021-09-07
 extra-deps:
 - IfElse-0.85
 - aws-0.22

Sponsored-by: Graham Spencer on Patreon
2021-12-08 15:08:02 -04:00
Joey Hess
ae4c56b28a
Revert "fix too early close of shared lock file"
This reverts commit 66b2536ea0.

I misunderstood commit ac56a5c2a0
and caused a FD leak when pid locking is not used.

A LockHandle contains an action that will close the underlying lock
file, and that action is run when it is closed. In the case of a shared
lock, the lock file is opened once for each LockHandle, and only
the one for the LockHandle that is being closed will be closed.
2021-12-06 12:51:28 -04:00
Joey Hess
ed0afbc36b
avoid concurrent threads trying to take pid lock at same time
Seem there are several races that happen when 2 threads run PidLock.tryLock
at the same time. One involves checkSaneLock of the side lock file, which may
be deleted by another process that is dropping the lock, causing checkSaneLock
to fail. And even with the deletion disabled, it can still fail, Probably due
to linkToLock failing when a second thread overwrites the lock file.

The same can happen when 2 processes do, but then one process just fails
to take the lock, which is fine. But with 2 threads, some actions where failing
even though the process as a whole had the pid lock held.

Utility.LockPool.PidLock already maintains a STM lock, and since it uses
LockShared, 2 threads can hold the pidlock at the same time, and when
the first thread drops the lock, it will remain held by the second
thread, and so the pid lock file should not get deleted until the last
thread to hold it drops the lock. Which is the right behavior, and why a
LockShared STM lock is used in the first place.

The problem is that each time it takes the STM lock, it then also calls
PidLock.tryLock. So that was getting called repeatedly and concurrently.

Fixed by noticing when the shared lock is already held, and stop calling
PidLock.tryLock again, just use the pid lock that already exists then.

Also, LockFile.PidLock.tryLock was deleting the pid lock when it failed
to take the lock, which was entirely wrong. It should only drop the side
lock.

Sponsored-by: Dartmouth College's Datalad project
2021-12-01 17:14:39 -04:00
Joey Hess
66b2536ea0
fix too early close of shared lock file
This fixes a reversion introduced in commit
ac56a5c2a0.

I didn't notice there that it was handling the case of a shared lock
file that was still open elsewhere by not running the close action.

This was especially deadly when annex.pidlock is set, as it caused early
deletion of the pid lock file.

Sponsored-by: Dartmouth College's Datalad project
2021-12-01 17:06:28 -04:00
Joey Hess
567f63ba47
export: Avoid unncessarily re-exporting non-annexed files that were already exported
Commit b6e4ed9aa7 made non-annexed files
be re-uploaded every time, since they're not tracked in the location log,
and it made it check the location log. Don't do that for non-annexed files.

Sponsored-by: Brock Spratlen on Patreon
2021-11-29 14:02:38 -04:00
Joey Hess
01a5ee6998
addurl, youtube-dl: When --check-raw prevents downloading an url, still continue with any downloads that come after it, rather than erroring out
Sponsored-By: Mark Reidenbach on Patreon
2021-11-28 19:40:06 -04:00
Joey Hess
1d513540e9
Fix build with old versions of feed library 2021-11-23 16:06:51 -04:00
Joey Hess
74fcc389d8
releasing package git-annex version 8.20211123 2021-11-23 15:20:24 -04:00
Joey Hess
5a7f253974
support git 2.34.0's handling of merge conflict between annexed and non-annexed file
This version of git -- or its new default "ort" resolver -- handles such
a conflict by staging two files, one with the original name and the other
named file~ref. Use unmergedSiblingFile when the latter is detected.

(It doesn't do that when the conflict is between a directory and a file
or symlink though, so see previous commit for how that case is handled.)

The sibling file has to be deleted separately, because cleanConflictCruft
may not delete it -- that only handles files that are annex links,
but the sibling file may be the non-annexed file side of the conflict.

The graftin code had assumed that, when the other side of a conclict
is a symlink, the file in the work tree will contain the non-annexed
content that we want it to contain. But that is not the case with the new
git; the file may be the annex link and needs to be replaced with the
content, while the annex link will be written as a -variant file.

(The weird doesDirectoryExist check in graftin turns out to still be
needed, test suite failed when I tried to remove it.)

Test suite passes with new git with ort resolver default. Have not tried it
with old git or other defaults.

Sponsored-by: Noam Kremen on Patreon
2021-11-22 16:10:24 -04:00
Joey Hess
766720d093
soften language in changelog
This bug mostly would happen when the downloads ran very fast or were
all failing (how I reproduced it), because there have to be two
downloads that finish very close to the same time to trigger the race.

So most users of -J probably would not see much impact from the bug.
2021-11-19 12:52:22 -04:00
Joey Hess
623a775609
fix cat-file leak in get with -J
Bugfix: When -J was enabled, getting files leaked a ever-growing number of
git cat-file processes.

(Since commit dd39e9e255)

The leak happened when mergeState called stopNonConcurrentSafeCoProcesses.
While stopNonConcurrentSafeCoProcesses usually manages to stop everything,
there was a race condition where cat-file processes were leaked. Because
catFileStop modifies Annex.catfilehandles in a non-concurrency safe way,
and could clobber modifications made in between. Which should have been ok,
since originally catFileStop was only used at shutdown.

Note the comment on catFileStop saying it should only be used when nothing
else is using the handles. It would be possible to make catFileStop
race-safe, but it should just not be used in a situation where a race is
possible. So I didn't bother.

Instead, the fix is just not to stop any processes in mergeState. Because
in order for mergeState to be called, dupState must have been run, and it
enables concurrency mode, stops any non-concurrent processes, and so all
processes that are running are concurrency safea. So there is no need to
stop them when merging state. Indeed, stopping them would be extra work,
even if there was not this bug.

Sponsored-by: Dartmouth College's Datalad project
2021-11-19 12:51:08 -04:00
Joey Hess
31be0770a5
importfeed: Display url before starting youtube-dl download
It was displaying a blank line before.
2021-11-17 13:23:55 -04:00
Joey Hess
c3af94eff4
releasing package git-annex version 8.20211117 2021-11-17 12:20:29 -04:00
Joey Hess
2bd778a46e
importfeed: Fix a crash when used in a non-unicode locale
See comment for analysis.

At first I thought I'd need to convert all T.unpack in git-annex, but
luckily not -- so long as the Text is read from a file, the filesystem
encoding is applied and T.unpack is fine. It's only when using Feed
that the filesystem encoding is not applied.

While this fixes the crash, it does result in some mojibake, eg:
itemid=http://www.manager-tools.com/2014/01/choosing-a-company-work-chapter-7-���-questions/

Have not tracked that down, but it must be unrelated, because
I've verified that it roundtrips when using encodeUf8:

joey@darkstar:~/src/git-annex>LANG=C ghci  Utility/FileSystemEncoding.hs
ghci> useFileSystemEncoding
ghci> Just f <- Text.Feed.Import.parseFeedFromFile "/home/joey/tmp/career_tools_podcasts.xml"
ghci> Just (_, x) = Text.Feed.Query.getItemId (Text.Feed.Query.feedItems f !! 0)
ghci> decodeBS (Data.Text.Encoding.encodeUtf8 x)
"http://www.manager-tools.com/2014/01/choosing-a-company-work-chapter-7-\56546\56448\56467-questions/"
ghci> writeFile "foo" $ decodeBS (Data.Text.Encoding.encodeUtf8 x)
Writes a file containing the ENDASH character.

Sponsored-by: Jochen Bartl on Patreon
2021-11-15 15:04:21 -04:00
Joey Hess
aa6e54ac6e
Fix a typo in the name of youtube-dl (reversion introduced in version 8.20210903) 2021-11-13 08:58:36 -04:00
Joey Hess
51b73ea1fc
migrate: New --remove-size option
While intended for converting URL keys added by addurl --fast to be
as if added by addurl --relaxed, it can also be used to remove size
from other types of keys. Although that is not likely to be useful
for checksummed keys, I suppose it could be used for WORM or other
non-checksum keys.

Specifying the --remove-size option does not prevent other migrations
from taking effect if there's a key upgrade to perform, or if the
backend has changed. So --backend=URL needs to be used to prevent
migrating an URL key to the default backend.

Note that it's not possible to use git-annex migrate to convert from a
non-URL key to an URL key, as URL keys cannot be generated, except by
addurl. So while this can get the same effect as --relaxed would have
when addurl --fast was used, when --fast was not used, it won't work, or
if --backend=URL is not used will remove the size but not prevent
checksum verification, which is not useful. Due to this complexity, I
decided not to mention it in the git-annex addurl man page.

Sponsored-by: Jochen Bartl on Patreon
2021-11-12 13:28:28 -04:00
Joey Hess
f3326b8b5a
git-lfs gitlab interoperability fix
git-lfs: Fix interoperability with gitlab's implementation of the git-lfs
protocol, which requests Content-Encoding chunked.

Sponsored-by: Dartmouth College's Datalad project
2021-11-10 13:51:11 -04:00
Joey Hess
9d3ce224e3
uninit edge cases
* uninit: Avoid error message when no commits have been made to the
  repository yet.
* uninit: Avoid error message when there is no git-annex branch.

Sponsored-by: Svenne Krap on Patreon
2021-11-08 16:47:00 -04:00
Joey Hess
68257e9076
add git-annex filter-process
filter-process: New command that can make git add/checkout faster when
there are a lot of unlocked annexed files or non-annexed files, but that
also makes git add of large annexed files slower.

Use it by running: git
config filter.annex.process 'git-annex filter-process'

Fully tested and working, but I have not benchmarked it at all.
And, incremental hashing is not done when git add uses it, so extra work is
done in that case.

Sponsored-by: Mark Reidenbach on Patreon
2021-11-04 15:02:36 -04:00
Joey Hess
438e5b56aa
tighter --json parsing for metadata
metadata --batch --json: Reject input whose "fields" does not consist of
arrays of strings. Such invalid input used to be silently ignored.

Used to be that parseJSON for a JSONActionItem ran parseJSON separately
for the itemAdded, and if that failed, did not propagate the error. That
allowed different items with differently named fields to be parsed.
But it was actually only used to parse "fields" for metadata, so that
flexability is not needed.

The fix is just to parse "fields" as-is. AddJSONActionItemFields is needed
only because of the wonky way Command.MetaData adds onto the started json
object.

Note that this line got a dummy type signature added,
just because the type checker needs it to be some type.
itemFields = Nothing :: Maybe Bool
Since it's Nothing, it doesn't really matter what type it is,
and the value gets turned into json and is then thrown away.

Sponsored-by: Kevin Mueller on Patreon
2021-11-01 14:42:37 -04:00
Joey Hess
80f1354685
metadata --batch: Avoid crashing when a non-annexed file is input
Turns out that CommandStart actions do not have their exceptions caught,
which is why the giveup was causing a crash. Mostly these actions
do not do very much work on their own, but it does seem possible there
are other commands whose CommandStart also throws an exception.

So, my first attempt at a fix was to catch those exceptions. But,
--json-error-messages then causes a difficulty, because in order to output
a json error message, an action needs to have been started; that sets up
the json object that the error message will be included in a field of.

While it would be possible to output an object with just an error field,
this would be json output of a format that the user has no reason to
expect, that happens only in an exceptional circumstance. That is something
I have always wanted to avoid with the json output; while git-annex man
pages don't document what the json looks like, the output has always
been made to be self-describing. Eg, it includes "error-messages":[]
even when there's no errors.

With that ruled out, it doesn't seem a good idea to catch CommandStart
exceptions and display the error to stderr when --json-error-messages
is set. And so I don't know if it makes sense to catch exceptions from that
at all. Maybe I'd have a different opinion if --json-error-messages did not
exist though.

So instead, output a blank line like other batch commands do.
This also leaves open the possibility of implementing support for matching
object with metadata --json, which would also want to output a blank line
when the input didn't match.

Sponsored-by: Dartmouth College's DANDI project
2021-11-01 13:40:43 -04:00
Joey Hess
c260833a6b
releasing package git-annex version 8.20211028 2021-10-28 12:00:56 -04:00
Joey Hess
eb95ed4863
fix addurl concurrency issue
addurl: Support adding the same url to multiple files at the same time when
using -J with --batch --with-files.

Implementation was easier than expected, was able to reuse OnlyActionOn.

While it will download the url's content multiple times, that seems like
the best thing to do; see my comment for why.

Sponsored-by: Dartmouth College's DANDI project
2021-10-27 16:15:41 -04:00
Joey Hess
669037862a
avoid redundant freezeContent call
This opens the potential for the object file to be in place but
git-annex is interrupted before it can freeze it. git-annex fsck already
fixes that situation, which can also occur when lockContentForRemoval
thaws content.

Also improve comment to not be Windows-specific.
2021-10-27 14:18:10 -04:00
Joey Hess
0756625e1b
update, bugfix also fixed git-annex info 2021-10-27 12:22:02 -04:00
Joey Hess
b2c48fb86b
Fix using lookupkey inside a subdirectory
Caused by dirContains ".." "foo" being incorrectly False.

Also added a test of dirContains, which includes all the previous bug fixes
I could find and some obvious cases.

Reversion in version 8.20211011

Sponsored-by: Brett Eisenberg on Patreon
2021-10-26 15:00:45 -04:00
Joey Hess
5a9e6b1fd4
when private journal file exists, still read from git-annex branch
Fix bug that caused stale git-annex branch information to read when
annex.private or remote.name.annex-private is set.

The private journal file should not prevent reading more current
information from the git-annex branch, but used to.

Note that, overBranchFileContents has to do additional work now, when
there's a private journal file, it reads from the branch redundantly
and more slowly.

Sponsored-by: Jack Hill on Patreon
2021-10-26 13:43:50 -04:00
Joey Hess
2801528eb2
oops, I misread, still happens for adjusted branches 2021-10-20 13:45:56 -04:00
Joey Hess
f7b5a5c9ed
changelog
A user tested 0f38ad9a69 on WSL, and it
seems to have fixed the problem.
2021-10-20 13:26:01 -04:00
Joey Hess
f4bdecc4ec
improve sqlite MultiWriter handling of read after write
This removes a messy caveat that was easy to forget and caused at least one
bug. The price paid is that, after a write to a MultiWriter db, it has to
close the db connection that it had been using to read, and open a new
connection. So it might be a little bit slower. But, writes are usually
batched together, so there's often only a single write, and so there should
not be much of a slowdown. Notice that SingleWriter already closed the db
connection after a write, so paid the same overhead.

This is the second try at fixing a bug: git-annex get when run as the first
git-annex command in a new repo did not populate all unlocked files.
(Reversion in version 8.20210621)

Sponsored-by: Boyd Stephen Smith Jr. on Patreon
2021-10-19 15:13:29 -04:00
Joey Hess
29d687dce9
When retrival from a chunked remote fails, display the error that occurred when downloading the chunk
Rather than the error that occurred when trying to download the unchunked
content, which is less likely to actually be stored in the remote.

Sponsored-by: Boyd Stephen Smith Jr. on Patreon
2021-10-14 12:45:05 -04:00
Joey Hess
b36cc0320e
avoid crashing tilde expansion on user who does not exist
git does not crash when there's a remote configured for a user who does
not exist, and this prevents git-annex from crashing too. Consider that
a user might exist on one system but not another, and the git repo be
moved between systems. So not crashing is desirable.

Note that git fetch seems to mishandle a remote path like ~foo/bar
when the user does not exist. While it does access ./~foo/bar,
and gets as far as running git-upload-pack on the path,
it then complains there is no such repo. So different parts of git seem
to be doing different things in that edge case. Anyway, git-annex does
not need to be bug-for-bug compatible with git.

Sponsored-by: Jack Hill on Patreon
2021-10-13 09:16:36 -04:00
Joey Hess
c2a44eab50
move gpg tmp home to system temp dir
test: Put gpg temp home directory in system temp directory, not filesystem
being tested.

Since I've found indications gpg can fail talking to the agent when the
socket ends up on eg, fat. And to hopefully fix this bug report I've
followed up on.

The main risk in using the system temp dir is that TMPDIR could be set to a
long directory path, which is too long to put a unix socket in. To
partially amelorate that risk, it uses either an absolute or a relative
path, whichever is shorter. (Hopefully gpg will not convert it to a longer
form of the path..)

If the user sets TMPDIR to something so long a path to it +
"S.gpg-agent" is too long, I suppose that's their issue to deal with.

Sponsored-by: Dartmouth College's Datalad project
2021-10-12 13:29:56 -04:00
Joey Hess
17a0fa3dbc
negotiate P2P protocol version for tor remotes
This negotiation is not supported by versions of git-annex older
than 6.20180312. Well, maybe really 6.20180227 or so, but using that in
the changelog simplifies things since it was the version for the other
changes as well.

See commit c81768d425 for the back story.

As well as allowing for future protocol improvements, this will result
in negoatiating protocol version 1, which is an improvement over default
version 0.

In fact, it looks like no supported version of git-annex will use
protocol version 0, since version 1 was introduced in 6.20180227.
Still, removing the code for version 0 seems unncessary.
See commit 31e1adc005.

Sponsored-by: Brett Eisenberg on Patreon.
2021-10-11 15:58:51 -04:00
Joey Hess
f8816d2b92
remove list of removed commands
the list was wrong and also users shouldn't need to know
2021-10-11 15:43:19 -04:00
Joey Hess
7bdc7350a5
remove git-annex-shell compat code
* Removed support for accessing git remotes that use versions of
  git-annex older than 6.20180312.
* git-annex-shell: Removed several commands that were only needed to
  support git-annex versions older than 6.20180312.
  (lockcontent, recvkey, sendkey, transferinfo, commit)

The P2P protocol was added in that version, and used ever since, so
this code was only needed for interop with older versions.

"git-annex-shell commit" is used by newer git-annex versions, though
unnecessarily so, because the p2pstdio command makes a single commit at
shutdown. Luckily, it was run with stderr and stdout sent to /dev/null,
and non-zero exit status or other exceptions are caught and ignored. So,
that was able to be removed from git-annex-shell too.

git-annex-shell inannex, recvkey, sendkey, and dropkey are still used by
gcrypt special remotes accessed over ssh, so those had to be kept.
It would probably be possible to convert that to using the P2P protocol,
but it would be another multi-year transition.

Some git-annex-shell fields were able to be removed. I hoped to remove
all of them, and the very concept of them, but unfortunately autoinit
is used by git-annex sync, and gcrypt uses remoteuuid.

The main win here is really in Remote.Git, removing piles of hairy fallback
code.

Sponsored-by: Luke Shumaker
2021-10-11 15:36:51 -04:00
Joey Hess
e28cf82b45
releasing package git-annex version 8.20211011 2021-10-11 12:53:17 -04:00
Joey Hess
022bb6174c
Merge branch 'borgchunks' 2021-10-08 13:26:45 -04:00
Joey Hess
69f8e6c7c0
ImportableContentsChunkable
This improves the borg special remote memory usage, by
letting it only load one archive's worth of filenames into memory at a
time, and building up a larger tree out of the chunks.

When a borg repository has many archives, git-annex could easily OOM
before. Now, it will use only memory proportional to the number of
annexed keys in an archive.

Minor implementation wart: Each new chunk re-opens the content
identifier database, and also a new vector clock is used for each chunk.
This is a minor innefficiency only; the use of continuations makes
it hard to avoid, although putting the database handle into a Reader
monad would be one way to fix it.

It may later be possible to extend the ImportableContentsChunkable
interface to remotes that are not third-party populated. However, that
would perhaps need an interface that does not use continuations.

The ImportableContentsChunkable interface currently does not allow
populating the top of the tree with anything other than subtrees. It
would be easy to extend it to allow putting files in that tree, but borg
doesn't need that so I left it out for now.

Sponsored-by: Noam Kremen on Patreon
2021-10-08 13:15:22 -04:00
Joey Hess
1c11dd4793
avoid cursor jitter when updating progress display
When the progress display gets longer, and then shorter again, it causes
the cursor to jitter back and forth. Somehow I never noticed this until
this morning, but then it became intolerable to watch.

To fix it, pad the progress display to the maximum length it's occupied.

Sponsored-by: Svenne Krap on Patreon
2021-10-07 11:16:41 -04:00
Joey Hess
45dfddd33f
convert ExportLocation to ShortByteString to avoid PINNED memory fragmentation
This adds the overhead of a copy whenever converting to/from ExportLocation and
ImportLocation.

borg: Some improvements to memory use when importing a lot of archives.
(It's still pretty bad.)

Sponsored-by: Mark Reidenbach on Patreon
2021-10-05 14:51:55 -04:00
Joey Hess
9012fa0187
reinject: Fix crash when reinjecting a file from outside the repository
Commit 4bf7940d6b introduced this
problem, but was otherwise doing a good thing. Problem being
that fileRef "/foo" used to return ":./foo", which was actually wrong,
but as long as there was no foo in the local repository, catKey
could operate on it without crashing. After that fix though, fileRef
would return eg "../../foo", resulting in fileRef returning
":./../../foo", which will make git cat-file crash since that's
not a valid path in the repo.

Fix is simply to make fileRef detect paths outside the repo and return
Nothing. Then catKey can be skipped. This needed several bugfixes to
dirContains as well, in previous commits.

In Command.Smudge, this led to needing to check for Nothing. That case
should actually never happen, because the fileoutsiderepo check will
detect it earlier.

Sponsored-by: Brock Spratlen on Patreon
2021-10-01 14:06:34 -04:00
Joey Hess
b9a1cc512d
avoid uncessary call to inAnnex
sync --content: Avoid a redundant checksum of a file that was
incrementally verified, when used on NTFS and perhaps other filesystems.

When sync has just gotten the content, it does not need to check inAnnex a
second time. On NTFS, for some reason the write of the inode cache after
it gets the content is not immediately able to be read, and with an
empty/non-matching inode cache due to that stale data, inAnnex falls back
to hashing the whole object to determine if it's present.

Sponsored-by: Brock Spratlen on Patreon
2021-10-01 12:02:35 -04:00
Joey Hess
b9aa2ce8d1
resume properly when copying a file to/from a local git remote is interrupted (take 2)
This method avoids breaking test_readonly. Just check if the dest file
exists, and avoid CoW probing when it does, so when CoW probing fails,
it can resume where the previous non-CoW copy left off.

If CoW has been probed already to work, delete the dest file
since a CoW copy will presumably work. It seems like it would be almost
as good to just skip CoW copying in this case too, but consider that the
dest file might have started to be copied from some other remote, not
using CoW, but CoW has been probed to work to copy from the current
place.

Sponsored-by: Dartmouth College's Datalad project
2021-09-27 16:03:01 -04:00
Joey Hess
7ccf642863
revert change that broke test_readonly
commit 63d508e885 broke test_readonly.
When a local git remote is readonly, tryCopyCoW run to copy a file
from it failed at withOtherTmp.

Sponsored-by: Dartmouth College's Datalad project
2021-09-27 16:02:41 -04:00
Joey Hess
9ea8106bb0
sped up git-annex smudge --clean by 25%
Disabling git-annex branch update for this command is
ok, because it does not use any information from the branch,
but only logs the location when it adds a key.

Sponsored-by: Dartmouth College's Datalad project
2021-09-24 14:15:20 -04:00
Joey Hess
e8496d62e4
improved bwrate limiting implementation
New method is much better. Avoids unrestrained transfer at the beginning
(except for the first block. Keeps right at or a few kb/s below the
configured limit, with very little varation in the actual reported bandwidth.

Removed the /s part of the config as it's not needed.

Ready to merge.

Sponsored-by: Luke Shumaker on Patreon
2021-09-22 15:27:16 -04:00
Joey Hess
05a097cde8
Merge branch 'master' into bwlimit 2021-09-22 10:48:27 -04:00
Joey Hess
55b405a965
fix remote git config vs global git config order
Bug fix: Git configs such as annex.verify were incorrectly overriding
per-remote git configs such as remote.name.annex-verify.

This dates all the way back to 2013,
commit 8a5b397ac4,
where hlint apparently somehow confused me into parsing in the wrong
order. Before that it was correct.

Amazing noone has noticed until now.

Sponsored-by: Kevin Mueller on Patreon
2021-09-22 10:41:56 -04:00
Joey Hess
63d508e885
resume properly when copying a file to/from a local git remote is interrupted
Probably this fixes a reversion, but I don't know what version broke it.

This does use withOtherTmp for a temp file that could be quite large.
Though albeit a reflink copy that will not actually take up any space
as long as the file it was copied from still exists. So if the copy cow
succeeds but git-annex is interrupted just before that temp file gets
renamed into the usual .git/annex/tmp/ location, there is a risk that
the other temp directory ends up cluttered with a larger temp file than
later. It will eventually be cleaned up, and the changes of this being
a problem are small, so this seems like an acceptable thing to do.

Sponsored-by: Shae Erisson on Patreon
2021-09-21 17:43:35 -04:00
Joey Hess
18e00500ce
bwlimit
Added annex.bwlimit and remote.name.annex-bwlimit config that works for git
remotes and many but not all special remotes.

This nearly works, at least for a git remote on the same disk. With it set
to 100kb/1s, the meter displays an actual bandwidth of 128 kb/s, with
occasional spikes to 160 kb/s. So it needs to delay just a bit longer...
I'm unsure why.

However, at the beginning a lot of data flows before it determines the
right bandwidth limit. A granularity of less than 1s would probably improve
that.

And, I don't know yet if it makes sense to have it be 100ks/1s rather than
100kb/s. Is there a situation where the user would want a larger
granularity? Does granulatity need to be configurable at all? I only used that
format for the config really in order to reuse an existing parser.

This can't support for external special remotes, or for ones that
themselves shell out to an external command. (Well, it could, but it
would involve pausing and resuming the child process tree, which seems
very hard to implement and very strange besides.) There could also be some
built-in special remotes that it still doesn't work for, due to them not
having a progress meter whose displays blocks the bandwidth using thread.
But I don't think there are actually any that run a separate thread for
downloads than the thread that displays the progress meter.

Sponsored-by: Graham Spencer on Patreon
2021-09-21 16:58:10 -04:00
Joey Hess
9f38ecac1e
borg: Avoid trying to extract xattrs, ACLS, and bsdflags
when retrieving from a borg repository

That broke restoring on linux from a borg backup made on OSX.

Sponsored-by: Boyd Stephen Smith Jr. on Patreon
2021-09-03 12:10:14 -04:00
Joey Hess
1a586e473b
releasing package git-annex version 8.20210903 2021-09-03 12:01:12 -04:00
Joey Hess
ec12537774
defer write permissions checking in import until after copy to repo
This should complete the fix started in
6329997ac4, fixing the actual cause of the
test suite failure this time.

Sponsored-by: Dartmouth College's Datalad project
2021-09-02 13:45:21 -04:00
Joey Hess
4f42292b13
improve url download failure display
* When downloading urls fail, explain which urls failed for which
  reasons.
* web: Avoid displaying a warning when downloading one url failed
  but another url later succeeded.

Some other uses of downloadUrl use urls that are effectively internal use,
and should not all be displayed to the user on failure. Eg, Remote.Git
tries different urls where content could be located depending on how the
remote repo is set up. Exposing those urls to the user would lead to wild
goose chases. So had to parameterize it to control whether it displays urls
or not.

A side effect of this change is that when there are some youtube urls
and some regular urls, it will try regular urls first, even if the
youtube urls are listed first. This seems like an improvement if
anything, but in any case there's no defined order of urls that it's
supposed to use.

Sponsored-by: Dartmouth College's Datalad project
2021-09-01 15:33:38 -04:00
Joey Hess
837116ef1e
Fix support for readonly git remotes
Boolean blindness oops.

(Reversion in version 8.20210621)

Sponsored-by: Dartmouth College's Datalad project
2021-08-30 12:34:19 -04:00
Joey Hess
a99a84f342
add: Detect when xattrs or perhaps ACLs prevent locking down a file's content
And fail with an informative message.

I don't think ACLs can prevent removing the write bit, but I'm not sure,
so kept it mentioning them as a possibility.

Should git-annex lock also check if the write bits are able to be removed?
Maybe, but the case I know about with xattrs involves cp -a copying NFS
xattrs, and it's the copy of the file that is the problem. So when locking
a file, I guess it will not be the copy.

Sponsored-by: Dartmouth College's Datalad project
2021-08-27 14:33:01 -04:00
Joey Hess
e17342b2a0
Run cp -a with --no-preserve=xattr, to avoid problems with copied xattrs
Including them breaking permissions setting on some NFS servers.

Sponsored-by: Dartmouth College's Datalad project
2021-08-27 13:09:34 -04:00
Joey Hess
6d4a728455
Added annex.youtube-dl-command config
This can be used to run some forks of youtube-dl.

Sponsored-by: Brett Eisenberg on Patreon
2021-08-27 09:44:23 -04:00
Joey Hess
ab7b5a492c
--batch-keys
New --batch-keys option added to these commands:  get, drop, move, copy, whereis

git-annex-matching-options had to be reworded since some of its options
can be used to match on keys, not only files.

Sponsored-by: Luke Shumaker on Patreon
2021-08-25 14:21:12 -04:00
Joey Hess
4ed36b2634
Fix test suite failure on Windows
It would be better if the Arbitrary instance avoided generating impossible
filenames like "foo/c:bar", but proably this is the only place that splits
the file from the directory and then uses the file without the directory..
At least on the quickcheck properties.

Sponsored-by: Svenne Krap on Patreon
2021-08-24 14:03:29 -04:00
Joey Hess
f9b92c81f6
unused: Skip the refs/annex/last-index ref that git-annex recently started creating
This was unlikely to cause any problem, but it is unsightly to mention
normally hidden refs, and it might have done a bit of unnecessary work to
check that ref.

Sponsored-by: Noam Kremen on Patreon
2021-08-24 12:58:14 -04:00
Joey Hess
53744e132d
incremental verification for gitlfs and httpalso
And that should be all the special remotes supporting it on linux now,
except for in the odd edge case here and there.

Sponsored-by: Dartmouth College's DANDI project
2021-08-18 15:17:10 -04:00
Joey Hess
f5e09a1dbe
incremental verification for S3
Sponsored-by: Dartmouth College's DANDI project
2021-08-18 15:07:00 -04:00
Joey Hess
d154e7022e
incremental verification for web special remote
Except when configuration makes curl be used. It did not seem worth
trying to tail the file when curl is downloading.

But when an interrupted download is resumed, it does not read the whole
existing file to hash it. Same reason discussed in
commit 7eb3742e4b76d1d7a487c2c53bf25cda4ee5df43; that could take a long
time with no progress being displayed. And also there's an open http
request, which needs to be consumed; taking a long time to hash the file
might cause it to time out.

Also in passing implemented it for git and external special remotes when
downloading from the web. Several others like S3 are within striking
distance now as well.

Sponsored-by: Dartmouth College's DANDI project
2021-08-18 15:02:22 -04:00
Joey Hess
8613770b06
incremental verify for webdav special remote
Sponsored-by: Dartmouth College's DANDI project
2021-08-16 17:29:32 -04:00
Joey Hess
b1622eb932
incremental verify for directory special remote
Added fileRetriever', which will let the remaining special remotes
eventually also support incremental verify.

Sponsored-by: Dartmouth College's DANDI project
2021-08-16 16:51:33 -04:00
Joey Hess
ec82299730
status update
I was wrong about S3 supporting tailVerify.
2021-08-16 15:15:32 -04:00
Joey Hess
dadbb510f6
incremental hashing for fileRetriever
It uses tailVerify to hash the file while it's being written.

This is able to sometimes avoid a separate checksum step. Although
if the file gets written quickly enough, tailVerify may not see it
get created before the write finishes, and the checksum still happens.

Testing with the directory special remote, incremental checksumming did
not happen. But then I disabled the copy CoW probing, and it did work.
What's going on with that is the CoW probe creates an empty file on
failure, then deletes it, and then the file is created again. tailVerify
will open the first, empty file, and so fails to read the content that
gets written to the file that replaces it.

The directory special remote really ought to be able to avoid needing to
use tailVerify, and while other special remotes could do things that
cause similar problems, they probably don't. And if they do, it just
means the checksum doesn't get done incrementally.

Sponsored-by: Dartmouth College's DANDI project
2021-08-13 15:43:29 -04:00
Joey Hess
7eb3742e4b
incremental verify for chunked remotes
Simply feed each chunk in turn to the incremental verifier.

When resuming an interrupted retrieve, it does not do incremental
verification. That would need to read the file, up to the resume point,
and feed it to the incremental verifier. That seems easy to get wrong.
Also it would mean extra work done before the transfer can start. Which
would complicate displaying progress, and would perhaps not appear to the
user as if it was resuming from where it left off. Instead, in that
situation, return UnVerified, and let the verification be done in a
separate pass.

Granted, Annex.CopyFile does manage all that, but it's not complicated
by dealing with chunks too.

Sponsored-by: Dartmouth College's DANDI project
2021-08-11 14:42:49 -04:00
Joey Hess
c20358b671
incremental verify for byteRetriever special remotes
Several special remotes verify content while it is being retrieved,
avoiding a separate checksum pass. They are: S3, bup, ddar, and
gcrypt (with a local repository).

Not done when using chunking, yet.

Complicated by Retriever needing to change to be polymorphic. Which in turn
meant RankNTypes is needed, and also needed some code changes. The
change in Remote.External does not change behavior at all but avoids
the type checking failing because of a "rigid, skolem type" which
"would escape its scope". So I refactored slightly to make the type
checker's job easier there.

Unfortunately, directory uses fileRetriever (except when chunked),
so it is not amoung the improved ones. Fixing that would need a way for
FileRetriever to return a Verification. But, since the file retrieved
may be encrypted or chunked, it would be extra work to always
incrementally checksum the file while retrieving it. Hm.

Some other special remotes use fileRetriever, and so don't get incremental
verification, but could be converted to byteRetriever later. One is
GitLFS, which uses downloadConduit, which writes to the file, so could
verify as it goes. Other special remotes like web could too, but don't
use Remote.Helper.Special and so will need to be addressed separately.

Sponsored-by: Dartmouth College's DANDI project
2021-08-11 14:20:38 -04:00
Joey Hess
f1176f82a5
rsync special remote: Stop displaying rsync progress, and use git-annex's own progress display
Reasons are same as in commit cee14f147a.
(It was already done when using -J.)

Sponsored-by: Mark Reidenbach on Patreon
2021-08-09 12:06:10 -04:00
Joey Hess
1acdd18ea8
deal better with clock skew situations, using vector clocks
* Deal with clock skew, both forwards and backwards, when logging
  information to the git-annex branch.
* GIT_ANNEX_VECTOR_CLOCK can now be set to a fixed value (eg 1)
  rather than needing to be advanced each time a new change is made.
* Misuse of GIT_ANNEX_VECTOR_CLOCK will no longer confuse git-annex.

When changing a file in the git-annex branch, the vector clock to use is now
determined by first looking at the current time (or GIT_ANNEX_VECTOR_CLOCK
when set), and comparing it to the newest vector clock already in use in
that file. If a newer time stamp was already in use, advance it forward by
a second instead.

When the clock is set to a time in the past, this avoids logging with
an old timestamp, which would risk that log line later being ignored in favor
of "newer" line that is really not newer.

When a log entry has been made with a clock that was set far ahead in the
future, this avoids newer information being logged with an older timestamp
and so being ignored in favor of that future-timestamped information.
Once all clocks get fixed, this will result in the vector clocks being
incremented, until finally enough time has passed that time gets back ahead
of the vector clock value, and then it will return to usual operation.

(This latter situation is not ideal, but it seems the best that can be done.
The issue with it is, since all writers will be incrementing the last
vector clock they saw, there's no way to tell when one writer made a write
significantly later in time than another, so the earlier write might
arbitrarily be picked when merging. This problem is why git-annex uses
timestamps in the first place, rather than pure vector clocks.)

Advancing forward by 1 second is somewhat arbitrary. setDead
advances a timestamp by just 1 picosecond, and the vector clock could
too. But then it would interfere with setDead, which wants to be
overrulled by any change. So it could use 2 picoseconds or something,
but that seems weird. It could just as well advance it forward by a
minute or whatever, but then it would be harder for real time to catch
up with the vector clock when forward clock slew had happened.

A complication is that many log files contain several different peices of
information, and it may be best to only use vector clocks for the same peice
of information. For example, a key's location log file contains
InfoPresent/InfoMissing for each UUID, and it only looks at the vector
clocks for the UUID that is being changed, and not other UUIDs.

Although exactly where the dividing line is can be hard to determine.
Consider metadata logs, where a field "tag" can have multiple values set
at different times. Should it advance forward past the last tag?
Probably. What about when a different field is set, should it look at
the clocks of other fields? Perhaps not, but currently it does, and
this does not seems like it will cause any problems.

Another one I'm not entirely sure about is the export log, which is
keyed by (fromuuid, touuid). So if multiple repos are exporting to the
same remote, different vector clocks can be used for that remote.
It looks like that's probably ok, because it does not try to determine
what order things occurred when there was an export conflict.

Sponsored-by: Jochen Bartl on Patreon
2021-08-04 12:33:46 -04:00
Joey Hess
899983058f
add: When adding a dotfile, avoid treating its name as an extension. 2021-08-03 12:22:58 -04:00
Joey Hess
9cae7c5bbf
releasing package git-annex version 8.20210803 2021-08-03 12:20:45 -04:00
Joey Hess
b3c4579c79
work around strange auto-init bug
git-annex get when run as the first git-annex command in a new repo did not
populate unlocked files. (Reversion in version 8.20210621)

I am not entirely happy with this, because I don't understand how
428c91606b caused the problem in the first
place, and I don't fully understand how skipping calling scanAnnexedFiles
during autoinit avoids the problem.

Kept the explicit call to scanAnnexedFiles during git-annex init,
so that when reconcileStaged is expensive, it can be made to run then,
rather than at some later point when the information is needed.

Sponsored-by: Brock Spratlen on Patreon
2021-07-30 18:36:03 -04:00
Joey Hess
66089e97de
Fix a rounding bug in display of data sizes
Eg, showImprecise 1 1.99 returned "1.1" rather than "2". The 9 rounded
upward to 10, and that was wrongly used as the decimal, rather than
carrying the 1.

Sponsored-by: Jack Hill on Patreon
2021-07-30 09:56:04 -04:00
Joey Hess
d2aead67bd
fsck: Detect and correct stale or missing inode caches for object files
An easy way to see this in action is to have an unlocked file, and touch the
object file.

While all code that compares inode caches for object files needs to be
prepared for this kind of problem and fall back to verification, having
fsck notice it and correct it is cheap (as long as fsck is being run
anyway) and ensures that if it happens for some unusual reason, there's a
way for the user to notice that it's happening.

Not that, when annex.thin is in use, the earlier call to isUnmodified
(and also potentially earlier calls to inAnnex in eg, verifyLocationLog)
will fix up the same problem silently. That might prevent the warning
being displayed, although probably it still will be, because the
Database.Keys write of the InodeCache will be queued but will not have
happened yet. I can't see a way to improve this, but it's not great.

Sponsored-by: Dartmouth College's Datalad project
2021-07-29 14:06:42 -04:00
Joey Hess
73e0cbbb19
fix problem populating pointer files
This is a result of an audit of every use of getInodeCaches,
to find places that misbehave when the annex object is not in the inode
cache, despite pointer files for the same key being in the inode cache.

Unfortunately, that is the case for objects that were in v7 repos that
upgraded to v8. Added a note about this gotcha to getInodeCaches.

Database.Keys.reconcileStaged, then annex.thin is set, would fail to
populate pointer files in this situation. Changed it to check if the
annex object is unmodified the same way inAnnex does, falling back to a
checksum if the inode cache is not recorded.

Sponsored-by: Dartmouth College's Datalad project
2021-07-27 14:26:49 -04:00
Joey Hess
3b5a3e168d
check if object is modified before starting to send it
Fix bug that caused some transfers to incorrectly fail with "content
changed while it was being sent", when the content was not changed.

While I don't know how to reproduce the problem that several people
reported, it is presumably due to the inode cache somehow being stale.
So check isUnmodified', and if it's not modified, include the file's
current inode cache in the set to accept, when checking for modification
after the transfer.

That seems like the right thing to do for another reason: The failure
says the file changed while it was being sent, but if the object file was
changed before the transfer started, that's wrong. So it needs to check
before allowing the transfer at all if the file is modified.

(Other calls to sameInodeCache or elemInodeCaches, when operating on inode
caches from the database, could also be problimatic if the inode cache is
somehow getting stale. This does not address such problems.)

Sponsored-by: Dartmouth College's Datalad project
2021-07-26 17:33:49 -04:00
Joey Hess
3d50b47ded
sync, merge: Added --allow-unrelated-histories option
Which is the same as the git merge option.

After last commit, this turns out to be needed in the test suite, and when
doing git-annex import from special remote, followed by a git-annex merge.

Sponsored-by: Svenne Krap on Patreon
2021-07-19 12:14:26 -04:00
Joey Hess
b6bea0d3f2
remove direct mode remnant of merging unrelated histories
sync, merge, post-receive: Avoid merging unrelated histories, which used to
be allowed only to support direct mode repositories.

(However, sync does still merge unrelated histories when importing trees
from special remotes, and the assistant still merges unrelated histories
always.)

See 556b2ded2b for why this was added
back in 2016, for direct mode.

This is a behavior change, which might break something that was relying
on sync merging unrelated histories, but git had a good reason to
prevent it, since it's easy to foot shoot with it, and git-annex should
follow suit.

Sponsored-by: Noam Kremen on Patreon
2021-07-19 11:41:26 -04:00
Joey Hess
33a80d083a
sync --quiet
* sync: When --quiet is used, run git commit, push, and pull without
  their ususual output.
* merge: When --quiet is used, run git merge without its usual output.

This might also make --quiet work better for some other commands
that make commits, like git-annex adjust.

Sponsored-by: Kevin Mueller on Patreon
2021-07-19 11:28:47 -04:00
Joey Hess
c952c485c8
Fix retrieval of content from borg repos accessed over ssh
It was making the borgrepo path absolute.. even when it was a ssh
repository.

Made BorgRepo a newtype, to guard against accidentially treating it like a
FilePath.

Sponsored-by: Graham Spencer on Patreon
2021-07-15 12:39:24 -04:00
Joey Hess
dd31fe7b9e
fall back to checking lower case hash directories in normal repo
Fix a bug that prevented getting content from a repository that started out
as a bare repository, or had annex.crippledfilesystem set, and was
converted to a non-bare repository.

This unfortunately means that inAnnex check gets slowed down by a stat call
in normal repos when the content is not present. Oh well, such is the cost
of backwards compatability with old mistakes.

Sponsored-by: Mark Reidenbach on Patreon
2021-07-15 12:16:31 -04:00
Joey Hess
47d3dccf19
whereused implemented
except --historical

Sponsored-by: Jack Hill on Patreon
2021-07-14 14:27:21 -04:00
Joey Hess
065db484e0
releasing package git-annex version 8.20210714 2021-07-14 12:23:24 -04:00
Joey Hess
8885bd3c5b
addistant: honor annex.delayadd for non-large files
assistant: When adding non-large files to git, honor annex.delayadd
configuration.

Also, don't add non-large files to git when they are still
being written to. This came for free, since the changes to non-large
files get queued up with the ones to large files, and run through the lsof
check.

Sponsored-by: Luke Shumaker on Patreon
2021-07-13 12:17:00 -04:00
Joey Hess
a6767ca81f
close bug
and mention another aspect of the reversion in changelog
2021-07-12 10:45:57 -04:00
Joey Hess
6a581f8b8b
fix init reversion when core.sharedRepository = group
init: Fix misbehavior when core.sharedRepository = group that caused it to
enter an adjusted branch. (Reversion in version 8.20210630)

Commit 4b1b9d7a83 made init call
freezeContent in case there was a hook that could prevent writing in
situations where perms don't. But with the above git config, freezeContent
does not prevent write at all. So init needs to do what freezeContent does
with a non-shared git config.

Or init could check for that config, and skip the probing, since it
won't actually be preventing write to any files. But that would make init
too aware if details of Annex.Perms, and also would break if the git config
were changed after init.

Sponsored-by: Dartmouth College's Datalad project
2021-07-12 10:15:49 -04:00
Joey Hess
b885007f0e
--debug output goes to stderr again, not stdout
Reversion in version 8.20210428

Sponsored-by: Dartmouth College's Datalad project
2021-07-12 09:40:38 -04:00
Joey Hess
b9db859221
addurl: Avoid crashing when used on beegfs.
Sponsored-by: Dartmouth College's DANDI project
2021-07-05 13:02:40 -04:00
Joey Hess
d2c48404a8
assistant: Avoid unncessary git repository repair
In a situation where git fsck gets confused about a commit that is made
while it's running.

Sponsored-by: Graham Spencer on Patreon
2021-06-30 18:00:16 -04:00
Joey Hess
fd99ce6c95
releasing package git-annex version 8.20210630 2021-06-30 11:48:33 -04:00
Joey Hess
73ccf34763
closing 2021-06-30 11:47:39 -04:00
Joey Hess
6b0d732746
repair: Fix reversion in version 8.20200522 that prevented fetching missing objects from remotes
In commit dfc4e641b5 git repair was changed
to use remote name, not url, when fetching. But it fetches into a temporary
git repo, which doesn't have remotes configured. Oops.

(In my defense, that commit was made just as covid lockdown started. But
testing? Urk.)

Sponsored-by: Mark Reidenbach on Patreon
2021-06-29 13:15:15 -04:00
Joey Hess
199391befe
make repair interruption safe
Fixed bug that interrupting git-annex repair (or assistant) while it was
fixing repository corruption would lose objects that were contained in pack
files.

Unpack all pack files and move objects into place *before* deleting the
pack files. The old approach moved the pack files to a temp directory
before unpacking them, which was not interruption safe.

Sponsored-By: Jochen Bartl on Patreon
2021-06-29 13:14:28 -04:00
Joey Hess
b8e32e200e
addurl, importfeed: Added --no-raw option
Forces eg, download with youtube-dl without falling back to raw download.

Since youtube-dl failing due to an url not being supported is difficult to
distinguish from it failing due to being blocked in some way, this can be
useful to avoid the fallback of git-annex downloading the raw web page and
adding that.

Since --raw also prevents using special remotes, --no-raw also
allows special remote downloads. Although it's always possible that some
special remote may claim an url and fall back to raw download of the
content, which --no-raw cannot prevent.

Sponsored-by: Boyd Stephen Smith Jr. on Patreon
2021-06-27 11:14:51 -04:00
Joey Hess
3a14648142
dropping unused marks as dead
Dropping an object with drop --unused or dropunused will mark it as
dead, preventing fsck --all from complaining about it after it's been
dropped from all repositories.

If another repository still has a copy, it won't be treated as dead
until it's also dropped from there.

The drop has to use --unused, can't be --key or something else, because
this indicates that the user has recently ran git-annex unused. If it
checked the unused log on every drop, bad things would happen when the
unused log was out of date, eg a file used to be unused but then got
re-added. Marking such a file as dead could be confusing. When the user
uses --unused/dropunused, they must consider the unused information to be
up-to-date.

The particular workflow this enables is:

	git annex add foo
	git annex unannex foo
	git annex unused
	git annex drop --unused / dropunused
	git annex fsck --all # no warnings

The docs for git-annex unannex say to use git-annex unused and dropunused,
so the user should be pointed in this direction when they want to undo an
accidental add.

Sponsored-by: Brock Spratlen on Patreon
2021-06-25 15:22:26 -04:00
Joey Hess
df2001aa88
Improve display of errors when transfers fail
Transfers from or to a local git repo could fail without a reason being
given, if the content failed to verify, or if the object file's stat
changed while it was being copied. Now display messages in these cases.

Sponsored-by: Jack Hill on Patreon
2021-06-25 13:17:04 -04:00
Joey Hess
4b1b9d7a83
Added annex.freezecontent-command and annex.thawcontent-command configs
Freeze first sets the file perms, and then runs
freezecontent-command. Thaw runs thawcontent-command before
restoring file permissions. This is in case the freeze command
prevents changing file perms, as eg setting a file immutable does.
Also, changing file perms tends to mess up previously set ACLs.

git-annex init's probe for crippled filesystem uses them, so if file perms
don't work, but freezecontent-command manages to prevent write to a file,
it won't treat the filesystem as crippled.

When the the filesystem has been probed as crippled, the hooks are not
used, because there seems to be no point then; git-annex won't be relying
on locking annex objects down. Also, this avoids them being run when the
file perms have not been changed, in case they somehow rely on
git-annex's setting of the file perms in order to work.

Sponsored-by: Dartmouth College's Datalad project
2021-06-21 14:40:52 -04:00
Joey Hess
1cc7b2661e
push synced/master before synced/git-annex
sync: Partly work around github behavior that first branch to be pushed to
a new repository is assumed to be the head branch, by not pushing
synced/git-annex first.

github expects master (or whatever the name is) to be pushed first, but
git-annex sync can't, because it's got to also support pushes to non-bare
repos where pushing master fails, as explained in the big comment. So
pushing synced/master is not entirely a fix, but at least it makes github
default to a branch with the stuff the user expects in it, not a bunch of
annex log files.

Aside from fixing github to not make this assumption, or improving
the git push protocol to include what the current HEAD is, the only other
approach I can think of is to identify git push's progress messages and
display those when pushing master, while filtering out error messages
about non-fast-forward etc. But git doesn't provide a way to separate out
or identify its progress messages.

Sponsored-by: Luke Shumaker on Patreon
2021-06-21 12:32:21 -04:00
Joey Hess
a6e281e008
releasing package git-annex version 8.20210621 2021-06-21 12:17:46 -04:00
Joey Hess
d2be68907c
drop, move, mirror: when two files have the same content, honor the max numcopies and requiredcopies
Eg, before with a .gitattributes like:

*.2 annex.numcopies=2
*.1 annex.numcopies=1

And foo.1 and foo.2 having the same content and key, git-annex drop foo.1 foo.2
would succeed, leaving just 1 copy, despite foo.2 needing 2 copies.
It dropped foo.1 first and then skipped foo.2 since its content was gone.

Now that the keys database includes locked files, this longstanding wart
can be fixed.

Sponsored-by: Noam Kremen on Patreon
2021-06-15 11:38:44 -04:00
Joey Hess
af9fdf5dba
verify associated files when checking numcopies
Most of this is just refactoring. But, handleDropsFrom
did not verify that associated files from the keys db were still
accurate, and has now been fixed to.

A minor improvement to this would be to avoid calling catKeyFile
twice on the same file, when getting the numcopies and mincopies value,
in the common case where the same file has the highest value for both.
But, it avoids checking every associated file, so it will scale well to
lots of dups already.

Sponsored-by: Kevin Mueller on Patreon
2021-06-15 11:14:52 -04:00
Joey Hess
3af4c9a29a
fix exponential blowup when adding lots of identical files
This was an old problem when the files were being added unlocked,
so the changelog mentions that being fixed. However, recently it's also
affected locked files.

The fix for locked files is kind of stupidly simple. moveAnnex already
handles populating unlocked files, and only does it when the object file
was not already present. So remove the redundant populateUnlockedFiles
call. (That call was added all the way back in
cfaac52b88, and has always been
unncessary.)

Sponsored-by: Dartmouth College's Datalad project
2021-06-15 09:45:55 -04:00
Joey Hess
78da00c7a6
Future proof activity log parsing
When the log has an activity that is not known, eg added by a future
version of git-annex, it used to be treated as no activity at all,
which would make git-annex expire think it should expire the repository,
despite it having some kind of recent activity.

Hopefully there will be no reason to add a new activity until enough
time has passed that this commit is in use everywhere.

Sponsored-by: Jake Vosloo on Patreon
2021-06-14 14:18:19 -04:00
Joey Hess
771a122c9e
add --size-limit option
When this option is not used, there should be effectively no added
overhead, thanks to the optimisation in
b3cd0cc6ba.

When an action fails on a file, the size of the file still counts toward
the size limit. This was necessary to support concurrency, but also
generally seems like the right choice.

Most commands that operate on annexed files support the option.
export and import do not, and I don't know if it would make sense for
export to.. Why would you want an incomplete export? sync doesn't, and
while it would be easy to make it support it for transferring files,
it's not clear if dropping files should also take the size limit into
account. Commands like add that don't operate on annexed files don't
support the option either.

Exiting 101 not yet implemented.

Sponsored-by: Denis Dzyubenko on Patreon
2021-06-04 16:16:53 -04:00
Joey Hess
189fb05ffb
Added annex.adviceNoSshCaching config.
Sponsored-by: Brock Spratlen on Patreon
2021-05-27 12:37:49 -04:00
Joey Hess
b5f5475ed6
New matching options --excludesamecontent and --includesamecontent
The normalisation of filenames turns out to be the tricky part here,
because the associated files coming out of the keys db may look like
"./foo/bar" or "../bar". For the former to match a glob like "foo/*",
it needs to be normalised.

Note that, on windows, normalise "./foo/bar" = "foo\\bar"
which a glob like "foo/*" won't match. So the glob is matched a second
time, on the toInternalGitPath, so allowing the user to provide a glob
with the slashes in either direction. However, this still won't support
some wacky edge cases like the user providing a glob of "foo/bar\\*"

Sponsored-by: Dartmouth College's Datalad project
2021-05-25 13:08:18 -04:00
Joey Hess
cedc28a783
prevent dropping required content of other file using same content
When two files have the same content, and a required content expression
matches one but not the other, dropping the latter file will fail as it
would also remove the content of the required file.

This will slow down drop (w/o --auto), dropunused, mirror, and move, by one
keys db lookup per file. But I did include an optimisation to avoid a
double db lookup in the drop --auto / sync --content case. I suspect that
dropunused could also use PreferredContentChecked True, but haven't
entirely thought it through and it's rarely used with enough files for the
optimisation to matter.

Sponsored-by: Dartmouth College's Datalad project
2021-05-25 11:34:06 -04:00
Joey Hess
7029ef1c3d
improve changelog 2021-05-25 10:08:29 -04:00
Joey Hess
5d18994736
clearer language 2021-05-24 14:54:51 -04:00
Joey Hess
a56b151f90
fix longstanding indeterminite preferred content for duplicated file problem
* drop: When two files have the same content, and a preferred content
  expression matches one but not the other, do not drop the file.
* sync --content, assistant: Fix an edge case where a file that is not
  preferred content did not get dropped.

The sync --content edge case is that handleDropsFrom loaded associated files
and used them without verifying that the information from the database was
not stale.

It seemed best to avoid changing --want-drop's behavior, this way when
debugging a preferred content expression with it, the files matched will
still reflect the expression. So added a note to the --want-drop documentation,
to make clear it may not behave identically to git-annex drop --auto.

While it would be possible to introspect the preferred content
expression to see if it matches on filenames, and only look up the
associated files when it does, it's generally fairly rare for 2 files to
have the same content, and the database lookup is already avoided when
there's only 1 file, so I did not implement that further optimisation.

Note that there are still some situations where the associated files
database does not get locked files recorded in it, which will prevent
this fix from working.

Sponsored-by: Dartmouth College's Datalad project
2021-05-24 14:07:05 -04:00
Joey Hess
c525d18cf7
filter-branch: New command, useful to produce a filtered version of the git-annex branch, eg when splitting a repository 2021-05-17 14:16:46 -04:00
Joey Hess
8b6dad11a2
add createMessage
init: When annex.commitmessage is set, use that message for the commit
that creates the git-annex branch.

This will be used by filter-branch too, and it seems to make sense to let
annex.commitmessage affect it.
2021-05-17 13:07:47 -04:00
Joey Hess
947d2a10bc
assistant: Fix a crash on startup by avoiding using forkProcess
ghc 8.8.4 seems to have changed something that broke code that has been
successfully using forkProcess since 2012. Likely a change to GC internals.

Since forkProcess has never had clear documentation about how to
use it safely, avoid using it at all. Instead, when git-annex needs to
daemonize itself, re-run the git-annex command, in a new process group
and session.

This commit was sponsored by Luke Shumaker on Patreon.
2021-05-12 15:08:03 -04:00
Joey Hess
675556fd9a
smudge: check for known annexed inodes before checking annex.largefiles
smudge: Fix a case where an unlocked annexed file that annex.largefiles
does not match could get its unchanged content checked into git, due to git
running the smudge filter unecessarily.

When the file has the same inodecache as an already annexed file,
we can assume that the user is not intending to change how it's stored in
git.

Note that checkunchangedgitfile already handled the inverse case, where the
file was added to git previously. That goes further and actually sha1
hashes the new file and checks if it's the same hash in the index.

It would be possible to generate a key for the file and see if it's the
same as the old key, however that could be considerably more expensive than
sha1 of a small file is, and it is not necessary for the case I have, at
least, where the file is not modified or touched, and so its inode will
match the cache.

git-annex add was changed, when adding a small file, to remove the inode
cache for it. This is necessary to keep the recipe in
doc/tips/largefiles.mdwn for converting from annex to git working.
It also avoids bugs/case_where_using_pathspec_with_git-commit_leaves_s.mdwn
which the earlier try at this change introduced.
2021-05-10 13:20:10 -04:00
Joey Hess
72a8bbce12
Revert "smudge: check for known annexed inodes before checking annex.largefiles"
This reverts commit 424bef6b6f.

This commit caused other buggy behavior unfortunately.
2021-05-10 12:20:13 -04:00
Joey Hess
921753ac44
reinject: Error out when run on a file that is not annexed
rather than silently skipping it
2021-05-07 13:31:03 -04:00
Joey Hess
4bf7940d6b
fileRef: make paths relative and simplified
Fix behavior of several commands, including reinject, addurl, and rmurl
when given an absolute path to an unlocked file, or a relative path that
leaves and re-enters the repository.

To avoid slowing down all the cases where the paths are already ok
with an unncessary call to getCurrentDirectory, put in an optimisation
in relPathCwdToFile. That will probably also speed up other parts of
git-annex by some small amount, but I have not benchmarked.

Note that I did not convert branchFileRef, because it seems likely that
it will be used with a file that is not provided by the user, so is already
in a sane format. This is certainly true for the way git-annex uses it,
though maybe arguable to the extent Git.Ref is a reusable library.
2021-05-07 13:25:59 -04:00
Joey Hess
424bef6b6f
smudge: check for known annexed inodes before checking annex.largefiles
smudge: Fix a case where an unlocked annexed file that annex.largefiles
does not match could get its unchanged content checked into git, due to git
running the smudge filter unecessarily.

When the file has the same inodecache as an already annexed file,
we can assume that the user is not intending to change how it's stored in
git.

Note that checkunchangedgitfile already handled the inverse case, where the
file was added to git previously. That goes further and actually sha1
hashes the new file and checks if it's the same hash in the index.

It would be possible to generate a key for the file and see if it's the
same as the old key, however that could be considerably more expensive than
sha1 of a small file is, and it is not necessary for the case I have, at
least, where the file is not modified or touched, and so its inode will
match the cache.
2021-05-03 13:26:32 -04:00
Joey Hess
4588668a12
fromkey unlocked files support
fromkey: Create an unlocked file when used in an adjusted branch where the
file should be unlocked, or when configured by annex.addunlocked.

There is some overlap with code in Annex.Ingest, however it's not quite the
same because ingesting has a temp file with the content, where here the
content, if any, is in the annex object file. So it eg, makes sense for
Annex.Ingest to copy the execute mode of the content file, but it does not make
sense for fromkey to do that.

Also changed in passing to stage the file in git directly, rather than
using git add. One consequence of that is that if the file is gitignored,
it will still get added, rather than the old behavior:

The following paths are ignored by one of your .gitignore files:
ignored
hint: Use -f if you really want to add them.
hint: Turn this message off by running
hint: "git config advice.addIgnoredFile false"
git-annex: user error (xargs ["-0","git","--git-dir=.git","--work-tree=.","--literal-pathspecs","add","--"] exited 123)

That old behavior was a surprise to me, and so I consider it a bug, and doubt
anyone would have relied on it.

Note that, when on an --hide-missing branch, it is possible to fromkey a key
that is not present (needs --force). The annex link or pointer file still gets
written in this case. It doesn't seem to make any sense not to write it,
because then fromkey would not do anything useful in this case, and this way
the file can be committed and synced to master, and the branch re-adjusted to
hide the new missing file.

This commit was sponsored by Noam Kremen on Patreon.
2021-05-03 11:26:18 -04:00
Joey Hess
27e5f3cd52
releasing package git-annex version 8.20210428 2021-04-28 12:16:45 -04:00
Joey Hess
0f73b6d03a
Avoid more than 1 gpg password prompt at the same time
Which could happen occasionally before when concurrency is enabled.
While not much of a problem when it did happen, better to avoid it. Also,
since it seems likely the gpg-agent sometimes fails in such a situation,
this makes it not happen when running a single git-annex command with
concurrency enabled.

This commit was sponsored by Jake Vosloo on Patreon.
2021-04-27 16:36:44 -04:00
Joey Hess
a166d2520b
check mincopies is satisfied even when numcopies is known to be satisfied
I had been assuming that numcopies would be a larger or at most equal to
mincopies, so no need to check both. But users get confused and use configs
that don't really make sense, so make sure to handle mincopies being larger
than numcopies.

Also add something to the mincopies man page to discourage this
misconfiguration.

This commit was sponsored by Denis Dzyubenko on Patreon.
2021-04-27 13:37:18 -04:00
Joey Hess
d3e49b210a
git-annex-config: Allow setting annex.securehashesonly
Which has otherwise been supported since 2019, but was missing from the
list of allowed repo-global configs.

Reordered the list to match the order in the git-annex-config man page, to
make them easy to cross-compare.

This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
2021-04-26 13:50:37 -04:00
Joey Hess
8e24fb3507
update 2021-04-26 13:12:51 -04:00
Joey Hess
2b264b3edf
initremote --private 2021-04-23 14:47:46 -04:00
Joey Hess
0547884eb2
importfeed: fix bug while also speeding up 12x!
* Fix bug that could make git-annex importfeed not see recently recorded
  state when configured with annex.alwayscommit=false.
* importfeed: Made "checking known urls" phase run 12 times faster.

The massive speedup is because it no longer queries for metadata
accompanying each url. Instead it processes the whole git-annex branch and
checks all metadata files for feed item ids, and uses any it finds.

This could result in a behavior change, in an unlikely situation: If a feed
id is recorded in a key's metadata, but the url gets removed, the old code
would not see that item id and would re-download it if it finds an url for
it in a feed, while the new code will see the item id. I don't think
the old behavior was intentional, and it may be that the new behavior is
better. Not gonna worry about this.
2021-04-23 12:36:56 -04:00
Joey Hess
6eb3c0a6b4
fix branch precacheing bug by checking journal
Fix bug caused by recent optimisations that could make git-annex not see
recently recorded status information when configured with
annex.alwayscommit=false.

When not using --all, precaching only gets triggered when the
command actually needs location logs, and so there's no speed hit there.

This is a minor speed hit for --all, because it precaches even when the
location log is not actually going to be used, and so checking the journal
is not necessary. It would have been possible to defer checking the journal
until the cache gets used. But that would complicate the usual Branch.get
code path with two different kinds of caches, and the speed hit is really
minimal. A better way to speed up --all, later, would be to avoid
precaching at all when the location log is not going to be used.
2021-04-21 14:02:15 -04:00
Joey Hess
e1a9b79fa6
fix hardcoded origin name in checkAdjustedClone
init: Fix a crash when the repo's was cloned from a repo that had an
adjusted branch checked out, and the origin remote is not named "origin".

The only other hardcoding of the name of origin is in:

- Upgrade.V2, which can be ignored probably
- Annex.Branch, which doesn't fail if it has some other name, but just
  doesn't set up the git-annex branch with quite as linear a history in
  that case.
2021-04-14 18:53:27 -04:00