Commit graph

2034 commits

Author SHA1 Message Date
Joey Hess
a5709dcc22
Copy with a reflink when exporting a tree to a directory special remote
Remote.Directory makes a temp file, then calls this, and since the temp
file exists, it prevented probing if CoW works.

Note that deleting the empty file does mean there's a small window for a
race. If another process is also exporting to the remote, that could let it
make the same temp file. However, the temp filename actually has the
processes's pid in it, which avoids that being a problem.

This may have been a reversion caused by commits around
63d508e885, but I haven't gone back and
tested to be sure. The directory special remote had supposedly supported
CoW for this going back to about half a year before that.

Sponsored-by: Graham Spencer on Patreon
2023-03-28 13:09:14 -04:00
Joey Hess
24ae4b291c
addurl, importfeed: Fix failure when annex.securehashesonly is set
The temporary URL key used for the download, before the real key is
generated, was blocked by annex.securehashesonly.

Fixed by passing the Backend that will be used for the final key into
runTransfer. When a Backend is provided, have preCheckSecureHashes
check that, rather than the key being transferred.

Sponsored-by: unqueued on Patreon
2023-03-27 15:10:46 -04:00
Joey Hess
cb6cb61ca1
avoid build warning on windows 2023-03-27 12:20:35 -04:00
Joey Hess
291ad8f6b2
avoid build warning on windows 2023-03-27 12:19:26 -04:00
Joey Hess
2b5fa091e2
annex.maxextensionlength for view
view: Support annex.maxextensionlength when generating filenames for the
view branch.

Note that refining an existing view will reuse the extension length that was
configured when initially constructing the view. This is necessarily the case
because it reuses the filenames.

Also view files used to have all extensions at the end, no matter how
many there were. Since annex.maxextensionlength's documentation includes
that it's limited to 2 extensions, I made it consistent with that.

Sponsored-by: k0ld on Patreon
2023-03-24 14:01:38 -04:00
Joey Hess
038a2600f4
Avoid leaving repo with a detached head when there is a failure checking out an updated adjusted branch
I don't know of scenarios where that can happen (besides the bug
fixed by the parent commit), but there probably are some.

Sponsored-by: Boyd Stephen Smith Jr. on Patreon
2023-03-23 16:36:43 -04:00
Joey Hess
cb4d9f7b1f
run restagePointerFiles in adjustedBranchRefreshFull
Avoid failure to update adjusted branch --unlock-present after git-annex
drop when annex.adjustedbranchrefresh=1

At higher values, it did flush the queue, which ran restagePointerFiles.
But at 1, adjustedBranchRefreshFull gets added to the queue, and while
restagePointerFiles is also in the queue, it runs after that.

Sponsored-by: Brock Spratlen on Patreon
2023-03-23 16:25:45 -04:00
Joey Hess
e822df2a09
fix build warnings on windows 2023-03-21 18:41:23 -04:00
Yaroslav Halchenko
84b0a3707a
Apply codespell -w throughout 2023-03-17 15:14:58 -04:00
Yaroslav Halchenko
0ae5ff797f
Typo: sansative -> sensitive 2023-03-17 15:14:50 -04:00
Yaroslav Halchenko
e018ae1125
Fix ambigous typos 2023-03-17 15:14:47 -04:00
Joey Hess
54ad1b4cfb
Windows: Support long filenames in more (possibly all) of the code
Works around this bug in unix-compat:
https://github.com/jacobstanley/unix-compat/issues/56
getFileStatus and other FilePath using functions in unix-compat do not do
UNC conversion on Windows.

Made Utility.RawFilePath use convertToWindowsNativeNamespace to do the
necessary conversion on windows to support long filenames.

Audited all imports of System.PosixCompat.Files to make sure that no
functions that operate on FilePath were imported from it. Instead, use
the equvilants from Utility.RawFilePath. In particular the
re-export of that module in Common had to be removed, which led to lots
of other changes throughout the code.

The changes to Build.Configure, Build.DesktopFile, and Build.TestConfig
make Utility.Directory not be needed to build setup. And so let it use
Utility.RawFilePath, which depends on unix, which cannot be in
setup-depends.

Sponsored-by: Dartmouth College's Datalad project
2023-03-01 15:55:58 -04:00
Joey Hess
bb54c8a633
support --hide-missing adjustment of view branches
I had thought this would not make sense to combine with view branches,
since removing files from a view changes metadata.

However, that's committing removal of files. With --hide-missing, the
files get removed when git-annex updates the branch itself, so there is
no conflict.

It does not seem likely to be very useful, but it does work! And that's
nice because it means all types of adjusted branches can be combined with
view branches.

Sponsored-by: Max Thoursie on Patreon
2023-02-27 15:39:58 -04:00
Joey Hess
1c4f4b449a
support --unlock-present adjustment of view branches
When generating the view, check if the key is present.

When syncing in a view branch with an adjustment, run adjustedBranchRefreshFull
the same as is done when syncing in other adjusted branches. This is
needed because the docs for git-annex adjust --unlock-present suggest
using git-annex sync to update the branch when annex.adjustedbranchrefresh
is not set.

Note that, with annex.adjustedbranchrefresh set, it just works! The
adjusted branch gets updated in the usual way and it doesn't matter that
there's a view branch underneath.

And of course, re-running git-annex adjut --unlock-present also works,
as suggested in the docs.

Sponsored-by: Erik Bjäreholt on Patreon
2023-02-27 15:37:57 -04:00
Joey Hess
7d839176c3
support generation of unlocked views
Just make pointer files rather than symlinks, easy.

As for the other adjustments:
--lock is the default for views
--fix happens automatically in views
--hide-missing probably does not make sense when combined with views,
because deleting a file from a view removes metadata
--unlock-present will need a bit more work
2023-02-27 15:07:36 -04:00
Joey Hess
f09e299156
rawfilepath conversion 2023-02-27 15:06:32 -04:00
Joey Hess
cc32e31161
understand adjusted view branch names
An adjusted view branch has a name like
"refs/heads/adjusted/views/master(author=_)(unlocked)", so it is a view
branch that has been converted to an adjusted branch.

Made Logs.View support such branch names. So now git-annex sync and
pre-commit handle updating metadata on commit in such a branch.

Much remains to be done to fully support adjusted view branches,
including actually applying the adjustment when updating the view branch.

Sponsored-by: Graham Spencer on Patreon
2023-02-27 14:57:58 -04:00
Joey Hess
2a966f49f2
overwrite old adjusted view branch
When git-annex adjust is run in a view branch, and the adjusted branch
already exists, overwrite the old adjusted branch with the new one
without being forced.

Usually overwriting an adjusted branch is avoided because it could lose
data. But when a view branch has been adjusted, there is no data to lose
in the adjusted branch, because the only changes that can be made of
significance are to move files between directories. Which changes
metadata on commit. And the old branch has already been committed.

Sponsored-by: Lawrence Brogan on Patreon
2023-02-27 14:35:27 -04:00
Joey Hess
9b1fe37818
improve adjusted branch name parsing to support adjusted view branches
An adjusted view branch has a name like
"adjusted/views/master(author=_)(unlocked)"
and so the adjustment starts at the last open paren, not the first open
paren.

Note that git-annex sync still does not do anything useful when run in
such a branch, because it does not realize that it is a view branch.
This is only groundwork for adjusted view branches.

This also fixes adjusted branches when the basis branch name contains
parens for some other reason, though that is not common in a git branch
name.

Sponsored-by: Boyd Stephen Smith Jr. on Patreon
2023-02-27 14:09:05 -04:00
Joey Hess
da61d564f1
fix view reversion caused by optimisation
view: Fix a reversion in 10.20230214 that omitted a file from a view when
the file had no metadata set, but the view only used path fields.

Sponsored-by: Jack Hill on Patreon
2023-02-16 15:18:17 -04:00
Joey Hess
826b225ca8
Sped up view branch construction by 50%
A benchmark in my sound repository with `git-annex view feedtitle=*`
took 2:52 wall clock time before and 1:58 after. Though it still only used
130% of CPU.

This is the same kind of optimisation that is in seekFilteredKeys, though
that precaches location logs while this streams the metadata logs direct
to parsing them.

seekFilteredKeys contains more streaming, to find the annexed files, and
this could be further sped up with similar streaming.

Sponsored-by: Nicholas Golder-Manning on Patreon
2023-02-13 13:29:57 -04:00
Joey Hess
bb4550c7c1
sync: Warn when the adjusted basis ref cannot be found
As happens eg when the user has renamed branches.

Sponsored-by: Graham Spencer on Patreon
2023-02-10 14:33:21 -04:00
Joey Hess
5f9bf51438
sync in view branch updates the view branch
* sync: When run in a view branch, refresh the view branch to reflect any
    changes that have been made to the parent branch or metadata.

This is basically working, but probably needs some more work to deal with
all the edge cases of things sync does.

Sponsored-by: Lawrence Brogan on Patreon
2023-02-08 15:37:28 -04:00
Joey Hess
aa0350ff49
add directory to views for files that lack specified metadata
* view: New field?=glob and ?tag syntax that includes a directory "_"
  in the view for files that do not have the specified metadata set.
* Added annex.viewunsetdirectory git config to change the name of the
  "_" directory in a view.

When in a view using the new syntax, old git-annex will fail to parse the
view log. It errors with "Not in a view.", which is not ideal. But that
only affects view commands.

annex.viewunsetdirectory is included in the View for a couple of reasons.
One is to avoid needing to warn the user that it should not be changed when
in a view, since that would confuse git-annex. Another reason is that it
helped with plumbing the value through to some pure functions.

annex.viewunsetdirectory is actually mangled the same as any other view
directory. So if it's configured to something like "N/A", there won't be
multiple levels of directories, which would also confuse git-annex.

Sponsored-By: Jack Hill on Patreon
2023-02-07 16:28:46 -04:00
Joey Hess
579d9b60c1
improve concurrency of move/copy --from --to
Use separate stages for download and upload. In the common case where
it downloads the file from one remote and then uploads to the other,
those are by far the most expensive operations, and there's a decent
chance the two remotes bottleneck on different resources.

Suppose it's being run with -J2 and a bunch of 10 mb files. Two threads
will be started both downloading from the src remote. They will probably
finish at the same time. Then two threads will be started uploading to
the dst remote. They will probably take the same time as well. Before
this change, it would alternate back and forth, bottlenecking on src and dst.
With this change, as soon as the two threads start uploading to dst, two
more threads are able to start, downloading from src. So bandwidth to
both remotes is saturated more often.

Other commands that use transferStages only send in one direction at a
time. So the worker threads for the other direction will sit idle, and
there will be no change in their behavior.

Sponsored-by: Dartmouth College's DANDI project
2023-01-24 13:59:39 -04:00
Joey Hess
acc3f6211f
finishing up move --from --to
Lock the local content for drop after getting it from src, to prevent another
process from using the local content as a copy and dropping it from src,
which would prevent dropping the local content after sending it to dest.

Support resuming an interrupted move that downloaded the content from
src, leaving the local content populated. In this case, the location log
has not been updated to say the content is present locally, so we can
assume that it's resuming and go ahead and drop the local content after
sending it to dest.

Note that if a `git-annex get` is being ran at the same time as a
`git-annex move --from --to`, it may get a file just before the move
processes it. So the location log has not been updated yet, and the move
thinks it's resuming. Resulting in local copy being dropped after it's
sent to the dest. This race is something we'll just have to live with,
it seems.

I also gave up on the idea of checking if the location log had been updated
by a `git-annex get` that is ran at the same time. That wouldn't work, because
the location log is precached in the seek stage, so reading it again after
sending the content to dest would not notice changes made to it, unless the cache
were invalidated, which would slow it down a lot. That idea anyway was subject
to races where it would not detect the concurrent `git-annex get`.

So concurrent `git-annex get` will have results that may be surprising.
To make that less surprising, updated the documentation of this feature to
be explicit that it downloads content to the local repository
temporarily.

Sponsored-by: Dartmouth College's DANDI project
2023-01-23 17:43:48 -04:00
Joey Hess
1abd457e98
push location log updating up to callers of download
Prep for move --to --from, which needs to download from a src repo
without updating the location log for the local repo, before sending the
content on to the dest repo.

Note that caller of download' already update the log themselves.
See previous commit a422a056f2
that pushed it up to download from getViaTmpFrom.

(Also removed in passing a debug print + readline that I accidentially
committed last week on this branch.)

Sponsored-by: Dartmouth College's DANDI project
2023-01-23 13:47:41 -04:00
Joey Hess
cfaae7e931
added an optional cost= configuration to all special remotes
Note that when this is specified and an older git-annex is used to
enableremote such a special remote, it will simply ignore the cost= field
and use whatever the default cost is.

In passing, fixed adb to support the remote.name.cost and
remote.name.cost-command configs.

Sponsored-by: Dartmouth College's DANDI project
2023-01-12 13:42:28 -04:00
Joey Hess
2fa7656627
switch to readMaybe to handle values with leading number followed by non-number
readish ignores a trailing string after a number, but to support values
like "YYYY:MM:DD" which it makes sense to compare lexographically,
require the whole string to be parsed as a number in order to enable
numeric comparison.

Sponsored-by: Max Thoursie on Patreon
2022-12-22 14:33:47 -04:00
Joey Hess
9d60385001
convert renameFile to moveFile to support cross-device moves
Improve handling of some .git/annex/ subdirectories being on other
filesystems, in the bittorrent special remote, and youtube-dl integration,
and git-annex addurl.

The only one of these that I've confirmed to be a problem is in the
bittorrent special remote when .git/annex/tmp and .git/annex/othertmp are
on different filesystems.

As well as auditing for renameFile, also audited for createLink, all of
those are ok as are the other remaining renameFile calls. Also audited all
code paths that use .git/annex/othertmp, and did not find any other
cross-device problems. So, removing mention of othertmp needing to be on
the same device.

Sponsored-by: Dartmouth College's Datalad project
2022-12-20 15:17:50 -04:00
Joey Hess
aa6919737c
--metadata lexicographical comparisons
Change --metadata comparisons < > <= and >= to fall back to lexicographical
comparisons when one or both values being compared are not numbers.

Sponsored-by: Erik Bjäreholt on Patreon
2022-12-12 13:33:24 -04:00
Joey Hess
65f9e7a3c7
fix deadlock in restagePointerFiles
Fix a hang that occasionally occurred during commands such as move.
(A bug introduced in 10.20220927, in
commit 6a3bd283b8)

The restage.log was kept locked while running a complex index refresh
action. In an unusual situation, that action could need to write to the
restage log, which caused a deadlock.

The solution is a two-stage process. First the restage.log is moved to a
work file, which is done with the lock held. Then the content of the work
file is read and processed, which happens without the lock being held.
This is all done in a crash-safe manner.

Note that streamRestageLog may not be fully safe to run concurrently
with itself. That's ok, because restagePointerFiles uses it with the
index lock held, so only one can be run at a time.

streamRestageLog does delete the restage.old file at the end without
locking. If a calcRestageLog is run concurrently, it will either see the
file content before it was deleted, or will see it's missing. Either is
ok, because at most this will cause calcRestageLog to report more
work remains to be done than there is.

Sponsored-by: Dartmouth College's Datalad project
2022-12-08 14:36:11 -04:00
Joey Hess
43f681d4c1
Support parsing yt-dpl output to display download progress
Before this fix, no progress was displayed when yt-dpl was used.

Sponsored-by: Graham Spencer on Patreon
2022-11-21 15:04:36 -04:00
Joey Hess
5256be61c1
When youtube-dl is not available in PATH, use yt-dlp instead
Debian is going to drop youtube-dl which is not active upstream, and yt-dlp
is the replacement. This will make it be used if youtube-dl gets removed.

If an old version of youtube-dl remains installed, git-annex will still use
it. That might not be desirable, but changing git-annex to use yt-dlp in
preference to youtube-dl when both are installed risks breaking when
the user has annex.youtube-dl-options set to something that is supported
by youtube-dl, but not by yt-dlp.

Sponsored-by: Boyd Stephen Smith Jr. on Patreon
2022-11-21 14:40:33 -04:00
Joey Hess
2b014f1a8b
don't frontload reconcileStaged in git-annex init
init: Avoid scanning for annexed files, which can be lengthy in a
large repository. Instead that scan is done on demand. This lets git-annex
init be run and some query commands be used in a repository without
waiting.

Note that autoinit already behaved this way, so while this will mean some
commands like git-annex get/unlock/add will do the scan the first time run,
that is not really a significant behavior change.

And, it's really better to have a consistent behavior. The reason for
the inconsistency was a strange bug discussed in
b3c4579c79. Avoiding reconcileStaged in
init will keep avoiding whatever that was.

Sponsored-by: Dartmouth College's DANDI project
2022-11-18 13:58:47 -04:00
Joey Hess
14f7a386f0
Make git-annex enable-tor work when using the linux standalone build
Clean the standalone environment before running the su command
to run "sh". Otherwise, PATH leaked through, causing it to run
git-annex.linux/bin/sh, but GIT_ANNEX_DIR was not set,
which caused that script to not work:

[2022-10-26 15:07:02.145466106] (Utility.Process) process [938146] call: pkexec ["sh","-c","cd '/home/joey/tmp/git-annex.linux/r' && '/home/joey/tmp/git-annex.linux/git-annex' 'enable-tor' '1000'"]
/home/joey/tmp/git-annex.linux/bin/sh: 4: exec: /exe/sh: not found

Changed programPath to not use GIT_ANNEX_PROGRAMPATH,
but instead run the scripts at the top of GIT_ANNEX_DIR.
That works both when the standalone environment is set up, and when it's
not.

Sponsored-by: Kevin Mueller on Patreon
2022-10-26 15:45:08 -04:00
Joey Hess
731e806c96
use lookupKeyStaged in --batch code paths
Make --batch mode handle unstaged annexed files consistently whether the
file is unlocked or not. Before this, a unstaged locked file
would have the symlink on disk examined and operated on in --batch mode,
while an unstaged unlocked file would be skipped.

Note that, when not in batch mode, unstaged files are skipped over too.
That is actually somewhat new behavior; as late as 7.20191114 a
command like `git-annex whereis .` would operate on unstaged locked
files and skip over unstaged unlocked files. That changed during
optimisation of CmdLine.Seek with apparently little fanfare or notice.

Turns out that rmurl still behaved that way when given an unstaged file
on the command line. It was changed to use lookupKeyStaged to
handle its --batch mode. That also affected its non-batch mode, but
since that's just catching up to the change earlier made to most
other commands, I have not mentioed that in the changelog.

It may be that other uses of lookupKey should also change to
lookupKeyStaged. But it may also be that would slow down some things,
or lead to unwanted behavior changes, so I've kept the changes minimal
for now.

An example of a place where the use of lookupKey is better than
lookupKeyStaged is in Command.AddUrl, where it looks to see if the file
already exists, and adds the url to the file when so. It does not matter
there whether the file is staged or not (when it's locked). The use of
lookupKey in Command.Unused likewise seems good (and faster).

Sponsored-by: Nicholas Golder-Manning on Patreon
2022-10-26 14:43:06 -04:00
Joey Hess
b2ee2496ee
remove whenAnnexed and ifAnnexed
In preparation for adding a new variation on lookupKey.

Sponsored-by: Max Thoursie on Patreon
2022-10-26 14:06:32 -04:00
Joey Hess
6fbd337e34
avoid uncessary keys db writes; doubled speed!
When running eg git-annex get, for each file it has to read from and
write to the keys database. But it's reading exclusively from one table,
and writing to a different table. So, it is not necessary to flush the
write to the database before reading. This avoids writing the database
once per file, instead it will buffer 1000 changes before writing.

Benchmarking getting 1000 small files from a local origin,
git-annex get now takes 13.62s, down from 22.41s!
git-annex drop now takes 9.07s, down from 18.63s!
Wowowowowowowow!

(It would perhaps have been better if there were separate databases for
the two tables. At least it would have avoided this complexity. Ah well,
this is better than splitting the table in a annex.version upgrade.)

Sponsored-by: Dartmouth College's Datalad project
2022-10-12 15:33:16 -04:00
Joey Hess
ba7ecbc6a9
avoid flushing keys db queue after each Annex action
The flush was only done Annex.run' to make sure that the queue was flushed
before git-annex exits. But, doing it there means that as soon as one
change gets queued, it gets flushed soon after, which contributes to
excessive writes to the database, slowing git-annex down.
(This does not yet speed git-annex up, but it is a stepping stone to
doing so.)

Database queues do not autoflush when garbage collected, so have to
be flushed explicitly. I don't think it's possible to make them
autoflush (except perhaps if git-annex sqitched to using ResourceT..).
The comment in Database.Keys.closeDb used to be accurate, since the
automatic flushing did mean that all writes reached the database even
when closeDb was not called. But now, closeDb or flushDb needs to be
called before stopping using an Annex state. So, removed that comment.

In Remote.Git, change to using quiesce everywhere that it used to use
stopCoProcesses. This means that uses on onLocal in there are just as
slow as before. I considered only calling closeDb on the local git remotes
when git-annex exits. But, the reason that Remote.Git calls stopCoProcesses
in each onLocal is so as not to leave git processes running that have files
open on the remote repo, when it's on removable media. So, it seemed to make
sense to also closeDb after each one, since sqlite may also keep files
open. Although that has not seemed to cause problems with removable
media so far. It was also just easier to quiesce in each onLocal than
once at the end. This does likely leave performance on the floor, so
could be revisited.

In Annex.Content.saveState, there was no reason to close the db,
flushing it is enough.

The rest of the changes are from auditing for Annex.new, and making
sure that quiesce is called, after any action that might possibly need
it.

After that audit, I'm pretty sure that the change to Annex.run' is
safe. The only concern might be that this does let more changes get
queued for write to the db, and if git-annex is interrupted, those will be
lost. But interrupting git-annex can obviously already prevent it from
writing the most recent change to the db, so it must recover from such
lost data... right?

Sponsored-by: Dartmouth College's Datalad project
2022-10-12 14:12:23 -04:00
Joey Hess
c2ad84b423
all keys are still present on versioned remote after import of a tree
When importing from versioned remotes, fix tracking of the content of
deleted files.

Only S3 supports versioning so far, so only it was affected.

But, the draft import/export interface for external remotes also seemed to
need a change, so that versionedExport could be set.
2022-10-11 13:05:40 -04:00
Joey Hess
7059322a6c
Support "inbackend" in preferred content expressions
Well, actually, fix a typo that has always been in the implementation of
that. "inbacked" used to work, but let's not tell users about that; they
might try to use it and expect git-annex to keep supporting the typo..

Sponsored-by: Jack Hill on Patreon
2022-09-26 16:06:49 -04:00
Joey Hess
b411a1ce74
remove unncessary do block
Left by Reiko's patch
2022-09-26 13:10:25 -04:00
Reiko Asakura
1d48153bb8
Run freeze and thaw hooks on crippled filesystems
The user sets these hooks deliberately so they should always be run. For
example this allows hooks to be used to manage file permissions on NTFS
volumes in WSL1.
2022-09-26 13:05:39 -04:00
Joey Hess
98eb5ff84f
fix windows build 2022-09-26 12:08:04 -04:00
Joey Hess
e62e4eaaf2
refector for legibility 2022-09-23 18:53:06 -04:00
Joey Hess
2478e9e03a
restage: New git-annex command, handles restaging unlocked files
This is much easier and less failure-prone than having the user run
git update-index --refresh themselves.

Sponsored-by: Dartmouth College's DANDI project
2022-09-23 16:29:59 -04:00
Joey Hess
f7146c153b
fix restaging of transferred files after stalldetection kicks in
Sponsored-by: Dartmouth College's DANDI project
2022-09-23 15:55:40 -04:00
Joey Hess
6a3bd283b8
add restage log
When pointer files need to be restaged, they're first written to the
log, and then when the restage operation runs, it reads the log. This
way, if the git-annex process is interrupted before it can do the
restaging, a later git-annex process can do it.

Currently, this lets a git-annex get/drop command be interrupted and
then re-ran, and as long as it gets/drops additional files, it will
clean up after the interrupted command. But more changes are
needed to make it easier to restage after an interrupted process.

Kept using the git queue to run the restage action, even though the
list of files that it builds up for that action is not actually used by
the action. This could perhaps be simplified to make restaging a cleanup
action that gets registered, rather than using the git queue for it. But
I wasn't sure if that would cause visible behavior changes, when eg
dropping a large number of files, currently the git queue flushes
periodically, and so it restages incrementally, rather than all at the
end.

In restagePointerFiles, it reads the restage log twice, once to get
the number of files and size, and a second time to process it.
This seemed better than reading the whole file into memory, since
potentially a huge number of files could be in there. Probably the OS
will cache the file in memory and there will not be much performance
impact. It might be better to keep running tallies in another file
though. But updating that atomically with the log seems hard.

Also note that it's possible for calcRestageLog to see a different file
than streamRestageLog does. More files may be added to the log in
between. That is ok, it will only cause the filterprocessfaster heuristic to
operate with slightly out of date information, so it may make the wrong
choice for the files that got added and be a little slower than ideal.

Sponsored-by: Dartmouth College's DANDI project
2022-09-23 15:47:24 -04:00
Joey Hess
8718125ae4
refactor the restage runner
Sponsored-by: Dartmouth College's DANDI project
2022-09-23 13:12:17 -04:00
Joey Hess
6e3c9bea2e
drain transferrer read handle when shutting it down
Fixes updating git index file after getting an unlocked file when
annex.stalldetection is set.

The transferrer may want to send additional protocol messages when it's
shut down. Closing the read handle prevented it from doing that, and caused
it to crash rather than cleanly shutting down.

Draining the handle without processing the protocol seemed ok to do,
because anything it outputs is going to be some side message displayed
at shutdown. Displaying those once per transferrer process that is running
seems unncessary.

Sponsored-by: Dartmouth College's DANDI project
2022-09-22 14:39:39 -04:00
Joey Hess
0ffc59d341
change retrieveExportWithContentIdentifier to take a list of ContentIdentifier
This partly fixes an issue where there are duplicate files in the
special remote, and the first file gets swapped with another duplicate,
or deleted. The swap case is fixed by this, the deleted case will need
other changes.

This makes retrieveExportWithContentIdentifier take a list of allowed
ContentIdentifier, same as storeExportWithContentIdentifier,
removeExportWithContentIdentifier, and
checkPresentExportWithContentIdentifier.

Of the special remotes that support importtree, borg is a special case
and does not use content identifiers, S3 I assume can't get mixed up
like this, directory certainly has the problem, and adb also appears to
have had the problem.

Sponsored-by: Graham Spencer on Patreon
2022-09-20 13:19:42 -04:00
Joey Hess
d2c842e9a1
don't force use of conduit in withUrlOptionsPromptingCreds
Use curl for downloads from git remotes when annex.url-options and other
git configs are set.

If the url needs a password, curl will fail, and git credential will not be
used to prompt for it. But the user can set --netrc in url-options and
put the password in the netrc file.

This also means that url-options settings like -4 will take effect.
That was the case before commit 1883f7ef8f
forced conduit to be used.
2022-09-09 16:07:32 -04:00
Joey Hess
c62fe5e9a8
avoid redundant prompt for http password in git-annex get that does autoinit
autoEnableSpecialRemotes runs a subprocess, and if the uuid for a git
remote has not been probed yet, that will do a http get that will prompt
for a password. And then the parent process will subsequently prompt
for a password when getting annexed files from the remote.

So the solution is for autoEnableSpecialRemotes to run remoteList before
the subprocess, which will probe for the uuid for the git remote in the
same process that will later be used to get annexed files.

But, Remote.Git imports Annex.Init, and Remote.List imports Remote.Git,
so Annex.Init cannot import Remote.List. Had to pass remoteList into
functions in Annex.Init to get around this dependency loop.
2022-09-09 14:43:43 -04:00
Joey Hess
9621beabc4
cache credentials in memory when doing http basic auth to a git remote
When accessing a git remote over http needs a git credential prompt for a
password, cache it for the lifetime of the git-annex process, rather than
repeatedly prompting.

The git-lfs special remote already caches the credential when discovering
the endpoint. And presumably commands like git pull do as well, since they
may download multiple urls from a remote.

The TMVar CredentialCache is read, so two concurrent calls to
getBasicAuthFromCredential will both prompt for a credential.
There would already be two concurrent password prompts in such a case,
and existing uses of `prompt` probably avoid it. Anyway, it's no worse
than before.
2022-09-09 14:20:32 -04:00
Joey Hess
d4fd966396
avoid dup check of guardSafeToUseRepo
Speeds up init slightly, and reduces the number of syscalls by the
dynamic linker.

Sponsored-by: Dartmouth College's Datalad project
2022-08-29 13:52:58 -04:00
Yaroslav Halchenko
0151976676
Typo fix unncessary -> unnecessary.
Detected while reading recent CHANGELOG entry but then decided to apply
to entire codebase and docs since why not?
2022-08-20 09:40:19 -04:00
Joey Hess
b801812660
init: probe if sqlite works
Help the user get annex.dbdir configured when their filesystem is not
one that sqlite works on.

The change in Database.Handle makes an error from sqlite not be ignored
besides being displayed, which it was before. I can't see any reason
git-annex would want to ignore these errors.

I chose to use the fsck database rather than the keys database because
opening the keys database populates it, and see commit
b3c4579c79.

The placement of the call to checkSqliteWorks inside checkInitializeAllowed
avoids annex.uuid getting set before it's called.

Sponsored-by: Dartmouth College's Datalad project
2022-08-17 13:12:26 -04:00
Joey Hess
840bd50390
make it easier to use curl for unusual url schemes
Use curl when annex.security.allowed-url-schemes includes an url scheme not
supported by git-annex internally, as long as
annex.security.allowed-ip-addresses is configured to allow using curl.

Sponsored-by: Luke Shumaker on Patreon
2022-08-15 12:22:13 -04:00
Joey Hess
4cfe17a9e8
use a subdirectory of annex.dbdir
This allows annex.dbdir to be set globally or always set to the same
value when needed. Each repository uses a subdirectory of it.

Sponsored-by: Dartmouth College's Datalad project
2022-08-12 13:18:15 -04:00
Joey Hess
a335c1e46e
annex.dbdir fully working
Completes work started in e60766543f

I've verified that all the sqlite databases get stored in annex.dbdir
and are created successfully. If annex.dbdir does not exist, it will be
created; its parent directory must already exist though.

Sponsored-by: Dartmouth College's Datalad project
2022-08-12 13:06:58 -04:00
Joey Hess
23c6e350cb
improve createDirectoryUnder to allow alternate top directories
This should not change the behavior of it, unless there are multiple top
directories, and then it should behave the same as if there was a single
top directory that was actually above the directory to be created.

Sponsored-by: Dartmouth College's Datalad project
2022-08-12 12:52:37 -04:00
Joey Hess
e60766543f
add annex.dbdir (WIP)
WIP: This is mostly complete, but there is a problem: createDirectoryUnder
throws an error when annex.dbdir is set to outside the git repo.

annex.dbdir is a workaround for filesystems where sqlite does not work,
due to eg, the filesystem not properly supporting locking.

It's intended to be set before initializing the repository. Changing it
in an existing repository can be done, but would be the same as making a
new repository and moving all the annexed objects into it. While the
databases get recreated from the git-annex branch in that situation, any
information that is in the databases but not stored in the branch gets
lost. It may be that no information ever gets stored in the databases
that cannot be reconstructed from the branch, but I have not verified
that.

Sponsored-by: Dartmouth College's Datalad project
2022-08-11 16:58:53 -04:00
Joey Hess
a23fd7349f
work around git segfault
Work around bug in git 2.37 that causes a segfault when when
core.untrackedCache is set, and broke git-annex init.

Depending on when git gets fixed and how widely the buggy versions are
used, this could be reverted quite soon, or need to linger for a long time.
It only makes git-annex init a tiny bit slower in a new repo.

Sponsored-by: Max Thoursie on Patreon
2022-08-04 14:20:57 -04:00
Joey Hess
be19a68276
new matching options --want-get-by and --want-drop-by
Sponsored-by: Graham Spencer on Patreon
2022-07-28 13:26:03 -04:00
Joey Hess
d905232842
use ResourcePool for hash-object handles
Avoid starting an unncessary number of git hash-object processes when
concurrency is enabled.

Sponsored-by: Dartmouth College's DANDI project
2022-07-25 17:32:39 -04:00
Joey Hess
63cef2ae0b
v8 repositories automatically upgrade to v9
(And v9 later on to v10.)

When v9/v10 were added, making v8 automatically upgrade was deferred
"for a few months" to prevent interoperability problems if users also
have an old version of git-annex. Of course that could still be the
case, but there has been a good amount of time and this can't be put off
forever.

Allow setting annex.autoupgraderepository to false to avoid this upgrade.
Previously, that only prevented upgrades from no longer supported git-annex
versions, but v8 is still supported, and users may want to keep on v8 to
interoperate with an old git-annex version.

Sponsored-by: Boyd Stephen Smith Jr. on Patreon
2022-07-25 16:20:04 -04:00
Joey Hess
cbe12b9bc3
force fully strict read of journal file again
I was thinking that discardIncompleteAppend would make it strict, since
it looks at the end of the bytestring. But, it's applied lazily..

This probably fixes windows, which was failing:

      git-annex.exe: .git\annex\journal\trust.log: DeleteFile "\\\\?\\C:\\Users\\runneradmin\\.t\\5\\tmprepo22\\.git\\annex\\journal\\trust.log": permission denied (The process cannot access the file because it is being used by another process.)
2022-07-22 11:36:21 -04:00
Joey Hess
4e88137a28
prevent appends except when annex.alwayscompact=false
I would like for a new repo version to enable appends, but to do so
safely would need a v11 followed by a 1 year delay followed by a v12
that does it. Since a similar v9 and v10 transition is currently
happening, and is less than 6 months along in most repos, it does not
feel wise to stack up another year-long transition behind that. What if
I need to hurry up a new repo version for some other change?

Added todo so I remember to make this change at some time when a v11
and probably v12 repo version do make sense.

Sponsored-by: Dartmouth College's DANDI project
2022-07-20 13:23:55 -04:00
Joey Hess
d275874e6c
handling of interrupted appends
An append that is interrupted and writes part of a line is now dealt
with by subsequent reads and appends. This also handles a read that
happens at the same time as an append to the file.

Old versions of git-annex will still see a partially written line,
and could get confused. Since appends are currently done for url logs
and location logs, the confusion is limited to a substring of the actual
url or UUID of the remote being read. This will not affect writes, since
the journal file is locked when reading in preparation for writing.
However, the bad data can be output by git-annex and used by other
things, or could cause surprising behavior by git-annex. Including eg,
downloading the content of the wrong url.

So, something needs to be done to prevent old versions of git-annex from
running in a repository where this appending is being done..

Sponsored-by: Dartmouth College's DANDI project
2022-07-20 12:40:49 -04:00
Joey Hess
6f1fd3abdd
no locking of journal on read after all
Finally have a final design, and it turns out not to need locking on read.
2022-07-20 10:57:28 -04:00
Joey Hess
d0860b7f0e
fix build
After 28b0aaea54
2022-07-18 16:44:32 -04:00
Joey Hess
28b0aaea54
re-add lock journal before reading journal files
This reverts commit 2e6e9876e3.

This is gonna be needed after all.. The append will only be atomic if
the journal is locked, because the file being appended will have to be
moved out of the way to avoid an old version of git-annex seeing an
incomplete write to it. When git-annex finds that the file is not in the
journal, and checks the append location, locking will be needed to avoid
a race causing it to miss it in the append location too due to it being
moved back to the journal.
2022-07-18 16:40:25 -04:00
Joey Hess
36f0bdcd57
add annex.alwayscompact
Added annex.alwayscompact setting which can be unset to speed up writes to
the git-annex branch in some cases.

Sponsored-by: Dartmouth College's DANDI project
2022-07-18 16:39:19 -04:00
Joey Hess
ccff639651
Merge branch 'master' into append 2022-07-18 14:17:15 -04:00
Joey Hess
de18d92de6
efficient but unsafe journal file append
This is only for checking performance, it's not safe.

Sponsored-by: Dartmouth College's DANDI project
2022-07-18 14:17:12 -04:00
Joey Hess
1c40b927aa
minor optimisation
Avoid re-writing the file when the journal directory did not
exist.
2022-07-18 13:50:35 -04:00
Joey Hess
2e6e9876e3
Revert "lock journal before reading journal files"
This reverts commit 47358a6f95.

This added overhead, and will not be needed, because appends are going
to have to be made atomic for other reasons than avoiding incomplete
reads of data being appended.

In particular, when git-annex is interrupted in the middle of an append,
it must not leave the file with a partially written line. So appending
has to somehow be made fully atomic.
2022-07-18 13:38:12 -04:00
Joey Hess
ce455223df
split out appending to journal from writing, high level only
Currently this is not an improvement, but it allows for optimising
appendJournalFile later. With an optimised appendJournalFile, this will
greatly speed up access patterns like git-annex addurl of a lot of urls
to the same key, where the log file can grow rather large. Appending
rather than re-writing the journal file for each line can save a lot of
disk writes.

It still has to read the current journal or branch file, to check
if it can append to it, and so when the journal file does not exist yet,
it can write the old content from the branch to it. Probably the re-reads
are better cached by the filesystem than repeated writes. (If the
re-reads turn out to keep performance bad, they could be eliminated, at
the cost of not being able to compact the log when replacing old
information in it. That could be enabled by a switch.)

While the immediate need is to affect addurl writes, it was implemented
at the level of presence logs, so will also perhaps speed up location logs.
The only added overhead is the call to isNewInfo, which only needs to
compare ByteStrings. Helping to balance that out, it avoids compactLog
when it's able to append.

Sponsored-by: Dartmouth College's DANDI project
2022-07-18 13:22:50 -04:00
Joey Hess
47358a6f95
lock journal before reading journal files
This is not currently necessary; journal files are updated atomically.

However, for faster appends to large journal files, locking on read will
be needed, because appends are not atomic.

Sponsored-by: Dartmouth College's DANDI project
2022-07-15 14:43:29 -04:00
Joey Hess
a2b1f369d1
disable journalIgnorable in enableInteractiveBranchAccess
Fix a reversion that prevented --batch commands (and the assistant)
from noticing data written to the journal by other commands.

I have not identified which commit broke this for sure,
but probably it was aeca7c2207

--batch commands that wrote to the journal avoided the problem since
journalIgnorable sets unset on write. It's a little bit surprising that
nobody noticed that query --batch commands did not see data written by
other commands.

Sponsored-by: Dartmouth College's DANDI project
2022-07-15 13:48:41 -04:00
Joey Hess
91abd872d3
complete a comment 2022-07-15 12:59:59 -04:00
Joey Hess
ad467791c1
optimise journal writes to not mkdir journal directory when it already exists
Sponsored-by: Dartmouth College's DANDI project
2022-07-14 12:29:39 -04:00
Joey Hess
1b680d330b
revert accidental change 2022-07-13 15:17:08 -04:00
Joey Hess
68e9b7f987
comment 2022-07-13 13:44:43 -04:00
Joey Hess
f58fb6a79a
fix build when dbus is enabled
Broken in commit 8040ecf9b8
2022-07-05 13:06:45 -04:00
Joey Hess
8040ecf9b8
final readonly values moves to AnnexRead
At this point I've checked all AnnexState values and these were all that
remained that could move.

Pity that Annex.repo can't move, but it gets modified sometimes..

A couple of AnnexState values are set by options and could be AnnexRead,
but happen to use Annex when being set.

Sponsored-by: Max Thoursie on Patreon
2022-06-28 16:04:58 -04:00
Joey Hess
cb9cf30c48
move several readonly values to AnnexRead
This improves performance to a small extent in several places.

Sponsored-by: Tobias Ammann on Patreon
2022-06-28 15:40:19 -04:00
Joey Hess
debcf86029
use RawFilePath version of rename
Some small wins, almost certianly swamped by the system calls, but still
worthwhile progress on the RawFilePath conversion.

Sponsored-by: Erik Bjäreholt on Patreon
2022-06-22 16:47:34 -04:00
Joey Hess
d00e23cac9
RawFilePath optimisations 2022-06-22 16:20:08 -04:00
Joey Hess
224a57f9ed
RawFilePath optimisation 2022-06-22 16:11:03 -04:00
Joey Hess
95a04920cf
remove objectDir' 2022-06-22 16:08:49 -04:00
Joey Hess
f80ec74128
RawFilePath optimisation 2022-06-22 16:08:26 -04:00
Joey Hess
78a3d44ea0
get rid of racy addLink
The remaining callers all did not rely on it checking gitignore, so were
easy to convert.

They were susceptable to the same overwrite race as add and fix,
although less likely to have it and a narrower window than add's race.

Command.Rekey in passing got an unncessary call to removeFile deleted.
addSymlink handles deleting any existing worktree file.
2022-06-14 14:47:15 -04:00
Joey Hess
7ace804d8e
avoid writing same symlink twice in a row
Oddly, the second write did not cause it to lose the mtime inherited
from the file being added, although the mtime was not provided to that
write but only to the first. I don't quite know why that worked before!
2022-06-14 14:30:12 -04:00
Joey Hess
5ef79125ad
fix overwrite race with git-annex add of annex symlink
In the unlikely case where git-annex add is run on an annex symlink that
is not already added, and while it's processing it, the annex symlink is
overwritten with something else, avoid git-annex overwriting that with
the symlink again.

Sponsored-by: Jack Hill on Patreon
2022-06-14 14:00:13 -04:00
Joey Hess
dd6dec4eb1
fix add overwrite race with git-annex add to annex
This is not a complete fix for all such races, only the one where a
large file gets changed while adding and gets added to git rather than
to the annex.

addLink needs to go away, any caller of it is probably subject to the
same kind of race. (Also, addLink itself fails to check gitignore when
symlinks are not supported.)

ingestAdd no longer checks gitignore. (It didn't check it consistently
before either, since there were cases where it did not run git add!)
When git-annex import calls it, it's already checked gitignore itself
earlier. When git-annex add calls it, it's usually on files found
by withFilesNotInGit, which handles checking ignores.

There was one other case, when git-annex add --batch calls it. In that
case, old git-annex behaved rather badly, it would seem to add the file,
but git add would later fail, leaving the file as an unstaged annex symlink.
That behavior has also been fixed.

Sponsored-by: Brett Eisenberg on Patreon
2022-06-14 13:37:19 -04:00
Joey Hess
c59ea5b1ca
info: Added --autoenable option
Use cases include using git-annex init --no-autoenable and then going back
and enabling the special remotes that have autoenable configured. As well
as just querying to remember which ones have it enabled.

It lists all special remotes that have autoenable=yes whether currently
enabled or not. And it can be used with --json.

I pondered making this "git-annex info autoenable", but that seemed wrong
because then if the use has a directory named "autoenable", it's unclear
what they are asking for. (Although "git-annex info remote" may be
similarly unclear.) Making it an option does mean that it can't be provided
via --batch though.

Sponsored-by: Dartmouth College's Datalad project
2022-06-01 14:20:38 -04:00
Joey Hess
f35c551d35
make path absolute for display
Avoid suggesting the user add "." to safe.directory.
2022-05-31 12:17:27 -04:00
Joey Hess
478ed28f98
revert windows-specific locking changes that broke tests
This reverts windows-specific parts of 5a98f2d509
There were no code paths in common between windows and unix, so this
will return Windows to the old behavior.

The problem that the commit talks about has to do with multiple different
locations where git-annex can store annex object files, but that is not
too relevant to Windows anyway, because on windows the filesystem is always
treated as criplled and/or symlinks are not supported, so it will only
use one object location. It would need to be using a repo populated
in another OS to have the other object location in use probably.
Then a drop and get could possibly lead to a dangling lock file.

And, I was not able to actually reproduce that situation happening
before making that commit, even when I forced a race. So making these
changes on windows was just begging trouble..

I suspect that the change that caused the reversion is in
Annex/Content/Presence.hs. It checks if the content file exists,
and then called modifyContentDirWhenExists, which seems like it would
not fail, but if something deleted the content file at that point,
that call would fail. Which would result in an exception being thrown,
which should not normally happen from a call to inAnnexSafe. That was a
windows-specific change; the unix side did not have an equivilant
change.

Sponsored-by: Dartmouth College's Datalad project
2022-05-23 13:21:26 -04:00
Joey Hess
63624c40a0
fix typo in comment 2022-05-23 12:53:55 -04:00
Joey Hess
af0d854460
deal with git's changes for CVE-2022-24765
Deal with git's recent changes to fix CVE-2022-24765, which prevent using
git in a repository owned by someone else.

That makes git config --list not list the repo's configs, only global
configs. So annex.uuid and annex.version are not visible to git-annex.
It displayed a message about that, which is not right for this situation.
Detect the situation and display a better message, similar to the one other
git commands display.

Also, git-annex init when run in that situation would overwrite annex.uuid
with a new one, since it couldn't see the old one. Add a check to prevent
it running too in this situation. It may be that this fix has security
implications, if a config set by the malicious user who owns the repo
causes git or git-annex to run code. I don't think any git-annex configs
get run by git-annex init. It may be that some git config of a command
does get run by one of the git commands that git-annex init runs. ("git
status" is the command that prompted the CVE-2022-24765, since
core.fsmonitor can cause it to run a command). Since I don't know how
to exploit this, I'm not treating it as a security fix for now.

Note that passing --git-dir makes git bypass the security check. git-annex
does pass --git-dir to most calls to git, which it does to avoid needing
chdir to the directory containing a git repository when accessing a remote.
So, it's possible that somewhere in git-annex it gets as far as running git
with --git-dir, and git reads some configs that are unsafe (what
CVE-2022-24765 is about). This seems unlikely, it would have to be part of
git-annex that runs in git repositories that have no (visible) annex.uuid,
and git-annex init is the only one that I can think of that then goes on to
run git, as discussed earlier. But I've not fully ruled out there being
others..

The git developers seem mostly worried about "git status" or a similar
command implicitly run by a shell prompt, not an explicit use of git in
such a repository. For example, Ævar Arnfjörð Bjarma wrote:
> * There are other bits of config that also point to executable things,
>   e.g. core.editor, aliases etc, but nothing has been found yet that
>   provides the "at a distance" effect that the core.fsmonitor vector
>   does.
>
>   I.e. a user is unlikely to go to /tmp/some-crap/here and run "git
>   commit", but they (or their shell prompt) might run "git status", and
>   if you have a /tmp/.git ...

Sponsored-by: Jarkko Kniivilä on Patreon
2022-05-20 14:38:27 -04:00
Joey Hess
aa414d97c9
make fsck normalize object locations
The purpose of this is to fix situations where the annex object file is
stored in a directory structure other than where annex symlinks point to.

But it will also move object files from the hashdirmixed back to
hashdirlower if the repo configuration makes that the normal location.
It would have been more work to avoid that than to let it do it.

Sponsored-by: Dartmouth College's Datalad project
2022-05-16 15:38:06 -04:00
Joey Hess
6b5029db29
fix hardcoding of number of hash directories
It can be changed to 1 via a tuning, rather than the 2 this assumed. So
it would have tried to rmdir .git/annex/objects in that case, which
would not hurt anything, but is not what it is supposed to do.

Sponsored-by: Dartmouth College's Datalad project
2022-05-16 15:08:42 -04:00
Joey Hess
5a98f2d509
avoid creating content directory when locking content
If the content directory does not exist, then it does not make sense to
lock the content file, as it also does not exist, and so it's ok for the
lock operation to fail.

This avoids potential races where the content file exists but is then
deleted/renamed, while another process sees that it exists and goes to
lock it, resulting in a dangling lock file in an otherwise empty object
directory.

Also renamed modifyContent to modifyContentDir since it is not only
necessarily used for modifying content files, but also other files in
the content directory.

Sponsored-by: Dartmouth College's Datalad project
2022-05-16 12:34:56 -04:00
Joey Hess
e8a601aa24
incremental verification for retrieval from import remotes
Sponsored-by: Dartmouth College's Datalad project
2022-05-09 15:39:43 -04:00
Joey Hess
2f2701137d
incremental verification for retrieval from all export remotes
Only for export remotes so far, not export/import.

Sponsored-by: Dartmouth College's Datalad project
2022-05-09 13:49:33 -04:00
Joey Hess
90950a37e5
support incremental verification when retrieving from export/import remotes
None of the special remotes do it yet, but this lays the groundwork.

Added MustFinishIncompleteVerify so that, when an incremental verify is
started but not complete, it can be forced to finish it. Otherwise, it
would have skipped doing it when verification is disabled, but
verification must always be done when retrievin from export remotes
since files can be modified during retrieval.

Note that retrieveExportWithContentIdentifier doesn't support incremental
verification yet. And I'm not sure if it can -- it doesn't know the Key
before it downloads the content. It seems a new API call would need to
be split out of that, which is provided with the key.

Sponsored-by: Dartmouth College's Datalad project
2022-05-09 12:25:04 -04:00
Joey Hess
8675b2b075
rename memoryUnits
It's not just used for memory sizes.
2022-05-05 15:35:11 -04:00
Joey Hess
d266a41f8d
prevent numcopies or mincopies being configured to 0
Ignore annex.numcopies set to 0 in gitattributes or git config, or by
git-annex numcopies or by --numcopies, since that configuration would make
git-annex easily lose data. Same for mincopies.

This is a continuation of the work to make data only be able to be lost
when --force is used. It earlier led to the --trust option being disabled,
and similar reasoning applies here.

Most numcopies configs had docs that strongly discouraged setting it to 0
anyway. And I can't imagine a use case for setting to 0. Not that there
might not be one, but it's just so far from the intended use case of
git-annex, of managing and storing your data, that it does not seem like
it makes sense to cater to such a hypothetical use case, where any
git-annex drop can lose your data at any time.

Using a smart constructor makes sure every place avoids 0. Note that this
does mean that NumCopies is for the configured desired values, and not the
actual existing number of copies, which of course can be 0. The name
configuredNumCopies is used to make that clear.

Sponsored-by: Brock Spratlen on Patreon
2022-03-28 15:20:34 -04:00
Joey Hess
982eb7ed0d
remove vendored http-client-restricted
Removed vendored copy of http-client-restricted, and removed the
HttpClientRestricted build flag that avoided that dependency.

http-client-restricted is in Debian stable, and the i386ancient build also
uses it, so I think this vendored copy is no longer needed.

Sponsored-by: Noam Kremen on Patreon
2022-03-22 11:50:06 -04:00
Joey Hess
952664641a
turn of PackageImports in cabal file
This makes it easier to build eg benchmarks of individual modules.

May be that most of these PackageImports are not really necessary,
dunno.
2022-02-25 13:16:36 -04:00
Joey Hess
51c528980c
avoid accidentally thawing git-annex symlink
It did nothing, since at this point the link is dangling. But when there
is a thaw hook, it would probably not be happy to be asked to run on a
symlink, or might do something unexpected.

Sponsored-by: Dartmouth College's Datalad project
2022-02-24 14:21:23 -04:00
Joey Hess
f4b046252a
Run annex.thawcontent-command before deleting an object file
In case annex.freezecontent-command did something that would prevent
deletion.

Sponsored-by: Dartmouth College's Datalad project
2022-02-24 14:11:02 -04:00
Joey Hess
346007a915
add debugging of freeze and thaw 2022-02-24 14:01:29 -04:00
Joey Hess
28bc5ce232
ignore write bits being set when there is a freeze hook
When annex.freezecontent-command is set, and the filesystem does not
support removing write bits, avoid treating it as a crippled filesystem.

The hook may be enough to prevent writing on its own, and some filesystems
ignore attempts to remove write bits.

Sponsored-by: Dartmouth College's Datalad project
2022-02-24 13:28:31 -04:00
Joey Hess
64ccb4734e
smudge: Warn when encountering a pointer file that has other content appended to it
It will then proceed to add the file the same as if it were any other
file containing possibly annexable content. Usually the file is one that
was annexed before, so the new, probably corrupt content will also be added
to the annex. If the file was not annexed before, the content will be added
to git.

It's not possible for the smudge filter to throw an error here, because
git then just adds the file to git anyway.

Sponsored-by: Dartmouth College's Datalad project
2022-02-23 15:17:08 -04:00
Joey Hess
67245ae00f
fully specify the pointer file format
This format is designed to detect accidental appends, while having some
room for future expansion.

Detect when an unlocked file whose content is not present has gotten some
other content appended to it, and avoid treating it as a pointer file, so
that appended content will not be checked into git, but will be annexed
like any other file.

Dropped the max size of a pointer file down to 32kb, it was around 80 kb,
but without any good reason and certianly there are no valid pointer files
anywhere that are larger than 8kb, because it's just been specified what it
means for a pointer file with additional data even looks like.

I assume 32kb will be good enough for anyone. ;-) Really though, it needs
to be some smallish number, because that much of a file in git gets read
into memory when eg, catting pointer files. And since we have no use cases
for the extra lines of a pointer file yet, except possibly to add
some human-visible explanation that it is a git-annex pointer file, 32k
seems as reasonable an arbitrary number as anything. Increasing it would be
possible, eg to 64k, as long as users of such jumbo pointer files didn't
mind upgrading all their git-annex installations to one that supports the
new larger size.

Sponsored-by: Dartmouth College's Datalad project
2022-02-23 14:20:31 -04:00
Joey Hess
5b373a9dd2
read a consistent amount from pointer file
A few places were reading the max symlink size of a pointer file,
then passing tp parseLinkTargetOrPointer. Which is fine currently, but
to support pointer files with lines of data after the pointer, enough
has to be read that parseLinkTargetOrPointer can be assured of seeing
enough of that data to know if it's correctly formatted.

Sponsored-by: Dartmouth College's Datalad project
2022-02-23 12:52:34 -04:00
Joey Hess
4cd9325c2c
fold parseLinkTarget into parseLinkTargetOrPointer
Only one place remained that differentiated between them.

It is the case that a symlink target that happens to contain a newline
somehow will be treated as a link to a key truncated at the newline.
This is super unlikely to happen, and since a key cannot actually
contain a newline, it's as good a behavior as any. Anyway, this commit
does not change the behavior there, although arguably it should be
changed. Note that getAnnexLinkTarget does prevent a symlink target
containing a newline.

Sponsored-by: Dartmouth College's Datalad project
2022-02-23 12:30:32 -04:00
Joey Hess
ce1b3a9699
info: Allow using matching options in more situations
File matching options like --include will be rejected in situations where
there is no filename to match against. (Or where there is a filename but
it's not relative to the cwd, or otherwise seemed too bothersome to match
against.)

The addition of listKeys' was necessary to avoid using more memory in the
common case of "git-annex info". Adding a filterM would have caused the
list to buffer in memory and not stream. This is an ugly hack, but listKeys
had previously run Annex operations inside unafeInterleaveIO (for direct
mode). And matching against a matcher should hopefully not change any Annex
state.

This does allow for eg `git-annex info somefile --include=*.ext`
although why someone would want to do that I don't really know. But it
seems to make sense to allow it.
But, consider: `git-annex info ./somefile --include=somefile`
This does not match, so will not display info about somefile.
If the user really wants to, they can `--include=./somefile`.

Using matching options like --copies or --in=remote seems likely to be
slower than git-annex find with those options, because unlike such
commands, info does not have optimised streaming through the matcher.

Note that `git-annex info remote` is not the same as
`git-annex info --in remote`. The former shows info about all files in
the remote. The latter shows local keys that are also in that remote.
The output should make that clear, but this still seems like a point
where users could get confused.

Sponsored-by: Jochen Bartl on Patreon
2022-02-21 14:46:07 -04:00
Joey Hess
faf84aa5c2
Avoid git status taking a long time after git-annex unlock of many files.
Implemented by making Git.Queue have a FlushAction, which can accumulate
along with another action on files, and runs only once the other action has
run.

This lets git-annex unlock queue up git update-index actions, without
conflicting with the restagePointerFiles FlushActions.

In a repository with filter-process enabled, git-annex unlock will
often not take any more time than before, though it may when the files are
large. Either way, it should always slow down less than git-annex status
speeds up.

When filter-process is not enabled, git-annex unlock will slow down as much
as git status speeds up.

Sponsored-by: Jochen Bartl on Patreon
2022-02-18 15:06:40 -04:00
Joey Hess
21e40b86d8
have v9 autoupgrade to v10
This was right before commit a27776f602,
which made v6 v7 autoupgrade to v8 but not yet to v10.

Sponsored-by: Dartmouth College's Datalad project
2022-01-26 13:16:06 -04:00
Joey Hess
a27776f602
init --version=6 upgrade to 8 not yet 10
autoUpgradeableVersions had latestVersion (10), but it did not make
sense for asking for old version 6 to get version 10, while asking for
version 8 got version 8. So use defaultVersion (8) instead.

Sponsored-by: Dartmouth College's Datalad project
2022-01-25 13:52:42 -04:00
Joey Hess
3618746a85
fix failing readonly test case
The problem is that withContentLockFile, in a v8 repo, has to take a shared
lock of `.git/annex/content.lck`. But, in a readonly repository, if that
file does not yet exist, it cannot lock it. And while it will sometimes
work to `chmod +r .git/annex`, the repository might be readonly due to
being owned by another user, or due to being mounted readonly.

So, it seems that the only solution is to use some other file than
`.git/annex/content.lck` as the lock file. The inode sential file
was almost the only option that should always exist. (And if it somehow
does not exist, creating an empty one for locking will be ok.)

Wow, what a hack!

Sponsored-by: Dartmouth College's Datalad project
2022-01-21 13:49:31 -04:00
Joey Hess
47084b8a1d
enable filter.annex.process in v9
This has tradeoffs, but is generally a win, and users who it causes git add to
slow down unacceptably for can just disable it again.

It needed to happen in an upgrade, since there are git-annex versions
that do not support it, and using such an old version with a v8
repository with filter.annex.process set will cause bad behavior.
By enabling it in v9, it's guaranteed that any git-annex version that
can use the repository does support it. Although, this is not a perfect
protection against problems, since an old git-annex version, if it's
used with a v9 repository, will cause git add to try to run
git-annex filter-process, which will fail. But at least, the user is
unlikely to have an old git-annex in path if they are using a v9
repository, since it won't work in that repository.

Sponsored-by: Dartmouth College's Datalad project
2022-01-21 13:11:18 -04:00
Joey Hess
dc14221bc3
detect v10 upgrade while running
Capstone of the v10 upgrade process.

Tested with a git-annex drop in a v8 repo that had a local v8 remote.
Upgrading the repo to v10 (with --force) immedaitely caused it to notice
and switch over to v10 locking. Upgrading the remote also caused it to
switch over when operating on the remote.

The InodeCache makes this fairly efficient, just an added stat call per
lock of an object file. After the v10 upgrade, there is no more
overhead.

Sponsored-by: Dartmouth College's Datalad project
2022-01-21 12:56:38 -04:00
Joey Hess
76e365769e
fix crash after drop in v10
After cleaning up the lock file, the content directory is gone, so
freezing it failed.

Sponsored-by: Dartmouth College's Datalad project
2022-01-20 14:03:27 -04:00
Joey Hess
d0a5714409
continue to use v8 by default for now, unless upgraded
Since it's easy to keep supporting v8, using it for a while (eg a few
months) will give users time to upgrade git-annex installations, before
it upgrades their repository to v9.

This commit should be reverted once ready to start upgrading
repositories by default.

Sponsored-by: Dartmouth College's Datalad project
2022-01-20 11:56:05 -04:00
Joey Hess
0904eac8b4
automatic upgrade from v8 to v9
Sponsored-by: Dartmouth College's Datalad project
2022-01-20 11:39:36 -04:00
Joey Hess
cea6f6db92
v10 upgrade locking
The v10 upgrade should almost be safe now. What remains to be done is
notice when the v10 upgrade has occurred, while holding the shared lock,
and switch to using v10 lock files.

Sponsored-by: Dartmouth College's Datalad project
2022-01-20 11:33:14 -04:00
Joey Hess
9d5db6a09a
add upgrade.log
The upgrade from V9 uses this to avoid an automatic upgrade until 1 year
after the V9 update. It can also be used in future such situations.

Sponsored-by: Dartmouth College's Datalad project
2022-01-19 15:52:29 -04:00
Joey Hess
856ce5cf5f
split upgrade into v9 and v10
v10 will run 1 year after the upgrade to v9, to give time for any v8
processes to die. Until that point, the v10 upgrade will be tried by
every process but deferred, so added support for deferring upgrades.

The upgrade prevention lock file that will be used by v10 is not yet
implemented, so it does not yet defer.

Sponsored-by: Dartmouth College's Datalad project
2022-01-19 13:09:33 -04:00
Joey Hess
4f7b8ce09d
fix spelling of upgradeable 2022-01-19 12:14:50 -04:00
Joey Hess
538d02d397
delete content lock file safely after shared lock
Upgrade the shared lock to an exclusive lock, and then delete the
lock file. If there is another process still holding the shared lock,
the first process will fail taking the exclusive lock, and not delete
the lock file; then the other process will later delete it.

Note that, in the time period where the exclusive lock is held, other
attempts to lock the content in place would fail. This is unlikely to be
a problem since it's a short period.

Other attempts to lock the content for removal would also fail in that
time period, but that's no different than a removal failing because
content is locked to prevent removal.

Sponsored-by: Dartmouth College's Datalad project
2022-01-13 14:54:57 -04:00
Joey Hess
86e5ffe34a
clean empty object directories after deleting content lock file
When dropping content, this was already done after deleting the content
file, but the lock file prevents deleting the directories. So, try the
deletion again.

This does mean there's a small added overhead of a failed rmdir().

Sponsored-by: Dartmouth College's Datalad project
2022-01-13 14:22:37 -04:00
Joey Hess
e28d1d0325
fix logic that was not inverted after all
oops
2022-01-13 14:11:36 -04:00
Joey Hess
a3b6b3499b
delete content lock file safely on drop, keep after shared lock
This seems to be the best that can be done to avoid forever accumulating
the new content lock files, while being fully safe.

This is fixing code paths that have lingered unused since direct mode!
And direct mode seems to have been buggy in this area, since the content
lock file was deleted on unlock. But with a shared lock, there could be
another process that also had the lock file locked, and deleting it
invalidates that lock.

So, the lock file cannot be deleted after a shared lock. At least, not
wihout taking an exclusive lock first.. which I have not pursued yet but may.

After an exclusive lock, the lock file can be deleted. But there is
still a potential race, where the exclusive lock is held, and another
process gets the file open, just as the exclusive lock is dropped and
the lock file is deleted. That other process would be left with a file
handle it can take a shared lock of, but with no effect since the file
is deleted. Annex.Transfer also deletes lock files, and deals with this
same problem by using checkSaneLock, which is how I've dealt with it
here.

Sponsored-by: Dartmouth College's Datalad project
2022-01-13 13:58:58 -04:00
Joey Hess
3d7933f124
fix inverted logic
Now the content lock files are used in v9. However, I am not yet certian
they are correct. In particular, lockContentUsing deletes
the content lock file on unlock. But what if there's a shared lock
by another process? That seems like it would discard that lock too!

(Windows seems like it would not have the same problem, because as the
comment in there says, "Can't delete a locked file on Windows".
So if another process has a shared lock, removing it presumably fails.)

Sponsored-by: Dartmouth College's Datalad project
2022-01-13 13:58:31 -04:00
Joey Hess
731b1ecf87
v9 upgrade implemented
Seems to work ok. Unsure yet about the actual locking changes being
correct.

This is not the end of the story with upgrades, because it is unsafe for
this upgrade as implemented to run in a repository where an old
git-annex process is already running. The old process would use the old
locking method, and not notice files locked by the new, and this could
result in data loss. This problem will need to be dealt with before this
branch is suitable for merging.

Sponsored-by: Dartmouth College's Datalad project
2022-01-13 13:25:10 -04:00
Joey Hess
3936599885
move code from Command.Fsck
Sponsored-by: Dartmouth College's Datalad project
2022-01-13 13:24:50 -04:00
Joey Hess
3c042606c2
use separate lock from content file in v9
Windows has always used a separate lock file, but on unix, the content
file itself was locked, and in v9 that changes to also use a separate
lock file.

This needs to be tested more. Eg, what happens after dropping a file;
does the the content lock file get deleted too, or linger around?

Sponsored-by: Dartmouth College's Datalad project
2022-01-11 17:03:14 -04:00
Joey Hess
43f9d967ff
shared repository content file permissions for v9
v9 will not need to write to annex content files in order to lock them,
so freezeContent removes the write bit in a shared repository, the same
as in any other repository.

checkContentWritePerm makes sure that the write perm is not set, which
will let git-annex fsck fix up the permissions. Upgrading to v9
will need to fix the permissions as well, but it seems likely there will
be situations where the user git-annex is running an upgrade as cannot,
so it will have to leave the write bit set. In such a case, git-annex
fsck can fix it later.

Sponsored-by: Dartmouth College's Datalad project
2022-01-11 16:50:50 -04:00
Joey Hess
ff570ad363
add v9 annex.version, not yet the default
This is the start of v9, but it's currently identical to v8, and v8 is
not upgraded to it. git-annex upgrade will upgrade to v9 with this
change.

Sponsored-by: Dartmouth College's Datalad project
2022-01-11 14:59:39 -04:00
Joey Hess
e95747a149
fix handling of corrupted data received from git remote
Recover from corrupted content being received from a git remote due eg to a
wire error, by deleting the temporary file when it fails to verify. This
prevents a retry from failing again.

Reversion introduced in version 8.20210903, when incremental verification
was added.

Only the git remote seems to be affected, although it is certianly
possible that other remotes could later have the same issue. This only
affects things passed to getViaTmp that return (False, UnVerified) due to
verification failing. As far as getViaTmp can tell, that could just as well
mean that the transfer failed in a way that would resume, so it cannot
delete the temp file itself. Remote.Git and P2P.Annex use getViaTmp internally,
while other remotes do not, which is why only it seems affected.

A better fix perhaps would be to improve the types of the callback
passed to getViaTmp, so that some other value could be used to indicate
the state where the transfer succeeded but verification failed.

Sponsored-by: Boyd Stephen Smith Jr.
2022-01-07 13:25:33 -04:00
Joey Hess
21c0d5be6e
comment 2022-01-07 12:27:19 -04:00
Joey Hess
e416635021
renameremote: Better handling of case where there are multiple special remotes with a name
Instead of renaming one at random, error out and ask that a uuid be
specified.

Sponsored-by: Brett Eisenberg on Patreon
2022-01-05 15:24:02 -04:00
Joey Hess
58afb00f6e
enableremote: Better handling of the unusual case where multiple special remotes have been initialized with the same name
Before it would pick one at random, though preferring ones that were not
dead over dead ones.

Now, if one is dead and the other not, it will use the non-dead one. But if
both are not dead, or both dead, it will error out, suggesting the user
clarify what they want to enable.

Sponsored-by: Luke Shumaker on Patreon
2022-01-05 15:12:11 -04:00
Joey Hess
b1d719f9d2
handle transitions with read-only unmerged git-annex branches
Capstone to this feature. Any transitions that have been performed on an
unmerged remote ref but not on the local git-annex branch, or vice-versa
have to be applied on the fly when reading files.

Sponsored-by: Dartmouth College's Datalad project
2021-12-28 13:23:32 -04:00
Joey Hess
720baf820e
refactoring 2021-12-28 12:15:51 -04:00