Commit graph

1835 commits

Author SHA1 Message Date
Joey Hess
478ed28f98
revert windows-specific locking changes that broke tests
This reverts windows-specific parts of 5a98f2d509
There were no code paths in common between windows and unix, so this
will return Windows to the old behavior.

The problem that the commit talks about has to do with multiple different
locations where git-annex can store annex object files, but that is not
too relevant to Windows anyway, because on windows the filesystem is always
treated as criplled and/or symlinks are not supported, so it will only
use one object location. It would need to be using a repo populated
in another OS to have the other object location in use probably.
Then a drop and get could possibly lead to a dangling lock file.

And, I was not able to actually reproduce that situation happening
before making that commit, even when I forced a race. So making these
changes on windows was just begging trouble..

I suspect that the change that caused the reversion is in
Annex/Content/Presence.hs. It checks if the content file exists,
and then called modifyContentDirWhenExists, which seems like it would
not fail, but if something deleted the content file at that point,
that call would fail. Which would result in an exception being thrown,
which should not normally happen from a call to inAnnexSafe. That was a
windows-specific change; the unix side did not have an equivilant
change.

Sponsored-by: Dartmouth College's Datalad project
2022-05-23 13:21:26 -04:00
Joey Hess
63624c40a0
fix typo in comment 2022-05-23 12:53:55 -04:00
Joey Hess
af0d854460
deal with git's changes for CVE-2022-24765
Deal with git's recent changes to fix CVE-2022-24765, which prevent using
git in a repository owned by someone else.

That makes git config --list not list the repo's configs, only global
configs. So annex.uuid and annex.version are not visible to git-annex.
It displayed a message about that, which is not right for this situation.
Detect the situation and display a better message, similar to the one other
git commands display.

Also, git-annex init when run in that situation would overwrite annex.uuid
with a new one, since it couldn't see the old one. Add a check to prevent
it running too in this situation. It may be that this fix has security
implications, if a config set by the malicious user who owns the repo
causes git or git-annex to run code. I don't think any git-annex configs
get run by git-annex init. It may be that some git config of a command
does get run by one of the git commands that git-annex init runs. ("git
status" is the command that prompted the CVE-2022-24765, since
core.fsmonitor can cause it to run a command). Since I don't know how
to exploit this, I'm not treating it as a security fix for now.

Note that passing --git-dir makes git bypass the security check. git-annex
does pass --git-dir to most calls to git, which it does to avoid needing
chdir to the directory containing a git repository when accessing a remote.
So, it's possible that somewhere in git-annex it gets as far as running git
with --git-dir, and git reads some configs that are unsafe (what
CVE-2022-24765 is about). This seems unlikely, it would have to be part of
git-annex that runs in git repositories that have no (visible) annex.uuid,
and git-annex init is the only one that I can think of that then goes on to
run git, as discussed earlier. But I've not fully ruled out there being
others..

The git developers seem mostly worried about "git status" or a similar
command implicitly run by a shell prompt, not an explicit use of git in
such a repository. For example, Ævar Arnfjörð Bjarma wrote:
> * There are other bits of config that also point to executable things,
>   e.g. core.editor, aliases etc, but nothing has been found yet that
>   provides the "at a distance" effect that the core.fsmonitor vector
>   does.
>
>   I.e. a user is unlikely to go to /tmp/some-crap/here and run "git
>   commit", but they (or their shell prompt) might run "git status", and
>   if you have a /tmp/.git ...

Sponsored-by: Jarkko Kniivilä on Patreon
2022-05-20 14:38:27 -04:00
Joey Hess
aa414d97c9
make fsck normalize object locations
The purpose of this is to fix situations where the annex object file is
stored in a directory structure other than where annex symlinks point to.

But it will also move object files from the hashdirmixed back to
hashdirlower if the repo configuration makes that the normal location.
It would have been more work to avoid that than to let it do it.

Sponsored-by: Dartmouth College's Datalad project
2022-05-16 15:38:06 -04:00
Joey Hess
6b5029db29
fix hardcoding of number of hash directories
It can be changed to 1 via a tuning, rather than the 2 this assumed. So
it would have tried to rmdir .git/annex/objects in that case, which
would not hurt anything, but is not what it is supposed to do.

Sponsored-by: Dartmouth College's Datalad project
2022-05-16 15:08:42 -04:00
Joey Hess
5a98f2d509
avoid creating content directory when locking content
If the content directory does not exist, then it does not make sense to
lock the content file, as it also does not exist, and so it's ok for the
lock operation to fail.

This avoids potential races where the content file exists but is then
deleted/renamed, while another process sees that it exists and goes to
lock it, resulting in a dangling lock file in an otherwise empty object
directory.

Also renamed modifyContent to modifyContentDir since it is not only
necessarily used for modifying content files, but also other files in
the content directory.

Sponsored-by: Dartmouth College's Datalad project
2022-05-16 12:34:56 -04:00
Joey Hess
e8a601aa24
incremental verification for retrieval from import remotes
Sponsored-by: Dartmouth College's Datalad project
2022-05-09 15:39:43 -04:00
Joey Hess
2f2701137d
incremental verification for retrieval from all export remotes
Only for export remotes so far, not export/import.

Sponsored-by: Dartmouth College's Datalad project
2022-05-09 13:49:33 -04:00
Joey Hess
90950a37e5
support incremental verification when retrieving from export/import remotes
None of the special remotes do it yet, but this lays the groundwork.

Added MustFinishIncompleteVerify so that, when an incremental verify is
started but not complete, it can be forced to finish it. Otherwise, it
would have skipped doing it when verification is disabled, but
verification must always be done when retrievin from export remotes
since files can be modified during retrieval.

Note that retrieveExportWithContentIdentifier doesn't support incremental
verification yet. And I'm not sure if it can -- it doesn't know the Key
before it downloads the content. It seems a new API call would need to
be split out of that, which is provided with the key.

Sponsored-by: Dartmouth College's Datalad project
2022-05-09 12:25:04 -04:00
Joey Hess
8675b2b075
rename memoryUnits
It's not just used for memory sizes.
2022-05-05 15:35:11 -04:00
Joey Hess
d266a41f8d
prevent numcopies or mincopies being configured to 0
Ignore annex.numcopies set to 0 in gitattributes or git config, or by
git-annex numcopies or by --numcopies, since that configuration would make
git-annex easily lose data. Same for mincopies.

This is a continuation of the work to make data only be able to be lost
when --force is used. It earlier led to the --trust option being disabled,
and similar reasoning applies here.

Most numcopies configs had docs that strongly discouraged setting it to 0
anyway. And I can't imagine a use case for setting to 0. Not that there
might not be one, but it's just so far from the intended use case of
git-annex, of managing and storing your data, that it does not seem like
it makes sense to cater to such a hypothetical use case, where any
git-annex drop can lose your data at any time.

Using a smart constructor makes sure every place avoids 0. Note that this
does mean that NumCopies is for the configured desired values, and not the
actual existing number of copies, which of course can be 0. The name
configuredNumCopies is used to make that clear.

Sponsored-by: Brock Spratlen on Patreon
2022-03-28 15:20:34 -04:00
Joey Hess
982eb7ed0d
remove vendored http-client-restricted
Removed vendored copy of http-client-restricted, and removed the
HttpClientRestricted build flag that avoided that dependency.

http-client-restricted is in Debian stable, and the i386ancient build also
uses it, so I think this vendored copy is no longer needed.

Sponsored-by: Noam Kremen on Patreon
2022-03-22 11:50:06 -04:00
Joey Hess
952664641a
turn of PackageImports in cabal file
This makes it easier to build eg benchmarks of individual modules.

May be that most of these PackageImports are not really necessary,
dunno.
2022-02-25 13:16:36 -04:00
Joey Hess
51c528980c
avoid accidentally thawing git-annex symlink
It did nothing, since at this point the link is dangling. But when there
is a thaw hook, it would probably not be happy to be asked to run on a
symlink, or might do something unexpected.

Sponsored-by: Dartmouth College's Datalad project
2022-02-24 14:21:23 -04:00
Joey Hess
f4b046252a
Run annex.thawcontent-command before deleting an object file
In case annex.freezecontent-command did something that would prevent
deletion.

Sponsored-by: Dartmouth College's Datalad project
2022-02-24 14:11:02 -04:00
Joey Hess
346007a915
add debugging of freeze and thaw 2022-02-24 14:01:29 -04:00
Joey Hess
28bc5ce232
ignore write bits being set when there is a freeze hook
When annex.freezecontent-command is set, and the filesystem does not
support removing write bits, avoid treating it as a crippled filesystem.

The hook may be enough to prevent writing on its own, and some filesystems
ignore attempts to remove write bits.

Sponsored-by: Dartmouth College's Datalad project
2022-02-24 13:28:31 -04:00
Joey Hess
64ccb4734e
smudge: Warn when encountering a pointer file that has other content appended to it
It will then proceed to add the file the same as if it were any other
file containing possibly annexable content. Usually the file is one that
was annexed before, so the new, probably corrupt content will also be added
to the annex. If the file was not annexed before, the content will be added
to git.

It's not possible for the smudge filter to throw an error here, because
git then just adds the file to git anyway.

Sponsored-by: Dartmouth College's Datalad project
2022-02-23 15:17:08 -04:00
Joey Hess
67245ae00f
fully specify the pointer file format
This format is designed to detect accidental appends, while having some
room for future expansion.

Detect when an unlocked file whose content is not present has gotten some
other content appended to it, and avoid treating it as a pointer file, so
that appended content will not be checked into git, but will be annexed
like any other file.

Dropped the max size of a pointer file down to 32kb, it was around 80 kb,
but without any good reason and certianly there are no valid pointer files
anywhere that are larger than 8kb, because it's just been specified what it
means for a pointer file with additional data even looks like.

I assume 32kb will be good enough for anyone. ;-) Really though, it needs
to be some smallish number, because that much of a file in git gets read
into memory when eg, catting pointer files. And since we have no use cases
for the extra lines of a pointer file yet, except possibly to add
some human-visible explanation that it is a git-annex pointer file, 32k
seems as reasonable an arbitrary number as anything. Increasing it would be
possible, eg to 64k, as long as users of such jumbo pointer files didn't
mind upgrading all their git-annex installations to one that supports the
new larger size.

Sponsored-by: Dartmouth College's Datalad project
2022-02-23 14:20:31 -04:00
Joey Hess
5b373a9dd2
read a consistent amount from pointer file
A few places were reading the max symlink size of a pointer file,
then passing tp parseLinkTargetOrPointer. Which is fine currently, but
to support pointer files with lines of data after the pointer, enough
has to be read that parseLinkTargetOrPointer can be assured of seeing
enough of that data to know if it's correctly formatted.

Sponsored-by: Dartmouth College's Datalad project
2022-02-23 12:52:34 -04:00
Joey Hess
4cd9325c2c
fold parseLinkTarget into parseLinkTargetOrPointer
Only one place remained that differentiated between them.

It is the case that a symlink target that happens to contain a newline
somehow will be treated as a link to a key truncated at the newline.
This is super unlikely to happen, and since a key cannot actually
contain a newline, it's as good a behavior as any. Anyway, this commit
does not change the behavior there, although arguably it should be
changed. Note that getAnnexLinkTarget does prevent a symlink target
containing a newline.

Sponsored-by: Dartmouth College's Datalad project
2022-02-23 12:30:32 -04:00
Joey Hess
ce1b3a9699
info: Allow using matching options in more situations
File matching options like --include will be rejected in situations where
there is no filename to match against. (Or where there is a filename but
it's not relative to the cwd, or otherwise seemed too bothersome to match
against.)

The addition of listKeys' was necessary to avoid using more memory in the
common case of "git-annex info". Adding a filterM would have caused the
list to buffer in memory and not stream. This is an ugly hack, but listKeys
had previously run Annex operations inside unafeInterleaveIO (for direct
mode). And matching against a matcher should hopefully not change any Annex
state.

This does allow for eg `git-annex info somefile --include=*.ext`
although why someone would want to do that I don't really know. But it
seems to make sense to allow it.
But, consider: `git-annex info ./somefile --include=somefile`
This does not match, so will not display info about somefile.
If the user really wants to, they can `--include=./somefile`.

Using matching options like --copies or --in=remote seems likely to be
slower than git-annex find with those options, because unlike such
commands, info does not have optimised streaming through the matcher.

Note that `git-annex info remote` is not the same as
`git-annex info --in remote`. The former shows info about all files in
the remote. The latter shows local keys that are also in that remote.
The output should make that clear, but this still seems like a point
where users could get confused.

Sponsored-by: Jochen Bartl on Patreon
2022-02-21 14:46:07 -04:00
Joey Hess
faf84aa5c2
Avoid git status taking a long time after git-annex unlock of many files.
Implemented by making Git.Queue have a FlushAction, which can accumulate
along with another action on files, and runs only once the other action has
run.

This lets git-annex unlock queue up git update-index actions, without
conflicting with the restagePointerFiles FlushActions.

In a repository with filter-process enabled, git-annex unlock will
often not take any more time than before, though it may when the files are
large. Either way, it should always slow down less than git-annex status
speeds up.

When filter-process is not enabled, git-annex unlock will slow down as much
as git status speeds up.

Sponsored-by: Jochen Bartl on Patreon
2022-02-18 15:06:40 -04:00
Joey Hess
21e40b86d8
have v9 autoupgrade to v10
This was right before commit a27776f602,
which made v6 v7 autoupgrade to v8 but not yet to v10.

Sponsored-by: Dartmouth College's Datalad project
2022-01-26 13:16:06 -04:00
Joey Hess
a27776f602
init --version=6 upgrade to 8 not yet 10
autoUpgradeableVersions had latestVersion (10), but it did not make
sense for asking for old version 6 to get version 10, while asking for
version 8 got version 8. So use defaultVersion (8) instead.

Sponsored-by: Dartmouth College's Datalad project
2022-01-25 13:52:42 -04:00
Joey Hess
3618746a85
fix failing readonly test case
The problem is that withContentLockFile, in a v8 repo, has to take a shared
lock of `.git/annex/content.lck`. But, in a readonly repository, if that
file does not yet exist, it cannot lock it. And while it will sometimes
work to `chmod +r .git/annex`, the repository might be readonly due to
being owned by another user, or due to being mounted readonly.

So, it seems that the only solution is to use some other file than
`.git/annex/content.lck` as the lock file. The inode sential file
was almost the only option that should always exist. (And if it somehow
does not exist, creating an empty one for locking will be ok.)

Wow, what a hack!

Sponsored-by: Dartmouth College's Datalad project
2022-01-21 13:49:31 -04:00
Joey Hess
47084b8a1d
enable filter.annex.process in v9
This has tradeoffs, but is generally a win, and users who it causes git add to
slow down unacceptably for can just disable it again.

It needed to happen in an upgrade, since there are git-annex versions
that do not support it, and using such an old version with a v8
repository with filter.annex.process set will cause bad behavior.
By enabling it in v9, it's guaranteed that any git-annex version that
can use the repository does support it. Although, this is not a perfect
protection against problems, since an old git-annex version, if it's
used with a v9 repository, will cause git add to try to run
git-annex filter-process, which will fail. But at least, the user is
unlikely to have an old git-annex in path if they are using a v9
repository, since it won't work in that repository.

Sponsored-by: Dartmouth College's Datalad project
2022-01-21 13:11:18 -04:00
Joey Hess
dc14221bc3
detect v10 upgrade while running
Capstone of the v10 upgrade process.

Tested with a git-annex drop in a v8 repo that had a local v8 remote.
Upgrading the repo to v10 (with --force) immedaitely caused it to notice
and switch over to v10 locking. Upgrading the remote also caused it to
switch over when operating on the remote.

The InodeCache makes this fairly efficient, just an added stat call per
lock of an object file. After the v10 upgrade, there is no more
overhead.

Sponsored-by: Dartmouth College's Datalad project
2022-01-21 12:56:38 -04:00
Joey Hess
76e365769e
fix crash after drop in v10
After cleaning up the lock file, the content directory is gone, so
freezing it failed.

Sponsored-by: Dartmouth College's Datalad project
2022-01-20 14:03:27 -04:00
Joey Hess
d0a5714409
continue to use v8 by default for now, unless upgraded
Since it's easy to keep supporting v8, using it for a while (eg a few
months) will give users time to upgrade git-annex installations, before
it upgrades their repository to v9.

This commit should be reverted once ready to start upgrading
repositories by default.

Sponsored-by: Dartmouth College's Datalad project
2022-01-20 11:56:05 -04:00
Joey Hess
0904eac8b4
automatic upgrade from v8 to v9
Sponsored-by: Dartmouth College's Datalad project
2022-01-20 11:39:36 -04:00
Joey Hess
cea6f6db92
v10 upgrade locking
The v10 upgrade should almost be safe now. What remains to be done is
notice when the v10 upgrade has occurred, while holding the shared lock,
and switch to using v10 lock files.

Sponsored-by: Dartmouth College's Datalad project
2022-01-20 11:33:14 -04:00
Joey Hess
9d5db6a09a
add upgrade.log
The upgrade from V9 uses this to avoid an automatic upgrade until 1 year
after the V9 update. It can also be used in future such situations.

Sponsored-by: Dartmouth College's Datalad project
2022-01-19 15:52:29 -04:00
Joey Hess
856ce5cf5f
split upgrade into v9 and v10
v10 will run 1 year after the upgrade to v9, to give time for any v8
processes to die. Until that point, the v10 upgrade will be tried by
every process but deferred, so added support for deferring upgrades.

The upgrade prevention lock file that will be used by v10 is not yet
implemented, so it does not yet defer.

Sponsored-by: Dartmouth College's Datalad project
2022-01-19 13:09:33 -04:00
Joey Hess
4f7b8ce09d
fix spelling of upgradeable 2022-01-19 12:14:50 -04:00
Joey Hess
538d02d397
delete content lock file safely after shared lock
Upgrade the shared lock to an exclusive lock, and then delete the
lock file. If there is another process still holding the shared lock,
the first process will fail taking the exclusive lock, and not delete
the lock file; then the other process will later delete it.

Note that, in the time period where the exclusive lock is held, other
attempts to lock the content in place would fail. This is unlikely to be
a problem since it's a short period.

Other attempts to lock the content for removal would also fail in that
time period, but that's no different than a removal failing because
content is locked to prevent removal.

Sponsored-by: Dartmouth College's Datalad project
2022-01-13 14:54:57 -04:00
Joey Hess
86e5ffe34a
clean empty object directories after deleting content lock file
When dropping content, this was already done after deleting the content
file, but the lock file prevents deleting the directories. So, try the
deletion again.

This does mean there's a small added overhead of a failed rmdir().

Sponsored-by: Dartmouth College's Datalad project
2022-01-13 14:22:37 -04:00
Joey Hess
e28d1d0325
fix logic that was not inverted after all
oops
2022-01-13 14:11:36 -04:00
Joey Hess
a3b6b3499b
delete content lock file safely on drop, keep after shared lock
This seems to be the best that can be done to avoid forever accumulating
the new content lock files, while being fully safe.

This is fixing code paths that have lingered unused since direct mode!
And direct mode seems to have been buggy in this area, since the content
lock file was deleted on unlock. But with a shared lock, there could be
another process that also had the lock file locked, and deleting it
invalidates that lock.

So, the lock file cannot be deleted after a shared lock. At least, not
wihout taking an exclusive lock first.. which I have not pursued yet but may.

After an exclusive lock, the lock file can be deleted. But there is
still a potential race, where the exclusive lock is held, and another
process gets the file open, just as the exclusive lock is dropped and
the lock file is deleted. That other process would be left with a file
handle it can take a shared lock of, but with no effect since the file
is deleted. Annex.Transfer also deletes lock files, and deals with this
same problem by using checkSaneLock, which is how I've dealt with it
here.

Sponsored-by: Dartmouth College's Datalad project
2022-01-13 13:58:58 -04:00
Joey Hess
3d7933f124
fix inverted logic
Now the content lock files are used in v9. However, I am not yet certian
they are correct. In particular, lockContentUsing deletes
the content lock file on unlock. But what if there's a shared lock
by another process? That seems like it would discard that lock too!

(Windows seems like it would not have the same problem, because as the
comment in there says, "Can't delete a locked file on Windows".
So if another process has a shared lock, removing it presumably fails.)

Sponsored-by: Dartmouth College's Datalad project
2022-01-13 13:58:31 -04:00
Joey Hess
731b1ecf87
v9 upgrade implemented
Seems to work ok. Unsure yet about the actual locking changes being
correct.

This is not the end of the story with upgrades, because it is unsafe for
this upgrade as implemented to run in a repository where an old
git-annex process is already running. The old process would use the old
locking method, and not notice files locked by the new, and this could
result in data loss. This problem will need to be dealt with before this
branch is suitable for merging.

Sponsored-by: Dartmouth College's Datalad project
2022-01-13 13:25:10 -04:00
Joey Hess
3936599885
move code from Command.Fsck
Sponsored-by: Dartmouth College's Datalad project
2022-01-13 13:24:50 -04:00
Joey Hess
3c042606c2
use separate lock from content file in v9
Windows has always used a separate lock file, but on unix, the content
file itself was locked, and in v9 that changes to also use a separate
lock file.

This needs to be tested more. Eg, what happens after dropping a file;
does the the content lock file get deleted too, or linger around?

Sponsored-by: Dartmouth College's Datalad project
2022-01-11 17:03:14 -04:00
Joey Hess
43f9d967ff
shared repository content file permissions for v9
v9 will not need to write to annex content files in order to lock them,
so freezeContent removes the write bit in a shared repository, the same
as in any other repository.

checkContentWritePerm makes sure that the write perm is not set, which
will let git-annex fsck fix up the permissions. Upgrading to v9
will need to fix the permissions as well, but it seems likely there will
be situations where the user git-annex is running an upgrade as cannot,
so it will have to leave the write bit set. In such a case, git-annex
fsck can fix it later.

Sponsored-by: Dartmouth College's Datalad project
2022-01-11 16:50:50 -04:00
Joey Hess
ff570ad363
add v9 annex.version, not yet the default
This is the start of v9, but it's currently identical to v8, and v8 is
not upgraded to it. git-annex upgrade will upgrade to v9 with this
change.

Sponsored-by: Dartmouth College's Datalad project
2022-01-11 14:59:39 -04:00
Joey Hess
e95747a149
fix handling of corrupted data received from git remote
Recover from corrupted content being received from a git remote due eg to a
wire error, by deleting the temporary file when it fails to verify. This
prevents a retry from failing again.

Reversion introduced in version 8.20210903, when incremental verification
was added.

Only the git remote seems to be affected, although it is certianly
possible that other remotes could later have the same issue. This only
affects things passed to getViaTmp that return (False, UnVerified) due to
verification failing. As far as getViaTmp can tell, that could just as well
mean that the transfer failed in a way that would resume, so it cannot
delete the temp file itself. Remote.Git and P2P.Annex use getViaTmp internally,
while other remotes do not, which is why only it seems affected.

A better fix perhaps would be to improve the types of the callback
passed to getViaTmp, so that some other value could be used to indicate
the state where the transfer succeeded but verification failed.

Sponsored-by: Boyd Stephen Smith Jr.
2022-01-07 13:25:33 -04:00
Joey Hess
21c0d5be6e
comment 2022-01-07 12:27:19 -04:00
Joey Hess
e416635021
renameremote: Better handling of case where there are multiple special remotes with a name
Instead of renaming one at random, error out and ask that a uuid be
specified.

Sponsored-by: Brett Eisenberg on Patreon
2022-01-05 15:24:02 -04:00
Joey Hess
58afb00f6e
enableremote: Better handling of the unusual case where multiple special remotes have been initialized with the same name
Before it would pick one at random, though preferring ones that were not
dead over dead ones.

Now, if one is dead and the other not, it will use the non-dead one. But if
both are not dead, or both dead, it will error out, suggesting the user
clarify what they want to enable.

Sponsored-by: Luke Shumaker on Patreon
2022-01-05 15:12:11 -04:00
Joey Hess
b1d719f9d2
handle transitions with read-only unmerged git-annex branches
Capstone to this feature. Any transitions that have been performed on an
unmerged remote ref but not on the local git-annex branch, or vice-versa
have to be applied on the fly when reading files.

Sponsored-by: Dartmouth College's Datalad project
2021-12-28 13:23:32 -04:00