Commit graph

769 commits

Author SHA1 Message Date
Joey Hess
7c7c9912c1
migrate --update gets keys
The git log is outputting the diff, but this only looks at the new
files. When we have a new file, we can get the old filename by just
replacing "new" with "old". And then use branchFileRef to refer to it
allows catting the old key.

While this does have to skip past the old files in the diff, it's still
faster than calling git diff separately.

Sponsored-by: Nicholas Golder-Manning on Patreon
2023-12-07 17:25:56 -04:00
Joey Hess
f1ce15036f
started migrate --update
This is most of the way there, but not quite working.

The layout of migrate.tree/ needs to be changed to follow this approach.
git log will list all the files in tree order, so the new layout needs
to alternate old and new keys. Can that be done? git may not document
tree order, or may not preserve it here.

Alternatively, change to using git log --format=raw and extract
the tree header from that, then use
git diff --raw $tree:migrate.tree/old $tree:migrate.tree/new
That will be a little more expensive, but only when there are lots of
migrations.

Sponsored-by: Joshua Antonishen on Patreon
2023-12-07 15:50:52 -04:00
Joey Hess
d06aee7ce0
make commitMigration interuption safe
Fixed inversion of control issue, so the tree is recorded
in streamLogFile finalizer.

Sponsored-by: Leon Schuermann on Patreon
2023-12-06 16:29:58 -04:00
Joey Hess
0bd8b17b59
log migration trees to git-annex branch
This will allow distributed migration: Start a migration in one clone of
a repo, and then update other clones.

commitMigration is a bit of a bear.. There is some inversion of control
that needs some TMVars. Also streamLogFile's finalizer does not handle
recording the trees, so an interrupt at just the wrong time can cause
migration.log to be emptied but the git-annex branch not updated.

Sponsored-by: Graham Spencer on Patreon
2023-12-06 15:40:03 -04:00
Joey Hess
561c036664
split out generic git log parser
Sponsored-By: Jack Hill on Patreon
2023-11-10 15:40:03 -04:00
Joey Hess
4e35067325
windows hook scripts newlines without CR
Windows: When git-annex init is installing hook scripts, it will
avoid ending lines with CR for portability.

Existing hook scripts that do have CR line endings will not be changed.
While it would be possible to have git-annex init upgrade them, users would
need to know to use that command to do that, and it would add complexity
that does not seem warranted for the portability benefit alone.

Sponsored-by: Luke T. Shumaker on Patreon
2023-11-02 13:37:04 -04:00
Joey Hess
e4ef688b98
RawFilePath conversion
May slightly speed up startup.
2023-10-26 13:58:49 -04:00
Joey Hess
ef7c867238
fix some build warnings from ghc 9.4.6
It now notices that a RepoLocation may not be Local, in which case
pattern matching on Local wouldn't do.
2023-09-21 13:40:22 -04:00
Joey Hess
e03e907705
fix some build warnings from ghc 9.4.6
It now notices that a RepoLocation may not be Local, in which case
pattern matching on Local wouldn't do.

However, in these cases, I think it always is a Local. In particular,
Git.Config.read is only run on local repos and upgrades LocalUnknown to
Local.
2023-09-21 12:11:01 -04:00
Joey Hess
baf8e4f6ed
Override safe.bareRepository for git remotes
Fix using git remotes that are bare when git is configured
with safe.bareRepository = explicit

Sponsored-by: Dartmouth College's DANDI project
2023-09-07 14:56:26 -04:00
Joey Hess
cbfd214993
set safe.directory when getting config for git-annex-shell or git remotes
Fix more breakage caused by git's fix for CVE-2022-24765, this time
involving a remote (either local or ssh) that is a repository not owned by
the current user.

Sponsored-by: Dartmouth College's DANDI project
2023-09-07 14:40:50 -04:00
Joey Hess
10b5f79e2d
fix empty tree import when directory does not exist
Fix behavior when importing a tree from a directory remote when the
directory does not exist. An empty tree was imported, rather than the
import failing. Merging that tree would delete every file in the
branch, if those files had been exported to the directory before.

The problem was that dirContentsRecursive returned [] when the directory
did not exist. Better for it to throw an exception. But in commit
74f0d67aa3 back in 2012, I made it never
theow exceptions, because exceptions throw inside unsafeInterleaveIO become
untrappable when the list is being traversed.

So, changed it to list the contents of the directory before entering
unsafeInterleaveIO. So exceptions are thrown for the directory. But still
not if it's unable to list the contents of a subdirectory. That's less of a
problem, because the subdirectory does exist (or if not, it got removed
after being listed, and it's ok to not include it in the list). A
subdirectory that has permissions that don't allow listing it will have its
contents omitted from the list still.

(Might be better to have it return a type that includes indications of
errors listing contents of subdirectories?)

The rest of the changes are making callers of dirContentsRecursive
use emptyWhenDoesNotExist when they relied on the behavior of it not
throwing an exception when the directory does not exist. Note that
it's possible some callers of dirContentsRecursive that used to ignore
permissions problems listing a directory will now start throwing exceptions
on them.

The fix to the directory special remote consisted of not making its
call in listImportableContentsM use emptyWhenDoesNotExist. So it will
throw an exception as desired.

Sponsored-by: Joshua Antonishen on Patreon
2023-08-15 12:57:41 -04:00
Joey Hess
be028f10e5
split out Utility.Url.Parse
This is mostly for git-repair which can't include all of Utility.Url
without adding many dependencies that are not really necessary.
2023-08-14 12:28:10 -04:00
Joey Hess
68c9b08faf
fix build with unix-2.8.0
Changed the parameters to openFd. So needed to add a small wrapper
library to keep supporting older versions as well.
2023-08-01 18:41:27 -04:00
Joey Hess
a05bc6a314
Fix breakage when git is configured with safe.bareRepository = explicit
Running git config --list inside .git then fails, so better to only
do that when --git-dir was specified explicitly. Otherwise, when the
repository is not bare, run the command inside the working tree.

Also make init detect when the uuid it just set cannot be read and fail
with an error, in case git changes something that breaks this later.

I still don't actually understand why git-annex add/assist -J2 was
affected but -J1 was not. But I did show that it was skipping writing to
the location log, because the uuid was NoUUID.

Sponsored-by: Graham Spencer on Patreon
2023-07-05 14:43:14 -04:00
Joey Hess
c6acf574c7
implement importChanges optimisaton (not used yet)
For simplicity, I've not tried to make it handle History yet, so when
there is a history, a full import will still be done. Probably the right
way to handle history is to first diff from the current tree to the last
imported tree. Then, diff from the current tree to each of the
historical trees, and recurse through the history diffing from child tree
to parent tree.

I don't think that will need a record of the previously imported
historical trees, and so Logs.Import doesn't store them. Although I did
leave room for future expansion in that log just in case.

Next step will be to change importTree to importChanges and modify
recordImportTree et all to handle it, by using adjustTree.

Sponsored-by: Brett Eisenberg on Patreon
2023-05-31 16:01:34 -04:00
Joey Hess
7298123520
build git trees using ContentIdentifier to speed up import
This gets the trees built, but it does not use them. Next step will be
to remember the tree for next time an import is done, and diff between
old and new trees to find the files that have changed.

Added --missing to the mktree parameters. That only disables a check, so
it's ok to do everywhere mktree is used. It probably also speeds up
mktree to disable the check.

Note that git fsck does not complain about the resulting tree objects
that point to shas that are not in the repository. Even with --strict.

A quick benchmark, importing 10000 files, this slowed it down
from 2:04.06 to 2:04.28. So it will more than pay for itself.

Sponsored-by: Luke Shumaker on Patreon
2023-05-31 12:46:54 -04:00
Joey Hess
5070087a63
repair: Fix handling of git ref names on Windows
Sponsored-by: Kevin Mueller on Patreon
2023-05-30 16:09:13 -04:00
Joey Hess
0da0e2efcc
add git config debugging
(and process cwd debugging)

Sponsored-by: Dartmouth College's Datalad project
2023-05-15 15:35:29 -04:00
Joey Hess
67f8268b3f
Support core.sharedRepository=0xxx at long last
Sponsored-by: Brett Eisenberg on Patreon
2023-04-26 17:03:29 -04:00
Joey Hess
7af75a59be
Warn about unsupported core.sharedRepository=0xxx when set
This spams the user with a lot of messages, but it seems like busywork to
avoid that and only warn once, since this warning will go away when it gets
implemented.

Also fix parsing of the octal value.

Sponsored-by: Kevin Mueller on Patreon
2023-04-26 13:25:29 -04:00
Joey Hess
fe5e586b72
rename Git.Filename to Git.Quote 2023-04-12 17:22:03 -04:00
Joey Hess
4a5f18a8ec
IsString StringContainingQuotedPath optimisation
This causes an encodeBS thunk, and the first evaluation of the string
forces it. From then on, further uses operate on a ByteString. This
avoids converting repeatedly.
2023-04-11 15:29:04 -04:00
Joey Hess
8b6c7bdbcc
filter out control characters in all other Messages
This does, as a side effect, make long notes in json output not
be indented. The indentation is only needed to offset them
underneath the display of the file they apply to, so that's ok.

Sponsored-by: Brock Spratlen on Patreon
2023-04-11 12:58:01 -04:00
Joey Hess
007e302637
use safeOutput when quoting UnquotedString
UnquotedString does not need to be quoted, but still it's possible
it contains something attacker-controlled, which could have an
escape sequence or control character in it. This is a convenient
place to filter out such things, since quoting alrready handles
those in filenames.

Sponsored-by: Luke Shumaker on Patreon
2023-04-10 14:46:17 -04:00
Joey Hess
cd544e548b
filter out control characters in error messages
giveup changed to filter out control characters. (It is too low level to
make it use StringContainingQuotedPath.)

error still does not, but it should only be used for internal errors,
where the message is not attacker-controlled.

Changed a lot of existing error to giveup when it is not strictly an
internal error.

Of course, other exceptions can still be thrown, either by code in
git-annex, or a library, that include some attacker-controlled value.
This does not guard against those.

Sponsored-by: Noam Kremen on Patreon
2023-04-10 13:50:51 -04:00
Joey Hess
063c00e4f7
git style filename quoting for giveup
When the filenames are part of the git repository or other files that
might have attacker-controlled names, quote them in error messages.

This is fairly complete, although I didn't do the one in
Utility.DirWatcher.INotify.hs because that doesn't have access to
Git.Filename or Annex.

But it's also quite possible I missed some. And also while scanning for
these, I found giveup used with other things that could be attacker
controlled to contain control characters (eg Keys). So, I'm thinking
it would also be good for giveup to just filter out control characters.
This commit is then not the only line of defence, but just good
formatting when git-annex displays a filename in an error message.

Sponsored-by: Kevin Mueller on Patreon
2023-04-10 12:56:45 -04:00
Joey Hess
1c21ce17d4
avoid unncessary nested lists for combineing StringContainingQuotedPath 2023-04-09 12:53:13 -04:00
Joey Hess
2ba1559a8e
git style quoting for ActionItemOther
Added StringContainingQuotedPath, which is used for ActionItemOther.

In the process, checked every ActionItemOther for those containing
filenames, and made them use quoting.

Sponsored-by: Graham Spencer on Patreon
2023-04-08 16:30:01 -04:00
Joey Hess
d689a5b338
git style filename quoting controlled by core.quotePath
This is by no means complete, but escaping filenames in actionItemDesc does
cover most commands.

Note that for ActionItemBranchFilePath, the value is branch:file, and I
choose to only quote the file part (if necessary). I considered quoting the
whole thing. But, branch names cannot contain control characters, and while
they can contain unicode, git coes not quote unicode when displaying branch
names. So, it would be surprising for git-annex to quote unicode in a
branch name.

The find command is the most obvious command that still needs to be
dealt with. There are probably other places that filenames also get
displayed, eg embedded in error messages.

Some other commands use ActionItemOther with a filename, I think that
ActionItemOther should either be pre-sanitized, or should explicitly not
be used for filenames, so that needs more work.

When --json is used, unicode does not get escaped, but control
characters were already escaped in json.

(Key escaping may turn out to be needed, but I'm ignoring that for now.)

Sponsored-by: unqueued on Patreon
2023-04-08 14:52:26 -04:00
Joey Hess
c5b017e55b
full emulation of git filename escaping
Not yet used, but the plan is to make git-annex use this when displaying
filenames similar to how git does.

Sponsored-by: Lawrence Brogan on Patreon
2023-04-07 17:17:31 -04:00
Joey Hess
d9b6be7782
convert encode_c to ByteString
This turns out to be possible after all, because the old one decomposed
a unicode Char to multiple Word8s and encoded those. It should be faster
in some places, particularly in Git.Filename.encodeAlways.

The old version encoded all unicode by default as well as ascii control
characters and also '"'. The new one only encodes ascii control
characters by default.

That old behavior was visible in Utility.Format.format, which did escape
'"' when used in eg git-annex find --format='${escaped_file}\n'
So made sure to keep that working the same. Although the man page only
says it will escape "unusual" characters, so it might be able to be
changed.

Git.Filename.encodeAlways also needs to escape '"' ; that was the
original reason that was escaped.

Types.Transferrer I judge is ok to not escape '"', because the escaped
value is sent in a line-based protocol, which is decoded at the other
end by decode_c. So old git-annex and new will be fine whether that is
escaped or not, the result will be the same.

Note that when asked to escape a double quote, it is escaped to \"
rather than to \042. That's the same behavior as git has. It's
perhaps somehow more of a special case than it needs to be.

Sponsored-by: k0ld on Patreon
2023-04-07 17:10:49 -04:00
Joey Hess
371d4f8183
decode_c converted to ByteString
This speeds up a few things, notably CmdLine.Seek using Git.Filename
which uses decode_c and this avoids a conversion to String and back,
and probably the ByteString implementation of decode_c is also faster
for simple cases at least than the string version.

encode_c cannot be converted to ByteString (or if it did, it would have
to convert right back to String in order to handle unicode).

Sponsored-by: Brock Spratlen on Patreon
2023-04-07 14:44:19 -04:00
Joey Hess
cd076cd085
Windows: Support urls like "file:///c:/path"
That is a legal url, but parseUrl parses it to "/c:/path"
which is not a valid path on Windows. So as a workaround, use
parseURIPortable everywhere, which removes the leading slash when
run on windows.

Note that if an url is parsed like this and then serialized back
to a string, it will be different from the input. Which could
potentially be a problem, but is probably not in practice.

An alternative way to do it would be to have an uriPathPortable
that fixes up the path after parsing. But it would be harder to
make sure that is used everywhere, since uriPath is also used
when constructing an URI.

It's also worth noting that System.FilePath.normalize "/c:/path"
yields "c:/path". The reason I didn't use it is that it also
may change "/" to "\" in the path and I wanted to keep the url
changes minimal. Also noticed that convertToWindowsNativeNamespace
handles "/c:/path" the same as "c:/path".

Sponsored-By: the NIH-funded NICEMAN (ReproNim TR&D3) project
2023-03-27 13:38:02 -04:00
Joey Hess
a0badc5069
sync: Fix parsing of gcrypt::rsync:// urls that use a relative path
Such an url is not valid; parseURI will fail on it. But git-annex doesn't
actually need to parse the url, because all it needs to do to support
syncing with it is know that it's not a local path, and use git pull and
push.

(Note that there is no good reason for the user to use such an url. An
absolute url is valid and I patched git-remote-gcrypt to support them
years ago. Still, users gonna do anything that tools allow, and
git-remote-gcrypt still supports them.)

Sponsored-by: Jack Hill on Patreon
2023-03-23 15:20:00 -04:00
Joey Hess
e822df2a09
fix build warnings on windows 2023-03-21 18:41:23 -04:00
Yaroslav Halchenko
84b0a3707a
Apply codespell -w throughout 2023-03-17 15:14:58 -04:00
Yaroslav Halchenko
e018ae1125
Fix ambigous typos 2023-03-17 15:14:47 -04:00
Joey Hess
a6bebe3c0f
make hashFile support paths with newlines
git hash-object --stdin-paths is a newline protocol so it cannot
support them. It would help to not use absPath, when the problem
is that the repository itself is in a path with a newline. But,
there's a reason it used absPath, which is that
git hash-object --stdin-paths actually chdirs to the top of the
repository on startup! That is not documented, and I think is a bug
in git.

I considered making the path relative to the top of the repo, but
then what if this is a git bug and gets fixed? git-annex would break
horribly.

So instead, keep the absPath, but when the path contains a newline,
fall back to running git hash-object once per file, which avoids
the problem with newlines and --stdin-paths. It will be slower,
but this is an edge case. (Similar slow code paths are already used
elsewhere when dealing with filenames with newlines and other parts
of git that use line-based protocols.)

Sponsored-by: Dartmouth College's Datalad project
2023-03-13 13:43:40 -04:00
Joey Hess
54ad1b4cfb
Windows: Support long filenames in more (possibly all) of the code
Works around this bug in unix-compat:
https://github.com/jacobstanley/unix-compat/issues/56
getFileStatus and other FilePath using functions in unix-compat do not do
UNC conversion on Windows.

Made Utility.RawFilePath use convertToWindowsNativeNamespace to do the
necessary conversion on windows to support long filenames.

Audited all imports of System.PosixCompat.Files to make sure that no
functions that operate on FilePath were imported from it. Instead, use
the equvilants from Utility.RawFilePath. In particular the
re-export of that module in Common had to be removed, which led to lots
of other changes throughout the code.

The changes to Build.Configure, Build.DesktopFile, and Build.TestConfig
make Utility.Directory not be needed to build setup. And so let it use
Utility.RawFilePath, which depends on unix, which cannot be in
setup-depends.

Sponsored-by: Dartmouth College's Datalad project
2023-03-01 15:55:58 -04:00
Joey Hess
f09e299156
rawfilepath conversion 2023-02-27 15:06:32 -04:00
Joey Hess
672258c8f4
Revert "revert recent bug fix temporarily for release"
This reverts commit 16f1e24665.
2023-02-14 14:11:23 -04:00
Joey Hess
16f1e24665
revert recent bug fix temporarily for release
Decided this bug is not severe enough to delay the release until
tomorrow, so this will be re-applied after the release.
2023-02-14 14:06:29 -04:00
Joey Hess
c1ef4a7481
Avoid Git.Config.updateLocation adding "/.git" to the end of the repo
path to a bare repo when git config is not allowed to list the configs
due to the CVE-2022-24765 fix.

That resulted in a confusing error message, and prevented the nice
message that explains how to mark the repo as safe to use.

Made isBare a tristate so that the case where core.bare is not returned can
be handled.

The handling in updateLocation is to check if the directory
contains config and objects and if so assume it's bare.
Note that if that heuristic is somehow wrong, it would construct a repo
that thinks it's bare but is not. That could cause follow-on problems,
but since git-annex then checks checkRepoConfigInaccessible, and skips
using the repo anyway, a wrong guess should not be a problem.

Sponsored-by: Luke Shumaker on Patreon
2023-02-14 14:00:36 -04:00
Joey Hess
c1f4d536b2
fix comment 2023-02-14 13:28:02 -04:00
Joey Hess
49ee07f93d
fix flush of a closed file handle
Avoids displaying warning about git-annex restage needing to be run in
situations where it does not.

Closing a handle flushes it anyway, so no need for an explict flush. The
handle does get closed twice, but that's fine, the second one does nothing.

Sponsored-by: Dartmouth College's DANDI project
2022-09-30 14:02:31 -04:00
Joey Hess
bfa451fc4e
pass --git-dir when reading git config when it was specified explicitly
Let GIT_DIR and --git-dir override git's protection against operating in a
repository owned by another user.

This is the same behavior other git commands have.

Sponsored-by: Jarkko Kniivilä on Patreon
2022-09-26 14:38:34 -04:00
Joey Hess
6a3bd283b8
add restage log
When pointer files need to be restaged, they're first written to the
log, and then when the restage operation runs, it reads the log. This
way, if the git-annex process is interrupted before it can do the
restaging, a later git-annex process can do it.

Currently, this lets a git-annex get/drop command be interrupted and
then re-ran, and as long as it gets/drops additional files, it will
clean up after the interrupted command. But more changes are
needed to make it easier to restage after an interrupted process.

Kept using the git queue to run the restage action, even though the
list of files that it builds up for that action is not actually used by
the action. This could perhaps be simplified to make restaging a cleanup
action that gets registered, rather than using the git queue for it. But
I wasn't sure if that would cause visible behavior changes, when eg
dropping a large number of files, currently the git queue flushes
periodically, and so it restages incrementally, rather than all at the
end.

In restagePointerFiles, it reads the restage log twice, once to get
the number of files and size, and a second time to process it.
This seemed better than reading the whole file into memory, since
potentially a huge number of files could be in there. Probably the OS
will cache the file in memory and there will not be much performance
impact. It might be better to keep running tallies in another file
though. But updating that atomically with the log seems hard.

Also note that it's possible for calcRestageLog to see a different file
than streamRestageLog does. More files may be added to the log in
between. That is ok, it will only cause the filterprocessfaster heuristic to
operate with slightly out of date information, so it may make the wrong
choice for the files that got added and be a little slower than ideal.

Sponsored-by: Dartmouth College's DANDI project
2022-09-23 15:47:24 -04:00
Joey Hess
9c76e503cf
generalize refreshIndex to MonadIO
Sponsored-by: Dartmouth College's DANDI project
2022-09-23 14:28:52 -04:00
Joey Hess
8d26fdd670
skip checkRepoConfigInaccessible when git directory specified explicitly
Fix a reversion that prevented git-annex from working in a repository when
--git-dir or GIT_DIR is specified to relocate the git directory to
somewhere else. (Introduced in version 10.20220525)

checkRepoConfigInaccessible could still run git config --list, just passing
--git-dir. It seems not necessary, because I know that passing --git-dir
bypasses git's check for repo ownership. I suppose it might be that git
eventually changes to check something about the ownership of the working
tree, so passing --git-dir without --work-tree would still be worth doing.
But for now this is the simple fix.

Sponsored-by: Nicholas Golder-Manning on Patreon
2022-09-20 14:52:43 -04:00