This reverts commit cee12f6a2f.
This commit broke git-annex init run in a repo that was cloned from a
repo with an adjusted branch checked out.
The problem is that findAdjustingCommit was not able to identify the
commit that created the adjusted branch. It seems that there is an extra
"\n" at the end of the commit message that it does not expect.
Since backwards compatability needs to be maintained, cannot just make
findAdjustingCommit accept it with the "\n". Will have to instead
have one commitTree variant that uses the old method, and use it for
adjusted branch committing.
sync, assist, import: Allow -m option to be specified multiple times, to
provide additional paragraphs for the commit message.
The option parser didn't allow multiple -m before, so there is no risk of
behavior change breaking something that was for some reason using multiple
-m already.
Pass through to git commands, so that the method used to assemble the
paragrahs is whatever git does. Which might conceivably change in the
future.
Note that git commit-tree has supported -m since git 1.7.7. commitTree
was probably not using it since it predates that version. Since the
configure script prevents building git-annex with git older than 2.1,
there is no risk that it's not supported now.
Sponsored-by: Nicholas Golder-Manning on Patreon
The old code traversed the list of addtreeitems once per subdirectory in
the tree, so could get quite slow. Converting to Map lookups sped it up
significantly.
In my test case, git-annex import used to take about 2 minutes, when
calling adjustTree to add back excluded files to the imported tree. This
dropped it down to 6 seconds. Of which 4 seconds are the actual
enumeration of the contents of the remote, so really only 2 seconds for
this.
The path prefix map is a bit suboptimal memory-wise, since items get
stored in the map once per subdirectory on the path to the item. It
would perhaps be better to use a tree data structure.
Also it's suboptimal memory-wise that it builds two maps, as well
as retaining a reference to addtreeitems. I could not see a way around
that though.
This is a fixed version of commit 2c86651180.
It fixes a test suite reversion.
Sponsored-by: Jack Hill on Patreon
The old code traversed the list of addtreeitems once per subdirectory in
the tree, so could get quite slow. Converting to Map lookups sped it up
significantly.
In my test case, git-annex import used to take about 2 minutes, when
calling adjustTree to add back excluded files to the imported tree. This
dropped it down to 6 seconds. Of which 4 seconds are the actual
enumeration of the contents of the remote, so really only 2 seconds for
this.
The path prefix map is a bit suboptimal memory-wise, since items get
stored in the map once per subdirectory on the path to the item. It
would perhaps be better to use a tree data structure.
Also it's suboptimal memory-wise that it builds two maps, as well
as retaining a reference to addtreeitems. I could not see a way around
that though.
Sponsored-by: Luke T. Shumaker on Patreon
The git log is outputting the diff, but this only looks at the new
files. When we have a new file, we can get the old filename by just
replacing "new" with "old". And then use branchFileRef to refer to it
allows catting the old key.
While this does have to skip past the old files in the diff, it's still
faster than calling git diff separately.
Sponsored-by: Nicholas Golder-Manning on Patreon
This is most of the way there, but not quite working.
The layout of migrate.tree/ needs to be changed to follow this approach.
git log will list all the files in tree order, so the new layout needs
to alternate old and new keys. Can that be done? git may not document
tree order, or may not preserve it here.
Alternatively, change to using git log --format=raw and extract
the tree header from that, then use
git diff --raw $tree:migrate.tree/old $tree:migrate.tree/new
That will be a little more expensive, but only when there are lots of
migrations.
Sponsored-by: Joshua Antonishen on Patreon
This will allow distributed migration: Start a migration in one clone of
a repo, and then update other clones.
commitMigration is a bit of a bear.. There is some inversion of control
that needs some TMVars. Also streamLogFile's finalizer does not handle
recording the trees, so an interrupt at just the wrong time can cause
migration.log to be emptied but the git-annex branch not updated.
Sponsored-by: Graham Spencer on Patreon
Windows: When git-annex init is installing hook scripts, it will
avoid ending lines with CR for portability.
Existing hook scripts that do have CR line endings will not be changed.
While it would be possible to have git-annex init upgrade them, users would
need to know to use that command to do that, and it would add complexity
that does not seem warranted for the portability benefit alone.
Sponsored-by: Luke T. Shumaker on Patreon
It now notices that a RepoLocation may not be Local, in which case
pattern matching on Local wouldn't do.
However, in these cases, I think it always is a Local. In particular,
Git.Config.read is only run on local repos and upgrades LocalUnknown to
Local.
Fix more breakage caused by git's fix for CVE-2022-24765, this time
involving a remote (either local or ssh) that is a repository not owned by
the current user.
Sponsored-by: Dartmouth College's DANDI project
Fix behavior when importing a tree from a directory remote when the
directory does not exist. An empty tree was imported, rather than the
import failing. Merging that tree would delete every file in the
branch, if those files had been exported to the directory before.
The problem was that dirContentsRecursive returned [] when the directory
did not exist. Better for it to throw an exception. But in commit
74f0d67aa3 back in 2012, I made it never
theow exceptions, because exceptions throw inside unsafeInterleaveIO become
untrappable when the list is being traversed.
So, changed it to list the contents of the directory before entering
unsafeInterleaveIO. So exceptions are thrown for the directory. But still
not if it's unable to list the contents of a subdirectory. That's less of a
problem, because the subdirectory does exist (or if not, it got removed
after being listed, and it's ok to not include it in the list). A
subdirectory that has permissions that don't allow listing it will have its
contents omitted from the list still.
(Might be better to have it return a type that includes indications of
errors listing contents of subdirectories?)
The rest of the changes are making callers of dirContentsRecursive
use emptyWhenDoesNotExist when they relied on the behavior of it not
throwing an exception when the directory does not exist. Note that
it's possible some callers of dirContentsRecursive that used to ignore
permissions problems listing a directory will now start throwing exceptions
on them.
The fix to the directory special remote consisted of not making its
call in listImportableContentsM use emptyWhenDoesNotExist. So it will
throw an exception as desired.
Sponsored-by: Joshua Antonishen on Patreon
Running git config --list inside .git then fails, so better to only
do that when --git-dir was specified explicitly. Otherwise, when the
repository is not bare, run the command inside the working tree.
Also make init detect when the uuid it just set cannot be read and fail
with an error, in case git changes something that breaks this later.
I still don't actually understand why git-annex add/assist -J2 was
affected but -J1 was not. But I did show that it was skipping writing to
the location log, because the uuid was NoUUID.
Sponsored-by: Graham Spencer on Patreon
For simplicity, I've not tried to make it handle History yet, so when
there is a history, a full import will still be done. Probably the right
way to handle history is to first diff from the current tree to the last
imported tree. Then, diff from the current tree to each of the
historical trees, and recurse through the history diffing from child tree
to parent tree.
I don't think that will need a record of the previously imported
historical trees, and so Logs.Import doesn't store them. Although I did
leave room for future expansion in that log just in case.
Next step will be to change importTree to importChanges and modify
recordImportTree et all to handle it, by using adjustTree.
Sponsored-by: Brett Eisenberg on Patreon
This gets the trees built, but it does not use them. Next step will be
to remember the tree for next time an import is done, and diff between
old and new trees to find the files that have changed.
Added --missing to the mktree parameters. That only disables a check, so
it's ok to do everywhere mktree is used. It probably also speeds up
mktree to disable the check.
Note that git fsck does not complain about the resulting tree objects
that point to shas that are not in the repository. Even with --strict.
A quick benchmark, importing 10000 files, this slowed it down
from 2:04.06 to 2:04.28. So it will more than pay for itself.
Sponsored-by: Luke Shumaker on Patreon
This spams the user with a lot of messages, but it seems like busywork to
avoid that and only warn once, since this warning will go away when it gets
implemented.
Also fix parsing of the octal value.
Sponsored-by: Kevin Mueller on Patreon
This causes an encodeBS thunk, and the first evaluation of the string
forces it. From then on, further uses operate on a ByteString. This
avoids converting repeatedly.
This does, as a side effect, make long notes in json output not
be indented. The indentation is only needed to offset them
underneath the display of the file they apply to, so that's ok.
Sponsored-by: Brock Spratlen on Patreon
UnquotedString does not need to be quoted, but still it's possible
it contains something attacker-controlled, which could have an
escape sequence or control character in it. This is a convenient
place to filter out such things, since quoting alrready handles
those in filenames.
Sponsored-by: Luke Shumaker on Patreon
giveup changed to filter out control characters. (It is too low level to
make it use StringContainingQuotedPath.)
error still does not, but it should only be used for internal errors,
where the message is not attacker-controlled.
Changed a lot of existing error to giveup when it is not strictly an
internal error.
Of course, other exceptions can still be thrown, either by code in
git-annex, or a library, that include some attacker-controlled value.
This does not guard against those.
Sponsored-by: Noam Kremen on Patreon
When the filenames are part of the git repository or other files that
might have attacker-controlled names, quote them in error messages.
This is fairly complete, although I didn't do the one in
Utility.DirWatcher.INotify.hs because that doesn't have access to
Git.Filename or Annex.
But it's also quite possible I missed some. And also while scanning for
these, I found giveup used with other things that could be attacker
controlled to contain control characters (eg Keys). So, I'm thinking
it would also be good for giveup to just filter out control characters.
This commit is then not the only line of defence, but just good
formatting when git-annex displays a filename in an error message.
Sponsored-by: Kevin Mueller on Patreon
Added StringContainingQuotedPath, which is used for ActionItemOther.
In the process, checked every ActionItemOther for those containing
filenames, and made them use quoting.
Sponsored-by: Graham Spencer on Patreon
This is by no means complete, but escaping filenames in actionItemDesc does
cover most commands.
Note that for ActionItemBranchFilePath, the value is branch:file, and I
choose to only quote the file part (if necessary). I considered quoting the
whole thing. But, branch names cannot contain control characters, and while
they can contain unicode, git coes not quote unicode when displaying branch
names. So, it would be surprising for git-annex to quote unicode in a
branch name.
The find command is the most obvious command that still needs to be
dealt with. There are probably other places that filenames also get
displayed, eg embedded in error messages.
Some other commands use ActionItemOther with a filename, I think that
ActionItemOther should either be pre-sanitized, or should explicitly not
be used for filenames, so that needs more work.
When --json is used, unicode does not get escaped, but control
characters were already escaped in json.
(Key escaping may turn out to be needed, but I'm ignoring that for now.)
Sponsored-by: unqueued on Patreon
This turns out to be possible after all, because the old one decomposed
a unicode Char to multiple Word8s and encoded those. It should be faster
in some places, particularly in Git.Filename.encodeAlways.
The old version encoded all unicode by default as well as ascii control
characters and also '"'. The new one only encodes ascii control
characters by default.
That old behavior was visible in Utility.Format.format, which did escape
'"' when used in eg git-annex find --format='${escaped_file}\n'
So made sure to keep that working the same. Although the man page only
says it will escape "unusual" characters, so it might be able to be
changed.
Git.Filename.encodeAlways also needs to escape '"' ; that was the
original reason that was escaped.
Types.Transferrer I judge is ok to not escape '"', because the escaped
value is sent in a line-based protocol, which is decoded at the other
end by decode_c. So old git-annex and new will be fine whether that is
escaped or not, the result will be the same.
Note that when asked to escape a double quote, it is escaped to \"
rather than to \042. That's the same behavior as git has. It's
perhaps somehow more of a special case than it needs to be.
Sponsored-by: k0ld on Patreon
This speeds up a few things, notably CmdLine.Seek using Git.Filename
which uses decode_c and this avoids a conversion to String and back,
and probably the ByteString implementation of decode_c is also faster
for simple cases at least than the string version.
encode_c cannot be converted to ByteString (or if it did, it would have
to convert right back to String in order to handle unicode).
Sponsored-by: Brock Spratlen on Patreon
That is a legal url, but parseUrl parses it to "/c:/path"
which is not a valid path on Windows. So as a workaround, use
parseURIPortable everywhere, which removes the leading slash when
run on windows.
Note that if an url is parsed like this and then serialized back
to a string, it will be different from the input. Which could
potentially be a problem, but is probably not in practice.
An alternative way to do it would be to have an uriPathPortable
that fixes up the path after parsing. But it would be harder to
make sure that is used everywhere, since uriPath is also used
when constructing an URI.
It's also worth noting that System.FilePath.normalize "/c:/path"
yields "c:/path". The reason I didn't use it is that it also
may change "/" to "\" in the path and I wanted to keep the url
changes minimal. Also noticed that convertToWindowsNativeNamespace
handles "/c:/path" the same as "c:/path".
Sponsored-By: the NIH-funded NICEMAN (ReproNim TR&D3) project
Such an url is not valid; parseURI will fail on it. But git-annex doesn't
actually need to parse the url, because all it needs to do to support
syncing with it is know that it's not a local path, and use git pull and
push.
(Note that there is no good reason for the user to use such an url. An
absolute url is valid and I patched git-remote-gcrypt to support them
years ago. Still, users gonna do anything that tools allow, and
git-remote-gcrypt still supports them.)
Sponsored-by: Jack Hill on Patreon
git hash-object --stdin-paths is a newline protocol so it cannot
support them. It would help to not use absPath, when the problem
is that the repository itself is in a path with a newline. But,
there's a reason it used absPath, which is that
git hash-object --stdin-paths actually chdirs to the top of the
repository on startup! That is not documented, and I think is a bug
in git.
I considered making the path relative to the top of the repo, but
then what if this is a git bug and gets fixed? git-annex would break
horribly.
So instead, keep the absPath, but when the path contains a newline,
fall back to running git hash-object once per file, which avoids
the problem with newlines and --stdin-paths. It will be slower,
but this is an edge case. (Similar slow code paths are already used
elsewhere when dealing with filenames with newlines and other parts
of git that use line-based protocols.)
Sponsored-by: Dartmouth College's Datalad project
Works around this bug in unix-compat:
https://github.com/jacobstanley/unix-compat/issues/56
getFileStatus and other FilePath using functions in unix-compat do not do
UNC conversion on Windows.
Made Utility.RawFilePath use convertToWindowsNativeNamespace to do the
necessary conversion on windows to support long filenames.
Audited all imports of System.PosixCompat.Files to make sure that no
functions that operate on FilePath were imported from it. Instead, use
the equvilants from Utility.RawFilePath. In particular the
re-export of that module in Common had to be removed, which led to lots
of other changes throughout the code.
The changes to Build.Configure, Build.DesktopFile, and Build.TestConfig
make Utility.Directory not be needed to build setup. And so let it use
Utility.RawFilePath, which depends on unix, which cannot be in
setup-depends.
Sponsored-by: Dartmouth College's Datalad project