git-annex

Author	SHA1	Message	Date
Joey Hess	c6acf574c7	implement importChanges optimisaton (not used yet) For simplicity, I've not tried to make it handle History yet, so when there is a history, a full import will still be done. Probably the right way to handle history is to first diff from the current tree to the last imported tree. Then, diff from the current tree to each of the historical trees, and recurse through the history diffing from child tree to parent tree. I don't think that will need a record of the previously imported historical trees, and so Logs.Import doesn't store them. Although I did leave room for future expansion in that log just in case. Next step will be to change importTree to importChanges and modify recordImportTree et all to handle it, by using adjustTree. Sponsored-by: Brett Eisenberg on Patreon	2023-05-31 16:01:34 -04:00
Joey Hess	7298123520	build git trees using ContentIdentifier to speed up import This gets the trees built, but it does not use them. Next step will be to remember the tree for next time an import is done, and diff between old and new trees to find the files that have changed. Added --missing to the mktree parameters. That only disables a check, so it's ok to do everywhere mktree is used. It probably also speeds up mktree to disable the check. Note that git fsck does not complain about the resulting tree objects that point to shas that are not in the repository. Even with --strict. A quick benchmark, importing 10000 files, this slowed it down from 2:04.06 to 2:04.28. So it will more than pay for itself. Sponsored-by: Luke Shumaker on Patreon	2023-05-31 12:46:54 -04:00
Joey Hess	5070087a63	repair: Fix handling of git ref names on Windows Sponsored-by: Kevin Mueller on Patreon	2023-05-30 16:09:13 -04:00
Joey Hess	0da0e2efcc	add git config debugging (and process cwd debugging) Sponsored-by: Dartmouth College's Datalad project	2023-05-15 15:35:29 -04:00
Joey Hess	67f8268b3f	Support core.sharedRepository=0xxx at long last Sponsored-by: Brett Eisenberg on Patreon	2023-04-26 17:03:29 -04:00
Joey Hess	7af75a59be	Warn about unsupported core.sharedRepository=0xxx when set This spams the user with a lot of messages, but it seems like busywork to avoid that and only warn once, since this warning will go away when it gets implemented. Also fix parsing of the octal value. Sponsored-by: Kevin Mueller on Patreon	2023-04-26 13:25:29 -04:00
Joey Hess	fe5e586b72	rename Git.Filename to Git.Quote	2023-04-12 17:22:03 -04:00
Joey Hess	4a5f18a8ec	IsString StringContainingQuotedPath optimisation This causes an encodeBS thunk, and the first evaluation of the string forces it. From then on, further uses operate on a ByteString. This avoids converting repeatedly.	2023-04-11 15:29:04 -04:00
Joey Hess	8b6c7bdbcc	filter out control characters in all other Messages This does, as a side effect, make long notes in json output not be indented. The indentation is only needed to offset them underneath the display of the file they apply to, so that's ok. Sponsored-by: Brock Spratlen on Patreon	2023-04-11 12:58:01 -04:00
Joey Hess	007e302637	use safeOutput when quoting UnquotedString UnquotedString does not need to be quoted, but still it's possible it contains something attacker-controlled, which could have an escape sequence or control character in it. This is a convenient place to filter out such things, since quoting alrready handles those in filenames. Sponsored-by: Luke Shumaker on Patreon	2023-04-10 14:46:17 -04:00
Joey Hess	cd544e548b	filter out control characters in error messages giveup changed to filter out control characters. (It is too low level to make it use StringContainingQuotedPath.) error still does not, but it should only be used for internal errors, where the message is not attacker-controlled. Changed a lot of existing error to giveup when it is not strictly an internal error. Of course, other exceptions can still be thrown, either by code in git-annex, or a library, that include some attacker-controlled value. This does not guard against those. Sponsored-by: Noam Kremen on Patreon	2023-04-10 13:50:51 -04:00
Joey Hess	063c00e4f7	git style filename quoting for giveup When the filenames are part of the git repository or other files that might have attacker-controlled names, quote them in error messages. This is fairly complete, although I didn't do the one in Utility.DirWatcher.INotify.hs because that doesn't have access to Git.Filename or Annex. But it's also quite possible I missed some. And also while scanning for these, I found giveup used with other things that could be attacker controlled to contain control characters (eg Keys). So, I'm thinking it would also be good for giveup to just filter out control characters. This commit is then not the only line of defence, but just good formatting when git-annex displays a filename in an error message. Sponsored-by: Kevin Mueller on Patreon	2023-04-10 12:56:45 -04:00
Joey Hess	1c21ce17d4	avoid unncessary nested lists for combineing StringContainingQuotedPath	2023-04-09 12:53:13 -04:00
Joey Hess	2ba1559a8e	git style quoting for ActionItemOther Added StringContainingQuotedPath, which is used for ActionItemOther. In the process, checked every ActionItemOther for those containing filenames, and made them use quoting. Sponsored-by: Graham Spencer on Patreon	2023-04-08 16:30:01 -04:00
Joey Hess	d689a5b338	git style filename quoting controlled by core.quotePath This is by no means complete, but escaping filenames in actionItemDesc does cover most commands. Note that for ActionItemBranchFilePath, the value is branch:file, and I choose to only quote the file part (if necessary). I considered quoting the whole thing. But, branch names cannot contain control characters, and while they can contain unicode, git coes not quote unicode when displaying branch names. So, it would be surprising for git-annex to quote unicode in a branch name. The find command is the most obvious command that still needs to be dealt with. There are probably other places that filenames also get displayed, eg embedded in error messages. Some other commands use ActionItemOther with a filename, I think that ActionItemOther should either be pre-sanitized, or should explicitly not be used for filenames, so that needs more work. When --json is used, unicode does not get escaped, but control characters were already escaped in json. (Key escaping may turn out to be needed, but I'm ignoring that for now.) Sponsored-by: unqueued on Patreon	2023-04-08 14:52:26 -04:00
Joey Hess	c5b017e55b	full emulation of git filename escaping Not yet used, but the plan is to make git-annex use this when displaying filenames similar to how git does. Sponsored-by: Lawrence Brogan on Patreon	2023-04-07 17:17:31 -04:00
Joey Hess	d9b6be7782	convert encode_c to ByteString This turns out to be possible after all, because the old one decomposed a unicode Char to multiple Word8s and encoded those. It should be faster in some places, particularly in Git.Filename.encodeAlways. The old version encoded all unicode by default as well as ascii control characters and also '"'. The new one only encodes ascii control characters by default. That old behavior was visible in Utility.Format.format, which did escape '"' when used in eg git-annex find --format='${escaped_file}\n' So made sure to keep that working the same. Although the man page only says it will escape "unusual" characters, so it might be able to be changed. Git.Filename.encodeAlways also needs to escape '"' ; that was the original reason that was escaped. Types.Transferrer I judge is ok to not escape '"', because the escaped value is sent in a line-based protocol, which is decoded at the other end by decode_c. So old git-annex and new will be fine whether that is escaped or not, the result will be the same. Note that when asked to escape a double quote, it is escaped to \" rather than to \042. That's the same behavior as git has. It's perhaps somehow more of a special case than it needs to be. Sponsored-by: k0ld on Patreon	2023-04-07 17:10:49 -04:00
Joey Hess	371d4f8183	decode_c converted to ByteString This speeds up a few things, notably CmdLine.Seek using Git.Filename which uses decode_c and this avoids a conversion to String and back, and probably the ByteString implementation of decode_c is also faster for simple cases at least than the string version. encode_c cannot be converted to ByteString (or if it did, it would have to convert right back to String in order to handle unicode). Sponsored-by: Brock Spratlen on Patreon	2023-04-07 14:44:19 -04:00
Joey Hess	cd076cd085	Windows: Support urls like "file:///c:/path" That is a legal url, but parseUrl parses it to "/c:/path" which is not a valid path on Windows. So as a workaround, use parseURIPortable everywhere, which removes the leading slash when run on windows. Note that if an url is parsed like this and then serialized back to a string, it will be different from the input. Which could potentially be a problem, but is probably not in practice. An alternative way to do it would be to have an uriPathPortable that fixes up the path after parsing. But it would be harder to make sure that is used everywhere, since uriPath is also used when constructing an URI. It's also worth noting that System.FilePath.normalize "/c:/path" yields "c:/path". The reason I didn't use it is that it also may change "/" to "\" in the path and I wanted to keep the url changes minimal. Also noticed that convertToWindowsNativeNamespace handles "/c:/path" the same as "c:/path". Sponsored-By: the NIH-funded NICEMAN (ReproNim TR&D3) project	2023-03-27 13:38:02 -04:00
Joey Hess	a0badc5069	sync: Fix parsing of gcrypt::rsync:// urls that use a relative path Such an url is not valid; parseURI will fail on it. But git-annex doesn't actually need to parse the url, because all it needs to do to support syncing with it is know that it's not a local path, and use git pull and push. (Note that there is no good reason for the user to use such an url. An absolute url is valid and I patched git-remote-gcrypt to support them years ago. Still, users gonna do anything that tools allow, and git-remote-gcrypt still supports them.) Sponsored-by: Jack Hill on Patreon	2023-03-23 15:20:00 -04:00
Joey Hess	e822df2a09	fix build warnings on windows	2023-03-21 18:41:23 -04:00
Yaroslav Halchenko	84b0a3707a	Apply codespell -w throughout	2023-03-17 15:14:58 -04:00
Yaroslav Halchenko	e018ae1125	Fix ambigous typos	2023-03-17 15:14:47 -04:00
Joey Hess	a6bebe3c0f	make hashFile support paths with newlines git hash-object --stdin-paths is a newline protocol so it cannot support them. It would help to not use absPath, when the problem is that the repository itself is in a path with a newline. But, there's a reason it used absPath, which is that git hash-object --stdin-paths actually chdirs to the top of the repository on startup! That is not documented, and I think is a bug in git. I considered making the path relative to the top of the repo, but then what if this is a git bug and gets fixed? git-annex would break horribly. So instead, keep the absPath, but when the path contains a newline, fall back to running git hash-object once per file, which avoids the problem with newlines and --stdin-paths. It will be slower, but this is an edge case. (Similar slow code paths are already used elsewhere when dealing with filenames with newlines and other parts of git that use line-based protocols.) Sponsored-by: Dartmouth College's Datalad project	2023-03-13 13:43:40 -04:00
Joey Hess	54ad1b4cfb	Windows: Support long filenames in more (possibly all) of the code Works around this bug in unix-compat: https://github.com/jacobstanley/unix-compat/issues/56 getFileStatus and other FilePath using functions in unix-compat do not do UNC conversion on Windows. Made Utility.RawFilePath use convertToWindowsNativeNamespace to do the necessary conversion on windows to support long filenames. Audited all imports of System.PosixCompat.Files to make sure that no functions that operate on FilePath were imported from it. Instead, use the equvilants from Utility.RawFilePath. In particular the re-export of that module in Common had to be removed, which led to lots of other changes throughout the code. The changes to Build.Configure, Build.DesktopFile, and Build.TestConfig make Utility.Directory not be needed to build setup. And so let it use Utility.RawFilePath, which depends on unix, which cannot be in setup-depends. Sponsored-by: Dartmouth College's Datalad project	2023-03-01 15:55:58 -04:00
Joey Hess	f09e299156	rawfilepath conversion	2023-02-27 15:06:32 -04:00
Joey Hess	672258c8f4	Revert "revert recent bug fix temporarily for release" This reverts commit `16f1e24665`.	2023-02-14 14:11:23 -04:00
Joey Hess	16f1e24665	revert recent bug fix temporarily for release Decided this bug is not severe enough to delay the release until tomorrow, so this will be re-applied after the release.	2023-02-14 14:06:29 -04:00
Joey Hess	c1ef4a7481	Avoid Git.Config.updateLocation adding "/.git" to the end of the repo path to a bare repo when git config is not allowed to list the configs due to the CVE-2022-24765 fix. That resulted in a confusing error message, and prevented the nice message that explains how to mark the repo as safe to use. Made isBare a tristate so that the case where core.bare is not returned can be handled. The handling in updateLocation is to check if the directory contains config and objects and if so assume it's bare. Note that if that heuristic is somehow wrong, it would construct a repo that thinks it's bare but is not. That could cause follow-on problems, but since git-annex then checks checkRepoConfigInaccessible, and skips using the repo anyway, a wrong guess should not be a problem. Sponsored-by: Luke Shumaker on Patreon	2023-02-14 14:00:36 -04:00
Joey Hess	c1f4d536b2	fix comment	2023-02-14 13:28:02 -04:00
Joey Hess	49ee07f93d	fix flush of a closed file handle Avoids displaying warning about git-annex restage needing to be run in situations where it does not. Closing a handle flushes it anyway, so no need for an explict flush. The handle does get closed twice, but that's fine, the second one does nothing. Sponsored-by: Dartmouth College's DANDI project	2022-09-30 14:02:31 -04:00
Joey Hess	bfa451fc4e	pass --git-dir when reading git config when it was specified explicitly Let GIT_DIR and --git-dir override git's protection against operating in a repository owned by another user. This is the same behavior other git commands have. Sponsored-by: Jarkko Kniivilä on Patreon	2022-09-26 14:38:34 -04:00
Joey Hess	6a3bd283b8	add restage log When pointer files need to be restaged, they're first written to the log, and then when the restage operation runs, it reads the log. This way, if the git-annex process is interrupted before it can do the restaging, a later git-annex process can do it. Currently, this lets a git-annex get/drop command be interrupted and then re-ran, and as long as it gets/drops additional files, it will clean up after the interrupted command. But more changes are needed to make it easier to restage after an interrupted process. Kept using the git queue to run the restage action, even though the list of files that it builds up for that action is not actually used by the action. This could perhaps be simplified to make restaging a cleanup action that gets registered, rather than using the git queue for it. But I wasn't sure if that would cause visible behavior changes, when eg dropping a large number of files, currently the git queue flushes periodically, and so it restages incrementally, rather than all at the end. In restagePointerFiles, it reads the restage log twice, once to get the number of files and size, and a second time to process it. This seemed better than reading the whole file into memory, since potentially a huge number of files could be in there. Probably the OS will cache the file in memory and there will not be much performance impact. It might be better to keep running tallies in another file though. But updating that atomically with the log seems hard. Also note that it's possible for calcRestageLog to see a different file than streamRestageLog does. More files may be added to the log in between. That is ok, it will only cause the filterprocessfaster heuristic to operate with slightly out of date information, so it may make the wrong choice for the files that got added and be a little slower than ideal. Sponsored-by: Dartmouth College's DANDI project	2022-09-23 15:47:24 -04:00
Joey Hess	9c76e503cf	generalize refreshIndex to MonadIO Sponsored-by: Dartmouth College's DANDI project	2022-09-23 14:28:52 -04:00
Joey Hess	8d26fdd670	skip checkRepoConfigInaccessible when git directory specified explicitly Fix a reversion that prevented git-annex from working in a repository when --git-dir or GIT_DIR is specified to relocate the git directory to somewhere else. (Introduced in version 10.20220525) checkRepoConfigInaccessible could still run git config --list, just passing --git-dir. It seems not necessary, because I know that passing --git-dir bypasses git's check for repo ownership. I suppose it might be that git eventually changes to check something about the ownership of the working tree, so passing --git-dir without --work-tree would still be worth doing. But for now this is the simple fix. Sponsored-by: Nicholas Golder-Manning on Patreon	2022-09-20 14:52:43 -04:00
Joey Hess	9621beabc4	cache credentials in memory when doing http basic auth to a git remote When accessing a git remote over http needs a git credential prompt for a password, cache it for the lifetime of the git-annex process, rather than repeatedly prompting. The git-lfs special remote already caches the credential when discovering the endpoint. And presumably commands like git pull do as well, since they may download multiple urls from a remote. The TMVar CredentialCache is read, so two concurrent calls to getBasicAuthFromCredential will both prompt for a credential. There would already be two concurrent password prompts in such a case, and existing uses of `prompt` probably avoid it. Anyway, it's no worse than before.	2022-09-09 14:20:32 -04:00
Joey Hess	23c6e350cb	improve createDirectoryUnder to allow alternate top directories This should not change the behavior of it, unless there are multiple top directories, and then it should behave the same as if there was a single top directory that was actually above the directory to be created. Sponsored-by: Dartmouth College's Datalad project	2022-08-12 12:52:37 -04:00
Joey Hess	fbc3c223a6	filter-process: Fix protocol for empty files This caused git to complain that filter-process failed and kill it with signal 15. Because it wrote an extra flushPkt for an empty file, which git did not expect, and so git saw an unexpected response to the next request. Luckily, filter-process is only used by default in v9 and up, and v8 is still the default. Also, git had to be updating an empty file, followed by another file, which is a fairly unlikely situation. And git restarts filter-process after this happens and uses it to filter the rest of the files. So this isn't a crippling bug. Sponsored-by: Luke Shumaker on Patreon	2022-07-13 17:13:54 -04:00
Joey Hess	debcf86029	use RawFilePath version of rename Some small wins, almost certianly swamped by the system calls, but still worthwhile progress on the RawFilePath conversion. Sponsored-by: Erik Bjäreholt on Patreon	2022-06-22 16:47:34 -04:00
Joey Hess	dca6e96e31	debug output of git security check probe This is so, if there's some other failure that triggers it, --debug will show what went wrong. See https://github.com/datalad/datalad/issues/6708 Sponsored-by: Dartmouth College's Datalad project	2022-05-31 12:25:11 -04:00
Joey Hess	af0d854460	deal with git's changes for CVE-2022-24765 Deal with git's recent changes to fix CVE-2022-24765, which prevent using git in a repository owned by someone else. That makes git config --list not list the repo's configs, only global configs. So annex.uuid and annex.version are not visible to git-annex. It displayed a message about that, which is not right for this situation. Detect the situation and display a better message, similar to the one other git commands display. Also, git-annex init when run in that situation would overwrite annex.uuid with a new one, since it couldn't see the old one. Add a check to prevent it running too in this situation. It may be that this fix has security implications, if a config set by the malicious user who owns the repo causes git or git-annex to run code. I don't think any git-annex configs get run by git-annex init. It may be that some git config of a command does get run by one of the git commands that git-annex init runs. ("git status" is the command that prompted the CVE-2022-24765, since core.fsmonitor can cause it to run a command). Since I don't know how to exploit this, I'm not treating it as a security fix for now. Note that passing --git-dir makes git bypass the security check. git-annex does pass --git-dir to most calls to git, which it does to avoid needing chdir to the directory containing a git repository when accessing a remote. So, it's possible that somewhere in git-annex it gets as far as running git with --git-dir, and git reads some configs that are unsafe (what CVE-2022-24765 is about). This seems unlikely, it would have to be part of git-annex that runs in git repositories that have no (visible) annex.uuid, and git-annex init is the only one that I can think of that then goes on to run git, as discussed earlier. But I've not fully ruled out there being others.. The git developers seem mostly worried about "git status" or a similar command implicitly run by a shell prompt, not an explicit use of git in such a repository. For example, Ævar Arnfjörð Bjarma wrote: > * There are other bits of config that also point to executable things, > e.g. core.editor, aliases etc, but nothing has been found yet that > provides the "at a distance" effect that the core.fsmonitor vector > does. > > I.e. a user is unlikely to go to /tmp/some-crap/here and run "git > commit", but they (or their shell prompt) might run "git status", and > if you have a /tmp/.git ... Sponsored-by: Jarkko Kniivilä on Patreon	2022-05-20 14:38:27 -04:00
Joey Hess	0406c33f58	fix git-annex repair false positive Avoid treating refs/annex/last-index or other refs that are not commit objects as evidence of repository corruption. The repair code checks to find bad refs by trying to run `git log` on them, and assumes that no output means something is broken. But git log on a tree object is empty. This was worth fixing generally, not as a special case, since it's certainly possible that other things store tree or other objects in refs. Sponsored-by: Max Thoursie on Patreon	2022-05-04 11:32:23 -04:00
Joey Hess	faf84aa5c2	Avoid git status taking a long time after git-annex unlock of many files. Implemented by making Git.Queue have a FlushAction, which can accumulate along with another action on files, and runs only once the other action has run. This lets git-annex unlock queue up git update-index actions, without conflicting with the restagePointerFiles FlushActions. In a repository with filter-process enabled, git-annex unlock will often not take any more time than before, though it may when the files are large. Either way, it should always slow down less than git-annex status speeds up. When filter-process is not enabled, git-annex unlock will slow down as much as git status speeds up. Sponsored-by: Jochen Bartl on Patreon	2022-02-18 15:06:40 -04:00
Joey Hess	a03e9107cb	wording	2021-12-14 13:53:36 -04:00
Joey Hess	681d8611be	fix flush order reversion commit `c2e46f4707` caused the queue to possibly be flushed in the wrong order when it contained a mix of different actions.	2021-12-14 13:51:00 -04:00
Joey Hess	c2e46f4707	improve git command queue flushing with time limit So that eg, addurl of several large files that take time to download will update the index for each file, rather than deferring the index updates to the end. In cases like an add of many smallish files, where a new file is being added every few seconds. In that case, the queue will still build up a lot of changes which are flushed at once, for best performance. Since the default queue size is 10240, often it only gets flushed once at the end, same as before. (Notice that updateQueue updated _lastchanged when adding a new item to the queue without flushing it; that is necessary to avoid it flushing the queue every 5 minutes in this case.) But, when it takes more than a 5 minutes to add a file, the overhead of updating the index immediately is probably small, so do it after each file. This avoids git-annex potentially taking a very very long time indeed to stage newly added files, which can be annoying to the user who would like to get on with doing something with the files it's already added, eg using git mv to rename them to a better name. This is only likely to cause a problem if it takes say, 30 seconds to update the index; doing an extra 30 seconds of work after every 5 minute file add would be less optimal. Normally, updating the index takes significantly less time than that. On a SSD with 100k files it takes less than 1 second, and the index write time is bound by disk read and write so is not too much worse on a hard drive. So I hope this will not impact users, although if it does turn out to, the time limit could be made configurable. A perhaps better way to do it would be to have a background worker thread that wakes up every 60 seconds or so and flushes the queue. That is made somewhat difficult because the queue can contain Annex actions and so this would add a new source of concurrency issues. So I'm trying to avoid that approach if possible. Sponsored-by: Erik Bjäreholt on Patreon	2021-12-14 12:23:19 -04:00
Joey Hess	a62f2e141b	convert some error to giveup error has a backtrace, but these are non-internal errors, so a backtrace is unlikely to be useful	2021-12-09 14:36:54 -04:00
Joey Hess	5a7f253974	support git 2.34.0's handling of merge conflict between annexed and non-annexed file This version of git -- or its new default "ort" resolver -- handles such a conflict by staging two files, one with the original name and the other named file~ref. Use unmergedSiblingFile when the latter is detected. (It doesn't do that when the conflict is between a directory and a file or symlink though, so see previous commit for how that case is handled.) The sibling file has to be deleted separately, because cleanConflictCruft may not delete it -- that only handles files that are annex links, but the sibling file may be the non-annexed file side of the conflict. The graftin code had assumed that, when the other side of a conclict is a symlink, the file in the work tree will contain the non-annexed content that we want it to contain. But that is not the case with the new git; the file may be the annex link and needs to be replaced with the content, while the annex link will be written as a -variant file. (The weird doesDirectoryExist check in graftin turns out to still be needed, test suite failed when I tried to remove it.) Test suite passes with new git with ort resolver default. Have not tried it with old git or other defaults. Sponsored-by: Noam Kremen on Patreon	2021-11-22 16:10:24 -04:00
Joey Hess	a0758bdd10	dynamically disable filter-process in restagePointerFile when it would be slower Based on my earlier benchmark, I have a rough cost model for how expensive it is for git-annex smudge to be run on a file, vs how expensive it is for a gigabyte of a file's content to be read and piped through to filter-process. So, using that cost model, it can decide if using filter-process will be more or less expensive than running the smudge filter on the files to be restaged. It turned out to be really annoying to temporarily disable filter-process. I did find a way, but urk, this is horrible. Notice that, if it's interrupted with it disabled, it will remain disabled until the next time restagePointerFile runs. Which could be some time later. If the user runs `git add` or `git checkout` on a lot of small files before that, they will see slower than expected performance. (This commit also deletes where I wrote down the benchmark results earlier.) Sponsored-by: Noam Kremen on Patreon	2021-11-08 16:20:34 -04:00
Joey Hess	483e82ae0e	update	2021-11-05 10:53:11 -04:00

1 2 3 4 5 ...

754 commits