git-annex

Author	SHA1	Message	Date
Joey Hess	49ee07f93d	fix flush of a closed file handle Avoids displaying warning about git-annex restage needing to be run in situations where it does not. Closing a handle flushes it anyway, so no need for an explict flush. The handle does get closed twice, but that's fine, the second one does nothing. Sponsored-by: Dartmouth College's DANDI project	2022-09-30 14:02:31 -04:00
Joey Hess	bfa451fc4e	pass --git-dir when reading git config when it was specified explicitly Let GIT_DIR and --git-dir override git's protection against operating in a repository owned by another user. This is the same behavior other git commands have. Sponsored-by: Jarkko Kniivilä on Patreon	2022-09-26 14:38:34 -04:00
Joey Hess	6a3bd283b8	add restage log When pointer files need to be restaged, they're first written to the log, and then when the restage operation runs, it reads the log. This way, if the git-annex process is interrupted before it can do the restaging, a later git-annex process can do it. Currently, this lets a git-annex get/drop command be interrupted and then re-ran, and as long as it gets/drops additional files, it will clean up after the interrupted command. But more changes are needed to make it easier to restage after an interrupted process. Kept using the git queue to run the restage action, even though the list of files that it builds up for that action is not actually used by the action. This could perhaps be simplified to make restaging a cleanup action that gets registered, rather than using the git queue for it. But I wasn't sure if that would cause visible behavior changes, when eg dropping a large number of files, currently the git queue flushes periodically, and so it restages incrementally, rather than all at the end. In restagePointerFiles, it reads the restage log twice, once to get the number of files and size, and a second time to process it. This seemed better than reading the whole file into memory, since potentially a huge number of files could be in there. Probably the OS will cache the file in memory and there will not be much performance impact. It might be better to keep running tallies in another file though. But updating that atomically with the log seems hard. Also note that it's possible for calcRestageLog to see a different file than streamRestageLog does. More files may be added to the log in between. That is ok, it will only cause the filterprocessfaster heuristic to operate with slightly out of date information, so it may make the wrong choice for the files that got added and be a little slower than ideal. Sponsored-by: Dartmouth College's DANDI project	2022-09-23 15:47:24 -04:00
Joey Hess	9c76e503cf	generalize refreshIndex to MonadIO Sponsored-by: Dartmouth College's DANDI project	2022-09-23 14:28:52 -04:00
Joey Hess	8d26fdd670	skip checkRepoConfigInaccessible when git directory specified explicitly Fix a reversion that prevented git-annex from working in a repository when --git-dir or GIT_DIR is specified to relocate the git directory to somewhere else. (Introduced in version 10.20220525) checkRepoConfigInaccessible could still run git config --list, just passing --git-dir. It seems not necessary, because I know that passing --git-dir bypasses git's check for repo ownership. I suppose it might be that git eventually changes to check something about the ownership of the working tree, so passing --git-dir without --work-tree would still be worth doing. But for now this is the simple fix. Sponsored-by: Nicholas Golder-Manning on Patreon	2022-09-20 14:52:43 -04:00
Joey Hess	9621beabc4	cache credentials in memory when doing http basic auth to a git remote When accessing a git remote over http needs a git credential prompt for a password, cache it for the lifetime of the git-annex process, rather than repeatedly prompting. The git-lfs special remote already caches the credential when discovering the endpoint. And presumably commands like git pull do as well, since they may download multiple urls from a remote. The TMVar CredentialCache is read, so two concurrent calls to getBasicAuthFromCredential will both prompt for a credential. There would already be two concurrent password prompts in such a case, and existing uses of `prompt` probably avoid it. Anyway, it's no worse than before.	2022-09-09 14:20:32 -04:00
Joey Hess	23c6e350cb	improve createDirectoryUnder to allow alternate top directories This should not change the behavior of it, unless there are multiple top directories, and then it should behave the same as if there was a single top directory that was actually above the directory to be created. Sponsored-by: Dartmouth College's Datalad project	2022-08-12 12:52:37 -04:00
Joey Hess	fbc3c223a6	filter-process: Fix protocol for empty files This caused git to complain that filter-process failed and kill it with signal 15. Because it wrote an extra flushPkt for an empty file, which git did not expect, and so git saw an unexpected response to the next request. Luckily, filter-process is only used by default in v9 and up, and v8 is still the default. Also, git had to be updating an empty file, followed by another file, which is a fairly unlikely situation. And git restarts filter-process after this happens and uses it to filter the rest of the files. So this isn't a crippling bug. Sponsored-by: Luke Shumaker on Patreon	2022-07-13 17:13:54 -04:00
Joey Hess	debcf86029	use RawFilePath version of rename Some small wins, almost certianly swamped by the system calls, but still worthwhile progress on the RawFilePath conversion. Sponsored-by: Erik Bjäreholt on Patreon	2022-06-22 16:47:34 -04:00
Joey Hess	dca6e96e31	debug output of git security check probe This is so, if there's some other failure that triggers it, --debug will show what went wrong. See https://github.com/datalad/datalad/issues/6708 Sponsored-by: Dartmouth College's Datalad project	2022-05-31 12:25:11 -04:00
Joey Hess	af0d854460	deal with git's changes for CVE-2022-24765 Deal with git's recent changes to fix CVE-2022-24765, which prevent using git in a repository owned by someone else. That makes git config --list not list the repo's configs, only global configs. So annex.uuid and annex.version are not visible to git-annex. It displayed a message about that, which is not right for this situation. Detect the situation and display a better message, similar to the one other git commands display. Also, git-annex init when run in that situation would overwrite annex.uuid with a new one, since it couldn't see the old one. Add a check to prevent it running too in this situation. It may be that this fix has security implications, if a config set by the malicious user who owns the repo causes git or git-annex to run code. I don't think any git-annex configs get run by git-annex init. It may be that some git config of a command does get run by one of the git commands that git-annex init runs. ("git status" is the command that prompted the CVE-2022-24765, since core.fsmonitor can cause it to run a command). Since I don't know how to exploit this, I'm not treating it as a security fix for now. Note that passing --git-dir makes git bypass the security check. git-annex does pass --git-dir to most calls to git, which it does to avoid needing chdir to the directory containing a git repository when accessing a remote. So, it's possible that somewhere in git-annex it gets as far as running git with --git-dir, and git reads some configs that are unsafe (what CVE-2022-24765 is about). This seems unlikely, it would have to be part of git-annex that runs in git repositories that have no (visible) annex.uuid, and git-annex init is the only one that I can think of that then goes on to run git, as discussed earlier. But I've not fully ruled out there being others.. The git developers seem mostly worried about "git status" or a similar command implicitly run by a shell prompt, not an explicit use of git in such a repository. For example, Ævar Arnfjörð Bjarma wrote: > * There are other bits of config that also point to executable things, > e.g. core.editor, aliases etc, but nothing has been found yet that > provides the "at a distance" effect that the core.fsmonitor vector > does. > > I.e. a user is unlikely to go to /tmp/some-crap/here and run "git > commit", but they (or their shell prompt) might run "git status", and > if you have a /tmp/.git ... Sponsored-by: Jarkko Kniivilä on Patreon	2022-05-20 14:38:27 -04:00
Joey Hess	0406c33f58	fix git-annex repair false positive Avoid treating refs/annex/last-index or other refs that are not commit objects as evidence of repository corruption. The repair code checks to find bad refs by trying to run `git log` on them, and assumes that no output means something is broken. But git log on a tree object is empty. This was worth fixing generally, not as a special case, since it's certainly possible that other things store tree or other objects in refs. Sponsored-by: Max Thoursie on Patreon	2022-05-04 11:32:23 -04:00
Joey Hess	faf84aa5c2	Avoid git status taking a long time after git-annex unlock of many files. Implemented by making Git.Queue have a FlushAction, which can accumulate along with another action on files, and runs only once the other action has run. This lets git-annex unlock queue up git update-index actions, without conflicting with the restagePointerFiles FlushActions. In a repository with filter-process enabled, git-annex unlock will often not take any more time than before, though it may when the files are large. Either way, it should always slow down less than git-annex status speeds up. When filter-process is not enabled, git-annex unlock will slow down as much as git status speeds up. Sponsored-by: Jochen Bartl on Patreon	2022-02-18 15:06:40 -04:00
Joey Hess	a03e9107cb	wording	2021-12-14 13:53:36 -04:00
Joey Hess	681d8611be	fix flush order reversion commit `c2e46f4707` caused the queue to possibly be flushed in the wrong order when it contained a mix of different actions.	2021-12-14 13:51:00 -04:00
Joey Hess	c2e46f4707	improve git command queue flushing with time limit So that eg, addurl of several large files that take time to download will update the index for each file, rather than deferring the index updates to the end. In cases like an add of many smallish files, where a new file is being added every few seconds. In that case, the queue will still build up a lot of changes which are flushed at once, for best performance. Since the default queue size is 10240, often it only gets flushed once at the end, same as before. (Notice that updateQueue updated _lastchanged when adding a new item to the queue without flushing it; that is necessary to avoid it flushing the queue every 5 minutes in this case.) But, when it takes more than a 5 minutes to add a file, the overhead of updating the index immediately is probably small, so do it after each file. This avoids git-annex potentially taking a very very long time indeed to stage newly added files, which can be annoying to the user who would like to get on with doing something with the files it's already added, eg using git mv to rename them to a better name. This is only likely to cause a problem if it takes say, 30 seconds to update the index; doing an extra 30 seconds of work after every 5 minute file add would be less optimal. Normally, updating the index takes significantly less time than that. On a SSD with 100k files it takes less than 1 second, and the index write time is bound by disk read and write so is not too much worse on a hard drive. So I hope this will not impact users, although if it does turn out to, the time limit could be made configurable. A perhaps better way to do it would be to have a background worker thread that wakes up every 60 seconds or so and flushes the queue. That is made somewhat difficult because the queue can contain Annex actions and so this would add a new source of concurrency issues. So I'm trying to avoid that approach if possible. Sponsored-by: Erik Bjäreholt on Patreon	2021-12-14 12:23:19 -04:00
Joey Hess	a62f2e141b	convert some error to giveup error has a backtrace, but these are non-internal errors, so a backtrace is unlikely to be useful	2021-12-09 14:36:54 -04:00
Joey Hess	5a7f253974	support git 2.34.0's handling of merge conflict between annexed and non-annexed file This version of git -- or its new default "ort" resolver -- handles such a conflict by staging two files, one with the original name and the other named file~ref. Use unmergedSiblingFile when the latter is detected. (It doesn't do that when the conflict is between a directory and a file or symlink though, so see previous commit for how that case is handled.) The sibling file has to be deleted separately, because cleanConflictCruft may not delete it -- that only handles files that are annex links, but the sibling file may be the non-annexed file side of the conflict. The graftin code had assumed that, when the other side of a conclict is a symlink, the file in the work tree will contain the non-annexed content that we want it to contain. But that is not the case with the new git; the file may be the annex link and needs to be replaced with the content, while the annex link will be written as a -variant file. (The weird doesDirectoryExist check in graftin turns out to still be needed, test suite failed when I tried to remove it.) Test suite passes with new git with ort resolver default. Have not tried it with old git or other defaults. Sponsored-by: Noam Kremen on Patreon	2021-11-22 16:10:24 -04:00
Joey Hess	a0758bdd10	dynamically disable filter-process in restagePointerFile when it would be slower Based on my earlier benchmark, I have a rough cost model for how expensive it is for git-annex smudge to be run on a file, vs how expensive it is for a gigabyte of a file's content to be read and piped through to filter-process. So, using that cost model, it can decide if using filter-process will be more or less expensive than running the smudge filter on the files to be restaged. It turned out to be really annoying to temporarily disable filter-process. I did find a way, but urk, this is horrible. Notice that, if it's interrupted with it disabled, it will remain disabled until the next time restagePointerFile runs. Which could be some time later. If the user runs `git add` or `git checkout` on a lot of small files before that, they will see slower than expected performance. (This commit also deletes where I wrote down the benchmark results earlier.) Sponsored-by: Noam Kremen on Patreon	2021-11-08 16:20:34 -04:00
Joey Hess	483e82ae0e	update	2021-11-05 10:53:11 -04:00
Joey Hess	a5a7d8433d	add pktLineHeaderLength	2021-11-04 15:37:39 -04:00
Joey Hess	218e1983ad	reorg	2021-11-04 15:03:12 -04:00
Joey Hess	68257e9076	add git-annex filter-process filter-process: New command that can make git add/checkout faster when there are a lot of unlocked annexed files or non-annexed files, but that also makes git add of large annexed files slower. Use it by running: git config filter.annex.process 'git-annex filter-process' Fully tested and working, but I have not benchmarked it at all. And, incremental hashing is not done when git add uses it, so extra work is done in that case. Sponsored-by: Mark Reidenbach on Patreon	2021-11-04 15:02:36 -04:00
Joey Hess	d706b49979	handle unhandled case	2021-11-04 14:36:48 -04:00
Joey Hess	b1f9dadafe	git long-running filter process implementation This module is not used yet, but the plan is to use it for smudge/clean filtering, at least as an option. In some circumstances, using this interface may perform better than the interface git-annex is currently using. Sponsored-by: Brock Spratlen on Patreon	2021-11-03 15:41:26 -04:00
Joey Hess	e9685aac5b	git pkt-line implementation This module is not used yet, but the plan is to implement the long running filter process for smudge/clean. Sponsored-by: Shae Erisson on Patreon	2021-11-03 15:30:25 -04:00
Joey Hess	b36cc0320e	avoid crashing tilde expansion on user who does not exist git does not crash when there's a remote configured for a user who does not exist, and this prevents git-annex from crashing too. Consider that a user might exist on one system but not another, and the git repo be moved between systems. So not crashing is desirable. Note that git fetch seems to mishandle a remote path like ~foo/bar when the user does not exist. While it does access ./~foo/bar, and gets as far as running git-upload-pack on the path, it then complains there is no such repo. So different parts of git seem to be doing different things in that edge case. Anyway, git-annex does not need to be bug-for-bug compatible with git. Sponsored-by: Jack Hill on Patreon	2021-10-13 09:16:36 -04:00
Joey Hess	69f8e6c7c0	ImportableContentsChunkable This improves the borg special remote memory usage, by letting it only load one archive's worth of filenames into memory at a time, and building up a larger tree out of the chunks. When a borg repository has many archives, git-annex could easily OOM before. Now, it will use only memory proportional to the number of annexed keys in an archive. Minor implementation wart: Each new chunk re-opens the content identifier database, and also a new vector clock is used for each chunk. This is a minor innefficiency only; the use of continuations makes it hard to avoid, although putting the database handle into a Reader monad would be one way to fix it. It may later be possible to extend the ImportableContentsChunkable interface to remotes that are not third-party populated. However, that would perhaps need an interface that does not use continuations. The ImportableContentsChunkable interface currently does not allow populating the top of the tree with anything other than subtrees. It would be easy to extend it to allow putting files in that tree, but borg doesn't need that so I left it out for now. Sponsored-by: Noam Kremen on Patreon	2021-10-08 13:15:22 -04:00
Joey Hess	1dc82f177f	use bytestring filepaths more This should be more efficient, and allocate less. Sponsored-by: Graham Spencer on Patreon	2021-10-05 15:44:02 -04:00
Joey Hess	ee31698825	remove errant print debug	2021-10-03 18:18:04 -04:00
Joey Hess	9012fa0187	reinject: Fix crash when reinjecting a file from outside the repository Commit `4bf7940d6b` introduced this problem, but was otherwise doing a good thing. Problem being that fileRef "/foo" used to return ":./foo", which was actually wrong, but as long as there was no foo in the local repository, catKey could operate on it without crashing. After that fix though, fileRef would return eg "../../foo", resulting in fileRef returning ":./../../foo", which will make git cat-file crash since that's not a valid path in the repo. Fix is simply to make fileRef detect paths outside the repo and return Nothing. Then catKey can be skipped. This needed several bugfixes to dirContains as well, in previous commits. In Command.Smudge, this led to needing to check for Nothing. That case should actually never happen, because the fileoutsiderepo check will detect it earlier. Sponsored-by: Brock Spratlen on Patreon	2021-10-01 14:06:34 -04:00
Joey Hess	e47b4badb3	separate handles for cat-file and cat-file --batch-check This avoids starting one process when only the other one is needed. Eg in git-annex smudge --clean, this reduces the total number of cat-file processes that are started from 4 to 2. The only performance penalty is that when both are needed, it has to do twice as much work to maintain the two Maps. But both are very small, consisting of 1 or 2 items, so that work is negligible. Sponsored-by: Dartmouth College's Datalad project	2021-09-24 13:16:13 -04:00
Joey Hess	fa62c98910	simplify and speed up Utility.FileSystemEncoding This eliminates the distinction between decodeBS and decodeBS', encodeBS and encodeBS', etc. The old implementation truncated at NUL, and the primed versions had to do extra work to avoid that problem. The new implementation does not truncate at NUL, and is also a lot faster. (Benchmarked at 2x faster for decodeBS and 3x for encodeBS; more for the primed versions.) Note that filepath-bytestring 1.4.2.1.8 contains the same optimisation, and upgrading to it will speed up to/fromRawFilePath. AFAIK, nothing relied on the old behavior of truncating at NUL. Some code used the faster versions in places where I was sure there would not be a NUL. So this change is unlikely to break anything. Also, moved s2w8 and w82s out of the module, as they do not involve filesystem encoding really. Sponsored-by: Shae Erisson on Patreon	2021-08-11 12:13:31 -04:00
Joey Hess	33a80d083a	sync --quiet * sync: When --quiet is used, run git commit, push, and pull without their ususual output. * merge: When --quiet is used, run git merge without its usual output. This might also make --quiet work better for some other commands that make commits, like git-annex adjust. Sponsored-by: Kevin Mueller on Patreon	2021-07-19 11:28:47 -04:00
Joey Hess	fcd1b93a7d	whereused --historical Does not check the reflog, but otherwise works. It's possible for it to display something that is not an annexed file, if a non-annexed file somehow ends up containing something that looks like the key's name. This seems very unlikely to happen, and it would add a lot of complexity to detect it and somehow skip over that file, since the git log would need to either be run again, or not limited to 1 result and canceled once enough results have been read. Also, it kind of seems ok, if a file refers to a key, to consider that as a place the key was used, for some definition of used. So, I punted on dealing with that. May revisit later. Sponsored-by: Brock Spratlen on Patreon	2021-07-14 15:38:28 -04:00
Joey Hess	d2c48404a8	assistant: Avoid unncessary git repository repair In a situation where git fsck gets confused about a commit that is made while it's running. Sponsored-by: Graham Spencer on Patreon	2021-06-30 18:00:16 -04:00
Joey Hess	75d29de8d4	avoid using CopyFile If .git/config symlinks are ever a thing, this will handle them better. But mostly, this is to avoid git-repair needing to include CopyFile, which would complicate it unduely.	2021-06-29 13:21:21 -04:00
Joey Hess	6b0d732746	repair: Fix reversion in version 8.20200522 that prevented fetching missing objects from remotes In commit `dfc4e641b5` git repair was changed to use remote name, not url, when fetching. But it fetches into a temporary git repo, which doesn't have remotes configured. Oops. (In my defense, that commit was made just as covid lockdown started. But testing? Urk.) Sponsored-by: Mark Reidenbach on Patreon	2021-06-29 13:15:15 -04:00
Joey Hess	199391befe	make repair interruption safe Fixed bug that interrupting git-annex repair (or assistant) while it was fixing repository corruption would lose objects that were contained in pack files. Unpack all pack files and move objects into place before deleting the pack files. The old approach moved the pack files to a temp directory before unpacking them, which was not interruption safe. Sponsored-By: Jochen Bartl on Patreon	2021-06-29 13:14:28 -04:00
Joey Hess	70dbe61fc2	remove unnecessary liftIO	2021-06-07 14:51:12 -04:00
Joey Hess	efae085272	fixed reconcileStaged crash when index is locked or in conflict Eg, when git commit runs the smudge filter. Commit `428c91606b` introduced the crash, as write-tree fails in those situations. Now it will work, and git-annex always gets up-to-date information even in those situations. It does need to do a bit more work, each time git-annex is run with the index locked. Although if the index is unmodified from the last time write-tree succeeded, that work is avoided.	2021-05-24 11:33:23 -04:00
Joey Hess	984034f335	filter-branch working aside from some edge cases Added a note to man page about what happens to information that is recorded in the private journal. Since it uses Branch.get, that information will be copied when options allow. It seemed better to allow it and document it than not allow it, since the options allow excluding repositories and so can be used to exclude private repos if desired.	2021-05-17 13:24:58 -04:00
Joey Hess	1d16654a22	convert formatLsTree to ByteString for speed	2021-05-17 10:46:24 -04:00
Joey Hess	4bf7940d6b	fileRef: make paths relative and simplified Fix behavior of several commands, including reinject, addurl, and rmurl when given an absolute path to an unlocked file, or a relative path that leaves and re-enters the repository. To avoid slowing down all the cases where the paths are already ok with an unncessary call to getCurrentDirectory, put in an optimisation in relPathCwdToFile. That will probably also speed up other parts of git-annex by some small amount, but I have not benchmarked. Note that I did not convert branchFileRef, because it seems likely that it will be used with a file that is not provided by the user, so is already in a sane format. This is certainly true for the way git-annex uses it, though maybe arguable to the extent Git.Ref is a reusable library.	2021-05-07 13:25:59 -04:00
Joey Hess	32138b8cd8	implement annex.privateremote and remote.name.private configs The slightly unusual parsing in Types.GitConfig avoids the need to look at the remote list to get configs of remotes. annexPrivateRepos combines all the configs, and will only be calculated once, so it's nice and fast. privateUUIDsKnown and regardingPrivateUUID now need to read from the annex mvar, so are not entirely free. But that overhead can be optimised away, as seen in getJournalFileStale. The other call sites didn't seem worth optimising to save a single MVar access. The feature should have impreceptable speed overhead when not being used.	2021-04-23 14:21:57 -04:00
Joey Hess	0e830b6bb5	make remoteKeyToRemoteName safer If it's passed a ConfigKey such as annex.version, avoid returning an empty remote name and return Nothing instead. Also, foo.bar.baz is not treated as a remote named "bar".	2021-04-23 13:29:21 -04:00
Joey Hess	5712a7ef93	fix incomplete pattern match warning There was not really a bug here, because the 2 lists are always the same length, but the compiler does not know that.	2021-03-30 12:59:53 -04:00
Joey Hess	4611813ef1	Fix bug importing from a special remote into a subdirectory more than one level deep Which generated unusual git trees that could confuse git merge, since they incorrectly had 2 subtrees with the same name. Root of the bug was a) not testing that at all! but also b) confusing graftdirs, which contains eg "foo/bar" with non-recursively read trees, which would contain eg "bar" when reading a subtree of "foo". It's worth noting that Annex.Import uses graftTree, but it really shouldn't have needed to. Eg, when importing into foo/bar from a remote, it's enough to generate a tree of foo/bar/x, foo/bar/y, and does not include other files that are at the top of the master branch. It uses graftTree, so it does include the other files, as well as the foo/bar tree. git merge will do the same thing for both trees. With that said, switching it away from graftTree would result in another import generating a new commit that seems to delete files that were there in a previous commit, so it probably has to keep using graftTree since it used it before. This commit was sponsored by Kevin Mueller on Patreon.	2021-03-26 16:04:36 -04:00
Joey Hess	5d78cd9d08	Sped up git-annex init in a clone of an existing repository Seems that hasOrigin was never finding origin's git-annex branch, so a new one got created each time. And so then it later needed to merge the two branches, which is expensive. Added --no-track to git branch to avoid it displaying a message about setting up tracking branches. Of course there's no reason to make the git-annex branch a tracking branch since git-annex auto-merges it.	2021-03-23 15:23:13 -04:00
Joey Hess	a8b837aaef	add git ls-tree --long parser Not yet used, but allows getting the size of items in the tree fairly cheaply. I noticed that CmdLine.Seek uses ls-tree and the feeds the files into another long-running process to check their size. That would be an example of a place that might be sped up by using this. Although in that particular case, it only needs to know the size of unlocked files, not locked. And since enabling --long probably doubles the ls-tree runtime or more, the overhead of using it there may outwweigh the benefit.	2021-03-23 12:47:00 -04:00

1 2 3 4 5 ...

724 commits