git-annex

Author	SHA1	Message	Date
Joey Hess	cb9cf30c48	move several readonly values to AnnexRead This improves performance to a small extent in several places. Sponsored-by: Tobias Ammann on Patreon	2022-06-28 15:40:19 -04:00
Joey Hess	debcf86029	use RawFilePath version of rename Some small wins, almost certianly swamped by the system calls, but still worthwhile progress on the RawFilePath conversion. Sponsored-by: Erik Bjäreholt on Patreon	2022-06-22 16:47:34 -04:00
Joey Hess	d00e23cac9	RawFilePath optimisations	2022-06-22 16:20:08 -04:00
Joey Hess	224a57f9ed	RawFilePath optimisation	2022-06-22 16:11:03 -04:00
Joey Hess	95a04920cf	remove objectDir'	2022-06-22 16:08:49 -04:00
Joey Hess	f80ec74128	RawFilePath optimisation	2022-06-22 16:08:26 -04:00
Joey Hess	78a3d44ea0	get rid of racy addLink The remaining callers all did not rely on it checking gitignore, so were easy to convert. They were susceptable to the same overwrite race as add and fix, although less likely to have it and a narrower window than add's race. Command.Rekey in passing got an unncessary call to removeFile deleted. addSymlink handles deleting any existing worktree file.	2022-06-14 14:47:15 -04:00
Joey Hess	7ace804d8e	avoid writing same symlink twice in a row Oddly, the second write did not cause it to lose the mtime inherited from the file being added, although the mtime was not provided to that write but only to the first. I don't quite know why that worked before!	2022-06-14 14:30:12 -04:00
Joey Hess	5ef79125ad	fix overwrite race with git-annex add of annex symlink In the unlikely case where git-annex add is run on an annex symlink that is not already added, and while it's processing it, the annex symlink is overwritten with something else, avoid git-annex overwriting that with the symlink again. Sponsored-by: Jack Hill on Patreon	2022-06-14 14:00:13 -04:00
Joey Hess	dd6dec4eb1	fix add overwrite race with git-annex add to annex This is not a complete fix for all such races, only the one where a large file gets changed while adding and gets added to git rather than to the annex. addLink needs to go away, any caller of it is probably subject to the same kind of race. (Also, addLink itself fails to check gitignore when symlinks are not supported.) ingestAdd no longer checks gitignore. (It didn't check it consistently before either, since there were cases where it did not run git add!) When git-annex import calls it, it's already checked gitignore itself earlier. When git-annex add calls it, it's usually on files found by withFilesNotInGit, which handles checking ignores. There was one other case, when git-annex add --batch calls it. In that case, old git-annex behaved rather badly, it would seem to add the file, but git add would later fail, leaving the file as an unstaged annex symlink. That behavior has also been fixed. Sponsored-by: Brett Eisenberg on Patreon	2022-06-14 13:37:19 -04:00
Joey Hess	c59ea5b1ca	info: Added --autoenable option Use cases include using git-annex init --no-autoenable and then going back and enabling the special remotes that have autoenable configured. As well as just querying to remember which ones have it enabled. It lists all special remotes that have autoenable=yes whether currently enabled or not. And it can be used with --json. I pondered making this "git-annex info autoenable", but that seemed wrong because then if the use has a directory named "autoenable", it's unclear what they are asking for. (Although "git-annex info remote" may be similarly unclear.) Making it an option does mean that it can't be provided via --batch though. Sponsored-by: Dartmouth College's Datalad project	2022-06-01 14:20:38 -04:00
Joey Hess	f35c551d35	make path absolute for display Avoid suggesting the user add "." to safe.directory.	2022-05-31 12:17:27 -04:00
Joey Hess	478ed28f98	revert windows-specific locking changes that broke tests This reverts windows-specific parts of `5a98f2d509` There were no code paths in common between windows and unix, so this will return Windows to the old behavior. The problem that the commit talks about has to do with multiple different locations where git-annex can store annex object files, but that is not too relevant to Windows anyway, because on windows the filesystem is always treated as criplled and/or symlinks are not supported, so it will only use one object location. It would need to be using a repo populated in another OS to have the other object location in use probably. Then a drop and get could possibly lead to a dangling lock file. And, I was not able to actually reproduce that situation happening before making that commit, even when I forced a race. So making these changes on windows was just begging trouble.. I suspect that the change that caused the reversion is in Annex/Content/Presence.hs. It checks if the content file exists, and then called modifyContentDirWhenExists, which seems like it would not fail, but if something deleted the content file at that point, that call would fail. Which would result in an exception being thrown, which should not normally happen from a call to inAnnexSafe. That was a windows-specific change; the unix side did not have an equivilant change. Sponsored-by: Dartmouth College's Datalad project	2022-05-23 13:21:26 -04:00
Joey Hess	63624c40a0	fix typo in comment	2022-05-23 12:53:55 -04:00
Joey Hess	af0d854460	deal with git's changes for CVE-2022-24765 Deal with git's recent changes to fix CVE-2022-24765, which prevent using git in a repository owned by someone else. That makes git config --list not list the repo's configs, only global configs. So annex.uuid and annex.version are not visible to git-annex. It displayed a message about that, which is not right for this situation. Detect the situation and display a better message, similar to the one other git commands display. Also, git-annex init when run in that situation would overwrite annex.uuid with a new one, since it couldn't see the old one. Add a check to prevent it running too in this situation. It may be that this fix has security implications, if a config set by the malicious user who owns the repo causes git or git-annex to run code. I don't think any git-annex configs get run by git-annex init. It may be that some git config of a command does get run by one of the git commands that git-annex init runs. ("git status" is the command that prompted the CVE-2022-24765, since core.fsmonitor can cause it to run a command). Since I don't know how to exploit this, I'm not treating it as a security fix for now. Note that passing --git-dir makes git bypass the security check. git-annex does pass --git-dir to most calls to git, which it does to avoid needing chdir to the directory containing a git repository when accessing a remote. So, it's possible that somewhere in git-annex it gets as far as running git with --git-dir, and git reads some configs that are unsafe (what CVE-2022-24765 is about). This seems unlikely, it would have to be part of git-annex that runs in git repositories that have no (visible) annex.uuid, and git-annex init is the only one that I can think of that then goes on to run git, as discussed earlier. But I've not fully ruled out there being others.. The git developers seem mostly worried about "git status" or a similar command implicitly run by a shell prompt, not an explicit use of git in such a repository. For example, Ævar Arnfjörð Bjarma wrote: > * There are other bits of config that also point to executable things, > e.g. core.editor, aliases etc, but nothing has been found yet that > provides the "at a distance" effect that the core.fsmonitor vector > does. > > I.e. a user is unlikely to go to /tmp/some-crap/here and run "git > commit", but they (or their shell prompt) might run "git status", and > if you have a /tmp/.git ... Sponsored-by: Jarkko Kniivilä on Patreon	2022-05-20 14:38:27 -04:00
Joey Hess	aa414d97c9	make fsck normalize object locations The purpose of this is to fix situations where the annex object file is stored in a directory structure other than where annex symlinks point to. But it will also move object files from the hashdirmixed back to hashdirlower if the repo configuration makes that the normal location. It would have been more work to avoid that than to let it do it. Sponsored-by: Dartmouth College's Datalad project	2022-05-16 15:38:06 -04:00
Joey Hess	6b5029db29	fix hardcoding of number of hash directories It can be changed to 1 via a tuning, rather than the 2 this assumed. So it would have tried to rmdir .git/annex/objects in that case, which would not hurt anything, but is not what it is supposed to do. Sponsored-by: Dartmouth College's Datalad project	2022-05-16 15:08:42 -04:00
Joey Hess	5a98f2d509	avoid creating content directory when locking content If the content directory does not exist, then it does not make sense to lock the content file, as it also does not exist, and so it's ok for the lock operation to fail. This avoids potential races where the content file exists but is then deleted/renamed, while another process sees that it exists and goes to lock it, resulting in a dangling lock file in an otherwise empty object directory. Also renamed modifyContent to modifyContentDir since it is not only necessarily used for modifying content files, but also other files in the content directory. Sponsored-by: Dartmouth College's Datalad project	2022-05-16 12:34:56 -04:00
Joey Hess	e8a601aa24	incremental verification for retrieval from import remotes Sponsored-by: Dartmouth College's Datalad project	2022-05-09 15:39:43 -04:00
Joey Hess	2f2701137d	incremental verification for retrieval from all export remotes Only for export remotes so far, not export/import. Sponsored-by: Dartmouth College's Datalad project	2022-05-09 13:49:33 -04:00
Joey Hess	90950a37e5	support incremental verification when retrieving from export/import remotes None of the special remotes do it yet, but this lays the groundwork. Added MustFinishIncompleteVerify so that, when an incremental verify is started but not complete, it can be forced to finish it. Otherwise, it would have skipped doing it when verification is disabled, but verification must always be done when retrievin from export remotes since files can be modified during retrieval. Note that retrieveExportWithContentIdentifier doesn't support incremental verification yet. And I'm not sure if it can -- it doesn't know the Key before it downloads the content. It seems a new API call would need to be split out of that, which is provided with the key. Sponsored-by: Dartmouth College's Datalad project	2022-05-09 12:25:04 -04:00
Joey Hess	8675b2b075	rename memoryUnits It's not just used for memory sizes.	2022-05-05 15:35:11 -04:00
Joey Hess	d266a41f8d	prevent numcopies or mincopies being configured to 0 Ignore annex.numcopies set to 0 in gitattributes or git config, or by git-annex numcopies or by --numcopies, since that configuration would make git-annex easily lose data. Same for mincopies. This is a continuation of the work to make data only be able to be lost when --force is used. It earlier led to the --trust option being disabled, and similar reasoning applies here. Most numcopies configs had docs that strongly discouraged setting it to 0 anyway. And I can't imagine a use case for setting to 0. Not that there might not be one, but it's just so far from the intended use case of git-annex, of managing and storing your data, that it does not seem like it makes sense to cater to such a hypothetical use case, where any git-annex drop can lose your data at any time. Using a smart constructor makes sure every place avoids 0. Note that this does mean that NumCopies is for the configured desired values, and not the actual existing number of copies, which of course can be 0. The name configuredNumCopies is used to make that clear. Sponsored-by: Brock Spratlen on Patreon	2022-03-28 15:20:34 -04:00
Joey Hess	982eb7ed0d	remove vendored http-client-restricted Removed vendored copy of http-client-restricted, and removed the HttpClientRestricted build flag that avoided that dependency. http-client-restricted is in Debian stable, and the i386ancient build also uses it, so I think this vendored copy is no longer needed. Sponsored-by: Noam Kremen on Patreon	2022-03-22 11:50:06 -04:00
Joey Hess	952664641a	turn of PackageImports in cabal file This makes it easier to build eg benchmarks of individual modules. May be that most of these PackageImports are not really necessary, dunno.	2022-02-25 13:16:36 -04:00
Joey Hess	51c528980c	avoid accidentally thawing git-annex symlink It did nothing, since at this point the link is dangling. But when there is a thaw hook, it would probably not be happy to be asked to run on a symlink, or might do something unexpected. Sponsored-by: Dartmouth College's Datalad project	2022-02-24 14:21:23 -04:00
Joey Hess	f4b046252a	Run annex.thawcontent-command before deleting an object file In case annex.freezecontent-command did something that would prevent deletion. Sponsored-by: Dartmouth College's Datalad project	2022-02-24 14:11:02 -04:00
Joey Hess	346007a915	add debugging of freeze and thaw	2022-02-24 14:01:29 -04:00
Joey Hess	28bc5ce232	ignore write bits being set when there is a freeze hook When annex.freezecontent-command is set, and the filesystem does not support removing write bits, avoid treating it as a crippled filesystem. The hook may be enough to prevent writing on its own, and some filesystems ignore attempts to remove write bits. Sponsored-by: Dartmouth College's Datalad project	2022-02-24 13:28:31 -04:00
Joey Hess	64ccb4734e	smudge: Warn when encountering a pointer file that has other content appended to it It will then proceed to add the file the same as if it were any other file containing possibly annexable content. Usually the file is one that was annexed before, so the new, probably corrupt content will also be added to the annex. If the file was not annexed before, the content will be added to git. It's not possible for the smudge filter to throw an error here, because git then just adds the file to git anyway. Sponsored-by: Dartmouth College's Datalad project	2022-02-23 15:17:08 -04:00
Joey Hess	67245ae00f	fully specify the pointer file format This format is designed to detect accidental appends, while having some room for future expansion. Detect when an unlocked file whose content is not present has gotten some other content appended to it, and avoid treating it as a pointer file, so that appended content will not be checked into git, but will be annexed like any other file. Dropped the max size of a pointer file down to 32kb, it was around 80 kb, but without any good reason and certianly there are no valid pointer files anywhere that are larger than 8kb, because it's just been specified what it means for a pointer file with additional data even looks like. I assume 32kb will be good enough for anyone. ;-) Really though, it needs to be some smallish number, because that much of a file in git gets read into memory when eg, catting pointer files. And since we have no use cases for the extra lines of a pointer file yet, except possibly to add some human-visible explanation that it is a git-annex pointer file, 32k seems as reasonable an arbitrary number as anything. Increasing it would be possible, eg to 64k, as long as users of such jumbo pointer files didn't mind upgrading all their git-annex installations to one that supports the new larger size. Sponsored-by: Dartmouth College's Datalad project	2022-02-23 14:20:31 -04:00
Joey Hess	5b373a9dd2	read a consistent amount from pointer file A few places were reading the max symlink size of a pointer file, then passing tp parseLinkTargetOrPointer. Which is fine currently, but to support pointer files with lines of data after the pointer, enough has to be read that parseLinkTargetOrPointer can be assured of seeing enough of that data to know if it's correctly formatted. Sponsored-by: Dartmouth College's Datalad project	2022-02-23 12:52:34 -04:00
Joey Hess	4cd9325c2c	fold parseLinkTarget into parseLinkTargetOrPointer Only one place remained that differentiated between them. It is the case that a symlink target that happens to contain a newline somehow will be treated as a link to a key truncated at the newline. This is super unlikely to happen, and since a key cannot actually contain a newline, it's as good a behavior as any. Anyway, this commit does not change the behavior there, although arguably it should be changed. Note that getAnnexLinkTarget does prevent a symlink target containing a newline. Sponsored-by: Dartmouth College's Datalad project	2022-02-23 12:30:32 -04:00
Joey Hess	ce1b3a9699	info: Allow using matching options in more situations File matching options like --include will be rejected in situations where there is no filename to match against. (Or where there is a filename but it's not relative to the cwd, or otherwise seemed too bothersome to match against.) The addition of listKeys' was necessary to avoid using more memory in the common case of "git-annex info". Adding a filterM would have caused the list to buffer in memory and not stream. This is an ugly hack, but listKeys had previously run Annex operations inside unafeInterleaveIO (for direct mode). And matching against a matcher should hopefully not change any Annex state. This does allow for eg `git-annex info somefile --include=*.ext` although why someone would want to do that I don't really know. But it seems to make sense to allow it. But, consider: `git-annex info ./somefile --include=somefile` This does not match, so will not display info about somefile. If the user really wants to, they can `--include=./somefile`. Using matching options like --copies or --in=remote seems likely to be slower than git-annex find with those options, because unlike such commands, info does not have optimised streaming through the matcher. Note that `git-annex info remote` is not the same as `git-annex info --in remote`. The former shows info about all files in the remote. The latter shows local keys that are also in that remote. The output should make that clear, but this still seems like a point where users could get confused. Sponsored-by: Jochen Bartl on Patreon	2022-02-21 14:46:07 -04:00
Joey Hess	faf84aa5c2	Avoid git status taking a long time after git-annex unlock of many files. Implemented by making Git.Queue have a FlushAction, which can accumulate along with another action on files, and runs only once the other action has run. This lets git-annex unlock queue up git update-index actions, without conflicting with the restagePointerFiles FlushActions. In a repository with filter-process enabled, git-annex unlock will often not take any more time than before, though it may when the files are large. Either way, it should always slow down less than git-annex status speeds up. When filter-process is not enabled, git-annex unlock will slow down as much as git status speeds up. Sponsored-by: Jochen Bartl on Patreon	2022-02-18 15:06:40 -04:00
Joey Hess	21e40b86d8	have v9 autoupgrade to v10 This was right before commit `a27776f602`, which made v6 v7 autoupgrade to v8 but not yet to v10. Sponsored-by: Dartmouth College's Datalad project	2022-01-26 13:16:06 -04:00
Joey Hess	a27776f602	init --version=6 upgrade to 8 not yet 10 autoUpgradeableVersions had latestVersion (10), but it did not make sense for asking for old version 6 to get version 10, while asking for version 8 got version 8. So use defaultVersion (8) instead. Sponsored-by: Dartmouth College's Datalad project	2022-01-25 13:52:42 -04:00
Joey Hess	3618746a85	fix failing readonly test case The problem is that withContentLockFile, in a v8 repo, has to take a shared lock of `.git/annex/content.lck`. But, in a readonly repository, if that file does not yet exist, it cannot lock it. And while it will sometimes work to `chmod +r .git/annex`, the repository might be readonly due to being owned by another user, or due to being mounted readonly. So, it seems that the only solution is to use some other file than `.git/annex/content.lck` as the lock file. The inode sential file was almost the only option that should always exist. (And if it somehow does not exist, creating an empty one for locking will be ok.) Wow, what a hack! Sponsored-by: Dartmouth College's Datalad project	2022-01-21 13:49:31 -04:00
Joey Hess	47084b8a1d	enable filter.annex.process in v9 This has tradeoffs, but is generally a win, and users who it causes git add to slow down unacceptably for can just disable it again. It needed to happen in an upgrade, since there are git-annex versions that do not support it, and using such an old version with a v8 repository with filter.annex.process set will cause bad behavior. By enabling it in v9, it's guaranteed that any git-annex version that can use the repository does support it. Although, this is not a perfect protection against problems, since an old git-annex version, if it's used with a v9 repository, will cause git add to try to run git-annex filter-process, which will fail. But at least, the user is unlikely to have an old git-annex in path if they are using a v9 repository, since it won't work in that repository. Sponsored-by: Dartmouth College's Datalad project	2022-01-21 13:11:18 -04:00
Joey Hess	dc14221bc3	detect v10 upgrade while running Capstone of the v10 upgrade process. Tested with a git-annex drop in a v8 repo that had a local v8 remote. Upgrading the repo to v10 (with --force) immedaitely caused it to notice and switch over to v10 locking. Upgrading the remote also caused it to switch over when operating on the remote. The InodeCache makes this fairly efficient, just an added stat call per lock of an object file. After the v10 upgrade, there is no more overhead. Sponsored-by: Dartmouth College's Datalad project	2022-01-21 12:56:38 -04:00
Joey Hess	76e365769e	fix crash after drop in v10 After cleaning up the lock file, the content directory is gone, so freezing it failed. Sponsored-by: Dartmouth College's Datalad project	2022-01-20 14:03:27 -04:00
Joey Hess	d0a5714409	continue to use v8 by default for now, unless upgraded Since it's easy to keep supporting v8, using it for a while (eg a few months) will give users time to upgrade git-annex installations, before it upgrades their repository to v9. This commit should be reverted once ready to start upgrading repositories by default. Sponsored-by: Dartmouth College's Datalad project	2022-01-20 11:56:05 -04:00
Joey Hess	0904eac8b4	automatic upgrade from v8 to v9 Sponsored-by: Dartmouth College's Datalad project	2022-01-20 11:39:36 -04:00
Joey Hess	cea6f6db92	v10 upgrade locking The v10 upgrade should almost be safe now. What remains to be done is notice when the v10 upgrade has occurred, while holding the shared lock, and switch to using v10 lock files. Sponsored-by: Dartmouth College's Datalad project	2022-01-20 11:33:14 -04:00
Joey Hess	9d5db6a09a	add upgrade.log The upgrade from V9 uses this to avoid an automatic upgrade until 1 year after the V9 update. It can also be used in future such situations. Sponsored-by: Dartmouth College's Datalad project	2022-01-19 15:52:29 -04:00
Joey Hess	856ce5cf5f	split upgrade into v9 and v10 v10 will run 1 year after the upgrade to v9, to give time for any v8 processes to die. Until that point, the v10 upgrade will be tried by every process but deferred, so added support for deferring upgrades. The upgrade prevention lock file that will be used by v10 is not yet implemented, so it does not yet defer. Sponsored-by: Dartmouth College's Datalad project	2022-01-19 13:09:33 -04:00
Joey Hess	4f7b8ce09d	fix spelling of upgradeable	2022-01-19 12:14:50 -04:00
Joey Hess	538d02d397	delete content lock file safely after shared lock Upgrade the shared lock to an exclusive lock, and then delete the lock file. If there is another process still holding the shared lock, the first process will fail taking the exclusive lock, and not delete the lock file; then the other process will later delete it. Note that, in the time period where the exclusive lock is held, other attempts to lock the content in place would fail. This is unlikely to be a problem since it's a short period. Other attempts to lock the content for removal would also fail in that time period, but that's no different than a removal failing because content is locked to prevent removal. Sponsored-by: Dartmouth College's Datalad project	2022-01-13 14:54:57 -04:00
Joey Hess	86e5ffe34a	clean empty object directories after deleting content lock file When dropping content, this was already done after deleting the content file, but the lock file prevents deleting the directories. So, try the deletion again. This does mean there's a small added overhead of a failed rmdir(). Sponsored-by: Dartmouth College's Datalad project	2022-01-13 14:22:37 -04:00
Joey Hess	e28d1d0325	fix logic that was not inverted after all oops	2022-01-13 14:11:36 -04:00
Joey Hess	a3b6b3499b	delete content lock file safely on drop, keep after shared lock This seems to be the best that can be done to avoid forever accumulating the new content lock files, while being fully safe. This is fixing code paths that have lingered unused since direct mode! And direct mode seems to have been buggy in this area, since the content lock file was deleted on unlock. But with a shared lock, there could be another process that also had the lock file locked, and deleting it invalidates that lock. So, the lock file cannot be deleted after a shared lock. At least, not wihout taking an exclusive lock first.. which I have not pursued yet but may. After an exclusive lock, the lock file can be deleted. But there is still a potential race, where the exclusive lock is held, and another process gets the file open, just as the exclusive lock is dropped and the lock file is deleted. That other process would be left with a file handle it can take a shared lock of, but with no effect since the file is deleted. Annex.Transfer also deletes lock files, and deals with this same problem by using checkSaneLock, which is how I've dealt with it here. Sponsored-by: Dartmouth College's Datalad project	2022-01-13 13:58:58 -04:00
Joey Hess	3d7933f124	fix inverted logic Now the content lock files are used in v9. However, I am not yet certian they are correct. In particular, lockContentUsing deletes the content lock file on unlock. But what if there's a shared lock by another process? That seems like it would discard that lock too! (Windows seems like it would not have the same problem, because as the comment in there says, "Can't delete a locked file on Windows". So if another process has a shared lock, removing it presumably fails.) Sponsored-by: Dartmouth College's Datalad project	2022-01-13 13:58:31 -04:00
Joey Hess	731b1ecf87	v9 upgrade implemented Seems to work ok. Unsure yet about the actual locking changes being correct. This is not the end of the story with upgrades, because it is unsafe for this upgrade as implemented to run in a repository where an old git-annex process is already running. The old process would use the old locking method, and not notice files locked by the new, and this could result in data loss. This problem will need to be dealt with before this branch is suitable for merging. Sponsored-by: Dartmouth College's Datalad project	2022-01-13 13:25:10 -04:00
Joey Hess	3936599885	move code from Command.Fsck Sponsored-by: Dartmouth College's Datalad project	2022-01-13 13:24:50 -04:00
Joey Hess	3c042606c2	use separate lock from content file in v9 Windows has always used a separate lock file, but on unix, the content file itself was locked, and in v9 that changes to also use a separate lock file. This needs to be tested more. Eg, what happens after dropping a file; does the the content lock file get deleted too, or linger around? Sponsored-by: Dartmouth College's Datalad project	2022-01-11 17:03:14 -04:00
Joey Hess	43f9d967ff	shared repository content file permissions for v9 v9 will not need to write to annex content files in order to lock them, so freezeContent removes the write bit in a shared repository, the same as in any other repository. checkContentWritePerm makes sure that the write perm is not set, which will let git-annex fsck fix up the permissions. Upgrading to v9 will need to fix the permissions as well, but it seems likely there will be situations where the user git-annex is running an upgrade as cannot, so it will have to leave the write bit set. In such a case, git-annex fsck can fix it later. Sponsored-by: Dartmouth College's Datalad project	2022-01-11 16:50:50 -04:00
Joey Hess	ff570ad363	add v9 annex.version, not yet the default This is the start of v9, but it's currently identical to v8, and v8 is not upgraded to it. git-annex upgrade will upgrade to v9 with this change. Sponsored-by: Dartmouth College's Datalad project	2022-01-11 14:59:39 -04:00
Joey Hess	e95747a149	fix handling of corrupted data received from git remote Recover from corrupted content being received from a git remote due eg to a wire error, by deleting the temporary file when it fails to verify. This prevents a retry from failing again. Reversion introduced in version 8.20210903, when incremental verification was added. Only the git remote seems to be affected, although it is certianly possible that other remotes could later have the same issue. This only affects things passed to getViaTmp that return (False, UnVerified) due to verification failing. As far as getViaTmp can tell, that could just as well mean that the transfer failed in a way that would resume, so it cannot delete the temp file itself. Remote.Git and P2P.Annex use getViaTmp internally, while other remotes do not, which is why only it seems affected. A better fix perhaps would be to improve the types of the callback passed to getViaTmp, so that some other value could be used to indicate the state where the transfer succeeded but verification failed. Sponsored-by: Boyd Stephen Smith Jr.	2022-01-07 13:25:33 -04:00
Joey Hess	21c0d5be6e	comment	2022-01-07 12:27:19 -04:00
Joey Hess	e416635021	renameremote: Better handling of case where there are multiple special remotes with a name Instead of renaming one at random, error out and ask that a uuid be specified. Sponsored-by: Brett Eisenberg on Patreon	2022-01-05 15:24:02 -04:00
Joey Hess	58afb00f6e	enableremote: Better handling of the unusual case where multiple special remotes have been initialized with the same name Before it would pick one at random, though preferring ones that were not dead over dead ones. Now, if one is dead and the other not, it will use the non-dead one. But if both are not dead, or both dead, it will error out, suggesting the user clarify what they want to enable. Sponsored-by: Luke Shumaker on Patreon	2022-01-05 15:12:11 -04:00
Joey Hess	b1d719f9d2	handle transitions with read-only unmerged git-annex branches Capstone to this feature. Any transitions that have been performed on an unmerged remote ref but not on the local git-annex branch, or vice-versa have to be applied on the fly when reading files. Sponsored-by: Dartmouth College's Datalad project	2021-12-28 13:23:32 -04:00
Joey Hess	720baf820e	refactoring	2021-12-28 12:15:51 -04:00
Joey Hess	23a485498f	handle Annex.Branch.files with read-only unmerged git-annex branches It would be difficult to make Annex.Branch.files query the unmerged git-annex branches. Might be possible, similar to what was discussed in `7f6b2ca49c` but again I decided to make it not do anything in that situation to start with before adding such a complicated thing. git-annex info uses it when getting info about a repostory. The choices were to make that fail with an error, or display the info it can, and change the output slightly for the bits of info it cannot access. While that is a behavior change, and I want to avoid any behavior changes due to unmerged git-annex branches in a read-only repo, displaying a message that is not a number seems unlikely to break anything that was consuming a number, any worse than throwing an exception would. Probably. Also git-annex unused --from origin is made to throw an error, but it would fail later anyway when trying to write to the unused log files. Sponsored-by: Dartmouth College's Datalad project	2021-12-27 15:28:31 -04:00
Joey Hess	7f6b2ca49c	handle overBranchFileContents with read-only unmerged git-annex branches This makes --all error out in that situation. Which is better than ignoring information from the branches. To really handle the branches right, overBranchFileContents would need to both query all the branches and union merge file contents (or perhaps not provide any file content), as well as diffing between branches to find files that are only present in the unmerged branches. And also, it would need to handle transitions.. Sponsored-by: Dartmouth College's Datalad project	2021-12-27 14:30:51 -04:00
Joey Hess	d9d0fe5fa4	disable precaching git-annex branch when there are unmerged branches in a read-only repo The way precaching works, it can't merge in information from those branches efficiently, so just disable it and fall back to Annex.Branch.get in order to get the correct information. Sponsored-by: Dartmouth College's Datalad project	2021-12-27 14:08:50 -04:00
Joey Hess	1e09cf661e	remove git-annex branch ref from unmerged refs list It's queried separately so it was causing extra work to include it.	2021-12-27 13:33:27 -04:00
Joey Hess	6d7ecd9e5d	merge git-annex branch in memory in read-only repository Improved support for using git-annex in a read-only repository, git-annex branch information from remotes that cannot be merged into the git-annex branch will now not crash it, but will be merged in memory. To avoid this making git-annex behave one way in a read-only repository, and another way when it can write, it's important that Annex.Branch.get return the same thing (modulo log file compaction) in both cases. This manages that mostly. There are some exceptions: - When there is a transition in one of the remote git-annex branches that has not yet been applied to the local or other git-annex branches. Transitions are not handled. - `git-annex log` runs git log on the git-annex branch, and so it will not be able to show information coming from the other, not yet merged branches. - Annex.Branch.files only looks at files in the git-annex branch and not unmerged branches. This affects git-annex info output. - Annex.Branch.hs.overBranchFileContents ditto. Affects --all and also importfeed (but importfeed cannot work in a read-only repo anyway). - CmdLine.Seek.seekFilteredKeys when precaching location logs. Note use of Annex.Branch.fullname - Database.ContentIdentifier.needsUpdateFromLog and updateFromLog These warts make this not suitable to be merged yet. This readonly code path is more expensive, since it has to query several branches. The value does get cached, but still large queries will be slower in a read-only repository when there are unmerged git-annex branches. When annex.merge-annex-branches=false, updateTo skips doing anything, and so the read-only repository code does not get triggered. So a user who is bothered by the extra work can set that. Other writes to the repository can still result in permissions errors. This includes the initial creation of the git-annex branch, and of course any writes to the git-annex branch. Sponsored-by: Dartmouth College's Datalad project	2021-12-27 13:21:15 -04:00
Joey Hess	c2e46f4707	improve git command queue flushing with time limit So that eg, addurl of several large files that take time to download will update the index for each file, rather than deferring the index updates to the end. In cases like an add of many smallish files, where a new file is being added every few seconds. In that case, the queue will still build up a lot of changes which are flushed at once, for best performance. Since the default queue size is 10240, often it only gets flushed once at the end, same as before. (Notice that updateQueue updated _lastchanged when adding a new item to the queue without flushing it; that is necessary to avoid it flushing the queue every 5 minutes in this case.) But, when it takes more than a 5 minutes to add a file, the overhead of updating the index immediately is probably small, so do it after each file. This avoids git-annex potentially taking a very very long time indeed to stage newly added files, which can be annoying to the user who would like to get on with doing something with the files it's already added, eg using git mv to rename them to a better name. This is only likely to cause a problem if it takes say, 30 seconds to update the index; doing an extra 30 seconds of work after every 5 minute file add would be less optimal. Normally, updating the index takes significantly less time than that. On a SSD with 100k files it takes less than 1 second, and the index write time is bound by disk read and write so is not too much worse on a hard drive. So I hope this will not impact users, although if it does turn out to, the time limit could be made configurable. A perhaps better way to do it would be to have a background worker thread that wakes up every 60 seconds or so and flushes the queue. That is made somewhat difficult because the queue can contain Annex actions and so this would add a new source of concurrency issues. So I'm trying to avoid that approach if possible. Sponsored-by: Erik Bjäreholt on Patreon	2021-12-14 12:23:19 -04:00
Joey Hess	6242b35c33	fix error message Was "failed to generate a key" when key generation did not fail (it never does anymore) but the actual problem was it failed to stat the source file, perhaps due to it being deleted while the key was being generated. A user reported this, in a comment I followed up on in `262400fe04`, although I don't know what they did to trigger the error message.	2021-12-09 15:25:59 -04:00
Joey Hess	dbba231e06	Improve error message display when autoinit fails Due to eg, a permissions problem.	2021-12-09 14:38:12 -04:00
Joey Hess	ef3ab0769e	close pid lock only once no threads use it This fixes a FD leak when annex.pidlock is set and -J is used. Also, it fixes bugs where the pid lock file got deleted because one thread was done with it, while another thread was still holding it open. The LockPool now has two distinct types of resources, one is per-LockHandle and is used for file Handles, which get closed when the associated LockHandle is closed. The other one is per lock file, and gets closed when no more LockHandles use that lock file, including other shared locks of the same file. That latter kind is used for the pid lock file, so it's opened by the first thread to use a lock, and closed when the last thread closes a lock. In practice, this means that eg git-annex get of several files opens and closes the pidlock file a few times per file. While with -J5 it will open the pidlock file, process a number of files, until all the threads happen to finish together, at which point the pidlock file gets closed, and then that repeats. So in either case, another process still gets a chance to take the pidlock. registerPostRelease has a rather intricate dance, there are fine-grained STM locks, a STM lock of the pidfile itself, and the actual pidlock file on disk that are all resolved in stages by it. Sponsored-by: Dartmouth College's Datalad project	2021-12-06 15:01:39 -04:00
Joey Hess	e5ca67ea1c	fine-grained locking when annex.pidlock is enabled This locking has been missing from the beginning of annex.pidlock. It used to be possble, when two threads are doing conflicting things, for both to run at the same time despite using locking. Seems likely that nothing actually had a problem, but it was possible, and this eliminates that possible source of failure. Sponsored-by: Dartmouth College's Datalad project	2021-12-03 17:20:21 -04:00
Joey Hess	4703ad3e7f	remove unused import	2021-11-23 16:15:57 -04:00
Joey Hess	5a7f253974	support git 2.34.0's handling of merge conflict between annexed and non-annexed file This version of git -- or its new default "ort" resolver -- handles such a conflict by staging two files, one with the original name and the other named file~ref. Use unmergedSiblingFile when the latter is detected. (It doesn't do that when the conflict is between a directory and a file or symlink though, so see previous commit for how that case is handled.) The sibling file has to be deleted separately, because cleanConflictCruft may not delete it -- that only handles files that are annex links, but the sibling file may be the non-annexed file side of the conflict. The graftin code had assumed that, when the other side of a conclict is a symlink, the file in the work tree will contain the non-annexed content that we want it to contain. But that is not the case with the new git; the file may be the annex link and needs to be replaced with the content, while the annex link will be written as a -variant file. (The weird doesDirectoryExist check in graftin turns out to still be needed, test suite failed when I tried to remove it.) Test suite passes with new git with ort resolver default. Have not tried it with old git or other defaults. Sponsored-by: Noam Kremen on Patreon	2021-11-22 16:10:24 -04:00
Joey Hess	623a775609	fix cat-file leak in get with -J Bugfix: When -J was enabled, getting files leaked a ever-growing number of git cat-file processes. (Since commit `dd39e9e255`) The leak happened when mergeState called stopNonConcurrentSafeCoProcesses. While stopNonConcurrentSafeCoProcesses usually manages to stop everything, there was a race condition where cat-file processes were leaked. Because catFileStop modifies Annex.catfilehandles in a non-concurrency safe way, and could clobber modifications made in between. Which should have been ok, since originally catFileStop was only used at shutdown. Note the comment on catFileStop saying it should only be used when nothing else is using the handles. It would be possible to make catFileStop race-safe, but it should just not be used in a situation where a race is possible. So I didn't bother. Instead, the fix is just not to stop any processes in mergeState. Because in order for mergeState to be called, dupState must have been run, and it enables concurrency mode, stops any non-concurrent processes, and so all processes that are running are concurrency safea. So there is no need to stop them when merging state. Indeed, stopping them would be extra work, even if there was not this bug. Sponsored-by: Dartmouth College's Datalad project	2021-11-19 12:51:08 -04:00
Joey Hess	15d617f7e1	have setConcurrency stop any running git coprocesses When non-concurrent git coprocesses have been started, setConcurrency used to not stop them, and so could leak processes when enabling concurrency, eg when forkState is called. I do not think that ever actually happened, given where setConcurrency is called. And it probably would only leak one of each process, since it never downgrades from concurrent to non-concurrent.	2021-11-19 12:00:39 -04:00
Joey Hess	8c756d5a27	fix comment typo	2021-11-17 13:03:37 -04:00
Joey Hess	aa6e54ac6e	Fix a typo in the name of youtube-dl (reversion introduced in version 8.20210903)	2021-11-13 08:58:36 -04:00
Joey Hess	8034f2e9bb	factor out IncrementalHasher from IncrementalVerifier	2021-11-09 12:33:22 -04:00
Joey Hess	a0758bdd10	dynamically disable filter-process in restagePointerFile when it would be slower Based on my earlier benchmark, I have a rough cost model for how expensive it is for git-annex smudge to be run on a file, vs how expensive it is for a gigabyte of a file's content to be read and piped through to filter-process. So, using that cost model, it can decide if using filter-process will be more or less expensive than running the smudge filter on the files to be restaged. It turned out to be really annoying to temporarily disable filter-process. I did find a way, but urk, this is horrible. Notice that, if it's interrupted with it disabled, it will remain disabled until the next time restagePointerFile runs. Which could be some time later. If the user runs `git add` or `git checkout` on a lot of small files before that, they will see slower than expected performance. (This commit also deletes where I wrote down the benchmark results earlier.) Sponsored-by: Noam Kremen on Patreon	2021-11-08 16:20:34 -04:00
Joey Hess	837025b14f	Revert "disable filter.annex.process in restagePointerFile" This reverts commit `afe327ac49`. Unfortunately, disabling it by setting it to "" does not work, git then ignores filter.annex.smudge/clean, and does not pass files through git-annex at all. I don't think there is a way to temporarily disable this git config from the git command line. Which seems like a bug in git. So, it may be more expensive than anticipated to enable filter.annex.process, since git checkout etc will pipe all annexed files being checked out through it.	2021-11-05 12:43:33 -04:00
Joey Hess	afe327ac49	disable filter.annex.process in restagePointerFile This means git will run git-annex smudge --clean once per file that is restaged, which can be slow. But probably not as slow as git feeding all the content of annexed files you've gotten through a pipe to git-annex filter-process. The only time this is probably not ideal is after a drop of a bunch of files, when filter-process would be faster.	2021-11-04 15:20:26 -04:00
Joey Hess	a3cdff3fd5	add a comment about checkSaneLock See commit `8c2dd7d8ee` for original introduction of it, but needing to spelunk that far back to understand the code is not good.	2021-10-27 14:55:30 -04:00
Joey Hess	55bfa414b3	move transfer already in progress message to warning This makes it be displayed in the error-messages field with --json-error-messages. And with --quiet, it will let it be displayed, which makes sense because it's telling the user why what they requested to do has failed to happen.	2021-10-27 14:46:21 -04:00
Joey Hess	669037862a	avoid redundant freezeContent call This opens the potential for the object file to be in place but git-annex is interrupted before it can freeze it. git-annex fsck already fixes that situation, which can also occur when lockContentForRemoval thaws content. Also improve comment to not be Windows-specific.	2021-10-27 14:18:10 -04:00
Reiko Asakura	0db7297f00	Call freezeContent after move into annex This change better supports Windows ACL management using annex.freezecontent-command and annex.thawcontent-command and matches the behaviour of adding an unlocked file. By calling freezeContent after the file has moved into the annex, the file's delete permission can be denied. If the file's delete permission is denied before moving into the annex, the file cannot be moved or deleted. If the file's delete permission is not denied after moving into the annex, it will likely inherit a grant for the delete permission which allows it to be deleted irrespective of the permissions of the parent directory.	2021-10-27 14:05:57 -04:00
Joey Hess	5a9e6b1fd4	when private journal file exists, still read from git-annex branch Fix bug that caused stale git-annex branch information to read when annex.private or remote.name.annex-private is set. The private journal file should not prevent reading more current information from the git-annex branch, but used to. Note that, overBranchFileContents has to do additional work now, when there's a private journal file, it reads from the branch redundantly and more slowly. Sponsored-by: Jack Hill on Patreon	2021-10-26 13:43:50 -04:00
Joey Hess	0f38ad9a69	close keys db to possibly work around WSL1 issue	2021-10-19 13:07:49 -04:00
Joey Hess	887edeb1ad	avoid warning when built with unix-compat 0.5.3 It re-exports modificationTimeHiRes, and provides a windows version. Might be worth using that windows version eventually, but I have not tested it.	2021-10-18 16:25:28 -04:00
Joey Hess	69f8e6c7c0	ImportableContentsChunkable This improves the borg special remote memory usage, by letting it only load one archive's worth of filenames into memory at a time, and building up a larger tree out of the chunks. When a borg repository has many archives, git-annex could easily OOM before. Now, it will use only memory proportional to the number of annexed keys in an archive. Minor implementation wart: Each new chunk re-opens the content identifier database, and also a new vector clock is used for each chunk. This is a minor innefficiency only; the use of continuations makes it hard to avoid, although putting the database handle into a Reader monad would be one way to fix it. It may later be possible to extend the ImportableContentsChunkable interface to remotes that are not third-party populated. However, that would perhaps need an interface that does not use continuations. The ImportableContentsChunkable interface currently does not allow populating the top of the tree with anything other than subtrees. It would be easy to extend it to allow putting files in that tree, but borg doesn't need that so I left it out for now. Sponsored-by: Noam Kremen on Patreon	2021-10-08 13:15:22 -04:00
Joey Hess	19e78816f0	convert Key to ShortByteString This adds the overhead of a copy when serializing and deserializing keys. I have not benchmarked much, but runtimes seem barely changed at all by that. When a lot of keys are in memory, it improves memory use. And, it prevents keys sometimes getting PINNED in memory and failing to GC, which is a problem ByteString has sometimes. In particular, git-annex sync from a borg special remote had that problem and this improved its memory use by a large amount. Sponsored-by: Shae Erisson on Patreon	2021-10-05 20:20:08 -04:00
Joey Hess	9012fa0187	reinject: Fix crash when reinjecting a file from outside the repository Commit `4bf7940d6b` introduced this problem, but was otherwise doing a good thing. Problem being that fileRef "/foo" used to return ":./foo", which was actually wrong, but as long as there was no foo in the local repository, catKey could operate on it without crashing. After that fix though, fileRef would return eg "../../foo", resulting in fileRef returning ":./../../foo", which will make git cat-file crash since that's not a valid path in the repo. Fix is simply to make fileRef detect paths outside the repo and return Nothing. Then catKey can be skipped. This needed several bugfixes to dirContains as well, in previous commits. In Command.Smudge, this led to needing to check for Nothing. That case should actually never happen, because the fileoutsiderepo check will detect it earlier. Sponsored-by: Brock Spratlen on Patreon	2021-10-01 14:06:34 -04:00
Joey Hess	b9aa2ce8d1	resume properly when copying a file to/from a local git remote is interrupted (take 2) This method avoids breaking test_readonly. Just check if the dest file exists, and avoid CoW probing when it does, so when CoW probing fails, it can resume where the previous non-CoW copy left off. If CoW has been probed already to work, delete the dest file since a CoW copy will presumably work. It seems like it would be almost as good to just skip CoW copying in this case too, but consider that the dest file might have started to be copied from some other remote, not using CoW, but CoW has been probed to work to copy from the current place. Sponsored-by: Dartmouth College's Datalad project	2021-09-27 16:03:01 -04:00
Joey Hess	7ccf642863	revert change that broke test_readonly commit `63d508e885` broke test_readonly. When a local git remote is readonly, tryCopyCoW run to copy a file from it failed at withOtherTmp. Sponsored-by: Dartmouth College's Datalad project	2021-09-27 16:02:41 -04:00
Joey Hess	e47b4badb3	separate handles for cat-file and cat-file --batch-check This avoids starting one process when only the other one is needed. Eg in git-annex smudge --clean, this reduces the total number of cat-file processes that are started from 4 to 2. The only performance penalty is that when both are needed, it has to do twice as much work to maintain the two Maps. But both are very small, consisting of 1 or 2 items, so that work is negligible. Sponsored-by: Dartmouth College's Datalad project	2021-09-24 13:16:13 -04:00
Joey Hess	798b33ba3d	simplify annex.bwlimit handling RemoteGitConfig parsing looks for annex.bwlimit when a remote does not have a per-remote config for it, so no need for a separate gobal config. Sponsored-by: Svenne Krap on Patreon	2021-09-22 10:52:01 -04:00
Joey Hess	05a097cde8	Merge branch 'master' into bwlimit	2021-09-22 10:48:27 -04:00
Joey Hess	4fef94d764	simplify annex.stalldetection handling RemoteGitConfig parsing looks for annex.stalldetection when a remote does not have a per-remote config for it, so no need for a separate gobal config. Sponsored-by: Noam Kremen on Patreon	2021-09-22 10:46:10 -04:00
Joey Hess	63d508e885	resume properly when copying a file to/from a local git remote is interrupted Probably this fixes a reversion, but I don't know what version broke it. This does use withOtherTmp for a temp file that could be quite large. Though albeit a reflink copy that will not actually take up any space as long as the file it was copied from still exists. So if the copy cow succeeds but git-annex is interrupted just before that temp file gets renamed into the usual .git/annex/tmp/ location, there is a risk that the other temp directory ends up cluttered with a larger temp file than later. It will eventually be cleaned up, and the changes of this being a problem are small, so this seems like an acceptable thing to do. Sponsored-by: Shae Erisson on Patreon	2021-09-21 17:43:35 -04:00
Joey Hess	18e00500ce	bwlimit Added annex.bwlimit and remote.name.annex-bwlimit config that works for git remotes and many but not all special remotes. This nearly works, at least for a git remote on the same disk. With it set to 100kb/1s, the meter displays an actual bandwidth of 128 kb/s, with occasional spikes to 160 kb/s. So it needs to delay just a bit longer... I'm unsure why. However, at the beginning a lot of data flows before it determines the right bandwidth limit. A granularity of less than 1s would probably improve that. And, I don't know yet if it makes sense to have it be 100ks/1s rather than 100kb/s. Is there a situation where the user would want a larger granularity? Does granulatity need to be configurable at all? I only used that format for the config really in order to reuse an existing parser. This can't support for external special remotes, or for ones that themselves shell out to an external command. (Well, it could, but it would involve pausing and resuming the child process tree, which seems very hard to implement and very strange besides.) There could also be some built-in special remotes that it still doesn't work for, due to them not having a progress meter whose displays blocks the bandwidth using thread. But I don't think there are actually any that run a separate thread for downloads than the thread that displays the progress meter. Sponsored-by: Graham Spencer on Patreon	2021-09-21 16:58:10 -04:00
Joey Hess	ec12537774	defer write permissions checking in import until after copy to repo This should complete the fix started in `6329997ac4`, fixing the actual cause of the test suite failure this time. Sponsored-by: Dartmouth College's Datalad project	2021-09-02 13:45:21 -04:00
Joey Hess	bd5494bb9c	fix windows build	2021-09-02 12:21:25 -04:00
Joey Hess	4f42292b13	improve url download failure display * When downloading urls fail, explain which urls failed for which reasons. * web: Avoid displaying a warning when downloading one url failed but another url later succeeded. Some other uses of downloadUrl use urls that are effectively internal use, and should not all be displayed to the user on failure. Eg, Remote.Git tries different urls where content could be located depending on how the remote repo is set up. Exposing those urls to the user would lead to wild goose chases. So had to parameterize it to control whether it displays urls or not. A side effect of this change is that when there are some youtube urls and some regular urls, it will try regular urls first, even if the youtube urls are listed first. This seems like an improvement if anything, but in any case there's no defined order of urls that it's supposed to use. Sponsored-by: Dartmouth College's Datalad project	2021-09-01 15:33:38 -04:00
Joey Hess	6329997ac4	init: check for filesystem where write bit cannot be removed This fixes a reversion caused by `a99a84f342`, when git-annex init is run as root on a FAT filesystem mounted with hdiutil on OSX. Such a mount point has file mode 777 for everything and it cannot be changed. The existing crippled filesystem test tried to write to a file after removing write bit, but that test does not run as root (since root can write to unwritable files). So added a check of the write permissions of the file, after attempting to remove them. Sponsored-by: Dartmouth College's Datalad project	2021-09-01 10:27:28 -04:00
Joey Hess	e853ef3095	decorate openTempFile errors with the template name This is to track down what file in .git/annex/ is being written to via a temp file when the repository is read-only. Sponsored-by: Dartmouth College's Datalad project	2021-08-30 13:05:02 -04:00
Joey Hess	a99a84f342	add: Detect when xattrs or perhaps ACLs prevent locking down a file's content And fail with an informative message. I don't think ACLs can prevent removing the write bit, but I'm not sure, so kept it mentioning them as a possibility. Should git-annex lock also check if the write bits are able to be removed? Maybe, but the case I know about with xattrs involves cp -a copying NFS xattrs, and it's the copy of the file that is the problem. So when locking a file, I guess it will not be the copy. Sponsored-by: Dartmouth College's Datalad project	2021-08-27 14:33:01 -04:00
Joey Hess	6d4a728455	Added annex.youtube-dl-command config This can be used to run some forks of youtube-dl. Sponsored-by: Brett Eisenberg on Patreon	2021-08-27 09:44:23 -04:00
Joey Hess	4ed36b2634	Fix test suite failure on Windows It would be better if the Arbitrary instance avoided generating impossible filenames like "foo/c:bar", but proably this is the only place that splits the file from the directory and then uses the file without the directory.. At least on the quickcheck properties. Sponsored-by: Svenne Krap on Patreon	2021-08-24 14:03:29 -04:00
Joey Hess	492036622a	fix OSX build	2021-08-18 16:35:26 -04:00
Joey Hess	d154e7022e	incremental verification for web special remote Except when configuration makes curl be used. It did not seem worth trying to tail the file when curl is downloading. But when an interrupted download is resumed, it does not read the whole existing file to hash it. Same reason discussed in commit 7eb3742e4b76d1d7a487c2c53bf25cda4ee5df43; that could take a long time with no progress being displayed. And also there's an open http request, which needs to be consumed; taking a long time to hash the file might cause it to time out. Also in passing implemented it for git and external special remotes when downloading from the web. Several others like S3 are within striking distance now as well. Sponsored-by: Dartmouth College's DANDI project	2021-08-18 15:02:22 -04:00
Joey Hess	88b63a43fa	distinguish between incremental verification failing and not being done Sponsored-by: Dartmouth College's DANDI project	2021-08-18 14:38:02 -04:00
Joey Hess	325bfda12d	refactor	2021-08-18 13:37:00 -04:00
Joey Hess	449851225a	refactor IncrementalVerifier moved to Utility.Hash, which will let Utility.Url use it later. It's perhaps not really specific to hashing, but making a separate module just for the data type seemed unncessary. Sponsored-by: Dartmouth College's DANDI project	2021-08-18 13:19:02 -04:00
Joey Hess	f0754a61f5	plumb VerifyConfig into retrieveKeyFile This fixes the recent reversion that annex.verify is not honored, because retrieveChunks was passed RemoteVerify baser, but baser did not have export/import set up. Sponsored-by: Dartmouth College's DANDI project	2021-08-17 12:43:13 -04:00
Joey Hess	b1622eb932	incremental verify for directory special remote Added fileRetriever', which will let the remaining special remotes eventually also support incremental verify. Sponsored-by: Dartmouth College's DANDI project	2021-08-16 16:51:33 -04:00
Joey Hess	a644f729ce	refactor fileCopier Sponsored-by: Dartmouth College's DANDI project	2021-08-16 15:56:24 -04:00
Joey Hess	d889ae0c01	move comment	2021-08-16 15:25:06 -04:00
Joey Hess	aac0654ff4	handle AlreadyInUseError As happens when using the directory special remote, gitlfs, webdav, and S3. But not external, adb, gcrypt, hook, or rsync. Sponsored-by: Dartmouth College's DANDI project	2021-08-16 15:03:48 -04:00
Joey Hess	c4aba8e032	better handling of finishing up incomplete incremental verify Now it's run in VerifyStage. I thought about keeping the file handle open, and resuming reading where tailVerify left off. But that risks leaking open file handles, until the GC closes them, if the deferred verification does not get resumed. Since that could perhaps happen if there's an exception somewhere, I decided that was too unsafe. Instead, re-open the file, seek, and resume. Sponsored-by: Dartmouth College's DANDI project	2021-08-16 14:52:59 -04:00
Joey Hess	e0b7f391bd	improve tailVerify Wait for the file to get modified, not only opened. This way, if a remote does not support resuming, and opens a new file over top of the existing file, it will wait until that remote starts writing, and open the file it's writing to, not the old file. Sponsored-by: Dartmouth College's DANDI project	2021-08-16 14:47:37 -04:00
Joey Hess	e46a7dff6f	fix windows build	2021-08-13 16:36:33 -04:00
Joey Hess	16dd3dd4ca	catch more exceptions I saw this: .git/annex/tmp/SHA256E-s1234376--5ba8e06e0163b217663907482bbed57684d7188024155ddc81da0710dfd2687d: openBinaryFile: resource busy (file is locked) guess catching IO exceptions did not catch that one.	2021-08-13 16:16:46 -04:00
Joey Hess	ff2dc5eb18	INotify.removeWatch can crash Unsure why, possibly if the file has been replaced by another file.	2021-08-13 15:35:18 -04:00
Joey Hess	7503b8448b	inotify reports paths relative to directory being watched Sponsored-by: Dartmouth College's DANDI project	2021-08-13 14:51:15 -04:00
Joey Hess	e07625df8a	convert tailVerify to not finalize the verification Added failIncremental so it can force failure to verify. Sponsored-by: Dartmouth College's DANDI project	2021-08-13 13:39:02 -04:00
Joey Hess	9d533b347f	tailVerify: return deferred action when it gets behind Sponsored-by: Dartmouth College's DANDI project	2021-08-13 12:32:01 -04:00
Joey Hess	b6efba8139	add tailVerify Not yet used, but this will let all remotes verify incrementally if it's acceptable to pay the performance price. See comment for details of when it will perform badly. I anticipate using this for all special remotes that use fileRetriever. Except perhaps for a few like GitLFS that could feed the incremental verifier themselves despite using that. Sponsored-by: Dartmouth College's DANDI project	2021-08-12 14:38:02 -04:00
Joey Hess	fa62c98910	simplify and speed up Utility.FileSystemEncoding This eliminates the distinction between decodeBS and decodeBS', encodeBS and encodeBS', etc. The old implementation truncated at NUL, and the primed versions had to do extra work to avoid that problem. The new implementation does not truncate at NUL, and is also a lot faster. (Benchmarked at 2x faster for decodeBS and 3x for encodeBS; more for the primed versions.) Note that filepath-bytestring 1.4.2.1.8 contains the same optimisation, and upgrading to it will speed up to/fromRawFilePath. AFAIK, nothing relied on the old behavior of truncating at NUL. Some code used the faster versions in places where I was sure there would not be a NUL. So this change is unlikely to break anything. Also, moved s2w8 and w82s out of the module, as they do not involve filesystem encoding really. Sponsored-by: Shae Erisson on Patreon	2021-08-11 12:13:31 -04:00
Joey Hess	1acdd18ea8	deal better with clock skew situations, using vector clocks * Deal with clock skew, both forwards and backwards, when logging information to the git-annex branch. * GIT_ANNEX_VECTOR_CLOCK can now be set to a fixed value (eg 1) rather than needing to be advanced each time a new change is made. * Misuse of GIT_ANNEX_VECTOR_CLOCK will no longer confuse git-annex. When changing a file in the git-annex branch, the vector clock to use is now determined by first looking at the current time (or GIT_ANNEX_VECTOR_CLOCK when set), and comparing it to the newest vector clock already in use in that file. If a newer time stamp was already in use, advance it forward by a second instead. When the clock is set to a time in the past, this avoids logging with an old timestamp, which would risk that log line later being ignored in favor of "newer" line that is really not newer. When a log entry has been made with a clock that was set far ahead in the future, this avoids newer information being logged with an older timestamp and so being ignored in favor of that future-timestamped information. Once all clocks get fixed, this will result in the vector clocks being incremented, until finally enough time has passed that time gets back ahead of the vector clock value, and then it will return to usual operation. (This latter situation is not ideal, but it seems the best that can be done. The issue with it is, since all writers will be incrementing the last vector clock they saw, there's no way to tell when one writer made a write significantly later in time than another, so the earlier write might arbitrarily be picked when merging. This problem is why git-annex uses timestamps in the first place, rather than pure vector clocks.) Advancing forward by 1 second is somewhat arbitrary. setDead advances a timestamp by just 1 picosecond, and the vector clock could too. But then it would interfere with setDead, which wants to be overrulled by any change. So it could use 2 picoseconds or something, but that seems weird. It could just as well advance it forward by a minute or whatever, but then it would be harder for real time to catch up with the vector clock when forward clock slew had happened. A complication is that many log files contain several different peices of information, and it may be best to only use vector clocks for the same peice of information. For example, a key's location log file contains InfoPresent/InfoMissing for each UUID, and it only looks at the vector clocks for the UUID that is being changed, and not other UUIDs. Although exactly where the dividing line is can be hard to determine. Consider metadata logs, where a field "tag" can have multiple values set at different times. Should it advance forward past the last tag? Probably. What about when a different field is set, should it look at the clocks of other fields? Perhaps not, but currently it does, and this does not seems like it will cause any problems. Another one I'm not entirely sure about is the export log, which is keyed by (fromuuid, touuid). So if multiple repos are exporting to the same remote, different vector clocks can be used for that remote. It looks like that's probably ok, because it does not try to determine what order things occurred when there was an export conflict. Sponsored-by: Jochen Bartl on Patreon	2021-08-04 12:33:46 -04:00
Joey Hess	6111958440	fix test suite `14683da9eb` caused a test suite failure. When the content of a key is not present, a LinkAnnexFailed is returned, but replaceFile then tried to move the file into place, and since it was not written, that crashed. Sponsored-by: Boyd Stephen Smith Jr. on Patreon	2021-08-02 13:59:23 -04:00
Joey Hess	b3c4579c79	work around strange auto-init bug git-annex get when run as the first git-annex command in a new repo did not populate unlocked files. (Reversion in version 8.20210621) I am not entirely happy with this, because I don't understand how `428c91606b` caused the problem in the first place, and I don't fully understand how skipping calling scanAnnexedFiles during autoinit avoids the problem. Kept the explicit call to scanAnnexedFiles during git-annex init, so that when reconcileStaged is expensive, it can be made to run then, rather than at some later point when the information is needed. Sponsored-by: Brock Spratlen on Patreon	2021-07-30 18:36:03 -04:00
Joey Hess	748addbe05	remove second pass in scanAnnexedFiles The pass was needed to populate files when annex.thin was set, but in commit `73e0cbbb19`, reconcileStaged started to do that. So, this second pass is not needed any longer.	2021-07-30 17:46:11 -04:00
Joey Hess	817ccbbc47	split verifyKeyContent This avoids it calling enteringStage VerifyStage when it's used in places that only fall back to verification rarely, and which might be called while in TransferStage and be going to perform a transfer after the verification.	2021-07-29 13:58:40 -04:00
Joey Hess	897fd5c104	add note	2021-07-29 13:14:03 -04:00
Joey Hess	067a9c70c7	simplify code	2021-07-29 12:28:13 -04:00
Joey Hess	3e0b210039	remove unncessary debugs Keeping the ones in Annex.InodeSentinal	2021-07-29 12:19:37 -04:00
Joey Hess	73e0cbbb19	fix problem populating pointer files This is a result of an audit of every use of getInodeCaches, to find places that misbehave when the annex object is not in the inode cache, despite pointer files for the same key being in the inode cache. Unfortunately, that is the case for objects that were in v7 repos that upgraded to v8. Added a note about this gotcha to getInodeCaches. Database.Keys.reconcileStaged, then annex.thin is set, would fail to populate pointer files in this situation. Changed it to check if the annex object is unmodified the same way inAnnex does, falling back to a checksum if the inode cache is not recorded. Sponsored-by: Dartmouth College's Datalad project	2021-07-27 14:26:49 -04:00
Joey Hess	de482c7eeb	move verifyKeyContent to Annex.Verify The goal is that Database.Keys be able to use it; it can't use Annex.Content.Presence due to an import loop. Several other things also needed to be moved to Annex.Verify as a conseqence.	2021-07-27 14:07:23 -04:00
Joey Hess	14683da9eb	fix potential race in updating inode cache Some uses of linkFromAnnex are inside replaceWorkTreeFile, which was already safe, but others use it directly on the work tree file, which was race-prone. Eg, if the work tree file was first removed, then linkFromAnnex called to populate it, the user could have re-written it in the interim. This came to light during an audit of all calls of addInodeCaches, looking for such races. All the other uses of it seem ok. Sponsored-by: Brett Eisenberg on Patreon	2021-07-27 13:08:08 -04:00
Joey Hess	e4b2a067e0	fix potential race in updating inode cache In Annex.Content, the object file was statted after pointer files were populated. But if annex.thin is set, once the pointer files are populated, the object file can potentially be modified via the hard link. So, it was possible, though seemingly very unlikely, for the inode of the modified object file to be cached. Command.Fix and Command.Fsck had similar problems, statting the work tree files after they were in place. Changed them to stat the temp file that gets moved into place. This does rely on .git/annex being on the same filesystem. If it's not, the cached inode will not be the same as the one that the temp file gets moved to. Result will be that git-annex will later need to do an expensive verification of the content of the worktree files. Note that the cross-filesystem move of the temp file already is a larger amount of extra work, so this seems acceptable. Sponsored-by: Luke Shumaker on Patreon	2021-07-27 12:29:10 -04:00
Joey Hess	3b5a3e168d	check if object is modified before starting to send it Fix bug that caused some transfers to incorrectly fail with "content changed while it was being sent", when the content was not changed. While I don't know how to reproduce the problem that several people reported, it is presumably due to the inode cache somehow being stale. So check isUnmodified', and if it's not modified, include the file's current inode cache in the set to accept, when checking for modification after the transfer. That seems like the right thing to do for another reason: The failure says the file changed while it was being sent, but if the object file was changed before the transfer started, that's wrong. So it needs to check before allowing the transfer at all if the file is modified. (Other calls to sameInodeCache or elemInodeCaches, when operating on inode caches from the database, could also be problimatic if the inode cache is somehow getting stale. This does not address such problems.) Sponsored-by: Dartmouth College's Datalad project	2021-07-26 17:33:49 -04:00
Joey Hess	f195f3b541	more inode cache debugging	2021-07-26 12:57:35 -04:00
Joey Hess	0073384850	add debugging in sameInodeCache	2021-07-26 10:58:07 -04:00
Joey Hess	33a80d083a	sync --quiet * sync: When --quiet is used, run git commit, push, and pull without their ususual output. * merge: When --quiet is used, run git merge without its usual output. This might also make --quiet work better for some other commands that make commits, like git-annex adjust. Sponsored-by: Kevin Mueller on Patreon	2021-07-19 11:28:47 -04:00
Joey Hess	635e7f3e26	split annexLocations To avoid mistakes like commit `0ccbed4f6f`, be explicit about the two variants of this. Incidentially avoids a small amount of overhead in calling reverse. Sponsored-by: Shae Erisson on Patreon	2021-07-16 14:17:56 -04:00
Joey Hess	0ccbed4f6f	fix oops `dd31fe7b9e` broke non-bare repos by using bare hash dirs first, oops	2021-07-15 21:01:07 -04:00
Joey Hess	dd31fe7b9e	fall back to checking lower case hash directories in normal repo Fix a bug that prevented getting content from a repository that started out as a bare repository, or had annex.crippledfilesystem set, and was converted to a non-bare repository. This unfortunately means that inAnnex check gets slowed down by a stat call in normal repos when the content is not present. Oh well, such is the cost of backwards compatability with old mistakes. Sponsored-by: Mark Reidenbach on Patreon	2021-07-15 12:16:31 -04:00
Joey Hess	6a581f8b8b	fix init reversion when core.sharedRepository = group init: Fix misbehavior when core.sharedRepository = group that caused it to enter an adjusted branch. (Reversion in version 8.20210630) Commit `4b1b9d7a83` made init call freezeContent in case there was a hook that could prevent writing in situations where perms don't. But with the above git config, freezeContent does not prevent write at all. So init needs to do what freezeContent does with a non-shared git config. Or init could check for that config, and skip the probing, since it won't actually be preventing write to any files. But that would make init too aware if details of Annex.Perms, and also would break if the git config were changed after init. Sponsored-by: Dartmouth College's Datalad project	2021-07-12 10:15:49 -04:00
Joey Hess	9905ec19a7	add pointer to annex.security.allowed-url-schemes Sponsored-by: Kevin Mueller on Patreon	2021-07-02 10:53:45 -04:00
Joey Hess	3a14648142	dropping unused marks as dead Dropping an object with drop --unused or dropunused will mark it as dead, preventing fsck --all from complaining about it after it's been dropped from all repositories. If another repository still has a copy, it won't be treated as dead until it's also dropped from there. The drop has to use --unused, can't be --key or something else, because this indicates that the user has recently ran git-annex unused. If it checked the unused log on every drop, bad things would happen when the unused log was out of date, eg a file used to be unused but then got re-added. Marking such a file as dead could be confusing. When the user uses --unused/dropunused, they must consider the unused information to be up-to-date. The particular workflow this enables is: git annex add foo git annex unannex foo git annex unused git annex drop --unused / dropunused git annex fsck --all # no warnings The docs for git-annex unannex say to use git-annex unused and dropunused, so the user should be pointed in this direction when they want to undo an accidental add. Sponsored-by: Brock Spratlen on Patreon	2021-06-25 15:22:26 -04:00
Joey Hess	df2001aa88	Improve display of errors when transfers fail Transfers from or to a local git repo could fail without a reason being given, if the content failed to verify, or if the object file's stat changed while it was being copied. Now display messages in these cases. Sponsored-by: Jack Hill on Patreon	2021-06-25 13:17:04 -04:00
Joey Hess	51c696679f	avoid using temp file size when deciding whether to retry failed transfer When stall detection is enabled, and a transfer is in progress, it would display a doubled message: (transfer already in progress, or unable to take transfer lock) (transfer already in progress, or unable to take transfer lock) That happened because the forward retry decider had a start size of 0, and an end size of whatever amount of the object the other process had downloaded. So it incorrectly thought that the transferrer process had made progress, when it had in fact immediately given up with that message. Instead, use the reported value from the progress meter. If a remote does not report progress, this will mean it doesn't forward retry, in a situation where it used to. But most remotes do report progress, and any remote that does not can be fixed to, by using watchFileSize when downloading. Also, some remotes might preallocate the temp file (eg bittorrent), so relying on statting its size at this level to get progress is dubious. The same change was made to Annex/Transfer.hs, although only Annex/TransferrerPool.hs needed to be changed to avoid the duplicate message. (An alternate fix would have been to start the retry decider with the size of the object file before downloading begins, rather than 0.) Sponsored-by: Brett Eisenberg on Patreon	2021-06-25 12:04:23 -04:00
Joey Hess	0fe550af75	fix windows build	2021-06-22 09:46:06 -04:00
Joey Hess	4b1b9d7a83	Added annex.freezecontent-command and annex.thawcontent-command configs Freeze first sets the file perms, and then runs freezecontent-command. Thaw runs thawcontent-command before restoring file permissions. This is in case the freeze command prevents changing file perms, as eg setting a file immutable does. Also, changing file perms tends to mess up previously set ACLs. git-annex init's probe for crippled filesystem uses them, so if file perms don't work, but freezecontent-command manages to prevent write to a file, it won't treat the filesystem as crippled. When the the filesystem has been probed as crippled, the hooks are not used, because there seems to be no point then; git-annex won't be relying on locking annex objects down. Also, this avoids them being run when the file perms have not been changed, in case they somehow rely on git-annex's setting of the file perms in order to work. Sponsored-by: Dartmouth College's Datalad project	2021-06-21 14:40:52 -04:00
Joey Hess	ba62c3467b	remove dead code	2021-06-21 13:54:12 -04:00
Joey Hess	4eb3778aec	remove unused import	2021-06-21 12:32:36 -04:00
Joey Hess	694fe3702c	fix 2 build warnings	2021-06-21 11:27:18 -04:00
Joey Hess	d2be68907c	drop, move, mirror: when two files have the same content, honor the max numcopies and requiredcopies Eg, before with a .gitattributes like: .2 annex.numcopies=2 .1 annex.numcopies=1 And foo.1 and foo.2 having the same content and key, git-annex drop foo.1 foo.2 would succeed, leaving just 1 copy, despite foo.2 needing 2 copies. It dropped foo.1 first and then skipped foo.2 since its content was gone. Now that the keys database includes locked files, this longstanding wart can be fixed. Sponsored-by: Noam Kremen on Patreon	2021-06-15 11:38:44 -04:00
Joey Hess	0ed1369dcd	remove unused import	2021-06-15 11:31:59 -04:00
Joey Hess	af9fdf5dba	verify associated files when checking numcopies Most of this is just refactoring. But, handleDropsFrom did not verify that associated files from the keys db were still accurate, and has now been fixed to. A minor improvement to this would be to avoid calling catKeyFile twice on the same file, when getting the numcopies and mincopies value, in the common case where the same file has the highest value for both. But, it avoids checking every associated file, so it will scale well to lots of dups already. Sponsored-by: Kevin Mueller on Patreon	2021-06-15 11:14:52 -04:00
Joey Hess	0b91afb57d	avoid warning	2021-06-15 11:11:55 -04:00
Joey Hess	77517ab506	avoid nub It's O(N^2) which could matter when there are many dup files using the same key.	2021-06-15 10:48:11 -04:00
Joey Hess	3af4c9a29a	fix exponential blowup when adding lots of identical files This was an old problem when the files were being added unlocked, so the changelog mentions that being fixed. However, recently it's also affected locked files. The fix for locked files is kind of stupidly simple. moveAnnex already handles populating unlocked files, and only does it when the object file was not already present. So remove the redundant populateUnlockedFiles call. (That call was added all the way back in `cfaac52b88`, and has always been unncessary.) Sponsored-by: Dartmouth College's Datalad project	2021-06-15 09:45:55 -04:00
Joey Hess	e147ae07f4	remove supportUnlocked check that is not worth its overhead moveAnnex only gets to that check if the object file was not present before. So in the case where dup files are being added repeatedly, it will only run the first time, and so there's no significant speedup from doing it; all it avoids is a single sqlite lookup. Since MVar accesses do have overhead, it's better to optimise for the common case, where unlocked files are supported. removeAnnex is less clear cut, but I think mostly is skipped running on keys when the object has already been dropped, so similar reasoning applies.	2021-06-15 09:28:56 -04:00
Joey Hess	dcd2c95249	fix windows build	2021-06-14 12:43:26 -04:00
Joey Hess	014dc63a55	avoid sometimes expensive operations when annex.supportunlocked = false This will mostly just avoid a DB lookup, so things get marginally faster. But in cases where there are many files using the same key, it can be a more significant speedup. Added overhead is one MVar lookup per call, which should be small enough, since this happens after transferring or ingesting a file, which is always a lot more work than that. It would be nice, though, to move getGitConfig to AnnexRead, which there is an open todo about.	2021-06-14 12:40:41 -04:00
Joey Hess	c4f1465a81	check symlink before reading file This is faster because when multiple files are in a directory, it gets cached.	2021-06-14 11:53:51 -04:00
Joey Hess	26a9ea12d1	handle edge case of symlink to something that is not really a pointer file That seems very unlikely to happen, but still, it's possible it could. And with the recent addition of locked files to the keys db, this could be called by places that did not call it before, so it seems even more important it's correct. Adds an extra stat of the file, and is potentially racy, but both problems are fixed by the unix-2.8.0 path. I have not tested that path builds because that package is not yet released and it would be difficult to install it since it's tightly tied to a ghc version.	2021-06-14 11:35:52 -04:00
Joey Hess	673b2feaf3	rename for clarity Associated files are recorded now also for locked files, but this is only needed to populate unlocked files.	2021-06-14 10:55:24 -04:00
Joey Hess	7b6deb1109	display scanning message whenever reconcileStaged has enough files to chew on Clear visible progress bar first. Removed showSideActionAfter because it can't be used in reconcileStaged (import loop). Instead, it counts the number of files it processes and displays it after it's seen a sufficient to know it's taking a while. Sponsored-by: Dartmouth College's Datalad project	2021-06-08 12:48:30 -04:00
Joey Hess	13b9a288d3	scanAnnexedFiles in smudge --update This makes git checkout and git merge hooks do the work to catch up with changes that they made to the tree. Rather than doing it at some later point when the user is not thinking about that past operation. Sponsored-by: Dartmouth College's Datalad project	2021-06-08 11:37:47 -04:00
Joey Hess	7f742589f9	claw back annexed file scan speedup Following commit `c941ab6f5b`, this avoids the second, redundant scan when annex.thin is not set. The benchmark now runs in 35.5 seconds, down from 40 seconds. Note that the inode cache of the annex object has to be passed to addInodeCaches now, because it might not already be in the inode caches, unlike previously. Sponsored-by: Dartmouth College's Datalad project	2021-06-08 11:09:15 -04:00
Joey Hess	c941ab6f5b	avoid double work in git-annex init, second try reconcileStaged populates the db, so scanAnnexedFiles does not need to do it again. It still makes a pass over the HEAD tree, but populating the db was most of the expensive part. Benchmarking with 100,000 files, git-annex init now takes 40 seconds, vs 37 seconds with the old, buggy version of this fix. It should be possible to win those 3 precious seconds per 100k files back, in the case when when annex.thin is not set, with improvements to reconcileStaged that avoid needing this second pass. Sponsored-by: Dartmouth College's Datalad project	2021-06-08 09:36:53 -04:00
Joey Hess	2cb7b7b336	Revert "avoid double work in git-annex init" This reverts commit `0f10f208a7`. The implementation of this turns out to be unsafe; it can lead to a keys db deadlock. scanAnnexedFiles injects a call to inAnnex into reconcileStaged, but inAnnex sometimes needs to read from the keys db, which will try to re-open it when it's in the process of being opened. The exclusive lock of gitAnnexKeysDbLock will then deadlock. This needs to be done in some other way...	2021-06-08 09:11:24 -04:00
Joey Hess	0f10f208a7	avoid double work in git-annex init reconcileStaged was doing a redundant scan to scannAnnexedFiles. It would probably make sense to move the body of scannAnnexedFiles into reconcileStaged, the separation does not really serve any purpose. Sponsored-by: Dartmouth College's Datalad project	2021-06-07 16:50:14 -04:00
Joey Hess	0434674c85	avoid displaying the scanning annexed files message when repo is not large Avoids users thinking this scan is a big deal, when it's not in the majority of repos. showSideActionAfter has some ugly caveats, since it has to display in the background of another action. I could not see a better way to do it and it works fine in this particular case. It also doesn't really belong in Annex.Concurrent, but cannot go in Messages due to an import loop. Sponsored-by: Dartmouth College's Datalad project	2021-06-04 13:16:48 -04:00
Joey Hess	0f54e5e0ae	speed up initial scanning for annexed files Streaming through git this way speeds it up by around 25%. This is similar to the optimisations of seeking annexed files. Sponsored-by: Dartmouth College's Datalad project	2021-05-31 14:29:34 -04:00
Joey Hess	aa00e171cb	annex.supportunlocked should not prevent scan for annexed files That scan used to be only for unlocked files, but no longer..	2021-05-31 10:51:39 -04:00
Joey Hess	189fb05ffb	Added annex.adviceNoSshCaching config. Sponsored-by: Brock Spratlen on Patreon	2021-05-27 12:37:49 -04:00
Joey Hess	cedc28a783	prevent dropping required content of other file using same content When two files have the same content, and a required content expression matches one but not the other, dropping the latter file will fail as it would also remove the content of the required file. This will slow down drop (w/o --auto), dropunused, mirror, and move, by one keys db lookup per file. But I did include an optimisation to avoid a double db lookup in the drop --auto / sync --content case. I suspect that dropunused could also use PreferredContentChecked True, but haven't entirely thought it through and it's rarely used with enough files for the optimisation to matter. Sponsored-by: Dartmouth College's Datalad project	2021-05-25 11:34:06 -04:00
Joey Hess	f46e4c9b7c	fix case where keys db was not initialized in time When the keys db is opened for read, and did not exist yet, it used to skip creating it, and return mempty values. But that prevents reconcileStaged from populating associated files information in time for the read. This fixes the one remaining case I know of where the fix in `a56b151f90` didn't work. Note that, when there is a permissions error, it still avoids creating the db and returns mempty for all queries. This does mean that reconcileStaged does not run and so it may want to drop files that it should not. However, presumably a permissions error on the keys database also means that the user does not have permission to delete annex objects, so they won't be able to drop the files anyway. Sponsored-by: Dartmouth College's Datalad project	2021-05-24 14:46:59 -04:00
Joey Hess	a56b151f90	fix longstanding indeterminite preferred content for duplicated file problem * drop: When two files have the same content, and a preferred content expression matches one but not the other, do not drop the file. * sync --content, assistant: Fix an edge case where a file that is not preferred content did not get dropped. The sync --content edge case is that handleDropsFrom loaded associated files and used them without verifying that the information from the database was not stale. It seemed best to avoid changing --want-drop's behavior, this way when debugging a preferred content expression with it, the files matched will still reflect the expression. So added a note to the --want-drop documentation, to make clear it may not behave identically to git-annex drop --auto. While it would be possible to introspect the preferred content expression to see if it matches on filenames, and only look up the associated files when it does, it's generally fairly rare for 2 files to have the same content, and the database lookup is already avoided when there's only 1 file, so I did not implement that further optimisation. Note that there are still some situations where the associated files database does not get locked files recorded in it, which will prevent this fix from working. Sponsored-by: Dartmouth College's Datalad project	2021-05-24 14:07:05 -04:00
Joey Hess	428c91606b	include locked files in the keys database associated files Before only unlocked files were included. The initial scan now scans for locked as well as unlocked files. This does mean it gets a little bit slower, although I optimised it as well as I think it can be. reconcileStaged changed to diff from the current index to the tree of the previous index. This lets it handle deletions as well, removing associated files for both locked and unlocked files, which did not always happen before. On upgrade, there will be no recorded previous tree, so it will diff from the empty tree to current index, and so will fully populate the associated files, as well as removing any stale associated files that were present due to them not being removed before. reconcileStaged now does a bit more work. Most of the time, this will just be due to running more often, after some change is made to the index, and since there will be few changes since the last time, it will not be a noticable overhead. What may turn out to be a noticable slowdown is after changing to a branch, it has to go through the diff from the previous index to the new one, and if there are lots of changes, that could take a long time. Also, after adding a lot of files, or deleting a lot of files, or moving a large subdirectory, etc. Command.Lock used removeAssociatedFile, but now that's wrong because a newly locked file still needs to have its associated file tracked. Command.Rekey used removeAssociatedFile when the file was unlocked. It could remove it also when it's locked, but it is not really necessary, because it changes the index, and so the next time git-annex run and accesses the keys db, reconcileStaged will run and update it. There are probably several other places that use addAssociatedFile and don't need to any more for similar reasons. But there's no harm in keeping them, and it probably is a good idea to, if only to support mixing this with older versions of git-annex. However, mixing this and older versions does risk reconcileStaged not running, if the older version already ran it on a given index state. So it's not a good idea to mix versions. This problem could be dealt with by changing the name of the gitAnnexKeysDbIndexCache, but that would leave the old file dangling, or it would need to keep trying to remove it.	2021-05-21 16:24:37 -04:00
Joey Hess	8b6dad11a2	add createMessage init: When annex.commitmessage is set, use that message for the commit that creates the git-annex branch. This will be used by filter-branch too, and it seems to make sense to let annex.commitmessage affect it.	2021-05-17 13:07:47 -04:00
Joey Hess	1da9fe5bd8	implemented filter-branch for key info Not tested yet but should work. Noted a possible optimisation, which should probably be added, to speed it up in cases where there is no uuid filtering being done. It would need Annex.Branch to add a function like getRef that uses catFileDetails, so the sha is also returned. The difficulty would be making it support the precached file content; if it didn't it would probably not be any faster and could even be slower. So probably the precaching would need to be changed to also cache the sha.	2021-05-17 11:11:39 -04:00
Joey Hess	4ff8a1ae2b	refactoring filterBranch should be reusable for copy-branch command. Changed LogVariety to differentiate between LocationLog and UrlLog; only location logs contain uuids and need to be filtered by uuid, while url logs do not. This does not change current behavior, but it will let filterBranch be reused without filtering url logs incorrectly.	2021-05-13 14:43:25 -04:00
Joey Hess	947d2a10bc	assistant: Fix a crash on startup by avoiding using forkProcess ghc 8.8.4 seems to have changed something that broke code that has been successfully using forkProcess since 2012. Likely a change to GC internals. Since forkProcess has never had clear documentation about how to use it safely, avoid using it at all. Instead, when git-annex needs to daemonize itself, re-run the git-annex command, in a new process group and session. This commit was sponsored by Luke Shumaker on Patreon.	2021-05-12 15:08:03 -04:00
Joey Hess	4bf7940d6b	fileRef: make paths relative and simplified Fix behavior of several commands, including reinject, addurl, and rmurl when given an absolute path to an unlocked file, or a relative path that leaves and re-enters the repository. To avoid slowing down all the cases where the paths are already ok with an unncessary call to getCurrentDirectory, put in an optimisation in relPathCwdToFile. That will probably also speed up other parts of git-annex by some small amount, but I have not benchmarked. Note that I did not convert branchFileRef, because it seems likely that it will be used with a file that is not provided by the user, so is already in a sane format. This is certainly true for the way git-annex uses it, though maybe arguable to the extent Git.Ref is a reusable library.	2021-05-07 13:25:59 -04:00
Joey Hess	4588668a12	fromkey unlocked files support fromkey: Create an unlocked file when used in an adjusted branch where the file should be unlocked, or when configured by annex.addunlocked. There is some overlap with code in Annex.Ingest, however it's not quite the same because ingesting has a temp file with the content, where here the content, if any, is in the annex object file. So it eg, makes sense for Annex.Ingest to copy the execute mode of the content file, but it does not make sense for fromkey to do that. Also changed in passing to stage the file in git directly, rather than using git add. One consequence of that is that if the file is gitignored, it will still get added, rather than the old behavior: The following paths are ignored by one of your .gitignore files: ignored hint: Use -f if you really want to add them. hint: Turn this message off by running hint: "git config advice.addIgnoredFile false" git-annex: user error (xargs ["-0","git","--git-dir=.git","--work-tree=.","--literal-pathspecs","add","--"] exited 123) That old behavior was a surprise to me, and so I consider it a bug, and doubt anyone would have relied on it. Note that, when on an --hide-missing branch, it is possible to fromkey a key that is not present (needs --force). The annex link or pointer file still gets written in this case. It doesn't seem to make any sense not to write it, because then fromkey would not do anything useful in this case, and this way the file can be committed and synced to master, and the branch re-adjusted to hide the new missing file. This commit was sponsored by Noam Kremen on Patreon.	2021-05-03 11:26:18 -04:00
Joey Hess	4edde98709	improve message Pluralize copies appropriately. This commit was sponsored by Mark Reidenbach on Patreon.	2021-04-27 13:44:08 -04:00
Joey Hess	a166d2520b	check mincopies is satisfied even when numcopies is known to be satisfied I had been assuming that numcopies would be a larger or at most equal to mincopies, so no need to check both. But users get confused and use configs that don't really make sense, so make sure to handle mincopies being larger than numcopies. Also add something to the mincopies man page to discourage this misconfiguration. This commit was sponsored by Denis Dzyubenko on Patreon.	2021-04-27 13:37:18 -04:00
Joey Hess	32138b8cd8	implement annex.privateremote and remote.name.private configs The slightly unusual parsing in Types.GitConfig avoids the need to look at the remote list to get configs of remotes. annexPrivateRepos combines all the configs, and will only be calculated once, so it's nice and fast. privateUUIDsKnown and regardingPrivateUUID now need to read from the annex mvar, so are not entirely free. But that overhead can be optimised away, as seen in getJournalFileStale. The other call sites didn't seem worth optimising to save a single MVar access. The feature should have impreceptable speed overhead when not being used.	2021-04-23 14:21:57 -04:00
Joey Hess	d5a05655b4	Merge branch 'master' into hiddenannex	2021-04-23 13:06:33 -04:00
Joey Hess	657d55c401	convert withKnownUrls to use overBranchFileContents This only partly fixes importfeed to see journalled files, since it separately cats metadata directly from the branch. Held off on a changelog for a bug fix until that's dealt with.	2021-04-23 11:32:25 -04:00
Joey Hess	c687eae80b	got private repos really working This new TODO will need private indexes to resolve; until then the private journal has to be checked when private UUIDs are known.	2021-04-21 16:26:23 -04:00
Joey Hess	d0c5f6d2f0	optimisation Avoid trying to read private journal files when no private uuids are known.	2021-04-21 16:02:56 -04:00
Joey Hess	24eeacdba8	adapt recent bug fixes to support private journal At this point, private repos should mostly work, except for a few commands that directly read from the git-annex branch and will not see the private journal. Private index not yet implemented.	2021-04-21 16:01:13 -04:00
Joey Hess	0bb57702e1	Merge branch 'master' into hiddenannex	2021-04-21 15:45:12 -04:00
Joey Hess	653b719472	fix --all to include not yet committed files from the journal Fix bug caused by recent optimisations that could make git-annex not see recently recorded status information when configured with annex.alwayscommit=false. This does mean that --all can end up processing the same key more than once, but before the optimisations that introduced this bug, it used to also behave that way. So I didn't try to fix that; it's an edge case and anyway git-annex behaves well when run on the same key repeatedly. I am not too happy with the use of a MVar to buffer the list of files in the journal. I guess it doesn't defeat lazy streaming of the list, if that list is actually generated lazily, and anyway the size of the journal is normally capped and small, so if configs are changed to make it huge and this code path fire, git-annex using enough memory to buffer it all is not a large problem.	2021-04-21 15:40:32 -04:00

... 2 3 4 5 6 ...

1997 commits