git-annex

Author	SHA1	Message	Date
Joey Hess	cbe12b9bc3	force fully strict read of journal file again I was thinking that discardIncompleteAppend would make it strict, since it looks at the end of the bytestring. But, it's applied lazily.. This probably fixes windows, which was failing: git-annex.exe: .git\annex\journal\trust.log: DeleteFile "\\\\?\\C:\\Users\\runneradmin\\.t\\5\\tmprepo22\\.git\\annex\\journal\\trust.log": permission denied (The process cannot access the file because it is being used by another process.)	2022-07-22 11:36:21 -04:00
Joey Hess	4e88137a28	prevent appends except when annex.alwayscompact=false I would like for a new repo version to enable appends, but to do so safely would need a v11 followed by a 1 year delay followed by a v12 that does it. Since a similar v9 and v10 transition is currently happening, and is less than 6 months along in most repos, it does not feel wise to stack up another year-long transition behind that. What if I need to hurry up a new repo version for some other change? Added todo so I remember to make this change at some time when a v11 and probably v12 repo version do make sense. Sponsored-by: Dartmouth College's DANDI project	2022-07-20 13:23:55 -04:00
Joey Hess	d275874e6c	handling of interrupted appends An append that is interrupted and writes part of a line is now dealt with by subsequent reads and appends. This also handles a read that happens at the same time as an append to the file. Old versions of git-annex will still see a partially written line, and could get confused. Since appends are currently done for url logs and location logs, the confusion is limited to a substring of the actual url or UUID of the remote being read. This will not affect writes, since the journal file is locked when reading in preparation for writing. However, the bad data can be output by git-annex and used by other things, or could cause surprising behavior by git-annex. Including eg, downloading the content of the wrong url. So, something needs to be done to prevent old versions of git-annex from running in a repository where this appending is being done.. Sponsored-by: Dartmouth College's DANDI project	2022-07-20 12:40:49 -04:00
Joey Hess	6f1fd3abdd	no locking of journal on read after all Finally have a final design, and it turns out not to need locking on read.	2022-07-20 10:57:28 -04:00
Joey Hess	d0860b7f0e	fix build After `28b0aaea54`	2022-07-18 16:44:32 -04:00
Joey Hess	28b0aaea54	re-add lock journal before reading journal files This reverts commit `2e6e9876e3`. This is gonna be needed after all.. The append will only be atomic if the journal is locked, because the file being appended will have to be moved out of the way to avoid an old version of git-annex seeing an incomplete write to it. When git-annex finds that the file is not in the journal, and checks the append location, locking will be needed to avoid a race causing it to miss it in the append location too due to it being moved back to the journal.	2022-07-18 16:40:25 -04:00
Joey Hess	36f0bdcd57	add annex.alwayscompact Added annex.alwayscompact setting which can be unset to speed up writes to the git-annex branch in some cases. Sponsored-by: Dartmouth College's DANDI project	2022-07-18 16:39:19 -04:00
Joey Hess	ccff639651	Merge branch 'master' into append	2022-07-18 14:17:15 -04:00
Joey Hess	de18d92de6	efficient but unsafe journal file append This is only for checking performance, it's not safe. Sponsored-by: Dartmouth College's DANDI project	2022-07-18 14:17:12 -04:00
Joey Hess	1c40b927aa	minor optimisation Avoid re-writing the file when the journal directory did not exist.	2022-07-18 13:50:35 -04:00
Joey Hess	2e6e9876e3	Revert "lock journal before reading journal files" This reverts commit `47358a6f95`. This added overhead, and will not be needed, because appends are going to have to be made atomic for other reasons than avoiding incomplete reads of data being appended. In particular, when git-annex is interrupted in the middle of an append, it must not leave the file with a partially written line. So appending has to somehow be made fully atomic.	2022-07-18 13:38:12 -04:00
Joey Hess	ce455223df	split out appending to journal from writing, high level only Currently this is not an improvement, but it allows for optimising appendJournalFile later. With an optimised appendJournalFile, this will greatly speed up access patterns like git-annex addurl of a lot of urls to the same key, where the log file can grow rather large. Appending rather than re-writing the journal file for each line can save a lot of disk writes. It still has to read the current journal or branch file, to check if it can append to it, and so when the journal file does not exist yet, it can write the old content from the branch to it. Probably the re-reads are better cached by the filesystem than repeated writes. (If the re-reads turn out to keep performance bad, they could be eliminated, at the cost of not being able to compact the log when replacing old information in it. That could be enabled by a switch.) While the immediate need is to affect addurl writes, it was implemented at the level of presence logs, so will also perhaps speed up location logs. The only added overhead is the call to isNewInfo, which only needs to compare ByteStrings. Helping to balance that out, it avoids compactLog when it's able to append. Sponsored-by: Dartmouth College's DANDI project	2022-07-18 13:22:50 -04:00
Joey Hess	47358a6f95	lock journal before reading journal files This is not currently necessary; journal files are updated atomically. However, for faster appends to large journal files, locking on read will be needed, because appends are not atomic. Sponsored-by: Dartmouth College's DANDI project	2022-07-15 14:43:29 -04:00
Joey Hess	a2b1f369d1	disable journalIgnorable in enableInteractiveBranchAccess Fix a reversion that prevented --batch commands (and the assistant) from noticing data written to the journal by other commands. I have not identified which commit broke this for sure, but probably it was `aeca7c2207` --batch commands that wrote to the journal avoided the problem since journalIgnorable sets unset on write. It's a little bit surprising that nobody noticed that query --batch commands did not see data written by other commands. Sponsored-by: Dartmouth College's DANDI project	2022-07-15 13:48:41 -04:00
Joey Hess	91abd872d3	complete a comment	2022-07-15 12:59:59 -04:00
Joey Hess	ad467791c1	optimise journal writes to not mkdir journal directory when it already exists Sponsored-by: Dartmouth College's DANDI project	2022-07-14 12:29:39 -04:00
Joey Hess	1b680d330b	revert accidental change	2022-07-13 15:17:08 -04:00
Joey Hess	68e9b7f987	comment	2022-07-13 13:44:43 -04:00
Joey Hess	f58fb6a79a	fix build when dbus is enabled Broken in commit `8040ecf9b8`	2022-07-05 13:06:45 -04:00
Joey Hess	8040ecf9b8	final readonly values moves to AnnexRead At this point I've checked all AnnexState values and these were all that remained that could move. Pity that Annex.repo can't move, but it gets modified sometimes.. A couple of AnnexState values are set by options and could be AnnexRead, but happen to use Annex when being set. Sponsored-by: Max Thoursie on Patreon	2022-06-28 16:04:58 -04:00
Joey Hess	cb9cf30c48	move several readonly values to AnnexRead This improves performance to a small extent in several places. Sponsored-by: Tobias Ammann on Patreon	2022-06-28 15:40:19 -04:00
Joey Hess	debcf86029	use RawFilePath version of rename Some small wins, almost certianly swamped by the system calls, but still worthwhile progress on the RawFilePath conversion. Sponsored-by: Erik Bjäreholt on Patreon	2022-06-22 16:47:34 -04:00
Joey Hess	d00e23cac9	RawFilePath optimisations	2022-06-22 16:20:08 -04:00
Joey Hess	224a57f9ed	RawFilePath optimisation	2022-06-22 16:11:03 -04:00
Joey Hess	95a04920cf	remove objectDir'	2022-06-22 16:08:49 -04:00
Joey Hess	f80ec74128	RawFilePath optimisation	2022-06-22 16:08:26 -04:00
Joey Hess	78a3d44ea0	get rid of racy addLink The remaining callers all did not rely on it checking gitignore, so were easy to convert. They were susceptable to the same overwrite race as add and fix, although less likely to have it and a narrower window than add's race. Command.Rekey in passing got an unncessary call to removeFile deleted. addSymlink handles deleting any existing worktree file.	2022-06-14 14:47:15 -04:00
Joey Hess	7ace804d8e	avoid writing same symlink twice in a row Oddly, the second write did not cause it to lose the mtime inherited from the file being added, although the mtime was not provided to that write but only to the first. I don't quite know why that worked before!	2022-06-14 14:30:12 -04:00
Joey Hess	5ef79125ad	fix overwrite race with git-annex add of annex symlink In the unlikely case where git-annex add is run on an annex symlink that is not already added, and while it's processing it, the annex symlink is overwritten with something else, avoid git-annex overwriting that with the symlink again. Sponsored-by: Jack Hill on Patreon	2022-06-14 14:00:13 -04:00
Joey Hess	dd6dec4eb1	fix add overwrite race with git-annex add to annex This is not a complete fix for all such races, only the one where a large file gets changed while adding and gets added to git rather than to the annex. addLink needs to go away, any caller of it is probably subject to the same kind of race. (Also, addLink itself fails to check gitignore when symlinks are not supported.) ingestAdd no longer checks gitignore. (It didn't check it consistently before either, since there were cases where it did not run git add!) When git-annex import calls it, it's already checked gitignore itself earlier. When git-annex add calls it, it's usually on files found by withFilesNotInGit, which handles checking ignores. There was one other case, when git-annex add --batch calls it. In that case, old git-annex behaved rather badly, it would seem to add the file, but git add would later fail, leaving the file as an unstaged annex symlink. That behavior has also been fixed. Sponsored-by: Brett Eisenberg on Patreon	2022-06-14 13:37:19 -04:00
Joey Hess	c59ea5b1ca	info: Added --autoenable option Use cases include using git-annex init --no-autoenable and then going back and enabling the special remotes that have autoenable configured. As well as just querying to remember which ones have it enabled. It lists all special remotes that have autoenable=yes whether currently enabled or not. And it can be used with --json. I pondered making this "git-annex info autoenable", but that seemed wrong because then if the use has a directory named "autoenable", it's unclear what they are asking for. (Although "git-annex info remote" may be similarly unclear.) Making it an option does mean that it can't be provided via --batch though. Sponsored-by: Dartmouth College's Datalad project	2022-06-01 14:20:38 -04:00
Joey Hess	f35c551d35	make path absolute for display Avoid suggesting the user add "." to safe.directory.	2022-05-31 12:17:27 -04:00
Joey Hess	478ed28f98	revert windows-specific locking changes that broke tests This reverts windows-specific parts of `5a98f2d509` There were no code paths in common between windows and unix, so this will return Windows to the old behavior. The problem that the commit talks about has to do with multiple different locations where git-annex can store annex object files, but that is not too relevant to Windows anyway, because on windows the filesystem is always treated as criplled and/or symlinks are not supported, so it will only use one object location. It would need to be using a repo populated in another OS to have the other object location in use probably. Then a drop and get could possibly lead to a dangling lock file. And, I was not able to actually reproduce that situation happening before making that commit, even when I forced a race. So making these changes on windows was just begging trouble.. I suspect that the change that caused the reversion is in Annex/Content/Presence.hs. It checks if the content file exists, and then called modifyContentDirWhenExists, which seems like it would not fail, but if something deleted the content file at that point, that call would fail. Which would result in an exception being thrown, which should not normally happen from a call to inAnnexSafe. That was a windows-specific change; the unix side did not have an equivilant change. Sponsored-by: Dartmouth College's Datalad project	2022-05-23 13:21:26 -04:00
Joey Hess	63624c40a0	fix typo in comment	2022-05-23 12:53:55 -04:00
Joey Hess	af0d854460	deal with git's changes for CVE-2022-24765 Deal with git's recent changes to fix CVE-2022-24765, which prevent using git in a repository owned by someone else. That makes git config --list not list the repo's configs, only global configs. So annex.uuid and annex.version are not visible to git-annex. It displayed a message about that, which is not right for this situation. Detect the situation and display a better message, similar to the one other git commands display. Also, git-annex init when run in that situation would overwrite annex.uuid with a new one, since it couldn't see the old one. Add a check to prevent it running too in this situation. It may be that this fix has security implications, if a config set by the malicious user who owns the repo causes git or git-annex to run code. I don't think any git-annex configs get run by git-annex init. It may be that some git config of a command does get run by one of the git commands that git-annex init runs. ("git status" is the command that prompted the CVE-2022-24765, since core.fsmonitor can cause it to run a command). Since I don't know how to exploit this, I'm not treating it as a security fix for now. Note that passing --git-dir makes git bypass the security check. git-annex does pass --git-dir to most calls to git, which it does to avoid needing chdir to the directory containing a git repository when accessing a remote. So, it's possible that somewhere in git-annex it gets as far as running git with --git-dir, and git reads some configs that are unsafe (what CVE-2022-24765 is about). This seems unlikely, it would have to be part of git-annex that runs in git repositories that have no (visible) annex.uuid, and git-annex init is the only one that I can think of that then goes on to run git, as discussed earlier. But I've not fully ruled out there being others.. The git developers seem mostly worried about "git status" or a similar command implicitly run by a shell prompt, not an explicit use of git in such a repository. For example, Ævar Arnfjörð Bjarma wrote: > * There are other bits of config that also point to executable things, > e.g. core.editor, aliases etc, but nothing has been found yet that > provides the "at a distance" effect that the core.fsmonitor vector > does. > > I.e. a user is unlikely to go to /tmp/some-crap/here and run "git > commit", but they (or their shell prompt) might run "git status", and > if you have a /tmp/.git ... Sponsored-by: Jarkko Kniivilä on Patreon	2022-05-20 14:38:27 -04:00
Joey Hess	aa414d97c9	make fsck normalize object locations The purpose of this is to fix situations where the annex object file is stored in a directory structure other than where annex symlinks point to. But it will also move object files from the hashdirmixed back to hashdirlower if the repo configuration makes that the normal location. It would have been more work to avoid that than to let it do it. Sponsored-by: Dartmouth College's Datalad project	2022-05-16 15:38:06 -04:00
Joey Hess	6b5029db29	fix hardcoding of number of hash directories It can be changed to 1 via a tuning, rather than the 2 this assumed. So it would have tried to rmdir .git/annex/objects in that case, which would not hurt anything, but is not what it is supposed to do. Sponsored-by: Dartmouth College's Datalad project	2022-05-16 15:08:42 -04:00
Joey Hess	5a98f2d509	avoid creating content directory when locking content If the content directory does not exist, then it does not make sense to lock the content file, as it also does not exist, and so it's ok for the lock operation to fail. This avoids potential races where the content file exists but is then deleted/renamed, while another process sees that it exists and goes to lock it, resulting in a dangling lock file in an otherwise empty object directory. Also renamed modifyContent to modifyContentDir since it is not only necessarily used for modifying content files, but also other files in the content directory. Sponsored-by: Dartmouth College's Datalad project	2022-05-16 12:34:56 -04:00
Joey Hess	e8a601aa24	incremental verification for retrieval from import remotes Sponsored-by: Dartmouth College's Datalad project	2022-05-09 15:39:43 -04:00
Joey Hess	2f2701137d	incremental verification for retrieval from all export remotes Only for export remotes so far, not export/import. Sponsored-by: Dartmouth College's Datalad project	2022-05-09 13:49:33 -04:00
Joey Hess	90950a37e5	support incremental verification when retrieving from export/import remotes None of the special remotes do it yet, but this lays the groundwork. Added MustFinishIncompleteVerify so that, when an incremental verify is started but not complete, it can be forced to finish it. Otherwise, it would have skipped doing it when verification is disabled, but verification must always be done when retrievin from export remotes since files can be modified during retrieval. Note that retrieveExportWithContentIdentifier doesn't support incremental verification yet. And I'm not sure if it can -- it doesn't know the Key before it downloads the content. It seems a new API call would need to be split out of that, which is provided with the key. Sponsored-by: Dartmouth College's Datalad project	2022-05-09 12:25:04 -04:00
Joey Hess	8675b2b075	rename memoryUnits It's not just used for memory sizes.	2022-05-05 15:35:11 -04:00
Joey Hess	d266a41f8d	prevent numcopies or mincopies being configured to 0 Ignore annex.numcopies set to 0 in gitattributes or git config, or by git-annex numcopies or by --numcopies, since that configuration would make git-annex easily lose data. Same for mincopies. This is a continuation of the work to make data only be able to be lost when --force is used. It earlier led to the --trust option being disabled, and similar reasoning applies here. Most numcopies configs had docs that strongly discouraged setting it to 0 anyway. And I can't imagine a use case for setting to 0. Not that there might not be one, but it's just so far from the intended use case of git-annex, of managing and storing your data, that it does not seem like it makes sense to cater to such a hypothetical use case, where any git-annex drop can lose your data at any time. Using a smart constructor makes sure every place avoids 0. Note that this does mean that NumCopies is for the configured desired values, and not the actual existing number of copies, which of course can be 0. The name configuredNumCopies is used to make that clear. Sponsored-by: Brock Spratlen on Patreon	2022-03-28 15:20:34 -04:00
Joey Hess	982eb7ed0d	remove vendored http-client-restricted Removed vendored copy of http-client-restricted, and removed the HttpClientRestricted build flag that avoided that dependency. http-client-restricted is in Debian stable, and the i386ancient build also uses it, so I think this vendored copy is no longer needed. Sponsored-by: Noam Kremen on Patreon	2022-03-22 11:50:06 -04:00
Joey Hess	952664641a	turn of PackageImports in cabal file This makes it easier to build eg benchmarks of individual modules. May be that most of these PackageImports are not really necessary, dunno.	2022-02-25 13:16:36 -04:00
Joey Hess	51c528980c	avoid accidentally thawing git-annex symlink It did nothing, since at this point the link is dangling. But when there is a thaw hook, it would probably not be happy to be asked to run on a symlink, or might do something unexpected. Sponsored-by: Dartmouth College's Datalad project	2022-02-24 14:21:23 -04:00
Joey Hess	f4b046252a	Run annex.thawcontent-command before deleting an object file In case annex.freezecontent-command did something that would prevent deletion. Sponsored-by: Dartmouth College's Datalad project	2022-02-24 14:11:02 -04:00
Joey Hess	346007a915	add debugging of freeze and thaw	2022-02-24 14:01:29 -04:00
Joey Hess	28bc5ce232	ignore write bits being set when there is a freeze hook When annex.freezecontent-command is set, and the filesystem does not support removing write bits, avoid treating it as a crippled filesystem. The hook may be enough to prevent writing on its own, and some filesystems ignore attempts to remove write bits. Sponsored-by: Dartmouth College's Datalad project	2022-02-24 13:28:31 -04:00
Joey Hess	64ccb4734e	smudge: Warn when encountering a pointer file that has other content appended to it It will then proceed to add the file the same as if it were any other file containing possibly annexable content. Usually the file is one that was annexed before, so the new, probably corrupt content will also be added to the annex. If the file was not annexed before, the content will be added to git. It's not possible for the smudge filter to throw an error here, because git then just adds the file to git anyway. Sponsored-by: Dartmouth College's Datalad project	2022-02-23 15:17:08 -04:00
Joey Hess	67245ae00f	fully specify the pointer file format This format is designed to detect accidental appends, while having some room for future expansion. Detect when an unlocked file whose content is not present has gotten some other content appended to it, and avoid treating it as a pointer file, so that appended content will not be checked into git, but will be annexed like any other file. Dropped the max size of a pointer file down to 32kb, it was around 80 kb, but without any good reason and certianly there are no valid pointer files anywhere that are larger than 8kb, because it's just been specified what it means for a pointer file with additional data even looks like. I assume 32kb will be good enough for anyone. ;-) Really though, it needs to be some smallish number, because that much of a file in git gets read into memory when eg, catting pointer files. And since we have no use cases for the extra lines of a pointer file yet, except possibly to add some human-visible explanation that it is a git-annex pointer file, 32k seems as reasonable an arbitrary number as anything. Increasing it would be possible, eg to 64k, as long as users of such jumbo pointer files didn't mind upgrading all their git-annex installations to one that supports the new larger size. Sponsored-by: Dartmouth College's Datalad project	2022-02-23 14:20:31 -04:00
Joey Hess	5b373a9dd2	read a consistent amount from pointer file A few places were reading the max symlink size of a pointer file, then passing tp parseLinkTargetOrPointer. Which is fine currently, but to support pointer files with lines of data after the pointer, enough has to be read that parseLinkTargetOrPointer can be assured of seeing enough of that data to know if it's correctly formatted. Sponsored-by: Dartmouth College's Datalad project	2022-02-23 12:52:34 -04:00
Joey Hess	4cd9325c2c	fold parseLinkTarget into parseLinkTargetOrPointer Only one place remained that differentiated between them. It is the case that a symlink target that happens to contain a newline somehow will be treated as a link to a key truncated at the newline. This is super unlikely to happen, and since a key cannot actually contain a newline, it's as good a behavior as any. Anyway, this commit does not change the behavior there, although arguably it should be changed. Note that getAnnexLinkTarget does prevent a symlink target containing a newline. Sponsored-by: Dartmouth College's Datalad project	2022-02-23 12:30:32 -04:00
Joey Hess	ce1b3a9699	info: Allow using matching options in more situations File matching options like --include will be rejected in situations where there is no filename to match against. (Or where there is a filename but it's not relative to the cwd, or otherwise seemed too bothersome to match against.) The addition of listKeys' was necessary to avoid using more memory in the common case of "git-annex info". Adding a filterM would have caused the list to buffer in memory and not stream. This is an ugly hack, but listKeys had previously run Annex operations inside unafeInterleaveIO (for direct mode). And matching against a matcher should hopefully not change any Annex state. This does allow for eg `git-annex info somefile --include=*.ext` although why someone would want to do that I don't really know. But it seems to make sense to allow it. But, consider: `git-annex info ./somefile --include=somefile` This does not match, so will not display info about somefile. If the user really wants to, they can `--include=./somefile`. Using matching options like --copies or --in=remote seems likely to be slower than git-annex find with those options, because unlike such commands, info does not have optimised streaming through the matcher. Note that `git-annex info remote` is not the same as `git-annex info --in remote`. The former shows info about all files in the remote. The latter shows local keys that are also in that remote. The output should make that clear, but this still seems like a point where users could get confused. Sponsored-by: Jochen Bartl on Patreon	2022-02-21 14:46:07 -04:00
Joey Hess	faf84aa5c2	Avoid git status taking a long time after git-annex unlock of many files. Implemented by making Git.Queue have a FlushAction, which can accumulate along with another action on files, and runs only once the other action has run. This lets git-annex unlock queue up git update-index actions, without conflicting with the restagePointerFiles FlushActions. In a repository with filter-process enabled, git-annex unlock will often not take any more time than before, though it may when the files are large. Either way, it should always slow down less than git-annex status speeds up. When filter-process is not enabled, git-annex unlock will slow down as much as git status speeds up. Sponsored-by: Jochen Bartl on Patreon	2022-02-18 15:06:40 -04:00
Joey Hess	21e40b86d8	have v9 autoupgrade to v10 This was right before commit `a27776f602`, which made v6 v7 autoupgrade to v8 but not yet to v10. Sponsored-by: Dartmouth College's Datalad project	2022-01-26 13:16:06 -04:00
Joey Hess	a27776f602	init --version=6 upgrade to 8 not yet 10 autoUpgradeableVersions had latestVersion (10), but it did not make sense for asking for old version 6 to get version 10, while asking for version 8 got version 8. So use defaultVersion (8) instead. Sponsored-by: Dartmouth College's Datalad project	2022-01-25 13:52:42 -04:00
Joey Hess	3618746a85	fix failing readonly test case The problem is that withContentLockFile, in a v8 repo, has to take a shared lock of `.git/annex/content.lck`. But, in a readonly repository, if that file does not yet exist, it cannot lock it. And while it will sometimes work to `chmod +r .git/annex`, the repository might be readonly due to being owned by another user, or due to being mounted readonly. So, it seems that the only solution is to use some other file than `.git/annex/content.lck` as the lock file. The inode sential file was almost the only option that should always exist. (And if it somehow does not exist, creating an empty one for locking will be ok.) Wow, what a hack! Sponsored-by: Dartmouth College's Datalad project	2022-01-21 13:49:31 -04:00
Joey Hess	47084b8a1d	enable filter.annex.process in v9 This has tradeoffs, but is generally a win, and users who it causes git add to slow down unacceptably for can just disable it again. It needed to happen in an upgrade, since there are git-annex versions that do not support it, and using such an old version with a v8 repository with filter.annex.process set will cause bad behavior. By enabling it in v9, it's guaranteed that any git-annex version that can use the repository does support it. Although, this is not a perfect protection against problems, since an old git-annex version, if it's used with a v9 repository, will cause git add to try to run git-annex filter-process, which will fail. But at least, the user is unlikely to have an old git-annex in path if they are using a v9 repository, since it won't work in that repository. Sponsored-by: Dartmouth College's Datalad project	2022-01-21 13:11:18 -04:00
Joey Hess	dc14221bc3	detect v10 upgrade while running Capstone of the v10 upgrade process. Tested with a git-annex drop in a v8 repo that had a local v8 remote. Upgrading the repo to v10 (with --force) immedaitely caused it to notice and switch over to v10 locking. Upgrading the remote also caused it to switch over when operating on the remote. The InodeCache makes this fairly efficient, just an added stat call per lock of an object file. After the v10 upgrade, there is no more overhead. Sponsored-by: Dartmouth College's Datalad project	2022-01-21 12:56:38 -04:00
Joey Hess	76e365769e	fix crash after drop in v10 After cleaning up the lock file, the content directory is gone, so freezing it failed. Sponsored-by: Dartmouth College's Datalad project	2022-01-20 14:03:27 -04:00
Joey Hess	d0a5714409	continue to use v8 by default for now, unless upgraded Since it's easy to keep supporting v8, using it for a while (eg a few months) will give users time to upgrade git-annex installations, before it upgrades their repository to v9. This commit should be reverted once ready to start upgrading repositories by default. Sponsored-by: Dartmouth College's Datalad project	2022-01-20 11:56:05 -04:00
Joey Hess	0904eac8b4	automatic upgrade from v8 to v9 Sponsored-by: Dartmouth College's Datalad project	2022-01-20 11:39:36 -04:00
Joey Hess	cea6f6db92	v10 upgrade locking The v10 upgrade should almost be safe now. What remains to be done is notice when the v10 upgrade has occurred, while holding the shared lock, and switch to using v10 lock files. Sponsored-by: Dartmouth College's Datalad project	2022-01-20 11:33:14 -04:00
Joey Hess	9d5db6a09a	add upgrade.log The upgrade from V9 uses this to avoid an automatic upgrade until 1 year after the V9 update. It can also be used in future such situations. Sponsored-by: Dartmouth College's Datalad project	2022-01-19 15:52:29 -04:00
Joey Hess	856ce5cf5f	split upgrade into v9 and v10 v10 will run 1 year after the upgrade to v9, to give time for any v8 processes to die. Until that point, the v10 upgrade will be tried by every process but deferred, so added support for deferring upgrades. The upgrade prevention lock file that will be used by v10 is not yet implemented, so it does not yet defer. Sponsored-by: Dartmouth College's Datalad project	2022-01-19 13:09:33 -04:00
Joey Hess	4f7b8ce09d	fix spelling of upgradeable	2022-01-19 12:14:50 -04:00
Joey Hess	538d02d397	delete content lock file safely after shared lock Upgrade the shared lock to an exclusive lock, and then delete the lock file. If there is another process still holding the shared lock, the first process will fail taking the exclusive lock, and not delete the lock file; then the other process will later delete it. Note that, in the time period where the exclusive lock is held, other attempts to lock the content in place would fail. This is unlikely to be a problem since it's a short period. Other attempts to lock the content for removal would also fail in that time period, but that's no different than a removal failing because content is locked to prevent removal. Sponsored-by: Dartmouth College's Datalad project	2022-01-13 14:54:57 -04:00
Joey Hess	86e5ffe34a	clean empty object directories after deleting content lock file When dropping content, this was already done after deleting the content file, but the lock file prevents deleting the directories. So, try the deletion again. This does mean there's a small added overhead of a failed rmdir(). Sponsored-by: Dartmouth College's Datalad project	2022-01-13 14:22:37 -04:00
Joey Hess	e28d1d0325	fix logic that was not inverted after all oops	2022-01-13 14:11:36 -04:00
Joey Hess	a3b6b3499b	delete content lock file safely on drop, keep after shared lock This seems to be the best that can be done to avoid forever accumulating the new content lock files, while being fully safe. This is fixing code paths that have lingered unused since direct mode! And direct mode seems to have been buggy in this area, since the content lock file was deleted on unlock. But with a shared lock, there could be another process that also had the lock file locked, and deleting it invalidates that lock. So, the lock file cannot be deleted after a shared lock. At least, not wihout taking an exclusive lock first.. which I have not pursued yet but may. After an exclusive lock, the lock file can be deleted. But there is still a potential race, where the exclusive lock is held, and another process gets the file open, just as the exclusive lock is dropped and the lock file is deleted. That other process would be left with a file handle it can take a shared lock of, but with no effect since the file is deleted. Annex.Transfer also deletes lock files, and deals with this same problem by using checkSaneLock, which is how I've dealt with it here. Sponsored-by: Dartmouth College's Datalad project	2022-01-13 13:58:58 -04:00
Joey Hess	3d7933f124	fix inverted logic Now the content lock files are used in v9. However, I am not yet certian they are correct. In particular, lockContentUsing deletes the content lock file on unlock. But what if there's a shared lock by another process? That seems like it would discard that lock too! (Windows seems like it would not have the same problem, because as the comment in there says, "Can't delete a locked file on Windows". So if another process has a shared lock, removing it presumably fails.) Sponsored-by: Dartmouth College's Datalad project	2022-01-13 13:58:31 -04:00
Joey Hess	731b1ecf87	v9 upgrade implemented Seems to work ok. Unsure yet about the actual locking changes being correct. This is not the end of the story with upgrades, because it is unsafe for this upgrade as implemented to run in a repository where an old git-annex process is already running. The old process would use the old locking method, and not notice files locked by the new, and this could result in data loss. This problem will need to be dealt with before this branch is suitable for merging. Sponsored-by: Dartmouth College's Datalad project	2022-01-13 13:25:10 -04:00
Joey Hess	3936599885	move code from Command.Fsck Sponsored-by: Dartmouth College's Datalad project	2022-01-13 13:24:50 -04:00
Joey Hess	3c042606c2	use separate lock from content file in v9 Windows has always used a separate lock file, but on unix, the content file itself was locked, and in v9 that changes to also use a separate lock file. This needs to be tested more. Eg, what happens after dropping a file; does the the content lock file get deleted too, or linger around? Sponsored-by: Dartmouth College's Datalad project	2022-01-11 17:03:14 -04:00
Joey Hess	43f9d967ff	shared repository content file permissions for v9 v9 will not need to write to annex content files in order to lock them, so freezeContent removes the write bit in a shared repository, the same as in any other repository. checkContentWritePerm makes sure that the write perm is not set, which will let git-annex fsck fix up the permissions. Upgrading to v9 will need to fix the permissions as well, but it seems likely there will be situations where the user git-annex is running an upgrade as cannot, so it will have to leave the write bit set. In such a case, git-annex fsck can fix it later. Sponsored-by: Dartmouth College's Datalad project	2022-01-11 16:50:50 -04:00
Joey Hess	ff570ad363	add v9 annex.version, not yet the default This is the start of v9, but it's currently identical to v8, and v8 is not upgraded to it. git-annex upgrade will upgrade to v9 with this change. Sponsored-by: Dartmouth College's Datalad project	2022-01-11 14:59:39 -04:00
Joey Hess	e95747a149	fix handling of corrupted data received from git remote Recover from corrupted content being received from a git remote due eg to a wire error, by deleting the temporary file when it fails to verify. This prevents a retry from failing again. Reversion introduced in version 8.20210903, when incremental verification was added. Only the git remote seems to be affected, although it is certianly possible that other remotes could later have the same issue. This only affects things passed to getViaTmp that return (False, UnVerified) due to verification failing. As far as getViaTmp can tell, that could just as well mean that the transfer failed in a way that would resume, so it cannot delete the temp file itself. Remote.Git and P2P.Annex use getViaTmp internally, while other remotes do not, which is why only it seems affected. A better fix perhaps would be to improve the types of the callback passed to getViaTmp, so that some other value could be used to indicate the state where the transfer succeeded but verification failed. Sponsored-by: Boyd Stephen Smith Jr.	2022-01-07 13:25:33 -04:00
Joey Hess	21c0d5be6e	comment	2022-01-07 12:27:19 -04:00
Joey Hess	e416635021	renameremote: Better handling of case where there are multiple special remotes with a name Instead of renaming one at random, error out and ask that a uuid be specified. Sponsored-by: Brett Eisenberg on Patreon	2022-01-05 15:24:02 -04:00
Joey Hess	58afb00f6e	enableremote: Better handling of the unusual case where multiple special remotes have been initialized with the same name Before it would pick one at random, though preferring ones that were not dead over dead ones. Now, if one is dead and the other not, it will use the non-dead one. But if both are not dead, or both dead, it will error out, suggesting the user clarify what they want to enable. Sponsored-by: Luke Shumaker on Patreon	2022-01-05 15:12:11 -04:00
Joey Hess	b1d719f9d2	handle transitions with read-only unmerged git-annex branches Capstone to this feature. Any transitions that have been performed on an unmerged remote ref but not on the local git-annex branch, or vice-versa have to be applied on the fly when reading files. Sponsored-by: Dartmouth College's Datalad project	2021-12-28 13:23:32 -04:00
Joey Hess	720baf820e	refactoring	2021-12-28 12:15:51 -04:00
Joey Hess	23a485498f	handle Annex.Branch.files with read-only unmerged git-annex branches It would be difficult to make Annex.Branch.files query the unmerged git-annex branches. Might be possible, similar to what was discussed in `7f6b2ca49c` but again I decided to make it not do anything in that situation to start with before adding such a complicated thing. git-annex info uses it when getting info about a repostory. The choices were to make that fail with an error, or display the info it can, and change the output slightly for the bits of info it cannot access. While that is a behavior change, and I want to avoid any behavior changes due to unmerged git-annex branches in a read-only repo, displaying a message that is not a number seems unlikely to break anything that was consuming a number, any worse than throwing an exception would. Probably. Also git-annex unused --from origin is made to throw an error, but it would fail later anyway when trying to write to the unused log files. Sponsored-by: Dartmouth College's Datalad project	2021-12-27 15:28:31 -04:00
Joey Hess	7f6b2ca49c	handle overBranchFileContents with read-only unmerged git-annex branches This makes --all error out in that situation. Which is better than ignoring information from the branches. To really handle the branches right, overBranchFileContents would need to both query all the branches and union merge file contents (or perhaps not provide any file content), as well as diffing between branches to find files that are only present in the unmerged branches. And also, it would need to handle transitions.. Sponsored-by: Dartmouth College's Datalad project	2021-12-27 14:30:51 -04:00
Joey Hess	d9d0fe5fa4	disable precaching git-annex branch when there are unmerged branches in a read-only repo The way precaching works, it can't merge in information from those branches efficiently, so just disable it and fall back to Annex.Branch.get in order to get the correct information. Sponsored-by: Dartmouth College's Datalad project	2021-12-27 14:08:50 -04:00
Joey Hess	1e09cf661e	remove git-annex branch ref from unmerged refs list It's queried separately so it was causing extra work to include it.	2021-12-27 13:33:27 -04:00
Joey Hess	6d7ecd9e5d	merge git-annex branch in memory in read-only repository Improved support for using git-annex in a read-only repository, git-annex branch information from remotes that cannot be merged into the git-annex branch will now not crash it, but will be merged in memory. To avoid this making git-annex behave one way in a read-only repository, and another way when it can write, it's important that Annex.Branch.get return the same thing (modulo log file compaction) in both cases. This manages that mostly. There are some exceptions: - When there is a transition in one of the remote git-annex branches that has not yet been applied to the local or other git-annex branches. Transitions are not handled. - `git-annex log` runs git log on the git-annex branch, and so it will not be able to show information coming from the other, not yet merged branches. - Annex.Branch.files only looks at files in the git-annex branch and not unmerged branches. This affects git-annex info output. - Annex.Branch.hs.overBranchFileContents ditto. Affects --all and also importfeed (but importfeed cannot work in a read-only repo anyway). - CmdLine.Seek.seekFilteredKeys when precaching location logs. Note use of Annex.Branch.fullname - Database.ContentIdentifier.needsUpdateFromLog and updateFromLog These warts make this not suitable to be merged yet. This readonly code path is more expensive, since it has to query several branches. The value does get cached, but still large queries will be slower in a read-only repository when there are unmerged git-annex branches. When annex.merge-annex-branches=false, updateTo skips doing anything, and so the read-only repository code does not get triggered. So a user who is bothered by the extra work can set that. Other writes to the repository can still result in permissions errors. This includes the initial creation of the git-annex branch, and of course any writes to the git-annex branch. Sponsored-by: Dartmouth College's Datalad project	2021-12-27 13:21:15 -04:00
Joey Hess	c2e46f4707	improve git command queue flushing with time limit So that eg, addurl of several large files that take time to download will update the index for each file, rather than deferring the index updates to the end. In cases like an add of many smallish files, where a new file is being added every few seconds. In that case, the queue will still build up a lot of changes which are flushed at once, for best performance. Since the default queue size is 10240, often it only gets flushed once at the end, same as before. (Notice that updateQueue updated _lastchanged when adding a new item to the queue without flushing it; that is necessary to avoid it flushing the queue every 5 minutes in this case.) But, when it takes more than a 5 minutes to add a file, the overhead of updating the index immediately is probably small, so do it after each file. This avoids git-annex potentially taking a very very long time indeed to stage newly added files, which can be annoying to the user who would like to get on with doing something with the files it's already added, eg using git mv to rename them to a better name. This is only likely to cause a problem if it takes say, 30 seconds to update the index; doing an extra 30 seconds of work after every 5 minute file add would be less optimal. Normally, updating the index takes significantly less time than that. On a SSD with 100k files it takes less than 1 second, and the index write time is bound by disk read and write so is not too much worse on a hard drive. So I hope this will not impact users, although if it does turn out to, the time limit could be made configurable. A perhaps better way to do it would be to have a background worker thread that wakes up every 60 seconds or so and flushes the queue. That is made somewhat difficult because the queue can contain Annex actions and so this would add a new source of concurrency issues. So I'm trying to avoid that approach if possible. Sponsored-by: Erik Bjäreholt on Patreon	2021-12-14 12:23:19 -04:00
Joey Hess	6242b35c33	fix error message Was "failed to generate a key" when key generation did not fail (it never does anymore) but the actual problem was it failed to stat the source file, perhaps due to it being deleted while the key was being generated. A user reported this, in a comment I followed up on in `262400fe04`, although I don't know what they did to trigger the error message.	2021-12-09 15:25:59 -04:00
Joey Hess	dbba231e06	Improve error message display when autoinit fails Due to eg, a permissions problem.	2021-12-09 14:38:12 -04:00
Joey Hess	ef3ab0769e	close pid lock only once no threads use it This fixes a FD leak when annex.pidlock is set and -J is used. Also, it fixes bugs where the pid lock file got deleted because one thread was done with it, while another thread was still holding it open. The LockPool now has two distinct types of resources, one is per-LockHandle and is used for file Handles, which get closed when the associated LockHandle is closed. The other one is per lock file, and gets closed when no more LockHandles use that lock file, including other shared locks of the same file. That latter kind is used for the pid lock file, so it's opened by the first thread to use a lock, and closed when the last thread closes a lock. In practice, this means that eg git-annex get of several files opens and closes the pidlock file a few times per file. While with -J5 it will open the pidlock file, process a number of files, until all the threads happen to finish together, at which point the pidlock file gets closed, and then that repeats. So in either case, another process still gets a chance to take the pidlock. registerPostRelease has a rather intricate dance, there are fine-grained STM locks, a STM lock of the pidfile itself, and the actual pidlock file on disk that are all resolved in stages by it. Sponsored-by: Dartmouth College's Datalad project	2021-12-06 15:01:39 -04:00
Joey Hess	e5ca67ea1c	fine-grained locking when annex.pidlock is enabled This locking has been missing from the beginning of annex.pidlock. It used to be possble, when two threads are doing conflicting things, for both to run at the same time despite using locking. Seems likely that nothing actually had a problem, but it was possible, and this eliminates that possible source of failure. Sponsored-by: Dartmouth College's Datalad project	2021-12-03 17:20:21 -04:00
Joey Hess	4703ad3e7f	remove unused import	2021-11-23 16:15:57 -04:00
Joey Hess	5a7f253974	support git 2.34.0's handling of merge conflict between annexed and non-annexed file This version of git -- or its new default "ort" resolver -- handles such a conflict by staging two files, one with the original name and the other named file~ref. Use unmergedSiblingFile when the latter is detected. (It doesn't do that when the conflict is between a directory and a file or symlink though, so see previous commit for how that case is handled.) The sibling file has to be deleted separately, because cleanConflictCruft may not delete it -- that only handles files that are annex links, but the sibling file may be the non-annexed file side of the conflict. The graftin code had assumed that, when the other side of a conclict is a symlink, the file in the work tree will contain the non-annexed content that we want it to contain. But that is not the case with the new git; the file may be the annex link and needs to be replaced with the content, while the annex link will be written as a -variant file. (The weird doesDirectoryExist check in graftin turns out to still be needed, test suite failed when I tried to remove it.) Test suite passes with new git with ort resolver default. Have not tried it with old git or other defaults. Sponsored-by: Noam Kremen on Patreon	2021-11-22 16:10:24 -04:00
Joey Hess	623a775609	fix cat-file leak in get with -J Bugfix: When -J was enabled, getting files leaked a ever-growing number of git cat-file processes. (Since commit `dd39e9e255`) The leak happened when mergeState called stopNonConcurrentSafeCoProcesses. While stopNonConcurrentSafeCoProcesses usually manages to stop everything, there was a race condition where cat-file processes were leaked. Because catFileStop modifies Annex.catfilehandles in a non-concurrency safe way, and could clobber modifications made in between. Which should have been ok, since originally catFileStop was only used at shutdown. Note the comment on catFileStop saying it should only be used when nothing else is using the handles. It would be possible to make catFileStop race-safe, but it should just not be used in a situation where a race is possible. So I didn't bother. Instead, the fix is just not to stop any processes in mergeState. Because in order for mergeState to be called, dupState must have been run, and it enables concurrency mode, stops any non-concurrent processes, and so all processes that are running are concurrency safea. So there is no need to stop them when merging state. Indeed, stopping them would be extra work, even if there was not this bug. Sponsored-by: Dartmouth College's Datalad project	2021-11-19 12:51:08 -04:00
Joey Hess	15d617f7e1	have setConcurrency stop any running git coprocesses When non-concurrent git coprocesses have been started, setConcurrency used to not stop them, and so could leak processes when enabling concurrency, eg when forkState is called. I do not think that ever actually happened, given where setConcurrency is called. And it probably would only leak one of each process, since it never downgrades from concurrent to non-concurrent.	2021-11-19 12:00:39 -04:00
Joey Hess	8c756d5a27	fix comment typo	2021-11-17 13:03:37 -04:00
Joey Hess	aa6e54ac6e	Fix a typo in the name of youtube-dl (reversion introduced in version 8.20210903)	2021-11-13 08:58:36 -04:00
Joey Hess	8034f2e9bb	factor out IncrementalHasher from IncrementalVerifier	2021-11-09 12:33:22 -04:00
Joey Hess	a0758bdd10	dynamically disable filter-process in restagePointerFile when it would be slower Based on my earlier benchmark, I have a rough cost model for how expensive it is for git-annex smudge to be run on a file, vs how expensive it is for a gigabyte of a file's content to be read and piped through to filter-process. So, using that cost model, it can decide if using filter-process will be more or less expensive than running the smudge filter on the files to be restaged. It turned out to be really annoying to temporarily disable filter-process. I did find a way, but urk, this is horrible. Notice that, if it's interrupted with it disabled, it will remain disabled until the next time restagePointerFile runs. Which could be some time later. If the user runs `git add` or `git checkout` on a lot of small files before that, they will see slower than expected performance. (This commit also deletes where I wrote down the benchmark results earlier.) Sponsored-by: Noam Kremen on Patreon	2021-11-08 16:20:34 -04:00
Joey Hess	837025b14f	Revert "disable filter.annex.process in restagePointerFile" This reverts commit `afe327ac49`. Unfortunately, disabling it by setting it to "" does not work, git then ignores filter.annex.smudge/clean, and does not pass files through git-annex at all. I don't think there is a way to temporarily disable this git config from the git command line. Which seems like a bug in git. So, it may be more expensive than anticipated to enable filter.annex.process, since git checkout etc will pipe all annexed files being checked out through it.	2021-11-05 12:43:33 -04:00
Joey Hess	afe327ac49	disable filter.annex.process in restagePointerFile This means git will run git-annex smudge --clean once per file that is restaged, which can be slow. But probably not as slow as git feeding all the content of annexed files you've gotten through a pipe to git-annex filter-process. The only time this is probably not ideal is after a drop of a bunch of files, when filter-process would be faster.	2021-11-04 15:20:26 -04:00
Joey Hess	a3cdff3fd5	add a comment about checkSaneLock See commit `8c2dd7d8ee` for original introduction of it, but needing to spelunk that far back to understand the code is not good.	2021-10-27 14:55:30 -04:00
Joey Hess	55bfa414b3	move transfer already in progress message to warning This makes it be displayed in the error-messages field with --json-error-messages. And with --quiet, it will let it be displayed, which makes sense because it's telling the user why what they requested to do has failed to happen.	2021-10-27 14:46:21 -04:00
Joey Hess	669037862a	avoid redundant freezeContent call This opens the potential for the object file to be in place but git-annex is interrupted before it can freeze it. git-annex fsck already fixes that situation, which can also occur when lockContentForRemoval thaws content. Also improve comment to not be Windows-specific.	2021-10-27 14:18:10 -04:00
Reiko Asakura	0db7297f00	Call freezeContent after move into annex This change better supports Windows ACL management using annex.freezecontent-command and annex.thawcontent-command and matches the behaviour of adding an unlocked file. By calling freezeContent after the file has moved into the annex, the file's delete permission can be denied. If the file's delete permission is denied before moving into the annex, the file cannot be moved or deleted. If the file's delete permission is not denied after moving into the annex, it will likely inherit a grant for the delete permission which allows it to be deleted irrespective of the permissions of the parent directory.	2021-10-27 14:05:57 -04:00
Joey Hess	5a9e6b1fd4	when private journal file exists, still read from git-annex branch Fix bug that caused stale git-annex branch information to read when annex.private or remote.name.annex-private is set. The private journal file should not prevent reading more current information from the git-annex branch, but used to. Note that, overBranchFileContents has to do additional work now, when there's a private journal file, it reads from the branch redundantly and more slowly. Sponsored-by: Jack Hill on Patreon	2021-10-26 13:43:50 -04:00
Joey Hess	0f38ad9a69	close keys db to possibly work around WSL1 issue	2021-10-19 13:07:49 -04:00
Joey Hess	887edeb1ad	avoid warning when built with unix-compat 0.5.3 It re-exports modificationTimeHiRes, and provides a windows version. Might be worth using that windows version eventually, but I have not tested it.	2021-10-18 16:25:28 -04:00
Joey Hess	69f8e6c7c0	ImportableContentsChunkable This improves the borg special remote memory usage, by letting it only load one archive's worth of filenames into memory at a time, and building up a larger tree out of the chunks. When a borg repository has many archives, git-annex could easily OOM before. Now, it will use only memory proportional to the number of annexed keys in an archive. Minor implementation wart: Each new chunk re-opens the content identifier database, and also a new vector clock is used for each chunk. This is a minor innefficiency only; the use of continuations makes it hard to avoid, although putting the database handle into a Reader monad would be one way to fix it. It may later be possible to extend the ImportableContentsChunkable interface to remotes that are not third-party populated. However, that would perhaps need an interface that does not use continuations. The ImportableContentsChunkable interface currently does not allow populating the top of the tree with anything other than subtrees. It would be easy to extend it to allow putting files in that tree, but borg doesn't need that so I left it out for now. Sponsored-by: Noam Kremen on Patreon	2021-10-08 13:15:22 -04:00
Joey Hess	19e78816f0	convert Key to ShortByteString This adds the overhead of a copy when serializing and deserializing keys. I have not benchmarked much, but runtimes seem barely changed at all by that. When a lot of keys are in memory, it improves memory use. And, it prevents keys sometimes getting PINNED in memory and failing to GC, which is a problem ByteString has sometimes. In particular, git-annex sync from a borg special remote had that problem and this improved its memory use by a large amount. Sponsored-by: Shae Erisson on Patreon	2021-10-05 20:20:08 -04:00
Joey Hess	9012fa0187	reinject: Fix crash when reinjecting a file from outside the repository Commit `4bf7940d6b` introduced this problem, but was otherwise doing a good thing. Problem being that fileRef "/foo" used to return ":./foo", which was actually wrong, but as long as there was no foo in the local repository, catKey could operate on it without crashing. After that fix though, fileRef would return eg "../../foo", resulting in fileRef returning ":./../../foo", which will make git cat-file crash since that's not a valid path in the repo. Fix is simply to make fileRef detect paths outside the repo and return Nothing. Then catKey can be skipped. This needed several bugfixes to dirContains as well, in previous commits. In Command.Smudge, this led to needing to check for Nothing. That case should actually never happen, because the fileoutsiderepo check will detect it earlier. Sponsored-by: Brock Spratlen on Patreon	2021-10-01 14:06:34 -04:00
Joey Hess	b9aa2ce8d1	resume properly when copying a file to/from a local git remote is interrupted (take 2) This method avoids breaking test_readonly. Just check if the dest file exists, and avoid CoW probing when it does, so when CoW probing fails, it can resume where the previous non-CoW copy left off. If CoW has been probed already to work, delete the dest file since a CoW copy will presumably work. It seems like it would be almost as good to just skip CoW copying in this case too, but consider that the dest file might have started to be copied from some other remote, not using CoW, but CoW has been probed to work to copy from the current place. Sponsored-by: Dartmouth College's Datalad project	2021-09-27 16:03:01 -04:00
Joey Hess	7ccf642863	revert change that broke test_readonly commit `63d508e885` broke test_readonly. When a local git remote is readonly, tryCopyCoW run to copy a file from it failed at withOtherTmp. Sponsored-by: Dartmouth College's Datalad project	2021-09-27 16:02:41 -04:00
Joey Hess	e47b4badb3	separate handles for cat-file and cat-file --batch-check This avoids starting one process when only the other one is needed. Eg in git-annex smudge --clean, this reduces the total number of cat-file processes that are started from 4 to 2. The only performance penalty is that when both are needed, it has to do twice as much work to maintain the two Maps. But both are very small, consisting of 1 or 2 items, so that work is negligible. Sponsored-by: Dartmouth College's Datalad project	2021-09-24 13:16:13 -04:00
Joey Hess	798b33ba3d	simplify annex.bwlimit handling RemoteGitConfig parsing looks for annex.bwlimit when a remote does not have a per-remote config for it, so no need for a separate gobal config. Sponsored-by: Svenne Krap on Patreon	2021-09-22 10:52:01 -04:00
Joey Hess	05a097cde8	Merge branch 'master' into bwlimit	2021-09-22 10:48:27 -04:00
Joey Hess	4fef94d764	simplify annex.stalldetection handling RemoteGitConfig parsing looks for annex.stalldetection when a remote does not have a per-remote config for it, so no need for a separate gobal config. Sponsored-by: Noam Kremen on Patreon	2021-09-22 10:46:10 -04:00
Joey Hess	63d508e885	resume properly when copying a file to/from a local git remote is interrupted Probably this fixes a reversion, but I don't know what version broke it. This does use withOtherTmp for a temp file that could be quite large. Though albeit a reflink copy that will not actually take up any space as long as the file it was copied from still exists. So if the copy cow succeeds but git-annex is interrupted just before that temp file gets renamed into the usual .git/annex/tmp/ location, there is a risk that the other temp directory ends up cluttered with a larger temp file than later. It will eventually be cleaned up, and the changes of this being a problem are small, so this seems like an acceptable thing to do. Sponsored-by: Shae Erisson on Patreon	2021-09-21 17:43:35 -04:00
Joey Hess	18e00500ce	bwlimit Added annex.bwlimit and remote.name.annex-bwlimit config that works for git remotes and many but not all special remotes. This nearly works, at least for a git remote on the same disk. With it set to 100kb/1s, the meter displays an actual bandwidth of 128 kb/s, with occasional spikes to 160 kb/s. So it needs to delay just a bit longer... I'm unsure why. However, at the beginning a lot of data flows before it determines the right bandwidth limit. A granularity of less than 1s would probably improve that. And, I don't know yet if it makes sense to have it be 100ks/1s rather than 100kb/s. Is there a situation where the user would want a larger granularity? Does granulatity need to be configurable at all? I only used that format for the config really in order to reuse an existing parser. This can't support for external special remotes, or for ones that themselves shell out to an external command. (Well, it could, but it would involve pausing and resuming the child process tree, which seems very hard to implement and very strange besides.) There could also be some built-in special remotes that it still doesn't work for, due to them not having a progress meter whose displays blocks the bandwidth using thread. But I don't think there are actually any that run a separate thread for downloads than the thread that displays the progress meter. Sponsored-by: Graham Spencer on Patreon	2021-09-21 16:58:10 -04:00
Joey Hess	ec12537774	defer write permissions checking in import until after copy to repo This should complete the fix started in `6329997ac4`, fixing the actual cause of the test suite failure this time. Sponsored-by: Dartmouth College's Datalad project	2021-09-02 13:45:21 -04:00
Joey Hess	bd5494bb9c	fix windows build	2021-09-02 12:21:25 -04:00
Joey Hess	4f42292b13	improve url download failure display * When downloading urls fail, explain which urls failed for which reasons. * web: Avoid displaying a warning when downloading one url failed but another url later succeeded. Some other uses of downloadUrl use urls that are effectively internal use, and should not all be displayed to the user on failure. Eg, Remote.Git tries different urls where content could be located depending on how the remote repo is set up. Exposing those urls to the user would lead to wild goose chases. So had to parameterize it to control whether it displays urls or not. A side effect of this change is that when there are some youtube urls and some regular urls, it will try regular urls first, even if the youtube urls are listed first. This seems like an improvement if anything, but in any case there's no defined order of urls that it's supposed to use. Sponsored-by: Dartmouth College's Datalad project	2021-09-01 15:33:38 -04:00
Joey Hess	6329997ac4	init: check for filesystem where write bit cannot be removed This fixes a reversion caused by `a99a84f342`, when git-annex init is run as root on a FAT filesystem mounted with hdiutil on OSX. Such a mount point has file mode 777 for everything and it cannot be changed. The existing crippled filesystem test tried to write to a file after removing write bit, but that test does not run as root (since root can write to unwritable files). So added a check of the write permissions of the file, after attempting to remove them. Sponsored-by: Dartmouth College's Datalad project	2021-09-01 10:27:28 -04:00
Joey Hess	e853ef3095	decorate openTempFile errors with the template name This is to track down what file in .git/annex/ is being written to via a temp file when the repository is read-only. Sponsored-by: Dartmouth College's Datalad project	2021-08-30 13:05:02 -04:00
Joey Hess	a99a84f342	add: Detect when xattrs or perhaps ACLs prevent locking down a file's content And fail with an informative message. I don't think ACLs can prevent removing the write bit, but I'm not sure, so kept it mentioning them as a possibility. Should git-annex lock also check if the write bits are able to be removed? Maybe, but the case I know about with xattrs involves cp -a copying NFS xattrs, and it's the copy of the file that is the problem. So when locking a file, I guess it will not be the copy. Sponsored-by: Dartmouth College's Datalad project	2021-08-27 14:33:01 -04:00
Joey Hess	6d4a728455	Added annex.youtube-dl-command config This can be used to run some forks of youtube-dl. Sponsored-by: Brett Eisenberg on Patreon	2021-08-27 09:44:23 -04:00
Joey Hess	4ed36b2634	Fix test suite failure on Windows It would be better if the Arbitrary instance avoided generating impossible filenames like "foo/c:bar", but proably this is the only place that splits the file from the directory and then uses the file without the directory.. At least on the quickcheck properties. Sponsored-by: Svenne Krap on Patreon	2021-08-24 14:03:29 -04:00
Joey Hess	492036622a	fix OSX build	2021-08-18 16:35:26 -04:00
Joey Hess	d154e7022e	incremental verification for web special remote Except when configuration makes curl be used. It did not seem worth trying to tail the file when curl is downloading. But when an interrupted download is resumed, it does not read the whole existing file to hash it. Same reason discussed in commit 7eb3742e4b76d1d7a487c2c53bf25cda4ee5df43; that could take a long time with no progress being displayed. And also there's an open http request, which needs to be consumed; taking a long time to hash the file might cause it to time out. Also in passing implemented it for git and external special remotes when downloading from the web. Several others like S3 are within striking distance now as well. Sponsored-by: Dartmouth College's DANDI project	2021-08-18 15:02:22 -04:00
Joey Hess	88b63a43fa	distinguish between incremental verification failing and not being done Sponsored-by: Dartmouth College's DANDI project	2021-08-18 14:38:02 -04:00
Joey Hess	325bfda12d	refactor	2021-08-18 13:37:00 -04:00
Joey Hess	449851225a	refactor IncrementalVerifier moved to Utility.Hash, which will let Utility.Url use it later. It's perhaps not really specific to hashing, but making a separate module just for the data type seemed unncessary. Sponsored-by: Dartmouth College's DANDI project	2021-08-18 13:19:02 -04:00
Joey Hess	f0754a61f5	plumb VerifyConfig into retrieveKeyFile This fixes the recent reversion that annex.verify is not honored, because retrieveChunks was passed RemoteVerify baser, but baser did not have export/import set up. Sponsored-by: Dartmouth College's DANDI project	2021-08-17 12:43:13 -04:00
Joey Hess	b1622eb932	incremental verify for directory special remote Added fileRetriever', which will let the remaining special remotes eventually also support incremental verify. Sponsored-by: Dartmouth College's DANDI project	2021-08-16 16:51:33 -04:00
Joey Hess	a644f729ce	refactor fileCopier Sponsored-by: Dartmouth College's DANDI project	2021-08-16 15:56:24 -04:00
Joey Hess	d889ae0c01	move comment	2021-08-16 15:25:06 -04:00
Joey Hess	aac0654ff4	handle AlreadyInUseError As happens when using the directory special remote, gitlfs, webdav, and S3. But not external, adb, gcrypt, hook, or rsync. Sponsored-by: Dartmouth College's DANDI project	2021-08-16 15:03:48 -04:00
Joey Hess	c4aba8e032	better handling of finishing up incomplete incremental verify Now it's run in VerifyStage. I thought about keeping the file handle open, and resuming reading where tailVerify left off. But that risks leaking open file handles, until the GC closes them, if the deferred verification does not get resumed. Since that could perhaps happen if there's an exception somewhere, I decided that was too unsafe. Instead, re-open the file, seek, and resume. Sponsored-by: Dartmouth College's DANDI project	2021-08-16 14:52:59 -04:00
Joey Hess	e0b7f391bd	improve tailVerify Wait for the file to get modified, not only opened. This way, if a remote does not support resuming, and opens a new file over top of the existing file, it will wait until that remote starts writing, and open the file it's writing to, not the old file. Sponsored-by: Dartmouth College's DANDI project	2021-08-16 14:47:37 -04:00
Joey Hess	e46a7dff6f	fix windows build	2021-08-13 16:36:33 -04:00
Joey Hess	16dd3dd4ca	catch more exceptions I saw this: .git/annex/tmp/SHA256E-s1234376--5ba8e06e0163b217663907482bbed57684d7188024155ddc81da0710dfd2687d: openBinaryFile: resource busy (file is locked) guess catching IO exceptions did not catch that one.	2021-08-13 16:16:46 -04:00
Joey Hess	ff2dc5eb18	INotify.removeWatch can crash Unsure why, possibly if the file has been replaced by another file.	2021-08-13 15:35:18 -04:00
Joey Hess	7503b8448b	inotify reports paths relative to directory being watched Sponsored-by: Dartmouth College's DANDI project	2021-08-13 14:51:15 -04:00
Joey Hess	e07625df8a	convert tailVerify to not finalize the verification Added failIncremental so it can force failure to verify. Sponsored-by: Dartmouth College's DANDI project	2021-08-13 13:39:02 -04:00
Joey Hess	9d533b347f	tailVerify: return deferred action when it gets behind Sponsored-by: Dartmouth College's DANDI project	2021-08-13 12:32:01 -04:00
Joey Hess	b6efba8139	add tailVerify Not yet used, but this will let all remotes verify incrementally if it's acceptable to pay the performance price. See comment for details of when it will perform badly. I anticipate using this for all special remotes that use fileRetriever. Except perhaps for a few like GitLFS that could feed the incremental verifier themselves despite using that. Sponsored-by: Dartmouth College's DANDI project	2021-08-12 14:38:02 -04:00
Joey Hess	fa62c98910	simplify and speed up Utility.FileSystemEncoding This eliminates the distinction between decodeBS and decodeBS', encodeBS and encodeBS', etc. The old implementation truncated at NUL, and the primed versions had to do extra work to avoid that problem. The new implementation does not truncate at NUL, and is also a lot faster. (Benchmarked at 2x faster for decodeBS and 3x for encodeBS; more for the primed versions.) Note that filepath-bytestring 1.4.2.1.8 contains the same optimisation, and upgrading to it will speed up to/fromRawFilePath. AFAIK, nothing relied on the old behavior of truncating at NUL. Some code used the faster versions in places where I was sure there would not be a NUL. So this change is unlikely to break anything. Also, moved s2w8 and w82s out of the module, as they do not involve filesystem encoding really. Sponsored-by: Shae Erisson on Patreon	2021-08-11 12:13:31 -04:00
Joey Hess	1acdd18ea8	deal better with clock skew situations, using vector clocks * Deal with clock skew, both forwards and backwards, when logging information to the git-annex branch. * GIT_ANNEX_VECTOR_CLOCK can now be set to a fixed value (eg 1) rather than needing to be advanced each time a new change is made. * Misuse of GIT_ANNEX_VECTOR_CLOCK will no longer confuse git-annex. When changing a file in the git-annex branch, the vector clock to use is now determined by first looking at the current time (or GIT_ANNEX_VECTOR_CLOCK when set), and comparing it to the newest vector clock already in use in that file. If a newer time stamp was already in use, advance it forward by a second instead. When the clock is set to a time in the past, this avoids logging with an old timestamp, which would risk that log line later being ignored in favor of "newer" line that is really not newer. When a log entry has been made with a clock that was set far ahead in the future, this avoids newer information being logged with an older timestamp and so being ignored in favor of that future-timestamped information. Once all clocks get fixed, this will result in the vector clocks being incremented, until finally enough time has passed that time gets back ahead of the vector clock value, and then it will return to usual operation. (This latter situation is not ideal, but it seems the best that can be done. The issue with it is, since all writers will be incrementing the last vector clock they saw, there's no way to tell when one writer made a write significantly later in time than another, so the earlier write might arbitrarily be picked when merging. This problem is why git-annex uses timestamps in the first place, rather than pure vector clocks.) Advancing forward by 1 second is somewhat arbitrary. setDead advances a timestamp by just 1 picosecond, and the vector clock could too. But then it would interfere with setDead, which wants to be overrulled by any change. So it could use 2 picoseconds or something, but that seems weird. It could just as well advance it forward by a minute or whatever, but then it would be harder for real time to catch up with the vector clock when forward clock slew had happened. A complication is that many log files contain several different peices of information, and it may be best to only use vector clocks for the same peice of information. For example, a key's location log file contains InfoPresent/InfoMissing for each UUID, and it only looks at the vector clocks for the UUID that is being changed, and not other UUIDs. Although exactly where the dividing line is can be hard to determine. Consider metadata logs, where a field "tag" can have multiple values set at different times. Should it advance forward past the last tag? Probably. What about when a different field is set, should it look at the clocks of other fields? Perhaps not, but currently it does, and this does not seems like it will cause any problems. Another one I'm not entirely sure about is the export log, which is keyed by (fromuuid, touuid). So if multiple repos are exporting to the same remote, different vector clocks can be used for that remote. It looks like that's probably ok, because it does not try to determine what order things occurred when there was an export conflict. Sponsored-by: Jochen Bartl on Patreon	2021-08-04 12:33:46 -04:00

1 2 3 4 5 ...

1967 commits