git-annex

Author	SHA1	Message	Date
Joey Hess	928b2a4839	create journal directory in withJournalHandle Fixes a crash by git-annex repair when .git/annex/journal/ does not exist. Normally the journal directory is created before withJournalHandle gets run, but git-annex repair can be run in a situation where it does not exist.	2023-06-21 15:23:59 -04:00
Joey Hess	72715845a1	display destination file before youtube-dl download Rather than after it, which can leave one wondering what file it's downloading. youtubeDl should not ever return Right Nothing in normal operation, becaause it's already asked youtube-dl if it supports the url. So it would have to succeed at that, then not download any file, but also exit successfully, in order for the new error message to display. Also display the name of yt-dlp when using it.	2023-06-20 14:55:25 -04:00
Joey Hess	a861d56428	httpalso: Support being used with special remotes that use chunking. Sponsored-by: k0ld on Patreon	2023-06-20 13:35:28 -04:00
Joey Hess	a36a81dea3	Improve resuming interrupted download when using yt-dlp Sometimes resuming an interrupted download will fail to resume and download more files with different names. That resulted in the workdir having multiple files at the end, which causes git-annex to give up because it does not know what was downloaded. To fix this, use a yt-dlp feature, which appends to a file the name of each file after it's finished downloading it. So the presence of other cruft in the workdir will not confuse git-annex.	2023-06-19 14:39:08 -04:00
Joey Hess	64738ea157	config: Added the --show-origin and --for-file options * config: Added the --show-origin and --for-file options. * config: Support annex.numcopies and annex.mincopies. There is a little bit of redundancy here with other code elsewhere that combines the various configs and selects which to use. But really only for the special case of annex.numcopies, which is a git config that does not override the annex branch setting and for annex.mincopies, which does not have a git config but does have gitattributes settings as well as the annex branch setting. That seems small enough, and unlikely enough to grow into a mess that it was worth supporting annex.numcopies and annex.mincopies in git-annex config --show-origin. Because these settings are a prime thing that someone might get confused about and want to know where they were configured. And, it followed that git-annex config might as well support those two for --set and --get as well. While this is redundant with the speclialized commands, it's only a little code and it makes it more consistent. Note that --set does not have as nice output as numcopies/mincopies commands in some special cases like setting to 0 or a negative number. It does avoid setting to a bad value thanks to the smart constructors (eg configuredNumCopies). As for other git-annex branch configurations that are not set by git-annex config, things like trust and wanted that are specific to a repository don't map to a git config name, so don't really fit into git-annex config. And they are only configured in the git-annex branch with no local override (at least so far), so --show-origin would not be useful for them. Sponsored-by: Dartmouth College's DANDI project	2023-06-12 16:24:31 -04:00
Joey Hess	ae98fb1b31	move unspecifiedAttr check to checkAttr It just so happens that everywhere that checks attrs other than annex.largefiles parses the value further, and failed to parse unspecifiedAttr in a way that behaved the same as if nothing was set. So this is not a bug fix or behavior change. What it does so is prevent future uses of checkAttr from needing to remember to handle checking for this edge case. Sponsored-by: Dartmouth College's DANDI project	2023-06-12 14:37:42 -04:00
Joey Hess	532b227086	update exportdb tree in getImportableContents This avoids bottlenecking on git check-ignore in a particular situation. Also, there may have been a correctness issue with it not having updated it. When the exportdb is already up-to-date, this is not expensive. And the exportdb is updated elsewhere, so usually it is up-to-date. Sponsored-by: Joshua Antonishen on Patreon	2023-06-08 18:36:24 -04:00
Joey Hess	6821ba8dab	sync: use log to track adjusted branch needs updating Speeds up sync in an adjusted branch by avoiding re-adjusting the branch unncessarily, particularly when it is adjusted with --hide-missing or --unlock-present. When there are a lot of files, that was the majority of the time of a --no-content sync. Uses a log file, which is updated when content presence changes. This adds a little bit of overhead to every file get/drop when on such an adjusted branch. The overhead is minimal for get of any size of file, but might be noticable for drop in some cases. It seems like a reasonable trade-off. It would be possible to update the log file only at the end, but then it would not happen if the command is interrupted. When not in an adjusted branch, there should be no additional overhead. (getCurrentBranch is an MVar read, and it avoids the MVar read of getGitConfig.) Note that this does not deal with situations such as: git checkout master, git-annex get, git checkout adjusted branch, git-annex sync. The sync won't know that the adjusted branch needs to be updated. Dealing with that would add overhead to operation in non-adjusted branches, which I don't like. Also, there are other situations like having two adjusted branches that both need to be updated like this, and switching between them and sync not updating. This does mean a behavior change to sync, since it did previously deal with those situations. But, the documentation did not say that it did. The man pages only talk about sync updating the adjusted branch after it transfers content. I did consider making sync keep track of content it transferred (and dropped) and only update the adjusted branch then, not to catch up to other changes made previously. That would perform better. But it seemed rather hard to implement, and also it would have problems with races with a concurrent get/drop, which this implementation avoids. And it seemed pretty likely someone had gotten used to get/drop followed by sync updating the branch. It seems much less likely someone is switching branches, doing get/drop, and then switching back and expecting sync to update the branch. Re-running git-annex adjust still does a full re-adjusting of the branch, for anyone who needs that. Sponsored-by: Leon Schuermann on Patreon	2023-06-08 14:35:41 -04:00
Joey Hess	637f19bebb	fix adjusted branch update breakage Introduced recently in commit `64fc34b3da`. adjustBranch changes the sha that is recorded for the current branch (eg the adjusted branch). So, have to get the original sha before calling it. Sponsored-by: Jack Hill on Patreon	2023-06-08 13:33:58 -04:00
Joey Hess	64fc34b3da	narrow window where HEAD is detached Updating an adjusted branch can take a while when there are a lot of files. HEAD was detached at the start, so if eg git-annex sync was interrupted at the wrong point, there was a possibly wide window where it would leave the repo with HEAD detached. There's still a window, just much narrower. I don't know if it's possible to close the window entirely. While git can clearly update the currently checked out branch in eg git merge, it doesn't seem to provide another way to do it. Sponsored-by: Graham Spencer on Patreon	2023-06-07 11:10:54 -04:00
Joey Hess	fe1b2dfb4b	speed up very first tree import by 25% Reading from the cidsdb is responsible for about 25% of the runtime of an import. Since the cidmap is used to store the same information in ram, the cidsdb is not written to during an import any longer. And so, if it started off empty (and updateFromLog wasn't needed), those reads can just be skipped. This is kind of a cheesy optimisation, since after any import from any special remote, the database will no longer be empty, so it's a single use optimisation. But it's probably not uncommon to start by importing a lot of files, and it can save a lot of time then. Sponsored-by: Brock Spratlen on Patreon	2023-06-02 13:30:30 -04:00
Joey Hess	40017089f2	use importChanges optimisation Large speed up to importing trees from special remotes that contain a lot of files, by only processing changed files. Benchmarks: Importing from a special remote that has 10000 files, that have all been imported before, and 1 new file sped up from 26.06 to 2.59 seconds. An import with no change and 10000 unchanged files sped up from 24.3 to 1.99 seconds. Going up to 20000 files, an import with no changes sped up from 125.95 to 3.84 seconds. Sponsored-by: k0ld on Patreon	2023-06-01 13:47:00 -04:00
Joey Hess	c6acf574c7	implement importChanges optimisaton (not used yet) For simplicity, I've not tried to make it handle History yet, so when there is a history, a full import will still be done. Probably the right way to handle history is to first diff from the current tree to the last imported tree. Then, diff from the current tree to each of the historical trees, and recurse through the history diffing from child tree to parent tree. I don't think that will need a record of the previously imported historical trees, and so Logs.Import doesn't store them. Although I did leave room for future expansion in that log just in case. Next step will be to change importTree to importChanges and modify recordImportTree et all to handle it, by using adjustTree. Sponsored-by: Brett Eisenberg on Patreon	2023-05-31 16:01:34 -04:00
Joey Hess	7298123520	build git trees using ContentIdentifier to speed up import This gets the trees built, but it does not use them. Next step will be to remember the tree for next time an import is done, and diff between old and new trees to find the files that have changed. Added --missing to the mktree parameters. That only disables a check, so it's ok to do everywhere mktree is used. It probably also speeds up mktree to disable the check. Note that git fsck does not complain about the resulting tree objects that point to shas that are not in the repository. Even with --strict. A quick benchmark, importing 10000 files, this slowed it down from 2:04.06 to 2:04.28. So it will more than pay for itself. Sponsored-by: Luke Shumaker on Patreon	2023-05-31 12:46:54 -04:00
Joey Hess	f6aa097a39	avoid import writing to cidsdb initially Speed up importing trees from special remotes somewhat by avoiding redundant writes to sqlite database. Before, import would write to both the git-annex branch and also to the sqlite database. But then the next time it was run, needsUpdateFromLog would see the branch had changed, so run updateFromLog, which would make the same writes to the sqlite database a second time. Now import writes only to the git-annex branch. The next time it's run, needsUpdateFromLog sees that the branch has changed and so calls updateFromLog, which updates the sqlite database. Why defer the write to the sqlite database like this? It seems that it could write to the database as it goes, and at the end call recordAnnexBranchTree to indicate that the information in the git-annex branch has all been written to the cidsdb. That would avoid the second import doing extra work. But, there could be other processes running at the same time, and one of them may update the git-annex branch, eg merging a remote git-annex branch into it. Any cids logs on that merged git-annex branch would not be reflected in the cidsdb yet. If the import then called recordAnnexBranchTree, the cidsdb would never get updated with that merged information. I don't think there's a good way to prevent, or to detect that situation. So, it can't call recordAnnexBranchTree at the end. So it might as well wait until the next run and do updateFromLog then. It could instead do updateFromLog at the end, but it's going to check needsUpdateFromLog at the beginning anyway. Note that the database writes were queued, so there is already a cidmap that is used to remember changes that the current process has made. So, omitting database writes can't change the behavior of the current process. Also note that thirdpartypopulatedimport uses recordcidkeyindb, which reflects what it already did. That code path does not use the cidmap, but does not need to query it either. It might be possible to make that code path also only update the git-annex branch and not the db, but I haven't checked. Sponsored-by: Noam Kremen on Patreon	2023-05-30 17:05:28 -04:00
Joey Hess	f2db6da938	default to yt-dlp and fix progress parsing bugs I noticed git-annex was using a lot of CPU when downloading from youtube, and was not displaying progress. Turns out that yt-dlp (and I think also youtube-dl) sometimes only knows an estimated size, not the actual size, and displays the progress output slightly differently for that. That broke the parser. And, the parser was feeding chunks that failed to parse back as a remainder, which caused it to try to re-parse the entire output each time, so it got slower and slower. Using --progress-template like this should avoid parsing problems as well as future proof against output changes. But it will work with only yt-dlp. So, this seemed like the right time to deprecate youtube-dl, and default to yt-dlp when available. git-annex will still use youtube-dl if that's all that's available. However, since the progress parser for youtube-dl was buggy, and I don't want to maintain two different progress parsers (especially since youtube-dl is no longer in debian unstable having been replaced by yt-dlp), made git-annex no longer try to parse youtube-dl's progress. Also, updated docs for yt-dlp being default. It did not seem worth renaming annex.youtube-dl-options and annex.youtube-dl-command. Note that yt-dlp does not seem to document the fields available in the progress template. I found them by reading the source and looking at the templates it uses internally. Also note that the use of "i" (rather than "s") in progressTemplate makes it display floats rounded to integers; particularly the estimated total size can be a float. That also does not seem to be documented but I assume is a python thing? Sponsored-by: Joshua Antonishen on Patreon	2023-05-27 13:04:53 -04:00
Joey Hess	aff37fc208	avoid annexFileMode special case This makes annexFileMode be just an application of setAnnexPerm', which avoids having 2 functions that do different versions of the same thing. Fixes some buggy behavior for some combinations of core.sharedRepository and umask. Sponsored-by: Jack Hill on Patreon	2023-04-27 15:58:37 -04:00
Joey Hess	67f8268b3f	Support core.sharedRepository=0xxx at long last Sponsored-by: Brett Eisenberg on Patreon	2023-04-26 17:03:29 -04:00
Joey Hess	0aa98aa09b	fix perms for core.sharedRepository These two missed setting it. It rarely matters that the journal gets the right perm. But, when using annex.alwayscommit=false, someone else may come along later and want to append to the journal file. It probably never matters what the sentinal perms are, but for completeness.. Sponsored-by: Luke Shumaker on Patreon	2023-04-26 16:29:11 -04:00
Joey Hess	7af75a59be	Warn about unsupported core.sharedRepository=0xxx when set This spams the user with a lot of messages, but it seems like busywork to avoid that and only warn once, since this warning will go away when it gets implemented. Also fix parsing of the octal value. Sponsored-by: Kevin Mueller on Patreon	2023-04-26 13:25:29 -04:00
Joey Hess	9155ed1072	configremote New command, currently limited to changing autoenable= setting of a special remote. It will probably never be used for more than that given the limitations on it. Sponsored-by: Brock Spratlen on Patreon	2023-04-18 15:30:49 -04:00
Joey Hess	fe5e586b72	rename Git.Filename to Git.Quote	2023-04-12 17:22:03 -04:00
Joey Hess	a576fc3b12	fix mojibake reversion in display of utf8 When displaying a ByteString like "💕", safeOutput operates on individual bytes like "\240\159\146\149" and isControl '\146' = True, so it got truncated to just "\240". So, only treat the low control characters, and DEL, as control characters. Also split Utility.Terminal out of Utility.SafeOutput. The latter needs win32, but Utility.SafeOutput is used by Control.Exception, which is used by Setup. Sponsored-by: Nicholas Golder-Manning on Patreon	2023-04-12 13:53:30 -04:00
Joey Hess	c50aa21d5f	init: Avoid autoenabling special remotes that have control characters in their names I'm on the fence about this. Notice that pulling from a git remote can pull branches that have escape sequences in their names. Git will display those as-is. Arguably git should try harder to avoid that. But, names of remotes are usually up to the local user, and autoenable changes that, and so it makes sense that git chooses to display control characters in names of remotes, and so autoenable needs to guard against it. Sponsored-by: Graham Spencer on Patreon	2023-04-12 12:37:12 -04:00
Joey Hess	de68e3dd4f	allow tab in controlCharacterInFilePath Seems unlikely to have a tab in a path, but it's not a control character that needs to be prevented either. Left \n \r \v and \a as other non-threatening control characters that are still obnoxious to have in a filepath because of how it causes issues with display and/or with shell scripting.	2023-04-12 12:31:16 -04:00
Joey Hess	8b6c7bdbcc	filter out control characters in all other Messages This does, as a side effect, make long notes in json output not be indented. The indentation is only needed to offset them underneath the display of the file they apply to, so that's ok. Sponsored-by: Brock Spratlen on Patreon	2023-04-11 12:58:01 -04:00
Joey Hess	3290a09a70	filter out control characters in warning messages Converted warning and similar to use StringContainingQuotedPath. Most warnings are static strings, some do refer to filepaths that need to be quoted, and others don't need quoting. Note that, since quote filters out control characters of even UnquotedString, this makes all warnings safe, even when an attacker sneaks in a control character in some other way. When json is being output, no quoting is done, since json gets its own quoting. This does, as a side effect, make warning messages in json output not be indented. The indentation is only needed to offset warning messages underneath the display of the file they apply to, so that's ok. Sponsored-by: Brett Eisenberg on Patreon	2023-04-10 15:55:44 -04:00
Joey Hess	cd544e548b	filter out control characters in error messages giveup changed to filter out control characters. (It is too low level to make it use StringContainingQuotedPath.) error still does not, but it should only be used for internal errors, where the message is not attacker-controlled. Changed a lot of existing error to giveup when it is not strictly an internal error. Of course, other exceptions can still be thrown, either by code in git-annex, or a library, that include some attacker-controlled value. This does not guard against those. Sponsored-by: Noam Kremen on Patreon	2023-04-10 13:50:51 -04:00
Joey Hess	063c00e4f7	git style filename quoting for giveup When the filenames are part of the git repository or other files that might have attacker-controlled names, quote them in error messages. This is fairly complete, although I didn't do the one in Utility.DirWatcher.INotify.hs because that doesn't have access to Git.Filename or Annex. But it's also quite possible I missed some. And also while scanning for these, I found giveup used with other things that could be attacker controlled to contain control characters (eg Keys). So, I'm thinking it would also be good for giveup to just filter out control characters. This commit is then not the only line of defence, but just good formatting when git-annex displays a filename in an error message. Sponsored-by: Kevin Mueller on Patreon	2023-04-10 12:56:45 -04:00
Joey Hess	da83652c76	addurl --preserve-filename: reject control characters As well as escape sequences, control characters seem unlikely to be desired when doing addurl, and likely to trip someone up. So disallow them as well. I did consider going the other way and allowing filenames with control characters and escape sequences, since git-annex is in the process of escaping display of all filenames. Might still be a better idea? Also display the illegal filename git quoted when it rejects it. Sponsored-by: Nicholas Golder-Manning on Patreon	2023-04-10 12:18:25 -04:00
Joey Hess	2ba1559a8e	git style quoting for ActionItemOther Added StringContainingQuotedPath, which is used for ActionItemOther. In the process, checked every ActionItemOther for those containing filenames, and made them use quoting. Sponsored-by: Graham Spencer on Patreon	2023-04-08 16:30:01 -04:00
Joey Hess	ac0345aa42	improve comments	2023-04-04 15:23:39 -04:00
Joey Hess	e3f5bd4ca6	Revert "override rather than setting user.name and user.email" This reverts commit `66eb63dd82`. git-annex init is the only thing that uses ensureCommit. So overriding there will make later commits to the git-annex branch or by git-annex sync fail. It's ugly that git-annex init sets user.name and user.email, but it only does it on systems that are badly configured.	2023-04-04 15:15:02 -04:00
Joey Hess	e91bf784cd	Support user.useConfigOnly git config When it's set and git cannot determine user.name or user.email, this will result in git-annex init failing when committing to create the git-annex branch. Other git-annex commands that commit can also fail. Sponsored-by: Jack Hill on Patreon	2023-04-04 15:12:52 -04:00
Joey Hess	66eb63dd82	override rather than setting user.name and user.email Avoid setting user.name and user.email in the git config when git is unable to detect them. git-annex has good reason to want to ensure git commit succeeds when eg committing to the git-annex branch. But it's not playing nice to set these values where other commands can see them. Sponsored-by: Brett Eisenberg on Patreon	2023-04-04 14:56:44 -04:00
Joey Hess	3eb51ee929	readFileStrict to avoid laziness bug Fix laziness bug introduced in last release that breaks use of --unlock-present and --hide-missing adjusted branches. Since there is a writeFile of the same file immediately after readFile, it may still have the file open for read (or may have happened to read it already and closed it). I was not able to reproduce the problem in brief testing, but this seems obvious. Sponsored-by: Luke Shumaker on Patreona	2023-04-04 14:25:01 -04:00
Joey Hess	22091d4765	fix comment	2023-03-28 13:40:17 -04:00
Joey Hess	a5709dcc22	Copy with a reflink when exporting a tree to a directory special remote Remote.Directory makes a temp file, then calls this, and since the temp file exists, it prevented probing if CoW works. Note that deleting the empty file does mean there's a small window for a race. If another process is also exporting to the remote, that could let it make the same temp file. However, the temp filename actually has the processes's pid in it, which avoids that being a problem. This may have been a reversion caused by commits around `63d508e885`, but I haven't gone back and tested to be sure. The directory special remote had supposedly supported CoW for this going back to about half a year before that. Sponsored-by: Graham Spencer on Patreon	2023-03-28 13:09:14 -04:00
Joey Hess	24ae4b291c	addurl, importfeed: Fix failure when annex.securehashesonly is set The temporary URL key used for the download, before the real key is generated, was blocked by annex.securehashesonly. Fixed by passing the Backend that will be used for the final key into runTransfer. When a Backend is provided, have preCheckSecureHashes check that, rather than the key being transferred. Sponsored-by: unqueued on Patreon	2023-03-27 15:10:46 -04:00
Joey Hess	cb6cb61ca1	avoid build warning on windows	2023-03-27 12:20:35 -04:00
Joey Hess	291ad8f6b2	avoid build warning on windows	2023-03-27 12:19:26 -04:00
Joey Hess	2b5fa091e2	annex.maxextensionlength for view view: Support annex.maxextensionlength when generating filenames for the view branch. Note that refining an existing view will reuse the extension length that was configured when initially constructing the view. This is necessarily the case because it reuses the filenames. Also view files used to have all extensions at the end, no matter how many there were. Since annex.maxextensionlength's documentation includes that it's limited to 2 extensions, I made it consistent with that. Sponsored-by: k0ld on Patreon	2023-03-24 14:01:38 -04:00
Joey Hess	038a2600f4	Avoid leaving repo with a detached head when there is a failure checking out an updated adjusted branch I don't know of scenarios where that can happen (besides the bug fixed by the parent commit), but there probably are some. Sponsored-by: Boyd Stephen Smith Jr. on Patreon	2023-03-23 16:36:43 -04:00
Joey Hess	cb4d9f7b1f	run restagePointerFiles in adjustedBranchRefreshFull Avoid failure to update adjusted branch --unlock-present after git-annex drop when annex.adjustedbranchrefresh=1 At higher values, it did flush the queue, which ran restagePointerFiles. But at 1, adjustedBranchRefreshFull gets added to the queue, and while restagePointerFiles is also in the queue, it runs after that. Sponsored-by: Brock Spratlen on Patreon	2023-03-23 16:25:45 -04:00
Joey Hess	e822df2a09	fix build warnings on windows	2023-03-21 18:41:23 -04:00
Yaroslav Halchenko	84b0a3707a	Apply codespell -w throughout	2023-03-17 15:14:58 -04:00
Yaroslav Halchenko	0ae5ff797f	Typo: sansative -> sensitive	2023-03-17 15:14:50 -04:00
Yaroslav Halchenko	e018ae1125	Fix ambigous typos	2023-03-17 15:14:47 -04:00
Joey Hess	54ad1b4cfb	Windows: Support long filenames in more (possibly all) of the code Works around this bug in unix-compat: https://github.com/jacobstanley/unix-compat/issues/56 getFileStatus and other FilePath using functions in unix-compat do not do UNC conversion on Windows. Made Utility.RawFilePath use convertToWindowsNativeNamespace to do the necessary conversion on windows to support long filenames. Audited all imports of System.PosixCompat.Files to make sure that no functions that operate on FilePath were imported from it. Instead, use the equvilants from Utility.RawFilePath. In particular the re-export of that module in Common had to be removed, which led to lots of other changes throughout the code. The changes to Build.Configure, Build.DesktopFile, and Build.TestConfig make Utility.Directory not be needed to build setup. And so let it use Utility.RawFilePath, which depends on unix, which cannot be in setup-depends. Sponsored-by: Dartmouth College's Datalad project	2023-03-01 15:55:58 -04:00
Joey Hess	bb54c8a633	support --hide-missing adjustment of view branches I had thought this would not make sense to combine with view branches, since removing files from a view changes metadata. However, that's committing removal of files. With --hide-missing, the files get removed when git-annex updates the branch itself, so there is no conflict. It does not seem likely to be very useful, but it does work! And that's nice because it means all types of adjusted branches can be combined with view branches. Sponsored-by: Max Thoursie on Patreon	2023-02-27 15:39:58 -04:00

1 2 3 4 5 ...

1971 commits