git-annex

Author	SHA1	Message	Date
Joey Hess	5cc8d9d03b	replace removeLink with removeFile removeFile calls unlink so removes anything not a directory. So these are replaceable in order to convert to OsPath.	2025-02-02 14:16:58 -04:00
Joey Hess	71195cce13	more OsPath conversion Sponsored-by: k0ld	2025-02-01 14:06:38 -04:00
Joey Hess	474cf3bc8b	more OsPath conversion Sponsored-by: Brock Spratlen	2025-02-01 11:54:19 -04:00
Joey Hess	9b79f0f43d	use file-io for readFile/writeFile/appendFile on ByteStrings These are all straightforward, and easy small performance wins. Sponsored-by: Nicholas Golder-Manning	2025-01-22 14:30:25 -04:00
Joey Hess	793ddecd4b	use openTempFile from file-io And follow-on changes. Note that relatedTemplate was changed to operate on a RawFilePath, and so when it counts the length, it is now the number of bytes, not the number of code points. This will just make it truncate shorter strings in some cases, the truncation is still unicode aware. When not building with the OsPath flag, toOsPath . fromRawFilePath and fromRawFilePath . fromOsPath do extra conversions back and forth between String and ByteString. That overhead could be avoided, but that's the non-optimised build mode, so didn't bother. Sponsored-by: unqueued	2025-01-22 11:41:43 -04:00
Joey Hess	c7cca43ab0	RawFilePath conversion for Utility.Directory.Stream	2025-01-20 19:25:52 -04:00
Joey Hess	1ceece3108	RawFilePath conversion of System.Directory By using System.Directory.OsPath, which takes and returns OsString, which is a ShortByteString. So, things like dirContents currently have the overhead of copying that to a ByteString, but that should be less than the overhead of using Strings which often in turn were converted to RawFilePaths. Added Utility.OsString and the OsString build flag. That flag is turned on in the stack.yaml, and will be turned on automatically by cabal when built with new enough libraries. The stack.yaml change is a bit ugly, and that could be reverted for now if it causes any problems. Note that Utility.OsString.toOsString on windows is avoiding only a check of encoding that is documented as being unlikely to fail. I don't think it can fail in git-annex; if it could, git-annex didn't contain such an encoding check before, so at worst that should be a wash.	2025-01-20 19:17:33 -04:00
Joey Hess	a73fa77417	added hooks corresponding to annex.-command Added freezecontent-annex and thawcontent-annex hooks that correspond to the git configs annex.freezecontent and annex.thawcontent. * Added secure-erase-annex hook that corresponds to the git config annex.secure-erase-command. * Added commitmessage-annex hook that corresponds to the git config annex.commitmessage-command. * Added http-headers-annex hook that corresponds to the git config annex.http-headers-command. that correspond to the post-update-annex and pre-commit-annex hooks. The use case for these is eg, setting up a git repository that is run in a container, where the easiest way to provide a script is by putting it in .git/hooks/, rather than copying it into the container in a way that puts it in PATH. This is all the ones that make sense to add for annex.*-config git configs. annex.youtube-dl-command is not a hook, it's telling git-annex what command to run. So is annex.shared-sop-command. So omitted those. May later also want to add hooks corresponding to `remote.<name>.annex-cost-command` etc. Sponsored-by: the NIH-funded NICEMAN (ReproNim TR&D3) project	2025-01-10 14:54:42 -04:00
Joey Hess	5df1b2b36e	configs annex.post-update-command and annex.pre-commit-command Added git configs annex.post-update-command and annex.pre-commit-command that correspond to the git-annex hook scripts post-update-annex and pre-commit-annex. Note that the hook files take precience over the git config, since the git config can includ global config which should be overridden by local config. These new git configs are probably not super useful. Especially the pre-commit-annex hook is there to install scripts to instead of the pre-commit hook, since git-annex installs that hook itself. So why would someone want to use a git config for that? Only reason I can think of would be in a global git config. Or possibly because it's easier to set a git config than write a hook script, on an OS like Windows. The real reason I'm adding these is as groundwork for making other annex.-command git configs also be available as hook scripts. I want to avoid having some things available as only git hooks and others as both gitconfigs and git hooks. (It seems that some annex.-command configs don't translate to git hooks though.) In the man page, moved documentation of the hooks to be next to the documentation of the git configs. This is to avoid repitition.	2025-01-10 13:27:51 -04:00
Joey Hess	db89e39df6	partially fix concurrency issue in updating the rollingtotal It's possible for two processes or threads to both be doing the same operation at the same time. Eg, both dropping the same key. If one finishes and updates the rollingtotal, then the other one needs to be prevented from later updating the rollingtotal as well. And they could finish at the same time, or with some time in between. Addressed this by making updateRepoSize be called with the journal locked, and only once it's been determined that there is an actual location change to record in the log. updateRepoSize waits for the database to be updated. When there is a redundant operation, updateRepoSize won't be called, and the redundant LiveUpdate will be removed from the database on garbage collection. But: There will be a window where the redundant LiveUpdate is still visible in the db, and processes can see it, combine it with the rollingtotal, and arrive at the wrong size. This is a small window, but it still ought to be addressed. Unsure if it would always be safe to remove the redundant LiveUpdate? Consider the case where two drops and a get are all running concurrently somehow, and the order they finish is [drop, get, drop]. The second drop seems redundant to the first, but it would not be safe to remove it. While this seems unlikely, it's hard to rule out that a get and drop at different stages can both be running at the same time.	2024-08-26 09:43:32 -04:00
Joey Hess	06064f897c	update Annex.reposizes when changing location logs The live update is only needed when Annex.reposizes has already been populated.	2024-08-15 13:27:14 -04:00
Joey Hess	63a3cedc45	slightly improve hairy types	2024-08-14 16:04:18 -04:00
Joey Hess	3e6eb2a58d	implement journalledRepoSizes Plan is to run this when populating Annex.reposizes on demand. So Annex.reposizes will be up-to-date with the journal, including crucially journal entries for private repositories. But also anything that has been written to the journal by another process, especially if the process was ran with annex.alwayscommit=false. From there, Annex.reposizes can be kept up to date with changes made by the running process.	2024-08-14 13:53:24 -04:00
Joey Hess	8ac2685b33	calcBranchRepoSizes without journal files This will be used to prime the RepoSizes database, which will always contain values that correpond to information in the git-annex branch, so without anything from journal files. Factored out overJournalFileContents which will later be used to update Annex.reposizes to include information from journal files. This will be partitcularly important to support private UUIDs which only ever get to journal files and not to the branch.	2024-08-14 03:19:30 -04:00
Joey Hess	5afbea25e7	avoid counting size of keys that are in the journal twice In calcRepoSizes and also git-annex info, when a key was in the journal, it was passed to the callback twice, so the calculated size was wrong.	2024-08-13 13:23:39 -04:00
Joey Hess	467d80101a	improve handling of unmerged git-annex branches in readonly repo git-annex info was displaying a message that didn't make sense in context. In calcRepoSizes, it seems better to return the information from the git-annex branch, rather than giving up. Especially since balanced preferred content uses it, and we can't just give up evaluating a preferred content expression if git-annex is to be usable in such a readonly repo. Commit `6d7ecd9e5d` nobly wanted git-annex to behave the same with such unmerged branches as it does when it can merge them. But for the purposes of preferred content, it seems to me there's a sense that such an unmerged branch is the same as a remote we have not pulled from. The balanced preferred content will either way operate under outdated information, and so make not the best choices.	2024-08-13 13:13:12 -04:00
Joey Hess	bd3d327d8a	smarter BranchState cache invalidation Only invalidate a just-written file in the cache, not the whole cache. This will avoid the possibly performance impact of cache invalidation mentioned in commit `770aac97a7`	2024-07-28 12:33:32 -04:00
Joey Hess	25a6ab6f11	Avoid grafting in export tree objects that are missing They could be missing due to an interrupted git-annex at just the wrong time during a prior graft, after which the tree objects got garbage collected. Or they could be missing because of manual messing with the git-annex branch, eg resetting it to back before the graft commit. Sponsored-by: Dartmouth College's OpenNeuro project	2024-06-07 16:51:50 -04:00
Joey Hess	b32c4c2e98	atomic git-annex branch update when regrafting in transition Fix a bug where interrupting git-annex while it is updating the git-annex branch could lead to git fsck complaining about missing tree objects. Interrupting git-annex while regraftexports is running in a transition that is forgetting git-annex branch history would leave the repository with a git-annex branch that did not contain the tree shas listed in export.log. That lets those trees be garbage collected. A subsequent run of the same transition then regrafts the trees listed in export.log into the git-annex branch. But those trees have been lost. Note that both sides of `if neednewlocalbranch` are atomic now. I had thought only the True side needed to be, but I do think there may be cases where the False side needs to be as well. Sponsored-by: Dartmouth College's OpenNeuro project	2024-06-07 16:34:10 -04:00
Joey Hess	adcebbae47	clean up git-remote-annex git-annex branch handling Implemented alternateJournal, which git-remote-annex uses to avoid any writes to the git-annex branch while setting up a special remote from an annex:: url. That prevents the remote.log from being overwritten with the special remote configuration from the url, which might not be 100% the same as the existing special remote configuration. And it prevents an overwrite deleting of other stuff that was already in the remote.log. Also, when the branch was created by git-remote-annex, only delete it at the end if nothing else has been written to it by another command. This fixes the race condition described in `797f27ab05`, where git-remote-annex set up the branch and git-annex init and other commands were run at the same time and their writes to the branch were lost.	2024-05-15 17:33:38 -04:00
Joey Hess	2c73845d90	multiple -m second try Test suite passes this time. When committing the adjusted branch, use the old method to make a message that old git-annex can consume. Also made the code accept the new message, so that eventually commitTreeExactMessage can be removed. Sponsored-by: Kevin Mueller on Patreon	2024-04-09 12:56:47 -04:00
Joey Hess	a8dd85ea5a	Revert "multiple -m" This reverts commit `cee12f6a2f`. This commit broke git-annex init run in a repo that was cloned from a repo with an adjusted branch checked out. The problem is that findAdjustingCommit was not able to identify the commit that created the adjusted branch. It seems that there is an extra "\n" at the end of the commit message that it does not expect. Since backwards compatability needs to be maintained, cannot just make findAdjustingCommit accept it with the "\n". Will have to instead have one commitTree variant that uses the old method, and use it for adjusted branch committing.	2024-04-02 17:29:07 -04:00
Joey Hess	cee12f6a2f	multiple -m sync, assist, import: Allow -m option to be specified multiple times, to provide additional paragraphs for the commit message. The option parser didn't allow multiple -m before, so there is no risk of behavior change breaking something that was for some reason using multiple -m already. Pass through to git commands, so that the method used to assemble the paragrahs is whatever git does. Which might conceivably change in the future. Note that git commit-tree has supported -m since git 1.7.7. commitTree was probably not using it since it predates that version. Since the configure script prevents building git-annex with git older than 2.1, there is no risk that it's not supported now. Sponsored-by: Nicholas Golder-Manning on Patreon	2024-03-27 15:58:27 -04:00
Joey Hess	a69871491f	avoid build warning on windows Since append was only exported by Annex.Common on unix, excluding it from import caused a build warning on windows.	2024-03-26 13:16:33 -04:00
Joey Hess	68e99513f0	added annex.commitmessage-command config Sponsored-by: the NIH-funded NICEMAN (ReproNim TR&D3) project	2024-02-12 14:35:22 -04:00
Joey Hess	2114253eaf	update comment The segfault seems to be fixed with git 2.43, I'm not sure what the affected range was.	2024-01-20 11:25:22 -04:00
Joey Hess	f1ce15036f	started migrate --update This is most of the way there, but not quite working. The layout of migrate.tree/ needs to be changed to follow this approach. git log will list all the files in tree order, so the new layout needs to alternate old and new keys. Can that be done? git may not document tree order, or may not preserve it here. Alternatively, change to using git log --format=raw and extract the tree header from that, then use git diff --raw $tree:migrate.tree/old $tree:migrate.tree/new That will be a little more expensive, but only when there are lots of migrations. Sponsored-by: Joshua Antonishen on Patreon	2023-12-07 15:50:52 -04:00
Joey Hess	be6b56df4c	remove unused import	2023-11-01 13:14:39 -04:00
Joey Hess	eb42935e58	Windows: Fix CRLF handling in some log files In particular, the mergedrefs file was written with CR added to each line, but read without CRLF handling. This resulted in each update of the file adding CR to each line in it, growing the number of lines, while also preventing the optimisation from working, so it remerged unncessarily. writeFile and readFile do NewlineMode translation on Windows. But the ByteString conversion prevented that from happening any longer. I've audited for other cases of this, and found three more (.git/annex/index.lck, .git/annex/ignoredrefs, and .git/annex/import/). All of those also only prevent optimisations from working. Some other files are currently both read and written with ByteString, but old git-annex may have written them with NewlineMode translation. Other files are at risk for breakage later if the reader gets converted to ByteString. This is a minimal fix, but should be enough, as long as I remember to use fileLines when splitting a ByteString into lines. This leaves files written using ByteString without CR added, but that's ok because old git-annex has no difficulty reading such files. When the mergedrefs file has gotten lines that end with "\r\r\r\n", this will eventually clean it up. Each update will remove a single trailing CR. Note that S8.lines is still used in eg Command.Unused, where it is parsing git show-ref, and similar in Git/*. git commands don't include CR in their output so that's ok. Sponsored-by: Joshua Antonishen on Patreon	2023-10-30 14:23:23 -04:00
Joey Hess	0da1d40cd4	Improve memory use of --all when using annex.private This does not improve Annex.Branch.files at all, since it still uses ++ to combine the lists, so forcing all but the last one. But when there are a lot of files in the private journal, it does avoid --all (or a bare repo) from buffering the filenames in memory. See commit `653b719472` for prior discussion of this buffering. Sponsored-by: Graham Spencer on Patreon	2023-10-24 13:20:55 -04:00
Joey Hess	8bde6101e3	sqlite datbase for importfeed importfeed: Use caching database to avoid needing to list urls on every run, and avoid using too much memory. Benchmarking in my podcasts repo, importfeed got 1.42 seconds faster, and memory use dropped from 203000k to 59408k. Database.ImportFeed is Database.ContentIdentifier with the serial number filed off. There is a bit of code duplication I would like to avoid, particularly recordAnnexBranchTree, and getAnnexBranchTree. But these use the persistent sqlite tables, so despite the code being the same, they cannot be factored out. Since this database includes the contentidentifier metadata, it will be slightly redundant if a sqlite database is ever added for metadata. I did consider making such a generic database and using it for this. But, that would then need importfeed to update both the url database and the metadata database, which is twice as much work diffing the git-annex branch trees. Or would entagle updating two databases in a complex way. So instead it seems better to optimise the database that importfeed needs, and if the metadata database is used by another command, use a little more disk space and do a little bit of redundant work to update it. Sponsored-by: unqueued on Patreon	2023-10-23 16:46:22 -04:00
Joey Hess	c268dc5878	only stage regular files from the journal git-annex only writes regular files there, but other things may drop junk like empty .DAV directories around the tree. And trying to hash such things can have weird and hard to understand effects. So it seems best to do a small amount of work in statting the journal file to make sure it's a regular file. Sponsored-by: Jack Hill on Patreon	2023-10-10 13:22:02 -04:00
Joey Hess	adda6c1088	Add git-annex remote refs that are not newer to the merged refs list Significant startup speed increase by avoiding repeatedly checking if some remote git-annex branch refs need to be merged when it is not newer. One way this could happen is when there are 2 remotes that are themselves connected. The git-annex branch on the first remote gets updated. Then the second remote pulls from the first, and merges in its git-annex branch. Then the local repo pulls from the second remote, and merges its git-annex branch. At this point, a pull from the first remote will get a git-annex branch that is not newer, but is not on the merged refs list. In my big repo, git-annex startup time dropped from 4 seconds to 0.1 seconds. There were 5 to 10 such remote refs out of 18 remotes. Sponsored-by: Graham Spencer on Patreon	2023-08-09 13:31:36 -04:00
Joey Hess	8b6c7bdbcc	filter out control characters in all other Messages This does, as a side effect, make long notes in json output not be indented. The indentation is only needed to offset them underneath the display of the file they apply to, so that's ok. Sponsored-by: Brock Spratlen on Patreon	2023-04-11 12:58:01 -04:00
Joey Hess	cd544e548b	filter out control characters in error messages giveup changed to filter out control characters. (It is too low level to make it use StringContainingQuotedPath.) error still does not, but it should only be used for internal errors, where the message is not attacker-controlled. Changed a lot of existing error to giveup when it is not strictly an internal error. Of course, other exceptions can still be thrown, either by code in git-annex, or a library, that include some attacker-controlled value. This does not guard against those. Sponsored-by: Noam Kremen on Patreon	2023-04-10 13:50:51 -04:00
Yaroslav Halchenko	84b0a3707a	Apply codespell -w throughout	2023-03-17 15:14:58 -04:00
Joey Hess	f09e299156	rawfilepath conversion	2023-02-27 15:06:32 -04:00
Joey Hess	a23fd7349f	work around git segfault Work around bug in git 2.37 that causes a segfault when when core.untrackedCache is set, and broke git-annex init. Depending on when git gets fixed and how widely the buggy versions are used, this could be reverted quite soon, or need to linger for a long time. It only makes git-annex init a tiny bit slower in a new repo. Sponsored-by: Max Thoursie on Patreon	2022-08-04 14:20:57 -04:00
Joey Hess	d905232842	use ResourcePool for hash-object handles Avoid starting an unncessary number of git hash-object processes when concurrency is enabled. Sponsored-by: Dartmouth College's DANDI project	2022-07-25 17:32:39 -04:00
Joey Hess	4e88137a28	prevent appends except when annex.alwayscompact=false I would like for a new repo version to enable appends, but to do so safely would need a v11 followed by a 1 year delay followed by a v12 that does it. Since a similar v9 and v10 transition is currently happening, and is less than 6 months along in most repos, it does not feel wise to stack up another year-long transition behind that. What if I need to hurry up a new repo version for some other change? Added todo so I remember to make this change at some time when a v11 and probably v12 repo version do make sense. Sponsored-by: Dartmouth College's DANDI project	2022-07-20 13:23:55 -04:00
Joey Hess	6f1fd3abdd	no locking of journal on read after all Finally have a final design, and it turns out not to need locking on read.	2022-07-20 10:57:28 -04:00
Joey Hess	d0860b7f0e	fix build After `28b0aaea54`	2022-07-18 16:44:32 -04:00
Joey Hess	28b0aaea54	re-add lock journal before reading journal files This reverts commit `2e6e9876e3`. This is gonna be needed after all.. The append will only be atomic if the journal is locked, because the file being appended will have to be moved out of the way to avoid an old version of git-annex seeing an incomplete write to it. When git-annex finds that the file is not in the journal, and checks the append location, locking will be needed to avoid a race causing it to miss it in the append location too due to it being moved back to the journal.	2022-07-18 16:40:25 -04:00
Joey Hess	36f0bdcd57	add annex.alwayscompact Added annex.alwayscompact setting which can be unset to speed up writes to the git-annex branch in some cases. Sponsored-by: Dartmouth College's DANDI project	2022-07-18 16:39:19 -04:00
Joey Hess	de18d92de6	efficient but unsafe journal file append This is only for checking performance, it's not safe. Sponsored-by: Dartmouth College's DANDI project	2022-07-18 14:17:12 -04:00
Joey Hess	2e6e9876e3	Revert "lock journal before reading journal files" This reverts commit `47358a6f95`. This added overhead, and will not be needed, because appends are going to have to be made atomic for other reasons than avoiding incomplete reads of data being appended. In particular, when git-annex is interrupted in the middle of an append, it must not leave the file with a partially written line. So appending has to somehow be made fully atomic.	2022-07-18 13:38:12 -04:00
Joey Hess	ce455223df	split out appending to journal from writing, high level only Currently this is not an improvement, but it allows for optimising appendJournalFile later. With an optimised appendJournalFile, this will greatly speed up access patterns like git-annex addurl of a lot of urls to the same key, where the log file can grow rather large. Appending rather than re-writing the journal file for each line can save a lot of disk writes. It still has to read the current journal or branch file, to check if it can append to it, and so when the journal file does not exist yet, it can write the old content from the branch to it. Probably the re-reads are better cached by the filesystem than repeated writes. (If the re-reads turn out to keep performance bad, they could be eliminated, at the cost of not being able to compact the log when replacing old information in it. That could be enabled by a switch.) While the immediate need is to affect addurl writes, it was implemented at the level of presence logs, so will also perhaps speed up location logs. The only added overhead is the call to isNewInfo, which only needs to compare ByteStrings. Helping to balance that out, it avoids compactLog when it's able to append. Sponsored-by: Dartmouth College's DANDI project	2022-07-18 13:22:50 -04:00
Joey Hess	47358a6f95	lock journal before reading journal files This is not currently necessary; journal files are updated atomically. However, for faster appends to large journal files, locking on read will be needed, because appends are not atomic. Sponsored-by: Dartmouth College's DANDI project	2022-07-15 14:43:29 -04:00
Joey Hess	1b680d330b	revert accidental change	2022-07-13 15:17:08 -04:00
Joey Hess	68e9b7f987	comment	2022-07-13 13:44:43 -04:00

1 2 3 4 5 ...

263 commits