git-annex

Author	SHA1	Message	Date
Joey Hess	6a9e923c74	fix handling of linked worktrees on filesystems w/o symlinks Fix bug in handling of linked worktrees on filesystems not supporting symlinks, that caused annexed file content to be stored in the wrong location inside the git directory, and also caused pointer files to not get populated. This parameterizes functions in Annex.Locations with a GitLocationMaker. The uses of standardGitLocationMaker are in cases where the path returned by a function should not change when in a linked worktree. For example, gitAnnexLink uses standardGitLocationMaker because symlink targets should always be to ".git/annex/objects" paths, even when in a linked worktree. Hopefully I have gotten all uses of standardGitLocationMaker right. This also assumes that all path construction to the annex directory is done via the functions in Annex.Locations, and there is no other, ad-hoc construction elsewhere. Thankfully, Annex.Locations has been around since the beginning, and has been used consistently. I think. --- In fixupUnusualRepos, when symlinks are supported, the .git file is replaced with a symlink to the linked worktree git directory. And in that directory, an "annex" symlink points to the main annex directory. In that case, it's not necessary to set mainWorkTreePath. It would be ok to set it, but not setting it in that case allows an optimisation of avoiding reading the "commondir" file. The change to make fixupUnusualRepos set mainWorkTreePath when the repository is not initialized yet is done in case the initialization itself writes to the annex directory. If that were the case, without setting mainWorkTreePath, the annex symlink would not be set up yet, and so it might have created the annex directory in the wrong place. Currently that didn't happen, but now that mainWorkTreePath is available, using it here avoids any such later problem. --- This commit does not deal with the mess of a worktree that has experienced this bug before. In particular, if `git-annex get` were run in such a worktree, it would have stored the object files in the linked worktree's git directory, rather than in the main git directory. Such misplaced object files need to be dealt with; the plan is to make git-annex fsck notice and fix them. A worktree that has experienced this bug before will contain unpopulated pointer files. Those may eventually get fixed up in regular usage of git-annex, but git-annex fsck will also fix them up. --- Finally, this has me pondering if all of git-annex's state files should really be stored in one common place across all linked worktrees. Should perhaps state files that are specific to the worktree be stored per-worktree? That has not been the case when using git-annex on filesystems supporting symlinks, but it has been the case on filesystems not supporting symlinks. Perhaps this leads to some other buggy behavior in some cases. Or perhaps to extra work being done. For example, the keys database has an associated files table. Which depends on the worktree. But reconcileStaged updates that table, so when git-annex is used first in one worktree and then in another one, reconcileStaged will update the table to reflect the current worktree. Which is extra work each time a different worktree is used. But also, what if two git-annex processes are running at the same time, in separate worktrees? Probably this needs more thought and investigation. So there is a risk that this commit exposes such buggy behavior in a situation where it didn't happen before, due to the filesystem not supporting symlinks. But, given how much this bug crippled using linked worktrees in such a situation, I doubt that many people have been doing that.	2025-07-14 13:20:39 -04:00
Joey Hess	fb155b1e3e	Work around git 2.50 bug that caused it to crash when there is a merge conflict with an unlocked annexed file This fixes several test suite failures with git 2.50. See the bug report for the full, gory details.	2025-06-25 13:36:23 -04:00
Joey Hess	2fa1631041	more OsPath conversion this avoids 1 copy	2025-02-11 14:03:20 -04:00
Joey Hess	5dbaaae299	test suite now passes after OsPath conversion The test suite was failing because of a bug in the Database/* modules. I had replaced doesPathExist with doesDirectoryExist, but it was checking the database file. I have audited commit `f1ba21d698` for other changes to doesPathExist, and checked that doesDirectoryExist and doesFileExist were used correctly. The only change I found is in youtubeDl', where it used to return directories that might have been created by youtube-dl. But it was supposed to return media files, so changing it to use doesFileExist is actually an improvement. Although only of theoretical benefit. Note that it would actually be possible to keep using doesPathExist, there is a version of that for OsPath as well. But the rest of these changes seem safe. Sponsored-by: Nicholas Golder-Manning	2025-02-11 12:44:09 -04:00
Joey Hess	2ff716be30	OsPath build flag no longer depends on filepath-bytestring However, filepath-bytestring is still in Setup-Depends. That's because Utility.OsPath uses it when not built with OsPath. It would be maybe possible to make Utility.OsPath fall back to using filepath, and eliminate that dependency too, but it would mean either wrapping all of System.FilePath's functions, or using `type OsPath = FilePath` Annex.Import uses ifdefs to avoid converting back to FilePath when not on windows. On windows it's a bit slower due to that conversion. Utility.Path.Windows.convertToWindowsNativeNamespace got a bit slower too, but not really worth optimising I think. Note that importing Utility.FileSystemEncoding at the same time as System.Posix.ByteString will result in conflicting definitions for RawFilePath. filepath-bytestring avoids that by importing RawFilePath from System.Posix.ByteString, but that's not possible in Utility.FileSystemEncoding, since Setup-Depends does not include unix. This turned out not to affect any code in git-annex though. Sponsored-by: Leon Schuermann	2025-02-10 16:39:55 -04:00
Joey Hess	71195cce13	more OsPath conversion Sponsored-by: k0ld	2025-02-01 14:06:38 -04:00
Joey Hess	c69e57aede	more OsPath conversion Sponsored-by: Jack Hill	2025-01-30 15:46:32 -04:00
Joey Hess	a9f3a31a52	more OsPath conversion Sponsored-by: Kevin Mueller	2025-01-29 16:24:51 -04:00
Joey Hess	8bafe05500	more OsPath conversion	2025-01-27 10:13:43 -04:00
Joey Hess	c64731f16a	more OsPath conversion	2025-01-25 11:56:35 -04:00
Joey Hess	793ddecd4b	use openTempFile from file-io And follow-on changes. Note that relatedTemplate was changed to operate on a RawFilePath, and so when it counts the length, it is now the number of bytes, not the number of code points. This will just make it truncate shorter strings in some cases, the truncation is still unicode aware. When not building with the OsPath flag, toOsPath . fromRawFilePath and fromRawFilePath . fromOsPath do extra conversions back and forth between String and ByteString. That overhead could be avoided, but that's the non-optimised build mode, so didn't bother. Sponsored-by: unqueued	2025-01-22 11:41:43 -04:00
Joey Hess	da5e195597	remove i386ancient and need at least debian stable to build * Removed the i386ancient standalone tarball build for linux, which was increasingly unable to support new git-annex features. * Removed support for building with ghc older than 9.0.2, and with older versions of haskell libraries than are in current Debian stable. * stack.yaml: Update to lts-23.2. Note that i386ancient was targeting linux 2.6.32, which has been EOL for over 9 years now. Any old system still using such a kernel is certainly highly insecure. And I suspect i386ancient had its own insecurities due to haskell libraries and C libraries not having been updated.	2025-01-01 14:15:55 -04:00
Joey Hess	87871f724e	split up remaining items from todo/git-annex_proxies and close it!	2024-10-30 14:49:54 -04:00
Joey Hess	6cf9a101b8	sim: Fix size tracking for balanced preferred content	2024-09-23 12:42:32 -04:00
Joey Hess	133584a83a	avoid locking the journal in readonly repository The test suite flagged that git-annex info in a readonly repository was no longer working. .git/annex/journal.lck: openFd: permission denied This fixes it, however, in a case where .git/annex/reposize/ is writable, but .git/annex/journal/ is not, there will still be a permission denied error. The solution would just be to use consistent permissions I suppose.	2024-08-30 11:58:10 -04:00
Joey Hess	d876e06e35	err on the side of larger repository size When a live update is removing a key, it might fail. So only count those once they have succeeded. When a live update is adding a key, count it immediately to avoid over-filling a repo. This also makes the 1 minute delay between stale live changes checks more defensible, because a stale live change can only cause us to err more on the side of caution.	2024-08-28 14:13:12 -04:00
Joey Hess	f89a1b8216	remove stale live changes from reposize database Reorganized the reposize database directory, and split up a column. checkStaleSizeChanges needs to run before needLiveUpdate, otherwise the process won't be holding a lock on its pid file, and another process could go in and expire the live update it records. It just so happens that they do get called in the correct order, since checking balanced preferred content calls getLiveRepoSizes before needLiveUpdate. The 1 minute delay between checks is arbitrary, but will avoid excess work. The downside of it is that, if a process is dropping a file and gets interrupted, for 1 minute another process can expect a repository will soon be smaller than it is. And so a process might send data to a repository when a file is not really going to be dropped from it. But note that can already happen if a drop takes some time in eg locking and then fails. So it seems possible that live updates should only be allowed to increase, rather than decrease the size of a repository.	2024-08-28 13:57:25 -04:00
Joey Hess	278adbb726	combine 2 queries	2024-08-28 11:00:59 -04:00
Joey Hess	b01a63ef62	avoid nub There will not usually be many live changes, but usually does not mean ever, and O(N^2) is best avoided.	2024-08-27 15:00:10 -04:00
Joey Hess	8555fb88ef	locking in checkLiveUpdate This makes sure that two threads don't check balanced preferred content at the same time, so each thread always sees a consistent picture of what is happening. This does add a fairly expensive file level lock to every check of preferred content, in commands that use prepareLiveUpdate. It would be good to only do that when live updates are actually needed, eg when the preferred content expression uses balanced preferred content.	2024-08-27 13:12:43 -04:00
Joey Hess	4d2f95853d	closing in on finishing live reposizes Fixed successfullyFinishedLiveSizeChange to not update the rolling total when a redundant change is in RecentChanges. Made setRepoSizes clear RecentChanges that are no longer needed. It might be possible to clear those earlier, this is only a convenient point to do it. The reason it's safe to clear RecentChanges here is that, in order for a live update to call successfullyFinishedLiveSizeChange, a change must be made to a location log. If a RecentChange gets cleared, and just after that a new live update is started, making the same change, the location log has already been changed (since the RecentChange exists), and so when the live update succeeds, it won't call successfullyFinishedLiveSizeChange. The reason it doesn't clear RecentChanges when there is a reduntant live update is because I didn't want to think through whether or not all races are avoided in that case. The rolling total in SizeChanges is never cleared. Instead, calcJournalledRepoSizes gets the initial value of it, and then getLiveRepoSizes subtracts that initial value from the current value. Since the rolling total can only be updated by updateRepoSize, which is called with the journal locked, locking the journal in calcJournalledRepoSizes ensures that the database does not change while reading the journal.	2024-08-27 12:54:46 -04:00
Joey Hess	23d44aa4aa	use live reposizes in balanced preferred content	2024-08-27 10:17:43 -04:00
Joey Hess	d7813876a0	fixed the build Manually tested getLiveRepoSizes and it is working correctly.	2024-08-27 09:41:35 -04:00
Joey Hess	21608716bd	started work on getLiveRepoSizes Doesn't quite compile	2024-08-26 14:50:09 -04:00
Joey Hess	18f8d61f55	rolling total of size changes in RepoSize database When a live size change completes successfully, the same transaction that removes it from the database updates the rolling total for its repository. The idea is that when RepoSizes is read, SizeChanges will be as well, and cached locally. Any time a change is made, the local cache will be updated. So by comparing the local cache with the current SizeChanges, it can learn about size changes that were made by other processes. Then read the LiveSizeChanges, and add that in to get a live picture of the current sizes. Also added a SizeChangeId. This allows 2 different threads, or processes, to both record a live size change for the same repo and key, and update their own information without stepping on one-another's toes.	2024-08-25 10:34:47 -04:00
Joey Hess	9188825a4d	use FileSize It's just an alias, so this doesn't change the db schema, but it makes explicit that it's not stored as an int64	2024-08-25 08:22:40 -04:00
Joey Hess	c3d40b9ec3	plumb in LiveUpdate (WIP) Each command that first checks preferred content (and/or required content) and then does something that can change the sizes of repositories needs to call prepareLiveUpdate, and plumb it through the preferred content check and the location log update. So far, only Command.Drop is done. Many other commands that don't need to do this have been updated to keep working. There may be some calls to NoLiveUpdate in places where that should be done. All will need to be double checked. Not currently in a compilable state.	2024-08-23 16:35:12 -04:00
Joey Hess	4885073377	add live size changes to RepoSize database Not yet used.	2024-08-23 12:51:00 -04:00
Joey Hess	d09a005f2b	update RepoSize database from git-annex branch incrementally The use of catObjectStream is optimally fast. Although it might be possible to combine this with git-annex branch merge to avoid some redundant work. Benchmarking, a git-annex branch that had 100000 files changed took less than 1.88 seconds to run through this.	2024-08-17 13:35:00 -04:00
Joey Hess	bba23e7cc9	do not need a db queue This database is read once and written at most once per run.	2024-08-15 12:31:27 -04:00
Joey Hess	eac4e9391b	finalize RepoSize database Including locking on creation, handling of permissions errors, and setting repo sizes. I'm confident that locking is not needed while using this database. Since writes happen in a single transaction. When there are two writers that are recording sizes based on different git-annex branch commits, one will overwrite what the other one recorded. Which is fine, it's only necessary that the database stays consistent with the content of a git-annex branch commit.	2024-08-15 12:29:34 -04:00
Joey Hess	3e6eb2a58d	implement journalledRepoSizes Plan is to run this when populating Annex.reposizes on demand. So Annex.reposizes will be up-to-date with the journal, including crucially journal entries for private repositories. But also anything that has been written to the journal by another process, especially if the process was ran with annex.alwayscommit=false. From there, Annex.reposizes can be kept up to date with changes made by the running process.	2024-08-14 13:53:24 -04:00
Joey Hess	8ac2685b33	calcBranchRepoSizes without journal files This will be used to prime the RepoSizes database, which will always contain values that correpond to information in the git-annex branch, so without anything from journal files. Factored out overJournalFileContents which will later be used to update Annex.reposizes to include information from journal files. This will be partitcularly important to support private UUIDs which only ever get to journal files and not to the branch.	2024-08-14 03:19:30 -04:00
Joey Hess	5afbea25e7	avoid counting size of keys that are in the journal twice In calcRepoSizes and also git-annex info, when a key was in the journal, it was passed to the callback twice, so the calculated size was wrong.	2024-08-13 13:23:39 -04:00
Joey Hess	467d80101a	improve handling of unmerged git-annex branches in readonly repo git-annex info was displaying a message that didn't make sense in context. In calcRepoSizes, it seems better to return the information from the git-annex branch, rather than giving up. Especially since balanced preferred content uses it, and we can't just give up evaluating a preferred content expression if git-annex is to be usable in such a readonly repo. Commit `6d7ecd9e5d` nobly wanted git-annex to behave the same with such unmerged branches as it does when it can merge them. But for the purposes of preferred content, it seems to me there's a sense that such an unmerged branch is the same as a remote we have not pulled from. The balanced preferred content will either way operate under outdated information, and so make not the best choices.	2024-08-13 13:13:12 -04:00
Joey Hess	99a126bebb	added reposize database The idea is that upon a merge of the git-annex branch, or a commit to the git-annex branch, the reposize database will be updated. So it should always accurately reflect the location log sizes, but it will often be behind the actual current sizes. Annex.reposizes will start with the value from the database, and get updated with each transfer, so it will reflect a process's best understanding of the current sizes. When there are multiple processes all transferring to the same repo, Annex.reposize will not reflect transfers made by the other processes since the current process started. So when using balanced preferred content, it may make suboptimal choices, including trying to transfer content to the repo when another process has already filled it up. But this is the same as if there are multiple processes running on ifferent machines, so is acceptable. The reposize will eventually get an accurate value reflecting changes made by other processes or in other repos.	2024-08-12 11:19:58 -04:00
Yaroslav Halchenko	87e2ae2014	run codespell throughout fixing typos automagically === Do not change lines below === { "chain": [], "cmd": "codespell -w", "exit": 0, "extra_inputs": [], "inputs": [], "outputs": [], "pwd": "." } ^^^ Do not change lines above ^^^	2024-05-01 15:46:21 -04:00
Joey Hess	0fb551053b	fix comment	2024-03-08 14:59:54 -04:00
Joey Hess	eb2cd944d9	update	2024-03-08 14:32:29 -04:00
Joey Hess	01b301b902	fix comment	2024-03-08 14:23:17 -04:00
Joey Hess	325b3d5c9c	avoid build warnings	2023-12-26 19:39:01 -04:00
Joey Hess	8a3beabf35	use RawFilePath for opening sqlite databases Fix a crash opening sqlite databases when run in a non-unicode locale, with a remote that uses a non-unicode filepath. In that situation converting to Text fails. The fix needs git-annex to be built with persistent-sqlite 2.13.3. Building against older versions still works, but that version is used when building with stack. Database.RawFilePath is a lot of code copied from persistent-sqlite and lightly modified, since only 1 function in persistent-sqlite was made to support RawFilePath. This is a bit of a pain, and I hope that persistent-sqlite will eventually switch to using OsPath, allowing this module to be removed from git-annex. Sponsored-by: k0ld on Patreon	2023-12-26 18:31:52 -04:00
Joey Hess	7776d03355	avoid unused import	2023-10-26 14:00:21 -04:00
Joey Hess	c9866d2164	speed up populating the importfeed database Avoid conversion from ByteString to String for urls that will just be converted right back to ByteString to go into the database. Also setTempUrl is not used by importfeed, so avoid checking for temp urls in this code path. This benchmarks as only a small improvement. From 2.99s to 2.78s when populating a database with 33k urls. Note that it does not seem worth replacing URLString with URLByteString generally, because the ways urls are used all entails either parseURI, which takes a string, or passing a parameter to eg curl, which also is currently a string. Sponsored-by: Leon Schuermann on Patreon	2023-10-25 13:00:17 -04:00
Joey Hess	8bde6101e3	sqlite datbase for importfeed importfeed: Use caching database to avoid needing to list urls on every run, and avoid using too much memory. Benchmarking in my podcasts repo, importfeed got 1.42 seconds faster, and memory use dropped from 203000k to 59408k. Database.ImportFeed is Database.ContentIdentifier with the serial number filed off. There is a bit of code duplication I would like to avoid, particularly recordAnnexBranchTree, and getAnnexBranchTree. But these use the persistent sqlite tables, so despite the code being the same, they cannot be factored out. Since this database includes the contentidentifier metadata, it will be slightly redundant if a sqlite database is ever added for metadata. I did consider making such a generic database and using it for this. But, that would then need importfeed to update both the url database and the metadata database, which is twice as much work diffing the git-annex branch trees. Or would entagle updating two databases in a complex way. So instead it seems better to optimise the database that importfeed needs, and if the metadata database is used by another command, use a little more disk space and do a little bit of redundant work to update it. Sponsored-by: unqueued on Patreon	2023-10-23 16:46:22 -04:00
Joey Hess	6472da265b	silence build warning about ~	2023-09-15 07:56:10 -04:00
Joey Hess	19d95c9bb8	enable TypeOperators more warnings about ~ needing it in a future ghc release	2023-08-02 09:47:42 -04:00
Joey Hess	11db986781	enable TypeOperators The use of ‘~’ without TypeOperators will become an error in a future GHC release. says new ghc	2023-08-01 18:33:39 -04:00
Joey Hess	f25eeedeac	initial implementation of --explain Currently it only displays explanations of options like --in and --copies. In the future, it should explain preferred content expression evaluation and other decisions. The explanations of a few things could be better. In particular, "standard" will just appear as-is (or as "!standard" if it doesn't match), rather than explaining why the standard preferred content expression for the group matches or not. Currently as implemented, it goes to stdout, and so commands like git-annex find that have custom output will not display --explain information. Perhaps that should change, dunno. Sponsored-by: Dartmouth College's DANDI project	2023-07-25 16:52:57 -04:00
Joey Hess	a0ab425c95	add ContentIndentifiersCidRemoteKeyIndex Optimise database to further speed up importing large trees from special remotes. See comment for details of why the other index didn't help cid queries. It would probably be better to manually create an index on only cid, rather than adding a second uniqueness constraint that is a larger index. But persitent does not support creating indexes, and an attempt to manually add it to the migration failed. Sponsored-by: Nicholas Golder-Manning on Patreon	2023-06-09 15:12:33 -04:00

1 2 3 4 5 ...

298 commits