git-annex

Author	SHA1	Message	Date
Joey Hess	da3c0115cb	make cluster UUIDs distinguishable from any other repository UUID A cluster UUID is a version 8 UUID, with first octets 'a' and 'c'. The rest of the content will be random. This avoids a class of attack where the UUID of a repository is used as the UUID of a cluster, which will prevent git-annex from updating location logs for that repository. I don't know why someone would want to do that, but let's prevent it. Also, isClusterUUID make it easy to filter out cluster UUIDs when writing the location logs.	2024-06-14 11:11:09 -04:00
Joey Hess	aa56d433d5	implement cluster.log Not used yet. (Or tested.) I did consider making the log start with the uuid of the node, followed by the cluster uuid (or uuids). That would perhaps mean a smaller write to the git-annex branch when adding a node, but overall the log file would be larger, and it will be read and cached near to startup on most git-annex runs.	2024-06-13 16:00:58 -04:00
Joey Hess	5b668f9ef1	add missing spaces	2024-06-12 13:06:14 -04:00
Joey Hess	649b87bedd	Merge branch 'master' into proxy	2024-06-10 14:26:18 -04:00
Joey Hess	f5532be954	graft in exported tree before updating the export log It was possible for the export.log to get written and then git-annex was interrupted, before it could graft in the exported tree. Which could result in export.log referencing a tree that got garbage collected.	2024-06-07 15:25:02 -04:00
Joey Hess	b43c835def	instantiate remotes that are behind a proxy remote Untested, but this should be close to working. The proxied remotes have the same url but a different uuid. When talking to current git-annex-shell, it will fail due to a uuid mismatch. Once it supports proxies, it will know that the presented uuid is for a remote that it proxies for. The check for any git config settings for a remote with the same name as the proxied remote is there for several reasons. One is security: Writing a name to the proxy log should not cause changes to how an existing, configured git remote operates in a different clone of the repo. It's possible that the user has been using a proxied remote, and decides to set a git config for it. We can't tell the difference between that scenario and an evil remote trying to eg, intercept a file upload by replacing their remote with a proxied remote. Also, if the user sets some git config, does it override the config inherited from the proxy remote? Seems a difficult question. Luckily, the above means we don't need to think through it. This does mean though, that in order for a user to change the config of a proxy remote, they have to manually set its annex-uuid and url, as well as the config they want to change. They may also have to set any of the inherited configs that they were relying on.	2024-06-06 17:15:32 -04:00
Joey Hess	a72d0f69d0	filter out illegal remote names when reading proxy log	2024-06-06 12:51:30 -04:00
Joey Hess	f97f4b8bdb	Added updateproxy command and remote.name.annex-proxy configuration So far this only records proxy information on the git-annex branch.	2024-06-04 14:52:03 -04:00
Joey Hess	8dbee28963	reorder imports for consistency	2024-06-04 13:25:48 -04:00
Yaroslav Halchenko	87e2ae2014	run codespell throughout fixing typos automagically === Do not change lines below === { "chain": [], "cmd": "codespell -w", "exit": 0, "extra_inputs": [], "inputs": [], "outputs": [], "pwd": "." } ^^^ Do not change lines above ^^^	2024-05-01 15:46:21 -04:00
Joey Hess	07baa7ffcf	fix build on windows deletestale renamed from cleanstale	2024-03-26 13:12:58 -04:00
Joey Hess	f04d9574d6	fix transfer lock file for Download to not include uuid While redundant concurrent transfers were already prevented in most cases, it failed to prevent the case where two different repositories were sending the same content to the same repository. By removing the uuid from the transfer lock file for Download transfers, one repository sending content will block the other one from also sending the same content. In order to interoperate with old git-annex, the old lock file is still locked, as well as locking the new one. That added a lot of extra code and work, and the plan is to eventually stop locking the old lock file, at some point in time when an old git-annex process is unlikely to be running at the same time. Note that in the case of 2 repositories both doing eg `git-annex copy foo --to origin` the output is not that great: copy b (to origin...) transfer already in progress, or unable to take transfer lock git-annex: transfer already in progress, or unable to take transfer lock 97% 966.81 MiB 534 GiB/s 0sp2pstdio: 1 failed Lost connection (fd:14: hPutBuf: resource vanished (Broken pipe)) Transfer failed Perhaps that output could be cleaned up? Anyway, it's a lot better than letting the redundant transfer happen and then failing with an obscure error about a temp file, which is what it did before. And it seems users don't often try to do this, since nobody ever reported this bug to me before. (The "97%" there is actually how far along the other transfer is.) Sponsored-by: Joshua Antonishen on Patreon	2024-03-25 14:47:46 -04:00
Joey Hess	310a49a76d	nub the equivilant key list If somehow the same key gets listed in the log twice, avoid doing double checksumming.	2024-03-01 14:39:01 -04:00
Joey Hess	55bf01b788	add equivilant key log for VURL keys When downloading a VURL from the web, make sure that the equivilant key log is populated. Unfortunately, this does not hash the content while it's being downloaded from the web. There is not an interface in Backend currently for incrementally hash generation, only for incremental verification of an existing hash. So this might add a noticiable delay, and it has to show a "(checksum...") message. This could stand to be improved. But, that separate hashing step only has to happen on the first download of new content from the web. Once the hash is known, the VURL key can have its hash verified incrementally while downloading except when the content in the web has changed. (Doesn't happen yet because verifyKeyContentIncrementally is not implemented yet for VURL keys.) Note that the equivilant key log file is formatted as a presence log. This adds a tiny bit of overhead (eg "1 ") per line over just listing the urls. The reason I chose to use that format is it seems possible that there will need to be a way to remove an equivilant key at some point in the future. I don't know why that would be necessary, but it seemed wise to allow for the possibility. Downloads of VURL keys from other special remotes that claim urls, like bittorrent for example, does not popilate the equivilant key log. So for now, no checksum verification will be done for those. Sponsored-by: Nicholas Golder-Manning on Patreon	2024-02-29 16:01:49 -04:00
Joey Hess	9a67ed0f10	importtree: support preferred content expressions needing keys When importing from a special remote, support preferred content expressions that use terms that match on keys (eg "present", "copies=1"). Such terms are ignored when importing, since the key is not known yet. When "standard" or "groupwanted" is used, the terms in those expressions also get pruned accordingly. This does allow setting preferred content to "not (copies=1)" to make a special remote into a "source" type of repository. Importing from it will import all files. Then exporting to it will drop all files from it. In the case of setting preferred content to "present", it's pruned on import, so everything gets imported from it. Then on export, it's applied, and everything in it is left on it, and no new content is exported to it. Since the old behavior on these preferred content expressions was for importtree to error out, there's no backwards compatability to worry about. Except that sync/pull/etc will now import where before it errored out.	2023-12-18 16:27:59 -04:00
Joey Hess	4ed71b34de	migrate --apply And avoid migrate --update/--aply migrating when the new key was already present in the repository, and got dropped. Luckily, the location log allows distinguishing from the new key never having been present! That is mostly useful for --apply because otherwise dropped files would keep coming back until the old objects were reaped as unused. But it seemed to make sense to also do it for --update. for consistency in edge cases if nothing else. One case where --update can use it is when one branch got migrated earlier, and we dropped the file, and now another branch has migrated the same file. Sponsored-by: Jack Hill on Patreon	2023-12-08 13:23:46 -04:00
Joey Hess	7c7c9912c1	migrate --update gets keys The git log is outputting the diff, but this only looks at the new files. When we have a new file, we can get the old filename by just replacing "new" with "old". And then use branchFileRef to refer to it allows catting the old key. While this does have to skip past the old files in the diff, it's still faster than calling git diff separately. Sponsored-by: Nicholas Golder-Manning on Patreon	2023-12-07 17:25:56 -04:00
Joey Hess	f1ce15036f	started migrate --update This is most of the way there, but not quite working. The layout of migrate.tree/ needs to be changed to follow this approach. git log will list all the files in tree order, so the new layout needs to alternate old and new keys. Can that be done? git may not document tree order, or may not preserve it here. Alternatively, change to using git log --format=raw and extract the tree header from that, then use git diff --raw $tree:migrate.tree/old $tree:migrate.tree/new That will be a little more expensive, but only when there are lots of migrations. Sponsored-by: Joshua Antonishen on Patreon	2023-12-07 15:50:52 -04:00
Joey Hess	d06aee7ce0	make commitMigration interuption safe Fixed inversion of control issue, so the tree is recorded in streamLogFile finalizer. Sponsored-by: Leon Schuermann on Patreon	2023-12-06 16:29:58 -04:00
Joey Hess	0bd8b17b59	log migration trees to git-annex branch This will allow distributed migration: Start a migration in one clone of a repo, and then update other clones. commitMigration is a bit of a bear.. There is some inversion of control that needs some TMVars. Also streamLogFile's finalizer does not handle recording the trees, so an interrupt at just the wrong time can cause migration.log to be emptied but the git-annex branch not updated. Sponsored-by: Graham Spencer on Patreon	2023-12-06 15:40:03 -04:00
Joey Hess	38b9ebc5fd	newtype MapLog Noticed that Semigroup instance of Map is not suitable to use for MapLog. For example, it behaved like this: ghci> parseTrustLog "foo 1 timestamp=10\nfoo 2 timestamp=11" <> parseTrustLog "foo X timestamp=12" fromList [(UUID "foo",LogEntry {changed = VectorClock 11s, value = SemiTrusted})] Which was wrong, it lost the newer DeadTrusted value. Luckily, nothing used that Semigroup when operating on a MapLog. And this provides a safe instance. Sponsored-by: Graham Spencer on Patreon	2023-11-13 14:37:22 -04:00
Joey Hess	574514545c	git-annex log --sizesof This can take a lot of memory. I decided to violate the usual rule in git-annex that it operate in constant memory no matter how many annexed objects. In this case, it would be hard to be fast without using a big map of the location logs. The main difficulty here is that there can be many git-annex branches and it needs to display a consistent view at a point in time, which means merging information from multiple git-annex branches. I have not checked if there are any laziness leaks in this code. It takes 1 gb to run in my big repo, which is around what I estimated before writing it. 2 options that are documented are not yet implemented. Small bug: With eg --when=1h, it will display at 12:00 then 1:10 if the next change after 12:59 is then. Then it waits until after 2:10 to display the next change. It ought to wait until after 2:00. Sponsored-by: Brock Spratlen on Patreon	2023-11-10 17:26:10 -04:00
Joey Hess	11cc9f1933	info: Added calculation of combined annex size of all repositories Factored out overLocationLogs from CmdLine.Seek, which can calculate this pretty fast even in a large repo. In my big repo, the time to run git-annex info went up from 1.33s to 8.5s. Note that the "backend usage" stats are for annexed files in the working tree only, not all annexed files. This new data source would let that be changed, but that would be a confusing behavior change. And I cannot retitle it either, out of fear something uses the current title (eg parsing the json). Also note that, while time says "402108maxresident" in my big repo now, up from "54092maxresident", top shows the RES constant at 64mb, and it was 48mb before. So I don't think there is a memory leak. I tried using deepseq to force full evaluation of addKeyCopies and memory use didn't change, which also says no memory leak. And indeed, not even calling addKeyCopies resulted in the same memory use. Probably the increased memory usage is buffering the stream of data from git in overLocationLogs. Sponsored-by: Brett Eisenberg on Patreon	2023-11-08 13:35:11 -04:00
Joey Hess	eb42935e58	Windows: Fix CRLF handling in some log files In particular, the mergedrefs file was written with CR added to each line, but read without CRLF handling. This resulted in each update of the file adding CR to each line in it, growing the number of lines, while also preventing the optimisation from working, so it remerged unncessarily. writeFile and readFile do NewlineMode translation on Windows. But the ByteString conversion prevented that from happening any longer. I've audited for other cases of this, and found three more (.git/annex/index.lck, .git/annex/ignoredrefs, and .git/annex/import/). All of those also only prevent optimisations from working. Some other files are currently both read and written with ByteString, but old git-annex may have written them with NewlineMode translation. Other files are at risk for breakage later if the reader gets converted to ByteString. This is a minimal fix, but should be enough, as long as I remember to use fileLines when splitting a ByteString into lines. This leaves files written using ByteString without CR added, but that's ok because old git-annex has no difficulty reading such files. When the mergedrefs file has gotten lines that end with "\r\r\r\n", this will eventually clean it up. Each update will remove a single trailing CR. Note that S8.lines is still used in eg Command.Unused, where it is parsing git show-ref, and similar in Git/*. git commands don't include CR in their output so that's ok. Sponsored-by: Joshua Antonishen on Patreon	2023-10-30 14:23:23 -04:00
Joey Hess	d9fd205cbb	push RawFilePath down into Annex.ReplaceFile Minor optimisation, but a win in every case, except for a couple where it's a wash. Note that replaceFile still takes a FilePath, because it needs to operate on Chars to truncate unicode filenames properly.	2023-10-26 13:36:49 -04:00
Joey Hess	3742263c99	simplify base64 to only use ByteString Note the use of fromString and toString from Data.ByteString.UTF8 dated back to commit `9b93278e8a`. Back then it was using the dataenc package for base64, which operated on Word8 and String. But with the switch to sandi, it uses ByteString, and indeed fromB64' and toB64' were already using ByteString without that complication. So I think there is no risk of such an encoding related breakage. I also tested the case that `9b93278e8a` fixed: git-annex metadata -s foo='a …' x git-annex metadata x metadata x foo=a … In Remote.Helper.Encryptable, it was avoiding using Utility.Base64 because of that UTF8 conversion. Since that's no longer done, it can just use it now.	2023-10-26 13:10:05 -04:00
Joey Hess	c9866d2164	speed up populating the importfeed database Avoid conversion from ByteString to String for urls that will just be converted right back to ByteString to go into the database. Also setTempUrl is not used by importfeed, so avoid checking for temp urls in this code path. This benchmarks as only a small improvement. From 2.99s to 2.78s when populating a database with 33k urls. Note that it does not seem worth replacing URLString with URLByteString generally, because the ways urls are used all entails either parseURI, which takes a string, or passing a parameter to eg curl, which also is currently a string. Sponsored-by: Leon Schuermann on Patreon	2023-10-25 13:00:17 -04:00
Joey Hess	10b5f79e2d	fix empty tree import when directory does not exist Fix behavior when importing a tree from a directory remote when the directory does not exist. An empty tree was imported, rather than the import failing. Merging that tree would delete every file in the branch, if those files had been exported to the directory before. The problem was that dirContentsRecursive returned [] when the directory did not exist. Better for it to throw an exception. But in commit `74f0d67aa3` back in 2012, I made it never theow exceptions, because exceptions throw inside unsafeInterleaveIO become untrappable when the list is being traversed. So, changed it to list the contents of the directory before entering unsafeInterleaveIO. So exceptions are thrown for the directory. But still not if it's unable to list the contents of a subdirectory. That's less of a problem, because the subdirectory does exist (or if not, it got removed after being listed, and it's ok to not include it in the list). A subdirectory that has permissions that don't allow listing it will have its contents omitted from the list still. (Might be better to have it return a type that includes indications of errors listing contents of subdirectories?) The rest of the changes are making callers of dirContentsRecursive use emptyWhenDoesNotExist when they relied on the behavior of it not throwing an exception when the directory does not exist. Note that it's possible some callers of dirContentsRecursive that used to ignore permissions problems listing a directory will now start throwing exceptions on them. The fix to the directory special remote consisted of not making its call in listImportableContentsM use emptyWhenDoesNotExist. So it will throw an exception as desired. Sponsored-by: Joshua Antonishen on Patreon	2023-08-15 12:57:41 -04:00
Joey Hess	518a51a8a0	--explain for preferred/required content matching And annex.largefiles and annex.addunlocked. Also git-annex matchexpression --explain explains why its input expression matches or fails to match. When there is no limit, avoid explaining why the lack of limit matches. This is also done when no preferred content expression is set, although in a few cases it defaults to a non-empty matcher, which will be explained. Sponsored-by: Dartmouth College's DANDI project	2023-07-26 14:50:04 -04:00
Joey Hess	6821ba8dab	sync: use log to track adjusted branch needs updating Speeds up sync in an adjusted branch by avoiding re-adjusting the branch unncessarily, particularly when it is adjusted with --hide-missing or --unlock-present. When there are a lot of files, that was the majority of the time of a --no-content sync. Uses a log file, which is updated when content presence changes. This adds a little bit of overhead to every file get/drop when on such an adjusted branch. The overhead is minimal for get of any size of file, but might be noticable for drop in some cases. It seems like a reasonable trade-off. It would be possible to update the log file only at the end, but then it would not happen if the command is interrupted. When not in an adjusted branch, there should be no additional overhead. (getCurrentBranch is an MVar read, and it avoids the MVar read of getGitConfig.) Note that this does not deal with situations such as: git checkout master, git-annex get, git checkout adjusted branch, git-annex sync. The sync won't know that the adjusted branch needs to be updated. Dealing with that would add overhead to operation in non-adjusted branches, which I don't like. Also, there are other situations like having two adjusted branches that both need to be updated like this, and switching between them and sync not updating. This does mean a behavior change to sync, since it did previously deal with those situations. But, the documentation did not say that it did. The man pages only talk about sync updating the adjusted branch after it transfers content. I did consider making sync keep track of content it transferred (and dropped) and only update the adjusted branch then, not to catch up to other changes made previously. That would perform better. But it seemed rather hard to implement, and also it would have problems with races with a concurrent get/drop, which this implementation avoids. And it seemed pretty likely someone had gotten used to get/drop followed by sync updating the branch. It seems much less likely someone is switching branches, doing get/drop, and then switching back and expecting sync to update the branch. Re-running git-annex adjust still does a full re-adjusting of the branch, for anyone who needs that. Sponsored-by: Leon Schuermann on Patreon	2023-06-08 14:35:41 -04:00
Joey Hess	3c15e0f7a0	cache negative lookups of global numcopies and mincopies Speeds up eg git-annex sync --content by up to 50%. When it does not need to transfer or drop anything, it now noops a lot more quickly. I didn't see anything else in sync --content noop loop that could really be sped up. It has to cat git objects to keys, stat object files, etc. Sponsored-by: unqueued on Patreon	2023-06-06 14:43:25 -04:00
Joey Hess	c6acf574c7	implement importChanges optimisaton (not used yet) For simplicity, I've not tried to make it handle History yet, so when there is a history, a full import will still be done. Probably the right way to handle history is to first diff from the current tree to the last imported tree. Then, diff from the current tree to each of the historical trees, and recurse through the history diffing from child tree to parent tree. I don't think that will need a record of the previously imported historical trees, and so Logs.Import doesn't store them. Although I did leave room for future expansion in that log just in case. Next step will be to change importTree to importChanges and modify recordImportTree et all to handle it, by using adjustTree. Sponsored-by: Brett Eisenberg on Patreon	2023-05-31 16:01:34 -04:00
Joey Hess	fe5e586b72	rename Git.Filename to Git.Quote	2023-04-12 17:22:03 -04:00
Joey Hess	8b6c7bdbcc	filter out control characters in all other Messages This does, as a side effect, make long notes in json output not be indented. The indentation is only needed to offset them underneath the display of the file they apply to, so that's ok. Sponsored-by: Brock Spratlen on Patreon	2023-04-11 12:58:01 -04:00
Joey Hess	d689a5b338	git style filename quoting controlled by core.quotePath This is by no means complete, but escaping filenames in actionItemDesc does cover most commands. Note that for ActionItemBranchFilePath, the value is branch:file, and I choose to only quote the file part (if necessary). I considered quoting the whole thing. But, branch names cannot contain control characters, and while they can contain unicode, git coes not quote unicode when displaying branch names. So, it would be surprising for git-annex to quote unicode in a branch name. The find command is the most obvious command that still needs to be dealt with. There are probably other places that filenames also get displayed, eg embedded in error messages. Some other commands use ActionItemOther with a filename, I think that ActionItemOther should either be pre-sanitized, or should explicitly not be used for filenames, so that needs more work. When --json is used, unicode does not get escaped, but control characters were already escaped in json. (Key escaping may turn out to be needed, but I'm ignoring that for now.) Sponsored-by: unqueued on Patreon	2023-04-08 14:52:26 -04:00
Yaroslav Halchenko	84b0a3707a	Apply codespell -w throughout	2023-03-17 15:14:58 -04:00
Joey Hess	38e9ea8497	one-way escaping of newlines in uuid.log A repository can have a newline in its description due to being in a directory containing a newline, or due to git-annex describe being passed a string with a newline in it for some reason. Putting that newline in uuid.log breaks its format. So, escape the newline when it enters uuid.log, to \n This is a one-way escaping, it is not converted back to a newline when reading the log. If it were, commands like git-annex info and whereis would display a multi-line description, which could be confusing to read. And, implementing roundtripping would necessarily cause problems if an old version of git-annex were used to set a description that contained whatever special character is used to escape the \n. Eg, a \ or if it used the ! prefix before base64 data that is used in some other logs, the ! character. Then the description set by the old git-annex would not roundtrip. There just doesn't seem to be any benefit of roundtripping newlines through, so why bother? And, git often displays \n for newline when a filename contains a newline, so git-annex doing it in this case seems sorta ok by analogy to git. (Some other git-annex logs can also have newlines put into them if the user really wants to break git-annex. For example: git-annex config annex.largefiles "foo bar" The full list is probably config.log, remote.log, group.log, preferred-content.log, required-content.log, group-preferred-content.log, schedule.log. Probably there is no good reason to use a newline in any of these, and the breakage is probably limited to the bad data the user put in not coming back out. And users can write any garbage to log files themselves manually in any case. So, I am not going to address all of those at this time. If a problem such as this one with the newline in the repository path comes up, it can be dealt with on a case by case basis.) Sponsored-by: Dartmouth College's Datalad project	2023-03-13 14:19:32 -04:00
Joey Hess	cc32e31161	understand adjusted view branch names An adjusted view branch has a name like "refs/heads/adjusted/views/master(author=_)(unlocked)", so it is a view branch that has been converted to an adjusted branch. Made Logs.View support such branch names. So now git-annex sync and pre-commit handle updating metadata on commit in such a branch. Much remains to be done to fully support adjusted view branches, including actually applying the adjustment when updating the view branch. Sponsored-by: Graham Spencer on Patreon	2023-02-27 14:57:58 -04:00
Joey Hess	a11d6e0baf	avoid sync pushing view branches to remotes, and better view branch names * sync: Avoid pushing view branches to remotes. * Changed the name of view branches to include the parent branch. Existing view branches checked out using an old name will still work. It does not seem useful for sync to push view branches around, because the information in a view branch can entirely be derived from other information in git. And sync doesn't push adjusted branches around either. The better view branch names make it more in line with adjusted branch names, but were also needed to make fromViewBranch be able to return the original branch name. Kept the old view branch names still working. But, when those branches exist in a repo, sync will still try to push them as before. Avoiding that would need more complicated and/or expensive changes to sync. Sponsored-By: Boyd Stephen Smith Jr. on Patreon	2023-02-08 13:57:48 -04:00
Joey Hess	aa0350ff49	add directory to views for files that lack specified metadata * view: New field?=glob and ?tag syntax that includes a directory "_" in the view for files that do not have the specified metadata set. * Added annex.viewunsetdirectory git config to change the name of the "_" directory in a view. When in a view using the new syntax, old git-annex will fail to parse the view log. It errors with "Not in a view.", which is not ideal. But that only affects view commands. annex.viewunsetdirectory is included in the View for a couple of reasons. One is to avoid needing to warn the user that it should not be changed when in a view, since that would confuse git-annex. Another reason is that it helped with plumbing the value through to some pure functions. annex.viewunsetdirectory is actually mangled the same as any other view directory. So if it's configured to something like "N/A", there won't be multiple levels of directories, which would also confuse git-annex. Sponsored-By: Jack Hill on Patreon	2023-02-07 16:28:46 -04:00
Joey Hess	65f9e7a3c7	fix deadlock in restagePointerFiles Fix a hang that occasionally occurred during commands such as move. (A bug introduced in 10.20220927, in commit `6a3bd283b8`) The restage.log was kept locked while running a complex index refresh action. In an unusual situation, that action could need to write to the restage log, which caused a deadlock. The solution is a two-stage process. First the restage.log is moved to a work file, which is done with the lock held. Then the content of the work file is read and processed, which happens without the lock being held. This is all done in a crash-safe manner. Note that streamRestageLog may not be fully safe to run concurrently with itself. That's ok, because restagePointerFiles uses it with the index lock held, so only one can be run at a time. streamRestageLog does delete the restage.old file at the end without locking. If a calcRestageLog is run concurrently, it will either see the file content before it was deleted, or will see it's missing. Either is ok, because at most this will cause calcRestageLog to report more work remains to be done than there is. Sponsored-by: Dartmouth College's Datalad project	2022-12-08 14:36:11 -04:00
Joey Hess	4a42c69092	take lock in checkLogFile and calcLogFile move: Fix openFile crash with -J This does make them a bit slower, although usually the log file is not very big, so even when it's being rewritten, they will not block for long taking the lock. Still, little slowdowns may add up when moving a lot file files. A less expensive fix would be to use something lower level than openFile that does not check if the file is already open for write by another thread. But GHC does not seem to provide anything convenient; even mkFD checks for a writing thread. fullLines is no longer necessary since these functions no longer will read the file while it's being written. Sponsored-by: Dartmouth College's DANDI project	2022-10-07 13:19:17 -04:00
Joey Hess	26dea5641e	avoid name shadowing warning	2022-09-26 16:20:56 -04:00
Joey Hess	6a3bd283b8	add restage log When pointer files need to be restaged, they're first written to the log, and then when the restage operation runs, it reads the log. This way, if the git-annex process is interrupted before it can do the restaging, a later git-annex process can do it. Currently, this lets a git-annex get/drop command be interrupted and then re-ran, and as long as it gets/drops additional files, it will clean up after the interrupted command. But more changes are needed to make it easier to restage after an interrupted process. Kept using the git queue to run the restage action, even though the list of files that it builds up for that action is not actually used by the action. This could perhaps be simplified to make restaging a cleanup action that gets registered, rather than using the git queue for it. But I wasn't sure if that would cause visible behavior changes, when eg dropping a large number of files, currently the git queue flushes periodically, and so it restages incrementally, rather than all at the end. In restagePointerFiles, it reads the restage log twice, once to get the number of files and size, and a second time to process it. This seemed better than reading the whole file into memory, since potentially a huge number of files could be in there. Probably the OS will cache the file in memory and there will not be much performance impact. It might be better to keep running tallies in another file though. But updating that atomically with the log seems hard. Also note that it's possible for calcRestageLog to see a different file than streamRestageLog does. More files may be added to the log in between. That is ok, it will only cause the filterprocessfaster heuristic to operate with slightly out of date information, so it may make the wrong choice for the files that got added and be a little slower than ideal. Sponsored-by: Dartmouth College's DANDI project	2022-09-23 15:47:24 -04:00
Joey Hess	b17e328175	avoid unncessary locking by checkLogFile Like the comment says, this works without locking. It looks like I originally copied another function and forgot to remove the locking. Sponsored-by: Dartmouth College's DANDI project	2022-09-23 14:01:43 -04:00
Joey Hess	ed8afacc39	add a finalizer to streamLogFile Sponsored-by: Dartmouth College's DANDI project	2022-09-23 13:49:01 -04:00
Joey Hess	8718125ae4	refactor the restage runner Sponsored-by: Dartmouth College's DANDI project	2022-09-23 13:12:17 -04:00
Joey Hess	e60766543f	add annex.dbdir (WIP) WIP: This is mostly complete, but there is a problem: createDirectoryUnder throws an error when annex.dbdir is set to outside the git repo. annex.dbdir is a workaround for filesystems where sqlite does not work, due to eg, the filesystem not properly supporting locking. It's intended to be set before initializing the repository. Changing it in an existing repository can be done, but would be the same as making a new repository and moving all the annexed objects into it. While the databases get recreated from the git-annex branch in that situation, any information that is in the databases but not stored in the branch gets lost. It may be that no information ever gets stored in the databases that cannot be reconstructed from the branch, but I have not verified that. Sponsored-by: Dartmouth College's Datalad project	2022-08-11 16:58:53 -04:00
Joey Hess	36f0bdcd57	add annex.alwayscompact Added annex.alwayscompact setting which can be unset to speed up writes to the git-annex branch in some cases. Sponsored-by: Dartmouth College's DANDI project	2022-07-18 16:39:19 -04:00
Joey Hess	ce455223df	split out appending to journal from writing, high level only Currently this is not an improvement, but it allows for optimising appendJournalFile later. With an optimised appendJournalFile, this will greatly speed up access patterns like git-annex addurl of a lot of urls to the same key, where the log file can grow rather large. Appending rather than re-writing the journal file for each line can save a lot of disk writes. It still has to read the current journal or branch file, to check if it can append to it, and so when the journal file does not exist yet, it can write the old content from the branch to it. Probably the re-reads are better cached by the filesystem than repeated writes. (If the re-reads turn out to keep performance bad, they could be eliminated, at the cost of not being able to compact the log when replacing old information in it. That could be enabled by a switch.) While the immediate need is to affect addurl writes, it was implemented at the level of presence logs, so will also perhaps speed up location logs. The only added overhead is the call to isNewInfo, which only needs to compare ByteStrings. Helping to balance that out, it avoids compactLog when it's able to append. Sponsored-by: Dartmouth College's DANDI project	2022-07-18 13:22:50 -04:00

1 2 3 4 5 ...

527 commits