git-annex

Author	SHA1	Message	Date
Joey Hess	1acdd18ea8	deal better with clock skew situations, using vector clocks * Deal with clock skew, both forwards and backwards, when logging information to the git-annex branch. * GIT_ANNEX_VECTOR_CLOCK can now be set to a fixed value (eg 1) rather than needing to be advanced each time a new change is made. * Misuse of GIT_ANNEX_VECTOR_CLOCK will no longer confuse git-annex. When changing a file in the git-annex branch, the vector clock to use is now determined by first looking at the current time (or GIT_ANNEX_VECTOR_CLOCK when set), and comparing it to the newest vector clock already in use in that file. If a newer time stamp was already in use, advance it forward by a second instead. When the clock is set to a time in the past, this avoids logging with an old timestamp, which would risk that log line later being ignored in favor of "newer" line that is really not newer. When a log entry has been made with a clock that was set far ahead in the future, this avoids newer information being logged with an older timestamp and so being ignored in favor of that future-timestamped information. Once all clocks get fixed, this will result in the vector clocks being incremented, until finally enough time has passed that time gets back ahead of the vector clock value, and then it will return to usual operation. (This latter situation is not ideal, but it seems the best that can be done. The issue with it is, since all writers will be incrementing the last vector clock they saw, there's no way to tell when one writer made a write significantly later in time than another, so the earlier write might arbitrarily be picked when merging. This problem is why git-annex uses timestamps in the first place, rather than pure vector clocks.) Advancing forward by 1 second is somewhat arbitrary. setDead advances a timestamp by just 1 picosecond, and the vector clock could too. But then it would interfere with setDead, which wants to be overrulled by any change. So it could use 2 picoseconds or something, but that seems weird. It could just as well advance it forward by a minute or whatever, but then it would be harder for real time to catch up with the vector clock when forward clock slew had happened. A complication is that many log files contain several different peices of information, and it may be best to only use vector clocks for the same peice of information. For example, a key's location log file contains InfoPresent/InfoMissing for each UUID, and it only looks at the vector clocks for the UUID that is being changed, and not other UUIDs. Although exactly where the dividing line is can be hard to determine. Consider metadata logs, where a field "tag" can have multiple values set at different times. Should it advance forward past the last tag? Probably. What about when a different field is set, should it look at the clocks of other fields? Perhaps not, but currently it does, and this does not seems like it will cause any problems. Another one I'm not entirely sure about is the export log, which is keyed by (fromuuid, touuid). So if multiple repos are exporting to the same remote, different vector clocks can be used for that remote. It looks like that's probably ok, because it does not try to determine what order things occurred when there was an export conflict. Sponsored-by: Jochen Bartl on Patreon	2021-08-04 12:33:46 -04:00
Joey Hess	05989556a2	start implementing hidden git-annex repositories This adds a separate journal, which does not currently get committed to an index, but is planned to be committed to .git/annex/index-private. Changes that are regarding a UUID that is private will get written to this journal, and so will not be published into the git-annex branch. All log writing should have been made to indicate the UUID it's regarding, though I've not verified this yet. Currently, no UUIDs are treated as private yet, a way to configure that is needed. The implementation is careful to not add any additional IO work when privateUUIDsKnown is False. It will skip looking at the private journal at all. So this should be free, or nearly so, unless the feature is used. When it is used, all branch reads will be about twice as expensive. It is very lucky -- or very prudent design -- that Annex.Branch.change and maybeChange are the only ways to change a file on the branch, and Annex.Branch.set is only internal use. That let Annex.Branch.get always yield any private information that has been recorded, without the risk that Annex.Branch.set might be called, with a non-private UUID, and end up leaking the private information into the git-annex branch. And, this relies on the way git-annex union merges the git-annex branch. When reading a file, there can be a public and a private version, and they are just concacenated together. That will be handled the same as if there were two diverged git-annex branches that got union merged.	2021-04-20 15:04:53 -04:00
Joey Hess	6280af2901	generate more compact git-annex branch for imports Especially from borg, where the content identifier logs all end up being the same identical file! But also, for other imports, the location tracking logs can, in some cases, be identical files. Bonus optimisation: Avoid looking up (and parsing when set) GIT_ANNEX_VECTOR_CLOCK env var every time a log is written to. Although the lookup does happen at startup even when no log will be written now.	2020-12-23 15:25:16 -04:00
Joey Hess	7b2d236556	importfeed: stream metadata for 5% speedup On top of the 10% speedup from streaming url logs.	2020-07-14 14:35:26 -04:00
Joey Hess	067aabdd48	wip RawFilePath 2x git-annex find speedup Finally builds (oh the agoncy of making it build), but still very unmergable, only Command.Find is included and lots of stuff is badly hacked to make it compile. Benchmarking vs master, this git-annex find is significantly faster! Specifically: num files old new speedup 48500 4.77 3.73 28% 12500 1.36 1.02 66% 20 0.075 0.074 0% (so startup time is unchanged) That's without really finishing the optimization. Things still to do: * Eliminate all the fromRawFilePath, toRawFilePath, encodeBS, decodeBS conversions. * Use versions of IO actions like getFileStatus that take a RawFilePath. * Eliminate some Data.ByteString.Lazy.toStrict, which is a slow copy. * Use ByteString for parsing git config to speed up startup. It's likely several of those will speed up git-annex find further. And other commands will certianly benefit even more.	2019-11-26 16:01:58 -04:00
Joey Hess	9828f45d85	add RemoteStateHandle This solves the problem of sameas remotes trampling over per-remote state. Used for: * per-remote state, of course * per-remote metadata, also of course * per-remote content identifiers, because two remote implementations could in theory generate the same content identifier for two different peices of content While chunk logs are per-remote data, they don't use this, because the number and size of chunks stored is a common property across sameas remotes. External special remote had a complication, where it was theoretically possible for a remote to send SETSTATE or GETSTATE during INITREMOTE or EXPORTSUPPORTED. Since the uuid of the remote is typically generate in Remote.setup, it would only be possible to pass a Maybe RemoteStateHandle into it, and it would otherwise have to construct its own. Rather than go that route, I decided to send an ERROR in this case. It seems unlikely that any existing external special remote will be affected. They would have to make up a git-annex key, and set state for some reason during INITREMOTE. I can imagine such a hack, but it doesn't seem worth complicating the code in such an ugly way to support it. Unfortunately, both TestRemote and Annex.Import needed the Remote to have a new field added that holds its RemoteStateHandle.	2019-10-14 13:51:42 -04:00
Joey Hess	40ecf58d4b	update licenses from GPL to AGPL This does not change the overall license of the git-annex program, which was already AGPL due to a number of sources files being AGPL already. Legally speaking, I'm adding a new license under which these files are now available; I already released their current contents under the GPL license. Now they're dual licensed GPL and AGPL. However, I intend for all my future changes to these files to only be released under the AGPL license, and I won't be tracking the dual licensing status, so I'm simply changing the license statement to say it's AGPL. (In some cases, others wrote parts of the code of a file and released it under the GPL; but in all cases I have contributed a significant portion of the code in each file and it's that code that is getting the AGPL license; the GPL license of other contributors allows combining with AGPL code.)	2019-03-13 15:48:14 -04:00
Joey Hess	cb375977a6	follow-on changes from MetaData type changes Including writing and parsing the metadata log files with bytestring-builder and attoparsec.	2019-01-07 15:51:05 -04:00
Joey Hess	bfc9039ead	convert git-annex branch access to ByteStrings and Builders Most of the individual logs are not converted yet, only presense logs have an efficient ByteString Builder implemented so far. The rest convert to and from String.	2019-01-03 13:21:48 -04:00
Joey Hess	2e9f128dea	moved module and relicensed	2018-10-29 23:13:36 -04:00
Joey Hess	0a7c5a9982	dropdead per-remote metadata Had to refactor pure code into separate modules so it is accessible inside Annex.Branch.Transitions. This commit was sponsored by Peter on Patreon.	2018-09-05 13:52:46 -04:00
Joey Hess	b3d42283ad	use per-remote metadata storage for S3 version ID Since the same key can be stored in a versioned S3 bucket multiple times with different version IDs, this allows tracking them all. Not currently needed, but if we ever want to drop from a versioned S3 bucket, we'll need to know them all. This commit was supported by the NSF-funded DataLad project.	2018-08-31 13:27:29 -04:00
Joey Hess	5c99f6247e	per-remote metadata storage Actually very straightforward reuse of the metadata log file code. Although I had to add a todo item as git-annex forget won't clean up dead remote's metadata yet. This would be worth adding to the external special remote interface sometime. Have not opened a todo though, guess I'll wait until something needs it. This commit was supported by the NSF-funded DataLad project.	2018-08-31 12:23:22 -04:00
Joey Hess	4d0e522b72	Warn when metadata is inherited from a previous version of a file to avoid the user being surprised in cases where that behavior is not desired or expected This commit was supported by the NSF-funded DataLad project.	2017-09-28 12:56:35 -04:00
Joey Hess	2cecc8d2a3	Added GIT_ANNEX_VECTOR_CLOCK environment variable Can be used to override the default timestamps used in log files in the git-annex branch. This is a dangerous environment variable; use with caution. Note that this only affects writing to the logs on the git-annex branch. It is not used for metadata in git commits (other env vars can be set for that). There are many other places where timestamps are still used, that don't get committed to git, but do touch disk. Including regular timestamps of files, and timestamps embedded in some files in .git/annex/, including the last fsck timestamp and timestamps in transfer log files. A good way to find such things in git-annex is to get for getPOSIXTime and getCurrentTime, although some of the results are of course false positives that never hit disk (unless git-annex gets swapped out..) So this commit does NOT necessarily make git-annex comply with some HIPPA privacy regulations; it's up to the user to determine if they can use it in a way compliant with such regulations. Benchmarking: It takes 0.00114 milliseconds to call getEnv "GIT_ANNEX_VECTOR_CLOCK" when that env var is not set. So, 100 thousand log files can be written with an added overhead of only 0.114 seconds. That should be by far swamped by the actual overhead of writing the log files and making the commit containing them. This commit was supported by the NSF-funded DataLad project.	2017-08-14 14:19:58 -04:00
Joey Hess	737e45156e	remove 163 lines of code without changing anything except imports	2016-01-20 16:36:33 -04:00
Joey Hess	6383d22ffa	remove back-compat code for old version of containers Already b-d on a newer version.	2015-06-06 15:23:53 -04:00
Joey Hess	6c2d5b5e41	more time-1.5 fixes	2015-05-10 15:36:58 -04:00
Joey Hess	b0575c621f	implement annex.tune.branchhash1 I hope this doesn't impact speed much -- it does have to pull out a value from Annex state every time it accesses the branch now. The test case I dropped has never caught any problems that I can remember, and would have been rather difficult to convert.	2015-01-28 17:17:26 -04:00
Joey Hess	afc5153157	update my email address and homepage url	2015-01-21 12:50:09 -04:00
Joey Hess	7b50b3c057	fix some mixed space+tab indentation This fixes all instances of " \t" in the code base. Most common case seems to be after a "where" line; probably vim copied the two space layout of that line. Done as a background task while listening to episode 2 of the Type Theory podcast.	2014-10-09 15:09:11 -04:00
Joey Hess	d0c1a22e7c	import metadata from feeds When annex.genmetadata is set, metadata from the feed is added to files that are imported from it. Reused the same feedtitle and itemtitle, feedauthor, itemauthor, etc names that are used in --template. Also added title and author, which are the item title/author if available, falling back to the feed title/author. These are more likely to be common metadata fields. (There is a small bit of dupication here, but once git gets around to packing the object, it will compress it away.) The itempubdate field is not included in the metadata as a string; instead it is used to generate year and month fields, same as is done when adding files with annex.genmetadata set. This commit was sponsored by Amitai Schlair, who cooincidentially is responsible for ikiwiki generating nice feed metadata!	2014-07-03 14:15:00 -04:00
Joey Hess	f64c2d6138	toplevel lastchanged field	2014-03-19 19:10:55 -04:00
Joey Hess	6848f09a12	better timestamp format	2014-03-18 19:01:50 -04:00
Joey Hess	caa97d1271	Each for each metadata field, there's now an automatically maintained "$field-lastchanged" that gives the timestamp of the last change to that field. Note that this is a nearly entirely free feature. The data was already stored in the metadata log in an easily accessible way, and already was parsed to a time when parsing the log. The generation of the metadata fields may even be done lazily, although probably not entirely (the map has to be evaulated to when queried).	2014-03-18 18:55:43 -04:00
Joey Hess	8d5158fa31	Preserve metadata when staging a new version of an annexed file. Performance impact: When adding a large tree of new files, this needs to do some git cat-file queries to check if any of the files already existed and might need a metadata copy. I tried a benchmark in a copy of my sound repository (so there was already a significant git tree to check against. Adding 10000 small files, with a cold cache: before: 1m48.539s after: 1m52.791s So, impact is 0.0004 seconds per file added. Which seems acceptable, so did not add some kind of configuration to enable/disable this. This commit was sponsored by Lisa Feilen.	2014-02-24 14:41:33 -04:00
Joey Hess	7498c5dd96	annex.genmetadata can be set to make git-annex automatically set metadata (year and month) when adding files	2014-02-23 00:08:29 -04:00
Joey Hess	bdfc8e1f44	fix build with old version of Data.Set that lacks toDescList	2014-02-21 11:30:31 -04:00
Joey Hess	39ebfa1a2e	pre-commit: Update metadata when committing changes to annexed files within a view. So the user can now switch to a view and then move files around within it to manage metadata. For example, moving a file into a new directory when in the tags=* view adds a tag to it. Implementation is fairly efficient. One diff-index, which is no more expensive than the first stage of a git commit, followed by possibly some cat-file --batch traffic to find the key (when deleting a file). Very similar to what's done in direct mode when committing. And like direct mode when updating the WC after a merge, it has to buffer the diff-tree values in order to make 2 passes over them. When not in a view, pre-commit now does one extra git symbolic-ref, which is tiny overhead. This commit was sponsored by Andrew Eskridge.	2014-02-19 14:17:58 -04:00
Joey Hess	a18eae9a0f	nice git ack space optimisation when setting the same metadata value for multiple files	2014-02-13 01:57:43 -04:00
Joey Hess	361aee0470	avoid churning in git to no benefit when optimising metadata log I think this is now optimal.	2014-02-12 23:24:04 -04:00
Joey Hess	8076530284	improve simplifier	2014-02-12 22:50:41 -04:00
Joey Hess	a05ac13e92	fix metadata log simplifier and additional quickcheck tests	2014-02-12 22:27:55 -04:00
Joey Hess	9f7e76130e	add metadata command to get/set metadata Adds metadata log, and command. Note that unsetting field values seems to currently be broken. And in general this has had all of 2 minutes worth of testing. This commit was sponsored by Julien Lefrique.	2014-02-12 21:30:33 -04:00

34 commits