git-annex

Author	SHA1	Message	Date
Joey Hess	ce455223df	split out appending to journal from writing, high level only Currently this is not an improvement, but it allows for optimising appendJournalFile later. With an optimised appendJournalFile, this will greatly speed up access patterns like git-annex addurl of a lot of urls to the same key, where the log file can grow rather large. Appending rather than re-writing the journal file for each line can save a lot of disk writes. It still has to read the current journal or branch file, to check if it can append to it, and so when the journal file does not exist yet, it can write the old content from the branch to it. Probably the re-reads are better cached by the filesystem than repeated writes. (If the re-reads turn out to keep performance bad, they could be eliminated, at the cost of not being able to compact the log when replacing old information in it. That could be enabled by a switch.) While the immediate need is to affect addurl writes, it was implemented at the level of presence logs, so will also perhaps speed up location logs. The only added overhead is the call to isNewInfo, which only needs to compare ByteStrings. Helping to balance that out, it avoids compactLog when it's able to append. Sponsored-by: Dartmouth College's DANDI project	2022-07-18 13:22:50 -04:00
Joey Hess	d3d5d2b4ec	fix test suite failure when run with LANG=C	2021-08-18 17:36:00 -04:00
Joey Hess	1acdd18ea8	deal better with clock skew situations, using vector clocks * Deal with clock skew, both forwards and backwards, when logging information to the git-annex branch. * GIT_ANNEX_VECTOR_CLOCK can now be set to a fixed value (eg 1) rather than needing to be advanced each time a new change is made. * Misuse of GIT_ANNEX_VECTOR_CLOCK will no longer confuse git-annex. When changing a file in the git-annex branch, the vector clock to use is now determined by first looking at the current time (or GIT_ANNEX_VECTOR_CLOCK when set), and comparing it to the newest vector clock already in use in that file. If a newer time stamp was already in use, advance it forward by a second instead. When the clock is set to a time in the past, this avoids logging with an old timestamp, which would risk that log line later being ignored in favor of "newer" line that is really not newer. When a log entry has been made with a clock that was set far ahead in the future, this avoids newer information being logged with an older timestamp and so being ignored in favor of that future-timestamped information. Once all clocks get fixed, this will result in the vector clocks being incremented, until finally enough time has passed that time gets back ahead of the vector clock value, and then it will return to usual operation. (This latter situation is not ideal, but it seems the best that can be done. The issue with it is, since all writers will be incrementing the last vector clock they saw, there's no way to tell when one writer made a write significantly later in time than another, so the earlier write might arbitrarily be picked when merging. This problem is why git-annex uses timestamps in the first place, rather than pure vector clocks.) Advancing forward by 1 second is somewhat arbitrary. setDead advances a timestamp by just 1 picosecond, and the vector clock could too. But then it would interfere with setDead, which wants to be overrulled by any change. So it could use 2 picoseconds or something, but that seems weird. It could just as well advance it forward by a minute or whatever, but then it would be harder for real time to catch up with the vector clock when forward clock slew had happened. A complication is that many log files contain several different peices of information, and it may be best to only use vector clocks for the same peice of information. For example, a key's location log file contains InfoPresent/InfoMissing for each UUID, and it only looks at the vector clocks for the UUID that is being changed, and not other UUIDs. Although exactly where the dividing line is can be hard to determine. Consider metadata logs, where a field "tag" can have multiple values set at different times. Should it advance forward past the last tag? Probably. What about when a different field is set, should it look at the clocks of other fields? Perhaps not, but currently it does, and this does not seems like it will cause any problems. Another one I'm not entirely sure about is the export log, which is keyed by (fromuuid, touuid). So if multiple repos are exporting to the same remote, different vector clocks can be used for that remote. It looks like that's probably ok, because it does not try to determine what order things occurred when there was an export conflict. Sponsored-by: Jochen Bartl on Patreon	2021-08-04 12:33:46 -04:00
Joey Hess	40ecf58d4b	update licenses from GPL to AGPL This does not change the overall license of the git-annex program, which was already AGPL due to a number of sources files being AGPL already. Legally speaking, I'm adding a new license under which these files are now available; I already released their current contents under the GPL license. Now they're dual licensed GPL and AGPL. However, I intend for all my future changes to these files to only be released under the AGPL license, and I won't be tracking the dual licensing status, so I'm simply changing the license statement to say it's AGPL. (In some cases, others wrote parts of the code of a file and released it under the GPL; but in all cases I have contributed a significant portion of the code in each file and it's that code that is getting the AGPL license; the GPL license of other contributors allows combining with AGPL code.)	2019-03-13 15:48:14 -04:00
Joey Hess	936aee6a60	quickcheck property for parsing of content identifier logs	2019-02-21 13:17:43 -04:00
Joey Hess	928e08904d	avoid two test failures with LANG=C Log.Remote.prop_parse_show_Config failed on an input of fromList [("\28162","")] in LANG=C, encodeBS "\28162" == "\STX=", while in UTF-8 locale, encodeBS "\28162" == "\230\184\130". So in the C locale, the String that's the parsed Map key ends up being encoded differently than it was in the input Map. Logs.Presence.Pure.prop_parse_build_log was failing in LANG=C because the Arbitrary LogLine for some reason sometimes generated LogInfo values containing \n or \r, despite using suchThat to prevent that. I don't understand why at all, but switching the suchThat to filter the ByteString instead of the String before conversion with encodeBS somehow avoids the problem. Both of these suggest something wonky with encodeBS in LANG=C, but I think it's not a problem except for with test data generated by Arbitrary.	2019-01-18 13:33:48 -04:00
Joey Hess	7e54c215b4	Remove redundant endOfInput parseLogLines consumes till endOfInput	2019-01-10 12:42:59 -04:00
Joey Hess	de4980ef85	simplify Show instance by deriving	2019-01-09 13:13:31 -04:00
Joey Hess	11f4d993b5	fix format of show instance It's only used to show test suite failures..	2019-01-09 13:12:25 -04:00
Joey Hess	2d46038754	converting more log files to use Builder Probably not any particular speedup in this, since most of these logs are not written to often. Possibly chunk log writing is sped up, but writes to chunk logs are interleaved with expensive data transfers to remotes, so unlikely to be a noticiable speedup.	2019-01-09 13:06:37 -04:00
Joey Hess	ef8ddaa713	attoparsec parser for presence logs	2019-01-03 15:27:29 -04:00
Joey Hess	bfc9039ead	convert git-annex branch access to ByteStrings and Builders Most of the individual logs are not converted yet, only presense logs have an efficient ByteString Builder implemented so far. The rest convert to and from String.	2019-01-03 13:21:48 -04:00
Joey Hess	0b307f43e1	avoid accidental Show of VectorClock Removed its Show instance.	2017-08-14 14:51:54 -04:00
Joey Hess	2cecc8d2a3	Added GIT_ANNEX_VECTOR_CLOCK environment variable Can be used to override the default timestamps used in log files in the git-annex branch. This is a dangerous environment variable; use with caution. Note that this only affects writing to the logs on the git-annex branch. It is not used for metadata in git commits (other env vars can be set for that). There are many other places where timestamps are still used, that don't get committed to git, but do touch disk. Including regular timestamps of files, and timestamps embedded in some files in .git/annex/, including the last fsck timestamp and timestamps in transfer log files. A good way to find such things in git-annex is to get for getPOSIXTime and getCurrentTime, although some of the results are of course false positives that never hit disk (unless git-annex gets swapped out..) So this commit does NOT necessarily make git-annex comply with some HIPPA privacy regulations; it's up to the user to determine if they can use it in a way compliant with such regulations. Benchmarking: It takes 0.00114 milliseconds to call getEnv "GIT_ANNEX_VECTOR_CLOCK" when that env var is not set. So, 100 thousand log files can be written with an added overhead of only 0.114 seconds. That should be by far swamped by the actual overhead of writing the log files and making the commit containing them. This commit was supported by the NSF-funded DataLad project.	2017-08-14 14:19:58 -04:00
Joey Hess	176cd98293	remove \r from Arbitrary for log tests	2016-05-27 12:04:49 -04:00
Joey Hess	eba68572dc	Split lines in the git-annex branch on \r as well as \n, to deal with \r\n terminated lines written by some versions of git-annex on Windows. This fixes strange displays in some cases, including whereis showing many duplicate locations, and showing more total copies than actually exist. It's unknown if that lead to data loss when eg, dropping. At the moment, it seems unlikely it could, since the UUID with \r's appended is not the same as a UUID without, and so no remote matches it. It's also unknown if \r's can leak in on windows, perhaps when merging the git-annex branch.	2016-05-27 11:45:13 -04:00
Joey Hess	737e45156e	remove 163 lines of code without changing anything except imports	2016-01-20 16:36:33 -04:00
Joey Hess	f9adb905fc	Avoid unncessary write to the location log when a file is unlocked and then added back with unchanged content. Implemented with no additional overhead of compares etc. This is safe to do for presence logs because of their locality of change; a given repo's presence logs are only ever changed in that repo, or in a repo that has just been actively changing the content of that repo. So, we don't need to worry about a split-brain situation where there'd be disagreement about the location of a key in a repo. And so, it's ok to not update the timestamp when that's the only change that would be made due to logging presence info.	2015-10-12 14:46:47 -04:00
Joey Hess	53ede1a10e	parse X in location log file as indicating a dead key A dead key is both not present at the location that thinks it has a copy, and also is assumed to probably not be present anywhere else. Although there may be lurking disconnected repos that somehow still have a copy. Suprisingly few changes needed for this! This is because the presence log code only really concerns itself with keys that are present, and dead keys are not present. Note that both the location and web log can be parsed as having a dead key. I don't see any value to having keys listed as dead in the web log, but since it doesn't change any behavior, there was no point in not parsing it.	2015-06-09 13:28:30 -04:00
Joey Hess	6cf62a9bde	support time-1.5.0 This no longer uses old-locale's defaultTimeLocale, but provides one of its own. Factored out a Logs.TimeStamp.	2015-05-10 15:21:35 -04:00
Joey Hess	afc5153157	update my email address and homepage url	2015-01-21 12:50:09 -04:00
Joey Hess	43dc7f678f	setpresentkey: A new plumbing-level command.	2014-12-29 15:16:40 -04:00
Joey Hess	0831e18372	forget --drop-dead: Completely removes mentions of repositories that have been marked as dead from the git-annex branch. Wrote nice pure transition calculator, and ugly code to stage its results into the git-annex branch. Also had to split up several Log modules that Annex.Branch needed to use, but that themselves used Annex.Branch. The transition calculator is limited to looking at and changing one file at a time. While this made the implementation relatively easy, it precludes transitions that do stuff like deleting old url log files for keys that are being removed because they are no longer present anywhere.	2013-08-31 17:51:13 -04:00

23 commits