Commit graph

60 commits

Author SHA1 Message Date
Joey Hess
770aac97a7
share single BranchState amoung all threads
This fixes a problem when git-annex testremote is run against a cluster
accessed via the http server. Annex.Cluster uses the location log
to find nodes that contain a key when checking if the key is present or getting
it. Just after a key was stored to a cluster node, reading the location log
was not getting the UUID of that node.

Apparently the Annex action that wrote to the location log, and the one
that read from it were run with two different Annex states. The http server
does use several different Annex threads.

BranchState was part of the AnnexState, and so two threads could have
different BranchStates.

Moved BranchState to the AnnexRead, so all threads will see the common state.

This might possibly impact performance. If one thread is writing changes to the
branch, and another thread is reading from the branch, the writing thread will
now invalidate the BranchState's cache, which will cause the reading thread to
need to do extra work. But correctness is surely more important. If did is
found to have impacted performance, it could probably be dealt with by doing
smarter BranchState cache invalidation.

Another way this might impact performance is that the BranchState has a small
cache. If several threads were reading from the branch and relying on the value
they just read still being in the case, now a cache miss will be more likely.
Increasing the BranchState cache to the number of jobs might be a good
idea to amelorate that. But the cache is currently an innefficient list,
so making it large would need changes to the data types.

(Commit 4304f1b6ae dealt with a follow-on
effect of the bug fixed here.)
2024-07-28 12:30:27 -04:00
Joey Hess
adcebbae47
clean up git-remote-annex git-annex branch handling
Implemented alternateJournal, which git-remote-annex
uses to avoid any writes to the git-annex branch while setting up
a special remote from an annex:: url.

That prevents the remote.log from being overwritten with the special
remote configuration from the url, which might not be 100% the same as
the existing special remote configuration.

And it prevents an overwrite deleting of other stuff that was
already in the remote.log.

Also, when the branch was created by git-remote-annex, only delete it
at the end if nothing else has been written to it by another command.
This fixes the race condition described in
797f27ab05, where git-remote-annex
set up the branch and git-annex init and other commands were
run at the same time and their writes to the branch were lost.
2024-05-15 17:33:38 -04:00
Joey Hess
928b2a4839
create journal directory in withJournalHandle
Fixes a crash by git-annex repair when .git/annex/journal/ does not exist.

Normally the journal directory is created before withJournalHandle gets
run, but git-annex repair can be run in a situation where it does not
exist.
2023-06-21 15:23:59 -04:00
Joey Hess
0aa98aa09b
fix perms for core.sharedRepository
These two missed setting it.

It rarely matters that the journal gets the right perm. But, when using
annex.alwayscommit=false, someone else may come along later and want
to append to the journal file.

It probably never matters what the sentinal perms are, but for
completeness..

Sponsored-by: Luke Shumaker on Patreon
2023-04-26 16:29:11 -04:00
Yaroslav Halchenko
84b0a3707a
Apply codespell -w throughout 2023-03-17 15:14:58 -04:00
Joey Hess
e60766543f
add annex.dbdir (WIP)
WIP: This is mostly complete, but there is a problem: createDirectoryUnder
throws an error when annex.dbdir is set to outside the git repo.

annex.dbdir is a workaround for filesystems where sqlite does not work,
due to eg, the filesystem not properly supporting locking.

It's intended to be set before initializing the repository. Changing it
in an existing repository can be done, but would be the same as making a
new repository and moving all the annexed objects into it. While the
databases get recreated from the git-annex branch in that situation, any
information that is in the databases but not stored in the branch gets
lost. It may be that no information ever gets stored in the databases
that cannot be reconstructed from the branch, but I have not verified
that.

Sponsored-by: Dartmouth College's Datalad project
2022-08-11 16:58:53 -04:00
Joey Hess
cbe12b9bc3
force fully strict read of journal file again
I was thinking that discardIncompleteAppend would make it strict, since
it looks at the end of the bytestring. But, it's applied lazily..

This probably fixes windows, which was failing:

      git-annex.exe: .git\annex\journal\trust.log: DeleteFile "\\\\?\\C:\\Users\\runneradmin\\.t\\5\\tmprepo22\\.git\\annex\\journal\\trust.log": permission denied (The process cannot access the file because it is being used by another process.)
2022-07-22 11:36:21 -04:00
Joey Hess
d275874e6c
handling of interrupted appends
An append that is interrupted and writes part of a line is now dealt
with by subsequent reads and appends. This also handles a read that
happens at the same time as an append to the file.

Old versions of git-annex will still see a partially written line,
and could get confused. Since appends are currently done for url logs
and location logs, the confusion is limited to a substring of the actual
url or UUID of the remote being read. This will not affect writes, since
the journal file is locked when reading in preparation for writing.
However, the bad data can be output by git-annex and used by other
things, or could cause surprising behavior by git-annex. Including eg,
downloading the content of the wrong url.

So, something needs to be done to prevent old versions of git-annex from
running in a repository where this appending is being done..

Sponsored-by: Dartmouth College's DANDI project
2022-07-20 12:40:49 -04:00
Joey Hess
36f0bdcd57
add annex.alwayscompact
Added annex.alwayscompact setting which can be unset to speed up writes to
the git-annex branch in some cases.

Sponsored-by: Dartmouth College's DANDI project
2022-07-18 16:39:19 -04:00
Joey Hess
ccff639651
Merge branch 'master' into append 2022-07-18 14:17:15 -04:00
Joey Hess
de18d92de6
efficient but unsafe journal file append
This is only for checking performance, it's not safe.

Sponsored-by: Dartmouth College's DANDI project
2022-07-18 14:17:12 -04:00
Joey Hess
1c40b927aa
minor optimisation
Avoid re-writing the file when the journal directory did not
exist.
2022-07-18 13:50:35 -04:00
Joey Hess
ce455223df
split out appending to journal from writing, high level only
Currently this is not an improvement, but it allows for optimising
appendJournalFile later. With an optimised appendJournalFile, this will
greatly speed up access patterns like git-annex addurl of a lot of urls
to the same key, where the log file can grow rather large. Appending
rather than re-writing the journal file for each line can save a lot of
disk writes.

It still has to read the current journal or branch file, to check
if it can append to it, and so when the journal file does not exist yet,
it can write the old content from the branch to it. Probably the re-reads
are better cached by the filesystem than repeated writes. (If the
re-reads turn out to keep performance bad, they could be eliminated, at
the cost of not being able to compact the log when replacing old
information in it. That could be enabled by a switch.)

While the immediate need is to affect addurl writes, it was implemented
at the level of presence logs, so will also perhaps speed up location logs.
The only added overhead is the call to isNewInfo, which only needs to
compare ByteStrings. Helping to balance that out, it avoids compactLog
when it's able to append.

Sponsored-by: Dartmouth College's DANDI project
2022-07-18 13:22:50 -04:00
Joey Hess
ad467791c1
optimise journal writes to not mkdir journal directory when it already exists
Sponsored-by: Dartmouth College's DANDI project
2022-07-14 12:29:39 -04:00
Joey Hess
debcf86029
use RawFilePath version of rename
Some small wins, almost certianly swamped by the system calls, but still
worthwhile progress on the RawFilePath conversion.

Sponsored-by: Erik Bjäreholt on Patreon
2022-06-22 16:47:34 -04:00
Joey Hess
5a9e6b1fd4
when private journal file exists, still read from git-annex branch
Fix bug that caused stale git-annex branch information to read when
annex.private or remote.name.annex-private is set.

The private journal file should not prevent reading more current
information from the git-annex branch, but used to.

Note that, overBranchFileContents has to do additional work now, when
there's a private journal file, it reads from the branch redundantly
and more slowly.

Sponsored-by: Jack Hill on Patreon
2021-10-26 13:43:50 -04:00
Joey Hess
32138b8cd8
implement annex.privateremote and remote.name.private configs
The slightly unusual parsing in Types.GitConfig avoids the need to look
at the remote list to get configs of remotes. annexPrivateRepos combines
all the configs, and will only be calculated once, so it's nice and
fast.

privateUUIDsKnown and regardingPrivateUUID now need to read from the
annex mvar, so are not entirely free. But that overhead can be optimised
away, as seen in getJournalFileStale. The other call sites didn't seem
worth optimising to save a single MVar access. The feature should have
impreceptable speed overhead when not being used.
2021-04-23 14:21:57 -04:00
Joey Hess
d0c5f6d2f0
optimisation
Avoid trying to read private journal files when no private uuids are
known.
2021-04-21 16:02:56 -04:00
Joey Hess
24eeacdba8
adapt recent bug fixes to support private journal
At this point, private repos should mostly work, except for a few
commands that directly read from the git-annex branch and will not see
the private journal.

Private index not yet implemented.
2021-04-21 16:01:13 -04:00
Joey Hess
05989556a2
start implementing hidden git-annex repositories
This adds a separate journal, which does not currently get committed to
an index, but is planned to be committed to .git/annex/index-private.

Changes that are regarding a UUID that is private will get written to
this journal, and so will not be published into the git-annex branch.

All log writing should have been made to indicate the UUID it's
regarding, though I've not verified this yet.

Currently, no UUIDs are treated as private yet, a way to configure that
is needed.

The implementation is careful to not add any additional IO work when
privateUUIDsKnown is False. It will skip looking at the private journal
at all. So this should be free, or nearly so, unless the feature is
used. When it is used, all branch reads will be about twice as expensive.

It is very lucky -- or very prudent design -- that Annex.Branch.change
and maybeChange are the only ways to change a file on the branch,
and Annex.Branch.set is only internal use. That let Annex.Branch.get
always yield any private information that has been recorded, without
the risk that Annex.Branch.set might be called, with a non-private UUID,
and end up leaking the private information into the git-annex branch.

And, this relies on the way git-annex union merges the git-annex branch.
When reading a file, there can be a public and a private version, and
they are just concacenated together. That will be handled the same as if
there were two diverged git-annex branches that got union merged.
2021-04-20 15:04:53 -04:00
Joey Hess
b2222e4639
optimisation
Avoid unnecessary conversion to/from String.
2021-04-20 13:13:45 -04:00
Joey Hess
c30557594e
remove now redundant function 2021-04-20 12:42:57 -04:00
Joey Hess
f45ad178cb
more RawFilePath conversion
At 318/645 after 4k lines of changes

This commit was sponsored by Jake Vosloo on Patreon.
2020-10-29 12:03:50 -04:00
Joey Hess
3d38ec9585
fix fileJournal
My ByteString rewrite oversimplified it, resulting in any _ in a journal
file turning into a / in the git-annex branch, which was often the wrong
filename, or sometimes (//) an invalid filename that git
refused to add.
2019-12-18 11:29:34 -04:00
Joey Hess
cee0d738fc
match also / path separator on windows 2019-12-11 17:08:08 -04:00
Joey Hess
c19211774f
use filepath-bytestring for annex object manipulations
git-annex find is now RawFilePath end to end, no string conversions.
So is git-annex get when it does not need to get anything.
So this is a major milestone on optimisation.

Benchmarks indicate around 30% speedup in both commands.

Probably many other performance improvements. All or nearly all places
where a file is statted use RawFilePath now.
2019-12-11 15:25:07 -04:00
Joey Hess
067aabdd48
wip RawFilePath 2x git-annex find speedup
Finally builds (oh the agoncy of making it build), but still very
unmergable, only Command.Find is included and lots of stuff is badly
hacked to make it compile.

Benchmarking vs master, this git-annex find is significantly faster!
Specifically:

	num files	old	new	speedup
	48500		4.77	3.73	28%
	12500		1.36	1.02	66%
	20		0.075	0.074	0% (so startup time is unchanged)

That's without really finishing the optimization. Things still to do:

* Eliminate all the fromRawFilePath, toRawFilePath, encodeBS,
  decodeBS conversions.
* Use versions of IO actions like getFileStatus that take a RawFilePath.
* Eliminate some Data.ByteString.Lazy.toStrict, which is a slow copy.
* Use ByteString for parsing git config to speed up startup.

It's likely several of those will speed up git-annex find further.
And other commands will certianly benefit even more.
2019-11-26 16:01:58 -04:00
Joey Hess
40ecf58d4b
update licenses from GPL to AGPL
This does not change the overall license of the git-annex program, which
was already AGPL due to a number of sources files being AGPL already.

Legally speaking, I'm adding a new license under which these files are
now available; I already released their current contents under the GPL
license. Now they're dual licensed GPL and AGPL. However, I intend
for all my future changes to these files to only be released under the
AGPL license, and I won't be tracking the dual licensing status, so I'm
simply changing the license statement to say it's AGPL.

(In some cases, others wrote parts of the code of a file and released it
under the GPL; but in all cases I have contributed a significant portion
of the code in each file and it's that code that is getting the AGPL
license; the GPL license of other contributors allows combining with
AGPL code.)
2019-03-13 15:48:14 -04:00
Joey Hess
d5f2463702
misctmp cleanup
* Switch to using .git/annex/othertmp for tmp files other than partial
  downloads, and make stale files left in that directory when git-annex
  is interrupted be cleaned up promptly by subsequent git-annex processes.
* The .git/annex/misctmp directory is no longer used and git-annex will
  delete anything lingering in there after it's 1 week old.

Also, in Annex.Ingest, made the filename it uses in the tmp dir be
prefixed with "ingest-" to avoid potentially using a filename used by
some other code.
2019-01-17 16:02:22 -04:00
Joey Hess
bfc9039ead
convert git-annex branch access to ByteStrings and Builders
Most of the individual logs are not converted yet, only presense logs
have an efficient ByteString Builder implemented so far. The rest
convert to and from String.
2019-01-03 13:21:48 -04:00
Joey Hess
d1961e4498
back out incorrect IO interleaving change
Fix regression in last release that crashes when using --all or running
git-annex in a bare repository. May have also affected git-annex unused and
git-annex info.

Reversed the order of the (++) in Annex.Branch.files so --all will stream
lazily still when there are not a bunch of uncommitted journal files.
Added a todo to maybe improve this later.

This commit was sponsored by Trenton Cronholm on Patreon.
2018-05-08 13:54:42 -04:00
Joey Hess
bea0ad220a
avoid --all buffering list of all keys
In Annex.Branch.branch, the (++) was killing laziness.
Rewrote so it streams lazily.

filterM also kills laziness, so made loggedKeys use a Unchecked type,
and check if the key is dead in the seek loop.

Note that loggedKeysFor still buffers, so git-annex info <remote> and
git-annex unused --from remote still use more memory than necessary.

Also removed some unused functions from Annex.Journal.
2018-04-26 16:00:20 -04:00
Joey Hess
25703e1413
finally really add back custom-setup stanza
Fourth or fifth try at this and finally found a way to make it work.

Absurd amount of busy-work forced on me by change in cabal's behavior.
Split up Utility modules that need posix stuff out of ones used by
Setup. Various other hacks around inability for Setup to use anything
that ifdefs a use of unix.

Probably lost a full day of my life to this.
This is how build systems make their users hate them. Just saying.
2017-12-31 16:36:39 -04:00
Joey Hess
8484c0c197
Always use filesystem encoding for all file and handle reads and writes.
This is a big scary change. I have convinced myself it should be safe. I
hope!
2016-12-24 14:46:31 -04:00
Joey Hess
737e45156e
remove 163 lines of code without changing anything except imports 2016-01-20 16:36:33 -04:00
Joey Hess
afc5153157 update my email address and homepage url 2015-01-21 12:50:09 -04:00
Joey Hess
c784ef4586 unify exception handling into Utility.Exception
Removed old extensible-exceptions, only needed for very old ghc.

Made webdav use Utility.Exception, to work after some changes in DAV's
exception handling.

Removed Annex.Exception. Mostly this was trivial, but note that
tryAnnex is replaced with tryNonAsync and catchAnnex replaced with
catchNonAsync. In theory that could be a behavior change, since the former
caught all exceptions, and the latter don't catch async exceptions.

However, in practice, nothing in the Annex monad uses async exceptions.
Grepping for throwTo and killThread only find stuff in the assistant,
which does not seem related.

Command.Add.undo is changed to accept a SomeException, and things
that use it for rollback now catch non-async exceptions, rather than
only IOExceptions.
2014-08-07 22:03:29 -04:00
Joey Hess
26ee27915a refactor locking 2014-07-10 00:32:23 -04:00
Joey Hess
e5b88713a1 refactor 2014-07-10 00:16:53 -04:00
Joey Hess
d9d76cf98b Fix minor FD leak in journal code.
Minor because normally only 1 FD is leaked per git-annex run. However,
the test suite leaks a few hundred FDs, and this broke it on the Debian
autobuilders, which seem to have a tigher than usual ulimit.

The leak was introduced by the lazy getDirectoryContents' that was
introduced in e6330988dd in order to scale to
millions of journal files -- if the lazy list was never fully consumed, the
directory handle did not get closed.

Instead, pull in openDirectory/readDirectory/closeDirectory code that I
already developed and submitted in a patch to the haskell directory library
earlier. Using this in journalDirty avoids the place that the lazy list
caused a problem. And using it in stageJournal eliminates the need for
getDirectoryContents'.

The getJournalFiles* functions are switched back to using the regular
strict getDirectoryContents. I'm not sure if those always consume the whole
list, so this avoids any leak. And the things that call those are things
like git annex unused, which also look at every file committed to the
git-annex branch, so would need more work to scale to insane numbers of
files anyway.
2014-07-09 23:36:53 -04:00
Joey Hess
c90e4e8778
work around getDirectoryContents not streaming lazily 2014-07-04 17:59:26 -04:00
Joey Hess
1ab3d7c810 Windows: Fix bug introduced in last release that caused files in the git-annex branch to have lines teminated with \r. 2014-06-05 14:57:01 -04:00
Joey Hess
95ca3bb022 Fix encoding of data written to git-annex branch. Avoid truncating unicode characters to 8 bits.
Allow any encoding to be used, as with filenames (but utf8 is the sane
choice). Affects metadata and repository descriptions, and preferred
content expressions.

The question of what's the right encoding for the git-annex branch is a
vexing one. utf-8 would be a nice choice, but this leaves the possibility
of bad data getting into a git-annex branch somehow, and this resulting in
git-annex crashing with encoding errors, which is a failure mode I want to
avoid.

(Also, preferred content expressions can refer to filenames, and filenames
can have any encoding, so limiting to utf-8 would not be ideal.)

The union merge code already took care to not assume any encoding for a
file. Except it assumes that any \n is a literal newline, and not part of
some encoding of a character that happens to contain a newline. (At least
utf-8 avoids using newline for anything except liternal newlines.)
Adapted the git-annex branch code to use this same approach.

Note that there is a potential interop problem with Windows, since
FileSystemEncoding doesn't work there, and instead things are always
decoded as utf-8. If someone uses non-utf8 encoding for data on the
git-annex branch, this can lead to an encoding error on windows. However,
this commit doesn't actually make that any worse, because the union merge
code would similarly fail with an encoding error on windows in that
situation.

This commit was sponsored by Kyle Meyer.
2014-05-27 14:16:33 -04:00
Joey Hess
a1432bce2f Put non-object tmp files in .git/annex/misctmp, leaving .git/annex/tmp for only partially transferred objects.
This allows eg, putting .git/annex/tmp on a ram disk, if the disk IO
of temp object files is too annoying (and if you don't want to keep
partially transferred objects across reboots).

.git/annex/misctmp must be on the same filesystem as the git work tree,
since files are moved to there in a way that will not work cross-device,
as well as symlinked into there.

I first wanted to put the tmp objects in .git/annex/objects/tmp, but
that would pose transition problems on upgrade when partially transferred
objects existed.

git annex info does not currently show the size of .git/annex/misctemp,
since it should stay small. It would also be ok to make something clean it
out, periodically.
2014-02-26 16:52:56 -04:00
Joey Hess
891c85cd88 use locking on Windows
This is all the easy cases, where there was already a separate lock file.
2014-01-28 14:42:03 -04:00
Joey Hess
56c3f68a53 use types to partially prove correctness of journal locking code
My implementation does not guard against double locking of the journal. But
it does ensure that the journal is always locked when operated on, by using
a type that is only produced by lockJournal, and which is required as a
parameter of all functions that operate on the journal.

Note that I had to add the fooStale functions for cases where it does not
make sense to lock the journal when querying it. I was more concerned about
ensuring that anything that modifies the journal is locked.
setJournalFile's implementation ensures that any query of the journal will
get one value or the other atomically, even if the journal is being changed
at the time.
2013-10-03 14:41:57 -04:00
Joey Hess
a3224ce35b avoid more build warnings on Windows 2013-08-04 14:05:36 -04:00
Joey Hess
93f2371e09 get rid of __WINDOWS__, use mingw32_HOST_OS
The latter is harder for me to remember, but avoids build failures in code
used by the configure program.
2013-08-02 12:27:32 -04:00
Joey Hess
bf86b5ca16 improve robustness of fromDirect and replaceFile
Made fromDirect check that a file in the tree has good content (and is not
a broken symlink either) before copying it to another file that has the
same key.

Made replaceFile clean up the temp file if the action that creates it, or
the file replacement action fails.
2013-05-25 15:06:02 -04:00
Joey Hess
03e8594369 fix the day's windows permissions damage 2013-05-12 19:09:48 -04:00