Added annex.fastcopy and remote.name.annex-fastcopy config setting. When
set, this allows the copy_file_range syscall to be used, which can eg allow
for server-side copies on NFS. (For fastest copying, also disable
annex.verify or remote.name.annex-verify.)
This is a simple implementation, that does not handle resuming as well as
it possibly could.
It can be used with both local git remotes (including on NFS), and
directory special remotes. Other types of remotes could in theory also
support it, so I've left the config documented as a general thing.
Which is a per-remote version of the annex.web-options config.
Had to plumb RemoteGitConfig through to getUrlOptions. In cases where a
special remote does not use curl, there was no need to do that and I used
Nothing instead.
In the case of the addurl and importfeed commands, it seemed best to say
that running these commands is not using the web special remote per se,
so the config is not used for those commands.
And require for enable as well as autoenable.
It seemed asking for trouble for `git-annex enable foo` to use whatever
compute program is stored in the git config, without verifying that the
user wants that program to be used.
Note that it would be good to allow `git-annex enable foo program=...`
to be used without the program being in the git config. Not implemented yet
though.
Added annex.security.autoenable-compute-programs and only allow
autoenabling special remotes that use compute programs on that list.
The reason this is needed is a user might have some compute programs
that are less safe to use than others. They might want to use an unsafe
one only with one repository, where they are the only committer or other
committers are trusted. They might be ok with others being used by any
repository, and if so they can add them to the list.
Another reason would be a user who has installed a compute program by
accident. Eg, it might be included with git-annex at some point, or
pulled in by some dependency. That user doesn't necessarily want that
compute program to be used in an autoenabled special remote.
However, filepath-bytestring is still in Setup-Depends.
That's because Utility.OsPath uses it when not built with OsPath.
It would be maybe possible to make Utility.OsPath fall back to using
filepath, and eliminate that dependency too, but it would mean either
wrapping all of System.FilePath's functions, or using `type OsPath = FilePath`
Annex.Import uses ifdefs to avoid converting back to FilePath when not
on windows. On windows it's a bit slower due to that conversion.
Utility.Path.Windows.convertToWindowsNativeNamespace got a bit
slower too, but not really worth optimising I think.
Note that importing Utility.FileSystemEncoding at the same time as
System.Posix.ByteString will result in conflicting definitions for
RawFilePath. filepath-bytestring avoids that by importing RawFilePath
from System.Posix.ByteString, but that's not possible in
Utility.FileSystemEncoding, since Setup-Depends does not include unix.
This turned out not to affect any code in git-annex though.
Sponsored-by: Leon Schuermann
keyFile has a nice improvement; since a Key is a ShortByteString, it can
be converted to an OsPath without needing the copy that was done before.
Unfortunately, fileKey has to convert from a ShortByteString to a
ByteString in order to use attoparsec, and then the results get
converted back to an OsPath, so there are now 2 copies.
Maybe attoparsec will eventually get a ShortByteString API,
see https://github.com/haskell/attoparsec/issues/225
Sponsored-by: Joshua Antonishen
Decent win in exportDirectories, since it operates on ShortByteString
end to end now without needing conversion. That made it worth
implementing an OsPath specific code path there.
And ExportLocation already being a ShortByteString is an good example of why
it's a good thing that OsPath uses that!
Sponsored-by: k0ld on Patreon
This removes that function, using file-io readFile' instead.
Had to deal with newline conversion, which readFileStrict does on
Windows. In a few cases, that was pretty ugly to deal with.
Sponsored-by: Kevin Mueller
By using System.Directory.OsPath, which takes and returns OsString,
which is a ShortByteString. So, things like dirContents currently have the
overhead of copying that to a ByteString, but that should be less than
the overhead of using Strings which often in turn were converted to
RawFilePaths.
Added Utility.OsString and the OsString build flag. That flag is turned
on in the stack.yaml, and will be turned on automatically by cabal when
built with new enough libraries. The stack.yaml change is a bit ugly,
and that could be reverted for now if it causes any problems.
Note that Utility.OsString.toOsString on windows is avoiding only a
check of encoding that is documented as being unlikely to fail. I don't
think it can fail in git-annex; if it could, git-annex didn't contain
such an encoding check before, so at worst that should be a wash.
Added annex.pre-init-command git config and pre-init-annex hook that is run
before git-annex repository initialization.
This can block initialization. Or it can preform pre-initialization
configuration or tweaking.
I left stdio connected while it's running, so it could also be used for
interactive prompting conceivably, although that would want to use /dev/tty
anyway probably in order to not pollute the stdout of a command when
automatic initialization is done.
Sponsored-by: Dartmouth College's OpenNeuro project
Added git configs annex.post-update-command and annex.pre-commit-command
that correspond to the git-annex hook scripts post-update-annex and
pre-commit-annex.
Note that the hook files take precience over the git config, since the git
config can includ global config which should be overridden by local config.
These new git configs are probably not super useful. Especially the
pre-commit-annex hook is there to install scripts to instead of the
pre-commit hook, since git-annex installs that hook itself. So why would
someone want to use a git config for that? Only reason I can think of would
be in a global git config. Or possibly because it's easier to set a git
config than write a hook script, on an OS like Windows.
The real reason I'm adding these is as groundwork for making other
annex.*-command git configs also be available as hook scripts. I want
to avoid having some things available as only git hooks and others as
both gitconfigs and git hooks. (It seems that some annex.*-command configs
don't translate to git hooks though.)
In the man page, moved documentation of the hooks to be next to the
documentation of the git configs. This is to avoid repitition.
Added config `url.<base>.annexInsteadOf` corresponding to git's
`url.<base>.pushInsteadOf`, to configure the urls to use for accessing the
git-annex repositories on a server without needing to configure
remote.name.annexUrl in each repository.
While one use case for this would be rewriting urls to use annex+http,
I decided not to add any kind of special case for that. So while
git-annex p2phttp, when serving multiple repositories, needs an url
of eg "annex+http://example.com/git-annex/ for each of them, rewriting an
url like "https://example.com/git/foo/bar" with this config set to
"https://example.com/git/" will result in eg
"annex+http://example.com/git-annex/foo/bar", which p2phttp does not
support.
That seems better dealt with in either git-annex p2phttp or a http
middleware, rather than complicating the config with a special case for
annex+http.
Anyway, there are other use cases for this that don't involve annex+http.
I anticipate lots of external special remote programs will neglect
implementing this. Still, it's the right thing to do to assume that some
of them may write files out of order. Probably most external special
remotes will not be used with a proxy. When someone is using one with a
proxy, they can always get it fixed to send ORDERED.
The problem was that when the proxy requests a key be retrieved to its
own temp file, fileRetriever was retriving it to the key's temp
location, and then moving it at the end, which broke streaming.
So, plumb through the path where the key is being retrieved to.
Had to add Read instances to Key and NumCopies and some other similar
types. I only expect to use those in serializing a sim. Of course, this
risks that implementation changes break reading old data. For a sim,
that would not be a big problem.
Not tested yet. This emulates the same checking that is done when
dropping. Note that when dropping from a special remote it is not able
to make a locked copy.
Detect when a preferred content expression contains "not present", which
would lead to repeatedly getting and then dropping files, and make it never
match. This also applies to "not balanced" and "not sizebalanced".
--explain will tell the user when this happens
Note that getMatcher calls matchMrun' and does not check for unstable
negated limits. While there is no --present anyway, if there was,
it would not make sense for --not --present to complain about
instability and fail to match.
Reorganized the reposize database directory, and split up a column.
checkStaleSizeChanges needs to run before needLiveUpdate,
otherwise the process won't be holding a lock on its pid file, and
another process could go in and expire the live update it records. It
just so happens that they do get called in the correct order, since
checking balanced preferred content calls getLiveRepoSizes before
needLiveUpdate.
The 1 minute delay between checks is arbitrary, but will avoid excess
work. The downside of it is that, if a process is dropping a file and
gets interrupted, for 1 minute another process can expect a repository
will soon be smaller than it is. And so a process might send data to a
repository when a file is not really going to be dropped from it. But
note that can already happen if a drop takes some time in eg locking and
then fails. So it seems possible that live updates should only be
allowed to increase, rather than decrease the size of a repository.
Only when the preferred content expression being matched uses balanced
preferred content is this overhead needed.
It might be possible to eliminate the locking entirely. Eg, check the
live changes before and after the action and re-run if they are not
stable. For now, this is good enough, it avoids existing preferred
content getting slow. If balanced preferred content turns out to be too
slow to check, that could be tried later.
Fixed successfullyFinishedLiveSizeChange to not update the rolling total
when a redundant change is in RecentChanges.
Made setRepoSizes clear RecentChanges that are no longer needed.
It might be possible to clear those earlier, this is only a convenient
point to do it.
The reason it's safe to clear RecentChanges here is that, in order for a
live update to call successfullyFinishedLiveSizeChange, a change must be
made to a location log. If a RecentChange gets cleared, and just after
that a new live update is started, making the same change, the location
log has already been changed (since the RecentChange exists), and
so when the live update succeeds, it won't call
successfullyFinishedLiveSizeChange. The reason it doesn't
clear RecentChanges when there is a reduntant live update is because
I didn't want to think through whether or not all races are avoided in
that case.
The rolling total in SizeChanges is never cleared. Instead,
calcJournalledRepoSizes gets the initial value of it, and then
getLiveRepoSizes subtracts that initial value from the current value.
Since the rolling total can only be updated by updateRepoSize,
which is called with the journal locked, locking the journal in
calcJournalledRepoSizes ensures that the database does not change while
reading the journal.
When finishedLiveUpdate was run on a different key than expected, it
blocked forever waiting for an indication the database had been updated.
Since the journal is locked when finishedLiveUpdate runs, this could
also have caused other git-annex commands to hang.
When a live size change completes successfully, the same transaction
that removes it from the database updates the rolling total for its
repository.
The idea is that when RepoSizes is read, SizeChanges will be as
well, and cached locally. Any time a change is made, the local cache
will be updated. So by comparing the local cache with the current
SizeChanges, it can learn about size changes that were made by other
processes. Then read the LiveSizeChanges, and add that in to get a live
picture of the current sizes.
Also added a SizeChangeId. This allows 2 different threads, or
processes, to both record a live size change for the same repo and key,
and update their own information without stepping on one-another's toes.
I've tested the behavior of the thread that waits for the LiveUpdate to
be finished, and it does get signaled and exit cleanly when the
LiveUpdate is GCed instead.
Made finishedLiveUpdate wait for the thread to finish updating the
database.
There is a case where GC doesn't happen in time and the database is left
with a live update recorded in it. This should not be a problem as such
stale data can also happen when interrupted and will need to be detected
when loading the database.
Balanced preferred content expressions now call startLiveUpdate.
Each command that first checks preferred content (and/or required
content) and then does something that can change the sizes of
repositories needs to call prepareLiveUpdate, and plumb it through the
preferred content check and the location log update.
So far, only Command.Drop is done. Many other commands that don't need
to do this have been updated to keep working.
There may be some calls to NoLiveUpdate in places where that should be
done. All will need to be double checked.
Not currently in a compilable state.