This removes that function, using file-io readFile' instead.
Had to deal with newline conversion, which readFileStrict does on
Windows. In a few cases, that was pretty ugly to deal with.
Sponsored-by: Kevin Mueller
And follow-on changes.
Note that relatedTemplate was changed to operate on a RawFilePath, and
so when it counts the length, it is now the number of bytes, not the
number of code points. This will just make it truncate shorter strings
in some cases, the truncation is still unicode aware.
When not building with the OsPath flag, toOsPath . fromRawFilePath and
fromRawFilePath . fromOsPath do extra conversions back and forth between
String and ByteString. That overhead could be avoided, but that's the
non-optimised build mode, so didn't bother.
Sponsored-by: unqueued
By using System.Directory.OsPath, which takes and returns OsString,
which is a ShortByteString. So, things like dirContents currently have the
overhead of copying that to a ByteString, but that should be less than
the overhead of using Strings which often in turn were converted to
RawFilePaths.
Added Utility.OsString and the OsString build flag. That flag is turned
on in the stack.yaml, and will be turned on automatically by cabal when
built with new enough libraries. The stack.yaml change is a bit ugly,
and that could be reverted for now if it causes any problems.
Note that Utility.OsString.toOsString on windows is avoiding only a
check of encoding that is documented as being unlikely to fail. I don't
think it can fail in git-annex; if it could, git-annex didn't contain
such an encoding check before, so at worst that should be a wash.
Added checks that repo names are ones that have been added to the sim.
Implemented preferred content etc setting. It does not need to parse the
expression in applySimCommand, instead that can be done when running the
sim. This keeps it pure.
But, it can't be entirely pure because of CommandAddTree. So made it
return an Annex action when necessary.
Moved makeMatcher into Annex.FileMatcher in preparation for using it,
but it's not used yet. Also moved checkPreferredContentExpression.
It's possible for two processes or threads to both be doing the same
operation at the same time. Eg, both dropping the same key. If one
finishes and updates the rollingtotal, then the other one needs to be
prevented from later updating the rollingtotal as well. And they could
finish at the same time, or with some time in between.
Addressed this by making updateRepoSize be called with the journal
locked, and only once it's been determined that there is an actual
location change to record in the log. updateRepoSize waits for the
database to be updated.
When there is a redundant operation, updateRepoSize won't be called,
and the redundant LiveUpdate will be removed from the database on
garbage collection.
But: There will be a window where the redundant LiveUpdate is still
visible in the db, and processes can see it, combine it with the
rollingtotal, and arrive at the wrong size. This is a small window, but
it still ought to be addressed. Unsure if it would always be safe to
remove the redundant LiveUpdate? Consider the case where two drops and a
get are all running concurrently somehow, and the order they finish is
[drop, get, drop]. The second drop seems redundant to the first, but
it would not be safe to remove it. While this seems unlikely, it's hard
to rule out that a get and drop at different stages can both be running
at the same time.
Each command that first checks preferred content (and/or required
content) and then does something that can change the sizes of
repositories needs to call prepareLiveUpdate, and plumb it through the
preferred content check and the location log update.
So far, only Command.Drop is done. Many other commands that don't need
to do this have been updated to keep working.
There may be some calls to NoLiveUpdate in places where that should be
done. All will need to be double checked.
Not currently in a compilable state.
The use of catObjectStream is optimally fast. Although it might be
possible to combine this with git-annex branch merge to avoid some
redundant work.
Benchmarking, a git-annex branch that had 100000 files changed
took less than 1.88 seconds to run through this.
updateRepoSize is only called on the UUID of a repository, not any
cluster it might be a node of. But overLocationLogs and overLocationLogsJournal
were inclusing cluster UUIDs. So it was inconsistent.
Currently I don't see any reason to calculate RepoSize for a cluster.
It's not even clear what it should mean, the total size of all nodes, or
the amount of information stored in the cluster in total?
Plan is to run this when populating Annex.reposizes on demand.
So Annex.reposizes will be up-to-date with the journal, including
crucially journal entries for private repositories. But also
anything that has been written to the journal by another process,
especially if the process was ran with annex.alwayscommit=false.
From there, Annex.reposizes can be kept up to date with changes made
by the running process.
This will be used to prime the RepoSizes database, which will always
contain values that correpond to information in the git-annex branch, so
without anything from journal files.
Factored out overJournalFileContents which will later be used to
update Annex.reposizes to include information from journal files.
This will be partitcularly important to support private UUIDs which only
ever get to journal files and not to the branch.
git-annex info was displaying a message that didn't make sense in
context.
In calcRepoSizes, it seems better to return the information from the
git-annex branch, rather than giving up. Especially since balanced
preferred content uses it, and we can't just give up evaluating a
preferred content expression if git-annex is to be usable in such a
readonly repo.
Commit 6d7ecd9e5d nobly wanted git-annex
to behave the same with such unmerged branches as it does when it can
merge them. But for the purposes of preferred content, it seems to me
there's a sense that such an unmerged branch is the same as a remote we
have not pulled from. The balanced preferred content will either way
operate under outdated information, and so make not the best choices.
The idea is that upon a merge of the git-annex branch, or a commit to
the git-annex branch, the reposize database will be updated. So it
should always accurately reflect the location log sizes, but it will
often be behind the actual current sizes.
Annex.reposizes will start with the value from the database, and get
updated with each transfer, so it will reflect a process's best
understanding of the current sizes.
When there are multiple processes all transferring to the same repo,
Annex.reposize will not reflect transfers made by the other processes
since the current process started. So when using balanced preferred
content, it may make suboptimal choices, including trying to transfer
content to the repo when another process has already filled it up.
But this is the same as if there are multiple processes running on
ifferent machines, so is acceptable. The reposize will eventually
get an accurate value reflecting changes made by other processes or in
other repos.
This deals with the possible security problem that someone could make an
unusually low UUID and generate keys that are all constructed to hash to
a number that, mod the number of repositories in the group, == 0.
So balanced preferred content would always put those keys in the
repository with the low UUID as long as the group contains the
number of repositories that the attacker anticipated.
Presumably the attacker than holds the data for ransom? Dunno.
Anyway, the partial solution is to use HMAC (sha256) with all the UUIDs
combined together as the "secret", and the key as the "message". Now any
change in the set of UUIDs in a group will invalidate the attacker's
constructed keys from hashing to anything in particular.
Given that there are plenty of other things someone can do if they can
write to the repository -- including modifying preferred content so only
their repository wants files, and numcopies so other repositories drom
them -- this seems like safeguard enough.
Note that, in balancedPicker, combineduuids is memoized.
This fixes a problem with datalad's test suite, where loading the cluster
log happened to cause the git-annex branch commits to take a different
shape, with an additional commit.
It's also faster though, since many commands don't need the cluster log.
Just fill Annex.clusters with a thunk.
Sponsored-by: the NIH-funded NICEMAN (ReproNim TR&D3) project
Avoid `git-annex sync --content` etc from operating on cluster nodes by default
since syncing with a cluster implicitly syncs with its nodes. This avoids a
lot of unncessary work when a cluster has a lot of nodes just in checking
if each node's preferred content is satisfied. And it avoids content
being sent to nodes individually, so instead syncing with clusters always
fanout uploads to nodes.
The downside is that there are situations where a cluster's preferred content
settings can be met, but those of its nodes are not. Or where a node does not
contain a key, but the cluster does, and there are not enough copies of the key
yet, so it would be desirable the send it there. I think that's an acceptable
tradeoff. These kind of situations are ones where the cluster itself should
probably be responsible for copying content to the node. Which it can do much
less expensively than a client can. Part of the balanced preferred content
design that I will be working on in a couple of months involves rebalancing
clusters, so I expect to revisit this.
The use of annex-sync config does allow running git-annex sync with a specific
node, or nodes, and it will sync with it. And it's also possible to set
annex-sync git configs to make it sync with a node by default. (Although that
will require setting up an explicit git remote for the node rather than relying
on the proxied remote.)
Logs.Cluster.Basic is needed because Remote.Git cannot import Logs.Cluster
due to a cycle. And the Annex.Startup load of clusters happens
too late for Remote.Git to use that. This does mean one redundant load
of the cluster log, though only when there is a proxy.
Works down to P2P protocol.
The question now is, how to handle protocol version negotiation for
clusters? Connecting to each node to find their protocol versions and
using the lowest would be too expensive with a lot of nodes. So it seems
that the cluster needs to pick its own protocol version to use with the
client.
Then it can either negotiate that same version with the nodes when
it comes time to use them, or it can translate between multiple protocol
versions. That seems complicated. Thinking it would be ok to refuse to
use a node if it is not able to negotiate the same protocol version with
it as with the client. That will mean that sometimes need nodes to be
upgraded when upgrading the cluster's proxy. But protocol versions
rarely change.
This is to avoid inserting a cluster uuid into the location log when
only dead nodes in the cluster contain the content of a key.
One reason why this is necessary is Remote.keyLocations, which excludes
dead repositories from the list. But there are probably many more.
Implementing this was challenging, because Logs.Location importing
Logs.Cluster which imports Logs.Trust which imports Remote.List resulted
in an import cycle through several other modules.
Resorted to making Logs.Location not import Logs.Cluster, and instead
it assumes that Annex.clusters gets populated when necessary before it's
called.
That's done in Annex.Startup, which is run by the git-annex command
(but not other commands) at early startup in initialized repos. Or,
is run after initialization.
Note that is Remote.Git, it is unable to import Annex.Startup, because
Remote.Git importing Logs.Cluster leads the the same import cycle.
So ensureInitialized is not passed annexStartup in there.
Other commands, like git-annex-shell currently don't run annexStartup
either.
So there are cases where Logs.Location will not see clusters. So it won't add
any cluster UUIDs when loading the log. That's ok, the only reason to do
that is to make display of where objects are located include clusters,
and to make commands like git-annex get --from treat keys as being located
in a cluster. git-annex-shell certainly does not do anything like that,
and I'm pretty sure Remote.Git (and callers to Remote.Git.onLocalRepo)
don't either.
A cluster UUID is a version 8 UUID, with first octets 'a' and 'c'.
The rest of the content will be random.
This avoids a class of attack where the UUID of a repository is used as
the UUID of a cluster, which will prevent git-annex from updating
location logs for that repository. I don't know why someone would want
to do that, but let's prevent it.
Also, isClusterUUID make it easy to filter out cluster UUIDs when
writing the location logs.
Not used yet. (Or tested.)
I did consider making the log start with the uuid of the node, followed
by the cluster uuid (or uuids). That would perhaps mean a smaller write
to the git-annex branch when adding a node, but overall the log file
would be larger, and it will be read and cached near to startup on most
git-annex runs.
It was possible for the export.log to get written and then git-annex was
interrupted, before it could graft in the exported tree. Which could
result in export.log referencing a tree that got garbage collected.
Untested, but this should be close to working. The proxied remotes have
the same url but a different uuid. When talking to current
git-annex-shell, it will fail due to a uuid mismatch. Once it supports
proxies, it will know that the presented uuid is for a remote that it
proxies for.
The check for any git config settings for a remote with the same name as
the proxied remote is there for several reasons. One is security:
Writing a name to the proxy log should not cause changes to
how an existing, configured git remote operates in a different clone of
the repo.
It's possible that the user has been using a proxied remote, and decides
to set a git config for it. We can't tell the difference between that
scenario and an evil remote trying to eg, intercept a file upload
by replacing their remote with a proxied remote.
Also, if the user sets some git config, does it override the config
inherited from the proxy remote? Seems a difficult question. Luckily,
the above means we don't need to think through it.
This does mean though, that in order for a user to change the config of
a proxy remote, they have to manually set its annex-uuid and url, as
well as the config they want to change. They may also have to set any of
the inherited configs that they were relying on.
While redundant concurrent transfers were already prevented in most
cases, it failed to prevent the case where two different repositories were
sending the same content to the same repository. By removing the uuid
from the transfer lock file for Download transfers, one repository
sending content will block the other one from also sending the same
content.
In order to interoperate with old git-annex, the old lock file is still
locked, as well as locking the new one. That added a lot of extra code
and work, and the plan is to eventually stop locking the old lock file,
at some point in time when an old git-annex process is unlikely to be
running at the same time.
Note that in the case of 2 repositories both doing eg
`git-annex copy foo --to origin`
the output is not that great:
copy b (to origin...)
transfer already in progress, or unable to take transfer lock
git-annex: transfer already in progress, or unable to take transfer lock
97% 966.81 MiB 534 GiB/s 0sp2pstdio: 1 failed
Lost connection (fd:14: hPutBuf: resource vanished (Broken pipe))
Transfer failed
Perhaps that output could be cleaned up? Anyway, it's a lot better than letting
the redundant transfer happen and then failing with an obscure error about
a temp file, which is what it did before. And it seems users don't often
try to do this, since nobody ever reported this bug to me before.
(The "97%" there is actually how far along the *other* transfer is.)
Sponsored-by: Joshua Antonishen on Patreon