avoided the strangeness of the cluster's proxy location tracking being wrong

This commit is contained in:
Joey Hess 2024-06-13 10:34:19 -04:00
parent ffd7c745ff
commit 90e3b8b44f
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38

View file

@ -208,8 +208,13 @@ For this we need a UUID for the cluster. But it is not like a usual UUID.
It does not need to actually be recorded in the location tracking logs, and It does not need to actually be recorded in the location tracking logs, and
it is not counted as a copy for numcopies purposes. The only point of this it is not counted as a copy for numcopies purposes. The only point of this
UUID is to make commands like `git-annex drop --from cluster` and UUID is to make commands like `git-annex drop --from cluster` and
`git-annex get --from cluster` talk to the cluster's frontend proxy, which `git-annex get --from cluster` talk to the cluster's frontend proxy.
has as its UUID the cluster's UUID.
The proxy log contains the cluster UUID (with a remote name like
"cluster"), as well as the UUIDs of the nodes of the cluster.
This makes the client access the cluster using the proxy. Note that more
than one proxy can be in front of the same cluster, and multiple clusters
can be accessed via the same proxy.
The cluster UUID is recorded in the git-annex branch, along with a list of The cluster UUID is recorded in the git-annex branch, along with a list of
the UUIDs of nodes of the cluster (which can change at any time). the UUIDs of nodes of the cluster (which can change at any time).
@ -220,11 +225,11 @@ of the cluster, the cluster's UUID is added to the list of UUIDs.
When writing a location log, the cluster's UUID is filtered out of the list When writing a location log, the cluster's UUID is filtered out of the list
of UUIDs. of UUIDs.
The cluster's frontend proxy fans out uploads to nodes according to When proxying an upload to the cluster's UUID, git-annex-shell fans out
preferred content. And `storeKey` is extended to be able to return a list uploads to nodes according to preferred content. And `storeKey` is extended
of additional UUIDs where the content was stored. So an upload to the to be able to return a list of additional UUIDs where the content was
cluster will end up writing to the location log the actual nodes that it stored. So an upload to the cluster will end up writing to the location log
was fanned out to. the actual nodes that it was fanned out to.
Note that to support clusters that are nodes of clusters, when a cluster's Note that to support clusters that are nodes of clusters, when a cluster's
frontend proxy fans out an upload to a node, and `storeKey` returns frontend proxy fans out an upload to a node, and `storeKey` returns
@ -232,45 +237,29 @@ additional UUIDs, it should pass those UUIDs along. Of course, no cluster
can be a node of itself, and cycles have to be broken (as described in a can be a node of itself, and cycles have to be broken (as described in a
section below). section below).
When a file is requested from the cluster's frontend proxy, it can send its When a file is requested from the cluster's UUID, git-annex-shell picks one
own local copy if it has one, but otherwise it will proxy to one of its of the nodes that has the content, and proxies to that one.
nodes. (How to pick which node to use? Load balancing?) This behavior will (How to pick which node to use? Load balancing?)
need to be added to git-annex-shell, and to Remote.Git for local paths to a And, if the proxy repository itself contains the requested key, it can send
cluster. it directly. This allows the proxy repository to be primed with frequently
accessed files when it has the space.
The cluster's frontend proxy also fans out drops to all nodes, attempting When a drop is requested from the cluster's UUID, git-annex-shell drops
to drop content from the whole cluster, and only indicating success if it from all nodes, as well as from the proxy itself. Only indicating success
can. Also needs changes to git-annex-shell and Remote.Git. if it is able to delete all copies from the cluster.
It does not fan out lockcontent, instead the client will lock content It does not fan out lockcontent, instead the client will lock content
on specific nodes. In fact, the cluster UUID should probably be omitted on specific nodes. In fact, the cluster UUID should probably be omitted
when constructing a drop proof, since trying to lockcontent on it will when constructing a drop proof, since trying to lockcontent on it will
usually fail. always fail.
Some commands like `git-annex whereis` will list content as being stored in Some commands like `git-annex whereis` will list content as being stored in
the cluster, as well as on whicheven of its nodes, and whereis currently the cluster, as well as on whichever of its nodes, and whereis currently
says "n copies", but since the cluster doesn't count as a copy, that says "n copies", but since the cluster doesn't count as a copy, that
display should probably be counted using the numcopies logic that excludes display should probably be counted using the numcopies logic that excludes
cluster UUIDs. cluster UUIDs.
No other protocol extensions or special cases should be needed. Except for No other protocol extensions or special cases should be needed.
the strange case of content stored in the cluster's frontend proxy.
Running `git-annex fsck --fast` on the cluster's frontend proxy will look
weird: For each file, it will read the location log, and if the file is
present on any node it will add the frontend proxy's UUID. So fsck will
expect the content to be present. But it probably won't be. So it will fix
the location log... which will make no changes since the proxy's UUID will
be filtered out on write. So probably fsck will need a special case to
avoid this behavior. (Also for `git-annex fsck --from cluster --fast`)
And if a key does get stored on the cluster's frontend proxy, it will not
be possible to tell from looking at the location log that the content is
really present there. So that won't be counted as a copy. In some cases,
a cluster's frontend proxy may want to keep files, perhaps some files are
worth caching there for speed. But if a file is stored only on the
cluster's frontend proxy and not in any of its nodes, it will not count as
a copy.
## speed ## speed