240 lines
11 KiB
Markdown
240 lines
11 KiB
Markdown
In some situations it can be useful for a reposistory to not store its
|
|
state on the git-annex branch. One example is a temporary clone that's
|
|
going to be deleted. One way is just to not push the git-annex branch
|
|
from the repository to anywhere, but that limits what can be done in the
|
|
repository. For example, files might be added, and copied to public
|
|
repositories, but then the git-annex branch would need to be pushed to
|
|
publish them.
|
|
|
|
There could be a git config to hide the current repository.
|
|
|
|
## location tracking effects
|
|
|
|
The main logs this would affect are for location tracking.
|
|
|
|
git-annex will mostly work the same without location tracking information
|
|
being recorded for the local repo. Often, git-annex uses inAnnex to
|
|
directly check if an annex object is present, rather than looking at
|
|
location tracking. For example --in=here uses inAnnex so would still work;
|
|
|
|
Of course, `git annex whereis`reports on location tracking info, so if a
|
|
file were added to such a repo, whereis on it would report no copies. And
|
|
I said "of course", but this may not be obvious to all users.
|
|
|
|
And there are parts of git-annex that do look at location tracking for
|
|
the current repo, even though it's generally slower than inAnnex. Since the
|
|
two are generally equivilant now, some general-purpose code that looks at
|
|
locations generally has no real need to use inAnnex. One example of this
|
|
is --copies.
|
|
|
|
One thing that would certainly need to be changed is git-annex
|
|
fsck, which notices when the location tracking information is wrong/missing and
|
|
corrects it. (Note that unsetting the git config followed by a fsck would
|
|
update the location logs, which could be useful to stop hiding the repo,
|
|
but if other stuff like annex.uuid is also affected, fsck would not do
|
|
anything about that stuff.)
|
|
|
|
git-annex info is a bit of a mess from this perspecitve. Its repo list
|
|
would not include the repo (if it was also hidden from uuid.log), but it
|
|
would report on the number of locally present objects, while other info
|
|
like numcopies stats and combined size of repositories are based on
|
|
location tracking, so would not include the current repo.
|
|
|
|
Looks like git-annex drop --from remote relies on the location log
|
|
to see if there's a local copy, so a hidden repo would not be treated as a
|
|
copy. This could be changed, but checking inAnnex here would actually slow
|
|
it down. It could also be argued that this is a good thing since dropping
|
|
from a remote could leave the only copy in a hidden repo. But then move
|
|
--from should also prevent that, and I think it might not.
|
|
|
|
So the question is, would adding this feature complicate git-annex too
|
|
much, in needing to pick the right choice of inAnnex or location log
|
|
querying, or in the user's mental model of how git-annex works?
|
|
|
|
## uuid.log
|
|
|
|
To really hide a repository, it needs to not be written to uuid.log.
|
|
|
|
So the config would need to be set before git-annex init.
|
|
|
|
If a repository is also hidden from uuid.log, it follows that this option
|
|
is not given a name specific to location tracking. Eg annex.hidden rather
|
|
than annex.omit-location-logs. But that does raise the
|
|
question about all the other places a repo's uuid could crop up in the
|
|
git-annex branch.
|
|
|
|
## everything else
|
|
|
|
* remote.log: A special remote is not usable without this, and this does
|
|
not seem to be a config that affects what is stored about remotes, but only
|
|
the current repo.
|
|
|
|
* trust.log: If the user sets this config, are things
|
|
like `git-annex trust here` supposed to refuse to work? Seems limiting,
|
|
and a significant source of scope creep. Maybe it would be better to
|
|
let the uuid be written to these, if the user chooses to set a trust.
|
|
After all, having some uuids in these logs, that are not described in
|
|
uuid.log, does not tell anyone else much, except that someone had a
|
|
hidden repository.
|
|
|
|
* group.log, preferred-content.log, required-content.log: Same as trust.log;
|
|
Names in group.log do hint about how a hidden repo might be used, but if
|
|
the user is concerned about that they can not add their repo to groups
|
|
that expose information.
|
|
|
|
* export.log: Same as remote.log
|
|
|
|
* `*.log.rmt`, `*.log.rmet`, `*.log.cid`, `*.log.cnk`: Same as remote.log
|
|
|
|
* schedule.log: Same as trust.log
|
|
|
|
* activity.log: A user might be surprised that fscking a hidden repo
|
|
mentions its uuid here. Also it seems unnecessary info to log for a
|
|
hidden repo. Should be special cased if uuid.log is.
|
|
|
|
* multicast.log: This includes the uuid of the current repo, when using
|
|
git-annex multicast. That could be surprising to a user, so probably
|
|
git-annex multicast would need to refuse to run in hidden repos.
|
|
|
|
* difference.log: Surprisingly to me, in a clone of a repo that was initialized
|
|
with tunings, git-annex init adds the new repo's uuid to this log file.
|
|
Should be special cased if uuid.log is. Unsure yet if it will be possible
|
|
to avoid writing it, or if tunings and hidden repos need to be
|
|
incompatible features.
|
|
|
|
That seems to be all. --[[Joey]]
|
|
|
|
---
|
|
|
|
## alternative: public/private git-annex branch separation
|
|
|
|
This would need a separate index file in addition to .git/annex/index,
|
|
call it index-private. Any logging of information that includes a hidden
|
|
uuid would modify index-private instead. When index-private exists,
|
|
git-annex branch queries have to read from both index and index-private,
|
|
and can just concacentate the two. (But writes are more complicated, see below.)
|
|
|
|
index-private could be committed to a git-annex-private branch, but this is
|
|
not necessary for the single repo case. But doing that could allow for a
|
|
group of repos that are all hidden but share information via that branch.
|
|
|
|
Performance overhead seems like it would mostly be in determining if
|
|
index-private exists and needs to be read from. And that should be
|
|
cacheable for the life of git-annex.
|
|
|
|
Or, it could only read from index-private when the git config has the
|
|
feature enabled. Then turning off the feature would need the user to do
|
|
something to merge index-private into index, in order to keep using the
|
|
repo with the info stored there. This approach also means that the user can
|
|
temporarily disable the feature and see how the git-annex info will appear
|
|
to others, eg see that whereis and info do not list their hidden repository.
|
|
|
|
In a lot of ways this seems simpler than the approach of not writing to the
|
|
branch. Main complication is log writes. All the log modules need to
|
|
indicate when writes are adding information that needs to be kept private.
|
|
|
|
Often a log file will be read, and then written with a new line added
|
|
and perhaps an old line removed. Eg:
|
|
|
|
addLog file line = Annex.Branch.change file $ \b ->
|
|
buildLog $ compactLog (line : parseLog b)
|
|
|
|
This seems like it would need Annex.Branch.change to be passed a parameter
|
|
indicating if it's changing the public or private log. Luckily, I think
|
|
all reads followed by writes do go via Annex.Branch.change, so Annex.Branch.get
|
|
can just concacenate the two without worrying about it leaking back out in a
|
|
later write.
|
|
|
|
## networks of hidden repos
|
|
|
|
There are a lot of complications involving using hidden repos as remotes.
|
|
It may be best to not support doing it at all. Some of the complications
|
|
are discussed below.
|
|
|
|
If location tracking and other data is not stored for a hidden remote, it
|
|
will be hard to do much useful with that remote. For example, git-annex get
|
|
from that remote would not work. (Without setting annex-speculate-present.)
|
|
Also some remotes depend on state being stored, like chunking data,
|
|
encryption, etc. So setting up networks of repos that know about
|
|
one-another but are hidden from the wider world would need some kind of
|
|
public/private git-annex branch separation.
|
|
|
|
If there's a branch such as git-annex-private that should only be pushed
|
|
to remotes that know about the hidden repo, it invites mistakes. git push
|
|
of matching branches would push it to any remote that happens to have a
|
|
branch by the same name, which could even be done maliciously. Avoiding
|
|
that would need to avoid using a named branch, and only let
|
|
git-annex sync push the information, to remotes it knows should receive it.
|
|
|
|
A related problem is, git-annex move will update location tracking for the
|
|
remote. If the remote is hidden, that would expose its uuid, unless
|
|
git-annex move knew that it was and either avoided doing that, or wrote to
|
|
the private git-annex branch. So, setting up a network of hidden repos
|
|
would need some way to tell each of them which of the others were also
|
|
hidden. A per-remote git config is one way, but the user would need to
|
|
remember to set them up when setting up the remotes.
|
|
|
|
Would storing a list of the uuids of hidden repos be acceptable? If there
|
|
were a list in the git-annex branch, then it would be easier to support
|
|
networks of hidden repos. The only information exposed would be the number
|
|
of hidden repos and when they were added. Space used would generally be
|
|
small, although in situations where temporary repos are being created
|
|
frequently, and are hidden to avoid bloating the branch, there would be a
|
|
small amount of bloat.
|
|
|
|
Alternatively, the UUID of a hidden repo could somehow encode the fact that
|
|
it's hidden. Although this would make it hard to convert a repo from/to
|
|
being hidden.
|
|
|
|
None of the above allows for a network of hidden repos, one of which is
|
|
part of a *different* network of hidden repos. Supporting that would be a
|
|
major complication.
|
|
|
|
Things other than the git-annex branch that can expose the existence of the
|
|
repository:
|
|
|
|
* The p2p protocol has an AUTH that includes the repository that is
|
|
connecting. This should be ok, since links between repositories have to be
|
|
set up explicitly.
|
|
* git-annex-shell configlist will list the UUID. User has to know/guess
|
|
the repo exists and have an accepted ssh key.
|
|
|
|
# alternative: git-annex branch filtering
|
|
|
|
Different angle on this: Let the git-annex branch grow as usual. But
|
|
provide a way to filter uuids out of the git-annex branch, producing a new
|
|
branch.
|
|
|
|
Then the user can push the filtered branch back to origin or whatever they
|
|
want to do with it. It would be up to them to avoid making a mistake and
|
|
letting git push automatically send git-annex to origin/git-annex.
|
|
Maybe git has sufficient configs to let it be configured to avoid such
|
|
mistakes, dunno. (git-annex sync would certianly be a foot shooting
|
|
opportunity too.)
|
|
|
|
> Setting remote.name.push = simple would avoid accidental pushes.
|
|
> But if the user wanted to otherwise push matching branches, they would
|
|
> not be able to express that with a git config. Also, `git push origin :`
|
|
> would override that config.
|
|
>
|
|
> Using a different branch name than git-annex when branch filtering is
|
|
> enabled would be avoid most accidental pushes. And then the filtering
|
|
> could produce the git-annex branch.
|
|
|
|
The filtering would need to go back from the top commit to the last commit
|
|
that was filtered, and remove all mentions of the uuid. The transition
|
|
code (mostly) knows how to do that, but it doesn't preserve the history of
|
|
commits currently, and filtering would need to preserve that.
|
|
|
|
Any commits that were made elsewhere or that don't contain the UUIDs would
|
|
keep the same trees, and should keep the same commit hashes too, as long
|
|
as their parents are the same.
|
|
|
|
This would support any networks of hidden repos that might be wanted.
|
|
And it's *clean*.. Except it punts the potential foot shooting of
|
|
keeping the unfiltered branch private and unpushed to the user, and it
|
|
adds a step of needing to do the filtering before pushing.
|
|
|
|
[[!tag projects/datalad]]
|
|
|
|
> [[done]]; see [[tips/cloning_a_repository_privately]] --[[Joey]]
|