git-annex/doc/todo/hiding_a_repository.mdwn

In some situations it can be useful for a reposistory to not store its
state on the git-annex branch. One example is a temporary clone that's
going to be deleted. One way is just to not push the git-annex branch
from the repository to anywhere, but that limits what can be done in the
repository. For example, files might be added, and copied to public
repositories, but then the git-annex branch would need to be pushed to
publish them.

There could be a git config to hide the current repository.

## location tracking effects

The main logs this would affect are for location tracking. 

git-annex will mostly work the same without location tracking information
being recorded for the local repo. Often, git-annex uses inAnnex to
directly check if an annex object is present, rather than looking at
location tracking. For example --in=here uses inAnnex so would still work;

Of course, `git annex whereis`reports on location tracking info, so if a
file were added to such a repo, whereis on it would report no copies. And
I said "of course", but this may not be obvious to all users. 

And there are parts of git-annex that do look at location tracking for
the current repo, even though it's generally slower than inAnnex. Since the
two are generally equivilant now, some general-purpose code that looks at
locations generally has no real need to use inAnnex. One example of this
is --copies.

One thing that would certainly need to be changed is git-annex
fsck, which notices when the location tracking information is wrong/missing and
corrects it. (Note that unsetting the git config followed by a fsck would
update the location logs, which could be useful to stop hiding the repo,
but if other stuff like annex.uuid is also affected, fsck would not do
anything about that stuff.)

git-annex info is a bit of a mess from this perspecitve. Its repo list
would not include the repo (if it was also hidden from uuid.log), but it
would report on the number of locally present objects, while other info
like numcopies stats and combined size of repositories are based on
location tracking, so would not include the current repo.

Looks like git-annex drop --from remote relies on the location log
to see if there's a local copy, so a hidden repo would not be treated as a
copy. This could be changed, but checking inAnnex here would actually slow
it down. It could also be argued that this is a good thing since dropping
from a remote could leave the only copy in a hidden repo. But then move
--from should also prevent that, and I think it might not.

So the question is, would adding this feature complicate git-annex too
much, in needing to pick the right choice of inAnnex or location log
querying, or in the user's mental model of how git-annex works? 

## uuid.log

To really hide a repository, it needs to not be written to uuid.log.

So the config would need to be set before git-annex init.

If a repository is also hidden from uuid.log, it follows that this option
is not given a name specific to location tracking. Eg annex.hidden rather
than annex.omit-location-logs. But that does raise the
question about all the other places a repo's uuid could crop up in the
git-annex branch.

## everything else

* remote.log: A special remote is not usable without this, and this does
  not seem to be a config that affects what is stored about remotes, but only
  the current repo.

* trust.log: If the user sets this config, are things
  like `git-annex trust here` supposed to refuse to work? Seems limiting,
  and a significant source of scope creep. Maybe it would be better to
  let the uuid be written to these, if the user chooses to set a trust.
  After all, having some uuids in these logs, that are not described in
  uuid.log, does not tell anyone else much, except that someone had a
  hidden repository.

* group.log, preferred-content.log, required-content.log: Same as trust.log;
  Names in group.log do hint about how a hidden repo might be used, but if
  the user is concerned about that they can not add their repo to groups
  that expose information.

* export.log: Same as remote.log

* `*.log.rmt`, `*.log.rmet`, `*.log.cid`, `*.log.cnk`: Same as remote.log

* schedule.log: Same as trust.log

* activity.log: A user might be surprised that fscking a hidden repo
  mentions its uuid here. Also it seems unnecessary info to log for a
  hidden repo. Should be special cased if uuid.log is.

* multicast.log: This includes the uuid of the current repo, when using
  git-annex multicast. That could be surprising to a user, so probably
  git-annex multicast would need to refuse to run in hidden repos.

* difference.log: Surprisingly to me, in a clone of a repo that was initialized
  with tunings, git-annex init adds the new repo's uuid to this log file.
  Should be special cased if uuid.log is. Unsure yet if it will be possible
  to avoid writing it, or if tunings and hidden repos need to be
  incompatible features.

That seems to be all. --[[Joey]]

---

## alternative: public/private git-annex branch separation

This would need a separate index file in addition to .git/annex/index,
call it index-private. Any logging of information that includes a hidden
uuid would modify index-private instead. When index-private exists,
git-annex branch queries have to read from both index and index-private,
and can just concacentate the two. (But writes are more complicated, see below.)

index-private could be committed to a git-annex-private branch, but this is
not necessary for the single repo case. But doing that could allow for a
group of repos that are all hidden but share information via that branch.

Performance overhead seems like it would mostly be in determining if
index-private exists and needs to be read from. And that should be
cacheable for the life of git-annex. 

Or, it could only read from index-private when the git config has the
feature enabled. Then turning off the feature would need the user to do
something to merge index-private into index, in order to keep using the
repo with the info stored there. This approach also means that the user can
temporarily disable the feature and see how the git-annex info will appear
to others, eg see that whereis and info do not list their hidden repository.

In a lot of ways this seems simpler than the approach of not writing to the
branch. Main complication is log writes. All the log modules need to
indicate when writes are adding information that needs to be kept private.

Often a log file will be read, and then written with a new line added
and perhaps an old line removed. Eg:

	addLog file line = Annex.Branch.change file $ \b ->
	        buildLog $ compactLog (line : parseLog b)

This seems like it would need Annex.Branch.change to be passed a parameter
indicating if it's changing the public or private log. Luckily, I think 
all reads followed by writes do go via Annex.Branch.change, so Annex.Branch.get
can just concacenate the two without worrying about it leaking back out in a
later write.

## networks of hidden repos

There are a lot of complications involving using hidden repos as remotes.
It may be best to not support doing it at all. Some of the complications
are discussed below.

If location tracking and other data is not stored for a hidden remote, it
will be hard to do much useful with that remote. For example, git-annex get
from that remote would not work. (Without setting annex-speculate-present.)
Also some remotes depend on state being stored, like chunking data,
encryption, etc. So setting up networks of repos that know about
one-another but are hidden from the wider world would need some kind of
public/private git-annex branch separation.

If there's a branch such as git-annex-private that should only be pushed
to remotes that know about the hidden repo, it invites mistakes. git push
of matching branches would push it to any remote that happens to have a
branch by the same name, which could even be done maliciously. Avoiding
that would need to avoid using a named branch, and only let 
git-annex sync push the information, to remotes it knows should receive it.

A related problem is, git-annex move will update location tracking for the
remote. If the remote is hidden, that would expose its uuid, unless
git-annex move knew that it was and either avoided doing that, or wrote to
the private git-annex branch. So, setting up a network of hidden repos
would need some way to tell each of them which of the others were also
hidden. A per-remote git config is one way, but the user would need to
remember to set them up when setting up the remotes.

Would storing a list of the uuids of hidden repos be acceptable? If there
were a list in the git-annex branch, then it would be easier to support
networks of hidden repos. The only information exposed would be the number
of hidden repos and when they were added. Space used would generally be
small, although in situations where temporary repos are being created
frequently, and are hidden to avoid bloating the branch, there would be a
small amount of bloat. 

Alternatively, the UUID of a hidden repo could somehow encode the fact that
it's hidden. Although this would make it hard to convert a repo from/to
being hidden.

None of the above allows for a network of hidden repos, one of which is
part of a *different* network of hidden repos. Supporting that would be a
major complication.

Things other than the git-annex branch that can expose the existence of the
repository:

* The p2p protocol has an AUTH that includes the repository that is
  connecting. This should be ok, since links between repositories have to be
  set up explicitly.
* git-annex-shell configlist will list the UUID. User has to know/guess
  the repo exists and have an accepted ssh key.

# alternative: git-annex branch filtering

Different angle on this: Let the git-annex branch grow as usual. But
provide a way to filter uuids out of the git-annex branch, producing a new
branch.

Then the user can push the filtered branch back to origin or whatever they
want to do with it. It would be up to them to avoid making a mistake and
letting git push automatically send git-annex to origin/git-annex.
Maybe git has sufficient configs to let it be configured to avoid such
mistakes, dunno. (git-annex sync would certianly be a foot shooting
opportunity too.)

> Setting remote.name.push = simple would avoid accidental pushes.
> But if the user wanted to otherwise push matching branches, they would
> not be able to express that with a git config. Also, `git push origin :`
> would override that config.
> 
> Using a different branch name than git-annex when branch filtering is
> enabled would be avoid most accidental pushes. And then the filtering
> could produce the git-annex branch.

The filtering would need to go back from the top commit to the last commit
that was filtered, and remove all mentions of the uuid. The transition
code (mostly) knows how to do that, but it doesn't preserve the history of
commits currently, and filtering would need to preserve that.

Any commits that were made elsewhere or that don't contain the UUIDs would
keep the same trees, and should keep the same commit hashes too, as long
as their parents are the same.

This would support any networks of hidden repos that might be wanted.
And it's *clean*.. Except it punts the potential foot shooting of
keeping the unfiltered branch private and unpushed to the user, and it
adds a step of needing to do the filtering before pushing.

[[!tag projects/datalad]]

> [[done]]; see [[tips/cloning_a_repository_privately]] --[[Joey]]
idea 2021-04-16 17:30:23 +00:00			`In some situations it can be useful for a reposistory to not store its`
			`state on the git-annex branch. One example is a temporary clone that's`
			`going to be deleted. One way is just to not push the git-annex branch`
			`from the repository to anywhere, but that limits what can be done in the`
			`repository. For example, files might be added, and copied to public`
			`repositories, but then the git-annex branch would need to be pushed to`
			`publish them.`

			`There could be a git config to hide the current repository.`

			`## location tracking effects`

			`The main logs this would affect are for location tracking.`

			`git-annex will mostly work the same without location tracking information`
			`being recorded for the local repo. Often, git-annex uses inAnnex to`
			`directly check if an annex object is present, rather than looking at`
			`location tracking. For example --in=here uses inAnnex so would still work;`

			Of course, `git annex whereis`reports on location tracking info, so if a
			`file were added to such a repo, whereis on it would report no copies. And`
			`I said "of course", but this may not be obvious to all users.`

			`And there are parts of git-annex that do look at location tracking for`
			`the current repo, even though it's generally slower than inAnnex. Since the`
			`two are generally equivilant now, some general-purpose code that looks at`
			`locations generally has no real need to use inAnnex. One example of this`
			`is --copies.`

			`One thing that would certainly need to be changed is git-annex`
			`fsck, which notices when the location tracking information is wrong/missing and`
			`corrects it. (Note that unsetting the git config followed by a fsck would`
			`update the location logs, which could be useful to stop hiding the repo,`
			`but if other stuff like annex.uuid is also affected, fsck would not do`
			`anything about that stuff.)`

			`git-annex info is a bit of a mess from this perspecitve. Its repo list`
			`would not include the repo (if it was also hidden from uuid.log), but it`
			`would report on the number of locally present objects, while other info`
			`like numcopies stats and combined size of repositories are based on`
			`location tracking, so would not include the current repo.`

			`Looks like git-annex drop --from remote relies on the location log`
			`to see if there's a local copy, so a hidden repo would not be treated as a`
			`copy. This could be changed, but checking inAnnex here would actually slow`
			`it down. It could also be argued that this is a good thing since dropping`
			`from a remote could leave the only copy in a hidden repo. But then move`
			`--from should also prevent that, and I think it might not.`

			`So the question is, would adding this feature complicate git-annex too`
			`much, in needing to pick the right choice of inAnnex or location log`
			`querying, or in the user's mental model of how git-annex works?`

			`## uuid.log`

			`To really hide a repository, it needs to not be written to uuid.log.`

			`So the config would need to be set before git-annex init.`

			`If a repository is also hidden from uuid.log, it follows that this option`
			`is not given a name specific to location tracking. Eg annex.hidden rather`
			`than annex.omit-location-logs. But that does raise the`
			`question about all the other places a repo's uuid could crop up in the`
			`git-annex branch.`

			`## everything else`

			`* remote.log: A special remote is not usable without this, and this does`
			`not seem to be a config that affects what is stored about remotes, but only`
			`the current repo.`

			`* trust.log: If the user sets this config, are things`
			like `git-annex trust here` supposed to refuse to work? Seems limiting,
			`and a significant source of scope creep. Maybe it would be better to`
			`let the uuid be written to these, if the user chooses to set a trust.`
			`After all, having some uuids in these logs, that are not described in`
			`uuid.log, does not tell anyone else much, except that someone had a`
			`hidden repository.`

			`* group.log, preferred-content.log, required-content.log: Same as trust.log;`
			`Names in group.log do hint about how a hidden repo might be used, but if`
			`the user is concerned about that they can not add their repo to groups`
			`that expose information.`

			`* export.log: Same as remote.log`

			* `.log.rmt`, `.log.rmet`, `.log.cid`, `.log.cnk`: Same as remote.log

			`* schedule.log: Same as trust.log`

			`* activity.log: A user might be surprised that fscking a hidden repo`
			`mentions its uuid here. Also it seems unnecessary info to log for a`
			`hidden repo. Should be special cased if uuid.log is.`

			`* multicast.log: This includes the uuid of the current repo, when using`
			`git-annex multicast. That could be surprising to a user, so probably`
			`git-annex multicast would need to refuse to run in hidden repos.`

			`* difference.log: Surprisingly to me, in a clone of a repo that was initialized`
			`with tunings, git-annex init adds the new repo's uuid to this log file.`
			`Should be special cased if uuid.log is. Unsure yet if it will be possible`
			`to avoid writing it, or if tunings and hidden repos need to be`
			`incompatible features.`

			`That seems to be all. --[[Joey]]`

thoughts 2021-04-16 18:41:12 +00:00			`---`

			`## alternative: public/private git-annex branch separation`

			`This would need a separate index file in addition to .git/annex/index,`
			`call it index-private. Any logging of information that includes a hidden`
			`uuid would modify index-private instead. When index-private exists,`
			`git-annex branch queries have to read from both index and index-private,`
			`and can just concacentate the two. (But writes are more complicated, see below.)`

			`index-private could be committed to a git-annex-private branch, but this is`
			`not necessary for the single repo case. But doing that could allow for a`
			`group of repos that are all hidden but share information via that branch.`

			`Performance overhead seems like it would mostly be in determining if`
			`index-private exists and needs to be read from. And that should be`
			`cacheable for the life of git-annex.`

			`Or, it could only read from index-private when the git config has the`
			`feature enabled. Then turning off the feature would need the user to do`
			`something to merge index-private into index, in order to keep using the`
			`repo with the info stored there. This approach also means that the user can`
			`temporarily disable the feature and see how the git-annex info will appear`
			`to others, eg see that whereis and info do not list their hidden repository.`

			`In a lot of ways this seems simpler than the approach of not writing to the`
			`branch. Main complication is log writes. All the log modules need to`
			`indicate when writes are adding information that needs to be kept private.`

			`Often a log file will be read, and then written with a new line added`
			`and perhaps an old line removed. Eg:`

			`addLog file line = Annex.Branch.change file $ \b ->`
			`buildLog $ compactLog (line : parseLog b)`

			`This seems like it would need Annex.Branch.change to be passed a parameter`
			`indicating if it's changing the public or private log. Luckily, I think`
			`all reads followed by writes do go via Annex.Branch.change, so Annex.Branch.get`
			`can just concacenate the two without worrying about it leaking back out in a`
			`later write.`
yoh asked me to tag this datalad 2021-04-19 17:20:07 +00:00
thoughts 2021-04-19 17:58:43 +00:00			`## networks of hidden repos`

			`There are a lot of complications involving using hidden repos as remotes.`
			`It may be best to not support doing it at all. Some of the complications`
			`are discussed below.`

			`If location tracking and other data is not stored for a hidden remote, it`
			`will be hard to do much useful with that remote. For example, git-annex get`
			`from that remote would not work. (Without setting annex-speculate-present.)`
			`Also some remotes depend on state being stored, like chunking data,`
			`encryption, etc. So setting up networks of repos that know about`
			`one-another but are hidden from the wider world would need some kind of`
			`public/private git-annex branch separation.`

			`If there's a branch such as git-annex-private that should only be pushed`
			`to remotes that know about the hidden repo, it invites mistakes. git push`
			`of matching branches would push it to any remote that happens to have a`
			`branch by the same name, which could even be done maliciously. Avoiding`
			`that would need to avoid using a named branch, and only let`
			`git-annex sync push the information, to remotes it knows should receive it.`

			`A related problem is, git-annex move will update location tracking for the`
			`remote. If the remote is hidden, that would expose its uuid, unless`
			`git-annex move knew that it was and either avoided doing that, or wrote to`
			`the private git-annex branch. So, setting up a network of hidden repos`
			`would need some way to tell each of them which of the others were also`
			`hidden. A per-remote git config is one way, but the user would need to`
			`remember to set them up when setting up the remotes.`

			`Would storing a list of the uuids of hidden repos be acceptable? If there`
			`were a list in the git-annex branch, then it would be easier to support`
			`networks of hidden repos. The only information exposed would be the number`
			`of hidden repos and when they were added. Space used would generally be`
			`small, although in situations where temporary repos are being created`
			`frequently, and are hidden to avoid bloating the branch, there would be a`
			`small amount of bloat.`

			`Alternatively, the UUID of a hidden repo could somehow encode the fact that`
			`it's hidden. Although this would make it hard to convert a repo from/to`
			`being hidden.`

			`None of the above allows for a network of hidden repos, one of which is`
			`part of a different network of hidden repos. Supporting that would be a`
			`major complication.`

start implementing hidden git-annex repositories This adds a separate journal, which does not currently get committed to an index, but is planned to be committed to .git/annex/index-private. Changes that are regarding a UUID that is private will get written to this journal, and so will not be published into the git-annex branch. All log writing should have been made to indicate the UUID it's regarding, though I've not verified this yet. Currently, no UUIDs are treated as private yet, a way to configure that is needed. The implementation is careful to not add any additional IO work when privateUUIDsKnown is False. It will skip looking at the private journal at all. So this should be free, or nearly so, unless the feature is used. When it is used, all branch reads will be about twice as expensive. It is very lucky -- or very prudent design -- that Annex.Branch.change and maybeChange are the only ways to change a file on the branch, and Annex.Branch.set is only internal use. That let Annex.Branch.get always yield any private information that has been recorded, without the risk that Annex.Branch.set might be called, with a non-private UUID, and end up leaking the private information into the git-annex branch. And, this relies on the way git-annex union merges the git-annex branch. When reading a file, there can be a public and a private version, and they are just concacenated together. That will be handled the same as if there were two diverged git-annex branches that got union merged. 2021-04-20 18:32:41 +00:00			`Things other than the git-annex branch that can expose the existence of the`
			`repository:`

			`* The p2p protocol has an AUTH that includes the repository that is`
			`connecting. This should be ok, since links between repositories have to be`
			`set up explicitly.`
			`* git-annex-shell configlist will list the UUID. User has to know/guess`
			`the repo exists and have an accepted ssh key.`

alternative 2021-04-21 21:18:47 +00:00			`# alternative: git-annex branch filtering`

			`Different angle on this: Let the git-annex branch grow as usual. But`
			`provide a way to filter uuids out of the git-annex branch, producing a new`
			`branch.`

update 2021-04-22 03:42:00 +00:00			`Then the user can push the filtered branch back to origin or whatever they`
alternative 2021-04-21 21:18:47 +00:00			`want to do with it. It would be up to them to avoid making a mistake and`
			`letting git push automatically send git-annex to origin/git-annex.`
			`Maybe git has sufficient configs to let it be configured to avoid such`
update 2021-04-22 03:42:00 +00:00			`mistakes, dunno. (git-annex sync would certianly be a foot shooting`
			`opportunity too.)`

			`> Setting remote.name.push = simple would avoid accidental pushes.`
			`> But if the user wanted to otherwise push matching branches, they would`
			> not be able to express that with a git config. Also, `git push origin :`
			`> would override that config.`
			`>`
			`> Using a different branch name than git-annex when branch filtering is`
			`> enabled would be avoid most accidental pushes. And then the filtering`
			`> could produce the git-annex branch.`
alternative 2021-04-21 21:18:47 +00:00
			`The filtering would need to go back from the top commit to the last commit`
			`that was filtered, and remove all mentions of the uuid. The transition`
			`code (mostly) knows how to do that, but it doesn't preserve the history of`
			`commits currently, and filtering would need to preserve that.`

			`Any commits that were made elsewhere or that don't contain the UUIDs would`
			`keep the same trees, and should keep the same commit hashes too, as long`
			`as their parents are the same.`

			`This would support any networks of hidden repos that might be wanted.`
update 2021-04-22 03:42:00 +00:00			`And it's clean.. Except it punts the potential foot shooting of`
			`keeping the unfiltered branch private and unpushed to the user, and it`
			`adds a step of needing to do the filtering before pushing.`
alternative 2021-04-21 21:18:47 +00:00
yoh asked me to tag this datalad 2021-04-19 17:20:07 +00:00			`[[!tag projects/datalad]]`
done 2021-04-23 18:48:45 +00:00
			`> [[done]]; see [[tips/cloning_a_repository_privately]] --[[Joey]]`