2011-03-02 01:32:28 +00:00
In the world of git, we're not scared about internal implementation
details, and sometimes we like to dive in and tweak things by hand. Here's
some documentation to that end.
2024-05-10 18:41:18 +00:00
[[!toc ]]
2014-10-31 15:27:05 +00:00
## The .git/ directory
### `.git/annex/objects/aa/bb/*/*`
2011-03-02 01:32:28 +00:00
This is where locally available file contents are actually stored.
fully specify the pointer file format
This format is designed to detect accidental appends, while having some
room for future expansion.
Detect when an unlocked file whose content is not present has gotten some
other content appended to it, and avoid treating it as a pointer file, so
that appended content will not be checked into git, but will be annexed
like any other file.
Dropped the max size of a pointer file down to 32kb, it was around 80 kb,
but without any good reason and certianly there are no valid pointer files
anywhere that are larger than 8kb, because it's just been specified what it
means for a pointer file with additional data even looks like.
I assume 32kb will be good enough for anyone. ;-) Really though, it needs
to be some smallish number, because that much of a file in git gets read
into memory when eg, catting pointer files. And since we have no use cases
for the extra lines of a pointer file yet, except possibly to add
some human-visible explanation that it is a git-annex pointer file, 32k
seems as reasonable an arbitrary number as anything. Increasing it would be
possible, eg to 64k, as long as users of such jumbo pointer files didn't
mind upgrading all their git-annex installations to one that supports the
new larger size.
Sponsored-by: Dartmouth College's Datalad project
2022-02-23 18:20:31 +00:00
Files added to the annex get a symlink or [[pointer_file]] checked into git,
2015-12-27 20:06:11 +00:00
that points to the file content.
2011-03-02 01:32:28 +00:00
2011-03-16 04:08:02 +00:00
First there are two levels of directories used for hashing, to prevent
too many things ending up in any one directory.
2013-04-01 00:13:49 +00:00
See [[hashing]] for details.
2011-03-16 04:08:02 +00:00
2012-11-30 20:01:29 +00:00
Each subdirectory has the [[name_of_a_key|key_format]] in one of the
2011-03-02 01:38:47 +00:00
[[key-value_backends|backends]]. The file inside also has the name of the key.
This two-level structure is used because it allows the write bit to be removed
2014-12-05 03:25:00 +00:00
from the subdirectories as well as from the files. That prevents accidentally
2013-11-22 20:19:46 +00:00
deleting or changing the file contents. See [[lockdown]] for details.
2011-03-02 01:32:28 +00:00
2014-10-31 15:27:05 +00:00
### `.git/annex/tmp/`
2014-02-26 21:04:03 +00:00
This directory contains partially transferred objects.
2019-01-17 19:40:44 +00:00
### `.git/annex/othertmp/`
2014-02-26 21:04:03 +00:00
This is a temp directory for miscellaneous other temp files.
2014-10-31 15:27:05 +00:00
### `.git/annex/bad/`
2014-02-26 21:04:03 +00:00
git-annex fsck puts any bad objects it finds in here.
2014-10-31 15:27:05 +00:00
### `.git/annex/transfers/`
2014-02-26 21:04:03 +00:00
Contains information files for uploads and downloads that are in progress,
as well as any that have failed. Used especially by the assistant.
It is safe to delete these files.
2014-10-31 15:27:05 +00:00
### `.git/annex/ssh/`
2014-02-26 21:04:03 +00:00
2015-06-09 20:33:25 +00:00
ssh connection caching files are written in here. It is safe to delete
these files.
2014-02-26 21:04:03 +00:00
2014-10-31 15:27:05 +00:00
### `.git/annex/index`
2014-02-26 21:04:03 +00:00
2014-10-31 15:27:05 +00:00
This is a git index file which git-annex uses to stage files
2015-06-09 20:33:25 +00:00
when preparing commits to the git-annex branch.
It's pretty safe to delete this file if git-annex is not currently running.
It will be re-created as necessary.
2014-02-26 21:04:03 +00:00
2014-10-31 15:27:05 +00:00
### `.git/annex/journal/`
2014-02-26 21:04:03 +00:00
git-annex uses this to journal changes to the git-annex branch,
before committing a set of changes.
2018-09-25 16:54:00 +00:00
<a name="The_git-annex_branch"></a>
2011-06-22 21:26:34 +00:00
## The git-annex branch
This branch is managed by git-annex, with the contents listed below.
2014-10-31 15:27:05 +00:00
This branch is not connected to your master, etc branches. It it used for
internal tracking of information about git-annex repositories and annexed
objects.
2013-07-15 09:44:10 +00:00
add remote state logs
This allows a remote to store a piece of arbitrary state associated with a
key. This is needed to support Tahoe, where the file-cap is calculated from
the data stored in it, and used to retrieve a key later. Glacier also would
be much improved by using this.
GETSTATE and SETSTATE are added to the external special remote protocol.
Note that the state is left as-is even when a key is removed from a remote.
It's up to the remote to decide when it wants to clear the state.
The remote state log, $KEY.log.rmt, is a UUID-based log. However,
rather than using the old UUID-based log format, I created a new variant
of that format. The new varient is more space efficient (since it lacks the
"timestamp=" hack, and easier to parse (and the parser doesn't mess with
whitespace in the value), and avoids compatability cruft in the old one.
This seemed worth cleaning up for these new files, since there could be a
lot of them, while before UUID-based logs were only used for a few log
files at the top of the git-annex branch. The transition code has also
been updated to handle these new UUID-based logs.
This commit was sponsored by Daniel Hofer.
2014-01-03 20:35:57 +00:00
The files stored in this branch are all designed to be auto-merged
using git's [[union merge driver|git-union-merge]]. So each line
has a timestamp, to allow the most recent information to be identified.
2011-06-22 21:26:34 +00:00
### `uuid.log`
2011-03-02 01:32:28 +00:00
Records the UUIDs of known repositories, and associates them with a
description of the repository. This allows git-annex to display something
more useful than a UUID when it refers to a repository that does not have
a configured git remote pointing at it.
The file format is simply one line per repository, with the uuid followed by a
2011-10-06 19:31:25 +00:00
space and then the description, followed by a timestamp. Example:
2011-03-02 01:32:28 +00:00
2011-10-06 19:31:25 +00:00
e605dca6-446a-11e0-8b2a-002170d25c55 laptop timestamp=1317929189.157237s
26339d22-446b-11e0-9101-002170d25c55 usb disk timestamp=1317929330.769997s
2011-03-02 01:32:28 +00:00
2014-01-20 20:47:56 +00:00
## `numcopies.log`
Records the global numcopies setting.
The file format is simply a timestamp followed by a number.
2012-04-20 15:31:30 +00:00
2021-01-06 18:11:08 +00:00
## `mincopies.log`
Records the global mincopies setting.
The file format is simply a timestamp followed by a number.
2017-01-30 20:41:29 +00:00
## `config.log`
Records global configuration settings, which can be overridden by values
in `.git/config`.
The file format is a timestamp, followed by the name of the configuration,
followed by the value. For example:
1317929189.157237s annex.autocommit false
2013-03-04 00:47:36 +00:00
## `remote.log`
2011-03-28 06:12:05 +00:00
2011-03-28 23:08:12 +00:00
Holds persistent configuration settings for [[special_remotes]] such as
Amazon S3.
2011-03-28 06:12:05 +00:00
2011-03-28 23:08:12 +00:00
The file format is one line per remote, starting with the uuid of the
2013-03-04 00:47:36 +00:00
remote, followed by a space, and then a series of var=value pairs,
2011-10-06 20:07:51 +00:00
each separated by whitespace, and finally a timestamp.
2011-03-28 06:12:05 +00:00
2017-05-24 17:37:06 +00:00
Special remotes that are autoenabled have autoenable=true here.
2013-03-04 00:47:36 +00:00
Encrypted special remotes store their encryption key here,
in the "cipher" value. It is base64 encoded, and unless shared [[encryption]]
is used, is encrypted to one or more gpg keys. The first 256 bytes of
the cipher is used as the HMAC SHA1 encryption key, to encrypt filenames
stored on the special remote. The remainder of the cipher is used as a gpg
symmetric encryption key, to encrypt the content of files stored on the special
remote.
2011-06-22 21:26:34 +00:00
## `trust.log`
2011-03-02 01:32:28 +00:00
Records the [[trust]] information for repositories. Does not exist unless
[[trust]] values are configured.
The file format is one line per repository, with the uuid followed by a
2011-12-04 01:01:22 +00:00
space, and then either `1` (trusted), `0` (untrusted), `?` (semi-trusted),
`X` (dead) and finally a timestamp.
2011-03-02 01:32:28 +00:00
Example:
2011-10-06 20:07:51 +00:00
e605dca6-446a-11e0-8b2a-002170d25c55 1 timestamp=1317929189.157237s
26339d22-446b-11e0-9101-002170d25c55 ? timestamp=1317929330.769997s
Repositories not listed are semi-trusted.
2011-03-02 01:32:28 +00:00
2012-10-01 19:12:04 +00:00
## `group.log`
Used to group repositories together.
The file format is one line per repository, with the uuid followed by a space,
and then a space-separated list of groups this repository is part of,
and finally a timestamp.
2012-10-04 19:48:59 +00:00
## `preferred-content.log`
Used to indicate which repositories prefer to contain which file contents.
The file format is one line per repository, with the uuid followed by a space,
then a boolean expression, and finally a timestamp.
Files matching the expression are preferred to be retained in the
repository, while files not matching it are preferred to be stored
somewhere else.
2014-03-29 18:39:10 +00:00
## `required-content.log`
Used to indicate which repositories are required to contain which file
contents.
File format is identical to preferred-content.log.
2014-03-15 17:44:31 +00:00
## `group-preferred-content.log`
Contains standard preferred content settings for groups. (Overriding or
2014-12-05 03:25:00 +00:00
supplementing the ones built into git-annex.)
2014-03-15 17:44:31 +00:00
2017-08-29 21:26:42 +00:00
The file format is one line per group, starting with a timestamp, then a
2014-03-15 17:44:31 +00:00
space, then the group name followed by a space and then the preferred
content expression.
2017-08-29 21:26:42 +00:00
## `export.log`
Tracks what trees have been exported to special remotes by
[[git-annex-export]](1).
2017-08-31 19:41:48 +00:00
Each line starts with a timestamp, then the uuid of the repository
2017-09-12 21:45:52 +00:00
that exported to the special remote, followed by a colon (`:`) and
the uuid of the special remote. Then, separated by a spaces,
2020-01-07 20:10:57 +00:00
the SHA of the tree that was exported, and optionally any number of
subsequent SHAs, of trees that have started to be exported but whose
2017-09-12 21:45:52 +00:00
export is not yet complete.
2017-08-31 19:41:48 +00:00
2017-09-12 21:45:52 +00:00
In order to record the beginning of the first export, where nothing
2020-01-07 20:10:57 +00:00
has been exported yet, the SHA of the exported tree can be
the empty tree (eg 4b825dc642cb6eb9a060e54bf8d69288fbee4904).
2017-09-06 17:39:33 +00:00
For example:
2017-09-12 21:45:52 +00:00
1317929100.012345s e605dca6-446a-11e0-8b2a-002170d25c55:26339d22-446b-11e0-9101-002170d25c55 4b825dc642cb6eb9a060e54bf8d69288fbee4904 bb08b1abd207aeecccbc7060e523b011d80cb35b
1317929100.012345s e605dca6-446a-11e0-8b2a-002170d25c55:26339d22-446b-11e0-9101-002170d25c55 bb08b1abd207aeecccbc7060e523b011d80cb35b
1317929189.157237s e605dca6-446a-11e0-8b2a-002170d25c55:26339d22-446b-11e0-9101-002170d25c55 bb08b1abd207aeecccbc7060e523b011d80cb35b 7c7af825782b7c8706039b855c72709993542be4
1317923000.251111s e605dca6-446a-11e0-8b2a-002170d25c55:26339d22-446b-11e0-9101-002170d25c55 7c7af825782b7c8706039b855c72709993542be4
2017-08-29 21:26:42 +00:00
2017-09-06 17:39:33 +00:00
(The trees are also grafted into the git-annex branch, at
2017-08-29 21:26:42 +00:00
`export.tree`, to prevent git from garbage collecting it. However, the head
of the git-annex branch should never contain such a grafted in tree;
the grafted tree is removed in the same commit that updates `export.log`.)
2011-06-22 21:26:34 +00:00
## `aaa/bbb/*.log`
2011-03-02 01:32:28 +00:00
2011-07-01 21:28:31 +00:00
These log files record [[location_tracking]] information
add remote state logs
This allows a remote to store a piece of arbitrary state associated with a
key. This is needed to support Tahoe, where the file-cap is calculated from
the data stored in it, and used to retrieve a key later. Glacier also would
be much improved by using this.
GETSTATE and SETSTATE are added to the external special remote protocol.
Note that the state is left as-is even when a key is removed from a remote.
It's up to the remote to decide when it wants to clear the state.
The remote state log, $KEY.log.rmt, is a UUID-based log. However,
rather than using the old UUID-based log format, I created a new variant
of that format. The new varient is more space efficient (since it lacks the
"timestamp=" hack, and easier to parse (and the parser doesn't mess with
whitespace in the value), and avoids compatability cruft in the old one.
This seemed worth cleaning up for these new files, since there could be a
lot of them, while before UUID-based logs were only used for a few log
files at the top of the git-annex branch. The transition code has also
been updated to handle these new UUID-based logs.
This commit was sponsored by Daniel Hofer.
2014-01-03 20:35:57 +00:00
for file contents. These are placed in two levels of subdirectories
2013-04-01 00:13:49 +00:00
for hashing. See [[hashing]] for details.
The name of the key is the filename, and the content
2015-06-09 17:28:30 +00:00
consists of a timestamp, either 1 (present) or 0 (not present) or X (dead),
and the UUID of the repository that has or lacks the file content.
2011-03-02 01:32:28 +00:00
Example:
1287290776.765152s 1 e605dca6-446a-11e0-8b2a-002170d25c55
1287290767.478634s 0 26339d22-446b-11e0-9101-002170d25c55
2012-04-20 15:31:30 +00:00
## `aaa/bbb/*.log.web`
2011-07-01 21:28:31 +00:00
These log files record urls used by the
2014-12-08 23:14:24 +00:00
[[web_special_remote|special_remotes/web]] and sometimes by other remotes.
Their format is similar to the location tracking files, but with urls
rather than UUIDs.
2013-12-18 00:13:40 +00:00
add equivilant key log for VURL keys
When downloading a VURL from the web, make sure that the equivilant key
log is populated.
Unfortunately, this does not hash the content while it's being
downloaded from the web. There is not an interface in Backend currently
for incrementally hash generation, only for incremental verification of an
existing hash. So this might add a noticiable delay, and it has to show
a "(checksum...") message. This could stand to be improved.
But, that separate hashing step only has to happen on the first download
of new content from the web. Once the hash is known, the VURL key can have
its hash verified incrementally while downloading except when the
content in the web has changed. (Doesn't happen yet because
verifyKeyContentIncrementally is not implemented yet for VURL keys.)
Note that the equivilant key log file is formatted as a presence log.
This adds a tiny bit of overhead (eg "1 ") per line over just listing the
urls. The reason I chose to use that format is it seems possible that
there will need to be a way to remove an equivilant key at some point in
the future. I don't know why that would be necessary, but it seemed wise
to allow for the possibility.
Downloads of VURL keys from other special remotes that claim urls,
like bittorrent for example, does not popilate the equivilant key log.
So for now, no checksum verification will be done for those.
Sponsored-by: Nicholas Golder-Manning on Patreon
2024-02-29 19:41:57 +00:00
## `aaa/bbb/*.log.ek`
These log files record other keys that are equivilant to the key
used in the filename. This is currently used for the `VURL` backend.
Their format is similar to the location tracking files, but with keys
rather than UUIDs.
add remote state logs
This allows a remote to store a piece of arbitrary state associated with a
key. This is needed to support Tahoe, where the file-cap is calculated from
the data stored in it, and used to retrieve a key later. Glacier also would
be much improved by using this.
GETSTATE and SETSTATE are added to the external special remote protocol.
Note that the state is left as-is even when a key is removed from a remote.
It's up to the remote to decide when it wants to clear the state.
The remote state log, $KEY.log.rmt, is a UUID-based log. However,
rather than using the old UUID-based log format, I created a new variant
of that format. The new varient is more space efficient (since it lacks the
"timestamp=" hack, and easier to parse (and the parser doesn't mess with
whitespace in the value), and avoids compatability cruft in the old one.
This seemed worth cleaning up for these new files, since there could be a
lot of them, while before UUID-based logs were only used for a few log
files at the top of the git-annex branch. The transition code has also
been updated to handle these new UUID-based logs.
This commit was sponsored by Daniel Hofer.
2014-01-03 20:35:57 +00:00
## `aaa/bbb/*.log.rmt`
These log files are used by remotes that need to record their own state
about keys. Each remote can store one line of data about a key, in
its own format.
2018-08-31 16:23:22 +00:00
Note that only the most recently set state about a key is seen
by remotes using this. The `log.rmet` documented below does not have this
limitation.
add remote state logs
This allows a remote to store a piece of arbitrary state associated with a
key. This is needed to support Tahoe, where the file-cap is calculated from
the data stored in it, and used to retrieve a key later. Glacier also would
be much improved by using this.
GETSTATE and SETSTATE are added to the external special remote protocol.
Note that the state is left as-is even when a key is removed from a remote.
It's up to the remote to decide when it wants to clear the state.
The remote state log, $KEY.log.rmt, is a UUID-based log. However,
rather than using the old UUID-based log format, I created a new variant
of that format. The new varient is more space efficient (since it lacks the
"timestamp=" hack, and easier to parse (and the parser doesn't mess with
whitespace in the value), and avoids compatability cruft in the old one.
This seemed worth cleaning up for these new files, since there could be a
lot of them, while before UUID-based logs were only used for a few log
files at the top of the git-annex branch. The transition code has also
been updated to handle these new UUID-based logs.
This commit was sponsored by Daniel Hofer.
2014-01-03 20:35:57 +00:00
Example:
1287290776.765152s e605dca6-446a-11e0-8b2a-002170d25c55 blah blah
1287290767.478634s 26339d22-446b-11e0-9101-002170d25c55 foo=bar
2014-02-13 01:12:22 +00:00
## `aaa/bbb/*.log.met`
These log files are used to store arbitrary [[design/metadata]] about keys.
Each key can have any number of metadata fields. Each field has a set of
values.
Lines are timestamped, and record when values are added (`field +value`),
but also when values are removed (`field -value`). Removed values
are retained in the log so that when merging an old line that sets a value
2014-12-05 03:25:00 +00:00
that was later unset, the value is not accidentally added back.
2014-02-13 01:12:22 +00:00
For example:
1287290776.765152s tag +foo +bar author +joey
1291237510.141453s tag -bar +baz
The value can be completely arbitrary data, although it's typically
reasonably short. If the value contains any whitespace
2014-12-05 03:25:00 +00:00
(including \r or \n), it will be base64 encoded. Base64 encoded values
are indicated by prefixing them with "!".
2014-02-13 01:12:22 +00:00
2018-08-31 16:23:22 +00:00
## `aaa/bbb/*.log.rmet`
These log files store per-remote metadata about keys. This metadata
is only used by the remote.
Format is the same as the metadata log files above, but each metadata key
is prefixed with "uuid:" to indicate the remote it belongs to.
For example:
1287290776.765152s e605dca6-446a-11e0-8b2a-002170d25c55:foo +bar
1287290776.765152s 26339d22-446b-11e0-9101-002170d25c55:x +1
1291237510.141453s 26339d22-446b-11e0-9101-002170d25c55:x -1 26339d22-446b-11e0-9101-002170d25c55:x +2
2019-02-20 19:36:09 +00:00
## `aaa/bbb/*.log.cid`
These log files store per-remote content identifiers for keys.
A given key may have any number of content identifiers.
add RemoteStateHandle
This solves the problem of sameas remotes trampling over per-remote
state. Used for:
* per-remote state, of course
* per-remote metadata, also of course
* per-remote content identifiers, because two remote implementations
could in theory generate the same content identifier for two different
peices of content
While chunk logs are per-remote data, they don't use this, because the
number and size of chunks stored is a common property across sameas
remotes.
External special remote had a complication, where it was theoretically
possible for a remote to send SETSTATE or GETSTATE during INITREMOTE or
EXPORTSUPPORTED. Since the uuid of the remote is typically generate in
Remote.setup, it would only be possible to pass a Maybe
RemoteStateHandle into it, and it would otherwise have to construct its
own. Rather than go that route, I decided to send an ERROR in this case.
It seems unlikely that any existing external special remote will be
affected. They would have to make up a git-annex key, and set state for
some reason during INITREMOTE. I can imagine such a hack, but it doesn't
seem worth complicating the code in such an ugly way to support it.
Unfortunately, both TestRemote and Annex.Import needed the Remote
to have a new field added that holds its RemoteStateHandle.
2019-10-14 16:33:27 +00:00
The format is a timestamp, followed by the uuid of the remote,
2019-02-21 17:45:16 +00:00
followed by the content identifiers which are separated by colons.
If a content identifier contains a colon or \r or \n, it will be base64
2019-02-20 19:36:09 +00:00
encoded. Base64 encoded values are indicated by prefixing them with "!".
2019-02-21 17:45:16 +00:00
1287290776.765152s e605dca6-446a-11e0-8b2a-002170d25c55 5248916:5250378
2019-02-20 19:36:09 +00:00
2014-07-24 17:28:54 +00:00
## `aaa/bbb/*.log.cnk`
These log files are used when objects are stored in chunked form on
remotes. They record the size(s) of the chunks, and the number of chunks.
2014-07-24 20:23:36 +00:00
For example, this logs that a remote has an object stored using both
9 chunks of 1 mb size, and 1 chunk of 10 mb size.
2014-07-24 17:28:54 +00:00
2014-07-24 20:23:36 +00:00
1287290776.765152s e605dca6-446a-11e0-8b2a-002170d25c55:10240 9
1287290776.765153s e605dca6-446a-11e0-8b2a-002170d25c55:102400 1
2014-07-24 17:28:54 +00:00
(When those chunks are removed from the remote, the 9 is changed to 0.)
2013-12-18 00:13:40 +00:00
## `schedule.log`
Used to record scheduled events, such as periodic fscks.
The file format is simply one line per repository, with the uuid followed by a
space and then its schedule, followed by a timestamp.
2014-12-05 03:25:00 +00:00
There can be multiple events in the schedule, separated by "; ".
2013-12-18 00:13:40 +00:00
The format of the scheduled events is the same described in
2015-03-23 19:36:10 +00:00
[[git-annex-schedule]].
2013-12-18 00:13:40 +00:00
Example:
42bf2035-0636-461d-a367-49e9dfd361dd fsck self 30m every day at any time; fsck 4b3ebc86-0faf-4892-83c5-ce00cbe30f0a 1h every year at any time timestamp=1385646997.053162s
2015-04-05 16:50:02 +00:00
## `activity.log`
Used to record the times of activities, such as fscks.
Example:
42bf2035-0636-461d-a367-49e9dfd361dd Fsck timestamp=1422387398.30395s
2013-12-18 00:13:40 +00:00
## `transitions.log`
Used to record transitions, eg by `git annex forget`
Each line of the file is a transition, followed by a timestamp.
Example:
ForgetGitHistory 1387325539.685136s
ForgetDeadRemotes 1387325539.685136s
2015-01-27 21:38:06 +00:00
## `difference.log`
Used when a repository has fundamental differences from other repositories,
that should prevent merging.
Example:
2015-01-28 17:47:41 +00:00
e605dca6-446a-11e0-8b2a-002170d25c55 [ObjectHashLower] timestamp=1422387398.30395s
2017-03-30 23:32:58 +00:00
## `multicast.log`
Records uftp public key fingerprints, for use by [[git-annex-multicast]].
2023-12-06 19:38:01 +00:00
## `migrate.tree/old` and `migrate.tree/new`
These are used to record migrations done by `git-annex migrate`. By diffing
between the two, the old and new keys can be determined. This lets
migrations be recorded while using a minimum of space in the git
repository. The filenames in these trees have no connection to the names
of actual annexed files.
These trees are recorded in history of the git-annex branch, but the
head of the git-annex branch will never contain them.
2024-05-10 18:41:18 +00:00
## Other internals documentation
* [[git-remote-annex]] documents how git repositories are stored
on special remotes when using git with "annex::" urls.