git-annex/doc/scalability.mdwn
Joey Hess 7189dfd77d git-annex (5.20131127) unstable; urgency=low
* webapp: Detect when upgrades are available, and upgrade if the user
    desires.
    (Only when git-annex is installed using the prebuilt binaries
    from git-annex upstream, not from eg Debian.)
  * assistant: Detect when the git-annex binary is modified or replaced,
    and either prompt the user to restart the program, or automatically
    restart it.
  * annex.autoupgrade configures both the above upgrade behaviors.
  * Added support for quvi 0.9. Slightly suboptimal due to limitations in its
    interface compared with the old version.
  * Bug fix: annex.version did not get set on automatic upgrade to v5 direct
    mode repo, so the upgrade was performed repeatedly, slowing commands down.
  * webapp: Fix bug that broke switching between local repositories
    that use the new guarded direct mode.
  * Android: Fix stripping of the git-annex binary.
  * Android: Make terminal app show git-annex version number.
  * Android: Re-enable XMPP support.
  * reinject: Allow to be used in direct mode.
  * Futher improvements to git repo repair. Has now been tested in tens
    of thousands of intentionally damaged repos, and successfully
    repaired them all.
  * Allow use of --unused in bare repository.

# imported from the archive
2013-11-27 18:41:44 -04:00

44 lines
2 KiB
Markdown

git-annex is designed for scalability. The key points are:
* Arbitrarily large files can be managed. The only constraint
on file size are how large a file your filesystem can hold.
While git-annex does checksum files by default, there
is a [[WORM_backend|backends]] available that avoids the checksumming
overhead, so you can add new, enormous files, very fast. This also
allows it to be used on systems with very slow disk IO.
* Memory usage should be constant. This is a "should", because there
can sometimes be leaks (and this is one of haskell's weak spots),
but git-annex is designed so that it does not need to hold all
the details about your repository in memory.
The one exception is that [[todo/git-annex_unused_eats_memory]],
because it *does* need to hold the whole repo state in memory. But
that is still considered a bug, and hoped to be solved one day.
Luckily, that command is not often used.
* Many files can be managed. The limiting factor is git's own
limitations in scaling to repositories with a lot of files, and as git
improves this will improve. Scaling to hundreds of thousands of files
is not a problem, scaling beyond that and git will start to get slow.
To some degree, git-annex works around inefficiencies in git; for
example it batches input sent to certain git commands that are slow
when run in an enormous repository.
* It can use as much, or as little bandwidth as is available. In
particular, any interrupted file transfer can be resumed by git-annex.
## scalability tips
* If the files are so big that checksumming becomes a bottleneck, consider
using the [[WORM_backend|backends]]. You can always `git annex migrate`
files to a checksumming backend later on.
* If you're adding a huge number of files at once (hundreds of thousands),
you'll soon notice that git-annex periodically stops and say
"Recording state in git" while it runs a `git add` command that
becomes increasingly expensive. Consider adjusting the `annex.queuesize`
to a higher value, at the expense of it using more memory.