Commit graph

98 commits

Author SHA1 Message Date
Joey Hess
b325694645 getKeysPresent is now fully lazy
.. Allowing it to be used by things in constant space!

Random statistics: git annex status has gone from taking 239 mb
of memory and 26 seconds in a repo, to 8 mb and 13 seconds.

The trick here is the unsafeInterleaveIO, and the form of the function's
recursion, which I cribbed heavily from System.IO.HVFS.Utils.recurseDirStat.
The difference is, this one goes to a limited depth and avoids statting
everything.
2012-03-11 18:04:58 -04:00
Joey Hess
ff3644ad38 status: Fixed to run in nearly constant space.
Before, it leaked space due to caching lists of keys. Now all necessary
data about keys is calculated as they stream in.

The "nearly constant" is due to getKeysPresent, which builds up a lot
of [] thunks as it traverses .git/annex/objects/. Will deal with it later.
2012-03-11 17:15:58 -04:00
Joey Hess
d08ee1a9d2 syscall optimisation 2012-03-06 13:56:20 -04:00
Joey Hess
12b89a3eb8 configure: Check if ssh connection caching is supported by the installed version of ssh and default annex.sshcaching accordingly. 2012-02-25 19:15:29 -04:00
Joey Hess
1f73db3469 improve alwayscommit=false mode
Now changes are staged into the branch's index, but not committed,
which avoids growing a large journal. And sync and merge always
explicitly commit, ensuring that even when they do nothing else,
they commit the staged changes.

Added a flag file to indicate that the branch's journal contains
uncommitted changes. (Could use git ls-files, but don't want to run
that every time.)

In the future, this ability to have uncommitted changes staged in the
journal might be used on remotes after a series of oneshot commands.
2012-02-25 16:18:55 -04:00
Joey Hess
b49c0c2633 add annex.alwayscommit option
To avoid commits of data to the git-annex branch after each command
is run, set annex.alwayscommit=false. Its data will then be committed
less frequently, when a merge or sync is done.
2012-02-25 15:31:42 -04:00
Joey Hess
bd66f962d3 Deal with NFS problem that caused a failure to remove a directory when removing content from the annex.
I was able to reproduce this on linux using the kernel's nfs server and
mounting localhost:/. Determined that removing the directory fails when
the just-deleted file in it was locked. Considered dropping the lock
before removing the directory, but this would complicate parts of the code
that should not need to worry about locking. So instead, ignore the failure
to remove the directory in this case.

While I was at it, made it attempt to remove both levels of hash
directories, in case they're empty.
2012-02-24 16:30:47 -04:00
Joey Hess
a1e52f0ce5 hlint 2012-02-16 00:44:51 -04:00
Joey Hess
52c5b164d8 Added a annex.queuesize setting
useful when adding hundreds of thousands of files on a system with plenty
of memory.

git add gets quite slow in such a large repository, so if the system has
more than the ~32 mb of memory the queue can use by default, it's a useful
optimisation to increase the queue size, in order to decrease the number
of times git add is run.
2012-02-15 11:14:19 -04:00
Joey Hess
03c559f8d6 tweak 2012-02-14 14:51:26 -04:00
Joey Hess
7ebd98d8d8 fix memory leak when staging the journal
The list of files had to be retained until the end so it could be deleted.
Also, a list of update-index lines was generated and only then fed into it.
Now everything streams in constant space.
2012-02-14 14:37:59 -04:00
Joey Hess
a40ec5e03e Fixed a memory leak due to excessive strictness when committing journal files.
When hashing the files, the entire list of shas was read strictly.
That was entirely unnecessary, since there's a cleanup action run
after they're consumed.
2012-02-14 11:20:34 -04:00
Joey Hess
cbaebf538a rework git check-attr interface
Now gitattributes are looked up, efficiently, in only the places that
really need them, using the same approach used for cat-file.

The old CheckAttr code seemed very fragile, in the way it streamed files
through git check-attr.
I actually found that cad8824852
was still deadlocking with ghc 7.4, at the end of adding a lot of files.
This should fix that problem, and avoid future ones.

The best part is that this removes withAttrFilesInGit and withNumCopies,
which were complicated Seek methods, as well as simplfying the types
for several other Seek methods that had a Backend tupled in.
2012-02-13 23:52:21 -04:00
Joey Hess
d55f3c0716 Fix teardown of stale cached ssh connections. 2012-02-09 21:49:46 -04:00
Joey Hess
146c36ca54 IO exception rework
ghc 7.4 comaplains about use of System.IO.Error to catch exceptions.
Ok, use Control.Exception, with variants specialized to only catch IO
exceptions.
2012-02-03 16:47:24 -04:00
Joey Hess
b81d662cbf Avoid repeated location log commits when a remote is receiving files.
Done by adding a oneshot mode, in which location log changes are written to
the journal, but not committed. Taking advantage of git-annex's existing
ability to recover in this situation.

This is used by git-annex-shell and other places where changes are made to
a remote's location log.
2012-01-28 15:41:52 -04:00
Joey Hess
ba6088b249 rename readMaybe to readish
a stricter (but also partial) readMaybe is getting added to base
2012-01-23 17:00:10 -04:00
Joey Hess
eb9001044f order user provided params after connection caching params
So the user can override them.
2012-01-20 17:32:32 -04:00
Joey Hess
6ef82665de add annex.sshcaching config setting 2012-01-20 17:15:46 -04:00
Joey Hess
47250a153a ssh connection caching
Ssh connection caching is now enabled automatically by git-annex. Only one
ssh connection is made to each host per git-annex run, which can speed some
things up a lot, as well as avoiding repeated password prompts. Concurrent
git-annex processes also share ssh connections. Cached ssh connections are
shut down when git-annex exits.

Note: The rsync special remote does not yet participate in the ssh
connection caching.
2012-01-20 17:14:56 -04:00
Joey Hess
61dbad505d fsck --from remote --fast
Avoids expensive file transfers, at the expense of checking file size
and/or contents.

Required some reworking of the remote code.
2012-01-20 13:23:11 -04:00
Joey Hess
effaa298fa optimise fsck --from normal git remotes
For a local git remote, can symlink the file.
For a git remote using rsync, can preseed any local content.

There are a few reasons to use fsck --from on a normal git remote.
One is if it's using gitosis or similar, and you don't have shell access
to run git annex locally. Another reason could be if you just want to
fsck certian files of a bare remote.
2012-01-19 17:10:44 -04:00
Joey Hess
81856c3175 add a configure check for StatFS
This way, the build log will indicate whether StatFS can be relied on.
I've tested all the failing architectures now, and on all of them,
the StatFS code now returns Nothing, rather than Just nonsense.

Also, if annex.diskreserve is set on a platform where StatFS is not
working, git-annex will complain.

Also, the Makefile was missing the sources target used when building with
cabal.
2012-01-15 13:49:32 -04:00
Joey Hess
a3d97e0c85 tweak 2012-01-14 14:31:16 -04:00
Joey Hess
5e2b4e16ba avoid multiple unnecessary stats of the index file
Up to one per file processed.
2012-01-14 12:07:36 -04:00
Joey Hess
abdacf58ed tweaks 2012-01-11 00:06:54 -04:00
Joey Hess
16e7178f20 reorg 2012-01-10 15:29:10 -04:00
Joey Hess
a3a9f87047 log: New command that displays the location log for file, showing each repository they were added to and removed from.
This needs to run git log on the location log files to get at all past
versions of the file, which tends to be a bit slow.

It would be possible to make a version optimised for showing the location
logs for every key. That would only need to run git log once, so would be
faster, but it would need to process an enormous amount of data, so
would not speed up the individual file case.

In the future it would be nice to support log --format. log --json also
doesn't work right yet.
2012-01-06 15:40:07 -04:00
Joey Hess
aa0882691b Added remote.name.annex-web-options configuration setting, which can be used to provide parameters to whichever of wget or curl git-annex uses (depends on which is available, but most of their important options suitable for use here are the same). 2012-01-02 14:20:20 -04:00
Joey Hess
252376d639 Merge branch 'master' into autosync 2011-12-30 20:38:59 -04:00
Joey Hess
52104dae6f refactor 2011-12-30 18:36:40 -04:00
Joey Hess
925b6390aa add forceUpdate
This code is picked from my tweak-fetch branch, which already did the needed
refactoring.
2011-12-30 15:57:28 -04:00
Joey Hess
6d4382a89e Merge branch 'new-monad-control' 2011-12-24 23:02:42 -04:00
Joey Hess
ee3b5b2a42 use Common in a few more modules 2011-12-20 14:37:53 -04:00
Joey Hess
ef28b3fef7 split out Git/Command.hs 2011-12-14 15:56:11 -04:00
Joey Hess
02f1bd2bf4 split more stuff out of Git.hs 2011-12-14 15:43:13 -04:00
Joey Hess
25b2cc4148 move commit to Git.Branch 2011-12-13 15:08:44 -04:00
Joey Hess
13fff71f20 split out three modules from Git
Constructors and configuration make sense in separate modules.
A separate Git.Types is needed to avoid cycles.
2011-12-13 15:06:49 -04:00
Joey Hess
46588674b0 avoid closing pipe before all the shas are read from it
Could have just used hGetContentsStrict here, but that would require
storing all the shas in memory. Since this is called at the end of a
git-annex run, it may have created a *lot* of shas, so I avoid that memory
use and stream them out like before.
2011-12-12 21:41:37 -04:00
Joey Hess
0e45b762a0 broke out Git/HashObject.hs 2011-12-12 21:24:55 -04:00
Joey Hess
31a0c07ee9 broke out Git/Branch.hs and reorganized 2011-12-12 21:12:51 -04:00
Joey Hess
543d0d2501 split out Git/Ref.hs 2011-12-12 18:30:33 -04:00
Joey Hess
da95cbadca split out Annex/Journal.hs 2011-12-12 18:03:28 -04:00
Joey Hess
98dfc0c9b0 split out Annex/BranchState.hs 2011-12-12 17:38:46 -04:00
Joey Hess
b2f934e07a update comment 2011-12-12 17:24:12 -04:00
Joey Hess
79345ad5fc optimisation
avoids a redundant call to git show-ref
2011-12-12 03:30:47 -04:00
Joey Hess
f9cd3f6ad1 optimisation
avoids a useless diff from git-annex..refs/heads/git-annex
2011-12-12 02:31:07 -04:00
Joey Hess
2332afb4bc cleanup 2011-12-12 02:04:48 -04:00
Joey Hess
29b88ad657 avoid redundant call to updateIndex
commitBranch calls updateIndex
2011-12-11 21:46:21 -04:00
Joey Hess
c4c965d602 detect and recover from branch push/commit race
Dealing with a race without using locking is exceedingly difficult and tricky.
Fully tested, I hope.

There are three places left where the branch can be updated, that are not
covered by the race recovery code. Let's prove they're all immune to the
race:

1. tryFastForwardTo checks to see if a fast-forward can be done,
   and then does git-update-ref on the branch to fast-forward it.

   If a push comes in before the check, then either no fast-forward
   will be done (ok), or the push set the branch to a ref that can
   still be fast-forwarded (also ok)

   If a push comes in after the check, the git-update-ref will
   undo the ref change made by the push. It's as if the push did not come
   in, and the next git-push will see this, and try to re-do it.
   (acceptable)

2. When creating the branch for the very first time, an empty index
   is created, and a commit of it made to the branch. The commit's ref
   is recorded as the current state of the index. If a push came in
   during that, it will be noticed the next time a commit is made to the
   branch, since the branch will have changed. (ok)

3. Creating the branch from an existing remote branch involves making
   the branch, and then getting its ref, and recording that the index
   reflects that ref.

   If a push creates the branch first, git-branch will fail (ok).

   If the branch is created and a racing push is then able to change it
   (highly unlikely!) we're still ok, because it first records the ref into
   the index.lck, and then updating the index. The race can cause the
   index.lck to have the old branch ref, while the index has the newly pushed
   branch merged into it, but that only results in an unnecessary update of
   the index file later on.
2011-12-11 20:41:35 -04:00