Commit graph

183 commits

Author SHA1 Message Date
Joey Hess
6c0155efb7 refactor 2012-02-20 15:22:21 -04:00
Joey Hess
f0f07db01d reorder prams and put -- after atrributes, for compatability with old git
(cherry picked from commit c8ec0e233e)
2012-02-15 14:01:06 -04:00
Joey Hess
52c5b164d8 Added a annex.queuesize setting
useful when adding hundreds of thousands of files on a system with plenty
of memory.

git add gets quite slow in such a large repository, so if the system has
more than the ~32 mb of memory the queue can use by default, it's a useful
optimisation to increase the queue size, in order to decrease the number
of times git add is run.
2012-02-15 11:14:19 -04:00
Joey Hess
7ebd98d8d8 fix memory leak when staging the journal
The list of files had to be retained until the end so it could be deleted.
Also, a list of update-index lines was generated and only then fed into it.
Now everything streams in constant space.
2012-02-14 14:37:59 -04:00
Joey Hess
a40ec5e03e Fixed a memory leak due to excessive strictness when committing journal files.
When hashing the files, the entire list of shas was read strictly.
That was entirely unnecessary, since there's a cleanup action run
after they're consumed.
2012-02-14 11:20:34 -04:00
Joey Hess
8f76d66f32 set fileEncoding on CheckAttr handles
Seemed to work without it, but this is correct.
2012-02-14 04:31:39 -04:00
Joey Hess
a2f241d503 fix LsFiles.typeChanged paths
Passing absolute paths to Command.Add used to work, but after recent
changes doesn't. All LsFiles should use relative paths anyway, so fix it
there.
2012-02-14 00:22:42 -04:00
Joey Hess
cbaebf538a rework git check-attr interface
Now gitattributes are looked up, efficiently, in only the places that
really need them, using the same approach used for cat-file.

The old CheckAttr code seemed very fragile, in the way it streamed files
through git check-attr.
I actually found that cad8824852
was still deadlocking with ghc 7.4, at the end of adding a lot of files.
This should fix that problem, and avoid future ones.

The best part is that this removes withAttrFilesInGit and withNumCopies,
which were complicated Seek methods, as well as simplfying the types
for several other Seek methods that had a Backend tupled in.
2012-02-13 23:52:21 -04:00
Joey Hess
d35a8d85b5 another place hGetBoth was used without a writer thread 2012-02-13 20:23:45 -04:00
Joey Hess
cad8824852 thinko
I removed the now unnecessary forkProcess, but forgot to change back to
pipeBoth, so there was no writer thread.
2012-02-13 20:01:37 -04:00
Joey Hess
3ac2677e00 comment typo 2012-02-13 16:58:26 -04:00
Joey Hess
e4d0923544 wording 2012-02-09 17:35:36 -04:00
Joey Hess
dc682e53a2 use fileEncoding for git-update-index input handle 2012-02-04 13:03:33 -04:00
Joey Hess
586be39952 fix file encoding of HashObject 2012-02-04 13:01:00 -04:00
Joey Hess
d8fb97806c support all filename encodings with ghc 7.4
Under ghc 7.4, this seems to be able to handle all filename encodings
again. Including filename encodings that do not match the LANG setting.
I think this will not work with earlier versions of ghc, it uses some ghc
internals.

Turns out that ghc 7.4 has a special filesystem encoding that it uses when
reading/writing filenames (as FilePaths). This encoding is documented
to allow  "arbitrary undecodable bytes to be round-tripped through it".

So, to get FilePaths from eg, git ls-files, set the Handle that is reading
from git to use this encoding. Then things basically just work.

However, I have not found a way to make Text read using this encoding.
Text really does assume unicode. So I had to switch back to using String
when reading/writing data to git. Which is a pity, because it's some
percent slower, but at least it works.

Note that stdout and stderr also have to be set to this encoding, or
printing out filenames that contain undecodable bytes causes a crash.
IMHO this is a misfeature in ghc, that the user can pass you a filename,
which you can readFile, etc, but that default, putStr of filename may
cause a crash!

Git.CheckAttr gave me special trouble, because the filenames I got back
from git, after feeding them in, had further encoding breakage.
Rather than try to deal with that, I just zip up the input filenames
with the attributes. Which must be returned in the same order queried
for this to work.

Also of note is an apparent GHC bug I worked around in Git.CheckAttr. It
used to forkProcess and feed git from the child process.  Unfortunatly,
after this forkProcess, accessing the `files` variable from the parent
returns []. Not the value that was passed into the function. This screams
of a bad bug, that's clobbering a variable, but for now I just avoid
forkProcess there to work around it. That forkProcess was itself only added
because of a ghc bug, #624389. I've confirmed that the test case for that
bug doesn't reproduce it with ghc 7.4. So that's ok, except for the new ghc
bug I have not isolated and reported. Why does this simple bit of code
magnet the ghc bugs? :)

Also, the symlink touching code is currently broken, when used on utf-8
filenames in a non-utf-8 locale, or probably on any filename containing
undecodable bytes, and I temporarily commented it out.
2012-02-03 16:23:20 -04:00
Joey Hess
3d49258e5b attempt at a quick, utf-8 only fix to the ghc 7.4 problem
If you have only utf-8 filenames, and need to build git-annex with ghc 7.4,
this will work. But, it will crash on non-utf-8 filenames.
2012-02-01 16:16:08 -04:00
Joey Hess
a964012fc3 switch to the strict state monad
I had not realized what a memory leak the lazy state monad could be,
although I have not seen much evidence of actual leaking in git-annex.
However, if running git-annex on a great many files, this could matter.

The additional Utility.State.changeState adds even more strictness,
avoiding a problem I saw in github-backup where repeatedly modifying
state built up a huge pile of thunks.
2012-01-29 22:55:06 -04:00
Joey Hess
97209ac08d fix error message 2012-01-25 20:43:01 -04:00
Joey Hess
3ca7cf5db1 export fromPath
Not used in git-annex, but I am using it in git-backup
2012-01-25 20:42:05 -04:00
Joey Hess
ce5637498f remove Utility.Conditional and use IfElse
This drops the >>! and >>? with the nice low fixity. IfElse does have
undocumented >>=>>! and >>=>>? operators, but I deem that too fishy.
Anyway, using whenM and unlessM is easier; I sometimes mixed the operators
up.
2012-01-24 16:22:07 -04:00
Joey Hess
ba6088b249 rename readMaybe to readish
a stricter (but also partial) readMaybe is getting added to base
2012-01-23 17:00:10 -04:00
Joey Hess
8c87293b48 avoid unnecessary stats when traversing to parent 2012-01-14 11:48:10 -04:00
Joey Hess
92a4af8b20 avoid unnecessary chdir 2012-01-14 11:42:51 -04:00
Joey Hess
1f66af2b53 optimize away 3 stats 2012-01-14 11:28:49 -04:00
Joey Hess
ff5703ce77 tweak 2012-01-13 21:06:00 -04:00
Joey Hess
66aac77467 support relative GIT_DIR 2012-01-13 14:40:36 -04:00
Joey Hess
1ae780ee79 git-annex, git-union-merge: Support GIT_DIR and GIT_WORK_TREE.
Note that GIT_WORK_TREE cannot influence GIT_DIR; that is necessary for
git-fake-bare and vcsh type things to work.
2012-01-13 12:52:09 -04:00
Joey Hess
0d5c402210 Add annex-trustlevel configuration settings, which can be used to override the trust level of a remote.
This overrides the trust.log, and is overridden by the command-line trust
parameters.

It would have been nicer to have Logs.Trust.trustMap just look up the
configuration for all remotes, but a dependency loop prevented that
(Remotes depends on Logs.Trust in several ways). So instead, look up
the configuration when building remotes, storing it in the same forcetrust
field used for the command-line trust parameters.
2012-01-09 23:31:44 -04:00
Joey Hess
9fb5f3edc7 log --after=date 2012-01-06 17:24:03 -04:00
Joey Hess
0b27e6baa0 Support unescaped repository urls, like git does.
Turns out that git will accept a .git/config containing an url with eg,
spaces in its name. Handle this by escaping the url if it's not valid.

This also fixes support for urls containing escaped characters like %20
for space. Before, the path from the url was not unescaped properly.
2012-01-05 14:32:20 -04:00
Joey Hess
f0957426c5 skip local remotes that are not available (ie, not mounted)
With --fast, unavailable local remotes are filtered out of the fast set.
This way, if there are local remotes, --fast always acts only on them,
and if none are mounted, acts on nothing. This consistency is better
than --fast acting on different remotes depending on what's mounted.
2011-12-31 04:50:39 -04:00
Joey Hess
a2ec2d3760 refactor and check for a detached HEAD 2011-12-31 03:38:58 -04:00
Joey Hess
52104dae6f refactor 2011-12-30 18:36:40 -04:00
Joey Hess
26040d6419 add base, under
The describe function was only intended to generate a human-visible
description of a branch, but taking the base of a branch is a useful
operation to be able to do no matter the human-visible representation.

Converting a branch like refs/heads/master to refs/heads/origin/master
is also a useful operation, and under can do that.
2011-12-30 16:48:26 -04:00
Joey Hess
5287d1dc3f fixed behavior when multiple insteadOf configs are provided for the same url base
Consider this git config --list case:

url.git+ssh://git@example.com/.insteadOf=gl
url.git+ssh://git@example.com/.insteadOf=shared

Since config is stored in a Map, only the last of the values for this key
was stored and available for use by the insteadOf code. But that
is wrong; git allows either "gl" or "shared" to be used in an url and
the insteadOf value to be substituted in.

To support this, it seems best to keep the existing config map as-is,
and add a second map that accumulates a list of multiple values for
config keys. This new fullconfig map can be used in the rare places where
multiple values for a key make sense, without needing to complicate
everything else.

Haskell's laziness and data sharing keep the overhead of adding
this second map low.
2011-12-30 14:07:46 -04:00
Joey Hess
cba3ce08df handle C-style escapes in Format
I was happily able to repurpose some code from Git.Filename to handle this.

I remember writing that code... a whole afternoon at a coffee shop, after
which I felt I'd struggled with Haskell and git, and sorta lost, in needing
to write this nasty peice of code. But was also pleased at the use of a
pair of functions and quickcheck that allowed me to get it 100% right.
So, turns out I not only got it right, but the code wasn't as special-purpose
as I'd feared. Yay!
2011-12-23 01:05:16 -04:00
Joey Hess
5a275a3f5d Can now be built with older git versions (before 1.7.7); the resulting binary should only be used with old git.
Remove git old version check from configure, and use the git version
it was built against in the git check-attr code.
2011-12-22 15:01:13 -04:00
Joey Hess
6bffe509d7 Add --include, which is the same as --not --exclude. 2011-12-22 14:00:17 -04:00
Joey Hess
ee3b5b2a42 use Common in a few more modules 2011-12-20 14:37:53 -04:00
Joey Hess
95d2391f58 more partial function removal
Left a few Prelude.head's in where it was checked not null and too hard to
remove, etc.
2011-12-15 18:19:36 -04:00
Joey Hess
fbc3d32f7d avoid partial function, and parse git-ref output better
It's possible that a ref name might contain a space, this properly
preserves the space.
2011-12-15 16:58:04 -04:00
Joey Hess
eb132a854e avoid partial head function
(although it was used safely)
2011-12-15 16:04:08 -04:00
Joey Hess
111b6937ec avoid partial functions, and added check for correct sha content 2011-12-15 15:57:47 -04:00
Joey Hess
a8643ca44c refactor 2011-12-15 13:05:47 -04:00
Joey Hess
09cd042775 Properly handle multiline git config values.
A crash on parsing was fixed a while ago. This adds support for fully
correctly parsing multiline git config values, using git config --null.

Since git-annex-shell configlist uses normal git config output, I left in
support for that too; the two forms of config output can be easily
identified by the parser. Since configlist only prints the annex.uuid
config, there's no risk of multiline values there, so no need to change it.
2011-12-15 12:48:27 -04:00
Joey Hess
ef28b3fef7 split out Git/Command.hs 2011-12-14 15:56:11 -04:00
Joey Hess
02f1bd2bf4 split more stuff out of Git.hs 2011-12-14 15:43:13 -04:00
Joey Hess
9db8ec210f split out two more Git modules 2011-12-13 15:24:23 -04:00
Joey Hess
25b2cc4148 move commit to Git.Branch 2011-12-13 15:08:44 -04:00
Joey Hess
13fff71f20 split out three modules from Git
Constructors and configuration make sense in separate modules.
A separate Git.Types is needed to avoid cycles.
2011-12-13 15:06:49 -04:00
Joey Hess
46588674b0 avoid closing pipe before all the shas are read from it
Could have just used hGetContentsStrict here, but that would require
storing all the shas in memory. Since this is called at the end of a
git-annex run, it may have created a *lot* of shas, so I avoid that memory
use and stream them out like before.
2011-12-12 21:41:37 -04:00
Joey Hess
0e45b762a0 broke out Git/HashObject.hs 2011-12-12 21:24:55 -04:00
Joey Hess
31a0c07ee9 broke out Git/Branch.hs and reorganized 2011-12-12 21:12:51 -04:00
Joey Hess
543d0d2501 split out Git/Ref.hs 2011-12-12 18:30:33 -04:00
Joey Hess
acd7a52dfd always find optimal merge
Testing b9ac585454, it didn't find the
optimal union merge, the second sha was the one to use, at least in
the case I tried. Let's just try all shas to see if any can be reused.

I stopped using the expensive nub, so despite the use of sets to
sort/uniq file contents, this is probably as fast or faster than it
was before.
2011-12-12 01:59:29 -04:00
Joey Hess
0cbab5de65 refactor 2011-12-12 00:48:25 -04:00
Joey Hess
b9ac585454 more efficient union merges
Tries to avoid generating a new object when the merged content has the same
lines that were in the old object.

I've noticed some merge commits that only move lines around, like this:

- 1323478057.181191s 1 be23c3ac-0ee5-11e0-b185-3b0f9b5b00c5
  1323204972.062151s 1 87e06c7a-7388-11e0-ba07-03cdf300bd87
++1323478057.181191s 1 be23c3ac-0ee5-11e0-b185-3b0f9b5b00c5

Unsure if this will really save anything in practice, since it only looks
at one of the two old objects, and maybe I didn't pick the best one.
2011-12-11 23:02:25 -04:00
Joey Hess
d64132a43a hslint 2011-12-09 01:57:13 -04:00
Joey Hess
9290095fc2 improve type signatures with a Ref newtype
In git, a Ref can be a Sha, or a Branch, or a Tag. I added type aliases for
those. Note that this does not prevent mixing up of eg, refs and branches
at the type level. Since git really doesn't care, except rare cases like
git update-ref, or git tag -d, that seems ok for now.

There's also a tree-ish, but let's just use Ref for it. A given Sha or Ref
may or may not be a tree-ish, depending on the object type, so there seems
no point in trying to represent it at the type level.
2011-11-16 02:41:46 -04:00
Joey Hess
272a67921c better name 2011-11-16 01:46:46 -04:00
Joey Hess
e83b966eb5 cleanup 2011-11-15 23:51:24 -04:00
Joey Hess
21a925dcf1 merge: Now runs in constant space.
Before, a merge was first calculated, by running various actions that
called git and built up a list of lines, which were at the end sent
to git update-index. This necessarily used space proportional to the size
of the diff between the trees being merged.

Now, lines are streamed into git update-index from each of the actions in
turn.

Runtime size of git-annex merge when merging 50000 location log files
drops from around 100 mb to a constant 4 mb.

Presumably it runs quite a lot faster, too.
2011-11-15 23:28:01 -04:00
Joey Hess
922e9af528 cleanup 2011-11-15 22:40:40 -04:00
Joey Hess
b76dc2d210 avoid space leak writing merge
This reduces the memory use of a merge by 1/3rd. The space leak was
apparently because the whole update-index input was generated strictly, not
lazily.

I wondered if the change to ByteStrings contributed to this, due to the
need to convert with L.pack here. But going back to the old code, I still
see a much similar leak, and worse performance besides due to it not using
ByteStrings.

The fix is to just hPutStr the lines repeatedly. (Note the \0 is written
separately, to avoid allocation overheads in adding it to the string.)
The Git.pipeWrite interface is probably just wrong for any large inputs to
git. This was the only place using it for input of any size.

There is still at least one other space leak in the merge code.
2011-11-15 22:19:12 -04:00
Joey Hess
04edae6791 Optimised union merging; now only runs git cat-file once. 2011-11-12 17:45:12 -04:00
Joey Hess
637b5feb45 lint 2011-11-11 01:52:58 -04:00
Joey Hess
bf460a0a98 reorder repo parameters last
Many functions took the repo as their first parameter. Changing it
consistently to be the last parameter allows doing some useful things with
currying, that reduce boilerplate.

In particular, g <- gitRepo is almost never needed now, instead
use inRepo to run an IO action in the repo, and fromRepo to get
a value from the repo.

This also provides more opportunities to use monadic and applicative
combinators.
2011-11-08 16:27:20 -04:00
Joey Hess
3acdba3995 faster union merge of multiple branches into index
only write index once
2011-10-07 13:36:48 -04:00
Joey Hess
7ff89ccfee convert all git read/write functions to use ByteStrings
This yields a second or so speedup in unused, find, etc. Seems that even
when the ByteString is immediately split and then converted to Strings,
it's faster.

I may try to push ByteStrings out into more of git-annex gradually,
although I suspect most of the time-critical parts are already covered
now, and many of the rest rely on libraries that only support Strings.
2011-09-29 23:48:57 -04:00
Joey Hess
949ef94d5e layout 2011-09-29 22:31:20 -04:00
Joey Hess
67f2b7cb3e use ByteStrings when reading content of files
didn't bother to benchmark this
2011-09-29 19:19:28 -04:00
Joey Hess
a91c8a15d5 Sped up unused.
Added Git.ByteString which replaces Git IO methods with ones using lazy
ByteStrings. This can be more efficient when large quantities of data are
being read from git.

In Git.LsTree, parse git ls-tree output more efficiently, thanks
to ByteString. This benchmarks 25% faster, in a benchmark that includes
(probably predominately) the run time for git ls-tree itself.

In real world numbers, this makes git annex unused 2 seconds faster for
each branch it needs to check, in my usual large repo.
2011-09-29 19:04:24 -04:00
Joey Hess
297bc648b9 make unused check branches and tags too
needs time and space optimisation
2011-09-28 16:43:10 -04:00
Joey Hess
ad245a6375 refactor catfile code
split into generic IO code, and a thin Annex wrapper
2011-09-28 15:17:36 -04:00
Joey Hess
a3cb5c47e5 use FileMode 2011-09-28 14:14:52 -04:00
Joey Hess
93807564d0 add ls-tree interface
This parser should be fast. I hope.
2011-09-28 14:03:59 -04:00
Joey Hess
7724f895a8 tweak 2011-09-25 14:37:13 -04:00
Joey Hess
203148363f split groups of related functions out of Utility 2011-08-22 16:14:12 -04:00
Joey Hess
e784757376 hlint tweaks
Did all sources except Remotes/* and Command/*
2011-07-15 03:12:05 -04:00
Joey Hess
ded2591124 unannex: Clean up use of git commit -a.
This was more complex than would be expected. unannex has to use git commit -a
since it's removing files from git; git commit filelist won't do.

Allow commands to be added to the Git queue that have no associated files,
and run such commands once.
2011-07-14 17:15:37 -04:00
Joey Hess
896726cde4 rename GitUnionMerge to Git.UnionMerge
Also, moved commit function into Git proper, it's not union merge specific.
2011-06-30 13:32:47 -04:00
Joey Hess
f0497312a7 rename GitQueue to Git.Queue 2011-06-30 13:25:37 -04:00
Joey Hess
f6063a094e renamed GitRepo to Git
It was always imported qualified as Git anyway
2011-06-30 13:21:39 -04:00