The filename sent to git cat-file needs to be sent on a File encoded handle.
Also set the read handle to use the File encoding, so that any error
message mentioning the filename is received properly.
The actual file content is read using Data.ByteString.Char8, which
will ignore the read handle's encoding, so this won't change that.
(Whether that is entirely correct remains to be seen.)
useful when adding hundreds of thousands of files on a system with plenty
of memory.
git add gets quite slow in such a large repository, so if the system has
more than the ~32 mb of memory the queue can use by default, it's a useful
optimisation to increase the queue size, in order to decrease the number
of times git add is run.
The list of files had to be retained until the end so it could be deleted.
Also, a list of update-index lines was generated and only then fed into it.
Now everything streams in constant space.
When hashing the files, the entire list of shas was read strictly.
That was entirely unnecessary, since there's a cleanup action run
after they're consumed.
Now gitattributes are looked up, efficiently, in only the places that
really need them, using the same approach used for cat-file.
The old CheckAttr code seemed very fragile, in the way it streamed files
through git check-attr.
I actually found that cad8824852
was still deadlocking with ghc 7.4, at the end of adding a lot of files.
This should fix that problem, and avoid future ones.
The best part is that this removes withAttrFilesInGit and withNumCopies,
which were complicated Seek methods, as well as simplfying the types
for several other Seek methods that had a Backend tupled in.
Under ghc 7.4, this seems to be able to handle all filename encodings
again. Including filename encodings that do not match the LANG setting.
I think this will not work with earlier versions of ghc, it uses some ghc
internals.
Turns out that ghc 7.4 has a special filesystem encoding that it uses when
reading/writing filenames (as FilePaths). This encoding is documented
to allow "arbitrary undecodable bytes to be round-tripped through it".
So, to get FilePaths from eg, git ls-files, set the Handle that is reading
from git to use this encoding. Then things basically just work.
However, I have not found a way to make Text read using this encoding.
Text really does assume unicode. So I had to switch back to using String
when reading/writing data to git. Which is a pity, because it's some
percent slower, but at least it works.
Note that stdout and stderr also have to be set to this encoding, or
printing out filenames that contain undecodable bytes causes a crash.
IMHO this is a misfeature in ghc, that the user can pass you a filename,
which you can readFile, etc, but that default, putStr of filename may
cause a crash!
Git.CheckAttr gave me special trouble, because the filenames I got back
from git, after feeding them in, had further encoding breakage.
Rather than try to deal with that, I just zip up the input filenames
with the attributes. Which must be returned in the same order queried
for this to work.
Also of note is an apparent GHC bug I worked around in Git.CheckAttr. It
used to forkProcess and feed git from the child process. Unfortunatly,
after this forkProcess, accessing the `files` variable from the parent
returns []. Not the value that was passed into the function. This screams
of a bad bug, that's clobbering a variable, but for now I just avoid
forkProcess there to work around it. That forkProcess was itself only added
because of a ghc bug, #624389. I've confirmed that the test case for that
bug doesn't reproduce it with ghc 7.4. So that's ok, except for the new ghc
bug I have not isolated and reported. Why does this simple bit of code
magnet the ghc bugs? :)
Also, the symlink touching code is currently broken, when used on utf-8
filenames in a non-utf-8 locale, or probably on any filename containing
undecodable bytes, and I temporarily commented it out.
I had not realized what a memory leak the lazy state monad could be,
although I have not seen much evidence of actual leaking in git-annex.
However, if running git-annex on a great many files, this could matter.
The additional Utility.State.changeState adds even more strictness,
avoiding a problem I saw in github-backup where repeatedly modifying
state built up a huge pile of thunks.
This drops the >>! and >>? with the nice low fixity. IfElse does have
undocumented >>=>>! and >>=>>? operators, but I deem that too fishy.
Anyway, using whenM and unlessM is easier; I sometimes mixed the operators
up.
This overrides the trust.log, and is overridden by the command-line trust
parameters.
It would have been nicer to have Logs.Trust.trustMap just look up the
configuration for all remotes, but a dependency loop prevented that
(Remotes depends on Logs.Trust in several ways). So instead, look up
the configuration when building remotes, storing it in the same forcetrust
field used for the command-line trust parameters.
Turns out that git will accept a .git/config containing an url with eg,
spaces in its name. Handle this by escaping the url if it's not valid.
This also fixes support for urls containing escaped characters like %20
for space. Before, the path from the url was not unescaped properly.
With --fast, unavailable local remotes are filtered out of the fast set.
This way, if there are local remotes, --fast always acts only on them,
and if none are mounted, acts on nothing. This consistency is better
than --fast acting on different remotes depending on what's mounted.
The describe function was only intended to generate a human-visible
description of a branch, but taking the base of a branch is a useful
operation to be able to do no matter the human-visible representation.
Converting a branch like refs/heads/master to refs/heads/origin/master
is also a useful operation, and under can do that.
Consider this git config --list case:
url.git+ssh://git@example.com/.insteadOf=gl
url.git+ssh://git@example.com/.insteadOf=shared
Since config is stored in a Map, only the last of the values for this key
was stored and available for use by the insteadOf code. But that
is wrong; git allows either "gl" or "shared" to be used in an url and
the insteadOf value to be substituted in.
To support this, it seems best to keep the existing config map as-is,
and add a second map that accumulates a list of multiple values for
config keys. This new fullconfig map can be used in the rare places where
multiple values for a key make sense, without needing to complicate
everything else.
Haskell's laziness and data sharing keep the overhead of adding
this second map low.
I was happily able to repurpose some code from Git.Filename to handle this.
I remember writing that code... a whole afternoon at a coffee shop, after
which I felt I'd struggled with Haskell and git, and sorta lost, in needing
to write this nasty peice of code. But was also pleased at the use of a
pair of functions and quickcheck that allowed me to get it 100% right.
So, turns out I not only got it right, but the code wasn't as special-purpose
as I'd feared. Yay!
A crash on parsing was fixed a while ago. This adds support for fully
correctly parsing multiline git config values, using git config --null.
Since git-annex-shell configlist uses normal git config output, I left in
support for that too; the two forms of config output can be easily
identified by the parser. Since configlist only prints the annex.uuid
config, there's no risk of multiline values there, so no need to change it.