Commit graph

52 commits

Author SHA1 Message Date
Joey Hess
067aabdd48
wip RawFilePath 2x git-annex find speedup
Finally builds (oh the agoncy of making it build), but still very
unmergable, only Command.Find is included and lots of stuff is badly
hacked to make it compile.

Benchmarking vs master, this git-annex find is significantly faster!
Specifically:

	num files	old	new	speedup
	48500		4.77	3.73	28%
	12500		1.36	1.02	66%
	20		0.075	0.074	0% (so startup time is unchanged)

That's without really finishing the optimization. Things still to do:

* Eliminate all the fromRawFilePath, toRawFilePath, encodeBS,
  decodeBS conversions.
* Use versions of IO actions like getFileStatus that take a RawFilePath.
* Eliminate some Data.ByteString.Lazy.toStrict, which is a slow copy.
* Use ByteString for parsing git config to speed up startup.

It's likely several of those will speed up git-annex find further.
And other commands will certianly benefit even more.
2019-11-26 16:01:58 -04:00
Joey Hess
40ecf58d4b
update licenses from GPL to AGPL
This does not change the overall license of the git-annex program, which
was already AGPL due to a number of sources files being AGPL already.

Legally speaking, I'm adding a new license under which these files are
now available; I already released their current contents under the GPL
license. Now they're dual licensed GPL and AGPL. However, I intend
for all my future changes to these files to only be released under the
AGPL license, and I won't be tracking the dual licensing status, so I'm
simply changing the license statement to say it's AGPL.

(In some cases, others wrote parts of the code of a file and released it
under the GPL; but in all cases I have contributed a significant portion
of the code in each file and it's that code that is getting the AGPL
license; the GPL license of other contributors allows combining with
AGPL code.)
2019-03-13 15:48:14 -04:00
Joey Hess
5d98cba923
use ByteStrings when reading annex symlinks and pointers
Now there's a ByteString used all the way from disk to Key.

The main complication in this conversion was the use of fromInternalGitPath
in several places to munge things on Windows. The things that used that
were changed to parse the ByteString using either path separator.

Also some code that had read from files to a String lazily was changed
to read a minimal strict ByteString.
2019-01-14 15:37:08 -04:00
Joey Hess
7d51b0c109
import Utility.FileSystemEncoding in Common 2019-01-03 11:37:02 -04:00
Joey Hess
b3c69eaaf8
strict bytestring encoders and decoders
Only had lazy ones before.

Already sped up a few parts of the code.
2019-01-01 14:55:15 -04:00
Joey Hess
4a788fbb3b
sync --content now supports --hide-missing adjusted branches
This relies on git ls-files --with-tree, which I'm using in a way that
its man page does not document. Hm. I emailed the git list to try to get
the docs improved, but at least the git test suite does test the same
kind of use case I'm using here.

Performance impact when not in an adjusted branch is limited to some
additional MVar accesses, and a single git call to determine the name of
the current branch. So very minimal.

When in an adjusted branch, the performance impact is
in Annex.WorkTree.lookupFile, which starts doing an equal amount of work
for files that didn't exist as it already did for files that were
unlocked.

This commit was sponsored by Jochen Bartl on Patreon.
2018-10-19 17:51:25 -04:00
Joey Hess
10138056dc
v6: avoid accidental conversion when annex.largefiles is not configured
v6: When annex.largefiles is not configured for a file, running git add or
git commit, or otherwise using git to stage a file will add it to the annex
if the file was in the annex before, and to git otherwise. This is to avoid
accidental conversion.

Note that git-annex add's behavior has not changed, for reasons explained
in the added comment.

Performance: No added overhead when annex.largefiles is configured.
When not configured, there is an added call to catObjectMetaData,
which involves a round trip through git cat-file --batch.
However, the earlier catKeyFile primes the cache for it.

This commit was supported by the NSF-funded DataLad project.
2018-08-27 14:51:10 -04:00
Joey Hess
8484c0c197
Always use filesystem encoding for all file and handle reads and writes.
This is a big scary change. I have convinced myself it should be safe. I
hope!
2016-12-24 14:46:31 -04:00
Joey Hess
34530e59d9
Avoid using a lot of memory when large objects are present in the git repository
.. and have to be checked to see if they are a pointed to an annexed file.

Cases where such memory use could occur included, but were not limited to:
  - git commit -a of a large unlocked file (in v5 mode)
  - git-annex adjust when a large file was checked into git directly
Generally, any use of catKey was a potential problem.

Fix by using git cat-file --batch-check to check size before catting.
This adds another git batch process, which is included in the CatFileHandle
for simplicity.

There could be performance impact, anywhere catKey is used. Particularly
likely to affect adjusted branch generation speed, and operations on
unlocked files in v6 mode. Hopefully since the --batch-check and
--batch read the same data, disk buffering will avoid most overhead.
Leaving only the overhead of talking to the process over the pipe and
whatever computation --batch-check needs to do.

This commit was sponsored by Bruno BEAUFILS on Patreon.
2016-10-05 15:24:13 -04:00
Joey Hess
1cd02762bf
Optimisations to git-annex branch query and setting, avoiding repeated copies of the environment.
Speeds up commands like  "git-annex find --in remote" by over 50%.

Profiling showed that adjustGitEnv was 21% of the time and 37% of the
allocations of that command. It copied the environment each time with
getEnvironment.

The only repeated use of adjustGitEnv is in withIndexFile, which tends to
be run at least once per file. So, it was optimised by keeping a cache of
the environment, which can be reused.

There could be other better ways to optimise this. Maybe get the while
environment once at startup. But, then it would have to be serialized back
out each time running a child process, so I doubt that would be a net win.

It might be better to cache a version of the environment that is
pre-modified to use .git-annex/index. But, profiling doesn't show that
modifying the enviroment is taking any significant time.
2016-09-29 13:36:48 -04:00
Joey Hess
e91037a38b
use indexEnv 2016-05-17 13:38:04 -04:00
Joey Hess
04d4830ac3
add catCommit 2016-02-25 15:34:46 -04:00
Joey Hess
737e45156e
remove 163 lines of code without changing anything except imports 2016-01-20 16:36:33 -04:00
Joey Hess
f324ad24c1
improve comment 2015-12-26 13:47:36 -04:00
Joey Hess
78a6b8ce05
refactor and improve pointer file handling code 2015-12-09 14:27:43 -04:00
Joey Hess
712c9fc590
require "annex/objects/" before key in pointer files
This removes ambiguity, because while someone might have "WORM--foo" in a
file that's not intended to be a git-annex pointer file,
"annex/objects/WORM--foo" is less likely.

Also, 664cc987e8 had a caveat about symlink
targets being parsed as pointer files, and now the same parser is used for
both.

I did not include any hash directories before the key in the pointer file,
as they're not needed. However, if they were included, the parser would
still work ok.
2015-12-07 15:45:08 -04:00
Joey Hess
664cc987e8
support pointer files
Backend.lookupFile is changed to always fall back to catKey when
operating on a file that's not a symlink.

catKey is changed to understand pointer files, as well as annex symlinks.

Before, catKey needed a file mode witness, to be sure it was looking at a
symlink. That was complicated stuff. Now, it doesn't actually care if a
file in git is a symlink or not; in either case asking git for the content
of the file will get the pointer to the key.

This does mean that git-annex will treat a link
foo -> WORM--bar as a git-annex file, and also treats
a regular file containing annex/objects/WORM--bar as a git-annex file.

Calling catKey could make git-annex commands need to do more work than
before. This would especially be the case if a repo contained many regular
files, and only a few annexed files, as now git-annex will need to ask
git about the contents of the regular files.
2015-12-07 15:35:36 -04:00
Joey Hess
afc5153157 update my email address and homepage url 2015-01-21 12:50:09 -04:00
Joey Hess
7b50b3c057 fix some mixed space+tab indentation
This fixes all instances of " \t" in the code base. Most common case
seems to be after a "where" line; probably vim copied the two space layout
of that line.

Done as a background task while listening to episode 2 of the Type Theory
podcast.
2014-10-09 15:09:11 -04:00
Joey Hess
96dc423e39 When accessing a local remote, shut down git-cat-file processes afterwards, to ensure that remotes on removable media can be unmounted. Closes: #758630
This does mean that eg, copying multiple files to a local remote will
become slightly slower, since it now restarts git-cat-file after each copy.
Should not be significant slowdown.

The reason git-cat-file is run on the remote at all is to update its
location log. In order to add an item to it, it needs to get the current
content of the log. Finding a way to avoid needing to do that would be a
good path to avoiding this slowdown if it does become a problem somehow.

This commit was sponsored by Evan Deaubl.
2014-08-20 12:07:57 -04:00
Joey Hess
ba42b67c70 Fix bug in automatic merge conflict resolution
When one side is an annexed symlink, and the other side is a non-annexed symlink.

In this case, git-merge does not replace the annexed symlink in the work
tree with the non-annexed symlink, which is different from it's handling of
conflicts between annexed symlinks and regular files or directories.
So, while git-annex generated the correct merge commit, the work tree
didn't get updated to reflect it.
See comments on bug for additional analysis.

Did not add this to the test suite yet; just unloaded a truckload of firewood
and am feeling lazy.

This commit was sponsored by Adam Spiers.
2014-07-08 13:55:11 -04:00
Joey Hess
1052eeface Windows: Fix some filename encoding bugs.
http://git-annex.branchable.com/bugs/Unicode_file_names_ignored_on_Windows/

Not a complete fix yet.
2014-03-19 15:57:56 -04:00
Joey Hess
1192d98721 sync: Fix bug in direct mode that caused a file not checked into git to be deleted when merging with a remote that added a file by the same name. (Thanks, jkt) 2014-03-03 14:57:16 -04:00
Joey Hess
8d5158fa31 Preserve metadata when staging a new version of an annexed file.
Performance impact: When adding a large tree of new files, this needs
to do some git cat-file queries to check if any of the files already
existed and might need a metadata copy. I tried a benchmark in a copy
of my sound repository (so there was already a significant git tree
to check against.

Adding 10000 small files, with a cold cache:
  before: 1m48.539s
  after:  1m52.791s

So, impact is 0.0004 seconds per file added. Which seems acceptable, so did
not add some kind of configuration to enable/disable this.

This commit was sponsored by Lisa Feilen.
2014-02-24 14:41:33 -04:00
Joey Hess
7d5b25515c Add plumbing-level lookupkey command. 2013-12-15 14:02:23 -04:00
Joey Hess
59ecc804cd add new status command
This works for both direct and indirect mode.

It may need some performance tuning.

Note that unlike git status, it only shows the status of the work tree, not
the status of the index. So only one status letter, not two .. and since
files that have been added and not yet committed do not differ between the
work tree and the index, they are not shown. Might want to add display of
the index vs the last commit eventually.

This commit was sponsored by an unknown bitcoin contributor, whose
contribution as been going up lately! ;)
2013-11-07 14:07:25 -04:00
Joey Hess
4f871f89ba git-recover-repository 1/2 done 2013-10-20 17:50:51 -04:00
Joey Hess
3588729f0d completely solve catKey memory leak
Since 006cf7976f was incomplete, not being
able to get the right mode of the file when the index differs from HEAD,
this is a final workaround. Only buffering the start of the file
in this case avoids leaking memory.

This does not prevent git-cat-file being asked to output the whole file,
which needs to be consumed, and can be slow. But this only happens in a
rare edge case.
2013-09-19 20:09:03 -04:00
Joey Hess
006cf7976f more completely solve catKey memory leak
Done using a mode witness, which ensures it's fixed everywhere.

Fixing catFileKey was a bear, because git cat-file does not provide a
nice way to query for the mode of a file and there is no other efficient
way to do it. Oh, for libgit2..

Note that I am looking at tree objects from HEAD, rather than the index.
Because I cat-file cannot show a tree object for the index.
So this fix is technically incomplete. The only cases where it matters
are:

1. A new large file has been directly staged in git, but not committed.
2. A file that was committed to HEAD as a symlink has been staged
   directly in the index.

This could be fixed a lot better using libgit2.
2013-09-19 16:41:21 -04:00
Joey Hess
412dcb8017 Fix bug that caused typechanged symlinks to be assumed to be unlocked files, so they were added to the annex by the pre-commit hook. 2013-08-22 13:57:07 -04:00
Joey Hess
729eab1f89 assistant: Work around git-cat-file's not reloading the index after files are staged.
Argh.
2013-05-25 00:37:41 -04:00
Joey Hess
aba49995b6 Merge branch 'master' into windows 2013-05-15 19:18:04 -04:00
Joey Hess
c62b54d80d start one git-cat-file per index file
This reverts 1c83b6c439 and properly fixes
the issue discussed there.

This makes git-annex behave much nicer in direct mode.
2013-05-15 18:46:38 -04:00
Joey Hess
03e8594369 fix the day's windows permissions damage 2013-05-12 19:09:48 -04:00
Joey Hess
73d2f8b280 deal with git using / internally, even on DOS 2013-05-12 17:29:49 -05:00
Joey Hess
1c83b6c439 work around a very strange git-cat-file behavior
Sometimes it seems that git-cat-file --batch stops getting info for
files in the current repo, when ":file" is fed to it. I have not reproduced
this at the command line, but only when using git annex whereis and git
annex move inside a direct mode repo. Those failed, because cat-file
returned "file missing". OTOH, git annex find works fine, despite passing
the same file to cat-file. It seems that the failing commands first asked
cat-file to show a file on the git-annex branch. Perhaps it got "stuck" on
that branch? But I cannot repoduce it running cat-file by hand. Most
strange. HEAD is a workaround for this extreme weirdness, since I spent a
good 2 hours struggling with it already.
2013-01-05 17:06:24 -04:00
Joey Hess
1cdf2b923d assistant: Make expensive transfer scan work fully in direct mode.
The expensive scan uses lookupFile, but in direct mode, that doesn't work
for files that are present. So the scan was not finding things that are
present that need to be uploaded. (It did find things not present that
needed to be downloaded.)

Now lookupFile also works in direct mode. Note that it still prefers
symlinks on disk to info committed to git, in direct mode. This is
necessary to make things like Assistant.Threads.Watcher.onAddSymlink
work correctly, when given a new symlink not yet checked into git (or
replacing a file checked into git).
2013-01-05 15:57:53 -04:00
Joey Hess
eb40227d15 assistant direct mode file add/change bookkeeping
When a file is changed in direct mode, the old content is probably lost
(at least from the local repo), and bookeeping needs to be updated to
reflect this.

Also, synthetic add events are generated at assistant startup, so
make it detect when the file has not really changed, and avoid re-adding
it.

This does add the overhead of querying the runing git cat-file for the
key that's recorded in git for the file, each time a file is added or
modified in direct mode.
2012-12-25 15:48:15 -04:00
Joey Hess
b080a58b76 Merge branch 'master' into desymlink
Conflicts:
	Annex/CatFile.hs
	Annex/Content.hs
	Git/LsFiles.hs
	Git/LsTree.hs
2012-12-13 00:29:06 -04:00
Joey Hess
f87a781aa6 finished where indentation changes 2012-12-13 00:24:19 -04:00
Joey Hess
e7b8cb0063 direct mode committing 2012-12-12 19:20:38 -04:00
Joey Hess
ca45cea113 Revert "add catFileIndex"
This interface is not a good idea, because a running git cat-file --batch
does not notice when existing files in the index are changed.
2012-09-15 18:30:53 -04:00
Joey Hess
e1baf48d88 add catFileIndex 2012-09-15 17:06:10 -04:00
Joey Hess
75b6ee81f9 avoid ByteString.Char8 where not needed
Its truncation behavior is a red flag, so avoid using it in these places
where only raw ByteStrings are used, without looking at the data inside.
2012-06-20 13:13:40 -04:00
Joey Hess
ca9ee21bd7 crazy optimisation
Crazy like a fox..
2012-06-10 19:58:34 -04:00
Joey Hess
c4c965d602 detect and recover from branch push/commit race
Dealing with a race without using locking is exceedingly difficult and tricky.
Fully tested, I hope.

There are three places left where the branch can be updated, that are not
covered by the race recovery code. Let's prove they're all immune to the
race:

1. tryFastForwardTo checks to see if a fast-forward can be done,
   and then does git-update-ref on the branch to fast-forward it.

   If a push comes in before the check, then either no fast-forward
   will be done (ok), or the push set the branch to a ref that can
   still be fast-forwarded (also ok)

   If a push comes in after the check, the git-update-ref will
   undo the ref change made by the push. It's as if the push did not come
   in, and the next git-push will see this, and try to re-do it.
   (acceptable)

2. When creating the branch for the very first time, an empty index
   is created, and a commit of it made to the branch. The commit's ref
   is recorded as the current state of the index. If a push came in
   during that, it will be noticed the next time a commit is made to the
   branch, since the branch will have changed. (ok)

3. Creating the branch from an existing remote branch involves making
   the branch, and then getting its ref, and recording that the index
   reflects that ref.

   If a push creates the branch first, git-branch will fail (ok).

   If the branch is created and a racing push is then able to change it
   (highly unlikely!) we're still ok, because it first records the ref into
   the index.lck, and then updating the index. The race can cause the
   index.lck to have the old branch ref, while the index has the newly pushed
   branch merged into it, but that only results in an unnecessary update of
   the index file later on.
2011-12-11 20:41:35 -04:00
Joey Hess
9290095fc2 improve type signatures with a Ref newtype
In git, a Ref can be a Sha, or a Branch, or a Tag. I added type aliases for
those. Note that this does not prevent mixing up of eg, refs and branches
at the type level. Since git really doesn't care, except rare cases like
git update-ref, or git tag -d, that seems ok for now.

There's also a tree-ish, but let's just use Ref for it. A given Sha or Ref
may or may not be a tree-ish, depending on the object type, so there seems
no point in trying to represent it at the type level.
2011-11-16 02:41:46 -04:00
Joey Hess
04edae6791 Optimised union merging; now only runs git cat-file once. 2011-11-12 17:45:12 -04:00
Joey Hess
637b5feb45 lint 2011-11-11 01:52:58 -04:00
Joey Hess
bf460a0a98 reorder repo parameters last
Many functions took the repo as their first parameter. Changing it
consistently to be the last parameter allows doing some useful things with
currying, that reduce boilerplate.

In particular, g <- gitRepo is almost never needed now, instead
use inRepo to run an IO action in the repo, and fromRepo to get
a value from the repo.

This also provides more opportunities to use monadic and applicative
combinators.
2011-11-08 16:27:20 -04:00