Commit graph

55 commits

Author SHA1 Message Date
Joey Hess
81d402216d cache the serialization of a Key
This will speed up the common case where a Key is deserialized from
disk, but is then serialized to build eg, the path to the annex object.

Previously attempted in 4536c93bb2
and reverted in 96aba8eff7.
The problems mentioned in the latter commit are addressed now:

Read/Show of KeyData is backwards-compatible with Read/Show of Key from before
this change, so Types.Distribution will keep working.

The Eq instance is fixed.

Also, Key has smart constructors, avoiding needing to remember to update
the cached serialization.

Used git-annex benchmark:
  find is 7% faster
  whereis is 3% faster
  get when all files are already present is 5% faster
Generally, the benchmarks are running 0.1 seconds faster per 2000 files,
on a ram disk in my laptop.
2019-11-22 17:49:16 -04:00
Joey Hess
4a3f3a2cb5
make git add only annex when configured by annex.largefiles 2019-10-24 14:17:29 -04:00
Joey Hess
168f91efec
avoid warning over name 2019-10-24 11:46:40 -04:00
Joey Hess
bd197be3ad
annex.gitaddtoannex configuration
Added annex.gitaddtoannex configuration. Setting it to false prevents
git add from usually adding files to the annex.
(Unless the file was annexed before, or a renamed annexed file is detected.)

Currently left at true; some users are encouraging it be set to false.
2019-10-23 15:29:46 -04:00
Joey Hess
ec08b66bda
shouldAnnex: check isInodeKnown
Renamed unlocked files are now detected, and will always be
annexed, unless annex.largefiles disallows it.

This allows for git add's behavior to later be changed to otherwise
not annex files (whether by default or as a config option), without
worrying about the rename case.

This is not a major behavior change; annexing is still the default. But
there is one case where the behavior is changed, I think for the better:

	touch f
	git -c annex.largefiles=nothing add f
	git add bigfile
	git commit -m ...
	mv bigfile f
	git add f

Before, git-annex would see that f was previously not annexed,
and so the renamed bigfile content gets added to git. Now, it notices
that the inode is the one that bigfile used, and so it annexes it.

This potentially slows down git add a lot in some repositories because
of the poor performance of isInodeKnown when there are a lot of unlocked
files. Configuring annex.largefiles avoids the speed hit.
2019-10-23 14:49:45 -04:00
Joey Hess
8355dba5cc
plumb MeterUpdate into getKey
No behavior changes, but this shows everywhere that a progress meter
could be displayed when hashing a file to add to the annex.

Many of the places don't make sense to display a progress meter though,
eg when importing the copy of the file probably swamps the hashing of
the file.
2019-06-25 11:43:24 -04:00
Joey Hess
2d33122215
avoid ingest lockdown file escaping the withOtherTmp call
Fixes bug that caused git-annex to fail to add a file when another
git-annex process cleaned up the temp directory it was using.

Solution is just to push withOtherTmp out to a higher level, so that
the whole ingest process can be completed inside it.

But in the assistant, that was not practical to do, since withOtherTmp runs
in the Annex monad and the assistant does not. Worked around by introducing
a separate temp directory that only the assistant uses for lockdown.
Since only one assistant can run at a time, it's easy to clean up that
directory of old cruft at startup.
2019-05-07 13:04:57 -04:00
Joey Hess
40ecf58d4b
update licenses from GPL to AGPL
This does not change the overall license of the git-annex program, which
was already AGPL due to a number of sources files being AGPL already.

Legally speaking, I'm adding a new license under which these files are
now available; I already released their current contents under the GPL
license. Now they're dual licensed GPL and AGPL. However, I intend
for all my future changes to these files to only be released under the
AGPL license, and I won't be tracking the dual licensing status, so I'm
simply changing the license statement to say it's AGPL.

(In some cases, others wrote parts of the code of a file and released it
under the GPL; but in all cases I have contributed a significant portion
of the code in each file and it's that code that is getting the AGPL
license; the GPL license of other contributors allows combining with
AGPL code.)
2019-03-13 15:48:14 -04:00
Joey Hess
5d98cba923
use ByteStrings when reading annex symlinks and pointers
Now there's a ByteString used all the way from disk to Key.

The main complication in this conversion was the use of fromInternalGitPath
in several places to munge things on Windows. The things that used that
were changed to parse the ByteString using either path separator.

Also some code that had read from files to a String lazily was changed
to read a minimal strict ByteString.
2019-01-14 15:37:08 -04:00
Joey Hess
71cc9cfaa2
improve smudge --clean behavior on outside work tree files
smudge: When passed a file located outside the working tree, eg by git
diff, avoid erroring out.

This commit was sponsored by Ewen McNeill on Patreon.
2018-11-15 13:04:40 -04:00
Joey Hess
5ab0f48ffb
high-res mtimes
Cache high-resolution mtimes for improved detection of modified files in v7
(and direct mode).

Including on Windows.

With back-compat support so old low-res mtimes won't break anything, and
so the new information also won't break old versions of git-annex.
2018-10-30 00:41:26 -04:00
Joey Hess
234842a347
v7
Install new git hooks in this version.

This does beg the question of what to do if git later gets eg a
post-smudge hook, that could run git-annex smudge --update. I think the
thing to do in that case would be to make git-annex smudge --update
install the new hooks. That way, as the user uses git-annex, the hook
would be created pretty quickly and without needing any extra syscalls
except for when git-annex smudge --update is called.

I considered doing something like that for installation of the
post-checkout and post-merge hooks, which would have avoided the need
for v7. But the only place it was cheap to do it would be in git-annex smudge
which could cheaply notice that smudge.log didn't exist yet and so know
the hooks needed to be installed. But since smudge used to populate pointer
files, it would be quite surprising if a single git checkout/merge failed
to update the work tree, and so that idea didn't work out.

The other reason for v7 is psychological -- users don't need to worry
about whether they might be running an old version of git-annex that
doesn't support their v7 repository very well. And bug reports about
"v6" have gotten a bit of a bad association in my head since they often
hit one of the known limitations and didn't realize it was experimental.

newtyped RepoVersion Int to avoid needing 2 comparisons in
versionSupportsUnlockedPointers etc. Also it's just nicer.

This commit was sponsored by John Pellman on Patreon.
2018-10-25 18:24:23 -04:00
Joey Hess
c28ca8294f
optimize smudge --clean of unmodified file
Usually, git won't run clean filter when a file is unmodified. But, when
git checkout runs git annex smudge --update, it populates the pointer
runs git update-index, which sees the file has changed and runs
git annex smudge --clean, which was checksumming the file unncessarily
as it re-ingested it.

With annex.thin set, this is the difference between git checkout of a
branch with a 1 gb file taking 30s and 0.1s.

This commit was sponsored by Brett Eisenberg on Patreon.
2018-10-25 16:46:46 -04:00
Joey Hess
ca7de61454
git post-checkout and post-merge hooks
* init, upgrade: Install git post-checkout and post-merge hooks that run
  git annex smudge --update.
* precommit: Run git annex smudge --update, because the post-merge
  hook is not run when there is a merge conflict. So the work tree will
  be updated when a commit is made to resolve the merge conflict.
* precommit: Run git annex smudge --update, because the post-merge
  hook is not run when there is a merge conflict. So the work tree will
  be updated when a commit is made to resolve the merge conflict.
* Note that git has no hooks run after git stash or git cherry-pick,
  so the user will have to manually run git annex smudge --update
  after such commands.

Nothing currently installs the hooks into v6 repos that already exist.
Something will need to be done about that, either move this behavior to v7,
or document that the user will need to manually fix up their v6 repos.

This commit was sponsored by Eric Drechsel on Patreon.
2018-10-25 15:59:51 -04:00
Joey Hess
917a2c6095
defer updating unlocked files until after smudge filter
The smuge filter no longer provides git with annexed file content, to
avoid a git memory leak, and because that did not honor annex.thin.

git annex smudge --update has to be run after a checkout to update
unlocked files in the working tree with annexed file contents.

No hooks yet to run it.

This commit was sponsored by Nick Piper on Patreon.
2018-10-25 15:08:20 -04:00
Joey Hess
f2a4db724c
remove redundant test
populatePointerFile checks the same thing
2018-10-25 14:31:45 -04:00
Joey Hess
6001b3cf45
fix build warning 2018-08-28 13:17:06 -04:00
Joey Hess
10138056dc
v6: avoid accidental conversion when annex.largefiles is not configured
v6: When annex.largefiles is not configured for a file, running git add or
git commit, or otherwise using git to stage a file will add it to the annex
if the file was in the annex before, and to git otherwise. This is to avoid
accidental conversion.

Note that git-annex add's behavior has not changed, for reasons explained
in the added comment.

Performance: No added overhead when annex.largefiles is configured.
When not configured, there is an added call to catObjectMetaData,
which involves a round trip through git cat-file --batch.
However, the earlier catKeyFile primes the cache for it.

This commit was supported by the NSF-funded DataLad project.
2018-08-27 14:51:10 -04:00
Joey Hess
98fd7ec6c9
recover from race between git mv+commit and git-annex get
Last of the known v6 races.

This also makes git add of a pointer file populate it when its content
is present in the annex. Which makes sense to do, I think.

This commit was supported by the NSF-funded DataLad project.
2018-08-22 16:01:50 -04:00
Joey Hess
48e9e12961
finally fixed v6 get/drop git status
After updating the worktree for an add/drop, update git's index, so git
status will not show the files as modified.

What actually happens is that the index update removes the inode
information from the index. The next git status (or similar) run
then has to do some work. It runs the clean filter.

So, this depends on the clean filter being reasonably fast and on git
not leaking memory when running it. Both problems were fixed in
a96972015d, but only for git 2.5. Anyone
using an older git will see very expensive git status after an add/drop.

This uses the same git update-index queue as other parts of git-annex, so
the actual index update is fairly efficient. Of course, updating the index
does still have some overhead. The annex.queuesize config will control how
often the index gets updated when working on a lot of files.

This is an imperfect workaround... Added several todos about new
problems this workaround causes. Still, this seems a lot better than the
old behavior.

This commit was supported by the NSF-funded DataLad project.
2018-08-14 16:23:58 -04:00
Joey Hess
a96972015d
massive v6 add speed/memory improvement
v6 add: Take advantage of improved SIGPIPE handler in git 2.5 to speed up
the clean filter by not reading the file content from the pipe. This also
avoids git buffering the whole file content in memory.

When built with an older git, still consumes stdin. If built with a newer
git and used with an older one, it breaks, but that's acceptable --
checking the git version every time would make repeated smudge runs slow.

This commit was supported by the NSF-funded DataLad project.
2018-08-09 18:17:46 -04:00
Joey Hess
9c4650358c
add KeyVariety type
Where before the "name" of a key and a backend was a string, this makes
it a concrete data type.

This is groundwork for allowing some varieties of keys to be disabled
in file2key, so git-annex won't use them at all.

Benchmarks ran in my big repo:

old git-annex info:

real	0m3.338s
user	0m3.124s
sys	0m0.244s

new git-annex info:

real	0m3.216s
user	0m3.024s
sys	0m0.220s

new git-annex find:

real	0m7.138s
user	0m6.924s
sys	0m0.252s

old git-annex find:

real	0m7.433s
user	0m7.240s
sys	0m0.232s

Surprising result; I'd have expected it to be slower since it now parses
all the key varieties. But, the parser is very simple and perhaps
sharing KeyVarieties uses less memory or something like that.

This commit was supported by the NSF-funded DataLad project.
2017-02-24 15:16:56 -04:00
Joey Hess
f617988a29
Make import --deduplicate and --skip-duplicates only hash once, not twice
import: --deduplicate and --skip-duplicates were implemented inneficiently;
they unncessarily hashed each file twice. They have been improved to only
hash once.

The new approach is to lock down (minimally) and hash files, and then
reuse that information when importing them.

This was rather tricky, especially in detecting changes to files while
they are being imported.

The output of import changed slightly. While before it silently skipped
over files with eg --skip-duplicates, now it shows each file as it starts
to act on it. Since every file is hashed first thing, it would otherwise
not be clear what file import is chewing on. (Actually, it wasn't clear
before when any of the duplicates switches were used.)

This commit was sponsored by Alexander Thompson on Patreon.
2017-02-09 15:32:22 -04:00
Joey Hess
b6b5a11601
Make git clean filter preserve the backend that was used for a file. 2016-06-09 15:17:08 -04:00
Joey Hess
0860731760
reorder associated file db update
There's a potential race where the smudge filter is run at the same time an
object is being downloaded. If the download finished after the inAnnex
check, and before the keys db was updated, the associated file would not
get updated with the downloaded content.

I'm not sure this closes the race; it may only narrow the window. Problem
is, the keys database needs to communicate between two different processes.
In the case of the assistant, the transferkeys command is the other
process, and it closes the db handle after getting the file. So, it should
re-open the database and so see the update that the smudge filter has
written to it. But, what if the smudge filter takes a while to update the
database?
2016-05-16 14:55:05 -04:00
Joey Hess
617f2834fc
Fix bug that sometimes prevented git-annex smudge --clean from consuming all its input, which resulted in git add bypassing git-annex. 2016-05-02 10:53:24 -04:00
Joey Hess
402e5adab8
smudge: Print a warning when annex.thin is set, as git's smudge interface does not allow honoring that configuration. 2016-04-13 13:34:24 -04:00
Joey Hess
737e45156e
remove 163 lines of code without changing anything except imports 2016-01-20 16:36:33 -04:00
Joey Hess
4b819bee2b
avoid confusing git with a modified ctime in clean filter
Linking the file to the tmp dir was not necessary in the clean
filter, and it caused the ctime to change, which caused git to think
the file was changed. This caused git status to get slow as it kept
re-cleaning unchanged files.
2016-01-07 17:48:04 -04:00
Joey Hess
b3d60ca285
use TopFilePath for associated files
Fixes several bugs with updates of pointer files. When eg, running
git annex drop --from localremote
it was updating the pointer file in the local repository, not the remote.
Also, fixes drop ../foo when run in a subdir, and probably lots of other
problems. Test suite drops from ~30 to 11 failures now.

TopFilePath is used to force thinking about what the filepath is relative
to.

The data stored in the sqlite db is still just a plain string, and
TopFilePath is a newtype, so there's no overhead involved in using it in
DataBase.Keys.
2016-01-05 17:22:19 -04:00
Joey Hess
43b1333216
switch to using main ingest code
Fixes at least one bug, in populating existing worktree files that use the
same key that's ingested.
2016-01-01 14:16:40 -04:00
Joey Hess
996ae9b172
don't disable smudge filter while merging
The smudge filter does need to be run, because if the key is in the local
annex already (due to renaming, or a copy of a file added, or a new file
added and its content has already arrived), git merge smudges the file and
this should provide its content.

This does probably mean that in merge conflict resolution, git smudges the
existing file, re-copying all its content to it, and then the file is
deleted. So, not efficient.
2015-12-29 16:36:21 -04:00
Joey Hess
b6b34f4916
automatic conflict resolution for v6 unlocked files
Several tricky parts:

* When the conflict is just between the same key being locked and unlocked,
  the unlocked version wins, and the file is not renamed in this case.

* Need to update associated file map when conflict resolution renames
  an unlocked file.

* git merge runs the smudge filter on the conflicting file, and actually
  overwrites the file with the same content it had before, and so
  invalidates its inode cache. This makes it difficult to know when it's
  safe to remove such files as conflict cruft, without going so far as to
  compare their entire contents.

  Dealt with this by preventing the smudge filter from populating the file
  when a merge is run. However, that also prevents the smudge filter being
  run for non-conflicting files, so eg moving a file won't put its new
  content into place.

* Ideally, if a merge or a merge conflict resolution renames an unlocked
  file, the file in the work tree can just be moved, rather than copying
  the content to a new worktree file.

  This is attempted to be done in merge conflict resolution, but
  due to git merge's behavior of running smudge filters, what actually
  seems to happen is the old worktree file with the content is deleted and
  rewritten as a pointer file, so doesn't get reused.

So, this is probably not as efficient as it optimally could be.
If that becomes a problem, could look into running the merge in a separate
worktree and updating the real worktree more efficiently, similarly to the
direct mode merge. However, the direct mode merge had a lot of bugs, and
I'd rather not use that more error-prone method unless really needed.
2015-12-29 15:41:09 -04:00
Joey Hess
121f5d5b0c
annex.thin
Decided it's too scary to make v6 unlocked files have 1 copy by default,
but that should be available to those who need it. This is consistent with
git-annex not dropping unused content without --force, etc.

* Added annex.thin setting, which makes unlocked files in v6 repositories
  be hard linked to their content, instead of a copy. This saves disk
  space but means any modification of an unlocked file will lose the local
  (and possibly only) copy of the old version.
* Enable annex.thin by default on upgrade from direct mode to v6, since
  direct mode made the same tradeoff.
* fix: Adjusts unlocked files as configured by annex.thin.
2015-12-27 15:59:59 -04:00
Joey Hess
7c02f070b1
lost some bookkeeping info
I forgot to convert this to use Annex.Ingest, todo later.
2015-12-24 13:15:26 -04:00
Joey Hess
d8a8c77a8f
move cleanOldKey into ingest 2015-12-22 16:55:49 -04:00
Joey Hess
4392140946
make linkAnnex detect when the file changes as it's being copied/linked in
This fixes a race where the modified file ended up in annex/objects, and
the InodeCache stored in the database was for the modified version, so
git-annex didn't know it had gotten modified.

The race could occur when the smudge filter was running; now it gets the
InodeCache before generating the Key, which avoids the race.
2015-12-22 15:20:03 -04:00
Joey Hess
71e2050f8f
have clean filter check if the filename was already in use by an old key
The annex object for it may have been modified due to hard link, and
that should be cleaned up when the new version is added. If another
associated file has the old key's content, that's linked into the annex
object. Otherwise, update location log to reflect that content has been
lost.
2015-12-15 13:06:52 -04:00
Joey Hess
42caf42857
avoid smudge filter returning invalid content
1. git add file
2. git commit
3. modify file
4. git commit
5. git reset HEAD^

Before this fix, that resulted in git saying the file was modified. And
indeed, it didn't have the content it should in the just checked out ref,
because step 3 modified the object file for the old key.
2015-12-11 18:01:50 -04:00
Joey Hess
50e83b606c
only make 1 hardlink max between pointer file and annex object
If multiple files point to the same annex object, the user may want to
modify them independently, so don't use a hard link.

Also, check diskreserve when copying.
2015-12-11 14:00:21 -04:00
Joey Hess
06a8256bf6
always format pointer file with a trailing newline
Before the smudge filter added a trailing newline, but other things that
wrote formatPointer to a file did not.

also some new pointer staging code to use later
2015-12-10 16:06:58 -04:00
Joey Hess
ce73a96e4e
use InodeCache when dropping a key to see if a pointer file can be safely reset
The Keys database can hold multiple inode caches for a given key. One for
the annex object, and one for each pointer file, which may not be hard
linked to it.

Inode caches for a key are recorded when its content is added to the annex,
but only if it has known pointer files. This is to avoid the overhead of
maintaining the database when not needed.

When the smudge filter outputs a file's content, the inode cache is not
updated, because git's smudge interface doesn't let us write the file. So,
dropping will fall back to doing an expensive verification then. Ideally,
git's interface would be improved, and then the inode cache could be
updated then too.
2015-12-09 17:54:54 -04:00
Joey Hess
5e8c628d2e
add inode cache to the db
Renamed the db to keys, since it is various info about a Keys.

Dropping a key will update its pointer files, as long as their content can
be verified to be unmodified. This falls back to checksum verification, but
I want it to use an InodeCache of the key, for speed. But, I have not made
anything populate that cache yet.
2015-12-09 17:00:37 -04:00
Joey Hess
ba39f993f5
avoid clean filter trying to annex a pointer file 2015-12-09 15:24:32 -04:00
Joey Hess
05b598a057
stash DbHandle in Annex state 2015-12-09 14:55:47 -04:00
Joey Hess
78a6b8ce05
refactor and improve pointer file handling code 2015-12-09 14:27:43 -04:00
Joey Hess
712c9fc590
require "annex/objects/" before key in pointer files
This removes ambiguity, because while someone might have "WORM--foo" in a
file that's not intended to be a git-annex pointer file,
"annex/objects/WORM--foo" is less likely.

Also, 664cc987e8 had a caveat about symlink
targets being parsed as pointer files, and now the same parser is used for
both.

I did not include any hash directories before the key in the pointer file,
as they're not needed. However, if they were included, the parser would
still work ok.
2015-12-07 15:45:08 -04:00
Joey Hess
664cc987e8
support pointer files
Backend.lookupFile is changed to always fall back to catKey when
operating on a file that's not a symlink.

catKey is changed to understand pointer files, as well as annex symlinks.

Before, catKey needed a file mode witness, to be sure it was looking at a
symlink. That was complicated stuff. Now, it doesn't actually care if a
file in git is a symlink or not; in either case asking git for the content
of the file will get the pointer to the key.

This does mean that git-annex will treat a link
foo -> WORM--bar as a git-annex file, and also treats
a regular file containing annex/objects/WORM--bar as a git-annex file.

Calling catKey could make git-annex commands need to do more work than
before. This would especially be the case if a repo contained many regular
files, and only a few annexed files, as now git-annex will need to ask
git about the contents of the regular files.
2015-12-07 15:35:36 -04:00
Joey Hess
2cbcb4f1a8
update associated files database on smudge and clean 2015-12-07 14:41:22 -04:00
Joey Hess
fb6ebdaae7
refactor 2015-12-04 17:18:26 -04:00