Commit graph

52 commits

Author SHA1 Message Date
Yaroslav Halchenko
84b0a3707a
Apply codespell -w throughout 2023-03-17 15:14:58 -04:00
Joey Hess
6a3bd283b8
add restage log
When pointer files need to be restaged, they're first written to the
log, and then when the restage operation runs, it reads the log. This
way, if the git-annex process is interrupted before it can do the
restaging, a later git-annex process can do it.

Currently, this lets a git-annex get/drop command be interrupted and
then re-ran, and as long as it gets/drops additional files, it will
clean up after the interrupted command. But more changes are
needed to make it easier to restage after an interrupted process.

Kept using the git queue to run the restage action, even though the
list of files that it builds up for that action is not actually used by
the action. This could perhaps be simplified to make restaging a cleanup
action that gets registered, rather than using the git queue for it. But
I wasn't sure if that would cause visible behavior changes, when eg
dropping a large number of files, currently the git queue flushes
periodically, and so it restages incrementally, rather than all at the
end.

In restagePointerFiles, it reads the restage log twice, once to get
the number of files and size, and a second time to process it.
This seemed better than reading the whole file into memory, since
potentially a huge number of files could be in there. Probably the OS
will cache the file in memory and there will not be much performance
impact. It might be better to keep running tallies in another file
though. But updating that atomically with the log seems hard.

Also note that it's possible for calcRestageLog to see a different file
than streamRestageLog does. More files may be added to the log in
between. That is ok, it will only cause the filterprocessfaster heuristic to
operate with slightly out of date information, so it may make the wrong
choice for the files that got added and be a little slower than ideal.

Sponsored-by: Dartmouth College's DANDI project
2022-09-23 15:47:24 -04:00
Joey Hess
faf84aa5c2
Avoid git status taking a long time after git-annex unlock of many files.
Implemented by making Git.Queue have a FlushAction, which can accumulate
along with another action on files, and runs only once the other action has
run.

This lets git-annex unlock queue up git update-index actions, without
conflicting with the restagePointerFiles FlushActions.

In a repository with filter-process enabled, git-annex unlock will
often not take any more time than before, though it may when the files are
large. Either way, it should always slow down less than git-annex status
speeds up.

When filter-process is not enabled, git-annex unlock will slow down as much
as git status speeds up.

Sponsored-by: Jochen Bartl on Patreon
2022-02-18 15:06:40 -04:00
Joey Hess
a03e9107cb
wording 2021-12-14 13:53:36 -04:00
Joey Hess
681d8611be
fix flush order reversion
commit c2e46f4707 caused
the queue to possibly be flushed in the wrong order when
it contained a mix of different actions.
2021-12-14 13:51:00 -04:00
Joey Hess
c2e46f4707
improve git command queue flushing with time limit
So that eg, addurl of several large files that take time to download will
update the index for each file, rather than deferring the index updates to
the end.

In cases like an add of many smallish files, where a new file is being
added every few seconds. In that case, the queue will still build up a
lot of changes which are flushed at once, for best performance. Since
the default queue size is 10240, often it only gets flushed once at the
end, same as before. (Notice that updateQueue updated _lastchanged
when adding a new item to the queue without flushing it; that is
necessary to avoid it flushing the queue every 5 minutes in this case.)

But, when it takes more than a 5 minutes to add a file, the overhead of
updating the index immediately is probably small, so do it after each
file. This avoids git-annex potentially taking a very very long time
indeed to stage newly added files, which can be annoying to the user who
would like to get on with doing something with the files it's already
added, eg using git mv to rename them to a better name.

This is only likely to cause a problem if it takes say, 30 seconds to
update the index; doing an extra 30 seconds of work after every 5
minute file add would be less optimal. Normally, updating the index takes
significantly less time than that. On a SSD with 100k files it takes
less than 1 second, and the index write time is bound by disk read and
write so is not too much worse on a hard drive. So I hope this will not
impact users, although if it does turn out to, the time limit could be
made configurable.

A perhaps better way to do it would be to have a background worker
thread that wakes up every 60 seconds or so and flushes the queue.
That is made somewhat difficult because the queue can contain Annex
actions and so this would add a new source of concurrency issues.
So I'm trying to avoid that approach if possible.

Sponsored-by: Erik Bjäreholt on Patreon
2021-12-14 12:23:19 -04:00
Joey Hess
a0758bdd10
dynamically disable filter-process in restagePointerFile when it would be slower
Based on my earlier benchmark, I have a rough cost model for how
expensive it is for git-annex smudge to be run on a file, vs
how expensive it is for a gigabyte of a file's content to be read and
piped through to filter-process.

So, using that cost model, it can decide if using filter-process will
be more or less expensive than running the smudge filter on the files to
be restaged.

It turned out to be *really* annoying to temporarily disable
filter-process. I did find a way, but urk, this is horrible. Notice
that, if it's interrupted with it disabled, it will remain disabled
until the next time restagePointerFile runs. Which could be some time
later. If the user runs `git add` or `git checkout` on a lot of small
files before that, they will see slower than expected performance.

(This commit also deletes where I wrote down the benchmark results
earlier.)

Sponsored-by: Noam Kremen on Patreon
2021-11-08 16:20:34 -04:00
Joey Hess
1c5fc8f047
Git.Queue: allow providing git common options like -c 2021-01-04 12:51:55 -04:00
Joey Hess
cd776ecb2e
avoid combining queued commands with different params
I don't think this affected git-annex currently, but if the same command
was queued twice with different params, one set of params was thrown
away, and the files going with those were run with the other set of
params.
2021-01-04 12:41:19 -04:00
Joey Hess
804808d569
squash build warnings on windows 2020-11-23 14:00:17 -04:00
Joey Hess
681b44236a
more RawFilePath conversion
at 377/645

This commit was sponsored by Svenne Krap on Patreon.
2020-10-29 14:20:57 -04:00
Joey Hess
05703893af
use right handle 2020-06-05 16:38:11 -04:00
Joey Hess
2670890b17
convert to withCreateProcess for async exception safety
This handles all createProcessSuccess callers, and aside from process
pools, the complete conversion of all process running to async exception
safety should be complete now.

Also, was able to remove from Utility.Process the old API that I now
know was not a good idea. And proof it was bad: The code size went *down*,
despite there being a fair bit of boilerplate for some future API to
reduce.
2020-06-04 15:45:52 -04:00
Joey Hess
99536e3a0b
remove one more warningIO
Had to generalize Git.Queue so it can run an Annex action, yipes.

Only remaining warningIO are in the legacy chunk code.
2019-11-12 10:45:52 -04:00
Joey Hess
40ecf58d4b
update licenses from GPL to AGPL
This does not change the overall license of the git-annex program, which
was already AGPL due to a number of sources files being AGPL already.

Legally speaking, I'm adding a new license under which these files are
now available; I already released their current contents under the GPL
license. Now they're dual licensed GPL and AGPL. However, I intend
for all my future changes to these files to only be released under the
AGPL license, and I won't be tracking the dual licensing status, so I'm
simply changing the license statement to say it's AGPL.

(In some cases, others wrote parts of the code of a file and released it
under the GPL; but in all cases I have contributed a significant portion
of the code in each file and it's that code that is getting the AGPL
license; the GPL license of other contributors allows combining with
AGPL code.)
2019-03-13 15:48:14 -04:00
Joey Hess
82c5dd8a01
queueing of internal IO actions on files
This would be better if getInternalFiles were
more polymorphic, but I can't see a good
way to accomplish that without messing with Data.Typeable,
which seemed like overkill.

Reverted CommandAction back to the simpler version.

This commit was sponsored by Eric Drechsel on Patreon.
2018-08-17 13:28:21 -04:00
Joey Hess
c5a8abb130
comment 2018-08-17 13:19:37 -04:00
Joey Hess
6a445dc086
support conditionally excluding queued files
Switched code to use a for loop to avoid a filterM that would have
doubled the memory used.

This commit was supported by the NSF-funded DataLad project.
2018-08-16 14:38:37 -04:00
Joey Hess
256d8f07e8
avoid insertWith' depreaction warning
Switch to Data.Map.Strict everywhere that used it.

There are still lots of lazy maps in git-annex. I think switching these
is safe. The risk is that there might be a map that is used in a way
that relies on the values not being evaluated to WHNF, and switching to
strict might result in bad performance or memory use. So, I have not
switched everything.
2018-04-22 13:28:31 -04:00
Joey Hess
8484c0c197
Always use filesystem encoding for all file and handle reads and writes.
This is a big scary change. I have convinced myself it should be safe. I
hope!
2016-12-24 14:46:31 -04:00
Yaroslav Halchenko
64e844e1fe
minor typo fixes throughout
problematic
flexibility
2016-06-02 11:22:18 -04:00
Joey Hess
b52cf5697b
immediate queue flushing when annex.queuesize=1
Previously, it only flushed when the queue got larger than 1.

Also, make the queue auto-flush when items are added, rather than needing
to be flushed as a separate step. This simplifies the code and make it more
efficient too, as it avoids needing to read the queue out of the state to
check if it should be flushed.
2016-01-13 14:55:01 -04:00
Joey Hess
31472161e4
merge git command queue when joining with concurrent thread 2015-11-05 18:21:48 -04:00
Joey Hess
afc5153157 update my email address and homepage url 2015-01-21 12:50:09 -04:00
Joey Hess
986bf1d6f6 Fix bug in annex.queuesize calculation that caused much more queue flushing than necessary.
The bug caused the size of the queue to be miscalculted; it was doubled
each time an item was added. Commands run after approx 140 items rather
than the intended 10240!
2014-06-18 17:23:36 -04:00
Joey Hess
fbd5a67cba fix a test suite reversion on Windows
Forgot to pass gitEnv when running commands in the git queue on windows.
2014-06-12 18:37:12 -04:00
Joey Hess
a44fd2c019 export CreateProcess fields from Utility.Process
update code to avoid cwd and env redefinition warnings
2014-06-10 19:20:14 -04:00
Joey Hess
f8cfcd4e44 couple more warning fixes 2014-02-25 14:53:43 -04:00
Joey Hess
3f6e4b8c7c fix all remaining -Wall warnings on Windows 2014-02-25 14:48:50 -04:00
Joey Hess
aff125ddab try working around windows xargs problem 2013-10-17 15:56:56 -04:00
Joey Hess
ebd778c519 Escape ':' in file/directory names to avoid it being treated as a pathspec by some git commands
A git pathspec is a filename, except when it starts with ':', it's taken
to refer to a branch, etc. Rather than special case ':', any filename
starting with anything unusual is prefixed with "./"

This could have been a real mess to deal with, but luckily SafeCommand
is already extensively used and so we know at the type level the difference
between parameters that are files, and parameters that are command options.

Testing did show that Git.Queue was not using SafeCommand on
filenames fed to xargs. (Filenames starting with '-' worked before only
because -- was used to separate filenames from options when calling eg git
add.)

The test suite now passes with filenames starting with ':'. However, I did
not keep that change to it, because such filenames are probably not legal
on windows, and I have enough ugly windows ifdefs in there as it is.

This commit was sponsored by Otavio Salvador. Thanks!
2013-08-01 15:15:49 -04:00
Joey Hess
8a2d1988d3 expose Control.Monad.join
I think I've been looking for that function for some time.
Ie, I remember wanting to collapse Just Nothing to Nothing.
2013-04-22 20:24:53 -04:00
Joey Hess
f87a781aa6 finished where indentation changes 2012-12-13 00:24:19 -04:00
Joey Hess
c9b3b8829d thread safe git-annex index file use 2012-08-24 20:50:39 -04:00
Joey Hess
1db7d27a45 add back debug logging
Make Utility.Process wrap the parts of System.Process that I use,
and add debug logging to them.

Also wrote some higher-level code that allows running an action
with handles to a processes stdin or stdout (or both), and checking
its exit status, all in a single function call.

As a bonus, the debug logging now indicates whether the process
is being run to read from it, feed it data, chat with it (writing and
reading), or just call it for its side effect.
2012-07-19 00:46:52 -04:00
Joey Hess
d1da9cf221 switch from System.Cmd.Utils to System.Process
Test suite now passes with -threaded!

I traced back all the hangs with -threaded to System.Cmd.Utils. It seems
it's just crappy/unsafe/outdated, and should not be used. System.Process
seems to be the cool new thing, so converted all the code to use it
instead.

In the process, --debug stopped printing commands it runs. I may try to
bring that back later.

Note that even SafeSystem was switched to use System.Process. Since that
was a modified version of code from System.Cmd.Utils, it needed to be
converted too. I also got rid of nearly all calls to forkProcess,
and all calls to executeFile, which I'm also doubtful about working
well with -threaded.
2012-07-18 18:00:24 -04:00
Joey Hess
da62edb42a optimisation and memory leak fix 2012-06-12 21:13:15 -04:00
Joey Hess
c5707c84d3 queue size fix
Increase queue size for update-index actions, because otherwise they'll
never be flushed.
2012-06-10 13:56:04 -04:00
Joey Hess
d45a9a7831 refactor and function name cleanup
(oops, I had a calcMerge and a calc_merge!)
2012-06-08 00:29:39 -04:00
Joey Hess
20f425be19 make watch use the queue
May not work. Certianly needs to flush the queue from time to time
when only symlink changes are being made.
2012-06-07 15:40:44 -04:00
Joey Hess
0a11b35d89 extend Git.Queue to be able to queue more than simple git commands
While I was in there, I noticed and fixed a bug in the queue size
calculations. It was never encountered only because Queue.add was
only ever run with 1 file in the list.
2012-06-07 15:19:44 -04:00
Joey Hess
7a6fb8ae4e flush the git queue when a new type of action is being added to it
This allows the queue to be used in a single process for multiple possibly
conflicting commands, like add and rm, without running them out of order.

This assumes that running the same git subcommand with different parameters
cannot itself conflict.
2012-06-04 20:41:22 -04:00
Joey Hess
52c5b164d8 Added a annex.queuesize setting
useful when adding hundreds of thousands of files on a system with plenty
of memory.

git add gets quite slow in such a large repository, so if the system has
more than the ~32 mb of memory the queue can use by default, it's a useful
optimisation to increase the queue size, in order to decrease the number
of times git add is run.
2012-02-15 11:14:19 -04:00
Joey Hess
d8fb97806c support all filename encodings with ghc 7.4
Under ghc 7.4, this seems to be able to handle all filename encodings
again. Including filename encodings that do not match the LANG setting.
I think this will not work with earlier versions of ghc, it uses some ghc
internals.

Turns out that ghc 7.4 has a special filesystem encoding that it uses when
reading/writing filenames (as FilePaths). This encoding is documented
to allow  "arbitrary undecodable bytes to be round-tripped through it".

So, to get FilePaths from eg, git ls-files, set the Handle that is reading
from git to use this encoding. Then things basically just work.

However, I have not found a way to make Text read using this encoding.
Text really does assume unicode. So I had to switch back to using String
when reading/writing data to git. Which is a pity, because it's some
percent slower, but at least it works.

Note that stdout and stderr also have to be set to this encoding, or
printing out filenames that contain undecodable bytes causes a crash.
IMHO this is a misfeature in ghc, that the user can pass you a filename,
which you can readFile, etc, but that default, putStr of filename may
cause a crash!

Git.CheckAttr gave me special trouble, because the filenames I got back
from git, after feeding them in, had further encoding breakage.
Rather than try to deal with that, I just zip up the input filenames
with the attributes. Which must be returned in the same order queried
for this to work.

Also of note is an apparent GHC bug I worked around in Git.CheckAttr. It
used to forkProcess and feed git from the child process.  Unfortunatly,
after this forkProcess, accessing the `files` variable from the parent
returns []. Not the value that was passed into the function. This screams
of a bad bug, that's clobbering a variable, but for now I just avoid
forkProcess there to work around it. That forkProcess was itself only added
because of a ghc bug, #624389. I've confirmed that the test case for that
bug doesn't reproduce it with ghc 7.4. So that's ok, except for the new ghc
bug I have not isolated and reported. Why does this simple bit of code
magnet the ghc bugs? :)

Also, the symlink touching code is currently broken, when used on utf-8
filenames in a non-utf-8 locale, or probably on any filename containing
undecodable bytes, and I temporarily commented it out.
2012-02-03 16:23:20 -04:00
Joey Hess
3d49258e5b attempt at a quick, utf-8 only fix to the ghc 7.4 problem
If you have only utf-8 filenames, and need to build git-annex with ghc 7.4,
this will work. But, it will crash on non-utf-8 filenames.
2012-02-01 16:16:08 -04:00
Joey Hess
ee3b5b2a42 use Common in a few more modules 2011-12-20 14:37:53 -04:00
Joey Hess
ef28b3fef7 split out Git/Command.hs 2011-12-14 15:56:11 -04:00
Joey Hess
bf460a0a98 reorder repo parameters last
Many functions took the repo as their first parameter. Changing it
consistently to be the last parameter allows doing some useful things with
currying, that reduce boilerplate.

In particular, g <- gitRepo is almost never needed now, instead
use inRepo to run an IO action in the repo, and fromRepo to get
a value from the repo.

This also provides more opportunities to use monadic and applicative
combinators.
2011-11-08 16:27:20 -04:00
Joey Hess
203148363f split groups of related functions out of Utility 2011-08-22 16:14:12 -04:00
Joey Hess
e784757376 hlint tweaks
Did all sources except Remotes/* and Command/*
2011-07-15 03:12:05 -04:00