2011-04-07 17:59:31 +00:00
|
|
|
{- git-annex command queue
|
|
|
|
-
|
2021-01-04 16:51:55 +00:00
|
|
|
- Copyright 2011-2021 Joey Hess <id@joeyh.name>
|
2011-04-07 17:59:31 +00:00
|
|
|
-
|
2019-03-13 19:48:14 +00:00
|
|
|
- Licensed under the GNU AGPL version 3 or higher.
|
2011-04-07 17:59:31 +00:00
|
|
|
-}
|
|
|
|
|
2015-11-05 22:21:48 +00:00
|
|
|
{-# LANGUAGE BangPatterns #-}
|
|
|
|
|
2011-10-04 04:40:47 +00:00
|
|
|
module Annex.Queue (
|
2012-06-07 19:19:44 +00:00
|
|
|
addCommand,
|
2022-02-18 19:06:40 +00:00
|
|
|
addFlushAction,
|
2012-06-07 19:40:44 +00:00
|
|
|
addUpdateIndex,
|
2011-04-07 17:59:31 +00:00
|
|
|
flush,
|
2012-06-10 17:56:04 +00:00
|
|
|
flushWhenFull,
|
2015-11-05 22:21:48 +00:00
|
|
|
size,
|
withAltRepo needs a separate queue of changes
The queue could potentially contain changes from before withAltRepo, and
get flushed inside the call, which would apply the changes to the modified
repo.
Or, changes could be queued in withAltRepo that were intended to affect
the modified repo, but don't get flushed until later.
I don't know of any cases where either happens, but better safe than sorry.
Note that this affect withIndexFile, which is used in git-annex branch
updates. So, it potentially makes things slower. Should not be by much;
the overhead consists only of querying the current queue a couple of times,
and potentially flushing changes queued within withAltRepo earlier, that
could have maybe been bundled with other later changes.
Notice in particular that the existing queue is not flushed when calling
withAltRepo. So eg when git annex add needs to stage files in the index,
it will still bundle them together efficiently.
2016-06-03 17:48:14 +00:00
|
|
|
get,
|
2015-11-05 22:21:48 +00:00
|
|
|
mergeFrom,
|
2011-04-07 17:59:31 +00:00
|
|
|
) where
|
|
|
|
|
2016-01-20 20:36:33 +00:00
|
|
|
import Annex.Common
|
2012-02-15 15:13:13 +00:00
|
|
|
import Annex hiding (new)
|
2019-05-06 19:15:12 +00:00
|
|
|
import Annex.LockFile
|
2011-06-30 17:25:37 +00:00
|
|
|
import qualified Git.Queue
|
2012-06-07 19:40:44 +00:00
|
|
|
import qualified Git.UpdateIndex
|
2011-04-07 17:59:31 +00:00
|
|
|
|
2011-08-21 16:59:49 +00:00
|
|
|
{- Adds a git command to the queue. -}
|
2021-01-04 16:51:55 +00:00
|
|
|
addCommand :: [CommandParam] -> String -> [CommandParam] -> [FilePath] -> Annex ()
|
|
|
|
addCommand commonparams command params files = do
|
2012-02-15 15:13:13 +00:00
|
|
|
q <- get
|
2019-11-12 14:44:51 +00:00
|
|
|
store =<< flushWhenFull =<<
|
2021-01-04 16:51:55 +00:00
|
|
|
(Git.Queue.addCommand commonparams command params files q =<< gitRepo)
|
2011-04-07 17:59:31 +00:00
|
|
|
|
add restage log
When pointer files need to be restaged, they're first written to the
log, and then when the restage operation runs, it reads the log. This
way, if the git-annex process is interrupted before it can do the
restaging, a later git-annex process can do it.
Currently, this lets a git-annex get/drop command be interrupted and
then re-ran, and as long as it gets/drops additional files, it will
clean up after the interrupted command. But more changes are
needed to make it easier to restage after an interrupted process.
Kept using the git queue to run the restage action, even though the
list of files that it builds up for that action is not actually used by
the action. This could perhaps be simplified to make restaging a cleanup
action that gets registered, rather than using the git queue for it. But
I wasn't sure if that would cause visible behavior changes, when eg
dropping a large number of files, currently the git queue flushes
periodically, and so it restages incrementally, rather than all at the
end.
In restagePointerFiles, it reads the restage log twice, once to get
the number of files and size, and a second time to process it.
This seemed better than reading the whole file into memory, since
potentially a huge number of files could be in there. Probably the OS
will cache the file in memory and there will not be much performance
impact. It might be better to keep running tallies in another file
though. But updating that atomically with the log seems hard.
Also note that it's possible for calcRestageLog to see a different file
than streamRestageLog does. More files may be added to the log in
between. That is ok, it will only cause the filterprocessfaster heuristic to
operate with slightly out of date information, so it may make the wrong
choice for the files that got added and be a little slower than ideal.
Sponsored-by: Dartmouth College's DANDI project
2022-09-23 18:38:59 +00:00
|
|
|
addFlushAction :: Git.Queue.FlushActionRunner Annex -> [RawFilePath] -> Annex ()
|
2022-02-18 19:06:40 +00:00
|
|
|
addFlushAction runner files = do
|
2018-08-16 19:15:20 +00:00
|
|
|
q <- get
|
2019-11-12 14:44:51 +00:00
|
|
|
store =<< flushWhenFull =<<
|
2022-02-18 19:06:40 +00:00
|
|
|
(Git.Queue.addFlushAction runner files q =<< gitRepo)
|
2018-08-16 19:15:20 +00:00
|
|
|
|
2012-06-07 19:40:44 +00:00
|
|
|
{- Adds an update-index stream to the queue. -}
|
|
|
|
addUpdateIndex :: Git.UpdateIndex.Streamer -> Annex ()
|
|
|
|
addUpdateIndex streamer = do
|
|
|
|
q <- get
|
2019-11-12 14:44:51 +00:00
|
|
|
store =<< flushWhenFull =<<
|
|
|
|
(Git.Queue.addUpdateIndex streamer q =<< gitRepo)
|
2012-06-07 19:40:44 +00:00
|
|
|
|
2016-01-13 18:55:01 +00:00
|
|
|
{- Runs the queue if it is full. -}
|
2019-11-12 14:44:51 +00:00
|
|
|
flushWhenFull :: Git.Queue.Queue Annex -> Annex (Git.Queue.Queue Annex)
|
2016-01-13 18:55:01 +00:00
|
|
|
flushWhenFull q
|
|
|
|
| Git.Queue.full q = flush' q
|
|
|
|
| otherwise = return q
|
2011-04-07 17:59:31 +00:00
|
|
|
|
|
|
|
{- Runs (and empties) the queue. -}
|
2012-04-27 17:23:52 +00:00
|
|
|
flush :: Annex ()
|
|
|
|
flush = do
|
2012-02-15 15:13:13 +00:00
|
|
|
q <- get
|
2011-06-30 17:25:37 +00:00
|
|
|
unless (0 == Git.Queue.size q) $ do
|
2016-01-13 18:55:01 +00:00
|
|
|
store =<< flush' q
|
|
|
|
|
2018-08-28 17:14:44 +00:00
|
|
|
{- When there are multiple worker threads, each has its own queue.
|
2019-05-06 19:15:12 +00:00
|
|
|
- And of course multiple git-annex processes may be running each with its
|
|
|
|
- own queue.
|
2018-08-28 17:14:44 +00:00
|
|
|
-
|
|
|
|
- But, flushing two queues at the same time could lead to failures due to
|
|
|
|
- git locking files. So, only one queue is allowed to flush at a time.
|
|
|
|
-}
|
2019-11-12 14:44:51 +00:00
|
|
|
flush' :: Git.Queue.Queue Annex -> Annex (Git.Queue.Queue Annex)
|
2022-08-11 20:57:44 +00:00
|
|
|
flush' q = do
|
|
|
|
lck <- fromRepo gitAnnexGitQueueLock
|
|
|
|
withExclusiveLock lck $ do
|
|
|
|
showStoringStateAction
|
|
|
|
Git.Queue.flush q =<< gitRepo
|
2011-04-07 17:59:31 +00:00
|
|
|
|
2012-06-10 17:56:04 +00:00
|
|
|
{- Gets the size of the queue. -}
|
|
|
|
size :: Annex Int
|
|
|
|
size = Git.Queue.size <$> get
|
|
|
|
|
2019-11-12 14:44:51 +00:00
|
|
|
get :: Annex (Git.Queue.Queue Annex)
|
2012-02-15 15:13:13 +00:00
|
|
|
get = maybe new return =<< getState repoqueue
|
|
|
|
|
2019-11-12 14:44:51 +00:00
|
|
|
new :: Annex (Git.Queue.Queue Annex)
|
2012-02-15 15:13:13 +00:00
|
|
|
new = do
|
improve git command queue flushing with time limit
So that eg, addurl of several large files that take time to download will
update the index for each file, rather than deferring the index updates to
the end.
In cases like an add of many smallish files, where a new file is being
added every few seconds. In that case, the queue will still build up a
lot of changes which are flushed at once, for best performance. Since
the default queue size is 10240, often it only gets flushed once at the
end, same as before. (Notice that updateQueue updated _lastchanged
when adding a new item to the queue without flushing it; that is
necessary to avoid it flushing the queue every 5 minutes in this case.)
But, when it takes more than a 5 minutes to add a file, the overhead of
updating the index immediately is probably small, so do it after each
file. This avoids git-annex potentially taking a very very long time
indeed to stage newly added files, which can be annoying to the user who
would like to get on with doing something with the files it's already
added, eg using git mv to rename them to a better name.
This is only likely to cause a problem if it takes say, 30 seconds to
update the index; doing an extra 30 seconds of work after every 5
minute file add would be less optimal. Normally, updating the index takes
significantly less time than that. On a SSD with 100k files it takes
less than 1 second, and the index write time is bound by disk read and
write so is not too much worse on a hard drive. So I hope this will not
impact users, although if it does turn out to, the time limit could be
made configurable.
A perhaps better way to do it would be to have a background worker
thread that wakes up every 60 seconds or so and flushes the queue.
That is made somewhat difficult because the queue can contain Annex
actions and so this would add a new source of concurrency issues.
So I'm trying to avoid that approach if possible.
Sponsored-by: Erik Bjäreholt on Patreon
2021-12-14 15:48:07 +00:00
|
|
|
sz <- annexQueueSize <$> getGitConfig
|
|
|
|
q <- liftIO $ Git.Queue.new sz Nothing
|
2012-02-15 15:13:13 +00:00
|
|
|
store q
|
|
|
|
return q
|
|
|
|
|
2019-11-12 14:44:51 +00:00
|
|
|
store :: Git.Queue.Queue Annex -> Annex ()
|
2012-02-15 15:13:13 +00:00
|
|
|
store q = changeState $ \s -> s { repoqueue = Just q }
|
2015-11-05 22:21:48 +00:00
|
|
|
|
|
|
|
mergeFrom :: AnnexState -> Annex ()
|
|
|
|
mergeFrom st = case repoqueue st of
|
|
|
|
Nothing -> noop
|
|
|
|
Just newq -> do
|
|
|
|
q <- get
|
|
|
|
let !q' = Git.Queue.merge q newq
|
2016-01-13 18:55:01 +00:00
|
|
|
store =<< flushWhenFull q'
|