Significantly sped up processing of large numbers of directories passed to a single git-annex command. (try 2)

New approach is to do it the expensive way for the first 100 paths
on the command line, but then assume the user doesn't care about order too
much and fall back to the cheap way that does not preserve order.
This commit is contained in:
Joey Hess 2015-04-02 01:44:32 -04:00
parent f79502d377
commit 294991dacb
3 changed files with 18 additions and 5 deletions

View file

@ -170,17 +170,26 @@ prop_relPathDirToFile_regressionTest = same_dir_shortcurcuits_at_difference
== joinPath ["..", "..", "..", "..", ".git", "annex", "objects", "18", "gk", "SHA256-foo", "SHA256-foo"] == joinPath ["..", "..", "..", "..", ".git", "annex", "objects", "18", "gk", "SHA256-foo", "SHA256-foo"]
{- Given an original list of paths, and an expanded list derived from it, {- Given an original list of paths, and an expanded list derived from it,
- generates a list of lists, where each sublist corresponds to one of the - which may be arbitrarily reordered, generates a list of lists, where
- original paths. When the original path is a directory, any items - each sublist corresponds to one of the original paths.
- in the expanded list that are contained in that directory will appear in -
- its segment. - When the original path is a directory, any items in the expanded list
- that are contained in that directory will appear in its segment.
-
- The order of the original list of paths is attempted to be preserved in
- the order of the returned segments. However, doing so has a O^NM
- growth factor. So, if the original list has more than 100 paths on it,
- we stop preserving ordering at that point. Presumably a user passing
- that many paths in doesn't care too much about order of the later ones.
-} -}
segmentPaths :: [FilePath] -> [FilePath] -> [[FilePath]] segmentPaths :: [FilePath] -> [FilePath] -> [[FilePath]]
segmentPaths [] new = [new] segmentPaths [] new = [new]
segmentPaths [_] new = [new] -- optimisation segmentPaths [_] new = [new] -- optimisation
segmentPaths (l:ls) new = found : segmentPaths ls rest segmentPaths (l:ls) new = found : segmentPaths ls rest
where where
(found, rest)=partition (l `dirContains`) new (found, rest) = if length ls < 100
then partition (l `dirContains`) new
else break (\p -> not (l `dirContains` p)) new
{- This assumes that it's cheaper to call segmentPaths on the result, {- This assumes that it's cheaper to call segmentPaths on the result,
- than it would be to run the action separately with each path. In - than it would be to run the action separately with each path. In

2
debian/changelog vendored
View file

@ -23,6 +23,8 @@ git-annex (5.20150328) UNRELEASED; urgency=medium
* fsck: Added --distributed and --expire options, * fsck: Added --distributed and --expire options,
for distributed fsck. for distributed fsck.
* Fix truncation of parameters that could occur when using xargs git-annex. * Fix truncation of parameters that could occur when using xargs git-annex.
* Significantly sped up processing of large numbers of directories
passed to a single git-annex command.
-- Joey Hess <id@joeyh.name> Fri, 27 Mar 2015 16:04:43 -0400 -- Joey Hess <id@joeyh.name> Fri, 27 Mar 2015 16:04:43 -0400

View file

@ -11,3 +11,5 @@ Feeding git-annex a long list off directories, eg with xargs can have
git-ls-files results. There is probably an exponential blowup in the time git-ls-files results. There is probably an exponential blowup in the time
relative to the number of parameters. Some of the stuff being done to relative to the number of parameters. Some of the stuff being done to
preserve original ordering etc is likely at fault. preserve original ordering etc is likely at fault.
> [[fixed|done]] --[[Joey]]