Added a comment

2011-12-23 17:22:12 +00:00 · 2011-12-23 17:22:12 +00:00 · d3e80eabe8
commit d3e80eabe8
parent fe65981de8
1 changed files with 68 additions and 0 deletions
--- a/doc/todo/wishlist:_Provide_a_34git_annex34_command_that_will_skip_duplicates/comment_10_d78d79fb2f3713aa69f45d2691cf8dfe._comment
+++ b/doc/todo/wishlist:_Provide_a_34git_annex34_command_that_will_skip_duplicates/comment_10_d78d79fb2f3713aa69f45d2691cf8dfe._comment
@ -0,0 +1,68 @@
 [[!comment format=mdwn
 username="http://adamspiers.myopenid.com/"
 nickname="Adam"
 subject="comment 10"
 date="2011-12-23T17:22:11Z"
 content="""
 > Your perl script is not O(n). Inserting into perl hash tables has
 > overhead of minimum O(n log n).
 What's your source for this assertion?  I would expect an amortized
 average of `O(1)` per insertion, i.e. `O(n)` for full population.
 > Not counting the overhead of resizing hash tables,
 > the grevious slowdown if the bucket size is overcome by data (it
 > probably falls back to a linked list or something then), and the
 > overhead of traversing the hash tables to get data out.
 None of which necessarily change the algorithmic complexity.  However
 real benchmarks are far more useful here than complexity analysis, and
 [the dangers of premature optimization](http://c2.com/cgi/wiki?PrematureOptimization) 
 should not be forgotten.
 > Your memory size calculations ignore the overhead of a hash table or
 > other data structure to store the data in, which will tend to be
 > more than the actual data size it's storing. I estimate your 50
 > million number is off by at least one order of magnitude, and more
 > likely two;
 Sure, I was aware of that, but my point still stands.  Even 500k keys
 per 1GB of RAM does not sound expensive to me.
 > in any case I don't want git-annex to use 1 gb of ram.
 Why not?  What's the maximum it should use?  512MB?  256MB?
 32MB?  I don't see the sense in the author of a program 
 dictating thresholds which are entirely dependent on the context
 in which the program is *run*, not the context in which it's *written*.
 That's why systems have files such as `/etc/security/limits.conf`.
 You said you want git-annex to scale to enormous repositories.  If you
 impose an arbitrary memory restriction such as the above, that means
 avoiding implementing *any* kind of functionality which requires `O(n)`
 memory or worse.  Isn't it reasonable to assume that many users use
 git-annex on repositories which are *not* enormous?  Even when they do
 work with enormous repositories, just like with any other program,
 they would naturally expect certain operations to take longer or
 become impractical without sufficient RAM.  That's why I say that this
 restriction amounts to throwing out the baby with the bathwater.
 It just means that those who need the functionality would have to
 reimplement it themselves, assuming they are able, which is likely
 to result in more wheel reinventions.  I've already shared
 [my implementation](https://github.com/aspiers/git-config/blob/master/bin/git-annex-finddups)
 but how many people are likely to find it, let alone get it working?
 > Little known fact: sort(1) will use a temp file as a buffer if too
 > much memory is needed to hold the data to sort.
 Interesting.  Presumably you are referring to some undocumented
 behaviour, rather than `--batch-size` which only applies when merging
 multiple files, and not when only sorting STDIN.
 > It's also written in the most efficient language possible and has
 > been ruthlessly optimised for 30 years, so I would be very surprised
 > if it was not the best choice.
 It's the best choice for sorting.  But sorting purely to detect
 duplicates is a dismally bad choice.
 """]]