Merge remote-tracking branch 'origin/master' into reorg

This commit is contained in:
Joey Hess 2011-03-16 00:09:35 -04:00
commit 539083b847
15 changed files with 162 additions and 0 deletions

View file

@ -0,0 +1,10 @@
[[!comment format=mdwn
username="https://www.google.com/accounts/o8/id?id=AItOawl9sYlePmv1xK-VvjBdN-5doOa_Xw-jH4U"
nickname="Richard"
subject="comment 1"
date="2011-03-15T14:11:27Z"
content="""
Keep in mind that lots of small files may have significant overhead, so a warning that it's not possible to make sure there's enough space would make sense for certain corner cases. Actually finding out the exact overhead is beyond git-annex' scope and, given transparent compression etc, ability, but a warning, optionally with a \"do you want to continue\" prompt can't hurt.
-- RichiH
"""]]

View file

@ -0,0 +1,8 @@
[[!comment format=mdwn
username="http://joey.kitenet.net/"
nickname="joey"
subject="comment 2"
date="2011-03-16T03:04:50Z"
content="""
Right. You probably don't want git-annex to fill up your entire drive anyway, so if it tries to reseve 10 mb or 1% or whatever (probably configurable) for overhead, that should be good enough.
"""]]

View file

@ -0,0 +1,27 @@
[[!comment format=mdwn
username="praet"
ip="81.240.159.215"
subject="Use variable symlinks, relative to the repo's root ?"
date="2011-03-10T16:50:28Z"
content="""
It all boils down to the fact that the path to a relative symlink's target is determined relative to the symlink itself.
Now, if we define the symlink's target relative to the git repo's root (eg. using the $GIT_DIR environment variable, which can be a relative or absolute path itself), this unfortunately results in an absolute symlink, which would -for obvious reasons- only be usable locally:
user@host:~$ mkdir -p tmp/{.git/annex,somefolder}
user@host:~$ export GIT_DIR=~/tmp
user@host:~$ touch $GIT_DIR/.git/annex/realfile
user@host:~$ ln -s $GIT_DIR/.git/annex/realfile $GIT_DIR/somefolder/file
user@host:~$ ls -al $GIT_DIR/somefolder/
total 12
drwxr-x--- 2 user group 4096 2011-03-10 16:54 .
drwxr-x--- 4 user group 4096 2011-03-10 16:53 ..
lrwxrwxrwx 1 user group 33 2011-03-10 16:54 file -> /home/user/tmp/.git/annex/realfile
user@host:~$
So, what we need is the ability to record the actual variable name (instead of it's value) in our symlinks.
It *is* possible, using [variable/variant symlinks](http://en.wikipedia.org/wiki/Symbolic_link#Variable_symbolic_links), yet I'm unsure as to whether or not this is available on Linux systems, and even if it is, it would introduce compatibility issues in multi-OS environments.
Thoughts on this?
"""]]

View file

@ -0,0 +1,8 @@
[[!comment format=mdwn
username="http://joey.kitenet.net/"
nickname="joey"
subject="comment 3"
date="2011-03-16T03:03:19Z"
content="""
Interesting, I had not heard of variable symlinks before. AFAIK linux does not have them.
"""]]

9
doc/comments.mdwn Normal file
View file

@ -0,0 +1,9 @@
[[!sidebar content="""
[[!inline pages="comment_pending(*)" feedfile=pendingmoderation
description="comments pending moderation" show=-1]]
Comments in the [[!commentmoderation desc="moderation queue"]]:
[[!pagecount pages="comment_pending(*)"]]
"""]]
Recent comments posted to this site:
[[!inline pages="comment(*)" template="comment"]]

View file

@ -0,0 +1,19 @@
[[!comment format=mdwn
username="http://dieter-be.myopenid.com/"
nickname="dieter"
subject="comment 2"
date="2011-02-16T21:32:04Z"
content="""
thanks Joey,
is it possible to run some git annex command that tells me, for a specific directory, which files are available in an other remote? (and which remote, and which filenames?)
I guess I could run that, do my own policy thingie, and run `git annex get` for the files I want.
For your podcast use case (and some of my use cases) don't you think git [annex] might actually be overkill? For example your podcasts use case, what value does git annex give over a simple rsync/rm script?
such a script wouldn't even need a data store to store its state, unlike git. it seems simpler and cleaner to me.
for the mpd thing, check http://alip.github.com/mpdcron/ (bad project name, it's a plugin based \"event handler\")
you should be able to write a simple plugin for mpdcron that does what you want (or even interface with mpd yourself from perl/python/.. to use its idle mode to get events)
Dieter
"""]]

View file

@ -0,0 +1,12 @@
[[!comment format=mdwn
username="http://joey.kitenet.net/"
nickname="joey"
subject="comment 3"
date="2011-03-16T03:01:17Z"
content="""
Whups, the comment above got stuck in moderation queue for 27 days. I will try to check that more frequently.
In the meantime, I've implemented \"git annex whereis\" -- enjoy!
I find keeping my podcasts in the annex useful because it allows me to download individual episodes or poscasts easily when low bandwidth is available (ie, dialup), or over sneakernet. And generally keeps everything organised.
"""]]

View file

@ -0,0 +1,14 @@
[[!comment format=mdwn
username="https://www.google.com/accounts/o8/id?id=AItOawl9sYlePmv1xK-VvjBdN-5doOa_Xw-jH4U"
nickname="Richard"
subject="comment 2"
date="2011-03-15T13:52:16Z"
content="""
Can't you just use an underscore instead of a colon?
Would it be feasible to split directories dynamically? I.e. start with SHA1_123456789abcdef0123456789abcdef012345678/SHA1_123456789abcdef0123456789abcdef012345678 and, at a certain cut-off point, switch to shorter directory names? This could even be done per subdirectory and based purely on a locally-configured number. Different annexes on different file systems or with different file subsets might even have different thresholds. This would ensure scale while not forcing you to segment from the start. Also, while segmenting with longer directory names means a flatter tree, segments longer than four characters might not make too much sense. Segmenting too often could lead to some directories becoming too populated, bringing us back to the dynamic segmentation.
All of the above would make merging annexes by hand a _lot_ harder, but I don't know if this is a valid use case. And if all else fails, one could merge everything with the unsegemented directory names and start again from there.
-- RichiH
"""]]

View file

@ -0,0 +1,12 @@
[[!comment format=mdwn
username="http://joey.kitenet.net/"
nickname="joey"
subject="comment 3"
date="2011-03-16T03:13:39Z"
content="""
It is unfortunatly not possible to do system-dependant hashing, so long as git-annex stores symlinks to the content in git.
It might be possible to start without hashing, and add hashing for new files after a cutoff point. It would add complexity.
I'm currently looking at a 2 character hash directory segment, based on an md5sum of the key, which splits it into 1024 buckets. git uses just 256 buckets for its object directory, but then its objects tend to get packed away. I sorta hope that one level is enough, but guess I could go to 2 levels (objects/ab/cd/key), which would provide 1048576 buckets, probably plenty, as if you are storing more than a million files, you are probably using a modern enough system to have a filesystem that doesn't need hashing.
"""]]

View file

@ -10,6 +10,7 @@ To get a feel for it, see the [[walkthrough]].
* [[bugs]]
* [[todo]]
* [[forum]]
* [[comments]]
* [[contact]]
[[News]]:

View file

@ -6,6 +6,8 @@ all users, so this should be the *last* reorg in the forseeable future.
2. Add hashing, since some filesystems do suck (like er, fat at least :)
[[forum/hashing_objects_directories]]
(Also, may as well hash .git-annex/* while at it -- that's what
really gets big.)
3. Add filesize metadata for [[bugs/free_space_checking]]. (Currently only
present in WORM, and in an ad-hoc way.)

View file

@ -0,0 +1,8 @@
[[!comment format=mdwn
username="https://www.google.com/accounts/o8/id?id=AItOawl9sYlePmv1xK-VvjBdN-5doOa_Xw-jH4U"
nickname="Richard"
subject="comment 1"
date="2011-03-16T01:16:48Z"
content="""
If you support generic meta-data, keep in mind that you will need to do conflict resolution. Timestamps may not be synched across all systems, so keeping a log of old metadata could be used, sorting by history and using the latest. Which leaves the situation of two incompatible changes. This would probably mean manual conflict resolution. You will probably have thought of this already, but I still wanted to make sure this is recorded. -- RichiH
"""]]

View file

@ -0,0 +1,8 @@
[[!comment format=mdwn
username="https://www.google.com/accounts/o8/id?id=AItOawl9sYlePmv1xK-VvjBdN-5doOa_Xw-jH4U"
nickname="Richard"
subject="comment 2"
date="2011-03-16T01:19:25Z"
content="""
Hmm, I added quite a few comments at work, but they are stuck in moderation. Maybe I forgot to log in before adding them. I am surprised this one appeared immediately. -- RichiH
"""]]

View file

@ -0,0 +1,12 @@
[[!comment format=mdwn
username="https://www.google.com/accounts/o8/id?id=AItOawl9sYlePmv1xK-VvjBdN-5doOa_Xw-jH4U"
nickname="Richard"
subject="comment 1"
date="2011-03-15T14:08:41Z"
content="""
What is the potential time-frame for this change? As I am not using git-annex for production yet, I can see myself waiting to avoid any potential hassle.
Supporting generic metadata seems like a great idea. Though if you are going this path, wouldn't it make sense to avoid metastore for mtime etc and support this natively without outside dependencies?
-- RichiH
"""]]

View file

@ -0,0 +1,12 @@
[[!comment format=mdwn
username="http://joey.kitenet.net/"
nickname="joey"
subject="comment 4"
date="2011-03-16T03:22:45Z"
content="""
Well, I spent a few hours playing this evening in the 'reorg' branch in git. It seems to be shaping up pretty well; type-based refactoring in haskell makes these kind of big systematic changes a matter of editing until it compiles. And it compiles and test suite passes. But, so far I've only covered 1. 3. and 4. on the list, and have yet to deal with upgrades.
I'd recommend you not wait before using git-annex. I am committed to provide upgradability between annexes created with all versions of git-annex, going forward. This is important because we can have offline archival drives that sit unused for years. Git-annex will upgrade a repository to current standard the first time it sees it, and I hope the upgrade will be pretty smooth. It was not bad for the annex.version 0 to 1 upgrade earlier. The only annoyance with upgrades is that it will result in some big commits to git, as every symlink in the repo gets changed, and log files get moved to new names.
(The metadata being stored with keys is data that a particular backend can use, and is static to a given key, so there are no merge issues (and it won't be used to preserve mtimes, etc).)
"""]]