From e1b88531749e491bdc05ebd829a8224b567f37ea Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Thu, 26 Mar 2015 11:44:20 -0400 Subject: [PATCH] improve import duplicate docs --- doc/git-annex-import.mdwn | 28 +++++++++++-------- ..._205ecbc7401f99fc83719acbf5da174e._comment | 26 +++++++++++++++++ 2 files changed, 43 insertions(+), 11 deletions(-) create mode 100644 doc/todo/inject_on_import/comment_2_205ecbc7401f99fc83719acbf5da174e._comment diff --git a/doc/git-annex-import.mdwn b/doc/git-annex-import.mdwn index 4d2c05547e..43e619607f 100644 --- a/doc/git-annex-import.mdwn +++ b/doc/git-annex-import.mdwn @@ -13,11 +13,18 @@ the annex. Individual files to import can be specified. If a directory is specified, the entire directory is imported. git annex import /media/camera/DCIM/* - -By default, importing two files with the same contents from two different -locations will result in both files being added to the repository. -(With all checksumming backends, including the default SHA256E, -only one copy of the data will be stored.) + +When importing files, there's a possibility of importing a duplicate +of a file that is already known to git-annex -- its content is either +present in the local repository already, or git-annex knows of anther +repository that contains it. + +By default, importing a duplicate of a known file will result in +a new filename being added to the repository, so the duplicate file +is present in the repository twice. (With all checksumming backends, +including the default SHA256E, only one copy of the data will be stored.) + +Several options can be used to adjust handling of duplicate files. # OPTIONS @@ -32,19 +39,18 @@ only one copy of the data will be stored.) * `--deduplicate` - Only import files whose content has not been seen before by git-annex. - - Duplicate files will be deleted from the import location. + Only import files that are not duplicates; + duplicate files will be deleted from the import location. * `--skip-duplicates` - Only import files whose content has not been seen before by git-annex, - but avoid deleting duplicate files. + Only import files that are not duplicates; and avoid deleting + duplicate files from the import location. * `--clean-duplicates` Does not import any files, but any files found in the import location - that are duplicates of content in the annex are deleted. + that are duplicates are deleted. * file matching options diff --git a/doc/todo/inject_on_import/comment_2_205ecbc7401f99fc83719acbf5da174e._comment b/doc/todo/inject_on_import/comment_2_205ecbc7401f99fc83719acbf5da174e._comment new file mode 100644 index 0000000000..acd661feb0 --- /dev/null +++ b/doc/todo/inject_on_import/comment_2_205ecbc7401f99fc83719acbf5da174e._comment @@ -0,0 +1,26 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 2""" + date="2015-03-26T15:28:45Z" + content=""" +Well, you've found an edge case here. + +It behaves as documented as long as the file being imported is located in some +repository know to git-annex. The file content does not have to be present in +the local repository for it to behave as documented. + +In your case, the file being imported has a symlink in the git repo, but +git-annex knows about 0 annexed copies of the file, so it's treated as +if it's a new file and not a duplicate. + +Since import is working at the key level, there's not a good way to look up +that there are some symlinks in the git repo even though the content is +gone. And even if there was, I think I'd be uncomfortable with it deleting +the file as "duplicate" when its content is not available in any known +repository. The only behavior improvement might be to import the content +but not make a redundant symlink in this case. + +I think it's best to change the documentation. I've added a new +paragraph that more exactly and clearly explains what duplicate files +are for the purposes of importing. +"""]]