git-annex

Author	SHA1	Message	Date
Joey Hess	381f203d1a	refactor Avoiding using a callback simplifies this and should make it easier to implement incremental checksumming, which will need to happen partly in writeRetrievedContent and partly in retrieveChunks.	2021-02-16 16:03:28 -04:00
Joey Hess	48310f2d55	windows build fix from jwodder	2021-02-15 13:35:01 -04:00
Joey Hess	f44d4704c6	incremental checksum for local remotes This benchmarks only slightly faster than the old git-annex. Eg, for a 1 gb file, 14.56s vs 15.57s. (On a ram disk; there would certianly be more of an effect if the file was written to disk and didn't stay in cache.) Commenting out the updateIncremental calls make the same run in 6.31s. May be that overhead in the implementation, other than the actual checksumming, is slowing it down. Eg, MVar access. (I also tried using 10x larger chunks, which did not change the speed.)	2021-02-10 16:05:24 -04:00
Joey Hess	48f63c2798	stop using rsync in fileCopier This is groundwork for calculating checksums while copying, rather than in a separate pass, but that's not done yet. For now, avoid using rsync (and cp on Windows), and instead read and write the file ourselves, with resume handling. Benchmarking vs old git-annex that used rsync, this is faster, at least once the file size is larger than a couple of MB.	2021-02-10 14:44:35 -04:00
Joey Hess	c4c9b99e22	refactoring	2021-02-10 13:38:45 -04:00
Joey Hess	e24ddb8946	Bugfix: fsck --from a ssh remote did not actually check that the content on the remote is not corrupted Changing to the P2P protocol broke this, because preseedTmp copies the local copy of the object to the temp file, and then the P2P transfer sees the right length file and uses it as-is. When git-annex-shell is too old and rsync is used, it did verify the content, and when the local repo does not have the object it did verify the content.	2021-02-10 13:29:12 -04:00
Joey Hess	1c75364eac	fix missing call to check after hard linking This could perhaps have caused a hard link to be made when the content of the object was modified. I don't think that actually happened, because the annexed file would have to be unlocked, with annex.thin, for the object to get modified, and in that case, a hard link is not made. However, to be sure, run the check. Note that it seemed best to run the check only once, although the current implementation is fast and safe to run repeatedly.	2021-02-10 13:07:38 -04:00
Joey Hess	62e152f210	incremental checksum on download from ssh or p2p Checksum as content is received from a remote git-annex repository, rather than doing it in a second pass. Not tested at all yet, but I imagine it will work! Not implemented for any special remotes, and also not implemented for copies from local remotes. It may be that, for local remotes, it will suffice to use rsync, rely on its checksumming, and simply return Verified. (It would still make a checksumming pass when cp is used for COW, I guess.)	2021-02-09 17:03:27 -04:00
Joey Hess	fa3d71d924	Tahoe: Avoid verifying hash after download, since tahoe does sufficient verification itself See my comment in the next commit for some details about why Verified needs a hash with preimage resistance. As far as tahoe goes, it's fully cryptographically secure. I think that bup could also return Verified. However, the Retriever interface does not currenly support that.	2021-02-09 13:42:16 -04:00
Joey Hess	3a66cd715f	avoid making absolute git remote path relative When a git remote is configured with an absolute path, use that path, rather than making it relative. If it's configured with a relative path, use that. Git.Construct.fromPath changed to preserve the path as-is, rather than making it absolute. And Annex.new changed to not convert the path to relative. Instead, Git.CurrentRepo.get generates a relative path. A few things that used fromAbsPath unncessarily were changed in passing to use fromPath instead. I'm seeing fromAbsPath as a security check, while before it was being used in some cases when the path was known absolute already. It may be that fromAbsPath is not really needed, but only git-annex-shell uses it now, and I'm not 100% sure that there's not some input that would cause a relative path to be used, opening a security hole, without the security check. So left it as-is. Test suite passes and strace shows the configured remote url is used unchanged in the path into it. I can't be 100% sure there's not some code somewhere that takes an absolute path to the repo and converts it to relative and uses it, but it seems pretty unlikely that the code paths used for a git remote would call such code. One place I know of is gitAnnexLink, but I'm pretty sure that git remotes never deal with annex symlinks. If that did get called, it generates a path relative to cwd, which would have been wrong before this change as well, when operating on a remote.	2021-02-08 13:18:01 -04:00
Joey Hess	dd39e9e255	suggest when user may want annex.stalldetection When annex.stalldetection is not enabled, and a likely stall is detected, display a suggestion to enable it. Note that the progress meter display is not taken down when displaying the message, so it will display like this: 0% 8 B 0 B/s Transfer seems to have stalled. To handle stalling transfers, configure annex.stalldetection 0% 10 B 0 B/s Although of course if it's really stalled, it will never update again after the message. Taking down the progress meter and starting a new one doesn't seem too necessary given how unusual this is, also this does help show the state it was at when it stalled. Use of uninterruptibleCancel here is ok, the thread it's canceling only does STM transactions and sleeps. The annex thread that gets forked off is separate to avoid it being canceled, so that it can be joined back at the end. A module cycle required moving from dupState the precaching of the remote list. Doing it at startConcurrency should cover all the cases where the remote list is used in concurrent actions. This commit was sponsored by Kevin Mueller on Patreon.	2021-02-03 15:57:19 -04:00
Joey Hess	1b63132ca3	add searchPathContents And rename related functions for consistency.	2021-02-02 19:06:15 -04:00
Joey Hess	b372d962ae	Added GETGITREMOTENAME to extenal special remote protocol	2021-01-26 12:42:47 -04:00
Joey Hess	b63e3118d7	fix export overwrite on FAT Don't accept the cid of the temp file that the content has just been written to as something we will accept if another file has that same content. There's no reason to, and on FAT, due to mtime resolution, the test suite hit just such a case. This fixes a reversion from `73df633a62` which removed inode from the ContentIdentifier.	2021-01-25 13:31:17 -04:00
Joey Hess	73df633a62	omit inode from ContentIdentifier for directory special remote Directory special remotes with importtree=yes now avoid unncessary overhead when inodes of files have changed, as happens whenever a FAT filesystem gets remounted. A few unusual edge cases of modifications won't be detected and imported. I think they're unusual enough not to be a concern. It would be possible to add a config setting that controls whether to compare inodes too, but does not seem worth bothering the user about currently. I chose to continue to use the InodeCache serialization, just with the inode zeroed. This way, if I later change my mind or make it configurable, can parse it back to an InodeCache and operate on it. The overhead of storing a 0 in the content identifier log seems worth it. There is a one-time cost to this change; all directory special remotes with importtree=yes will re-hash all files once, and will update the content identifier logs with zeroed inodes. This commit was sponsored by Brett Eisenberg on Patreon.	2021-01-19 13:15:07 -04:00
Joey Hess	e7134ca1eb	avoid partial functions in Git.Url After the last commit, it was able to throw errors just due to an unparseable url. This avoids needing to worry about that, as long as the call site has already checked that it has a parseable url.	2021-01-18 15:07:23 -04:00
Joey Hess	2aa4fab62a	avoid crashing when there are remotes using unparseable urls Including the non-standard URI form that git-remote-gcrypt uses for rsync. Eg, "ook://foo:bar" cannot be parsed because "bar" is not a valid port number. But git could have a remote with that, it would try to run git-remote-ook to handle it. So, git-annex has to allow for such things, rather than crashing. This commit was sponsored by Luke Shumaker on Patreon.	2021-01-18 14:59:08 -04:00
Joey Hess	c8b1fa67b4	Behavior change: --trust-glacier option no longer overrides trust Since that can lead to data loss, which should never be enabled by an option other than --force. This commit was sponsored by Jake Vosloo on Patreon.	2021-01-07 10:37:43 -04:00
Joey Hess	e10855e723	don't support dropping from thirdPartyPopulated for now This code I'm reverting works. But it has a problem: The export db and log and the ContentIdentifier db and log still list the content as being stored in the remote. So when I ran borg create again and stored the content in borg again in a new archive, git-annex sync noticed that, but since it didn't update the tree for the old archives, it then thought the content that had been removed from them was still in them, and so git-annex get failed in an ugly way: Include pattern 'tmp/x/.git/annex/objects/pX/ZJ/SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855/SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855' never matched. [2020-12-28 16:40:44.878952393] process [933616] done ExitFailure 1 user error (borg ["extract","/tmp/b::abs4","tmp/x/.git/annex/objects/pX/ZJ/SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855/SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"] exited 1) It does not seem worth it to update the git tree for the export when dropping content, that would make drop of many files very expensive in git tree objects created. So, let's not support this I suppose..	2020-12-28 16:48:38 -04:00
Joey Hess	69d4b84501	support removing objects from borg	2020-12-28 16:36:52 -04:00
Joey Hess	bfdaee234f	support removal from thirdPartyPopulated also some other fixes to thirdPartyPopulated	2020-12-28 16:36:26 -04:00
Joey Hess	b16e6fb4e6	borg appendonly config	2020-12-28 16:23:38 -04:00
Joey Hess	0990d74574	revert recent lockContent change lockContent should only be done when it's versioned	2020-12-28 16:05:14 -04:00
Joey Hess	36133f27c0	move untrust forcing from Logs.Trust into Remote No behavior changes here, but this is groundwork for letting remotes such as borg vary untrust forcing depending on configuration.	2020-12-28 15:22:10 -04:00
Joey Hess	5ce7fce74a	simplify adjustExportImport' is never called with both isexport and isimport False.	2020-12-28 15:06:47 -04:00
Joey Hess	46059ab0e5	split off versionedExport from appendonly S3 uses versionedExport, while GitLFS uses appendonly. This is groundwork for later changes.	2020-12-28 14:37:15 -04:00
Joey Hess	2e72590a48	avoid using export method when the remote only supports import	2020-12-23 13:40:56 -04:00
Joey Hess	e3d356fe84	borg: add subdir= config Note that, after changing it with enableremote, syncing won't rescan known archives in the borg repo using the changed config. Probably not a problem? Also used File in some places where filenames that could theoretically start with - are passed to borg, to avoid it confusing them with options.	2020-12-23 13:12:11 -04:00
Joey Hess	4254e2297d	implement retrieveExportWithContentIdentifier Moved out an XXX to a todo This seems about ready to merge..	2020-12-22 16:16:48 -04:00
Joey Hess	a9d639c5b5	borg can prompt	2020-12-22 15:48:17 -04:00
Joey Hess	df4942e179	notice when an archive that was seen before gets deleted	2020-12-22 15:45:06 -04:00
Joey Hess	523b7143e0	implemented checkPresentExportWithContentIdentifier	2020-12-22 15:34:41 -04:00
Joey Hess	4f9969d0a1	optimisation for borg Skip needing to list importable contents when unchanged since last time.	2020-12-22 15:00:05 -04:00
Joey Hess	e1ac42be77	convert listImportableContents to throwing exceptions	2020-12-22 14:24:29 -04:00
Joey Hess	5d8e4a7c74	avoid borg list of archives that have been listed before This makes sync a lot faster in the common case where there's no new backup. There's still room for it to be faster. Currently the old imported tree has to be traversed, to generate the ImportableContents. Which then gets turned around to generate the new imported tree, which is identical. So, it would be possible to just return a "no new imports", or an ImportableContents that has a way to graft in a tree. The latter is probably too far to go to optimise this, unless other things need it. The former might be worth it, but it's already pretty fast, since git ls-tree is pretty fast.	2020-12-22 14:06:40 -04:00
Joey Hess	7f7094a7cb	include borg archive name in tree, use empty ContentIdentifier It's unusual to use a ContentIdentifier that is not semi-unique for different contents. Note that in importKeys, it checks if a content identifier is one that's known before, to avoid downloading the same content twice. But that's done in a code path not used for borg repos, because they are thirdpartypopulated.	2020-12-22 11:53:00 -04:00
Joey Hess	bcd55b365c	import from borg is basically working Still some issues to deal with, see TODO and XXX. Here's what gets logged, for each key: cid log: 1608582045.832799227s 6720ebad-b20e-4460-a8f2-2477361aea75 !MjAyMC0xMi0yMVQxMTozMzoxNw==:!MjAyMC0xMi0yMVQxMzowNzoyNg== The "!Mj" are base64 encoded borg archive names, since mine were dates and contained some characters not allowed in cid logs unescaped. There were archives that each contained the key. This list will grow as more borg backups are done and learned about. tree generated: 120000 blob 5ef6a4615c084819b44cd4e3a31657664ddf643b x/dotgit/annex/objects/06/mv/SHA256E-s30--a5d8532e64ec28f5491e25e7a6c1cb68f80507c1be6c1b35f8ec53d25413e5da/SHA256E-s30--a5d8532e64ec28f5491e25e7a6c1cb68f80507c1be6c1b35f8ec53d25413e5da 120000 blob 063a139d3021c8db60f5c576d29fada2b824d91c x/dotgit/annex/objects/72/PP/SHA256E-s30--e80b09a854b4e4d99a76caaa6983b34272480e0b4fdb95d04234a54b4849b893/SHA256E-s30--e80b09a854b4e4d99a76caaa6983b34272480e0b4fdb95d04234a54b4849b893 120000 blob b53b54916fd6abf21fedf796deca08d5ac7a75af x/dotgit/annex/objects/Ww/pk/SHA256E-s30--6aac072a8ebf02a5807c4f15e77ed585a6c87b3b333ba625a3c8d6b4dc50a9f2/SHA256E-s30--6aac072a8ebf02a5807c4f15e77ed585a6c87b3b333ba625a3c8d6b4dc50a9f2 This commit was sponsored by Denis Dzyubenko on Patreon.	2020-12-21 16:37:55 -04:00
Joey Hess	15000dee07	improve thirdpartypopulated support May actually work now. Note that, importKey now has to add the size to the key if it's supposed to have size. Remote.Directory relied on the importer adding the size, which is no longer done, so it was changed; it was the only one. This way, importKey does not need to behave differently between regular and thirdpartypopulated imports.	2020-12-21 16:19:44 -04:00
Joey Hess	706e2a63fb	fix logic error in thirdPartyPopulated handling	2020-12-21 13:24:07 -04:00
Joey Hess	ca31d7e54f	refactor That code was not borg specific, and I can see making more remotes for other backup software.	2020-12-18 17:08:44 -04:00
Joey Hess	1c054f1cf7	started borg special remote Still need to implement 3 methods, but importKeyM looks like it will work well to find annex object files.	2020-12-18 16:56:54 -04:00
Joey Hess	3207e8293b	start borg special remote Compiles, but unusable so far.	2020-12-18 16:03:51 -04:00
Joey Hess	9a2c8757f3	add thirdPartyPopulated interface This is to support, eg a borg repo as a special remote, which is populated not by running git-annex commands, but by using borg. Then git-annex sync lists the content of the remote, learns which files are annex objects, and treats those as present in the remote. So, most of the import machinery is reused, to a new purpose. While normally importtree maintains a remote tracking branch, this does not, because the files stored in the remote are annex object files, not user-visible filenames. But, internally, a git tree is still generated, of the files on the remote that are annex objects. This tree is used by retrieveExportWithContentIdentifier, etc. As with other import/export remotes, that the tree is recorded in the export log, and gets grafted into the git-annex branch. importKey changed to be able to return Nothing, to indicate when an ImportLocation is not an annex object and so should be skipped from being included in the tree. It did not seem to make sense to have git-annex import do this, since from the user's perspective, it's not like other imports. So only git-annex sync does it. Note that, git-annex sync does not yet download objects from such remotes that are preferred content. importKeys is run with content downloading disabled, to avoid getting the content of all objects. Perhaps what's needed is for seekSyncContent to be run with these remotes, but I don't know if it will just work (in particular, it needs to avoid trying to transfer objects to them), so I skipped that for now. (Untested and unused as of yet.) This commit was sponsored by Jochen Bartl on Patreon.	2020-12-18 15:23:58 -04:00
Joey Hess	f930176d6e	change info from export=yes to exporttree=yes and same for import for consistency	2020-12-17 17:06:50 -04:00
Joey Hess	e9db382308	avoid redundant set of a S3 verison ID that is already recorded I think this could cause unnecessary changes to the git-annex branch, and retrieveExportWithContentIdentifier is now also used for getting content from importtree=yes remotes, so it would happen more frequently so let's avoid.	2020-12-17 16:49:17 -04:00
Joey Hess	77aedbef8b	fix call to warnExportImportConflict That needs a Remote that has the right export/import set up, not the input Remote, which does not yet.	2020-12-17 16:25:02 -04:00
Joey Hess	f2ecc6e0da	import remotes use ContentIdentifier for getting and checking content This is better than using the equivilant actions for export remotes, especially for getting content, since the ContentIdentifier checking means we can be sure (enough) that the content is valid to not force verification of content. Which allows getting keys of types that cannot be verified. Also, reorganized the internals of adjustExportImport which was becoming very hard to follow. Now it's clear what each method does in each case.	2020-12-17 15:55:31 -04:00
Joey Hess	5946e7136e	force verification after getting file from export remote This way, if annex.verify is disabled, it's still checked, since this is not a key/value store, it has to be checked.	2020-12-17 15:31:22 -04:00
Joey Hess	ceda8c0066	refactor common code	2020-12-17 14:17:09 -04:00
Joey Hess	4d2cd58ee5	provide missing remote actions for importree only remote Ah, it seemed too easy before when I was implementing importrree only, and it was because all the key-based actions needed to be handled too. Mostly copied from isexport, and this works. It does seem that an import remote could use retrieveExportWithContentIdentifier rather than retrieveExport, and checkPresentExportWithContentIdentifier rather than checkPresentExport, which would both be more accurate.	2020-12-17 13:46:34 -04:00

1 2 3 4 5 ...

1371 commits