git-annex

Author	SHA1	Message	Date
Joey Hess	0f7143d226	support VURL backend Not yet implemented is recording hashes on download from web and verifying hashes. addurl --verifiable option added with -V short option because I expect a lot of people will want to use this. It seems likely that --verifiable will become the default eventually, and possibly rather soon. While old git-annex versions don't support VURL, that doesn't prevent using them with keys that use VURL. Of course, they won't verify the content on transfer, and fsck will warn that it doesn't know about VURL. So there's not much problem with starting to use VURL even when interoperating with old versions. Sponsored-by: Joshua Antonishen on Patreon	2024-02-29 13:48:51 -04:00
Joey Hess	6b38d0c427	addurl, importfeed: Added --raw-except option --raw-except=web allows using yt-dlp but not any other special remotes. Currently this option can only be used once, trying to use it repeatedly will make option parsing fail. Perhaps it ought to support being used more than once, but it seemed like an unlikely use case to need that. Note that getParsed is called repeatedly when the option is used with several urls. While implementing DeferredParseClass would avoid that innefficiency, it didn't seem worth the added boilerplate since getParsed only calls byNameWithUUID which does minimal work. Sponsored-by: Dartmouth College's DANDI project	2024-02-05 15:16:25 -04:00
Joey Hess	2f3fe4d904	fix importfeed --force skip behavior reversion importfeed --force: Don't treat it as a failure when an already downloaded file exists. (Fixes a behavior change introduced in 10.20230626.) `04ee6c4c6b` caused the reversion. Inside a CommandPerform, stop causes it to fail. Before that commit, it was inside a CommandStart, where stop causes it to skip.	2024-02-02 15:57:07 -04:00
Joey Hess	0c64cd30c2	compare urls irrespective of downloader importfeed --force: Avoid creating duplicates of existing already downloaded files when yt-dlp or a special remote was used.	2024-02-02 15:50:56 -04:00
Joey Hess	90db97d9a2	importfeed: Added --scrape option Which uses yt-dlp to screen scrape the equivilant of an RSS feed. Note that youtubedlscraped is a speed optimisation. Since yt-dlp found the urls, we know it can download them. That avoids calling youtubeDlSupported on each url, which makes --fast a lot faster. Almost all the same metadata fields and file formatting fields are populated, when yt-dlp is able to get the data. Note that yt-dlp has some additional useful metadata that could be exposed. But, much of it is specific to particular websites, and it would be hard to document on the git-annex importfeed man page. Sponsored-by: unqueued on Patreon	2024-01-30 15:37:29 -04:00
Joey Hess	d7949f8202	move Feed and Item out of ToDownload This is groundwork for producing ToDownload in other ways, that may not be entirely isomorphic with feeds. Eg by using yt-dlp.	2024-01-30 14:11:26 -04:00
Joey Hess	8bde6101e3	sqlite datbase for importfeed importfeed: Use caching database to avoid needing to list urls on every run, and avoid using too much memory. Benchmarking in my podcasts repo, importfeed got 1.42 seconds faster, and memory use dropped from 203000k to 59408k. Database.ImportFeed is Database.ContentIdentifier with the serial number filed off. There is a bit of code duplication I would like to avoid, particularly recordAnnexBranchTree, and getAnnexBranchTree. But these use the persistent sqlite tables, so despite the code being the same, they cannot be factored out. Since this database includes the contentidentifier metadata, it will be slightly redundant if a sqlite database is ever added for metadata. I did consider making such a generic database and using it for this. But, that would then need importfeed to update both the url database and the metadata database, which is twice as much work diffing the git-annex branch trees. Or would entagle updating two databases in a complex way. So instead it seems better to optimise the database that importfeed needs, and if the metadata database is used by another command, use a little more disk space and do a little bit of redundant work to update it. Sponsored-by: unqueued on Patreon	2023-10-23 16:46:22 -04:00
Joey Hess	aa5e333cb7	fix whitespace Thanks to a compile warning from new ghc	2023-08-01 18:36:54 -04:00
Joey Hess	7fc6503812	fix waiting for all started feed downloads with -J importfeed bug fix: When -J was used with multiple feeds, some feeds did not get their items downloaded. In my case, I had added a feed to the end of the list, and no items from it were ever downloaded. Sponsored-by: Leon Schuermann on Patreon	2023-07-11 22:08:35 -04:00
Joey Hess	51b24aac91	importfeed: Add feedurl to the metadata (And allow it to be used in the --template although that seems unlikely to be very useful.) My use case for this is that one of the podcast feeds I subscribe to is sometimes leaking episodes of some other podcast. The other podcast is also very close to spam, so this may be a form of intentional spamming. I have not been able to catch the podcast feed containing those episodes, so I don't know which one is at fault. So putting this in the metadata will let me eventually catch it.	2023-07-06 00:11:38 -04:00
Joey Hess	39f3d783fe	consolidate	2023-06-20 15:10:11 -04:00
Joey Hess	2fdf0ae38d	include url in json output The input field is consistently the url of the feed, which makes sense as that is the user input, but to differentiate multiple urls downloaded from a feed when using --json-progress -J, need the url that is being downloaded too. Sponsored-By: the NIH-funded NICEMAN (ReproNim TR&D3) project	2023-05-09 16:59:44 -04:00
Joey Hess	7919349cee	importfeed: Support --json and --json-error-messages and --json-progress Sponsored-By: the NIH-funded NICEMAN (ReproNim TR&D3) project	2023-05-09 16:51:16 -04:00
Joey Hess	6b54ea69e3	importfeed: Move error to where --json-error-messages can capture it Sponsored-By: the NIH-funded NICEMAN (ReproNim TR&D3) project	2023-05-09 16:27:23 -04:00
Joey Hess	04ee6c4c6b	importfeed: Support -J (and work toward supporting --json) Both -J and --json needed importfeed to be refactored to use commandAction. That was difficult, because of the interrelated nature of downloading feeds and then downloading files from feeds, both of which needed to use commandAction. And then checking for problems in feeds has to come after these actions, which may be run as background jobs. As for --json support, it's most of the way there, but still has some warts, so I didn't enable jsonOptions yet. The warts include: - An initial empty json record is displayed by getCache. - Input is not populated, should be feed url - feedProblem at end will not be captured by --json-error-messages (see FIXME) Sponsored-By: the NIH-funded NICEMAN (ReproNim TR&D3) project	2023-05-09 16:13:56 -04:00
Joey Hess	8b6c7bdbcc	filter out control characters in all other Messages This does, as a side effect, make long notes in json output not be indented. The indentation is only needed to offset them underneath the display of the file they apply to, so that's ok. Sponsored-by: Brock Spratlen on Patreon	2023-04-11 12:58:01 -04:00
Joey Hess	a0e6fa18eb	eliminate showStart showStartOther These were not handling control characters and are redundant. Sponsored-by: Jack Hill on Patreon	2023-04-10 16:28:58 -04:00
Joey Hess	3290a09a70	filter out control characters in warning messages Converted warning and similar to use StringContainingQuotedPath. Most warnings are static strings, some do refer to filepaths that need to be quoted, and others don't need quoting. Note that, since quote filters out control characters of even UnquotedString, this makes all warnings safe, even when an attacker sneaks in a control character in some other way. When json is being output, no quoting is done, since json gets its own quoting. This does, as a side effect, make warning messages in json output not be indented. The indentation is only needed to offset warning messages underneath the display of the file they apply to, so that's ok. Sponsored-by: Brett Eisenberg on Patreon	2023-04-10 15:55:44 -04:00
Joey Hess	cd544e548b	filter out control characters in error messages giveup changed to filter out control characters. (It is too low level to make it use StringContainingQuotedPath.) error still does not, but it should only be used for internal errors, where the message is not attacker-controlled. Changed a lot of existing error to giveup when it is not strictly an internal error. Of course, other exceptions can still be thrown, either by code in git-annex, or a library, that include some attacker-controlled value. This does not guard against those. Sponsored-by: Noam Kremen on Patreon	2023-04-10 13:50:51 -04:00
Yaroslav Halchenko	84b0a3707a	Apply codespell -w throughout	2023-03-17 15:14:58 -04:00
Joey Hess	2323af3736	importfeed: Display feed title When importing a bunch of feeds, this makes it more clear what it's working on. Also, I sometimes want to delete a particular feed from a list of feeds but don't know which url belongs to the feed, and this solves that. Control characters are filtered out just to protect against some feed putting escape character stuff in the feed, which could be a security problem. (Control characters also get filtered out of importfeed filenames.) Sponsored-by: Luke Shumaker on Patreon	2023-03-11 13:52:45 -04:00
Joey Hess	54ad1b4cfb	Windows: Support long filenames in more (possibly all) of the code Works around this bug in unix-compat: https://github.com/jacobstanley/unix-compat/issues/56 getFileStatus and other FilePath using functions in unix-compat do not do UNC conversion on Windows. Made Utility.RawFilePath use convertToWindowsNativeNamespace to do the necessary conversion on windows to support long filenames. Audited all imports of System.PosixCompat.Files to make sure that no functions that operate on FilePath were imported from it. Instead, use the equvilants from Utility.RawFilePath. In particular the re-export of that module in Common had to be removed, which led to lots of other changes throughout the code. The changes to Build.Configure, Build.DesktopFile, and Build.TestConfig make Utility.Directory not be needed to build setup. And so let it use Utility.RawFilePath, which depends on unix, which cannot be in setup-depends. Sponsored-by: Dartmouth College's Datalad project	2023-03-01 15:55:58 -04:00
Joey Hess	b2ee2496ee	remove whenAnnexed and ifAnnexed In preparation for adding a new variation on lookupKey. Sponsored-by: Max Thoursie on Patreon	2022-10-26 14:06:32 -04:00
Joey Hess	149d12f188	support --backend again in addurl and importfeed Missed these two when converting from a global option. Sponsored-by: Dartmouth College's Datalad project	2022-07-05 15:35:43 -04:00
Joey Hess	cb9cf30c48	move several readonly values to AnnexRead This improves performance to a small extent in several places. Sponsored-by: Tobias Ammann on Patreon	2022-06-28 15:40:19 -04:00
Joey Hess	7f6b2ca49c	handle overBranchFileContents with read-only unmerged git-annex branches This makes --all error out in that situation. Which is better than ignoring information from the branches. To really handle the branches right, overBranchFileContents would need to both query all the branches and union merge file contents (or perhaps not provide any file content), as well as diffing between branches to find files that are only present in the unmerged branches. And also, it would need to handle transitions.. Sponsored-by: Dartmouth College's Datalad project	2021-12-27 14:30:51 -04:00
Joey Hess	01a5ee6998	addurl, youtube-dl: When --check-raw prevents downloading an url, still continue with any downloads that come after it, rather than erroring out Sponsored-By: Mark Reidenbach on Patreon	2021-11-28 19:40:06 -04:00
Joey Hess	1d513540e9	Fix build with old versions of feed library	2021-11-23 16:06:51 -04:00
Joey Hess	31be0770a5	importfeed: Display url before starting youtube-dl download It was displaying a blank line before.	2021-11-17 13:23:55 -04:00
Joey Hess	86fa460ce2	better wording	2021-11-17 12:48:28 -04:00
Joey Hess	332385a117	use parseFeedFromFile to avoid mojibake As mentioned in commit `2bd778a46e`, there was mojibake when LANG=C. Looking at parseFeedFromFile, it is very particular to read the file as unicode. parseFeedString looks like it will accept any old String, but a String that was read using the filesystem encoding will not in fact have the right encoding. I think this is a bug in the feed library and will file one. Sponsored-by: Svenne Krap on Patreon	2021-11-15 15:31:02 -04:00
Joey Hess	2bd778a46e	importfeed: Fix a crash when used in a non-unicode locale See comment for analysis. At first I thought I'd need to convert all T.unpack in git-annex, but luckily not -- so long as the Text is read from a file, the filesystem encoding is applied and T.unpack is fine. It's only when using Feed that the filesystem encoding is not applied. While this fixes the crash, it does result in some mojibake, eg: itemid=http://www.manager-tools.com/2014/01/choosing-a-company-work-chapter-7-��-questions/ Have not tracked that down, but it must be unrelated, because I've verified that it roundtrips when using encodeUf8: joey@darkstar:~/src/git-annex>LANG=C ghci Utility/FileSystemEncoding.hs ghci> useFileSystemEncoding ghci> Just f <- Text.Feed.Import.parseFeedFromFile "/home/joey/tmp/career_tools_podcasts.xml" ghci> Just (_, x) = Text.Feed.Query.getItemId (Text.Feed.Query.feedItems f !! 0) ghci> decodeBS (Data.Text.Encoding.encodeUtf8 x) "http://www.manager-tools.com/2014/01/choosing-a-company-work-chapter-7-\56546\56448\56467-questions/" ghci> writeFile "foo" $ decodeBS (Data.Text.Encoding.encodeUtf8 x) Writes a file containing the ENDASH character. Sponsored-by: Jochen Bartl on Patreon	2021-11-15 15:04:21 -04:00
Joey Hess	889e771357	display error message if unable to run youtube-dl This would have made the typo of the command name that was just fixed obvious earlier, when --no-raw was used to force using it.	2021-11-13 09:07:43 -04:00
Joey Hess	d154e7022e	incremental verification for web special remote Except when configuration makes curl be used. It did not seem worth trying to tail the file when curl is downloading. But when an interrupted download is resumed, it does not read the whole existing file to hash it. Same reason discussed in commit 7eb3742e4b76d1d7a487c2c53bf25cda4ee5df43; that could take a long time with no progress being displayed. And also there's an open http request, which needs to be consumed; taking a long time to hash the file might cause it to time out. Also in passing implemented it for git and external special remotes when downloading from the web. Several others like S3 are within striking distance now as well. Sponsored-by: Dartmouth College's DANDI project	2021-08-18 15:02:22 -04:00
Joey Hess	b8e32e200e	addurl, importfeed: Added --no-raw option Forces eg, download with youtube-dl without falling back to raw download. Since youtube-dl failing due to an url not being supported is difficult to distinguish from it failing due to being blocked in some way, this can be useful to avoid the fallback of git-annex downloading the raw web page and adding that. Since --raw also prevents using special remotes, --no-raw also allows special remote downloads. Although it's always possible that some special remote may claim an url and fall back to raw download of the content, which --no-raw cannot prevent. Sponsored-by: Boyd Stephen Smith Jr. on Patreon	2021-06-27 11:14:51 -04:00
Joey Hess	0547884eb2	importfeed: fix bug while also speeding up 12x! * Fix bug that could make git-annex importfeed not see recently recorded state when configured with annex.alwayscommit=false. * importfeed: Made "checking known urls" phase run 12 times faster. The massive speedup is because it no longer queries for metadata accompanying each url. Instead it processes the whole git-annex branch and checks all metadata files for feed item ids, and uses any it finds. This could result in a behavior change, in an unlikely situation: If a feed id is recorded in a key's metadata, but the url gets removed, the old code would not see that item id and would re-download it if it finds an url for it in a feed, while the new code will see the item id. I don't think the old behavior was intentional, and it may be that the new behavior is better. Not gonna worry about this.	2021-04-23 12:36:56 -04:00
Joey Hess	b689f17062	refactoring	2021-04-23 11:44:10 -04:00
Joey Hess	13c090b37a	use fastDebug everywhere it can be used None of these are likely to yeild a noticable speedup though.	2021-04-06 15:41:24 -04:00
Joey Hess	1b645e1ace	added --debugfilter (and annex.debugfilter)	2021-04-05 15:31:10 -04:00
Joey Hess	aaba83795b	switch from hslogger to purpose-built Utility.Debug This uses a DebugSelector, rather than debug levels, which will allow for a later option like --debug-from=Process to only see debuging about running processes. The module name that contains the thing being debugged is used as the DebugSelector (in most cases; does not need to be a hard and fast rule). Debug calls were changed to add that. hslogger did not display that first parameter to debugM, but the DebugSelector does get displayed. Also fastDebug will allow doing debugging in places that are used in tight loops, with the DebugSelector coming from the Annex Reader essentially for free. Not done yet.	2021-04-05 13:40:31 -04:00
Joey Hess	ed68a2166d	importfeed: Avoid using youtube-dl when a feed does not contain an enclosure, but only a link to an url which youtube-dl does not support This is common in some feeds, which might mix some items with enclosures, with others that link to posts or whatever. Before this, it would try to use youtube-dl and fail, or if youtube-dl was not allowed, it would incorrectly complain that an url was supported by youtube-dl.	2020-12-15 01:13:21 -04:00
Joey Hess	5a1e73617d	finished this stage of the RawFilePath conversion Finally compiles again, and test suite passes. This commit was sponsored by Brock Spratlen on Patreon.	2020-11-04 14:20:37 -04:00
Joey Hess	4bcb4030a5	more RawFilePath conversion 580/645 This commit was sponsored by Jack Hill on Patreon.	2020-11-03 18:34:27 -04:00
Joey Hess	4c32499e82	Parse youtube-dl progress output Which lets progress be displayed when doing concurrent downloads. Amoung other things, like --json-progress etc. The youtube-dl output is no longer displayed, except for any errors. This commit was sponsored by Denis Dzyubenko on Patreon.	2020-09-29 17:53:48 -04:00
Joey Hess	1610d94776	addurl: Avoid a redundant git ignores check for speed Ensure that checkCanAdd is used everywhere a file is added to git, so git add is run with -f, presumably avoiding the work it would usually do to check ignores.	2020-09-29 13:00:41 -04:00
Joey Hess	d0b06c17c0	Added --no-check-gitignore option for finer grained control than using --force. add, addurl, importfeed, import: Added --no-check-gitignore option for finer grained control than using --force. (--force is used for too many different things, and at least one of these also uses it for something else. I would like to reduce --force's footprint until it only forces drops or a few other data losses. For now, --force still disables checking ignores too.) addunused: Don't check .gitignores when adding files. This is a behavior change, but I justify it by analogy with git add of a gitignored file adding it, asking to add all unused files back should add them all back, not skip some. The old behavior was surprising. In Command.Lock and Command.ReKey, CheckGitIgnore False does not change behavior, it only makes explicit what is done. Since these commands are run on annexed files, the file is already checked into git, so git add won't check ignores.	2020-09-18 13:19:13 -04:00
Joey Hess	fcf5d11c63	add "input" field to json output The use case of this field is mostly to support -J combined with --json. When that is implemented, a user will be able to look at the field to determine which of the requests they have sent it corresponds to. The field typically has a single value in its list, but in some cases mutliple values (eg 2 command-line params) are combined together and the list will have more. Note that json parsing was already non-strict, so old git-annex metadata --json --batch can be fed json produced by the new git-annex and will not stumble over the new field.	2020-09-15 16:22:44 -04:00
Joey Hess	283d2f85d1	importfeed: Fix reversion that caused some '.' in filenames to be replaced with '_' sanitizeFilePath was changed to sanitize leading '.', but ImportFeed was running it on parts of the template. So eg the leading '.' in the extension got sanitized. Note the added case for sanitizeLeadingFilePathCharacter ('/':_) -- this was added because, if the template is title/episode and the title is not set, it would expand to "/episode". So this is another potential security fix.	2020-08-05 11:35:00 -04:00
Joey Hess	7b2d236556	importfeed: stream metadata for 5% speedup On top of the 10% speedup from streaming url logs.	2020-07-14 14:35:26 -04:00
Joey Hess	4229713e63	importfeed: Added some additional --template variables for date and time This commit was sponsored by Ethan Aubin.	2020-06-24 14:24:50 -04:00

1 2 3

147 commits