- Move processDocuments, a function for loading a DOM representation of a document or set of documents, to Scholar.Utilities.HTTP
- Add Scholar.Ingester.ingestURL, a simplified function to scrape a URL (closes#33)
- Make Amazon scraper work with multiple documents
- Fix bugs in processDocuments
- Make Scholar.Ingester.Utilities.getItemArray() willing to take an array of DOM nodes to search for links, and finally take advantage of the fact that objects have no length
- Multiple item detection code is now a part of the scraperJavaScript, rather than the scrapeDetectCode, and code to choose which items to add is part of Scholar.Ingester.Utilities, accessible from inside scrapers. The alternative approach would result in one request (or, in the case of JSTOR, three requests) per new item, while in some cases (e.g. Voyager) only one request is necessary to get all of the items.
- When possible, corporate creators/contributors are categorized with their own RDF types (prefixDummy + "corporateCreator/corporateContributor)
- Remove extraneous debug code in extensions
- Ingester lets callback function save items, rather than saving them itself.
- Better handling of multiple items in API, although no scrapers currently implement this.
- Removed localLastUpdated field from scrapers table and renamed centralLastUpdated to lastUpdated; updated scraper queries accordingly
- Added query in scrapers.sql to update version table 'repository' row to prevent immediate downloads of newly installed scrapers
- Get version property from extension manager in Scholar.init() and assign to Scholar.version
Assigned guids to scrapers, replaced INSERT queries with REPLACE queries, and removed table DELETE query at top -- this will allow scrapers to be updated without deleting any others that may exist (e.g. that someone is developing, third-party, etc.)
Integrated the scrapers with the schema update mechanism. Changed a bunch of schema methods to handle both schema.sql and scrapers.sql (or others, if need be) and altered the version table to track mu
ltiple versions for different files. This theoretically should detect that the version table has changed and force a reinitialization of the DB--let me know if there are problems.
- Implemented loadDocument API, for loading and parsing the DOMs of HTML documents in the background
- Added scraper code to SVN repository (now includes 12 scrapers, see Writeboard for details)
To update to the latest versions of all scrapers, ensure you have an up-to-date version of sqlite3, then run:
sqlite3 ~/Library/Application\ Support/Firefox/Profiles/profileName/scholar.sqlite < scrapers.sql