Commit graph

149 commits

Author SHA1 Message Date
Simon Kornblith
0c24beee3f closes #213, add in-text citation support to citation engine
fixes date and et al. handling bugs in citation engine
permits citation of multiple items in Word integration
2006-08-29 04:24:11 +00:00
Simon Kornblith
7385ba4df2 ABC-CLIO fixes 2006-08-26 21:57:02 +00:00
Simon Kornblith
d3fc9866b9 - add ABC-CLIO (America: History and Life) translator
- fix a potential issue with COinS support
2006-08-26 21:36:49 +00:00
Simon Kornblith
a457cdb493 added New York Times translator 2006-08-26 07:27:02 +00:00
Simon Kornblith
ddb754839c make google scholar translator turn on export automatically 2006-08-26 06:04:29 +00:00
Simon Kornblith
72da6e412e add Google Scholar translator 2006-08-26 05:51:41 +00:00
Simon Kornblith
53aae7751c support FirstSearch databases besides WorldCat 2006-08-26 04:59:30 +00:00
Simon Kornblith
f07cb5a5bc adds an InfoTrac OneFile translator
fixes a bug in ingester progress window handling
2006-08-26 03:50:15 +00:00
Simon Kornblith
0e63958f96 - make proquest work better behind proxies
- improved frame support
2006-08-24 18:00:48 +00:00
Simon Kornblith
04d05548b2 closes #103, figure out how to store captured pages in native export format
fixes ampersands in citation COinS
fixes tags and seeAlso in import/export (should now work for all items)
2006-08-20 04:35:04 +00:00
Simon Kornblith
a55b035761 update scrapers.sql version (oops) 2006-08-19 23:15:38 +00:00
Simon Kornblith
94bd2415da adds short roles to CSL (Ed. instead of Editor)
adds COinS to exported HTML
uses real lists in HTML output
fixes other small citation style issues
2006-08-19 23:14:27 +00:00
Simon Kornblith
26668a6e73 closes #194, EBSCO translator
closes #160, cache regular expressions
closes #188, rewrite MARC handling functions

MARC-based translators should now produce item types besides "book." right now, artwork, film, and manuscript are available. MARC also has codes for various types of audio (speech, music, etc.) and maps.
the EBSCO translator does not yet produce attachments. i sent them an email because their RIS export is invalid (the URLs come after the "end of record" field) and i'm waiting to see if they'll fix it before i try to fix it myself.
the EBSCO translator is unfortunately a bit slow, because it has to make 5 requests in order to get RIS export. the alternative (scraping individual item pages) would be even slower.
regular expression caching can be turned off by disabling extensions.scholar.cacheTranslatorData in about:config. if you leave it on, you'll have to restart Firefox after updating translators.
2006-08-19 18:58:09 +00:00
Simon Kornblith
20486d5053 addresses #103, figure out how to store captured pages in native export format
import/export of file data should work for all file types _except_ snapshots (in this situation, export is working, but import is not yet complete; see #193)
also, fixes a potential security issue that could have allowed malicious web translators to post local data to remote sites (although, given we maintain the central repository and there's no easy way to install a translator, the risk would have been minimal to begin with).
2006-08-18 05:58:14 +00:00
Simon Kornblith
10ba568ee8 closes #39, auto-ingest of associated files (as recognizable)
closes #3, Overflow metadata dumps into "extra" field

add "extra" data where such data is useful and conveniently accessible (not available for XML-based export or MARC formats yet)
add links to permanent URLs
download associated files from full text sources (if extensions.scholar.downloadAssociatedFiles preference is enabled)
fix WorldCat translator
improve InnoPAC translator (it now works on Georgetown search results pages, albeit slowly, because it must first realize the catalog is misconfigured)
tag items from SIRSI and WorldCat
return to putting the full lengths of books into "pages," because some citation styles require it
fix COinS (broken a few revisions ago)
2006-08-17 07:56:01 +00:00
Simon Kornblith
51108446e3 closes #187, make berkeley's library work
closes #186, stop translators from hanging

when a document loads inside a frameset, we now check whether we can scrape each individual frame.
all functions involving tabs have been vastly simplified, because in the process of figuring this out, i discovered Firefox 2's new tab events.
if a translator throws an exception inside loadDocument(), doGet(), doPost(), or processDocuments(), a translate error message will appear, and the translator will not hang
2006-08-15 19:46:42 +00:00
Simon Kornblith
dac5bbb3f3 closes #183, export bibliography to RTF 2006-08-14 21:54:45 +00:00
Simon Kornblith
feff0aa531 closes #53, export to footnote or bibliography
closes #180, make all contextual menu export/create bibliography options work right

also:
- add Chicago Note style output
- unregister RDF data sources from cache after import
2006-08-14 20:34:13 +00:00
Simon Kornblith
bb07710b34 okay, i actually tested it this time, so i'm pretty sure i got all the bugs out. no, i am not just cheating to get closer to the illusive r500. 2006-08-14 05:23:39 +00:00
Simon Kornblith
45f84fb31f actually fix the bug properly this time 2006-08-14 05:15:52 +00:00
Simon Kornblith
a67b8c8b95 oops. saw a bug looking over my diff. 2006-08-14 05:14:16 +00:00
Simon Kornblith
3195a1c382 closes #112, ingested items should be automatically added to selected project
references #178, changes to various date fields

- updates CSL to work with the latest schema. we can now (almost) generate completely valid APA style. the only issue is that there's no syntax for specifying short forms for page and creator type labels.
- updates scrapers to use date field rather than year field.
- removes now-unnecessary translation engine code pertaining to year field.
2006-08-14 05:12:28 +00:00
Simon Kornblith
05edc2a08b rewrote citation support to support new version of CSL schema. bibliographic output is much improved. 2006-08-12 23:23:56 +00:00
Simon Kornblith
4284132db5 update scrapers.sql version 2006-08-12 04:29:59 +00:00
Simon Kornblith
ddb4fc872c remove the doStatus argument from Scholar.Utilities.HTTP 2006-08-12 04:27:49 +00:00
Simon Kornblith
36a402713c rename Scholar.Utilities.Ingester.HTTPUtilities to Scholar.Utilities.Ingester.HTTP for consistency 2006-08-11 16:34:22 +00:00
Simon Kornblith
064ecd17db removes unnecessary pieces of piggy bank API from utilities and updates translators to abide by current translator guidelines 2006-08-11 15:28:18 +00:00
Simon Kornblith
6efd6d2cc4 closes #99, add options for export 2006-08-08 23:00:33 +00:00
Simon Kornblith
3edb6e0286 closes #86, steal EndNote download links
Scholar should now attempt to process citation information from EndNote download links (MIME types application/x-endnote-refer and application/x-research-info-systems). in situations where Scholar cannot process the information, a standard helper app dialog will appear. this behavior is controlled by the preference extensions.scholar.parseEndNoteMIMETypes.
2006-08-08 21:17:07 +00:00
Simon Kornblith
504ebf8996 closes #162, do sniffing for import formats
import should now work regardless of file extensions. this should make #86 (steal EndNote download links) fairly easy to implement.
2006-08-08 02:46:52 +00:00
Simon Kornblith
216f0c7581 closes #83, figure out how to implement OpenURL
closes #76, implement extensible search/retrieval architecture for obtaining metadata

OpenURL COinS lookup is now implemented using a real search architecture system. at the moment, it works with Open WorldCat for books, CrossRef for journal articles (provided the COinS object contains a DOI or an ISSN), and PubMed when a PMID is available.
2006-08-08 01:06:33 +00:00
Simon Kornblith
6626eba844 addresses #83, figure out how to implement OpenURL
OpenURL lookup now works for books. this means that all that's necessary to add scrapable book metadata to a page is an ISBN, as shown below:

<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info:ofi/fmt:kev:mtx:book&amp;rft.isbn=1579550088"></span>

also, we can now scrape Open WorldCat and Wikipedia Book Sources pages with no specialized code involved.

i'm still looking for a better way of looking up journal article metadata. it's currently implemented with CrossRef, but CrossRef simply will not work without a DOI, and is also incomplete (only holds the last name of the first author).
2006-08-07 05:15:30 +00:00
Simon Kornblith
e3d062a819 fix inappropriately truncated field values in InnoPAC 2006-08-07 01:49:56 +00:00
Simon Kornblith
2b5b65f4dd addresses #83, figure out how to implement OpenURL
adds preliminary support for COinS microformat data. does not yet support COinS where there is only a DOI or ISBN.
2006-08-07 00:30:36 +00:00
Simon Kornblith
c0bab22016 bring scrapers into sync with updated database schema 2006-08-06 17:34:41 +00:00
Simon Kornblith
fc589a37cf closes #131, make import/export symmetrical
all 4 import/export formats currently supported (MODS, Hybrid RDF, Unqualified Dublin Core, and RIS) now work as both import and export translators
2006-08-06 09:34:51 +00:00
Simon Kornblith
9144b56772 addresses #131, make import/export symmetrical
closes #163, make translator API allow creator types besides author

import and export in the multi-ontology RDF format should now work properly. collections, notes, and see also are all preserved. more extensive testing will be necessary later.
2006-08-05 20:58:45 +00:00
Simon Kornblith
b4c8dbe700 closes #157, add database infrastructure for different CSL styles
CSL is stored in a new "csl" table. only metadata relevant to updates and selection (ID, date updated, and title) is stored in columns.
2006-08-03 04:54:16 +00:00
Simon Kornblith
6305e4cada closes #55, export bibliography to printable version
closes #4, Make printable version

- moves functions for creating and deleting hidden browser objects to scholar.js (from ingester.js), since these are necessary for printing as well
- allows saving bibliography in HTML or printing bibliography. style support is not yet complete (pending finalization of 0.9 version of CSL specification).
2006-07-27 23:01:55 +00:00
Simon Kornblith
c64e5c841f closes #78, figure out import/export architecture
closes #100, migrate ingester to Scholar.Translate
closes #88, migrate scrapers away from RDF
closes #9, pull out LC subject heading tags
references #87, add fromArray() and toArray() methods to item objects

API changes:
all translation (import/export/web) now goes through Scholar.Translate
all Scholar-specific functions in scrapers start with "Scholar." rather than the jumbled up piggy bank un-namespaced confusion
scrapers now longer specify items through RDF (the beginning of an item.fromArray()-like function exists in Scholar.Translate.prototype._itemDone())
scrapers can be any combination of import, export, and web (type is the sum of 1/2/4 respectively)
scrapers now contain functions (doImport, doExport, doWeb) rather than loose code
scrapers can call functions in other scrapers or just call the function to translate itself
export accesses items item-by-item, rather than accepting a huge array of items
MARC functions are now in the MARC import translator, and accessed by the web translators

new features:
import now works
rudimentary RDF (unqualified dublin core only), RIS, and MARC import translators are implemented (although they are a little picky with respect to file extensions at the moment)
items appear as they are scraped
MARC import translator pulls out tags, although this seems to slow things down
no icon appears next to a the URL when Scholar hasn't detected metadata, since this seemed somewhat confusing

apologizes for the size of this diff. i figured if i was going to re-write the API, i might as well do it all at once and get everything working right.
2006-07-17 04:06:58 +00:00
Simon Kornblith
d65328c830 adds Biblio/DC/FOAF/PRISM/VCard RDF export type. Bruce D'Arcus, author of CiteProc and co-lead on the OpenOffice bibliographic project, is currently using this as his ontology, and we can unambiguously encode all of our metadata with it.
caveats:
- it's not human readable. mozilla doesn't nest blank nodes, so everything's scattered throughout the file. it would be relatively easy to do post-processing with E4X or even regexps to correct this.
- there's no generic callNumber field, so all callNumbers are encoded as LCC.

adds container creation routines to dataMode rdf

changes Dublin Core export to Unqualified Dublin Core, and removes DC Terms qualifiers
2006-07-07 18:41:21 +00:00
Simon Kornblith
c02666fcd3 add an API for Mozilla's RDF data source, so that import/export translators will be able to create and parse RDF with minimal effort
convert Dublin Core export to new API
2006-07-06 21:55:46 +00:00
Simon Kornblith
b7124bd8c1 ack, update scrapers.sql version info 2006-07-06 03:41:18 +00:00
Simon Kornblith
2d8ed16d88 adds export of tags to MODS.
adds export of seeAlso info and project hierarchy to RDF. for now, this is embedded in the modsCollection root element.

uses nodeIDs for Dublin Core RDF.
2006-07-06 03:39:32 +00:00
Simon Kornblith
c0251085a9 Add export filters for RIS and Dublin Core RDF 2006-07-05 21:44:01 +00:00
Simon Kornblith
8b4a44be0f fixes a bug that made the Google Books translator not appear
adjusts the Google Books translator to work with the latest revision of the site

renames the MODS translator to just MODS, because "Metadata Object Description Schema (MODS)" was too long for the export dialog
2006-06-30 19:21:36 +00:00
Simon Kornblith
77282c3edc - fixes a bug that could result in scrapers using utilities.processDocuments malfunctioning
- fixes a bug that could result in the Scrape Progress chrome thingy sticking around forever
- makes chrome thingy disappear when URL changes or when tabs are switched
2006-06-29 03:22:10 +00:00
Simon Kornblith
cd25ecc034 I swear I've fixed this bug before, but make multiple item ingest work right for InnoPAC 2006-06-29 02:54:37 +00:00
Simon Kornblith
45b9234996 addresses #78, figure out import/export architecture
- changes scrapers table to translators table; all import/export/web translators now belong in this table
- adds Scholar.Translate to handle translation issues. eventually, Scholar.Ingester.Document will become part of this interface
- adds Scholar_File_Interface (in fileInterface.js) to handle UI for export and eventually import. (David, when you have time, please connect Scholar_File_Interface.exportFile to a button.)
- adds an export translator for MODS. all of our metadata, but not our hierarchy (projects, etc.) translates directly and unambiguously into valid MODS. eventually, we can use RDF or another format to handle hierarchy.
- adds utilities.getVersion() and utilities.inArray() for simplified scraper coding
- fixes minor interface issues with the nifty chrome scraping status window
2006-06-29 00:56:50 +00:00
Simon Kornblith
19504e6746 - closes #73, use chrome for "Scraping Progress..." indicator
- multiple and book icons were swapped for Voyager scraper
2006-06-27 02:03:10 +00:00
Simon Kornblith
f1cc809f76 Add a generic scraper that will scrape any website, although it may not always find very much information. It looks at META tags, both Dublin Core and otherwise.
When tags are ready, we can pull out META keywords.
2006-06-26 20:44:45 +00:00
Simon Kornblith
4242c62b1b - Fix redundancy in utilities.js (I accidentally copied and pasted a much larger block of code than i meant to)
- Move processDocuments, a function for loading a DOM representation of a document or set of documents, to Scholar.Utilities.HTTP
- Add Scholar.Ingester.ingestURL, a simplified function to scrape a URL (closes #33)
2006-06-26 20:02:30 +00:00
Simon Kornblith
4535b220db Closes #84, make type icon in toolbar match item about to be scraped. It's not perfect, since to get everything right, we'd need to scrape the page as soon as it appears, but it provides a pretty good indication. Multiple items get the folder icon. If there's a better icon out there, it's pretty straightforward to implement. 2006-06-26 18:05:23 +00:00
Simon Kornblith
a33b119dff grab ISBN from SIRSI 2003+ catalogs 2006-06-26 01:17:29 +00:00
Simon Kornblith
303c6ee68d closes #41, get library call number 2006-06-26 01:08:59 +00:00
Simon Kornblith
d73127b1b3 update modification times 2006-06-25 22:01:04 +00:00
Simon Kornblith
f6b0d9a541 search results scraping for InfoTrac. closes #15 2006-06-25 22:00:20 +00:00
Simon Kornblith
1ec834cef2 Search results scraping for Project MUSE 2006-06-25 21:12:14 +00:00
Simon Kornblith
6a627fad0a Search results scraping for LexisNexis 2006-06-25 20:09:27 +00:00
Simon Kornblith
a48ea7dabf Search results scraping for ProQuest 2006-06-25 19:32:49 +00:00
Simon Kornblith
7402577806 Add search results scraping for History Cooperative 2006-06-25 18:34:23 +00:00
Simon Kornblith
a9c79f6110 Search results scraping for JSTOR 2006-06-25 18:17:00 +00:00
Simon Kornblith
5e73dcdd2e - Search results scraping for WorldCat.
- Make scraperJavaScript run on reload again, because it makes debugging easier
- There's not actually a memory leak in the proxyMonitor code.
2006-06-25 16:13:47 +00:00
Simon Kornblith
9e78d62b13 Better handling of itemTypes, and improved date handling in PubMed scraper. 2006-06-25 05:03:01 +00:00
Simon Kornblith
fd2052e63c Search results scraping for PubMed and Google Books. This marks the end of what I can do with respect to #15 until I'm at home or CHNM, where I'll have access to the gated collections. 2006-06-24 17:33:35 +00:00
Simon Kornblith
260ce80086 - Search results scraping for TLC. This is the last of the library scrapers.
- Minor fixes to ingester utilities.
2006-06-24 15:38:53 +00:00
Simon Kornblith
06cf9e7853 Search results scraping for SIRSI (old versions) 2006-06-24 14:35:05 +00:00
Simon Kornblith
6f19b215f5 Search result scraping for GEAC catalogs 2006-06-23 21:27:32 +00:00
Simon Kornblith
2b58ead7aa Search results scraping for Dynix 2006-06-23 20:53:29 +00:00
Simon Kornblith
2a74e88416 - Make generalized function for finding search results case insensitive
- Scrape DRA search results
2006-06-23 20:09:48 +00:00
Simon Kornblith
8fe72b3e3c Search results scraping for VTLS 2006-06-23 19:22:24 +00:00
Simon Kornblith
641d7054cc - Fixed some bugs in the InnoPAC scraper (search results)
- Made an Aleph search results scraper that works correctly on most sites, and degrades nicely when it doesn't
2006-06-23 17:35:57 +00:00
Simon Kornblith
83c36f330d Scrapable search results for SIRSI 2003+ scraper 2006-06-23 16:17:53 +00:00
Simon Kornblith
9742283389 InnoPAC scraper now handles search results pages 2006-06-23 14:12:34 +00:00
Simon Kornblith
098078627c - Make events listening for DOMContentLoaded listen for load, because DOMContentLoaded does not seem ready for prime time (hey, it's undocumented, what can you expect)
- Make Amazon scraper work with multiple documents
- Fix bugs in processDocuments
- Make Scholar.Ingester.Utilities.getItemArray() willing to take an array of DOM nodes to search for links, and finally take advantage of the fact that objects have no length
2006-06-23 03:02:30 +00:00
Simon Kornblith
b4d65420f3 ...but I forgot to update the timestamp 2006-06-22 20:51:40 +00:00
Simon Kornblith
470f7c463f The Voyager scraper now actually works on the search results page. 2006-06-22 20:50:57 +00:00
Simon Kornblith
3890e5f122 - Made ingester automatically create hidden browser objects, given a window object. This should make things much easier for both David and me.
- Multiple item detection code is now a part of the scraperJavaScript, rather than the scrapeDetectCode, and code to choose which items to add is part of Scholar.Ingester.Utilities, accessible from inside scrapers. The alternative approach would result in one request (or, in the case of JSTOR, three requests) per new item, while in some cases (e.g. Voyager) only one request is necessary to get all of the items.
2006-06-22 15:50:46 +00:00
Simon Kornblith
1b74d0b04a Doh! Forgot to update scraper timestamp. 2006-06-22 02:46:30 +00:00
Simon Kornblith
ca3a0e6e5d Beginnings of search result scraping (does not yet actually do the scraping, but does present the menu) 2006-06-22 02:43:40 +00:00
Simon Kornblith
6d1e447154 - Remove load eventListener after it has been called once
- Capture editors from Google Books
2006-06-21 15:18:18 +00:00
Simon Kornblith
f753c1cc2f Add Google Books scraper 2006-06-21 14:28:51 +00:00
Simon Kornblith
7b08c94437 Remember to update modified dates on changed scrapers. 2006-06-21 13:55:55 +00:00
Simon Kornblith
7d3deb5b9f - Make Scholar.Ingester.Utilities.loadDocument() attach an event handler to load rather than DOMContentLoaded to resolve an issue with the Ex Libris/Aleph scraper (VCU)
- When possible, corporate creators/contributors are categorized with their own RDF types (prefixDummy + "corporateCreator/corporateContributor)
- Remove extraneous debug code in extensions
2006-06-21 01:41:07 +00:00
Simon Kornblith
09d79d6dd7 Fix overly optimistic JSTOR scraper 2006-06-20 17:06:41 +00:00
Simon Kornblith
968348a5d1 Add a scraper for Dublin Core metadata embedded in HTML/XHTML META tags 2006-06-20 16:08:13 +00:00
Simon Kornblith
4c34c592da - Better handling of InnoPAC records not returned by searches 2006-06-18 21:00:43 +00:00
Simon Kornblith
20369f41b3 - Move commonly used scraper functions to ingester.js, rather than re-defining them in each scraper. This breaks Piggy Bank compatibility in our scrapers, but we will still be able to export our scrapers in a Piggy Bank compatible form.
- Better handling of scraper RDF to item mapping.
- Improved date handling. All scrapers now return ISO-style dates when possible.
2006-06-18 19:04:32 +00:00
Simon Kornblith
3d881eec13 - Make scrapers return standard ISO-style YYYY-MM-DD dates. Still need to work on journal article scrapers.
- Ingester lets callback function save items, rather than saving them itself.
- Better handling of multiple items in API, although no scrapers currently implement this.
2006-06-17 21:21:15 +00:00
Dan Stillman
70216ea2c7 - Added automatic scraper update mechanism (more details on Basecamp: http://chnm.grouphub.com/C2687015)
- Removed localLastUpdated field from scrapers table and renamed centralLastUpdated to lastUpdated; updated scraper queries accordingly

- Added query in scrapers.sql to update version table 'repository' row to prevent immediate downloads of newly installed scrapers

- Get version property from extension manager in Scholar.init() and assign to Scholar.version
2006-06-15 06:13:02 +00:00
Dan Stillman
d42258b168 Changed schema of scrapers table to use single GUID for scraperID
Assigned guids to scrapers, replaced INSERT queries with REPLACE queries, and removed table DELETE query at top -- this will allow scrapers to be updated without deleting any others that may exist (e.g. that someone is developing, third-party, etc.)
2006-06-12 15:43:24 +00:00
Simon Kornblith
076ee0fad2 Add PubMed scraper, fix a few other small bugs 2006-06-08 01:26:40 +00:00
Simon Kornblith
f437917016 Add Project MUSE scraper 2006-06-07 21:26:55 +00:00
Simon Kornblith
cef0b19770 Add TLC/YouSeeMore scraper 2006-06-07 18:44:27 +00:00
Simon Kornblith
1e48189c3b Add SIRSI (old) scraper 2006-06-07 17:44:55 +00:00
Simon Kornblith
07dad8fae9 Add DRA, GEAC scrapers 2006-06-07 16:48:03 +00:00
Dan Stillman
393807b152 This isn't quite done (I'm discussing changing the scrapers schema with Simon to better handle scraper updates) but in the interest of getting the scrapers in for testing, I'll commit this now.
Integrated the scrapers with the schema update mechanism. Changed a bunch of schema methods to handle both schema.sql and scrapers.sql (or others, if need be) and altered the version table to track mu
ltiple versions for different files. This theoretically should detect that the version table has changed and force a reinitialization of the DB--let me know if there are problems.
2006-06-07 15:27:21 +00:00
Simon Kornblith
0753d78910 - Add VLTS scraper
- Fix loadDocument/processDocuments (broken by r145)
2006-06-06 21:35:23 +00:00
Simon Kornblith
152c9bf9e7 - Small changes to MARC record support
- Implemented loadDocument API, for loading and parsing the DOMs of HTML documents in the background
- Added scraper code to SVN repository (now includes 12 scrapers, see Writeboard for details)

To update to the latest versions of all scrapers, ensure you have an up-to-date version of sqlite3, then run:
sqlite3 ~/Library/Application\ Support/Firefox/Profiles/profileName/scholar.sqlite < scrapers.sql
2006-06-06 18:25:45 +00:00