There are currently two types of fulltext searching: an SQL-based word index and a file scanner. They each have their advantages and drawbacks.
The word index is very fast to search and is currently used for the find-as-you-type quicksearch. However, indexing files takes some time, so we should probably offer a preference to turn it off ("Index attachment content for quicksearch" or something). There's also an issue with Chinese characters (which are indexed by character rather than word, since there are no spaces to go by, so a search for a word with common characters could produce erroneous results). The quicksearch doesn't use a left-bound index (since that would probably upset German speakers searching for "musik" in "nachtmusik," though I don't know for sure how they think of words) but still seems pretty fast.
* Note: There will be a potentially long delay when you start Firefox with this revision as it builds a fulltext word index of your existing items. We obviously need a notification/option for this. *
The file scanner, used in the Attachment Content condition of the search dialog, offers phrase searching as well as regex support (both case-sensitive and not, and defaulting to multiline). It doesn't require an index, though it should probably be optimized to use the word index, if available, for narrowing the results when not in regex mode. (It does only scan files that pass all the other search conditions, which speeds it up considerably for multi-condition searches, and skips non-text files unless instructed otherwise, but it's still relatively slow.)
Both convert HTML to text before searching (with the exception of the binary file scanning mode).
There are some issues with which files get indexed and which don't that we can't do much about and that will probably confuse users immensely. Dan C. suggested some sort of indicator (say, a green dot) to show which files are indexed.
Also added (very ugly) charset detection (anybody want to figure out getCharsetFromString(str)?), a setTimeout() replacement in the XPCOM service, an arrayToHash() method, and a new header to timedtextarea.xml, since it's really not copyright CHNM (it's really just a few lines off from the toolkit timed-textbox binding--I tried to change it to extend timed-textbox and just ignore Return keypress events so that we didn't need to duplicate the Mozilla code, but timed-textbox's reliance on html:input instead of html:textarea made things rather difficult).
To do:
- Pref/buttons to disable/clear/rebuild fulltext index
- Hidden prefs to set maximum file size to index/scan
- Don't index words of fewer than 3 non-Asian characters
- MRU cache for saved searches
- Use word index if available to narrow search scope of fulltext scanner
- Cache attachment info methods
- Show content excerpt in search results (at least in advanced search window, when it exists)
- Notification window (a la scraping) to show when indexing
- Indicator of indexed status
- Context menu option to index
- Indicator that a file scanning search is in progress, if possible
- Find other ways to make it index the NYT front page in under 10 seconds
- Probably fix lots of bugs, which you will likely start telling me about...now.
- select() new collections and saved searches on creation
- Show all collections in search dialog Collection condition drop-down, not just top-level ones -- eventually there should probably be some sort of level indication to show hierarchy
- Don't allow a search to define itself as a savedSearchID condition. Right.
- Fixed a couple bugs in Collection condition creation that would've made such searches often not work
- Added saved search 'add' support to collectionTreeView notify(), and send the right trigger from Scholar.Search
Implemented isBefore and isAfter operators and added to date conditions -- currently have to type dates in SQL format, but we'll have a chooser or something by Beta 2 (this should probably be a known issue)
Added isLessThan and isGreaterThan to 'pages', 'section', 'accessionNumber', 'seriesNumber', and 'issue' fields, though somebody should probably check me on that list
Removed JS strict warning on empty search results
- Added 'fulltext' condition as shortcut to add an operator/value against all string-based conditions -- can be used for quicksearch within a view if collectionID/savedSearchID is added as a required condition along with it
Note that the 'fulltext' condition isn't stored internally and addCondition() doesn't return a searchConditionID for it, so it's not really meant to be used for saved searches.
Example:
var search = new Scholar.Search();
search.addCondition('collectionID', 'is', 6856, true);
search.addCondition('fulltext', 'contains', 'wellman');
Scholar.debug(search.search());
- Fixed isNot/doesNotContain for items table fields and collectionID
Conditions can now offer drop-down menus rather than freeform text fields -- implemented for collections/saved searches and item types
Special handling to combine collections and saved searches into a single "Collection" menu (can we get away with calling them "Smart Collections"?) -- internally, values are stored as C1234 or S5678 in the interface and converted to/from regular collectionID and savedSearchID conditions for search.js
Use localized strings for conditions (tries searchConditions.* first, then itemFields.*)
Alphabetize condition list
Operator menu now fixed length for all conditions
- Support for multiple id collection/search remove/modify notifications
- New method Scholar.Searches.get(id) to get 'id' and 'name' for a particular savedSearchID
Conditions in ANY queries can be made required by passing 'true' as an extra parameter to addCondition() and updateCondition() -- this can be used for limiting ANY queries to particular collections (in place of the removed 'context' condition), but if there was an elegant way to expose it to the user for all ANY queries, it's something users might find very useful.
- Implemented 'collectionID' and 'savedSearchID' conditions (a.k.a. search within a search) and removed special 'context' condition. Per my conversation with Dan, the 'recursive' flag is now a global flag that applies to all specified collectionIDs, which is less than ideal but probably better than the alternatives (e.g. having condition-specific recursive checkboxes). It does mean, however, that a "Search subfolders" checkbox is irrelevant if there are no collectionID conditions and should probably be greyed out until applicable.
Another side effect is that it's no longer possible to do an ANY search and return results only within a specific folder (though it can now be done by putting the ANY conditions in a subsearch). Since ANY searches are always annoying in this regard, what I might do is add a way to mark particular conditions as required even in ANY mode, which would allow for quite a lot of flexibility...
Note also that while 'collectionID' and 'savedSearchID' are standard conditions, they should probably be combined into a single condition on the interface side (like playlists and smart playlists under just 'Playlist' in iTunes).
- Now skips invalid/obsolete saved conditions in load()
- Remaining searchConditionIDs are no longer affected by removeCondition() (i.e. they now act like autoincrements), which should make interface code simpler
- Changed default join mode to ALL
- Fixed loading of saved searches with no search conditions
Implemented advanced/saved search architecture -- to use, you create a new search with var search = new Scholar.Search(), add conditions to it with addCondition(condition, operator, value), and run it with search(). The standard conditions with their respective operators can be retrieved with Scholar.SearchConditions.getStandardConditions(). Others are for special search flags and can be specified as follows (condition, operator, value):
'context', null, collectionIDToSearchWithin
'recursive', 'true'|'false' (as strings!--defaults to false if not specified, though, so should probably just be removed if not wanted), null
'joinMode', 'any'|'all', null
For standard conditions, currently only 'title' and the itemData fields are supported -- more coming soon.
Localized strings created for the standard search operators
API:
search.setName(name) -- must be called before save() on new searches
search.load(savedSearchID)
search.save() -- saves search to DB and returns a savedSearchID
search.addCondition(condition, operator, value)
search.updateCondition(searchConditionID, condition, operator, value)
search.removeCondition(searchConditionID)
search.getSearchCondition(searchConditionID) -- returns a specific search condition used in the search
search.getSearchConditions() -- returns search conditions used in the search
search.search() -- runs search and returns an array of item ids for results
search.getSQL() -- will be used by Dan for search-within-search
Scholar.Searches.getAll() -- returns an array of saved searches with 'id' and 'name', in alphabetical order
Scholar.Searches.erase(savedSearchID) -- deletes a given saved search from the DB
Scholar.SearchConditions.get(condition) -- get condition data (operators, etc.)
Scholar.SearchConditions.getStandardConditions() -- retrieve conditions for use in drop-down menu (as opposed to special search flags)
Scholar.SearchConditions.hasOperator() -- used by Dan for error-checking