`waitForDialog()` now returns a regular window, and
`window.document.documentElement.textContent` includes all form
elements, so this updates a test to include the checkbox label.
And fix magic numbers for content-type sniffing, which wrongly used the
Unicode replacement character (which likely just meant we were falling
back to file-extension-based detection)
Use the new PageData mechanism for character set detection, don't try to
index HTML files directly without properly detecting the charset, and
generally simplify the indexing code.
HTML files are now considered cached files that require indexing and
won't be indexed automatically in Zotero.FullText.findTextInItems(),
which breaks certain expectations, including in some tests. This will
need to be addressed.
Remove Zotero.Browser and add HiddenBrowser.jsm. Post-Fission, web/file
content loads in a separate process, so it's not possible (as best as I
can tell) to directly access the contents of a hidden browser -- it just
appears as about:blank in the parent process. We now use Mozilla's
JSWindowActor mechanism [1] to get page data, including character set
and body text for full-text indexing. We'll have to evaluate other uses
of hidden browsers to see how to handle them.
This also adds include.jsm for loading the Zotero object into a JSM.
[1] https://firefox-source-docs.mozilla.org/dom/ipc/jsactors.html
listbox is gone, but richlistbox is still here as a custom element and
works fine for cases where we don't need virtualization.
groupbox label and richlistitem styles should probably be copied to
somewhere global once tuned a bit.
Now it:
1. Strips punctuation at the beginning, no matter what it is.
2. Strips non-dash punctuation in other positions.
3. Trims the result.
This should better prevent numerical ranges from being joined into a
single number that ends up incorrectly being sorted to the very bottom.
Follow-up to 58f515058 with a better approach: if no full-text cache
file, just get text directly without indexing. In the one existing use
of `attachmentText`, attachment merging, this is better anyway, because
we might be deleting the file, so there's no point wasting time
inserting words into the database.
We follow a different merge procedure for each attachment type:
- For PDF attachments, compare by MD5. If no match, get the top 50 words
in the attachment's text and hash those, then check again for a match.
Update references to item keys in notes and annotations.
- For web (snapshot / link) attachments, compare by title and URL.
Prefer a title + URL match but accept a title-only match.
- For other attachment types, keep all attachments from all items being
merged.
Also:
- Move most merge tests from Duplicates to Items#merge(). It just doesn't
make sense to worry about the UI in these.