[PATCH] test: add known broken test for indexing html

David Bremner david at tethera.net
Sat Mar 18 08:04:07 PDT 2017


Jeffrey Stedfast <jestedfa at microsoft.com> writes:

> Hi David,
>
> Base64 encoded inline image data is always within the src attribute value of an <img> tag and will always begin with "data:" followed by the mime-type and then followed by ";base64," so it's pretty easy to spot.
>
> While on this topic, why index HTML attribute values at all? Other than perhaps some known ones like perhaps the 'alt' value of <img> tags?
>
> I would argue that the only portion of any HTML that you should be indexing at all for searching is the character data between tags.

We're not currently parsing the HTML, so none of these distinctions are
really available to us. Maybe adding an HTML parser is the right
solution, but it's a bit non-trivial.

d


More information about the notmuch mailing list