[PATCH] test: add known broken test for indexing html

David Bremner david at tethera.net
Sat Mar 18 08:08:27 PDT 2017


Jeffrey Stedfast <jestedfa at microsoft.com> writes:

> Base64 encoded inline image data is always within the src attribute
> value of an <img> tag and will always begin with "data:" followed by
> the mime-type and then followed by ";base64," so it's pretty easy to
> spot.
>
> While on this topic, why index HTML attribute values at all? Other
>than perhaps some known ones like perhaps the 'alt' value of <img>
>tags?
>
> I would argue that the only portion of any HTML that you should be
> indexing at all for searching is the character data between tags.
>

I should mention that we also have a fair amount of base64 gunk from
inline PGP signatures. I'm not sure if it's just ugly to look at when
dumping the database term, or if it actually makes a measurable
difference in time/space usage.

d


More information about the notmuch mailing list