[PATCH] test: add known broken test for indexing html
david at tethera.net
Sat Mar 18 08:08:27 PDT 2017
Jeffrey Stedfast <jestedfa at microsoft.com> writes:
> Base64 encoded inline image data is always within the src attribute
> value of an <img> tag and will always begin with "data:" followed by
> the mime-type and then followed by ";base64," so it's pretty easy to
> While on this topic, why index HTML attribute values at all? Other
>than perhaps some known ones like perhaps the 'alt' value of <img>
> I would argue that the only portion of any HTML that you should be
> indexing at all for searching is the character data between tags.
I should mention that we also have a fair amount of base64 gunk from
inline PGP signatures. I'm not sure if it's just ugly to look at when
dumping the database term, or if it actually makes a measurable
difference in time/space usage.
More information about the notmuch