Drop HTML tags when indexing

David Bremner david at tethera.net
Sat Mar 25 05:59:20 PDT 2017

David Bremner <david at tethera.net> writes:

> Steven Allen pointed out [2] that the previous scanner [1] was a
> little too simplistic. This version handles (or claims to) quoted
> strings in attributes, which can apparently contain '>'and '<'
> characters. This required generalizing the state machine runner a bit
> [3] to handle states with out-degree more than two.

For what it is worth, this series shrunk my index by about the same
amount as skipping html messages entirely: I have about 15% messages
with html parts, and this series made the index about 15% smaller.


