Drop HTML tags when indexing
david at tethera.net
Sat Mar 25 05:59:20 PDT 2017
David Bremner <david at tethera.net> writes:
> Steven Allen pointed out  that the previous scanner  was a
> little too simplistic. This version handles (or claims to) quoted
> strings in attributes, which can apparently contain '>'and '<'
> characters. This required generalizing the state machine runner a bit
>  to handle states with out-degree more than two.
For what it is worth, this series shrunk my index by about the same
amount as skipping html messages entirely: I have about 15% messages
with html parts, and this series made the index about 15% smaller.
More information about the notmuch