RFC: drop html tags

David Bremner david at tethera.net
Tue Mar 21 06:15:43 PDT 2017


Although HTML itself is not regular (probably not anything sane in the
latest incarnations), well formed tags should be as far as I know.
Here is a simple fix to the problem of giant embedded images in HTML:
drop all tags.  Unbalanced < > could force an HTML part not to be
indexed.

If the general approach seems sensible, then it can probably be tidied
up a bit, e.g.  by storing a state table in the filter struct, rather
than creating a function to define the appropriate state table and
jumping through a function pointer. On the other hand, in principle
this approach is more flexible as it does not insist that all scanners
are automata based. I originally wanted to try a real HTML parser, but
I couldn't see how to get the one I looked at (gumbo) working easily
in "stream" mode.



More information about the notmuch mailing list