[PATCH v3 3/5] Add indexing for the mimetype term

Sat Jan 17 08:41:10 PST 2015

>>>>> "DB" == David Bremner <david at tethera.net> writes:

    DB> Todd <todd at electricoding.com> writes:
    >> Adds the indexing and removes the broken test flag
    >> ---
    >> lib/database.cc        |  1 +
    >> lib/index.cc           | 10 ++++++++++
    >> test/T190-multipart.sh |  4 ----
    >> 3 files changed, 11 insertions(+), 4 deletions(-)
    >>
    >> diff --git a/lib/database.cc b/lib/database.cc
    >> index 0d2c417..3974e2e 100644
    >> --- a/lib/database.cc
    >> +++ b/lib/database.cc
    >> @@ -254,6 +254,7 @@ static prefix_t PROBABILISTIC_PREFIX[]= {
    >> { "from",			"XFROM" },
    >> { "to",			"XTO" },
    >> { "attachment",		"XATTACHMENT" },
    >> +    { "mimetype",		"XMIMETYPE"},
    >> { "subject",		"XSUBJECT"},
    >> };

    DB> I think the commit message should articulate why we are indexing this as
    DB> a probabilistic prefix, rather than as a boolean prefix. In particular,
    DB> this gives people a last chance to complain.

    DB> The reference I know is http://xapian.org/docs/queryparser.html

    DB> If I understand correctly (it would be great if you could test this
    DB> Todd) , with a probabilistic prefix,

    DB>    mimetime:pdf

    DB> will match

    DB> application/pdf
    DB> image/pdf
    DB> application/x-pdf
    DB> application/x-ext-pdf

    DB> but not

    DB> application/x-bzpdf
    DB> application/x-gzpdf
    DB> application/x-xzpdf

    I just tested, and it does work this way with your examples.  I
    *believe* from reading the docs, that xapian is treating the full
    MIME-type queries as phrase searches anyway due to the embedded
    slashes.

    From http://xapian.org/docs/queryparser.html:

         A phrase surrounded with double quotes ("") matches documents
         containing that exact phrase. Hyphenated words are also treated
         as phrases, as are cases such as filenames and email addresses
         (e.g. /etc/passwd or president at whitehouse.gov).

    I think that we'll get good behavior from the types of queries that
    will typically be performed due to this automatic phrasing.


    DB> On the whole, this is probably more beneficial than bad.  The downside
    DB> of probabilistic prefixes/fields is that they are not "anchored", so
    DB> there is no easy way to distinguish

    DB>       application/pdf

    DB> from

    DB>       pdf
    DB>       application/x-pdf

    DB> I guess in a perfect world this would also be explained in
    DB> notmuch-search-terms(7), but that's pretty much orthogonal to this
    DB> series.

    If separate messages with application/pdf and application/x-pdf are
    indexed, then:
    
    mimetype:application/x-pdf finds only the application/x-pdf
    mimetype:application/pdf finds only the application/pdf
    mimetype:pdf finds both of the messages

    I am fairly sure that this behaviour is a result of the automatic
    phrasing mentioned above.

    - Todd
    
    DB> d
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 818 bytes
Desc: not available
URL: <http://notmuchmail.org/pipermail/notmuch/attachments/20150117/d46cb7a9/attachment.pgp>