Integration with training-based bayesian filters

Samium Gromoff _deepfire at feelingofgreen.ru
Mon Aug 16 08:38:43 PDT 2010


Good day folks,

My "+notmuch AND train" query on the local notmuch list archive didn't
yield anything relevant, so I've got at least one excuse if the
question I'm going to pose was already answered to death here.

So, how is a notmuch user supposed to integrate a train-based message
classifier like crm114[1], which operates as follows:

  - the filter->you information flow is established by prepending
    either "ADV: " or "UNS: " strings to the message subject, denoting,
    correspondingly, either "spam" or "please tell me if this is spam"
    categories.  The non-spam messages, naturally, have their subject
    lines unmodified.

  - the you->filter information flow is established by taking the
    message file whose status you want to pin down (mostly those marked
    as UNS, because after a while crm144 gets really really good),
    and piping it to the classifier executable.

One thing is certain -- we're talking elisp territory here.

Another is certain, also -- such questions appear at some point, sooner
or later, in the life of every mail user agent.  Again, sorry if
I failed the due diligence part of prior art discovery.

Now to some answers (the unexpected part):

The first part is handled easily, well, by a composition of procmailing
the "ADV: "-prefixed messages out of one's sight, which becomes a
plausible strategy once the classifier becomes clueful enough, and
by adding a simple xapian "subject:" rule for "UNS: "-prefixed ones.

The second part can be solved either in a way pleasant to the user,
or easily.

The easy way is to expect the user enter the spam thread, which contains
exactly one message (never seen longer spam threads, still wondering
why...), and then press some key and confirm the destination, station
purple hell.  Then you exit the thread.  To enter another one...

So, after a couple of minutes of processing the backlog, it's becoming
painfully clear, that you don't want to spend more effort on these
one-message spam threads than pressing 's', and then confirming it with
'y', avoiding the painful, distracting and redrawing thread enter/exit
sequence.

Note, that this conveniently avoids the question of non-spam messages,
which actually often land within threads, but I'd like to keep this
aside, sorry for incomplete solutions.

So, the crux is, to pipe the file to the classifier you need the filename,
and the filename appears to be easily available only in the 'show' mode.

I've had to introduce some code to operate on single-message threads,
or actually, threads with all messages ignored, but the first one.

So, here goes, the solution modulo the conveniently avoided question
of non-spam messages:


(defun notmuch-pipe-file (filename command)
  (apply 'start-process-shell-command "notmuch-pipe-command" "*notmuch-pipe*"
	 (list command " < " (shell-quote-argument filename))))

(defun notmuch-query (query)
  (notmuch-query-get-threads (append (list "\'") query (list "\'"))))

(defun notmuch-result-firstmsg-property (result property)
  (plist-get (caaar result) property))

(defun notmuch-result-backend-remove-tags (result tags)
  (apply 'notmuch-call-notmuch-process
         (append (cons "tag" (mapcar (lambda (s) (concat "-" s)) tags))
                 (cons (concat "id:" (notmuch-result-firstmsg-property result :id)) nil))))

(defun notmuch-search-result-remove-tags (result tags)
  "Remove a tag from the current message.  RESULT is not updated."
  (let ((current-tags (notmuch-result-firstmsg-property result :tags)))
    (if (intersection current-tags tags :test 'string=)
        ;; new result tags are (sort (set-difference current-tags tags :test 'string=) 'string<)
        ;; however, it's unlikely we'll need them, so no need to update
	(notmuch-result-backend-remove-tags result tags))))

(defun notmuch-search-query-current-thread ()
  (notmuch-query (list (notmuch-search-find-thread-id))))

(defun notmuch-show-pipe-current-message (command)
  "Pipe the message currently pointed at within the show mode,
through COMMAND."
  (interactive "sPipe message to command: ")
  (notmuch-pipe-file (notmuch-show-get-filename) command))

(defun notmuch-search-pipe-current-message (command)
  "Pipe the first message of the thread currently pointed at within
the search mode, through COMMAND."
  (interactive "sPipe message to command: ")
  (let* ((result (notmuch-search-query-current-thread))
         (filename (notmuch-result-firstmsg-property result :filename)))
    (notmuch-pipe-file filename command)
    result))

(setq mark-as-good-command "~/bin/stdin-is-good"
      mark-as-spam-command "~/bin/stdin-is-spam"
      spam-tagdrop-list '("inbox" "unread" "sent" "train"))

(defun make-mark-as-good (piper)
  "Mark the message as good."
  (lexical-let ((piper piper))
    (lambda ()
      (interactive)
      (if (y-or-n-p "Mark as good? ")
          (progn
            (funcall piper mark-as-good-command)
            (forward-line 1))))))

(defun make-mark-as-spam (piper searchp)
  "Mark the message as spam."
  (lexical-let ((piper piper)
                (searchp searchp))
    (lambda ()
      (interactive)
      (if (y-or-n-p "Mark as spam? ")
          (let ((maybe-result (funcall piper mark-as-spam-command)))
            (if searchp
                (progn
                  (notmuch-search-result-remove-tags maybe-result spam-tagdrop-list)
                  (forward-line 1))
                (notmuch-show-mark-read)))))))

(define-key notmuch-show-mode-map "g"    (make-mark-as-good 'notmuch-show-pipe-current-message))
(define-key notmuch-show-mode-map "s"    (make-mark-as-spam 'notmuch-show-pipe-current-message nil))
(define-key notmuch-search-mode-map "g"  (make-mark-as-good 'notmuch-search-pipe-current-message))
(define-key notmuch-search-mode-map "s"  (make-mark-as-spam 'notmuch-search-pipe-current-message t))


I'll leave it to the more qualified people to decide which part (and in
which form) is supposed to go into notmuch, and which is destined to
live in the end-user's init file.


-- 
regards,
  Samium Gromoff
--
1. http://crm114.sourceforge.net/

--
"Actually I made up the term 'object-oriented', and I can tell you I
did not have C++ in mind." - Alan Kay (OOPSLA 1997 Keynote)


More information about the notmuch mailing list