parallel tests broken on Debian stable

Daniel Kahn Gillmor dkg at fifthhorseman.net
Mon May 20 16:49:02 PDT 2019


On Mon 2019-05-20 13:27:03 -0400, Daniel Kahn Gillmor wrote:
>  c) we should avoid the timeout hanging :)

I dug into this today, and i'm reporting back my findings.

I have what appears to be a fix (see below), but i don't understand it,
so i'm not advocating for it.

To be clear: my two test cases are two KVM instances, one running
stretch (debian stable) and one running sid (debian unstable).  both
systems have 4 virtual CPUs (on a hardware platform that has 4 cores).

The two VMs are otherwise similarly configured.  Both have moreutils
installed, and GNU parallel is not installed.

on the stretch system, i can achieve this hang/failure with a simple
"make -j4 check".  on the "sid" system, i do not see the failure.

When i disable the use of timeout entirely (with NOTMUCH_TEST_TIMEOUT=0,
see id:20190520232535.4904-1-dkg at fifthhorseman.net), the problem goes
away on the stretch system.

When i inspect the state of the debian stretch system when the tests are
hanging, i see this (from "ps auwx"):

------------------------
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
dkg       7980  1.8  0.5  10228  4348 pts/1    S+   18:49   0:00 make -j4 check
dkg       8001  0.0  0.3  11228  2884 pts/1    S+   18:49   0:00 bash /home/dkg/src/notmuch/notmuch/test/notmuch-test
dkg       8011  0.0  0.1  10092   804 pts/1    S    18:49   0:00 timeout 2m parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/
dkg       8012  0.0  0.0   4168   744 pts/1    T    18:49   0:00 parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch
dkg       8013  0.0  0.0   4168    96 pts/1    T    18:49   0:00 parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch
dkg       8014  0.0  0.0   4168    96 pts/1    T    18:49   0:00 parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch
dkg       8267  0.0  0.0   4168    96 pts/1    T    18:49   0:00 parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch
dkg       8268  0.0  0.0   4276   732 pts/1    T    18:49   0:00 sh -c /home/dkg/src/notmuch/notmuch/test/T050-new.sh
dkg       8270  2.7  0.4  11772  3744 pts/1    T    18:49   0:00 bash /home/dkg/src/notmuch/notmuch/test/T050-new.sh
dkg       8320  0.0  0.0   4168    96 pts/1    T    18:49   0:00 parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch
dkg       8321  0.0  0.0   4276   748 pts/1    T    18:49   0:00 sh -c /home/dkg/src/notmuch/notmuch/test/T060-count.sh
dkg       8322  0.7  0.4  11752  3556 pts/1    T    18:49   0:00 bash /home/dkg/src/notmuch/notmuch/test/T060-count.sh
dkg       8345  0.0  0.0   4168    96 pts/1    T    18:49   0:00 parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch
dkg       8346  0.0  0.0   4276   744 pts/1    T    18:49   0:00 sh -c /home/dkg/src/notmuch/notmuch/test/T070-insert.sh
dkg       8347  1.7  0.4  11772  3764 pts/1    T    18:49   0:00 bash /home/dkg/src/notmuch/notmuch/test/T070-insert.sh
dkg       8425  0.0  0.0   4168    96 pts/1    T    18:49   0:00 parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch
dkg       8426  0.0  0.0   4276   740 pts/1    T    18:49   0:00 sh -c /home/dkg/src/notmuch/notmuch/test/T080-search.sh
dkg       8427  1.5  0.4  11752  3664 pts/1    T    18:49   0:00 bash /home/dkg/src/notmuch/notmuch/test/T080-search.sh
dkg       8763  4.7  2.9  73960 22708 pts/1    T    18:49   0:00 gdb --batch-silent --return-child-result -x count-files.gdb --args notmuch count --output=files *
dkg       8914  0.0  0.8  68508  6228 pts/1    T    18:49   0:00 notmuch search --format=text0 --output=files --offset=1 --limit=1 *
dkg       8915  0.0  0.1   4484  1164 pts/1    T    18:49   0:00 xargs -0 -I {} mv {} /home/dkg/src/notmuch/notmuch/test/tmp.T050-new/mail/moved_messages
dkg       8916  0.0  0.5  68244  3824 pts/1    T    18:49   0:00 notmuch insert --folder=Drafts +draft -unread
dkg       8919  0.0  0.0  13012   704 pts/1    T    18:49   0:00 notmuch new
dkg       8920  0.0  0.0   1412     4 pts/1    t    18:49   0:00 /bin/bash -c exec /home/dkg/src/notmuch/notmuch/notmuch count --output=files \*
------------------------


As you can see in the "STAT" column, nearly all of the hanging processes
are marked with T ("stopped by job control signal" according to ps(1)).

I also note that "t" means "stopped by debugger during the tracing" --
maybe that final line (with "t") is the special one that triggers this?
i don't know.

When i try to connect to any of these stopped processes with "strace -p
$PID", strace reports:

    strace: Process 4204 attached
    --- stopped by SIGTTOU ---

SIGTTOU is novel to me, and i don't really understand why the test suite
would have this problem.  Skimming this guidance:

    http://curiousthing.org/sigttin-sigttou-deep-dive-linux

suggested that maybe if i just decoupled the processes from the terminal
"enough" i could get away with a functioning test suite.  redirecting
all of stdin, stdout, stderr to /dev/null worked!  then i tried pruning
out different pieces, and found that all i needed to do was to redirect
stdin from /dev/null and the test suite would run without problems in
parallel with moreutils parallel.  (it also works with GNU parallel, and
if i run the tests serially).

So the patch below is a "fix" but it's not a principled one.

the source for moreutils parallel.c doesn't appear to have changed at
all between stretch and buster.  I tried upgrading the version of
moreutils on this stretch system from 0.60-1 to 0.62-1, and i was able
to reproduce the same problem.  So i don't believe the problem is with
moreutils.

Some things that might be different between debian stable (stretch) and
testing (buster):

    package        provides               stretch           buster
    -------        --------               -------           ------
    GNU coreutils  /usr/bin/timeout       8.26-3            8.30-3
    GNU bash       /bin/bash              4.4-5             5.0-4
    GNU dash       /bin/sh (via symlink)  0.5.8-2.4         0.5.10.2-5
    Linux          the kernel             4.9.168-1+deb9u2  4.19.37-3
    GNU gdb        /usr/bin/gdb           7.12-6            8.2.1-2

I also tried changing the symlink for /bin/sh to point to bash instead
of dash, and was still able to replicate the problem, so i suspect dash
is not the culprit.

However, i tried selectively upgrading all the versions of all of these
packages *except for gdb* to the version in buster (or to the version
from backports, in the case of the kernel).  and i'm *still* seeing the
problem on the stretch system.

So perhaps it's some interaction between timeout and gdb?  I haven't
managed to test that particular combination yet.

I hope someone else will look into this further, as i'm out of my depth.

  --dkg

diff --git a/test/Makefile.local b/test/Makefile.local
index 47244e8f..3a57b6be 100644
--- a/test/Makefile.local
+++ b/test/Makefile.local
@@ -66,13 +66,13 @@ test-binaries: $(TEST_BINARIES)
 test:  all test-binaries
 ifeq ($V,)
        @echo 'Use "$(MAKE) V=1" to see the details for passing and known broken tests.'
-       @env NOTMUCH_TEST_QUIET=1 $(NOTMUCH_SRCDIR)/$(test_src_dir)/notmuch-test $(OPTIONS)
+       @env NOTMUCH_TEST_QUIET=1 $(NOTMUCH_SRCDIR)/$(test_src_dir)/notmuch-test $(OPTIONS) </dev/null
 else
 # The user has explicitly enabled quiet execution.
 ifeq ($V,0)
-       @env NOTMUCH_TEST_QUIET=1 $(NOTMUCH_SRCDIR)/$(test_src_dir)/notmuch-test $(OPTIONS)
+       @env NOTMUCH_TEST_QUIET=1 $(NOTMUCH_SRCDIR)/$(test_src_dir)/notmuch-test $(OPTIONS) </dev/null
 else
-       @$(NOTMUCH_SRCDIR)/$(test_src_dir)/notmuch-test $(OPTIONS)
+       @$(NOTMUCH_SRCDIR)/$(test_src_dir)/notmuch-test $(OPTIONS) </dev/null
 endif
 endif
 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 227 bytes
Desc: not available
URL: <http://notmuchmail.org/pipermail/notmuch/attachments/20190520/434d205b/attachment.sig>


More information about the notmuch mailing list