parallel tests broken on Debian stable
Daniel Kahn Gillmor
dkg at fifthhorseman.net
Mon May 20 16:49:02 PDT 2019
On Mon 2019-05-20 13:27:03 -0400, Daniel Kahn Gillmor wrote:
> c) we should avoid the timeout hanging :)
I dug into this today, and i'm reporting back my findings.
I have what appears to be a fix (see below), but i don't understand it,
so i'm not advocating for it.
To be clear: my two test cases are two KVM instances, one running
stretch (debian stable) and one running sid (debian unstable). both
systems have 4 virtual CPUs (on a hardware platform that has 4 cores).
The two VMs are otherwise similarly configured. Both have moreutils
installed, and GNU parallel is not installed.
on the stretch system, i can achieve this hang/failure with a simple
"make -j4 check". on the "sid" system, i do not see the failure.
When i disable the use of timeout entirely (with NOTMUCH_TEST_TIMEOUT=0,
see id:20190520232535.4904-1-dkg at fifthhorseman.net), the problem goes
away on the stretch system.
When i inspect the state of the debian stretch system when the tests are
hanging, i see this (from "ps auwx"):
------------------------
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
dkg 7980 1.8 0.5 10228 4348 pts/1 S+ 18:49 0:00 make -j4 check
dkg 8001 0.0 0.3 11228 2884 pts/1 S+ 18:49 0:00 bash /home/dkg/src/notmuch/notmuch/test/notmuch-test
dkg 8011 0.0 0.1 10092 804 pts/1 S 18:49 0:00 timeout 2m parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/
dkg 8012 0.0 0.0 4168 744 pts/1 T 18:49 0:00 parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch
dkg 8013 0.0 0.0 4168 96 pts/1 T 18:49 0:00 parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch
dkg 8014 0.0 0.0 4168 96 pts/1 T 18:49 0:00 parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch
dkg 8267 0.0 0.0 4168 96 pts/1 T 18:49 0:00 parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch
dkg 8268 0.0 0.0 4276 732 pts/1 T 18:49 0:00 sh -c /home/dkg/src/notmuch/notmuch/test/T050-new.sh
dkg 8270 2.7 0.4 11772 3744 pts/1 T 18:49 0:00 bash /home/dkg/src/notmuch/notmuch/test/T050-new.sh
dkg 8320 0.0 0.0 4168 96 pts/1 T 18:49 0:00 parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch
dkg 8321 0.0 0.0 4276 748 pts/1 T 18:49 0:00 sh -c /home/dkg/src/notmuch/notmuch/test/T060-count.sh
dkg 8322 0.7 0.4 11752 3556 pts/1 T 18:49 0:00 bash /home/dkg/src/notmuch/notmuch/test/T060-count.sh
dkg 8345 0.0 0.0 4168 96 pts/1 T 18:49 0:00 parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch
dkg 8346 0.0 0.0 4276 744 pts/1 T 18:49 0:00 sh -c /home/dkg/src/notmuch/notmuch/test/T070-insert.sh
dkg 8347 1.7 0.4 11772 3764 pts/1 T 18:49 0:00 bash /home/dkg/src/notmuch/notmuch/test/T070-insert.sh
dkg 8425 0.0 0.0 4168 96 pts/1 T 18:49 0:00 parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch
dkg 8426 0.0 0.0 4276 740 pts/1 T 18:49 0:00 sh -c /home/dkg/src/notmuch/notmuch/test/T080-search.sh
dkg 8427 1.5 0.4 11752 3664 pts/1 T 18:49 0:00 bash /home/dkg/src/notmuch/notmuch/test/T080-search.sh
dkg 8763 4.7 2.9 73960 22708 pts/1 T 18:49 0:00 gdb --batch-silent --return-child-result -x count-files.gdb --args notmuch count --output=files *
dkg 8914 0.0 0.8 68508 6228 pts/1 T 18:49 0:00 notmuch search --format=text0 --output=files --offset=1 --limit=1 *
dkg 8915 0.0 0.1 4484 1164 pts/1 T 18:49 0:00 xargs -0 -I {} mv {} /home/dkg/src/notmuch/notmuch/test/tmp.T050-new/mail/moved_messages
dkg 8916 0.0 0.5 68244 3824 pts/1 T 18:49 0:00 notmuch insert --folder=Drafts +draft -unread
dkg 8919 0.0 0.0 13012 704 pts/1 T 18:49 0:00 notmuch new
dkg 8920 0.0 0.0 1412 4 pts/1 t 18:49 0:00 /bin/bash -c exec /home/dkg/src/notmuch/notmuch/notmuch count --output=files \*
------------------------
As you can see in the "STAT" column, nearly all of the hanging processes
are marked with T ("stopped by job control signal" according to ps(1)).
I also note that "t" means "stopped by debugger during the tracing" --
maybe that final line (with "t") is the special one that triggers this?
i don't know.
When i try to connect to any of these stopped processes with "strace -p
$PID", strace reports:
strace: Process 4204 attached
--- stopped by SIGTTOU ---
SIGTTOU is novel to me, and i don't really understand why the test suite
would have this problem. Skimming this guidance:
http://curiousthing.org/sigttin-sigttou-deep-dive-linux
suggested that maybe if i just decoupled the processes from the terminal
"enough" i could get away with a functioning test suite. redirecting
all of stdin, stdout, stderr to /dev/null worked! then i tried pruning
out different pieces, and found that all i needed to do was to redirect
stdin from /dev/null and the test suite would run without problems in
parallel with moreutils parallel. (it also works with GNU parallel, and
if i run the tests serially).
So the patch below is a "fix" but it's not a principled one.
the source for moreutils parallel.c doesn't appear to have changed at
all between stretch and buster. I tried upgrading the version of
moreutils on this stretch system from 0.60-1 to 0.62-1, and i was able
to reproduce the same problem. So i don't believe the problem is with
moreutils.
Some things that might be different between debian stable (stretch) and
testing (buster):
package provides stretch buster
------- -------- ------- ------
GNU coreutils /usr/bin/timeout 8.26-3 8.30-3
GNU bash /bin/bash 4.4-5 5.0-4
GNU dash /bin/sh (via symlink) 0.5.8-2.4 0.5.10.2-5
Linux the kernel 4.9.168-1+deb9u2 4.19.37-3
GNU gdb /usr/bin/gdb 7.12-6 8.2.1-2
I also tried changing the symlink for /bin/sh to point to bash instead
of dash, and was still able to replicate the problem, so i suspect dash
is not the culprit.
However, i tried selectively upgrading all the versions of all of these
packages *except for gdb* to the version in buster (or to the version
from backports, in the case of the kernel). and i'm *still* seeing the
problem on the stretch system.
So perhaps it's some interaction between timeout and gdb? I haven't
managed to test that particular combination yet.
I hope someone else will look into this further, as i'm out of my depth.
--dkg
diff --git a/test/Makefile.local b/test/Makefile.local
index 47244e8f..3a57b6be 100644
--- a/test/Makefile.local
+++ b/test/Makefile.local
@@ -66,13 +66,13 @@ test-binaries: $(TEST_BINARIES)
test: all test-binaries
ifeq ($V,)
@echo 'Use "$(MAKE) V=1" to see the details for passing and known broken tests.'
- @env NOTMUCH_TEST_QUIET=1 $(NOTMUCH_SRCDIR)/$(test_src_dir)/notmuch-test $(OPTIONS)
+ @env NOTMUCH_TEST_QUIET=1 $(NOTMUCH_SRCDIR)/$(test_src_dir)/notmuch-test $(OPTIONS) </dev/null
else
# The user has explicitly enabled quiet execution.
ifeq ($V,0)
- @env NOTMUCH_TEST_QUIET=1 $(NOTMUCH_SRCDIR)/$(test_src_dir)/notmuch-test $(OPTIONS)
+ @env NOTMUCH_TEST_QUIET=1 $(NOTMUCH_SRCDIR)/$(test_src_dir)/notmuch-test $(OPTIONS) </dev/null
else
- @$(NOTMUCH_SRCDIR)/$(test_src_dir)/notmuch-test $(OPTIONS)
+ @$(NOTMUCH_SRCDIR)/$(test_src_dir)/notmuch-test $(OPTIONS) </dev/null
endif
endif
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 227 bytes
Desc: not available
URL: <http://notmuchmail.org/pipermail/notmuch/attachments/20190520/434d205b/attachment.sig>
More information about the notmuch
mailing list