30 Apr 2020

Linux dmesg --follow (-w) not working?

For a couple months now, I have noticed that running dmesg -w on my workstation does not appear to print out new kernel messages. In other words dmesg --follow "hangs". Additionally when running tail -f /var/log/kern.log to monitor new dmesg messages picked up by sysklogd (part of syslogng), the latest messages do not come through until sysklogd periodically "reopens" the /dev/kmsg kernel message buffer.

Why is this a problem?

This is a problem because I use the dmesg log to monitor important hardware related messages such as the kernel recognizing a USB device or diagnosing bluetooth/wifi issues. When I plug in a USB drive, the first thing I do is check dmesg for the following messages:

[10701.359834] usb 2-4.4: new high-speed USB device number 8 using ehci-pci
[10701.394801] usb 2-4.4: New USB device found, idVendor=12f7, idProduct=0313, bcdDevice= 1.10
[10701.394807] usb 2-4.4: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[10701.394810] usb 2-4.4: Product: MerryGoRound
[10701.394813] usb 2-4.4: Manufacturer: Memorex
[10701.394816] usb 2-4.4: SerialNumber: AAAAAAAAAAAA
[10701.395182] usb-storage 2-4.4:1.0: USB Mass Storage device detected
[10701.398885] scsi host7: usb-storage 2-4.4:1.0
[10702.401161] scsi 7:0:0:0: Direct-Access     Memorex  MerryGoRound     PMAP PQ: 0 ANSI: 0 CCS
[10702.401710] sd 7:0:0:0: Attached scsi generic sg6 type 0
[10702.651720] sd 7:0:0:0: [sde] 15654912 512-byte logical blocks: (8.02 GB/7.46 GiB)
[10702.652341] sd 7:0:0:0: [sde] Write Protect is off
[10702.652346] sd 7:0:0:0: [sde] Mode Sense: 23 00 00 00
[10702.652961] sd 7:0:0:0: [sde] No Caching mode page found
[10702.652965] sd 7:0:0:0: [sde] Assuming drive cache: write through
[10702.681473]  sde: sde1 sde2
[10702.684869] sd 7:0:0:0: [sde] Attached SCSI removable disk

This output reports that a USB device was detected, where it is plugged in, what is the vendor/product information, what USB speed it is using, the size of the storage, the device name (/dev/sde), and its partitions (/dev/sde1, /dev/sde2). There are a lot of other messages written out to dmesg, such as the kernel detecting a bad USB cable, segementation faults, and so on.

Given the importance of the above log output, I have developed a habit of running dmesg -w to monitor such kernel events. The -w tells dmesg to monitor for new messages. The long option is --follow.

In addition to dmesg -w not working as intended, syslogng log entries written to /var/log/kern.log are not written as they occur; instead the log is written in "bursts", which suggests sysklogd occasionally reopens /dev/kmsg, thereby reading in new log messages, but all the timestamps are the same time for each "burst" read.

Which of systems were affected?

I have two systems with a virtually identical OS installation; one is a workstation named snowcrash with an AMD FX-8350 on an ASRock M5A97 R2.0 motherboard; the other is a HP Elitebook 820 G4 named cyberdemon with an Intel Core i5-7300U. Curiously enough, the strange dmesg -w hang does not occur on cyberdemon, but does occur on snowcrash. Both hosts run Linux mainline, with both machines on 5.6.4. Looking through my /var/log/kern.log files, this behavior was apparent on a 5.4.25 kernel. As we will see later, this coincides with the affected versions that others have reported

Additionally, I asked my friend tyil who happens to also use an AMD FX-8350 with Gentoo to check for the bug; he also had the problem on 5.6.0.

Pinpointing the bug

First thing I did was find a way to reproduce the issue. I recorded an asciinema recording. You can watch it here. I then shared the recording on IRC, hoping somebody would know of a solution. I got some helpful and encouraging feedback, but nobody knew of this particular bug. See the recording below, or click here:

The next step was to figure out if there was something wrong with /bin/dmesg. Running strace -o dmesg-strace.log dmesg -w shows the follow pertinent lines:

openat(AT_FDCWD, "/dev/kmsg", O_RDONLY) = 3
lseek(3, 0, SEEK_DATA)                  = 0
read(3, "6,242,717857,-;futex hash table "..., 8191) = 79
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x16), ...}) = 0
openat(AT_FDCWD, "/usr/lib64/gconv/gconv-modules.cache", O_RDONLY) = 4
fstat(4, {st_mode=S_IFREG|0644, st_size=26388, ...}) = 0
mmap(NULL, 26388, PROT_READ, MAP_SHARED, 4, 0) = 0x7fed92688000
close(4)                                = 0
futex(0x7fed925f9a14, FUTEX_WAKE_PRIVATE, 2147483647) = 0
write(1, "\33[32m[    0.717857] \33[0m\33[33mfut"..., 97) = 97
… SNIP …
read(3, "6,1853,137347289701,-;input: Mic"..., 8191) = 128
write(1, "\33[32m[137347.289701] \33[0m\33[33min"..., 140) = 140
read(3,

The last line indicates a pending read() that never completes. Note the 3 file description refers to the /dev/kmsg device. It appears nothing out of the ordinary occurs, except for the fact that the read() simply hangs.

I was a bit at a loss to explain the hanged read(). I was really lost honestly. So I went on, and inspected the changes to /bin/dmesg shipped by util-linux, and did not find any sign of significant changes. I did run dmesg from master just to be sure. See the commit log of dmesg.c here. Additionally I also searched the util-linux bug tracker and found nothing relevant.

Given I had no solution yet, I decided to resort to Googling things, hoping somebody had discussed this bug before. Keywords I tried are:

  • dmesg follow no longer working
  • dmesg kmsg no more messages
  • linux kmsg read hang
  • "/dev/kmsg" hang
  • "dmesg -w" hangs

None of these came up with anything useful. I was using DuckDuckGo mainly, with some Google queries sprinkled on top.

I then visited the torvalds/linux GitHub repository, searched for "kmsg", and did not find a commit that looked like a fix. I picked on through reading commits that /dev/kmsg is written to via the printk functions, so on a whim I decided to look changes made to kernel/printk/printk.c. Reading through the commit logs of printk.c, I realized the last commit is likely the fix:

commit ab6f762f0f53162d41497708b33c9a3236d3609e
Author: Sergey Senozhatsky <protected@email>
Date:   Tue Mar 3 20:30:02 2020 +0900

    printk: queue wake_up_klogd irq_work only if per-CPU areas are ready
    
    printk_deferred(), similarly to printk_safe/printk_nmi, does not
    immediately attempt to print a new message on the consoles, avoiding
    calls into non-reentrant kernel paths, e.g. scheduler or timekeeping,
    which potentially can deadlock the system.
    
    Those printk() flavors, instead, rely on per-CPU flush irq_work to print
    messages from safer contexts.  For same reasons (recursive scheduler or
    timekeeping calls) printk() uses per-CPU irq_work in order to wake up
    user space syslog/kmsg readers.
    
    However, only printk_safe/printk_nmi do make sure that per-CPU areas
    have been initialised and that it's safe to modify per-CPU irq_work.
    This means that, for instance, should printk_deferred() be invoked "too
    early", that is before per-CPU areas are initialised, printk_deferred()
    will perform illegal per-CPU access.
    
    Lech Perczak [0] reports that after commit 1b710b1b10ef ("char/random:
    silence a lockdep splat with printk()") user-space syslog/kmsg readers
    are not able to read new kernel messages.
    
    The reason is printk_deferred() being called too early (as was pointed
    out by Petr and John).
    
    Fix printk_deferred() and do not queue per-CPU irq_work before per-CPU
    areas are initialized.
    
    Link: https://lore.kernel.org/lkml/aa0732c6-5c4e-8a8b-a1c1-75ebe3dca05b@camlintechnologies.com/
    Reported-by: Lech Perczak <protected@email>
    Signed-off-by: Sergey Senozhatsky <protected@email>
    Tested-by: Jann Horn <protected@email>
    Reviewed-by: Petr Mladek <protected@email>
    Cc: Greg Kroah-Hartman <protected@email>
    Cc: Theodore Ts'o <protected@email>
    Cc: John Ogness <protected@email>
    Signed-off-by: Linus Torvalds <protected@email>

Unfortunately my understanding of the linux kernel architecture is not comprehensive, let alone competent, the commit message describes

  1. syslog/kmsg readers — which includes dmesg and syslogng,
  2. certain functions don't immediate attempt to print a new message to console,
  3. and syslog/kmsg readers might not wake up.

Indeed it's a bit hard for me to wrap my minimal kernel understanding around, however, reading the linked email list thread clears things up significantly:

After upgrading kernel on our boards from v4.19.105 to v4.19.106 we found out that syslog fails to read the messages after ones read initially after opening /proc/kmsg just after booting. I also found out, that output of 'dmesg –follow' also doesn't react on new printks appearing for whatever reason - to read new messages, reopening /proc/kmsg or /dev/kmsg was needed. I bisected this down to commit 15341b1dd409749fa5625e4b632013b6ba81609b ("char/random: silence a lockdep splat with printk()"), and reverting it on top of v4.19.106 restored correct behaviour.

— Lech Perczak

Now that, sounds like the issue I'm having! The thread also discusses the bug is apparent on 4.19.106 (fixed in 4.19.107 — see this commit), and affects users of 5.5.9, 5.5.15, 5.6.3 (see the PATCHv2 thread).

Applying the patch

Next step is to apply the patch in order to test and verify this fixes the issue.

Since Fall last year, I have used sys-kernel/vanilla-kernel to compile, install, and create an initramfs for my two machines. This is a great ebuild because it uses a kernel .config based off of Archlinux's, so it is compatible with most machines. It is also streamlined in that it does all the work for you — no more manually configuring and remembering which make invocations are necessary to update the kernel. It's not hard to get right, but it's not particularly interesting in my use-case. Additionally, using sys-kernel/vanilla-kernel, the kernel & its modules are now packaged, and can be distributed to my other machine as a binpkg. This streamlines deployment significantly.

In order to add the patch to this ebuild, I simply have to drop the patch file into /etc/portage/patches/sys-kernel/vanilla-kernel. In my case I chose to drop it in /etc/portage/patches/sys-kernel/vanilla-kernel:5.6.4 because I rather the patch only be applied no the current kernel I have installed, than all versions of sys-kernel/vanilla-kernel. This ensures when I upgrade to to the upcoming 5.7 release (which has the fix included), the patch won't be applied and emerge won't fail due to the patch not being applied cleanly.

The commands (commit to my /etc/portage):

mkdir -p /etc/portage/patches/sys-kernel/vanilla-kernel:5.2.6
curl -o /etc/portage/patches/sys-kernel/vanilla-kernel:5.2.6/fix-dmesg--follow.patch \
        https://github.com/torvalds/linux/commit/ab6f762f0f53162d41497708b33c9a3236d3609e.patch
emerge -1av sys-kernel/vanilla-kernel:5.2.6

An hour later and the kernel is installed. After the reboot, indeed dmesg -w works once again! And the log messages in /var/log/kern.log have timestamps that correctly reflect the kernel time!

Conclusion

Even kernels have regressions. As discussed on IRC, I was reminded that the kernel project is not responsible for the userland, so it's possible such testcases might not be on the radar of most kernel developers. Perhaps it's the distros' responsibilities to execute integrated system testing to catch bugs like this. In any case it is still a surprise to see such a regression occur. We like to think of the kernel as this infallible magical machine that doesn't break except when you do something patently wrong, but this isn't really the case. We're all human.

I want to thank Tyil, Sergey (the patch author), Lech (the bug reporter) and some folks from the #linux IRC channel for helping me pinpoint this issue. The reader may think this is a lot of effort to go through to fix such a simple bug — but it's really important for the kernel to work — if the kernel misbehaves, anything is up for grab. It's not like your bug-laden browser product we accept will have crashing bugs in it — if the kernel crashes or misbehaves, the ramifications are almost as bad as if the hardware is failing — you'll lose your application data, productivity, and trust in the operating system itself.

It is important to mention LTS (Long term support) kernels exist. Given the amount of trouble I went to address this issue, and the fact I rather not have things breaking, I don't think I should be running a mainline kernel at the moment. Perhaps I can install both side by side. then pick 'n choose which kernel to use de jure.

I am very interested to hear of you, the reader's, suggestions for kernel maintenance and version selection strategies. You can find my contact details at https://winny.tech/ . Thank you for reading.

Tags: linux computing

© Winston Weinert (winny) — CC-BY-SA-4.0