Posts tagged "computing":
You want sudo -i or su -
You want to use sudo -i
or su -
to log into root. sudo su
anything is
superfluous, because you probably should be using sudo -i
or sudo -s
, which are
roughly equivalent, depending if you want to simulate a login (su -
or sudo
-i
) or not (su
or sudo -s
).1
When to use su -
?
You want to log into root using the root password. Typically you must be in
the wheel
group (check your PAM configuration). In Debian, you simply need
to know the password, as there is no wheel
group restriction enabled by
default.2
When to use sudo -i
?
You want to log into root using your sudo configuration, which will typically
prompt for your login password or allow login without a password. sudo
also
logs all invocations by default. It is also more flexible, but is also prone
to security concerns, such as the recent local user escalation vulnerability.
It's not crazy to consider using just su -
or maybe some other tool like
doas
, sudo
is a bit hard to pin down when assessing its security risks.
Why do I want to "simulate a login"?
The next few sections show some diffs of env(1)
output. The diff is
generated using diff -U0 | grep -v '^@'
. This shows an unified diff with no
context and suppresses line markers. First I'll summarize what the output
suggests, then show the diffs.
TL;DR
In my case, if I don't use sudo -i
or su -
to simulate3 a login the
following things might not work correctly:
- Not exactly sure what the missing settings mean for nix, though I'm going to guess root won't be able to use nix without a login.
- the pager settings won't be configured correctly
- SBCL, Java, Dotnet, VBox, Fltk, OpenCL, Distcc might not work
- Plan9port might not work
sudo
will allowXAUTHORITY
on through, which permits you to run graphical programs as another user, using the current user's desktop session. So usingsu
may not pass this on through by default. (Maybesu -m
could do this as well?)
Simulating login also sets your PWD
to root's HOME
(/root
). This might
seem convenient at first. I wonder why one would want to touch their user
files as root. The use-cases might be (1) write a thumb drive, (2) grab some
system configuration from your user's homedir, or (3) You store system stuff in
your home directory. Maybe logging into root's homedir is a saner default,
then just specify an absolute path to be extra clear what you wanted to do.
This also means if you do something dumb, it will not damage your user homedir,
only root's, provided you didn't cd
somewhere else.
It most of these settings are pulled in from my /etc/profile
. Hence you
probably want to simulate a user login.
The env
diffs
sudo -s
vs sudo -i
--- sudo_-s 2021-02-14 17:26:26.912620999 -0600 +++ sudo_-i 2021-02-14 17:26:38.259214818 -0600 +PLAN9=/opt/plan9 +XDG_CONFIG_DIRS=/etc/xdg +LESS=-R -M --shift 5 +JDK_HOME=/etc/java-config-2/current-system-vm +CONFIG_PROTECT_MASK=/etc/sandbox.d /etc/fonts/fonts.conf /etc/gentoo-release /etc/gconf /etc/terminfo /etc/dconf /etc/ca-certificates.conf /etc/texmf/web2c /etc/texmf/language.dat.d /etc/texmf/language.def.d /etc/texmf/updmap.d /etc/revdep-rebuild +DISTCC_VERBOSE=0 +JAVA_HOME=/etc/java-config-2/current-system-vm +DOTNET_ROOT=/opt/dotnet_core +ANT_HOME=/usr/share/ant -PWD=/home/winston +EDITOR=/usr/bin/vi +PWD=/root +NIX_PROFILES=/nix/var/nix/profiles/default /root/.nix-profile +CONFIG_PROTECT=/etc/stunnel/stunnel.conf /usr/share/maven-bin-3.6/conf /usr/share/gnupg/qualified.txt /usr/share/easy-rsa /usr/share/config /usr/lib64/libreoffice/program/sofficerc +QT_QPA_PLATFORMTHEME=qt5ct +DISTCC_TCP_CORK= +MANPATH=/root/.nix-profile/share/man:/etc/java-config-2/current-system-vm/man:/usr/share/gcc-data/x86_64-pc-linux-gnu/9.3.0/man:/usr/share/binutils-data/x86_64-pc-linux-gnu/2.35.1/man:/etc/java-config-2/current-system-vm/man/:/usr/local/share/man:/usr/share/man:/usr/lib/rust/man:/usr/lib/llvm/11/share/man:/opt/plan9/man +NIX_PATH=nixpkgs=/nix/var/nix/profiles/per-user/root/channels/nixpkgs:/nix/var/nix/profiles/per-user/root/channels:/root/.nix-defexpr/channels +OPENCL_PROFILE=nvidia +UNCACHED_ERR_FD= +FLTK_DOCDIR=/usr/share/doc/fltk-1.3.5-r4/html +NIX_SSL_CERT_FILE=/etc/ssl/certs/ca-certificates.crt +OPENGL_PROFILE=xorg-x11 +DISTCC_FALLBACK=1 +DCC_EMAILLOG_WHOM_TO_BLAME= +INFOPATH=/usr/share/gcc-data/x86_64-pc-linux-gnu/9.3.0/info:/usr/share/binutils-data/x86_64-pc-linux-gnu/2.35.1/info:/usr/share/info:/usr/share/info/emacs-26 +JAVAC=/etc/java-config-2/current-system-vm/bin/javac +LESSOPEN=|lesspipe %s +MANPAGER=manpager -PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/opt/bin:/usr/lib/llvm/11/bin:/opt/plan9/bin +DISTCC_SAVE_TEMPS=0 +PAGER=/usr/bin/less +DISTCC_SSH= +SBCL_HOME=/usr/lib64/sbcl +GCC_SPECS= +GSETTINGS_BACKEND=dconf +DISTCC_ENABLE_DISCREPANCY_EMAIL= +XDG_DATA_DIRS=/usr/local/share:/usr/share +PATH=/root/.nix-profile/bin:/root/.nix-profile/bin:/nix/var/nix/profiles/default/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/bin:/usr/lib/llvm/11/bin:/opt/plan9/bin:/usr/games/bin +VBOX_APP_HOME=/usr/lib64/virtualbox +LV2_PATH=/usr/lib64/lv2 -_=/bin/env +SBCL_SOURCE_ROOT=/usr/lib64/sbcl/src +LADSPA_PATH=/usr/lib64/ladspa +_=/usr/bin/env
sudo su
vs sudo su -
I don't have a root password set on my systems, so I will use sudo
with su
for example's sake.
--- sudo_su 2021-02-14 17:25:53.932832743 -0600 +++ sudo_su_- 2021-02-14 17:26:10.259394587 -0600 +PLAN9=/opt/plan9 -SUDO_GID=1000 -SUDO_COMMAND=/bin/su -SUDO_USER=winston -PWD=/home/winston +XDG_CONFIG_DIRS=/etc/xdg +LESS=-R -M --shift 5 +JDK_HOME=/etc/java-config-2/current-system-vm +CONFIG_PROTECT_MASK=/etc/sandbox.d /etc/fonts/fonts.conf /etc/gentoo-release /etc/gconf /etc/terminfo /etc/dconf /etc/ca-certificates.conf /etc/texmf/web2c /etc/texmf/language.dat.d /etc/texmf/language.def.d /etc/texmf/updmap.d /etc/revdep-rebuild +DISTCC_VERBOSE=0 +JAVA_HOME=/etc/java-config-2/current-system-vm +DOTNET_ROOT=/opt/dotnet_core +ANT_HOME=/usr/share/ant +EDITOR=/usr/bin/vi +PWD=/root +NIX_PROFILES=/nix/var/nix/profiles/default /root/.nix-profile +CONFIG_PROTECT=/etc/stunnel/stunnel.conf /usr/share/maven-bin-3.6/conf /usr/share/gnupg/qualified.txt /usr/share/easy-rsa /usr/share/config /usr/lib64/libreoffice/program/sofficerc -XAUTHORITY=/root/.xauthgcUQue +QT_QPA_PLATFORMTHEME=qt5ct +DISTCC_TCP_CORK= +MANPATH=/root/.nix-profile/share/man:/etc/java-config-2/current-system-vm/man:/usr/share/gcc-data/x86_64-pc-linux-gnu/9.3.0/man:/usr/share/binutils-data/x86_64-pc-linux-gnu/2.35.1/man:/etc/java-config-2/current-system-vm/man/:/usr/local/share/man:/usr/share/man:/usr/lib/rust/man:/usr/lib/llvm/11/share/man:/opt/plan9/man +NIX_PATH=nixpkgs=/nix/var/nix/profiles/per-user/root/channels/nixpkgs:/nix/var/nix/profiles/per-user/root/channels:/root/.nix-defexpr/channels +XAUTHORITY=/root/.xauthq2CYTL +OPENCL_PROFILE=nvidia +UNCACHED_ERR_FD= +FLTK_DOCDIR=/usr/share/doc/fltk-1.3.5-r4/html +NIX_SSL_CERT_FILE=/etc/ssl/certs/ca-certificates.crt +OPENGL_PROFILE=xorg-x11 +DISTCC_FALLBACK=1 +DCC_EMAILLOG_WHOM_TO_BLAME= +INFOPATH=/usr/share/gcc-data/x86_64-pc-linux-gnu/9.3.0/info:/usr/share/binutils-data/x86_64-pc-linux-gnu/2.35.1/info:/usr/share/info:/usr/share/info/emacs-26 +JAVAC=/etc/java-config-2/current-system-vm/bin/javac +LESSOPEN=|lesspipe %s +MANPAGER=manpager -PATH=/sbin:/bin:/usr/sbin:/usr/bin -SUDO_UID=1000 -MAIL=/var/mail/root -_=/bin/env +DISTCC_SAVE_TEMPS=0 +PAGER=/usr/bin/less +DISTCC_SSH= +SBCL_HOME=/usr/lib64/sbcl +GCC_SPECS= +GSETTINGS_BACKEND=dconf +DISTCC_ENABLE_DISCREPANCY_EMAIL= +XDG_DATA_DIRS=/usr/local/share:/usr/share +PATH=/root/.nix-profile/bin:/root/.nix-profile/bin:/nix/var/nix/profiles/default/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/bin:/usr/lib/llvm/11/bin:/opt/plan9/bin:/usr/games/bin +VBOX_APP_HOME=/usr/lib64/virtualbox +LV2_PATH=/usr/lib64/lv2 +SBCL_SOURCE_ROOT=/usr/lib64/sbcl/src +LADSPA_PATH=/usr/lib64/ladspa +_=/usr/bin/env
Conclusion
Use either sudo -i
or su -
. Don't mix sudo
and su
. Maybe don't use
sudo
(always good advice, though I don't follow it… yet).
Footnotes:
While writing this post, I found that sudo su -
will erase SUDO_*
environment variables. Maybe this is beneficial to a workflow, but in most
cases I suggest fixing your software to not check for these vars. Looking at
you, beep
.
About my keyboard choices
Disclaimer: Dvorak, fancy keyboards, whatever, does not replace good lifestyle habits such as computer breaks, good desk ergonomics, and balancing one's computer life with gasp real life. Dvorak and fancy keyboards can make your life better, but they cannot completely address RSI problems alone.
Since sometime 2013 I have faced pain, numbness, and tightness in my hands due to computer overuse. At first I chose to ignore it, but it got so bad I'd take days if not weeks off from prolonged computer use. At my worst, I saw healthcare professionals which treated me for repetitive strain injury (RSI). It took six months to a year to manage my symptoms well enough to start participating in open source again. Shortly after, I landed a long-term freelance gig as a system administrator. The increased computer work led to a resurgence of RSI symptoms. I managed for awhile with simply taking breaks, maintaining okay desk ergonomics, doing tendon & nerve glides, and hot/cold therapy. At some point my symptoms were not improving, so I looked into using better input devices.
Strategic switch to Dvorak


Enter Dvorak. I chose to switch to the Dvorak keyboard layout because it requires less stretching to type the same English text. Dvorak also is designed such that each other key is likely typed by the other hand. As such each hand gets opportunity to rest between keystrokes, and position itself where it needs to be for the next key. The most frequent letters are on the home row right below one's fingertips. This includes all vowels: A, E, I, O, and U; and common consonants: T, H, N, S, D. To type any of these common keys, I don't need to move a single finger to another location. I simply press the key. In reflection, this is a good move. While I don't believe this made me a faster typist, it certainly has made typing a lot more comfortable due to less effort to type the same text. Dvorak helped, but, there's more.
Shortly after switching to Dvorak in 2013-2014 — cold turkey, I should add — I decided to bite the bullet, and acquire a more ergonomic keyboard. I shopped around, I did try "split" budget ergonomic keyboards such as the Microsoft contour and it felt the same. It actually felt worse, honestly. Something that did help me for awhile was using low impact keyboards, such as scissor switch laptop keyboard. I also tried out a couple mechanical keyboards. Nothing really seemed very comfortable.
Kinesis Advantage: The best board


Enter the Kinesis Advantage. It felt outside of my budget, with a price tag ranging of around 275-350 USD. I immediately recognized the potential this keyboard had, given the keys offloaded to the thumbs, the concave "key wells" designed to contour my hands, and the built-in Dvorak keyboard layout, for when I couldn't use the operating system's Dvorak setting. Additionally it was a true split keyboard, so my forearms could be oriented more perpendicular to my body, instead of bent inwards towards the keyboard's center (In other words my arms can rest at my side, instead of being awkwardly placed in front of my chest). Additionally this layout allowed for less twisting of the forearms, thereby improving circulation and reducing keyboard-strain.
Some things became immediately apparently when I switched to the Kinesis Advantage. It would not be pain free. Using it for a couple months led to new soreness, but never any sharp or dull pains related to RSI. The manual mentions that using new keyboards like this exercise slightly different muscle groups, therefore your body will take some time to adjust and build up strength, even if it is ergonomically easier and more comfortable. Another thing that became apparently was I did not learn how to type properly. When typing on a normal keyboard, I would roll my wrists a frequently to use my stronger fingers (and minimizing use of my pinky, which felt intuitive to use as a new computer user). I would also frequently look for keys despite touch typing for a majority of keys. Simply put, I would type keys with the wrong fingers, sometimes with the wrong hand completely, leading to a harder time getting used to a split keyboard like the Kinesis Advantage.
I persisted though. And Here I am 6+ years later still using the Kinesis Advantage. I believe this keyboard has played a very important role in my studies, professional development, and mental health. It has provided me with a mechanism to minimize RSI due to less-than-comfortable stretches and frequent movements on normal keyboards. It has taught me how to type properly. It has mitigated frequent pain, thereby improving my mood and helping me deal with my own mental well being.
A significant benefit of the Kinesis Advantage keyboard is the built in piezoelectric speaker that actuates when the a key circuit is closed. Instead of relying on the click of the key switches, I rely on the clicking sound produced by this speaker. It helps me reduce the amount of force I use when typing, though I have always been a heavy typist, so an onlooker might think I am still very heavy with my keyboard, but alas it is much better. Additionally this speaker helps me maintain pace. Instead of not being sure if I actually typed a key, the speaker unambiguously confirms this keystroke for me. Since it resides in the keyboard, and is handled by a low latency micro controller, it is near-instantaneous in relation to the key switch actuation. If the computer were to handle this, there would be a noticeable latency. It turns out desktop computers are really poor at low latency sound without a lot of effort. It's just not worth the effort, let micro controllers do what micro controllers do best!
What about keyboard shortcuts?
I use vi keys with no adjustment. On Dvorak jk
are on the bottom row of the
left middle & ring fingers. hl
are on the right hand just to the left of
home row, and on the upper row of the pinky. This might seem awkward, but if
you're jk hl and it's causing you RSI problems, regardless where it's located,
you probably should not be using a keyboard. Sure it might be easier to reach
if it's still on home row, but you shouldn't feel strain typing any key on
your keyboard. If you do, you must address that problem, through switching
keyboard, not using a keyboard, and most importantly changing keyboard habits
(e.g. relearn how to type properly). The same applies to Emacs keys, though I
think it's less awkward in Dvorak, because p is not on the pinky, but that's a
minor concern as mentioned above, as one should be able to type all keys fine.
The other concern with keyboard shortcuts is how does one type Control-c/Control-x/Control-v… or really any keyboard shortcut one is used to. First of all you're probably typing it with one hand which is a terrible sin unto your hands. Normal typing is very low impact on your hands because in theory you type a single key on each hand at most. This means If you type "X", you will hold down shift with the opposite hand. That is why there are two shift keys. This also means you don't have to apply force on two ends of your hand to achieve a keyboard combination. In light of this, I strongly urge you, dear reader, to start "balancing" your keyboard shortcuts over two hands. If you need to type Control-c, hold down that control with the opposite hand. Sure it might seem slower at first, but if you're typing it frequently, it will cut down on strain. This is also why people "swear by vim". It's because vim's modal user interface does not require many multi-key keystrokes. This is also why some emacs users seem to get RSI, is they don't discipline their modifier keys, and wonder why if they mash their entire hand onto the keyboard, it starts to manifest pain in the relation to the needless micro-stresses the inexperienced emacs user puts on it.
Finally, maybe you thought I would answer the question — "But I know keystrokes by their location on the keyboard". I believe with a little bit of critical thinking, you may deduce this is patently false. Sure you may have forgotten how you learned the keystrokes initially, but I am willing to bet you initially thought every time (okay, I need to press control and x — Control-x). If you develop muscle memory for your keyboard, using application shortcuts follow easily, and are very malleable in my experience. There may be some amount of relearning involved, but it's good for oneself, because it enables the user time to really recognize the keyboard shortcut, and properly memorize it as a keyboard combination, instead of a gesture of the hand that won't map well between keyboards that are in the different in the slightest way — shape, layout, and size.
Final thoughts
There is more to my RSI story. Even though I use a Kinesis Advantage keyboard most of the time, my endurance on a normal keyboard has improved significantly due to properly re-learning how to type. Additionally learning Dvorak also helps with endurance on any keyboard.
And no… there isn't a problem using a weird keyboard layout. If you are doing any serious typing, simply switch the layout, or bring a keyboard that can switch it in hardware, or bring your own computer. It's that simple. Life is too short to care about using QWERTY… it was never designed with ergonomics in mind. Just think about the word "ergonomics". On Dvorak, each other key in "ergonomics" is on the other hand, so each hand has ample time to switch position (if necessary) between keystrokes. Take QWERTY on the other hand. One types it using only four digits across two hands, and each hand is responsible for typing a string of 3+ characters. Think about some common words, plot out how you would type them in QWERTY, then plot them out in basically any other keyboard layout. You'll see how god awful QWERTY is for your hands if you want to minimize weird stretching and reaches.
When I moved to Milwaukee for university, I met several people who were also Dvorak users. It was amazing to pair-program with folks who not only know Dvorak, but are comfortable in vim and Emacs. I even lived with a Dvorak user. We could share each others' computers without dealing with trivial things like keyboard layouts. Nowadays I don't think anybody knows I type Dvorak until they wonder why I type slow at their desk. I'm not self conscious about it — because I'm a practiced typist, hunting & pecking for me looks like a mildly slow touch typist using a new keyboard. You might wonder why I don't try to touch type qwerty, and no, I don't want to confuse myself. Learning a new keyboard layout takes dedication, and I can only truly touch type in Dvorak.
Thank you for reading my thoughts on using Dvorak and the Kinesis Advantage keyboard. I can talk about computer ergonomics all day, due to their importance in my lifestyle. Drop me a line, I'd love to hear from you. Interested in learning Dvorak? Stop. Give it a deep think. Maybe you don't need to learn a new keyboard layout. If you're not discouraged, load up keybr.com and practice. Go cold turkey, it worked for me, and can work for you.
Linux dmesg --follow (-w) not working?
For a couple months now, I have noticed that running dmesg -w
on my
workstation does not appear to print out new kernel messages. In other
words dmesg --follow
"hangs". Additionally when running tail -f
/var/log/kern.log
to monitor new dmesg messages picked up by
sysklogd
(part of syslogng), the latest messages do not come through
until sysklogd periodically "reopens" the /dev/kmsg
kernel message
buffer.
Why is this a problem?
This is a problem because I use the dmesg
log to monitor important
hardware related messages such as the kernel recognizing a USB device
or diagnosing bluetooth/wifi issues. When I plug in a USB drive, the
first thing I do is check dmesg for the following messages:
[10701.359834] usb 2-4.4: new high-speed USB device number 8 using ehci-pci [10701.394801] usb 2-4.4: New USB device found, idVendor=12f7, idProduct=0313, bcdDevice= 1.10 [10701.394807] usb 2-4.4: New USB device strings: Mfr=1, Product=2, SerialNumber=3 [10701.394810] usb 2-4.4: Product: MerryGoRound [10701.394813] usb 2-4.4: Manufacturer: Memorex [10701.394816] usb 2-4.4: SerialNumber: AAAAAAAAAAAA [10701.395182] usb-storage 2-4.4:1.0: USB Mass Storage device detected [10701.398885] scsi host7: usb-storage 2-4.4:1.0 [10702.401161] scsi 7:0:0:0: Direct-Access Memorex MerryGoRound PMAP PQ: 0 ANSI: 0 CCS [10702.401710] sd 7:0:0:0: Attached scsi generic sg6 type 0 [10702.651720] sd 7:0:0:0: [sde] 15654912 512-byte logical blocks: (8.02 GB/7.46 GiB) [10702.652341] sd 7:0:0:0: [sde] Write Protect is off [10702.652346] sd 7:0:0:0: [sde] Mode Sense: 23 00 00 00 [10702.652961] sd 7:0:0:0: [sde] No Caching mode page found [10702.652965] sd 7:0:0:0: [sde] Assuming drive cache: write through [10702.681473] sde: sde1 sde2 [10702.684869] sd 7:0:0:0: [sde] Attached SCSI removable disk
This output reports that a USB device was detected, where it is
plugged in, what is the vendor/product information, what USB speed it
is using, the size of the storage, the device name (/dev/sde
), and
its partitions (/dev/sde1
, /dev/sde2
). There are a lot of other
messages written out to dmesg
, such as the kernel detecting a bad
USB cable, segementation faults, and so on.
Given the importance of the above log output, I have developed a habit
of running dmesg -w
to monitor such kernel events. The -w
tells
dmesg to monitor for new messages. The long option is --follow
.
In addition to dmesg -w
not working as intended, syslogng log
entries written to /var/log/kern.log
are not written as they occur;
instead the log is written in "bursts", which suggests sysklogd
occasionally reopens /dev/kmsg
, thereby reading in new log messages,
but all the timestamps are the same time for each "burst" read.
Which of systems were affected?
I have two systems with a virtually identical OS installation; one is a
workstation named snowcrash
with an AMD FX-8350 on an ASRock M5A97
R2.0 motherboard; the other is a HP Elitebook 820 G4 named
cyberdemon
with an Intel Core i5-7300U. Curiously enough, the
strange dmesg -w
hang does not occur on cyberdemon
, but does occur
on snowcrash
. Both hosts run Linux mainline, with both machines on
5.6.4. Looking through my /var/log/kern.log
files, this behavior was
apparent on a 5.4.25 kernel. As we will see later, this coincides with
the affected versions that others have reported
Additionally, I asked my friend tyil who happens to also use an AMD FX-8350 with Gentoo to check for the bug; he also had the problem on 5.6.0.
Pinpointing the bug
First thing I did was find a way to reproduce the issue. I recorded an asciinema recording. You can watch it here. I then shared the recording on IRC, hoping somebody would know of a solution. I got some helpful and encouraging feedback, but nobody knew of this particular bug. See the recording below, or click here:
The next step was to figure out if there was something wrong with
/bin/dmesg
. Running strace -o dmesg-strace.log dmesg -w
shows the
follow pertinent lines:
openat(AT_FDCWD, "/dev/kmsg", O_RDONLY) = 3 lseek(3, 0, SEEK_DATA) = 0 read(3, "6,242,717857,-;futex hash table "..., 8191) = 79 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x16), ...}) = 0 openat(AT_FDCWD, "/usr/lib64/gconv/gconv-modules.cache", O_RDONLY) = 4 fstat(4, {st_mode=S_IFREG|0644, st_size=26388, ...}) = 0 mmap(NULL, 26388, PROT_READ, MAP_SHARED, 4, 0) = 0x7fed92688000 close(4) = 0 futex(0x7fed925f9a14, FUTEX_WAKE_PRIVATE, 2147483647) = 0 write(1, "\33[32m[ 0.717857] \33[0m\33[33mfut"..., 97) = 97 … SNIP … read(3, "6,1853,137347289701,-;input: Mic"..., 8191) = 128 write(1, "\33[32m[137347.289701] \33[0m\33[33min"..., 140) = 140 read(3,
The last line indicates a pending read()
that never completes. Note
the 3
file description refers to the /dev/kmsg
device. It appears
nothing out of the ordinary occurs, except for the fact that the
read()
simply hangs.
I was a bit at a loss to explain the hanged read()
. I was really
lost honestly. So I went on, and inspected the changes to /bin/dmesg
shipped by util-linux, and did not find any sign of significant
changes. I did run dmesg
from master
just to be sure. See the
commit log of dmesg.c
here. Additionally I also searched the
util-linux bug tracker and found nothing relevant.
Given I had no solution yet, I decided to resort to Googling things, hoping somebody had discussed this bug before. Keywords I tried are:
- dmesg follow no longer working
- dmesg kmsg no more messages
- linux kmsg read hang
- "/dev/kmsg" hang
- "dmesg -w" hangs
None of these came up with anything useful. I was using DuckDuckGo mainly, with some Google queries sprinkled on top.
I then visited the torvalds/linux GitHub repository, searched for
"kmsg", and did not find a commit that looked like a fix. I picked on
through reading commits that /dev/kmsg
is written to via the printk
functions, so on a whim I decided to look changes made to
kernel/printk/printk.c
. Reading through the commit logs of printk.c
,
I realized the last commit is likely the fix:
commit ab6f762f0f53162d41497708b33c9a3236d3609e Author: Sergey Senozhatsky <protected@email> Date: Tue Mar 3 20:30:02 2020 +0900 printk: queue wake_up_klogd irq_work only if per-CPU areas are ready printk_deferred(), similarly to printk_safe/printk_nmi, does not immediately attempt to print a new message on the consoles, avoiding calls into non-reentrant kernel paths, e.g. scheduler or timekeeping, which potentially can deadlock the system. Those printk() flavors, instead, rely on per-CPU flush irq_work to print messages from safer contexts. For same reasons (recursive scheduler or timekeeping calls) printk() uses per-CPU irq_work in order to wake up user space syslog/kmsg readers. However, only printk_safe/printk_nmi do make sure that per-CPU areas have been initialised and that it's safe to modify per-CPU irq_work. This means that, for instance, should printk_deferred() be invoked "too early", that is before per-CPU areas are initialised, printk_deferred() will perform illegal per-CPU access. Lech Perczak [0] reports that after commit 1b710b1b10ef ("char/random: silence a lockdep splat with printk()") user-space syslog/kmsg readers are not able to read new kernel messages. The reason is printk_deferred() being called too early (as was pointed out by Petr and John). Fix printk_deferred() and do not queue per-CPU irq_work before per-CPU areas are initialized. Link: https://lore.kernel.org/lkml/aa0732c6-5c4e-8a8b-a1c1-75ebe3dca05b@camlintechnologies.com/ Reported-by: Lech Perczak <protected@email> Signed-off-by: Sergey Senozhatsky <protected@email> Tested-by: Jann Horn <protected@email> Reviewed-by: Petr Mladek <protected@email> Cc: Greg Kroah-Hartman <protected@email> Cc: Theodore Ts'o <protected@email> Cc: John Ogness <protected@email> Signed-off-by: Linus Torvalds <protected@email>
Unfortunately my understanding of the linux kernel architecture is not comprehensive, let alone competent, the commit message describes
- syslog/kmsg readers — which includes dmesg and syslogng,
- certain functions don't immediate attempt to print a new message to console,
- and syslog/kmsg readers might not wake up.
Indeed it's a bit hard for me to wrap my minimal kernel understanding around, however, reading the linked email list thread clears things up significantly:
After upgrading kernel on our boards from v4.19.105 to v4.19.106 we found out that syslog fails to read the messages after ones read initially after opening /proc/kmsg just after booting. I also found out, that output of 'dmesg –follow' also doesn't react on new printks appearing for whatever reason - to read new messages, reopening /proc/kmsg or /dev/kmsg was needed. I bisected this down to commit 15341b1dd409749fa5625e4b632013b6ba81609b ("char/random: silence a lockdep splat with printk()"), and reverting it on top of v4.19.106 restored correct behaviour.
— Lech Perczak
Now that, sounds like the issue I'm having! The thread also discusses the bug is apparent on 4.19.106 (fixed in 4.19.107 — see this commit), and affects users of 5.5.9, 5.5.15, 5.6.3 (see the PATCHv2 thread).
Further reading related to the above commit
- The commit that broke things
- The commit that fixed things
- Regression in v4.19.106 breaking waking up of readers of /proc/kmsg and /dev/kmsg
- {PATCH} printk: queue
wake_up_klogd
irq_work
only if per-CPU areas are ready - {PATCHv2} printk: queue
wake_up_klogd
irq_work
only if per-CPU areas are ready - linux-4.19.y: Revert "char/random: silence a lockdep splat with printk()"
Applying the patch
Next step is to apply the patch in order to test and verify this fixes the issue.
Since Fall last year, I have used sys-kernel/vanilla-kernel
to
compile, install, and create an initramfs for my two machines. This is
a great ebuild because it uses a kernel .config
based off of
Archlinux's, so it is compatible with most machines. It is also
streamlined in that it does all the work for you — no more manually
configuring and remembering which make invocations are necessary to
update the kernel. It's not hard to get right, but it's not
particularly interesting in my use-case. Additionally, using
sys-kernel/vanilla-kernel
, the kernel & its modules are now
packaged, and can be distributed to my other machine as a binpkg. This
streamlines deployment significantly.
In order to add the patch to this ebuild, I simply have to drop the
patch file into /etc/portage/patches/sys-kernel/vanilla-kernel
. In
my case I chose to drop it in
/etc/portage/patches/sys-kernel/vanilla-kernel:5.6.4
because I
rather the patch only be applied no the current kernel I have
installed, than all versions of sys-kernel/vanilla-kernel
. This
ensures when I upgrade to to the upcoming 5.7 release (which has the
fix included), the patch won't be applied and emerge won't fail due to
the patch not being applied cleanly.
The commands (commit to my /etc/portage
):
mkdir -p /etc/portage/patches/sys-kernel/vanilla-kernel:5.2.6
curl -o /etc/portage/patches/sys-kernel/vanilla-kernel:5.2.6/fix-dmesg--follow.patch \
https://github.com/torvalds/linux/commit/ab6f762f0f53162d41497708b33c9a3236d3609e.patch
emerge -1av sys-kernel/vanilla-kernel:5.2.6
An hour later and the kernel is installed. After the reboot, indeed
dmesg -w
works once again! And the log messages in
/var/log/kern.log
have timestamps that correctly reflect the kernel
time!
Conclusion
Even kernels have regressions. As discussed on IRC, I was reminded that the kernel project is not responsible for the userland, so it's possible such testcases might not be on the radar of most kernel developers. Perhaps it's the distros' responsibilities to execute integrated system testing to catch bugs like this. In any case it is still a surprise to see such a regression occur. We like to think of the kernel as this infallible magical machine that doesn't break except when you do something patently wrong, but this isn't really the case. We're all human.
I want to thank Tyil, Sergey (the patch author), Lech (the bug reporter) and some folks from the #linux IRC channel for helping me pinpoint this issue. The reader may think this is a lot of effort to go through to fix such a simple bug — but it's really important for the kernel to work — if the kernel misbehaves, anything is up for grab. It's not like your bug-laden browser product we accept will have crashing bugs in it — if the kernel crashes or misbehaves, the ramifications are almost as bad as if the hardware is failing — you'll lose your application data, productivity, and trust in the operating system itself.
It is important to mention LTS (Long term support) kernels exist. Given the amount of trouble I went to address this issue, and the fact I rather not have things breaking, I don't think I should be running a mainline kernel at the moment. Perhaps I can install both side by side. then pick 'n choose which kernel to use de jure.
I am very interested to hear of you, the reader's, suggestions for kernel maintenance and version selection strategies. You can find my contact details at https://winny.tech/ . Thank you for reading.
Debugging Zathura, GTK (don't forget about seccomp)
Zathura is a fantastic PDF viewer. It also supports Postscript, DjVu, and Comicbook archive. In particular it supports using mupdf for the backend, so it's rather fast (unlike poppler, used by evince and friends). Here is a screenshot of Zathura:

Now that I've introduced Zathura. I want to talk about a problem I had
recently. I wanted to print a document a couple weeks ago, but found
whenever I issued a :print
command in Zathura, the program would
crash. I got this error in dmesg:
[94592.482544] zathura[26424]: segfault at 201 ip 00007f0bc27d0086 sp 00007ffeada0d0d8 error 4 in libc-2.29.so[7f0bc2752000+158000] [94592.482557] Code: 0f 1f 40 00 66 0f ef c0 66 0f ef c9 66 0f ef d2 66 0f ef db 48 89 f8 48 89 f9 48 81 e1 ff 0f 00 00 48 81 f9 cf 0f 00 00 77 6a <f3> 0f 6f 20 66 0f 74 e0 66 0f d7 d4 85 d2 74 04 0f bc c2 c3 48 83
Lets get a crash dump
I spent a bunch of time trying to get crash dumps from Zathura, and was largely unsuccessful, until I realized the wonkiness I was dealing with (see below).
Try to run Zathura in GDB
First I tried getting a backtrace directly from gdb. It appears to run, but zathura does not create a window:
winston@snowcrash ~ $ gdb --args zathura ~/docs/uni/classes/cs-655/handouts/spim_documentation.pdf Reading symbols from zathura... (gdb) run Starting program: /usr/bin/zathura /home/winston/docs/uni/classes/cs-655/handouts/spim_documentation.pdf [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". [New Thread 0x7ffff5f26700 (LWP 18882)] [New Thread 0x7ffff5725700 (LWP 18883)] Cannot find user-level thread for LWP 18744: generic error (gdb)
The error message Cannot find user-level thread for LWP 18744:
generic error
is mentioned on the Sourceware Wiki. The Wiki FAQ
suggests I may have a mismatch between libthread_db.so.1
and
libpthread.so.0
or am using a 64-bit debugger with a 32-bit
program. Both zathura and gdb are amd64 programs on my box. And I only
have one version of amd64 glibc installed. Given the facts, it seemed
like I was dealing with a different problem.
What's more is I tested running a program in gdb
, in my case cat
,
and it worked fine:
winston@snowcrash ~ $ gdb cat Reading symbols from cat... (gdb) run Starting program: /bin/cat ^C Program received signal SIGINT, Interrupt. 0x00007ffff7eb5cb5 in __GI___libc_read (fd=0, buf=0x7ffff7fb0000, nbytes=131072) at ../sysdeps/unix/sysv/linux/read.c:26 26 return SYSCALL_CANCEL (read, fd, buf, nbytes);
Try to attach a Zathura process in GDB
When attaching GDB to a process, make sure you have permission to do
so, out of the box most distros limit debuggers to either attach to
child processes or only if gdb is ran as root. In any case one can run
sysctl kernel.yama.ptrace_scope=0
to temporarily loosen restrictions
to allow attaching gdb to any process of the same user. See
ptrace(2)
and grep for ptrace_scope
.
Now that gdb can attach to any other processes I own, I tried to attach to zathura, without any success:
inston@snowcrash ~ $ gdb -p 3541 zathura Reading symbols from zathura... Attaching to program: /usr/bin/zathura, process 3541 ptrace: Operation not permitted. (gdb)
Indeed this also worked fine with cat
:
winston@snowcrash ~ $ gdb -p 6885 cat Reading symbols from cat... Attaching to program: /bin/cat, process 6885 Reading symbols from /lib64/libc.so.6... Reading symbols from /usr/lib/debug//lib64/libc-2.29.so.debug... Reading symbols from /lib64/ld-linux-x86-64.so.2... Reading symbols from /usr/lib/debug//lib64/ld-2.29.so.debug... 0x00007f93b5fa4cb5 in __GI___libc_read (fd=0, buf=0x7f93b609f000, nbytes=131072) at ../sysdeps/unix/sysv/linux/read.c:26 26 return SYSCALL_CANCEL (read, fd, buf, nbytes); (gdb)
Try to get Zathura to dump core
I moved on to the next approach to get a backtrace — write core files. First I'll describe what that entails on my setup:
Enabling core dumps
On my setup, relatively vanilla Gentoo with OpenRC, it is
relatively straight forward to enable this — just create
/etc/security/limits.d/core.conf
with the single line (see
limits.conf(5)
):
* soft core unlimited
And relogin. Verify that the output of ulimit -a
shows unlimited
core file size.
winston@snowcrash ~ $ ulimit -a core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 63422 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 63422 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
The second part is ensuring sysctl kernel.core_pattern
is set so
something reasonable. If it's a pipeline (first character is a
|
), make sure you understand what that pipeline does, or set it
to a simple filename pattern. More information in core(5)
. A
good file pattern might be %e.%h.%t.core
, which produces core
files such as cat.snowcrash.1586300242.core
. Time can be
converted into a human readable form with date -d@1586300242
.
winston@snowcrash ~ $ sudo sysctl kernel.core_pattern=%e.%h.%t.core kernel.core_pattern = %e.%h.%t.core winston@snowcrash ~ $ cat ^\Quit (core dumped) winston@snowcrash ~ $ ls *.core cat.snowcrash.1586300242.core winston@snowcrash ~ $ coretime() { date -d @"$(cut -d. -f3 <<<"$1")"; } winston@snowcrash ~ $ coretime cat.snowcrash.1586300242.core Tue 07 Apr 2020 05:57:22 PM CDT
Getting a core dump
I fired up Zathura for what felt like the tenth time, and triggered
the bug, but indeed, no core dump! I even tried running zathura, and
instead sending SIGQUIT (^\
— Control-Backslash in most terminals)
which should cause the process to dump core, but to no avail.
In the above shell session, I demonstrated that I was able to dump
core with cat
, so indeed core dumps are enabled.
Investigating why I can't get a crash dump
This feels like madness. There is no obvious reason why I can't get a backtrace via any of the above techniques. So I took a deep breath, and grabbed the source code, thinking they must be doing something a bit too clever for my liking.
Getting the source
On Gentoo I usually do something like the following to grab program source:
winston@snowcrash ~ $ ebuild $(equery w zathura) prepare * zathura-0.4.5.tar.gz BLAKE2B SHA512 size ;-) ... [ ok ] * checking ebuild checksums ;-) ... [ ok ] * checking miscfile checksums ;-) ... [ ok ] >>> Unpacking source... >>> Unpacking zathura-0.4.5.tar.gz to /var/tmp/portage/app-text/zathura-0.4.5/work >>> Source unpacked in /var/tmp/portage/app-text/zathura-0.4.5/work >>> Preparing source in /var/tmp/portage/app-text/zathura-0.4.5/work/zathura-0.4.5 ... >>> Source prepared.
Scanning the source
A quick scan of the source tree yields some most interesting files — including some that will become more interesting as you read on:
winston@snowcrash .../work/zathura-0.4.5 $ grep -riF ptrace . ./zathura/seccomp-filters.c: /* prevent escape via ptrace */ ./zathura/seccomp-filters.c: DENY_RULE(ptrace); ./zathura/seccomp-filters.c: /* prevent escape via ptrace */
Notice the filename. It appears Zathura utilizes seccomp, and
somehow messes about with debuggers use of ptrace()
. Here is a tree
of the files I'll be walking through:
winston@snowcrash .../work/zathura-0.4.5 $ tree -L 2 -F \ > -P 'meson*|README|AUTHORS|LICENSE|main.[ch]|*seccomp*.[ch]|zathura.[ch]|config.[ch]' . ├── AUTHORS ├── data/ │ ├── icon-128/ │ ├── icon-16/ │ ├── icon-256/ │ ├── icon-32/ │ ├── icon-64/ │ └── meson.build ├── doc/ │ ├── api/ │ ├── configuration/ │ ├── installation/ │ ├── man/ │ ├── meson.build │ └── usage/ ├── LICENSE ├── meson.build ├── meson_options.txt ├── po/ │ └── meson.build ├── README ├── subprojects/ ├── tests/ │ └── meson.build └── zathura/ ├── config.c ├── config.h ├── main.c ├── seccomp-filters.c ├── seccomp-filters.h ├── zathura.c └── zathura.h 16 directories, 16 files
Where Seccomp is used in the code
Indeed if we look in seccomp-filters.c
it has a couple lines that suggest
zathura prevents dumping core & using ptrace()
:
#define ADD_RULE(str_action, action, call, ...) \ do { \ seccomp_rule_add(ctx, action, SCMP_SYS(call), __VA_ARGS__); \ } while (0) #define DENY_RULE(call) ADD_RULE("kill", SCMP_ACT_KILL, call, 0) int seccomp_enable_basic_filter(void) { /* prevent escape via ptrace */ if (prctl(PR_SET_DUMPABLE, 0, 0, 0, 0)) { girara_error("prctl PR_SET_DUMPABLE"); return -1; } }
Please note I tidied up the code for clarity. Looking in prctl(2)
we can see the prctl(PR_SET_DUMPABLE, 0, 0, 0, 0)
not only
prevents core dumps, but prevents processes from attaching to
Zathura to debug it.
Now to figure out how it's called. Take a look at zathura.c
.
bool zathura_init(zathura_t* zathura) { #ifdef WITH_SECCOMP /* initialize seccomp filters */ switch (zathura->global.sandbox) { case ZATHURA_SANDBOX_NONE: girara_debug("Sandbox deactivated."); break; case ZATHURA_SANDBOX_NORMAL: girara_debug("Basic sandbox allowing normal operation."); if (seccomp_enable_basic_filter() != 0) { girara_error("Failed to initialize basic seccomp filter."); goto error_free; } break; case ZATHURA_SANDBOX_STRICT: girara_debug("Strict sandbox preventing write and network access."); if (seccomp_enable_strict_filter() != 0) { girara_error("Failed to initialize strict seccomp filter."); goto error_free; } break; } #endif }
In the zathura_init
procedure, seccomp is conditionally compiled
in using an an #ifdef
check. It becomes apparent there are three
sandbox modes supported by Zathura. Next let's see where
zathura_init()
is called in main.c
:
static zathura_t* init_zathura(const char* config_dir, const char* data_dir, const char* cache_dir, const char* plugin_path, char** argv, const char* synctex_editor, Window embed) { /* create zathura session */ zathura_t* zathura = zathura_create(); if (zathura == NULL) { return NULL; } /* Init zathura */ if (zathura_init(zathura) == false) { zathura_free(zathura); return NULL; } return zathura; } /* main function */ GIRARA_VISIBLE int main(int argc, char* argv[]) { /* CLI parsing and initialization */ /* Create zathura session */ zathura_t* zathura = init_zathura(config_dir, data_dir, cache_dir, plugin_path, argv, synctex_editor, embed); /* More initialization logic */ /* run zathura */ gtk_main(); /* free zathura */ return ret; }
The program's entry point, main()
calls init_zathura()
, which
itself calls zathura_init()
, and then calls into
seccomp_enable_*_filter()
. This makes it clear that Zathura
always initializes sandboxing on startup, unless
zathura->global.sandbox
is ZATHURA_SANDBOX_NONE
.
If one looks in the top level meson.build
we can see where the
WITH_SECCOMP
proprocessor definition comes from:
if seccomp.found() build_dependencies += seccomp defines += '-DWITH_SECCOMP' additional_sources += files('zathura/seccomp-filters.c') endif
Now comes the matter how does one debug this application? Initially
I succeeded by configuring Gentoo to not use seccomp with
Zathura. After a second look, there appears to be a sandbox
configuration option. In the next few sections I explain how to
manually disable seccomp with both Gentoo USE flags, and by
configuring zathura at runtime.
Disabling Seccomp by USE flag
Taking a closer look at the app-text/zathura
package in Gentoo's
ebuild repository, there is a seccomp
USE flag.
winston@snowcrash ~ $ eix -e app-text/zathura [I] app-text/zathura Available versions: 0.4.3^t 0.4.4^t{tbz2} (~)0.4.5^t{tbz2} **9999*l^t {doc +magic seccomp sqlite synctex test} Installed versions: 0.4.5^t{tbz2}(05:21:00 PM 04/07/2020)(doc magic seccomp -sqlite -synctex -test) Homepage: http://pwmt.org/projects/zathura/ Description: A highly customizable and functional document viewer
Let's disable this seccomp
USE flag:
snowcrash ~ # echo 'app-text/zathura -seccomp' >> /etc/portage/package.use/zathura snowcrash ~ # emerge -1av app-text/zathura These are the packages that would be merged, in order: Calculating dependencies ... done! [ebuild R ~] app-text/zathura-0.4.5::gentoo USE="doc magic -seccomp* -sqlite -synctex -test" 0 KiB Total: 1 package (1 reinstall), Size of downloads: 0 KiB Would you like to merge these packages? [Yes/No]
With Zathura rebuilt without seccomp support, I am able to attach a debugger. Success!
Disabling Seccomp via configuration option
After reviewing zathura's configuration code, I found there is a
sandbox
option that can be configured in one's zathurarc. It was
not mentioned in the zathura(1)
manpage, nor its --help
text. I discovered it in the README
. Later I also found it
mentioned in the zathurarc(5)
manpage. As such, heed this
friendly reminder— make sure to read the README
, and make sure to
read the related manpages in SEE ALSO
section of a given
manpage!
Back to the matter at hand. Looking at config.c
:
static void cb_sandbox_changed(girara_session_t* session, const char* UNUSED(name), girara_setting_type_t UNUSED(type), const void* value, void* UNUSED(data)) { g_return_if_fail(value != NULL); g_return_if_fail(session != NULL); g_return_if_fail(session->global.data != NULL); zathura_t* zathura = session->global.data; const char* sandbox = value; if (g_strcmp0(sandbox, "none") == 0) { zathura->global.sandbox = ZATHURA_SANDBOX_NONE; } else if (g_strcmp0(sandbox, "normal") == 0) { zathura->global.sandbox = ZATHURA_SANDBOX_NORMAL; } else if (g_strcmp0(sandbox, "strict") == 0) { zathura->global.sandbox = ZATHURA_SANDBOX_STRICT; } else { girara_error("Invalid sandbox option"); } } void config_load_default(zathura_t* zathura) { girara_session_t* gsession = zathura->ui.session; /* default to no sandbox when running in WSL */ const char* string_value = running_under_wsl() ? "none" : "normal"; girara_setting_add(gsession, "sandbox", string_value, STRING, true, _("Sandbox level"), cb_sandbox_changed, NULL); }
Now we know there is an event listener for the sandbox
configuration option. I know I skipped a few steps, but the
pattern is pretty clear for my purposes. After adding set sandbox
none
to my ~/.config/zathura/config
, Zathura was able to start
up without a sandbox, and I was able to attach a debugger.
Getting a more informative backtrace
Now, with seccomp disabled I was able to get a crash dump:
winston@snowcrash ~ $ gdb --args zathura ~/docs/uni/classes/cs-655/handouts/spim_documentation.pdf Reading symbols from zathura... (gdb) run Starting program: /usr/bin/zathura /home/winston/docs/uni/classes/cs-655/handouts/spim_documentation.pdf [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". [New Thread 0x7ffff5f73700 (LWP 15633)] [New Thread 0x7ffff5772700 (LWP 15634)] (zathura:15629): dbind-WARNING **: 23:49:52.224: Couldn't register with accessibility bus: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. [New Thread 0x7fffe52b7700 (LWP 15639)] [New Thread 0x7fffe4ab6700 (LWP 15645)] [New Thread 0x7fffcbfff700 (LWP 15646)] Thread 1 "zathura" received signal SIGSEGV, Segmentation fault. __strlen_sse2 () at ../sysdeps/x86_64/multiarch/../strlen.S:120 120 movdqu (%rax), %xmm4 (gdb) bt #0 __strlen_sse2 () at ../sysdeps/x86_64/multiarch/../strlen.S:120 #1 0x00007ffff722753d in g_strjoinv () at /usr/lib64/libglib-2.0.so.0 #2 0x00007fffec65e31b in avahi_service_resolver_cb () at /usr/lib64/gtk-3.0/3.0.0/printbackends/libprintbackend-cups.so #3 0x00007ffff73d4973 in g_task_return_now () at /usr/lib64/libgio-2.0.so.0 #4 0x00007ffff73d531d in g_task_return.part () at /usr/lib64/libgio-2.0.so.0 #5 0x00007ffff7429f0f in g_dbus_connection_call_done () at /usr/lib64/libgio-2.0.so.0 #6 0x00007ffff73d4973 in g_task_return_now () at /usr/lib64/libgio-2.0.so.0 #7 0x00007ffff73d49a9 in complete_in_idle_cb () at /usr/lib64/libgio-2.0.so.0 #8 0x00007ffff72064ef in g_main_context_dispatch () at /usr/lib64/libglib-2.0.so.0 #9 0x00007ffff72068c0 in g_main_context_iterate.isra () at /usr/lib64/libglib-2.0.so.0 #10 0x00007ffff7206bd3 in g_main_loop_run () at /usr/lib64/libglib-2.0.so.0 #11 0x00007ffff796a105 in gtk_main () at /usr/lib64/libgtk-3.so.0 #12 0x0000555555561871 in main () (gdb) frame 1 #1 0x00007ffff722753d in g_strjoinv () from /usr/lib64/libglib-2.0.so.0 (gdb) list 115 #ifdef AS_STRNLEN 116 andq $-16, %rax 117 FIND_ZERO 118 #else 119 /* Test first 16 bytes unaligned. */ 120 movdqu (%rax), %xmm4 121 PCMPEQ %xmm0, %xmm4 122 pmovmskb %xmm4, %edx 123 test %edx, %edx 124 je L(next48_bytes)
Notice how the frame's listing shows assembly instructions. It looks like we are missing debug symbols. Additionally, it would be nice to have installed sources, because the debugger can show us line-for-line backtraces and will make it easy to single-step to the crash
Installing debug symbols on Gentoo
On gentoo one can use equery b
to discover what package owns a
particular file:
winston@snowcrash ~ $ for f in /usr/lib64/libglib-2.0.so.0 \ > /usr/lib64/gtk-3.0/3.0.0/printbackends/libprintbackend-cups.so \ > /usr/lib64/libgio-2.0.so.0 /usr/lib64/libglib-2.0.so.0 \ > /usr/lib64/libgtk-3.so.0; do > equery -q b $f > done | sort -u dev-libs/glib-2.60.7-r2 x11-libs/gtk+-3.24.13
I came up with the following packages to install debug symbols for:
dev-libs/glib
x11-libs/gtk+:3
- and
app-text/zathura
for good measure.
Using /etc/portage/env/debugsyms
and
/etc/portage/env/installsources
— Portage environment files
loosely based off the Gentoo Wiki — I can simply add the following
lines to my /etc/portage/package.env/
:
dev-libs/glib debugsyms installsources x11-libs/gtk+:3 debugsyms installsources app-text/zathura debugsyms installsources
And then I manually re-emerged each package, because unfortunately Portage does not appear to consider environment files when determining when to rebuild packages.
snowcrash ~ # emerge -1av app-text/zathura dev-libs/glib x11-libs/gtk+:3 These are the packages that would be merged, in order: Calculating dependencies ... done! [ebuild R ] dev-libs/glib-2.60.7-r2:2::gentoo USE="dbus debug* (mime) xattr -fam -gtk-doc (-selinux) -static-libs -systemtap -test -utils" ABI_X86="32 (64) (-x32)" 0 KiB [ebuild R ] x11-libs/gtk+-3.24.13:3::gentoo USE="X cups examples introspection xinerama (-aqua) -broadway -cloudprint -colord -gtk-doc -test -vim-syntax -wayland" ABI_X86="(64) -32 (-x32)" 0 KiB [ebuild R ~] app-text/zathura-0.4.5::gentoo USE="doc magic -seccomp -sqlite -synctex -test" 0 KiB Total: 3 packages (3 reinstalls), Size of downloads: 0 KiB Would you like to merge these packages? [Yes/No]
Portage installed the source code under /usr/src/debug/${CATEGORY}/${PF}
,
where PF
is the full package name, version, and revision, such
as /usr/src/debug/x11-base/xorg-server-1.20.5-r2
. Debug symbols
will be installed under /usr/lib/debug
.
A better backtrace
After getting the debug symbols & sources installed, I now get the following backtrace:
winston@snowcrash ~ $ gdb --args zathura ~/docs/uni/classes/cs-655/handouts/spim_documentation.pdf Reading symbols from zathura... Reading symbols from /usr/lib/debug//usr/bin/zathura.debug... (gdb) run Starting program: /usr/bin/zathura /home/winston/docs/uni/classes/cs-655/handouts/spim_documentation.pdf [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". [New Thread 0x7ffff5f62700 (LWP 12321)] [New Thread 0x7ffff5761700 (LWP 12322)] [New Thread 0x7fffe52b7700 (LWP 12329)] [New Thread 0x7fffe4ab6700 (LWP 12333)] [New Thread 0x7fffcbfff700 (LWP 12334)] Thread 1 "zathura" received signal SIGSEGV, Segmentation fault. __strlen_sse2 () at ../sysdeps/x86_64/multiarch/../strlen.S:120 120 movdqu (%rax), %xmm4 (gdb) bt #0 __strlen_sse2 () at ../sysdeps/x86_64/multiarch/../strlen.S:120 #1 0x00007ffff721680d in g_strjoinv (separator=separator@entry=0x7fffec702546 "-", str_array=str_array@entry=0x555555e56ac0) at ../glib-2.60.7/glib/gstrfuncs.c:2585 #2 0x00007fffec6fd31b in avahi_service_resolver_cb (source_object=<optimized out>, res=<optimized out>, user_data=user_data@entry=0x555555e06040) at /usr/src/debug/x11-libs/gtk+-3.24.13/gtk+-3.24.13/modules/printbackends/cups/gtkprintbackendcups.c:3223 #3 0x00007ffff73caf79 in g_task_return_now (task=0x555555ea01a0 [GTask]) at ../glib-2.60.7/gio/gtask.c:1209 #4 0x00007ffff73cba9d in g_task_return (task=0x555555ea01a0 [GTask], type=<optimized out>) at ../glib-2.60.7/gio/gtask.c:1278 #5 0x00007ffff73cc00c in g_task_return (type=G_TASK_RETURN_SUCCESS, task=<optimized out>) at ../glib-2.60.7/gio/gtask.c:1678 #6 g_task_return_pointer (task=<optimized out>, result=<optimized out>, result_destroy=<optimized out>) at ../glib-2.60.7/gio/gtask.c:1683 #7 0x0000000000000000 in ()
If you feel inclined, here is the full backtrace (bt full
).
A lot more useful, huh?
Analyzing the crash
Knowing GDB is a powerful, useful skill. Nothing beats understanding your debugger. Not even printf debugging.
Let's start with the source of the crash:
(gdb) frame 0 #0 __strlen_sse2 () at ../sysdeps/x86_64/multiarch/../strlen.S:120 120 movdqu (%rax), %xmm4 (gdb) list 115 #ifdef AS_STRNLEN 116 andq $-16, %rax 117 FIND_ZERO 118 #else 119 /* Test first 16 bytes unaligned. */ 120 movdqu (%rax), %xmm4 121 PCMPEQ %xmm0, %xmm4 122 pmovmskb %xmm4, %edx 123 test %edx, %edx 124 je L(next48_bytes) (gdb) info registers rax rax 0x61 97
So strlen is trying to derefence address 0x61
, that doesn't look
right. Cheking the output of info proc mappings
shows zathura
doesn't have mapped memory that corresponds to the value in rax
.
(gdb) info proc mappings process 12314 Mapped address spaces: Start Addr End Addr Size Offset objfile 0x555555554000 0x55555555f000 0xb000 0x0 /usr/bin/zathura 0x55555555f000 0x555555581000 0x22000 0xb000 /usr/bin/zathura …SNIP… 0x7ffff7ffd000 0x7ffff7ffe000 0x1000 0x27000 /lib64/ld-2.29.so 0x7ffff7ffe000 0x7ffff7fff000 0x1000 0x0 0x7ffffffdd000 0x7ffffffff000 0x22000 0x0 [stack] 0xffffffffff600000 0xffffffffff601000 0x1000 0x0 [vsyscall]
Now let's carry on with the second frame.
(gdb) frame 1 #1 0x00007ffff721680d in g_strjoinv (separator=separator@entry=0x7fffec702546 "-", str_array=str_array@entry=0x555555e56ac0) at ../glib-2.60.7/glib/gstrfuncs.c:2585 2585 for (i = 1; str_array[i] != NULL; i++) (gdb) info frame Stack level 1, frame at 0x7fffffffd140: rip = 0x7ffff721680d in g_strjoinv (../glib-2.60.7/glib/gstrfuncs.c:2585); saved rip = 0x7fffec6fd31b called by frame at 0x7fffffffd200, caller of frame at 0x7fffffffd0f0 source language c. Arglist at 0x7fffffffd0e8, args: separator=separator@entry=0x7fffec702546 "-", str_array=str_array@entry=0x555555e56ac0 Locals at 0x7fffffffd0e8, Previous frame's sp is 0x7fffffffd140 Saved registers: rbx at 0x7fffffffd108, rbp at 0x7fffffffd110, r12 at 0x7fffffffd118, r13 at 0x7fffffffd120, r14 at 0x7fffffffd128, r15 at 0x7fffffffd130, rip at 0x7fffffffd138 (gdb) list 2580 gsize separator_len; 2581 2582 separator_len = strlen (separator); 2583 /* First part, getting length */ 2584 len = 1 + strlen (str_array[0]); 2585 for (i = 1; str_array[i] != NULL; i++) 2586 len += strlen (str_array[i]); 2587 len += separator_len * (i - 1); 2588 2589 /* Second part, building string */ (gdb) print *str_array $1 = (gchar *) 0x555555cea670 "Canon" (gdb) print str_array[1] $2 = (gchar *) 0x555555dac870 "MF632C" (gdb) print str_array[2] $3 = (gchar *) 0x555555e87150 "634C" (gdb) print str_array[3] $4 = (gchar *) 0x61 <error: Cannot access memory at address 0x61>
Indeed, we cannot access memory of address 0x61
. And looking at the
source and documentation for gstrjoinv, the str_array
argument
should be a NUL terminated array of strings.
Let's look at the third frame.
(gdb) frame 2 #2 0x00007fffec6fd31b in avahi_service_resolver_cb (source_object=<optimized out>, res=<optimized out>, user_data=user_data@entry=0x555555e06040) at /usr/src/debug/x11-libs/gtk+-3.24.13/gtk+-3.24.13/modules/printbackends/cups/gtkprintbackendcups.c:3223 3223 data->printer_name = g_strjoinv ("-", printer_name_compressed_strv); (gdb) info frame Stack level 2, frame at 0x7fffffffd200: rip = 0x7fffec6fd31b in avahi_service_resolver_cb (/usr/src/debug/x11-libs/gtk+-3.24.13/gtk+-3.24.13/modules/printbackends/cups/gtkprintbackendcups.c:3223); saved rip = 0x7ffff73caf79 called by frame at 0x7fffffffd220, caller of frame at 0x7fffffffd140 source language c. Arglist at 0x7fffffffd138, args: source_object=<optimized out>, res=<optimized out>, user_data=user_data@entry=0x555555e06040 Locals at 0x7fffffffd138, Previous frame's sp is 0x7fffffffd200 Saved registers: rbx at 0x7fffffffd1c8, rbp at 0x7fffffffd1d0, r12 at 0x7fffffffd1d8, r13 at 0x7fffffffd1e0, r14 at 0x7fffffffd1e8, r15 at 0x7fffffffd1f0, rip at 0x7fffffffd1f8 (gdb) list 3218 printer_name_compressed_strv[j] = printer_name_strv[i]; 3219 j++; 3220 } 3221 } 3222 3223 data->printer_name = g_strjoinv ("-", printer_name_compressed_strv); 3224 3225 g_strfreev (printer_name_strv); 3226 g_free (printer_name_compressed_strv); 3227 g_free (printer_name); (gdb) print printer_name_compressed_strv $5 = (gchar **) 0x555555e56ac0
Note the value of printer_name_compressed_strv
of 0x555555e56ac0
corresponds to the value of str_array
in the previous frame
(g_strjoinv()
). The full definition of avahi_sernvice_resolver_cb
can be read on Gnome's GitLab.
As mentioned above, we found the sentinel value of the string array was not NUL. Looking at the following code, do you see the bug? I honestly didn't:
printer_name = g_strdup (name); g_strcanon (printer_name, PRINTER_NAME_ALLOWED_CHARACTERS, '-'); printer_name_strv = g_strsplit_set (printer_name, "-", -1); printer_name_compressed_strv = g_new0 (gchar *, g_strv_length (printer_name_strv)); for (i = 0, j = 0; printer_name_strv[i] != NULL; i++) { if (printer_name_strv[i][0] != '\0') { printer_name_compressed_strv[j] = printer_name_strv[i]; j++; } } data->printer_name = g_strjoinv ("-", printer_name_compressed_strv);
After spending some time refamiliarizing myself with glib, GTK+, and Googling, it became apparently after I found the commit that fixed it. Let me preface the found commit with a brief explaination.
g_strcanon()
replaces characters not inPRINTER_NAME_ALLOW_CHARACTERS
with a hyphen, i.e."Canon MF632C/634C"
becomes"Canon-MF632C-634C"
g_strsplit_set()
splitsprinter_name
on"-"
, giving the following array:(gdb) print *printer_name_strv@g_strv_length(printer_name_strv)+1 $12 = {0x555555eff250 "Canon", 0x555555ece120 "MF632C", 0x555555f397e0 "634C", 0x0}
g_new0()
initializes a zero-filled array of pointers of length 3, the number of splitted elements returned byg_strsplit_set()
- The for loop copies over the contents of printernamestrv, but skips empty elements - e.g. in the case that the above string had two adjacent hypens.
- The
g_strjoinv()
joins each string in theprinter_name_compressed_strv
array of strings, joining them on"-"
.
The problem occurs because the call to g_new0
does not account for
the extra array sentinel element. Indeed that is what the GitLab
commit discusses.
Best way to fix it?
In this case, I did what is best for my time effort. GTK+ 3.24.14 has been out for a couple months, and GTK+ 3.24.13 is not much older. So instead of dealing with backports, that is making a patch for the older version of GTK+ and adding it to my install, I took the liberty to bump my local GTK+ 3 install to GTK+ 3.24.14.
Either is not too tricky, in all honesty, given adding patches to a
Gentoo system is as easy as placing the patch in the correct
path. And bumping the ebuild usually entails simply unmasking it via
accepting the keyworded version (in my case ~amd64
).
As such this is all I had to do to fix the issue:
snowcrash ~ # echo '~x11-libs/gtk+-3.24.14 ~amd64' >> /etc/portage/package.accept_keywords/gtk snowcrash ~ # emerge -uDU -av --changed-deps --verbose-conflicts @world These are the packages that would be merged, in order: [ebuild U ~] x11-libs/gtk+-3.24.14:3::gentoo [3.24.13:3::gentoo] USE="X cups examples introspection xinerama (-aqua) -broadway -cloudprint -colord -gtk-doc -test -vim-syntax -wayland" ABI_X86="(64) -32 (-x32)" 0 KiB Total: 1 package (1 upgrade), Size of downloads: 0 KiB Would you like to merge these packages? [Yes/No]
And viola. I am able to print!

Conclusion
In this post, I described several related challenges:
- How to get a backtrace
- What happens when seccomp blocks ptrace
- How to install debug symbols and source code on Gentoo
- What it looks like to pick apart a backtrace
- And how I fixed this particular issue
In retrospect, I should have reported the bug to the gentoo tracker, because this was a bug due to the selection of patches cherry picked off the gtk git repository. Thankfully the affected versions of GTK are no longer in the official gentoo ebuild repository. I'll be sure to report such bugs going forward! It pleases me how easy Gentoo makes it to debug stuff.
I found the entire experience here informative, but also incredibly
irritating. I'm no stranger to debugging crashes and grabbing debug
symbols. When something gets in the way of getting backtraces,
things really get very frustrating. The debugging part is the fun
part; dealing with wildcards like seccomp preventing ptrace()
with
no meaningful error messages is a huge time waster.
The lack of literature about debugging seccomp enabled applications was a factor in this frustration. I only figured out the issue due to taking the time to read the source code, grepping for ptrace, and understanding seccomp as its used in Zathura. Had I read the README I could have saved some time; It's important to read all the documentation.
If you made it this far, you have a lot of patience for ramblings and hobbyist computing. You're terrific! Next time you run into a segfault, put what you've learned to good use!
A week in the life of Winston
During these interesting times, I figure it would be a good idea to describe how I've been keeping myself busy, bugs I've fixed, and some of the daily tasks/routines that keep my day structured.
For context: I moved house on the weekend of March 21st, which is a couple weeks before the Covid-19 fiasco became a front-and-center concern for my geographical region. I am finishing my undergrad in computer science — this is my last semester. The classes I am taking are Compilers, Compiler Implementation Laboratory, and Matrices and Applications. I am now currently living in a very rural area, so I have been very successful in maintaining a social distance in all aspects of my life.
Daily routine
- Eat the exact same thing every day: Eggs, Bacon, Corn Tortillas. Take any dietary supplements/vitamins.
- Make coffee — Aeropressed. Currently sourcing coffee from Ruby. I wish to see more Ethiopia/Kenya/Peru coffee but even a more earthy Colombia coffee is agreeable.
- Spend 30–60 minutes checking email, news, IRC.
- Wash up
- Spend about 2-3 hours on current tasks — schoolwork, bugfixing, packaging, or researching.
- Eat a lunch, make some tea
- Spend another 2-3 hours on same tasks.
- Take a break, preferably away from computer
- Spend another couple hours
- Have a dinner
- Spend some time with housemates, make sure we're all on the same page
- Spend another couple hours on tasks
- Wash up, goto bed
In retrospect, I think I should replace one of those task blocks with off-task things such as gaming, reading books, and so on. That's way too much time on task, and knowing myself I end up being less productive due to too much time on task.
Package work for this week
For a long time now, I've aimed to keep all my system-wide software packaged in the OS package manager. This allows me to easily rebuild my systems, or deploy the OS on new computers. This also means updates become a lot easier, because the package manager can track things such as rebuilding all library dependencies, and ensuring dependencies are installed and are the correct versions. Depending on your OS it's pretty easy. In my case Gentoo makes it extremely easy.
Alephone
I reintroduced Alephone packages1 to play the classic Bungie first person shooters Marathon, Marathon 2: Durandal, and Marathon 3: Infinity. Thankfully I could base some of my work off my old Portage overlay combined with couple-year-old commits from the official Gentoo repository.
After getting a show-stopping bug addressed, I added a prerelease package that includes fixes that should address memory corruption, flickering sprites, and (some of the) popping audio. See details in the Bugs Addressed section.
The toolchain used for my university course
An ongoing desire was to do all my university homeworks locally,
without logging into servers with less software choice, and an
abnormal amount of network jitter/latency spikes. I finally made it
happen with a combination of rsync invocations followed by a tar
-czvf
and a Gentoo package.
I find this very exciting. I invest very heavily into my computer environment, and try my best to avoid doing complicated work in unfamiliar environments. I can also do work offline now.
It is worth noting that the distfiles for this package are not publicly available, and as such you will have be a student in the course to install it. This is intentional. I have zero interest in trying to make this toolchain public or open source, I merely want to use it locally.
Bugs addressed
Deal with issue making Emacs unresponsive
I am pleased to discover and fix a longstanding bug that would
yield my Emacs unresponsive after visiting files, then deleting the
directories the visited files resided in. I wish I had documented
the first time I noticed this problem; it may have been as soon as
I introduced auto-virtualenvwrapper
to my workflow. This package
tells Emacs to automatically search for Python virtualenvs for use
in ansi-term (terminal in emacs), running python code from emacs,
and getting accurate tab completion when writing python.
This was one of those sort of irksome issues that is difficult to
debug unless one invests effort to reconfigure emacs to report
error traces, which can't be set after the bug occurs. As a result
every time I've encountered this bug, I have given up, and simply
restarted Emacs daemon, because messing up the minibuffer precludes
issuing a M-x toggle-debug-on-error RET
.
I want to thank the maintainer of auto-pythonvirtualenv for being very responsive to pull request I made. Contributing to Emacs packages is a lot of hit-or-miss, because it seems some of the less commonly used Emacs packages are dead. Additionally there is a culture of disinterest in accepting PRs that don't directly improve the maintainer's quality of life.
Strange Alephone memory corruption
I was very excited to get Alephone packaged and installed. I
started noting weirdness on my workstation setup. It started with
some graphics corruption, with sprites being rotated 90°, and
severe visual corruption when interacting with the in-game text
terminals. Invariably on every exit the game segfaults with
corrupted size vs. prev_size
.
I reported the issue, and after a lot of testing, it become apparent the issue only occurs when playing at my native resolution, which is 1440p (2560×1440), but does not happen at 1080p (1920×1080). Thanks to my (over) comprehensive testing and a couple passionate project maintainers, someone was able to pinpoint the source of the bug was due to an out of bounds write. The writes was caused by a statically allocated buffer used to copy artifacts of the render trees onto the screen. Or something like that.
I wrote a quick and dirty patch, then later one of the maintainers helped write a more future-proof patch. After testing it appears the problem is fixed. This was a fantastic experience, the discussion was on topic, there was no bike shedding, and everybody treated each other with kindness.
Dropping Nvidia
In 2015 I purchased a Nvidia GTX 760 used for $50. It was a great investment. At the time AMD driver quality is pretty poor. This is the post-fglrx horror years, but the drivers were still subpar compared to Nvidia's proprietary drivers. You could not expect to get Windows-par graphics performance on an AMD card in 2015. On the other hand one could expect Windows-par graphics performance on a Nvidia card.
Why AMD and not Nvidia
The landscape has completely changed in the last 5 years. AMD has open sourced their graphics drivers, and is actively helping out in maintaining them. Nvidia on the other hand has inherit issues such as
- upgrading the driver breaks currently running Xorg sessions' 3d acceleration, and requires a reboot;
- Out of tree kernel drivers are usually a bad idea, though I
appreciate how easy Gentoo makes it to deal with them — simply run
emerge @module-rebuild
, this still a mild annoyance because it adds extra steps when upgrading kernels or rebuilding kernels - No native resolution modesetting is available on Nvidia, so your linux consoles (tt1-tty6 on most installs) are stuck at a very low resolution, and look very chunky;
you have to either use Nvidia libGL or use Mesa, not both(libglvnd fixes this apparently);- it's yet another non-free software to install on my computer — if bugs occur I cannot contribute fixes, or solicit fixes from other users
- OBS acts up with Nvidia binary drivers, GZdoom skyboxes are not captured, and certain 3d applications are somewhat difficult to capture correctly with the binary drivers;
- Nvidia's composition pipeline feature for reducing video tearing is pretty awful. It simply makes most animations look choppy/stuttery, and ruins the experience of most video playback;
- Nvidia is liable to drop support for my card in another year or so, forcing me to upgrade anyways, this is planned obsolescence at the driver level. With AMD on the other hand the driver probably will stay in tree and supported for a couple decades;
- There is no way to track resource usage of my GPU — it's too old
to support tracking resource usage in
nvidia-smi
, butradeontop
has been able to do this on all AMD cards for a very long time; - and there is a bug with Nvidia's HDMI alsa drivers that prevents pulseaudio from redetecting most of my sound interfaces on s3 resume from suspend. The usual work around is to either unplug my HDMI output or keep killing pulseaudio until it magically works.
With AMD on the other hand I don't foresee most of these issues. Presently I found s3 suspend-resume cycles take up to a minute, so I need to address that. Video tearing on the other hand is very minimal; I have been able to watch this YouTube tearing test and experience no video tearing. I did notice tearing in certain parts of Firewatch though. That is likely because Firewatch is not particularly well optimized.
Gotchas switching cards
The GPU arrived Thursday, and I got super excited, and neglected to
run an emerge -uDU --changed-deps -av @word
after an emerge
--sync
. The card installed fine, but X would segfault. I noticed
in the Xorg logs it couldn't open the radeonsi
driver. I thought
I could simply add amdgpu
to VIDEO_CARDS
, but as the logs
suggest, I need radeonsi
and amdgpu
. The Gentoo Wiki also
suggests this. Because I was both trying to update and reconfigure
my installation, this yielded to problems with blockers. It seemed
nvidia was the problem, as it was masking Xorg versions I needed. I
nuked nvidia
from my VIDEO_CARDS
and was successful in updating
and reconfiguring my graphics stack.
Additionally, it appears the vulkan
USE flag must be enabled on
media-libs/mesa
for some Steam games to work, such as The Talos
Principle. I think the Nvidia binary drivers support vulkan out of
the box, hence I never had to set a USE flag on the previous GPU
driver.
Finally, I had to configure mpv to not use vdpau (I had forced mpv to use vdpau, for my Nvidia card). Otherwise mpv would give me a black screen.
Schoolwork
I found using a graphics tablet to be valuable to my math class. I can take notes in xournal, and write problems step by step. You might wonder what's wrong with paper, but it seems when in front of a computer watching lectures and interacting with online learning management systems, it is difficult to split attention between the computer and the notebook. As such I simply decided to digitize the notebook.
To make the experience more tolerable, I have been using
youtube-dl
to grab all the videos I can, and play them locally in
MPV. This ensures I have global multimedia shortcuts to control
video playback, have better control over frame advance, do not
require internet access 24/7, and have better control over playback
speed.
As I finally packaged the software used for one of my classes I can do all that class's work locally except for submission. This is fantastic because I can use my Emacs 26 setup and do not require a 24/7 Internet connection.
My office is located in a room that can get down to the low 60°'s at night, and I found many times I'd want to do work, I could barely focus because I was so unevenly cold. The floor would be 60-65 but the room would be 70. I feel like an old man complaining about this, but really getting a space heater did wonders for my productivity and focus. This is a schoolwork problem, because it's the most tedious sort of productivity.
Conclusion
This has been a rather long post. I really wanted to describe some of the things I've been up to, and some of the challenges I've been facing. I am very happy to have removed my workstation Nvidia dependency. I am very excited about graduating soon, and adjusting in this time, and keeping that in mind, has been a challenge. As usual packaging software and fixing keeps my computers usable and maintainable.
I hope to write more in the near future. I had started some posts on debugging a GTK bug, and some other topics, but the amount of material to cover kept growing, much like this post keeps growing. Stay tuned to read about seccomp madness.
Extending a wireless LAN with a bridged Ethernet LAN using Mikrotik RouterOS
I recently moved, and my new abode has an Ubiquiti Amplifi LAN. The rationale is this mesh-based WiFi network eliminates the need to install Ethernet between Wireless Access Points (APs). It works surprisingly well. In this post I document how I extended this network so I could place my networked devices all on the same Ethernet segment, without needing to wire it to the Amplifi base station.
The idea is the network should look like this:

Multiple Subnets Gotcha: No static routes or additional IP addresses on the Amplifi Router
Initially I wanted to segment my devices on its own RFC1918
network. The idea is the main Amplifi LAN would be on
192.168.182.0/24
and my network would be on 10.9.8.0/24
. I know
I can ensure an address from each subnet is on each router, and the
routers should then be able to route between each other without any
additional configuration.
Unfortunately Amplifi does not offer a facility to add additional IP addresses to the LAN IP configuration. Think about how absurd that is—a router doesn't offer facility to add multiple addresses on an interface or bridge. Yep pretty silly. This leads me to the second approach: add static routes to make every segment on the network routable. Unfortunately, yet again, Ubiquiti Amplifi does not offer this standard router feature.
Given the limitations of Amplifi to not offer static routing or multiple LAN IP addresses, I think the only approach where I could maintain an additional subnet is with NAT (Network Address Translation) configured, which totally defeats the purpose of having a second subnet which could host any number of network services that should be reachable anywhere on the network, without any port forwarding.
Configuring the Mikrotik Routerboard
My examples assume the user is logged into the CLI via SSH. Make sure to read through the RouterOS Wiki documentation and upcoming replacement wiki when things are not clear.
In particular check out the guides for scripting, using the console, first-time setup, and troubleshooting tools.
The steps are this:
1) Make a backup of your current RouterOS configuration
About the /export
and /import
paths. Also worth reading about
/system backup
.
[admin@MikroTik] > # Export to local file [admin@MikroTik] > /export verbose file=2020-03-29 [admin@MikroTik] > # Confirm file exists [admin@MikroTik] > /file print # NAME TYPE SIZE CREATION-TIME 0 2020-03-29.rsc script 40.2KiB mar/29/2020 12:44:43 1 flash disk dec/31/1969 19:00:03 2 flash/skins directory dec/31/1969 19:00:03 3 flash/pub directory mar/20/2020 03:04:00
Then copy it over.
$ sftp admin@192.168.182.70 Connected to 192.168.182.70. sftp> ls 2020-03-29.rsc flash sftp> get 2020-03-29.rsc Fetching /2020-03-29.rsc to 2020-03-29.rsc /2020-03-29.rsc 100% 40KB 2.9MB/s 00:00 sftp> exit
2) Reconfigure the bridge
You probably want everything on the bridge, including the ether1
,
which is often used for WAN on a typical router setup. More
information on RouterOS bridges.
[admin@MikroTik] > /interface bridge port print Flags: X - disabled, I - inactive, D - dynamic, H - hw-offload # INTERFACE BRIDGE HW PVID PRIORITY PATH-COST INTERNAL-PATH-COST HORIZON 0 H ;;; defconf ether2 bridge yes 1 0x80 10 10 none 1 I H ;;; defconf ether3 bridge yes 1 0x80 10 10 none 2 I H ;;; defconf ether4 bridge yes 1 0x80 10 10 none 3 I H ;;; defconf ether5 bridge yes 1 0x80 10 10 none 4 I ;;; defconf sfp1 bridge yes 1 0x80 10 10 none 5 I H ether1 bridge yes 1 0x80 10 10 none 6 I wlan1 bridge 1 0x80 10 10 none 7 wlan2 bridge 1 0x80 10 10 none
To add/remove devices just use /interface bridge port add
and
/interface bridge port remove
. Press TAB to complete stuff.
3) Remove obsolete firewall rules
The following scripts can be copy-pasted into the terminal. More information on RouterOS scripting here.
I've included the following image because I think everyone should be able to enjoy the colorful nature of RouterOS's shell.

/ip firewall filter
As per a suggestion on IRC (thanks drmessano) it's best to drop all the firewall rules, since this device should be a L3 bridge, and should not be restricting traffic.
:foreach rule in=[/ip firewall filter find] do={ \ :if ([/ip firewall filter get $rule dynamic]) \ do={} \ else={/ip firewall filter remove $rule} \ }; \ /ip firewall filter print
/ip firewall mangle
I didn't see a need for any mangle rules, so this should be empty as well.
:foreach rule in=[/ip firewall mangle find] do={ \ :if ([/ip firewall mangle get $rule dynamic]) \ do={} \ else={/ip firewall mangle remove $rule} \ }; \ /ip firewall mangle print
/ip firewall nat
Everything here should be disabled or removed. This setup does not use a NAT.
:foreach rule in=[/ip firewall nat find] do={ \ :if ([/ip firewall nat get $rule dynamic]) \ do={} \ else={/ip firewall nat remove $rule} \ }; \ /ip firewall nat print
4) Disable DHCP Server
You can probably just run:
/ip dhcp-server disable 0
This is necessary because this device shouldn't be doling out IP addresses. Only one router on this subnet should be running a DHCP server.
5) Configure device IP Address
Add an out-of-band management IP address
In general it's a good idea to add out-of-band management IP addresses to devices that don't have another way to log in. My particular device does not have an accessible serial console, so I need to take care to always have a way to address this Routerboard, even if the LAN and its DHCP server goes down.
There are two ways to achieve this: use an IPv6 link-local address or manually add a static RFC1918 IPv4 address. I will use a static IPv4 address.
/ip address add address=10.128.0.1/24 interface=bridge
Make sure to write this address down, it will save a hard-reset down the road. Maybe attach it to the unit with a printed label.
With my Linux box's Ethernet directly hooked up to the Routerboard, I can assign another IPv4 on the same subnet, then log into the router.
winston@snowcrash ~ $ sudo ip address add dev enp2s0 10.128.0.10/24 winston@snowcrash ~ $ ssh admin@10.128.0.1 The authenticity of host '10.128.0.1 (10.128.0.1)' can't be established. RSA key fingerprint is SHA256:QhJryzCxFpT/wW4Mmg7R6QEnRDPeYsY2SAF/hlc7Mx4. No matching host key fingerprint found in DNS. Are you sure you want to continue connecting (yes/no/[fingerprint])? yes Warning: Permanently added '10.128.0.1' (RSA) to the list of known hosts. MMM MMM KKK TTTTTTTTTTT KKK MMMM MMMM KKK TTTTTTTTTTT KKK MMM MMMM MMM III KKK KKK RRRRRR OOOOOO TTT III KKK KKK MMM MM MMM III KKKKK RRR RRR OOO OOO TTT III KKKKK MMM MMM III KKK KKK RRRRRR OOO OOO TTT III KKK KKK MMM MMM III KKK KKK RRR RRR OOOOOO TTT III KKK KKK MikroTik RouterOS 6.46.4 (c) 1999-2020 http://www.mikrotik.com/ [?] Gives the list of available commands command [?] Gives help on the command and list of arguments [Tab] Completes the command/word. If the input is ambiguous, a second [Tab] gives possible options / Move up to base level .. Move up one level /command Use command at the base level [admin@MikroTik] >
Provision an IPv4 on the LAN for easier management
I opted to use DHCP, but I'll show both ways to assign an IP address to the router. Before adding more DHCP clients or a static IP, make sure to disable or remove other DHCP clients.
# Change "remove" to "disable" to keep the configuration available for later use. foreach cl in=[/ip dhcp-client find] do={/ip dhcp-client remove $cl}; \ /ip dhcp-client print
To add a DHCP client, run a command like this:
/ip dhcp-client add interface=bridge disabled=no
To add a static IPv4 address, run commands like the following:
# Give the Routerboard an address /ip address add address=192.168.182.70/24 interface=bridge # The following two commands tell the Routerboard how to access the internet, so it # can get updates or access cloud services. # # Where 192.168.182.1 is the default gateway (the main router) /ip route add gateway=192.168.182.1 # Set the DNS servers preferring in order: the main router, Cloudflare, and Google /ip dns set servers=192.168.182.1,1.1.1.1,8.8.8.8
6) Configure the WiFi
Here is my WiFi configuration:
[admin@MikroTik] > /interface wireless print Flags: X - disabled, R - running 0 X name="wlan1" mtu=1500 l2mtu=1600 mac-address=CC:2D:E0:E1:3E:B9 arp=enabled interface-type=Atheros AR9300 mode=station-pseudobridge ssid="MyCoolWifiName" frequency=auto band=2ghz-b/g/n channel-width=20/40mhz-Ce secondary-channel="" scan-list=default wireless-protocol=802.11 vlan-mode=no-tag vlan-id=1 wds-mode=disabled wds-default-bridge=none wds-ignore-ssid=no bridge-mode=enabled default-authentication=yes default-forwarding=yes default-ap-tx-limit=0 default-client-tx-limit=0 hide-ssid=no security-profile=default compression=no 1 R name="wlan2" mtu=1500 l2mtu=1600 mac-address=CC:2D:E0:E1:3E:B8 arp=enabled interface-type=Atheros AR9888 mode=station-pseudobridge ssid="MyCoolWifiName" frequency=auto band=5ghz-a/n/ac channel-width=20/40/80mhz-Ceee secondary-channel="" scan-list=default wireless-protocol=802.11 vlan-mode=no-tag vlan-id=1 wds-mode=disabled wds-default-bridge=none wds-ignore-ssid=no bridge-mode=enabled default-authentication=yes default-forwarding=yes default-ap-tx-limit=0 default-client-tx-limit=0 hide-ssid=no security-profile=default compression=no
1) Take the WiFi offline
First, I recommend taking the WiFi offline until you're happy with the configuration.
/interface wireless disable numbers=0,1
2) Configure the SSID and put the Routerboard into a wireless client mode
Then make sure to set the ssid
to your existing WiFi's ESSID and
mode
to station-pseudobridge
.
/interface wireless set numbers=0,1 mode=station-pseudobridge ssid="MyCoolWifiName"
3) Configure the security-profile
Chances are your wireless device already has a security profile, so make
note of the security-profile
in the /interface wireless print
output. It is likely security-profile=default
.
Next configure the security-profile, mine looks like this:
[admin@MikroTik] > /interface wireless security-profiles print Flags: * - default 0 * name="default" mode=dynamic-keys authentication-types=wpa-psk,wpa2-psk unicast-ciphers=aes-ccm group-ciphers=aes-ccm wpa-pre-shared-key="Top secret password here" wpa2-pre-shared-key="Top secret password here" supplicant-identity="MikroTik" eap-methods=passthrough tls-mode=no-certificates tls-certificate=none mschapv2-username="" mschapv2-password="" disable-pmkid=no static-algo-0=none static-key-0="" static-algo-1=none static-key-1="" static-algo-2=none static-key-2="" static-algo-3=none static-key-3="" static-transmit-key=key-0 static-sta-private-algo=none static-sta-private-key="" radius-mac-authentication=no radius-mac-accounting=no radius-eap-accounting=no interim-update=0s radius-mac-format=XX:XX:XX:XX:XX:XX radius-mac-mode=as-username radius-called-format=mac:ssid radius-mac-caching=disabled group-key-update=5m management-protection=disabled management-protection-key=""
In short, run the following command:
:local password "Top secret password here"; \ /interface wireless security-profiles set numbers=0 \ wpa-pre-shared-key=$password wpa2-pre-shared-key=$password
4) Re-enable the WiFi and test
I know the Amplifi wireless LAN supports 802.11ac and the closest
mesh-point is sufficiently close, so I'll only enable the radio
capable of 802.11ac. Looking at the output of /interface wireless
print
I see only wlan2
's band
key contains support for
802.11ac (band=5ghz-a/n/ac
). So I'll only enable that radio.
/interface wireless enable wlan2
In any case once everything is set up, make sure one can ping the internet:
admin@MikroTik] > /ping google.com count=3 SEQ HOST SIZE TTL TIME STATUS 0 172.217.9.46 56 52 25ms 1 172.217.9.46 56 52 24ms 2 172.217.9.46 56 52 25ms sent=3 received=3 packet-loss=0% min-rtt=24ms avg-rtt=24ms max-rtt=25ms
Some gotchas in case things don't work:
- Can you ping the router via its shared LAN IP
- What parts of the network can be pinged from what devices?
- Is the WiFi SSID/Password correct?
Conclusion
Though a casual reader might consider this too many steps to configure a network device, it is only a handful of operations. A Router/Switch device's web configuration GUI might help streamline this, but that doesn't mean it is any simpler to configure. Chances are one will have to fill in more boxes and tick more fields in such a scenario, since this is a somewhat weird and awkward network topology.
As usual the Mikrotik Wiki helped me get this project finished in little time. Doubly so thanks to the Mikrotik IRC channel. I rather enjoy working with RouterOS because it feels rather state-less. When I specify a configuration, it is usually idempotent, every line of configuration feels relevant to the use-case, and in general feels rather DWIM (Do what I Mean).
I hope this helps somebody, because this took some mild mental
aerobics to figure out how this works. In particular I knew that WiFi
and Ethernet have different frame formats, and Layer 2 bridging
between the two requires weird hacks that are unique to the vendor's
software/hardware. In this case using station-pseudobridge
, the
Routerboard does Layer 2 bridging for certain traffic, and falls
back on Layer 3 for the rest. I don't fully understand it, but the
results are satisfactory.
Edit: About the IRC channel
I'm a big fan of IRC however looking over the logs for
##mikrotik
, and my interactions, I was very lucky to never be a
target of abuse. Thankfully all I had to deal with are some low
intensity rudeness, and the repercussions of reminding the channel
maybe it's a good idea to act like adults (and that people remember
interactions like this). Indeed that is apparently a dangerous
discussion to bring up, I guess rude people don't like being told
they are being rude. I left that channel, and I think others should
avoid it.
Anyway I looked through my year's worth of logs, and found these comments about the abuse problem in the channel. I have anonymized the names because even trolls/rude users don't deserve personal attacks. What isn't included are actual insults and attacks on other users. Those are far too personal and explicit for my blog.
2019-12-19 16:39:03 PersonA "##mikrotik You've got questions, we've got toxic mockery." 2019-12-19 16:42:14 PersonB come for the advice, stay for the abuse 2019-12-20 09:20:23 PersonC PersonB: it worked, tyvm 2019-12-20 09:20:24 <-- PersonC (~PersonC@unaffiliated/PersonC) has quit 2019-12-20 09:20:45 PersonD wat 2019-12-20 09:21:16 PersonB left without giving me a chance to insult him 2019-12-20 09:21:21 PersonD damn :( 2019-12-20 09:21:55 PersonD #mikrotik: Come for the help, stay for the abuse 2019-12-20 09:22:13 PersonB i did that one already 2019-12-20 09:22:22 PersonB [22:42:14] <PersonB> come for the advice, stay for the abuse 2019-12-28 15:21:22 PersonE have you come for the abuse? 2019-12-28 15:30:43 PersonF always 2019-12-28 16:03:58 PersonB the abuse is the only way he can come 2020-01-10 16:21:21 PersonG well you guys led the conversation in that direction 2020-01-10 16:21:27 PersonG start calling names 2020-01-10 16:21:32 PersonG I don't work networking 2020-01-10 16:21:41 PersonG I am messing at home ... 2020-01-10 16:23:24 PersonG nah, it is excuse to be rude and without any manners.
Indeed they seem to know how inappropriately they service the community the channel is for. As always you can write me about this at the following email: hello AT winny DOT tech. I recommend seeking out another communication medium for questions related to Mikrotik.
Switching website to GitLab Pages
Previously I detailed how I set up blog.winny.tech using GitHub for source code hosting and Caddy’s git plugin for deployment. This works well and I used a similar setup with my homepage. The downside is I host the static web content and I am tied to using Caddy.1 I imagine simpler is better, so I opted to host my static sites — https://winny.tech/ & https://blog.winny.tech/ — with GitLab pages.
What’s wrong with Caddy?
Caddy is very easy to get started with, but it has its own set of trade-offs. Over the last few years, I’ve noticed multiple hard-to-isolate performance quirks, some of which were likely related to the official Docker image. In particular, I had built a Docker image of Caddy with webdav support, and the overall performance tanked to seconds per request, even with webdav disabled. I still have no clue what happened there; instrumenting Caddy through Docker appeared nontrivial, so I gave up on webdav support, reverted to my old Docker based setup, and everything was fast, once again.
There is a good amount of inflexibility in Caddy, such as the git plugin’s limitation to deploy to a non-root folder of the web root. And its rewrite logic is usually what you want, but not nearly as flexible as nginx’s.
Asking questions on their IRC is usually met with no response of any kind, which indicates to me that the project’s community isn’t very active.
The move to Caddy v2 is unwelcoming; I don’t want to relearn yet another set of config files and quirks, especially weeding through the layer of configuration file format adapters and the abstracted-away configuration options, so I rather just use Certbot and some other HTTPD that won’t change everything for the fun of it.2
Until recently Caddy experimented with a pretty dubious monetizing strategy. HackerNoon published an article detailing how it worked. In short: they plastered text all over their website claiming you “need to buy a license” to use Caddy commercially, though that claim was never true. Caddy was always covered by Apache License 2.0. Instead, you needed a commercial license in the narrow use-case that your organization wants to use Caddy’s prebuilt release binaries as offered on their website. It is good they stopped this scheme, but it leaves a bad taste with the community, and with me, and discourages me from relying on the project moving forward.
Why GitLab Pages instead of GitHub Pages?
I have used both GitHub Pages and GitLab Pages in the past. My experience with GitHub Pages is it’s relatively inflexible and difficult to see what is going to be published, and has a CI/CD setup only useful for certain Jekyll based sites. GitLab pages, on the other hand, lets you set up any old Docker-based CI/CD workflow, so it is possible to render a blog with GitLab CI of any static site generating software. The IEEE-CS student chapter I am a part of does just this. We use a combination of static redirect sites and a Pelican-powered static website. There are a large number of example repositories for most of the popular ways to publish a static website, including Gatsby, Hugo, and Sphinx. Needless to say GitLab Pages puts GitHub pages to shame in terms of flexibility.
Setting up GitLab Pages
There are two steps in setting up GitLab Pages. These are the most important ideas related to GitLab pages; how to navigate the site is something the reader must experience for oneself. Nothing beats experimentation and reading the docs. Make sure to refer to the official GitLab Pages documentation for further details.
1) Getting GitLab Pages deploy your git repository
Before getting started, make sure GitLab Pages is activated for your project. Visit it via Settings → Pages on your project. Most of the Pages settings are rooted in that webpage.
How GitLab Pages CI/CD deploys your site is specific to your
software or lack of software. If you are simply setting up a static
website on GitLab Pages, a simple .gitlab-ci.yml
will work for
you:
pages: stage: deploy script: - mkdir .public - cp -rv -- * .public/ # Note the `--' - mv .public public artifacts: paths: - public only: - master
This simply tells GitLab CI/CD to copy everything not starting
with a .
into the public
folder. By the way, one cannot change
the public
folder path. It does not appear possible to use
something like artifacts: paths: ["."]
to deploy the entire git
repository.
There is a GitLab CI/CD YAML lint website3 (and web
API). Additionally, there is a reference documentation for the
.gitlab-ci.yml
schema. Please note, it will often yield
confusing error messages. For example it is invalid to omit a
script
key, but the error message is Error: root config contains
unknown keys: pages
. Take the error messages with a grain of salt.
Once you have what seems like the .gitlab-ci.yml
that you want,
commit it to your git repository, and push to GitLab. Check
progress under CI/CD → Pipelines. If everything works out, you
should be able to view the website on GitLab Page’s website —
e.g. https://winny.tech.gitlab.io/blog.winny.tech. The format of
the above url (visible in Settings → Pages) is
https://<namespace>.gitlab.io/<project>
. If you can’t view your
website, check the CI/CD pipeline’s logs, and inspect the artifacts
ZIP — which is also available from the CI/CD piplines page. Chances
are you need to edit the .gitlab-ci.yml
or tweak the scripts used
in the YAML file.
2) Hosting the GitLab Pages site on your (sub-)domain
All the tasks in this section use Settings → Pages using the “New Domain” or “Edit” webpages.
To set up GitLab pages on your domain, you need to first prove
ownership of that specific domain via a specially constructed TXT
record, then configure that specific domain to point to GitLab
Pages via a CNAME
or A
record. In general I recommend using an
A
record because you can stuff any other records you please on
the same domain.
Simply add an A
record on your DNS setup as so: yourdomain.com. A
35.185.44.232
.4 If everything works, after the DNS updates it can
take anywhere from seconds to the rest of your SOA
TTL
(Time-To-Live). Visiting your domain should now provide a GitLab
Pages placeholder page with a 4xx error code.
Next prove to GitLab you own the domain. Create the TXT
record as
indicated in the GitLab Pages management website. The string to the
left of TXT
should be the name/subdomain, and the string to the
right of TXT
is the value. Alternately you can put the entire
string into the value field of a TXT
record (?!).
Note, the above two sub-steps are independent; one can validate the domain before adding the record to point it to GitLab, and vice versa.
GitLab Pages Gotchas
There are a few gotchas about GitLab Pages. Some of them are related to GitLab Pages users not being familiar with all of the DNS RFCs. Others are simply because GitLab Pages has quirks too.
CNAME
on apex domain is a no-no
Make sure you do not use a CNAME
record on the apex domain. Use an
A
record instead. Paraphrasing from the ServerFault answer: RFC
2181 clarifies a CNAME
record can only coexist with records of
types SIG
, NXT
, and KEY RR
. Every apex domain must contain
NS
and SOA
records, hence a CNAME
on the apex domain will
break things.
CNAME
and TXT
cannot co-exist
The above also is true for TXT
and CNAME
on the
same subdomain. For example if one adds TXT somevaluehere
and
CNAME example.com
to the same domain, say hello.example.com
,
things will not behave correctly.
If we have a look at the GitLab Pages admin page, the language is
mildly confusing, stating “To verify ownership of your domain, add
the above key to a TXT record within to your DNS configuration.” At
first, I thought “somewhere in your configuration” means “place this
entire string as the right hand side of a TXT
record on any
subdomain in your configuration”. This does work, as such I have
blog.winny.tech. IN A 35.185.44.232 blog.winny.tech. IN TXT "_gitlab-pages-verification-code.blog.winny.tech TXT gitlab-pages-verification-code=99da5843ab3eabe1288b3f8b3c3d8872"
But they probably didn’t mean that, Surely I should have this instead:
blog.winny.tech IN A 35.185.44.232 _gitlab-pages-verification-code.blog.winny.tech IN TXT gitlab-pages-verification-code=99da5843ab3eabe1288b3f8b3c3d8872
I feel a bit silly after realizing this is what the GitLab Pages
folks intended for me to do, but it really was not clear to me,
especially given how when clicking in the TXT
record’s text-box it
highlights the entire string, instead of allowing the user to copy
the important bits (such as the TXT
’s key) into whatever web
management UI they might be using for DNS.
The feedback loop for activation of the domain is slow
It can take awhile for a domain to be activated by GitLab Pages
after the initial deploy. Things to look for: you should get a
GitLab Pages error page on your domain if you set up the CNAME
or
A
record correctly. The error is usually “Unauthorized (401)”,
but it can be other errors.
The other place to look is verify your domain is in the “Verified” state on the GitLab Pages admin website.
The feedback loop for activation of LetsEncrypt HTTPS is huge
Sometimes GitLab pages will seemingly never activate your
LetsEncrypt support for HTTPS access. If this happens, a discussion
suggests the best solution is to remove that domain from your
GitLab Pages setup, and add it again. You will likely have to edit
the TXT
record used to claim domain ownership. This also worked
for me, when experiencing the same issue.
Make sure to enable GitLab Pages for all users
See this ticket.
Conclusion
GitLab pages isn’t perfect, but this should streamline what services my VPS hosts, and give me more freedom to fiddle with my VPS configuration and deployment. I look forward to rebuilding my VPS with cdist, ansible, or saltstack. While that happens, my website will be up thanks to GitLab pages. Also, I imagine GitLab Pages is a bit more resilient to downtime than a budget VPS provider.
The repositories with .gitlab-ci.yml
files for both this site, and
winny.tech are public on GitLab official hosting. Presently it is
the simplest setup possible, simply deploying pre-generated content
already checked into git, but the possibilities are endless.
Footnotes:
I could deploy my own webhook application server that GitHub/GitLab connects to, and have done so in the past, but every application I manage is another thing I have to well, ahem, manage (and fix bugs for).
There are some cool new features in Caddy 2, such as the ability to configure Caddy via a RESTful API and a sub-command driven CLI, but I don’t need additional features.
From the GitLab CI Linter’s old page “go to ‘CI/CD → Pipelines’
inside your project, and click on the ‘CI Lint’ button”. Or simply
visit https://gitlab.com/username/project/-/ci/lint
.
It’s a good idea to compare the mentioned IP address against what appears in the GitLab Pages Custom Domain management interface.
How to fix early framebuffer problems, or "Can I type my disk password yet??"
Most of my workstations & laptops require a passphrase typed in to open the encrypted root filesystem. So my steps to booting are as follows:
- Power on machine
- Wait for FDE passphrase prompt
- Type in FDE passphrase
- Wait for boot to complete and automatic XFCE session to start
Since I need to know when the computer is ready to accept the
passphrase, it is important the framebuffer is usable during the early
part of the boot. In the case of of HP Elitebook 820 G4, the EFI
framebuffer does not appear to work, and I rather not boot in BIOS
mode to get a functional VESA framebuffer. Making things more awkward,
a firmware is needed when the i915 driver is loaded, or the
framebuffer will not work either. (It’s not always clear if a firmware
is needed, so one should run dmesg | grep -F firmware
and check if
firmware is being loaded.)
With this information, the problem is summarized to: “How do I ensure i915 is available at boot with the appropriate firmware?”. This question can be easily generalized to any framebuffer driver, as the steps are more-or-less the same.
Zeroth step: Do you need only a driver, or a driver with firmware?
IT is a good idea to verify if your kernel is missing a driver at boot, or is missing firmware or both. Boot up a Live USB with good hardware compatibility, such as GRML1 or Ubuntu’s, and let’s see what framebuffer driver our host is trying to use2:
$ dmesg | grep -i 'frame.*buffer' [ 4.790570] efifb: framebuffer at 0xe0000000, using 8128k, total 8128k [ 4.790611] fb0: EFI VGA frame buffer device [ 4.820637] Console: switching to colour frame buffer device 240x67 [ 6.643895] i915 0000:00:02.0: fb1: i915drmfb frame buffer device
Se we can see the efifb is initially used for a couple seconds, then
i915 is used for the rest of the computer’s uptime. Now let’s look
at if firmware is necessary, first checking if modinfo(8)
knows of
any firmware:
$ modinfo i915 -F firmware i915/bxt_dmc_ver1_07.bin i915/skl_dmc_ver1_27.bin i915/kbl_dmc_ver1_04.bin ... SNIP ... i915/kbl_guc_33.0.0.bin i915/icl_huc_ver8_4_3238.bin i915/icl_guc_33.0.0.bin
This indicates this driver will load firmware when available, and if necessary for the particular mode of operation or hardware.
Now let’s look at dmesg to see if any firmware is loaded:
[ 0.222906] Spectre V2 : Enabling Restricted Speculation for firmware calls [ 5.511731] [drm] Finished loading DMC firmware i915/kbl_dmc_ver1_04.bin (v1.4) [ 25.579703] iwlwifi 0000:02:00.0: loaded firmware version 36.77d01142.0 op_mode iwlmvm [ 25.612759] Bluetooth: hci0: Minimum firmware build 1 week 10 2014 [ 25.620251] Bluetooth: hci0: Found device firmware: intel/ibt-12-16.sfi [ 25.712793] iwlwifi 0000:02:00.0: Allocated 0x00400000 bytes for firmware monitor. [ 27.042080] Bluetooth: hci0: Waiting for firmware download to complete
Aha! So it appears we need i915/kbl_dmc_ver1_04.bin
for i915. In
the case case one doesn’t need firmware, it won’t show anything
related to drm
or a line with your driver name in it.
By the way, it is a good idea to check dmesg for hints about missing firmware, or alternative drivers, for example my trackpad is supported by both i2c and synaptics based trackpad drivers, and the kernel was kind enough to tell me.
First step: Obtain the firmware
On Gentoo install
sys-kernel/linux-firmware
. You will have to agree to some non-free
licenses; nothing too inane, but worth mentioning. Now just
run emerge -av sys-kernel/linux-firmware
. (On other distros it
might be this easy, or more difficult; for example—in my experience
Debian does not ship every single firmware like Gentoo does, so
YMMV.)
Second step, Option A: Compile firmware into your kernel
Since most of my systems run Gentoo, it is business as usual to deploy a kernel with most excess drivers disabled except for common hot-swappable components such as USB network interfaces, audio devices, and so on. For example, this laptop’s config was originally derived from genkernel’ stock amd64 config with most extra drivers disabled, then augmented with support for an Acer ES1-111M-C7DE, and finally with support for this Elitebook.
I had compiled the kernel with i915 support built into the image, as
opposed to an additional kernel module. Unfortunately this meant the
kernel is unable to load firmware from filesystem, because it
appears only kernel modules can load firmware from filesystem. To
work around this without resorting to making i915 a kernel module,
we can include the drivers within the kernel image (vmlinuz
).
Including firmware and drivers both in the vmlinuz has a couple
benefits. First it will always be available. There is no need to
figure out how to load the driver and firmware from initrd, let
alone getting the initrd generator one is using, to cooperate. A
downside is it makes the kernel very specific to the machine,
because perhaps a different Intel machine needs a different firmware
file compiled in.
To achieve including the firmware in kernel, I set the following
values in my kernel config (.config
in your kernel source tree).
CONFIG_EXTRA_FIRMWARE="i915/kbl_dmc_ver1_04.bin" CONFIG_EXTRA_FIRMWARE_DIR="/lib/firmware"
Note, if you’re using menuconfig, you can type /EXTRA_FIRMWARE
(slash for search, then the text) followed by keyboard return to
find where these settings exist in the menu system.
Then I verified i915 is indeed not a kernel module, but built into
the kernel image (it would be m
if it’s a module):
CONFIG_DRM_I915=y
After compiling & installing the kernel (and generating a dracut initrd for cryptsetup/lvm), I was able to reboot and get an early pre-mounted-root framebuffer on this device.
Second step, Option B: A portable kernel approach (using sys-kernel/vanilla-kernel
)
I discovered the Gentoo devs have begun shipping an ebuild that builds and installs a kernel with a portable, livecd friendly config. In addition this package will optionally generates an initrd with dracut as a pkgpostinst step, making it very suitable as a replacement for users who just want a working kernel, and don’t mind a excessive compatibility (at a cost to size and build time).
This presents a different challenge, because while this package does
allow the user to drop in their own .config, it is not very
multiple-machine-deployment friendly to hard-code each individual
firmware into the kernel. Instead we tell dracut to include our
framebuffer driver. As mentioned above I found this computer uses
the i915
kernel driver for framebuffer. Let’s tell dracut to
include the driver:
cat > /etc/dracut.conf.d/i915.conf <<EOF add_drivers+=" i915 " EOF
Dracut is smart enough to pick up the firmware the kernel module
needs, provided they are installed. To get an idea what firmware
dracut will include, run modinfo i915 -F firmware
which will print
out a bunch of firmware relative paths.
After applying this fix, just regenerate your initrd using dracut; in
my case I let portage do the work:
emerge -1av sys-kernel/vanilla-kernel
. Finally reboot.
Conclusion
Check dmesg. Always check dmesg. We found two ways to deploy firmware, in-kernel and in-initrd. The in-kernel technique is best for a device-specific kernel, the in-initrd is best for a portable kernel. I am a big fan of the second technique because it scales well to many machines.
I did not touch on the political side of using binary blobs. It would be nice to not use any non-free software, but I rather have a working system with a couple small non-free components, than a non-working system. Which is more valuable, your freedom, or reduced capacity of your tools?
Footnotes:
GRML is my favorite live media. It is simple, to the point, has lots of little scripts to streamline tasks such as setting up a wireless AP, a iPXE netboot environment, a router, installing debian, and so on. And Remastering is relatively straight forward. It also has a sane gui sutable for any machine (fluxbox).
The Danger of fuzzy matching over one's PATH
Awhile back I noticed my personal mnt/
directory, my (empty) personal
tmp/
directory, and a few symbolic links disappeared from my home
directory. I only noticed because I use unison1 to synchronize my
desktop and laptop homedirs. The actual amount of removed directories
and symbolic links were staggering, and it costed me five minutes of
extra effort to search through the unison UI to ignore files I don’t
want to synchronize. Repeat this a few times a day, with the problem
occurring at seemingly random intervals, and you’ve wasted minutes out
of every day, which adds up to hours every month.
For months I had not figured out what the problem was. By chance I had
noticed while using my application launcher, I had accidentally not
ran links -g
2 but instead had ran cleanlinks
. I wonder to myself
what was I running by accident, as I had done this before, but had not
thought anything of it, assuming it was a program that would print
usage or perform a no-operation by default.
I was wrong.
Turns out cleanlinks
searches the current working directory for empty
directories and broken symbolic links. Both are useful. For example I
keep empty directories in ~/mnt/
to mount sshfs
stuff, and I prefer to
use ~/tmp/
as a work directory because no system scripts will touch
it.3 I had a few broken symbolic links scattered about, from
weird git repositories working trees to some stale user-level systemd
unit links from my archlinux install.
Making things more interesting, if you run cleanlinks --help
, or with
any flags, it operates as usual. So it’s a mistake to also do
cleanlinks /some/directory/i/want/to/clean
. As a part of imake,4
the old X11 ecosystem build tools, cleanlinks
will be installed on
many systems and it’s not safe to run it lest you enjoy random stuff
being messed about with in your current directory.
How did I manage to run cleanlinks
so many times? I did not have links
installed on the affected machine. And even after I did install it, I
forgot to remove cleanlinks
from my rofi runcache. So it had a higher
precedence to match than links
in certain cases. Hence I ran it a few
times on accident even after installing links.
Therefore, I strongly recommend one doesn’t fuzzy match over their
PATH
. Who knows what other nasty tools ship on your system that will
lay waste your productivity, or worse, damage your personal files.
Regardless, I have yet to heed my own warning. Maybe I should just use
.desktop
files, but then again, maybe there exists a
cleanlinks.desktop
… Ideally, I’ll create a directory of symlinks to
programs I want to launch from rofi. Someday :)
About Unison
I should mention unison is a superb tool for synchronizing your data. It shows the user a list of changes to each directory being synchronized, waits for the user to decide which way each file should be synchronized:
- Send file from host A to B
- Send file from host B to A
- Ignore the file this time
- Ignore the file permanently
- Merge the files
Because unison doesn’t try to be fancy or automatic, it is easy to understand what is happening.
Footnotes:
Links 2 is the best web 1.0 browser. It even shows images and different text sizes. Screenshots on this page.
/var/tmp/
could also work, but this way I know nobody is gunna
mess with my files and I won’t accidentally mess up permissions on
sensitive data.
imake on freedesktop’s GitLab. See also what packages depend on imake in Arch Linux. I use Gentoo across my laptop and workstation, so it’s necessary to have imake installed.
Open URL in existing Qutebrowser from Emacs Daemon on Gentoo
On my Gentoo desktops, I use Emacs Daemon via sys-emacs/emacs-daemon
1
to ensure an Emacs instance is ready to go and always available from
boot. This is done via creating a symbolic link like
/etc/init.d/emacs.winston
to /etc/init.d/emacs
which will start Emacs
for the given user. See the package README for more details.
A shortcoming of this setup is XDG_RUNTIME_DIR
2 is not set, as this is
set by my Desktop Session - maybe LightDM or consolekit set this? As a
result, when I open a URL from Emacs Daemon, it opens a fresh
qutebrowser session, loading the saved default session, and making a
mess of my workflow.
One approach to fix this might be to instead run Emacs daemon from my
.xsession
script, but I rather not supervise daemons at the user
level; if I were to consider this, I'd be better off to switch to
systemd for user-level services anyway.
The solution I came up with is to add some lines to my init.el
to
ensure XDG_RUNTIME_DIR
is set to the expected value:
(defun winny/ensure-XDG_RUNTIME_DIR () "Ensure XDG_RUNTIME_DIR is set. Used by qutebrowser and other utilities." (let ((rd (getenv "XDG_RUNTIME_DIR"))) (when (or (not rd) (string-empty-p rd)) (setenv "XDG_RUNTIME_DIR" (format "/run/user/%d" (user-uid)))))) (add-hook 'after-init-hook #'winny/ensure-XDG_RUNTIME_DIR)
A strange emacs-ism: (user-uid)
returns float or integer, despite the
backing uid_t
(on *nix) is guarenteed to be an integer type. I'll just
assume this'll never return a float. Please contact me otherwise, I'd
love to hear about this.
Footnotes:
Blink Shell: First Thoughts
As a heavy user of SSH to manage computers and IRC via command line clients, the most used application on my phone besides the web browser is a SSH client. Previously I have used Prompt and it worked, but barely. My issues with Prompt include crashing on emoji spam that is common in certain IRC channels, very slow terminal rendering to the point that watching the output of compiling a large package will cause Prompt to lag uncontrollably for tens of seconds, and a relatively un-intuitive UI. Please note I am referring to Prompt, not Prompt 21, which is another problem in itself (the idea I need to pay for the same product but rewritten to actually work is just ludicrous).

Another thing to note before I dig into Blink is I'm not a big fan of mobile devices. They get the job done, but in most cases I rather crack open my netbook and not fight with the learn-by-fiddling mobile UI paradigms most devices have adopted. I am also not a fan of Apple's ecosystem, but I'm pretty content with my iPhone and iPad for my basic use cases: browse the web, chat applications, making phone calls, casual mobile gaming, logging into other computers via SSH.
Enter Blink2. This application does a few things differently from other mobile SSH clients. It offers a command-line first user interface. At first I thought this would be painful, but the application's verbs are pretty simple and appear to match up with how one uses ssh and mosh in a terminal emulator. The terminal emulator is responsive, significantly more stable than Prompt, and looks good. I can even load up my favorite font: Droid Sans Mono3 (thanks to its support for loading webfonts via a URL to a CSS file).
Another thing Blink does different is it offers Mosh support4. Mosh is an alternative to SSH that can "roam" across networks and device power state changes. One thing to note is Mosh requires SSH to make the initial login, then switches to Mosh which uses a different protocol. When using mosh, I tend to log in once throughout the day. I can then switch networks and mosh will gracefully reconnect as soon as the server is reachable again. In addition, when I sleep my netbook, the session isn't lost (but don't go using this instead of tmux or screen). Instead when I wake my netbook, it also gracefully reconnects to the server. The same is true of Blink's Mosh support. It even appears to save state between changing iOS application states, sans forcing the application to quit (or rebooting your device). This means I only log in once on my phone throughout the day, and the connection is gracefully dropped when the app becomes inactive and reconnects when the app is opened again.
I did notice I crashed Blink after writing non-ascii binary data to the terminal followed by some other operations, but did not manage to reproduce the crash yet. If I do, I am confident I can report the bug and get meaningful feedback too, as Blink has a bug tracker on GitHub5. The code is available under GPLv3 (but not to worry, as the single copyright holder can safely relicense it to deploy on the iOS App Store6). I could probably build it myself and install that, but I neither own a Mac (or wish to install Mac OS X in a VM) nor want to ever open XCode again even if I had a way to. That life isn't for me. Again, I'm very practical about how I use my iOS devices and don't enjoy using them more than I have to, so I'm willing to pay $20 for an application I'll use every day of my life on this walled garden platform. An added silver lining is this investment goes towards the development of such a great app that appears to have active maintainership.
Note: I did not explore some of Blink's features such as using it a local iOS shell or settings synchronization. YMMV.
Getting started
Here are the steps I used to get started with Blink, as it was
relatively unclear where to find documentation and which settings
are necessary to configure. Usage information can be
found via the help
command, the README on GitHub7, and fiddling
with the UI.
- Run the
config
command.- Default User -> set to your preferred username on most of your servers
- Add a new font via: Appearance -> Add a new font. Open the gallery and grab the raw CSS file (you can get a raw URL after switching to GitHub desktop), then paste it into the URL Address Field. I prefer Droid Sans Mono for its readability at all sizes.
Keys -> + -> Create New
Please use ssh keypair authentication. Password authentication requires you to memorize a password to log in which can be bruteforced or otherwise leaked when typing it in the wrong context. Also, be sure to use a unique key on each of your devices to make revocation easier when a device is no longer used.
Note: it appears keys are always stored as plaintext, which is acceptable for your uses, but it appears
ssh-keygen
can create a key with a passphrase. I wasn't able to get Blink to work with Passphrase protected keys, but I didn't try very hard. YMMV.- Type: RSA
- Bits: 4096 (why not?)
- Name:
id_rsa
- Hosts -> +
- Host: the name you gave your host when running
ssh host
ormosh host
. This is not related to the server's hostname, though it can be the same. I prefer simple names, usually the hostname before the first dot (e.g.worf.winny.tech
becomes simplyworf
) - User: make sure this is correct
- Key: select a key to use.
- Host: the name you gave your host when running
- Now you need to install the SSH public key:
- Run
config
again - Keys ->
id_rsa
-> Copy Public Key Install it some way.
I usually prefer to visit https://ptpb.pw/f and paste the public key, hit Paste, then copy the url with the text of "created". You can then
curl https://ptpb.pw/PasteId
on the server via another login setup (or ssh from a set up machine) and add it to your~/.ssh/authorized_keys
. Make sure that file is mode 0600 (chmod 0600 ~/.ssh/authorized_keys
). Also make sure the.ssh
directory is 0700 (chmod 0700 ~/.ssh
). OpenSSH refuses to useauthorized_keys
if these permissions are world readable.
- Run
- Run
ssh worf
. Congratuations, you now have a SSH session on the best iOS client available at this time. - To use mosh, ensure mosh is installed on the server.8 Then
run
mosh worf
.
Footnotes:
GNU C Style
No. Do not use it please! There are far easier-to-read and easier-to-use styles for C!1

Footnotes:
Publishing with org-static-blog
Criteria
After reviewing a list of org-mode1 capable static website generators2, I decided to see if org-static-blog3 could suffice my simple needs. My criteria for choosing an org-mode static site generator was:
- it must be actively maintained,
- it must be simple to set up with customizations,
- and it must work with Emacs 26 and later.
This ruled out quite a few right away. I didn't attempt using org-publish, as it looked like a great deal of configuration to achieve a minimum viable web-page for this project.
Configuration of org-static-blog
Following the org-static-blog README documentation, it is very
straight forward to get a minimal viable website generated. I added
the following to my init.el
:
(add-to-list 'auto-mode-alist (cons (concat org-static-blog-posts-directory ".*\\.org\\'") 'org-static-blog-mode)) (setq org-static-blog-publish-title "blog.winny.tech") (setq org-static-blog-publish-url "https://blog.winny.tech/") (setq org-static-blog-publish-directory "~/projects/blog/") (setq org-static-blog-posts-directory "~/projects/blog/posts/") (setq org-static-blog-drafts-directory "~/projects/blog/drafts/") (setq org-static-blog-enable-tags t) (setq org-static-blog-page-header " <link href=\"static/style.css\" rel=\"stylesheet\" type=\"text/css\" /> ")
I opted to setq
all the configuration variables; I will likely
switch to using M-x customize-group RET org-static-blog RET
in the
future however. It's a better experience.
Then I simply create a new post with
M-x org-static-blog-create-new-post RET <TITLE> RET
, edit the
buffer, save it, then run M-x org-static-blog-publish RET
.
I also added some styling to my style.css
4 based off the Tachyons CSS
framework5. I had previously used Bootstrap6 for styling but was hoping
to avoid adding framework and other extra tooling that shouldn't
be necessary to generate a simple site like this one.
Deploying with Caddy and GitHub
I am a fan of the Caddy7 web server which offers automatic HTTPS via LetsEncrypt with only a few lines of configuration. In addition Caddy has a plugin named git which offers the ability to automatically deploy content from git repositories with webhook support. To deploy the following steps are taken:
M-x org-static-blog-publish RET
from Emacs to regenerate the static site,- commit the changes in git,
- and finally push the git branch to GitHub.
After these steps, GitHub automatically sends a HTTP POST request to my Caddy server with information about the new git commits and Caddy pulls the git repository. If everything went well and the webhook successfully fired, the website is now deployed.
Server Configuration
I switched most of my personal internet-related services to Docker8 in
conjunction with docker-compose
9 last year. The main rationale
is I can move my configuration without dealing with systems package
versions. I already had Caddy set up,
so it was as simple as adding this to my Caddyfile
:
blog.winny.tech {
root /srv/www/blog.winny.tech
gzip
log /logs/blog.winny.tech.log
git https://github.com/winny-/blog.winny.tech {
hook /webhook top-secret-password-redacted
}
}
The relevant lines of my docker-compose.yml
looks like this:
version: "2.1" services: web: image: abiosoft/caddy:no-stats ports: # Expose the webserver ports to the internet - "80:80" - "443:443" environment: # This is where caddy places certs after ACME negotiation. CADDYPATH: "/etc/caddycerts" ACME_AGREE: "true" volumes: - /srv/caddy/certs:/etc/caddycerts - /srv/caddy/Caddyfile:/etc/Caddyfile # Configuration - /srv/www:/srv/www # the websites - /srv/caddy/logs:/logs
A keen docker-compose
-savy reader will notice I did not specify a
restart: always
entry. I had Caddy configured to always restart,
however, when requesting new HTTPS certificates from LetsEncrypt, there
is a tendency to misconfigure the domain configuration or Caddyfile,
and if Caddy requests too many HTTPS certificates in a short amount of
time, LetsEcrypt will rate-limit my future requests. Usually this is
only requires an hour or two of waiting, but is frustrating to deal
with when trying to fix my configuration. Instead I rather Caddy
exit after failing to activate all the domains and fix my
configuration first.
GitHub Configuration
Simply create the GitHub repository, then add a webhook. It is
important to note the webhook must send a JSON payload. By default a
newly created webhook will send a application/x-www-form-urlencoded
payload and will not work.
Conclusion
With this simple setup I can write posts. I can easily move the configuration to a new host at will. In addition my setup does not depend on future use of GitHub as GitLab, Gogs, and other git hosts offer webhook support in the same way. Most importantly, I can author org-mode files and have a better balance between features and ease of use than what markdown offers.
Web dev is one of my least favorite programming exercises. Between all the testing necessary to ensure a simple site works across many platforms, the trend to use a very complex system such as webpack and many other tools to produce simple websites, and the perpetual flux-and-flow between vendors only partially implementing good features, web dev just doesn't do it for me. Hence, I am very pleased how simple this project turned out to be.
Footnotes:
Toggle Redshift with Keyboard Shortcut
Redshift is a screen-tinting program that achieves similar goals to the popular f.lux1 program.
I perused through the redshift man-pages and noticed there is no
documented way to toggle redshift. Of course one can click the
notification area icon when using redshift-gtk or SIGTERM
the
redshift process, but neither is very user friendly. (The mouse is not
user friendly.) After some awkward DuckDuckGo-ing and Googling I found
an obvious solution on the redshift homepage2: simply send SIGUSR1
to
the redshift or redshift-gtk process. When using redshift-gtk, one can
choose to send SIGUSR1
to either redshift or redshift-gtk.
This is the script I came up with:
#!/bin/sh set -eu if ! pkill -x -SIGUSR1 redshift; then echo 'Could not find redshift process to toggle.' >&2 exit 1 fi
After installing the script into my system's PATH
, now all I have to
do is add a line to my Xbindkeys3 configuration file (~/.xbindkeysrc.scm
) such as:
(xbindkey '(Mod4 F2) "toggle-redshift")
Now I can type Mod4-F2
and toggle Redshift.