The Soviet Union Lives On (Kind Of)
This post is a post-mortem regarding intermittent DNS failures on my home network.
I and my partner observed the following: From time to time, our internet service would be unusable for about one minute. During that time, browsers on our mobile devices would fail to load (loading would continue indefinitely), and the WiFi section of our Settings apps (two iPhones) would report “No internet access” for the duration of the outage.
It’s always DNS
Figuring DNS was a likely culprit (because it always is), I checked the logs for my home network’s resolver, Knot Resolver. (I self-host a resolver so I can replace a portion of the public DNS tree to enable internal DNS resolution of the .home
TLD.) I observed that Knot Resolver reported DNSSEC failures during the outage. From there, I checked the Grafana dashboard and observed that Knot Resolver was returning SRVFAIL
during the outage.
Knot Resolver
At this time, I had previously been frustrated with Knot Resolver because of its caching implementation. I had observed previously a strange “Resource not available” error on the cache file (an LMDB database), but figured it wasn’t a huge deal since DNS resolution was still working properly. However, this was now my primary target for debugging. At that time, I had 3 replicas of Knot Resolver, with the LMDB cache stored in an emptyDir
, so each replica had its own cache. I figured the cache might be filling up, so I set up a PersistentVolumeClaim
and had the three replicas share the cache. This did not resolve the issue, and in fact triggered even more of the “Resource not available” errors.
After further investigation into LMDB’s documentation and forums (and even into Knot Resolver’s codebase), I discovered that, being memory-mapped, LMDB does not play nicely with multiple processes opening the same database, especially when they have the same PID. Since this was exactly my use-case (three processes opening the same database, all with PID 1), I decided that Knot Resolver was not the best solution.
CoreDNS
So, I replaced Knot Resolver with CoreDNS. Being used by Kubernetes itself, I anticipated that it would play much more nicely with being run in a containerized environment. I was correct; CoreDNS was actually super simple to set up, and has been rock solid. However, much to my dismay, the intermittent DNS failures persisted.
CoreDNS, though, gave me much more information than Knot Resolver. It logged the following:
2023-01-28T19:11:53.044235117Z [INFO] 10.244.0.1:23475 - 3032 "A IN spectrum.s3.amazonaws.com. udp 43 false 512" - - 0 2.000499727s
2023-01-28T19:11:53.044286533Z [ERROR] plugin/errors: 2 spectrum.s3.amazonaws.com. A: read udp 10.244.x.x:49428->1.0.0.1:53: i/o timeout
So, resolution itself was timing out. There’s a couple reasons why this might be happening, of course. In order of increasing probability, my thinking was:
- Cloudflare DNS could be down
- My ISP uplink could be down
- My router’s network stack could be down
- Kubernetes’
iptables
implementation was messed up
Knowing it wasn’t the issue, I checked Cloudflare’s status page — no reported issues. Same for Verizon — no reported issues. The only thing between CoreDNS and the public Internet (other than the miscellaneous Kubernetes networking machinery) was my router: a Ubiquiti UDM Pro. I actually had an issue with downtime previously, which was due to it auto-updating device firmware, taking down my “backbone” switch. That setting has since been turned off.
I do utilize the UDM Pro’s “Threat Management” functionality, with that setting turned to “Detect and Block”. I figured this was my next most likely culprit. I checked the logs (“System Logs” > “Threats”), and there it was: “Potentially Bad Traffic” detection from my Kubernetes host -> Cloudflare DNS.
Service: ET DNS Query for .su TLD (Soviet Union) Often Malware Related
Category: DNS
Activity: Potentially Bad Traffic
Source IP: 10.10.x.x:35457
Destination IP: 1.0.0.1:53
The Soviet Union?
An .su
TLD?? Praying it wasn’t any of the IoT crap on my network, I figured the next most likely culprit was my self-hosted Mastodon server; people love having edgy domains, especially in the Fediverse. Luckily, this was indeed the case:
$ curl https://ezra.social/api/v1/instance/peers
["...","peertube.su","..."]
I --follow
ed CoreDNS’s logs, visited peertube.su
in my browser, and observed the logs indicating the exact same i/o timeout
error as during an unexpected outage. This was it!
However, I didn’t know whether .su
was safe to unblock, or if I should manually defederate from all .su
instances. I read a few articles, with the upshot being this:
“The reality is that most people that use the .SU TLD are legitimate websites with it obviously having some cache, particularly if you want to do a site that has anything to with life in the old CCCP.
”… SU domain registrations … cost 590 Roubles, which means the price changes a lot depending on the exchange rate. Currently though that makes them just over $4, which means they are a steal.”
“Users of .su reportedly include anti-Russian Ukrainians, who are adopting the domain of the former Soviet Union as a way to harken back to the ‘antifascism’ of the communist bloc. So, if you got a lot of time on your free hands, and some up-to-date antivirus software installed on your computer, go ahead and start plugging in .su website into the url bar, and take a stroll down one of the weirdest parts of the internet.”
— https://www.inverse.com/article/8672-the-bizarre-afterlife-of-su-the-domain-name-and-last-bastion-of-the-ussr (permalink)
I’ve got no issue with anti-Russian Ukrainians nor weird parts of the Internet, so I decided it was safe to allow .su
DNS resolution.
Resolution
I added a “Suppression” to allow “threats” with this signature through the UDM’s IR stack. I then tested again by visiting peertube.su
in my browser, and was able to browse their PeerTube instance without issue. I will continue to monitor, but I consider this issue resolved. ☭