Anycasting with Docker and Exabgp

Context

I’ve been playing with the idea of Anycasting some of the services for a while.

The ideal candidate for this is syslog (and this is what we would focus on in this blog post) because there are quite a few products that I have that support only one syslog endpoint, and Watchguard firewalls are a perfect example of this.

In my current architecture pretty much all non-Windows services run as Docker containers on a number of non-clustered CoreOS servers. There is no need to cluster them as they operate independently of each other,
or replicate using the application layer (i.e. DNS master/slave replication).

Here is an example of how of two independent logstash instances receiving syslog messages look like on these nodes.

ROH:

9525f3def0dd        logstash:latest             "/docker-entrypoin..."   9 days ago          Up 20 hours             0.0.0.0:1514->1514/tcp, 0.0.0.0:5044->5044/tcp, 0.0.0.0:9600->9600/tcp, 0.0.0.0:514->10514/udp, 0.0.0.0:32768->10514/tcp         logstash

MFC:

2a4f9fe37700        logstash:latest             "/docker-entrypoin..."   11 days ago          Up 8 hours             0.0.0.0:1514->1514/tcp, 0.0.0.0:5044->5044/tcp, 0.0.0.0:9600->9600/tcp, 0.0.0.0:514->10514/udp, 0.0.0.0:32768->10514/tcp         logstash

These two run in two geographically distributed sites, ROH being in Asia, and MFC in Europe.

Idea

Like I previously mentioned the idea is to allow for downtime of one of these nodes without losing (too many) syslog messages.

I initially looked at Kubernetes and Calico as suggested in various places on the internet, but I gave up at the stage or reading the documentation, simply because this would mean redesigning the way I work with my containers.

So I instead created a loopback address on the CoreOS as and started experimenting with bird to announce that loopback to the neighbouring router.

I then realised that I got to track the health check of the containers, because the fact the CoreOS node is up does not necessary mean that the specific container or even more importantly the service with in that container is also fine.

While googling for the health check solution I came across this Reddit discussion where ExaBGP is mentioned.

So I replaced bird with ExaBGP, and tried a few quick tests:

Containers up in both locations

$ show ip bgp 10.255.255.1  
BGP routing table entry for 10.255.255.1/32
Paths: (2 available, best #1, table Default-IP-Routing-Table)
  Not advertised to any peer
  65001 65003
    10.255.254.21 from 10.255.254.21 (10.255.1.254)
      Origin IGP, metric 0, localpref 100, valid, external, best
      Last update: Thu Dec  28 20:01:48 2017
  65129 65131
    10.255.254.9 from 10.255.254.9 (10.255.129.254)
      Origin IGP, metric 0, localpref 100, valid, external
      Last update: Thu Dec  28 03:08:36 2017

One Container down

$ show ip bgp 10.255.255.1  
BGP routing table entry for 10.255.255.1/32
Paths: (1 available, best #1, table Default-IP-Routing-Table)
  Not advertised to any peer
  65129 65131
    10.255.254.9 from 10.255.254.9 (10.25.129.254)
      Origin IGP, metric 0, localpref 100, valid, external
      Last update: Thu Dec 28 03:08:36 2017

I then thought about this a little bit longer, and realised that every Anycasted service should have its separate /32 services, so that if the underlying container is down, the route could be withdrawn without affecting other services
running on the same CoreOS instance.

Luckily the exabgp’s health check supports adding the loopback interfaces on demand. I struggled to get this to work, but it boiled down to the need for exabgp execution with NET_ADMIN capabilities AND execution as root (user = root in the exabgp.env to be specific).

My exabgp container mikenowak/exabgp has this already setup, so grab that and don’t forget to star.

And here is the config used in the above example:

neighbor 10.255.3.254 {
        router-id 10.255.3.4;
        local-as 65003;
        peer-as 65001;
        md5-password 'PASSWORD';

        api services {
                processes [ watch-loghost ];
        }
}

process watch-loghost {
        encoder text;
        run python -m exabgp healthcheck --cmd "nc -z -w2 localhost 1514" --no-syslog --label loghost --withdraw-on-down --ip 10.255.255.1/32;
}

Quite happy with the results, I’ve built a Docker container mikenowak/exabgp that wraps everything in one simple place.

This can now be ran as follows

docker run -d --name exabgp --restart always -p 10.255.3.4:179:179 --cap-add=NET_ADMIN --net=host -v exabgp_usr_etc_exabgp:/usr/etc/exabgp mikenowak/exabgp

Few things to remember:

  • If you are receiving TCP+SSL syslog, make sure that the server certificate has the proper SAN, in my case I use loghost.domain.local
  • Docker containers would listen to all IP addresses of the CoreOS host (primary + loopbacks) unless you specify `-p 1.2.3.4:514:514/udp` as an argument when starting containers. I do that for all containers by default, but do NOT for Anycasted containers. The reason for that is that could easily confuse Nessus to produce inconsistent results while scanning the anycast addresses, provided that one container is at a different version than the other. So I instead scan the CoreOS primary IP and call it a day.

Hope this helps somebody!

Mind your MTU. A tale of UniFi, EdgeRouter-X, IPSec and NPS.

As I previously wrote here, I’ve replaced one of the Watchguards with a UniFi AP and EdgerRouter X. Everything was pretty much fine, until we started converting wired computers to wireless in an effort to get rid of some obscure cabling.

To give you a bit of background in this setup the domain joined wireless clients authenticate to the network using EAP-TLS against a NPS Radius server.

I have this setup working perfectly fine behind Watchguards in other locations, so I’ve basically replicated the settings on the UniFI controller, but the clients refused to join the network for some reason.

So I was there looking at the incredibly difficult to read Accounting Logs on the NPS server, but it appeared that the clients were completing the authentication just fine. Well, at least the <Reason-Code data_type="0">0</Reason-Code> was logged. Anyway my eyes got tired pretty fast looking at that stuff!

I then seen that there were others on the Internet who had a bunch of NPS events in their Event Log while mine was pretty empty, so I spent a day trying to get the NPS Event Logging to work.

When I finally got it to work, I seen this event being logged:

Authentication Details:
	Connection Request Policy Name:	MY-WIFI-NETWORK
	Network Policy Name:                       -
	Authentication Provider:                  Windows
	Authentication Server:                      domain-controller.local
	Authentication Type:                         -
	EAP Type:                                            -
	Account Session Identifier:              -
	Reason Code:                                      3
	Reason:                                                The RADIUS Request message that Network Policy Server received from the network access server was malformed.

A malformed request you say, well OK I accept the challange!

So I ran the packet capture, and got this:

16:22:28.767060 IP unifi-ap.34381 > domain-controller.radius: RADIUS, Access-Request (1), id: 0x5a length: 198
16:22:28.802027 IP domain-controller.radius > unifi-ap.34381: RADIUS, Access-Challenge (11), id: 0x5a length: 90
16:22:28.811600 IP unifi-ap.34381 > domain-controller.radius: RADIUS, Access-Request (1), id: 0x5b length: 312
16:22:28.847224 IP domain-controller.radius > unifi-ap.34381: RADIUS, Access-Challenge (11), id: 0x5b length: 1472
16:22:28.847289 IP domain-controller > unifi-ap: udp
16:22:28.851982 IP unifi-ap.34381 > domain-controller.radius: RADIUS, Access-Request (1), id: 0x5c length: 213
16:22:28.884595 IP domain-controller.radius > unifi-ap.34381: RADIUS, Access-Challenge (11), id: 0x5c length: 1472
16:22:28.884655 IP domain-controller > unifi-ap: udp
16:22:28.889344 IP unifi-ap.34381 > domain-controller.radius: RADIUS, Access-Request (1), id: 0x5d length: 213
16:22:28.921571 IP domain-controller.radius > unifi-ap.34381: RADIUS, Access-Challenge (11), id: 0x5d length: 932
16:22:28.960512 IP unifi-ap.34381 > domain-controller.radius: RADIUS, Access-Request (1), id: 0x5e length: 1472
16:22:28.960530 IP unifi-ap > domain-controller: udp
16:22:31.960962 IP unifi-ap.34381 > domain-controller.radius: RADIUS, Access-Request (1), id: 0x5e length: 1472
16:22:31.960969 IP unifi-ap > domain-controller: udp
16:22:37.961414 IP unifi-ap.34381 > domain-controller.radius: RADIUS, Access-Request (1), id: 0x5e length: 1472

That didn’t tell me much, so l tried in a verbose mode, and got this back

15:01:14.454542 IP (tos 0x0, ttl 64, id 16451, offset 0, flags [+], proto UDP (17), length 1500)
    unifi-ap.32887 > domain-controller.radius: RADIUS, length: 1472
        Access-Request (1), id: 0xb0, Authenticator: fd09f3b2dcd8dd0d07d0cad52894ffa
          User-Name Attribute (1), length: 26, Value: host/windows7.local
          NAS-IP-Address Attribute (4), length: 6, Value: unifi-ap
          NAS-Identifier Attribute (32), length: 14, Value: f09fc229df71
          NAS-Port Attribute (5), length: 6, Value: 0
          Called-Station-Id Attribute (30), length: 29, Value: XX-XX-XX-XX-XX-XX:MY-WIFI-NETWORK
          Calling-Station-Id Attribute (31), length: 19, Value: XX-XX-XX-XX-XX-XX
          Framed-MTU Attribute (12), length: 6, Value: 1400
          NAS-Port-Type Attribute (61), length: 6, Value: Wireless - IEEE 802.11
          Connect-Info Attribute (77), length: 23, Value: CONNECT 0Mbps 802.11b
          EAP-Message Attribute (79), length: 255, Value: .F....
          EAP-Message Attribute (79), length: 255, Value: ..
          EAP-Message Attribute (79), length: 255, Value: A.2F2.l..0..9...zF?....
          EAP-Message Attribute (79), length: 255, Value: CA.crl0m..+........a0_0]..+.....0..Qhttp://pki.local/Enterprise%20Certificate%20Authority.crt0...*.H.......
          EAP-Message Attribute (79), length: 255, Value: c.I&....pBt.......6...b.......K&...."za...\.&.z..o.`^.O.k.x.Ox..b]{f........)U.L.+.&&f▒j..%.^Cw.\...z.~..$.........[7..A..g..0...L..4.{.z.LY....NY.O.o..B.XRLM6...>R!.E........a....... t.....0..,.a.u.l.Q..|..K..Q..4yz..M...K..H.......e;p'.wd..A..^...o~.>
          EAP-Message Attribute (79), length: 229 (bogus, goes past end of packet)

That bogus, goes past end of packet, caught my eye immediately, and then I noticed the packet length which appeared strangely big for the IPSec protected GRE tunnel.

So I googled and googled and found that one way around this was to reduce the MTU on the gre interfaces.

However, I also came across the MSS-Clamp which appears less intrusive as and it puts the overhead of managing the packet size on the end device rather than the router.

My calculations for the MSS-Clamp are as follows:

1500 Ethernet MTU
– 20 TCP Header *
– 20 IP Header *

– 20 IPSec Header
– 52 ESP Header
– 24 GRE Header
= 1364

So I round it down  to 1360 for a good measure, commit, and… nothing happened!

Of course, I forgot, the radius traffic is UDP, and MSS-Clamp applies to TCP only, but I am leaving that there anyway, as quite a few people on the Ubiquiti forums complained about dodgy TCP traffic over IPSec on these devices, and now that I think of it, this might have been a root cause of another issue with flaky RDP to that site.

So the maths to get to the right MTU size are as these above for the MSS-Clamp, less the items with the asterisk (IP and TCP Headers), therefore 1404, but lets round it down to 1400 as recommended by Cisco.

And this is set as follows on the EdgeRouter-X:

set firewall options mss-clamp interface-type tun
set firewall options mss-clamp mss 1360
set interfaces tunnel tun0 mtu 1400

Upon commit, the clients began authenticating sucessfully.

So here it is, and the lesson for today is – Mind your MTU.

Working around the read-only file systems in CoreOS with overlay

I had a specific use case to place the quiesce scripts on the CoreOS running in a VMware virtual machine, so that I could take a consistent backup with Veaam.

While I generally agree this is a bad idea, and I admit that I store most of the important stuff in git, there are times when I am lazy in development and just want to have a backup of any sort.

So right back to the subject, shall we?

Of course building own image and keeping it up to date is one of the options, but let’s call it a plan Z for the moment.

Luckily, an overlay mounts can be used to work around the fact that /usr is a read-only partition.

I decided to keep the scripts ion /opt/sbin (as this location is read-write and persists reboots).

It is as simple as:

mkdir /opt/sbin
mount -o "lower=/usr/sbin:/opt/sbin" -t overlay overlay /usr/sbin

Also in order to survive the reboots we need the following systemd unit:

[Unit]
Description=Overlay mount /usr/sbin mount
Before=local-fs.target
ConditionPathExists=/opt/sbin

[Mount]
Type=overlay
What=overlay
Where=/usr/sbin
Options=lowerdir=/usr/sbin:/opt/sbin

[Install]
WantedBy=local-fs.target

Finally here are my quiesce tools that I use.

The /usr/sbin/pre-freeze-script script shuts down all the docker containers.

$ cat /usr/sbin/pre-freeze-script
#!/bin/bash
docker stop $(docker ps -aq) >/dev/null 2>&1

The /usr/sbin/post-thaw-script script restarts docker.service. This forces all containers to start up in the right order (think legacy links). I attempted to write logic to start them containers without service restart, but that became pretty complex code with no added benefit so I just gave up.

$ cat /usr/sbin/post-thaw-script
#!/bin/bash
systemctl restart docker.service >/dev/null 2>&1