Proxmox 5 networking at OVH with systemd-networkd

OK, so that was an interesting one. I’ve had a dedicated server at OVH for years and years and I’ve always run VMware ESXi for virtualization there because I didn’t really know what I was doing and vSphere Client was a pretty nice GUI for managing things. Over the years as I’ve gotten more comfortable with Linux, the limitations of ESXi were more and more apparent. It’s really designed to be used as part of the whole VMware ecosystem and if you aren’t using their entire platform, then there’s really no reason to go with ESXi.

One of the biggest annoyances is the lack of visibility into the hardware. You need to use either SNMP or the VMware API to monitor anything and honestly, I just prefer a decent syslog.

Having migrated my home network to a Proxmox cluster some weeks ago, I figured a standalone instance at a hosting provider would be a piece of cake. There were definitely parts that went incredibly smoothly based on the experience from last time–converting the VMDKs to QCOW2s was as easy as could be. (qemu-img convert -f vmdk -O qcow2 disk.vmdk disk.qcow2) The converted images were not only smaller, but worked without any tweaking as soon as they were attached to their new KVM homes.

However, there was one serious speed bump that I didn’t anticipate at all: the /etc/network/interfaces file was missing on the host. Somehow this machine was getting its OVH IP address and I could not figure out where it was coming from. (Spoiler: If you guessed systemd-networkd, you win!)

Never having used systemd-networkd, I figured it couldn’t be that hard to get things up and running and since it’s a massive deviation from the basic Proxmox installation it seems like something OVH would’ve documented somewhere. Also note that Proxmox depends on /etc/network/interfaces for a lot of its GUI functionality related to networking, e.g., choosing a bridge during VM creation.

So the first thing I did was check out /etc/systemd/network/50-default.network. Looks more or less like a network interface config file with all of the relevant OVH details included. OK, but it’s also assigning the address to eth0 and if I know anything about Proxmox, it’s that it loves its bridges. It wants a bridge to connect those VMs to the outside world.

That gave me a place to start.

Setting up a bridge for VMs

Fortunately, plenty of people out there have also set up bridges in their various flavors of Linux with systemd-networkd, so there was enough documentation to put together the pieces.

  1. A netdev file creating the bridge device (vmbr0 in this case)
  2. A network file assigning eth0 to the bridge (think bridge_ports in /etc/network/interfaces terms)
  3. A network file configuring vmbr0

Creating a netdev file for the bridge

[NetDev]
Name=vmbr0
Kind=bridge

Short and sweet.

Creating a network with the bridge

[Match]
Name=eth0

[Network]
Bridge=vmbr0

Also short and sweet.

Configuring the bridge interface

Everything from the [Network] section on was copied and pasted from OVH’s original /etc/systemd/network/50-default.network.

[Match]
Name=vmbr0

[Network]
DHCP=no
Address=xx.xxx.xx.xx/24
Gateway=xx.xxx.xxx.xxx
#IPv6AcceptRA=false
NTP=ntp.ovh.net
DNS=127.0.0.1
DNS=213.186.33.99
DNS=2001:41d0:3:163::1
Gateway=[IPv6 stuff]

[Address]
Address=[some IPv6 stuff]

[Route]
Destination=[more IPv6 stuff]
Scope=link

Seems simple enough… But wait, what do I do about the original configuration for eth0? Do I leave it in place? It could conflict with the bridge. Do I need to set it to the systemd equivalent of “manual” somehow? These are questions I could not find a simple answer to, so I just rebooted.

When the system came back up (with 50-default.network still in place), the bridge had been created, but it was DOWN according to ip addr. eth0 was UP and had the public IP address. Guess that answers that, need to remove 50-default.network.

# ln -sf /dev/null /etc/systemd/network/50-default.network

Another quick reboot and we should be back in business. But no, the server didn’t come back up. Rest in peace. I barely knew ye…as a Proxmox server. Any guesses as to what’s missing? If you said, “You needed to clone the MAC of eth0 onto vmbr0.” you are absolutely correct. Had to modify the netdev file accordingly:

[NetDev]
Name=vmbr0
Kind=bridge
MACAddress=00:11:22:33:44:55

OK, one more reboot…and this time we’re finally working. System came up and was still accessible, vmbr0 had the public IP, everything was looking good.

Creating a VM

Since the disk image doesn’t know anything about the specs of the VM, I had to recreate the VMs by hand. As I’m not a glutton for punishment, I used the Proxmox web interface to do this. But wait, remember how Proxmox uses /etc/network/interfaces and not systemd-networkd? When it came time to add a network adapter to the VM, there were no bridges available and the bridge is a required field. I ended up having to select No Network Device to get the web interface to create the VM. Then I logged into the server via SSH and edited the VM config file manually (while I was in there, I also pointed it at the converted QCOW2 instead of the disk created with the VM):

...
scsi0: local:100/old_disk.qcow2
net0: virtio=00:11:22:33:44:56,bridge=vmbr0

If you haven’t used OVH before, you might think, “OK, we’re done. Neat.” But wait, there’s more. OVH uses a system where your VMs can’t just start using IPs that are allocated to you, you need to register the MAC address for the IP address. And in most cases, this is super-simple and you don’t need to do anything more than paste the MAC address and be on your way. Unfortunately, in my case, these VMs are a relic of the old ESXi setup and there’s a dedicated router VM that had all of the IPs allocated and used iptables to forward traffic around to the VMs accordingly.

-----------------------------------------------
|   VMware           PUBLIC.IP.ADDRESS        |
|                                             |
|   [ Router VM ]              [ VM 1 ]       |
|   WAN: PUBLIC.IP.ADDRESSES   LAN: 10.2.0.11 |
|   LAN: 10.2.0.1                             |
|                              [ VM 2 ]       |
|                              LAN: 10.2.0.12 |
-----------------------------------------------

“What’s wrong with that?”, you ask, “Seems as though it should work just fine. Add the LAN IP to the bridge, put the VMs on the bridge, give them LAN IPs and everything’s good.” Yeah, sure, if I don’t mind my WAN traffic and LAN traffic going through the same bridge interface. But I do mind. I mind very much.

Solution: Creating a second bridge for the internal network

After already having created a bridge for the public address, this part was simple. Add another netdev for the new bridge and a minimal configuration for its network.

[NetDev]
Name=vmbr1
Kind=bridge
[Match]
Name=vmbr1

[Network]
Address=0.0.0.0

Rather than reboot, I tried a systemctl restart systemd-networkd to see if it’d pick up the bridge interface correctly and it did. One minor issue–systemd gave the bridge 10.0.0.1/255.0.0.0. Not that it matters in practice, but it was unexpected. Also at this point, I rebooted because I wanted to be sure everything came up working and it did.

Then it was just a matter of adding another network interface to the router VM manually and assigning it to vmbr1. It’s worth noting that you can’t issue a reboot to the VM and have it adopt configuration changes, it has to be stopped and started.

Epilogue

After that, I created the remainder of the VMs, added their old disk images, gave them network devices and everything was up and running just like before. Well, almost just like before–now I had better insight into my hardware.

One of the first things I did after updating the system was to install the RAID controller software and check the RAID status. One drive failed. Who knows how long that was going on? It was 1200 ATA errors deep at that point. Even though I’ve got great backups of the VMs (and their most recent disk images), a total storage failure would’ve been a much more stressful event. As it was, I reported the failed disk to OVH and they replaced it. Although annoyingly, they either pulled the wrong disk or shut the server down to do the swap despite me telling them which controller slot needed replacing. The RAID array is currently rebuilding and I’m running a shiny, new Proxmox instance. Feels good.

Making the lab transparently Tor-friendly

Why bother?

Just an interesting challenge, really. I was thinking about whether static routes could be pushed by a DHCP server and trying to think about any subnet I have that would require static routes. Then I remembered that Tor has an option to map resolved .onion addresses to private IP space, so why not?

First challenge: Forwarding Tor DNS

Per the usual Tor configuration, I set up a subnet for .onion addresses to map onto and told Tor to map when they’re resolved.

VirtualAddrNetwork 10.192.0.0/10
AutomapHostsOnResolve 1

That’s all well and good, but normally I’d be accessing Tor services from behind the Tor gateway, not in front of the gateway.

Normal case: 
                  [ Tor gateway @ 10.1.0.1 ]
                    /    |    \
    [ Sketchy .onion sites ] [ My VM @ 10.1.0.2 ]

This case:
 [ My network @ 192.168.0.0/24 ] -- [ Tor gateway @ 10.1.0.1 ]
                                      /    |    \
                        [ Sketchy .onion sites @ 10.192.0.0/10 ]

It could be pretty neat to have transparent access to Tor hidden services from any computer with no configuration. What’s particularly neat about this is that with the combination of DNS and a static route, we’re effectively recreating the transparent DNS/Routing setup from the Tor TransparentProxy documentation except for an entire network.

Since I’ve already got a BIND DNS server on my network, it was pretty trivial to add resolution via the Tor gateway (192.168.0.50) for .onion addresses:

zone "onion" {
        type forward;
        forward only;
        forwarders { 192.168.0.50; };
};

A simple forward for the .onion zone should solve it, right? Not so:

Feb 19 20:13:45 dhcp02 named[993]: NOTIMP unexpected RCODE resolving 'supersekritonion.onion/DS/IN': 192.168.0.50#53
Feb 19 20:13:45 dhcp02 named[993]: no valid DS resolving 'supersekritonion.onion/A/IN': 192.168.0.50#53

Yeah, that’s right, idiot BIND is trying to query DNSSEC for the .onion domain. What’s even more annoying is that you can’t set dnssec-validation for a single zone, it’s a server-wide option. At some point, I may try creating a view and using dnssec-must-be-secure but it’s not high on my list of priorities. I tossed dnssec-validation no; into /etc/bind/named.conf.options and away I went. A quick rndc reload and I could resolve .onion addresses from any system on my network.

Second challenge: Adding a static route to DHCP

Apparently this functionality isn’t used very often because the configuration is atrocious in isc-dhcp-server. Note the top-level declaration of the type and then the in-subnet declaration of the static route.

...
option rfc3442-classless-static-routes code 121 = array of integer 8;
option ms-classless-static-routes code 249 = array of integer 8;
...
subnet 192.168.0.0 netmask 255.255.255.0 {
  option rfc3442-classless-static-routes 10, 10, 192, 192, 168, 0, 50;
  option ms-classless-static-routes 10, 10, 192, 192, 168, 0, 50;
  ...
}

If the static route in there doesn’t make a lot of sense to you, you aren’t the only one. TL;DR: 10.192.0.0/10 goes through 192.168.0.50. The first byte is the netmask and then the size of the netmask determines how many octets you need of the target network. If you need multiple static routes, just keep comma-separating those bytes.

Also, I had to add this iptables rule on my gateway to allow LAN traffic destined for the Tor IP space to be redirected appropriately:

gateway# iptables -t nat -A PREROUTING -i eth0 -d 10.192.0.0/10 -p tcp -m tcp -j REDIRECT --to-ports 9040

To test, I logged into a random test system and ran ifdown eth0 && ifup eth0 to renew the DHCP lease, a little bit of ip route and then:

# ip route
default via 192.168.0.3 dev eth0 proto static metric 100
10.192.0.0/10 via 192.168.0.50 dev eth0 proto dhcp metric 100

And there we have it, transparent access to .onion sites from my network thanks to DNS forwarding, DHCP static routes, and a DNS/TCP proxy on the gateway.

Update 2018-03-01

When converting my local GitLab server to DHCP reservation, I discovered that it could no longer route externally. A quick glance at the routing table showed that it had no default route. But why? The routers option was clearly specified in dhcpd.conf. Turns out that when you have classless static routes defined, the DHCP client can ignore option routers. The fix is simple: add the default route as a classless route in the configuration. In this case, 0, 192, 168, 0, 3.

HA DHCP and incidentally PXE

Why bother?

My plan for this weekend had been to set up HA for my DHCP server given the cascading failures that resulted when I converted my DHCP server from VMware to KVM and the underlying Ceph storage filled up. Normally I’ve got a USB drive that I install from, but for whatever reason, the system that I wanted to perform the install on just sat at a blinking cursor when I tried to boot from the flash drive. I noticed that the LAN card had PXE capability, so I figured why not?

Adding PXE to the existing DHCP server

So PXE booting is going to require a TFTP server for serving the files. Most of the tutorials I’ve seen recommend installing inetd/xinetd as well to manage tftpd-hpa, but at least on Ubuntu 16.04, tftpd-hpa was immediately managed by systemd, so I didn’t see a need to install the other packages.

# apt install tftpd-hpa

This installs and starts tftpd-hpa with a server root at /var/lib/tftpboot. One thing to notice is that

# systemctl status tftpd-hpa
 ● tftpd-hpa.service - LSB: HPA's tftp server
 Loaded: loaded (/etc/init.d/tftpd-hpa; bad; vendor preset: enabled)

Seeing that “bad” in status was concerning, but it doesn’t actually affect anything. Just to verify what we’re seeing:

# systemctl is-enabled tftpd-hpa
 tftpd-hpa.service is not a native service, redirecting to systemd-sysv-install

It’s just that tftpd-hpa isn’t a normal systemd unit file, so it’s reporting that it’s bad. There’s nothing actually wrong, it’s just redirecting to an old-style init that doesn’t support all of the systemd options.

The only change I had to make to my existing DHCP server was to add filename "pxelinux.0"; to the existing DHCP scope.

Setting up the PXE boot menu

First step is getting a bootable environment. Easiest way to do that is grab Ubuntu’s, especially since I had the ISO lying around for installs anyway. However, I’ve included the instructions to download the netboot installer and pull files from it here.

# wget http://archive.ubuntu.com/ubuntu/ubuntu/dists/xenial/main/installer-amd64/current/images/netboot/netboot.tar.gz -O ubuntu-16.04-netboot.tar.gz
 # mkdir ubuntu-16.04-netboot
 # tar zxf ubuntu-16.04-netboot.tar.gz -C ubuntu-16.04-netboot
 # cp -a ubuntu-16.04-netboot/ubuntu-installer /var/lib/tftpboot

Most of this wouldn’t be necessary if the PXE environment just needed to boot a single operating system. You could just throw the Ubuntu installer into /var/lib/tftpboot and off you go. However, the setup that follows creates a menu for selecting different OS options which gives me the flexibility to PXE boot different things in the future.

# cd /var/lib/tftpboot/
# mkdir boot-screens
# mkdir pxelinux.cfg
# cp ubuntu-installer/amd64/pxelinux.0 ./
# cp ubuntu-installer/amd64/boot-screens/ldlinux.c32 ./
# cp ubuntu-installer/amd64/boot-screens/*.c32 ./boot-screens/
# ln -s boot-screens/syslinux.cfg pxelinux.cfg/default
path boot-screens
include boot-screens/menu.cfg
default boot-screens/vesamenu.c32
prompt 0
timeout 100
MENU WIDTH 48
MENU MARGIN 5
MENU TABMSG

MENU TITLE Boot menu
LABEL Boot local hard drive
  LOCALBOOT 0
MENU BEGIN ubuntu-16.04
  MENU TITLE Ubuntu Server 16.04
  LABEL mainmenu
    MENU LABEL ^Back...
    MENU exit
  INCLUDE ubuntu-installer/amd64/boot-screens/menu.cfg
MENU END

Note that there’s a quick PXE boot menu with the default option being to boot the local hard drive. This ensures that any systems that are configured to boot from LAN with higher priority than the local drive don’t suddenly stop working because they reboot into the Ubuntu installer. A better solution would probably be to put the PXE boot subnet on a different VLAN and then change the switchport configuration after the OS has been installed, but this is sufficient for now, especially since the PXE subnet is the same as the general client subnet.

I’m not doing any Kickstart/Preseed configurations at the moment, just a manual Ubuntu install since the focus here was supposed to be HA DHCP and not setting up a PXE environment. Also, since I tend to add VMs instead of physical machines to my network, I just clone a VM template instead of doing a PXE-based install. It’s much faster and less error-prone.

Adding an HA DHCP configuration into the mix

With the PXE server up and running, I was able to boot from LAN, get Ubuntu Server 16.04 installed on the new server, and off I went.

The HA configuration in isc-dhcp-server isn’t extremely difficult. You designate one server as primary and another as secondary, add a couple different attributes, and away you go. If you’ve done any HA before, one of the major annoyances is synchronizing state. In this case, I was less concerned about the leases (since I use address reservations for anything important) and more concerned about the configuration files. Primarily, I didn’t want to have to go in and edit dhcpd.conf on two servers every time another system needed a reservation. For that reason, I decided to put my DHCP server configuration into Ansible. That’ll also make it easy to re-deploy this setup if things break.

Starting with the configuration section for failover:

{% if dhcp_server is defined and 'role' in dhcp_server %}
failover peer "{{ dhcp_server['failover'] }}" {
  {{ dhcp_server['role'] }};
  address {{ dhcp_server['address'] }};
  port {{ dhcp_server['port'] }};
  peer address {{ dhcp_server['peer'] }};
  peer port {{ dhcp_server['peer_port'] }};
  {%- if dhcp_server['role'] == 'primary' %}
  mclt {{ dhcp_server['mclt'] }};
  split {{ dhcp_server['split'] }};
  {%- endif %}
  max-response-delay 60;
  max-unacked-updates 10;
  load balance max seconds 3;
}
{% endif %}

This is a pretty basic template to define the failover peer section in dhcpd.conf. The associated host_vars are:

dhcp_server:
  failover: dhcp-failover
  role: primary
  address: dhcp01.geek.cm
  port: 519
  peer: dhcp02.geek.cm
  peer_port: 520
  mclt: 2600
  split: 128
dhcp_server:
  failover: dhcp-failover
  role: secondary
  address: dhcp02.geek.cm
  port: 520
  peer: dhcp01.geek.cm
  peer_port: 519

One problem I ran into initially was trying to set the failover peer name to different things on each system, e.g., I wanted the failover peer to be “dhcp02” on dhcp01 and “dhcp01” on dhcp02. The syslog just tells you that you have an invalid peer name which isn’t particularly helpful. Fortunately, I found this old bug report describing a change where the failover peers have to share the same name. Fixed that up and all was good.

Converting the rest of the configuration to Ansible

There were only a couple more steps to get the configuration converted to Ansible–getting the subnets defined and adding the reservations. Fortunately, they’re both pretty straightforward. I started with group_vars for the DHCP servers and defined a data structure that would suit my needs:

subnets:
  - network: 192.168.0.0
    netmask: 255.255.255.0
    routers: 192.168.0.3
    domain_name: geek.cm
    dns_servers:
      - 192.168.0.10
      - 192.168.0.9
    pxe_filename: pxelinux.0
    pools:
      - peer: dhcp-failover
        range:
          bottom: 192.168.0.100
          top: 192.168.0.199

Should be pretty obvious, defines a subnet and the relevant options, adds the pxe_filename so that PXE booting is available, and specifies a DHCP range between 192.168.0.100 and 192.168.0.199. The template code that generates the config looks like:

{% for subnet in subnets %}
subnet {{ subnet['network'] }} netmask {{ subnet['netmask'] }} {
  option domain-name "{{ subnet['domain_name'] }}";
  option domain-name-servers {{ subnet['dns_servers']|join(',') }};
  option routers {{ subnet['routers'] }};
  {%- if 'pxe_filename' in subnet %}
  filename "{{ subnet['pxe_filename'] }}";
  {%- endif%}
  {%- for pool in subnet['pools'] %}
  pool {
    failover peer "{{ pool['peer'] }}";
    range {{ pool['range']['bottom'] }} {{ pool['range']['top'] }};
  }
  {%- endfor %}
}

It’s not an all-encompassing template that gives a ton of flexibility in the DHCP options, but it’s sufficient for my network at the moment. Finally, the reservations:

reservations:
  - name: netgear65
    mac: a1:63:92:c7:e2:34
    host: netgear65.geek.cm
    ip: 192.168.0.5
  - name: nginx
    mac: FE:2F:A0:A4:19:B4
    host: nginx.geek.cm
    ip: 192.168.0.11

Reservations are basically an array of hashes that have a name, a MAC address, a hostname, and an IP to reserve. And again, the template code to generate the reservations:

{% for reservation in reservations %}
host {{ reservation['name'] }} {
  hardware ethernet {{ reservation['mac'] }};
  server-name "{{ reservation['host'] }}";
  fixed-address {{ reservation['ip'] }};
}
{% endfor %}

Finally. There it is, a high-availability DHCP setup. Now I can fill up my Ceph cluster without destroying my entire internal network.

Adding a Proxmox compute node

Why bother?

This was mainly for testing. I was curious how Proxmox’s live migration would work between very different systems. My cluster consists of a SuperMicro 4-node server that all have the same processors and specs, but I wanted to know if I could join a random commodity system to the cluster and use it to host VMs.

Installing Proxmox and joining the cluster

Usual Proxmox installation and setup process, there was nothing special here. However, when I joined the cluster, I started getting SSL errors in the web interface for the new node:  Permission denied (invalid ticket 401). The Proxmox forums suggested that this could be due to an issue with clock synchronization, but the clocks on my systems were fine. Refreshing the web interface took care of this, but now every time I tried to log in, it was just giving me the login dialog again. I checked the web interface for the rest of the cluster and viewing the new node was giving the error: was tls_process_server_certificate: certificate verify failed (596). Here, the Proxmox forums suggest restarting the pveproxy and pvestatd services. Since the node wasn’t hosting anything yet, I rebooted it. That resolved the TLS certificate issue. In general, probably a good idea to reboot a node after joining the cluster to let all of the configuration settings and services be applied.

The next issue I ran into is that the new node couldn’t access the Ceph storage. It’s not a storage node, so I didn’t go through the process of adding its disks as OSDs or anything. The web interface reported that there was no pveceph configuration. That was a pretty big clue that I needed to generate one. One pveceph install && pveceph init --network=10.250.0.0/24 later and that error was gone, but I still couldn’t access the Ceph cluster. The Proxmox web interface was showing a Communications failure error when trying to view the Ceph storage from the new node. That makes sense because I have a dedicated switch for Ceph that the 4-node is connected to but the new node is not. Rather than add another NIC to the node and run another cable to the Ceph switch, I took the easy way out.

proxmox1# sysctl -w net.ipv4.ip_forward=1 && sysctl -p
proxmox2# sysctl -w net.ipv4.ip_forward=1 && sysctl -p
proxmox3# sysctl -w net.ipv4.ip_forward=1 && sysctl -p
proxmox4# sysctl -w net.ipv4.ip_forward=1 && sysctl -p

proxmox5# vi /etc/network/interfaces
...
auto vmbr0
iface vmbr0 inet static
        ...
        post-up ip route add 10.250.0.0/24 dev vmbr0 via 192.168.0.30
        post-up ip route add 10.250.0.0/24 dev vmbr0 via 192.168.0.31
        post-up ip route add 10.250.0.0/24 dev vmbr0 via 192.168.0.32
        post-up ip route add 10.250.0.0/24 dev vmbr0 via 192.168.0.33
        pre-down ip route del 10.250.0.0/24 dev vmbr0 via 192.168.0.30
        pre-down ip route del 10.250.0.0/24 dev vmbr0 via 192.168.0.31
        pre-down ip route del 10.250.0.0/24 dev vmbr0 via 192.168.0.32
        pre-down ip route del 10.250.0.0/24 dev vmbr0 via 192.168.0.33

Yeah, I enabled packet forwarding on the hosts and then added a redundant static route on the new node to the Ceph network. Best way to do things? No. Does it work? Yes. With proxmox5 able to ping the Ceph network addresses of the other nodes, everything looked good.

The real test

Then it came time for the real test: Could I live-migrate a VM to the new node using Ceph storage? Yes. There were no problems at all, the VM continued running as normal.

A minor annoyance

It’s worth noting that the new node (proxmox5) is a laptop. I discovered that when the lid was closed, it was suspending. That makes it pretty terrible as a VM host. Fortunately, Proxmox is Debian, so disabling that behavior should be relatively easy. According to https://wiki.debian.org/Suspend, I should be able to mask a bunch of systemd services related to suspend and problem solved.

proxmox5# systemctl mask sleep.target suspend.target hibernate.target hybrid-sleep.target

And then I could close the lid, the system didn’t sleep, and I started moving VMs over. However, in looking at the system metrics, proxmox5 was using much more CPU than I expected (over 50%) with four mostly-idle VMs. top showed that the problem was systemd-logind. Oh, what? More idiotic behavior out of systemd? I would never have expected this. So I found this bug report over on Debian.org that described exactly this behavior with the guy from Debian saying it’s systemd working as expected. Color me surprised again. Apparently the issue is that with the suspend services masked per the Debian wiki, systemd enlists a bunch of other services into the perfect storm of horrible software and starts consuming CPU as it tries to write messages constantly about the lid being closed and it not being able to suspend because the service is masked.

Fortunately, it’s not too difficult to get around that:

[Login]
HandleLidSwitch=ignore
HandleLidSwitchDocked=ignore

and then systemctl restart systemd-logind.serviceand magically the CPU usage goes back to normal.

Discovery of a networking issue

So after my compute node was up and running as normally as possible, I started wondering about the networking setup on my old VMs. Several of them were attached to a separate vSwitch in ESXi and ran on their own private network. How would that function in Proxmox when the gateway VM and the client VM were on separate hosts? Answer: Not well. Initially, I tried setting up a second bridge on each host and hoped that traffic would magically find its way between those bridges on the hosts, but no such luck. The easiest option, as far as I could tell, was to set up VLANs and use those.

I had hoped for an option that could be done solely on the Proxmox nodes and wouldn’t involve switch configuration, but no such luck. Started by enabling trunking on the Cisco for the ports with Proxmox nodes.

switchport trunk encapsulation dot1q
switchport mode trunk
switchport trunk native vlan <vlan>
switchport trunk allowed vlan <vlans>

Then I enabled bridge_vlan_aware in the network configuration for the bridge in Proxmox. (This is most easily done by double-clicking the bridge on the Network tab and checking the “VLAN aware” checkbox.) That was literally it. Once the node was rebooted to let the networking changes take effect, I could assign VMs to VLANs and everything just worked.

Wherein breaking Ceph breaks everything

Why bother?

Well, to be honest, I wouldn’t have bothered, except this happened accidentally. You see, in the process of moving VMs to my Proxmox cluster, I filled up my Ceph storage. That’s bad. Perfectly reasonable mistake too, I think. Proxmox’s web interface doesn’t show what you’d expect for Ceph storage. Instead of seeing the pool used/pool total, you get a gauge with the physical storage used/physical storage total. Depending on your Ceph settings, the physical storage and the pool storage could be vastly different numbers–as they are in my case.

Cluster resource view in Proxmox

I look at that image and I think, “Cool, I’ve got 3.5 TB of free space. Time to move some VMs.” Wrong wrong wrong wrong wrong. Check this out:

proxmox# ceph df
...
POOLS:
    NAME     ID     USED     %USED     MAX AVAIL     OBJECTS
    rbd      0      431G     40.75          627G      112663

Oh, what? I thought I had 3.5 TB free. Nope, it’s actually 627 GB.

So here’s what happened: Late Saturday night, I wanted to move the last VM off of one of the VMware servers. It’s the biggest one because it’s also the most important–my Windows development desktop with remote access configured to the rest of my network. It’s called “controlcenter” for a reason. Anyway, it has a large virtual disk to accommodate downloads, builds, and whatever else I’m doing. In fact, this virtual disk is so large that I can’t qemu-img convert it on the Proxmox host, I’m going to have to use my NFS storage to host it and then migrate it to Ceph.

No problem, I figure. After all, everything has gone swimmingly up until this point. I transfer the VMDK to the NFS server, convert the VMDK to raw, and then have Proxmox migrate the disk from NFS to Ceph. It’s a large disk, so it’s going to take a few hours. I go to sleep.

I wake up to IT hell. Everything is offline. I try to pull up the Proxmox web interface–no dice. Hold on a second, my laptop doesn’t have a network address. That’s odd. “Oh yeah,” I remember, “I just migrated the DHCP/DNS server VM over to Proxmox.” …And this is a serious problem. You see, I’ve been pretty gung-ho about putting all of my systems on the DHCP server and having reservations for hosts that need “static” addresses. No manual configuration on the servers themselves, just a nice, clean config file…until that config file isn’t available. My Proxmox servers no longer have IP addresses on my network which means that I can’t get on Proxmox to see what’s going on with the DHCP/DNS VM. Brilliant setup, me.

I still have no idea why everything broke in the first place, but my first task is to get things back online and usable. Fortunately, I have a separate VLAN and subnet on my switch for the IPMI for the Proxmox cluster hosts. Hop on the dedicated IPMI laptop, IPMIView over, and I see that Proxmox doesn’t have a network address. OK, first things first, convert the Proxmox hosts back to static addresses. I modify their respective /etc/network/interfaces to give them the addresses that they were getting from DHCP reservations and attempt to restart networking with systemctl restart networking.  That doesn’t work, they still don’t have addresses. OK, I’ll do it manually then: ifdown eth0 && ifup eth0. Now it’s complaining about not being able to tell the DHCP server it’s cancelling its reservation. Whatever, the static IP is assigned, things should be OK.

I give my laptop a static address on the main network and access the Proxmox web interface. Next step is to turn on the DHCP/DNS VM and get everything running. Weird, it’s still running. Open the console, everything looks fine. I’m able to log in and everything. This VM has a static address on the network (who ever heard of a DHCP server giving an address to itself), so the networking config looks normal. I go to check the syslog for DHCP entries to see why it’s not handing out addresses.

# less /var/log/<tab>

And then nothing. Literally just sits there. I can’t CTRL+C it. The VM has become unresponsive because I wanted to tab complete /var/log. That’s new and different. It finally dawns on me that this is not the average DHCP server outage.

Spoiler: It’s Ceph’s fault

If you can’t tab complete or list files, there’s a very good chance that your storage has fallen out. I’ve seen similar enough behavior with NFS-backed VMs on VMware that I had an idea of where to look. The obvious first step is to check the Ceph cluster health in Proxmox. It’s extremely unhappy. The health is in error state. The pool is full.

What?! I was only moving 1 TB and I had like 3.5 TB free, right? Yeah, not so much. That’s when I got a quick lesson in physical storage vs. pool storage. First order of business, deleting a few old VMs to free up some space. I use Proxmox’s web interface to remove the remnants of my ELK stack. That’s so two years ago anyway. After a brief period in which Ceph reorganizes some things, the cluster health is in a warning state for low space, but things are not outright broken. VMs (partially) come back to life in that I can start using their filesystems again.

Still, I’m impressed. The VMs just seemed paused, but were perfectly happy to resume–or so I thought.

Proxmox did not like my on-the-fly modifications to the network interface. I was getting what I could only describe as half-assed networking. Some hosts could ping some hosts on the same subnet, but not others. I resigned myself to rebooting the host nodes and letting the networking sort itself out.

One of the great things about the Proxmox cluster has been the ability to live-migrate VMs between nodes since they’re on shared storage. This means that the VMs can start and stop on a schedule independent of the hosts. Whenever Proxmox needs an update, I go through and migrate the hosts off of one node, update it, reboot it, rotate VMs off of another node, rinse and repeat. This process has been seamless for months now…until this incident.

Proxmox’s half-assed networking meant that despite a quorate cluster and the node I was using seeing all four nodes as in and everything happy, everything was not happy. Several of the VMs outright disappeared during the live migration with Proxmox reporting that it could no longer find their configurations. Uh oh, hope I have backups. (Of course I do.)

Logging into the nodes manually shows that /etc/pve is in an inconsistent state. It’s supposed to be synchronized between the nodes, but there has clearly been some divergence. At this point, I give up on keeping VMs up throughout the process–I just don’t want to lose anything. Time to reboot the nodes. But wait, Proxmox literally won’t reboot with this failed live migration VM.

proxmox1# echo 1 > /proc/sys/kernel/sysrq 
proxmox1# echo b > /proc/sysrq-trigger

This is a reboot by force. It’s bad for everything. Note that this same functionality can be accessed through magic sysrq keys if your sysctl is set to allow it (or you echo 1 to /proc/sys/kernel/sysrq). I could’ve done the same thing through the special laptop with IPMI access, but since I was already logged into the node, this seemed faster.

After a reboot of the nodes, the networking issues are resolved. Proxmox is happier, I’ve got my backups ready in case the VM configs were somehow lost, and things are back on track. Proxmox actually (mostly) resolved its state issues and I didn’t have to restore anything from backup. The only problem remaining was that the VM that had failed live migration was locked and wouldn’t start.

# Unlock the VM with ID 100
proxmox1# qm unlock 100

And problem solved. Finally, everything is back to normal.

Lessons learned

  1. Don’t fill up your Ceph pool. Just don’t. It’s bad.
    1. This really underscores my need to get some monitoring in place. I’m sure there are some easy ways to monitor ceph df to keep track of pool size remaining.
  2. Highly-available DHCP infrastructure
    1. Especially if, like me, you rely on DHCP for almost every host on your network. And highly-available means on different infrastructure so that something like a storage failure *cough* Ceph *cough* doesn’t bring down multiple DHCP servers.
  3. Backups. Have them.
    1. Storage is cheap. Your time is expensive. Save yourself the headache of rebuilding everything if the worst does happen.
  4. Ceph is pretty resilient.
    1. If the DHCP server hadn’t destroyed networking for everything, I think once I freed up the space in the Ceph cluster, everything would’ve resumed without issue. Generally speaking, I’ve been pretty happy with the Ceph backend for Proxmox and plan to add another Ceph cluster to my network at some point.

Moving from VMware to Proxmox

Why bother?

I’ve always been a fan of VMware and how everything pretty much just works. However, one of the major downsides has been the visibility of hardware monitoring. VMware used to be a RedHat derivative, so you could at least use tools built for Linux but now it is its own custom thing with its own set of command line tools and it’s just generally unpleasant to use. My primary virtualization cluster has been Proxmox with a Ceph backend for a while now and I’m loving how easy it is to migrate VMs, how fast the storage backend is, how much I don’t have to use iSCSI, and how I can upgrade the underlying hosts without any downtime for the guests… But the one thing that has been holding me back is the networking.

VMware’s networking is extremely simple and I had multiple vSwitches and virtual machines assigned to specific switches to break things on the network up, etc. I know that Proxmox supports Open vSwitch and you can get similarly crazy with it, but it involves learning another configuration DSL as opposed to VMware’s networking which was GUI-based and very easy point-and-click. That said, there are three VMware servers that I’ll be able to free up for other purposes by moving the VMs over to the Proxmox cluster, so…it’s time.

Migrating a Linux VM from VMware to Proxmox

By all accounts, it’s much easier to move Linux guests from VMware to Proxmox, so I figured that’s where I’d start. Most of the VMs that are still hosted on VMware are crucial for one reason or another, so downtime is an inconvenience. However, since I’m at home, my VPN virtual machine can probably serve as the test case.

# Shut down the VPN server.
vpn# poweroff

# Copy the VMware disk images to the Proxmox cluster.
vmware1# scp /vmfs/volumes/uuid/VPN/VPN.vmdk root@proxmox1:/tmp/
vmware1# scp /vmfs/volumes/uuid/VPN/VPN-flat.vmdk root@proxmox1:/tmp/

# Convert the disk image into RAW format so that it can be used in Proxmox.
proxmox1# qemu-img convert -f vmdk /tmp/VPN.vmdk -O raw /tmp/VPN.raw

Note that there are two VMDK files for the VMware guest. VPN-flat.vmdk is the disk image and VPN.vmdk is a disk configuration file/pointer to the flat image. If you attempt to qemu-img convert just the image, it’ll throw the following error: qemu-img: Could not open 'VPN.vmdk': invalid VMDK image descriptor

With the converted disk image, it’s time to recreate the virtual machine. I used the Proxmox web interface to create a VM with identical specs (1 core, 256 MB RAM in this case). I detached and removed the existing storage. The VM ID that Proxmox assigned–which we’ll need to put the new disk image in place–was 110.

# Move the RAW disk image into place.
proxmox1# mv /tmp/VPN.raw /var/lib/vz/images/110/vm-110-disk-1.raw

# Edit the VM config for the new VM to add the converted disk.
proxmox1# vi /etc/pve/qemu-server/110.conf

...
unused0: local:110/vm-110-disk-1.raw

Back in Proxmox, I refreshed the hardware list and saw that the disk was added. Double-clicked on the disk to add it as an in-use disk (added as SCSI), and booted the system. It came up and appeared to be fine. A few minor issues:

  1. The network interface name changed, so my /etc/network/interfaces file was wrong.
  2. The MAC address for the network interface changed, so my DHCP server didn’t recognize it for a lease.

Both of those were easily remedied. I tested the VPN connection from the outside world and everything was good.

Migrating a Windows VM from VMware to Proxmox

The process for migrating a Windows VM wasn’t much different except that the VM had to be prepared for the move in advance. Specifically, there’s a registry key that needs to be set to use IDE as a boot device otherwise you’ll end up with a blue screen at boot with the error “Stop 0x0000007B”. Proxmox provides a file (mergeide.zip) that you can import in advance of moving the VM.

With the IDE registry change in place, I shut down the Windows VM, moved the disk over to the Proxmox server, converted it from VMDK, and added the disk to the VMID.conf file as an unused disk. (Basically the same steps as for a Linux VM.) Back in the Proxmox web interface, double-clicked the unused disk and added it to the system. Note that if you don’t add it as an IDE disk, you will continue to have problems. You can also install virtio drivers from RedHat that will let you use paravirtualized devices on Windows (and boot from a VirtIO device instead of an IDE device). After that, the Windows VM booted up as normal. Conversion complete.

Putting an nginx proxy behind Cloudflare

Why bother?

Since this is my home lab and it’s running on my home connection, I definitely prefer to cut down on the number of people able to poke at things. The first layer of defense is obviously a firewall (with a whitelist!) to only allow access to select services, i.e., the VPN and emergency SSH, but what about services that are intended for the public like the nginx server? That’s where a reverse proxy comes in. There are many reasons that you’d want to keep your site behind a reverse proxy: Internet scumbags, whitehats who scan the internet and then sell information on your open ports and services, DDoS protection, etc. In this case, it’s going to add a layer of obfuscation to my origin address.

How Cloudflare Works…and mediocre ASCII art diagrams

Cloudflare provides a reverse proxy–and various other security features–much like the nginx proxy that we’ve already set up. The difference is that their network can handle DDoS and do helpful things like serve HTTP sites over HTTPS. You point your DNS to their servers and they transparently proxy traffic to you.

Normally:
[ Alice ] <-> [ Your web server with public IP address ]

  1. Alice sends a DNS request for geek.cm.
  2. DNS resolves geek.cm to 1.2.3.4.
  3. Alice requests http://1.2.3.4:80 with Host: geek.cm
  4. Web server returns the content to Alice.

With Cloudflare (or similar reverse proxy service):
[ Alice ] <-> [ Cloudflare ] <-> [ Your web server ]

  1. Alice sends a DNS request for geek.cm.
  2. DNS resolves geek.cm to one of Cloudflare’s servers.
  3. Alice requests http://cloudflare_ip:80 with Host: geek.cm
  4. Cloudflare’s servers request http://1.2.3.4:80 with Host: geek.cm
  5. Web server returns the content to Cloudflare.
  6. Cloudflare returns the content to Alice.

The difference is that Alice sees a Cloudflare address instead of yours, thus hiding your origin address. There’s some other stuff Cloudflare can do like serve as a web application firewall, upgrade requests to HTTPS, and so on, but we’re focusing on the core functionality–protecting our home network from the internet.

Translating requestor IP addresses

If you’re familiar with running a web server, you’re probably asking yourself, “But if Cloudflare is requesting all of the pages, then aren’t my logs full of Cloudflare’s IP address? What about my analytics?” or “How do I know who’s sending all of these LFI/RFI/SQLi requests?” Fortunately, Cloudflare documents this process[1] and it’s basically a cut-and-paste job.[2] I’ve removed the IPv6 addresses because I don’t allow IPv6 requests past my firewall.

server {
    ...
    set_real_ip_from 103.21.244.0/22;
    set_real_ip_from 103.22.200.0/22;
    set_real_ip_from 103.31.4.0/22;
    set_real_ip_from 104.16.0.0/12;
    set_real_ip_from 108.162.192.0/18;
    set_real_ip_from 131.0.72.0/22;
    set_real_ip_from 141.101.64.0/18;
    set_real_ip_from 162.158.0.0/15;
    set_real_ip_from 172.64.0.0/13;
    set_real_ip_from 173.245.48.0/20;
    set_real_ip_from 188.114.96.0/20;
    set_real_ip_from 190.93.240.0/20;
    set_real_ip_from 197.234.240.0/22;
    set_real_ip_from 198.41.128.0/17;

    real_ip_header CF-Connecting-IP;
    ...
}

The set_real_ip_from lines indicate servers that we trust to send the real client IP address. The real_ip_header line will read the header CF-Connecting-IP to any request coming from Cloudflare and set the client address to the value contained in that header. Now our nginx logs show the real IP address of requests instead of Cloudflare’s servers.

Locking down nginx for Cloudflare

When you’re configuring a web service for security behind some sort of proxy (e.g., Cloudflare), you should always restrict the incoming connections at the firewall. There are countless sites that put up Cloudflare and expect that no one will be able to find their origin address. It’s certainly not easy to track down a misconfigured site behind Cloudflare, but it can be done, especially if the attacker is only looking for one or two domains. A simple brute force of the IPv4 space making requests with the appropriate Host header to each IP address will eventually reveal the origin address. Thus, it’s important to have a whitelist in place that only allows traffic from Cloudflare or other trusted hosts.

# firewall-cmd --permanent --new-ipset=cf --type=hash:net
# for cidr in $(curl https://www.cloudflare.com/ips-v4); do \
>     firewall-cmd --permanent --ipset=cf --add-entry=$cidr; \
> done

# firewall-cmd --permanent --add-rich-rule='rule source ipset=cf port port=80 protocol=tcp accept'
# firewall-cmd --permanent --add-rich-rule='rule source ipset=cf port port=443 protocol=tcp accept'
# firewall-cmd --permanent --add-rich-rule='rule source ipset=cf invert="True" port port=80 protocol=tcp drop'
# firewall-cmd --permanent --add-rich-rule='rule source ipset=cf invert="True" port port=443 protocol=tcp drop'
# firewall-cmd --reload

Although it’s rare, Cloudflare’s IP addresses can change, so having a daily cron job like the following may be useful:

#!/bin/bash
firewall-cmd --permanent --delete-ipset=cf
firewall-cmd --permanent --new-ipset=cf --type=hash:net
for cidr in $(curl https://www.cloudflare.com/ips-v4); do \
    firewall-cmd --permanent --ipset=cf --add-entry=$cidr; \
done
firewall-cmd --reload

With these rules in place, we don’t have to worry about ending up on Shodan or Censys since any traffic that doesn’t originate from Cloudflare’s reverse proxies will be dropped. The cron job ensures that if Cloudflare adds more reverse proxies or changes their IP ranges, we aren’t denying that traffic.

Configuring Cloudflare

Since we’re using Cloudflare, arguably we don’t even need a LetsEncrypt cert since Cloudflare can proxy HTTPS to an HTTP backend and they’ll issue a SAN cert for your domain. This is OK for testing, but not really acceptable for anything that requires any security because even though the end user’s connection to Cloudflare is encrypted, Cloudflare’s connection to your origin is still HTTP and that means plaintext. Ideally, you want the traffic encrypted between both connections–the end user to Cloudflare and Cloudflare to you. To do this, you can enable the Full SSL option which proxies HTTPS to HTTPS. Cloudflare will ignore self-signed certs, so your visitors see “the green lock” and you get end-to-end encrypted traffic. However, the best option is Full (Strict) SSL mode where Cloudflare requires a valid certificate on your origin.

Why does it matter if the cert is valid if everything’s still encrypted? I don’t know. I can’t think of a threat model where an attacker is stopped by Full vs. Full (Strict). However, testing and internal access work a lot more smoothly if you need to go around Cloudflare and not have your browser complain.

Update (2018-01-08): After talking to a friend at Cloudflare, there is a scenario where Full (Strict) could be valuable: If you already have a valid certificate for your domain and you enable Cloudflare’s Always use HTTPS option. If you allow HTTP, then someone MITMing the connection between Cloudflare and your server could request a valid certificate for your domain and successfully sit behind Cloudflare’s Full SSL mode. However, with Always use HTTPS and Full (Strict), Cloudflare will require a valid cert from the origin which presumably the MITM doesn’t have, so they can’t receive unencrypted requests, can’t request a certificate, and can’t MITM the traffic.


Footnotes

[1] https://support.cloudflare.com/hc/en-us/articles/200170706-How-do-I-restore-original-visitor-IP-with-Nginx-

[2] Note that these are the ranges from https://www.cloudflare.com/ips-v4

nginx proxy with LetsEncrypt

Why bother?

As a home lab nerd, I frequently find myself putting up various web-based services and wanting to access them remotely. Naturally, I can’t be sending passwords cleartext over the interwebs and while I do have a sweet VPN setup, I’m not always using a machine with my VPN keys handy. The obvious solution is HTTPS, but what to do about certs? Self-signed certs are great, except that you’re either always accepting the invalid cert warning or having to import your custom CA everywhere. And that’s where LetsEncrypt comes in. With the noble goal of encrypting all the things for free, LetsEncrypt–in theory–makes getting valid SSL certs for all of your sites extremely easy. In practice, I’ve only rarely had it work reliably for me, so I’m not anxious to use it as a solution.

Introduction and sick ASCII art diagram

That’s all changing today though. I finally figured out how I want to host things: an nginx ingress controller that manages LetsEncrypt certs and proxies requests through to my various backends. That way, I only have to get LetsEncrypt working consistently once and all of the certs are managed in a central location instead of strewn about my network like everything else.

       [ Internet ]
            |
       [ Router 10.0.0.1 ]
            |
       [ nginx / LetsEncrypt 10.0.0.10 ]
           /                        \
[ pastebin.geek.cm 10.0.0.14 ]    [ geek.cm 10.0.0.15 ]

nginx installation and configuration

So let’s start with the nginx setup. Personally, I like CentOS for servers. The transactional yum database (and ability to roll back) has saved me numerous times. The downside is that the packages get stale quickly and sometimes I like to feel a little bit more bleeding-edge. 3.10 kernel? Seems…old-fashioned. Anyway, nitpicking aside, CentOS will do fine for a quick and reproducible nginx/LetsEncrypt server.

First off, let’s get nginx installed:

[nginx]
name=nginx repo
baseurl=http://nginx.org/packages/centos/7/$basearch/
gpgcheck=0
enabled=1

No GPG check because I like to live dangerously. J/K, I actually like to live with an abundance of caution and prudence and that means no unsigned packages on my network.

# wget http://nginx.org/keys/nginx_signing.key
# rpm --import nginx_signing.key
# sed -i 's/gpgcheck=0/gpgcheck=1/' /etc/yum.repos.d/nginx.repo
# yum -y install nginx
# systemctl enable nginx && systemctl start nginx

Now we can verify that nginx is up and running by visiting http://10.0.0.10 (obviously substitute your own local address) from another system on the network. You should get nginx’s default page. If you didn’t, there’s a good chance that you haven’t opened up the firewall ports. You are using a firewall, right?

# firewall-cmd --permanent --add-service=http
# firewall-cmd --permanent --add-service=https
# firewall-cmd --reload

Generate a self-signed certificate

To ensure that we’ve got a working SSL configuration without dealing with LetsEncrypt, we’ll start with a self-signed certificate.

# openssl req -x509 -newkey rsa:4096 -nodes -sha256 -keyout geek.cm.key -days 365 -out geek.cm.crt

I like the /etc/pki directories for certs, it makes sense to me, so we’re going to put them there:

# mv geek.cm.key /etc/pki/tls/private/
# mv geek.cm.crt /etc/pki/tls/certs/

And then modify the nginx configuration to add an SSL block:

server {
    listen 80;
    server_name geek.cm www.geek.cm;

    # Redirect all HTTP traffic to HTTPS.
    return 301 https://$host$request_uri;
}

server {
    listen 443 ssl;
    server_name geek.cm www.geek.cm;

    ssl_certificate /etc/pki/tls/certs/geek.cm.crt;
    ssl_certificate_key /etc/pki/tls/private/geek.cm.key;

    location / {
        root /usr/share/nginx/html/;
        allow all;
    }
}

Increasing the SSL security

This is based on Mozilla’s recommendations[1] for modern compatibility. If you’re going to have a wide variety of clients accessing your sites, then you may need to drop down to intermediate compatibility or legacy compatibility settings.

server {
    ...

    ssl_session_timeout 1d;
    ssl_session_cache shared:SSL:50m;
    ssl_session_tickets off;

    ssl_protocols TLSv1.2;
    ssl_ciphers 'ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256';
    ssl_prefer_server_ciphers on;

    add_header Strict-Transport-Security max-age=15768000;

    ...
}

Test the nginx configuration and reload it.

# nginx -t
# systemctl reload nginx

Generate certificates with certbot

The next step is to install certbot which is a pretty nice wrapper around LetsEncrypt functionality. To install certbot on CentOS, we need the EPEL (Extra Packages for Enterprise Linux) repo.

# yum -y install epel-release
# yum -y install certbot

Then we’ll go ahead and generate the cert using the http-01 challenge. Generally speaking, you can use the built-in web server plugins for certbot if you like (e.g., certbot nginx|apache) but I prefer to just get the certs and do the web server configuration myself. The other benefit to the webroot plugin is that it functions well behind Cloudflare without requiring additional hooks. You can use a hook to do the DNS challenge with Cloudflare[2], but I’ve found it less reliable than webroot. Cloudflare configuration will be discussed in another post.

# certbot certonly --webroot --webroot-path /usr/share/nginx/html/ -d geek.cm -d www.geek.cm

The webroot plugin adds a file to the /.well-known/ directory in the webroot that LetsEncrypt will then look for to validate that you actually own the domain for which you’re requesting certs. This will require that you’ve already got a DNS entry for those domains pointing to your server. You’ll answer some questions, accept the TOS, and like magic, you’ll have a free SSL cert to use for your server.

If everything went well, you’ve got a private key in /etc/letsencrypt/live/geek.cm/privkey.pem and a cert in /etc/letsencrypt/live/geek.cm/fullchain.pem. (Although obviously not geek.cm since that’s my domain, not yours.) You can optionally symlink the certs to /etc/pki or just modify the nginx configuration to point to the letsencrypt directory.

# rm /etc/pki/tls/certs/geek.cm.crt
# rm /etc/pki/tls/private/geek.cm.key
# ln -s /etc/letsencrypt/live/geek.cm/privkey.pem /etc/pki/tls/private/geek.cm.key
# ln -s /etc/letsencrypt/live/geek.cm/fullchain.pem /etc/pki/tls/certs/geek.cm.crt
server {
    ...
    ssl_certificate /etc/letsencrypt/live/geek.cm/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/geek.cm/privkey.pem;
    ...
}

If you’re doing this for a home lab like me, you will likely need to add a hosts entry since your domain will resolve to a public address and the routing won’t work as expected. (You’ll request https://public_ip, it’ll go to your modem, and your modem will be like, “Oh, you want to access the management interface? Here you go.”) One of these days, I’ll figure out some appropriate iptables rules so that things work on the LAN without split DNS or hosts entries.

You’ll want to test renewal of the certificate with certbot renew --dry-run. If everything is successful, then it’s time to set up the automated renewal of certificates with a cron job.

#!/bin/bash
/usr/bin/certbot renew --quiet --post-hook "systemctl reload nginx"

Configuring nginx to proxy to internal hosts

Presumably you’ll want to serve more than the nginx default page. In my case, I’ve got WordPress running on Apache over on 10.0.0.15. Since we aren’t going to co-mingle services by running Apache/PHP/MySQL on the same server as nginx, we’re going to need to proxy those requests.

upstream wordpress {
    server 10.0.0.15;
}

server {
    listen 80;
    ...
}

server {
    listen 443;
    ...

    location / {
        proxy_pass http://wordpress;
        proxy_buffering on;
        proxy_buffers 12 12k;
        proxy_redirect off;

        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $remote_addr;
        proxy_set_header Host $host;
    }
}

It’s worth noting that Apache will be serving the site over HTTP. That is, nginx is receiving HTTPS requests on port 443 and making HTTP requests on port 80 to the upstream server. If you have concerns about your local network traffic being sniffed, then you’ll need to configure Apache for HTTPS and modify the proxy_pass line in the nginx configuration.

Side note: There was some additional configuration that I did on the Apache 2.4 side in order to log correct IP addresses from nginx. If you don’t configure these options, all requests to your upstream servers will appear to originate from the nginx proxy.

...

RemoteIPInternalProxy 10.0.0.10
RemoteIPHeader X-Real-IP

...

And that’s it. Now we have a pretty decent setup that serves HTTPS on nginx and can proxy requests to our various internal hosts.

Ensuring that nginx doesn’t eat certbot requests

One issue that I’ve run into repeatedly when running certbot on the same host as web applications is renewal doesn’t always function correctly. Things aren’t supposed to change and if your certbot renew --dry-run succeeded during setup, then it’s supposed to work basically forever. In practice, that’s not always the case. In one instance, I had set up certbot, acquired the cert, tested the renewal and then later moved DNS to Cloudflare. Three months later, I was receiving certificate expiration notices from LetsEncrypt. A look at the logs showed that domain validation was failing during the renewal. That was weird because I had definitely tested renewal during the setup process. Turned out that the validation method worked fine–as long as the site wasn’t behind Cloudflare. As someone who generally prefers the privacy afforded by services like Cloudflare, it’s important to have LetsEncrypt working even when my sites aren’t directly accessible.

Tangentially-relevant story time: I also had one host running mod_python with Apache which had a very specific set of Python dependencies that were incompatible with the versions required by certbot. There were two quick options: run the application and get rid of certbot or find the application a different home. The longer option would’ve been figuring out a way to resolve the package dependency issues, e.g., with a virtual environment for one or the other. Anyway, I had already used certbot to acquire the cert for the domain and everything was up and running. Rather than deal with the finicky application, I opted to just remove certbot–completely forgetting that it’d need to be there to renew the cert.

Stories aside, we need a way to ensure that nginx doesn’t pass along the validation requests from certbot to whatever upstream we’ve configured. Since the webroot validation uses HTTP, it’s just a matter of adding a short stanza to our configs for each domain:

server {
    listen 80;
    ...

    location /.well-known {
        root /usr/share/nginx/html;
        allow all;
    }

    ...
}

Since we’ve already specified the webroot plugin and /usr/share/nginx/html as the webroot path during the initial certificate request, we just need to make sure that nginx knows that requests to the .well-known directory should stay local instead of being redirected to the proxy. This ensures that we pass the domain validation challenge during certbot’s renewal.


Footnotes

[1] Based on Mozilla’s recommendations using the modern compatibility settings since I’m usually the only person accessing things and my systems are generally up-to-date.

[2] https://github.com/kappataumu/letsencrypt-cloudflare-hook

HA network gateway with keepalived and conntrackd

Why bother?

I’ve always got long-running connections from my network out to the internet (mainly IRC) and historically, updating my router with the latest security fixes and/or kernels has been a real pain because it requires a reboot of the router. This meant everything got disconnected. Also, occasionally, I’d be playing with the router and something would break. Again, everything would get disconnected. I wanted a way to be able to update or work on my router without losing connectivity.

Introduction and sick ASCII network diagram

This is going to be a basic ACTIVE/BACKUP high availability (HA) setup for a network gateway. Both network gateways are running Ubuntu Server 16.04 LTS and have three NICs. The first NIC (eth0) is connected to the modem. The second NIC (eth1) is connected to the switch that feeds the LAN. The third NIC (eth2) is connected directly to the third NIC on the other router. The direct connection is for conntrackd. Documentation for conntrackd suggests that its synchronization can be bandwidth intensive and they recommend a dedicated interface. However, it is not necessary and you can just as easily use the LAN interface for conntrackd. My routers have three NICs so I went ahead and connected them directly.

         [___Modem 10.0.0.1/24____]
         |                        |
  --eth0--     WAN Virtual IP     --eth0--
  10.0.0.11       10.0.0.10       10.0.0.12
  |                                      |
[master]--10.10.1.1--eth2--10.10.1.2--[backup]
  |                                      |
 eth1          LAN Virtual IP           eth1
 192.168.0.1     192.168.0.3     192.168.0.2
  |                                      |
--------------------------------------------
|            LAN 192.168.0.0/24            |
--------------------------------------------

There are two components to this setup: conntrackd and keepalived. keepalived is responsible for creating the virtual IP addresses on the network and moving it between routers in the event of a failover. conntrackd keeps track of the state of network connections and synchronizes them between routers. This means that in the event of a failover, no connections are lost. conntrackd is not strictly necessary, keepalived can still move the IP between routers in the event that one fails, it just comes at the cost of losing all of your existing connections. conntrackd wasn’t particularly difficult to set up or configure, so I highly recommend using it.

Setting up the firewalls

Since my routers are Ubuntu, we’re going to use iptables-persistent to manage the firewall. It’s worth noting that since iptables traverses the chains linearly, rules that will process the most traffic should come before rules that process less traffic. For example, since these are gateways and they’ll be managing the traffic for all of the systems behind them, the first rule in the forward chain should be accepting traffic for any connection that is already being tracked by the firewall.

Also, you’ll note that the policy for the filter chains is set to DROP. This means that the firewall on the routers will drop any packet that is not part of a connection that it knows about. The reason for this is the following scenario: You’ve got an active TCP stream going (e.g., a large download) and your master router dies. The other end of the connection sends some more data not knowing the master is down. keepalived starts the failover process and moves the virtual IP to the backup. BUT conntrackd hasn’t been able to commit the external cache on the backup yet. The backup doesn’t know about this stream and sees what looks like an errant TCP packet. The TCP/IP stack on the backup router sends an RST to tell the remote host that it doesn’t know WTF is going on and the connection dies. With a default DROP policy, that errant packet is silently ignored by the backup router, TCP retransmission kicks in, the backup router has time to update its internal cache, and the connection continues uninterrupted.

master# apt install iptables-persistent
backup# apt install iptables-persistent
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
-A POSTROUTING -o eth0 -j MASQUERADE
COMMIT
*filter
:INPUT DROP [0:0]
:FORWARD DROP [0:0]
:OUTPUT DROP [0:0]
:LOGGING - [0:0]
# Accept any traffic on the loopback interface.
-A INPUT -i lo -j ACCEPT
# Accept any traffic destined for this server that's part of an already-tracked connection.
-A INPUT -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
# Accept any multicast VRRP traffic destined for 224.0.0.0/8 (this is how keepalived communicates).
-A INPUT -d 224.0.0.0/8 -p vrrp -j ACCEPT
# Accept any multicast traffic destined for 225.0.0.50 (this is how conntrackd communicates).
-A INPUT -d 225.0.0.50 -j ACCEPT
# Accept any traffic on the interface with the direct connection between routers.
-A INPUT -i eth2 -j ACCEPT
# Jump to the LOGGING chain.
-A INPUT -j LOGGING
# Accept any traffic for systems behind the NAT that's part of an already-tracked connection.
-A FORWARD -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
# Allow outbound connections from systems behind the NAT.
-A FORWARD -s 192.168.0.0/24 -i eth1 -o eth0 -m conntrack --ctstate NEW -j ACCEPT
# Allow any outbound traffic on the loopback interface.
-A OUTPUT -o lo -j ACCEPT
# Allow outbound multicast VRRP traffic destined for 224.0.0.0/8 (this is how keepalived communicates).
-A OUTPUT -d 224.0.0.0/8 -p vrrp -j ACCEPT
# Allow outbound multicast traffic destined for 225.0.0.50/8 (this is how conntrackd communicates).
-A OUTPUT -d 225.0.0.50 -j ACCEPT
# Allow outbound ICMP on the WAN interface.
-A OUTPUT -o eth0 -p icmp -j ACCEPT
# Allow outbound DNS on the WAN interface.
-A OUTPUT -o eth0 -p tcp -m tcp --dport 53 -j ACCEPT
-A OUTPUT -o eth0 -p udp -m udp --dport 53 -j ACCEPT
# Allow outbound NTP on the WAN interface.
-A OUTPUT -o eth0 -p udp -m udp --dport 123 -j ACCEPT
# Allow outbound HTTP/HTTPS on the WAN interface.
-A OUTPUT -o eth0 -p tcp -m multiport --dports 80,443 -j ACCEPT
# Allow any outbound traffic on the interface with the direct connection between routers.
-A OUTPUT -o eth2 -j ACCEPT
# Log (to syslog) any traffic that's going to be dropped with a limit of 2 entries per minute.
-A LOGGING -m limit --limit 2/min -j LOG --log-prefix "DROP: " --log-level 7
COMMIT

Load the iptables rules on each router so that everything is ready for the installation.

master# iptables-restore < /etc/iptables/rules.v4
backup# iptables-restore < /etc/iptables/rules.v4

Installing and configuring conntrackd

master# apt install conntrackd
master# cp /etc/conntrackd/conntrackd.conf{,.original}

We’re going to use the basic FTFW[1] sync configuration provided with conntrackd. My understanding is that FTFW provides a reliable form of synchronization of connection states between conntrackd on each system.

master# gunzip /usr/share/doc/conntrackd/examples/sync/ftfw/conntrackd.conf.gz
master# cp /usr/share/doc/conntrackd/examples/sync/ftfw/conntrackd.conf /etc/conntrackd/

The primary-backup.sh script provided by conntrackd is what will be triggered by notifications from keepalived. It will force a synchronization of the connection states between routers when failing over or failing back.

master# cp /usr/share/doc/conntrackd/examples/sync/primary-backup.sh /etc/conntrackd/

backup# apt install conntrackd
backup# cp /etc/conntrackd/conntrackd.conf{,.original}
backup# gunzip /usr/share/doc/conntrackd/examples/sync/ftfw/conntrackd.conf.gz
backup# cp /usr/share/doc/conntrackd/examples/sync/ftfw/conntrackd.conf /etc/conntrackd/
backup# cp /usr/share/doc/conntrackd/examples/sync/primary-backup.sh /etc/conntrackd/

The basic conntrackd configuration is pretty much ready to do, there are just a few things that you need to update. First, set the appropriate interface for communication between conntrackd on the routers. Update the IPv4_interface and the Interface. Second, add the local addresses of the interfaces on the router to the ignore list. This is because it doesn’t really make sense to synchronize connections that are made directly to a specific router with the other router. We’ll do the same configuration on the backup, substituting in the appropriate values. Note that the files below are not complete, they are changes to the default FTFW conntrackd.conf that we extracted.

Multicast {
  ...
  IPv4_interface 10.10.1.1
  ...
  Interface eth2
  ...
}

General {
  ...
  Filter From Userspace {
    ...
    Address Ignore {
      IPv4_address 127.0.0.1
      IPv4_address 10.0.0.11
      IPv4_address 192.168.0.1
      IPv4_address 10.10.1.1
    }
  }
}
Multicast {
  ...
  IPv4_interface 10.10.1.2
  ...
  Interface eth2
  ...
}

General {
  ...
  Filter From Userspace {
    ...
    Address Ignore {
      IPv4_address 127.0.0.1
      IPv4_address 10.0.0.12
      IPv4_address 192.168.0.2
      IPv4_address 10.10.1.2
    }
  }
}
master# systemctl restart conntrackd
backup# systemctl restart conntrackd

Verify that the connections are being tracked:

master# conntrackd -s
cache internal:
current active connections:              148
connections created:                     172    failed:            0
connections updated:                      40    failed:            0
connections destroyed:                    24    failed:            0

cache external:
current active connections:                1
connections created:                       1    failed:            0
connections updated:                       0    failed:            0
connections destroyed:                     0    failed:            0

traffic processed:
                   0 Bytes                         0 Pckts

multicast traffic (active device=enp6s0):
                5844 Bytes sent                  360 Bytes recv
                  90 Pckts sent                   18 Pckts recv
                   0 Error send                    0 Error recv

message tracking:
                   0 Malformed msgs                    0 Lost msgs

The internal cache is the set of connections being tracked by the master. The external cache is the set of connections being tracked by the backup. If you want to view the connections being handled by either, you can use conntrackd -i for the internal cache or conntrackd -e for the external cache. If you’ve configured conntrackd to track UDP, even when the backup is not the master, it will have one active connection for conntrackd’s multicast broadcasting.

Installing and configuring keepalived

We’ll be using keepalived to provide the virtual gateway IP.

Side note: This is where things got difficult in my setup. I’ve always run the modems in bridge mode so that my router was assigned the public IP address from the ISP, but I could not find any documentation on using keepalived in a situation where the WAN interface (and therefore WAN virtual IP) were dynanically assigned.[2] In my virtualized test setup, although both the master and backup got their WAN addresses via DHCP I had control of the range–and more than one IP address–so I could set the virtual IP to anything in that range and it wasn’t an issue. I worked around this by disabling the modem’s Bridge Mode which put the routers on a private network with the modem as the gateway. Then I configured the WAN virtual IP address on the modem’s subnet and set the modem’s DMZ host to be the WAN virtual IP address. It’s not as clean as I’d like, but without some method of setting that WAN virtual IP via DHCP it was the best solution I could come up with.

master# apt install keepalived
backup# apt install keepalived

Both routers need to share the same basic configuration with a few minor differences:

  • The state needs to be BACKUP instead of MASTER.
  • The priority needs to be lower than the priority of the MASTER.
vrrp_sync_group router-cluster {
    group {
        router-cluster-wan
        router-cluster-lan
    }
    notify_master "/etc/conntrackd/primary-backup.sh primary"
    notify_backup "/etc/conntrackd/primary-backup.sh backup"
    notify_fault "/etc/conntrackd/primary-backup.sh fault"
}

vrrp_instance router-cluster-wan {
    state MASTER
    interface eth0
    virtual_router_id 10
    priority 100
    virtual_ipaddress {
        10.0.0.10/24 brd 10.0.0.255 dev eth0
    }
}

vrrp_instance router-cluster-lan {
    state MASTER
    interface eth1
    virtual_router_id 11
    priority 100
    virtual_ipaddress {
        192.168.0.3/24 brd 192.168.0.255 dev eth1
    }
}
vrrp_sync_group router-cluster {
    group {
        router-cluster-wan
        router-cluster-lan
    }
    notify_master "/etc/conntrackd/primary-backup.sh primary"
    notify_backup "/etc/conntrackd/primary-backup.sh backup"
    notify_fault "/etc/conntrackd/primary-backup.sh fault"
}

vrrp_instance router-cluster-wan {
    state BACKUP
    interface eth0
    virtual_router_id 10
    priority 50
    virtual_ipaddress {
        10.0.0.10/24 brd 10.0.0.255 dev eth0
    }
}

vrrp_instance router-cluster-lan {
    state BACKUP
    interface eth1
    virtual_router_id 11
    priority 50
    virtual_ipaddress {
        192.168.0.3/24 brd 192.168.0.255 dev eth1
    }
}

Finally, the routers are ready to go, but there’s still one major change that needs to be made on the network. That is to change the default gateway on all of the LAN hoststo the virtual IP address configured in keepalived. In this case, 192.168.0.3 is the virtual IP address that will move between routers as needed. I use (mostly) DHCP on my network, so I updated the configuration to change the DHCP router option from 192.168.0.1 to 192.168.0.3. Be sure to update any hosts that are configured with static addresses to use the new virtual IP address for the gateway otherwise you will not have the benefit of HA routing.


References


Footnotes

1 I can’t find documentation on this anywhere, but I think that FTFW stands for “Fault Tolerant FireWall”. It’s the protocol that conntrackd uses to reliably transfer state.

2 I saw that the OpenWRT wiki hosted a high availability recipe that notes, “DHCP dynamic WAN IP is possible with keepalived, but requires extra scripting and is not going to be described here.”