This is the tale of how bugs get diagnosed in the world of FOSS software. Join us on a journey from “Man, why does this not work?” through “This makes no sense at all!” to final “All is good now.”
Being bugged
It all started on a sunny afternoon when our administrators replaced the old firewall with a brand new one. And because the new firewall requires a different client to establish the VPN connection, we also started updating the documentation on how to get the VPN working again. So far so good.
After we installed the required plugin for NetworkManager and created new the VPN configuration, we tried to establish a connection and were greeted by an error message, informing us, that something went wrong. But what?!?
Dragging the bug out into the open
Since the pop-up error message was not informative enough to be of much help, we opened our good-ol terminal emulator and ran
$ journalctl -u NetworkManager -e
Thankfully, developers of NetworkManager, NetworkManager-fortisslvpn and openfortivpn added enough diagnostic messages to the log that we were able to determine the main cause of the connection issue: the VPN client does not like our server certificate. The exact error message was:
Gateway certificate validation failed, and the certificate digest in not
in the local whitelist. If you trust it, rerun with:
--trusted-cert 3c8678a17367d9a5c...
or add this line to your config file:
trusted-cert = 3c8678a17367d9a5c...
Openfortivpn developers rock! They even included a fix/workaround for our problem in the error message.
At this point, we added the workaround into the documentation and published the docs, since getting people a working copy of an ugly procedure now allows them to resume working (perfection is a productivity killer in such situations).
Inspecting the bug
Temporary workaround that we placed in the documentation will work for a couple of months, but the certificate renewal will break the instructions, so we decided to find a more permanent solution to this problem. And so we started to dig deeper.
What could be wrong with the server certificate? Browsers all display the happy green lock on that URL, but just to be safe, we decided to check with OpenSSL’s client.
$ openssl s_client -connect <REDACTED>.xlab.si:443 < /dev/null
depth=0 C = SI, L = Ljubljana, O = XLAB d.o.o., CN = *.xlab.si
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 C = SI, L = Ljubljana, O = XLAB d.o.o., CN = *.xlab.si
verify error:num=21:unable to verify the first certificate
verify return:1
DONE
CONNECTED(00000003)
---
Certificate chain
0 s:/C=SI/L=Ljubljana/O=XLAB d.o.o./CN=*.xlab.si
i:/C=US/O=DigiCert Inc/CN=DigiCert SHA2 Secure Server CA
---
...
Well, it turns out that browsers sometimes do not tell the whole truth, since they happily tolerated the missing intermediary certificate. While we were updating the certificate that is served to the client, we started asking ourself: “Could this be it? Did we just fixed the issue for good?”
The bug behind the bug
To test the fix, we removed the trusted digest from the VPN connection settings and tried to establish the connection again. And it failed. With the same error message!?!
And this is how dreams get crushed. Some people would get discouraged by this, but not us. We were determined to get to the bottom of this issue. And since openfortivpn is FOSS, we checked-out the sources from GitHub and started taking things apart.
First thing that we did was to simplify our test by removing the NetworkManager from the testing procedure. What we ended up with was the next command that reproduced the output that we previously found int the log:
# openfortivpn -c k.cfg --no-dns --no-routes <REDACTED>.xlab.si:443
Next, we built the latest development release of the openfortivpn client (from the master branch):
$ ./autogen.sh && ./configure && make
Then we checked if the error still persisted … and it did. Being completely new to the codebase, we decided to run
$ git grep "Gateway certificate validation fail"
src/tunnel.c:577: log_debug("Gateway certificate validation ...
src/tunnel.c:603: log_error("Gateway certificate validation ...
Bingo! Two hits relatively close by indicated that src/tunnel.c
file
contained the functionality that was broken. And after we opened the file,
those two locations were in a function called ssl_verify_cert
. To get to
the exact point of the failure, we ran the connect command under gdb
:
# gdb --args \
./openfortivpn -c k.cfg --no-dns --no-routes <REDACTED>.xlab.si:443
...
(gdb) b ssl_verify_cert
Breakpoint 1 at 0x40abe1: file src/tunnel.c, line 545.
(gdb) r
...
Breakpoint 1, ssl_connect (tunnel=0x7fffffffbe80) at src/tunnel.c:754
754 if (ssl_verify_cert(tunnel))
(gdb) s
(gdb) n ## Repeat this 10 times
565 && strncasecmp(common_name, tunnel->config->gateway_host,
(gdb) n
577 log_debug("Gateway certificate validation failed.\n");
This way we found out that hostname of the VPN endpoint and common name in
certificate do not match, which was no surprise, since we are using wildcard
certificate for *.xlab.si
.
Now we needed to find a fix for this. Inspecting a ssl_verify_cert
function
once again, we noticed that line 555 contained a preprocessor
conditional that tested for macro HAVE_X509_CHECK_HOST
and that we were
using the else branch in our gdb
session.
The other branch of the conditional statement used X509_check_host
to
perform the validation and a few web searches later we were certain that using
this function would resolve our issue. But the build system was convinced that
we do not have this function even though our OpenSSL library was recent
enough.
Now, HAVE_SOME_FUNC
macros are usually defined by the autoconf
as a result
of a AC_CHECK_FUNCS([some_func])
macro call in configure.ac
. And indeed,
configure.ac
file for openfortivpn contained that macro on the line 42.
But why did this test fail to find the X509_check_host
function? To
determine the reason, we opened the config.log
file that has been produced
by the ./configure
call. Searching for the X509_check_host
string brought
us to the next section of the log file:
configure:5102: checking for X509_check_host
configure:5102: gcc -o conftest -g -O2 conftest.c -lutil -lpthread >&5
/tmp/cc46x0eH.o: In function `main':
conftest.c:184: undefined reference to `X509_check_host'
Test failed because there are some linker flags missing (-crypt
and -lssl
to be precise). When using autoconf
, these flags are usually added to the
compile commads by the AC_CHECK_LIB
macro. But the OpenSSL linker flags are
obtained from the pkg-config
command (PKG_CHECK_MODULES
macro in
configure.ac
file), which does not add them to the compile command.
The extermination of the bug
And this is how we got to the bottom of the problem. Interested to see how we fixed this? Join us in the second part of the journey (that is almost as long as the first part, despite the Tolkien’s stories that are concluded fairly quickly;).
Interested in seeing more such stories? Tell us about it on twitter or Reddit.