Enforcing Compliance using the AWS Ansible Collection

June 16, 2020 - Words by Manca Bizjak

June 16, 2020
Words by Manca Bizjak

If you are responsible for managing a bunch of your organization’s AWS resources, you probably know how annoying it is to debug various deployment-related issues, only to learn that someone accidentally messed with your setup, yet again. Which you probably resolved by navigating your command-line history like a maniac (because you don’t use Ansible and have to Ctrl+R, as you also forgot to put down the setup instructions ;)). We’ve been there, too. And this is what we learned:

You should use Ansible and deploy your environment with a playbook, and
You should make the most of it by taking advantage of both Ansible’s built-in mechanisms and the modules that allow you to detect configuration drift, and correct it.

Our dear sysadmins and DevOps engineers, this post is for you. Stay with us and read along!

Ansible and compliance? How?

According to Cambridge Advanced Learner’s Dictionary, compliance is

The fact of obeying a particular law or rule, or of acting according to an agreement.

To see what the above definition has got to do with Ansible and the AWS Ansible Collection, let’s briefly descend to the world of Ansible playbooks and modules.

It’s all about state

Ideally, playbook tasks describe the desired final state of a managed system - for example, configuration of an AWS EC2 instance. An Ansible module takes that description and performs the necessary steps to ensure that state.

It’s important to emphasize that it’s easy to write playbook tasks describing the final state of a managed system when Ansible modules adopt declarative APIs. As a consequence, we can treat playbook tasks as “the rule”, “the policy”, or simply as something we need to be compliant with. Perhaps even more importantly, we can treat Ansible playbooks as simple diagnostic tools to help us detect when we are not compliant anymore, so that we can mitigate the situation.

If you are familiar with Ansible, all of this probably rings a bell to you already. Was your first thought the value of the changed variable that you see every time you run your playbooks? Or perhaps even Ansible’s --diff and --check modes? Indeed, this post is about all of them!

Let’s walk through a practical example demonstrating how you can use the modules from the AWS Ansible Collection to detect and resolve configuration drift.

To follow along easier, make sure you have completed the tutorial about the basic usage of the collection in Getting Started with the AWS Ansible Collection.

Make sure you have Steampunk AWS Ansible Collection version 0.8.3 installed.

Set up the desired state

Let’s start by creating a simple AWS enviromnent that will support running an imaginary app. The environment will comprise:

an EC2 instance named steamy-server;
a network interface named steamy-eni to attach to the EC2 instance, which we secure with the VPC’s default security group.

We created a playbook called setup.yaml for spinning up this environment. And then, we run

$ ansible-playbook setup.yaml --diff

Output:

PLAY [localhost] ************************************************************

TASK [Create the steamy-server instance for running a simple app] ***********
--- before
+++ after
@@ -1 +1,24 @@
-{}
+{
+    "ami": "ami-085925f297f89fce1",
+    "availability_zone": "use1-az1",
+    "id": "i-0f1300d2456233353",
+    "key_pair": "test-keypair",
+    "launched_at": "2020-06-12T07:35:19+00:00",
+    "monitoring": "detailed",
+    "network_interface": "eni-0cf22a025bd9f7c98",
+    "on_instance_initiated_shutdown": "stop",
+    "secondary_network_interfaces": [],
+    "security_groups": [
+        "sg-0415ac333af261fc1"
+    ],
+    "state": "running",
+    "subnet": "subnet-06a0f705bc79538ed",
+    "tags": {
+        "Name": "steamy-server",
+        "app": "steamy",
+        "env": "staging"
+    },
+    "tenancy": "default",
+    "type": "t3.micro",
+    "vpc": "vpc-032c4ec6c40cf17a3"
+}

changed: [localhost]

TASK [Retrieve the default security group for the VPC] **********************
ok: [localhost]

TASK [Create a dedicated network interface and attach it to steamy-server] **
--- before
+++ after
@@ -1 +1,21 @@
-{}
+{
+    "attachment": {
+        "device_index": 1,
+        "instance": "i-0f1300d2456233353",
+        "keep_on_termination": false
+    },
+    "description": null,
+    "id": "eni-0c7d09f8e5290b78d",
+    "ip": "172.31.6.213",
+    "mac_address": "02:5b:f9:a2:db:fd",
+    "public_ip": null,
+    "security_groups": [
+        "sg-0415ac333af261fc1"
+    ],
+    "source_dest_check": true,
+    "subnet": "subnet-06a0f705bc79538ed",
+    "tags": {
+        "Name": "steamy-eni"
+    },
+    "type": "normal"
+}

changed: [localhost]

PLAY RECAP ******************************************************************
localhost              : ok=3         changed=2    unreachable=0    failed=0
                         skipped=0    rescued=0    ignored=0

We can see two changed tasks, corresponding to the two newly created AWS resources. Thanks to the --diff flag which instructs the modules to report the differences they made, we also get a nice visual indication of what exactly the modules changed. The red-colored lines show the previous state of the AWS resource (before a specific task ran), while the green ones show the current state, i.e. after the task ran.

We can interpret the diff output above as “the instance and network interface didn’t exist before, so we created them, and they are now configured as shown”.

Typically, we can’t take support for diff mode for granted, as it is up to individual modules to implement it. This is why some tasks make changes but report no differences despite running ansible-playbook with --diff.

A word about idempotence

Can you guess what will happen if we run the same command again? Let’s try it out!

$ ansible-playbook setup.yaml --diff

Output:

PLAY [localhost] ************************************************************

TASK [Create the steamy-server instance for running a simple app] ***********
ok: [localhost]

TASK [Retrieve the default security group for the VPC] **********************
ok: [localhost]

TASK [Create a dedicated network interface and attach it to steamy-server] **
ok: [localhost]

PLAY RECAP ******************************************************************
localhost              : ok=3         changed=0    unreachable=0    failed=0
                         skipped=0    rescued=0    ignored=0

This time around, we see that there was nothing to be done - the Ansible modules already did all the hard work when they were invoked the first time. So no changed tasks, no updates done, and no diff output shown. With other words, the current state of our managed AWS resources is compliant with the state described by the tasks in our setup.yaml playbook. This demonstrates another important aspect related to Ansible’s declarative approach, which is that playbook tasks should ideally be idempotent. If the state of the managed resource is aligned with the desired state as described by the playbook tasks, running the playbook again will not affect its state.

But how can all of this be useful to me, you wonder? Glad you asked, time to spice things up a bit!

Deviate from the initial state

Now, let’s put ourselves in the shoes of an adversary that wants to tweak the state of our AWS resources to their taste. In practice, this could be done accidentally, or worse - with malicious intent. But we’ll be doing it solely for the purpose of demonstration. Disclaimer: we won’t actually be doing anything dangerous, we’ll only make some simple configuration changes that should indicate that configuration has drifted if we take setup.yaml as our baseline for compliance.

To make the changes, we could open the AWS Management Console and manually modify some properties of the EC2 instance or the network interface that we created with [setup.yaml][setup` earlier. But since we’re all about automation, let’s put the tweaks in an Ansible playbook called tweaked-setup.yaml.

Here’s a recap of what we’ll be tweaking with this playbook:

We’ll downgrade our steamy-server’s CloudWatch monitoring to the basic level and modify it’s shutdown behavior;
We’ll update the value of steamy-server’s env tag.
We’ll create a new security group called dangerous-secgroup with some suspicious permissions;
We’ll associate this dangerous-secgroup with the network interface attached to the steamy-server instance, and we’ll also disable source/destination checking for the network interface;

Sneaky, right? ;)

If we ran ansible-playbook tweaked-setup.yaml now, we would end up with three changed tasks. But since we’re learning about Ansible diff mode, let’s do:

$  ansible-playbook tweaked-setup.yaml --diff

and we end up with three changed tasks, plus the diff output consistent with the tweaks described above:

PLAY [localhost] ************************************************************

TASK [Retrieve the steamy-server instance] **********************************
ok: [localhost]

TASK [Modify configuration of the steamy-server instance] *******************
--- before
+++ after
@@ -4,9 +4,9 @@
     "id": "i-0f1300d2456233353",
     "key_pair": "test-keypair",
     "launched_at": "2020-06-12T07:35:19+00:00",
-    "monitoring": "detailed",
+    "monitoring": "basic",
     "network_interface": "eni-0cf22a025bd9f7c98",
-    "on_instance_initiated_shutdown": "stop",
+    "on_instance_initiated_shutdown": "terminate",
     "secondary_network_interfaces": [
         "eni-0c7d09f8e5290b78d"
     ],
@@ -18,7 +18,7 @@
     "tags": {
         "Name": "steamy-server",
         "app": "steamy",
-        "env": "staging"
+        "env": "dev"
     },
     "tenancy": "default",
     "type": "t3.micro",

changed: [localhost]

TASK [Create a security group to permit SMB traffic from anywhere] **********
changed: [localhost]

TASK [Extend steamy-eni's security groups, and update source/dest checking] *
--- before
+++ after
@@ -10,9 +10,10 @@
     "mac_address": "02:5b:f9:a2:db:fd",
     "public_ip": null,
     "security_groups": [
+        "sg-0447e7e7bc88aeab1",
         "sg-0415ac333af261fc1"
     ],
-    "source_dest_check": true,
+    "source_dest_check": false,
     "subnet": "subnet-d8b640be",
     "tags": {
         "Name": "steamy-eni"

changed: [localhost]

PLAY RECAP ******************************************************************
localhost              : ok=4         changed=3    unreachable=0    failed=0
                         skipped=0    rescued=0    ignored=0

Basically, tweaked-setup.yaml introduced an alternative desired state of the AWS resources - the adversary’s. And this state is not exactly in agreement with the one we described in setup.yaml (we already learned that if it were, the last invocation of ansible-playbook would report no changed tasks!).

Detect configuration drift

Let’s now imagine that one day, we get a call from our teammates who report unusual traffic to the app server instance we provisioned on AWS, using our setup.yaml.

Luckily, we know what to do; for starters, we’ll try re-running the setup.yaml playbook, but with one important addition: this time we’ll add the --check flag to the ansible-playbook command.

The --check flag instructs the modules behind playbook tasks to run in check mode (also called dry run). In check mode, a module does not perform any work that would change the state of the resource it manages. Instead, the module merely checks whether a change would need to be made to ensure the desired state. This is evident from the module’s changed value, and you might already be familiar with it.

$ ansible-playbook setup.yaml --check

Output:

PLAY [localhost] ************************************************************

TASK [Create the steamy-server instance for running a simple app] ***********
changed: [localhost]

TASK [Retrieve the default security group for the VPC] **********************
ok: [localhost]

TASK [Create a dedicated network interface and attach it to steamy-server] **
changed: [localhost]

PLAY RECAP ******************************************************************
localhost              : ok=3         changed=2    unreachable=0    failed=0
                         skipped=0    rescued=0    ignored=0

We notice something peculiar in the above output: two changed tasks. This could very well be the related to the problem our teammates reported. But at this point, it’s too early to make any conclusions. We only know that something has changed on the instance and network interface with respect to how we want them configured.

However, our goal is not only to verify that a configuration has drifted, but to pinpoint exactly what deviated from the desired state. Luckily, we have another trick up our sleeve. You guessed it, it’s as simple as adding our beloved --diff flag to the previous ansible-playbook --check command:

$ ansible-playbook setup.yaml --check --diff

Output:

PLAY [localhost] ************************************************************

TASK [Create the steamy-server instance for running a simple app] ***********
--- before
+++ after
@@ -4,9 +4,9 @@
     "id": "i-0f1300d2456233353",
     "key_pair": "test-keypair",
     "launched_at": "2020-06-12T07:35:19+00:00",
-    "monitoring": "basic",
+    "monitoring": "detailed",
     "network_interface": "eni-0cf22a025bd9f7c98",
-    "on_instance_initiated_shutdown": "terminate",
+    "on_instance_initiated_shutdown": "stop",
     "secondary_network_interfaces": [
         "eni-0c7d09f8e5290b78d"
     ],
@@ -18,7 +18,7 @@
     "tags": {
         "Name": "steamy-server",
         "app": "steamy",
-        "env": "dev"
+        "env": "staging"
     },
     "tenancy": "default",
     "type": "t3.micro",

changed: [localhost]

TASK [Retrieve the default security group for the VPC] **********************
ok: [localhost]

TASK [Create a dedicated network interface and attach it to steamy-server] **
--- before
+++ after
@@ -10,10 +10,9 @@
     "mac_address": "02:5b:f9:a2:db:fd",
     "public_ip": null,
     "security_groups": [
-        "sg-0447e7e7bc88aeab1",
         "sg-0415ac333af261fc1"
     ],
-    "source_dest_check": false,
+    "source_dest_check": true,
     "subnet": "subnet-d8b640be",
     "tags": {
         "Name": "steamy-eni"

changed: [localhost]

PLAY RECAP ******************************************************************
localhost              : ok=3         changed=2    unreachable=0    failed=0
                         skipped=0    rescued=0    ignored=0

If we compare the output of this command and the output of ansible-playbook tweaked-setup.yaml --diff that we ran earlier, we can see that they’re exactly the opposite. So the last ansible-playbook command we ran detected what tweaked-setup.yaml modified in our desired AWS setup. And if we we weren’t running in check mode, we could actually …

Enforce compliance!

Fear not - our setup.yaml describes the desired configuration state of our AWS resources exactly, so the modules from the AWS Ansible Collection know what to do. To repair our setup, i.e. to bring it back to the desired state, let’s run ansible-playbook command once more, but this time without --check:

$ ansible-playbook setup.yaml --diff

And we can observe exactly the same output as with the previous command. This time however, the modules actually performed the changes shown in the diff output.

Afterwards, we can try running the same command again and we should end up with no changed tasks, once again. This indicates that the configuration state of our AWS resources is up-to-date with the configuration described in our setup.yaml playbook. And we are compliant with our baseline setup!

But I get no output with my playbooks

As we said before, not all modules support running in check mode or displaying differences between the before and after states. Our AWS Ansible Collection was designed with this use case in mind, so make sure you check it out.

What’s next?

In this post we demonstrated the value of Ansible’s --diff and --check mode and how we can use the modules in the AWS Ansible Collection to detect configuration drift, which makes them suitable as a tool to ensure compliance of the state of AWS resources. Perhaps the nicest thing in this whole story is that we used the same playbook for setting up the infrastructure, detecting the changes, and restoring our setup back to the desired state.

In our example, we ran the compliance checks using --check --diff after receiving an imaginary report about unusual behavior of the app running on our AWS resources. Try to imagine a similar situation, but with a real-world app deployment. How many of the app’s users could be affected and possibly discouraged from interacting with the app ever again, before state alterations (accidental or not) were discovered, reported, and, finally, fixed? So if we wanted to take things a bit further, we could implement automated checks that would periodically run ansible-playbook setup.yaml --check --diff command. This way, we could increase the chances of picking up anomalies, and be notified of similar situations before anyone notices.

If you have any questions, troubles, or doubts, you can always reach us on Twitter, LinkedIn, and Reddit. Thank you for checking out this post for a different perspective on simple Ansible concepts (get it?;)).

So long!