If you are responsible for managing a bunch of your organization’s AWS
resources, you probably know how annoying it is to debug various
deployment-related issues, only to learn that someone accidentally messed with
your setup, yet again. Which you probably resolved by navigating your
command-line history like a maniac (because you don’t use Ansible and have to
Ctrl+R
, as you also forgot to put down the setup instructions ;)). We’ve
been there, too. And this is what we learned:
- You should use Ansible and deploy your environment with a playbook, and
- You should make the most of it by taking advantage of both Ansible’s built-in mechanisms and the modules that allow you to detect configuration drift, and correct it.
Our dear sysadmins and DevOps engineers, this post is for you. Stay with us and read along!
Ansible and compliance? How?
According to Cambridge Advanced Learner’s Dictionary, compliance is
The fact of obeying a particular law or rule, or of acting according to an agreement.
To see what the above definition has got to do with Ansible and the AWS Ansible Collection, let’s briefly descend to the world of Ansible playbooks and modules.
It’s all about state
Ideally, playbook tasks describe the desired final state of a managed system - for example, configuration of an AWS EC2 instance. An Ansible module takes that description and performs the necessary steps to ensure that state.
It’s important to emphasize that it’s easy to write playbook tasks describing the final state of a managed system when Ansible modules adopt declarative APIs. As a consequence, we can treat playbook tasks as “the rule”, “the policy”, or simply as something we need to be compliant with. Perhaps even more importantly, we can treat Ansible playbooks as simple diagnostic tools to help us detect when we are not compliant anymore, so that we can mitigate the situation.
If you are familiar with Ansible, all of this probably rings a bell to you
already. Was your first thought the value of the changed
variable that you
see every time you run your playbooks? Or perhaps even Ansible’s
--diff
and --check
modes? Indeed, this post is about all of
them!
Let’s walk through a practical example demonstrating how you can use the modules from the AWS Ansible Collection to detect and resolve configuration drift.
To follow along easier, make sure you have completed the tutorial about the basic usage of the collection in Getting Started with the AWS Ansible Collection.
Make sure you have Steampunk AWS Ansible Collection version 0.8.3 installed.
Set up the desired state
Let’s start by creating a simple AWS enviromnent that will support running an imaginary app. The environment will comprise:
- an EC2 instance named steamy-server;
- a network interface named steamy-eni to attach to the EC2 instance, which we secure with the VPC’s default security group.
We created a playbook called setup.yaml for spinning up this environment. And then, we run
$ ansible-playbook setup.yaml --diff
Output:
PLAY [localhost] ************************************************************
TASK [Create the steamy-server instance for running a simple app] ***********
--- before
+++ after
@@ -1 +1,24 @@
-{}
+{
+ "ami": "ami-085925f297f89fce1",
+ "availability_zone": "use1-az1",
+ "id": "i-0f1300d2456233353",
+ "key_pair": "test-keypair",
+ "launched_at": "2020-06-12T07:35:19+00:00",
+ "monitoring": "detailed",
+ "network_interface": "eni-0cf22a025bd9f7c98",
+ "on_instance_initiated_shutdown": "stop",
+ "secondary_network_interfaces": [],
+ "security_groups": [
+ "sg-0415ac333af261fc1"
+ ],
+ "state": "running",
+ "subnet": "subnet-06a0f705bc79538ed",
+ "tags": {
+ "Name": "steamy-server",
+ "app": "steamy",
+ "env": "staging"
+ },
+ "tenancy": "default",
+ "type": "t3.micro",
+ "vpc": "vpc-032c4ec6c40cf17a3"
+}
changed: [localhost]
TASK [Retrieve the default security group for the VPC] **********************
ok: [localhost]
TASK [Create a dedicated network interface and attach it to steamy-server] **
--- before
+++ after
@@ -1 +1,21 @@
-{}
+{
+ "attachment": {
+ "device_index": 1,
+ "instance": "i-0f1300d2456233353",
+ "keep_on_termination": false
+ },
+ "description": null,
+ "id": "eni-0c7d09f8e5290b78d",
+ "ip": "172.31.6.213",
+ "mac_address": "02:5b:f9:a2:db:fd",
+ "public_ip": null,
+ "security_groups": [
+ "sg-0415ac333af261fc1"
+ ],
+ "source_dest_check": true,
+ "subnet": "subnet-06a0f705bc79538ed",
+ "tags": {
+ "Name": "steamy-eni"
+ },
+ "type": "normal"
+}
changed: [localhost]
PLAY RECAP ******************************************************************
localhost : ok=3 changed=2 unreachable=0 failed=0
skipped=0 rescued=0 ignored=0
We can see two changed
tasks, corresponding to the two newly created AWS
resources. Thanks to the --diff
flag which instructs the modules to report
the differences they made, we also get a nice visual indication of what
exactly the modules changed. The red-colored lines show the previous state of
the AWS resource (before a specific task ran), while the green ones show the
current state, i.e. after the task ran.
We can interpret the diff output above as “the instance and network interface didn’t exist before, so we created them, and they are now configured as shown”.
Typically, we can’t take support for diff mode for granted, as it is up to individual modules to implement it. This is why some tasks make changes but report no differences despite running
ansible-playbook
with--diff
.
A word about idempotence
Can you guess what will happen if we run the same command again? Let’s try it out!
$ ansible-playbook setup.yaml --diff
Output:
PLAY [localhost] ************************************************************
TASK [Create the steamy-server instance for running a simple app] ***********
ok: [localhost]
TASK [Retrieve the default security group for the VPC] **********************
ok: [localhost]
TASK [Create a dedicated network interface and attach it to steamy-server] **
ok: [localhost]
PLAY RECAP ******************************************************************
localhost : ok=3 changed=0 unreachable=0 failed=0
skipped=0 rescued=0 ignored=0
This time around, we see that there was nothing to be done - the Ansible
modules already did all the hard work when they were invoked the first time.
So no changed tasks, no updates done, and no diff output shown. With other
words, the current state of our managed AWS resources is compliant with the
state described by the tasks in our setup.yaml
playbook. This demonstrates
another important aspect related to Ansible’s declarative approach, which is
that playbook tasks should ideally be idempotent. If the state of the managed
resource is aligned with the desired state as described by the playbook tasks,
running the playbook again will not affect its state.
But how can all of this be useful to me, you wonder? Glad you asked, time to spice things up a bit!
Deviate from the initial state
Now, let’s put ourselves in the shoes of an adversary that wants to tweak the state of our AWS resources to their taste. In practice, this could be done accidentally, or worse - with malicious intent. But we’ll be doing it solely for the purpose of demonstration. Disclaimer: we won’t actually be doing anything dangerous, we’ll only make some simple configuration changes that should indicate that configuration has drifted if we take setup.yaml as our baseline for compliance.
To make the changes, we could open the AWS Management Console and manually modify some properties of the EC2 instance or the network interface that we created with [setup.yaml][setup` earlier. But since we’re all about automation, let’s put the tweaks in an Ansible playbook called tweaked-setup.yaml.
Here’s a recap of what we’ll be tweaking with this playbook:
- We’ll downgrade our steamy-server’s CloudWatch monitoring to the basic level and modify it’s shutdown behavior;
- We’ll update the value of steamy-server’s
env
tag. - We’ll create a new security group called dangerous-secgroup with some suspicious permissions;
- We’ll associate this dangerous-secgroup with the network interface attached to the steamy-server instance, and we’ll also disable source/destination checking for the network interface;
Sneaky, right? ;)
If we ran ansible-playbook tweaked-setup.yaml
now, we would end up with
three changed tasks. But since we’re learning about Ansible diff mode, let’s
do:
$ ansible-playbook tweaked-setup.yaml --diff
and we end up with three changed tasks, plus the diff output consistent with the tweaks described above:
PLAY [localhost] ************************************************************
TASK [Retrieve the steamy-server instance] **********************************
ok: [localhost]
TASK [Modify configuration of the steamy-server instance] *******************
--- before
+++ after
@@ -4,9 +4,9 @@
"id": "i-0f1300d2456233353",
"key_pair": "test-keypair",
"launched_at": "2020-06-12T07:35:19+00:00",
- "monitoring": "detailed",
+ "monitoring": "basic",
"network_interface": "eni-0cf22a025bd9f7c98",
- "on_instance_initiated_shutdown": "stop",
+ "on_instance_initiated_shutdown": "terminate",
"secondary_network_interfaces": [
"eni-0c7d09f8e5290b78d"
],
@@ -18,7 +18,7 @@
"tags": {
"Name": "steamy-server",
"app": "steamy",
- "env": "staging"
+ "env": "dev"
},
"tenancy": "default",
"type": "t3.micro",
changed: [localhost]
TASK [Create a security group to permit SMB traffic from anywhere] **********
changed: [localhost]
TASK [Extend steamy-eni's security groups, and update source/dest checking] *
--- before
+++ after
@@ -10,9 +10,10 @@
"mac_address": "02:5b:f9:a2:db:fd",
"public_ip": null,
"security_groups": [
+ "sg-0447e7e7bc88aeab1",
"sg-0415ac333af261fc1"
],
- "source_dest_check": true,
+ "source_dest_check": false,
"subnet": "subnet-d8b640be",
"tags": {
"Name": "steamy-eni"
changed: [localhost]
PLAY RECAP ******************************************************************
localhost : ok=4 changed=3 unreachable=0 failed=0
skipped=0 rescued=0 ignored=0
Basically, tweaked-setup.yaml introduced an alternative desired
state of the AWS resources - the adversary’s. And this state is not exactly in
agreement with the one we described in setup.yaml (we already learned
that if it were, the last invocation of ansible-playbook
would report no
changed tasks!).
Detect configuration drift
Let’s now imagine that one day, we get a call from our teammates who report unusual traffic to the app server instance we provisioned on AWS, using our setup.yaml.
Luckily, we know what to do; for starters, we’ll try re-running the
setup.yaml playbook, but with one important addition: this time we’ll
add the --check
flag to the ansible-playbook
command.
The --check
flag instructs the modules behind playbook tasks to run in
check mode (also called dry run). In check mode, a module does not
perform any work that would change the state of the resource it manages.
Instead, the module merely checks whether a change would need to be made to
ensure the desired state. This is evident from the module’s changed
value,
and you might already be familiar with it.
$ ansible-playbook setup.yaml --check
Output:
PLAY [localhost] ************************************************************
TASK [Create the steamy-server instance for running a simple app] ***********
changed: [localhost]
TASK [Retrieve the default security group for the VPC] **********************
ok: [localhost]
TASK [Create a dedicated network interface and attach it to steamy-server] **
changed: [localhost]
PLAY RECAP ******************************************************************
localhost : ok=3 changed=2 unreachable=0 failed=0
skipped=0 rescued=0 ignored=0
We notice something peculiar in the above output: two changed tasks. This could very well be the related to the problem our teammates reported. But at this point, it’s too early to make any conclusions. We only know that something has changed on the instance and network interface with respect to how we want them configured.
However, our goal is not only to verify that a configuration has drifted, but
to pinpoint exactly what deviated from the desired state. Luckily, we have
another trick up our sleeve. You guessed it, it’s as simple as adding our
beloved --diff
flag to the previous ansible-playbook --check
command:
$ ansible-playbook setup.yaml --check --diff
Output:
PLAY [localhost] ************************************************************
TASK [Create the steamy-server instance for running a simple app] ***********
--- before
+++ after
@@ -4,9 +4,9 @@
"id": "i-0f1300d2456233353",
"key_pair": "test-keypair",
"launched_at": "2020-06-12T07:35:19+00:00",
- "monitoring": "basic",
+ "monitoring": "detailed",
"network_interface": "eni-0cf22a025bd9f7c98",
- "on_instance_initiated_shutdown": "terminate",
+ "on_instance_initiated_shutdown": "stop",
"secondary_network_interfaces": [
"eni-0c7d09f8e5290b78d"
],
@@ -18,7 +18,7 @@
"tags": {
"Name": "steamy-server",
"app": "steamy",
- "env": "dev"
+ "env": "staging"
},
"tenancy": "default",
"type": "t3.micro",
changed: [localhost]
TASK [Retrieve the default security group for the VPC] **********************
ok: [localhost]
TASK [Create a dedicated network interface and attach it to steamy-server] **
--- before
+++ after
@@ -10,10 +10,9 @@
"mac_address": "02:5b:f9:a2:db:fd",
"public_ip": null,
"security_groups": [
- "sg-0447e7e7bc88aeab1",
"sg-0415ac333af261fc1"
],
- "source_dest_check": false,
+ "source_dest_check": true,
"subnet": "subnet-d8b640be",
"tags": {
"Name": "steamy-eni"
changed: [localhost]
PLAY RECAP ******************************************************************
localhost : ok=3 changed=2 unreachable=0 failed=0
skipped=0 rescued=0 ignored=0
If we compare the output of this command and the output of ansible-playbook tweaked-setup.yaml --diff
that we ran earlier, we can see that they’re
exactly the opposite. So the last ansible-playbook
command we ran detected
what tweaked-setup.yaml
modified in our desired AWS setup. And if we we
weren’t running in check mode, we could actually …
Enforce compliance!
Fear not - our setup.yaml describes the desired configuration state
of our AWS resources exactly, so the modules from the AWS Ansible Collection
know what to do. To repair our setup, i.e. to bring it back to the desired
state, let’s run ansible-playbook
command once more, but this time without
--check
:
$ ansible-playbook setup.yaml --diff
And we can observe exactly the same output as with the previous command. This time however, the modules actually performed the changes shown in the diff output.
Afterwards, we can try running the same command again and we should end up with no changed tasks, once again. This indicates that the configuration state of our AWS resources is up-to-date with the configuration described in our setup.yaml playbook. And we are compliant with our baseline setup!
But I get no output with my playbooks
As we said before, not all modules support running in check mode or displaying differences between the before and after states. Our AWS Ansible Collection was designed with this use case in mind, so make sure you check it out.
What’s next?
In this post we demonstrated the value of Ansible’s --diff
and --check
mode and how we can use the modules in the AWS Ansible Collection to detect
configuration drift, which makes them suitable as a tool to ensure compliance
of the state of AWS resources. Perhaps the nicest thing in this whole story is
that we used the same playbook for setting up the infrastructure, detecting
the changes, and restoring our setup back to the desired state.
In our example, we ran the compliance checks using --check --diff
after
receiving an imaginary report about unusual behavior of the app running on our
AWS resources. Try to imagine a similar situation, but with a real-world app
deployment. How many of the app’s users could be affected and possibly
discouraged from interacting with the app ever again, before state alterations
(accidental or not) were discovered, reported, and, finally, fixed? So if we
wanted to take things a bit further, we could implement automated checks that
would periodically run ansible-playbook setup.yaml --check --diff
command.
This way, we could increase the chances of picking up anomalies, and be
notified of similar situations before anyone notices.
If you have any questions, troubles, or doubts, you can always reach us on Twitter, LinkedIn, and Reddit. Thank you for checking out this post for a different perspective on simple Ansible concepts (get it?;)).
So long!