aws | FLRNKS

My first scala app

Sat, 10 Oct 2020 11:11:00 +0000

Motivation

In this post I wanted to write about a personal project I started some time ago, with the goal of learning more about Scala. At work, we use Scala quite often to run big data jobs on AWS using Apache Spark. I’ve never used Scala before I joined my current team, and its syntax was very alien to me. However, recently I had the chance to work on a task, where I had to modify a component to use AWS Secrets Manager instead of HashiCorp’s Vault for fetching some secret value at runtime. To my surprise I could complete this work without much struggle with Scala, and afterwards I became eager to learn more. Based on a colleague’s recommendation I started reading a book from Cay S. Horstmann titled Scala for the impatient (2nd edition). I’m making slow but steady progress.

Shortly after starting with the book, I had the idea to start a small project so that I can practice Scala by doing.

The Idea

The idea, like many others before, came while fixing a bug at work. The bug was found within a component written in Scala to interact with the AWS Athena service. It had some neatly written functionality for making queries and waiting for their completion before trying to fetch the results. I thought I would try to write something similar for AWS Systems Manager (SSM). It is a service with few different components, so I decided to focus on Automation Documents that can carry out actions in an automated fashion. For example, the AWS provided SSM document AWS-StartEC2Instance can run any EC2 instance when invoked with the below 2 input parameters:

InstanceId: to specify which EC2 instance you want to start
AutomationAssumeRole: to specify an IAM role which can be assumed by SSM to carry out this action

I realized quite early on, that if I wanted to implement this capability in my Scala app, it needed to be quite generic, so that it could support any Automation Document with an arbitrary number of input parameters. I also wanted it to be able to wait for the execution and report whether it failed or succeeded. Here are the final requirements I came up with:

create 2 separate git repos for:
- a module that’s home for the AWS utility/helper classes
- a module for implementing the CLI App
support extra AWS services such as KMS, Secrets Manager and CloudFormation
utilize localstack for integration testing (when possible)

Initial setup

Firstly, I had to figure out which third-party packages I needed to implement the app according to these simple requirements. To interact with AWS from Scala code, I decided to go with v2 of the official Java SDK for AWS. To implement the CLI app I mainly relied on the picocli Java package, which was a bit less straightforward, but eventually it proved to be a good choice.

Secondly, I have to admit that creating a re-usable scala package from scratch was a rather non-trivial task for me. Most of my programming experience comes from working with in non-JVM based environments so that’s probably no surprise. I initially started out with sbt for build & dependency management, but I was running into issues that I couldn’t solve on my own, so I decided to swap it with maven which was a bit more familiar to me.

Finally, separating the project into two distinct git repositories allowed me to practice versioning and dependency management which I also found very useful:

AWS Scala Utils: https://github.com/florianakos/aws-utils-scala
AWS SSM CLI App: https://github.com/florianakos/aws-ssm-scala-app

The utils module

Creating the utils module that would serve as a kind of glue between the scala CLI app and AWS Systems Manager was actually not as difficult as I thought. This is mostly thanks to the example I’ve seen at work for a similar project with the AWS Athena service.

The core functionality of the utils module when it comes to SSM, is captured in the below functions:

private def executeAutomation(documentName: String, parameters: java.util.Map[String,java.util.List[String]]): Future[String] = {
val startAutomationRequest = StartAutomationExecutionRequest.builder()
.documentName(documentName)
.parameters(parameters)
.build()
Future {
val executionResponse = ssmClient.startAutomationExecution(startAutomationRequest)
logger.info(s"Execution id: ${executionResponse.automationExecutionId()}")
executionResponse.automationExecutionId()
}
}
private def waitForAutomationToFinish(executionId: String): Future[String] = {
val getExecutionRequest = GetAutomationExecutionRequest.builder().automationExecutionId(executionId).build()
var status = AutomationExecutionStatus.IN_PROGRESS
Future {
var retries = 0
while (status != AutomationExecutionStatus.SUCCESS) {
val automationExecutionResponse = ssmClient.getAutomationExecution(getExecutionRequest)
status = automationExecutionResponse.automationExecution.automationExecutionStatus()
status match {
case AutomationExecutionStatus.CANCELLED | AutomationExecutionStatus.FAILED | AutomationExecutionStatus.TIMED_OUT =>
throw SsmAutomationExecutionException(status, automationExecutionResponse.automationExecution.failureMessage)
case AutomationExecutionStatus.SUCCESS =>
logger.info(s"Query finished with status: $status")
case status: AutomationExecutionStatus =>
logger.info(s"SSM Automation execution status: $status, check #$retries.")
Thread.sleep(if (retries <= 3) 2500 else if (retries <= 10) 5000 else 15000)
}
retries += 1
}
}.map(_ => executionId)
}

The first one executeAutomation crafts an execution request and then submits it to AWS, returning its execution ID. This ID can be passed to the waitForAutomationToFinish function that periodically checks in with AWS until the execution is complete. Between subsequent API requests it uses an increasing timeout to prevent API rate-limiting caused by excessive polling.

Testing the utils module

Once I had the core functionality ready I wanted to write integration tests to ensure it works as expected. Instead of having hard-coded AWS credentials or an AWS profile for a real account I wanted to use Localstack that mocks the real AWS API so that you can interact with it. For this reason I slightly tweaked the SsmAutomationHelper class to accept an Optional second argument which can be used while building the SSM API client:

class SsmAutomationHelper(profile: String, apiEndpoint: Option[String]) extends LazyLogging {
private val ssmClient = apiEndpoint match {
case None => SsmClient.builder()
.credentialsProvider(ProfileCredentialsProvider.create(profile))
.region(Region.EU_WEST_1)
.build()
case Some(localstackEndpoint) => SsmClient.builder()
.credentialsProvider(StaticCredentialsProvider.create(AwsBasicCredentials.create("foo", "bar")))
.endpointOverride(URI.create(localstackEndpoint))
.build()
}
}

This allowed me to pass http://localhost:4566 when running the integration tests against localstack and have the API calls directed to those mocked endpoints. Previously each mocked service had its own dedicated port, but thanks to a recent change in localstack, now all AWS services can be run on a single port, they call EDGE port.

According to the documentation, SSM is supported in localstack, however I’ve found out that running Automation Documents is feature that is still missing. As a result, I had to run the integration tests against a real AWS account that I set up for such scenarios. I was okay with doing this since there are plenty of built-in Automation Documents provided by AWS that I could safely use for this purpose.

Eventually I decided to encode in the tests AWS-StartEC2Instance & AWS-StopEC2Instance which only required me to set up a dummy EC2 instance which would be the target of these requests. I also added a special Tag to these integration tests so that they are excluded from running when invoked via mvn test but still available to run manually whenever necessary.

CLI App implementation

After running the tests, I was confident that the AWS utils worked correctly, so I started putting together the CLI app. For this, I’ve searched on the web for a third party package and found that it’s not as simple as it is when using Python’s argparse package. I eventually settled with picocli, which is written in Java but can also be used from Scala via the below annotations:

@Command(name = "SsmHelper", version = Array("v0.0.1"), mixinStandardHelpOptions = true, description = Array("CLI app for running automation documents in AWS SSM"))
class SsmCliParser extends Callable[Unit] with LazyLogging {
@Option(names = Array("-D", "--document"), description = Array("Name of the SSM Automation document to execute"))
private var documentName = new String
@Parameters(index = "0..*", arity = "0..*", paramLabel = "<param1=val1> <param2=val2> ...", description = Array("Key=Value parameters to use as Input Params"))
private val parameters: util.ArrayList[String] = null
[...]

According to the original idea, there had to be one constant CLI flag which controlled the name of the AWS Automation Document (--document) and then there had to be a variable number of additional arguments for specifying the Input Parameters required by the given document. The picocli package supported this workflow via the @Option and the @Parameters annotations.

The only thing left was a custom function that would carry out the needed transformation of Input Parameters. The values received in the parameters were in the form of an ArrayList: [<param1=val1>, <param2=val2>, ...] which had to be transformed into a Map: [param1 -> [val1], param2 -> [val2]] by splitting each String on the = character. The desired format was a requirement of the AWS SDK for SSM. After some iterations I ended up with the below function that could do this transformation:

private def process(params: util.ArrayList[String]): util.Map[String, util.List[String]] = {
params.asScala
.map(_.split('='))
.collect { case Array(key, value) => key -> value }
.groupBy(_._1)
.mapValues(_.map(_._2).asJava).asJava
}

Finally, I constructed the below method which utilized the SsmAutomationHelper class from the utils module and passed the two variables provided by picocli to it so it would invoke the necessary Automation Document and wait to retrieve its result via the Await mechanism of Scala:

def call(): Unit = {
val conf = ConfigFactory.load()
val inputParams = process(parameters)
Await.result(SsmAutomationHelper.newInstance(conf).runDocumentWithParameters(documentName, inputParams), 10.minutes)
}

Packaging the CLI app

At this point I was ready with the CLI app and wanted to run it to see how it would function. Before I could run it, I needed to figure out how to package it all into a fat JAR file with all needed dependencies, so that it could be invoked with CLI arguments. I googled around a bit and quickly found the spring-boot-maven-plugin which has the repackage goal that’s just what I needed:

Repackages existing JAR and WAR archives so that they can be executed from the command line using java -jar. With layout=NONE can also be used simply to package a JAR with nested dependencies (and no main class, so not executable).

I only had to add the below lines to my project’s pom.xml:

<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
<version>2.3.2.RELEASE</version>
<configuration>
<layout>JAR</layout>
</configuration>
<executions>
<execution>
<goals>
<goal>repackage</goal>
</goals>
</execution>
</executions>
</plugin>

Next I just had to run the mvn package command, which invokes the plugin to builds the fat JAR.

Running the CLI app

Once the JAR is available, it can be used via the java -jar ... command with extra arguments to run the any Automation Document such as AWS-StartEC2Instance:

$ ▶ java -jar ./target/scala-cli-app-1.0.0.jar --document=AWS-StartEC2Instance InstanceId=i-0ed4574c5ba94c877 AutomationAssumeRole=arn:aws:iam::{{global:ACCOUNT_ID}}:role/AutomationServiceRole
15:24:41.998 [main] INFO c.f.utils.ssm.SsmAutomationHelper :: Going to kick off SSM orchestration document: AWS-StartEC2Instance
15:24:42.773 [ForkJoinPool-1-worker-29] INFO c.f.utils.ssm.SsmAutomationHelper :: Execution id: <...>
15:24:42.882 [ForkJoinPool-1-worker-11] INFO c.f.utils.ssm.SsmAutomationHelper :: Current status: [InProgress], retry counter: #0
[...]
15:28:01.226 [ForkJoinPool-1-worker-11] INFO c.f.utils.ssm.SsmAutomationHelper :: Current status: [InProgress], retry counter: #21
15:28:16.442 [ForkJoinPool-1-worker-11] INFO c.f.utils.ssm.SsmAutomationHelper :: Execution finished with final status: [Success]
15:28:16.444 [main] INFO com.flrnks.app.SsmCliParser :: SSM execution run took 215 seconds

Seems to be working quite well!

Bonus: running in a container

I thought I would take the above one step further and package the JAR into a java based docker container. This would allow me to forget about the syntax of the java command that I previously used to run the app. Instead, I can hide it in a very minimal Dockerfile:

FROM openjdk:8-jdk-alpine
MAINTAINER flrnks <flrnks@flrnks.netlify.com>
ADD target/scala-cli-app-1.0.0.jar /usr/share/backend/app.jar
ENTRYPOINT [ "/usr/bin/java", "-jar", "/usr/share/backend/app.jar"]

The mvn package command which is used to build the fat JAR will save it into the /target subdirectory, so one can put this Dockerfile into the project’s root and then manually build the docker image by running docker build -t ssmcli .. This will create an image called ssmcli without issues, however I’ve found an awesome plugin called dockerfile-maven-plugin built by Spotify which can automagically take this Dockerfile and turn it into an image based on the plugin configuration:

<plugin>
<groupId>com.spotify</groupId>
<artifactId>dockerfile-maven-plugin</artifactId>
<version>1.4.10</version>
<executions>
<execution>
<id>default</id>
<goals>
<goal>build</goal>
</goals>
<configuration>
<repository>flrnks/ssmcli</repository>
<tag>latest</tag>
</configuration>
</execution>
</executions>
</plugin>

This plugin hooks into the mvn package goal and when it’s executed it will automatically create the docker image:

[INFO] --- spring-boot-maven-plugin:2.3.2.RELEASE:repackage (default) @ scala-cli-app ---
[INFO] Layout: JAR
[INFO] Replacing main artifact with repackaged archive
[INFO]
[INFO] --- dockerfile-maven-plugin:1.4.10:build (default) @ scala-cli-app ---
[INFO] dockerfile: null
[INFO] contextDirectory: /Users/flszabo/Desktop/personal-wrkspc/scala/scala-cli-app
[INFO] Building Docker context /Users/flszabo/Desktop/personal-wrkspc/scala/scala-cli-app
[INFO] Path(dockerfile): null
[INFO] Path(contextDirectory): /Users/flszabo/Desktop/personal-wrkspc/scala/scala-cli-app
[INFO]
[INFO] Image will be built as flrnks/ssmcli:latest
[INFO] Step 1/4 : FROM openjdk:8-jdk-alpine
[INFO] Pulling from library/openjdk
[INFO] Digest: sha256:94792824df2df33402f201713f932b58cb9de94a0cd524164a0f2283343547b3
[INFO] Status: Image is up to date for openjdk:8-jdk-alpine
[INFO] ---> a3562aa0b991
[INFO] Step 2/4 : MAINTAINER flrnks <flrnks@flrnks.netlify.com>
[INFO] ---> Using cache
[INFO] ---> efcc673b4f35
[INFO] Step 3/4 : ADD target/scala-cli-app-1.0.0.jar /usr/share/backend/app.jar
[INFO] ---> 8b2cf76f03c2
[INFO] Step 4/4 : ENTRYPOINT [ "/usr/bin/java", "-jar", "/usr/share/backend/app.jar"]
[INFO] ---> Running in c9633237f9fa
[INFO] Removing intermediate container c9633237f9fa
[INFO] ---> 6db69aa30fb1
[INFO] Successfully built 6db69aa30fb1
[INFO] Successfully tagged flrnks/ssmcli:latest

To test this new docker image I ran the AWS-StopEC2Instance Automation Document and specified the same CLI arguments as before, thanks to the ENTRYPOINT configuration in the Dockerfile. As an extra step I needed to share the AWS profile with the docker container at runtime by using the flag -v ~/.aws:/root/.aws:

$ ▶ ddocker run --rm -v ~/.aws:/root/.aws flrnks/ssmcli --document=AWS-StopEC2Instance InstanceId=i-0ed4574c5ba94c877 AutomationAssumeRole=arn:aws:iam::{{global:ACCOUNT_ID}}:role/AutomationServiceRole
17:18:59.541 [main] INFO c.f.utils.ssm.SsmAutomationHelper :: Going to kick off SSM orchestration document: AWS-StopEC2Instance
17:19:00.789 [ForkJoinPool-1-worker-13] INFO c.f.utils.ssm.SsmAutomationHelper :: Execution id: <...>
17:19:00.966 [ForkJoinPool-1-worker-11] INFO c.f.utils.ssm.SsmAutomationHelper :: Current status: [InProgress], retry counter: #0
17:19:03.564 [ForkJoinPool-1-worker-11] INFO c.f.utils.ssm.SsmAutomationHelper :: Execution finished with final status: [Success]
17:19:03.568 [main] INFO com.flrnks.app.SsmCliParser :: SSM execution run took 5 seconds

One may say that typing that long docker run ... command above takes longer than typing java -jar ./target/scala-cli-app-1.0.0.jar ... but I would argue that running it inside a docker container has its valid use-cases as well. It allows for controlled setup of the runtime environment and prevents dependency issues too!

Conclusion

This project has allowed me to learn much more than I initially expected. I learnt a lot about Scala, which was the original goal, but I also gained valuable experience with Maven, its plugin ecosystem and of course with Java as well. I hope whoever reads this post will find something useful in it too!

Monitoring Flink on AWS EMR

Sun, 16 Aug 2020 11:11:00 +0000

Brief intro

This is going to be a somewhat unusual post on this blog. It is about a problem I recently encountered while trying to improve the monitoring of a long-running Flink cluster we have on AWS EMR, following the official documentation from Datadog.

The EMR setup

Our EMR cluster consumes 4 Kinesis Data Streams which are used to send s3 files in AVRO format for processing. When a new file arrives, the Flink job will fetch it from S3, do some validation and filtering and then convert it to ORC format and save it to a new location on s3. In early June we experienced a failure in one of the Flink jobs consuming a production stream. Sadly we did not have adequate monitoring set up to detect this on time. We only learnt about it when we noticed that data in the output bucket was missing for certain dates. Our streams were configured with the maximum retention period of 7 days. By the time we noticed the missing data in the stream was already piling up, and the oldest was close to half of this retention period. By the time we managed to find the root cause and deploy the fix to the Flink job, it was too late, and some data had already expired from the stream.

The existing monitoring solution was implemented via AWS Lambda functions running every 8 hours. These functions were making Athena queries to check if any data arrived to the S3 bucket during the last 48 hours. The problem with this was approach was that we do not get alerts about missing data for up to 2 days because of the way our query used a sliding window of 2 days.

The Flink cluster runs in a private VPC, so reaching the Flink Web UI to check the status of the jobs was quite difficult to say the least. We either had to set up an SSH port forwarding session and use a FoxyProxy setup in Firefox, or set up a personal VM the same private VPC via the AWS WorkSpaces managed service and then connect from that VM’s browser to the cluster’s Flink UI. Either way it was quite cumbersome and still a manual process to connect to the Flink UI to check the cluster health. I wanted an automated way of gathering metrics and alerting if something went wrong, so I looked into how Flink could be monitored by Datadog.

Datadog ❤️ Flink

A quick Google search threw up the official documentation from Datadog where I found really straightforward instructions on enabling the submission of Flink metrics to Datadog, which could be instantly visualized in their default Flink dashboard. These main steps are:

adding some new parameters to the flink-conf.yaml, such as the Datadog API/APP keys and custom tags
copying the flink-datadog-metrics.jar to the active flink installation path

The first step was quite easy. Our cluster was defined in Cloudformation where we used AWS::EMR::Cluster which allows specifying the flink-conf.yaml content as below:

Cluster:
 Type: AWS::EMR::Cluster
 Properties:
 Name: Flink-Cluster
 Configurations:
 - Classification: flink-conf
 ConfigurationProperties:
 metrics.reporter.dghttp.class: org.apache.flink.metrics.datadog.DatadogHttpReporter
 metrics.reporter.dghttp.apikey: '{{resolve:secretsmanager:datadog/api_key:SecretString}}'
 metrics.reporter.dghttp.tags: name:flink-cluster, app:flink-cluster, region:eu-central-1, env:prod
 [...]

The above CF snippet shows just the 3 most important lines of the flink-conf.yaml: (1) the full package name of the java class which implements the metric submission, (2) the Datadog API key loaded from AWS Secrets Manager and (3) a few custom tags which will be added to metrics sent to Datadog.

To copy the necessary datadog-metrics JAR where it would be loaded from (/usr/lib/flink/lib), I added a new AWS::EMR::Step to in CloudFormation which is executed only on the EMR Master Node in order to activate Datadog monitoring on the cluster via the supplied Java class and API key in the flink-conf.yaml.

To test that it was working properly I just needed to redeploy the cluster which was surprisingly easy thanks to the Cloudformation setup we had in place. But something was still not right.

Know your continent

After redeploying the cluster I waited and waited and waited a bit more but metrics were not showing up in the Flink dashboard. So I got in touch with Datadog support who were very helpful in figuring out what the issue was. After a few rounds of emails back and forth we quickly discovered why the metrics were not showing up.

The reason was that we had our Datadog account set up in the EU region and not in the USA. Thus, all our metrics were supposed to flow to the EU endpoint at app.datadoghq.eu/api/ instead of the USA endpoint at app.datadoghq.com/api/. The difference is quite subtle, only a simple change in the TLD from .com to .eu. The catch was that our EMR cluster was running Flink 1.9.1 (provided by the EMR release 5.29.0) which had this API endpoint hardcoded, pointing to the USA data centre. The Datadog Support Engineer uncovered some extra instructions on how this can be solved by adding an extra line to the flink-conf.yaml to change the default US region to the EU instead:

Cluster:
 Type: AWS::EMR::Cluster
 Properties:
 Name: Flink-Cluster
 [...]
 Configurations:
 - Classification: flink-conf
 ConfigurationProperties:
 [...]
 metrics.reporter.dghttp.class: org.apache.flink.metrics.datadog.DatadogHttpReporter
 metrics.reporter.dghttp.apikey: '{{resolve:secretsmanager:datadog/api_key:SecretString}}'
 metrics.reporter.dghttp.tags: name:flink-cluster, app:flink-cluster, region:eu-central-1, env:prod
 metrics.reporter.dghttp.dataCenter: EU # << points the metrics reported to the EU region
 [...]

The problem was that this was only available in Flink v1.11.0 while the highest version offered by EMR through the latest EMR Release was only v1.10.0, so this was not going to work for me. I almost gave up on the idea of monitoring Flink via Datadog when I had the idea to clone the official Flink repository from Github and tweak the code in v1.9.1 which we were running to change the hardcoded API endpoint from .com to .eu. It was much easier than I expected, I just needed to tweak this class slightly ./src/main/java/org/apache/flink/metrics/datadog/DatadogHttpClient.java:

/**
 * Http client talking to Datadog.
 */
public class DatadogHttpClient {
/* Changed endpoint for metric submission to use .eu instead of .com */
private static final String SERIES_URL_FORMAT = "https://app.datadoghq.eu/api/v1/series?api_key=%s";
/* Changed endpoint for API key validation to use .eu instead of .com */
private static final String VALIDATE_URL_FORMAT = "https://app.datadoghq.eu/api/v1/validate?api_key=%s";
...
}

Once I made the above code changes, I built a new JAR via mvn clean package. The new JAR was made available at ./flink-metrics/flink-metrics-datadog/target/flink-metrics-datadog-1.9.1.jar which I then uploaded to an S3 bucket where we store such files in my team. Next I slightly tweaked the AWS EMR step to load this JAR from S3 redeployed the cluster once more. Finally, metrics started flowing! And it looked so nice, I was especially happy to see the TaskManager heap distribution, because the issue which sparked this whole endeavor was showing symptoms of Heap Memory issues.

Unfortunately this default dashboard was not perfect, as it had some graphs that were failing to show some data. Maybe it was because of using v1.9.1 of Flink instead of v1.11.0, not sure. In any case, I ended up cloning the dashboard and fixing the graphs manually, while also adding a few extras to show data about the AWS Kinesis streams which were feeding into the Flink cluster.

Now it shows very nicely the age of each Flink job, which was not visible at all on the default dashboard. The end result is much better in my opinion.

Conclusion

All in all, I am quite happy with how this whole story turned out in the end. Despite the issue with the hardcoded API endpoints to the USA region in v1.9.1 of Flink, I managed to implement a simple workaround thanks to the Open Source nature of the project. The result is that we have much better visibility and monitoring implemented for our Flink cluster which makes our lives in the DevOps world much better. I did not write much about it in this post, but once these metrics became available in our Datadog account it was trivial to set up a few Monitors which would alert us if for example one of the 4 Flink jobs were failing. I will leave it up to the reader to imagine how that’s done.

Testing Terraform Modules

Sun, 12 Jul 2020 11:11:00 +0000

Intro

I first head of Terraform about 1 year ago while working on an assignment for a job interview. The learning curve was steep, and I still remember how confused I was about the syntax of HCL that resembled JSON but was not exactly the same. I also remember hearing about the concept of Terraform Modules, but for the assignment it was not needed, so I skipped it for the time being.

Fast forward to present day, I’ve had a good amount of exposure to Terraform Modules at work, where we use them to provision resources on AWS in a standardized and rapid fashion. In order to broaden my knowledge on Terraform Modules, I decided to create an exercise in which I created two TF Modules with using version 0.12 of Terraform. In this post I wanted to describe these two Terraform Modules and how I went about testing them to ensure they did what they were meant to.

What is a Terraform Module

According to official documentation a Terraform module is simply a container for multiple resources that are defined and used together. Terraform Modules can be embedded in each other to create a hierarchical structure of dependent resources. To define a Terraform Module one needs to create one or more Terraform files that define some input variables, some resources and some outputs. The input variabls are used to control properties of the resources, while the outputs are used to reveal information about the created resources. These are often organized into such structure as follows:

variables.tf defining the Terraform variables
main.tf creating the Terraform resources
output.tf listing the Terraform outputs

Note that the above is just an un-enforced convention, it simply makes it easier to get a quick understanding about a Terraform Module. As an example, if an organization needs to have their AWS S3 buckets secured with the same policies to protect their data, they can embed these security policies in a TF Module and then prescribe its use within the organization to enable those security policies automatically. Next up is an example of just that.

The Secure-Bucket TF Module

The first of the 2 Terraform Modules is tf-module-s3-bucket which can be used to create an S3 bucket in AWS that is secured to a higher degree, so that it may be suitable for storing highly sensitive data. The security features of the bucket consists of:

filtering on Source IPs that can access its contents
enforcing encryption at rest (KMS) and in transit
object-level and server access logging enabled
filtering on IAM principals based on official docs

When using this module, one can define a list of IPs, and a list of IAM Principals to control who and from which networks can access the contents of the bucket. These restrictions are written into the Bucket Policy, which is considered a resource-based policy that always takes precendence over Identity based policies, so it does not matter if an IAM Role has specific permission granted to access the bucket, if the bucket’s own Bucket Policy denies the same access. Below is a good overview of the whole evaluation logic of AWS IAM:

In addition, server-access and object-level logging can be enabled as well to improve the bucket’s level of auditability. Altogether, these settings can greatly elevate the security of data in the S3 bucket that was created by this module.

The S3-AuthZ TF Module

This 2nd Terraform Module is called tf-module-s3-auth and it was written to in part to complement the other one used to create an S3 bucket. The aim of this module is to help with the creation of a single IAM policy that can cover the S3 and KMS permissions needed for a given IAM Principal. The motivation behind this module comes from some difficulties I’ve faced at work which meant that some IAM Roles we used had too many policies attached. For further reference see the AWS docs on this.

The Bucket Policy that is crafted by the first TF Module allows the definition of list of IAM Principals that are allowed to interact with the bucket. With this TF module one can actually define the particular S3 actions that those IAM Principals CAN carry out on the data in the bucket. Additionally, this TF module can also be used allow KMS actions on the KMS keys that are protecting the data at rest in the bucket.

Untested code is broken code

With infrastructure-as-code, just as with normal code, testing is often an afterthought. However, it seems to be catching on more and more nowadays. Nothing shows this better than the amount of search results in Google for Infrastructure as Code testing: 235.000.000 as of today (15.8.2020). While Infrastructure as Code is a much broader topic with many other interesting projects, this post will have a sole focus on Terraform. With Terraform, a good step in the right direction is as simple as running terraform validate that can catch silly mistakes and syntax errors and provide feedback such as below:

Error: Missing required argument
on main.tf line 107, in output "s3_bucket_name":
107: output "s3_bucket_name" {
The argument "value" is required, but no definition was found.

In addition to the terraform validate option, many IDEs such as IntelliJ, already have plugins that can alert to such issues, so I find myself not using it so often. However, it’s still nice to have this feature built into the terraform executable!

Once all syntax errors are fixed, the next stage of testing can continue with the terraform plan command. This command uses terraform state information (local or remote) to figure out what changes are needed if the configuration is applied. This is truly very useful in showing in advance what will be created or destroyed. However, a successful terraform plan can still result in a failed deployment because some constraints cannot be verified without making the actual API calls to the Cloud Service Provider. The terraform plan command does not make any actual API calls, it only computes the difference that exist between the Terraform Code vs. the Terraform State (local or remote). The failures are usually very provider specific.

data "aws_iam_policy_document" "Deny-Non-CiscoCidr-S3-Access" {
statement {
sid = "Deny-All-S3-Actions-If-Not-In-IP-PrefixList"
effect = "Deny"
actions = [ "s3:*" ]
resources = [ "*" ]
condition {
test = "NotIpAddress"
variable = "aws:SourceIp"
values = local.ip_prefix_list
}
}
}

This Terraform Code is syntactically correct nd passes the terraform validate, and terraform plan produces a valid plan. However, it still fails at the terraform apply stage because AWS has a restriction on the sid: For IAM policies, basic alphanumeric characters (A-Z,a-z,0-9) are the only allowed characters in the Sid value. This constraint is never checked before terraform apply is called, at which point it is going to fail the whole action with the below error:

An error occurred: Statement IDs (SID) must be alpha-numeric. Check that your input satisfies the regular expression [0-9A-Za-z]*

Such types of errors can only be caught when making real API calls to the Cloud Service Provider (or to a truly identical mock of the real API) which will validate the calls and return errors if any are found. Next I will go into some details on how I went about testing the 2 Terraform Modules I wrote.

Manual Testing via AWS

This most rudimentary form of testing can be done by setting up a real project that imports and uses the two Terraform modules. This test can be found in my repository’s test/terraform/aws/ directory. For this to work properly the AWS provider has to be set up with real credentials, which is beyond the scope of this post. I also opted to use S3 as TF state backend storage but this is optional, it can just ass well store the state locally in a .tfstate file.

First, terraform has to be initialized which will trigger the download of the AWS Terraform Provider via terraform init. Next, the changes can be planned and applied via terraform plan & apply respectively. It’s interesting to note that a complete terraform apply takes close to 1 minute to complete:

Apply complete! Resources: 7 added, 0 changed, 0 destroyed.
Outputs: [...]
real 0m49.090s
user 0m3.532s
sys 0m1.929s

Once the terraform apply is complete one can make manual assertions whether it went as expected based on the outputs (if any) and by manually inspecting the resources that were created. While this can be good enough for new setups, it may be not so good when an already deployed project has to be modified and one needs to make sure the changes will not have any undesired side effects.

Manual Testing via localstack

In order to save time (and some costs), one may also consider using localstack which replicates most of the AWS API and its features to enable faster and easier development and testing. It’s important to note that it only works if one is an AWS customer. In an earlier post I’ve already written on how to set it up, so I will not repeat it here. The most important thing is to enable S3, IAM and KMS services in the docker-compose.yaml by setting this environment variable: SERVICES=s3,kms,iam so the corresponding API endpoints are turned on.

The Terraform files I wrote for testing with on real AWS can be re-used for testing with localstack with some tweaks, for more detail look to test/terraform/localstack/ folder in my repository. Then it’s just a matter of running terraform init followed by a terraform plan & apply to create the fake resources in Localstack.

Apply complete! Resources: 7 added, 0 changed, 0 destroyed.
Outputs: [ ... ]
real 0m11.649s
user 0m3.589s
sys 0m1.580s

Notice that this time the terraform apply took only about 10 seconds, which is considerably faster than using the real AWS API.

Automating tests via Terratest

As I’ve shown, running tests via Localstack can be much faster on average, but sometimes a project may require the use of some AWS services that are not supported by Localstack. In this case it becomes necessary to run tests against the real AWS API. For such situations I recommend terratest from Gruntwork.io, which is a Go library that provides capabilities to automate tests.

It still requires a terraform project to be set up, as described in Manual Testing via AWS, however having the ability to formally define and verify tests can greatly increase the confidence that the code being tested will function the way it’s supposed to. In the test I implemented some assertions on the output values of the terraform apply as well as about the existence of the S3 bucket just created. In addition, the Go library also provides ways to verify the AWS infrastructure setup, by making HTTP calls or SSH connections. This can be a pretty powerful tool.

This terratest setup can be found in my repo under test/go/terraform_test.go.

Running this test takes considerably longer than either of the two previous ones, but the advantage is that this can be easily automated and integrated into a CI/CD build where it can verify on-demand that the TF code still works as intended, even if there were some changes.

▶ go test
TestTerraform 2020-08-09T21:46:22+02:00 logger.go:66: Terraform has been successfully initialized!
...
TestTerraform 2020-08-09T21:47:30+02:00 logger.go:66: Apply complete! Resources: 7 added, 0 changed, 0 destroyed.
...
TestTerraform 2020-08-09T21:48:08+02:00 logger.go:66: Destroy complete! Resources: 7 destroyed.
...
PASS
ok github.com/florianakos/terraform-testing/tests 116.670s

The basic idea of terratest is to automate the process or creation and cleanup of resources for the purposes of tests. To avoid name clashes with existing AWS resources, it’s a good practice to append some random strings to resource names as part of the test, so they are not going to fail due to unique name constraints.

Conclusion

In this post I have shown what options are available for testing a Terraform Module in local or remote settings. If one only works with AWS services then Localstack can be a great tool for quick local tests during development, while terratest from Gruntwork can be a great help with codifying and automating such tests that run against the real AWS Cloud from your favourite CI/CD setup.

Identity & Access Management

Mon, 03 Feb 2020 11:11:00 +0000

INTRODUCTION

In this post I show how the Identity and Access Management service in the AWS Public Cloud works to secure resources and workloads. It is a very important topic, because it underpins all of the security that is needed for hosting one’s resources in the public cloud.

At the end of the day, the cloud is just a concept that offers a convenient illusion of dedicated resources, but in reality it’s just some process that runs on someone else’s hardware, so one has to be absolutely sure about security before trusting it and running their business-critical workloads on it.

It is enough to do a quick google search for unsecured s3 bucket to see plenty of examples of administrators failing to properly harden and configure their AWS resources, and falling victim to accidental disclosure of often business-critical information.

IAM exists in the realm of AWS Cloud as a standalone service, providing various ways in which access to resources and workloads can be restricted. For example, if someone has an S3 bucket for storing arbitrary data, one can use IAM policies to restrict access to data stored in the bucket based on various criteria such as user identity, connection source IP, VPC environment and so on. S3 is a convenient service to demonstrate IAM capabilities, because it is very easy to grasp the result of restrictions: access to files in an S3 bucket is either granted or denied.

HOW IT WORKS

In order to illustrate how IAM works, I decided to create a Python Lambda function, which is just an AWS service offering server-less functions, and implemented a routine that tries to access some data stored in a particular S3 bucket. By default the Lambda starts running with an IAM role that has only read-only permission to the bucket. This is verified by making an API call with the boto3 package, which returns without any error. Next the Lambda tries to write some new data to the bucket, but this fails because the IAM role is not equipped with Write permission to the S3 bucket.

To mitigate this problem, I use boto3 to make an AWS Secure Token Service ( STS) call and assume a new role which is equipped with the necessary read-write access. Using this new role the program demonstrates that it can write to the bucket as expected. Below is a sample output of the Lambda Function in action:

=== Checking IAM Identity ===
ARN: arn:aws:sts::ACCOUNT_ID:assumed-role/Base-Lambda-Custom-Role/lambda

=== Testing Read access to S3 file in bucket ===
{
 "field1": true,
 "field2": 1.4107917E7
}

=== Testing Write access to S3 bucket ===
Error: AccessDenied!

=== Assumed New IAM Identity ===
ARN: arn:aws:sts::ACCOUNT_ID:assumed-role/S3-RW-Role/lambda

=== Testing Write access to S3 bucket (using new role) ===
... file was written successfully!

To get a better understanding how this all worked in code, feel free to check out the source code repository in Github ( link). Because I am a big fan of Terraform, I defined all resources (S3, IAM, Lambda) in code which makes it very simple and straightforward to deploy and test the code if you feel like!

ADVANCED IAM

Besides providing the basic functionality to restrict access to resources base on user identity, there are some cool and more advanced features of AWS IAM that I wanted to touch upon. For example, to show how simple it is to give read-only permissions to a bucket for an IAM role:

data "aws_iam_policy_document" "s3_ro_access_policy_document" {
statement {
effect = "Allow"
actions = [
"s3:GetObject",
"s3:ListBucket",
]
resources = [
"arn:aws:s3:::my-bucket",
"arn:aws:s3:::my-bucket/*"
]
}
}
resource "aws_iam_policy" "s3_ro_access_policy" {
name = "S3-ReadOnly-Access"
policy = data.aws_iam_policy_document.s3_ro_access_policy_document.json
}
resource "aws_iam_role_policy_attachment" "Allow_S3_ReadOnly_Access" {
role = aws_iam_role.aws_custom_role_for_lambda.name
policy_arn = aws_iam_policy.s3_ro_access_policy.arn
}
resource "aws_iam_role" "aws_s3_readwrite_role" {
name = "S3-RW-Role"
description = "Role to allow full RW to bucket"
}

Full source code on GitHub.

With this short Terraform code, I created a role, and assigned an IAM policy to it, which has RO access to my-bucket resource in S3. To spice this up a bit, it is possible to add extra restrictions based on various elements of the request context to restrict access based on Source IP for example:

data "aws_iam_policy_document" "s3_ro_access_policy_document" {
statement {
effect = "Deny"
actions = [
"s3:*"
]
resources = [ "*"]
condition {
test = "IpAddress"
variable = "aws:SourceIp"
values = [ "192.168.2.0/24" ]
}
}
}

All of a sudden, even if the user who makes the request to S3 has correct credentials, but is connecting from a subnet which is outside the one specified above, the request will be denied! This can be very useful for example, when trying restricting access to resources to be possible only from within a corporate network with specific CIDR range.

One small issue with this source IP restriction is that it can cause issues for certain AWS services that run on behalf of a principal/user. When using the AWS Athena service for example, triggering a query on data stored in S3 means Athena will make S3 API requests on behalf of the user who initiated the Athena query, but will have a source IP address from some Amazon AWS CIDR range and the request will fail. For this purpose, there is an extra condition that can be added to remediate this issue:

data "aws_iam_policy_document" "s3_ro_access_policy_document" {
statement {
effect = "Deny"
actions = [
"s3:*"
]
resources = [ "*"]
condition {
test = "IpAddress"
variable = "aws:SourceIp"
values = [ "192.168.2.0/24" ]
}
condition {
test = "Bool"
variable = "aws:ViaAWSService"
values = [ "false" ]
}
}
}

The aws:viaAWSService = false condition will ensure that this Deny will only take effect when the request context does not come from an AWS Service Endpoint. For additional info on what other possibilities exist that can be used to grant or deny access, please consult the AWS documentation.

CONCLUSION

In this post I demonstrated how to use the boto3 python package to make AWS IAM and STS calls to access resources in the AWS cloud protected by IAM policies. I also discussed some advanced features of AWS IAM that can help you implement more granular IAM policies and access rights. The linked repository also contains an example which may be run locally and does not need the Lambda function to be created (it still, however, requires the Terraform resources to be deployed).

Cloud Service Testing

Fri, 17 Jan 2020 11:11:00 +0000

In this blog post I discuss a recent project I worked on to practice my skills related to AWS, Python and Datadog. It includes topics such as integration testing using pytest and localstack; running Continuous Integration via Travis-CI and infrastructure as code using Terraform.

Intro

For the sake of this blog post, let’s assume that a periodic job runs somewhere in the Cloud, outside the context of this application, which generates a file with some meta-data about the job itself. This data includes mostly numerical values, such as the number of images used to train an ML model, or the number of files processed, etc. This part is depicted on the below diagram as a dummy Lambda function that periodically uploads this metadata file to an S3 bucket with random numerical values.

When this file is uploaded, an event notification is sent to the message queue. The goal of the Python application is to periodically drain these messages from the queue. When the application runs, it fetches the S3 file referenced in each SQS message, parses the file’s contents and submits the numerical metrics to DataDog for the purpose of visualisation and alerting.

Testing

Since the application interacts with two different APIs (AWS & Datadog), I figured it was a good idea to create integration tests that can be run easily via some free CI service (e.g.: Travis-CI.org). When writing the integration tests, I opted to create a simple mock class for testing the interaction with the Datadog API, and chose to rely on localstack for testing the interaction with the AWS API.

Thanks to localstack I could skip creating real resources in AWS and instead use free fake resources in a docker container, that mimic the real AWS API close to 100%. The AWS SDK called boto3 is very easy to reconfigure to connect to the fake resources in localstack with the endpoint_url= parameter.

In the following sections I go through different phases of the project:

coding the python app
mocking Datadog statsd client
setting up AWS resources in localstack
creating integration tests
Travis-CI integration
running the datadog-agent locally
setting up real AWS resources
live testing

~ Coding the python app ~

The code is mainly composed of two Python classes with methods to interact with AWS and DataDog. The CloudResourceHandler class has methods to interact with S3 and SQS, which can be replaced in integration-tests with preconfigured boto3 clients for localstack.

The MetricSubmitter class uses the CloudResourceHandler internally and offers some additional methods for sending metrics to DataDog. Internally it uses statsd from the datadog python package, which can be replaced via dependency injection in integration tests with a mock statsd class that I created to test its interaction with the Datadog API.

To connect to the real AWS & Datadog APIs (via a preconfigured local datadog-agent) there needs to be two environment variables specified at run-time:

STATSD_HOST set to localhost
SQS_QUEUE_URL set to the URL of the Queue

os.environ['STATSD_HOST'] = 'localhost'
os.environ['SQS_QUEUE_URL'] = 'https://sqs.eu-central-1.amazonaws.com/????????????/cloud-job-results-queue'
session = boto3.Session(profile_name='profile-name')
MetricSubmitter(statsd=datadog_statsd,
sqs_client=session.client('sqs'),
s3_client=session.client('s3')).run()

In addition, it also requires a preconfigured AWS profile in ~/.aws/credentials which is necessary for boto3 to authenticate to AWS:

[profile-name]
aws_access_key_id = XXXXXXXXXXXXXXX
aws_secret_access_key = XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
region = eu-central-1

But before running it, let’s set up some integration tests!

~ Mocking Datadog statsd client ~

In truth, the application does not interact directly with the Datadog API, but rather it uses statsd from the datadog python package, which interacts with the local datadog-agent, which in turn forwards metrics and events to the Datadog API.

To test this flow that relies on statsd, I created a class called DataDogStatsDHelper. This class has 2 functions (gauge/event) with identical signatures to the real functions from the official datadog-statsd package. However, the mock functions do not send anything to the datadog-agent. Instead, they accumulate the values they were passed in local class variables:

class DataDogStatsDHelper:
event_title = None
event_text = None
event_alert_type = None
event_tags = None
event_counter = 0
gauge_metric_name = None
gauge_metric_value = None
gauge_tags = None
gauge_counter = 0
def event(self, title, text, alert_type=None, aggregation_key=None, source_type_name=None,
date_happened=None, priority=None, tags=None, hostname=None):
...
def gauge(self, metric, value, tags=None, sample_rate=None):
...

When the MetricSubmitter class is tested, this mock class is injected instead of the real statsd class, which enables assertions to be made and compare expectations with reality.

~ AWS resources in localstack ~

To test how the python app integrates with S3 and SQS, I decided to use loalstack, running in a Docker container. To make it simple and repeatable, I created a docker-compose.yaml file that allows the configuration parameters to be defined in YAML:

version: '3.2'
services:
 localstack:
 image: localstack/localstack:latest
 container_name: localstack
 ports:
 - '4563-4599:4563-4599'
 - '8080:8080'
 environment:
 - SERVICES=s3,sqs
 - AWS_ACCESS_KEY_ID=foo
 - AWS_SECRET_ACCESS_KEY=bar

The resulting fake AWS resources are accessible via different ports on localhost. In this case, S3 runs on port 4572 and SQS on port 4576. Refer to the docs on GitHub for more details on ports used by other AWS services in localstack.

It is important to note that when localstack starts up, it is completely empty. Thus, before the integration tests can run, it is necessary to provision the S3 bucket and SQS queue in localstack, just as one would normally do it when using real AWS resources.

For this purpose, it’s possible to write a simple bash script that can be called from the localstack container as part of an automatic init script:

aws --endpoint-url=http://localhost:4572 s3api create-bucket --bucket "bucket-name" --region "eu-central-1"
aws --endpoint-url=http://localhost:4576 sqs create-queue --queue-name "queue-name" --region "eu-central-1" --attributes "MaximumMessageSize=4096,MessageRetentionPeriod=345600,VisibilityTimeout=30"

However, for the sake of making the integration-tests self-contained, I opted to integrate this into the tests as part of a class setup phase that runs before any tests and sets up the required S3 bucket and SQS queue:

@classmethod
def setUpClass(cls):
cls.ls = LocalStackHelper()
cls.ls.get_s3_client().create_bucket(Bucket=cls.s3_bucket_name)
cls.ls.get_sqs_client().create_queue(QueueName=cls.sqs_queue_name)

~ Creating integration tests ~

As a next step I created the integration tests which use the fake AWS resources in localstack, as well as the mock statsd class for DataDog. I used two popular python packages to create these:

unittest which is a built-in package
pytest which is a 3rd party package

Actually, the test cases only use unittest, while pytest is used for the simple collection and execution of those tests. To get started with the unittest framework, I created a python class and implemented the test cases within this class:

import unittest
from app.utils.datadog_fake_statsd import DataDogStatsDHelper
from app.utils.localstack_helper import LocalStackHelper
from app.submitter import MetricSubmitter
class ProjectIntegrationTesting(unittest.TestCase):
@classmethod
def setUpClass(cls):
...
def setUp(self):
...
def test_ddg_submitter_valid_payload(self):
...
def test_ddg_submitter_invalid_payload(self):
...
def test_aws_handler_invalid_s3key(self):
...
def test_aws_handler_valid_s3key(self):
...

In the setUpClass method, a few things are taken care of before tests can be executed:

define class variables for the bucket & the queue
create SQS & S3 clients using localstack endpoint url
provision needed resources (Queue/Bucket) in localstack

To test the interaction with DataDog via the statsd client, the submitter app is executed, which stores some values in the mock statsd class’s internal variables, which are then used in assertions to compare values with expectations.

The other tests inspect the behaviour of the CloudResourceHandler class. For example, one of the assertions tests whether the .has_available_messages() function returns false when there are no more messages in the queue.

A nice feature of unittest is that it’s easy to define tasks that need to be executed before each test, to ensure a clean slate for each test. For example, the code in the setUp method ensures two things:

the fake SQS queue is emptied before each test
class variables of the mock DataDog class are reset before each test

Theoretically, it would be possible to run the test by running pytest -s -v in the python project’s root directory, however the tests rely on localstack, so they would fail…

~ Travis-CI integration ~

So now that the integration tests are created, I thought it would be really nice to have them automatically run in a CI service, whenever someone pushes changes to the Git repo. To this end, I created a free account on travis-ci.org and integrated it with my github rep by creating a .travis.yaml file with the below initial content:

os: linux
language: python
python:
 - "3.8"
services:
 - docker
script:
 - {...}

However, I still needed a way to run localstack and then execute the integration tests within the CI environment. Luckily I found docker-compose to be a perfect fit for this purpose. I had already created a yaml file to describe how to run localstack, so now I could just simply add an extra container that would run my tests. Here is how I created a docker image to run the tests via docker-compose:

FROM python:3.8-alpine
WORKDIR /app
COPY ./requirements-test.txt ./
RUN apk add --no-cache --virtual .pynacl_deps build-base gcc make python3 python3-dev libffi-dev \
 && pip3 install --upgrade setuptools pip \
 && pip3 install --no-cache-dir -r requirements-test.txt \
 && rm requirements-test.txt
COPY ./utils/*.py ./utils/
COPY ./*.py ./
ENV LOCALSTACK_HOST localstack
ENTRYPOINT ["pytest", "-s", "-v"]

It installs the necessary dependencies to an alpine based python 3.8 image; adds the necessary source code, and finally executes pytest to collect & run the tests. Here are the updates I had to make to the docker-compose.yaml file:

version: '3.2'
services:
 localstack:
 {...}
 integration-tests:
 container_name: cloud-job-it
 build:
 context: .
 dockerfile: Dockerfile-tests
 depends_on:
 - "localstack"

Docker Compose auto-magically creates a shared network to enable connectivity between the defined services, which can call one-another by name. So when the tests are running in the cloud-job-it container, they can use the hostname localstack to create the boto3 session via the endpoint url to reach the fake AWS resources.

For easier to creation of AWS clients via localstack, I used a package called localstack-python-client, so I don’t have to deal with port numbers and low level details. However, this client by default tries to use localhost as the hostname, which wouldn’t work in my setup using docker-compose. After digging through the source-code of this python package, I found a way to change this by setting an environment variable named LOCALSTACK_HOST.

As a final step, I just had to add two lines to complete to the .travis.yaml file:

script:
 - docker-compose up --build --abort-on-container-exit
 - docker-compose down -v --rmi all --remove-orphans

Thanks to the --abort-on-container-exit flag, docker-compose will return the same exit code which is returned from the container that first exits, which first this use-case perfectly, as the cloud-job-it container only runs until the tests finish. This way the whole setup will gracefully shut down, while preserving the exit code from the container, allowing the CI system to generate an alert if it’s not 0 (meaning some test failed).

~ Running the datadog-agent locally ~

Note: while Datadog is a paid service, it’s possible to create a trial account that’s free for 2 weeks, without the need to enter credit card details. This is pretty amazing!

Now that the integration tests are automated and passing, I wanted to run the datadog-agent locally, so that I can test the python application with some real data that was to he submitted to Datadog via the agent. Here is an article that was particularly useful to me, with instructions on how the agent should be set up.

While the option of running it in docker-compose was initially appealing, I eventually decided to just start it manually as a long-lived detached container. Here is how I went about doing that:

DOCKER_CONTENT_TRUST=1 docker run -d \
 --name dd-agent \
 -v /var/run/docker.sock:/var/run/docker.sock:ro \
 -v /proc/:/host/proc/:ro \
 -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
 -e DD_API_KEY=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX \
 -e DD_SITE="datadoghq.eu" \
 -e DD_DOGSTATSD_NON_LOCAL_TRAFFIC=true \
 -p 8125:8125/udp \
 datadog/agent:7

Most notable of these lines is the DD_API_KEY environment variable which ensures that whatever data I send to the agent is associated with my own account. In addition, since I am closest to the EU region, I had to specify the endpoint via the DD_SITE variable. Also, because I want the agent to accept metrics from the python app, I need to turn on a feature via the environment variable DD_DOGSTATSD_NON_LOCAL_TRAFFIC, as well as expose port 8125 from the docker container to the host machine:

 ▶ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
477cb2ea74b2 datadog/agent "/init" 3 days ago Up 3 days (healthy) 0.0.0.0:8125->8125/udp, 8126/tcp dd-agent

All seems to be well!

~ Deploying real AWS resources ~

Here I briefly discuss how I deployed some real resources in AWS to see my application running live. In a nutshell, I set the infra up as code in Terraform, which greatly simplified the whole process. All the necessary files are collected in a directory of my repository:

variables.tf defines some variables used in multiple places
init.tf initialisation of the AWS provider and definition of AWS resources
outputs.tf defines some values that are reported when deployment finishes

The first and last files are not very interesting. Most of the interesting stuff happens in the init.tf, which defines the necessary resources and permissions. One extra resource not mentioned before, is an AWS Lambda function, which gets executed every minute and is used to upload a JSON file to the S3 bucket. This acts as a random source of data, so that the python app has some work to do without manual intervention.

~ Live testing ~

Now that all parts seem to be ready, it’s time to run the main python app using the real S3 bucket and SQS queue, as well as the local datadog-agent. The console output provides some hints whether it’s able to pump the metrics from AWS to a DataDog:

▶ python3 submitter.py
Initializing new Cloud Resource Handler with SQS URL - https://.../cloud-job-results-queue
Processing available messages in SQS queue:
- sending data to DataDog via statsd/datadog-agent.
- removing message from SQS (AQEBO37smPPHg6OIqbh3HMu3g...)
- ...
- sending data to DataDog via statsd/datadog-agent.
- removing message from SQS (AQEBV0/JzMVEP6k5kBmx2kvGn...)
No more messages visible in the queue, shutting down ...
Process finished with exit code 0

Next, I checked my DataDog account to see whether the metric data arrived. For this I created a custom Notebook with graphs to display them:

All seems to be well! The deployed AWS Lambda function has already run a few times, providing input data for the python app, which were successfully processed and forwarded to Datadog. As seen on the Notebook above, it is really easy to display metric data over time about any recurring workload, which can provide pretty useful insights into those jobs.

Furthermore, since DataDog also submission of events it becomes possible to design dashboards and create alerts which trigger based on mor complex criteria, such as the presence or lack of events over certain periods of time. One such example can be seen below:

This is a so-called screen-board which I created to display the status of a Monitor that I set up previously. This Monitor tracks incoming events with the tag cloud_job_metric and generates an alert, if there is not at least one such event of type success in the last 30 minutes. The screen-board can be exported via a public URL if needed, or just simply displayed on a big screen somewhere in the office.

Conclusions

In this post I discussed a relatively complex project with lots of exciting technology working together in the realm of Cloud Computing. In the end, I was able to create DashBoards and Monitors in DataDog, which can ingest and display telemetry about AWS workloads, in a way that makes it useful to track and monitor the workloads themselves.

Infrastructure as Code

Tue, 12 Nov 2019 11:11:00 +0000

Introduction

In this post I will briefly introduce different AWS services and show how to use Terraform to orchestrate and manage them. While the concept of the whole service is rather simple, its main use is enabling me to learn about this new emerging technology called Infrastructure-as-Code or IaC for short.

Project overview

The main goal of this task is to deploy a server-less function and periodically query the Github API to get a list of public repositories for a given organisation (e.g.: Google). The retrieved information should then be stored in a compressed CSV file in a specific S3 bucket, while notifications should be created for new files saved to the bucket.

The main AWS components of the solution are:

Lambda function written in Python
CW Event Rule to schedule the Lambda periodically
S3 for storing data in a bucket
SQS for queueing notifications from S3

Possibilities

Various methods exist for the creation and configuration of these necessary resources. The most simple one is by logging in to the AWS Management Console and setting up each components one by one via the GUI. This method, however, is slow, cumbersome and quite prone to errors.

A better option can be to use the AWS SDK for your favourite programming language. Several options exist, such as Java, Python, GO, Node.js, etc… This option is less error-prone, but still quite cumbersome and slow.

Perhaps one of the best options is to use Terraform, which is a popular Infrastructure as Code or IaC tool these days. It lets you define your infrastructure in a configuration language and has its own internal engine that talks to the AWS SDK to create the necessary infrastructure you defined.

Setup procedure

Before we can make use of Terraform to deploy our project on AWS, we need to set up credentials. This can be done by logging in to the AWS management console and going to Identity and Access Management section, which can provide the necessarz Access Key and Secret value that you need to put into a file on disk. These credentials should be saved to ~/.aws/credentials as follows:

[default]
aws_access_key_id = XXXXXXXXXXXX
aws_secret_access_key = XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

This enables Terraform to make changes to your AWS infrastructure through API calls made to AWS to provision resources according to your definition in the .tf file. Once you create the desired configuration a complete infrastructure can be deployed as simply as below:

$ ▶ ls -la
-rw-r--r-- 1 user group 4.9K Nov 21 22:58 main.tf
$ ▶ terraform init
...
Terraform has been successfully initialized!
$ ▶ terraform apply
...
Plan: 13 to add, 0 to change, 2 to destroy.
Do you want to perform these actions?
Enter a value: YES

Project building blocks

In this section I will go over each major component and explain what it is, what it does and how it is set up. First up is the main component: the core logic implemented in Python.

AWS Simple Storage Service

This is a basic building block which we use to store data generated by the Lambda function. Since Lambdas are by nature server-less, they do not have persistent storage attached which can be used to save data between two invocations of the function. If we need persistent storage we need to use S3. The necessary Terraform code is below:

resource "aws_s3_bucket" "tf_aws_bucket" {
bucket = "tf-aws-bucket"
tags = {
Name = "Bucket for Terraform project"
Environment = "Dev"
}
force_destroy = "true"
}

This will create a bucket named tf-aws-bucket which we can then use to store the results of our Lambda function. As an extra feature, we also configured notifications for this bucket, which will be created when a compressed file with .gz file type is created in the bucket. When this happens a notification will be generated and sent to the SQS queue that is also defined in the same Terraform file.

AWS Lambda

AWS Lambda is a server-less technology which lets you create a bare function in the cloud and call it from various other services, without having to worry about setting up an environment where it will run. Different programming language are supported, such as Python, Java, Go and NodeJS. Once you deploy your code, you can receive input to your function just as normally when you write a function, and give it permission to access and modify other resources in AWS, such as working with files stored in S3.

This is exactly the use-case that was implemented in this project. A lambda function that makes an API call to Github to download information, then store this in a compressed CSV file to an S3 bucket. To define the target organisation and the bucket where information is saved, the Lambda function expects two arguments in the function call:

{
"org_name" : "twitter",
"target_bucket" : "repos_folder"
}

This JSON input passed to the function is converted to a map in Python, which can be tested for the presence of necessary keys for the correct functioning of the code:

def handler(event, context):
# verify that URL is passed correctly and create file_name variable based on it
if 'org_name' not in event.keys() or 'target_bucket' not in event.keys():
print("Missing 'org_name' from request body (JSON)!")

The rest of the function’s code downloads the list of public repositories of the passed organisation from Github API and store this in a temporary file that can be uploaded to S3, provided that the necessary permissions have been granted to this Lambda function:

import boto3
s3 = boto3.client("s3")
s3.upload_file(path_to_local_file, target_bucket_name, key_name)

In order to enable access to S3 from Lambda, we have to define some IAM policies and roles. First we have to define a policy which says that the role, which obtains this policy can access the S3 bucket:

data "aws_iam_policy_document" "s3_lambda_access" {
statement {
effect = "Allow"
resources = ["arn:aws:s3:::tf-aws-bucket/*"]
actions = [
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket",
]
}
}

resource "aws_iam_policy" "s3_lambda_access" {
name = "s3_lambda_access"
policy = data.aws_iam_policy_document.s3_lambda_access.json
}

This policy is then attached to an IAM role which is allowed to be assumed by AWS Lambda:

resource "aws_iam_role_policy_attachment" "s3_lambda_access" {
role = aws_iam_role.tf_aws_exercise_role.name
policy_arn = aws_iam_policy.s3_lambda_access.id
}

resource "aws_iam_role" "tf_aws_exercise_role" {
name = "tfExerciseRole"
description = "Role that allowed to be assumed by AWS Lambda, which will be taking all actions."
tags = {
owner = "tfExerciseBoss"
}
assume_role_policy = <<EOF
{
 "Version": "2012-10-17",
 "Statement": [
 {
 "Action": "sts:AssumeRole",
 "Principal": {
 "Service": "lambda.amazonaws.com"
 },
 "Effect": "Allow"
 }
 ]
}
EOF
}

AWS CloudWatch Events

This component is responsible for periodically making a call to our Lambda function, with the required arguments passed in JSON format. This component was also configured via Terraform, but for the sake of simplicity, below is a screenshot taken from the AWS Management Console where the created CW event shows up as configured:

The screen-shot shows that it is configured to periodically execute a Target Lambda function every 2 minutes.

Results

In summary, it took me a while to get the hang of Infrastructure as Code concept and apply it while working with Terraform on AWS, but I can definitely see how it can benefit a bigger organisation which want their Cloud infrastructure to be stable and maintainable. IaC tools such as Terraform let developers define their infrastructure as code and check it in to version control for repeatable and more predictable deployment procedures. Now that I have this working project, I can do a simple terraform deploy to bring alive my service with all required components and permissions correctly set up in seconds, while also being able to quickly destroy it if I chose to do so. This gives flexibility and greater ease of development that can speed up projects in the cloud.