datadog | FLRNKS

Monitoring Flink on AWS EMR

Sun, 16 Aug 2020 11:11:00 +0000

Brief intro

This is going to be a somewhat unusual post on this blog. It is about a problem I recently encountered while trying to improve the monitoring of a long-running Flink cluster we have on AWS EMR, following the official documentation from Datadog.

The EMR setup

Our EMR cluster consumes 4 Kinesis Data Streams which are used to send s3 files in AVRO format for processing. When a new file arrives, the Flink job will fetch it from S3, do some validation and filtering and then convert it to ORC format and save it to a new location on s3. In early June we experienced a failure in one of the Flink jobs consuming a production stream. Sadly we did not have adequate monitoring set up to detect this on time. We only learnt about it when we noticed that data in the output bucket was missing for certain dates. Our streams were configured with the maximum retention period of 7 days. By the time we noticed the missing data in the stream was already piling up, and the oldest was close to half of this retention period. By the time we managed to find the root cause and deploy the fix to the Flink job, it was too late, and some data had already expired from the stream.

The existing monitoring solution was implemented via AWS Lambda functions running every 8 hours. These functions were making Athena queries to check if any data arrived to the S3 bucket during the last 48 hours. The problem with this was approach was that we do not get alerts about missing data for up to 2 days because of the way our query used a sliding window of 2 days.

The Flink cluster runs in a private VPC, so reaching the Flink Web UI to check the status of the jobs was quite difficult to say the least. We either had to set up an SSH port forwarding session and use a FoxyProxy setup in Firefox, or set up a personal VM the same private VPC via the AWS WorkSpaces managed service and then connect from that VM’s browser to the cluster’s Flink UI. Either way it was quite cumbersome and still a manual process to connect to the Flink UI to check the cluster health. I wanted an automated way of gathering metrics and alerting if something went wrong, so I looked into how Flink could be monitored by Datadog.

Datadog ❤️ Flink

A quick Google search threw up the official documentation from Datadog where I found really straightforward instructions on enabling the submission of Flink metrics to Datadog, which could be instantly visualized in their default Flink dashboard. These main steps are:

adding some new parameters to the flink-conf.yaml, such as the Datadog API/APP keys and custom tags
copying the flink-datadog-metrics.jar to the active flink installation path

The first step was quite easy. Our cluster was defined in Cloudformation where we used AWS::EMR::Cluster which allows specifying the flink-conf.yaml content as below:

Cluster:
 Type: AWS::EMR::Cluster
 Properties:
 Name: Flink-Cluster
 Configurations:
 - Classification: flink-conf
 ConfigurationProperties:
 metrics.reporter.dghttp.class: org.apache.flink.metrics.datadog.DatadogHttpReporter
 metrics.reporter.dghttp.apikey: '{{resolve:secretsmanager:datadog/api_key:SecretString}}'
 metrics.reporter.dghttp.tags: name:flink-cluster, app:flink-cluster, region:eu-central-1, env:prod
 [...]

The above CF snippet shows just the 3 most important lines of the flink-conf.yaml: (1) the full package name of the java class which implements the metric submission, (2) the Datadog API key loaded from AWS Secrets Manager and (3) a few custom tags which will be added to metrics sent to Datadog.

To copy the necessary datadog-metrics JAR where it would be loaded from (/usr/lib/flink/lib), I added a new AWS::EMR::Step to in CloudFormation which is executed only on the EMR Master Node in order to activate Datadog monitoring on the cluster via the supplied Java class and API key in the flink-conf.yaml.

To test that it was working properly I just needed to redeploy the cluster which was surprisingly easy thanks to the Cloudformation setup we had in place. But something was still not right.

Know your continent

After redeploying the cluster I waited and waited and waited a bit more but metrics were not showing up in the Flink dashboard. So I got in touch with Datadog support who were very helpful in figuring out what the issue was. After a few rounds of emails back and forth we quickly discovered why the metrics were not showing up.

The reason was that we had our Datadog account set up in the EU region and not in the USA. Thus, all our metrics were supposed to flow to the EU endpoint at app.datadoghq.eu/api/ instead of the USA endpoint at app.datadoghq.com/api/. The difference is quite subtle, only a simple change in the TLD from .com to .eu. The catch was that our EMR cluster was running Flink 1.9.1 (provided by the EMR release 5.29.0) which had this API endpoint hardcoded, pointing to the USA data centre. The Datadog Support Engineer uncovered some extra instructions on how this can be solved by adding an extra line to the flink-conf.yaml to change the default US region to the EU instead:

Cluster:
 Type: AWS::EMR::Cluster
 Properties:
 Name: Flink-Cluster
 [...]
 Configurations:
 - Classification: flink-conf
 ConfigurationProperties:
 [...]
 metrics.reporter.dghttp.class: org.apache.flink.metrics.datadog.DatadogHttpReporter
 metrics.reporter.dghttp.apikey: '{{resolve:secretsmanager:datadog/api_key:SecretString}}'
 metrics.reporter.dghttp.tags: name:flink-cluster, app:flink-cluster, region:eu-central-1, env:prod
 metrics.reporter.dghttp.dataCenter: EU # << points the metrics reported to the EU region
 [...]

The problem was that this was only available in Flink v1.11.0 while the highest version offered by EMR through the latest EMR Release was only v1.10.0, so this was not going to work for me. I almost gave up on the idea of monitoring Flink via Datadog when I had the idea to clone the official Flink repository from Github and tweak the code in v1.9.1 which we were running to change the hardcoded API endpoint from .com to .eu. It was much easier than I expected, I just needed to tweak this class slightly ./src/main/java/org/apache/flink/metrics/datadog/DatadogHttpClient.java:

/**
 * Http client talking to Datadog.
 */
public class DatadogHttpClient {
/* Changed endpoint for metric submission to use .eu instead of .com */
private static final String SERIES_URL_FORMAT = "https://app.datadoghq.eu/api/v1/series?api_key=%s";
/* Changed endpoint for API key validation to use .eu instead of .com */
private static final String VALIDATE_URL_FORMAT = "https://app.datadoghq.eu/api/v1/validate?api_key=%s";
...
}

Once I made the above code changes, I built a new JAR via mvn clean package. The new JAR was made available at ./flink-metrics/flink-metrics-datadog/target/flink-metrics-datadog-1.9.1.jar which I then uploaded to an S3 bucket where we store such files in my team. Next I slightly tweaked the AWS EMR step to load this JAR from S3 redeployed the cluster once more. Finally, metrics started flowing! And it looked so nice, I was especially happy to see the TaskManager heap distribution, because the issue which sparked this whole endeavor was showing symptoms of Heap Memory issues.

Unfortunately this default dashboard was not perfect, as it had some graphs that were failing to show some data. Maybe it was because of using v1.9.1 of Flink instead of v1.11.0, not sure. In any case, I ended up cloning the dashboard and fixing the graphs manually, while also adding a few extras to show data about the AWS Kinesis streams which were feeding into the Flink cluster.

Now it shows very nicely the age of each Flink job, which was not visible at all on the default dashboard. The end result is much better in my opinion.

Conclusion

All in all, I am quite happy with how this whole story turned out in the end. Despite the issue with the hardcoded API endpoints to the USA region in v1.9.1 of Flink, I managed to implement a simple workaround thanks to the Open Source nature of the project. The result is that we have much better visibility and monitoring implemented for our Flink cluster which makes our lives in the DevOps world much better. I did not write much about it in this post, but once these metrics became available in our Datadog account it was trivial to set up a few Monitors which would alert us if for example one of the 4 Flink jobs were failing. I will leave it up to the reader to imagine how that’s done.

Cloud Service Testing

Fri, 17 Jan 2020 11:11:00 +0000

In this blog post I discuss a recent project I worked on to practice my skills related to AWS, Python and Datadog. It includes topics such as integration testing using pytest and localstack; running Continuous Integration via Travis-CI and infrastructure as code using Terraform.

Intro

For the sake of this blog post, let’s assume that a periodic job runs somewhere in the Cloud, outside the context of this application, which generates a file with some meta-data about the job itself. This data includes mostly numerical values, such as the number of images used to train an ML model, or the number of files processed, etc. This part is depicted on the below diagram as a dummy Lambda function that periodically uploads this metadata file to an S3 bucket with random numerical values.

When this file is uploaded, an event notification is sent to the message queue. The goal of the Python application is to periodically drain these messages from the queue. When the application runs, it fetches the S3 file referenced in each SQS message, parses the file’s contents and submits the numerical metrics to DataDog for the purpose of visualisation and alerting.

Testing

Since the application interacts with two different APIs (AWS & Datadog), I figured it was a good idea to create integration tests that can be run easily via some free CI service (e.g.: Travis-CI.org). When writing the integration tests, I opted to create a simple mock class for testing the interaction with the Datadog API, and chose to rely on localstack for testing the interaction with the AWS API.

Thanks to localstack I could skip creating real resources in AWS and instead use free fake resources in a docker container, that mimic the real AWS API close to 100%. The AWS SDK called boto3 is very easy to reconfigure to connect to the fake resources in localstack with the endpoint_url= parameter.

In the following sections I go through different phases of the project:

coding the python app
mocking Datadog statsd client
setting up AWS resources in localstack
creating integration tests
Travis-CI integration
running the datadog-agent locally
setting up real AWS resources
live testing

~ Coding the python app ~

The code is mainly composed of two Python classes with methods to interact with AWS and DataDog. The CloudResourceHandler class has methods to interact with S3 and SQS, which can be replaced in integration-tests with preconfigured boto3 clients for localstack.

The MetricSubmitter class uses the CloudResourceHandler internally and offers some additional methods for sending metrics to DataDog. Internally it uses statsd from the datadog python package, which can be replaced via dependency injection in integration tests with a mock statsd class that I created to test its interaction with the Datadog API.

To connect to the real AWS & Datadog APIs (via a preconfigured local datadog-agent) there needs to be two environment variables specified at run-time:

STATSD_HOST set to localhost
SQS_QUEUE_URL set to the URL of the Queue

os.environ['STATSD_HOST'] = 'localhost'
os.environ['SQS_QUEUE_URL'] = 'https://sqs.eu-central-1.amazonaws.com/????????????/cloud-job-results-queue'
session = boto3.Session(profile_name='profile-name')
MetricSubmitter(statsd=datadog_statsd,
sqs_client=session.client('sqs'),
s3_client=session.client('s3')).run()

In addition, it also requires a preconfigured AWS profile in ~/.aws/credentials which is necessary for boto3 to authenticate to AWS:

[profile-name]
aws_access_key_id = XXXXXXXXXXXXXXX
aws_secret_access_key = XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
region = eu-central-1

But before running it, let’s set up some integration tests!

~ Mocking Datadog statsd client ~

In truth, the application does not interact directly with the Datadog API, but rather it uses statsd from the datadog python package, which interacts with the local datadog-agent, which in turn forwards metrics and events to the Datadog API.

To test this flow that relies on statsd, I created a class called DataDogStatsDHelper. This class has 2 functions (gauge/event) with identical signatures to the real functions from the official datadog-statsd package. However, the mock functions do not send anything to the datadog-agent. Instead, they accumulate the values they were passed in local class variables:

class DataDogStatsDHelper:
event_title = None
event_text = None
event_alert_type = None
event_tags = None
event_counter = 0
gauge_metric_name = None
gauge_metric_value = None
gauge_tags = None
gauge_counter = 0
def event(self, title, text, alert_type=None, aggregation_key=None, source_type_name=None,
date_happened=None, priority=None, tags=None, hostname=None):
...
def gauge(self, metric, value, tags=None, sample_rate=None):
...

When the MetricSubmitter class is tested, this mock class is injected instead of the real statsd class, which enables assertions to be made and compare expectations with reality.

~ AWS resources in localstack ~

To test how the python app integrates with S3 and SQS, I decided to use loalstack, running in a Docker container. To make it simple and repeatable, I created a docker-compose.yaml file that allows the configuration parameters to be defined in YAML:

version: '3.2'
services:
 localstack:
 image: localstack/localstack:latest
 container_name: localstack
 ports:
 - '4563-4599:4563-4599'
 - '8080:8080'
 environment:
 - SERVICES=s3,sqs
 - AWS_ACCESS_KEY_ID=foo
 - AWS_SECRET_ACCESS_KEY=bar

The resulting fake AWS resources are accessible via different ports on localhost. In this case, S3 runs on port 4572 and SQS on port 4576. Refer to the docs on GitHub for more details on ports used by other AWS services in localstack.

It is important to note that when localstack starts up, it is completely empty. Thus, before the integration tests can run, it is necessary to provision the S3 bucket and SQS queue in localstack, just as one would normally do it when using real AWS resources.

For this purpose, it’s possible to write a simple bash script that can be called from the localstack container as part of an automatic init script:

aws --endpoint-url=http://localhost:4572 s3api create-bucket --bucket "bucket-name" --region "eu-central-1"
aws --endpoint-url=http://localhost:4576 sqs create-queue --queue-name "queue-name" --region "eu-central-1" --attributes "MaximumMessageSize=4096,MessageRetentionPeriod=345600,VisibilityTimeout=30"

However, for the sake of making the integration-tests self-contained, I opted to integrate this into the tests as part of a class setup phase that runs before any tests and sets up the required S3 bucket and SQS queue:

@classmethod
def setUpClass(cls):
cls.ls = LocalStackHelper()
cls.ls.get_s3_client().create_bucket(Bucket=cls.s3_bucket_name)
cls.ls.get_sqs_client().create_queue(QueueName=cls.sqs_queue_name)

~ Creating integration tests ~

As a next step I created the integration tests which use the fake AWS resources in localstack, as well as the mock statsd class for DataDog. I used two popular python packages to create these:

unittest which is a built-in package
pytest which is a 3rd party package

Actually, the test cases only use unittest, while pytest is used for the simple collection and execution of those tests. To get started with the unittest framework, I created a python class and implemented the test cases within this class:

import unittest
from app.utils.datadog_fake_statsd import DataDogStatsDHelper
from app.utils.localstack_helper import LocalStackHelper
from app.submitter import MetricSubmitter
class ProjectIntegrationTesting(unittest.TestCase):
@classmethod
def setUpClass(cls):
...
def setUp(self):
...
def test_ddg_submitter_valid_payload(self):
...
def test_ddg_submitter_invalid_payload(self):
...
def test_aws_handler_invalid_s3key(self):
...
def test_aws_handler_valid_s3key(self):
...

In the setUpClass method, a few things are taken care of before tests can be executed:

define class variables for the bucket & the queue
create SQS & S3 clients using localstack endpoint url
provision needed resources (Queue/Bucket) in localstack

To test the interaction with DataDog via the statsd client, the submitter app is executed, which stores some values in the mock statsd class’s internal variables, which are then used in assertions to compare values with expectations.

The other tests inspect the behaviour of the CloudResourceHandler class. For example, one of the assertions tests whether the .has_available_messages() function returns false when there are no more messages in the queue.

A nice feature of unittest is that it’s easy to define tasks that need to be executed before each test, to ensure a clean slate for each test. For example, the code in the setUp method ensures two things:

the fake SQS queue is emptied before each test
class variables of the mock DataDog class are reset before each test

Theoretically, it would be possible to run the test by running pytest -s -v in the python project’s root directory, however the tests rely on localstack, so they would fail…

~ Travis-CI integration ~

So now that the integration tests are created, I thought it would be really nice to have them automatically run in a CI service, whenever someone pushes changes to the Git repo. To this end, I created a free account on travis-ci.org and integrated it with my github rep by creating a .travis.yaml file with the below initial content:

os: linux
language: python
python:
 - "3.8"
services:
 - docker
script:
 - {...}

However, I still needed a way to run localstack and then execute the integration tests within the CI environment. Luckily I found docker-compose to be a perfect fit for this purpose. I had already created a yaml file to describe how to run localstack, so now I could just simply add an extra container that would run my tests. Here is how I created a docker image to run the tests via docker-compose:

FROM python:3.8-alpine
WORKDIR /app
COPY ./requirements-test.txt ./
RUN apk add --no-cache --virtual .pynacl_deps build-base gcc make python3 python3-dev libffi-dev \
 && pip3 install --upgrade setuptools pip \
 && pip3 install --no-cache-dir -r requirements-test.txt \
 && rm requirements-test.txt
COPY ./utils/*.py ./utils/
COPY ./*.py ./
ENV LOCALSTACK_HOST localstack
ENTRYPOINT ["pytest", "-s", "-v"]

It installs the necessary dependencies to an alpine based python 3.8 image; adds the necessary source code, and finally executes pytest to collect & run the tests. Here are the updates I had to make to the docker-compose.yaml file:

version: '3.2'
services:
 localstack:
 {...}
 integration-tests:
 container_name: cloud-job-it
 build:
 context: .
 dockerfile: Dockerfile-tests
 depends_on:
 - "localstack"

Docker Compose auto-magically creates a shared network to enable connectivity between the defined services, which can call one-another by name. So when the tests are running in the cloud-job-it container, they can use the hostname localstack to create the boto3 session via the endpoint url to reach the fake AWS resources.

For easier to creation of AWS clients via localstack, I used a package called localstack-python-client, so I don’t have to deal with port numbers and low level details. However, this client by default tries to use localhost as the hostname, which wouldn’t work in my setup using docker-compose. After digging through the source-code of this python package, I found a way to change this by setting an environment variable named LOCALSTACK_HOST.

As a final step, I just had to add two lines to complete to the .travis.yaml file:

script:
 - docker-compose up --build --abort-on-container-exit
 - docker-compose down -v --rmi all --remove-orphans

Thanks to the --abort-on-container-exit flag, docker-compose will return the same exit code which is returned from the container that first exits, which first this use-case perfectly, as the cloud-job-it container only runs until the tests finish. This way the whole setup will gracefully shut down, while preserving the exit code from the container, allowing the CI system to generate an alert if it’s not 0 (meaning some test failed).

~ Running the datadog-agent locally ~

Note: while Datadog is a paid service, it’s possible to create a trial account that’s free for 2 weeks, without the need to enter credit card details. This is pretty amazing!

Now that the integration tests are automated and passing, I wanted to run the datadog-agent locally, so that I can test the python application with some real data that was to he submitted to Datadog via the agent. Here is an article that was particularly useful to me, with instructions on how the agent should be set up.

While the option of running it in docker-compose was initially appealing, I eventually decided to just start it manually as a long-lived detached container. Here is how I went about doing that:

DOCKER_CONTENT_TRUST=1 docker run -d \
 --name dd-agent \
 -v /var/run/docker.sock:/var/run/docker.sock:ro \
 -v /proc/:/host/proc/:ro \
 -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
 -e DD_API_KEY=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX \
 -e DD_SITE="datadoghq.eu" \
 -e DD_DOGSTATSD_NON_LOCAL_TRAFFIC=true \
 -p 8125:8125/udp \
 datadog/agent:7

Most notable of these lines is the DD_API_KEY environment variable which ensures that whatever data I send to the agent is associated with my own account. In addition, since I am closest to the EU region, I had to specify the endpoint via the DD_SITE variable. Also, because I want the agent to accept metrics from the python app, I need to turn on a feature via the environment variable DD_DOGSTATSD_NON_LOCAL_TRAFFIC, as well as expose port 8125 from the docker container to the host machine:

 ▶ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
477cb2ea74b2 datadog/agent "/init" 3 days ago Up 3 days (healthy) 0.0.0.0:8125->8125/udp, 8126/tcp dd-agent

All seems to be well!

~ Deploying real AWS resources ~

Here I briefly discuss how I deployed some real resources in AWS to see my application running live. In a nutshell, I set the infra up as code in Terraform, which greatly simplified the whole process. All the necessary files are collected in a directory of my repository:

variables.tf defines some variables used in multiple places
init.tf initialisation of the AWS provider and definition of AWS resources
outputs.tf defines some values that are reported when deployment finishes

The first and last files are not very interesting. Most of the interesting stuff happens in the init.tf, which defines the necessary resources and permissions. One extra resource not mentioned before, is an AWS Lambda function, which gets executed every minute and is used to upload a JSON file to the S3 bucket. This acts as a random source of data, so that the python app has some work to do without manual intervention.

~ Live testing ~

Now that all parts seem to be ready, it’s time to run the main python app using the real S3 bucket and SQS queue, as well as the local datadog-agent. The console output provides some hints whether it’s able to pump the metrics from AWS to a DataDog:

▶ python3 submitter.py
Initializing new Cloud Resource Handler with SQS URL - https://.../cloud-job-results-queue
Processing available messages in SQS queue:
- sending data to DataDog via statsd/datadog-agent.
- removing message from SQS (AQEBO37smPPHg6OIqbh3HMu3g...)
- ...
- sending data to DataDog via statsd/datadog-agent.
- removing message from SQS (AQEBV0/JzMVEP6k5kBmx2kvGn...)
No more messages visible in the queue, shutting down ...
Process finished with exit code 0

Next, I checked my DataDog account to see whether the metric data arrived. For this I created a custom Notebook with graphs to display them:

All seems to be well! The deployed AWS Lambda function has already run a few times, providing input data for the python app, which were successfully processed and forwarded to Datadog. As seen on the Notebook above, it is really easy to display metric data over time about any recurring workload, which can provide pretty useful insights into those jobs.

Furthermore, since DataDog also submission of events it becomes possible to design dashboards and create alerts which trigger based on mor complex criteria, such as the presence or lack of events over certain periods of time. One such example can be seen below:

This is a so-called screen-board which I created to display the status of a Monitor that I set up previously. This Monitor tracks incoming events with the tag cloud_job_metric and generates an alert, if there is not at least one such event of type success in the last 30 minutes. The screen-board can be exported via a public URL if needed, or just simply displayed on a big screen somewhere in the office.

Conclusions

In this post I discussed a relatively complex project with lots of exciting technology working together in the realm of Cloud Computing. In the end, I was able to create DashBoards and Monitors in DataDog, which can ingest and display telemetry about AWS workloads, in a way that makes it useful to track and monitor the workloads themselves.