<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>datadog | FLRNKS</title><link>https://flrnks.netlify.app/tag/datadog/</link><atom:link href="https://flrnks.netlify.app/tag/datadog/index.xml" rel="self" type="application/rss+xml"/><description>datadog</description><generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><copyright>© 2024</copyright><lastBuildDate>Sun, 16 Aug 2020 11:11:00 +0000</lastBuildDate><image><url>https://flrnks.netlify.app/images/icon_hu0b7a4cb9992c9ac0e91bd28ffd38dd00_9727_512x512_fill_lanczos_center_2.png</url><title>datadog</title><link>https://flrnks.netlify.app/tag/datadog/</link></image><item><title>Monitoring Flink on AWS EMR</title><link>https://flrnks.netlify.app/post/emr-flink-datadog/</link><pubDate>Sun, 16 Aug 2020 11:11:00 +0000</pubDate><guid>https://flrnks.netlify.app/post/emr-flink-datadog/</guid><description>&lt;h2 id="brief-intro">Brief intro&lt;/h2>
&lt;p>This is going to be a somewhat unusual post on this blog. It is about a problem I recently encountered while trying to improve the monitoring of a long-running Flink cluster we have on AWS EMR, following the official
&lt;a href="https://docs.datadoghq.com/integrations/flink/" target="_blank" rel="noopener">documentation&lt;/a> from Datadog.&lt;/p>
&lt;h2 id="the-emr-setup">The EMR setup&lt;/h2>
&lt;p>Our EMR cluster consumes 4 Kinesis Data Streams which are used to send s3 files in AVRO format for processing. When a new file arrives, the Flink job will fetch it from S3, do some validation and filtering and then convert it to ORC format and save it to a new location on s3. In early June we experienced a failure in one of the Flink jobs consuming a production stream. Sadly we did not have adequate monitoring set up to detect this on time. We only learnt about it when we noticed that data in the output bucket was missing for certain dates. Our streams were configured with the maximum retention period of 7 days. By the time we noticed the missing data in the stream was already piling up, and the oldest was close to half of this retention period. By the time we managed to find the root cause and deploy the fix to the Flink job, it was too late, and some data had already expired from the stream.&lt;/p>
&lt;p>The existing monitoring solution was implemented via AWS Lambda functions running every 8 hours. These functions were making Athena queries to check if any data arrived to the S3 bucket during the last 48 hours. The problem with this was approach was that we do not get alerts about missing data for up to 2 days because of the way our query used a sliding window of 2 days.&lt;/p>
&lt;p>The Flink cluster runs in a private VPC, so reaching the Flink Web UI to check the status of the jobs was quite difficult to say the least. We either had to set up an SSH port forwarding session and use a FoxyProxy setup in Firefox, or set up a personal VM the same private VPC via the AWS WorkSpaces managed service and then connect from that VM&amp;rsquo;s browser to the cluster&amp;rsquo;s Flink UI. Either way it was quite cumbersome and still a manual process to connect to the Flink UI to check the cluster health. I wanted an automated way of gathering metrics and alerting if something went wrong, so I looked into how Flink could be monitored by Datadog.&lt;/p>
&lt;h2 id="datadog--flink">Datadog ❤️ Flink&lt;/h2>
&lt;p>A quick Google search threw up the official documentation from Datadog where I found really straightforward instructions on enabling the submission of Flink metrics to Datadog, which could be instantly visualized in their default Flink dashboard. These main steps are:&lt;/p>
&lt;ul>
&lt;li>adding some new parameters to the flink-conf.yaml, such as the Datadog API/APP keys and custom tags&lt;/li>
&lt;li>copying the &lt;code>flink-datadog-metrics.jar&lt;/code> to the active flink installation path&lt;/li>
&lt;/ul>
&lt;p>The first step was quite easy. Our cluster was defined in Cloudformation where we used &lt;code>AWS::EMR::Cluster&lt;/code> which allows specifying the flink-conf.yaml content as below:&lt;/p>
&lt;div class="highlight">&lt;pre class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="k">Cluster&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">Type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>AWS&lt;span class="p">::&lt;/span>EMR&lt;span class="p">::&lt;/span>Cluster&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">Properties&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">Name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>Flink-Cluster&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">Configurations&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>- &lt;span class="k">Classification&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>flink-conf&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">ConfigurationProperties&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">metrics.reporter.dghttp.class&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>org.apache.flink.metrics.datadog.DatadogHttpReporter&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">metrics.reporter.dghttp.apikey&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;{{resolve:secretsmanager:datadog/api_key:SecretString}}&amp;#39;&lt;/span>&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">metrics.reporter.dghttp.tags&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>name&lt;span class="p">:&lt;/span>flink-cluster&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>app&lt;span class="p">:&lt;/span>flink-cluster&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>region&lt;span class="p">:&lt;/span>eu-central&lt;span class="m">-1&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>env&lt;span class="p">:&lt;/span>prod&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">[&lt;/span>...&lt;span class="p">]&lt;/span>&lt;span class="w">
&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The above CF snippet shows just the 3 most important lines of the &lt;strong>flink-conf.yaml&lt;/strong>: (1) the full package name of the java class which implements the metric submission, (2) the Datadog API key loaded from AWS Secrets Manager and (3) a few custom tags which will be added to metrics sent to Datadog.&lt;/p>
&lt;p>To copy the necessary datadog-metrics JAR where it would be loaded from (&lt;code>/usr/lib/flink/lib&lt;/code>), I added a new &lt;code>AWS::EMR::Step&lt;/code> to in CloudFormation which is executed only on the EMR Master Node in order to activate Datadog monitoring on the cluster via the supplied Java class and API key in the &lt;strong>flink-conf.yaml&lt;/strong>.&lt;/p>
&lt;p>To test that it was working properly I just needed to redeploy the cluster which was surprisingly easy thanks to the Cloudformation setup we had in place. But something was still not right.&lt;/p>
&lt;h2 id="know-your-continent">Know your continent&lt;/h2>
&lt;p>After redeploying the cluster I waited and waited and waited a bit more but metrics were not showing up in the Flink dashboard. So I got in touch with Datadog support who were very helpful in figuring out what the issue was. After a few rounds of emails back and forth we quickly discovered why the metrics were not showing up.&lt;/p>
&lt;p>The reason was that we had our Datadog account set up in the EU region and not in the USA. Thus, all our metrics were supposed to flow to the EU endpoint at &lt;code>app.datadoghq.eu/api/&lt;/code> instead of the USA endpoint at &lt;code>app.datadoghq.com/api/&lt;/code>. The difference is quite subtle, only a simple change in the TLD from &lt;strong>.com&lt;/strong> to &lt;strong>.eu&lt;/strong>. The catch was that our EMR cluster was running Flink 1.9.1 (provided by the EMR release 5.29.0) which had this API endpoint hardcoded, pointing to the USA data centre. The Datadog Support Engineer uncovered some extra
&lt;a href="https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#datadog-orgapacheflinkmetricsdatadogdatadoghttpreporter" target="_blank" rel="noopener">instructions&lt;/a> on how this can be solved by adding an extra line to the &lt;strong>flink-conf.yaml&lt;/strong> to change the default US region to the EU instead:&lt;/p>
&lt;div class="highlight">&lt;pre class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="k">Cluster&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">Type&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>AWS&lt;span class="p">::&lt;/span>EMR&lt;span class="p">::&lt;/span>Cluster&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">Properties&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">Name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>Flink-Cluster&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">[&lt;/span>...&lt;span class="p">]&lt;/span>&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">Configurations&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>- &lt;span class="k">Classification&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>flink-conf&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">ConfigurationProperties&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">[&lt;/span>...&lt;span class="p">]&lt;/span>&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">metrics.reporter.dghttp.class&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>org.apache.flink.metrics.datadog.DatadogHttpReporter&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">metrics.reporter.dghttp.apikey&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;{{resolve:secretsmanager:datadog/api_key:SecretString}}&amp;#39;&lt;/span>&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">metrics.reporter.dghttp.tags&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>name&lt;span class="p">:&lt;/span>flink-cluster&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>app&lt;span class="p">:&lt;/span>flink-cluster&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>region&lt;span class="p">:&lt;/span>eu-central&lt;span class="m">-1&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>env&lt;span class="p">:&lt;/span>prod&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">metrics.reporter.dghttp.dataCenter&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>EU&lt;span class="w"> &lt;/span>&lt;span class="c"># &amp;lt;&amp;lt; points the metrics reported to the EU region&lt;/span>&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">[&lt;/span>...&lt;span class="p">]&lt;/span>&lt;span class="w">
&lt;/span>&lt;span class="w">
&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The problem was that this was only available in Flink v1.11.0 while the highest version offered by EMR through the latest EMR Release was only v1.10.0, so this was not going to work for me. I almost gave up on the idea of monitoring Flink via Datadog when I had the idea to clone the official Flink repository from Github and tweak the code in v1.9.1 which we were running to change the hardcoded API endpoint from &lt;strong>.com&lt;/strong> to &lt;strong>.eu&lt;/strong>. It was much easier than I expected, I just needed to tweak this class slightly &lt;code>./src/main/java/org/apache/flink/metrics/datadog/DatadogHttpClient.java&lt;/code>:&lt;/p>
&lt;div class="highlight">&lt;pre class="chroma">&lt;code class="language-java" data-lang="java">&lt;span class="cm">/**
&lt;/span>&lt;span class="cm"> * Http client talking to Datadog.
&lt;/span>&lt;span class="cm"> */&lt;/span>
&lt;span class="kd">public&lt;/span> &lt;span class="kd">class&lt;/span> &lt;span class="nc">DatadogHttpClient&lt;/span> &lt;span class="o">{&lt;/span>
&lt;span class="cm">/* Changed endpoint for metric submission to use .eu instead of .com */&lt;/span>
&lt;span class="kd">private&lt;/span> &lt;span class="kd">static&lt;/span> &lt;span class="kd">final&lt;/span> &lt;span class="n">String&lt;/span> &lt;span class="n">SERIES_URL_FORMAT&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">&amp;#34;https://app.datadoghq.eu/api/v1/series?api_key=%s&amp;#34;&lt;/span>&lt;span class="o">;&lt;/span>
&lt;span class="cm">/* Changed endpoint for API key validation to use .eu instead of .com */&lt;/span>
&lt;span class="kd">private&lt;/span> &lt;span class="kd">static&lt;/span> &lt;span class="kd">final&lt;/span> &lt;span class="n">String&lt;/span> &lt;span class="n">VALIDATE_URL_FORMAT&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">&amp;#34;https://app.datadoghq.eu/api/v1/validate?api_key=%s&amp;#34;&lt;/span>&lt;span class="o">;&lt;/span>
&lt;span class="o">...&lt;/span>
&lt;span class="o">}&lt;/span>
&lt;/code>&lt;/pre>&lt;/div>&lt;p>Once I made the above code changes, I built a new JAR via &lt;code>mvn clean package&lt;/code>. The new JAR was made available at &lt;strong>./flink-metrics/flink-metrics-datadog/target/flink-metrics-datadog-1.9.1.jar&lt;/strong> which I then uploaded to an S3 bucket where we store such files in my team. Next I slightly tweaked the AWS EMR step to load this JAR from S3 redeployed the cluster once more. Finally, metrics started flowing! And it looked so nice, I was especially happy to see the TaskManager heap distribution, because the issue which sparked this whole endeavor was showing symptoms of Heap Memory issues.&lt;/p>
&lt;p>&lt;img src="./images/default-dashboard.png" alt="Default Datadog Flink Dashboard">&lt;/p>
&lt;p>Unfortunately this default dashboard was not perfect, as it had some graphs that were failing to show some data. Maybe it was because of using v1.9.1 of Flink instead of v1.11.0, not sure. In any case, I ended up cloning the dashboard and fixing the graphs manually, while also adding a few extras to show data about the AWS Kinesis streams which were feeding into the Flink cluster.&lt;/p>
&lt;p>&lt;img src="./images/custom-dashboard.jpg" alt="Custom Datadog Flink dashboard">&lt;/p>
&lt;p>Now it shows very nicely the age of each Flink job, which was not visible at all on the default dashboard. The end result is much better in my opinion.&lt;/p>
&lt;h2 id="conclusion">Conclusion&lt;/h2>
&lt;p>All in all, I am quite happy with how this whole story turned out in the end. Despite the issue with the hardcoded API endpoints to the USA region in v1.9.1 of Flink, I managed to implement a simple workaround thanks to the Open Source nature of the project. The result is that we have much better visibility and monitoring implemented for our Flink cluster which makes our lives in the DevOps world much better. I did not write much about it in this post, but once these metrics became available in our Datadog account it was trivial to set up a few Monitors which would alert us if for example one of the 4 Flink jobs were failing. I will leave it up to the reader to imagine how that&amp;rsquo;s done.&lt;/p></description></item><item><title>Cloud Service Testing</title><link>https://flrnks.netlify.app/post/python-aws-datadog-testing/</link><pubDate>Fri, 17 Jan 2020 11:11:00 +0000</pubDate><guid>https://flrnks.netlify.app/post/python-aws-datadog-testing/</guid><description>&lt;p>In this blog post I discuss a recent project I worked on to practice my skills related to AWS, Python and Datadog. It includes topics such as integration testing using &lt;strong>pytest&lt;/strong> and &lt;strong>localstack&lt;/strong>; running Continuous Integration via &lt;strong>Travis-CI&lt;/strong> and infrastructure as code using &lt;strong>Terraform&lt;/strong>.&lt;/p>
&lt;h2 id="intro">Intro&lt;/h2>
&lt;p>For the sake of this blog post, let&amp;rsquo;s assume that a periodic job runs somewhere in the Cloud, outside the context of this application, which generates a file with some meta-data about the job itself. This data includes mostly numerical values, such as the number of images used to train an ML model, or the number of files processed, etc. This part is depicted on the below diagram as a dummy Lambda function that periodically uploads this metadata file to an S3 bucket with random numerical values.&lt;/p>
&lt;p>&lt;img src="img/arch.png" alt="Architecture">&lt;/p>
&lt;p>When this file is uploaded, an event notification is sent to the message queue. The goal of the Python application is to periodically drain these messages from the queue. When the application runs, it fetches the S3 file referenced in each SQS message, parses the file&amp;rsquo;s contents and submits the numerical metrics to DataDog for the purpose of visualisation and alerting.&lt;/p>
&lt;h2 id="testing">Testing&lt;/h2>
&lt;p>Since the application interacts with two different APIs (AWS &amp;amp; Datadog), I figured it was a good idea to create integration tests that can be run easily via some free CI service (e.g.: Travis-CI.org). When writing the integration tests, I opted to create a simple mock class for testing the interaction with the Datadog API, and chose to rely on &lt;strong>localstack&lt;/strong> for testing the interaction with the AWS API.&lt;/p>
&lt;p>Thanks to &lt;strong>localstack&lt;/strong> I could skip creating real resources in AWS and instead use free fake resources in a docker container, that mimic the real AWS API close to 100%. The AWS SDK called &lt;code>boto3&lt;/code> is very easy to reconfigure to connect to the fake resources in &lt;strong>localstack&lt;/strong> with the &lt;code>endpoint_url=&lt;/code> parameter.&lt;/p>
&lt;p>In the following sections I go through different phases of the project:&lt;/p>
&lt;ol>
&lt;li>coding the python app&lt;/li>
&lt;li>mocking Datadog statsd client&lt;/li>
&lt;li>setting up AWS resources in localstack&lt;/li>
&lt;li>creating integration tests&lt;/li>
&lt;li>Travis-CI integration&lt;/li>
&lt;li>running the datadog-agent locally&lt;/li>
&lt;li>setting up real AWS resources&lt;/li>
&lt;li>live testing&lt;/li>
&lt;/ol>
&lt;h3 id="-coding-the-python-app-">~ Coding the python app ~&lt;/h3>
&lt;p>The
&lt;a href="https://github.com/florianakos/python-testing/blob/master/app/submitter.py" target="_blank" rel="noopener">code&lt;/a> is mainly composed of two Python classes with methods to interact with AWS and DataDog. The &lt;strong>CloudResourceHandler&lt;/strong> class has methods to interact with S3 and SQS, which can be replaced in integration-tests with preconfigured &lt;code>boto3&lt;/code> clients for &lt;strong>localstack&lt;/strong>.&lt;/p>
&lt;p>The &lt;strong>MetricSubmitter&lt;/strong> class uses the &lt;strong>CloudResourceHandler&lt;/strong> internally and offers some additional methods for sending metrics to DataDog. Internally it uses statsd from the &lt;code>datadog&lt;/code> python
&lt;a href="https://pypi.org/project/datadog/" target="_blank" rel="noopener">package&lt;/a>, which can be replaced via dependency injection in integration tests with a mock statsd class that I created to test its interaction with the Datadog API.&lt;/p>
&lt;p>To connect to the real AWS &amp;amp; Datadog APIs (via a preconfigured local datadog-agent) there needs to be two environment variables specified at run-time:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>STATSD_HOST&lt;/strong> set to &lt;code>localhost&lt;/code>&lt;/li>
&lt;li>&lt;strong>SQS_QUEUE_URL&lt;/strong> set to the URL of the Queue&lt;/li>
&lt;/ul>
&lt;div class="highlight">&lt;pre class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">environ&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;STATSD_HOST&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s1">&amp;#39;localhost&amp;#39;&lt;/span>
&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">environ&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;SQS_QUEUE_URL&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s1">&amp;#39;https://sqs.eu-central-1.amazonaws.com/????????????/cloud-job-results-queue&amp;#39;&lt;/span>
&lt;span class="n">session&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">boto3&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Session&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">profile_name&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;profile-name&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;span class="n">MetricSubmitter&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">statsd&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">datadog_statsd&lt;/span>&lt;span class="p">,&lt;/span>
&lt;span class="n">sqs_client&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">session&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">client&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;sqs&amp;#39;&lt;/span>&lt;span class="p">),&lt;/span>
&lt;span class="n">s3_client&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">session&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">client&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;s3&amp;#39;&lt;/span>&lt;span class="p">))&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">run&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/code>&lt;/pre>&lt;/div>&lt;p>In addition, it also requires a preconfigured AWS profile in &lt;code>~/.aws/credentials&lt;/code> which is necessary for &lt;strong>boto3&lt;/strong> to authenticate to AWS:&lt;/p>
&lt;pre>&lt;code class="language-console" data-lang="console">[profile-name]
aws_access_key_id = XXXXXXXXXXXXXXX
aws_secret_access_key = XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
region = eu-central-1
&lt;/code>&lt;/pre>&lt;p>But before running it, let&amp;rsquo;s set up some integration tests!&lt;/p>
&lt;h3 id="-mocking-datadog-statsd-client-">~ Mocking Datadog statsd client ~&lt;/h3>
&lt;p>In truth, the application does not interact directly with the Datadog API, but rather it uses &lt;strong>statsd&lt;/strong> from the &lt;code>datadog&lt;/code> python package, which interacts with the local &lt;code>datadog-agent&lt;/code>, which in turn forwards metrics and events to the Datadog API.&lt;/p>
&lt;p>To test this flow that relies on &lt;code>statsd&lt;/code>, I created a class called &lt;strong>DataDogStatsDHelper&lt;/strong>. This class has 2 functions (&lt;strong>gauge/event&lt;/strong>) with identical signatures to the real functions from the official &lt;code>datadog-statsd&lt;/code> package. However, the mock functions do not send anything to the &lt;code>datadog-agent&lt;/code>. Instead, they accumulate the values they were passed in local class variables:&lt;/p>
&lt;div class="highlight">&lt;pre class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="k">class&lt;/span> &lt;span class="nc">DataDogStatsDHelper&lt;/span>&lt;span class="p">:&lt;/span>
&lt;span class="n">event_title&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="bp">None&lt;/span>
&lt;span class="n">event_text&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="bp">None&lt;/span>
&lt;span class="n">event_alert_type&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="bp">None&lt;/span>
&lt;span class="n">event_tags&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="bp">None&lt;/span>
&lt;span class="n">event_counter&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">0&lt;/span>
&lt;span class="n">gauge_metric_name&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="bp">None&lt;/span>
&lt;span class="n">gauge_metric_value&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="bp">None&lt;/span>
&lt;span class="n">gauge_tags&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="bp">None&lt;/span>
&lt;span class="n">gauge_counter&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">0&lt;/span>
&lt;span class="k">def&lt;/span> &lt;span class="nf">event&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="bp">self&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">title&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">text&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">alert_type&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="bp">None&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">aggregation_key&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="bp">None&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">source_type_name&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="bp">None&lt;/span>&lt;span class="p">,&lt;/span>
&lt;span class="n">date_happened&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="bp">None&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">priority&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="bp">None&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">tags&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="bp">None&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">hostname&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="bp">None&lt;/span>&lt;span class="p">):&lt;/span>
&lt;span class="o">...&lt;/span>
&lt;span class="k">def&lt;/span> &lt;span class="nf">gauge&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="bp">self&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">metric&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">value&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">tags&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="bp">None&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">sample_rate&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="bp">None&lt;/span>&lt;span class="p">):&lt;/span>
&lt;span class="o">...&lt;/span>
&lt;/code>&lt;/pre>&lt;/div>&lt;p>When the MetricSubmitter class is tested, this mock class is injected instead of the real &lt;strong>statsd&lt;/strong> class, which enables assertions to be made and compare expectations with reality.&lt;/p>
&lt;h3 id="-aws-resources-in-localstack-">~ AWS resources in localstack ~&lt;/h3>
&lt;p>To test how the python app integrates with S3 and SQS, I decided to use &lt;strong>loalstack&lt;/strong>, running in a Docker container. To make it simple and repeatable, I created a &lt;code>docker-compose.yaml&lt;/code> file that allows the configuration parameters to be defined in YAML:&lt;/p>
&lt;div class="highlight">&lt;pre class="chroma">&lt;code class="language-yml" data-lang="yml">&lt;span class="k">version&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;3.2&amp;#39;&lt;/span>&lt;span class="w">
&lt;/span>&lt;span class="w">&lt;/span>&lt;span class="k">services&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">localstack&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">image&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>localstack/localstack&lt;span class="p">:&lt;/span>latest&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">container_name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>localstack&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">ports&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>- &lt;span class="s1">&amp;#39;4563-4599:4563-4599&amp;#39;&lt;/span>&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>- &lt;span class="s1">&amp;#39;8080:8080&amp;#39;&lt;/span>&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">environment&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>- SERVICES=s3&lt;span class="p">,&lt;/span>sqs&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>- AWS_ACCESS_KEY_ID=foo&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>- AWS_SECRET_ACCESS_KEY=bar&lt;span class="w">
&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The resulting fake AWS resources are accessible via different ports on localhost. In this case, S3 runs on port &lt;strong>4572&lt;/strong> and SQS on port &lt;strong>4576&lt;/strong>. Refer to the
&lt;a href="https://github.com/localstack/localstack#overview" target="_blank" rel="noopener">docs&lt;/a> on GitHub for more details on ports used by other AWS services in localstack.&lt;/p>
&lt;p>It is important to note that when localstack starts up, it is completely empty. Thus, before the integration tests can run, it is necessary to provision the S3 bucket and SQS queue in localstack, just as one would normally do it when using real AWS resources.&lt;/p>
&lt;p>For this purpose, it&amp;rsquo;s possible to write a simple bash script that can be called from the localstack container as part of an automatic init script:&lt;/p>
&lt;div class="highlight">&lt;pre class="chroma">&lt;code class="language-shell" data-lang="shell">aws --endpoint-url&lt;span class="o">=&lt;/span>http://localhost:4572 s3api create-bucket --bucket &lt;span class="s2">&amp;#34;bucket-name&amp;#34;&lt;/span> --region &lt;span class="s2">&amp;#34;eu-central-1&amp;#34;&lt;/span>
aws --endpoint-url&lt;span class="o">=&lt;/span>http://localhost:4576 sqs create-queue --queue-name &lt;span class="s2">&amp;#34;queue-name&amp;#34;&lt;/span> --region &lt;span class="s2">&amp;#34;eu-central-1&amp;#34;&lt;/span> --attributes &lt;span class="s2">&amp;#34;MaximumMessageSize=4096,MessageRetentionPeriod=345600,VisibilityTimeout=30&amp;#34;&lt;/span>
&lt;/code>&lt;/pre>&lt;/div>&lt;p>However, for the sake of making the integration-tests self-contained, I opted to integrate this into the tests as part of a class setup phase that runs before any tests and sets up the required S3 bucket and SQS queue:&lt;/p>
&lt;div class="highlight">&lt;pre class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="nd">@classmethod&lt;/span>
&lt;span class="k">def&lt;/span> &lt;span class="nf">setUpClass&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="bp">cls&lt;/span>&lt;span class="p">):&lt;/span>
&lt;span class="bp">cls&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">ls&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">LocalStackHelper&lt;/span>&lt;span class="p">()&lt;/span>
&lt;span class="bp">cls&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">ls&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">get_s3_client&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">create_bucket&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">Bucket&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="bp">cls&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">s3_bucket_name&lt;/span>&lt;span class="p">)&lt;/span>
&lt;span class="bp">cls&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">ls&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">get_sqs_client&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">create_queue&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">QueueName&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="bp">cls&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">sqs_queue_name&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="-creating-integration-tests-">~ Creating integration tests ~&lt;/h3>
&lt;p>As a next step I created the integration
&lt;a href="https://github.com/florianakos/python-testing/blob/master/app/test_submitter.py" target="_blank" rel="noopener">tests&lt;/a> which use the fake AWS resources in localstack, as well as the mock &lt;strong>statsd&lt;/strong> class for DataDog. I used two popular python packages to create these:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>unittest&lt;/strong> which is a built-in package&lt;/li>
&lt;li>&lt;strong>pytest&lt;/strong> which is a 3rd party package&lt;/li>
&lt;/ul>
&lt;p>Actually, the test cases only use &lt;strong>unittest&lt;/strong>, while &lt;strong>pytest&lt;/strong> is used for the simple collection and execution of those tests. To get started with the &lt;strong>unittest&lt;/strong> framework, I created a python class and implemented the test cases within this class:&lt;/p>
&lt;div class="highlight">&lt;pre class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="kn">import&lt;/span> &lt;span class="nn">unittest&lt;/span>
&lt;span class="kn">from&lt;/span> &lt;span class="nn">app.utils.datadog_fake_statsd&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">DataDogStatsDHelper&lt;/span>
&lt;span class="kn">from&lt;/span> &lt;span class="nn">app.utils.localstack_helper&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">LocalStackHelper&lt;/span>
&lt;span class="kn">from&lt;/span> &lt;span class="nn">app.submitter&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">MetricSubmitter&lt;/span>
&lt;span class="k">class&lt;/span> &lt;span class="nc">ProjectIntegrationTesting&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">unittest&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">TestCase&lt;/span>&lt;span class="p">):&lt;/span>
&lt;span class="nd">@classmethod&lt;/span>
&lt;span class="k">def&lt;/span> &lt;span class="nf">setUpClass&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="bp">cls&lt;/span>&lt;span class="p">):&lt;/span>
&lt;span class="o">...&lt;/span>
&lt;span class="k">def&lt;/span> &lt;span class="nf">setUp&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="bp">self&lt;/span>&lt;span class="p">):&lt;/span>
&lt;span class="o">...&lt;/span>
&lt;span class="k">def&lt;/span> &lt;span class="nf">test_ddg_submitter_valid_payload&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="bp">self&lt;/span>&lt;span class="p">):&lt;/span>
&lt;span class="o">...&lt;/span>
&lt;span class="k">def&lt;/span> &lt;span class="nf">test_ddg_submitter_invalid_payload&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="bp">self&lt;/span>&lt;span class="p">):&lt;/span>
&lt;span class="o">...&lt;/span>
&lt;span class="k">def&lt;/span> &lt;span class="nf">test_aws_handler_invalid_s3key&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="bp">self&lt;/span>&lt;span class="p">):&lt;/span>
&lt;span class="o">...&lt;/span>
&lt;span class="k">def&lt;/span> &lt;span class="nf">test_aws_handler_valid_s3key&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="bp">self&lt;/span>&lt;span class="p">):&lt;/span>
&lt;span class="o">...&lt;/span>
&lt;/code>&lt;/pre>&lt;/div>&lt;p>In the &lt;strong>setUpClass&lt;/strong> method, a few things are taken care of before tests can be executed:&lt;/p>
&lt;ul>
&lt;li>define class variables for the bucket &amp;amp; the queue&lt;/li>
&lt;li>create SQS &amp;amp; S3 clients using localstack endpoint url&lt;/li>
&lt;li>provision needed resources (Queue/Bucket) in localstack&lt;/li>
&lt;/ul>
&lt;p>To test the interaction with DataDog via the statsd client, the submitter app is executed, which stores some values in the mock &lt;strong>statsd&lt;/strong> class&amp;rsquo;s internal variables, which are then used in assertions to compare values with expectations.&lt;/p>
&lt;p>The other tests inspect the behaviour of the &lt;strong>CloudResourceHandler&lt;/strong> class. For example, one of the assertions tests whether the &lt;code>.has_available_messages()&lt;/code> function returns false when there are no more messages in the queue.&lt;/p>
&lt;p>A nice feature of &lt;strong>unittest&lt;/strong> is that it&amp;rsquo;s easy to define tasks that need to be executed before each test, to ensure a clean slate for each test. For example, the code in the &lt;strong>setUp&lt;/strong> method ensures two things:&lt;/p>
&lt;ul>
&lt;li>the fake SQS queue is emptied before each test&lt;/li>
&lt;li>class variables of the mock DataDog class are reset before each test&lt;/li>
&lt;/ul>
&lt;p>Theoretically, it would be possible to run the test by running &lt;code>pytest -s -v&lt;/code> in the python project&amp;rsquo;s root directory, however the tests rely on localstack, so they would fail&amp;hellip;&lt;/p>
&lt;h3 id="-travis-ci-integration-">~ Travis-CI integration ~&lt;/h3>
&lt;p>So now that the integration tests are created, I thought it would be really nice to have them automatically run in a CI service, whenever someone pushes changes to the Git repo. To this end, I created a free account on &lt;code>travis-ci.org&lt;/code> and integrated it with my github rep by creating a &lt;strong>.travis.yaml&lt;/strong> file with the below initial content:&lt;/p>
&lt;div class="highlight">&lt;pre class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="k">os&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>linux&lt;span class="w">
&lt;/span>&lt;span class="w">&lt;/span>&lt;span class="k">language&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>python&lt;span class="w">
&lt;/span>&lt;span class="w">&lt;/span>&lt;span class="k">python&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>- &lt;span class="s2">&amp;#34;3.8&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;span class="w">&lt;/span>&lt;span class="k">services&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>- docker&lt;span class="w">
&lt;/span>&lt;span class="w">&lt;/span>&lt;span class="k">script&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>- {...}&lt;span class="w">
&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>However, I still needed a way to run &lt;code>localstack&lt;/code> and then execute the integration tests within the CI environment. Luckily I found &lt;strong>docker-compose&lt;/strong> to be a perfect fit for this purpose. I had already created a yaml file to describe how to run &lt;code>localstack&lt;/code>, so now I could just simply add an extra container that would run my tests. Here is how I created a docker image to run the tests via docker-compose:&lt;/p>
&lt;div class="highlight">&lt;pre class="chroma">&lt;code class="language-dockerfile" data-lang="dockerfile">&lt;span class="k">FROM&lt;/span>&lt;span class="s"> python:3.8-alpine&lt;/span>&lt;span class="err">
&lt;/span>&lt;span class="err">&lt;/span>&lt;span class="k">WORKDIR&lt;/span>&lt;span class="s"> /app&lt;/span>&lt;span class="err">
&lt;/span>&lt;span class="err">&lt;/span>&lt;span class="k">COPY&lt;/span> ./requirements-test.txt ./&lt;span class="err">
&lt;/span>&lt;span class="err">&lt;/span>&lt;span class="k">RUN&lt;/span> apk add --no-cache --virtual .pynacl_deps build-base gcc make python3 python3-dev libffi-dev &lt;span class="se">\
&lt;/span>&lt;span class="se">&lt;/span> &lt;span class="o">&amp;amp;&amp;amp;&lt;/span> pip3 install --upgrade setuptools pip &lt;span class="se">\
&lt;/span>&lt;span class="se">&lt;/span> &lt;span class="o">&amp;amp;&amp;amp;&lt;/span> pip3 install --no-cache-dir -r requirements-test.txt &lt;span class="se">\
&lt;/span>&lt;span class="se">&lt;/span> &lt;span class="o">&amp;amp;&amp;amp;&lt;/span> rm requirements-test.txt&lt;span class="err">
&lt;/span>&lt;span class="err">&lt;/span>&lt;span class="k">COPY&lt;/span> ./utils/*.py ./utils/&lt;span class="err">
&lt;/span>&lt;span class="err">&lt;/span>&lt;span class="k">COPY&lt;/span> ./*.py ./&lt;span class="err">
&lt;/span>&lt;span class="err">&lt;/span>&lt;span class="k">ENV&lt;/span> LOCALSTACK_HOST localstack&lt;span class="err">
&lt;/span>&lt;span class="err">&lt;/span>&lt;span class="k">ENTRYPOINT&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;pytest&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;-s&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;-v&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="err">
&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>It installs the necessary dependencies to an alpine based python 3.8 image; adds the necessary source code, and finally executes &lt;strong>pytest&lt;/strong> to collect &amp;amp; run the tests. Here are the updates I had to make to the &lt;strong>docker-compose.yaml&lt;/strong> file:&lt;/p>
&lt;div class="highlight">&lt;pre class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="k">version&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;3.2&amp;#39;&lt;/span>&lt;span class="w">
&lt;/span>&lt;span class="w">&lt;/span>&lt;span class="k">services&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">localstack&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>{...}&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">integration-tests&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">container_name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>cloud-job-it&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">build&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">context&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>.&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">dockerfile&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>Dockerfile-tests&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">depends_on&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>- &lt;span class="s2">&amp;#34;localstack&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Docker Compose auto-magically creates a shared network to enable connectivity between the defined services, which can call one-another by name. So when the tests are running in the &lt;strong>cloud-job-it&lt;/strong> container, they can use the hostname &lt;strong>localstack&lt;/strong> to create the &lt;strong>boto3&lt;/strong> session via the endpoint url to reach the fake AWS resources.&lt;/p>
&lt;p>For easier to creation of AWS clients via localstack, I used a package called
&lt;a href="https://github.com/localstack/localstack-python-client" target="_blank" rel="noopener">localstack-python-client&lt;/a>, so I don&amp;rsquo;t have to deal with port numbers and low level details. However, this client by default tries to use &lt;strong>localhost&lt;/strong> as the hostname, which wouldn&amp;rsquo;t work in my setup using docker-compose. After digging through the source-code of this python package, I found a way to change this by setting an environment variable named &lt;strong>LOCALSTACK_HOST&lt;/strong>.&lt;/p>
&lt;p>As a final step, I just had to add two lines to complete to the &lt;strong>.travis.yaml&lt;/strong> file:&lt;/p>
&lt;div class="highlight">&lt;pre class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="k">script&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>- docker-compose&lt;span class="w"> &lt;/span>up&lt;span class="w"> &lt;/span>--build&lt;span class="w"> &lt;/span>--abort-on-container-exit&lt;span class="w">
&lt;/span>&lt;span class="w"> &lt;/span>- docker-compose&lt;span class="w"> &lt;/span>down&lt;span class="w"> &lt;/span>-v&lt;span class="w"> &lt;/span>--rmi&lt;span class="w"> &lt;/span>all&lt;span class="w"> &lt;/span>--remove-orphans&lt;span class="w">
&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Thanks to the &lt;code>--abort-on-container-exit&lt;/code> flag, docker-compose will return the same exit code which is returned from the container that first exits, which first this use-case perfectly, as the &lt;strong>cloud-job-it&lt;/strong> container only runs until the tests finish. This way the whole setup will gracefully shut down, while preserving the exit code from the container, allowing the CI system to generate an alert if it&amp;rsquo;s not 0 (meaning some test failed).&lt;/p>
&lt;h3 id="-running-the-datadog-agent-locally-">~ Running the datadog-agent locally ~&lt;/h3>
&lt;p>&lt;strong>Note&lt;/strong>: while Datadog is a paid service, it&amp;rsquo;s possible to create a trial account that&amp;rsquo;s free for 2 weeks, without the need to enter credit card details. This is pretty amazing!&lt;/p>
&lt;p>Now that the integration tests are automated and passing, I wanted to run the &lt;code>datadog-agent&lt;/code> locally, so that I can test the python application with some real data that was to he submitted to Datadog via the agent. Here is an
&lt;a href="https://docs.datadoghq.com/getting_started/agent/?tab=datadogeusite" target="_blank" rel="noopener">article&lt;/a> that was particularly useful to me, with instructions on how the agent should be set up.&lt;/p>
&lt;p>While the option of running it in docker-compose was initially appealing, I eventually decided to just start it manually as a long-lived detached container. Here is how I went about doing that:&lt;/p>
&lt;div class="highlight">&lt;pre class="chroma">&lt;code class="language-shell" data-lang="shell">&lt;span class="nv">DOCKER_CONTENT_TRUST&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="m">1&lt;/span> docker run -d &lt;span class="se">\
&lt;/span>&lt;span class="se">&lt;/span> --name dd-agent &lt;span class="se">\
&lt;/span>&lt;span class="se">&lt;/span> -v /var/run/docker.sock:/var/run/docker.sock:ro &lt;span class="se">\
&lt;/span>&lt;span class="se">&lt;/span> -v /proc/:/host/proc/:ro &lt;span class="se">\
&lt;/span>&lt;span class="se">&lt;/span> -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro &lt;span class="se">\
&lt;/span>&lt;span class="se">&lt;/span> -e &lt;span class="nv">DD_API_KEY&lt;/span>&lt;span class="o">=&lt;/span>XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX &lt;span class="se">\
&lt;/span>&lt;span class="se">&lt;/span> -e &lt;span class="nv">DD_SITE&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;datadoghq.eu&amp;#34;&lt;/span> &lt;span class="se">\
&lt;/span>&lt;span class="se">&lt;/span> -e &lt;span class="nv">DD_DOGSTATSD_NON_LOCAL_TRAFFIC&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="nb">true&lt;/span> &lt;span class="se">\
&lt;/span>&lt;span class="se">&lt;/span> -p 8125:8125/udp &lt;span class="se">\
&lt;/span>&lt;span class="se">&lt;/span> datadog/agent:7
&lt;/code>&lt;/pre>&lt;/div>&lt;p>Most notable of these lines is the &lt;strong>DD_API_KEY&lt;/strong> environment variable which ensures that whatever data I send to the agent is associated with my own account. In addition, since I am closest to the EU region, I had to specify the endpoint via the &lt;strong>DD_SITE&lt;/strong> variable. Also, because I want the agent to accept metrics from the python app, I need to turn on a feature via the environment variable &lt;strong>DD_DOGSTATSD_NON_LOCAL_TRAFFIC&lt;/strong>, as well as expose port 8125 from the docker container to the host machine:&lt;/p>
&lt;div class="highlight">&lt;pre class="chroma">&lt;code class="language-bash" data-lang="bash"> ▶ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
477cb2ea74b2 datadog/agent &lt;span class="s2">&amp;#34;/init&amp;#34;&lt;/span> &lt;span class="m">3&lt;/span> days ago Up &lt;span class="m">3&lt;/span> days &lt;span class="o">(&lt;/span>healthy&lt;span class="o">)&lt;/span> 0.0.0.0:8125-&amp;gt;8125/udp, 8126/tcp dd-agent
&lt;/code>&lt;/pre>&lt;/div>&lt;p>All seems to be well!&lt;/p>
&lt;h3 id="-deploying-real-aws-resources-">~ Deploying real AWS resources ~&lt;/h3>
&lt;p>Here I briefly discuss how I deployed some real resources in AWS to see my application running live. In a nutshell, I set the infra up as code in Terraform, which greatly simplified the whole process. All the necessary files are collected in a
&lt;a href="https://github.com/florianakos/python-testing/tree/master/terraform" target="_blank" rel="noopener">directory&lt;/a> of my repository:&lt;/p>
&lt;ul>
&lt;li>&lt;code>variables.tf&lt;/code> defines some variables used in multiple places&lt;/li>
&lt;li>&lt;code>init.tf&lt;/code> initialisation of the AWS provider and definition of AWS resources&lt;/li>
&lt;li>&lt;code>outputs.tf&lt;/code> defines some values that are reported when deployment finishes&lt;/li>
&lt;/ul>
&lt;p>The first and last files are not very interesting. Most of the interesting stuff happens in the &lt;strong>init.tf&lt;/strong>, which defines the necessary resources and permissions. One extra resource not mentioned before, is an AWS Lambda function, which gets executed every minute and is used to upload a JSON file to the S3 bucket. This acts as a random source of data, so that the python app has some work to do without manual intervention.&lt;/p>
&lt;h3 id="-live-testing-">~ Live testing ~&lt;/h3>
&lt;p>Now that all parts seem to be ready, it&amp;rsquo;s time to run the main python app using the real S3 bucket and SQS queue, as well as the local datadog-agent. The console output provides some hints whether it&amp;rsquo;s able to pump the metrics from AWS to a DataDog:&lt;/p>
&lt;div class="highlight">&lt;pre class="chroma">&lt;code class="language-bash" data-lang="bash">▶ python3 submitter.py
Initializing new Cloud Resource Handler with SQS URL - https://.../cloud-job-results-queue
Processing available messages in SQS queue:
- sending data to DataDog via statsd/datadog-agent.
- removing message from SQS &lt;span class="o">(&lt;/span>AQEBO37smPPHg6OIqbh3HMu3g...&lt;span class="o">)&lt;/span>
- ...
- sending data to DataDog via statsd/datadog-agent.
- removing message from SQS &lt;span class="o">(&lt;/span>AQEBV0/JzMVEP6k5kBmx2kvGn...&lt;span class="o">)&lt;/span>
No more messages visible in the queue, shutting down ...
Process finished with &lt;span class="nb">exit&lt;/span> code &lt;span class="m">0&lt;/span>
&lt;/code>&lt;/pre>&lt;/div>&lt;p>Next, I checked my DataDog account to see whether the metric data arrived. For this I created a custom
&lt;a href="https://app.datadoghq.eu/notebook/list" target="_blank" rel="noopener">Notebook&lt;/a> with graphs to display them:&lt;/p>
&lt;p>&lt;img src="img/datadog-metrics.png" alt="DataDog Metrics">&lt;/p>
&lt;p>All seems to be well! The deployed AWS Lambda function has already run a few times, providing input data for the python app, which were successfully processed and forwarded to Datadog. As seen on the &lt;code>Notebook&lt;/code> above, it is really easy to display metric data over time about any recurring workload, which can provide pretty useful insights into those jobs.&lt;/p>
&lt;p>Furthermore, since DataDog also submission of
&lt;a href="https://docs.datadoghq.com/events/" target="_blank" rel="noopener">events&lt;/a> it becomes possible to design dashboards and create alerts which trigger based on mor complex criteria, such as the presence or lack of events over certain periods of time. One such example can be seen below:&lt;/p>
&lt;p>&lt;img src="img/ok-vs-fail.png" alt="DataDog Dashboard OK">&lt;/p>
&lt;p>This is a so-called
&lt;a href="https://docs.datadoghq.com/dashboards/screenboards/" target="_blank" rel="noopener">screen-board&lt;/a> which I created to display the status of a Monitor that I set up previously. This Monitor tracks incoming events with the tag &lt;strong>cloud_job_metric&lt;/strong> and generates an alert, if there is not at least one such event of type &lt;strong>success&lt;/strong> in the last 30 minutes. The screen-board can be exported via a public URL if needed, or just simply displayed on a big screen somewhere in the office.&lt;/p>
&lt;h2 id="conclusions">Conclusions&lt;/h2>
&lt;p>In this post I discussed a relatively complex project with lots of exciting technology working together in the realm of Cloud Computing. In the end, I was able to create DashBoards and Monitors in DataDog, which can ingest and display telemetry about AWS workloads, in a way that makes it useful to track and monitor the workloads themselves.&lt;/p></description></item></channel></rss>