Design, build, innovate & share cloud.

I tend to check out various sites here are there to see what people are talking about with AWS and Cloud Computing and a user had a interesting request. How can I get a list of all ec2 instances running globally and send a notification to let me know if they are running for longer than 24 or 48 hours for example. I thought it was a good question and totally doable in AWS with existing services. So let's build it!

If you don't want to read the full post, drop down to the end, there is a deploy button, CDK and links to Github for the open source repo

Why would you want this?

As discussed above, this isn't too strange and has a few more use cases when you start to think about it. Once you get into AWS it's not uncommon to be hoping in and out of regions creating resources it's pretty easy to leave instances running, hours turn into days and days turn into months. Before you know it the bill is here and now you face palm.

This is great for labs where you want to run ec2's for hours, not days and maybe you don't have a clean shutdown or you want more flexibility. This is great for this use case. This also applies to multiple teams where your cloud admins don't really care if its running or not but the application owner does, you can separate out the notifications to let those who need to know take action.

Some design

When I started to break the problem down it was pretty clear there was a way to do this using exisiting AWS services. I am a big fan of serverless and Function as a Service (FaaS) so as expected everthing runs using managed services in AWS and functions (Lambda).

design

Key services:

  • Event bridge
  • Lambda
  • SNS
Event Bridge

We'll be using EventBridge simply to schedule the running of the Lambda function, this was previously CloudWatch events. This is pretty easy to use and still supports detailed scheduling with cron.

https://aws.amazon.com/eventbridge/

Lambda

This is the core of the design, the lambda function is going to run our python program, tnc-ec2runner. The program itself is pretty simple but i'll go into more detail later about the design.

SNS

Once the Lambda function has found the instances running that shouldn't be we want to let somebody know so that they can most likely ignore it. This is where we'll publish to a topic and subscribed users can receive emails or whatever.

An obvious addition to this is the ability to shutdown instances automatically, this can be done pretty easily with some updates. I'd probably look into the AWS solution instance scheduler first and see if that fits.

tnc-ec2runner a lambda python program

The bit doing all the smarts here is the tnc (talkncloud) ec2 runner python program. Let's step through the program to understand the logic a little better.

design-program

We can see from the flow diagram that the program is actually pretty simple and made up of few functions. The main function is the checkEc2 which does the actual reading of the ec2 instance information like launchTime and then compares against the current time and the set thresholds.

Constants as env vars

I like to remove items that might be configurable from the code itself, this is where os environment variables come in handy. We can see in the program there are constants, these actually point to Lambda environment variables and can be changes without updating the code. This is useful in this context where users might not want to change the code.

#
# Constants
#
# Thresholds are time in minutes, 1440 = 24 hours
THRESH_LOW = int(os.environ.get('THRESH_LOW'))
THRESH_MED = int(os.environ.get('THRESH_MED'))
THRESH_HIGH = int(os.environ.get('THRESH_HIGH'))
# Used to check if its tagged already, e.g. notification processed
TAGNAME_TNC = 'tnc-ec2runner'
# SNS arn
SNS_TOPIC = os.environ.get('SNS_TOPIC')

Screen-Shot-2020-11-10-at-2.23.48-pm

Tip: Never put sensitive information in these variables like passwords, there are better ways to do that like the AWS Secret Manager.

Logging through CloudWatch

I like to use the python logger to enable some useful logging, and look, it's just plain nice to read when you put some thought into it. More work now but you'll thank me when you're reading that CloudWatch log later. Here you can see any example of the function running and printing out some nice information, reassuring you everything is OK.

Screen-Shot-2020-11-10-at-2.26.38-pm

Customization

As you can see there is some level of customization, I haven't gone over the top and throught of every scenario. This program supports three thresholds:

  • Low
  • Med
  • High

The idea here is that these thresholds are minutes from smallest (low) to biggest (high). So, for example if you've like to notified when an ec2 has been running for 1 hour, you'd set low to 60, and if you'd like to be alarmed if it's still running after 2 hours, you'd set med to 120 and so on.

State management

People following along might be wondering how we keep track of the state e.g. if a notification has been sent we don't want to send another notification on the next run, otherwise it will just keep sending notifications. I've chosen tags for this job, simple, key value store on the ec2 instance. You could go with something like DynamoDB but it really wasn't needed.

When the program runs and if a notification is sent a simple key: tnc-ec2runner, value: low/med/high is set on the instance. If the instance is stopped the tag will be removed, starting the notification process all over again.

Lambda profile

I've haven't done extensive testing but I've completed a few rounds to give an indication on how long tnc-ec2runner takes to execute, memory etc. The longest running part of the program is looping through all of the regions, this takes time, no way around it.

We can see the from stats, here are some estimates:

  • ~18 second execution
  • ~90MB of memory
  • 2.3 Billed GB's

We'll park this here and discuss total price estimates towards the end.

So, how do I get this?

With all of my posts you find a few different ways to get up and running:

  • CloudFormation
  • CDK
  • Manual build it out

The first method is the click to deploy button, all you need to do is click the button below and it will launch the stack in your AWS account. Pretty cool! There is one parameter which is the email address you want notifications to go to.

Launch Stack

I've developed the stack to use only the permissions required to get the job done, there are no excessive permissions. It will need access to the following:

  • EC2 Describe Instances
  • EC2 Describe Regions
  • EC2 Create tags
  • EC2 Delete tags
  • SNS Publish to topic
  • Lambda Event Bridge

All of these are restricted to the stack, e.g. The lambda function has permission to describe all EC2 instances, I have further restricted tagging to only the tag tnc-ec2runner, same goes for remove tag.

CDK (Cloud Development Kit)

If you haven't noticed already I am a fan of CDK, this stack has been developed in CDK and can be downloaded from GitHub so you can deploy, make it better etc.

cdk deploy --parameters email=my@email.com

GitHub

As I've mentioned before, this is open source, take it and use it, share it, make it better. Access the repository on GitHub, if you have suggestions please use the GitHub issues or send me a message and I'll check it out.

talkncloud/aws
all things cloud from IaC to full apps. Contribute to talkncloud/aws development by creating an account on GitHub.

Price estimates

Ah, yes, estimating pricing with AWS, a fun exercise, let's recap on some of the key points for pricing:

  • Runs once per day
  • Lambda, < 128MB, ~18 seconds (30 second timeout set)
  • SNS
    • Message is grouped so you don't have several single messages e.g. higher quantity

Screen-Shot-2020-11-10-at-3.27.50-pm

Estimates

Lambda: $0.00
SNS: $0.00 for the first 1k emails, $2.00 for 100k more
EventBridge: $0.00 for service events

This solution should cost you nothing to run, which is great!

These are estimates only, actual deployment, use etc may vary, this is up to you to confirm, monitor etc.

Final thoughts

I had a fun time with this one, I didn't realise EventBridge had taken over from CloudWatch events and really this just made it easier to integrate into the CDK stack. I've also learned about CDK helpers, if you look at the CDK code you'll see the SNS permissions added to lambda using a helper, which is neat and tidy.

There are many different ways to tackle this problem, I can think of a few improvements, ChatBot anyone? But, that is the beauty of the cloud so much variety.

It's easy to see how this simple solution can really return savings, at the end of the day you deploy it, forget about it and shutdown those instances you don't need.

If think this is of use please share, give feedback, better yet pull the code and make it better!

You've successfully subscribed to talkncloud
Welcome back! You've successfully signed in.
Great! You've successfully signed up.
Success! Your account is fully activated, you now have access to all content.