AWS Global EC2 runner w/ CDK + CF
I tend to check out various sites here are there to see what people are talking about with AWS and Cloud Computing and a user had a interesting request. How can I get a list of all ec2 instances running globally and send a notification to let me know if they are running for longer than 24 or 48 hours for example. I thought it was a good question and totally doable in AWS with existing services. So let's build it!
If you don't want to read the full post, drop down to the end, there is a deploy button, CDK and links to Github for the open source repo
Why would you want this?
As discussed above, this isn't too strange and has a few more use cases when you start to think about it. Once you get into AWS it's not uncommon to be hoping in and out of regions creating resources it's pretty easy to leave instances running, hours turn into days and days turn into months. Before you know it the bill is here and now you face palm.
This is great for labs where you want to run ec2's for hours, not days and maybe you don't have a clean shutdown or you want more flexibility. This is great for this use case. This also applies to multiple teams where your cloud admins don't really care if its running or not but the application owner does, you can separate out the notifications to let those who need to know take action.
When I started to break the problem down it was pretty clear there was a way to do this using exisiting AWS services. I am a big fan of serverless and Function as a Service (FaaS) so as expected everthing runs using managed services in AWS and functions (Lambda).
- Event bridge
We'll be using EventBridge simply to schedule the running of the Lambda function, this was previously CloudWatch events. This is pretty easy to use and still supports detailed scheduling with cron.
This is the core of the design, the lambda function is going to run our python program, tnc-ec2runner. The program itself is pretty simple but i'll go into more detail later about the design.
Once the Lambda function has found the instances running that shouldn't be we want to let somebody know so that they can most likely ignore it. This is where we'll publish to a topic and subscribed users can receive emails or whatever.
An obvious addition to this is the ability to shutdown instances automatically, this can be done pretty easily with some updates. I'd probably look into the AWS solution instance scheduler first and see if that fits.
tnc-ec2runner a lambda python program
The bit doing all the smarts here is the tnc (talkncloud) ec2 runner python program. Let's step through the program to understand the logic a little better.
We can see from the flow diagram that the program is actually pretty simple and made up of few functions. The main function is the checkEc2 which does the actual reading of the ec2 instance information like launchTime and then compares against the current time and the set thresholds.
Constants as env vars
I like to remove items that might be configurable from the code itself, this is where os environment variables come in handy. We can see in the program there are constants, these actually point to Lambda environment variables and can be changes without updating the code. This is useful in this context where users might not want to change the code.
# # Constants # # Thresholds are time in minutes, 1440 = 24 hours THRESH_LOW = int(os.environ.get('THRESH_LOW')) THRESH_MED = int(os.environ.get('THRESH_MED')) THRESH_HIGH = int(os.environ.get('THRESH_HIGH')) # Used to check if its tagged already, e.g. notification processed TAGNAME_TNC = 'tnc-ec2runner' # SNS arn SNS_TOPIC = os.environ.get('SNS_TOPIC')
Tip: Never put sensitive information in these variables like passwords, there are better ways to do that like the AWS Secret Manager.
Logging through CloudWatch
I like to use the python logger to enable some useful logging, and look, it's just plain nice to read when you put some thought into it. More work now but you'll thank me when you're reading that CloudWatch log later. Here you can see any example of the function running and printing out some nice information, reassuring you everything is OK.
As you can see there is some level of customization, I haven't gone over the top and throught of every scenario. This program supports three thresholds:
The idea here is that these thresholds are minutes from smallest (low) to biggest (high). So, for example if you've like to notified when an ec2 has been running for 1 hour, you'd set low to 60, and if you'd like to be alarmed if it's still running after 2 hours, you'd set med to 120 and so on.
People following along might be wondering how we keep track of the state e.g. if a notification has been sent we don't want to send another notification on the next run, otherwise it will just keep sending notifications. I've chosen tags for this job, simple, key value store on the ec2 instance. You could go with something like DynamoDB but it really wasn't needed.
When the program runs and if a notification is sent a simple key: tnc-ec2runner, value: low/med/high is set on the instance. If the instance is stopped the tag will be removed, starting the notification process all over again.
I've haven't done extensive testing but I've completed a few rounds to give an indication on how long tnc-ec2runner takes to execute, memory etc. The longest running part of the program is looping through all of the regions, this takes time, no way around it.
We can see the from stats, here are some estimates:
- ~18 second execution
- ~90MB of memory
- 2.3 Billed GB's
We'll park this here and discuss total price estimates towards the end.
So, how do I get this?
With all of my posts you find a few different ways to get up and running:
- Manual build it out
The first method is the click to deploy button, all you need to do is click the button below and it will launch the stack in your AWS account. Pretty cool! There is one parameter which is the email address you want notifications to go to.
I've developed the stack to use only the permissions required to get the job done, there are no excessive permissions. It will need access to the following:
- EC2 Describe Instances
- EC2 Describe Regions
- EC2 Create tags
- EC2 Delete tags
- SNS Publish to topic
- Lambda Event Bridge
All of these are restricted to the stack, e.g. The lambda function has permission to describe all EC2 instances, I have further restricted tagging to only the tag tnc-ec2runner, same goes for remove tag.
CDK (Cloud Development Kit)
If you haven't noticed already I am a fan of CDK, this stack has been developed in CDK and can be downloaded from GitHub so you can deploy, make it better etc.
cdk deploy --parameters email@example.com
As I've mentioned before, this is open source, take it and use it, share it, make it better. Access the repository on GitHub, if you have suggestions please use the GitHub issues or send me a message and I'll check it out.
Ah, yes, estimating pricing with AWS, a fun exercise, let's recap on some of the key points for pricing:
- Runs once per day
- Lambda, < 128MB, ~18 seconds (30 second timeout set)
- Message is grouped so you don't have several single messages e.g. higher quantity
SNS: $0.00 for the first 1k emails, $2.00 for 100k more
EventBridge: $0.00 for service events
This solution should cost you nothing to run, which is great!
These are estimates only, actual deployment, use etc may vary, this is up to you to confirm, monitor etc.
I had a fun time with this one, I didn't realise EventBridge had taken over from CloudWatch events and really this just made it easier to integrate into the CDK stack. I've also learned about CDK helpers, if you look at the CDK code you'll see the SNS permissions added to lambda using a helper, which is neat and tidy.
There are many different ways to tackle this problem, I can think of a few improvements, ChatBot anyone? But, that is the beauty of the cloud so much variety.
It's easy to see how this simple solution can really return savings, at the end of the day you deploy it, forget about it and shutdown those instances you don't need.
If think this is of use please share, give feedback, better yet pull the code and make it better!