We aren’t a big shop, but we use AWS Autoscale Groups .. That means nodes spin up and down all day long .. Got some traffic ?? Add nodes .. Traffic is low ?? Drop nodes .. Someone accidentally terminates a node ?? Add nodes .. Someone sneezes funny ?? Drop nodes ..
This goes on all day long to get as close to “right-sizing” our infrastructure as we can ..
We also name our nodes using a simple formula for various monitoring and orchestration purposes .. For example:
Yes, i understand an argument could be made this is an anti-pattern .. “Greg, you don’t need to name the stinking node .. It is cattle — nobody cares !!” .. I get it, I really do — but sometimes it’s nice just to have a name ..
So then, with all this spinning up and down of nodes and the related creation of Route53 records — you can end up with a lot of dead entries .. For whatever reason, you may wish purge these dead entries — and OBVIOUSLY you do NOT want to do this manually .. So what to do ??
Well, I might have a solution for you .. Go ahead and check out this GitHub repo: https://github.com/gkspranger/aws-route53-purge-dead-records .. It’s a simple playbook and role that will purge AWS Route53 records for a given hosted zone and a known naming pattern, while making sure to NOT delete records of nodes that are currently running ..
WARNING It can cause damage, so please be sure to review and understand what is going on with the playbook and role ..
Did you know you can tag a host in Splunk ?? I didn’t !! Do you know how much time tags would have saved me from having to craft a most excellent Splunk search to capture just the right hosts ?? Me neither — but I’m guessing it’s a lot ..
So instead of my searches looking like this:
# get all staging RMI nodes -- hard index=* ( host=rmi1.s.* OR host=rmi2.s.* OR host=rmi3.s.* ) source=*tomcat* earliest=-1h
They can now look like this:
# get all staging RMI nodes -- easy index=* tag=rmi tag=stage source=*tomcat* earliest=-1h
I know, I know — I could achieve the same level of excellence using targeted indexes (index=rmi_stage) and/or various regex filters .. Some of that, unfortunately, is out of my control ..
OK .. So how can you manage this without having to use the GUI ?? Easy !! You just need to drop a config file in the proper location (for me it’s: /opt/splunk/etc/system/local/tags.conf) on the search head, and away you go .. The syntax is pretty basic:
# tagging host login.example.net with PROD, TOMCAT, and LOGIN [host=login.example.net] prod = enabled tomcat = enabled login = enabled
Below’s a nice little example of how I automated this using Ansible (big surprise there 🙂 and the EC2 dynamic inventory script ..
So I want to share !!
I can’t get into too many details, but the overall concept was that every customer would be running a micro instance with our custom Hubot code installed .. This instance would pull code updates, if any, every 5 minutes and infrastructure updates, if any, every 15 minutes .. In addition, a customer could participate in pilot programs — AKA branch work ..
I really liked how I was able to mitigate the use of a “command node” and just run Ansible locally and on a schedule .. Also, I was able to automate pretty much everything — from VPC creation all the way to autoscaling groups ..
Anyway, here’s the link: https://github.com/gkspranger/failed-chatbot .. Maybe it will help one of you out there in Internets land ..
Sooo .. You are monitoring a fleet of AWS EC2 hosts via Nagios, and have yet to find an easy way to manage their host definitions .. Good news (if you happen to be using Ansible dynamic inventories) !! I created an Ansible template that loops thru all your EC2s and creates them for you ..
In addition, you can easily define Nagios service dependencies, helping you zero in on the root problem more quickly ..
Afraid of having too many AWS EC2 images and/or snapshots, thus running up your bill ?? Fear not !! I have you covered:
Nagios Plugins to Check AWS EC2 Images
Nagios Plugin to Check AWS EC2 Snapshots
So then, let’s talk about the use case ..
- I have a Linux node
- On said node, I have attached and mounted some EBS volumes
- Think Jenkins (/var/lib/jenkins), BitBucket (/opt/bitbucket), OSSEC (/var/ossec), etc ..
- I need to be able to create EBS snapshots of said EBS volumes on a regular basis, without majorly disrupting the service
- I don’t want to have to stop, snapshot, and start the related service
- I don’t always know what the EBS volume ID is, nor do I want to know
- Sometimes I want to wait until the EBS snapshot is done before I continue doing other things
- I want to tag the EBS snapshot something meaningful
So how am I going to solve this ??
- Find a way to snapshot an EBS volume
- Hello CREATE-SNAPSHOT
- Find a way to determine the state of an EBS snapshot
- Hello DESCRIBE-SNAPSHOTS
- This is so I can wait until the EBS snapshot is done
- Find a way to “pause” the service without major disruption
- Hello FSFREEZE
- Find a way to map a device to an EBS volume
- Hello DESCRIBE-INSTANCES
- It’s a little more complicated than that, but this is where the data is
- Find a way to tag an EBS snapshot
- Hello CREATE-TAGS
Putting it all together ..
- I created a Bash script that is managed by Ansible and utilizes the AWS CLI and JQ
- It should be ran ON THE NODE where the EBS volume is attached and mounted
- /root/bin/snapshot_ebs_volume.sh makes sense to me
- It allows you to pass as options:
- -d <device path> — the device path of the EBS volume you will freeze, EBS snapshot, and unfreeze
- -c “<EBS snapshot description>” — is what it says
- -w [optional] — if defined, script will not finish until EBS snapshot is done
- Sequence of events is as follows:
- Get the EBS Volume ID of the provided device path
- Get the mounted file system path of the provided device path
- Freeze the mounted file system path
- Create an EBS snapshot of the discovered EBS Volume ID
- Unfreeze the mounted file system path
- Tag the EBS snapshot with a pretty name
- Wait until done [optional], otherwise takes about 5 seconds total