Sooo .. You are monitoring a fleet of AWS EC2 hosts via Nagios, and have yet to find an easy way to manage their host definitions .. Good news (if you happen to be using Ansible dynamic inventories) !! I created an Ansible template that loops thru all your EC2s and creates them for you ..
In addition, you can easily define Nagios service dependencies, helping you zero in on the root problem more quickly ..
Afraid of having too many AWS EC2 images and/or snapshots, thus running up your bill ?? Fear not !! I have you covered:
Nagios Plugins to Check AWS EC2 Images
Nagios Plugin to Check AWS EC2 Snapshots
So I have been using Ansible for over two years now .. I use it for damn near everything — provisioning infrastructure, configuring nodes, deploying Web applications, testing whatever I can, and other ad hoc tasks (sadly, I’m still working on a “get me beer” playbook) .. Long story short, it’s been a game changer .. Problem is, as my (and my team’s — hi guys !!) Ansible usage grows (50+ playbooks and 130+ roles) — so does my desire to organize it in a way that is scaleable ..
There are many ways you can setup your Ansible project (one, two, three, four, etc ..) — which is great !! That said, I love me some simplicity .. After trying out a few setups, I finally settled on one that works for me .. You can see it here ..
The view from 30,000 feet:
- All of my inventory files, static and dynamic, will be children of the inventories directory
- Group and other variables
- the group_vars directory can be relative to an inventory file, no matter where the playbook is .. since I like to add some structure to my playbook organization, it makes sense to put it here
- variable files (i.e. Ansible Vault) that need to be explicitly loaded go in the vars directory
- IN THEORY: you could also put a host_vars directory here .. but as y’all know, host_vars are the devil
- Let me explain my logic here .. In my work, I essentially perform five functions: 1) provision infrastructure, 2) configure infrastructure, 3) deploy to infrastructure, 4) test infrastructure (and other things), and 5) save the world (i.e. ad hoc tasks) .. Knowing this:
- these are the bits and pieces that make up a useful piece of infrastructure .. for example, we have resource roles for AEM, Bitbucket, CPANm, .forward, Java (1.6, 1.7, and 1.8), Apache, Netcat, Nagios, NRPE, etc .. all solid implementations .. all reusable ..
- STYLE ALERT: we DON’T use Ansible Galaxy — directly .. we often refer to it for inspiration — but always end up using our own implementation
- this is a “completed”piece of infrastructure .. the sum of the (resource) parts .. the cherry on top .. for example, we have primary roles for user_web_api_server, internal_dns_server, aem_author_server, aem_publish_server, etc ..
- this is what I “do” to my infrastructure that doesn’t maintain any real “resource” …. for example, we have action roles for silencing/unsilencing Nagios, deploying code, restarting application servers, taking data centers offline, etc ..
The view from 30 feet:
- PROBLEM: “I want my playbooks to be able to refer to my roles in a way that is consistent and easy” .. Great idea !! The problem is, your playbooks and roles are gonna be all over the place — so you can’t take advantage of the “relative referencing” you can do in a traditional project structure .. Let’s also assume you don’t want to define an absolute path in the config file ..
- SOLUTION: Hello bootstrap !! Take a path that will always be consistent (inventory_dir) and use it to define other paths with some regex magic .. When finished, shove it all into a external variables file ..
- FINALLY: Refer to the bootstrap, early and often, in every playbook you create, ..
Again, this project setup works for me .. “Me” being myself and 4 other sys admins ..
Anywho — I need to get back to work on that “get me beer” playbook ..
Hal is my Hubot chatbot .. He’s awesome !! He gets me beer !!
He also does things like restart app servers, deploy code, and show me pictures of grumpy cats .. He’s so cool, I’ve started making non-humans to talk to him .. “Greg, what do you mean ??” .. Well, let me show you ..
- I have a Nagios server
- It monitors (allthethings)
- When the “logged in users” alert is triggered, Nagios sends a message to my chat service using hipsaint
- “logged in users” is a monitor I have that alerts me when more than 3 users are logged into a server
- I see the alert and the server in question
- I SSH into the server
- I type who
- I then determine if I need to care
- If not, move on with my life
- If so, dig deeper
The thing is, I have more than 1,200 active monitors .. That means Nagios can and will send many, many messages to my chat service — depending on the day .. So how can I make my life easier ??
Here’s an easy one: ask Hal who’s on a server ..
My stack is HipChat -> Hubot -> Jenkins -> Ansible .. That means I can damn near do anything I want, all from my chat client ..
Remember what I said earlier — about making non-humans talk to Hal ?? What I did was create a Nagios event handler that sends a message to my chat service using HipChat CLI .. Therefore, I AM NOT asking Hal who’s on a server, it’s NAGIOS WHO IS doing it ..
It doesn’t stop there !! You can create scripted Splunk alerts as well .. Before you know it, you will be making (allthethings) talk to Hal ..
So one of our AEM nodes was freaking out the other day .. No, not the election results .. Some code was deployed to it that had runaway processes, thus gorging itself on CPU and memory .. EEEKK !! What to do ?? If you’ve been around AEM for awhile, you know how we love our raw thread dumps .. That being said, I really dislike the process on how to obtain them:
- Log into the node
- Get the Java PID
- Execute jstack and output to a log file
- Repeat every 10 seconds for at least one minute
- Compress the log file
- Share the compressed log file
- Typically via email
So how can I automate this ?? Easy !!
Create a Bash script that will generate the thread dumps for you ..
Create an Ansible playbook that will execute the script, compress the log file, and email it
We’re not dumb .. We don’t live in a cave .. As infrastructure “developers” — we KNOW we should be writing tests to validate our “code” .. Buuuuttt — how many of us are actually doing this ??
FEAR NOT !! Because it’s really not that hard if you break it down .. Think about it:
- I have a node
- I “do things” to that node to make it “more awesome“
- I validate “those things”
- I move on with my life
It’s the “I validate” bullet we need to focus on .. So then, here we go:
Pick a Testing Framework
Sooo many choices .. Here are some popular ones:
Get it on the Server
OK .. Here we can have a lengthy beer talk about WHAT framework is most supreme and HOW to best execute it (locally vs over-the-wire) — but it’s Friday — and my appetite for prolonged discussions about personal opinion is nil .. So let’s just agree to put ServerSpec on a node with the intention of running tests locally .. Here’s an example of a role that will do just that ..
Write Some Tests
We’re not splitting the atom here — we’re just writing some simple role tests that can validate our intended work .. I mean, look how easy this !!
Get Those Tests on the Server
In order to avoid repeating yourself, this is where you have to get a little creative, in order to avoid repeating yourself .. I need to find a way to get a role’s tests onto a node .. “Hey Greg, haven’t you heard of the synchronize module ??” Yeah, yeah — but I want to be DRY .. I want one solution that works for all roles ..
Eureka !! I’ve got it !!
I’ll create a role who’s sole purpose is to “copy” a referring role’s tests to the target node, and then declare this “copier role” as a dependency of the referring role ..
Run the Tests
OK .. So I lied a little .. You know that role who’s “sole purpose” is to “copy” a referring role’s tests to the target node — I added more to it .. What can I say — it was lonely .. The role will also double as a way to execute your ServerSpec tests, when you want it to ..
Wrapping it Up
Like I said — it’s beer Friday — so here’s the complete solution .. I hope this helps make automated testing “less scary” .. Before you know it, you will be sipping Mai Tais with Mizzy, discussing the effectiveness of unit vs acceptance tests ..
Why don’t ops teams choose an “automation framework” for their “development” ?? Think about it .. How many Bash/Perl/Python/Ruby scripts does an ops team create and maintain in a year ?? 10 .. 100 .. 1000 ??
“But Greg, they do !! It’s called Chef/Puppet/Ansible/Salt/etc ..”
ZING !! Got me there .. But that’s not what I am talking about ..
Yes — ops teams use configuration management tools to MANAGE the scripts they create and maintain, but I’m talking about defining a framework in which ALL scripts would be developed in .. Good (even bad) dev teams do it all the time .. Come on, say the names with me — Spring, Struts, CodeIgniter, Django, Rails, Express, AngularJS, etc .. They’re everywhere !!
So what does that make “us” ?? Good ?? Bad ?? Lazy ?? Adaptable ?? Wouldn’t ops teams realize the same benefits as dev teams if they choose a framework for their development ?? That seems like a logical conclusion, yet, I don’t see the same requirement being demanded from ops teams .. Why is that ??
Sorry .. I don’t have answer for you — but the more I do ops, the more I am convinced there is an opportunity here ..