Make Everyone Talk to Hal !!

Hal is my Hubot chatbot .. He’s awesome !! He gets me beer !!

hal beer me

hal beer me

He also does things like restart app servers, deploy code, and show me pictures of grumpy cats .. He’s so cool, I’ve started making non-humans to talk to him .. “Greg, what do you mean ??” .. Well, let me show you ..


  1. I have a Nagios server
  2. It monitors (allthethings)
  3. When the “logged in users” alert is triggered, Nagios sends a message to my chat service using hipsaint
    1. “logged in users” is a monitor I have that alerts me when more than 3 users are logged into a server
  4. I see the alert and the server in question
  5. I SSH into the server
  6. I type who
  7. I then determine if I need to care
    1. If not, move on with my life
    2. If so, dig deeper

The thing is, I have more than 2,200 active monitors .. That means Nagios can and will send many, many messages to my chat service — depending on the day .. So how can I make my life easier ??

Here’s an easy one: ask Hal who’s on a server ..

hal whos on

hal whos on

My stack is HipChat -> Hubot -> Jenkins -> Ansible .. That means I can damn near do anything I want, all from my chat client ..

Remember what I said earlier — about making non-humans talk to Hal ?? What I did was create a Nagios event handler that sends a message to my chat service using HipChat CLI .. Therefore, I AM NOT asking Hal who’s on a server, it’s NAGIOS WHO IS doing it ..

nagios hal whos on

It doesn’t stop there !! You can create scripted Splunk alerts as well .. Before you know it, you will be making (allthethings) talk to Hal ..


Qik-n-EZ: Multiline Notes for Nagios Service Definitions

OMG !! It’s still hip to say that — right ??

Anyway .. While I consider myself to be reasonably intelligent, I still find myself doing dopey things from time to time .. For example: saying OMG .. Another dopey thing I have been doing for a very, very long time — are run-on sentence notes for my Nagios service definitions .. I am ashamed to admit how many times I have tried to “fix” this — but always errored out during the pre-flight check .. Yes — I am aware of Google, but I’ve never found anything definitive ..

And then it hit me like a bolt of lightening — Bash style line breaks ..

So obvious, yet so elusive ..

Qik-n-EZ: Collect AEM Thread Dumps and Email via Ansible

So one of our AEM nodes was freaking out the other day .. No, not the election results .. Some code was deployed to it that had runaway processes, thus gorging itself on CPU and memory .. EEEKK !! What to do ?? If you’ve been around AEM for awhile, you know how we love our raw thread dumps .. That being said, I really dislike the process on how to obtain them:

  1. Log into the node
  2. Get the Java PID
  3. Execute jstack and output to a log file
    1. Repeat every 10 seconds for at least one minute
  4. Compress the log file
  5. Share the compressed log file
    1. Typically via email

So how can I automate this ?? Easy !!

Create a Bash script that will generate the thread dumps for you ..

Create an Ansible playbook that will execute the script, compress the log file, and email it


Ansible & ServerSpec, Sitting in a Tree, K-I-S-S-I-N-G

We’re not dumb .. We don’t live in a cave .. As infrastructure “developers” — we KNOW we should be writing tests to validate our “code” .. Buuuuttt — how many of us are actually doing this ??

FEAR NOT !! Because it’s really not that hard if you break it down .. Think about it:

  • I have a node
  • I “do things” to that node to make it “more awesome
  • I validate “those things”
  • I move on with my life

It’s the “I validate” bullet we need to focus on .. So then, here we go:

Pick a Testing Framework

Sooo many choices .. Here are some popular ones:

Get it on the Server

OK .. Here we can have a lengthy beer talk about WHAT framework is most supreme and HOW to best execute it (locally vs over-the-wire) — but it’s Friday — and my appetite for prolonged discussions about personal opinion is nil .. So let’s just agree to put ServerSpec on a node with the intention of running tests locally .. Here’s an example of a role that will do just that ..

Write Some Tests

We’re not splitting the atom here — we’re just writing some simple role tests that can validate our intended work .. I mean, look how easy this !!


Get Those Tests on the Server

In order to avoid repeating yourself, this is where you have to get a little creative, in order to avoid repeating yourself .. I need to find a way to get a role’s tests onto a node .. “Hey Greg, haven’t you heard of the synchronize module ??” Yeah, yeah — but I want to be DRY .. I want one solution that works for all roles ..

Eureka !! I’ve got it !!

I’ll create a role who’s sole purpose is to “copy” a referring role’s tests to the target node, and then declare this “copier role” as a dependency of the referring role ..

Run the Tests

OK .. So I lied a little .. You know that role who’s “sole purpose” is to “copy” a referring role’s tests to the target node — I added more to it .. What can I say — it was lonely .. The role will also double as a way to execute your ServerSpec tests, when you want it to ..

Wrapping it Up

Like I said — it’s beer Friday — so here’s the complete solution .. I hope this helps make automated testing “less scary” .. Before you know it, you will be sipping Mai Tais with Mizzy, discussing the effectiveness of unit vs acceptance tests ..

Deep Thoughts – Why No Automation Framework ??

Why don’t ops teams choose an “automation framework” for their “development” ?? Think about it .. How many Bash/Perl/Python/Ruby scripts does an ops team create and maintain in a year ?? 10 .. 100 .. 1000 ??

“But Greg, they do !! It’s called Chef/Puppet/Ansible/Salt/etc ..”

ZING !! Got me there .. But that’s not what I am talking about ..

Yes — ops teams use configuration management tools to MANAGE the scripts they create and maintain, but I’m talking about defining a framework in which ALL scripts would be developed in .. Good (even bad) dev teams do it all the time .. Come on, say the names with me — Spring, Struts, CodeIgniter, Django, Rails, Express, AngularJS, etc .. They’re everywhere !!

So what does that make “us” ?? Good ?? Bad ?? Lazy ?? Adaptable ?? Wouldn’t ops teams realize the same benefits as dev teams if they choose a framework for their development ?? That seems like a logical conclusion, yet, I don’t see the same requirement being demanded from ops teams .. Why is that ??

Sorry .. I don’t have answer for you — but the more I do ops, the more I am convinced there is an opportunity here ..


Reusable Hubot Event Listener to Build Jenkins Jobs

Jenkins (thank you CloudBees !!) is my chosen automation platform, often triggering an Ansible playbook (thank you Michael !!) to automate some mundane task ..  I also use HipChat and have implemented a Hubot (thank you GitHub !!) chatbot .. Wouldn’t it be nice to integrate it all ?? For example, in HipChat:

@hal restart tomcat in dev
  1. Which Hubot recognizes
  2. Who then builds a Jenkins job
  3. Which then executes some Ansible playbook
  4. Which then restarts Tomcat in the development environment

Actually, it’s not that difficult .. For this post, I won’t go into the details on how to setup Hubot (this might help) or how to restart the Tomcat service (this might help) .. What I am most interested in is how to:

  1. Create a reusable way to build any Jenkins job
    • Since Hubot is a Node.js application, an event listener sounds logical
  2. Invoke this reusable way
    • i.e. event trigger

Here is an example of a Hubot event listener that can build a Jenkins job with or without parameters ..

Here is an example of a Hubot command that can trigger the previously defined event listener ..

I hope this helps !!

Scripting AWS EBS Volume Snapshots

So then, let’s talk about the use case ..

  • I have a Linux node
  • On said node, I have attached and mounted some EBS volumes
    • Think Jenkins (/var/lib/jenkins), BitBucket (/opt/bitbucket), OSSEC (/var/ossec), etc ..
  • I need to be able to create EBS snapshots of said EBS volumes on a regular basis, without majorly disrupting the service
    • I don’t want to have to stop, snapshot, and start the related service
  • I don’t always know what the EBS volume ID is, nor do I want to know
  • Sometimes I want to wait until the EBS snapshot is done before I continue doing other things
  • I want to tag the EBS snapshot something meaningful
    • jenkins.example.net_var_lib_jenkins

So how am I going to solve this ??

  1. Find a way to snapshot an EBS volume
  2. Find a way to determine the state of an EBS snapshot
    2. This is so I can wait until the EBS snapshot is done
  3. Find a way to “pause” the service without major disruption
    1. Hello FSFREEZE
  4. Find a way to map a device to an EBS volume
    2. It’s a little more complicated than that, but this is where the data is
  5. Find a way to tag an EBS snapshot
    1. Hello CREATE-TAGS

Putting it all together ..

  • I created a Bash script that is managed by Ansible and utilizes the AWS CLI and JQ
  • It should be ran ON THE NODE where the EBS volume is attached and mounted
    • /root/bin/ makes sense to me
  • It allows you to pass as options:
    • -d <device path> — the device path of the EBS volume you will freeze, EBS snapshot, and unfreeze
    • -c “<EBS snapshot description>” — is what it says
    • -w [optional]  — if defined, script will not finish until EBS snapshot is done
  • Sequence of events is as follows:
    1. Get the EBS Volume ID of the provided device path
    2. Get the mounted file system path of the provided device path
    3. Freeze the mounted file system path
    4. Create an EBS snapshot of the discovered EBS Volume ID
    5. Unfreeze the mounted file system path
    6. Tag the EBS snapshot with a pretty name
    7. Wait until done [optional], otherwise takes about 5 seconds total

script output

