December 4, 2016

Day 4 - Change Management: Keep it Simple, Stupid

Written By: Chris McDermott
Edited By: Christopher Webber (@cwebber)

I love change management. I love the confidence it gives me. I love the traceability–how it’s effectively a changelog for my environment. I love the discipline it instills in my team. If you do change management right, it allows you to move faster. But your mileage may vary.

Not everyone has had a good experience with change management. In caricature, this manifests as the Official Change Board that meets bi-monthly and requires all participants to be present for the full meeting as every proposed plan is read aloud from the long and complicated triplicate copy of the required form. Questions are asked and answered; final judgements eventually rendered. Getting anything done takes weeks or months. People have left organizations because of change management gone wrong.

I suppose we really should start at the beginning, and ask “Why do we need change management at all?” Many teams don’t do much in the way of formal change process. I’ve made plenty of my own production changes without any kind of change management. I’ve also made the occasional human error along the way, with varying degrees of embarrassment.

I challenge you to try a simple exercise. Start writing down your plan before you execute a change that might impact your production environment. It doesn’t have to be fancy – use notepad, or vim, or a pad of paper, or whatever is easiest. Don’t worry about approval or anything. Just jot down three things: step-by-step what you’re planning to do, what you’ll test when you’re done, and what you would do if something went wrong. This is all stuff you already know, presumably. So it should be easy and fast to write it down somewhere.

When I go through this exercise, I find that I routinely make small mistakes, or forget steps, or realize that I don’t know where the backups are. Most mistakes are harmless, or they’re things that I would have caught myself as soon as I tried to perform the change. But you don’t always know, and some mistakes can be devastating.

The process of writing down my change plan, test plan, and roll-back plan forces me to think through what I’m planning carefully, and in many cases I have to check a man page or a hostname, or figure out where a backup file is located. And it turns out that doing all that thinking and checking catches a lot of errors. If I talk through my change plan with someone else, well that catches a whole bunch more. It’s amazing how much smarter two brains are, compared to just one. Sometimes, for big scary changes, I want to run the damn thing past every brain I can find. Heh, in fact, sometimes I show my plan to people I’m secretly hoping can think of a better way to do it. Having another human being review the plan and give feedback helps tremendously.

For me, those are the really critical bits. Write down the complete, detailed plan, and then make sure at least one other person reviews it. There’s other valuable stuff you can do like listing affected systems and stakeholders, and making notification and communication part of the planning process. But it’s critical to keep the process as simple, lightweight, and easy as possible. Use a tool that everyone is already using – your existing ticketing software, or a wiki, or any tool that will work. Figure out what makes sense for your environment, and your organization.

When you can figure out a process that works well, you gain some amazing benefits. There’s a record of everything that was done, and when, and by whom. If a problem manifests 6 or 12 or 72 hours after a change was made, you have the context of why the change was made, and the detailed test plan and roll-back plan right there at your fingertips. Requiring some level of review means that multiple people should always be aware of what’s happening and can help prevent knowledge silos. Calling out stakeholders and communication makes it more likely that people across your organization will be aware of relevant changes being made, and unintended consequences can be minimized. And of course you also reduce mistakes, which is benefit enough all by itself. All of these things combined allow high-functioning teams to move faster and act with more confidence.

I can give you an idea of what this might look like in practice. Here at SendGrid, we have a Kanban board in Jira (a tool that all our engineering teams were already using when we rolled out our change management process). If an engineer is planning a change that has the potential to impact production availability or customer data, they create a new issue on the Change Management Board (CMB). The template has the following fields:

  • Summary
  • Description
  • Affected hosts
  • Stakeholders
  • Change plan
  • Test plan
  • Roll-back plan
  • Roll-back verification plan
  • Risks

All the fields are optional except the Summary, and several of them have example text giving people a sample of what’s expected. When the engineer is happy with the plan, they get at least one qualified person to review it. That might be someone on their team, or it might be a couple of people on different teams. Engineers are encouraged to use their best judgement when selecting reviewers. Once a CMB has been approved (the reviewer literally just needs to add a “LGTM” comment on the Jira issue), it is dragged to the “Approved” column, and then the engineer can move it across the board until they’re done with the change. Each time the CMB’s status in Jira changes, it automatically notifies a HipChat channel where we announce things like deploys. For simple changes, this whole process can happen in the space of 10 or 15 minutes. More complicated ones can take a day or two, or in a few cases weeks (usually indicative of complex inter-team dependencies). The upper bound on how long it has taken is harder to calculate. We’ve had change plans that were written and sent to other teams for review, which then spawned discussions that spawned projects that grew into features or fixes and the original change plan withered and died. Sometimes that’s the the better choice.

I don’t think we have it perfect yet; we’ll probably continue to tune it to our needs. Ours is just one possible solution among many. We’ve tried to craft a process that works for us. I encourage you to do the same.

December 3, 2016

Day 3 - Building Empathy: a devopsec story

Written By: Annie Hedgpeth (@anniehedgie)
Edited By: Kerim Satirli (@ksatirli)

’Twas the night before Christmas, and all through the office not a creature was stirring … except for the compliance auditors finishing up their yearly CIS audits.

Ahh, poor them. This holiday season, wouldn’t you love to give your security and compliance team a little holiday cheer? Wouldn’t you love to see a bit of peace, joy, and empathy across organizations? I was lured into technology by just that concept, and I want to share a little holiday cheer by telling you my story.

I’m totally new to technology, having made a pretty big leap of faith into a career change. The thing that attracted me to technology was witnessing this display of empathy firsthand. My husband works for a large company who specializes in point-of-sale software, and he’s a very effective driver of DevOps within his organization. He was ready to move forward with automating all of the things and bringing more of the DevOps cheer to his company, but his security and compliance team was, in his eyes, blocking his initiatives - and for good reason!

My husband’s year-long struggle with getting his security and compliance team on board with automation was such an interesting problem to solve for me. He was excited about the agile and DevOps methodologies that he had adopted and how they would bring about greater business outcomes by increasing velocity. But the security and compliance team was still understandably hesitant, especially with stories of other companies experiencing massive data breaches in the news with millions of dollars lost. I would remind my husband that they were just trying to do their jobs, too. The security and compliance folks aren’t trying to be a grinch. They’re just doing their job, which is to defend, not to intentionally block.

So I urged him to figure out what they needed and wanted (ENTER: Empathy). And what he realized is that they needed to understand what was happening with the infrastructure. I can see how all of the automated configuration management could have caused a bit of hesitation on behalf of security and compliance. They wanted to be able to inspect everything more carefully and not feel like the automation was creating vulnerability issues that were out of their control.

But the lightbulb turned on when they realized that they could code their compliance controls with a framework called InSpec. InSpec is an open-source framework owned by Chef but totally platform agnostic. The cool thing about it is that you don’t even need to have configuration management to use it, which makes it a great introduction to automation for those that are new to DevOps or any sort of automation.

(Full-disclosure: Neither of us works for Chef/InSpec; we’re just big fans!)

You can run it locally or remotely, with nothing needing to be installed on the nodes being tested. That means you can store your InSpec test profile on your local machine or in version control and run it from the CLI to test your local machine or a remote host.

# run test locally
inspec exec test.rb

# run test on remote host on SSH
inspec exec test.rb -t ssh://user@hostname

# run test on remote Windows host on WinRM
inspec exec test.rb -t winrm://Administrator@windowshost --password 'your-password'

# run test on Docker container
inspec exec test.rb -t docker://container_id

# run with sudo
inspec exec test.rb --sudo [--sudo-password ...] [--sudo-options ...] [--sudo_command ...]

# run in a subshell
inspec exec test.rb --shell [--shell-options ...] [--shell-command ...]

The security and compliance team’s fears were finally allayed. All of the configuration automation that my husband was doing had allowed him to see his infrastructure as code, and now the security and compliance team could see their compliance as code, too.

They began to realize that they could automate a huge chunk of their PCI audits and verify every time the application or infrastructure code changed instead of the lengthy, manual audits that they were used to!

Chef promotes InSpec as being human-readable and accessible for non-developers, so I decided to learn it for myself and document on my blog whether or not that was true for me, a non-developer. As I learned it, I became more and more of a fan and could see how it was not only accessible, but in a very simple and basic way, it promoted empathy between the security and compliance teams and the DevOps teams. It truly is at the heart of the DevSecOps notion. We know that for DevOps to deliver on its promise of creating greater velocity and innovation that silos must be broken down. These silos being torn down absolutely must include those of the security and compliance teams. The InSpec framework does that in such a simple way that it is easy to gloss over. I promise you, though, it doens’t have to be complicated. So here it is…metadata. Let me explain.

If you’re a compliance auditor, then you’re used to working with PDFs, spreadsheets, docs, etc. One example of that is the CIS benchmarks. Here’s what a CIS control looks like.

And this is what that same control looks like when it’s being audited using InSpec. Can you see how the metadata provides a direct link to the CIS control above?

control "cis-1-5-2" do
  impact 1.0
  title "1.5.2 Set Permissions on /etc/grub.conf (Scored)"
  desc "Set permission on the /etc/grub.conf file to read and write for root only."
  describe file('/etc/grub.conf') do
    it { should be_writable.by('owner') }
    it { should be_readable.by('owner') }
  end
end

And then when you run a profile of controls like this, you end up with a nice, readable output like this.

When security and compliance controls are written this way, developers know what standards they’re expected to meet, and security and compliance auditors know that they’re being tested! InSpec allows them to speak the same language. When someone from security and compliance looks at this test, they feel assured that “Control 1.5.1” is being tested and what its impact level is for future prioritization. They can also read plainly how that control is being audited. And when a developer looks at this control, they see a description that gives them a frame of reference for why this control exists in the first place.

And when the three magi of Development, Operations, and Security and Compliance all speak the same language, bottlenecks are removed and progress can be realized!

Since I began my journey into technology, I have found myself at 10th Magnitude, a leading Azure cloud consultancy. My goal today is to leverage InSpec in as many ways as possible to add safety to 10th Magnitude’s Azure and DevOps engagements so that our clients can realize the true velocity the cloud makes possible.

I hope this sparked your interest in InSpec as it is my holiday gift to you! Find me on Twitter @anniehedie, and find much more about my journey with InSpec and technology on my blog.

December 2, 2016

Day 2 - DBAs, a priesthood no more

Written by: Silvia Botros (@dbsmasher)
Edited by: Shaun Mouton (@sdmouton)
Header Image
Hermione casting a spell.
Illustration by Frida Lundqvist

Companies have had and needed Database Administrators for years. Data is one of a business’s most important assets. That means many businesses, once they grow to the point where they must be able to rapidly scale, need someone to make sure that asset is well managed, performant for the product needs, and available to restore in case of disasters.

In a traditional sense, the job of the DBA means she is the only person with access to the servers that host the data, the person to go to create new database cluster for new features, the person to design new schemas, and the only person to contact when anything database related breaks in a production environment.

Because DBAs traditionally have such unique roles their time is at a premium, and it becomes harder to think big picture when day to day tasks overwhelm. It is typical to resort to brittle tools like bash for all sorts of operational tasks in DBA land. Need a new DB setup from a clean OS install? Take, validate, or restore backups? Rotate partitions or stale data? When your most commonly used tool is bash scripting, everything looks like a nail. I am sure many readers are preparing tweets to tell me how powerful bash is, but please hold your commentary until after evaluating my reasoning.

Does all this sound like your job description as a DBA? Does the job description talk in details about upgrading servers, creating and testing backups, and monitoring? Most typical DBA job postings will make sure to say that you have to configure and setup ‘multiple’ database servers (because the expectation is that DBAs hand craft them), and automate database management tasks with (hand crafted) scripts.

Is that really a scalable approach for what is often a team of one in a growing, fast paced organization?

I am here to argue that your job is not to perform and manage backups, create and manage databases, or optimize queries. You will do all these things in the span of your job but the primary goal is to make your business’s data accessible and scalable. This is not just for the business to run the current product but also to build new features and provide value to customers.

Why

You may want to ask, why would I do any of this? There is an argument for continuing the execute the DBA role traditionally: job security, right?

Many tech organizations nowadays do one or more of the following:
  • They are formed of many smaller teams
  • They provide feature by creating many micro-services in place of one or a few larger services
  • They adopt agile methodologies to speed the delivery of features
  • They combine operations and engineering under one leadership
  • They embed operations engineers with developers as early as possible in the design process
A DBA silo within operations means the operations team is less empowered to help debug production issues in its own stack, is sometimes unable to respond and fix issues without assistance, and frankly less credible at demanding closer and earlier collaborations with the engineering teams if they aren’t practicing what they preach inside Tech Ops.

So what can be done to bust that silo and make it easier for other folks to debug, help scale the database layer, and empower engineers to design services that can scale? Most up-and coming shops have at most one in-house DBA. Can the one DBA be ‘present’ in all design meetings, approve every schema change, and be on call for a sprawling, ever growing database footprint?

DBAs can no longer be gate keepers or magicians. A DBA can and should be a source of knowledge and expertise for engineers in an organization. She should help the delivery teams not just deliver features but to deliver products that scale and empower them to not fear the database. But how can a DBA achieve that while doing the daily job of managing the data layer? There are a number of ways you, the DBA, can set yourself up for excellence.

Configuration management

This is a very important one. DBAs tend to prefer old school tools like bash for database setup. I alluded to this earlier and I have nothing against using bash itself. I use it a lot, actually. But it is not the right tool for cluster setup. Especially if the rest of ops is NOT using Bash to manage the rest of the architecture. It’s true that operations engineers know Bash too, but if they are managing the rest of the infrastructure with a tool like Chef or Puppet and the databases are managed mostly by hand crafted scripts written by the DBA, you are imposing an obstruction for them to provide help when an urgent change is needed. Moreover, it becomes harder to help engineering teams to self-serve and own the creation of the new clusters they need for new feature foo. You become the ‘blocker’ for completing work. Getting familiar with the configuration management at your company is also a two way benefit. As you get familiar with how the infrastructure is managed, you get to know the team’s standards, get more familiar with the stack, and are able to collaborate on changes that ultimately affect the product scale. A DBA who is tuned into the engineering organization’s product and infrastructure as a whole is invaluable.

Runbooks

This is technically a subset of the documentation you have to write (you document things, right?!) but in my experience has proven far more useful that I feel it has to be pointed out separately. When I say runbooks I am specifically saying a document written for an audience that is NOT a DBA. There are a lot of production DB issues we may encounter as DBAs that are simple for us to debug and resolve. We tend to underestimate that muscle memory and we fall in the pattern of ‘just send me the page’ and we ‘take care of things’.

If your operations team is like mine where you are the only DBA, it probably means someone else on the team is the first line of defense when a DB related event pages. Some simple documentation on how to do initial debugging, data collection, can go a long way in making the rest of the operations team comfortable with the database layer and more familiar with how we monitor it and debug it. Even if that event still results into paging the DBA, slowly but surely, the runbook becomes a place for everyone to add acquired knowledge.

Additionally, I add a link to the related runbook section (use anchors!) to the page descriptions that go to the pager. This is incredibly helpful for someone being paged by a database host at 3 AM to find a place to start. These things may seem small but in my experience they have gone a long way breaking mental barriers for my operations team working on database layer when necessary.

As a personal preference, I write these as markdown docs inside my Chef cookbook repositories. This falls seamlessly into a pull request, review and merge pattern, and it becomes an integral part of the databases’ cookbooks pattern. As engineering teams start creating their own, the runbooks become a familiar template as new database clusters spring out all over the place.

Visibility

We like our terminal screens. We love them. The most popular tools in MySQL land are still terminal tools that live directly on the db hosts and that need prior knowledge of them and how to use them. I am talking about things like innotop and the MySQL shell. These are fine and still helpful but they are created for DBAs. If you do not want to be the gatekeeper to questions like “is there replication lag right now”, you need to have better tools to make any cluster health, now and historically, available and easy to digest for all team members. I have a few examples in this arena:

Orchestrator

We use read replicas to spread that load away from the primary, which means once lag hits a certain threshold, it becomes a customer support event. It is important to make it easier for anyone in the company to know at any given time whether any cluster is experiencing lag, what servers in that cluster are, and whether any of the hosts has gone down. Orchestrator is a great tool in that regard in that it makes visualizing clusters and their health a browser window away.

Grafana/Graphite

Metrics for the DB layer need to live in the same place metrics for the rest of the infrastructure are. It is important for the team to be able to juxtapose these metrics side by side. And it is important to have an easy way to see historical metrics for any DB cluster. While you may have a personal preference for cacti or munin, or artisanal templates that you have written over the years, if the metrics you use to investigate issues are not in the same place as the rest of the infrastructure metrics it sets up a barrier for other busy engineers, and they’ll be less inclined to use your tooling over that which is in use elsewhere. Graphite is in wide use for ingesting metrics in modern infrastructure teams, and Grafana is a widely used dashboarding front-end for metrics and analytics.

Query performance

We use VividCortex to track our queries on critical clusters and while this article isn’t meant to be an advertisement for a paid service, I will say that you need to make the ability to inspect the effect of deploys and code changes on running queries and query performance something that doesn’t need special access to logs and manually crunching them. If VividCortex isn’t a possibility (although, seriously, they are awesome!), there are other products and open source tools that can capture even just the slow log and put it in an easy to read web page for non DBAs to inspect and see the effect of their code. The important point here is that if you provide the means to see the data, engineers will use that data and do their best to keep things efficient. But it is part of your job to make that access available and not a special DBA trick.

Fight the pager fatigue

A lot of organizations do not have scaling the database layer as a very early imperative in their stack design, and they shouldn’t. In the early days of a company, you shouldn’t worry about how you will throttle API calls if no one is using the API yet. But it’s appropriate to consider a few years later, when the product has gained traction, and that API call that was hitting a table of a few thousand rows by a handful of customers is now a multi million rows table, and a couple of customers have built cron jobs that flood that API every morning at 6 AM your timezone.

It takes a lot of work to change the application layer of any product to protect the infrastructure and in the interim, allowing spurious database activity to cause pager fatigue is a big danger to both you and the rest of the operations organization. Get familiar with tools like pt-kill that can be used in a cinch to keep a database host from having major downtime due to unplanned volume. Make the use of that tool known and communicate the action and its effect to the stakeholder engineering team but it is unhealthy to try and absorb the pain from something you directly cannot change and it is ultimately not going to be beneficial to helping the engineering teams’ learn how to deal with growing pains.

There are a lot of ways a DBA’s work is unique to her role in comparison to the rest of the operations team but that doesn’t mean it has to be a magical priesthood no one can approach. These steps go a long way in making your work transparent but most importantly is approaching your work not as a gatekeeper to a golden garden of database host but as a subject matter expert who can provide advice and help grow the engineers you work with and provide more value to the business than backups and query tuning (but those are fun too!).

Special thanks to the wonderful operations team at Sendgrid who continue to teach me many things, and to Charity Majors for coining the title of this post.

December 1, 2016

Day 1 - Why You Need a Postmortem Process

Written by: Gabe Abinante (@gabinante)
Edited by: Shaun Mouton (@sdmouton)

Why Postmortems?

Failure is inevitable. As engineers building and maintaining complex systems, we likely encounter failure in some form on a daily basis. Not every failure requires a postmortem, but if a failure impacts the bottom line of the business, it becomes important to follow a postmortem process. I say “follow a postmortem process” instead of “do a postmortem”, because a postmortem should have very specific goals designed to prevent future failures in your environment. Simply asking the five whys to try and determine the root cause is not enough.

A postmortem is intended to fill out the sort of knowledge gaps that inevitably exist after an outage:
  1. Who was involved / Who should have been involved?
  2. Was/is communication good between those parties?
  3. How exactly did the incident happen, according to the people who were closest to it?
  4. What went well / what did we do right?
  5. What could have gone better?
  6. What action items can we take from this postmortem to prevent future occurrence?
  7. What else did we learn?
Without a systematic examination of failure, observers can resort to baseless speculation.

Without an analysis of what went right as well as what went wrong, the process can be viewed as a complete failure.

Without providing key learnings and developing action items, observers are left to imagine that the problem will almost certainly happen again.

A Case Study of the Knight Capital critical SMARS error 2012

Knight Capital was a financial services firm engaging in high frequency trading in the New York Stock Exchange and NASDAQ. It posted revenue of 1.404 billion in 2011, but went out of business by the end of 2012.

On August 1, 2012, Knight Capital deployed untested software which contained an obsolete function to a production environment. The incident happened due to an engineer deploying new code to only 7/8 of the servers responsible for Knight’s automated routing system for equity orders. The code repurposed a flag that was formerly used to activate an old function known as “Power Peg”, which was designed to move stock prices higher and lower in order to verify the behavior of trading algorithms in a controlled environment. All orders sent with the repurposed flag to one of the servers triggered the obsolete code still present on that server. As a result, Knight’s trading activities caused a major disruption in the prices of 148 companies listed at the New York Stock Exchange. This caused the prices of certain stocks to jump by as much as 1200%. For the incoming parent orders that were processed by the defective code, Knight Capital sent millions of child orders, resulting in 4 million executions in 154 stocks for more than 397 million shares in approximately 45 minutes (1). Knight Capital took a pre-tax loss of $440 million. Despite a bailout the day after, this precipitated the collapse of Knight Capital’s stock, losing 75% of their equity value.

I chose to write about this incident because there is an incredible body of writing about it, but actually remarkably little information or substance beyond the SEC release. The amount of material is certainly partially because the incident had such a high impact - few companies have a technical glitch that puts them out of business so quickly. I believe that there’s more to it however - this type of response is an attempt by the community to make sense of the incident because the company itself never released a public postmortem. This is an incredibly interesting case because a production bug and operational failure actually perpetuated the collapse of a seemingly successful business - but the lack of a public postmortem exposed the company to all kinds of baseless speculation about lackadaisical attitudes towards change controls, testing, and production changes (see various citations, especially 11, 12). It would also seem that there was not an internal postmortem, or that it was not well circulated, based upon the Knight Capital CEO’s comments to the press (2).

As the John Allspaw notes in his blog (3), one of the worst consequences of Knight’s reticence was news companies and bloggers using the SEC investigation as a substitute for a postmortem. This was harmful to the business and particularly to the engineers involved in the incident. The SEC document is blamey. It’s supposed to be blamey. It details the incident timeline and outlines procedures that should have been in place to prevent an error - and in doing so it focuses entirely on what was lacking from their outside perspective. What it doesn’t do is accurately explain how the event came to be. What processes WERE in place that the engineers relied upon? What change controls WERE being used? What went right and what will be done to ensure this doesn’t happen in the future?

Did Knight Capital go out of business because they lost a bunch of money in a catastrophic way? Sure. But their core business was still a profitable model - it’s conceivable that they could have received a bailout, continued operations, and gotten out of the hole created by this massive failure. Unfortunately, they failed to demonstrate to their investors and to the public that they were capable of doing so. By failing to release a public document, they allowed the narrative to be controlled by news sites and bloggers.

Taking a look at IaaS provider outages

Infrastructure providers are in a unique position where they have to release postmortems to all of their customers for every outage, because all of their customers’ business systems rely upon IaaS uptime.

AWS experienced an outage that spanned April 21st-April 24th, 2011 and brought down the web infrastructure of several large companies such as Quora and Hootsuite. The incident began when someone improperly executed a network change and shunted a bunch of traffic to the wrong place, which cut a ton of nodes off from each other. Because so many nodes were affected at one time, all of them trying to re-establish replication and hunt for free nodes caused the entire EBS cluster to run out of free space. This generated a cascading failure scenario that required a massive amount of storage capacity in order to untangle. Recovery took quite a while because capacity had to be physically added to the cluster. The postmortem was published via Amazon’s blog on April 29th, 2011. This incident is notable because it was somewhat widespread (affected multiple availability zones) and resolution took longer than 24 hours - making it one of the largest outages that AWS has experienced. AWS has a response pattern that is characterized by communication throughout; updates to status page during the incident, followed by a detailed postmortem afterwards (4). Amazon’s postmortem structure seems to be consistent across multiple events. Many seem to use roughly this outline:
  1. Statement of Purpose
  2. Infrastructure overview of affected systems
  3. Detailed recap of incident by service
  4. Explanation of recovery steps & key learnings
  5. Wrap-up and conclusion
From this we can learn two things: Firstly, we know that amazon has a postmortem process. They are pursuing specific goals around analyzing the failure of their service. Secondly, we know what they want to communicate. Primarily, they want to explain why the failure occurred and why it will not happen again in the future. They also provide an avenue for disgruntled stakeholders to reach out, receive compensation, get additional explanation, etc.

Azure experienced a similar storage failure in 2014 and we see a similar response from them - immediate communication via status pages, followed by a postmortem after the incident (5).

Taking a look at how the media approaches these failure events, it’s worthy of note that the articles written about the outages include links to the postmortem itself, as well as status pages and social media (6,7). Because the companies are communicative and providing documentation about the problem, the journalist can disseminate that information in their article - thus allowing the company that experienced the failure to control the narrative. Because so much information is supplied, there’s very little speculation about what went right or wrong on the part of individuals or journalists, despite the outage events impacting a huge number of individuals and companies utilizing the services themselves or software which relied upon them.

Conclusion

So, while postmortems are often considered a useful tool only from an engineering perspective, they are critical to all parts of a business for four reasons:

  • Resolving existing issues causing failures
  • Preventing future failures
  • Controlling public perception of the incident
  • Helping internal business units and stakeholders to understand the failure

Having a process is equally critical, and that process needs to be informed by the needs of the business both internally and externally. A process helps ensure that the right questions get asked and the right actions are taken to understand and mitigate failure. With the typical software development lifecycle increasing in speed, availability is becoming more of a moving target than ever. Postmortem processes help us zero in on threats to our availability and efficiently attack problems at the source.

About the Author

My name is Gabe Abinante and I am an SRE at ClearSlide. You can reach me at gabe@abinante.com or @gabinante on twitter. If you are interested in developing or refining a postmortem process, check out the Operations Incident Board on GitHub: https://github.com/Operations-Incident-Board

Citations / Additional Reading:

  1. http://www.sec.gov/litigation/admin/2013/34-70694.pdf
  1. http://spectrum.ieee.org/riskfactor/computing/it/-knuckleheads-in-it-responsible-for-errant-trading-knight-capital-ceo-claims
  1. http://www.kitchensoap.com/2013/10/29/counterfactuals-knight-capital/
  1. https://aws.amazon.com/message/65648/
  1. https://azure.microsoft.com/en-us/blog/update-on-azure-storage-service-interruption/
  1. https://www.wired.com/insights/2011/08/amazon-outage-post-mortem/
  1. http://www.rightscale.com/blog/cloud-industry-insights/amazon-ec2-outage-summary-and-lessons-learned
  1. http://www.forbes.com/sites/benkepes/2014/11/18/its-a-return-to-the-azure-alypse-microsoft-azure-suffers-widespread-outage/#76bfc0194852
  1. http://moneymorning.com/2012/08/07/the-real-story-behind-the-knight-capital-trading-fiasco/
  1. https://developer.ibm.com/urbancode/2013/10/28/-knight-capitals-472-million-release-failure/
  1. http://pythonsweetness.tumblr.com/post/64740079543/how-to-lose-172222-a-second-for-45-minutes
  1. http://dougseven.com/2014/04/17/knightmare-a-devops-cautionary-tale/
  1. http://www.cio.com/article/2393212/agile-development/software-testing-lessons-learned-from-knight-capital-fiasco.html
  1. http://dealbook.nytimes.com/2012/08/02/knight-capital-says-trading-mishap-cost-it-440-million/
  1. https://codeascraft.com/2016/11/17/debriefing-facilitation-guide/
  1. https://github.com/Operations-Incident-Board

December 25, 2015

Day 25 - Laziest Christmas Ever

Written by: Sally Lehman (@sllylhmn) & Roman Fuentes (@romfuen)
Photography by: Brandon Lehman (@aperturetwenty)
Edited by: Bill Weiss (@BillWeiss)

Equipment List

This Christmas, Roman and Sally decided to use a pair of thermocouples, a Raspberry Pi, Graphite, and Nagios to have our Christmas ham call and email us when it was done cooking (and by “done” we mean an internal temperature of 140°F). The setup also allowed us to remotely monitor the oven and ham temperatures from a phone or laptop using Graphite dashboards. Since we are both busy with life and family obligations around the holidays, finding new ways to automate food preparation was considered a necessity.

Temperature Sensor Setup

The Raspberry Pi was connected to a pair of high-temp, waterproof temperature sensors by following this tutorial from Adafruit. We deviated from the tutorial in that we added an additional temperature sensor, since we wanted one for the ham and a separate one for the oven. Attaching an additional sensor required soldering together the two 3.3v voltage leads, the ground leads, and the data lines. These data and voltage lines were then bridged using a 4.7k Ohm pull-up resistor.

Prototype with data output

We used some electrical tape to hide the nominal soldering job. The tape also helped keep the connection points for the pins in place. We wrapped each soldered lead pair individually and then to each other, completing the assembly by attaching this package to the pinout.

Prototype measuring whiskey temperature

The sensors shared the same data line feeding into the Pi and conveniently show up in Linux as separate device folders.

Device Folders.

The Adafruit tutorial included a Python script that would read data from the device files and output the temperature in Celsius and Fahrenheit. The script did most of what we needed, so we made some minor modifications and ran two versions in parallel: one to measure the internal temperature of the ham and the other for the oven [1]. A follow up project would combine this into a single script.

The troubling case of network access

We planned to set the Pi next to the oven, which is not an ideal place for Ethernet cabling. Wireless networking was an obvious solution; however, the setup and configuration was not trivial and took longer than expected because the adapter would work wonderfully, and subsequently refuse to connect to anything. Our combined investigative powers and Linux sleuthing led to the solution of repeatedly ‘turning it off and back on again’. Here is the config that worked for us using Sally’s iPhone hotspot [2]. Sadly, we lost a few data points when we moved the iPhone to another room and lost network connectivity, rudely terminating the Python script. In hindsight, using the local wireless access point would have prevented this, but we were happy to have any functional network connection at that point.

The plan

The workflow of our Christmas ham monitoring and alerting system is as follows. The Python scripts would grab temperature readings (in Fahrenheit) every second, modify the output to match the format expected by Graphite, and send them to our DigitalOcean droplet via TCP/IP socket. The metrics were named christmaspi.temperature.ham and christmaspi.temperature.oven. Nagios would poll Graphite via HTTP call for both of these metrics every few minutes and would send PagerDuty notifications for any warning or critical level breach.

We first decided to run the complete suite of X.org, Apache, Nagios, Graphite, and the Python polling scripts on our Pi host ‘christmas-pi’. Installing Nagios 3 and launching the web interface was straightforward. The Graphite install, however, overwhelmed the ~3Gb SD card. We freed up some space by removing the wolfram-engine package from the default Raspbian install.

All I want for Christmas is Docker containers running Graphite

At this point, with all our services running, we were left with around 272 Mb of free RAM. Surprisingly, instead of catching fire, the Pi was quite capably displaying the Nagios and Graphite web interfaces! Each Python script was running in an infinite loop and exporting data. Two thumbs up.

Imagine our astonishment when we attempted to take a peek at our Graphite graphs and saw only broken image placeholders! Google says that we may have been missing a Python package or two. In our darkest hour, we turned to the cloud. A simple IP change would point the Python script to send data via a TCP/IP socket to any Graphite cluster we had. There also was a nice Docker Compose setup that would automagically crate a complete Graphite cluster with the phrase “docker-compose up -d” and a DigitalOcean droplet running Docker. Following the quick setup of the cluster, we were prepared to begin recording data and get nice graphs of it.

Alerting and Notifications

At this point, the remaining work to be completed was setting up Nagios and Graphite to talk to each other, and then to find a way for Nagios to alert us. To handle the call and email page outs, we signed up for a 14-day trial of PagerDuty and followed their perl script integration guide to configure it with Nagios.

Jason Dixon had created a nice Nagios poller for Graphite that we also made use of. Once the script was set as executable and following an IP change to point our data export to DigitalOcean droplet, we appended the following to the default Nagios command file. We also changed the default Nagios service configuration to set the following limits:

  • Send a warning notification if oven temperature is below 320 °F or ham temperature is above 135 °F.
  • Send a critical notification if oven temperature is below 315 °F or ham temperature is above 140 °F.

Additionally, we modified the default Nagios hostname configuration so our host was called christmas-pi.

We were now ready to turn on the gas (or electricity) and start cooking.

Robot Chef

Alas, the oven temperature stopped increasing at 260 °F, as our Graphite graphs below show. We looked at the temperature allowances for the probe - and... yep, 260 °F was the limit. A follow up project would be to locate and integrate a temperature sensor that can function at oven air temperatures.

Christmas Ham Graphite Dashboard - 4 Hours

The recommended cook time for ham is “15-18 minutes per pound”, so we estimated our 16 pound ham would need around 4 hours to fully cook. You will see in our Graphite graphs that the ham rose in temperature too quickly, alerting us long before it was legitimately done. So, we did some troubleshooting and found the difference in temperature was about 50 degrees less with the sensor being placed about 3 in deeper. We reseated the temperature probe, and went back to not being in the kitchen.

Nagios warned when the ham reached a temperature of 135°F.

christmas-pi warning

The christmas-pi sent us text messages, emails, and a phone call to let us know that the warning alert triggered. A few minutes later, christmas-pi sent a critical alert that the ham had reached an internal temperature of 140°F.

christmas-pi critical

Here is another view of these alerts. Great success!

Photo of ham, dressed to impress, in delicious pork glory

We hope you enjoyed reading about our project, and wish you and your family a happy holiday!

References

[1] Using the Python socket library to export data to Graphite [2] If that config file looks like gibberish to you, take a moment to learn Puppet or Chef

December 24, 2015

Day 24 - It's not Production without an audit trail

Written by: Brian Henerey (@bhenerey)
Edited by: AJ Bourg (@ajbourg)

Don't be scared

I suspect when most tech people hear the word audit they want to run away in horror. It tends to bring to mind bureaucracy, paperwork creation, and box ticking. There's no technical work involved, so it tends to feel like a distraction from the 'real work' we're already struggling to keep up with.

My mind shifted on this a bit ago when I worked very closely with an Auditor over several months helping put together Controls, Policies and Procedures at an organization to prepare for a SOC2 audit. If you're not familiar with a SOC2, in essence it is a process where you define how you're going to protect your organization's Security, Availability, and Confidentiality(1) in a way that produces evidence for an outside auditor to inspect. The end result is a report you can share with customers, partners, or even board members, with the auditor's opinion on how well you're performing against what you said.

But even without seeking a report, aren't these all good things? As engineers working with complex systems, we constantly think about Security and Availability. We work hard implementing availability primitives such as Redundancy, Load Balancing, Clustering, Replication and Monitoring. We constantly strive to improve our security with DDOS protection, Web Application Firewalls, Intrusion Detection Systems, Pen tests, etc. People love this type of work because there's a never-ending set of problems to solve, and who doesn't love solving problems?

So why does Governance frighten us so? I think it's because we still treat it like a waterfall project, with all the audit work saved until the end. But what if we applied some Agile or Lean thinking to it?

Perhaps if we rub some Devops on it, it won't be so loathsome any more.

Metrics and Laurie's Law

We've been through this before. Does anyone save up monitoring to the end of a project any longer? No, of course not. We're building new infrastructure and shipping code all the time, and as we do, everything has monitoring in place as it goes out the door.

In 2011, Laurie Denness coined the phrase "If it moves, graph it". What this means to me is that any work done by me or my team is not "Done" until we have metrics flowing. Generally we'll have a dashboard as well, grouping as appropriate. However, I've worked with several different teams at a handful of companies, and people generally do not go far enough without some prompting. They might have os-level metrics, or even some application metrics, but they don't instrument all the things. Here are some examples that I see commonly neglected:

  • Cron jobs / background tasks - How often do they fail? How long do they take to run? Is it consistent? What influences the variance?

  • Deployments - How long did it take? How long did each individual step take? How often are deploys rolled back?

  • Operational "Meta-metrics" - How often do things change? How long do incidents last? How many users are affected? How quickly do we identify issues? How quickly from identification can we solve issues?

  • Data backups / ETL processes - Are we monitoring that they are running? How long do they take? How long do restores take? How often do these processes fail?

Now let's take these lessons we've learned about monitoring and apply them to audits.

Designing for Auditability

There's a saying that goes something like 'Systems well designed to be operated are easy to operate'. I think designing a system to be easily audited will have the same effect. So if you've already embraced 'measure all things!', then 'audit all the things!' should come easily to you. You can do this by having these standards:

  • Every tool or script you run should create a log event.
  • The log event should include as much meta-data as possible, but start with who, what and when.
  • These log events should be centralized into something like Logstash.
  • Adopt JSON as your logging format.
  • Incrementally improve things over time. This is not a big project to take on.

While I wrote this article, James Turnbull published a fantastic piece on Structured Logging.

Start small

The lowest hanging fruit comes from just centralizing your logs and using a tool like Logstash. Your configuration management and/or deployment changes are probably already being logged in /var/log/syslog.

The next step is to be a bit more purposeful and instrument your most heavily used tools.

Currently, at the beginning of every Ansible run we run this:

pre_tasks:
  - name: ansible start debug message #For audit trail
    shell: 'sudo echo Ansible playbook started on {{ inventory_hostname }} '

and also run this at the end:

post_tasks:
  - name: ansible finish debug message #For audit trail
    shell: 'sudo echo Ansible playbook finished on {{ inventory_hostname }} '

Running that command with sudo privileges ensures it will show up /var/log/auth.log.

Improve as you go

In the first few years after Statsd first came out, I evangelized often to get Dev teams to start instrumenting their code. Commonly, people would think of this as an extra task to be done outside of meeting the acceptance criteria of whatever story a Product Manager has fed to them. As such, this work tended to be put off till later, perhaps when we hoped we'd be less busy (hah!). Don't fall into this habit! Rather, add purposeful, quality logging to every bit of your work.

Back then, I asked a pretty senior engineer from an outside startup to give a demo of how he leveraged Statsd and Graphite at his company, and it was very well received. I asked him what additional amount of effort it added to any coding he did, and his answer was less than 1%.

The lesson here is not to think of this as a big project to go and do across your infrastructure and tooling. Just begin now, improve whatever parts of your infrastructure code-base you're working in, and your incremental improvements with add up over time.

CloudTrail!

If you're working in AWS, you'd be silly not to leverage CloudTrail. Launched in November 2013, AWS CloudTrail "records API calls made on your account and delivers log files to your Amazon S3 bucket."

One of the most powerful uses for this has been tracking all Security Group changes.

Pulling your CloudTrail logs into Elasticsearch/Logstash/Kibana adds even more power. Here's a graph plus event stream of a security rule being updated that opens up a port to 0.0.0.0/0. Unless this rule is in front of a public-internet facing service, it is the equivalent of chmod 0777 on a file/directory when you're trying to solve a permissions problems.

It can occasionally be useful to open things to the world when debugging, but too often this change is left behind in a sloppy way and poses a security risk.

Auditing in real-time!

Audit processes are not usually a part of technical workers' day-to-day activities. Keeping the compliance folks happy doesn't feel central to the work we're normally getting paid to do. However, if we think of the audit work as a key component of protecting our security or availability, perhaps we should be approaching it differently. For example, if the audit process is designed to keep unwanted security holes out of our infrastructure, shouldn't we be checking this all the time, not just in an annual audit? Can we get immediate feedback on the changes we make? Yes, we can.

Alerting on Elasticsearch data is an incredibly powerful way of getting immediate feedback on deviations from your policies. Elastic.co has a paid product for this called Watcher. I've not used it, preferring to use a Sensu plugin instead.

{
  "checks": {
    "es-query-count-cloudtrail": {
      "command": "/etc/sensu/plugins/check-es-query-count.rb -h my.elasticsearch  -d 'logstash-%Y.%m.%d' --minutes-previous 30 -p 9200 -c 1 -w 100  --types "cloudtrail"  -s http  -q 'Authorize*' -f eventName --invert",
      "subscribers": ["sensu-server"],
      "handlers": ["default"],
      "interval": 60,
      "occurrences": 2
    }
  }
}

With this I can query over any time frame, within a subset of event 'types', look for matches in any event field, and define warning and critical alert criteria for the results.

Now you can find out immediately when things are happening like non-approved accounts making changes, new IAM resources being created, activity in AWS regions you don't use, etc.

Closing Time

It can be exceptionally hard to move fast and 'embrace devops' and actually follow what you've documented in your organizations controls, policies, and procedures. If an audit is overly time consuming, even more time is lost from 'real' work, and there's even more temptation to cut corners and skip steps. I'd argue that the only way to avoid this is to bake auditing into every tool, all along the way as you go. Like I said before, it doesn't need to be a huge monumental effort, just start now and build on it as you go.

Good luck and happy auditing!

Footnotes

(1) I only mentioned 3, but there are 5 "Trust Service Principles", as definied by the AICPA

December 23, 2015

Day 23 - This Is Why We Can't Have Nice Things

Written by: Tray Torrance (@torrancew)
Edited by: Tom Purl (@tompurl)

TLS Edition

Preface

A previous job gave me the unique experience of building out the infrastructure for a security-oriented startup, beginning completely from scratch. In addition to being generally novel, this gave me the experience to learn a tremendous amount about security best practices, and TLS in particular, during the leaks of documents exposing various mass surveillance programs and cutely-named vulnerabilities such as "Heartbleed". Among our lofty goals was a very strict expectation around what protocols, ciphers and key sizes were acceptable for SSL/TLS connections (for the rest of this article, I will simply refer to this as "TLS"). This story is based on my experience implementing those standards internally.

Disclaimer

TLS is a heavy subject, and it is very easy to feel overwhelmed when approaching it, especially for a newcomer. This is NOT a guide to simplify that problem. For that type of help, see the resources section at the bottom of this article. To keep the tone lighter (and ideally combat the fact that TLS cipher recommendations can change at the drop of a single exploit), and hopefully more approachable for folks with less hands-on experience with these problems, I will use a single cipher, referred to by OpenSSL as ECDHE-RSA-AES256-GCM-SHA384 when demonstrating how a certain library represents its ciphers, along with amusing placeholders in my examples.

This will also (hopefully) help combat the fact that while this post may live for many years, TLS cipher recommendations can change. For the uninitiated, the above cipher string means:

  • Key Exchange: ECDHE with RSA keys

  • Encryption Algorithm: 256-bit AES in GCM mode

  • Signature: SHA384

These are the three components you need for a TLS-friendly cipher.

The Goal

With regards to TLS, our main objectives, both for ourselves and our customers, were:

  • Secure all internal and external communications with TLS

  • Enforce a mandatory key size

  • Enforce a consistent, strong set of ciphers across all TLS connections

For cipher selection, we effectively chose a subset of the

Mozilla-recommended "modern" ciphers for reasons that are out of scope for this post.

The Solution

With a modern configuration management solution (we used Puppet), this seems like a pretty trivial problem to solve. Indeed, we quickly threw together an NGINX template that enforced our preferences, and life was good.

...

ssl_ciphers ACIPHER_GOES_HERE:ANOTHER_GOES_HERE;

...

The Problem: EINCONSISTENT

So, that's it, right? Well, no. Anyone who's had to deal with these types of

things at this level can tell you that sooner or later, something will come

along that doesn't interoperate. In our case, eventually we added some software

that more or less required the use of Apache instead of NGINX, and now we had

a new config file in which to manage TLS ciphers.

The (Modified) Solution

Frustrated, we did what any good DevOps engineer does, and we abstracted the

problem away. We drew a puppet variable in at "top scope", which let us

access it from any other part of our codebase, and then referenced it from both

our Apache and NGINX templates:

# manifests/site.pp

$ssl_ciphers = 'YOU_GET_ACIPHER:EVERYBODY_GETS_ACIPHER'



# modules/nginx/templates/vhost.conf.erb

...

ssl_ciphers <%= @ssl_ciphers %>

...



# modules/apache/templates/vhost.conf.erb

...

SSLCiphers <%= @ssl_ciphers %>

...

The (Second) Problem: EUNSUPPORTED

As these things tend to go, after a few weeks (or maybe months, if we were

lucky - I don't recall) of smooth sailing, another compatibility issue was

introduced into our lives. This time, it was JRuby. For reasons that will be elaborated upon below, JRuby cannot use the OpenSSL library to provide its TLS support in the way that "normal" Ruby does. Instead, JRuby maintains a jopenssl library, whose purpose is to provide API-compatibility with Ruby's OpenSSL wrapper. Unfortunately, the library JRuby does use has a different notation for expressing TLS ciphers than OpenSSL, so jopenssl maintains a static mapping. Some of you may be groaning right about now, but wait - there's more!

In addition to not supporting some of the more modern ciphers we wanted to use (though it happily ignored them when specified, which was in this case helpful), feeding it malformed versions of the "magic" (aka ALL, LOW, ECDHE+RSA, etc) names supported by OpenSSL seemed to cause it to support any cipher that it understood - several of which are no longer secure enough for serious use.

This is why we can't have nice things.

The (Second) Solution

We had some pretty intelligent and talented folks attempt to patch this, but

they were unsuccessful at unwinding the rather complicated build process by

which JRuby tests and releases jopenssl. Ultimately, we decided that since

the JRuby application was internal only, that we could extend our policy for

internal services to include the strongest two ciphers JRuby supported at the

time. This meant adding another top scope puppet variable for use there:

# manifests/site.pp

...

$legacy_ssl_ciphers = "${ssl_ciphers}:JRUBY_WEAK_SAUCE"

...

And then, once more, referencing it in the proper template.

The (Third) Problem: ENOMENCLATURE

After another brief reprieve, along came a "proper" Java application. You may recall that I mentioned JRuby cannot use OpenSSL - well, this is because the JVM, being a cross-platform runtime, provides its own implementation of the TLS stack via a set of libraries referred to as JSSE (Java Secure Socket Extension). Now, for a brief digression.

OpenSSL cipher names are, you see, only barely based in reality. TLS cipher names are defined in RFCs, and our scapegoat cipher's official name, for example, is TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384. In my experience, OpenSSL almost always (annoyingly) ignores these names in favor of one they themselves make up. JSSE on the other hand, (correctly, though I rarely use that word in the context of Java) uses the RFC names.

Still with me? Great. As you may have put together, adding a Java app meant that our existing variable was not going to be able to do what we needed on its own. Attempts to cobble together a programmatic mapping via files describing RFC names and tricks with OpenSSL syntax were fairly successful, but unidirectional, relatively brittle and prone to needing manual updates of the RFC name list in the future.

The (Third) Solution

As you may have guessed, it's certainly simple enough to do the following, using Java-compatible cipher names:

# manifests/site.pp

...

$java_ssl_ciphers  = 'IDK:WHO:I:EVEN:AM:ANYMORE'

...

And then use that variable as needed in the various Java templates, which are almost always XML.

The (Fourth) Problem: EERLANG

As I'm sure you've guessed by now, after a bit of smooth sailing, something

else came along. This time, it was RabbitMQ, which is written in Erlang.

Rabbit (and possibly other Erlang tools) support SSL cipher configuration via

an array of 3-element tuples. In RabbitMQ, our example cipher would be

expressed as: {ecdhe_rsa,aes_256_gcm,sha384}. Now, let me first say that, academically, this is very clever. Practically, though, I want to start a pig farm.

The (Fourth) Solution

At this point, a hash, or really any data structure, was starting to look more appealing than a never-ending string of arbitrarily named global variables, so a refactor takes us to:

# manifests/site.pp

$ssl_ciphers = {

  'openssl' => 'MY:HANDS:ARE:TYPING:WORDS',

  'jruby'   => 'OH:MY:DEAR:WORD:WHY?!?',

  'java'    => 'PLEASE:MAKE:IT:STOP',

  'erlang'  => "[{screw,this,I},{am,going,home}]"

}

The Conclusion

Well, by now, you have possibly realized that everything is terrible and I need a drink (and perhaps a bit of therapy). While that is almost certainly true, it's worth noting a few things:

  • TLS, while important, can be very tedious/error-prone to deploy - this is why best practices are important!

  • A good config management solution is worth a thousand rants - our use of config management reduced this problem to figuring out the proper syntax for each new technology rather than reinventing the entire approach each time. This is more valuable than I can possibly express, and can be achieved with basically any config management framework available.

While we can't currently have nice things (and TLS is not the only reason), tools like Puppet (and its many alternatives, such as Chef, Ansible or Salt) can let us start thinking about a world where we can. With enough traction, maybe we'll get there one day. For now, I'll be off to find that drink.

References

Acknowledgements

I'd like to thank the following people for their help with this, regardless of whether they realized or volunteered to do so.

  • My wonderful partner Mandy, for her endless support (and proofing early drafts of this)
  • My SysAdvent editor, Tom, for being flexible and thorough
  • My former colleague, William, who inadvertantly mentored much of my TLS education