Recently I was having a conversation with a colleague who asserted that we (SREs) are broadly the
types of engineer who, if given the choice, try to focus on perfecting the fundamentals. This
surprised me, because if you were to ask me about my views on engineering, I’d probably lean
in a slightly different direction.
My personal view on SRE is that its a game of balance. We’re not Software Engineers, we’re
not Operations Engineers and we’re also not Security Engineers. We tread a fine line in the
middle, pushing on aspects of the broader (humans included) system to help it find a stable
equilibrium in which it delivers maximum value for all stakeholders. That kind of balancing
requires a very pragmatic, flexible approach and often depends more on the subtleties of the
system at hand than a rigidly theoretical approach can offer.
With that in mind, I think that as engineers, we need to focus on building systems that support
that healthy equilibrium. Doing so means balancing a wide range of requirements from different,
often competing, stakeholders while attempting to divine what the future may bring. In my
experience, however, all of this becomes much easier to deal with if you can solve two key
problems: velocity and observability.
Before I dive into that, let’s quickly talk about that experience.
Today I work as an SRE, surrounded by dozens of complex systems designed to make the process of
taking code we write and exposing it to customers. It’s easy to forget that software deployment
itself is a problem that many developers have not yet solved.
Today I’d like to run you through a straightforward process I recently implemented for Git Tool
to enable automated updates with minimal fuss. It’s straightforward, easy to implement and works
without any fancy tooling.
As an engineer, I like to think that I help fix problems. That’s what I’ve tried to do most of
my life and career and I love doing so to this day. It struck me, though, that there was one
problem which has followed me around for years without due attention: the state of my development
That’s not to say that they are disorganized, I’ve spent hours deliberating over the best way
to arrange them such that I can always find what I need, yet I often end up having to resort to
some dark incantation involving find to locate the project I was certain sat under my Work
No more, I’ve drawn the line and decided that if I can’t fix the problem, automation damn well
better be able to!
I’d like to introduce you to my new, standardized (and automated), development directory structure
and the tooling I use to maintain it. With any luck, you’ll find it useful and it will enable you
to save time, avoid code duplication and more easily transition between machines.
How iterators work in Python, details about the next function and a lesson from production
Recently we had an outage. It was a small one, by all accounts and as a result of the way our system is designed, it didn’t impact any users, lose any data and wasn’t in any way noticeable to anybody except us. It did happen though and that’s a problem.
The cause of this outage was pretty simple, engineer A designed a nice new feature in library X; engineer B liked this feature and decided to use it in service Y. This is a daily occurrence and is generally a very good thing, new, cleaner solutions help you constantly refactor away technical debt and improve the readability and all important maintainability of your code.
This time, however, it went wrong and caused an outage so let’s talk about how that happened and take a detour through the land of Python iterators at the same time.
At one point in my career, I spent over 2 years building a monitoring stack. It started out
the way many do; with people staring at dashboards, hoping to divine the secrets of production
from ripples in the space time continuum before an outage occurred.
Over these two years we was able to transform not just the technology used, but the entire
way the organization viewed monitoring, eventually removing the need for a NOC altogether.
I’ll walk you through the final design which was responsible everything from data acquisition
to alerting and much besides. In this post I’ll go over some of the design decisions we made,
why we made them and some guidance for anybody designing their own monitoring stack.
I’ve just spent the last month rewriting the core component in a monitoring stack which is responsible for protecting the availability of a billion dollar per year franchise. The purpose of this rewrite was to improve the ability of our engineers to implement new features in a safe, quick and easy way - what we delivered ended up offering a four order of magnitude performance and efficiency improvement over our previous system.
Let’s talk about how that happened, why it was possible and how we achieved that without it being a focal point of the redesign. I’m going to discuss evented input-output, often referred to as async.
Hopefully, by the time you’ve finished reading this article you should have a good grasp of what evented IO is, how it works and some of the situations in which it has a lot to offer - as well as some of the significant advantages it has over alternative approaches when we start talking about large scale production systems.
If you’ve built a production API before, you’ll know that they tend to
evolve over time. This evolution is not only unavoidable, it is a natural
state that any active system will exist in until it is deprecated.
Realizing and designing to support this kind of evolution in a proactive
way is one of the aspects that differentiates a mature API from the thousands
that litter the Wall of Shame.
At the same time, it is important that your API remains easy to use and
intuitive, maximizing the productivity of developers who will make use of it.
One of the most interesting discussions to have with people, notably those
with traditional database experience, is that of the relationship between
an off the shelf RDBMS and some modern NoSQL document stores.
What makes this discussion so interesting is that there’s invariably a lot
of opinion driven from, often very valid, experience one way or another.
The truth is that there simply isn’t a silver-bullet database solution and
that by better understanding the benefits and limitations of each, one can
make vastly better decisions on their adoption.
With the increasing popularity of Git as a tool for open source collaboration,
not to mention distribution of code for tools like Go, being able
to verify that the author of a piece of code is indeed who they claim to be
has become absolutely critical.
This requirement extends beyond simply ensuring that malicious actors cannot
modify the code we’ve published, something GitHub and its kin
(usually) do a very good job of preventing.
The simple fact is that by adopting code someone else has written, you are
entrusting your clients' security to them - you best be certain that trust
is wisely placed.
Using Git’s built in support for PGP signing and pairing it with
Keybase provides you with a great framework on which to build and
verify that trust. In this post I’ll go over how one sets up their development
environment to support this workflow.
Anybody who has worked in the development world for a significant portion of
time will have built up a vast repertoire of abbreviations to describe how
they solve problems. Everything from TDD to DDD and, my favourites, FDD
and HDD. There are so many in fact that you’ll find a
website dedicated to naming and shaming them.
I’m not one to add another standard to the mix… Oh who am I kidding, let me
introduce you to Chance Driven Development.
Bash’s ability to automatically provide suggested completions to a command
by pressing the Tab key is one of its most useful features. It
makes navigating complex command lines trivially simple, however it’s generally
not something we see that often.
Bash CLI was designed with the intention of making it as easy as possible to
build a command line tool with a great user experience. Giving our users the
ability to use autocompletion would be great, but we don’t want to make it
any more difficult for developers to build their command lines.
Thankfully, Bash CLI’s architecture makes adding basic autocomplete possible
without changing our developer-facing API (always a good thing).
If you’re just looking to hop straight to the final project, you’ll want
to check out SierraSoftworks/bash-cli on GitHub.
Anybody who has worked in the ops space as probably built up a veritable
library of scripts which they use to manage everything from deployments
to brewing you coffee.
Unfortunately, this tends to make finding the script you’re after
and its usage information a pain, you’ll either end up grep-ing
a README file, or praying that the script has a help feature built
Neither approach is conducive to a productive workflow for you or
those who will (inevitably) replace you. Even if you do end up adding
help functionality to all your scripts, it’s probably a rather significant
chunk of your script code that is dedicated to docs…
After a project I was working on started reaching that point, I decided
to put together a tool which should help minimize both the development
workload around building well documented scripts, as well as the usage
complexity related to them.
Since there seems to be quite a bit of confusion surrounding the process of hacking the Asus ExpressGate system to change resolutions I am gonna try my best to clarify some of it for you.
I recently purchased an Asus P6X58D Premium which comes with ExpressGate (SplashTop) embedded on it. However since the maximum resolution is limited to 1280x1024 I decided to do a bit of work and fix that.
I have since created a bash script and numerous applications to aid anyone looking to do their own bit of modding on ExpressGate.