DevOps for Application Developers
Unfortunately, Númenor has drowned, so you're probably going to have to learn about something called Kubernetes.
Hey there newsletterers. I know I said I probably wasn't going to write a newsletter this week but, well, then I had a draft that I wanted to finish, and, well, here we are! If you're just joining us (or just appreciate reminders) I'm Nat Bennett, and I do computer stuff for money. Here on this newsletter, I talk about computer stuff for free.
This week I want to talk a little bit about the wild world of DevOps for Application Developers. It's gotten weird in the last five years or so. Better in a lot of ways, but confusing and complicated. I've been writing a bunch about CI/CD and application platforms at work, so I cleaned up some of that and thought I'd share with ya'll.
Even more than usual, if you have questions about anything and everything related to "Devops for Application Developers," e-mail me! Respond to this e-mail directly, or write nat @ this website. I love writing about this stuff, and especially when I have some specific questions to answer.
Númenor has drowned
So you've decided to deploy an application. You're a responsible application developer, so you're not just going to "throw code over the wall" to an ops team. You're going to test, deploy, and run your app yourself, with some help from– is it the ops team or the SRE team? (You think your company has both but you don't really understand the difference.)
Unfortunately, Númenor has drowned, so you're probably going to have to learn about something called Kubernetes. The good news is, whatever those ops/SRE/infrastructure are called, they're really friendly, and they say you can use whatever you want to run your app! They even built you a whole bunch of stuff to help you!
But... uh... what do you want? There are a bunch of different options and they all seem to do slightly different things. People keep talking about containers and you don't really understand why (are they just... small VMs?). When you ask your infrastructure team to help, they just explain a bunch of things about their system that don't seem to have much to do with your application, like how fast it is at building layered JARs. You want to do this the right way but... what's the "right way?"
I can't help with that directly, since the "right way" depends a lot on the details of your particular application and your particular organization. But I can walk you through what these systems are doing in general, which will hopefully help you evaluate your options and ask useful questions.
A note on definitions: I'm typically going to use the word "application" to mean "a web application server that accepts HTTP requests and returns roughly web-page-shaped things that the business cares about." This is mostly distinct from "service" to mean "a process that's running that accepts requests of some kind from applications or other services and returns data or decisions." The main difference for this purpose is that "applications" tend to have relatively generic infrastructure requirements that are similar to each other, while services might include things like databases or routers that need weird things like persistent disk or exclusive access to their host's compute resources.
What happens when you deploy an app
First, let's look at what needs to happen in order to deploy a generic application.
- Notice that the application needs to be deployed
- Get the code
- Get all the other stuff that's necessary to build the application (this includes libraries -- gems, jars, or what-have-you but might also include something like Tomcat or Puma or whatever depending on the language/framework you're using and how you're testing it)
- Actually build the application
- Run the tests (and do whatever else needs to happen to validate that this is "good code")
- Get all the other stuff you're going to want to run on the same host as the application -- things like datadog agents but also possibly things like Puma, etc. if you don't need those built into your application artifact)
- Get the configuration for your application for this particular environment
- Possibly, build all that stuff into one deployable package
- Get the host that your process is going to run on
- Get all that stuff ("bits") onto the host your process(es) are going to run on
- Start (or restart) the application
- Feed config into the application
- Possibly, stop/destroy the application
- Possibly, notify a router or similar that the application has moved or is about to move
- Confirm that the application is, in fact, working (can connect to its databases, handle traffic, etc.)
- Probably, start some kind of process that will watch the application and restart it when it crashes
Whew! I'm tired just writing all that!
The simplest possible CI/CD setup
You could set all that up on just, like, a server, with approximately the following capabilities:
- something that can poll resources (git, s3 buckets) for changes in state
- something that can download files from those resources and put them into an execution environment
- something that can run scripts in that execution environment
- something that can push files/artifacts back to those external resources
But you'd run into a couple of problems.
One, you'd be writing just boatloads of glue code. Every external resource you wanted to work with would need its own poll/download/upload scripts. Or you'd have to write some kind of general solution. Every time you wanted to make any kind of change to your workflow it would take forever so you wouldn't do it very often. So a lot of what these systems are doing is handling these relatively straightforward problems in a standardized way, so you can use code that's being used by lots of people and maintained by a professional team, rather than code you wrote yourself six months ago and you don't remember how it works.
Two, you'd need some way to keep your environment "clean." If you just ran all your build scripts on one server you'd end up with like, a billion different versions of Java and Ruby or whatever, and your build would sometimes fail because of something that had happened on a previous run. Now you have to know things about the past! This is very bad, since of course, as programmers, we never want to think about state or time, and the past includes both! This radically increases the complexity of debugging.
(Actually– the simplest possible way I've heard of solving this problem was both dumber and cleverer than this– Honeycomb for a while updated all their systems by putting their built artifacts (Go binaries IIRC) into S3 buckets, and then on each of their production hosts they would run a chron job every 15 minutes to check if there was a new version of the artifact they ran and, if so, download and start it.)
What's the big deal about containers?
So now you have two new problems: Glue code and environment management.
One way to those problems is with, basically, immutability. Never update anything, just destroy and recreate your hosts whenever you want to run a task! Now we're starting to get to "why containers." In VM-world this is possible and wise, but slooooooow. It can take up to 10 minutes to get a new EC2 instance. Madness!
The reason this is slow is, basically, there's a bunch of computer stuff that has to be started up to start a VM. You have to set up the whole, well, virtual machine – everything that's being simulated for the "virtual" part, plus the actual operating system that's running on that virtual machine. And you have to set up and attach the disk, and do this all at the speed of your infrastructure provider which you don't necessarily have much control over.
Containers, on the other hand, don't have any virtualization. They behave very similarly to "small, fast VMs" but they're categorically different. They're basically processes that the host OS is hiding stuff from. This makes it tricky to do true multi-tenancy with them for various reasons but this isn't an issue if you're running lots of stuff from the same company or for the same app on one host, so you see them used for that application a lot.
This is why containers matter: They let you set up an execution environment that has no history and has been built from pure source code + checked in configuration, and do this quickly. You're working at the speed of starting a process, not the speed of starting a whole dang computer. This is very important for a lot of tasks that are involved in CI/CD so it's why you see containers pop up so much.
Package management
Okay so now that we know that we're using containers to set up immutable environments... how do we get the bits we want to run into those environments?
Welcome to the magical world of package management.
There are a couple of ways to get packages onto a host.
- You can build them into the VM image. Very effective, but, again, slow! Now you don't have to just wait for the VM to start, you have to wait for the image to build.
- You can stream them onto a generic VM. This is a bit faster but then you have to deal with streaming them in. You also have to figure out what you're willing to stream and where you're willing to stream it from. Tools that choose this method tend to also have some kind of package repository and a package manager.
apt-get
etc. But then you've got package repository problems-- you'd better trust your package publishers. (Hi left-pad.) - You can expect to have container images handed to you. This is the choice Kubernetes makes. It gets you all the good things about a VM image but it means someone else has to solve the problem of getting the right files in the right places on the filesystem -- that is, building the container.
If you choose "do it with a container image" then you basically have a micro-version of this problem again. How do you get bits onto the container image? You can mix-and-match but in general again you're going to have two choices
- Download the bits directly into the container when you build the image with something like a
RUN apt-get
command or your language's equivalent ofbundle
(tricky because you don't always control what you get from a package manager! ask your friends about "left-pad") - Fetch them bits inside the container once it's up and running (this breaks immutability though so you usually don't want to do it)
COPY
the bits into the container. But then you have to figure out how to get the bits onto the location you're building the container image in. It's turtles all the way down!
What Cloud Foundry and Heroku do to solve this problem, incidentally, is to accept source code from wherever, then schedule a container with a task to run it through a special script called a "buildpack" that builds the application (based on the language and framework it uses) and stores it as a container image (possibly also with telemetry sidecars.) This solution puts some limitations on what developers can run on the container system, but allows devs to run most normal apps, lets them not think about container details at all, and can be very secure, especially if you hook up your platform so that the build can only get artifacts from very specific places.
Actually deploying the app
The last thing that these systems often have to deal with is actual deployment workflows. Things like rolling deploys, canarying, bluegreen deployment. This is Spinnaker's whole deal -- it doesn't know anything about how to build artifacts, how to run scripts, how to run apps -- but it knows everything you could want to know about how to orchestrate those operations which is important if you're trying to achieve very high uptime.
Different tools "slice" this space differently
You tend these days to see systems that focus on one or two of the following categories
- Get code and run builds (Drone, Jenkins, Harness)
- Package code (& friends) into deployable artifacts (Docker)
- Run one-off tasks (Kubernetes, Jenkins)
- Run deployable artifacts and keep 'em running (Kubernetes, Nomad, EC2 instances)
- Send commands that change the artifacts and/or configuration that are running (Terraform, Helm, Ansible)
- Orchestrate deploy/update operations (Harness, Spinnaker)
In general these systems tend to divide up their space very differently depending on whether they're working with VMs or containers.
With systems that work with VMs like Ansible or Salt, there tends to be a lot more fiddling with the problem of "how do I get and configure files on the system?" They also are typically much more imperative. You write code describing a series of steps that get run in a particular order.
Kubernetes and friends, on the other hand, believe that "how get packages onto hosts" is an extremely solved problem (docker image!") and that how that image gets built is extremely not its problem. It's also broadly much more declarative. You describe the state that you want to be true, Kubernetes sets that state up and keeps it that way.
Task runners and process runners
One way to view these systems that I personally find useful is, are they *fundamentally* a process runner, or a one-off task runner? That is, when they run a "start" command and the process exits, do they restart that process, or do they put its output somewhere else?
If they're fundamentally a one-off task runner they're going to have a configuration file somewhere that describes
- the script that gets run
- the other stuff that gets loaded into the environment with the script
- what triggers the running of the script
- what gets loaded out of the environment after the script has run
That is, they should basically be chains of functions with inputs, outputs, maybe side effects. If they're fundamentally application runners, I'm going to look for
- How the stuff that's the same no matter where the app is running (dev, qa, prod, etc.) gets set up
- How the stuff that might be different gets communicated to the application (environment variables, secrets, connections to services)
- How the system "schedules" the container, how it knows how many to run, how to update them, what it needs to do to them before it shuts them down to update them, etc.
Questions, Comments
So that's the very broad overview:
- You're fundamentally just downloading files onto hosts, running scripts, and starting long-lived processes
- Doing this a lot gets complicated and confusing fast, because the scripts you ran earlier might affect the behavior and
- It also involves writing a lot of code that's going to be kind of the same no matter what application is using it
- You're probably using a tool (or a bunch of tools) so that you can pay for that "glue code" in money rather than engineer time
- You and/or the tools you're using rely a lot on containers because containers let you get an environment that's easy to understand fast
- Containers also solve a lot of the "get bits onto hosts" part of the problem, but then you have to get the bits into the container images
- A big part of what's confusing about the tools in this area is they're all doing different subsets of the total set of things that need to be done to deploy and run apps
- But they basically fall into two categories, "process-runners" and "thing-doers," and you can evaluate them quickly from that frame
I also expect all of this to change a lot over the next five years as people figure out how to use things like Kubernetes and some kind of omakase emerges. I don't think that what appears to be the current status quo – that application devs are expected to choose between a huge number of options that are all basically equivalent except for factors that aren't really visible to app devs – is going to be stable for very long.
If you have questions or comments or just stuff you still find confusing related to this kind of "DevOps for Application Engineers" topic, e-mail me! And send this to any friends who you think might have questions. This is something I want to write a lot more about this summer, and it always helps to have specific people to help and problems to solve.