Using Github Actions? Take a look at Skylounge
Skylounge solves common problems with implementing CI/CD for large numbers of teams, and managing applications on platforms, especially in secure and regulated environments, and it does so without hiding anything from developers.
Hey there, newsletter readers. I'm Nat Bennett, software engineer, and you're reading Simpler Machines. This week we're talking about pipelines, baby, and a system that I've been exploring recently for managing Github Actions, called Skylounge.
This week is a little bit unusual because this is based on some paid work I did last month for 33 Teams, a consulting agency I've worked with a few times in the past â I spent about 30 hours last month learning and evaluating Skylounge, in preparation for future client work. I mention this mostly for transparency but also to make sure it's clear that I (and/or a team of folks like me!) am available right now to build out automation/infrastructure/platform-y things. And if you're reading this, I'm interested in learning more about your problems in this area, even if you don't need consulting help right now.
So if you at any point you're reading this and you're like "hey, that sounds like something we're doing right now," please don't hesitate to send me an e-mail.
I love writing CI/CD pipelines.
They're like Rube Goldberg machines. I can almost hear the satisfying ka-thunk when a script puts a tested, working application into production. When I worked on Cloud Foundry, I spent much of my time on release teams, where writing and maintaining automation was the core of my daily job. In my work with 33 Teams, Iâm often the person on the team whoâs the fastest and happiest writing Bash scripts.
So I was super interested in getting my hands on Skylounge. Iâve spent a few weeks working with it, and I can see the potential to solve some common problems Iâve encountered over the past few years. If you have complex workflows, or have to manage pipelines for many applications, you should check it out.
One thing I want to emphasize is that Skylounge works specifically with Github Actions. And I'm going to be talking a lot about Github Actions here specifically, because it's the main automation tool I've been using since I left Cloud Foundry. At Cloud Foundry I used Concourse, and it's still my first love, but these days Iâm usually working with clients, so Iâm usually not introducing new tools. Github Actions is available, familiar, and easy to get started with, so itâs a good choice for a lot of teams. I desperately miss Concourseâs resource abstraction, but it otherwise gets the job done: It runs scripts in isolated environments.
Writing workflow automation well is hard
My biggest problem with CI right now is totally tool agnostic. Regardless of where and how youâre running the scripts, building complicated workflows out of Bash scripts is still a relatively specialized skill, and itâs hard to get application developers to spend time on it. It tends to be highly error-driven, and requires command of some technology that application developers otherwise donât encounter in their day-to-day work.
When I encounter an error while writing a script or some YAML for a pipeline, I can often recognize the general class of error it belongs to and make sense of it quickly. I can immediately recognize when a mysterious parsing error is probably a YAML whitespace issue, for instance, or a Bash shell expansion issue. I also have a lot of habits that eliminate these kinds of errors. I habitually quote strings and surround variables with curly braces, and I start all my scripts with set -eux.
But even with with that experience, I still find that writing automation like this is slow, fiddly, and hard to make predictions about. It's a lot of "get an error, solve that error, whoops there's a new error, repeat." Iâve seen developers without these habits and experience struggle to even understand whatâs happening when they encounter errors in CI. Pipelines are often peopleâs first serious encounters with details of the filesystem, Bashâs error handling and process model, and all kinds of deliciously Linux-y âblub.â Sometimes people donât even know what to Google, or have any idea whatâs going wrong. What theyâre looking at should work.
And while the microwaved-black-coffee-drinking sysadmin in me thinks itâs good for engineers to learn these things about tools they use every day â better in a CI pipeline than in production! â these problems often get in the way of shipping software. And they can get in the way of people investing the time to get themselves a good, tight feedback loop to production. Folks will end up running long, complicated test and deploy processes manually, because they canât justify the time to figure out and setup a good CI system.
This all gets much worse in regulated or otherwise security sensitive environments. Teams won't set up even simple security scanning or dependency automation, because they donât have command of their automation tools. Or theyâll put together pipelines using scripts and container images they pulled from who-knows-where because they're just trying to get something working, dang it.
If youâre a manager or a lead for a team in this position â especially if youâre a manager or a lead for a lot of them â you know how painful this is. You know the work in question isn't even really that hard â for someone who knows how to do it â and it's not that different from team to team. But you don't have a lot of the people who are good at it, and they have a lot of demands on their time. You canât have them spend all day writing whatâs basically the same automation, with small variations, for each individual team. You could hire a specialist like me, but that can be expensive and doesnât scale well, especially once you start having to maintain those pipelines.
Skylounge automates the parts that are the same for all projects
So here's where we get back to Skylounge, and why I immediately went, "Oh, I want that." Skylounge is a Github application that combines templates from a shared organizational library with repository-specific configuration, builds Github Actions workflows, and makes pull requests against those repositories to create and update those workflows. When Skylounge detects changes to that repository specific configuration or to the share organizational templates, it opens a pull request on any affected repository to update the Github Actions workflows it managers.
Itâs a tool that can make managing CI/CD pipelines easier, and let organizations leverage the folks like me â who are fast and confident writing CI pipelines, and even think itâs kind of fun â to build tooling that works for whole fleets of applications.
So here's an example. Say you have a lot of Java Spring apps, and you want to deploy them to Kubernetes. You can write a Skylounge blueprint for a Github workflow that builds a container and pushes it to your cluster, then add the skylounge.yml
that configures that workflow to your repositories. Want to add a scanning step to that workflow later, change the base image, or maybe add a sidecar container to the application pods? All you have to do is make that change once, in the blueprint file, and Skylounge will open up PRs on all the applications using it. You wonât surprise the developers responsible for those applications by making changes out from under them â theyâll have a chance to review and understand the PRs â but youâll also be able to manage those applications as a fleet.
Skylounge helps with platform onboarding
The place I'm especially interested in this is for building "platforms" in the "platform as a service" sense â systems that take infrastructure primitives (like Kubernetes) and wrap them in developer-friendly interfaces for provisioning and updating those primitives. Platforms are great for freeing up developer time from repetitive infrastructure tasks, but for them to do their job, you have to be able to get applications onto the platform, and you have to be able to update applications once theyâre there. These are both harder problems than you might expect.
First, letâs talk about onboarding applications. Platform/infrastructure/DevOps teams, by the nature of their role, tend to have a much better handle on tools like Docker and why theyâre valuable than developers do. They also tend to know a lot more about the options available. Itâs easy to underestimate the amount of work that youâre asking a team to do when youâre asking them to start using a tool that you know well, and they donât. Itâs also easy to underestimate the degree to which youâre disrupting their workflow when you ask them to change, say, their applicationâs build process in order to get onto your platform. If you donât have a plan for getting applications onto a platform, itâs easy to spend millions of dollars in engineer time building something that sees very little use.
Using a tool like Skylounge to create and manage CI templates resolves several common barriers to onboarding. It makes it easier to âmeet teams where theyâre at.â You can start by templating a workflow theyâre already using and familiar with. Then, when youâve got everyoneâs CI under management, you can start incrementally moving them towards the workflow you want them to adopt. It also means DevOps specialists can do more of the technical heavy lifting.
Abandoned apps, the platform problem no one tells you about
Then once you get the applications on the platform, you have to maintain them. Youâll need to update dependencies when they have security vulnerabilities. Most especially, youâll need to be able to update applications that no longer have dedicated teams maintaining them.
This one is a surprisingly big problem, and I think it sneaks up on people who havenât worked with a platform before. But platforms make it easy for companies to deploy applications. And if you make it easy to deploy applications, then youâll get more applications. A lot more. And then once those applications are out there, in production, making money or providing services, theyâll stay there. Often for decades, if things are going really well. And eventually they wonât really need any more updates or work, and something else will, and the team thatâs dedicated to them will get rolled off.
Iâve seen situations where an engineering organization deployed an application platform, onboarded teams to it, and then within months had abandoned applications with no team attached running on it. It happens fast. If youâre looking at building at a platform, you need to have a plan for maintaining these apps.
There are a lot of ways to tackle this problem, but SkyLoungeâs CI-management option is interesting because it essentially gives you a shim into the applications â a way to make a lot of different kinds of changes to an application and its runtime environment.
Letâs take CVE-2021-4428 as an example â a vulnerability that you might know as âthe Log4j problem.â This is a vulnerability in a Java logging library that allowed attackers who could get an application to print a particular string to its logs to get that application to download and execute code from anywhere. (When it was first reported, vulnerable applications included basically every Minecraft server. It was an exciting week.) There are two ways to patch an application thatâs vulnerable to the issue: You can update the logging library, or you can set an environment variable. The attack relied on a feature of the JVM that, it turns out, itâs possible to turn off. So even if you canât update the dependency, if you had access to the applicationâs runtime environment, you could set an environment variable that prevented the JVM from accessing and running the remote code.
Applications on container-orchestration systems like Kubernetes typically get their environment variables updated when theyâre deployed. If can make updates to all the deploy processes for all your applications from one place, you can set that environment variable for all your applications at once. You donât have to make changes to the code. You donât even have to have access to the code! Changing environment variables isnât always safer than changing code but itâs nice to have the option for those times when it is â throw a quick fix on everything now, and then start the process of testing and deploying the bigger change.
Skylounge doesn't hide anything from developers
The last thing I really like about Skylounge is that itâs accessible. Itâs easy to install on top of any existing infrastructure choices. Even if the main blueprint that most applications use doesnât work for a particular application, you can still write a one-off blueprint for a particular application, store the core of its deployment pipeline in your central library, and have everyoneâs CI available for inspection and update by your infrastructure team.
And while it takes over Github Actions from application teams, it does so in a relatively transparent wayâ through pull requests. The automation it creates is right there in the repository, and the developers responsible for it need to approve the PRs it generates to get updates. This doesnât guarantee theyâll understand what it does, but it does prevent the operations of the pipelines Skylounge manages from being a complete black box. If something goes wrong, they know where the files are and they can at least begin the process of debugging them.
Get in touch
If any of these problems sound familiar, Iâd love to talk more â even if youâre not using Github Actions. As you may have picked up on by now I love talking about, thinking about, and building automation to deploy and manage applications, and if thatâs something youâre struggling with I want to hear about it, and might even be able to help.