Making Lives Easier with Knewton Crab Stacker

In a previous post, we discussed Knewton’s in-house deployment tool, Knewton Crab Stacker (KCS). To summarize, KCS is a command line tool used to make deployment of CloudFormation (one of Amazon’s services) easier. It saves our engineers from banging their heads against their desks when trying to deploy their services.

So what exactly makes KCS such a valuable, can’t-live-without-it tool? In this post, we’ll take a look at some of the many KCS commands that make the lives of a Knewton engineer easier.

SSH

Normally, when you want to ssh into an EC2 instance, you have to go through a long and arduous process to find the instance’s public DNS name, then locate your own ssh key for that instance, and then finally type out the command that lets you ssh into the instance. You have to do this every single time you want to ssh. As you may imagine, this gets annoying fast.

To make this whole process simpler, we have a KCS command that does everything for you. All you have to do is specify which stack and target environment you’re trying to ssh into, and then KCS will take care of the rest. It will find the public DNS of the instance, find your ssh key, and finally ssh-es into the box for you. Besides being a huge time saver, my favorite part about this command is that it adds colors to the instance’s terminal. Colors make everything better.

BrianZeng1.jpg

Kick

Often while working on a service, we will make modifications to the instance (which we get into by using the awesome KCS ssh command). But when you make modifications, inevitably something gets messed up. No one wants their instance to be messed up, so you have to restart it. This usually involves relaunching the stack the instance is a part of, and twiddling your thumbs while you wait.

Here at Knewton, we like to save time, so we created a command that allows us to essentially restart our instance. We call this command kick.

Underneath the hood, kick gets the address of the instance we want to kick, ssh-es into the instance, and re-runs cfn-init (the command that is first run when the instance is created). This re-downloads the needed resources and configures everything you need using Chef. After kicking an instance, the instance is essentially brand new and ready for more tinkering.

Roundhouse Kick

A very common scenario that an engineer comes across is when he’s made a change to a service, has finished testing it locally, and then wants to test it on a real stack. To do this using just CloudFormation, we would have to first upload our new build to S3, then update our stack to use the new build. Updating a stack takes quite a bit of time, anywhere from a couple to ten-plus minutes. That’s a lot of waiting around.

That’s why we invented roundhouse kick. Roundhouse kick does everything you need to update the version of your service without having to relaunch your stack.

Here’s how it works: first, it will upload your build to S3. Next, it will do what we call an in-place update of the stack. Instead of launching new instances as a regular update would do, an in-place update just updates the existing instances. The time saved with the in-place update makes up the majority of the time saved. After updating the stack, KCS will then kick all the instances of the stack, which, in effect, restarts the stack and grabs the new version of the service you uploaded earlier. We like to think we made Chuck Norris proud with roundhouse kick.

“Chuck Norris can upload his new build to S3, update his stack, and kick his stack all at once.”

Grab logs and Failure logs

Sometimes you roundhouse kick your stack too hard and it stops working (there’s no such thing as a soft roundhouse kick). To find out what’s wrong, you have to ssh into the instance and check the logs. But there are many logs. And you’ll probably have forgotten where all of these logs are located.

Don’t worry — KCS has got you covered.

With a simple command, you can get all of the logs from your instance in a nicely bundled tarball. To do this, KCS knows the location of your logs thanks to some coordination with the Chef recipes that set up the logging system. After determining these locations, KCS will then perform an scp command with all the needed arguments to retrieve all the files. Now you can find out why your stack couldn’t handle the roundhouse kick.

What’s Next for KCS?

Even with all the cool commands that KCS has, there’s always room for improvement. People want KCS to run faster, have more features, and be invincible to bugs. When there’s a bug in a new release of KCS (which are unfortunately inevitable), the deployment team gets bombarded with complaints from disgruntled KCS users. We then work to fix everything, and try to get a new release of KCS out. But even when we do release a new KCS, not everyone remembers to upgrade their version and we continue to get complaints. We ask them to check their version and then we find out they aren’t on the latest version. An upgrade then fixes the issue. This is annoying and unnecessary for both KCS users and the deployment team.

To solve this problem, we created KCSServer — the website version of KCS, which has been my baby during my summer internship. Since KCSServer is a website, we don’t have to worry about people having different versions of KCS. We can very easily make changes to KCSServer without having to worry about getting people to install the latest version.

Migrating KCS to a website also provides many other benefits. One of the main issues we wanted to address was the speed of KCS. As a command line tool, KCS is pretty slow. For a command (such as describing a stack), KCS has to make a call to Amazon after determining all the proper credentials, and then once it retrieves the information, it has to output everything in a readable format for the user. With KCSServer, we can make this command much faster by utilizing a cache. A command has to be run once. Then, for all other times the command is run, KCSServer can just retrieve the output from the cache (of course, we update the cache as needed). This reduces the latency of a command from a couple of seconds to milliseconds. Considering that our rapidly-growing team of engineers uses KCS a lot, these seconds saved will quickly become hours, then days of developer time saved.. Another added benefit? With some CSS, we can make KCSServer look a whole lot more pleasant to look at than the dull terminal.

What’s the Take-Away?

Hopefully after reading about how we at Knewton use KCS to maximize our efficiency, you’ll start thinking more about how to eliminate inefficiencies in your own deployment process, or any process for that matter. Hopefully you’ll start asking yourself, “What’s slowing me down at doing my job?” and “What can I do about it?” Then you can go out there and create your own version of KCS. Don’t forget to give it an awesome name.

Knewton Crab Stacker: Innovations in Systems Engineering

Creating and maintaining the infrastructure to support a 40+ person developer team and more than a million users on the world’s most powerful and rigorously adaptive learning platform is no simple task. Conventional wisdom would suggest that a ten-person team with a wide swath of specialists would be the ideal arrangement. But in this regard, as with a number of other tech team practices, Knewton is anything but conventional.

Simply put, Knewton does not have an Ops team. Instead, the knowledge and tools required for infrastructure and systems tasks are distributed throughout the team. This structure confers a number of benefits. Developers are better able to write configurable, maintainable, deployable software because they have a strong understanding of our systems infrastructure. And systems-focused engineers are better able to optimize and maintain our infrastructure because they are also contributing members of the service developer teams.

In practice, this means that all software developers at Knewton are expected to both understand and utilize infrastructure technologies. All developers may find themselves on production support and tasked with updating environment configurations or deploying new code. Likewise, systems engineers all have a solid software engineering background and are expected to write production code.

Expectations and cross-pollination are only part of the process, however. Here’s some insight into the tools we use to create an environment where development and infrastructure work hand-in-hand.

AWS for everyone!

Every Knewton developer has access to our AWS console as well as the AWS command line tools pre-installed on their machines. Eventually every developer will go through the process of deploying code to production, at which point he or she will learn the basics of the AWS toolkit as well as our deployment and configuration management systems. As a result, no developer need waste time emailing the Systems group for basic information about EC2 instances or other AWS resources.

Knewton makes use of a number of AWS technologies, from EC2, to RDS, to ElastiCache and ElasticMapReduce. Many of these products have their own CLI tools, though they are all simple enough to install and configure that at Knewton we’ve made them part of box setup for new developers. While not every engineer is a wizard with the console or CLI commands, there is enough documentation to ensure any developer working alone on a late night will not be blocked by an AWS issue (unless the block is on AWS’ end…).

We have a few in-house tools that make use of AWS’ APIs, among them k.aws and the Knewton Crab Stacker. To focus this article a bit, I’ll dive specifically into the latter, as it addresses an important problem: deploying and updating software stacks.

CloudFormation and KCS

Knewton Crab Stacker, or KCS, makes use of the AWS CloudFormation toolset. CloudFormation makes it possible to define and deploy combinations of AWS resources, from EC2 instances to load balancers. The “define” part comes in the form of JSON templates, while deployment can be done either using the AWS console or a CLI tool.

Now, CloudFormation on its own is great for just “putting a box out there.” The templates let you define a wide variety of resources and set parameters and mappings for a base level service setup and tuning.

What CloudFormation doesn’t do well is box configuration. The JSON templates don’t allow you to do much in the way of defining functions, setting conditions, or even basic math. If you try to force configuration management onto it, you end up with a lot of bash scripts and config files floating around, or worse, hardcoded into templates.

Even relegating KCS to deploying boxes can be tricky. The launch command for a given stack can include a dozen or more command line arguments, such as EC2 instance parameters (size, type, etc.) and command flags.

The simplest case launch will make use of all defaults in the template and look something like this on the command line:

cfn-create-stack $StackName --template-file $Template

But if you need to use other CloudFormation functionality and override a number of parameters at launch, you’ll end up with something like this:

cfn-create-stack $StackName --template-file $Template --parameters AWSAccount=Production;InstanceType=m1.large;ClusterSize=4;ConfigTarballVersion=2.1.5;AppVersion=1.2.3;SSHKey=ProductionKey --capabilities CAPABILITY_IAM --disable-rollback

Yikes! That’s a lot to type from memory for each deploy. You’re especially going to want that last option to disable rollback as it keeps the instances from failed launches around for you to debug — essential for when you inevitably mistype a version number.

If stack launches are fairly consistent, you can mitigate the annoyance of launch commands with BASH scripts, but these will be a pain to maintain. But what if you have a number of frequently changing parameters or decisions that need to be made at launch? What if you need to work with multiple AWS accounts or validate components of your launch config so as to avoid a painful debug cycle? (Is that tarball you need in s3? Does this environment have all of my new stack’s dependencies?) Complex stacks can take ten to twenty minutes to launch. You don’t want to have to keep re-launching just because you fat-fingered the instance type.

The problem with the command above is that every parameter represents a potential point of failure. CloudFormation is only able to ensure that your template is logically consistent and the JSON is valid. It can’t know whether or not AppVersion 1.2.3 is a thing, or whether a four node cluster matches what is in the current environment, or numerous other details that can spoil an update before it begins.

This is where KCS steps in. Knewton Crab Stacker was developed by a team of Knewton engineers (including yours truly). KCS is a Python command line tool designed to make CloudFormation deployment much simpler.

The first nice thing KCS does is add the abstraction “environment” to our AWS accounts. It does this by simply taking the stackname parameter and appending $EnvironmentName + “-” to the front of it. From CloudFormation’s perspective, the stackname is “Dev-UserService,” but KCS understands the stack as “The UserService stack in the Dev environment.”

Making use of the namespace this way greatly simplifies the task of isolating test environments from one another. It adds one more piece to launch commands, which in the simplest case look like this:

kcs stacks create $Service $Template $Environment

The difference between this and the simple CloudFormation command above is what goes on behind the scenes.

Before initiating the create, KCS checks a number of things. First, KCS makes sure that the environment has any stacks that the new service needs. If a dependency is missing, you can still force a create. Secondly, KCS ensures that any s3 resources referenced in the template or launch command actually exist. In other words, if your launch command specifies “ServiceTarballVersion=0.3.444”, KCS makes sure that said tarball is actually there.

Updates are far more common than creates, and it is here where KCS really does a lot for you. Here’s a simple update command:

kcs stacks update $Service $Template $Environment

Like the create, KCS does a ton of validation on the template and environment. With the update however, KCS also runs a diff on the existing stack. Before the update actually runs, you will be shown a list of every parameter the update adds, removes, or changes. From there, you can either proceed with or cancel the update.

Before I do an update of a stack, I can also use “describe” to see what’s in the environment currently. The full command is “kcs stacks describe”, but I can shorten it using “s” and “d”, and aim it at our Dev environment like so:

kcs s d User Dev

Dev - $93.6 monthly
Stack   Status            $ monthly   Creation       Last Update
User    update complete   $93.6       4 months ago   3 months ago

SystemConfig Version: 1.1.55 User App Version: 1.1.224

i-01234567  ec2-01-23-45-678.compute-1.amazonaws.com (m1.medium)


This gives me a lot of cool info including the version of the App, some parameter information, as well as the instance ID, type, and hostname. If I want an exhaustive list of parameters I can do this:

kcs s d User Dev --detailed

Dev - $93.6 monthly
Stack   Status            $ monthly   Creation       Last Update
User    update complete   $93.6       4 months ago   3 months ago

Environment: Dev
Cluster Size: 4
SystemConfig Version: 1.1.55
Environment Class: Dev
User Version: 1.1.224
Instance Type: m1.medium
DB Password: ******
Billing Environment: Staging
ELBRegistration: Yes
AMI: ami-11223344
UserDB Address: amazonrds.us-east-1.rds.amazonaws.com
Key Name: STAGING-007
i-1234567   ec2-01-23-45-678.compute-1.amazonaws.com   (m1.medium)
12.345.678.999                             us-east-1c

These commands make it easy to run updates without knowing much about the stack, but there is an even easier method for the truly lazy:

kcs stacks interactive-update $Service $Environment

This command uses the existing stack’s template and then lets you pass in values for each parameter while showing you what is in the current environment. It guarantees that you only change exactly what you want to change.

When the update actually runs, KCS adds a few layers of insurance that CloudFormation does not. For one, it spins up brand new instances, runs their launch config, and then waits for success signals before tearing down the old stack. This allows you to set up whatever level of functionality and performance testing you want as a condition of a successful update. If part of the launch config fails or does something unexpected, KCS rolls everything back.

All of this just scratches the surface at what KCS can do. I could write a few dozen pages about KCS’ other abilities, like grabbing service logs, executing remote commands, hotswapping jars on Java stacks, and even snapshotting entire environments and then recreating new environments from snapshots (need to copy a 100 instance staging environment? No problem).

The main thing that KCS does is to kill the need for a Release Engineer or “Deployment Guy.” Nobody is happier about this than I am, as I was the “Deployment Guy” for months. Instead, we have a situation now where systems engineers can focus on improving infrastructure and devs can get new code out easily.

The lion’s share of the credit for KCS has to go to Sarah Haskins and Trevor Smith, the two developers who did the bulk of the coding. It has made life easier for all developers here at Knewton, and we hope to open source it in the future.

Configuration management and future challenges

As nice as KCS is for our deployment workflow, it is only able to solve one part of our infrastructure needs. Like any moderately large tech team, there are natural conflicts of interest that arise between those focused on system stability and maintenance, and those trying to push out a slick new feature. We’re not immune from knowledge bottlenecks and technical debt, but as the team grows and practices are refined, the future looks brighter and brighter for our tech team.

At the very least, thanks to KCS, we have a pretty good handle on deploying services. Swell. But how do we configure boxes once CloudFormation puts them out there? How do we ensure that services are able to talk to one another and that stacks are resilient to a plethora of errors?

Those fantastic questions, I’m afraid, will have to be the subject of another “N Choose K” post.

Announcing WP Stack

My name is Mark Jaquith. I’m one of the lead developers on the WordPress core software, and am a technical consultant to Knewton. The marketing team at Knewton wanted their WordPress-powered site to have a more professional development approach using version control; a real staging and deployment system (with the ability to roll back); and fast, scalable redundancy. I helped create that system for them. They loved it, and after talking to other companies with similar needs for their WordPress-powered sites, they thought it would be useful to turn this system into a generic WordPress deployment system that anyone could use to run a professional WordPress site.

Today, we’re announcing the result of that effort: WP Stack.

What it is

WP Stack is a set of deployment scripts using Capistrano, and a series of drop-in WordPress plugins that work in tandem with this system to perform commonly desired tasks on professional sites, such as rewriting media files to a CDN or serving WordPress multisite uploads directly through Nginx. It supports both production and staging environments, as well as file/database snapshotting from production to staging, so that you can test your code changes on fresh data.

The commands for tasks like deployment, rollback, and database/file syncs, are short one-liners such as cap production deploy or cap staging db:sync. And while code changes (like templates or plugins) require commits and deployments, the regular WordPress workflow is unchanged… so content creators can post, upload media, and edit content to their heart’s content without needing to know what’s going on under the hood.

WP Stack can be used to deploy your existing WordPress site repository, but you’ll get best results if you start with WordPress Skeleton, a complementary project that gives you a nicely laid out starter WordPress Git repository, that not-coincidentally is pre-wired to work well with WP Stack.

Who this is for

This project will have the most benefit for professional WordPress sites, where “doing it live” just isn’t an option. But there’s no reason it can’t be used for personal or small business websites too. Many organizations know that they should be using version control and a deployment system, but they don’t quite know where to start. It is my hope that WP Stack (and WordPress Skeleton) will help lower the barriers and lead to more organizations using professional development and deployment techniques for using the software that powers their public face.

The future

We’re not done! The roadmap for WP Stack includes things like multi-server uploads syncing, Puppet/Chef manifests with full Nginx configs, and a Vagrant config for easier local development. If you’d like to get involved, head on over to GitHub.

I’d also like to thank Knewton for sponsoring the development of WP Stack. It’s great to be able to work with a company that can see the value of open source contribution.