Netflix is the world’s leading Internet television network with over 40 million members in 41 countries enjoying more than one billion hours of TV shows and movies per month, making it one of the largest streaming services on the planet. In fact, Businessweek cites Netflix as accounting for almost a third of all Internet traffic entering North American homes.
The spike in traffic for night-time viewing is significant, and Netflix relies on Amazon’s worldwide cloud infrastructure to deal with the server demand elastically. Behind the scenes, the Netflix Engineering Tools team works to deliver high quality capacity, control costs, and allow engineers for various products to deploy code regularly and quickly roll back changes when issues or bugs exist.
Deployed on Amazon Web Services (AWS), with limited built-in deployment and scaling services to support their application, Netflix chose to develop their own service console for interacting with Amazon. The result is a project called Asgard that is written in Groovy and Grails.
Based on their experience developing with Groovy and Grails and in the interest of community, Netflix has open sourced Asgard for other developers to use as a reference app or for companies to use as their own console to AWS. Today, companies like IBM, eBay, and TRUSTe are all using it in various capacities.
Joe Sondow, lead Asgard engineer, has spoken at conferences around the world and shared the challenges Netflix faced, explains how Asgard helps, and talks about Asgard development and operations.
In this post, we share insights from a recent interview with Joe.
The Driving Requirements for Asgard
While Amazon provided the means to scale automatically, it lacked a good interface to configure auto-scaling. There were a few key operational and technical challenges that Amazon had no roadmap to address. So, Netflix began building Asgard, an application that according to Sondow, “solves some of the shortcomings that we found from Amazon’s tooling for their cloud.”
First and foremost, security and privileges were important to Netflix. Just like with networks or databases, it is best practice to restrict access and account for spending on Amazon. Giving hundreds of engineers unlimited access to a large cloud account in Amazon is a recipe for trouble, and overall account information needed to remain secure and private.
Still, engineers needed a way to manage and deploy their own apps—including services like signup, streaming starts, ratings, autocomplete, search, and the API. While AWS Identity and Access Management (IAM) is a conceptual solution, there were limits to productively managing accounts via IAMs. Of course, most modern, administrative systems also provide change tracking and logging of user actions to help provide an audit trail, fix problems, and support legal requirements.
In addition, the deployment process also needed a specialized, automated workflow. According to Sondow, they needed to “support very easy and consistent behavior for lots of different teams who have to push code to the cloud repeatedly.” Automating workflow with fast rollback became a key requirement. Essentially, Netflix wanted the old code on hot standby, and Amazon didn’t support it. Netflix wanted to be able to push out new code on new instances, switch traffic over to the new instances, leave the old instances up, monitor results, and then switch traffic back to prior instances if there was a problem. New code, even with great testing, can still unexpectedly break or diminish service, and Netflix knew that manual management of two sets of codebases and the related change across a large volume of instances was ripe for error.
In short, engineering needed a deployment method that allowed fast rollback.
In addition, Netflix’s operating model requires its applications to dynamically scale in the cloud to align costs with actual usage. To automatically add and remove instances, Netflix used Amazon’s Auto Scaling Groups (ASG) as the primary unit of deployment. Amazon users set conditions to add EC2 instances when demand increases and remove them when they are not needed, keeping costs down. While Netflix gained scale with ASGs, the AWS Management Console lacked support for ASGs. Netflix defined a new object called an application that defines and relates to the ASG along with a launch configuration—a combination of security groups and an Amazon Machine Image (AMI). As new instances are launched with the ASG, security, and AMI configuration info, instances are also attached to an Elastic Load Balancer to manage traffic and assigned to a cluster version to support fast roll-backs.
Today, Asgard manages a high-scale, dynamic environment. It provides a single, clear view of the Netflix cloud, supports deployments, enables rollbacks, and provides security. There are many other features as well—support for rolling pushes, enforcing conventions, integrating Jenkins CI, monitoring status, tracking metrics, and more.
Why Groovy and Grails for Asgard?
According to Sondow, Asgard’s first version was started on Groovy and Grails because Netflix was mainly a Java shop, and someone just wanted to try it.
Their experience has turned Netflix and Sondow into community spokespeople. Sondow sums up his experience enthusiastically, citing “Grails lets you iterate quickly. It is just a spectacular way to make things quickly. If you have an idea, you just write a little bit of code and there—that’s your new functionality. There is a lot less boilerplate than in Java or in some other web frameworks. If you’re into Java, then you don’t know it yet, but you’re actually into Groovy. And, if you’re into Groovy, then Grails is the most straight-forward way to make a web app.”
Sold on the experience after just a few weeks, Sondow keys in on increased developer productivity as a main benefit. “I certainly like the reduction of repetitive boilerplate code, as well as the wealth of extra utility methods that Java hasn’t yet added to common classes.”
Sondow also believes Groovy and Grails was quick and easy to learn, “Anyone who is good at Java web development can get up to speed on Groovy and Grails pretty easily.” There is almost no learning curve for Netflix engineers who begin to use Groovy and Grails within their applications. The syntax is shorter and more concise, but very familiar. It is also easier to maintain. For some projects, developers spend a lot of time reading through old code. “For the engineering tools we develop, Grails brings a lot of value to the table by keeping a code base manageable and relatively small. Part of that value comes from embracing the Grails defaults, including the use of Groovy as the primary language,” said Sondow. Convention over configuration in Grails makes things simple without sacrificing flexibility when the situation calls for it.
Groovy and Grails also runs on a Java Virtual Machine (JVM), their runtime platform. Sondow offered insight here, “Groovy benefits from in-house experts on JVM tuning and a wide ecosystem of Java libraries. It reduces or eliminates the drudgery of XML configuration files. Overall, it means a higher signal-to-noise ratio in the code, expressing the program’s intent in shorter bits than Java.” In addition to the productivity gains for writing code, testing and deployment is improved. “The greatest power of Groovy seems to be how it facilitates feature rich, minimalist frameworks for common use cases like Grails (web development), Gradle (builds), and Spock (automated testing),” said Sondow. In addition, application quality increases when there are better tools to help test and build.
Get started by learning more about Groovy, Grails, and Asgard:
- Read the Asgard Case Study.
- Check out the Asgard Code on Github.
- Reference the Asgard Quick Start Guide.
- Review the Asgard Slideshare Presentation.
- Check out how the Netflix Dynamic Scripting Platform also uses Groovy.
- See the Groovy Website and Grails Website.