The most significant transformation that Netflix describe is to move away from centralizing functions like QA and testing.
In the Microservices at Netflix Scale presentation Ruslan explains that the topmost priority for Netflix, over reliability and efficiency, is digital innovation, and they identified these organizational stage gates as choke points that directly limited this ambition.
Going Cloud Native – Mastering Chaos
The strain on traditional software testing methods is increased exponentially as organizations wholly embrace the latest development practices, notably the shift to the ‘Cloud Native’ paradigm, one based the use of containers, microservices and high frequency releases via Continuous Deployment. Netflix is the poster child of this new world.
To date enterprise architecture has been a domain based on a fundamental precept of a mostly static environment, a fixed set of applications with relatively infrequent changes made to it, maintained in one or more strictly controlled data centres.
In contrast Netflix now operates a global infrastructure spanning multiple AWS zones executing thousands of inter-operating microservices, continually spawning new ones and auto-scaling Cloud infrastructure; managing it is a process of ‘Mastering Chaos’.
As they repeatedly describe testing is fundamental to this, integrating the lessons they’ve learned into best practices applied automatically in their Continuous Deployment life-cycle of new code through their use of Spinnaker, such as canary analysis and staged deployments. Cloud guru David Linthicum makes the point that Cloud Native efforts won’t succeed without a suitable test automation capability like this.
What is especially notable is they don’t just apply testing to the process of writing and deploying new software, but also they rigorously test the whole system.
In the Mastering Chaos presentation Josh describes how they use techniques like Failure Injection Testing to simulate the failing of microservices. In the Microservices at Netflix Scale presentation, at 36:00 Ruslan demonstrates how they tested the failure of entire AWS region.
Netflix have termed this approach as ‘chaos engineering’ – In short they assume the system will fail and proactively test and simulate for this happening. At 11:00 Ruslan describes how they apply these principles in action, such as their use of Chaos Monkey for automating failure testing.
In other words Netflix applies testing from top to bottom, start to finish, of their entire environment including but not limited to their software development life-cycle. Given the principle of ‘infrastructure as code’ they know that failures can occur at any point within the overall environment not just the code they write.
For organizations seeking to emulate this transformation a number of methods and tools can be considered, including of course the components Netflix have open sourced.
The Cloud Native QA guide repeats the Netflix philosophy, notably “It is important to not only design for failure but test for recovery”. They recommend a series of QA practices for adopting the same type of culture as Netflix, various ways to test for failure and recovery as they do, and using tools such as OpenTracing.
Fernando Mayo explores the modernization of testing for this new microservices world, highlighting practices like property-based testing, fuzz testing, and mutation testing, that can help detect a wider range of defects in an automated way.
On Linkedin Shachar Landshut also proposes a framework for testing microservices, escalating up from testing individual services through integration testing and ultimately the chaos engineering approach that Netflix utilize.