Lesson 4, Topic 4
In Progress

Microservices at Netflix Scale – Lessons Learned

CBPN Editor April 17, 2020
Lesson Progress
0% Complete

This insightful 2016 talk about Microservices at Netflix was presented by Ruslan Meshenberg who served as the director of Platform Engineering at Netflix.

Netflix operates a platform approach that other developers utilize and build upon, a common layer that enables developer teams to work autonomously and quickly. The colossal levels of downstream traffic is supported by over 500 microservices, with 100 to 1,000 production changes being deployed daily.

Monolith to Microservices Cloud Migration

From 3:30 Ruslan describes the Netflix journey to the Cloud, with their desire to migrate being the primary motivation for also adopting microservices. It wasn’t possible to simply lift and shift their on premise monolith application, instead they ‘chiseled’ it up into microservice components and deployed these to the Cloud, a process that ultimately took seven years to complete.

At 5:35 he explains the primary catalyst for this move was a major failure of the legacy application, a database corruption that took four days to recover and caused a severe customer impact. This prompted their move to become an entirely Cloud-based company.

Core Principles

From 6:20 Ruslan begins to explain their microservices architecture, defining the core principles for guiding their use.

  • Use (and contribute to) open source where possible – Only develop new code if you really have to.
  • Services should be stateless – Don’t rely on sticky sessions, except for persistence and caching layer, and prove through chaos testing.
  • Scale out vs scale up – The key dynamic of the Cloud is the ability to provision new instances in response to escalating demand.
  • Redundancy and isolation for resiliency – Make more than one of anything, and isolate the ‘blast radius’ of any one component failure.
  • Automate destructive testing – Assume failure will occur and be continually testing the system to prove it can withstand those failures.

From 11:00 Ruslan moves on to exploring how they apply these principles in action, such as their use of Chaos Monkey for automating failure testing, and in particular their implementation of the Cassandra database, and how best to optimize large scale and multi region data consistency.

At 15:30 he addresses the billing component, the last and most intense of their microservice migrations, as they are naturally critically important and are also exposed to strict compliance requirements, necessitating logging and auditing.

Microservices – Benefits

From 16:30 Ruslan moves on to defining the benefits of a microservices approach.

These benefits are best articulated through relating them to the organizations priorities; in the case of Netflix they ranked 1) velocity of innovation, 2) reliability and 3) efficiency, as their core goals.

What is most notable about this paradigm isn’t the technology but organizational models. Ruslan highlights that the core dynamic of monolithic enterprise systems that inhibits innovation is the single software deployment life-cycle. Yes there are multiple teams working on various innovations, but each must be processed into and through this life-cycle, a “tight coupling” between the teams.

“Loosely coupled” teams work independently of each other and operate an end-to-end ownership of their own services, there is no central stage gate, they each manage the releases of their own code.

Furthermore their platform approach provides each team the building blocks, addressing core needs like security, for them to build upon, a ‘separation of concerns’ approach.

Microservices – Costs

From 19:40 Ruslan articulates the costs of the microservices approach. Again this comes back not to technology but organization – The new autonomous teams approach requires a culture and structure change, for example there is no longer a single QA team, and adapting to these changes presents human challenges.

The main cost factor they experienced was the dual operation of both on premise and Cloud Native platforms while they migrated. Understandably this presented a huge workload and double complexity expense in all areas such as bugs and maintenance.

However having made that investment they have empowered themselves to become the global digital behemoth success story we know today.

Microservices – Lessons Learned

So what lessons did Netflix learn from this journey? From 23:05 Ruslan explains that primarily these were:

  • IPC is crucial for loose coupling – A common language for communications and contracting between microservices.
  • Common deployment methods – A homogeneity of how applications are deployed. Netflix initially achieved this through Asgard, then moved to Spinnaker. This provides a single source of truth for managing roll outs.
  • Database caching – Hundreds of microservices interacting with the databases presents a significant challenge, one that Netflix protected against through implementing a caching layer, using EVCache.
  • Automated telemetry – Netflix generates over 20 million metrics per second, adding up to 1.7 trillion a day. Thus human processing of this vastness of reporting is impossible, and so these are piped into automated error detection and remediation algorithms.

At 28:20 Ruslan highlights the core challenge this presents for traditional approaches to enterprise IT architecture, notably the pointlessness of maintaining enterprise architecture diagrams, as these are rendered obsolete almost instantly. Instead the business needs a real-time view, achieved through this automated telemetry.

Reliability

At 29:30 he moves on to the critical importance of reliability and how this is achieved.

The key to this is primarily the assumption that failures will occur and therefore the system is engineered with a capacity to rapidly adapt to them. Netflix pioneered and operates an approach of ‘circuit breakers‘, implemented through their Hystrix component.

Again automation of these adaptations is critical as is the ongoing destructive testing of the environment to ensure the capability. They implement these tests at a very large scale, simulating the failure of entire AWS regions, demonstrated in visual form at 35:35.

Containers and Conclusion

From 36:35 Ruslan focuses on the role of containers, highlighting they provide a valuable but not silver bullet tool for implementing microservices.

As a very early Kubernetes user they had to develop many of the additional components they needed, and wrapping up he summarizes these and other tools that are available from Netflix.github.com.