They too previously operated a traditional enterprise data centre technology platform, where famously “a single missing semicolon brought down the entire (monolithic) Netflix website for several hours in 2008.” Read more about this story on DZone.
In this video the chief architect behind this digital transformation, Adrian Cockcroft, shares his experiences of the Netflix journey to the Cloud.
Cloud Migration Best Practices
First we can examine their transition from their legacy IT estate to the Cloud.
In this blog they focus on the migration of the core Netflix billing systems from their own data centre to AWS, and from Oracle to a Cassandra / MySQL combination, emphasizing in particular the scale and complexity of this database migration part of the Cloud Migration journey.
This initial quote from the Netflix blog sets the scene accordingly:
On January 4, 2016, right before Netflix expanded itself into 130 new countries, Netflix Billing infrastructure became 100% AWS cloud-native.
They also reference a previous blog also describing this overall AWS journey, again quickly making the most incisive point – this time describing the primary inflection point in CIO decision making that this shift represents, a move to ‘Web Scale IT‘:
That is when we realized that we had to move away from vertically scaled single points of failure, like relational databases in our datacenter, towards highly reliable, horizontally scalable, distributed systems in the cloud.
Migrating Mission-critical Systems
They then go on to explain their experiences of a complex migration of highly sensitive, operational customer systems from their own data centre to AWS.
As you might imagine the core customer billing systems are the backbone of a digital delivery business like Netflix, handling everything from billing transactions through reporting feeds for SOX compliance, and face a ‘change the tyre while the car is still moving’ challenge of keeping front-facing systems available and consistent to ensure unbroken service for a globally expanding audience, while conducting a background process of migrating terabytes of data from on-site enterprise databases into the AWS service.
- We had billions of rows of data, constantly changing and composed of all the historical data since Netflix’s inception in 1997. It was growing every single minute in our large shared database on Oracle. To move all this data over to AWS, we needed to first transport and synchronize the data in real time, into a double digit Terabyte RDBMS in cloud.
- Being a SOX system added another layer of complexity, since all the migration and tooling needed to adhere to our SOX processes.
- Netflix was launching in many new countries and marching towards being global soon.
- Billing migration needed to happen without adversely impacting other teams that were busy with their own migration and global launch milestones.”
The scope of data migration and the real-time requirements highlight the challenging nature of Cloud Migrations, and how it goes far beyond a simple lift and shift of an application from one operating environment to another.
The backbone of the challenge was how much code and data was interacting with Oracle, and so their goal was to ‘disintegrate’ that dependency into a services based architecture.
“Moving a database needs its own strategic planning:
Database movement needs to be planned out while keeping the end goal in sight, or else it can go very wrong. There are many decisions to be made, from storage prediction to absorbing at least a year’s worth of growth in data that translates into number of instances needed, licensing costs for both production and test environments, using RDS services vs. managing larger EC2 instances, ensuring that database architecture can address scalability, availability and reliability of data. Creating disaster recovery plan, planning minimal migration downtime possible and the list goes on. As part of this migration, we decided to migrate from licenced Oracle to open source MYSQL database running on Netflix managed EC2 instances.”
Overall this transformation scope and exercise included:
- APIs and Integrations – The legacy billing systems ran via batch job updates, integrating messaging updates from services such as gift cards, and billing APIs are also fundamental to customer workflows such as signups, cancellations or address changes.
- Globalization – Some of the APIs needed to be multi-region and highly available, so data was split into multiple Cassandra data stores. A data migration tool was written that transformed member billing attributes spread across many tables in oracle into a much smaller Cassandra structure.
- ACID – Payment processing needed ACID transaction, and so was migrated to MySQL. Netflix worked with the AWS team to develop a multi-region, scalable architecture for their MySQL master with DRBD copy and multiple read replicas available in different regions, with toolingn and alerts for MySQL instances to ensure monitoring and recovery as needed.
- Data / Code Purging – To optimize how much data needed migrated, the team conducted a review with business teams to identify what data was still actually live, and from that review purged many unnecessary and obsolete data sets. As part of this housekeeping obsolete code was also identified and removed.
A headline challenge was the real-time aspect, ‘changing the tyre of the moving car’, migrating data to MySQL that is constantly changing. This was achieved through Oracle GoldenGate, which could replicate their tables across heterogeneous databases, along with ongoing incremental changes. It took a heavy testing period of two months to complete the migration via this approach.
Downtime was needed for this scale of data migration, and to mitigate impact for users Netflix employed an approach of ‘decoupling user facing flows to shield customer experience from downtimes or other migration impacts’.
All of their tooling was built around ability to migrate a country at a time and funnel traffic as needed. They worked with ecommerce and membership services to change integration in user workflows to an asynchronous model, building retry capabilities to rerun failed processing and repeat as needed.
An absolute requirement was SOX Compliance, and for this Netflix made use of components available in their OSS open source suite.
“Our Cloud deployment tool Spinnaker was enhanced to capture details of deployment and pipe events to Chronos and our Big Data Platform for auditability. We needed to enhance Cassandra client for authentication and auditable actions. We wrote new alerts using Atlas that would help us in monitoring our applications and data in the Cloud.”
Building HA, Globally Distributed Cloud Applications with AWS
Netflix provides a detailed, repeatable best practice case study for implementing AWS Cloud services, at an extremely large scale, and so is an ideal baseline candidate for any enterprise organization considering the same types of scale challenges, especially with an emphasis on HA – High Availability.
Two Netflix presentations: Globally Distributed Cloud Applications, and From Clouds to Roots provide a broad and deep review of their overall global architecture approach, in terms of exploiting AWS with the largest and most demanding of of capacity and growth requirements, such as hosting tens of thousands of virtual server instances to operate the Netflix service, auto-scaling by 3k/day.
This goes into a granular level of detail of how they monitor performance, and then additionally in they focus specifically on High Availability Architecture, providing a broad and deep blueprint for this scenario requirements.
Cloud Native Toolchains
Migration is only the first step of the journey. Netflix has achieved their massive global growth because they are innovating continuously, meaning that they must be continually updating and changing their digital business systems.
In this blog Global Continuous Delivery With Spinnaker they explain how it addresses this scope of the code development lifecycle, across global teams, and forms the backbone of their DevOps ‘toolchain’, integrating with other tools such as Git, Nebula, Jenkins and Bakery.
As they describe:
Spinnaker is an open source multi-cloud Continuous Delivery platform for releasing software changes with high velocity and confidence. Spinnaker is designed with pluggability in mind; the platform aims to make it easy to extend and enhance cloud deployment models.
Moving from Asgard
Their history leading up to the conception and deployment of Spinnaker is helpful reading too; previously they utilized a tool called ‘Asgard’, and in Moving from Asgard:, describe the limitations they reached using that type of tool, and how instead they sought a new tool that could achieve:
- “enable repeatable automated deployments captured as flexible pipelines and configurable pipeline stages
- provide a global view across all the environments that an application passes through in its deployment pipeline
- offer programmatic configuration and execution via a consistent and reliable API
- be easy to configure, maintain, and extend”
These requirements formed into Spinnaker and the deployment practices they describe, which you can repeat through the Github Download.
The third Cloud Native foundation component is a microservices software architecture and again there are a wealth of resources to learn from.
In the past, architects tried to design every aspect of a software. Applications were supposed to work like perfect machines. The microservices approach to software development is more organic. Instead of trying to control every aspect of a complex system, architects try to set up rules to make a functioning organism. DevOps tools play an important role in this ecosystem. These tools help multiple teams work with each other seamlessly. The process results in healthy, flexible and scalable software.
Netflix OSS is a great place for any DevOps team to get an idea of the variety of tools that Netflix uses to run its massive microservice-based applications.
Matias De Santi of Wolox describes how microservices can make use of AWS services like their API Gateway and RisingStack offers this article Deploying Node.js Microservices to AWS using Docker.
- Build and Delivery Tools: Netflix has a collection of Gradle plugins called Nebula that helps development teams create repeatable builds. It helps save time during development. The Aminator tool packages AMIs for AWS. Spinnaker is Netflix’s continuous delivery platform that makes its complex microservices deployment possible.
- Common Runtime Services and Libraries: Eureka is Netflix’s service discovery tool. The application Ribbon helps with service communications. Hystrix helps isolate latency and fault tolerance at runtime.
- Data Persistence: Netflix’s EVCache and Dynomite are important innovations that help microservices use Memcached and Redis at scale.
- Insight, Reliability, and Performance: Netflix has developed a lot of tools to collect metrics and automatically address problems. But Chaos Monkey and Simian Army are its most famous reliability tools. These tools help Netflix test instances with random failures. The above discussion touches only a handful of options. The Netflix OSS page has a list of all the available open source tools that can help DevOps practices.