The uber poster child of migrating legacy applications and IT systems to a modular, high frequency agile ‘Cloud Native’ approach is Netflix.
In the feature video the chief architect behind this digital transformation, Adrian Cockcroft, shares his experiences of the Netflix journey to the Cloud.
Migrating to Web-Scale IT
In a VentureBeat article the author envisions ‘the future of enterprise tech’.
They describe how pioneering organizations like Netflix are entirely embracing a Cloud paradigm for their business, moving away from the traditional approach of owning and operating your own data centre populated by the likes of EMC, Oracle and VMware.
Instead they are moving to ‘web scale IT’ via on demand rental of containers, commodity hardware and NoSQL databases, but critically it’s not just about swapping out the infrastructure components. By further embracing software architecture principles like Microservices and high frequency Continuous Deployment practices an organization can go fully Cloud Native.
Cloud Migration Best Practices
First we can examine their transition from their legacy IT estate to the Cloud.
In this blog they focus on the migration of the core Netflix billing systems from their own data centre to AWS, and from Oracle to a Cassandra / MySQL combination, emphasizing in particular the scale and complexity of this database migration part of the Cloud Migration journey.
This initial quote from the Netflix blog sets the scene accordingly:
On January 4, 2016, right before Netflix expanded itself into 130 new countries, Netflix Billing infrastructure became 100% AWS cloud-native.
They also reference a previous blog also describing this overall AWS journey, again quickly making the most incisive point – this time describing the primary inflection point in CIO decision making that this shift represents, a move to ‘Web Scale IT‘:
That is when we realized that we had to move away from vertically scaled single points of failure, like relational databases in our datacenter, towards highly reliable, horizontally scalable, distributed systems in the cloud.
Migrating Mission-critical Systems
They then go on to explain their experiences of a complex migration of highly sensitive, operational customer systems from their own data centre to AWS.
As you might imagine the core customer billing systems are the backbone of a digital delivery business like Netflix, handling everything from billing transactions through reporting feeds for SOX compliance, and face a ‘change the tyre while the car is still moving’ challenge of keeping front-facing systems available and consistent to ensure unbroken service for a globally expanding audience, while conducting a background process of migrating terabytes of data from on-site enterprise databases into the AWS service.
- We had billions of rows of data, constantly changing and composed of all the historical data since Netflix’s inception in 1997. It was growing every single minute in our large shared database on Oracle. To move all this data over to AWS, we needed to first transport and synchronize the data in real time, into a double digit Terabyte RDBMS in cloud.
- Being a SOX system added another layer of complexity, since all the migration and tooling needed to adhere to our SOX processes.
- Netflix was launching in many new countries and marching towards being global soon.
- Billing migration needed to happen without adversely impacting other teams that were busy with their own migration and global launch milestones.”
The scope of data migration and the real-time requirements highlight the challenging nature of Cloud Migrations, and how it goes far beyond a simple lift and shift of an application from one operating environment to another.
The backbone of the challenge was how much code and data was interacting with Oracle, and so their goal was to ‘disintegrate’ that dependency into a services based architecture.
“Moving a database needs its own strategic planning:
Database movement needs to be planned out while keeping the end goal in sight, or else it can go very wrong. There are many decisions to be made, from storage prediction to absorbing at least a year’s worth of growth in data that translates into number of instances needed, licensing costs for both production and test environments, using RDS services vs. managing larger EC2 instances, ensuring that database architecture can address scalability, availability and reliability of data. Creating disaster recovery plan, planning minimal migration downtime possible and the list goes on. As part of this migration, we decided to migrate from licenced Oracle to open source MYSQL database running on Netflix managed EC2 instances.”
Overall this transformation scope and exercise included:
- APIs and Integrations – The legacy billing systems ran via batch job updates, integrating messaging updates from services such as gift cards, and billing APIs are also fundamental to customer workflows such as signups, cancellations or address changes.
- Globalization – Some of the APIs needed to be multi-region and highly available, so data was split into multiple Cassandra data stores. A data migration tool was written that transformed member billing attributes spread across many tables in oracle into a much smaller Cassandra structure.
- ACID – Payment processing needed ACID transaction, and so was migrated to MySQL. Netflix worked with the AWS team to develop a multi-region, scalable architecture for their MySQL master with DRBD copy and multiple read replicas available in different regions, with toolingn and alerts for MySQL instances to ensure monitoring and recovery as needed.
- Data / Code Purging – To optimize how much data needed migrated, the team conducted a review with business teams to identify what data was still actually live, and from that review purged many unnecessary and obsolete data sets. As part of this housekeeping obsolete code was also identified and removed.
A headline challenge was the real-time aspect, ‘changing the tyre of the moving car’, migrating data to MySQL that is constantly changing. This was achieved through Oracle GoldenGate, which could replicate their tables across heterogeneous databases, along with ongoing incremental changes. It took a heavy testing period of two months to complete the migration via this approach.
Downtime was needed for this scale of data migration, and to mitigate impact for users Netflix employed an approach of ‘decoupling user facing flows to shield customer experience from downtimes or other migration impacts’.
All of their tooling was built around ability to migrate a country at a time and funnel traffic as needed. They worked with ecommerce and membership services to change integration in user workflows to an asynchronous model, building retry capabilities to rerun failed processing and repeat as needed.
An absolute requirement was SOX Compliance, and for this Netflix made use of components available in their OSS open source suite.
“Our Cloud deployment tool Spinnaker was enhanced to capture details of deployment and pipe events to Chronos and our Big Data Platform for auditability. We needed to enhance Cassandra client for authentication and auditable actions. We wrote new alerts using Atlas that would help us in monitoring our applications and data in the Cloud.”
Building HA, Globally Distributed Cloud Applications with AWS
Netflix provides a detailed, repeatable best practice case study for implementing AWS Cloud services, at an extremely large scale, and so is an ideal baseline candidate for any enterprise organization considering the same types of scale challenges, especially with an emphasis on HA – High Availability.
Two Netflix presentations: Globally Distributed Cloud Applications, and From Clouds to Roots provide a broad and deep review of their overall global architecture approach, in terms of exploiting AWS with the largest and most demanding of of capacity and growth requirements, such as hosting tens of thousands of virtual server instances to operate the Netflix service, auto-scaling by 3k/day.
This goes into a granular level of detail of how they monitor performance, and then additionally in they focus specifically on High Availability Architecture, providing a broad and deep blueprint for this scenario requirements.
Cloud Native Toolchains
Migration is only the first step of the journey. Netflix has achieved their massive global growth because they are innovating continuously, meaning that they must be continually updating and changing their digital business systems.
In this blog Global Continuous Delivery With Spinnaker they explain how it addresses this scope of the code development lifecycle, across global teams, and forms the backbone of their DevOps ‘toolchain’, integrating with other tools such as Git, Nebula, Jenkins and Bakery.
As they describe:
Spinnaker is an open source multi-cloud Continuous Delivery platform for releasing software changes with high velocity and confidence. Spinnaker is designed with pluggability in mind; the platform aims to make it easy to extend and enhance cloud deployment models.
Moving from Asgard
Their history leading up to the conception and deployment of Spinnaker is helpful reading too; previously they utilized a tool called ‘Asgard’, and in Moving from Asgard:, describe the limitations they reached using that type of tool, and how instead they sought a new tool that could achieve:
- “enable repeatable automated deployments captured as flexible pipelines and configurable pipeline stages
- provide a global view across all the environments that an application passes through in its deployment pipeline
- offer programmatic configuration and execution via a consistent and reliable API
- be easy to configure, maintain, and extend”
These requirements formed into Spinnaker and the deployment practices they describe, which you can repeat through the Github Download.
Cloud Native on AWS
They further make extensive use of AWS capabilities. Naturally AWS is one of the primary Cloud providers for implementation of Cloud Native practices.
Amazon is proactively working the Cloud Native Computing Foundation (CNCF) to integrate components with AWS ECS for container network interface (CNI). AWS native VPC networking will work with CNI plugin. It means CNIs can operate at the same networking efficiency that AWS instances enjoy with each other.
In order to run containers, developers have to set up AMIs, daemons, and IAMs. AWS Fargate allows developers to run containers without the worry of managing servers and clusters. As you would expect they support container implementation like Kubernetes and Docker.
CodeStar, CodeCommit, CodePipeline, CodeBuild and CodeDeploy offers a DevOps ‘toolchain’ for speeding the software development, build and deploy lifecycle. CodeDeploy also supports GitHub so that you can deploy application revisions stored in GitHub repositories or Amazon S3 buckets to instances.
AWS offers an excellent range of best practice white papers explaining the best practice use of the services, such as an Introduction to DevOps, Practicing Continuous Integration and Delivery on AWS, and using Jenkins on AWS, their dedicated blog offers regular insights, and this video offers guidance from one of their presentations, describing Cloud Native DevOps on AWS.
Kubernetes as a Service
Amazon is also concentrating on Kubernetes integration with AWS installers, IAM security, and EKS, etc. Around 63% of all Kubernetes workloads run on AWS. Amazon is investing resources to ensure the Kubernetes users get a better experience.
An important component of the Kubernetes implementation on AWS is keeping it open source. AWS is not using a forked version of the platform. It is working with the community to reach consensus on any new feature or update.
However, Amazon is trying to ensure seamless Kubernetes integration with AWS features. Here are a few key integrations:
- IAM Authentication with Kubernetes: AWS is working with Heptio to create an open source project to integrate Kubernetes access and AWS IAM authentication.
- IAM Roles for Pods: The kube2iam open source project handles another part of Kubernetes management. Instead of sharing IAM credentials, containers inside Kubernetes clusters get their own IAM credentials based on annotations. AWS is also working on integration with both Hashicorp Vault and Secure Production Identity Framework for Everyone (SPIFFE).
Amazon has taken all the learning and features from their work with Kubernetes customers and created Amazon Elastic Container Service for Kubernetes (Amazon EKS). It is a fully managed service that will use the open source version of the system to run Kubernetes clusters. Customers wouldn’t have to worry about installing and operating the Kubernetes master or configuring a cluster of workers.
Amazon EKS is still under development. It is being created on the following tenets:
- Intended for enterprises to run production-grade workloads.
- Provide a native and upstream Kubernetes experience.
- Make the integration seamless and eliminate extra work.
- Actively contribute to the Kubernetes project.
Currently, Amazon is working to get EKS released in 2018. Amazon Fargate integration with EKS will take place later.
The third Cloud Native foundation component is a microservices software architecture and again there are a wealth of resources to learn from.
In the past, architects tried to design every aspect of a software. Applications were supposed to work like perfect machines. The microservices approach to software development is more organic. Instead of trying to control every aspect of a complex system, architects try to set up rules to make a functioning organism. DevOps tools play an important role in this ecosystem. These tools help multiple teams work with each other seamlessly. The process results in healthy, flexible and scalable software.
Netflix OSS is a great place for any DevOps team to get an idea of the variety of tools that Netflix uses to run its massive microservice-based applications.
Matias De Santi of Wolox describes how microservices can make use of AWS services like their API Gateway and RisingStack offers this article Deploying Node.js Microservices to AWS using Docker.
- Build and Delivery Tools: Netflix has a collection of Gradle plugins called Nebula that helps development teams create repeatable builds. It helps save time during development. The Aminator tool packages AMIs for AWS. Spinnaker is Netflix’s continuous delivery platform that makes its complex microservices deployment possible.
- Common Runtime Services and Libraries: Eureka is Netflix’s service discovery tool. The application Ribbon helps with service communications. Hystrix helps isolate latency and fault tolerance at runtime.
- Data Persistence: Netflix’s EVCache and Dynomite are important innovations that help microservices use Memcached and Redis at scale.
- Insight, Reliability, and Performance: Netflix has developed a lot of tools to collect metrics and automatically address problems. But Chaos Monkey and Simian Army are its most famous reliability tools. These tools help Netflix test instances with random failures. The above discussion touches only a handful of options. The Netflix OSS page has a list of all the available open source tools that can help DevOps practices.
This 2016 AWS Summit presentation provides a comprehensive overview including the broader context of how it fits within this DevOps framework. Their white paper provides a detailed review and this presentation dives more into the technical details and offers a number of implementation patterns:
Anyone can implement a microservices architecture on AWS with a simple Elastic Load Balancer, a few EC2 instances and a datastore like Amazon RDS or DynamoDB. The EC2 instances can be used for deploying microservices. However, depending on the size of the service, this can be an expensive choice. Here some other Amazon tools that can help with microservices:
- AWS Elastic Beanstalk – This orchestration service makes microservices deployment easier.
- Amazon Elastic Container Service (ECS) – Containers have become part of the microservices culture. Amazon’s ECS helps makes scheduling of containers more flexible.
- Amazon API Gateway and AWS Lambda – Serverless computing is gaining popularity. By combining Amazon API Gateway and AWS Lambda, it’s possible to create a microservices application that wouldn’t require any form of infrastructure management from the development team.
It’s also helpful to highlight some of the many partners in the AWS ecosystem who add value to this scenario.
For example BlazeMeter integrates with CodePipeline, adding load and performance testing into the pipeline process, enabling users to test APIs, mobile and web applications easily and rapidly.
- API-Enabled Digital Ecosystems: From Product to Platform - November 13, 2018
- Architecture-driven Cloud migration and transformation - November 9, 2018
- Digital Maturity Model – A Framework for Planning Digital Transformation - November 9, 2018
- Digital Business Architecture – Enterprise Architecture and Digital Transformation - November 9, 2018
- Stephen Orban and Edward Wilson-Smythe: Driving Business Value through Technology - October 29, 2018
- Stephen Orban and Edward Wilson Smythe: Value Added Work in the Digital Economy - October 29, 2018
- Stephen Orban and Edward Wilson Smythe: CTOs Making Decisions that Boost Your Business - October 29, 2018
- Stephen Orban and Edward Wilson Smythe on Overcoming Regulatory and Organizational Roadblocks in IT - October 29, 2018
- AWS Summit 2015 | Singapore Keynote - October 29, 2018
- Stephen Orban and Edward Wilson Smythe on Solutions to Align CIO Expectations and IT - October 29, 2018