Achieving Zero Downtime: Kubernetes Production Readiness for Highly Available Applications

In today's fast-paced software world, we want our computer programs to be flexible and able to grow smoothly. At Loconav, we decided to update our Infrastructure from traditional EC2 servers to a more advanced system like Kubernetes.


Table of contents

Introduction

We have many important services that keep our applications working well, and changing them one by one was a bit tricky. As we started with this transformation, our system became a mix of two things: Kubernetes (which is like a super-smart way to manage programs) and regular virtual machines. This mix made things interesting but also a bit complicated.

We had some unexpected problems and times when our services were offline. In this article, we'll share what we've learned and how you can keep your programs running smoothly with Kubernetes, without any interruptions.

Production Stack

Our operational infrastructure heavily relies on AWS managed services, addressing a wide spectrum of our requirements. These services include:

  • Amazon Elastic Kubernetes Service (EKS)
  • Application Load Balancer (ALB)
  • Amazon Simple Storage Service (S3)
  • Elastic Container Registry (ECR)
  • Amazon Elastic Compute Cloud (EC2)
  • Amazon Relational Database Service (RDS)
  • Amazon ElastiCache, and more.

Our legacy applications are currently hosted on EC2 instances, and our data storage strategy encompasses the utilization of Elasticache, RDS, as well as on-premises MongoDB. In terms of containerization technologies, we have bolstered our capabilities by integrating EKS, ECR, Timescaledb, Postgres with Patroni, and HAProxy into our stack.

We've divided the data layer from the application layer to promote statelessness in our applications, aligning with the principles of the 12 Factor App methodology. We've devised distinct strategies for each layer. When it comes to the data layer, we've opted for Postgres and Timescaledb, which we manage ourselves due to AWS lacking Timescaledb support.

Measures for Data Layer (PostgreSQL + Timescaledb)

As the primary data source for our applications, we rely heavily on PostgreSQL. We've taken great care to ensure that the following aspects remain consistent and in-sync, as even a brief moment of downtime can lead to significant failures within the application.

High Availability Architecture

Purpose: Implementing Patroni and HAProxy on our PostgreSQL database is aimed at establishing a High Availability (HA) architecture to ensure uninterrupted access and business continuity, even during failures.

Benefits: This implementation not only guarantees high availability, minimizing downtime and enhancing fault tolerance but also facilitates smoother deployments and scalability for our database system.

Readonly Applications to Postgres Replica

Purpose: Our strategy of directing read-only applications to connect to the Postgres slave serves the purpose of reducing the load on our primary database, optimizing its performance, and ensuring responsiveness to critical write operations.

Benefits: This approach not only minimizes the load on the primary database but also enhances overall performance, maintaining a responsive environment for critical write operations and ensuring the efficient operation of our applications.

Connection Pooling

Purpose: Our implementation of a connection pooling mechanism using PgBouncer in front of HAProxy serves the purpose of efficiently managing connections from our applications, addressing the resource-intensive nature of database connections in PostgreSQL.

Benefits: This approach optimizes resource utilization by efficiently managing database connections, reducing the strain on PostgreSQL, and ensuring that resources are used effectively for improved application performance.

Backups from Replicas

Purpose: Our strategy of performing backups from standby replicas using tools like pgBackRest is designed to prevent imposing a substantial load on our primary PostgreSQL database during backup processes, preserving its performance for critical operations.

Benefits: This approach not only enhances backup efficiency but also minimizes the impact on the primary database's performance. As a result, it ensures the continued availability of the primary database for essential operations while maintaining an effective backup process.

Measures for Our App Layer (K8S cluster)

We've implemented a range of measures throughout our EKS cluster to ensure seamless deployments and eliminate downtime.

PodDisruptionBudgets (PDBs)

Purpose: We employed PodDisruptionBudgets with the specific intent of guaranteeing that, during updates or maintenance procedures, a minimum number of our service's replicas would remain accessible and unaffected.

Benefits: PodDisruptionBudgets serve the valuable purpose of preserving the desired service availability level. They achieve this by acting as safeguards against disruptions caused by factors such as aggressive scaling, pod terminations, or actions initiated by the cluster autoscaler.

Readiness Probes

Purpose: Configuring Readiness Probes serves the purpose of evaluating the health of our pods before permitting them to receive traffic, thereby preventing the termination of healthy pods.

Benefits: This probe ensure that only fully operational pods are included in the load balancer's rotation, reducing the risk of routing traffic to unhealthy instances and enhancing the reliability of our services.

Pod Anti-Affinity

Purpose: To further enhance resilience and availability, we implemented Pod Anti-Affinity rules for our Kubernetes pods.

Benefits: This measure ensures that no two replicas of the same service are hosted on the same node within the cluster. By spreading replicas across different nodes, we reduce the risk of a single node failure impacting multiple instances of our service.

Multiple Replicas

Purpose: The deployment of multiple replicas was aimed at load distribution and redundancy, ensuring that incoming requests are evenly distributed and the service remains available.

Benefits: Multiple replicas enhance our application's resilience, and with Pod Anti-affinity in place, they mitigate the impact of node failures, ensuring uninterrupted service.

Summary

We've delved into our journey of migrating to a contemporary infrastructure, a journey marked by the integration of various technologies, including the fusion of Kubernetes with traditional virtual machines. This transformation brought its own set of challenges, including complexities and sporadic downtime. As we continue on this journey, we're committed to sharing our evolving experiences and insights on achieving seamless operations with Kubernetes. Be sure to stay tuned as we update this post whenever we uncover further insights in our ongoing exploration.