Unit 3.

Case Study: Scaling Food Delivery During Festivals – Infra Under Surge

IT 204: E-Commerce

Learning Objectives

By the end of this case study, you will be able to analyze and apply architectural patterns for high-traffic e-commerce events.

✅ Analyze the impact of bursty, seasonal traffic on a typical e-commerce platform.
✅ Identify key architectural solutions like caching, queues, and database scaling.
✅ Evaluate the success of infrastructure changes using performance metrics (latency, error rates).
✅ Apply these scaling concepts to the local Nepali e-commerce context.

The Challenge: The Festival Surge

Platforms like Foodmandu or Pathao Food face massive, predictable traffic spikes during festivals like Dashain and Tihar.

The Domino Effect of Unpreparedness:

Massive Traffic Spike 📈

➜ App/API Slowdowns 🐌

➜ Database Overload & Timeouts 💥

➜ Failed Orders & Payments ❌

➜ Unhappy Customers & Overwhelmed Support 😠

A Multi-Layered Solution

The solution wasn't a single fix, but a combination of strategies targeting different parts of the system.

1. Aggressive Caching

Serve content faster by reducing direct hits to the main server and database.

2. Database Optimization

Strengthen the core data layer to handle both high read and write volumes efficiently.

3. Asynchronous Workflows

Decouple system components to handle order processing resiliently, even under load.

Layer 1: Caching Deep Dive ⚡

Where to Cache?

CDN + Edge Caching: For static assets like images, logos, and restaurant menus that don't change often.
Server-Side Caching: In-memory stores (like Redis) for "hot" data like popular categories or promotional items.

Why it Works

Dramatically reduces database load.
Makes browsing the app feel instant.
Absorbs a huge percentage of user traffic before it hits the core infrastructure.

Layer 2: A Smarter Database 📊

Directly addressing the database bottleneck is critical for transactional stability.

Read Replicas: Created copies of the database dedicated to handling read-only requests (e.g., viewing menus, browsing restaurants). This isolates browsing traffic from order-placement traffic.
Slow Query Optimization: Identified and fixed inefficient database queries that were consuming excessive resources, often by adding indexes.
Connection Pooling: Reused a "pool" of active database connections instead of creating a new one for every request, reducing overhead.

Key Principle (Unit 3.3): Separate read vs. write paths to prevent contention and scale them independently.

Layer 3: Resilient Order Processing

Key Concept: Message Queues. An intermediary service that holds "messages" (like a new order) in a queue. Services can pull messages to process them at their own pace, preventing the system from being overwhelmed.

The order lifecycle was decoupled:

Order Placed ➜ [Queue]

➜ Order Accepted ➜ [Queue]

➜ Rider Assigned ➜ [Queue]

➜ Delivered

This ensures that even if the "Rider Assignment" service is slow, it doesn't stop new orders from being placed.

Optimizing the 'Last Mile': Smart Dispatch

Improving operational efficiency for riders is just as important as server performance.

Geofencing

Virtual geographic boundaries were used to group riders into specific zones. This ensures riders are only offered deliveries within an efficient travel radius, reducing pickup times.

Batched Assignments

The system intelligently grouped multiple orders from the same area for a single rider. This increases rider earnings and delivery speed during peak hours.

The Impact: Measurable Outcomes 🎯

BEFORE

High P95 Latency
High Order Failure Rate
Erratic Rider Utilization
Spike in Support Tickets

AFTER

P95 Latency down 45%
Failed Orders reduced by 70%
More Stable Rider Utilization
Fewer Customer Support Tickets

Practical Application: The Nepali Context

Scenario: You are the CTO of a local e-commerce platform (e.g., Daraz, Sastodeal) preparing for the Dashain shopping festival.

Question: Based on this case study, what are your top 3 priorities?

🔍 Pre-warm Caches: Proactively load caches with "Dashain Deals" and top-selling product data before the sale begins.
📈 Pre-scale Infrastructure: Schedule autoscaling of read replicas and message queue workers to activate just before peak shopping hours (e.g., evening times).
📍 Logistics Planning: Use historical data to predict high-volume delivery zones in Kathmandu, Pokhara, etc., and pre-position delivery staff accordingly.

Summary: Key Takeaways

Decouple with Queues: For any bursty, location-dependent workload, message queues are fundamental for building a resilient and scalable system.
Cache Intelligently: A multi-layered caching strategy (CDN for static, Server-side for dynamic) is crucial to absorb read-heavy traffic.
Isolate Read/Write Paths: Separating database read and write operations is one of the most effective ways to prevent system-wide slowdowns during peak traffic.
Monitor and Prepare: Scalability isn't just about technology; it's about proactively monitoring capacity and planning for seasonal peaks.

Thank You!

This case study covers concepts from Chapters 3.2-3.5.

Next Topic: Unit 4.1: 'Unit 4 Intro: E-commerce Security and Payment Systems | IT 204

[Back to Unit 3] | [Course Home]