Learning to Build Scalable LLM Chat Application: Microservices Architecture and Docker Containerization

16 min readFeb 14, 2025

📜 Table of Contents

🚀 1. Introduction 🌟
🏗️ 2. Microservices Architecture 🔧
💻 3. Building the Code: Frontend & Backend Services 🎨🖥️
🐳 4. Containerization with Docker 🔥
🔹 4.1 Docker Engine ⚙️🚛
🔹 4.2 Dockerfile 📄🛠️
🔹 4.3 Docker Image 📦🎭
🔹 4.4 Docker Compose 🔗🔄
🧪 5. Running & Testing the LLM Chat Application 🎯✅

🚀 1. Introduction 🌟

— — — — — — — — — — — — — — — — — — — — — — — —

In this article, we focus on learning essential concepts for building a scalable LLM chat application. This article covers two essential concepts: microservices architecture and containerization (including Docker, Docker Engine, Dockerfile, Docker Image, and Docker Compose). Other key concepts will be explored in future articles.

Link to the code on github: Scalable Simple LLM Chatbot.

For completion, lets summarize the concepts involved in building scalable LLM chat application. A microservices-based backend architecture separates the model, API, and database for better scalability. Containerization (Docker) and orchestration (Kubernetes) streamline deployment and service management. Efficient inference through LLM quantization, batching, and memory management to optimize model performance. Load balancing (Nginx, Traefik), caching (Redis, Memcached), and horizontal scaling (Kubernetes, AWS ECS, GKE) help handle increasing user demand. Cloud providers like AWS, GCP, and Azure offer flexible hosting, computing power, and storage solutions. Security measures such as TLS/SSL encryption, API gateways (AWS API Gateway, Cloudflare), and DDoS protection (AWS Shield, Cloudflare, Akamai) ensure data safety. Logging and monitoring tools like ELK Stack, Prometheus, Grafana, and OpenTelemetry help track system performance and troubleshoot issues.

At first glance, building a large language model (LLM) chat application seems straightforward. You pick a model, wrap it with an API, and deploy it — right?

Not quite.

A production-grade LLM chat application isn’t just a single piece of software. It consists of multiple moving parts:

A frontend (such as Gradio) to provide an interface for users.
A backend (such as FastAPI) to handle requests and manage communication.
A model server (such as Ollama running DeepSeek-R1) to process user queries.
A database (if needed) to store analytics or chat history.
Scaling infrastructure to handle multiple users at once.

If all of this is built as a single monolithic application — where everything runs in one big, tangled codebase — it quickly becomes difficult to scale, update, and debug.

When building a scalable LLM chat application that can handle real-world requests, we must adopt a more modular and resilient architecture. This is where microservices architecture and containerization (Docker) become essential. These technologies enable us to break down the system into independent, manageable components, making it easier to scale, maintain, and deploy.

We need to first understand how to design, containerize, and orchestrate our system effectively. Microservices architecture and containerization (Docker) allow us to build a modular, scalable LLM chat application that we can efficiently develop and test in a local environment. Once we have a stable local setup, then the chat application can be moved to cloud platforms like AWS ECS, GCP Cloud Run, or Azure AKS to scale the application seamlessly.

In this article, we will focus on microservices architecture and containerization (docker, docker-engine, docker file, docker-image, docker-compose) — laying the groundwork for efficient local development. We will be building a simple yet scalable LLM chat application to demonstrate these two concepts. Link to the code on github: Scalable Simple LLM Chatbot. In future articles, I will be exploring cloud deployment, covering AWS ECS, ECR, and Load balancing.

— — — — — — — — — — — — — — — — — — — — — — — —

🏗️ 2. Microservices Architecture 🔧

— — — — — — — — — — — — — — — — — — — — — — — —

What is Microservices Architecture?

Microservices architecture is a software design approach where an application is built as a collection of small, independent services, each responsible for a specific functionality. Instead of having one large, tightly integrated system (monolithic architecture), microservices decompose an application into loosely coupled components that communicate with each other.

Each microservice:

Performs a single function (e.g., user authentication, model inference, database operations).
Runs independently (can be deployed, updated, and scaled separately).
Communicates with other services using lightweight protocols like HTTP APIs, gRPC, or message queues.

Why is Microservices Architecture Used?

Scalability: Individual services can be scaled separately based on demand. Example: If the LLM model service is the bottleneck, you can scale only that part instead of the entire system.

Fault Isolation: A failure in one service does not bring down the entire application. Example: If the database service crashes, the model API can still function.

Faster Development & Deployment: Teams can work on different services independently. Services can be updated and deployed without redeploying the whole system.

Technology Flexibility: Different services can use different programming languages or frameworks. Example: A Python-based LLM model service can coexist with a Node.js frontend.

Easier Maintenance & Debugging: Smaller codebases per service make development more manageable. Debugging is easier since issues can be traced to specific services rather than a massive monolithic codebase.

Microservices in LLM Chat Applications

For a scalable LLM chat system, microservices help by separating:

Frontend UI (Gradio, React, etc.)
Backend API (FastAPI, Flask, etc.)
LLM Model Service (Ollama, DeepSeek-R1, etc.)
Database (if storing analytics/chat history)
Authentication & User Management (if required)

This modular design ensures better scalability, resilience, and maintainability, making it ideal for deploying production-ready LLM applications. 🚀

— — — — — — — — — — — — — — — — — — — — — — — —

💻 3. Building the Code: Frontend & Backend Services 🎨🖥️

— — — — — — — — — — — — — — — — — — — — — — — —

✅ Step 1: Setting up the Frontend (Gradio UI)
✅ Step 2: Building the Backend (FastAPI API Service)

Let’s dive in! 🏊‍♂️

✨ Step 1: Creating the Chat UI with Gradio

The frontend serves as the user interface where conversations happen. For this, we use Gradio, a simple yet powerful framework that allows us to create interactive UIs effortlessly.

🔹 Why Gradio?

✅ Easy to implement — Write minimal code to get a working UI.
✅ Fast & lightweight — Ideal for quick prototyping and production.
✅ Runs in the browser — No complex frontend development required.

📌 Code: Gradio UI for Chat

import gradio as gr
import requests

BACKEND_URL = "http://backend:8000/chat"
def chat_with_llm(message, history):
    try:
        response = requests.post(BACKEND_URL, json={"message": message})
        bot_message = response.json()["response"]
        return bot_message
    except Exception as e:
        print(f"Error: {str(e)}")
        return f"Error: {str(e)}"
# Create Gradio interface with chat
demo = gr.ChatInterface(
    fn=chat_with_llm,
    chatbot=gr.Chatbot(height=450),
    textbox=gr.Textbox(placeholder="Ask me anything...", container=False, scale=7),
    title="AI Chat Assistant",
    description="Chat with an AI using ChatGPT API",
    theme="soft",
    examples=["Tell me a joke", "What is the meaning of life?", "Write a short poem"],
    cache_examples=False,
    retry_btn="Retry ↺",
    undo_btn="Undo ↶",
    clear_btn="Clear 🗑️"
)
if __name__ == "__main__":
    demo.launch(server_name="0.0.0.0", server_port=7860)

👀 How does the interface look like?

💡 How It Works

User enters a message in the Gradio chatbox.
Request is sent to the backend API (http://backend:8000/chat).
Response is received and displayed in the UI.

🎨 End Result: A simple but effective chatbot UI!

🚀 Step 2: Setting Up the Backend with FastAPI

The backend handles API requests, processes inputs, and forwards messages to the LLM model service. For simplicity, instead of building an separate service for LLM, we use OpenAI’s API to generate responses and integrate it in the backend service itself. Building LLM model services will be covered in future articles.

🔹 Why FastAPI?

✅ Blazing fast — Uses async capabilities for efficient processing.
✅ Lightweight & easy to use — Minimal boilerplate code.
✅ Great for scaling — Works well with microservices.

📌 Code: FastAPI Backend

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from openai import OpenAI
import os

app = FastAPI()
# Get API key from environment variable
client = OpenAI(
  #api_key=OPENAI_API_KEY,  # this is also the default, it can be omitted # for local testing
  api_key=os.environ['OPENAI_API_KEY'],
)
class ChatRequest(BaseModel):
    message: str
@app.post("/chat")
async def chat(request: ChatRequest):
    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "user", "content": request.message}
            ]
        )
        return {"response": response.choices[0].message.content}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

💡 How It Works

Receives user message from the frontend.
Sends request to the LLM Model Service: In this case we are simply using OpenAI’s API.
Receives generated response and returns it to the frontend.

— — — — — — — — — — — — — — — — — — — — — — — —

🐳 4. Containerization with Docker 🔥

— — — — — — — — — — — — — — — — — — — — — — — —

Now that we understand microservices architecture and built the frontend and backend services, the next challenge is how to package and run these services efficiently. This is where Docker comes in.

What is Docker?

Docker is a containerization platform that allows developers to package applications and their dependencies into a lightweight, portable container that runs consistently across different environments.

What is Containerization?

Containerization is a method of packaging an application and its dependencies into a lightweight, portable unit called a container.

Think of a container as a self-contained environment that includes everything the application needs to run:

The application code
Runtime (Python, Node.js, etc.)
System libraries and dependencies
Environment variables and configurations

Why Docker?

Eliminates Dependency Issues → No more “it works on my machine” problems.
Environment Consistency → The same container runs on a local machine, server, or cloud.
Fast and Lightweight → Unlike virtual machines (VMs), Docker containers share the host OS kernel, making them more efficient.
Portability → Easily move containers across cloud providers or local systems.

How Docker Fits into Microservices?

Each microservice can be packaged as a separate Docker container, making it easier to manage, deploy, and scale.

— — — — — — — — — — — — — — — — — — — — — — — —

🔹 4.1 Docker Engine ⚙️🚛 : The Heart of Docker

— — — — — — — — — — — — — — — — — — — — — — — —

What is Docker Engine?

Docker Engine is the core service that runs Docker containers. It is responsible for:

Building images from a Dockerfile.
Running and managing containers.
Networking and storage for containers.

Docker Engine runs as a background service on your machine and processes all Docker commands.

Why is Docker Engine Needed?

Without Docker Engine, Docker commands won’t work.
It allows you to interact with the Docker API, which controls container execution.

Installing Docker Engine or Docker Desktop

Before we can use Docker or Docker Compose, we need to install Docker Engine. Installation Link.

— — — — — — — — — — — — — — — — — — — — — — — —

🔹 4.2 Dockerfile 📄🛠️

— — — — — — — — — — — — — — — — — — — — — — — —

Now that we understand Docker and containerization, the next step is to create a Dockerfile — a key component in defining how an application runs inside a container.

What Is a Dockerfile?

A Dockerfile is a script containing a set of instructions to automate the creation of a Docker image. It defines everything needed to set up a containerized environment for your application, including:

The base image (OS and runtime environment)
Application source code
Required dependencies
Configuration settings
The commands to run when the container starts

Why Use a Dockerfile?

Automation & Reproducibility: Instead of manually setting up the environment, a Dockerfile ensures every instance is built exactly the same way. If a new developer joins the project, they can spin up the exact same environment without hassle.

Portability: A Dockerfile ensures your application runs identically on any machine, whether on local development, staging, or production servers.

Scalability & Deployment: Cloud platforms (AWS ECS, GCP Cloud Run, Azure AKS) directly support Dockerfiles for deployment. You can build once and run anywhere without worrying about system compatibility issues.

Version Control & Rollbacks: The Dockerfile allows for versioned application environments, making it easy to track changes and roll back if needed.

Docker file: Backend

🛠️ Backend: Dockerfile

Create a file backend/Dockerfile:

FROM python:3.9-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY app.py .
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

Here’s a breakdown of each line in the Dockerfile:

1. `FROM python:3.9-slim`

This line tells Docker to use the official Python 3.9 image, specifically the “slim” version, as the base for your container.
This line pulls the python:3.9-slim image, which is based on Debian (a popular Linux distribution). The slim version is a minimal version of the Debian-based Python image.
The slim version is a smaller image with fewer unnecessary packages, making the image lighter and faster to build compared to the full version. It’s ideal when you want to keep the container size minimal while still having Python available for your application.
While the Dockerfile doesn’t explicitly mention the OS, the python:3.9-slim base image inherently contains the operating system (Debian) with the necessary Python environment.

2. `WORKDIR /app`

This sets the working directory inside the container to /app.
Any subsequent commands (like COPY, RUN, or CMD) will be executed relative to this directory. It's a good practice to keep your application inside a dedicated folder within the container to keep things organized.

3. `COPY requirements.txt`

This copies the requirements.txt file from your local machine into the container's working directory (/app).
requirements.txt contains all the Python dependencies for our backend. By copying it into the container, you can install the dependencies in the next step.

4. `RUN pip install -r requirements.txt`

This installs all the Python packages listed in requirements.txt using pip. This is necessary to set up your Python environment inside the container with all the required libraries, so our application will run correctly.

5. `COPY app.py`

This copies your app.py file (which contains our FastAPI based app) into the container’s working directory.
This is the main application file that Docker will run when the container starts. It needs to be inside the container for it to be accessible when running the application.

6. `CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]`

This sets the default command to run when the container starts. It uses uvicorn, a popular ASGI server for Python web apps, to run the app.py file. Specifically, it looks for the app object inside the app.py file and runs it on host 0.0.0.0 (making it accessible externally) and port 8000.
The CMD instruction is the default command to run when the container starts. It's essential to expose the correct host and port for the app to be accessible from outside the container (e.g., when deployed on a cloud server).

Docker file: Frontend

🛠️ Frontend: Dockerfile

Create a file frontend/Dockerfile:

FROM python:3.9-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY app.py .
CMD ["python", "app.py"]

— — — — — — — — — — — — — — — — — — — — — — — —

🔹 4.3 Docker Image 📦🎭: The Blueprint of a Container

— — — — — — — — — — — — — — — — — — — — — — — —

What is a Docker Image?

A Docker image is a blueprint for running a Docker container. It contains everything needed to run an application, including:

Source code
Dependencies & libraries
Configuration files
Operating system (lightweight version)

Think of a Docker image as a recipe, and a Docker container as the running instance of that recipe.

Why Docker Images?

Reusable & Versioned → You can create an image once and deploy it anywhere.
Consistent Execution → Ensures the same environment across development, testing, and production.

Creating docker images for our frontend and backend services

Since we are using Docker Compose, we don’t always need to manually build Docker images for each service. Docker Compose automates this process depending on how we define our services in docker-compose.yml.

— — — — — — — — — — — — — — — — — — — — — — — —

🔹 4.4 Docker Compose 🔗🔄: Managing Multi-Container Applications

— — — — — — — — — — — — — — — — — — — — — — — —

When building a scalable LLM chat application, we don’t just have one container — we have multiple microservice containers running together. Managing them manually would be tedious. We need a tool to manage them. Docker Compose handles it for us.

What is Docker Compose?

Docker Compose is a tool that lets you define and run multi-container Docker applications using a simple YAML file (docker-compose.yml).

Why use Docker Compose?

Easier Multi-Service Management → Start multiple containers with a single command.
Automatic Networking → Services can talk to each other without manual setup.
Simplified Local Development → Quickly test the full system before deploying.

📌 Creating docker-compose.yml file for our chat application:

version: '3'

services:
  frontend:
    build: ./frontend
    ports:
      - "7860:7860"
    depends_on:
      - backend
    environment:
      - BACKEND_URL=http://backend:8000/chat
    networks:
      - app-network
  backend:
    build: ./backend
    ports:
      - "8000:8000"
    env_file:
      - .env
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
    networks:
      - app-network
networks:
  app-network:
    driver: bridge

Explaining each term:

version: '3'

Concept: Specifies the version of Docker Compose syntax you’re using. In this case, version 3.
Why: Version 3 is widely used and supports various features for managing services, networks, and volumes.

services:

Concept: This section is where you define all the containers (or services) that your application will use.

services:
  frontend:
    build: ./frontend
    ports:
      - "7860:7860"
    depends_on:
      - backend
    environment:
      - BACKEND_URL=http://backend:8000/chat
    networks:
      - app-network

frontend: This is the name of the service, and here it refers to the frontend container.

build: ./frontend

This defines the build context for the frontend service. It tells Docker to use the Dockerfile in the ./frontend directory to build the image. This directory has app.py, Dockerfile and requirements.txt for the frontend service. Please take a look at the github repo for more clarity.

ports:
  - "7860:7860"

This maps ports between the container and the host. 7860:7860 means that port 7860 on the host is mapped to port 7860 on the container.

Understanding Port Mapping in Docker

When you run a service inside a Docker container, it operates in an isolated environment, meaning it doesn’t directly expose its internal ports to the outside world (your laptop/PC/web/server). Port mapping is how Docker allows external access to services running inside containers.

1. What is Port Mapping?

Port mapping links a port on your local machine/server (host) to a port inside the Docker container. This allows you to access the service running inside the container as if it were running directly on your machine.

2. How Does Port Mapping Work?

ports:
  - "7860:7860"

This means:

The first “7860” refers to port 7860 on your laptop/PC/Server (the host).
The second “7860” refers to port 7860 inside the Docker container.

So, when you run the frontend service, Docker will route all requests made to localhost:7860 on your laptop/PC to port 7860 inside the container where the frontend is running.

depends_on:
      - backend

This ensures that the frontend service waits for the backend service to start before it runs.
Dependencies are important in multi-container setups, especially when one service relies on another (e.g., frontend depends on backend APIs).

environment:
      - BACKEND_URL=http://backend:8000/chat

This section passes environment variables into the container. Here, the BACKEND_URL variable is defined to point to the backend service's URL (http://backend:8000/chat).

networks:
      - app-network

Specifies the networks this service will connect to. In this case, frontend is connected to app-network.
Networking is important for communication between services. Defining a custom network (e.g., app-network) ensures that containers can easily talk to each other.

backend:
    build: ./backend
    ports:
      - "8000:8000"
    env_file:
      - .env
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
    networks:
      - app-network

Similar explanations apply to the backend service lines. The OPENAI_API_KEY is defined in you .env file. Take a look at the github repo for more clarity.

networks:
  app-network:
    driver: bridge

This section defines custom networks for the services.

app-network: This creates a custom network named app-network.
driver: bridge: The bridge driver is the default network driver in Docker. It allows containers to communicate with each other and with the host machine.
The bridge network is used here to allow frontend and backend services to communicate with each other. You can also define custom networking rules if needed.

— — — — — — — — — — — — — — — — — — — — — — — —

🧪 5. Running & Testing the LLM Chat Application 🎯

— — — — — — — — — — — — — — — — — — — — — — — —

How These Pieces Fit Together?

Microservices → To keep components modular and scalable.
Docker & Docker Images → To package/containerize each microservice consistently.
Docker Compose → To run multiple containers together for local development and testing.

Now, let’s build and run everything in Docker on a local machine:

docker-compose up --build

Make sure your docker-engine or docker-desktop is installed and running before running this command.

🌍 Accessing the Application

Frontend UI: http://localhost:7860
Open this link using a browser, you will be able to see the UI we designed.

How is this working?

🔹 Outside Docker (From Your Browser)

Since Docker exposes ports to the host machine, you access the services using localhost or your server IP:

🔹 Inside the Dockerized Code

When making API calls inside the containers, service names are used instead of localhost.

For example, Gradio UI (frontend) calls the backend via this link:

BACKEND_URL = "http://backend:8000/chat"  # Uses service name, not localhost

💬 Chat and Test:

Works!

🔥 Now, we have a fully containerized LLM chat application that can run on any machine! 🎉

Here is the link to full code: Scalable Simple LLM Chatbot.

We can now move to cloud deployment (AWS ECS, GCP Cloud Run, Azure AKS) for production scaling. We have to create a docker-image of the full application, then upload it to docker-hub or AWS ECR and then proceed with other cloud deployment steps. This will be discussed and explored in future articles. 🚀

Thank you for reading!

References:

Home

Docker is a platform designed to help developers build, share, and run container applications. We handle the tedious…

www.docker.com

Docker Hub Container Image Library | App Containerization

Welcome to the world's largest container registry built for developers and open source contributors to find, use, and…

hub.docker.com

Creating integration tests with Docker platform | TSH.io

To make testing in a microservice architecture easier we made friends with Docker platform. Why should you introduce…

tsh.io

Learn - FastAPI

FastAPI framework, high performance, easy to learn, fast to code, ready for production

fastapi.tiangolo.com

Learning to Build Scalable LLM Chat Application: Microservices Architecture and Docker Containerization

📜 Table of Contents

🚀 1. Introduction 🌟

🏗️ 2. Microservices Architecture 🔧

💻 3. Building the Code: Frontend & Backend Services 🎨🖥️

✨ Step 1: Creating the Chat UI with Gradio

🔹 Why Gradio?

📌 Code: Gradio UI for Chat

👀 How does the interface look like?

💡 How It Works

🚀 Step 2: Setting Up the Backend with FastAPI

🔹 Why FastAPI?

📌 Code: FastAPI Backend

💡 How It Works

🐳 4. Containerization with Docker 🔥

🔹 4.1 Docker Engine ⚙️🚛 : The Heart of Docker

🔹 4.2 Dockerfile 📄🛠️

🛠️ Backend: Dockerfile

Here’s a breakdown of each line in the Dockerfile:

1. FROM python:3.9-slim

2. WORKDIR /app

3. COPY requirements.txt

4. RUN pip install -r requirements.txt

5. COPY app.py

6. CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

🛠️ Frontend: Dockerfile

🔹 4.3 Docker Image 📦🎭: The Blueprint of a Container

🔹 4.4 Docker Compose 🔗🔄: Managing Multi-Container Applications

📌 Creating docker-compose.yml file for our chat application:

1. What is Port Mapping?

2. How Does Port Mapping Work?

🧪 5. Running & Testing the LLM Chat Application 🎯

How These Pieces Fit Together?

🌍 Accessing the Application

🔹 Outside Docker (From Your Browser)

🔹 Inside the Dockerized Code

💬 Chat and Test:

Thank you for reading!

References:

Home

Docker is a platform designed to help developers build, share, and run container applications. We handle the tedious…

Docker Hub Container Image Library | App Containerization

Welcome to the world's largest container registry built for developers and open source contributors to find, use, and…

Creating integration tests with Docker platform | TSH.io

To make testing in a microservice architecture easier we made friends with Docker platform. Why should you introduce…

Learn - FastAPI

FastAPI framework, high performance, easy to learn, fast to code, ready for production

Gradio

Build & Share Delightful Machine Learning Apps

Written by Shakti Wadekar

No responses yet

1. `FROM python:3.9-slim`

2. `WORKDIR /app`

3. `COPY requirements.txt`

4. `RUN pip install -r requirements.txt`

5. `COPY app.py`

6. `CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]`