Understanding Scalable Deployment Tools on AWS: AWS ECR, ECS, ALB, IAM and Secrets Manager
Deploying a Large Language Model (LLM) chat application that can scale efficiently on AWS requires understanding key AWS services. This article explores important tools for scalable deployment like Amazon ECR, ECS, ALB, IAM, and Secrets Manager.
📜 Table of Contents 📜
✨ 1. Introduction
🏗️ 2. AWS ECR
🏢 3. AWS ECS
📌 3.1 Cluster, Task, Service
📌 3.2 Sequence of Setting Up ECS Components
🔍 4. In-depth Look at Tasks in ECS
⚙️ 4.1 Tasks can run in two modes
🛠️ 4.2 Task Configuration Strategies for Microservices
🎛️ 4.3 CPU and Memory Settings
🔑 4.4 Setup IAM & Secrets Manager for OpenAI Key Access
🚀 4.5 After Creating the Task, Two Options
📦 5. In-depth Look at Services in ECS
⚡ 5.1 Compute Configuration
🚀 5.2 Deployment Configuration
🌐 5.3 Networking
⚖️ 5.4 Load Balancing
📈 5.5 Service Auto Scaling
🌍 6. Allowing External Traffic to the Application
✨ 1. Introduction
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
Why Scalable Deployment Matters?
A scalable deployment means your application can handle increasing users and traffic without performance issues.
AWS provides Elastic Container Registry (ECR) for storing application images. Elastic Container Service (ECS) is used for running and managing application containers. Application Load Balancer (ALB) efficiently distributes incoming traffic across multiple targets. IAM (Identity and Access Management) manages permissions and access policies for AWS resources, ensuring secure authentication and authorization for developers, services, and applications interacting with AWS. Secrets Manager securely stores and manages sensitive data, such as API keys and credentials, preventing unauthorized access.
🏗️ 2. AWS Elastic Container Registry (ECR)
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
ECR is like a storage space for Docker container images. When you containerize an application, the images need to be stored somewhere before deployment. ECR provides a secure, scalable, and highly available repository for these images.
Why Use ECR?
- Centralized Storage: All your container images are stored in one place.
- Security & Access Control: Images are encrypted, and only authorized users can access them.
- Automatic Versioning: Each image has a unique tag, helping track changes easily.
- Integration with AWS Services: Directly connects with ECS for deployment.
Instead of manually transferring images to different servers, ECR allows smooth automation in the cloud deployment process.
Setup Steps Overview
To begin, create a repository on AWS ECR, which will initially be empty. This repository serves as a secure storage location for our Docker images, allowing seamless integration with ECS for deployment.
AWS simplifies this process by providing pre-generated push commands, accessible via the “View push commands” option in the top-right corner of the repository page. These commands guide us through pushing our Docker images step by step.
The process consists of four key steps:
- Authentication — Logging in to AWS ECR to gain push access.
- Building Docker Images — Creating container images for the application locally.
- Tagging the Images — Assigning repository-specific tags for correct storage.
- Pushing to ECR — Uploading the images to the repository for deployment.
Once the images are in ECR, they are ready to be used by ECS, enabling a scalable and efficient deployment process.
🏢 3. AWS Elastic Container Service (ECS)
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
Amazon Elastic Container Service (ECS) is a fully managed container orchestration service that simplifies the deployment, management, and scaling of containerized applications on AWS. Understanding its core components — clusters, tasks, and services — is essential for effectively utilizing ECS.
📌 3.1 Cluster, Task, Service
A cluster is a logical grouping of resources where your tasks and services run. It can include Amazon EC2 instances, AWS Fargate tasks, or a combination of both. Clusters provide isolation and organization for your applications, allowing you to manage resources efficiently.
A task is the instantiation of a task definition (settings), which is a blueprint describing how your application container(s) should run. It specifies details like the Docker image to use, CPU and memory requirements, networking configurations, and more. Tasks are the basic units of work scheduled by ECS.
A service enables you to run and maintain a specified number of tasks simultaneously in a cluster. It ensures that the desired number of tasks are always running, restarting failed tasks as needed. Services also facilitate load balancing and scaling, allowing your application to handle varying levels of traffic seamlessly.
📌 3.2 Sequence of Setting Up ECS Components
Establishing these components in the correct order is crucial for a successful deployment:
- Create a Cluster: Begin by setting up a cluster. This can be done using the AWS Management Console, AWS CLI, or infrastructure as code tools.
- Define a Task and Task Definition: Craft a task definition (settings) that outlines the configuration for your containerized application, including the Docker image, resource allocations, and networking settings.
- Create a Service: Deploy a service based on your task definition within the cluster. Configure the service to maintain the desired number of task instances and, if necessary, integrate it with a load balancer for distributing traffic.
🔍 4. In-depth Look at Tasks in ECS
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
You create tasks in the cluster. The task settings include:
- Task modes: Fargate or EC2 instance
- Container Definitions (settings): Details about the Docker containers to run, including image names, port mappings, and environment variables.
- Resource Requirements: Specifications for CPU and memory allocations for each container.
- Networking Mode: Configuration of networking settings, such as bridge, host, awsvpc, or none.
- Volumes: Definitions of data volumes to be used by the containers.
⚙️ 4.1 Tasks can run in two modes
- Fargate Launch Type: Allows you to run containers without managing the underlying infrastructure. AWS handles the provisioning and scaling of the compute resources. [This is preferred over EC2 for early development, deployment, and testing, as it allows you to focus on coding and development rather than managing compute infrastructure. It can also be a viable option in later stages if you prefer to avoid infrastructure management]
- EC2 Launch Type: Gives you more control over the infrastructure by running tasks on a cluster of Amazon EC2 instances that you manage.
🛠️ 4.2 Task Configuration Strategies for Microservices
A task in ECS allows you to run one or more Docker containers within it. In our LLM chat application, we have two images: one for the frontend and one for the backend. This raises the question — should we add both to a single task or create separate tasks for each?
In a microservices architecture like our application, where the frontend and backend are separate services (GitHub link to the code: scalable_llm_chatbot), there are two primary strategies for configuring tasks in ECS:
1. Single Task for Multiple Containers
In this approach, both frontend and backend containers are added within a single task.
Advantages:
- Simplified Networking: Containers can communicate over localhost, reducing complexity.
Disadvantages:
- Limited Flexibility: Scaling is uniform. You cannot scale frontend and backend independently based on their specific load requirements.
2. Separate Task for Each Microservice
Here, frontend and backend services are added in individual task and managed as separate services.
Advantages:
- Independent Scaling: Each service can scale based on its own demand, optimizing resource utilization.
- Fault Isolation: Issues in one service do not directly impact the other, enhancing overall resilience.
Disadvantages:
- Complex Networking: Inter-service communication requires proper network configurations, such as setting up service discovery or using load balancers.
- Increased Management Overhead: Managing multiple services and task definitions adds complexity to deployment and monitoring processes.
Choosing between these strategies depends on your application’s specific requirements, including scalability needs and resource utilization patterns.
🎛️ 4.3 CPU and Memory Settings
- CPU: Since our LLM chat application is lightweight, 0.5 vCPU or 1 vCPU should be enough to handle its workload.
- Memory: In ECS, ‘memory’ defines the maximum amount of memory available to a task or container. Since our application is lightweight, 2GB should suffice, but setting it to 4GB provides a buffer for stability.
How is Memory Used in a Task?
- Application Execution → Every application running inside a container needs memory to store code execution states, temporary data, and cache.
- Buffering and Caching → If the application handles large requests, or data processing, more memory is required to prevent performance issues.
🔑 4.4 Setup IAM & Secrets Manager for OpenAI Key Access
Certain components of our application, such as the backend, need secure access to sensitive information like API keys.
- AWS Secrets Manager securely stores credentials like the OpenAI API key, preventing exposure in environment variables or code.
- IAM (Identity and Access Management) is used to create roles that allow ECS tasks to fetch secrets dynamically from Secrets Manager.
By linking IAM roles to ECS, the backend can securely retrieve API keys whenever needed, without manual intervention. You need to link this in the ‘task role’ and ‘task execution role’ setting under ‘infrastructure requirements’, during the setup of the task.
🚀 4.5 After Creating the Task, Two Options
1. Run the Task: Running the task means launching a single instance of it. This is useful for testing parts of your application, but it does not scale. If you need auto-scaling, you must create a service on top of the task. Running a task without a service is mainly for debugging or temporary execution.
2. Create a Service A service allows your task to scale based on demand. It manages scaling settings, including:
- Scaling Limits — You can define the minimum and maximum number of running tasks.
- Application Load Balancer (ALB) — Distributes traffic across tasks for better performance.
When traffic increases, the service automatically adds more tasks, but only up to the set limit. When traffic is low, it reduces the number of running tasks, ensuring efficiency and cost savings. This dynamic scaling makes AWS ECS more economical and efficient.
Now, let’s dive deeper into how ECS services work.
📦 5. In-depth Look at Services in ECS
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
Services in ECS manage the deployment and scalability of tasks. Key features include:
- Desired Task Count: The service maintains the specified number of tasks, automatically replacing any that fail or stop.
- Auto Scaling: ECS services can adjust the number of running tasks based on defined scaling policies, allowing your application to handle changes in load efficiently.
- Load Balancing: Services can be associated with load balancers (Application Load Balancer or Network Load Balancer) to distribute incoming traffic across tasks, ensuring high availability and reliability. For our LLM chat application, we use Application Load Balancer.
An ECS Service is created on top of a task definition to ensure that the specified number of tasks are running continuously and can scale based on demand. The service manages launching, stopping, and replacing tasks in response to failures or scaling events.
Following are the important configurations or settings:
⚡ 5.1 Compute Configuration
There are two options for compute configuration: 1. Capacity provider strategy, 2. Launch type. Option 2 is preferable in our case.
The Capacity Provider Strategy allows defining where tasks will run. ECS supports capacity providers such as FARGATE
, EC2
, and etc. FARGATE is a serverless option managed by AWS, while EC2 gives more control over infrastructure.
The Launch Type determines whether tasks run in a fully managed (FARGATE
) or self-managed (EC2
) environment. FARGATE is ideal for unpredictable workloads as AWS handles provisioning, scaling, and maintenance. EC2 allows fine-grained control over instance types and networking, making it suitable for long-running, cost-sensitive applications.
We choose Launch type with FARGATE.
🚀 5.2 Deployment Configuration
The Service Name is the unique identifier of the ECS service, making it easy to track and manage logs and deployments. The Desired Tasks setting defines how many instances of a task should always be running. If a task fails, ECS automatically replaces it to maintain the desired count. At minimum set this to 1. You can increase it depending on your requirements.
The Deployment Type determines how new versions of tasks are rolled out. In a Rolling Update, tasks are replaced incrementally to avoid downtime. It is active by default.
The Minimum Running Tasks Percentage defines how many tasks should remain operational during an update. For example, a 50% setting ensures at least half the tasks remain active. This prevents downtime while deploying updates. The Maximum Running Tasks Percentage determines the number of extra tasks that can be temporarily launched during an update. Setting this to 200% allows scaling up temporarily to ensure smooth transitions before old tasks are terminated.
The Deployment Failure Detection mechanism identifies deployment issues such as failed health checks or container crashes. This ensures faulty deployments are quickly rolled back, preventing service disruptions. Keep the default options.
🌐 5.3 Networking
Settings in it include VPC (Virtual Private Cloud), Subnets, Security Groups and Public IP. Keep the default options for all. Make sure Public IP is active. This will later give us a public IP, using which we can access our application from internet.
⚖️ 5.4 Load Balancing
In Amazon Elastic Container Service (ECS), integrating a load balancer ensures that traffic is efficiently distributed across your containerized applications. Key settings to consider include:
1. Load Balancer Type: ECS supports different load balancers. Application Load Balancer is what we need for our LLM chat application.
- Application Load Balancer (ALB): Ideal for HTTP/HTTPS traffic, offering advanced routing based on URL paths or hostnames.
2. Health Check Grace Period: This setting defines a time window (in seconds) that allows new tasks to start and stabilize before the load balancer begins health checks. During this period, ECS ignores health check results, preventing premature marking of tasks as unhealthy.
3. Container: When configuring load balancing, it’s essential to specify which container (Image/container within a task) the load balancer should route traffic to, especially if multiple containers are defined. This involves mapping the container’s port to the load balancer’s listener port.
4. Listener: When a request is received, the listener forwards it to the appropriate target group. This will be typically the frontend port number on which the LLM chat application is running. Since we define it to be 7860, this value will be 7860. Refer to the LLM chat application code [scalable_llm_chatbot].
5. Target Group: A target group directs requests to one or more registered targets, such as ECS tasks. If no target groups were created earlier, simply select ‘create new’ and add name to it and keep the other settings to default.
📈 5.5 Service Auto Scaling
Minimum and Maximum Number of Tasks define the scaling boundaries. The minimum value ensures that at least a certain number of tasks are always running, maintaining baseline availability. The maximum value prevents over-scaling, which can lead to unnecessary resource consumption and cost increases.
Target Tracking automatically adjusts task count based on predefined metrics. If CPU utilization exceeds a certain threshold, ECS scales up tasks to handle the load. Common ECS Service Metrics include ECSServiceAverageCPUUtilization
, ECSServiceAverageMemoryUtilization
, and ALBRequestCountPerTarget
. CPU-based scaling is effective for compute-intensive applications, while request-based scaling suits web services handling fluctuating traffic.
The Target Value is the threshold at which scaling actions are triggered. For example, setting CPU utilization to 75% ensures ECS adds more tasks when CPU load crosses this limit. The Scale-In Cooldown Period is the waiting time before reducing tasks after a scaling event. A properly configured cooldown prevents excessive scaling down, which could impact performance. Similarly, the Scale-Up Cooldown Period defines the waiting time before ECS increases the number of tasks, avoiding unnecessary rapid scaling fluctuations.
After the configuration are complete, hit ‘create’ and the service starts deploying.
🌍 6. Allowing External Traffic to the Application
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
After configuring the application, it is still not accessible to internet users. To allow external access, we need to update the inbound traffic settings.
- If a single task was created, update its network settings.
- If multiple tasks were created, update the frontend task’s network settings, as it is the user-facing part of the application.
Steps to Allow Internet Access
Open the task in ECS. Go to ENI ID → Security Groups → Edit Inbound Rules → Add Rule. Set the following values:
- Type: Custom TCP
- Port Range:
7860
(or the port your application uses; our LLM chat app uses7860
) - Source: Anywhere-IPv4
Click Save Rules.
Finding the Public IP Address
- The public IP address is available under Task → Networking → Public IP.
- To access the application, use:
- http://<Public-IP>:<Application-Port>
This ensures that the application is reachable over the internet. 🚀