Deploying Dagster on AWS: A Production-Ready Guide
Dagster has become one of the most compelling choices for data orchestration — it brings software engineering rigor to data pipelines with its asset-centric model, rich type system, and excellent developer experience. But deploying it to production on AWS requires careful thought.
This guide walks through a battle-tested architecture for running Dagster on AWS, using ECS for compute, RDS for the event log, and Terraform for infrastructure-as-code.
Why Dagster?
Before diving into deployment, it's worth understanding what makes Dagster different from traditional orchestrators like Airflow:
- Asset-centric thinking: You define data assets (tables, ML models, reports) rather than tasks. Dagster tracks lineage and freshness automatically.
- Strong typing: Dagster's type system catches pipeline issues at definition time, not at 3am.
- Unified observability: The Dagit UI gives you a real-time view of your entire data platform — assets, runs, schedules, sensors.
Architecture Overview
Our production setup uses:
┌─────────────────────────────────────────────┐
│ AWS VPC │
│ │
│ ┌──────────┐ ┌───────────────────────┐ │
│ │ ALB │───▶│ ECS Fargate │ │
│ │ (HTTPS) │ │ (Dagit + Daemon) │ │
│ └──────────┘ └───────────────────────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ RDS Postgres │ │
│ │ (Event Store) │ │
│ └─────────────────┘ │
│ │
│ ┌──────────────────────────────────────┐ │
│ │ ECS Run Launcher (auto-scaling) │ │
│ └──────────────────────────────────────┘ │
└─────────────────────────────────────────────┘
Setting Up the Infrastructure with Terraform
VPC and Networking
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "~> 5.0"
name = "dagster-vpc"
cidr = "10.0.0.0/16"
azs = ["us-east-1a", "us-east-1b"]
private_subnets = ["10.0.1.0/24", "10.0.2.0/24"]
public_subnets = ["10.0.101.0/24", "10.0.102.0/24"]
enable_nat_gateway = true
single_nat_gateway = true # Cost optimization for non-critical envs
}
RDS for the Event Store
Dagster stores all run history and event logs in PostgreSQL. Use RDS for managed, durable storage:
resource "aws_db_instance" "dagster" {
identifier = "dagster-event-store"
engine = "postgres"
engine_version = "15.4"
instance_class = "db.t3.small"
allocated_storage = 20
db_name = "dagster"
username = "dagster"
password = var.db_password
vpc_security_group_ids = [aws_security_group.rds.id]
db_subnet_group_name = aws_db_subnet_group.dagster.name
backup_retention_period = 7
skip_final_snapshot = false
deletion_protection = true
}
ECS Fargate for the Dagit Webserver
resource "aws_ecs_service" "dagit" {
name = "dagit"
cluster = aws_ecs_cluster.dagster.id
task_definition = aws_ecs_task_definition.dagit.arn
desired_count = 1
launch_type = "FARGATE"
network_configuration {
subnets = module.vpc.private_subnets
security_groups = [aws_security_group.dagit.id]
assign_public_ip = false
}
load_balancer {
target_group_arn = aws_lb_target_group.dagit.arn
container_name = "dagit"
container_port = 3000
}
}
Configuring Dagster for Production
The dagster.yaml workspace configuration for production:
storage:
postgres:
postgres_db:
username: dagster
password:
env: DAGSTER_POSTGRES_PASSWORD
hostname:
env: DAGSTER_POSTGRES_HOST
db_name: dagster
port: 5432
run_launcher:
module: dagster_aws.ecs
class: EcsRunLauncher
config:
task_definition:
env: DAGSTER_ECS_TASK_DEFINITION_ARN
container_name: run
telemetry:
enabled: false
Lessons Learned
1. Separate the daemon from Dagit. The Dagster daemon handles schedules and sensors. Run it as a separate ECS task — if it crashes, your schedules stop. Separate containers make debugging easier and allow independent scaling.
2. Use SSM Parameter Store for secrets. Never put secrets in environment variables directly in your task definitions. Use AWS SSM Parameter Store and reference them in Terraform.
3. Set memory limits carefully. ECS tasks have strict memory limits. Start with 2GB for Dagit and monitor with CloudWatch. Complex asset graphs with many ops can spike memory during graph traversal.
4. Enable S3 for I/O Manager. For any production Dagster deployment, configure the S3 I/O manager for intermediate storage. This makes your runs reproducible and allows you to debug individual steps.
from dagster_aws.s3 import s3_pickle_io_manager, s3_resource
@repository
def my_repo():
return [
*assets,
define_asset_job("daily_job"),
]
defs = Definitions(
assets=all_assets,
resources={
"io_manager": s3_pickle_io_manager.configured({
"s3_bucket": "my-dagster-bucket",
"s3_prefix": "dagster-io",
}),
"s3": s3_resource,
},
)
Monitoring and Alerting
Set up CloudWatch alarms for:
- ECS task stop events (both Dagit and Daemon)
- RDS CPU and storage
- ALB 5xx errors
- Dagster run failures (via Dagster's built-in alerting or a custom sensor)
Cost Optimization
For a small team, this architecture runs comfortably on:
- ECS Fargate: ~$30/month (Dagit + Daemon, 0.5 vCPU / 1GB each)
- RDS t3.small: ~$25/month
- ALB: ~$20/month
- NAT Gateway: ~$35/month
Total: ~$110/month for a solid production Dagster deployment.
Next Steps
Once your Dagster deployment is stable, consider:
- Adding a software-defined asset check pipeline to monitor data quality
- Setting up Dagster Cloud Hybrid if you want Dagster to manage the orchestration layer while keeping compute on-prem
- Exploring Dagster's dbt integration for seamless analytics engineering workflows
This architecture has been used in production deployments handling millions of daily events. If you have questions about adapting it to your use case, get in touch.
