AWS CloudWatch Metrics are time-series data points from AWS services. As a Data Engineer, you use them to monitor ETL jobs, detect issues, and automate alerts.
Metrics are timestamped numerical values that measure resource or application performance — like CPUUtilization = 72%. They include a Namespace (service), Dimensions (filters like instance ID), and a Unit (Percent, Count, Bytes).
| 🧱 Service | 📊 Metric Name | 🎯 Why It Matters |
|---|---|---|
| 💻 EC2 | CPUUtilization |
Detect high CPU usage during data processing or batch jobs → scale or alert. |
| 📦 S3 | NumberOfObjects, BucketSizeBytes |
Track data lake growth and control storage costs. |
| 🧩 Glue | glue.driver.aggregate.numCompletedTasks |
Monitor job progress and detect stuck tasks. |
| ⚡ Lambda | Invocations, Errors, Duration |
Find failed or slow serverless transformations. |
| 🏢 Redshift | CPUUtilization, DatabaseConnections |
Ensure data warehouse performance under heavy query loads. |
| 🔄 Kinesis | GetRecords.IteratorAgeMilliseconds |
Detect stream consumer lag in real-time pipelines. |
| 🧠 Custom | RecordsProcessed, FilesIngested |
Track ETL KPIs like record count and job runtime. |
💬 Example Alarm: CPUUtilization > 80% for 3 of 5 minutes → Send SNS notification or trigger remediation Lambda.
RecordsProcessed drops below threshold → Auto-retry pipeline.BucketSizeBytes and alert when storage exceeds budget threshold.CloudWatch Logs centralize, monitor, and analyze logs from AWS services and applications in near real-time. They are essential for debugging, observability, compliance, and automation across data platforms.
| Use | How Logs Help | Example |
|---|---|---|
| 🔧 Debugging ETL Jobs | Full stack traces, Spark executor errors, and job progress appear in Glue/EMR logs. | Find "OutOfMemoryError" in Glue logs and identify failing stage. |
| 📊 Data Quality Checks | Log validation results and counts, enabling detection of missing or malformed records. | Log: "invalid rows=250" → trigger auto-retry or quarantine job. |
| ⏱️ Performance Tuning | Measure step durations and latencies to optimize transforms and partitioning. | Glue stage X takes 20m — add partition pruning to speed up. |
| 🚨 Alerting | Metric filters detect error keywords and raise CloudWatch Alarms/SNS notifications. | Filter: /ERROR/ in Lambda logs → Alarm → Slack via SNS. |
| 📚 Auditing & Compliance | Retention policies and secure storage for audit trails. | Keep pipeline execution logs for 180 days for audit review. |
| Pattern | Purpose | How to implement |
|---|---|---|
| Centralized Log Aggregation | Single-pane view across accounts/environments | Use centralized Log Group naming (e.g., /prod/data-platform/*), cross-account subscriptions, and CloudWatch Logs Insights dashboards. |
| Stream & Process | Real-time analytics and enrichment | Subscribe logs to Kinesis Data Streams → process with Lambda/Firehose → index in OpenSearch or S3. |
| Error-Driven Automation | Auto-remediation for common failures | Metric filter on "JobFailed" → CloudWatch Alarm → EventBridge → trigger remediation Lambda. |
| Cold Storage Archival | Cost-effective long-term retention | Subscribe logs to S3 via Kinesis Firehose with lifecycle rules (Glacier transition). |
fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 20✅ Key takeaway: CloudWatch Logs are the "black box" of your data platform — essential for debugging, observability, compliance, and automation. Combine logs with metrics, dashboards, and EventBridge to build resilient, observable pipelines.