AWS Data Engineering Services - Quick Reference Cheat Sheet
Data Ingestion Services
Amazon Kinesis Family
| Service | Use Case | Key Features | Limits |
|---|---|---|---|
| Kinesis Data Streams | Real-time streaming | • Custom consumer apps • Replay capability • Low latency (< 1 second) | • 1MB record limit • 1000 records/sec per shard |
| Kinesis Data Firehose | Near real-time delivery | • Serverless • Built-in transformation • Direct S3/Redshift delivery | • 60 second buffer minimum • No replay capability |
| Kinesis Analytics | Real-time analytics | • SQL on streaming data • Anomaly detection • Time-windowed queries | • SQL-based processing only |
AWS Glue
| Component | Purpose | Key Points |
|---|---|---|
| Glue Crawlers | Schema discovery | • Auto-detect schema changes • Populate Data Catalog • Schedule-based or on-demand |
| Glue ETL Jobs | Data transformation | • Serverless Spark • Auto-scaling • Built-in retry logic |
| Glue Data Catalog | Metadata repository | • Hive-compatible metastore • Integration with Athena/EMR • Schema versioning |
| Glue DataBrew | Visual data preparation | • No-code transformations • Data profiling • 250+ built-in transformations |
Data Storage Services
Amazon S3 Storage Classes
| Storage Class | Use Case | Retrieval Time | Cost |
|---|---|---|---|
| Standard | Frequently accessed | Immediate | Highest storage cost |
| Standard-IA | Infrequently accessed | Immediate | Lower storage, retrieval fee |
| One Zone-IA | Non-critical, infrequent | Immediate | 20% less than Standard-IA |
| Glacier Instant | Archive with instant access | Immediate | Lower storage cost |
| Glacier Flexible | Archive data | 1-12 hours | Very low storage cost |
| Glacier Deep Archive | Long-term archive | 12-48 hours | Lowest storage cost |
Amazon Redshift
| Feature | Description | Best Practice |
|---|---|---|
| Distribution Keys | How data is distributed | • Use JOIN columns • Avoid high cardinality • Consider EVEN for small tables |
| Sort Keys | Physical data ordering | • Use WHERE clause columns • Consider compound vs interleaved • Limit to 3-4 columns |
| Compression | Reduce storage/IO | • Use ANALYZE COMPRESSION • Different encoding per column • Automatic for new tables |
| Workload Management | Query prioritization | • Separate queues by workload • Set memory allocation • Use concurrency scaling |
DynamoDB
| Concept | Description | Guidelines |
|---|---|---|
| Partition Key | Primary hash key | • High cardinality • Uniform access pattern • Avoid hot partitions |
| Sort Key | Range key for sorting | • Enable range queries • Model 1:N relationships • Support query patterns |
| GSI/LSI | Secondary indexes | • GSI: Different partition key • LSI: Same partition key • Max 20 GSI per table |
| Capacity Modes | Billing model | • On-Demand: Unpredictable • Provisioned: Predictable + cheaper |
Data Processing Services
Amazon EMR
| Component | Purpose | Key Points |
|---|---|---|
| Master Node | Cluster management | • Manages cluster • NameNode for HDFS • Single point of failure |
| Core Nodes | Data storage + processing | • Run DataNode + TaskTracker • HDFS storage • Can be removed with care |
| Task Nodes | Processing only | • No HDFS storage • Spot instances recommended • Safe to terminate |
AWS Lambda
| Aspect | Specification | Considerations |
|---|---|---|
| Runtime | 15 minutes max | • Use Step Functions for longer workflows • Consider EMR for heavy processing |
| Memory | 128MB - 10GB | • CPU scales with memory • Optimize for cost vs performance |
| Triggers | Event-driven | • S3 events, Kinesis, DynamoDB Streams • EventBridge for schedules |
| Concurrency | 1000 default | • Can request increases • Consider reserved concurrency |
Analytics Services
Amazon Athena
| Feature | Description | Optimization Tips |
|---|---|---|
| Serverless SQL | Query S3 data directly | • Use columnar formats (Parquet) • Partition data by query patterns • Compress data (GZIP, Snappy) |
| Query Engines | Presto/Trino based | • Use appropriate data types • Avoid SELECT * queries • Use LIMIT for exploration |
| Workgroups | Query organization | • Set data limits • Control costs • Separate environments |
Amazon QuickSight
| Component | Purpose | Key Features |
|---|---|---|
| SPICE | In-memory engine | • Fast query performance • Automatic data refresh • 10GB per dataset |
| Data Sources | Input connections | • 30+ native connectors • Direct query vs SPICE • Row-level security |
| Dashboards | Visualization | • Interactive dashboards • Mobile responsive • Embedded analytics |
Security Services
AWS IAM for Data Engineering
| Policy Type | Use Case | Example |
|---|---|---|
| Identity-based | User/role permissions | Glue job execution role |
| Resource-based | Cross-account access | S3 bucket policy |
| Session policies | Temporary restrictions | Federated access limits |
| Permissions boundaries | Maximum permissions | Developer sandbox limits |
AWS KMS
| Key Type | Management | Use Case |
|---|---|---|
| AWS Managed | AWS controls | • Default encryption • Service-specific keys |
| Customer Managed | You control | • Custom key policies • Cross-account access • Key rotation control |
| Customer Provided | You provide | • Full control • Higher complexity • Import your own keys |
Monitoring & Governance
CloudWatch for Data Pipelines
| Metric Category | Examples | Alerting Strategy |
|---|---|---|
| Glue Jobs | • Job duration • Success/failure rate • DPU hours | • Set SLA-based alarms • Monitor cost metrics |
| Kinesis | • IncomingRecords • WriteProvisionedThroughputExceeded | • Shard-level monitoring • Auto-scaling triggers |
| Redshift | • CPU utilization • Disk space • Query performance | • Performance alerts • Storage warnings |
AWS Lake Formation
| Feature | Purpose | Best Practice |
|---|---|---|
| Data Permissions | Fine-grained access control | • Column-level permissions • Row-level security • Tag-based policies |
| Data Discovery | Catalog population | • Automatic crawling • ML-powered classification • PII detection with Macie |
| Data Sharing | Cross-account access | • Resource sharing • Query federation • Audit trails |
Common Architecture Patterns
Lambda Architecture
Batch Layer: S3 → Glue/EMR → Redshift (historical data)
Speed Layer: Kinesis → Lambda → DynamoDB (real-time)
Serving Layer: Athena/QuickSight (unified view)
Kappa Architecture
Stream Processing: Kinesis → Kinesis Analytics → Output
Everything is treated as a stream, including batch data
Data Lake Pattern
Landing Zone (S3 Raw) → Processing (Glue/EMR) →
Curated Zone (S3 Processed) → Analytics (Athena/Redshift)
Performance Optimization Quick Tips
S3 Optimization
- Use prefixes to avoid hot spots
- Multipart upload for files > 100MB
- S3 Transfer Acceleration for global access
- CloudFront for frequently accessed data
Redshift Optimization
- VACUUM regularly to reclaim space
- ANALYZE to update table statistics
- Use COPY command for bulk loads
- Monitor query performance with system tables
Glue Optimization
- Use bookmark for incremental processing
- Optimize for fewer, larger files
- Use pushdown predicates
- Consider Glue streaming for low latency
Remember: The exam tests your ability to choose the right service for the right use case. Focus on understanding trade-offs between services!