AWS Data Engineering Services - Quick Reference Cheat Sheet

AWS Data Engineering Services - Quick Reference Cheat Sheet

Data Ingestion Services

Amazon Kinesis Family

ServiceUse CaseKey FeaturesLimits
Kinesis Data StreamsReal-time streaming• Custom consumer apps
• Replay capability
• Low latency (< 1 second)
• 1MB record limit
• 1000 records/sec per shard
Kinesis Data FirehoseNear real-time delivery• Serverless
• Built-in transformation
• Direct S3/Redshift delivery
• 60 second buffer minimum
• No replay capability
Kinesis AnalyticsReal-time analytics• SQL on streaming data
• Anomaly detection
• Time-windowed queries
• SQL-based processing only

AWS Glue

ComponentPurposeKey Points
Glue CrawlersSchema discovery• Auto-detect schema changes
• Populate Data Catalog
• Schedule-based or on-demand
Glue ETL JobsData transformation• Serverless Spark
• Auto-scaling
• Built-in retry logic
Glue Data CatalogMetadata repository• Hive-compatible metastore
• Integration with Athena/EMR
• Schema versioning
Glue DataBrewVisual data preparation• No-code transformations
• Data profiling
• 250+ built-in transformations

Data Storage Services

Amazon S3 Storage Classes

Storage ClassUse CaseRetrieval TimeCost
StandardFrequently accessedImmediateHighest storage cost
Standard-IAInfrequently accessedImmediateLower storage, retrieval fee
One Zone-IANon-critical, infrequentImmediate20% less than Standard-IA
Glacier InstantArchive with instant accessImmediateLower storage cost
Glacier FlexibleArchive data1-12 hoursVery low storage cost
Glacier Deep ArchiveLong-term archive12-48 hoursLowest storage cost

Amazon Redshift

FeatureDescriptionBest Practice
Distribution KeysHow data is distributed• Use JOIN columns
• Avoid high cardinality
• Consider EVEN for small tables
Sort KeysPhysical data ordering• Use WHERE clause columns
• Consider compound vs interleaved
• Limit to 3-4 columns
CompressionReduce storage/IO• Use ANALYZE COMPRESSION
• Different encoding per column
• Automatic for new tables
Workload ManagementQuery prioritization• Separate queues by workload
• Set memory allocation
• Use concurrency scaling

DynamoDB

ConceptDescriptionGuidelines
Partition KeyPrimary hash key• High cardinality
• Uniform access pattern
• Avoid hot partitions
Sort KeyRange key for sorting• Enable range queries
• Model 1:N relationships
• Support query patterns
GSI/LSISecondary indexes• GSI: Different partition key
• LSI: Same partition key
• Max 20 GSI per table
Capacity ModesBilling model• On-Demand: Unpredictable
• Provisioned: Predictable + cheaper

Data Processing Services

Amazon EMR

ComponentPurposeKey Points
Master NodeCluster management• Manages cluster
• NameNode for HDFS
• Single point of failure
Core NodesData storage + processing• Run DataNode + TaskTracker
• HDFS storage
• Can be removed with care
Task NodesProcessing only• No HDFS storage
• Spot instances recommended
• Safe to terminate

AWS Lambda

AspectSpecificationConsiderations
Runtime15 minutes max• Use Step Functions for longer workflows
• Consider EMR for heavy processing
Memory128MB - 10GB• CPU scales with memory
• Optimize for cost vs performance
TriggersEvent-driven• S3 events, Kinesis, DynamoDB Streams
• EventBridge for schedules
Concurrency1000 default• Can request increases
• Consider reserved concurrency

Analytics Services

Amazon Athena

FeatureDescriptionOptimization Tips
Serverless SQLQuery S3 data directly• Use columnar formats (Parquet)
• Partition data by query patterns
• Compress data (GZIP, Snappy)
Query EnginesPresto/Trino based• Use appropriate data types
• Avoid SELECT * queries
• Use LIMIT for exploration
WorkgroupsQuery organization• Set data limits
• Control costs
• Separate environments

Amazon QuickSight

ComponentPurposeKey Features
SPICEIn-memory engine• Fast query performance
• Automatic data refresh
• 10GB per dataset
Data SourcesInput connections• 30+ native connectors
• Direct query vs SPICE
• Row-level security
DashboardsVisualization• Interactive dashboards
• Mobile responsive
• Embedded analytics

Security Services

AWS IAM for Data Engineering

Policy TypeUse CaseExample
Identity-basedUser/role permissionsGlue job execution role
Resource-basedCross-account accessS3 bucket policy
Session policiesTemporary restrictionsFederated access limits
Permissions boundariesMaximum permissionsDeveloper sandbox limits

AWS KMS

Key TypeManagementUse Case
AWS ManagedAWS controls• Default encryption
• Service-specific keys
Customer ManagedYou control• Custom key policies
• Cross-account access
• Key rotation control
Customer ProvidedYou provide• Full control
• Higher complexity
• Import your own keys

Monitoring & Governance

CloudWatch for Data Pipelines

Metric CategoryExamplesAlerting Strategy
Glue Jobs• Job duration
• Success/failure rate
• DPU hours
• Set SLA-based alarms
• Monitor cost metrics
Kinesis• IncomingRecords
• WriteProvisionedThroughputExceeded
• Shard-level monitoring
• Auto-scaling triggers
Redshift• CPU utilization
• Disk space
• Query performance
• Performance alerts
• Storage warnings

AWS Lake Formation

FeaturePurposeBest Practice
Data PermissionsFine-grained access control• Column-level permissions
• Row-level security
• Tag-based policies
Data DiscoveryCatalog population• Automatic crawling
• ML-powered classification
• PII detection with Macie
Data SharingCross-account access• Resource sharing
• Query federation
• Audit trails

Common Architecture Patterns

Lambda Architecture

Batch Layer: S3 → Glue/EMR → Redshift (historical data)
Speed Layer: Kinesis → Lambda → DynamoDB (real-time)
Serving Layer: Athena/QuickSight (unified view)

Kappa Architecture

Stream Processing: Kinesis → Kinesis Analytics → Output
Everything is treated as a stream, including batch data

Data Lake Pattern

Landing Zone (S3 Raw) → Processing (Glue/EMR) → 
Curated Zone (S3 Processed) → Analytics (Athena/Redshift)

Performance Optimization Quick Tips

S3 Optimization

  • Use prefixes to avoid hot spots
  • Multipart upload for files > 100MB
  • S3 Transfer Acceleration for global access
  • CloudFront for frequently accessed data

Redshift Optimization

  • VACUUM regularly to reclaim space
  • ANALYZE to update table statistics
  • Use COPY command for bulk loads
  • Monitor query performance with system tables

Glue Optimization

  • Use bookmark for incremental processing
  • Optimize for fewer, larger files
  • Use pushdown predicates
  • Consider Glue streaming for low latency

Remember: The exam tests your ability to choose the right service for the right use case. Focus on understanding trade-offs between services!