AWS Certified Data Engineer Associate - Complete Study Guide
Table of Contents
- Data Security and Governance
- Data Ingestion and Transformation
- Data Store Management
- Data Operations and Support
- Real-time Data Processing
- Analytics and Querying
- Workflow Orchestration
- Migration and Integration
- Cost Optimization Strategies
- Performance Optimization
- Exam Strategy and Tips
Data Security and Governance
AWS Lake Formation
Purpose: Centralized data governance for data lakes with minimal operational overhead
Key Features:
- Row and column-level access control: Fine-grained permissions without managing complex IAM policies
- Data filters: Restrict access based on user attributes (department, region, clearance level)
- Cross-account data sharing: Secure sharing without data duplication
- Integration: Works seamlessly with Athena, EMR, Redshift Spectrum
When to use:
- Need database, table, column, row, and cell-level access control
- Multiple teams accessing the same data lake with different permission requirements
- Cross-account data sharing scenarios
- Compliance requirements for data governance
Roles:
- Data Lake Administrator: Full administrative access to Lake Formation
- Database Creator: Can create new databases and manage their permissions
Amazon Macie
Purpose: Automatically discover, classify, and protect sensitive data (PII) using machine learning
Key Features:
- ML-based PII detection: Automatically identifies credit cards, SSNs, passport numbers
- EventBridge integration: Trigger automated responses to PII discovery
- Detailed reporting: Comprehensive reports on data classification results
- Custom data identifiers: Define organization-specific sensitive data patterns
When to use:
- Automated PII detection and data classification
- Compliance requirements (GDPR, HIPAA, PCI DSS)
- Data discovery across large S3 environments
- Event-driven data protection workflows
Data Encryption Options
SSE-S3 (Server-Side Encryption with S3-Managed Keys):
- AWS manages all encryption keys
- Default encryption option
- Use when: Basic encryption needs, minimal key management overhead
SSE-KMS (Server-Side Encryption with AWS KMS):
- Customer control over encryption keys
- Audit trail through CloudTrail
- Key rotation capabilities
- Use when: Sensitive data requiring access control, compliance requirements, audit trails needed
SSE-C (Server-Side Encryption with Customer-Provided Keys):
- Customer manages encryption keys completely
- AWS doesn’t store the keys
- Use when: Strict key management requirements, regulatory compliance demanding customer key control
AWS Secrets Manager vs Systems Manager Parameter Store
AWS Secrets Manager:
- Best for: Database credentials, API keys, sensitive data
- Features: Built-in automatic rotation, native RDS/Redshift integration
- Cost: Higher cost but better security features
- Use when: Managing database passwords, need automatic rotation
Systems Manager Parameter Store:
- Best for: Configuration data, application settings
- Features: No built-in rotation for credentials, lower cost
- Use when: Storing configuration parameters, non-sensitive data
Data Ingestion and Transformation
AWS Glue
Core Components:
Data Catalog:
- Central metadata repository for all data sources
- Automatically populated by crawlers
- Integrates with Athena, EMR, Redshift Spectrum
Crawlers:
- Automatically discover and catalog data schemas
- Handle schema evolution over time
- Support various data sources (S3, RDS, DynamoDB)
- Best Practice: Schedule crawlers to run after ETL jobs complete
ETL Job Types:
Standard Jobs:
- Dedicated resources for time-sensitive workloads
- Consistent performance and predictable execution time
- Use when: Production workloads, SLA requirements, time-critical processing
Flex Jobs:
- Use spare compute capacity for cost savings (up to 34% cost reduction)
- Variable execution time
- Use when: Non-urgent workloads, cost optimization priority, batch processing
Python Shell Jobs:
- Most cost-effective for small files (<30MB)
- Can use 1/16 DPU minimum
- Use when: Simple transformations, small data volumes, API calls
Ray Jobs:
- Scale Python workloads across distributed clusters
- Use when: Large-scale Python processing, machine learning workloads
Streaming ETL:
- Real-time data processing
- Use when: Near real-time transformation requirements
Advanced Features:
Glue DataBrew:
- Visual, no-code data preparation
- Pre-built transformations for common tasks
- Data profiling and quality assessment
- Use when: Business users need to prepare data, exploratory data analysis
Glue Studio:
- Visual interface for ETL job creation
- Supports both visual and code-based development
- Use when: Rapid ETL development, visual job design preference
Schema Registry:
- Centralized schema management for streaming data
- Schema evolution with versioning
- Integration with Kinesis and MSK
- Use when: Streaming data with evolving schemas, producer/consumer validation
Data Quality:
- Built-in data quality rules and validation
- DQDL (Data Quality Definition Language)
- Use when: Data validation requirements, quality monitoring
Performance Optimization:
- DPU Optimization: Monitor job metrics to determine optimal DPU allocation
- Pushdown Predicates: Filter data at source to reduce processing
- Partition Pruning: Organize data by date/region for efficient querying
- File Size: Keep files around 128MB for optimal processing
Amazon AppFlow
Purpose: Fully managed integration service for SaaS applications
Trigger Types:
- Run on demand: Manual execution for ad-hoc transfers
- Run on event: Event-driven from SaaS applications (new Salesforce record)
- Run on schedule: Time-based execution (daily, hourly)
Transfer Types:
- Incremental transfer: Only changed/new records (most efficient)
- Full transfer: Complete dataset snapshot
When to use:
- Integrating SaaS applications (Salesforce, ServiceNow, Slack) with AWS
- Need for secure, managed data transfer
- Minimal operational overhead requirement
Amazon Data Firehose
Purpose: Real-time streaming data delivery with minimal operational overhead
Key Features:
- Format conversion: Automatically converts JSON to Parquet/ORC
- Compression: Built-in data compression
- Destinations: S3, Redshift, OpenSearch, HTTP endpoints (NOT Amazon Timestream)
- Buffering: Configurable size and time-based delivery
- Serverless: No infrastructure management
When to use:
- Near real-time data streaming (60 seconds minimum latency)
- Need format conversion (JSON to Parquet)
- Delivering to multiple destinations
- Minimal operational overhead priority
Amazon Kinesis Data Streams
Purpose: Real-time data streaming with sub-second latency
Key Concepts:
- Shards: Unit of capacity (1 MB/sec input, 2 MB/sec output per shard)
- Retention: 24 hours to 365 days
- Resharding: Scale using SplitShard/MergeShard commands
Advanced Features:
- Enhanced Fan-Out: Dedicated 2 MB/sec per consumer per shard
- Server-side encryption: KMS integration
- Cross-region replication: Data redundancy
When to use:
- Real-time data ingestion with single-digit millisecond requirements
- High-throughput streaming data
- Multiple consumers need real-time access
- Custom application processing requirements
Kinesis vs Firehose Decision Matrix:
- Use Kinesis Data Streams when: Real-time processing, multiple consumers, custom applications
- Use Firehose when: Simple delivery to destinations, format conversion needed, minimal management
Data Store Management
Amazon S3
Storage Classes:
Standard:
- Frequent access, immediate availability
- Use when: Actively used data, high availability requirements
Standard-IA (Infrequent Access):
- Infrequent access but rapid retrieval when needed
- 30-day minimum storage duration
- Use when: Backup data, disaster recovery, long-term storage with occasional access
Intelligent-Tiering:
- Automatic optimization based on access patterns
- No retrieval fees for frequent access
- Use when: Unknown or changing access patterns, automatic cost optimization
Glacier Flexible Retrieval:
- Long-term archival, 1-5 minute to 12-hour retrieval
- Use when: Archival data with occasional access needs
Glacier Deep Archive:
- Lowest cost, 12-48 hours retrieval time
- Use when: Long-term archival, rarely accessed data, compliance requirements
Advanced Features:
Lifecycle Policies:
- Automate transitions between storage classes
- Delete expired objects automatically
- Best Practice: Implement based on data access patterns
Transfer Acceleration:
- Use CloudFront edge locations for faster uploads
- Use when: Global users uploading large files
Event Notifications:
- Trigger actions on object creation/deletion
- Integrate with Lambda, SQS, SNS
- Use when: Event-driven data processing
S3 Object Lambda:
- Transform data during retrieval using Lambda functions
- PII redaction, data masking
- Use when: Dynamic data transformation without creating copies
Amazon Redshift
Core Concepts:
- Fully managed, petabyte-scale data warehouse
- Columnar storage with massively parallel processing (MPP)
- Best price-performance for analytics workloads
Distribution Styles:
KEY Distribution:
- Distribute based on specific column values
- Use when: Large tables with frequent joins on specific columns
- Example: Distribute orders table on customer_id for customer analysis
ALL Distribution:
- Copy entire table to all nodes
- Use when: Small, frequently joined dimension tables (<10M rows)
- Example: Country lookup tables, product categories
EVEN Distribution:
- Round-robin distribution (default)
- Use when: Tables without clear join patterns, fact tables without dominant join column
AUTO Distribution:
- Redshift automatically chooses optimal distribution
- Use when: Uncertain about optimal distribution strategy
Performance Optimization:
Sort Keys:
- Compound Sort Keys: Most common, define sort order priority
- Interleaved Sort Keys: Equal weight to all columns, better for diverse query patterns
VACUUM Operations:
- VACUUM FULL: Reclaim space and re-sort all data
- VACUUM DELETE ONLY: Reclaim space from deleted rows
- VACUUM SORT ONLY: Re-sort data without reclaiming space
- VACUUM REINDEX: For interleaved sort keys optimization
Query Optimization:
- Materialized Views: Pre-computed results for complex queries with auto-refresh
- Query Result Caching: Reuse identical query results for 24 hours
- LIKE vs REGEX: Use LIKE operator for pattern matching (faster than regex)
- Stored Procedures: Encapsulate multiple SQL statements, reduce network traffic
Advanced Features:
Redshift Serverless:
- Pay-per-use model, automatic scaling
- Use when: Variable workloads, development/testing, cost optimization
Redshift Spectrum:
- Query S3 data without loading into Redshift
- Use when: Infrequently accessed data, data lake integration
Streaming Ingestion:
- Direct ingestion from Kinesis Data Streams using materialized views
- Use when: Real-time analytics on streaming data
Data Sharing:
- Share live data between clusters without copying
- Use when: Cross-team data access, multi-tenant scenarios
Concurrency Scaling:
- Automatically add capacity for concurrent queries
- Use when: Unpredictable query spikes, mixed workloads
Monitoring:
- STL_ALERT_EVENT_LOG: Identifies performance issues and query failures
- Workload Management (WLM): Manage query priorities and concurrency
Amazon DynamoDB
Performance Characteristics:
- Single-digit millisecond latency
- Automatic scaling based on utilization
- Serverless with on-demand pricing option
Design Best Practices:
Partition Key Design:
- High Cardinality: Use PRODUCT_ID instead of CATEGORY_NAME
- Even Distribution: Avoid hot partitions
- Access Patterns: Design based on query requirements
Capacity Management:
- Provisioned: Predictable workloads, reserved capacity options
- On-Demand: Variable workloads, pay-per-request
Monitoring:
- CloudWatch Contributor Insights: Identify hot partitions and access patterns
- Throttling Analysis: Monitor capacity limits
Advanced Features:
- TTL (Time to Live): Automatically delete expired items
- Global Tables: Multi-region replication
- Point-in-time Recovery: Backup and restore capabilities
- DynamoDB Streams: Change data capture
Amazon Aurora
Integration Features:
- Native Lambda Functions: Invoke Lambda from Aurora
- Zero-ETL Integration: Direct integration with Redshift
- External Metastore: Serve as Hive metastore for EMR
Performance:
- 5x faster than MySQL, 3x faster than PostgreSQL
- Auto-scaling storage up to 128TB
- Read replicas for scaling read workloads
Amazon Athena
Purpose: Serverless SQL queries on S3 data with pay-per-query pricing
Optimization Techniques:
Partition Projection:
- Calculate partitions dynamically instead of metadata lookup
- Use when: Highly partitioned data with predictable patterns
- Example: Date-based partitions (year/month/day)
Query Result Caching:
- Reuse identical query results for cost reduction
- 24-hour cache duration
- Best Practice: Enable for repeated analytical queries
Columnar Formats:
- Use Parquet/ORC for 30-90% cost reduction
- Better compression and predicate pushdown
- Always use for analytical workloads
Advanced Features:
Federated Queries:
- Query multiple data sources using Lambda connectors
- Connect to RDS, DynamoDB, Redis, HBase
- Use when: Need to join data across different systems
Workgroups:
- Separate query execution across teams
- Enforce cost controls and query limits
- Use when: Multi-tenant scenarios, cost management
Apache Iceberg Support:
- ACID transactions in data lakes
- Time travel queries
- Schema evolution
- Use when: Need ACID properties, versioning requirements
Data Operations and Support
Monitoring and Troubleshooting
AWS CloudTrail:
- Log API calls and data events
- Audit trail for compliance
- Use when: Security auditing, compliance requirements, troubleshooting
Amazon CloudWatch:
- Monitor performance metrics across all services
- Custom metrics and alarms
- Container Insights: Monitor EKS/ECS applications
- Use when: Performance monitoring, alerting, operational visibility
Performance Insights:
- Database performance analysis for RDS and Aurora
- Wait event analysis and top SQL identification
- Use when: Database performance troubleshooting
Service-Specific Monitoring:
- Glue: Job profiler and CloudWatch for DPU optimization
- Redshift: STL_ALERT_EVENT_LOG for performance issues
- Kinesis: Iterator age monitoring for processing lag
Workflow Orchestration
AWS Step Functions:
- Coordinate multiple AWS services into workflows
- State machine-based orchestration
State Types:
- Parallel State: Execute multiple branches simultaneously
- Map State: Process arrays of data in parallel
- Choice State: Conditional branching logic
- Wait State: Introduce delays in workflows
- Task State: Execute specific actions
When to use:
- Complex workflows with conditional logic
- Error handling and retry requirements
- Coordinating multiple AWS services
- Visual workflow representation needs
Amazon MWAA (Managed Workflows for Apache Airflow):
- Managed Apache Airflow for complex workflows
- DAG-based workflow definition
Key Features:
- requirements.txt: Install Python packages
- SSH Connections: Use apache-airflow-providers-ssh package
- Web UI: Visual workflow monitoring
When to use:
- Complex ETL workflows with dependencies
- Python-based workflow logic
- Integration with non-AWS systems
- Existing Airflow knowledge
AWS Glue Workflows:
- Orchestrate Glue jobs and crawlers
- Built-in monitoring and dependency management
- Use when: ETL-specific orchestration, simpler than Step Functions
Real-time Data Processing
Amazon Kinesis Ecosystem
Kinesis Data Streams:
- Real-time data ingestion with sub-second latency
- Shard-based scaling model
Key Concepts:
- Shards: 1 MB/sec input, 2 MB/sec output per shard
- Retention: 24 hours to 365 days
- Partition Keys: Determine shard assignment
Scaling Operations:
- SplitShard: Increase capacity by splitting shards
- MergeShard: Reduce capacity by merging adjacent shards
- Enhanced Fan-Out: Dedicated 2 MB/sec per consumer per shard
Lambda Integration:
- Parallelization Factor: Control concurrent executions per shard
- Batch Size: Number of records per invocation
- Starting Position: TRIM_HORIZON, LATEST, AT_TIMESTAMP
Amazon Managed Service for Apache Flink:
- Real-time stream processing with SQL, Java, Scala, Python
- Windowing operations for time-based analytics
Window Types:
- Tumbling Windows: Non-overlapping fixed intervals
- Sliding Windows: Overlapping time windows for continuous analysis
- Session Windows: Based on data activity
When to use:
- Complex stream processing logic
- Real-time analytics and aggregations
- Multiple data sources correlation
Integration Patterns
Event-Driven Architecture:
- S3 Event Notifications → Lambda → Processing
- EventBridge Rules → Scheduled workflows
- Kinesis → Lambda → DynamoDB pattern
Cross-Account Streaming:
- Stream logs from production to audit accounts
- CloudWatch Logs subscription filters
- Cross-account IAM roles
Analytics and Querying
Amazon Athena Advanced
Query Optimization:
- Partition Projection: Calculate partitions dynamically for better performance
- Columnar Formats: Always use Parquet/ORC for analytical workloads
- Compression: Use appropriate compression (Snappy for Parquet)
- File Sizing: Optimize file sizes (128MB-1GB) for parallel processing
Federated Queries:
- Query multiple data sources without moving data
- Lambda connectors for RDS, DynamoDB, HBase, Redis
- Use when: Need to join data across different systems
Workgroups:
- Separate teams and enforce settings
- Cost controls and query limits
- Result location management
- Use when: Multi-tenant scenarios, cost governance
Amazon QuickSight
VPC Connections: Secure access to private databases Data Sources: Connect to AWS and external sources Real-time Dashboards: Live data visualization
Workflow Orchestration
AWS Step Functions
Core Concepts: Serverless workflow orchestration using state machines
State Types:
- Task State: Execute Lambda functions, start EMR clusters, run Glue jobs
- Parallel State: Execute multiple branches simultaneously for concurrent processing
- Choice State: Conditional branching based on input values
- Wait State: Pause execution for specified time or until timestamp
- Map State: Process arrays of data in parallel
Integration Patterns:
- Request-Response: Synchronous execution
- Run Job: Asynchronous with polling
- Callback: Wait for task token
When to use:
- Complex workflows with conditional logic
- Error handling and retry requirements
- Visual workflow representation needs
- Coordinating multiple AWS services
Amazon MWAA (Managed Workflows for Apache Airflow)
Core Concepts: Managed Apache Airflow for complex workflow orchestration
Key Features:
- DAGs: Define workflows as Directed Acyclic Graphs
- requirements.txt: Install Python packages (apache-airflow-providers-ssh for SSH connections)
- Variables and Connections: Manage configuration and external system connections
- Web UI: Monitor and troubleshoot workflows
When to use:
- Complex ETL workflows with dependencies
- Python-based workflow logic
- Integration with external systems
- Need for rich scheduling capabilities
- Existing Airflow expertise
Amazon EventBridge
Core Concepts: Serverless event bus for event-driven architectures
Features:
- Event Rules: Filter and route events to targets
- Scheduled Rules: Cron-based job scheduling
- Custom Event Buses: Isolate different application events
- Event Replay: Replay events for debugging
Event Patterns: Filter events before triggering targets based on event content
When to use:
- Event-driven architectures
- Scheduled job execution
- Decoupling applications
- Real-time event processing
Migration and Integration
AWS Database Migration Service (DMS)
Core Concepts: Migrate databases with minimal downtime
Migration Types:
- Full Load: One-time complete migration
- CDC (Change Data Capture): Ongoing replication of changes
- Full Load + CDC: Complete migration followed by ongoing sync
Source Requirements for CDC:
- SQL Server: Enable transaction logs
- MySQL: Enable binary logs
- Oracle: Enable archived redo logs
Data Validation: Built-in validation to ensure data consistency
Target Formats: Support for Parquet format for better analytical query performance
When to use:
- Database migrations to AWS
- Ongoing replication between databases
- Database consolidation projects
- Zero-downtime migration requirements
AWS Schema Conversion Tool (SCT)
Purpose: Convert database schemas between different engines
Capabilities:
- Schema conversion (Oracle to PostgreSQL, SQL Server to Aurora)
- Code conversion for stored procedures and functions
- Assessment reports for migration complexity
AWS DataSync
Purpose: Automated, high-speed data transfer service
Features:
- Multiple Destinations: Transfer to multiple S3 buckets simultaneously
- Bandwidth Optimization: Automatic network optimization
- Scheduling: Periodic data synchronization
- Verification: Data integrity checks
When to use:
- Large-scale data migration to AWS
- Regular data synchronization
- Hybrid cloud architectures
- Archive data movement
Cost Optimization Strategies
Service-Specific Optimizations
AWS Glue:
- Flex Jobs: Up to 34% cost reduction for non-urgent workloads
- Python Shell: Most cost-effective for small files (<30MB)
- DPU Optimization: Monitor and right-size based on job metrics
- Job Bookmarks: Avoid reprocessing data
Amazon S3:
- Lifecycle Policies: Automate transitions to cheaper storage classes
- Intelligent-Tiering: Automatic optimization for unknown access patterns
- Compression: Reduce storage costs and transfer time
- Delete Incomplete Multipart Uploads: Clean up abandoned uploads
Amazon Redshift:
- Serverless: Pay-per-use for variable workloads
- Reserved Instances: 1-3 year commitments for predictable workloads
- Pause/Resume: Pause clusters during non-business hours
- Query Result Caching: Avoid redundant query execution
- UNLOAD to S3: Move infrequently accessed data to cheaper storage
Amazon Athena:
- Query Result Caching: Reuse identical query results
- Columnar Formats: Reduce data scanned (30-90% cost reduction)
- Partitioning: Limit data scanned per query
- Compression: Reduce storage and scanning costs
General Cost Optimization Principles
Right-Sizing:
- Monitor resource utilization with CloudWatch
- Use AWS Cost Explorer for usage analysis
- Implement auto-scaling where appropriate
Reserved Capacity:
- Use for predictable workloads
- Available for Redshift, RDS, DynamoDB
- 1-3 year commitment options
Spot Instances:
- Use for fault-tolerant workloads
- EMR clusters with proper checkpointing
- Development and testing environments
Performance Optimization
Data Format Optimization
Columnar Formats (Parquet/ORC):
- Benefits: Predicate pushdown, column pruning, better compression
- Use Cases: Analytical workloads, data warehousing, OLAP queries
- Compression: Use Snappy for Parquet (good compression + fast queries)
Row-Based Formats (JSON/CSV):
- Use Cases: Operational workloads, OLTP systems, data ingestion
- Limitations: Poor analytical performance, higher storage costs
Partitioning Strategies
Time-Based Partitioning:
- Partition by year/month/day for time-series data
- Example:
s3://bucket/year=2024/month=01/day=15/
- Benefits: Query performance, cost optimization
Categorical Partitioning:
- Partition by frequently filtered dimensions
- Example:
s3://bucket/region=us-east-1/department=sales/
- Avoid: High cardinality partitions (>10,000 partitions)
Query Performance Tuning
Amazon Redshift:
- Choose appropriate distribution and sort keys
- Use materialized views for repeated complex queries
- Monitor STL_ALERT_EVENT_LOG for performance issues
- Implement proper VACUUM schedules
Amazon Athena:
- Use partition projection for highly partitioned data
- Implement columnar storage formats
- Optimize file sizes (128MB-1GB)
- Use workgroups for resource management
AWS Glue:
- Monitor DPU utilization and adjust accordingly
- Use pushdown predicates to filter data at source
- Optimize file sizes and formats
- Use job bookmarks to avoid reprocessing
Infrastructure Optimization
Auto Scaling:
- DynamoDB: Auto Scaling based on utilization
- EMR: Auto Scaling for dynamic workloads
- Redshift: Concurrency Scaling for query spikes
Compute Optimization:
- Use appropriate instance types for workloads
- Graviton instances for better price-performance
- Serverless options to avoid idle costs
Exam Strategy and Tips
Question Pattern Recognition
“Least operational overhead”:
- Choose managed/serverless services over self-managed
- Prefer AWS Glue over EMR for ETL
- Choose Athena over self-managed Spark clusters
- Select Redshift Serverless over provisioned clusters
“Cost-effective”:
- Consider Glue Flex jobs for non-urgent workloads
- Use appropriate S3 storage classes
- Implement lifecycle policies
- Choose serverless options for variable workloads
“Real-time”:
- Kinesis Data Streams for sub-second latency
- Lambda for event-driven processing
- Streaming ETL with Glue or Flink
- Enhanced Fan-Out for dedicated throughput
“Large datasets”:
- EMR for big data processing frameworks
- Redshift for data warehousing
- Distributed processing patterns
- Appropriate partitioning strategies
Service Selection Guidelines
Data Transformation:
- AWS Glue: Managed ETL, serverless, broad integration
- EMR: More control, big data frameworks, custom processing
- Lambda: Event-driven, small-scale transformations
Analytics:
- Athena: Ad-hoc queries, serverless, pay-per-query
- Redshift: Data warehouse, complex analytics, consistent performance
- EMR: Big data analytics, ML workloads, custom frameworks
Storage:
- S3: Data lake, archival, web-scale storage
- Redshift: Structured data warehouse storage
- DynamoDB: NoSQL, single-digit millisecond latency
Integration:
- AppFlow: SaaS application integration
- Glue: General ETL and data catalog
- DMS: Database migration and replication
Key Concepts to Master
Schema Evolution:
- Glue Schema Registry for streaming data
- Handle changing data structures over time
- Version management and compatibility
Data Lineage:
- Track data flow and transformations
- Important for ML model governance
- Compliance and audit requirements
Multi-Tenancy:
- Secure data isolation between tenants
- Lake Formation for fine-grained access control
- Workgroups in Athena for team separation
Compliance and Auditing:
- CloudTrail for API auditing
- Data encryption at rest and in transit
- PII detection and redaction
- Proper access controls and monitoring
Advanced Architectural Patterns
Event-Driven Data Processing:
- S3 events trigger Lambda for immediate processing
- EventBridge for complex event routing
- Kinesis for real-time stream processing
Serverless Data Lakes:
- S3 for storage, Athena for querying
- Glue for ETL and data catalog
- Lambda for event-driven processing
Data Mesh Architecture:
- Distributed data ownership
- Lake Formation for governance
- Cross-account data sharing
Hybrid and Multi-Cloud:
- DataSync for data movement
- Storage Gateway for hybrid storage
- Direct Connect for dedicated connectivity
Common Pitfalls to Avoid
Storage Class Selection:
- Don’t use Glacier for frequently accessed data
- Consider retrieval times for Glacier classes
- Use Intelligent-Tiering for unknown patterns
Service Limitations:
- Lambda 15-minute execution limit
- Firehose minimum 60-second latency
- Athena query timeout considerations
Security Misconfigurations:
- Always use least privilege access
- Enable encryption for sensitive data
- Use Secrets Manager for credentials, not Parameter Store
Performance Anti-Patterns:
- Avoid small files in analytical workloads
- Don’t skip compression for storage optimization
- Avoid hot partitions in DynamoDB
Final Exam Preparation Checklist
Core Services Mastery
- Understand when to use each service vs alternatives
- Know integration patterns between services
- Master cost optimization strategies
- Understand security best practices
Hands-On Experience
- Practice creating ETL jobs in Glue
- Set up data lakes with proper partitioning
- Configure Redshift clusters and optimize queries
- Implement real-time streaming with Kinesis
Scenario-Based Thinking
- Practice choosing services based on requirements
- Understand trade-offs between options
- Know when to prioritize cost vs performance vs operational overhead
Security and Compliance
- Master IAM roles and policies for data services
- Understand encryption options and when to use each
- Know Lake Formation for data governance
- Understand PII detection and data classification
This comprehensive study guide covers all essential concepts for the AWS Certified Data Engineer Associate exam. Focus on understanding not just what each service does, but when and why to use it in different scenarios. Practice with hands-on labs and scenario-based questions to reinforce your learning.