AWS Certification - Data Engineer Associate - Study Guide 1

AWS Certified Data Engineer Associate - Complete Study Guide

Table of Contents

  1. Data Security and Governance
  2. Data Ingestion and Transformation
  3. Data Store Management
  4. Data Operations and Support
  5. Real-time Data Processing
  6. Analytics and Querying
  7. Workflow Orchestration
  8. Migration and Integration
  9. Cost Optimization Strategies
  10. Performance Optimization
  11. Exam Strategy and Tips

Data Security and Governance

AWS Lake Formation

Purpose: Centralized data governance for data lakes with minimal operational overhead

Key Features:

  • Row and column-level access control: Fine-grained permissions without managing complex IAM policies
  • Data filters: Restrict access based on user attributes (department, region, clearance level)
  • Cross-account data sharing: Secure sharing without data duplication
  • Integration: Works seamlessly with Athena, EMR, Redshift Spectrum

When to use:

  • Need database, table, column, row, and cell-level access control
  • Multiple teams accessing the same data lake with different permission requirements
  • Cross-account data sharing scenarios
  • Compliance requirements for data governance

Roles:

  • Data Lake Administrator: Full administrative access to Lake Formation
  • Database Creator: Can create new databases and manage their permissions

Amazon Macie

Purpose: Automatically discover, classify, and protect sensitive data (PII) using machine learning

Key Features:

  • ML-based PII detection: Automatically identifies credit cards, SSNs, passport numbers
  • EventBridge integration: Trigger automated responses to PII discovery
  • Detailed reporting: Comprehensive reports on data classification results
  • Custom data identifiers: Define organization-specific sensitive data patterns

When to use:

  • Automated PII detection and data classification
  • Compliance requirements (GDPR, HIPAA, PCI DSS)
  • Data discovery across large S3 environments
  • Event-driven data protection workflows

Data Encryption Options

SSE-S3 (Server-Side Encryption with S3-Managed Keys):

  • AWS manages all encryption keys
  • Default encryption option
  • Use when: Basic encryption needs, minimal key management overhead

SSE-KMS (Server-Side Encryption with AWS KMS):

  • Customer control over encryption keys
  • Audit trail through CloudTrail
  • Key rotation capabilities
  • Use when: Sensitive data requiring access control, compliance requirements, audit trails needed

SSE-C (Server-Side Encryption with Customer-Provided Keys):

  • Customer manages encryption keys completely
  • AWS doesn’t store the keys
  • Use when: Strict key management requirements, regulatory compliance demanding customer key control

AWS Secrets Manager vs Systems Manager Parameter Store

AWS Secrets Manager:

  • Best for: Database credentials, API keys, sensitive data
  • Features: Built-in automatic rotation, native RDS/Redshift integration
  • Cost: Higher cost but better security features
  • Use when: Managing database passwords, need automatic rotation

Systems Manager Parameter Store:

  • Best for: Configuration data, application settings
  • Features: No built-in rotation for credentials, lower cost
  • Use when: Storing configuration parameters, non-sensitive data

Data Ingestion and Transformation

AWS Glue

Core Components:

Data Catalog:

  • Central metadata repository for all data sources
  • Automatically populated by crawlers
  • Integrates with Athena, EMR, Redshift Spectrum

Crawlers:

  • Automatically discover and catalog data schemas
  • Handle schema evolution over time
  • Support various data sources (S3, RDS, DynamoDB)
  • Best Practice: Schedule crawlers to run after ETL jobs complete

ETL Job Types:

Standard Jobs:

  • Dedicated resources for time-sensitive workloads
  • Consistent performance and predictable execution time
  • Use when: Production workloads, SLA requirements, time-critical processing

Flex Jobs:

  • Use spare compute capacity for cost savings (up to 34% cost reduction)
  • Variable execution time
  • Use when: Non-urgent workloads, cost optimization priority, batch processing

Python Shell Jobs:

  • Most cost-effective for small files (<30MB)
  • Can use 1/16 DPU minimum
  • Use when: Simple transformations, small data volumes, API calls

Ray Jobs:

  • Scale Python workloads across distributed clusters
  • Use when: Large-scale Python processing, machine learning workloads

Streaming ETL:

  • Real-time data processing
  • Use when: Near real-time transformation requirements

Advanced Features:

Glue DataBrew:

  • Visual, no-code data preparation
  • Pre-built transformations for common tasks
  • Data profiling and quality assessment
  • Use when: Business users need to prepare data, exploratory data analysis

Glue Studio:

  • Visual interface for ETL job creation
  • Supports both visual and code-based development
  • Use when: Rapid ETL development, visual job design preference

Schema Registry:

  • Centralized schema management for streaming data
  • Schema evolution with versioning
  • Integration with Kinesis and MSK
  • Use when: Streaming data with evolving schemas, producer/consumer validation

Data Quality:

  • Built-in data quality rules and validation
  • DQDL (Data Quality Definition Language)
  • Use when: Data validation requirements, quality monitoring

Performance Optimization:

  • DPU Optimization: Monitor job metrics to determine optimal DPU allocation
  • Pushdown Predicates: Filter data at source to reduce processing
  • Partition Pruning: Organize data by date/region for efficient querying
  • File Size: Keep files around 128MB for optimal processing

Amazon AppFlow

Purpose: Fully managed integration service for SaaS applications

Trigger Types:

  • Run on demand: Manual execution for ad-hoc transfers
  • Run on event: Event-driven from SaaS applications (new Salesforce record)
  • Run on schedule: Time-based execution (daily, hourly)

Transfer Types:

  • Incremental transfer: Only changed/new records (most efficient)
  • Full transfer: Complete dataset snapshot

When to use:

  • Integrating SaaS applications (Salesforce, ServiceNow, Slack) with AWS
  • Need for secure, managed data transfer
  • Minimal operational overhead requirement

Amazon Data Firehose

Purpose: Real-time streaming data delivery with minimal operational overhead

Key Features:

  • Format conversion: Automatically converts JSON to Parquet/ORC
  • Compression: Built-in data compression
  • Destinations: S3, Redshift, OpenSearch, HTTP endpoints (NOT Amazon Timestream)
  • Buffering: Configurable size and time-based delivery
  • Serverless: No infrastructure management

When to use:

  • Near real-time data streaming (60 seconds minimum latency)
  • Need format conversion (JSON to Parquet)
  • Delivering to multiple destinations
  • Minimal operational overhead priority

Amazon Kinesis Data Streams

Purpose: Real-time data streaming with sub-second latency

Key Concepts:

  • Shards: Unit of capacity (1 MB/sec input, 2 MB/sec output per shard)
  • Retention: 24 hours to 365 days
  • Resharding: Scale using SplitShard/MergeShard commands

Advanced Features:

  • Enhanced Fan-Out: Dedicated 2 MB/sec per consumer per shard
  • Server-side encryption: KMS integration
  • Cross-region replication: Data redundancy

When to use:

  • Real-time data ingestion with single-digit millisecond requirements
  • High-throughput streaming data
  • Multiple consumers need real-time access
  • Custom application processing requirements

Kinesis vs Firehose Decision Matrix:

  • Use Kinesis Data Streams when: Real-time processing, multiple consumers, custom applications
  • Use Firehose when: Simple delivery to destinations, format conversion needed, minimal management

Data Store Management

Amazon S3

Storage Classes:

Standard:

  • Frequent access, immediate availability
  • Use when: Actively used data, high availability requirements

Standard-IA (Infrequent Access):

  • Infrequent access but rapid retrieval when needed
  • 30-day minimum storage duration
  • Use when: Backup data, disaster recovery, long-term storage with occasional access

Intelligent-Tiering:

  • Automatic optimization based on access patterns
  • No retrieval fees for frequent access
  • Use when: Unknown or changing access patterns, automatic cost optimization

Glacier Flexible Retrieval:

  • Long-term archival, 1-5 minute to 12-hour retrieval
  • Use when: Archival data with occasional access needs

Glacier Deep Archive:

  • Lowest cost, 12-48 hours retrieval time
  • Use when: Long-term archival, rarely accessed data, compliance requirements

Advanced Features:

Lifecycle Policies:

  • Automate transitions between storage classes
  • Delete expired objects automatically
  • Best Practice: Implement based on data access patterns

Transfer Acceleration:

  • Use CloudFront edge locations for faster uploads
  • Use when: Global users uploading large files

Event Notifications:

  • Trigger actions on object creation/deletion
  • Integrate with Lambda, SQS, SNS
  • Use when: Event-driven data processing

S3 Object Lambda:

  • Transform data during retrieval using Lambda functions
  • PII redaction, data masking
  • Use when: Dynamic data transformation without creating copies

Amazon Redshift

Core Concepts:

  • Fully managed, petabyte-scale data warehouse
  • Columnar storage with massively parallel processing (MPP)
  • Best price-performance for analytics workloads

Distribution Styles:

KEY Distribution:

  • Distribute based on specific column values
  • Use when: Large tables with frequent joins on specific columns
  • Example: Distribute orders table on customer_id for customer analysis

ALL Distribution:

  • Copy entire table to all nodes
  • Use when: Small, frequently joined dimension tables (<10M rows)
  • Example: Country lookup tables, product categories

EVEN Distribution:

  • Round-robin distribution (default)
  • Use when: Tables without clear join patterns, fact tables without dominant join column

AUTO Distribution:

  • Redshift automatically chooses optimal distribution
  • Use when: Uncertain about optimal distribution strategy

Performance Optimization:

Sort Keys:

  • Compound Sort Keys: Most common, define sort order priority
  • Interleaved Sort Keys: Equal weight to all columns, better for diverse query patterns

VACUUM Operations:

  • VACUUM FULL: Reclaim space and re-sort all data
  • VACUUM DELETE ONLY: Reclaim space from deleted rows
  • VACUUM SORT ONLY: Re-sort data without reclaiming space
  • VACUUM REINDEX: For interleaved sort keys optimization

Query Optimization:

  • Materialized Views: Pre-computed results for complex queries with auto-refresh
  • Query Result Caching: Reuse identical query results for 24 hours
  • LIKE vs REGEX: Use LIKE operator for pattern matching (faster than regex)
  • Stored Procedures: Encapsulate multiple SQL statements, reduce network traffic

Advanced Features:

Redshift Serverless:

  • Pay-per-use model, automatic scaling
  • Use when: Variable workloads, development/testing, cost optimization

Redshift Spectrum:

  • Query S3 data without loading into Redshift
  • Use when: Infrequently accessed data, data lake integration

Streaming Ingestion:

  • Direct ingestion from Kinesis Data Streams using materialized views
  • Use when: Real-time analytics on streaming data

Data Sharing:

  • Share live data between clusters without copying
  • Use when: Cross-team data access, multi-tenant scenarios

Concurrency Scaling:

  • Automatically add capacity for concurrent queries
  • Use when: Unpredictable query spikes, mixed workloads

Monitoring:

  • STL_ALERT_EVENT_LOG: Identifies performance issues and query failures
  • Workload Management (WLM): Manage query priorities and concurrency

Amazon DynamoDB

Performance Characteristics:

  • Single-digit millisecond latency
  • Automatic scaling based on utilization
  • Serverless with on-demand pricing option

Design Best Practices:

Partition Key Design:

  • High Cardinality: Use PRODUCT_ID instead of CATEGORY_NAME
  • Even Distribution: Avoid hot partitions
  • Access Patterns: Design based on query requirements

Capacity Management:

  • Provisioned: Predictable workloads, reserved capacity options
  • On-Demand: Variable workloads, pay-per-request

Monitoring:

  • CloudWatch Contributor Insights: Identify hot partitions and access patterns
  • Throttling Analysis: Monitor capacity limits

Advanced Features:

  • TTL (Time to Live): Automatically delete expired items
  • Global Tables: Multi-region replication
  • Point-in-time Recovery: Backup and restore capabilities
  • DynamoDB Streams: Change data capture

Amazon Aurora

Integration Features:

  • Native Lambda Functions: Invoke Lambda from Aurora
  • Zero-ETL Integration: Direct integration with Redshift
  • External Metastore: Serve as Hive metastore for EMR

Performance:

  • 5x faster than MySQL, 3x faster than PostgreSQL
  • Auto-scaling storage up to 128TB
  • Read replicas for scaling read workloads

Amazon Athena

Purpose: Serverless SQL queries on S3 data with pay-per-query pricing

Optimization Techniques:

Partition Projection:

  • Calculate partitions dynamically instead of metadata lookup
  • Use when: Highly partitioned data with predictable patterns
  • Example: Date-based partitions (year/month/day)

Query Result Caching:

  • Reuse identical query results for cost reduction
  • 24-hour cache duration
  • Best Practice: Enable for repeated analytical queries

Columnar Formats:

  • Use Parquet/ORC for 30-90% cost reduction
  • Better compression and predicate pushdown
  • Always use for analytical workloads

Advanced Features:

Federated Queries:

  • Query multiple data sources using Lambda connectors
  • Connect to RDS, DynamoDB, Redis, HBase
  • Use when: Need to join data across different systems

Workgroups:

  • Separate query execution across teams
  • Enforce cost controls and query limits
  • Use when: Multi-tenant scenarios, cost management

Apache Iceberg Support:

  • ACID transactions in data lakes
  • Time travel queries
  • Schema evolution
  • Use when: Need ACID properties, versioning requirements

Data Operations and Support

Monitoring and Troubleshooting

AWS CloudTrail:

  • Log API calls and data events
  • Audit trail for compliance
  • Use when: Security auditing, compliance requirements, troubleshooting

Amazon CloudWatch:

  • Monitor performance metrics across all services
  • Custom metrics and alarms
  • Container Insights: Monitor EKS/ECS applications
  • Use when: Performance monitoring, alerting, operational visibility

Performance Insights:

  • Database performance analysis for RDS and Aurora
  • Wait event analysis and top SQL identification
  • Use when: Database performance troubleshooting

Service-Specific Monitoring:

  • Glue: Job profiler and CloudWatch for DPU optimization
  • Redshift: STL_ALERT_EVENT_LOG for performance issues
  • Kinesis: Iterator age monitoring for processing lag

Workflow Orchestration

AWS Step Functions:

  • Coordinate multiple AWS services into workflows
  • State machine-based orchestration

State Types:

  • Parallel State: Execute multiple branches simultaneously
  • Map State: Process arrays of data in parallel
  • Choice State: Conditional branching logic
  • Wait State: Introduce delays in workflows
  • Task State: Execute specific actions

When to use:

  • Complex workflows with conditional logic
  • Error handling and retry requirements
  • Coordinating multiple AWS services
  • Visual workflow representation needs

Amazon MWAA (Managed Workflows for Apache Airflow):

  • Managed Apache Airflow for complex workflows
  • DAG-based workflow definition

Key Features:

  • requirements.txt: Install Python packages
  • SSH Connections: Use apache-airflow-providers-ssh package
  • Web UI: Visual workflow monitoring

When to use:

  • Complex ETL workflows with dependencies
  • Python-based workflow logic
  • Integration with non-AWS systems
  • Existing Airflow knowledge

AWS Glue Workflows:

  • Orchestrate Glue jobs and crawlers
  • Built-in monitoring and dependency management
  • Use when: ETL-specific orchestration, simpler than Step Functions

Real-time Data Processing

Amazon Kinesis Ecosystem

Kinesis Data Streams:

  • Real-time data ingestion with sub-second latency
  • Shard-based scaling model

Key Concepts:

  • Shards: 1 MB/sec input, 2 MB/sec output per shard
  • Retention: 24 hours to 365 days
  • Partition Keys: Determine shard assignment

Scaling Operations:

  • SplitShard: Increase capacity by splitting shards
  • MergeShard: Reduce capacity by merging adjacent shards
  • Enhanced Fan-Out: Dedicated 2 MB/sec per consumer per shard

Lambda Integration:

  • Parallelization Factor: Control concurrent executions per shard
  • Batch Size: Number of records per invocation
  • Starting Position: TRIM_HORIZON, LATEST, AT_TIMESTAMP

Amazon Managed Service for Apache Flink:

  • Real-time stream processing with SQL, Java, Scala, Python
  • Windowing operations for time-based analytics

Window Types:

  • Tumbling Windows: Non-overlapping fixed intervals
  • Sliding Windows: Overlapping time windows for continuous analysis
  • Session Windows: Based on data activity

When to use:

  • Complex stream processing logic
  • Real-time analytics and aggregations
  • Multiple data sources correlation

Integration Patterns

Event-Driven Architecture:

  • S3 Event Notifications → Lambda → Processing
  • EventBridge Rules → Scheduled workflows
  • Kinesis → Lambda → DynamoDB pattern

Cross-Account Streaming:

  • Stream logs from production to audit accounts
  • CloudWatch Logs subscription filters
  • Cross-account IAM roles

Analytics and Querying

Amazon Athena Advanced

Query Optimization:

  • Partition Projection: Calculate partitions dynamically for better performance
  • Columnar Formats: Always use Parquet/ORC for analytical workloads
  • Compression: Use appropriate compression (Snappy for Parquet)
  • File Sizing: Optimize file sizes (128MB-1GB) for parallel processing

Federated Queries:

  • Query multiple data sources without moving data
  • Lambda connectors for RDS, DynamoDB, HBase, Redis
  • Use when: Need to join data across different systems

Workgroups:

  • Separate teams and enforce settings
  • Cost controls and query limits
  • Result location management
  • Use when: Multi-tenant scenarios, cost governance

Amazon QuickSight

VPC Connections: Secure access to private databases Data Sources: Connect to AWS and external sources Real-time Dashboards: Live data visualization


Workflow Orchestration

AWS Step Functions

Core Concepts: Serverless workflow orchestration using state machines

State Types:

  • Task State: Execute Lambda functions, start EMR clusters, run Glue jobs
  • Parallel State: Execute multiple branches simultaneously for concurrent processing
  • Choice State: Conditional branching based on input values
  • Wait State: Pause execution for specified time or until timestamp
  • Map State: Process arrays of data in parallel

Integration Patterns:

  • Request-Response: Synchronous execution
  • Run Job: Asynchronous with polling
  • Callback: Wait for task token

When to use:

  • Complex workflows with conditional logic
  • Error handling and retry requirements
  • Visual workflow representation needs
  • Coordinating multiple AWS services

Amazon MWAA (Managed Workflows for Apache Airflow)

Core Concepts: Managed Apache Airflow for complex workflow orchestration

Key Features:

  • DAGs: Define workflows as Directed Acyclic Graphs
  • requirements.txt: Install Python packages (apache-airflow-providers-ssh for SSH connections)
  • Variables and Connections: Manage configuration and external system connections
  • Web UI: Monitor and troubleshoot workflows

When to use:

  • Complex ETL workflows with dependencies
  • Python-based workflow logic
  • Integration with external systems
  • Need for rich scheduling capabilities
  • Existing Airflow expertise

Amazon EventBridge

Core Concepts: Serverless event bus for event-driven architectures

Features:

  • Event Rules: Filter and route events to targets
  • Scheduled Rules: Cron-based job scheduling
  • Custom Event Buses: Isolate different application events
  • Event Replay: Replay events for debugging

Event Patterns: Filter events before triggering targets based on event content

When to use:

  • Event-driven architectures
  • Scheduled job execution
  • Decoupling applications
  • Real-time event processing

Migration and Integration

AWS Database Migration Service (DMS)

Core Concepts: Migrate databases with minimal downtime

Migration Types:

  • Full Load: One-time complete migration
  • CDC (Change Data Capture): Ongoing replication of changes
  • Full Load + CDC: Complete migration followed by ongoing sync

Source Requirements for CDC:

  • SQL Server: Enable transaction logs
  • MySQL: Enable binary logs
  • Oracle: Enable archived redo logs

Data Validation: Built-in validation to ensure data consistency

Target Formats: Support for Parquet format for better analytical query performance

When to use:

  • Database migrations to AWS
  • Ongoing replication between databases
  • Database consolidation projects
  • Zero-downtime migration requirements

AWS Schema Conversion Tool (SCT)

Purpose: Convert database schemas between different engines

Capabilities:

  • Schema conversion (Oracle to PostgreSQL, SQL Server to Aurora)
  • Code conversion for stored procedures and functions
  • Assessment reports for migration complexity

AWS DataSync

Purpose: Automated, high-speed data transfer service

Features:

  • Multiple Destinations: Transfer to multiple S3 buckets simultaneously
  • Bandwidth Optimization: Automatic network optimization
  • Scheduling: Periodic data synchronization
  • Verification: Data integrity checks

When to use:

  • Large-scale data migration to AWS
  • Regular data synchronization
  • Hybrid cloud architectures
  • Archive data movement

Cost Optimization Strategies

Service-Specific Optimizations

AWS Glue:

  • Flex Jobs: Up to 34% cost reduction for non-urgent workloads
  • Python Shell: Most cost-effective for small files (<30MB)
  • DPU Optimization: Monitor and right-size based on job metrics
  • Job Bookmarks: Avoid reprocessing data

Amazon S3:

  • Lifecycle Policies: Automate transitions to cheaper storage classes
  • Intelligent-Tiering: Automatic optimization for unknown access patterns
  • Compression: Reduce storage costs and transfer time
  • Delete Incomplete Multipart Uploads: Clean up abandoned uploads

Amazon Redshift:

  • Serverless: Pay-per-use for variable workloads
  • Reserved Instances: 1-3 year commitments for predictable workloads
  • Pause/Resume: Pause clusters during non-business hours
  • Query Result Caching: Avoid redundant query execution
  • UNLOAD to S3: Move infrequently accessed data to cheaper storage

Amazon Athena:

  • Query Result Caching: Reuse identical query results
  • Columnar Formats: Reduce data scanned (30-90% cost reduction)
  • Partitioning: Limit data scanned per query
  • Compression: Reduce storage and scanning costs

General Cost Optimization Principles

Right-Sizing:

  • Monitor resource utilization with CloudWatch
  • Use AWS Cost Explorer for usage analysis
  • Implement auto-scaling where appropriate

Reserved Capacity:

  • Use for predictable workloads
  • Available for Redshift, RDS, DynamoDB
  • 1-3 year commitment options

Spot Instances:

  • Use for fault-tolerant workloads
  • EMR clusters with proper checkpointing
  • Development and testing environments

Performance Optimization

Data Format Optimization

Columnar Formats (Parquet/ORC):

  • Benefits: Predicate pushdown, column pruning, better compression
  • Use Cases: Analytical workloads, data warehousing, OLAP queries
  • Compression: Use Snappy for Parquet (good compression + fast queries)

Row-Based Formats (JSON/CSV):

  • Use Cases: Operational workloads, OLTP systems, data ingestion
  • Limitations: Poor analytical performance, higher storage costs

Partitioning Strategies

Time-Based Partitioning:

  • Partition by year/month/day for time-series data
  • Example: s3://bucket/year=2024/month=01/day=15/
  • Benefits: Query performance, cost optimization

Categorical Partitioning:

  • Partition by frequently filtered dimensions
  • Example: s3://bucket/region=us-east-1/department=sales/
  • Avoid: High cardinality partitions (>10,000 partitions)

Query Performance Tuning

Amazon Redshift:

  • Choose appropriate distribution and sort keys
  • Use materialized views for repeated complex queries
  • Monitor STL_ALERT_EVENT_LOG for performance issues
  • Implement proper VACUUM schedules

Amazon Athena:

  • Use partition projection for highly partitioned data
  • Implement columnar storage formats
  • Optimize file sizes (128MB-1GB)
  • Use workgroups for resource management

AWS Glue:

  • Monitor DPU utilization and adjust accordingly
  • Use pushdown predicates to filter data at source
  • Optimize file sizes and formats
  • Use job bookmarks to avoid reprocessing

Infrastructure Optimization

Auto Scaling:

  • DynamoDB: Auto Scaling based on utilization
  • EMR: Auto Scaling for dynamic workloads
  • Redshift: Concurrency Scaling for query spikes

Compute Optimization:

  • Use appropriate instance types for workloads
  • Graviton instances for better price-performance
  • Serverless options to avoid idle costs

Exam Strategy and Tips

Question Pattern Recognition

“Least operational overhead”:

  • Choose managed/serverless services over self-managed
  • Prefer AWS Glue over EMR for ETL
  • Choose Athena over self-managed Spark clusters
  • Select Redshift Serverless over provisioned clusters

“Cost-effective”:

  • Consider Glue Flex jobs for non-urgent workloads
  • Use appropriate S3 storage classes
  • Implement lifecycle policies
  • Choose serverless options for variable workloads

“Real-time”:

  • Kinesis Data Streams for sub-second latency
  • Lambda for event-driven processing
  • Streaming ETL with Glue or Flink
  • Enhanced Fan-Out for dedicated throughput

“Large datasets”:

  • EMR for big data processing frameworks
  • Redshift for data warehousing
  • Distributed processing patterns
  • Appropriate partitioning strategies

Service Selection Guidelines

Data Transformation:

  • AWS Glue: Managed ETL, serverless, broad integration
  • EMR: More control, big data frameworks, custom processing
  • Lambda: Event-driven, small-scale transformations

Analytics:

  • Athena: Ad-hoc queries, serverless, pay-per-query
  • Redshift: Data warehouse, complex analytics, consistent performance
  • EMR: Big data analytics, ML workloads, custom frameworks

Storage:

  • S3: Data lake, archival, web-scale storage
  • Redshift: Structured data warehouse storage
  • DynamoDB: NoSQL, single-digit millisecond latency

Integration:

  • AppFlow: SaaS application integration
  • Glue: General ETL and data catalog
  • DMS: Database migration and replication

Key Concepts to Master

Schema Evolution:

  • Glue Schema Registry for streaming data
  • Handle changing data structures over time
  • Version management and compatibility

Data Lineage:

  • Track data flow and transformations
  • Important for ML model governance
  • Compliance and audit requirements

Multi-Tenancy:

  • Secure data isolation between tenants
  • Lake Formation for fine-grained access control
  • Workgroups in Athena for team separation

Compliance and Auditing:

  • CloudTrail for API auditing
  • Data encryption at rest and in transit
  • PII detection and redaction
  • Proper access controls and monitoring

Advanced Architectural Patterns

Event-Driven Data Processing:

  • S3 events trigger Lambda for immediate processing
  • EventBridge for complex event routing
  • Kinesis for real-time stream processing

Serverless Data Lakes:

  • S3 for storage, Athena for querying
  • Glue for ETL and data catalog
  • Lambda for event-driven processing

Data Mesh Architecture:

  • Distributed data ownership
  • Lake Formation for governance
  • Cross-account data sharing

Hybrid and Multi-Cloud:

  • DataSync for data movement
  • Storage Gateway for hybrid storage
  • Direct Connect for dedicated connectivity

Common Pitfalls to Avoid

Storage Class Selection:

  • Don’t use Glacier for frequently accessed data
  • Consider retrieval times for Glacier classes
  • Use Intelligent-Tiering for unknown patterns

Service Limitations:

  • Lambda 15-minute execution limit
  • Firehose minimum 60-second latency
  • Athena query timeout considerations

Security Misconfigurations:

  • Always use least privilege access
  • Enable encryption for sensitive data
  • Use Secrets Manager for credentials, not Parameter Store

Performance Anti-Patterns:

  • Avoid small files in analytical workloads
  • Don’t skip compression for storage optimization
  • Avoid hot partitions in DynamoDB

Final Exam Preparation Checklist

Core Services Mastery

  • Understand when to use each service vs alternatives
  • Know integration patterns between services
  • Master cost optimization strategies
  • Understand security best practices

Hands-On Experience

  • Practice creating ETL jobs in Glue
  • Set up data lakes with proper partitioning
  • Configure Redshift clusters and optimize queries
  • Implement real-time streaming with Kinesis

Scenario-Based Thinking

  • Practice choosing services based on requirements
  • Understand trade-offs between options
  • Know when to prioritize cost vs performance vs operational overhead

Security and Compliance

  • Master IAM roles and policies for data services
  • Understand encryption options and when to use each
  • Know Lake Formation for data governance
  • Understand PII detection and data classification

This comprehensive study guide covers all essential concepts for the AWS Certified Data Engineer Associate exam. Focus on understanding not just what each service does, but when and why to use it in different scenarios. Practice with hands-on labs and scenario-based questions to reinforce your learning.