AWS Certified Data Engineer Associate - Complete Study Guide

Data Security and Governance
Data Ingestion and Transformation
Data Store Management
Data Operations and Support
Real-time Data Processing
Analytics and Querying
Workflow Orchestration
Migration and Integration
Cost Optimization Strategies
Performance Optimization
Exam Strategy and Tips

Data Security and Governance

AWS Lake Formation

Purpose: Centralized data governance for data lakes with minimal operational overhead

Key Features:

Row and column-level access control: Fine-grained permissions without managing complex IAM policies
Data filters: Restrict access based on user attributes (department, region, clearance level)
Cross-account data sharing: Secure sharing without data duplication
Integration: Works seamlessly with Athena, EMR, Redshift Spectrum

When to use:

Need database, table, column, row, and cell-level access control
Multiple teams accessing the same data lake with different permission requirements
Cross-account data sharing scenarios
Compliance requirements for data governance

Roles:

Data Lake Administrator: Full administrative access to Lake Formation
Database Creator: Can create new databases and manage their permissions

Amazon Macie

Purpose: Automatically discover, classify, and protect sensitive data (PII) using machine learning

Key Features:

ML-based PII detection: Automatically identifies credit cards, SSNs, passport numbers
EventBridge integration: Trigger automated responses to PII discovery
Detailed reporting: Comprehensive reports on data classification results
Custom data identifiers: Define organization-specific sensitive data patterns

When to use:

Automated PII detection and data classification
Compliance requirements (GDPR, HIPAA, PCI DSS)
Data discovery across large S3 environments
Event-driven data protection workflows

Data Encryption Options

SSE-S3 (Server-Side Encryption with S3-Managed Keys):

AWS manages all encryption keys
Default encryption option
Use when: Basic encryption needs, minimal key management overhead

SSE-KMS (Server-Side Encryption with AWS KMS):

Customer control over encryption keys
Audit trail through CloudTrail
Key rotation capabilities
Use when: Sensitive data requiring access control, compliance requirements, audit trails needed

SSE-C (Server-Side Encryption with Customer-Provided Keys):

Customer manages encryption keys completely
AWS doesn’t store the keys
Use when: Strict key management requirements, regulatory compliance demanding customer key control

AWS Secrets Manager vs Systems Manager Parameter Store

AWS Secrets Manager:

Best for: Database credentials, API keys, sensitive data
Features: Built-in automatic rotation, native RDS/Redshift integration
Cost: Higher cost but better security features
Use when: Managing database passwords, need automatic rotation

Systems Manager Parameter Store:

Best for: Configuration data, application settings
Features: No built-in rotation for credentials, lower cost
Use when: Storing configuration parameters, non-sensitive data

Data Ingestion and Transformation

AWS Glue

Core Components:

Data Catalog:

Central metadata repository for all data sources
Automatically populated by crawlers
Integrates with Athena, EMR, Redshift Spectrum

Crawlers:

Automatically discover and catalog data schemas
Handle schema evolution over time
Support various data sources (S3, RDS, DynamoDB)
Best Practice: Schedule crawlers to run after ETL jobs complete

ETL Job Types:

Standard Jobs:

Dedicated resources for time-sensitive workloads
Consistent performance and predictable execution time
Use when: Production workloads, SLA requirements, time-critical processing

Flex Jobs:

Use spare compute capacity for cost savings (up to 34% cost reduction)
Variable execution time
Use when: Non-urgent workloads, cost optimization priority, batch processing

Python Shell Jobs:

Most cost-effective for small files (<30MB)
Can use 1/16 DPU minimum
Use when: Simple transformations, small data volumes, API calls

Ray Jobs:

Scale Python workloads across distributed clusters
Use when: Large-scale Python processing, machine learning workloads

Streaming ETL:

Real-time data processing
Use when: Near real-time transformation requirements

Advanced Features:

Glue DataBrew:

Visual, no-code data preparation
Pre-built transformations for common tasks
Data profiling and quality assessment
Use when: Business users need to prepare data, exploratory data analysis

Glue Studio:

Visual interface for ETL job creation
Supports both visual and code-based development
Use when: Rapid ETL development, visual job design preference

Schema Registry:

Centralized schema management for streaming data
Schema evolution with versioning
Integration with Kinesis and MSK
Use when: Streaming data with evolving schemas, producer/consumer validation

Data Quality:

Built-in data quality rules and validation
DQDL (Data Quality Definition Language)
Use when: Data validation requirements, quality monitoring

Performance Optimization:

DPU Optimization: Monitor job metrics to determine optimal DPU allocation
Pushdown Predicates: Filter data at source to reduce processing
Partition Pruning: Organize data by date/region for efficient querying
File Size: Keep files around 128MB for optimal processing

Amazon AppFlow

Purpose: Fully managed integration service for SaaS applications

Trigger Types:

Run on demand: Manual execution for ad-hoc transfers
Run on event: Event-driven from SaaS applications (new Salesforce record)
Run on schedule: Time-based execution (daily, hourly)

Transfer Types:

Incremental transfer: Only changed/new records (most efficient)
Full transfer: Complete dataset snapshot

When to use:

Integrating SaaS applications (Salesforce, ServiceNow, Slack) with AWS
Need for secure, managed data transfer
Minimal operational overhead requirement

Amazon Data Firehose

Purpose: Real-time streaming data delivery with minimal operational overhead

Key Features:

Format conversion: Automatically converts JSON to Parquet/ORC
Compression: Built-in data compression
Destinations: S3, Redshift, OpenSearch, HTTP endpoints (NOT Amazon Timestream)
Buffering: Configurable size and time-based delivery
Serverless: No infrastructure management

When to use:

Near real-time data streaming (60 seconds minimum latency)
Need format conversion (JSON to Parquet)
Delivering to multiple destinations
Minimal operational overhead priority

Amazon Kinesis Data Streams

Purpose: Real-time data streaming with sub-second latency

Key Concepts:

Shards: Unit of capacity (1 MB/sec input, 2 MB/sec output per shard)
Retention: 24 hours to 365 days
Resharding: Scale using SplitShard/MergeShard commands

Advanced Features:

Enhanced Fan-Out: Dedicated 2 MB/sec per consumer per shard
Server-side encryption: KMS integration
Cross-region replication: Data redundancy

When to use:

Real-time data ingestion with single-digit millisecond requirements
High-throughput streaming data
Multiple consumers need real-time access
Custom application processing requirements

Kinesis vs Firehose Decision Matrix:

Use Kinesis Data Streams when: Real-time processing, multiple consumers, custom applications
Use Firehose when: Simple delivery to destinations, format conversion needed, minimal management

Data Store Management

Amazon S3

Storage Classes:

Standard:

Frequent access, immediate availability
Use when: Actively used data, high availability requirements

Standard-IA (Infrequent Access):

Infrequent access but rapid retrieval when needed
30-day minimum storage duration
Use when: Backup data, disaster recovery, long-term storage with occasional access

Intelligent-Tiering:

Automatic optimization based on access patterns
No retrieval fees for frequent access
Use when: Unknown or changing access patterns, automatic cost optimization

Glacier Flexible Retrieval:

Long-term archival, 1-5 minute to 12-hour retrieval
Use when: Archival data with occasional access needs

Glacier Deep Archive:

Lowest cost, 12-48 hours retrieval time
Use when: Long-term archival, rarely accessed data, compliance requirements

Advanced Features:

Lifecycle Policies:

Automate transitions between storage classes
Delete expired objects automatically
Best Practice: Implement based on data access patterns

Transfer Acceleration:

Use CloudFront edge locations for faster uploads
Use when: Global users uploading large files

Event Notifications:

Trigger actions on object creation/deletion
Integrate with Lambda, SQS, SNS
Use when: Event-driven data processing

S3 Object Lambda:

Transform data during retrieval using Lambda functions
PII redaction, data masking
Use when: Dynamic data transformation without creating copies

Amazon Redshift

Core Concepts:

Fully managed, petabyte-scale data warehouse
Columnar storage with massively parallel processing (MPP)
Best price-performance for analytics workloads

Distribution Styles:

KEY Distribution:

Distribute based on specific column values
Use when: Large tables with frequent joins on specific columns
Example: Distribute orders table on customer_id for customer analysis

ALL Distribution:

Copy entire table to all nodes
Use when: Small, frequently joined dimension tables (<10M rows)
Example: Country lookup tables, product categories

EVEN Distribution:

Round-robin distribution (default)
Use when: Tables without clear join patterns, fact tables without dominant join column

AUTO Distribution:

Redshift automatically chooses optimal distribution
Use when: Uncertain about optimal distribution strategy

Performance Optimization:

Sort Keys:

Compound Sort Keys: Most common, define sort order priority
Interleaved Sort Keys: Equal weight to all columns, better for diverse query patterns

VACUUM Operations:

VACUUM FULL: Reclaim space and re-sort all data
VACUUM DELETE ONLY: Reclaim space from deleted rows
VACUUM SORT ONLY: Re-sort data without reclaiming space
VACUUM REINDEX: For interleaved sort keys optimization

Query Optimization:

Materialized Views: Pre-computed results for complex queries with auto-refresh
Query Result Caching: Reuse identical query results for 24 hours
LIKE vs REGEX: Use LIKE operator for pattern matching (faster than regex)
Stored Procedures: Encapsulate multiple SQL statements, reduce network traffic

Advanced Features:

Redshift Serverless:

Pay-per-use model, automatic scaling
Use when: Variable workloads, development/testing, cost optimization

Redshift Spectrum:

Query S3 data without loading into Redshift
Use when: Infrequently accessed data, data lake integration

Streaming Ingestion:

Direct ingestion from Kinesis Data Streams using materialized views
Use when: Real-time analytics on streaming data

Data Sharing:

Share live data between clusters without copying
Use when: Cross-team data access, multi-tenant scenarios

Concurrency Scaling:

Automatically add capacity for concurrent queries
Use when: Unpredictable query spikes, mixed workloads

Monitoring:

STL_ALERT_EVENT_LOG: Identifies performance issues and query failures
Workload Management (WLM): Manage query priorities and concurrency

Amazon DynamoDB

Performance Characteristics:

Single-digit millisecond latency
Automatic scaling based on utilization
Serverless with on-demand pricing option

Design Best Practices:

Partition Key Design:

High Cardinality: Use PRODUCT_ID instead of CATEGORY_NAME
Even Distribution: Avoid hot partitions
Access Patterns: Design based on query requirements

Capacity Management:

Provisioned: Predictable workloads, reserved capacity options
On-Demand: Variable workloads, pay-per-request

Monitoring:

CloudWatch Contributor Insights: Identify hot partitions and access patterns
Throttling Analysis: Monitor capacity limits

Advanced Features:

TTL (Time to Live): Automatically delete expired items
Global Tables: Multi-region replication
Point-in-time Recovery: Backup and restore capabilities
DynamoDB Streams: Change data capture

Amazon Aurora

Integration Features:

Native Lambda Functions: Invoke Lambda from Aurora
Zero-ETL Integration: Direct integration with Redshift
External Metastore: Serve as Hive metastore for EMR

Performance:

5x faster than MySQL, 3x faster than PostgreSQL
Auto-scaling storage up to 128TB
Read replicas for scaling read workloads

Amazon Athena

Purpose: Serverless SQL queries on S3 data with pay-per-query pricing

Optimization Techniques:

Partition Projection:

Calculate partitions dynamically instead of metadata lookup
Use when: Highly partitioned data with predictable patterns
Example: Date-based partitions (year/month/day)

Query Result Caching:

Reuse identical query results for cost reduction
24-hour cache duration
Best Practice: Enable for repeated analytical queries

Columnar Formats:

Use Parquet/ORC for 30-90% cost reduction
Better compression and predicate pushdown
Always use for analytical workloads

Advanced Features:

Federated Queries:

Query multiple data sources using Lambda connectors
Connect to RDS, DynamoDB, Redis, HBase
Use when: Need to join data across different systems

Workgroups:

Separate query execution across teams
Enforce cost controls and query limits
Use when: Multi-tenant scenarios, cost management

Apache Iceberg Support:

ACID transactions in data lakes
Time travel queries
Schema evolution
Use when: Need ACID properties, versioning requirements

Data Operations and Support

Monitoring and Troubleshooting

AWS CloudTrail:

Log API calls and data events
Audit trail for compliance
Use when: Security auditing, compliance requirements, troubleshooting

Amazon CloudWatch:

Monitor performance metrics across all services
Custom metrics and alarms
Container Insights: Monitor EKS/ECS applications
Use when: Performance monitoring, alerting, operational visibility

Performance Insights:

Database performance analysis for RDS and Aurora
Wait event analysis and top SQL identification
Use when: Database performance troubleshooting

Service-Specific Monitoring:

Glue: Job profiler and CloudWatch for DPU optimization
Redshift: STL_ALERT_EVENT_LOG for performance issues
Kinesis: Iterator age monitoring for processing lag

Workflow Orchestration

AWS Step Functions:

Coordinate multiple AWS services into workflows
State machine-based orchestration

State Types:

Parallel State: Execute multiple branches simultaneously
Map State: Process arrays of data in parallel
Choice State: Conditional branching logic
Wait State: Introduce delays in workflows
Task State: Execute specific actions

When to use:

Complex workflows with conditional logic
Error handling and retry requirements
Coordinating multiple AWS services
Visual workflow representation needs

Amazon MWAA (Managed Workflows for Apache Airflow):

Managed Apache Airflow for complex workflows
DAG-based workflow definition

Key Features:

requirements.txt: Install Python packages
SSH Connections: Use apache-airflow-providers-ssh package
Web UI: Visual workflow monitoring

When to use:

Complex ETL workflows with dependencies
Python-based workflow logic
Integration with non-AWS systems
Existing Airflow knowledge

AWS Glue Workflows:

Orchestrate Glue jobs and crawlers
Built-in monitoring and dependency management
Use when: ETL-specific orchestration, simpler than Step Functions

Real-time Data Processing

Amazon Kinesis Ecosystem

Kinesis Data Streams:

Real-time data ingestion with sub-second latency
Shard-based scaling model

Key Concepts:

Shards: 1 MB/sec input, 2 MB/sec output per shard
Retention: 24 hours to 365 days
Partition Keys: Determine shard assignment

Scaling Operations:

SplitShard: Increase capacity by splitting shards
MergeShard: Reduce capacity by merging adjacent shards
Enhanced Fan-Out: Dedicated 2 MB/sec per consumer per shard

Lambda Integration:

Parallelization Factor: Control concurrent executions per shard
Batch Size: Number of records per invocation
Starting Position: TRIM_HORIZON, LATEST, AT_TIMESTAMP

Amazon Managed Service for Apache Flink:

Real-time stream processing with SQL, Java, Scala, Python
Windowing operations for time-based analytics

Window Types:

Tumbling Windows: Non-overlapping fixed intervals
Sliding Windows: Overlapping time windows for continuous analysis
Session Windows: Based on data activity

When to use:

Complex stream processing logic
Real-time analytics and aggregations
Multiple data sources correlation

Integration Patterns

Event-Driven Architecture:

S3 Event Notifications → Lambda → Processing
EventBridge Rules → Scheduled workflows
Kinesis → Lambda → DynamoDB pattern

Cross-Account Streaming:

Stream logs from production to audit accounts
CloudWatch Logs subscription filters
Cross-account IAM roles

Analytics and Querying

Amazon Athena Advanced

Query Optimization:

Partition Projection: Calculate partitions dynamically for better performance
Columnar Formats: Always use Parquet/ORC for analytical workloads
Compression: Use appropriate compression (Snappy for Parquet)
File Sizing: Optimize file sizes (128MB-1GB) for parallel processing

Federated Queries:

Query multiple data sources without moving data
Lambda connectors for RDS, DynamoDB, HBase, Redis
Use when: Need to join data across different systems

Workgroups:

Separate teams and enforce settings
Cost controls and query limits
Result location management
Use when: Multi-tenant scenarios, cost governance

Amazon QuickSight

VPC Connections: Secure access to private databases Data Sources: Connect to AWS and external sources Real-time Dashboards: Live data visualization

Workflow Orchestration

AWS Step Functions

Core Concepts: Serverless workflow orchestration using state machines

State Types:

Task State: Execute Lambda functions, start EMR clusters, run Glue jobs
Parallel State: Execute multiple branches simultaneously for concurrent processing
Choice State: Conditional branching based on input values
Wait State: Pause execution for specified time or until timestamp
Map State: Process arrays of data in parallel

Integration Patterns:

Request-Response: Synchronous execution
Run Job: Asynchronous with polling
Callback: Wait for task token

When to use:

Complex workflows with conditional logic
Error handling and retry requirements
Visual workflow representation needs
Coordinating multiple AWS services

Amazon MWAA (Managed Workflows for Apache Airflow)

Core Concepts: Managed Apache Airflow for complex workflow orchestration

Key Features:

DAGs: Define workflows as Directed Acyclic Graphs
requirements.txt: Install Python packages (apache-airflow-providers-ssh for SSH connections)
Variables and Connections: Manage configuration and external system connections
Web UI: Monitor and troubleshoot workflows

When to use:

Complex ETL workflows with dependencies
Python-based workflow logic
Integration with external systems
Need for rich scheduling capabilities
Existing Airflow expertise

Amazon EventBridge

Core Concepts: Serverless event bus for event-driven architectures

Features:

Event Rules: Filter and route events to targets
Scheduled Rules: Cron-based job scheduling
Custom Event Buses: Isolate different application events
Event Replay: Replay events for debugging

Event Patterns: Filter events before triggering targets based on event content

When to use:

Event-driven architectures
Scheduled job execution
Decoupling applications
Real-time event processing

Migration and Integration

AWS Database Migration Service (DMS)

Core Concepts: Migrate databases with minimal downtime

Migration Types:

Full Load: One-time complete migration
CDC (Change Data Capture): Ongoing replication of changes
Full Load + CDC: Complete migration followed by ongoing sync

Source Requirements for CDC:

SQL Server: Enable transaction logs
MySQL: Enable binary logs
Oracle: Enable archived redo logs

Data Validation: Built-in validation to ensure data consistency

Target Formats: Support for Parquet format for better analytical query performance

When to use:

Database migrations to AWS
Ongoing replication between databases
Database consolidation projects
Zero-downtime migration requirements

AWS Schema Conversion Tool (SCT)

Purpose: Convert database schemas between different engines

Capabilities:

Schema conversion (Oracle to PostgreSQL, SQL Server to Aurora)
Code conversion for stored procedures and functions
Assessment reports for migration complexity

AWS DataSync

Purpose: Automated, high-speed data transfer service

Features:

Multiple Destinations: Transfer to multiple S3 buckets simultaneously
Bandwidth Optimization: Automatic network optimization
Scheduling: Periodic data synchronization
Verification: Data integrity checks

When to use:

Large-scale data migration to AWS
Regular data synchronization
Hybrid cloud architectures
Archive data movement

Cost Optimization Strategies

Service-Specific Optimizations

AWS Glue:

Flex Jobs: Up to 34% cost reduction for non-urgent workloads
Python Shell: Most cost-effective for small files (<30MB)
DPU Optimization: Monitor and right-size based on job metrics
Job Bookmarks: Avoid reprocessing data

Amazon S3:

Lifecycle Policies: Automate transitions to cheaper storage classes
Intelligent-Tiering: Automatic optimization for unknown access patterns
Compression: Reduce storage costs and transfer time
Delete Incomplete Multipart Uploads: Clean up abandoned uploads

Amazon Redshift:

Serverless: Pay-per-use for variable workloads
Reserved Instances: 1-3 year commitments for predictable workloads
Pause/Resume: Pause clusters during non-business hours
Query Result Caching: Avoid redundant query execution
UNLOAD to S3: Move infrequently accessed data to cheaper storage

Amazon Athena:

Query Result Caching: Reuse identical query results
Columnar Formats: Reduce data scanned (30-90% cost reduction)
Partitioning: Limit data scanned per query
Compression: Reduce storage and scanning costs

General Cost Optimization Principles

Right-Sizing:

Monitor resource utilization with CloudWatch
Use AWS Cost Explorer for usage analysis
Implement auto-scaling where appropriate

Reserved Capacity:

Use for predictable workloads
Available for Redshift, RDS, DynamoDB
1-3 year commitment options

Spot Instances:

Use for fault-tolerant workloads
EMR clusters with proper checkpointing
Development and testing environments

Performance Optimization

Data Format Optimization

Columnar Formats (Parquet/ORC):

Benefits: Predicate pushdown, column pruning, better compression
Use Cases: Analytical workloads, data warehousing, OLAP queries
Compression: Use Snappy for Parquet (good compression + fast queries)

Row-Based Formats (JSON/CSV):

Use Cases: Operational workloads, OLTP systems, data ingestion
Limitations: Poor analytical performance, higher storage costs

Partitioning Strategies

Time-Based Partitioning:

Partition by year/month/day for time-series data
Example: s3://bucket/year=2024/month=01/day=15/
Benefits: Query performance, cost optimization

Categorical Partitioning:

Partition by frequently filtered dimensions
Example: s3://bucket/region=us-east-1/department=sales/
Avoid: High cardinality partitions (>10,000 partitions)

Query Performance Tuning

Amazon Redshift:

Choose appropriate distribution and sort keys
Use materialized views for repeated complex queries
Monitor STL_ALERT_EVENT_LOG for performance issues
Implement proper VACUUM schedules

Amazon Athena:

Use partition projection for highly partitioned data
Implement columnar storage formats
Optimize file sizes (128MB-1GB)
Use workgroups for resource management

AWS Glue:

Monitor DPU utilization and adjust accordingly
Use pushdown predicates to filter data at source
Optimize file sizes and formats
Use job bookmarks to avoid reprocessing

Infrastructure Optimization

Auto Scaling:

DynamoDB: Auto Scaling based on utilization
EMR: Auto Scaling for dynamic workloads
Redshift: Concurrency Scaling for query spikes

Compute Optimization:

Use appropriate instance types for workloads
Graviton instances for better price-performance
Serverless options to avoid idle costs

Exam Strategy and Tips

Question Pattern Recognition

“Least operational overhead”:

Choose managed/serverless services over self-managed
Prefer AWS Glue over EMR for ETL
Choose Athena over self-managed Spark clusters
Select Redshift Serverless over provisioned clusters

“Cost-effective”:

Consider Glue Flex jobs for non-urgent workloads
Use appropriate S3 storage classes
Implement lifecycle policies
Choose serverless options for variable workloads

“Real-time”:

Kinesis Data Streams for sub-second latency
Lambda for event-driven processing
Streaming ETL with Glue or Flink
Enhanced Fan-Out for dedicated throughput

“Large datasets”:

EMR for big data processing frameworks
Redshift for data warehousing
Distributed processing patterns
Appropriate partitioning strategies

Service Selection Guidelines

Data Transformation:

AWS Glue: Managed ETL, serverless, broad integration
EMR: More control, big data frameworks, custom processing
Lambda: Event-driven, small-scale transformations

Analytics:

Athena: Ad-hoc queries, serverless, pay-per-query
Redshift: Data warehouse, complex analytics, consistent performance
EMR: Big data analytics, ML workloads, custom frameworks

Storage:

S3: Data lake, archival, web-scale storage
Redshift: Structured data warehouse storage
DynamoDB: NoSQL, single-digit millisecond latency

Integration:

AppFlow: SaaS application integration
Glue: General ETL and data catalog
DMS: Database migration and replication

Key Concepts to Master

Schema Evolution:

Glue Schema Registry for streaming data
Handle changing data structures over time
Version management and compatibility

Data Lineage:

Track data flow and transformations
Important for ML model governance
Compliance and audit requirements

Multi-Tenancy:

Secure data isolation between tenants
Lake Formation for fine-grained access control
Workgroups in Athena for team separation

Compliance and Auditing:

CloudTrail for API auditing
Data encryption at rest and in transit
PII detection and redaction
Proper access controls and monitoring

Advanced Architectural Patterns

Event-Driven Data Processing:

S3 events trigger Lambda for immediate processing
EventBridge for complex event routing
Kinesis for real-time stream processing

Serverless Data Lakes:

S3 for storage, Athena for querying
Glue for ETL and data catalog
Lambda for event-driven processing

Data Mesh Architecture:

Distributed data ownership
Lake Formation for governance
Cross-account data sharing

Hybrid and Multi-Cloud:

DataSync for data movement
Storage Gateway for hybrid storage
Direct Connect for dedicated connectivity

Common Pitfalls to Avoid

Storage Class Selection:

Don’t use Glacier for frequently accessed data
Consider retrieval times for Glacier classes
Use Intelligent-Tiering for unknown patterns

Service Limitations:

Lambda 15-minute execution limit
Firehose minimum 60-second latency
Athena query timeout considerations

Security Misconfigurations:

Always use least privilege access
Enable encryption for sensitive data
Use Secrets Manager for credentials, not Parameter Store

Performance Anti-Patterns:

Avoid small files in analytical workloads
Don’t skip compression for storage optimization
Avoid hot partitions in DynamoDB

Final Exam Preparation Checklist

Core Services Mastery

Understand when to use each service vs alternatives
Know integration patterns between services
Master cost optimization strategies
Understand security best practices

Hands-On Experience

Practice creating ETL jobs in Glue
Set up data lakes with proper partitioning
Configure Redshift clusters and optimize queries
Implement real-time streaming with Kinesis

Scenario-Based Thinking

Practice choosing services based on requirements
Understand trade-offs between options
Know when to prioritize cost vs performance vs operational overhead

Security and Compliance

Master IAM roles and policies for data services
Understand encryption options and when to use each
Know Lake Formation for data governance
Understand PII detection and data classification

This comprehensive study guide covers all essential concepts for the AWS Certified Data Engineer Associate exam. Focus on understanding not just what each service does, but when and why to use it in different scenarios. Practice with hands-on labs and scenario-based questions to reinforce your learning.

AWS Certified Data Engineer Associate - Complete Study Guide

Table of Contents

Data Security and Governance

AWS Lake Formation

Amazon Macie

Data Encryption Options

AWS Secrets Manager vs Systems Manager Parameter Store

Data Ingestion and Transformation

AWS Glue

Amazon AppFlow

Amazon Data Firehose

Amazon Kinesis Data Streams

Data Store Management

Amazon S3

Amazon Redshift

Amazon DynamoDB

Amazon Aurora

Amazon Athena

Data Operations and Support

Monitoring and Troubleshooting

Workflow Orchestration

Real-time Data Processing

Amazon Kinesis Ecosystem

Integration Patterns

Analytics and Querying

Amazon Athena Advanced

Amazon QuickSight

Workflow Orchestration

AWS Step Functions

Amazon MWAA (Managed Workflows for Apache Airflow)

Amazon EventBridge

Migration and Integration

AWS Database Migration Service (DMS)

AWS Schema Conversion Tool (SCT)

AWS DataSync

Cost Optimization Strategies

Service-Specific Optimizations

General Cost Optimization Principles

Performance Optimization

Data Format Optimization

Partitioning Strategies

Query Performance Tuning

Infrastructure Optimization

Exam Strategy and Tips

Question Pattern Recognition

Service Selection Guidelines

Key Concepts to Master

Advanced Architectural Patterns

Common Pitfalls to Avoid

Final Exam Preparation Checklist

Core Services Mastery

Hands-On Experience

Scenario-Based Thinking

Security and Compliance