Data Lake Starter

advanced

S3-based data lake with raw/processed/curated zones, Glue catalog and daily crawler, Athena WorkGroup, and a Kinesis Data Stream + Firehose ingestion pipeline. ETL Lambda for raw-to-processed transformation.

data s3 glue athena kinesis analytics

Quick Start

Via CLI (recommended)

npx cdk-starter create

Then select "Data Lake Starter" from the prompt

Or scaffold directly

npx cdk-starter create --starter data-lake

README

Data Lake Starter

S3-based data lake with raw, processed, and curated storage zones, Glue data catalog with daily crawler, Athena WorkGroup, and a Kinesis Firehose ingestion pipeline.

Ingest data

aws kinesis put-record \
  --stream-name <IngestStreamName> \
  --data '$(echo -n "{\"userId\":\"123\",\"event\":\"page_view\"}" | base64)' \
  --partition-key user123

Query with Athena

Run the Glue crawler to populate the catalog
Open Athena in the console and select the GlueDatabaseName output
Query the raw zone: SELECT * FROM raw_events LIMIT 10;

Zones

Zone	Bucket	Purpose
Raw	`*-raw`	Landing — immutable original data
Processed	`*-processed`	Cleaned, validated, partitioned
Curated	`*-curated`	Business-ready aggregates

Prerequisites

Node.js ≥ 20
AWS CLI configured (aws configure)
CDK bootstrapped (npx cdk bootstrap)

Deploy

npm install
npx cdk diff
npx cdk deploy

Tear down

npx cdk destroy