Data Lake Starter

advanced

S3-based data lake with raw/processed/curated zones, Glue catalog and daily crawler, Athena WorkGroup, and a Kinesis Data Stream + Firehose ingestion pipeline. ETL Lambda for raw-to-processed transformation.

Quick Start

Via CLI (recommended)

npx cdk-starter create

Then select "Data Lake Starter" from the prompt

Or scaffold directly

npx cdk-starter create --starter data-lake

README

Data Lake Starter

S3-based data lake with raw, processed, and curated storage zones, Glue data catalog with daily crawler, Athena WorkGroup, and a Kinesis Firehose ingestion pipeline.

Ingest data

aws kinesis put-record \
  --stream-name <IngestStreamName> \
  --data '$(echo -n "{\"userId\":\"123\",\"event\":\"page_view\"}" | base64)' \
  --partition-key user123

Query with Athena

  1. Run the Glue crawler to populate the catalog
  2. Open Athena in the console and select the GlueDatabaseName output
  3. Query the raw zone: SELECT * FROM raw_events LIMIT 10;

Zones

Zone Bucket Purpose
Raw *-raw Landing — immutable original data
Processed *-processed Cleaned, validated, partitioned
Curated *-curated Business-ready aggregates

Prerequisites

  • Node.js ≥ 20
  • AWS CLI configured (aws configure)
  • CDK bootstrapped (npx cdk bootstrap)

Deploy

npm install
npx cdk diff
npx cdk deploy

Tear down

npx cdk destroy