Data Lake Starter
advancedS3-based data lake with raw/processed/curated zones, Glue catalog and daily crawler, Athena WorkGroup, and a Kinesis Data Stream + Firehose ingestion pipeline. ETL Lambda for raw-to-processed transformation.
Quick Start
Via CLI (recommended)
npx cdk-starter create Then select "Data Lake Starter" from the prompt
Or scaffold directly
npx cdk-starter create --starter data-lake README
Data Lake Starter
S3-based data lake with raw, processed, and curated storage zones, Glue data catalog with daily crawler, Athena WorkGroup, and a Kinesis Firehose ingestion pipeline.
Ingest data
aws kinesis put-record \
--stream-name <IngestStreamName> \
--data '$(echo -n "{\"userId\":\"123\",\"event\":\"page_view\"}" | base64)' \
--partition-key user123
Query with Athena
- Run the Glue crawler to populate the catalog
- Open Athena in the console and select the
GlueDatabaseNameoutput - Query the raw zone:
SELECT * FROM raw_events LIMIT 10;
Zones
| Zone | Bucket | Purpose |
|---|---|---|
| Raw | *-raw |
Landing — immutable original data |
| Processed | *-processed |
Cleaned, validated, partitioned |
| Curated | *-curated |
Business-ready aggregates |
Prerequisites
- Node.js ≥ 20
- AWS CLI configured (
aws configure) - CDK bootstrapped (
npx cdk bootstrap)
Deploy
npm install
npx cdk diff
npx cdk deploy
Tear down
npx cdk destroy