Open Data Stack
Overview
Open Data Stack is an open-source data engineering demonstration that implements batch and streaming pipelines using real financial market data from Yahoo Finance API. It showcases production-ready ETL patterns with a dual-path architecture.
Architecture
text
+==============================================================================+
| OPEN DATA STACK |
+==============================================================================+
| |
| +-------------+ +-------------+ |
| | yfinance | | yfinance | |
| | (Stock API) | | (Stock API) | |
| +------+------+ +------+------+ |
| | | |
| v v |
| +-------------+ +-------------+ |
| | Airflow | | Kafka | |
| | (Batch) | | (Streaming) | |
| +------+------+ +------+------+ |
| | | |
| v v |
| +-------------+ +-------------+ |
| | DuckDB | | Spark | |
| | (Warehouse) | | (Processing)| |
| +------+------+ +------+------+ |
| | | |
| +----------------+-------------------+ |
| | |
| v |
| +-------------+ |
| | Superset | |
| | (Dashboards)| |
| +-------------+ |
| |
+==============================================================================+Key Features
- Dual-Path Architecture - Supports both batch (data warehouse) and streaming pipelines
- Real-Time Data - Integrates with Yahoo Finance API for live stock data (AAPL, GOOGL, MSFT, AMZN, META)
- Complete Observability - Pre-built Superset dashboards for monitoring
- Single-Command Deploy - Full stack via Docker Compose
- Production Patterns - Comprehensive testing with 73 tests
Tech Stack
- Data Source - yfinance (Yahoo Finance API)
- Message Queue - Apache Kafka
- Stream Processing - Apache Spark
- Orchestration - Apache Airflow
- Data Warehouse - DuckDB
- Visualization - Apache Superset
- Processing - Pandas, PySpark
Quick Start
bash
# Clone and start
git clone https://github.com/AlharbiAbdullah/open_data_stack
cd open_data_stack
docker-compose up --build -d
# Access services
# Airflow: http://localhost:8080
# Superset: http://localhost:8088
# Kafka UI: http://localhost:8082