Reference Architecture: Data Pipelines
Status: Proposed | Date: 2025-01-28
When to Use This Pattern
Use when building:
- Analytics and business intelligence reporting
- Data integration between organisational systems
- Automated data processing and transformation workflows
Overview
This template implements scalable data pipelines using Ibis for portable dataframe operations and DuckLake for lakehouse storage. The approach prioritises simplicity and portability - write transformations once in Python, run them anywhere from laptops to cloud warehouses.
Core Components
Key Technologies:
| Component | Tool | Purpose |
|---|---|---|
| Transformation | Ibis | Portable Python dataframe API - same code runs on DuckDB, PostgreSQL, or cloud warehouses |
| Local Engine | DuckDB | Fast analytical queries, runs anywhere without infrastructure |
| Lakehouse | DuckLake | Open table format over S3 with ACID transactions |
| Reporting | Quarto | Static reports from notebooks, version-controlled |
Project Kickoff Steps
Foundation Setup
- Apply Isolation - Follow ADR 001: Application Isolation for data processing network separation
- Deploy Infrastructure - Follow ADR 002: AWS EKS for Cloud Workloads for container deployment
- Configure Infrastructure - Follow ADR 010: Infrastructure as Code for reproducible infrastructure
- Setup Storage - Follow ADR 018: Database Patterns for S3 and DuckLake configuration
Security & Operations
- Configure Secrets - Follow ADR 005: Secrets Management for data source credentials
- Setup Logging - Follow ADR 007: Centralised Security Logging for audit trails
- Setup Backups - Follow ADR 014: Object Storage Backups for data backup
- Data Governance - Follow ADR 015: Data Governance Standards for data quality
Development Process
- Configure CI/CD - Follow ADR 004: CI/CD Quality Assurance for automated testing
- Setup Releases - Follow ADR 009: Release Documentation Standards for versioning
- Analytics Tools - Follow ADR 017: Analytics Tooling Standards for Quarto integration
Implementation Details
Why Ibis + DuckDB:
- Portability: Write transformations in Python, run on any backend (DuckDB locally, PostgreSQL, BigQuery, Snowflake)
- Simplicity: No complex orchestration infrastructure required for most workloads
- Performance: DuckDB handles analytical queries on datasets up to hundreds of gigabytes on a single machine
- Cost: Run development and small-medium workloads without cloud compute costs
When to Scale Beyond DuckDB:
- Data exceeds available memory/disk on a single node
- Need concurrent writes from multiple processes
- Require real-time streaming ingestion
Data Quality:
- Use Ibis expressions for data validation
- Implement schema checks in CI/CD pipeline
- Track data lineage through transformation code in git
Cost Optimisation:
- Run DuckDB locally for development and testing (zero infrastructure cost)
- Use S3 Intelligent Tiering for automatic archival
- Scale to cloud warehouses only when data volume requires it