Reference Architecture: Data Pipelines
Status: Proposed | Date: 2025-01-28
When to Use This Pattern
Use when building:
- Analytics and business intelligence reporting
- Data integration between organisational systems
- Batch data processing and transformation workflows
- Small to medium data products that should start simple and scale later
Do not use this pattern as a default for low-latency transactional APIs or streaming systems with sub-second processing requirements.
Overview
Build data pipelines as version-controlled transformation code over an object-storage datalake. Keep storage and table formats separate from the access layer so teams can start with DuckDB/DuckLake and add S3 Tables, Trino, or other Iceberg-compatible engines when concurrency or scale requires it.
Core Components
flowchart LR
sources[Data Sources]
transform[Ibis Transformations]
storage[Object Storage + Open Tables]
access[DuckDB / S3 Tables / Trino]
output[Reports & APIs]
sources -->|extract + validate| transform
transform -->|load curated data| storage
storage -->|query| access
access -->|serve| output
Key Technologies:
| Component | Tool | Purpose |
|---|---|---|
| Transformation | Ibis | Portable Python dataframe API for transformations across local and cloud engines |
| Local Access | DuckDB + DuckLake | Lightweight client and lakehouse access for development, scheduled jobs, and smaller workloads |
| Serverless Tables | Amazon S3 Tables | Managed Apache Iceberg table storage and maintenance for AWS workloads |
| Distributed Query | Trino or equivalent | Concurrent and larger-scale SQL access to Iceberg tables |
| Reporting | Quarto | Static reports and dashboards from version-controlled notebooks |
Project Kickoff Steps
Foundation Setup
- Apply Isolation - Follow ADR 001: Application Isolation for data processing network and account boundaries
- Deploy Workloads - Follow ADR 002: AWS EKS for Cloud Workloads for scheduled pipeline jobs when local or CI execution is not sufficient
- Configure Infrastructure - Follow ADR 010: Infrastructure as Code for buckets, table resources, permissions, and deployment environments
- Setup Storage and Access - Follow ADR 018: Database Patterns for object storage, DuckLake, S3 Tables, and Iceberg-compatible access layers
Security & Operations
- Configure Secrets - Follow ADR 005: Secrets Management for source system credentials and scoped storage access
- Setup Logging - Follow ADR 007: Centralised Security Logging for pipeline runs, data access, and failures
- Setup Backups - Follow ADR 014: Object Storage Backups for datalake backup, replication, and recovery objectives
- Apply Data Governance - Follow ADR 015: Data Governance Standards for ownership, quality, classification, and retention
Development Process
- Configure CI/CD - Follow ADR 004: CI/CD Quality Assurance for automated testing and deployment
- Setup Releases - Follow ADR 009: Release Standards for versioned pipeline changes, release notes, promotion, and data-impact notes
- Publish Analytics - Follow ADR 017: Analytics Tooling Standards for Quarto reports and dashboards
Implementation Details
Access Layer Selection:
- Use DuckDB + DuckLake for local development, scheduled jobs, notebooks, and simpler analytical workloads
- Use S3 Tables for managed Iceberg tables when AWS-managed table maintenance, catalog integration, or multi-engine access is required
- Use Trino or another Iceberg-compatible query engine when many users or services need concurrent SQL access
- Keep transformation logic in Ibis where practical so the same code can move between access layers
Data Quality:
- Validate schemas and business rules during ingestion and transformation
- Run schema and sample-data checks in CI/CD
- Track lineage through transformation code, table names, and release notes
Operations:
- Partition tables by common query and retention boundaries
- Use lifecycle policies and replication for backup and cost control
- Start with the simplest access layer and add distributed query only when measured concurrency or scale requires it