Reference Architecture: Data Pipelines

Status: Proposed | Date: 2025-01-28

When to Use This Pattern

Use when building:

Analytics and business intelligence reporting
Data integration between organisational systems
Automated data processing and transformation workflows

This template implements scalable data pipelines using Ibis for portable dataframe operations and DuckLake for lakehouse storage. The approach prioritises simplicity and portability - write transformations once in Python, run them anywhere from laptops to cloud warehouses.

Core Components

Key Technologies:

Component	Tool	Purpose
Transformation	Ibis	Portable Python dataframe API - same code runs on DuckDB, PostgreSQL, or cloud warehouses
Local Engine	DuckDB	Fast analytical queries, runs anywhere without infrastructure
Lakehouse	DuckLake	Open table format over S3 with ACID transactions
Reporting	Quarto	Static reports from notebooks, version-controlled

Project Kickoff Steps

Foundation Setup

Apply Isolation - Follow ADR 001: Application Isolation for data processing network separation
Deploy Infrastructure - Follow ADR 002: AWS EKS for Cloud Workloads for container deployment
Configure Infrastructure - Follow ADR 010: Infrastructure as Code for reproducible infrastructure
Setup Storage - Follow ADR 018: Database Patterns for S3 and DuckLake configuration

Security & Operations

Configure Secrets - Follow ADR 005: Secrets Management for data source credentials
Setup Logging - Follow ADR 007: Centralised Security Logging for audit trails
Setup Backups - Follow ADR 014: Object Storage Backups for data backup
Data Governance - Follow ADR 015: Data Governance Standards for data quality

Development Process

Configure CI/CD - Follow ADR 004: CI/CD Quality Assurance for automated testing
Setup Releases - Follow ADR 009: Release Documentation Standards for versioning
Analytics Tools - Follow ADR 017: Analytics Tooling Standards for Quarto integration

Implementation Details

Why Ibis + DuckDB:

Portability: Write transformations in Python, run on any backend (DuckDB locally, PostgreSQL, BigQuery, Snowflake)
Simplicity: No complex orchestration infrastructure required for most workloads
Performance: DuckDB handles analytical queries on datasets up to hundreds of gigabytes on a single machine
Cost: Run development and small-medium workloads without cloud compute costs

When to Scale Beyond DuckDB:

Data exceeds available memory/disk on a single node
Need concurrent writes from multiple processes
Require real-time streaming ingestion

Data Quality:

Use Ibis expressions for data validation
Implement schema checks in CI/CD pipeline
Track data lineage through transformation code in git

Cost Optimisation:

Run DuckDB locally for development and testing (zero infrastructure cost)
Use S3 Intelligent Tiering for automatic archival
Scale to cloud warehouses only when data volume requires it

DGOV DTT Architecture Decision Records