Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Reference Architecture: Data Pipelines

Status: Proposed | Date: 2025-01-28

When to Use This Pattern

Use when building:

  • Analytics and business intelligence reporting
  • Data integration between organisational systems
  • Automated data processing and transformation workflows

Overview

This template implements scalable data pipelines using Ibis for portable dataframe operations and DuckLake for lakehouse storage. The approach prioritises simplicity and portability - write transformations once in Python, run them anywhere from laptops to cloud warehouses.

Core Components

Key Technologies:

ComponentToolPurpose
TransformationIbisPortable Python dataframe API - same code runs on DuckDB, PostgreSQL, or cloud warehouses
Local EngineDuckDBFast analytical queries, runs anywhere without infrastructure
LakehouseDuckLakeOpen table format over S3 with ACID transactions
ReportingQuartoStatic reports from notebooks, version-controlled

Project Kickoff Steps

Foundation Setup

  1. Apply Isolation - Follow ADR 001: Application Isolation for data processing network separation
  2. Deploy Infrastructure - Follow ADR 002: AWS EKS for Cloud Workloads for container deployment
  3. Configure Infrastructure - Follow ADR 010: Infrastructure as Code for reproducible infrastructure
  4. Setup Storage - Follow ADR 018: Database Patterns for S3 and DuckLake configuration

Security & Operations

  1. Configure Secrets - Follow ADR 005: Secrets Management for data source credentials
  2. Setup Logging - Follow ADR 007: Centralised Security Logging for audit trails
  3. Setup Backups - Follow ADR 014: Object Storage Backups for data backup
  4. Data Governance - Follow ADR 015: Data Governance Standards for data quality

Development Process

  1. Configure CI/CD - Follow ADR 004: CI/CD Quality Assurance for automated testing
  2. Setup Releases - Follow ADR 009: Release Documentation Standards for versioning
  3. Analytics Tools - Follow ADR 017: Analytics Tooling Standards for Quarto integration

Implementation Details

Why Ibis + DuckDB:

  • Portability: Write transformations in Python, run on any backend (DuckDB locally, PostgreSQL, BigQuery, Snowflake)
  • Simplicity: No complex orchestration infrastructure required for most workloads
  • Performance: DuckDB handles analytical queries on datasets up to hundreds of gigabytes on a single machine
  • Cost: Run development and small-medium workloads without cloud compute costs

When to Scale Beyond DuckDB:

  • Data exceeds available memory/disk on a single node
  • Need concurrent writes from multiple processes
  • Require real-time streaming ingestion

Data Quality:

  • Use Ibis expressions for data validation
  • Implement schema checks in CI/CD pipeline
  • Track data lineage through transformation code in git

Cost Optimisation:

  • Run DuckDB locally for development and testing (zero infrastructure cost)
  • Use S3 Intelligent Tiering for automatic archival
  • Scale to cloud warehouses only when data volume requires it