OLake

0

Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Supporting Postgres, MongoDB and MySQL

Infrastructure

database
data-pipeline
change-data-capture
cdc

olake
OLake

Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Starting with MongoDB. Visit olake.io/docs for the full documentation, and benchmarks

GitHub issuesDocumentation slack

undefined

Connector ecosystem for Olake, the key points Olake Connectors focuses on are these

  • Integrated Writers to avoid block of reading, and pushing directly into destinations
  • Connector Autonomy
  • Avoid operations that don't contribute to increasing record throughput

Getting Started with OLake

Source / Connectors

  1. Getting started Postgres -> Writers | Postgres Docs
  2. Getting started MongoDB -> Writers | MongoDB Docs
  3. Getting started MySQL -> Writers | MySQL Docs

Writers / Destination

  1. Apache Iceberg Docs
  2. AWS S3 Docs
  3. Local FileSystem Docs

Source/Connector Functionalities

FunctionalityMongoDBPostgresMySQL
Full Refresh Sync Mode
Incremental Sync Mode
CDC Sync Mode
Full Parallel Processing
CDC Parallel Processing
Resumable Full Load
CDC Heart Beat

We have additionally planned the following sources - AWS S3 | Kafka

Writer Functionalities

FunctionalityLocal FilesystemAWS S3Apache Iceberg
Flattening & Normalization (L1)
Partitioning
Schema Changes
Schema Evolution

Supported Catalogs For Iceberg Writer

CatalogStatus
Glue CatalogWIP
Hive Meta StoreUpcoming
JDBC CatalogueUpcoming
REST Catalogue - NessieUpcoming
REST Catalogue - PolarisUpcoming
REST Catalogue - UnityUpcoming
REST Catalogue - GravitinoUpcoming
Azure PurviewNot Planned, submit a request
BigLake MetastoreNot Planned, submit a request

Core

Core or framework is the component/logic that has been abstracted out from Connectors to follow DRY. This includes base CLI commands, State logic, Validation logic, Type detection for unstructured data, handling Config, State, Catalog, and Writer config file, logging etc.

Core includes http server that directly exposes live stats about running sync such as:

  • Possible finish time
  • Concurrently running processes
  • Live record count

Core handles the commands to interact with a driver via these:

  • spec command: Returns render-able JSON Schema that can be consumed by rjsf libraries in frontend
  • check command: performs all necessary checks on the Config, Catalog, State and Writer config
  • discover command: Returns all streams and their schema
  • sync command: Extracts data out of Source and writes into destinations

Find more about how OLake works here.

Roadmap

Checkout GitHub Project Roadmap and Upcoming OLake Roadmap to track and influence the way we build it. If you have any ideas, questions, or any feedback, please share on our Github Discussions or raise an issue.

Contributing

We ❤️ contributions big or small check our Bounty Program. As always, thanks to our amazing contributors!.