OLake

Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Supporting Postgres, MongoDB and MySQL

Infrastructure

database

data-pipeline

change-data-capture

cdc

OLake

Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Starting with MongoDB. Visit olake.io/docs for the full documentation, and benchmarks

undefined

Connector ecosystem for Olake, the key points Olake Connectors focuses on are these

Integrated Writers to avoid block of reading, and pushing directly into destinations
Connector Autonomy
Avoid operations that don't contribute to increasing record throughput

Getting Started with OLake

Source / Connectors

Writers / Destination

Source/Connector Functionalities

Functionality	MongoDB	Postgres	MySQL
Full Refresh Sync Mode	✅	✅	✅
Incremental Sync Mode	❌	❌	❌
CDC Sync Mode	✅	✅	✅
Full Parallel Processing	✅	✅	✅
CDC Parallel Processing	✅	❌	❌
Resumable Full Load	✅	✅	✅
CDC Heart Beat	❌	❌	❌

We have additionally planned the following sources - AWS S3 | Kafka

Writer Functionalities

Functionality	Local Filesystem	AWS S3
Flattening & Normalization (L1)	✅	✅
Partitioning	✅	✅
Schema Changes	✅	✅
Schema Evolution	✅	✅

Supported Catalogs For Iceberg Writer

Catalog	Status
Glue Catalog	WIP
Hive Meta Store	Upcoming
JDBC Catalogue	Upcoming
REST Catalogue - Nessie	Upcoming
REST Catalogue - Polaris	Upcoming
REST Catalogue - Unity	Upcoming
REST Catalogue - Gravitino	Upcoming
Azure Purview	Not Planned, submit a request
BigLake Metastore	Not Planned, submit a request

Core

Core or framework is the component/logic that has been abstracted out from Connectors to follow DRY. This includes base CLI commands, State logic, Validation logic, Type detection for unstructured data, handling Config, State, Catalog, and Writer config file, logging etc.

Core includes http server that directly exposes live stats about running sync such as:

Possible finish time
Concurrently running processes
Live record count

Core handles the commands to interact with a driver via these:

spec command: Returns render-able JSON Schema that can be consumed by rjsf libraries in frontend
check command: performs all necessary checks on the Config, Catalog, State and Writer config
discover command: Returns all streams and their schema
sync command: Extracts data out of Source and writes into destinations

Find more about how OLake works here.

Roadmap

Checkout GitHub Project Roadmap and Upcoming OLake Roadmap to track and influence the way we build it. If you have any ideas, questions, or any feedback, please share on our Github Discussions or raise an issue.

Contributing

We ❤️ contributions big or small check our Bounty Program. As always, thanks to our amazing contributors!.

To contribute to Olake Check CONTRIBUTING.md
To contribute to UI, visit OLake UI Repository.
To contribute to OLake website and documentation (olake.io), visit Olake Docs Repository.