Infrastructure
Pachyderm is cost-effective at scale, enabling data engineering teams to automate complex pipelines with sophisticated data transformations across any type of data. Our unique approach provides parallelized processing of multi-stage, language-agnostic pipelines with data versioning and data lineage tracking. Pachyderm delivers the ultimate CI/CD engine for data.
To start deploying your end-to-end version-controlled data pipelines, run Pachyderm locally or you can also deploy on AWS/GCE/Azure in about 5 minutes.
You can also refer to our complete documentation to see tutorials, check out example projects, and learn about advanced features of Pachyderm.
If you'd like to see some examples and learn about core use cases for Pachyderm:
Keep up to date and get Pachyderm support via:
To get started, sign the Contributor License Agreement.
You should also check out our contributing guide.
Send us PRs, we would love to see what you do! You can also check our GH issues for things labeled "help-wanted" as a good place to start. We're sometimes bad about keeping that label up-to-date, so if you don't see any, just let us know.
Pachyderm automatically reports anonymized usage metrics. These metrics help us
understand how people are using Pachyderm and make it better. They can be
disabled by setting the env variable METRICS
to false
in the pachd
container.