Infrastructure
CocoIndex is the world's first open-source engine that supports both custom transformation logic and incremental updates specialized for data indexing.
If you're new to CocoIndex ๐ค, we recommend checking out the ๐ Documentation and โก Quick Start Guide. We also have a โถ๏ธ quick start video tutorial for you to jump start.
pip install -U cocoindex
Setup Postgres with pgvector extension; or bring up a Postgres database using docker compose:
docker compose -f <(curl -L https://raw.githubusercontent.com/cocoindex-io/cocoindex/refs/heads/main/dev/postgres.yaml) up -d
Follow Quick Start Guide to define your first indexing flow. A common indexing flow looks like:
@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
# Add a data source to read files from a directory
data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files"))
# Add a collector for data to be exported to the vector index
doc_embeddings = data_scope.add_collector()
# Transform data of each document
with data_scope["documents"].row() as doc:
# Split the document into chunks, put into `chunks` field
doc["chunks"] = doc["content"].transform(
cocoindex.functions.SplitRecursively(),
language="markdown", chunk_size=2000, chunk_overlap=500)
# Transform data of each chunk
with doc["chunks"].row() as chunk:
# Embed the chunk, put into `embedding` field
chunk["embedding"] = chunk["text"].transform(
cocoindex.functions.SentenceTransformerEmbed(
model="sentence-transformers/all-MiniLM-L6-v2"))
# Collect the chunk into the collector.
doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
text=chunk["text"], embedding=chunk["embedding"])
# Export collected data to a vector index.
doc_embeddings.export(
"doc_embeddings",
cocoindex.storages.Postgres(),
primary_key_fields=["filename", "location"],
vector_index=[("embedding", cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])
It defines an index flow like this:
Go to the examples directory to try out with any of the examples, following instructions under specific example directory.
Example | Description |
---|---|
Text Embedding | Index text documents with embeddings for semantic search |
Code Embedding | Index code embeddings for semantic search |
PDF Embedding | Parse PDF and index text embeddings for semantic search |
Manuals LLM Extraction | Extract structured information from a manual using LLM |
Google Drive Text Embedding | Index text documents from Google Drive |
More coming and stay tuned! If there's any specific examples you would like to see, please let us know in our Discord community ๐ฑ.
For detailed documentation, visit Cocoindex Documentation, including a Quickstart guide.
We love contributions from our community โค๏ธ. For details on contributing or running the project for development, check out our contributing guide.
Welcome with a huge coconut hug ๐ฅฅโ๏ฝกห๐ค. We are super excited for community contributions of all kinds - whether it's code improvements, documentation updates, issue reports, feature requests, and discussions in our Discord.
Join our community here:
CocoIndex is Apache 2.0 licensed.