Infrastructure
pipx install git+https://github.com/Storia-AI/sage.git@main
python -m venv sage-venv
source sage-venv/bin/activate
git clone https://github.com/Storia-AI/sage.git
cd sage
pip install -e .
sage
performs two steps:
To index the codebase locally, we use the open-source project Marqo, which is both an embedder and a vector store. To bring up a Marqo instance:
docker rm -f marqo
docker pull marqoai/marqo:latest
docker run --name marqo -it -p 8882:8882 marqoai/marqo:latest
This will open a persistent Marqo console window. This should take around 2-3 minutes on a fresh install.
To chat with an LLM locally, we use Ollama:
ollama pull llama3.1
.For embeddings, we support OpenAI and Voyage. According to our experiments, OpenAI is better quality. Their batch API is also faster, with more generous rate limits. Export the API key of the desired provider:
export OPENAI_API_KEY=... # or
export VOYAGE_API_KEY=...
We use Pinecone for the vector store, so you will need an API key:
export PINECONE_API_KEY=...
If you want to reuse an existing Pinecone index, specify it. Otherwise we'll create a new one called sage
.
export PINECONE_INDEX_NAME=...
For reranking, we support NVIDIA, Voyage, Cohere, and Jina.
nvidia/nv-rerankqa-mistral-4b-v3
.export NVIDIA_API_KEY=... # or
export VOYAGE_API_KEY=... # or
export COHERE_API_KEY=... # or
export JINA_API_KEY=...
For chatting with an LLM, we support OpenAI and Anthropic. For the latter, set an additional API key:
export ANTHROPIC_API_KEY=...
For easier configuration, adapt the entries within the sample .sage-env
(change the API keys names based on your desired setup) and run:
source .sage-env
If you are planning on indexing GitHub issues in addition to the codebase, you will need a GitHub token:
export GITHUB_TOKEN=...
Select your desired repository:
export GITHUB_REPO=huggingface/transformers
Index the repository. This might take a few minutes, depending on its size.
sage-index $GITHUB_REPO
To use external providers instead of running locally, set --mode=remote
.
Chat with the repository, once it's indexed:
sage-chat $GITHUB_REPO
To use external providers instead of running locally, set --mode=remote
.
--share=true
.sage-index --help
or sage-chat --help
for a full list.To index and chat with a private repository, simply set the GITHUB_TOKEN
environment variable. To obtain this token, go to github.com > click on your profile icon > Settings > Developer settings > Personal access tokens. You can either make a fine-grained token for the desired repository, or a classic token.
export GITHUB_TOKEN=...
You can specify an inclusion or exclusion file in the following format:
# This is a comment
ext:.my-ext-1
ext:.my-ext-2
ext:.my-ext-3
dir:my-dir-1
dir:my-dir-2
dir:my-dir-3
file:my-file-1.md
file:my-file-2.py
file:my-file-3.cpp
where:
ext
specifies a file extensiondir
specifies a directory. This is not a full path. For instance, if you specify dir:tests
in an exclusion directory, then a file like /path/to/my/tests/file.py
will be ignored.file
specifies a file name. This is also not a full path. For instance, if you specify file:__init__.py
, then a file like /path/to/my/__init__.py
will be ignored.To specify an inclusion file (i.e. only index the specified files):
sage-index $GITHUB_REPO --include=/path/to/inclusion/file
To specify an exclusion file (i.e. index all files, except for the ones specified):
sage-index $GITHUB_REPO --exclude=/path/to/exclusion/file
By default, we use the exclusion file sample-exclude.txt.
You will need a GitHub token first:
export GITHUB_TOKEN=...
To index GitHub issues without comments:
sage-index $GITHUB_REPO --index-issues
To index GitHub issues with comments:
sage-index $GITHUB_REPO --index-issues --index-issue-comments
To index GitHub issues, but not the codebase:
sage-index $GITHUB_REPO --index-issues --no-index-repo
Retrieving the right files from the vector database is arguably the quality bottleneck of the system. We are actively experimenting with various retrieval strategies and documenting our findings here.
Currently, we support the following types of retrieval:
Vanilla RAG from a vector database (nearest neighbor between dense embeddings). This is the default.
Hybrid RAG that combines dense retrieval (embeddings-based) with sparse retrieval (BM25). Use --retrieval-alpha
to weigh the two strategies.
Multi-query retrieval performs multiple query rewrites, makes a separate retrieval call for each, and takes the union of the retrieved documents. You can activate it by passing --multi-query-retrieval
. This can be combined with both vanilla and hybrid RAG.
LLM-only retrieval completely circumvents indexing the codebase. We simply enumerate all file paths and pass them to an LLM together with the user query. We ask the LLM which files are likely to be relevant for the user query, solely based on their filenames. You can activate it by passing --llm-retriever
.
main
in examples/pytorch/image-pretraining/run_mim.py
allows her to organize the outputs of each experiment in separate directories."Sometimes you just want to learn how a codebase works and how to integrate it, without spending hours sifting through the code itself.
sage
is like an open-source GitHub Copilot with the most up-to-date information about your repo.
Features:
repo2vec
to sage
.We're working to make all code on the internet searchable and understandable for devs. You can check out our early product, Code Sage. We pre-indexed a slew of OSS repos, and you can index your desired ones by simply pasting a GitHub URL.
If you're the maintainer of an OSS repo and would like a dedicated page on Code Sage (e.g. sage.storia.ai/your-repo
), then send us a message at founders@storia.ai. We'll do it for free!
We built the code purposefully modular so that you can plug in your desired embeddings, LLM and vector stores providers by simply implementing the relevant abstract classes.
Feel free to send feature requests to founders@storia.ai or make a pull request!