Productivity
Create leaderboards ranking LLM outputs against one another using automated judge evaluation
gpt-4o-mini, command-r, and
claude-3-haiku generally yields better accuracy than a single frontier judge like gpt-4o — while being faster and
much cheaper to run. AutoArena is built around this technique, called PoLL: Panel of LLM evaluators
(arXiv:2404.18796).Install from PyPI:
pip install autoarena
Run as a module and visit localhost:8899 in your browser:
python -m autoarena
With the application running, getting started is simple:
prompt and response columns.X_API_KEY in the
environment where you're running AutoArena.response to a given prompt.That's it! After these steps you're fully set up for automated evaluation on AutoArena.
AutoArena requires two pieces of information to test a model: the input prompt and corresponding model response.
prompt: the inputs to your model. When uploading responses, any other models that have been run on the same prompts
are matched and evaluated using the automated judges you have configured.response: the output from your model. Judges decide which of two models produced a better response, given the same
prompt.Data is stored in ./data/<project>.sqlite files in the directory where you invoked AutoArena. See
data/README.md for more details on data storage in AutoArena.
AutoArena uses uv to manage dependencies. To set up this repository for development, run:
uv venv && source .venv/bin/activate
uv pip install --all-extras -r pyproject.toml
uv tool run pre-commit install
uv run python3 -m autoarena serve --dev
To run AutoArena for development, you will need to run both the backend and frontend service:
uv run python3 -m autoarena serve --dev (the --dev/-d flag enables automatic service reloading when
source files change)ui/README.mdTo build a release tarball in the ./dist directory:
./scripts/build.sh