Productivity
Create leaderboards ranking LLM outputs against one another using automated judge evaluation
gpt-4o-mini
, command-r
, and
claude-3-haiku
generally yields better accuracy than a single frontier judge like gpt-4o
— while being faster and
much cheaper to run. AutoArena is built around this technique, called PoLL: Panel of LLM evaluators
(arXiv:2404.18796).Install from PyPI:
pip install autoarena
Run as a module and visit localhost:8899 in your browser:
python -m autoarena
With the application running, getting started is simple:
prompt
and response
columns.X_API_KEY
in the
environment where you're running AutoArena.response
to a given prompt
.That's it! After these steps you're fully set up for automated evaluation on AutoArena.
AutoArena requires two pieces of information to test a model: the input prompt
and corresponding model response
.
prompt
: the inputs to your model. When uploading responses, any other models that have been run on the same prompts
are matched and evaluated using the automated judges you have configured.response
: the output from your model. Judges decide which of two models produced a better response, given the same
prompt.Data is stored in ./data/<project>.sqlite
files in the directory where you invoked AutoArena. See
data/README.md
for more details on data storage in AutoArena.
AutoArena uses uv to manage dependencies. To set up this repository for development, run:
uv venv && source .venv/bin/activate
uv pip install --all-extras -r pyproject.toml
uv tool run pre-commit install
uv run python3 -m autoarena serve --dev
To run AutoArena for development, you will need to run both the backend and frontend service:
uv run python3 -m autoarena serve --dev
(the --dev
/-d
flag enables automatic service reloading when
source files change)ui/README.md
To build a release tarball in the ./dist
directory:
./scripts/build.sh