Security
Documentation | Usage Examples | Free Cloud Service
The Synthetic Data SDK is a Python toolkit for high-fidelity, privacy-safe Synthetic Data.
The SDK allows you to programmatically create, browse and manage 3 key resources:
Intent | Primitive | API Reference |
---|---|---|
Train a Generator on tabular or language data | g = mostly.train(config) | mostly.train |
Generate any number of synthetic data records | sd = mostly.generate(g, config) | mostly.generate |
Live probe the generator on demand | df = mostly.probe(g, config) | mostly.probe |
Connect to any data source within your org | c = mostly.connect(config) | mostly.connect |
https://github.com/user-attachments/assets/d1613636-06e4-4147-bef7-25bb4699e8fc
Install the SDK via pip:
pip install mostlyai
Train your first generator:
import pandas as pd
from mostlyai.sdk import MostlyAI
# load original data
repo_url = "https://github.com/mostly-ai/public-demo-data/raw/refs/heads/dev"
df_original = pd.read_csv(f"{repo_url}/census/census.csv.gz")
df_original = df_original.sample(n=10_000) # sub-sample to speed up demo
# initialize the SDK
mostly = MostlyAI()
# train a synthetic data generator, with default configs
g = mostly.train(name="Quick Start Demo", data=df_original)
# display the quality assurance report
g.reports(display=True)
Once the generator has been trained, generate synthetic data samples. Either via probing:
# probe for some representative synthetic samples
df_samples = mostly.probe(g, size=100)
df_samples
or by creating a synthetic dataset entity for larger data volumes:
# generate a large representative synthetic dataset
sd = mostly.generate(g, size=100_000)
df_synthetic = sd.data()
df_synthetic
or by conditionally probing / generating synthetic data:
# create 100 seed records of 24y old Mexicans
df_seed = pd.DataFrame({
'age': [24] * 100,
'native_country': ['Mexico'] * 100,
})
# conditionally probe, based on provided seed
df_samples = mostly.probe(g, seed=df_seed)
df_samples
Use pip
(or better uv pip
) to install the official mostlyai
package via PyPI. Python 3.10 or higher is required.
It is highly recommended to install the package within a dedicated virtual environment, such as venv, uv, or conda. E.g.
conda create -n mostlyai python=3.12
conda activate mostlyai
This is a light-weight installation for using the SDK in CLIENT mode only. It communicates to a MOSTLY AI platform to perform requested tasks. See e.g. app.mostly.ai for a free-to-use hosted version.
pip install -U mostlyai
This is a full installation for using the SDK in both CLIENT and LOCAL mode. It includes all dependencies, incl. PyTorch, for training and generating synthetic data locally.
# for CPU on macOS
pip install -U 'mostlyai[local]'
# for CPU on Linux
pip install -U 'mostlyai[local-cpu]' --extra-index-url https://download.pytorch.org/whl/cpu
# for GPU on Linux
pip install -U 'mostlyai[local-gpu]'
Note for Google Colab users: Installing any of the local extras (
mostlyai[local]
,mostlyai[local-cpu]
, ormostlyai[local-gpu]
) will downgrade PyTorch from 2.6.0 to 2.5.1. You'll need to restart the runtime after installation for the changes to take effect.
Add any of the following extras for further data connectors support in LOCAL mode: databricks
, googlebigquery
, hive
, mssql
, mysql
, oracle
, postgres
, snowflake
. E.g.
pip install -U 'mostlyai[local, databricks, snowflake]'
Please consider citing our project if you find it useful:
@software{mostlyai,
author = {{MOSTLY AI}},
title = {{MOSTLY AI SDK}},
url = {https://github.com/mostly-ai/mostlyai},
year = {2025}
}