A/B Testing¶
A concise guide to running A/B tests with Prompt Versioner. Keep this page focused on the common, practical workflows: Quick Start, setting up versions, collecting results, basic analysis and a short example.
๐งช Quick Start¶
from prompt_versioner import PromptVersioner, ABTest
import random
pv = PromptVersioner(project_name="my-project", enable_git=False)
# Prepare versions (assumes 1.0.0 and 1.1.0 exist)
ab_test = ABTest(
    versioner=pv,
    prompt_name="code_reviewer",
    version_a="1.0.0",
    version_b="1.1.0",
    metric_name="quality_score",
)
# Simulate logging
for _ in range(30):
    ab_test.log_result("a", random.uniform(0.70, 0.85))
    ab_test.log_result("b", random.uniform(0.75, 0.90))
if ab_test.is_ready(min_samples=20):
    result = ab_test.get_result()
    print(f"Winner: {result.winner} โ improvement: {result.improvement:.2%}, confidence: {result.confidence:.1%}")
๐ง Setting Up Tests¶
- 
Create the versions you want to compare. Use pv.save_version(...)withVersionBump.MAJOR/MINOR/PATCHdepending on the change.
- 
Initialize the ABTestwith the prompt name and the two versions to compare. Choose a primary metric (e.g.,quality_score) to evaluate.
from prompt_versioner import PromptVersioner, VersionBump
pv = PromptVersioner(project_name="my-project")
# Baseline
pv.save_version(
    name="summarizer",
    system_prompt="You are a summarization assistant.",
    user_prompt="Summarize: {text}",
    bump_type=VersionBump.MAJOR,
)
# Improved
pv.save_version(
    name="summarizer",
    system_prompt="You are an expert summarizer that produces concise, accurate summaries.",
    user_prompt="Please summarize: {text}",
    bump_type=VersionBump.MINOR,
)
ab_test = ABTest(
    versioner=pv,
    prompt_name="summarizer",
    version_a="1.0.0",
    version_b="1.1.0",
    metric_name="quality_score",
)
๐ Collecting Results¶
- Use ab_test.log_result(variant, value)to add single observations (variant is "a" or "b").
- For batch uploads, use ab_test.log_batch_results(variant, list_of_values)where available.
- Integrate logging into your production path: after an LLM call, evaluate the response and log the metric.
# Example after calling the LLM and computing a quality_score
ab_test.log_result("a", quality_score)
pv.log_metrics(name="summarizer", version="1.1.0", model_name="gpt-4o", quality_score=quality_score, success=True)
๐ Analyzing Results¶
- Check sample counts with ab_test.get_sample_counts().
- When ab_test.is_ready(min_samples)is True, callab_test.get_result()to get summary stats (means, samples, winner, confidence).
- Use ab_test.print_result()for a human-readable report.
count_a, count_b = ab_test.get_sample_counts()
print(f"Samples: A={count_a}, B={count_b}")
if ab_test.is_ready(min_samples=30):
    result = ab_test.get_result()
    print(result)
    ab_test.print_result()
โ๏ธ Best Practices (Short)¶
- Test one change at a time. Keep tests focused.
- Define the primary metric and a minimum sample size before starting.
- Randomize assignment to avoid bias.
- Monitor test health (latency, errors) alongside quality metrics.
๐งฉ Short Example¶
A short, complete example showing the minimal end-to-end flow.
from prompt_versioner import PromptVersioner, VersionBump, ABTest
import random
pv = PromptVersioner(project_name="quick-ab")
pv.save_version(name="g", system_prompt="You are helpful.", user_prompt="Q:{q}", bump_type=VersionBump.MAJOR)
pv.save_version(name="g", system_prompt="You are concise and helpful.", user_prompt="Q:{q}", bump_type=VersionBump.MINOR)
ab = ABTest(versioner=pv, prompt_name="g", version_a="1.0.0", version_b="1.1.0", metric_name="quality_score")
for i in range(40):
    ab.log_result("a", random.uniform(0.6, 0.8))
    ab.log_result("b", random.uniform(0.65, 0.9))
if ab.is_ready(min_samples=20):
    r = ab.get_result()
    ab.print_result()