Runner¶
The prompt_versioner.testing.runner module provides a framework for running automated tests on prompts.
PromptTestRunner¶
Class for running prompt tests with support for parallel execution and metric collection.
Constructor¶
Parameters:
- max_workers (int): Maximum number of parallel workers (default: 4)
Methods¶
run_test()¶
def run_test(
self,
test_case: TestCase,
prompt_fn: Callable[[Dict[str, Any]], Any],
metric_fn: Optional[Callable[[Any], Dict[str, float]]] = None,
) -> TestResult
Runs a single test case.
Parameters:
- test_case (TestCase): Test case to run
- prompt_fn (Callable): Function that takes input and returns LLM output
- metric_fn (Optional[Callable]): Optional function to compute metrics from output
Returns:
- TestResult: Object with test results
Example:
from prompt_versioner.testing.runner import PromptTestRunner
from prompt_versioner.testing.models import TestCase
# Define prompt function
def my_prompt_function(inputs: Dict[str, Any]) -> str:
prompt = f"Classify sentiment: {inputs['text']}"
# Simulate LLM call
return "positive"
# Define metrics function
def calculate_metrics(output: str) -> Dict[str, float]:
return {
"confidence": 0.95,
"accuracy": 1.0 if output in ["positive", "negative", "neutral"] else 0.0
}
# Create test case
test_case = TestCase(
name="sentiment_test_1",
inputs={"text": "I love this product!"},
expected_output="positive"
)
# Run test
runner = PromptTestRunner()
result = runner.run_test(test_case, my_prompt_function, calculate_metrics)
print(f"Success: {result.success}")
print(f"Output: {result.output}")
print(f"Duration: {result.duration_ms:.2f}ms")
print(f"Metrics: {result.metrics}")
run_tests()¶
def run_tests(
self,
test_cases: List[TestCase],
prompt_fn: Callable[[Dict[str, Any]], Any],
metric_fn: Optional[Callable[[Any], Dict[str, float]]] = None,
parallel: bool = True,
) -> List[TestResult]
Runs multiple test cases sequentially or in parallel.
Parameters:
- test_cases (List[TestCase]): List of test cases
- prompt_fn (Callable): Prompt function
- metric_fn (Optional[Callable]): Optional metrics function
- parallel (bool): Whether to run tests in parallel (default: True)
Returns:
- List[TestResult]: List of test results
Example:
# Create multiple test cases
test_cases = [
TestCase(
name="positive_sentiment",
inputs={"text": "I love this!"},
expected_output="positive"
),
TestCase(
name="negative_sentiment",
inputs={"text": "This is terrible"},
expected_output="negative"
),
TestCase(
name="neutral_sentiment",
inputs={"text": "It's okay"},
expected_output="neutral"
)
]
# Run all tests
results = runner.run_tests(test_cases, my_prompt_function, calculate_metrics)
# Analyze results
for result in results:
status = "โ" if result.success else "โ"
print(f"{status} {result.test_case.name}: {result.output}")
get_summary()¶
Generates a summary of test results.
Parameters:
- results (List[TestResult]): List of test results
Returns:
- Dict[str, Any]: Dictionary with statistics:
- total: Total number of tests
- passed: Number of passed tests
- failed: Number of failed tests
- pass_rate: Success rate
- metrics: Aggregated metric statistics
Example:
summary = runner.get_summary(results)
print(f"Pass rate: {summary['pass_rate']:.2%}")
print(f"Total: {summary['total']}, Passed: {summary['passed']}, Failed: {summary['failed']}")
print(f"Metrics: {summary['metrics']}")
format_summary(summary: Dict[str, Any]) -> str¶
Formats the summary as readable text.
Parameters:
- summary (Dict[str, Any]): Summary from get_summary()
Returns:
- str: Formatted summary
Example:
formatted = runner.format_summary(summary)
print(formatted)
# Output:
# ================================================================================
# TEST SUMMARY
# ================================================================================
# Total: 3 | Passed: 2 | Failed: 1 | Pass Rate: 66.67%
#
# METRICS SUMMARY:
# - duration_ms: mean=45.23, std=12.45, range=[32.10, 58.90]
# - accuracy: mean=0.83, std=0.41, range=[0.00, 1.00]
Complete Workflow¶
Example of a complete usage of the test runner:
from prompt_versioner.testing.runner import PromptTestRunner
from prompt_versioner.testing.models import TestCase
from typing import Dict, Any
import random
# 1. Define prompt function
def sentiment_classifier(inputs: Dict[str, Any]) -> str:
"""Simulate a sentiment classifier."""
text = inputs.get("text", "")
# Simulate classification logic
if "love" in text.lower() or "great" in text.lower():
return "positive"
elif "hate" in text.lower() or "terrible" in text.lower():
return "negative"
else:
return "neutral"
# 2. Define metrics function
def sentiment_metrics(output: str) -> Dict[str, float]:
"""Compute metrics for sentiment output."""
valid_sentiments = ["positive", "negative", "neutral"]
return {
"validity": 1.0 if output in valid_sentiments else 0.0,
"confidence": random.uniform(0.7, 0.99), # Simulate confidence
"response_length": len(output)
}
# 3. Define custom validation function
def validate_sentiment(output: str) -> bool:
"""Validate that the output is a valid sentiment."""
return output in ["positive", "negative", "neutral"]
# 4. Create test dataset
test_cases = [
TestCase(
name="clearly_positive",
inputs={"text": "I love this product, it's great!"},
expected_output="positive",
validation_fn=validate_sentiment
),
TestCase(
name="clearly_negative",
inputs={"text": "I hate this, it's terrible!"},
expected_output="negative",
validation_fn=validate_sentiment
),
TestCase(
name="neutral_text",
inputs={"text": "This is a product."},
expected_output="neutral",
validation_fn=validate_sentiment
),
TestCase(
name="edge_case_empty",
inputs={"text": ""},
expected_output="neutral",
validation_fn=validate_sentiment
),
TestCase(
name="mixed_sentiment",
inputs={"text": "I love the design but hate the price"},
# No expected_output, use only validation_fn
validation_fn=validate_sentiment
)
]
# 5. Initialize runner and run tests
print("๐งช Starting test suite...")
runner = PromptTestRunner(max_workers=2)
# Sequential tests for debugging
print("\n๐ Sequential execution for debugging:")
sequential_results = runner.run_tests(
test_cases[:2],
sentiment_classifier,
sentiment_metrics,
parallel=False
)
for result in sequential_results:
status = "โ" if result.success else "โ"
print(f"{status} {result.test_case.name}: {result.output} ({result.duration_ms:.1f}ms)")
# Parallel tests for speed
print("\nโก Full parallel execution:")
all_results = runner.run_tests(
test_cases,
sentiment_classifier,
sentiment_metrics,
parallel=True
)
# 6. Analyze results
summary = runner.get_summary(all_results)
print(f"\n๐ Results:")
print(f"Pass rate: {summary['pass_rate']:.1%}")
print(f"Tests passed: {summary['passed']}/{summary['total']}")
# 7. Show detailed summary
print("\n" + runner.format_summary(summary))
# 8. Error analysis
failed_tests = [r for r in all_results if not r.success]
if failed_tests:
print("\nโ Failed tests:")
for result in failed_tests:
print(f"- {result.test_case.name}: {result.error or 'Validation failed'}")
print(f" Input: {result.test_case.inputs}")
print(f" Output: {result.output}")
print(f" Expected: {result.test_case.expected_output}")
# 9. Performance analysis
durations = [r.duration_ms for r in all_results if r.duration_ms]
if durations:
avg_duration = sum(durations) / len(durations)
max_duration = max(durations)
print(f"\nโฑ๏ธ Performance:")
print(f"Average duration: {avg_duration:.1f}ms")
print(f"Max duration: {max_duration:.1f}ms")
Advanced Features¶
Regression Testing¶
def regression_test_suite():
"""Suite for regression tests."""
# Historical test cases that must always pass
regression_cases = [
TestCase(name="regression_1", inputs={"text": "baseline test"}),
TestCase(name="regression_2", inputs={"text": "edge case"}),
# ... other critical tests
]
results = runner.run_tests(regression_cases, sentiment_classifier)
# Ensure all pass
all_passed = all(r.success for r in results)
if not all_passed:
raise AssertionError("Regression test failed!")
return results
Performance Testing¶
def performance_benchmark():
"""Performance benchmark."""
# Create many test cases for stress testing
stress_cases = [
TestCase(
name=f"perf_test_{i}",
inputs={"text": f"Performance test number {i}"}
)
for i in range(100)
]
# Measure performance
start_time = time.time()
results = runner.run_tests(stress_cases, sentiment_classifier, parallel=True)
total_time = time.time() - start_time
# Analyze performance metrics
durations = [r.duration_ms for r in results]
throughput = len(results) / total_time
print(f"Throughput: {throughput:.1f} tests/sec")
print(f"Average latency: {sum(durations)/len(durations):.1f}ms")
return results
CI/CD Integration¶
def ci_test_suite() -> bool:
"""Test suite for CI/CD pipeline."""
try:
# Run all tests
results = runner.run_tests(all_test_cases, prompt_function)
summary = runner.get_summary(results)
# Success criteria for CI
success_criteria = {
"min_pass_rate": 0.95, # 95% of tests must pass
"max_avg_duration": 1000, # Max 1 second on average
}
pass_rate = summary["pass_rate"]
avg_duration = summary["metrics"]["duration_ms"]["mean"]
# Check criteria
if pass_rate < success_criteria["min_pass_rate"]:
print(f"โ Pass rate too low: {pass_rate:.1%}")
return False
if avg_duration > success_criteria["max_avg_duration"]:
print(f"โ Average duration too high: {avg_duration:.1f}ms")
return False
print(f"โ
CI test passed: {pass_rate:.1%} pass rate, {avg_duration:.1f}ms avg")
return True
except Exception as e:
print(f"โ CI test failed: {e}")
return False
# Usage in CI
if __name__ == "__main__":
import sys
success = ci_test_suite()
sys.exit(0 if success else 1)
Error Handling¶
The runner automatically handles:
- Timeouts: Timeout for hanging tests
- Exceptions: Catches exceptions and logs them as errors
- Validation: Supports custom validation via
validation_fn - Metrics: Collects metrics even in case of partial errors
Error Handling Example¶
def robust_prompt_function(inputs: Dict[str, Any]) -> str:
"""Prompt function with error handling."""
try:
text = inputs.get("text", "")
if not text.strip():
raise ValueError("Empty input text")
# Simulate possible failure
if "error" in text.lower():
raise RuntimeError("Simulated LLM error")
return sentiment_classifier(inputs)
except Exception as e:
# Log the error but continue with a default value
print(f"Error in prompt function: {e}")
return "neutral" # Default safe value
# The runner will still catch any unhandled exceptions
See Also¶
A/B Test- AB Testing for testing