Skip to content

Models

The prompt_versioner.metrics.models module defines data models for metrics and comparison structures.

ModelMetrics

Dataclass representing metrics for a single LLM call.

Attributes

Model Information

  • model_name (Optional[str]): Name of the model used

Token Usage

  • input_tokens (Optional[int]): Number of input tokens
  • output_tokens (Optional[int]): Number of output tokens
  • total_tokens (Optional[int]): Total number of tokens

Costs

  • cost_eur (Optional[float]): Cost in EUR

Performance

  • latency_ms (Optional[float]): Latency in milliseconds

Quality Metrics

  • quality_score (Optional[float]): Quality score
  • accuracy (Optional[float]): Accuracy

Model Parameters

  • temperature (Optional[float]): Temperature used
  • max_tokens (Optional[int]): Maximum number of tokens
  • top_p (Optional[float]): Top_p value

Status

  • success (bool): Whether the call was successful (default: True)
  • error_message (Optional[str]): Error message if present

Additional Data

  • metadata (Optional[Dict[str, Any]]): Additional metadata

Methods

to_dict()

def to_dict(self) -> Dict[str, Any]
Converts the object to a dictionary.

Returns: - Dict[str, Any]: Dictionary representation

from_dict()

@classmethod
    def from_dict(cls, data: Dict[str, Any]) -> "ModelMetrics"

Creates an instance from a dictionary.

Parameters: - data (Dict[str, Any]): Data in dictionary format

Returns: - ModelMetrics: New instance

Example:

from prompt_versioner.metrics.models import ModelMetrics

# Manual creation
metrics = ModelMetrics(
    model_name="gpt-4",
    input_tokens=100,
    output_tokens=50,
    total_tokens=150,
    cost_eur=0.003,
    latency_ms=1200,
    quality_score=0.95,
    temperature=0.7,
    success=True
)

# Convert to dictionary
data = metrics.to_dict()
print(f"Cost: €{data['cost_eur']}")

# Create from dictionary
metrics_copy = ModelMetrics.from_dict(data)

MetricStats

Dataclass representing a statistical summary of a metric.

Attributes

  • name (str): Name of the metric
  • count (int): Number of values
  • mean (float): Mean
  • median (float): Median
  • std_dev (float): Standard deviation
  • min_val (float): Minimum value
  • max_val (float): Maximum value

Methods

to_dict()

def to_dict(self) -> Dict[str, Any]

Converts to dictionary.

format() -> str

def format(self) -> str

Formats as a readable string.

Example:

from prompt_versioner.metrics.models import MetricStats

stats = MetricStats(
    name="latency_ms",
    count=100,
    mean=150.5,
    median=145.0,
    std_dev=25.3,
    min_val=95.0,
    max_val=250.0
)

print(stats.format())
# Output: latency_ms: mean=150.5000, median=145.0000, std=25.3000, range=[95.0000, 250.0000], n=100

MetricComparison

Dataclass representing a comparison between two sets of metrics.

Attributes

  • metric_name (str): Name of the metric
  • baseline_mean (float): Baseline mean
  • new_mean (float): New version mean
  • mean_diff (float): Difference between means
  • mean_pct_change (float): Percent change
  • improved (bool): Whether the metric improved
  • baseline_stats (Dict[str, float]): Baseline statistics
  • new_stats (Dict[str, float]): New version statistics

Methods

format()

def format(self) -> str

Formats the comparison as a readable string.

Example:

from prompt_versioner.metrics.models import MetricComparison

comparison = MetricComparison(
    metric_name="latency_ms",
    baseline_mean=150.0,
    new_mean=120.0,
    mean_diff=-30.0,
    mean_pct_change=-20.0,
    improved=True,
    baseline_stats={"std_dev": 15.0},
    new_stats={"std_dev": 12.0}
)

print(comparison.format())
# Output: latency_ms: 150.0000 → 120.0000 (↑ 20.00%)

MetricType

Enum defining common metric types for LLM prompts.

Values

Token Metrics

  • INPUT_TOKENS = "input_tokens"
  • OUTPUT_TOKENS = "output_tokens"
  • TOTAL_TOKENS = "total_tokens"

Cost Metrics

  • COST = "cost_eur"
  • COST_PER_TOKEN = "cost_per_token"

Performance Metrics

  • LATENCY = "latency_ms"
  • THROUGHPUT = "throughput"

Quality Metrics

  • QUALITY = "quality_score"
  • ACCURACY = "accuracy"
  • RELEVANCE = "relevance"
  • COHERENCE = "coherence"
  • FACTUALITY = "factuality"
  • FLUENCY = "fluency"

Success Metrics

  • SUCCESS_RATE = "success_rate"
  • ERROR_RATE = "error_rate"

Example:

from prompt_versioner.metrics.models import MetricType

# Using constants
print(MetricType.LATENCY)  # "latency_ms"
print(MetricType.ACCURACY)  # "accuracy"

# Check if a string is a valid type
metric_name = "cost_eur"
if metric_name in [mt.value for mt in MetricType]:
    print(f"{metric_name} is a valid metric type")

MetricDirection

Enum defining the optimization direction for metrics.

Values

  • HIGHER_IS_BETTER = "higher": Higher values are better
  • LOWER_IS_BETTER = "lower": Lower values are better
  • NEUTRAL = "neutral": Neutral direction

METRIC_DIRECTIONS

Dictionary mapping metric types to their optimization direction.

METRIC_DIRECTIONS = {
    MetricType.COST: MetricDirection.LOWER_IS_BETTER,
    MetricType.COST_PER_TOKEN: MetricDirection.LOWER_IS_BETTER,
    MetricType.LATENCY: MetricDirection.LOWER_IS_BETTER,
    MetricType.ERROR_RATE: MetricDirection.LOWER_IS_BETTER,
    MetricType.QUALITY: MetricDirection.HIGHER_IS_BETTER,
    MetricType.ACCURACY: MetricDirection.HIGHER_IS_BETTER,
    MetricType.RELEVANCE: MetricDirection.HIGHER_IS_BETTER,
    MetricType.COHERENCE: MetricDirection.HIGHER_IS_BETTER,
    MetricType.FACTUALITY: MetricDirection.HIGHER_IS_BETTER,
    MetricType.FLUENCY: MetricDirection.HIGHER_IS_BETTER,
    MetricType.THROUGHPUT: MetricDirection.HIGHER_IS_BETTER,
    MetricType.SUCCESS_RATE: MetricDirection.HIGHER_IS_BETTER,
}

Example:

from prompt_versioner.metrics.models import MetricType, METRIC_DIRECTIONS, MetricDirection

# Check optimization direction
direction = METRIC_DIRECTIONS.get(MetricType.LATENCY, MetricDirection.NEUTRAL)
print(f"For latency, {direction.value} is better")  # "lower is better"

if direction == MetricDirection.LOWER_IS_BETTER:
    print("We aim to reduce latency")

MetricThreshold

Dataclass for configuring warning thresholds for a metric.

Attributes

  • metric_type (MetricType): Type of metric
  • warning_threshold (float): Warning threshold
  • critical_threshold (float): Critical threshold
  • direction (MetricDirection): Optimization direction (default: HIGHER_IS_BETTER)

Methods

check()

def check(self, value: float) -> str:

Checks if a value meets the thresholds.

Parameters: - value (float): Value to check

Returns: - str: 'ok', 'warning', or 'critical'

Example:

from prompt_versioner.metrics.models import MetricThreshold, MetricType, MetricDirection

# Thresholds for latency (lower is better)
latency_threshold = MetricThreshold(
    metric_type=MetricType.LATENCY,
    warning_threshold=200.0,  # warning if > 200ms
    critical_threshold=500.0,  # critical if > 500ms
    direction=MetricDirection.LOWER_IS_BETTER
)

# Test values
values = [150.0, 250.0, 600.0]
for val in values:
    status = latency_threshold.check(val)
    print(f"Latency {val}ms: {status}")

# Thresholds for accuracy (higher is better)
accuracy_threshold = MetricThreshold(
    metric_type=MetricType.ACCURACY,
    warning_threshold=0.8,   # warning if < 0.8
    critical_threshold=0.6,  # critical if < 0.6
    direction=MetricDirection.HIGHER_IS_BETTER
)

accuracy_values = [0.95, 0.75, 0.5]
for val in accuracy_values:
    status = accuracy_threshold.check(val)
    print(f"Accuracy {val}: {status}")

See Also

  • Aggregator - Functionality to aggregate metrics across multiple test runs
  • Analyzer - Functionality for analyzing and comparing metrics between versions
  • Calculator - Utility for single-call metric calculations
  • Pricing - Manages model pricing and calculates LLM call costs
  • Tracker - Functionality for tracking and statistical analysis of metrics