Skip to main content

Model Performance Evaluation

Why evaluate?​

Evaluation makes routing data-driven. By measuring per-category accuracy on MMLU-Pro (and doing a quick sanity check with ARC), you can:

  • Select the right model for each category and rank them into categories.model_scores
  • Pick a sensible default_model based on overall performance
  • Decide when CoT prompting is worth the latency/cost tradeoff
  • Catch regressions when models, prompts, or parameters change
  • Keep changes reproducible and auditable for CI and releases

In short, evaluation converts anecdotes into measurable signals that improve quality, cost efficiency, and reliability of the router.


This guide documents the automated workflow to evaluate models (MMLU-Pro and ARC Challenge) via a vLLM-compatible OpenAI endpoint, generate a performance-based routing config, and update categories.model_scores in config.

see code in /src/training/model_eval

What you'll run end-to-end​

1) Evaluate models​

  • per-category accuracies
  • ARC Challenge: overall accuracy

2) Visualize results​

  • bar/heatmap plot of per-category accuracies

Bar Heatmap

3) Generate an updated config.yaml​

  • Rank models per category into categories.model_scores
  • Set default_model to the best average performer
  • Keep or apply category-level reasioning settings

1.Prerequisites​

  • A running vLLM-compatible OpenAI endpoint serving your models

    # Terminal 1
    vllm serve microsoft/phi-4 --port 11434 --served_model_name phi4

    # Terminal 2
    vllm serve Qwen/Qwen3-0.6B --port 11435 --served_model_name qwen3-0.6B
  • Python packages for evaluation scripts:

    # We will work at this dir in this guide
    cd /src/training/model_eval
    pip install -r requirements.txt

Optional tip:

  • Ensure your config/config.yaml includes your deployed model names under vllm_endpoints[].models and any pricing/policy under model_config if you plan to use the generated config directly.

2.Evaluate on MMLU-Pro​

see script in mmul_pro_vllm_eval.py

Example usage patterns​

# Evaluate a few models, few samples per category, direct prompting
python mmlu_pro_vllm_eval.py \
--endpoint http://localhost:11434/v1 \
--models phi4 \
--samples-per-category 10

python mmlu_pro_vllm_eval.py \
--endpoint http://localhost:11435/v1 \
--models qwen3-0.6B \
--samples-per-category 10

# Evaluate with CoT (results saved under *_cot)
python mmlu_pro_vllm_eval.py \
--endpoint http://localhost:11435/v1 \
--models qwen3-0.6B \
--samples-per-category 10
--use-cot

# If you have set up Semantic Router properly, you can run in one go
python mmlu_pro_vllm_eval.py \
--endpoint http://localhost:8801/v1 \
--models qwen3-0.6B, phi4 \
--samples-per-category
# --use-cot # Uncomment this line if use CoT

Key flags​

  • --endpoint: vLLM OpenAI URL (default http://localhost:8000/v1)
  • --models: space-separated list OR a single comma-separated string; if omitted, the script queries /models from the endpoint
  • --categories: restrict evaluation to specific categories; if omitted, uses all categories in the dataset
  • --samples-per-category: limit questions per category (useful for quick runs)
  • --use-cot: enables Chain-of-Thought prompting variant; results are saved in a separate subfolder suffix (_cot vs _direct)
  • --concurrent-requests: concurrency for throughput
  • --output-dir: where results are saved (default results)
  • --max-tokens, --temperature, --seed: generation and reproducibility knobs

What it outputs per model​

  • results/Model_Name_(direct|cot)/
    • detailed_results.csv: one row per question with is_correct and category
    • analysis.json: overall_accuracy, category_accuracy map, avg_response_time, counts
    • summary.json: condensed metrics
  • mmlu_pro_vllm_eval.txt: prompts and answers log (debug/inspection)

Note

  • Model naming: slashes are replaced with underscores for folder names; e.g., gemma3:27b -> gemma3:27b_direct directory.
  • Category accuracy is computed on successful queries only; failed requests are excluded.

3.Evaluate on ARC Challenge (optional, overall sanity check)​

see script in arc_challenge_vllm_eval.py

Example usage patterns​

python arc_challenge_vllm_eval.py \
--endpoint http://localhost:8801/v1\
--models qwen3-0.6B,phi4
--output-dir arc_results

Key flags​

  • --samples: total questions to sample (default 20); ARC is not categorized in our script
  • Other flags mirror the MMLU-Pro script

What it outputs per model​

  • results/Model_Name_(direct|cot)/
    • detailed_results.csv: one row per question with is_correct and category
    • analysis.json: overall_accuracy, avg_response_time
    • summary.json: condensed metrics
  • arc_challenge_vllm_eval.txt: prompts and answers log (debug/inspection)

Note

  • ARC results do not feed categories[].model_scores directly, but they can help spot regressions.

4.Visualize per-category performance​

see script in plot_category_accuracies.py

Example usage patterns:​

# Use results/ to generate bar plot
python plot_category_accuracies.py \
--results-dir results \
--plot-type bar \
--output-file results/bar.png

# Use results/ to generate heatmap plot
python plot_category_accuracies.py \
--results-dir results \
--plot-type heatmap \
--output-file results/heatmap.png

# Use sample-data to generate example plot
python src/training/model_eval/plot_category_accuracies.py \
--sample-data \
--plot-type heatmap \
--output-file results/category_accuracies.png

Key flags​

  • --results-dir: where analysis.json files are
  • --plot-type: bar or heatmap
  • --output-file: output image path (default model_eval/category_accuracies.png)
  • --sample-data: if no results exist, generates fake data to preview the plot

What it does​

  • Finds all results/**/analysis.json, aggregates analysis["category_accuracy"] per model
  • Adds an Overall column representing the average across categories
  • Produces a figure to quickly compare model/category performance

Note

  • It merges direct and cot as distinct model variants by appending :direct or :cot to the label; the legend hides :direct for brevity.

5.Generate performance-based routing config​

see script in result_to_config.py

Example usage patterns​

# Use results/ to generate a new config file (not overridden)
python src/training/model_eval/result_to_config.py \
--results-dir results \
--output-file config/config.eval.yaml

# Modify similarity-thredshold
python src/training/model_eval/result_to_config.py \
--results-dir results \
--output-file config/config.eval.yaml \
--similarity-threshold 0.85

# Generate from specific folder
python src/training/model_eval/result_to_config.py \
--results-dir results/mmlu_run_2025_09_10 \
--output-file config/config.eval.yaml

Key flags​

  • --results-dir: points to the folder where analysis.json files live
  • --output-file: target config path (default config/config.yaml)
  • --similarity-threshold: semantic cache threshold to set in the generated config

What it does​

  • Reads all analysis.json files, extracting analysis["category_accuracy"]
  • Constructs a new config:
    • categories: For each category present in results, ranks models by accuracy:
      • category.model_scores = [{ model: "Model_Name", score: 0.87 }, ...], highest first
    • default_model: the best average performer across categories
    • category reasoning settings: auto-filled from a built-in mapping (you can adjust after generation)
      • math, physics, chemistry, CS, engineering -> high reasoning
      • others default -> low/medium
    • Leaves out any special β€œauto” placeholder models if present

Schema alignment​

  • categories[].name: the MMLU-Pro category string
  • categories[].model_scores: descending ranking by accuracy for that category
  • default_model: a top performer across categories (approach suffix removed, e.g., gemma3:27b from gemma3:27b:direct)
  • Keeps other config sections (semantic_cache, tools, classifier, prompt_guard) with reasonable defaults; you can edit them post-generation if your environment differs

Note

  • This script only work with results from MMLU_Pro Evaluation.
  • Existing config.yaml can be overwritten. Consider writing to a temp file first and diffing:
    • --output-file config/config.eval.yaml
  • If your production config.yaml carries environment-specific settings (endpoints, pricing, policies), port the evaluated categories[].model_scores and default_model back into your canonical config.

Example config.eval.yaml​

see more about config at configuration

bert_model:
model_id: sentence-transformers/all-MiniLM-L12-v2
threshold: 0.6
use_cpu: true
semantic_cache:
enabled: true
similarity_threshold: 0.85
max_entries: 1000
ttl_seconds: 3600
tools:
enabled: true
top_k: 3
similarity_threshold: 0.2
tools_db_path: config/tools_db.json
fallback_to_empty: true
prompt_guard:
enabled: true
use_modernbert: true
model_id: models/jailbreak_classifier_modernbert-base_model
threshold: 0.7
use_cpu: true
jailbreak_mapping_path: models/jailbreak_classifier_modernbert-base_model/jailbreak_type_mapping.json

# Lack of endpoint config and model_config right here, modify here as needed

classifier:
category_model:
model_id: models/category_classifier_modernbert-base_model
use_modernbert: true
threshold: 0.6
use_cpu: true
category_mapping_path: models/category_classifier_modernbert-base_model/category_mapping.json
pii_model:
model_id: models/pii_classifier_modernbert-base_presidio_token_model
use_modernbert: true
threshold: 0.7
use_cpu: true
pii_mapping_path: models/pii_classifier_modernbert-base_presidio_token_model/pii_type_mapping.json
categories:
- name: business
use_reasoning: false
reasoning_description: Business content is typically conversational
reasoning_effort: low
model_scores:
- model: phi4
score: 0.2
- model: qwen3-0.6B
score: 0.0
- name: law
use_reasoning: false
reasoning_description: Legal content is typically explanatory
reasoning_effort: medium
model_scores:
- model: phi4
score: 0.8
- model: qwen3-0.6B
score: 0.2

# Ignore some categories here

- name: engineering
use_reasoning: true
reasoning_description: Engineering problems require systematic problem-solving
reasoning_effort: high
model_scores:
- model: phi4
score: 0.6
- model: qwen3-0.6B
score: 0.2
default_reasoning_effort: medium
default_model: phi4