Files
sure/app/models/eval/metrics/base.rb
soky srm 88952e4714 Small llms improvements (#400)
* Initial implementation

* FIX keys

* Add langfuse evals support

* FIX trace upload

* Delete .claude/settings.local.json

Signed-off-by: soky srm <sokysrm@gmail.com>

* Update client.rb

* Small LLMs improvements

* Keep batch size normal

* Update categorizer

* FIX json mode

* Add reasonable alternative to matching

* FIX thinking blocks for llms

* Implement json mode support with AUTO mode

* Make auto default for everyone

* FIX linter

* Address review

* Allow export manual categories

* FIX user export

* FIX oneshot example pollution

* Update categorization_golden_v1.yml

* Update categorization_golden_v1.yml

* Trim to 100 items

* Update auto_categorizer.rb

* FIX for auto retry in auto mode

* Separate the Eval Logic from the Auto-Categorizer

The expected_null_count parameter conflates eval-specific logic with production categorization logic.

* Force json mode on evals

* Introduce a more mixed dataset

150 items, performance from a local model:

By Difficulty:
  easy: 93.22% accuracy (55/59)
  medium: 93.33% accuracy (42/45)
  hard: 92.86% accuracy (26/28)
  edge_case: 100.0% accuracy (18/18)

* Improve datasets

Remove Data leakage from prompts

* Create eval runs as "pending"

---------

Signed-off-by: soky srm <sokysrm@gmail.com>
Signed-off-by: Juan José Mata <juanjo.mata@gmail.com>
Co-authored-by: Juan José Mata <juanjo.mata@gmail.com>
2025-12-07 18:11:34 +01:00

69 lines
1.4 KiB
Ruby

class Eval::Metrics::Base
attr_reader :eval_run
def initialize(eval_run)
@eval_run = eval_run
end
def calculate
raise NotImplementedError, "Subclasses must implement #calculate"
end
protected
def results
@results ||= eval_run.results.includes(:sample)
end
def samples
@samples ||= eval_run.dataset.samples
end
def total_count
results.count
end
def correct_count
results.where(correct: true).count
end
def incorrect_count
results.where(correct: false).count
end
def accuracy
return 0.0 if total_count.zero?
(correct_count.to_f / total_count * 100).round(2)
end
def avg_latency_ms
return nil if total_count.zero?
results.average(:latency_ms)&.round(0)
end
def total_cost
results.sum(:cost)&.to_f&.round(6)
end
def cost_per_sample
return nil if total_count.zero?
(total_cost / total_count).round(6)
end
def metrics_by_difficulty
%w[easy medium hard edge_case].index_with do |difficulty|
difficulty_results = results.joins(:sample).where(eval_samples: { difficulty: difficulty })
next nil if difficulty_results.empty?
correct = difficulty_results.where(correct: true).count
total = difficulty_results.count
{
count: total,
correct: correct,
accuracy: (correct.to_f / total * 100).round(2)
}
end.compact
end
end