mirror of
https://github.com/we-promise/sure.git
synced 2026-04-19 12:04:08 +00:00
Small llms improvements (#400)
* Initial implementation * FIX keys * Add langfuse evals support * FIX trace upload * Delete .claude/settings.local.json Signed-off-by: soky srm <sokysrm@gmail.com> * Update client.rb * Small LLMs improvements * Keep batch size normal * Update categorizer * FIX json mode * Add reasonable alternative to matching * FIX thinking blocks for llms * Implement json mode support with AUTO mode * Make auto default for everyone * FIX linter * Address review * Allow export manual categories * FIX user export * FIX oneshot example pollution * Update categorization_golden_v1.yml * Update categorization_golden_v1.yml * Trim to 100 items * Update auto_categorizer.rb * FIX for auto retry in auto mode * Separate the Eval Logic from the Auto-Categorizer The expected_null_count parameter conflates eval-specific logic with production categorization logic. * Force json mode on evals * Introduce a more mixed dataset 150 items, performance from a local model: By Difficulty: easy: 93.22% accuracy (55/59) medium: 93.33% accuracy (42/45) hard: 92.86% accuracy (26/28) edge_case: 100.0% accuracy (18/18) * Improve datasets Remove Data leakage from prompts * Create eval runs as "pending" --------- Signed-off-by: soky srm <sokysrm@gmail.com> Signed-off-by: Juan José Mata <juanjo.mata@gmail.com> Co-authored-by: Juan José Mata <juanjo.mata@gmail.com>
This commit is contained in:
68
app/models/eval/metrics/base.rb
Normal file
68
app/models/eval/metrics/base.rb
Normal file
@@ -0,0 +1,68 @@
|
||||
class Eval::Metrics::Base
|
||||
attr_reader :eval_run
|
||||
|
||||
def initialize(eval_run)
|
||||
@eval_run = eval_run
|
||||
end
|
||||
|
||||
def calculate
|
||||
raise NotImplementedError, "Subclasses must implement #calculate"
|
||||
end
|
||||
|
||||
protected
|
||||
|
||||
def results
|
||||
@results ||= eval_run.results.includes(:sample)
|
||||
end
|
||||
|
||||
def samples
|
||||
@samples ||= eval_run.dataset.samples
|
||||
end
|
||||
|
||||
def total_count
|
||||
results.count
|
||||
end
|
||||
|
||||
def correct_count
|
||||
results.where(correct: true).count
|
||||
end
|
||||
|
||||
def incorrect_count
|
||||
results.where(correct: false).count
|
||||
end
|
||||
|
||||
def accuracy
|
||||
return 0.0 if total_count.zero?
|
||||
(correct_count.to_f / total_count * 100).round(2)
|
||||
end
|
||||
|
||||
def avg_latency_ms
|
||||
return nil if total_count.zero?
|
||||
results.average(:latency_ms)&.round(0)
|
||||
end
|
||||
|
||||
def total_cost
|
||||
results.sum(:cost)&.to_f&.round(6)
|
||||
end
|
||||
|
||||
def cost_per_sample
|
||||
return nil if total_count.zero?
|
||||
(total_cost / total_count).round(6)
|
||||
end
|
||||
|
||||
def metrics_by_difficulty
|
||||
%w[easy medium hard edge_case].index_with do |difficulty|
|
||||
difficulty_results = results.joins(:sample).where(eval_samples: { difficulty: difficulty })
|
||||
next nil if difficulty_results.empty?
|
||||
|
||||
correct = difficulty_results.where(correct: true).count
|
||||
total = difficulty_results.count
|
||||
|
||||
{
|
||||
count: total,
|
||||
correct: correct,
|
||||
accuracy: (correct.to_f / total * 100).round(2)
|
||||
}
|
||||
end.compact
|
||||
end
|
||||
end
|
||||
Reference in New Issue
Block a user