Small llms improvements (#400)

* Initial implementation * FIX keys * Add langfuse evals support * FIX trace upload * Delete .claude/settings.local.json Signed-off-by: soky srm <sokysrm@gmail.com> * Update client.rb * Small LLMs improvements * Keep batch size normal * Update categorizer * FIX json mode * Add reasonable alternative to matching * FIX thinking blocks for llms * Implement json mode support with AUTO mode * Make auto default for everyone * FIX linter * Address review * Allow export manual categories * FIX user export * FIX oneshot example pollution * Update categorization_golden_v1.yml * Update categorization_golden_v1.yml * Trim to 100 items * Update auto_categorizer.rb * FIX for auto retry in auto mode * Separate the Eval Logic from the Auto-Categorizer The expected_null_count parameter conflates eval-specific logic with production categorization logic. * Force json mode on evals * Introduce a more mixed dataset 150 items, performance from a local model: By Difficulty: easy: 93.22% accuracy (55/59) medium: 93.33% accuracy (42/45) hard: 92.86% accuracy (26/28) edge_case: 100.0% accuracy (18/18) * Improve datasets Remove Data leakage from prompts * Create eval runs as "pending" --------- Signed-off-by: soky srm <sokysrm@gmail.com> Signed-off-by: Juan José Mata <juanjo.mata@gmail.com> Co-authored-by: Juan José Mata <juanjo.mata@gmail.com>
2026-04-19 12:04:08 +00:00 · 2025-12-07 18:11:34 +01:00
parent bf90cad9a0
commit 88952e4714
34 changed files with 11027 additions and 42 deletions
--- a/app/models/eval/metrics/base.rb
+++ b/app/models/eval/metrics/base.rb
@@ -0,0 +1,68 @@
+class Eval::Metrics::Base
+  attr_reader :eval_run
+
+  def initialize(eval_run)
+    @eval_run = eval_run
+  end
+
+  def calculate
+    raise NotImplementedError, "Subclasses must implement #calculate"
+  end
+
+  protected
+
+    def results
+      @results ||= eval_run.results.includes(:sample)
+    end
+
+    def samples
+      @samples ||= eval_run.dataset.samples
+    end
+
+    def total_count
+      results.count
+    end
+
+    def correct_count
+      results.where(correct: true).count
+    end
+
+    def incorrect_count
+      results.where(correct: false).count
+    end
+
+    def accuracy
+      return 0.0 if total_count.zero?
+      (correct_count.to_f / total_count * 100).round(2)
+    end
+
+    def avg_latency_ms
+      return nil if total_count.zero?
+      results.average(:latency_ms)&.round(0)
+    end
+
+    def total_cost
+      results.sum(:cost)&.to_f&.round(6)
+    end
+
+    def cost_per_sample
+      return nil if total_count.zero?
+      (total_cost / total_count).round(6)
+    end
+
+    def metrics_by_difficulty
+      %w[easy medium hard edge_case].index_with do |difficulty|
+        difficulty_results = results.joins(:sample).where(eval_samples: { difficulty: difficulty })
+        next nil if difficulty_results.empty?
+
+        correct = difficulty_results.where(correct: true).count
+        total = difficulty_results.count
+
+        {
+          count: total,
+          correct: correct,
+          accuracy: (correct.to_f / total * 100).round(2)
+        }
+      end.compact
+    end
+end