Files
sure/app/models/eval/result.rb
soky srm 88952e4714 Small llms improvements (#400)
* Initial implementation

* FIX keys

* Add langfuse evals support

* FIX trace upload

* Delete .claude/settings.local.json

Signed-off-by: soky srm <sokysrm@gmail.com>

* Update client.rb

* Small LLMs improvements

* Keep batch size normal

* Update categorizer

* FIX json mode

* Add reasonable alternative to matching

* FIX thinking blocks for llms

* Implement json mode support with AUTO mode

* Make auto default for everyone

* FIX linter

* Address review

* Allow export manual categories

* FIX user export

* FIX oneshot example pollution

* Update categorization_golden_v1.yml

* Update categorization_golden_v1.yml

* Trim to 100 items

* Update auto_categorizer.rb

* FIX for auto retry in auto mode

* Separate the Eval Logic from the Auto-Categorizer

The expected_null_count parameter conflates eval-specific logic with production categorization logic.

* Force json mode on evals

* Introduce a more mixed dataset

150 items, performance from a local model:

By Difficulty:
  easy: 93.22% accuracy (55/59)
  medium: 93.33% accuracy (42/45)
  hard: 92.86% accuracy (26/28)
  edge_case: 100.0% accuracy (18/18)

* Improve datasets

Remove Data leakage from prompts

* Create eval runs as "pending"

---------

Signed-off-by: soky srm <sokysrm@gmail.com>
Signed-off-by: Juan José Mata <juanjo.mata@gmail.com>
Co-authored-by: Juan José Mata <juanjo.mata@gmail.com>
2025-12-07 18:11:34 +01:00

71 lines
2.1 KiB
Ruby

class Eval::Result < ApplicationRecord
self.table_name = "eval_results"
belongs_to :run, class_name: "Eval::Run", foreign_key: :eval_run_id
belongs_to :sample, class_name: "Eval::Sample", foreign_key: :eval_sample_id
validates :actual_output, presence: true
validates :correct, inclusion: { in: [ true, false ] }
scope :correct, -> { where(correct: true) }
scope :incorrect, -> { where(correct: false) }
scope :with_nulls_returned, -> { where(null_returned: true) }
scope :with_nulls_expected, -> { where(null_expected: true) }
scope :exact_matches, -> { where(exact_match: true) }
scope :hierarchical_matches, -> { where(hierarchical_match: true) }
# Get actual category (for categorization results)
def actual_category_name
actual_output.dig("category_name") || actual_output["category_name"]
end
# Get actual merchant info (for merchant detection results)
def actual_business_name
actual_output.dig("business_name") || actual_output["business_name"]
end
def actual_business_url
actual_output.dig("business_url") || actual_output["business_url"]
end
# Get actual functions called (for chat results)
def actual_functions
actual_output.dig("functions") || actual_output["functions"] || []
end
# Get actual response text (for chat results)
def actual_response_text
actual_output.dig("response_text") || actual_output["response_text"]
end
# Summary for display
def summary
{
sample_id: sample_id,
correct: correct,
exact_match: exact_match,
expected: sample.expected_output,
actual: actual_output,
latency_ms: latency_ms,
cost: cost&.to_f
}
end
# Detailed comparison with expected
def detailed_comparison
{
sample_difficulty: sample.difficulty,
sample_tags: sample.tags,
input: sample.input_data,
expected: sample.expected_output,
actual: actual_output,
correct: correct,
exact_match: exact_match,
hierarchical_match: hierarchical_match,
null_expected: null_expected,
null_returned: null_returned,
fuzzy_score: fuzzy_score
}
end
end