Files
sure/app/models/eval/runners/base.rb
soky srm 88952e4714 Small llms improvements (#400)
* Initial implementation

* FIX keys

* Add langfuse evals support

* FIX trace upload

* Delete .claude/settings.local.json

Signed-off-by: soky srm <sokysrm@gmail.com>

* Update client.rb

* Small LLMs improvements

* Keep batch size normal

* Update categorizer

* FIX json mode

* Add reasonable alternative to matching

* FIX thinking blocks for llms

* Implement json mode support with AUTO mode

* Make auto default for everyone

* FIX linter

* Address review

* Allow export manual categories

* FIX user export

* FIX oneshot example pollution

* Update categorization_golden_v1.yml

* Update categorization_golden_v1.yml

* Trim to 100 items

* Update auto_categorizer.rb

* FIX for auto retry in auto mode

* Separate the Eval Logic from the Auto-Categorizer

The expected_null_count parameter conflates eval-specific logic with production categorization logic.

* Force json mode on evals

* Introduce a more mixed dataset

150 items, performance from a local model:

By Difficulty:
  easy: 93.22% accuracy (55/59)
  medium: 93.33% accuracy (42/45)
  hard: 92.86% accuracy (26/28)
  edge_case: 100.0% accuracy (18/18)

* Improve datasets

Remove Data leakage from prompts

* Create eval runs as "pending"

---------

Signed-off-by: soky srm <sokysrm@gmail.com>
Signed-off-by: Juan José Mata <juanjo.mata@gmail.com>
Co-authored-by: Juan José Mata <juanjo.mata@gmail.com>
2025-12-07 18:11:34 +01:00

83 lines
1.7 KiB
Ruby

class Eval::Runners::Base
attr_reader :eval_run
def initialize(eval_run)
@eval_run = eval_run
end
def run
eval_run.start!
begin
process_samples
metrics = calculate_metrics
eval_run.complete!(metrics)
rescue => e
eval_run.fail!(e)
raise
end
eval_run
end
protected
def process_samples
raise NotImplementedError, "Subclasses must implement #process_samples"
end
def calculate_metrics
raise NotImplementedError, "Subclasses must implement #calculate_metrics"
end
def samples
eval_run.dataset.samples
end
def provider
@provider ||= build_provider
end
def model
eval_run.model
end
private
def build_provider
case eval_run.provider
when "openai"
build_openai_provider
else
raise "Unsupported provider: #{eval_run.provider}"
end
end
def build_openai_provider
access_token = eval_run.provider_config["access_token"].presence ||
ENV["OPENAI_ACCESS_TOKEN"].presence ||
Setting.openai_access_token
raise "OpenAI access token not configured" unless access_token.present?
uri_base = eval_run.provider_config["uri_base"].presence ||
ENV["OPENAI_URI_BASE"].presence ||
Setting.openai_uri_base
Provider::Openai.new(access_token, uri_base: uri_base, model: model)
end
def record_result(sample:, actual_output:, correct:, **attributes)
eval_run.results.create!(
sample: sample,
actual_output: actual_output,
correct: correct,
**attributes
)
end
def log_progress(message)
Rails.logger.info("[Eval::Runner] #{message}")
end
end