* Initial implementation
* FIX keys
* Add langfuse evals support
* FIX trace upload
* Delete .claude/settings.local.json
Signed-off-by: soky srm <sokysrm@gmail.com>
* Update client.rb
* Small LLMs improvements
* Keep batch size normal
* Update categorizer
* FIX json mode
* Add reasonable alternative to matching
* FIX thinking blocks for llms
* Implement json mode support with AUTO mode
* Make auto default for everyone
* FIX linter
* Address review
* Allow export manual categories
* FIX user export
* FIX oneshot example pollution
* Update categorization_golden_v1.yml
* Update categorization_golden_v1.yml
* Trim to 100 items
* Update auto_categorizer.rb
* FIX for auto retry in auto mode
* Separate the Eval Logic from the Auto-Categorizer
The expected_null_count parameter conflates eval-specific logic with production categorization logic.
* Force json mode on evals
* Introduce a more mixed dataset
150 items, performance from a local model:
By Difficulty:
easy: 93.22% accuracy (55/59)
medium: 93.33% accuracy (42/45)
hard: 92.86% accuracy (26/28)
edge_case: 100.0% accuracy (18/18)
* Improve datasets
Remove Data leakage from prompts
* Create eval runs as "pending"
---------
Signed-off-by: soky srm <sokysrm@gmail.com>
Signed-off-by: Juan José Mata <juanjo.mata@gmail.com>
Co-authored-by: Juan José Mata <juanjo.mata@gmail.com>