Commit Graph

3 Commits

Author SHA1 Message Date
Guillem Arias Fauste
991cb959c1 feat(ai): Anthropic batch ops + LLM cost ledger (2/5) (#1984)
* feat(ai): add Anthropic provider with chat parity (1/5)

Introduces Provider::Anthropic alongside Provider::Openai, implementing
the LlmConcept chat_response contract over the official anthropic Ruby
SDK. Batch ops, PDF, and RAG land in follow-up PRs.

- Provider::Anthropic uses Messages API for sync and streaming responses
- ChatConfig builds requests with ephemeral prompt-cache markers on the
  system prompt and the last tool definition
- MessageFormatter reconstructs multi-turn history (text + tool_use +
  tool_result blocks) from raw Message records, including the paired
  user-role tool_result turn Anthropic requires after every tool_use
- ChatParser maps Anthropic Message into the shared ChatResponse Data
- Registry, Setting, User, Chat default model wired for ANTHROPIC_*
  envs and Setting.anthropic_*; LLM_PROVIDER selects between providers
- Responder forwards raw conversation_history (Array<Message>) so
  providers without hosted conversation state can rebuild context
- OpenAI provider accepts and ignores the new kwarg (no behavior change)

Tests cover provider init, model gating, MessageFormatter for all turn
shapes, ChatConfig request building (max_tokens, system cache, tool
conversion), ChatParser for text / tool_use / mixed blocks, Registry
discovery, and mocked chat_response success / error / function_request
paths. Live VCR cassettes recorded in a follow-up with a real key.

Stacked PRs: 2/5 batch ops + cost ledger, 3/5 PDF, 4/5 pgvector RAG,
5/5 settings UI + disclosure.

* fix(ai): address PR review on Anthropic provider foundation

Surface fixes raised by Codex + CodeRabbit on PR 1/5:

- Provider::Anthropic#chat_response now accepts (and ignores) a
  `messages:` kwarg. Assistant::Responder passes both `messages:`
  (OpenAI-shape) and `conversation_history:` (raw Message records) for
  cross-provider parity, so the previous signature raised
  ArgumentError on the first chat turn through the Anthropic provider.
- Provider::Anthropic#supports_model? bypasses the `claude` prefix
  gate when a custom base_url is configured, mirroring the OpenAI
  provider. Bedrock-shaped IDs like
  `anthropic.claude-sonnet-4-5-20250929-v1:0` and
  `claude-opus-4@20250514` are otherwise rejected by
  Assistant::Provided#get_model_provider and the chat dies.
- Setting.anthropic_access_token is now in
  EncryptedSettingFields::ENCRYPTED_FIELDS so the Anthropic API key
  is encrypted at rest like every other provider secret. Previously
  plaintext while siblings (openai_access_token, twelve_data_api_key,
  external_assistant_token) were ciphertext.
- Chat.default_model falls back to whichever provider is actually
  configured. Previously, with LLM_PROVIDER=anthropic but no
  Anthropic credentials, the default model resolved to a Claude ID
  that no registered provider supported, so chats failed even when
  OpenAI was fully configured. Adds Provider::{Anthropic,Openai}#configured?
  class methods for the readable callsite.
- Provider::Anthropic.effective_model uses
  `ENV["ANTHROPIC_MODEL"].presence || Setting.anthropic_model` so the
  Setting lookup is only performed when the env var is absent — the
  previous `ENV.fetch(KEY, default)` evaluated the default arg
  eagerly on every call.
- Provider::Anthropic::ChatConfig#anthropic_input_schema strips both
  `:strict` and `"strict"` keys so JSON-decoded schemas with string
  keys cannot leak the OpenAI-only flag through to Anthropic.

Test coverage added: supports_model? bypass on custom endpoints,
chat_response messages: kwarg compatibility, default_model fallback
in the three credential combinations, configured? against ENV +
Setting, strict-flag stripping for both key types, and a
`Setting.expects(:anthropic_model).never` assertion proving the
ENV-precedence test now exercises the lazy path.

All 4365 tests pass (1 pre-existing libvips env error unrelated).

* test(chat): make default_model tests resilient to ENV model overrides

CodeRabbit flagged on PR review: the new default_model tests asserted
against Provider::*::DEFAULT_MODEL, but Chat.default_model actually
returns Provider::*.effective_model.presence (which reads
OPENAI_MODEL / ANTHROPIC_MODEL from the environment). With either env
var set, the tests would fail intermittently even though routing was
correct.

- New default_model tests now assert against the provider's
  effective_model directly, so they verify the routing decision
  (which provider's value wins) without coupling to the constant.
- Pre-existing "creates with default model" assertions had the same
  brittleness; switch them to compare against Chat.default_model so
  the chosen model is whatever the env / Setting cascade resolves to.

Verified by running `ANTHROPIC_MODEL=claude-haiku-4-5 OPENAI_MODEL=gpt-4o
bin/rails test test/models/chat_test.rb` — 16 runs, 0 failures
(previously 2 pre-existing failures + 0 from the new tests).

* fix(ai): address local review on Anthropic foundation

- Provider::Anthropic#supports_pdf_processing? bypasses prefix gate for
  custom endpoints, mirroring supports_model?
- Provider::Anthropic#initialize raises Error when custom_endpoint? AND
  model.blank?, parity with Provider::Openai
- stream_chat_response captures partial usage on mid-stream errors and
  records it via the new on_partial callback so chat_response can skip
  the duplicate error row in the outer rescue
- safe_accumulated_message swallows the secondary failure when the SDK
  cannot reconstruct a snapshot
- langfuse_client memoizes properly (||= instead of =) so repeated calls
  don't churn Langfuse instances
- MessageFormatter sorts tool_calls by created_at then id so the
  message array is deterministic across replays; skips tool_calls
  missing both provider_call_id and provider_id rather than sending
  `id: nil` and getting rejected by Anthropic
- Setting.anthropic_access_token default falls back through
  ENV["ANTHROPIC_API_KEY"].presence (was missing .presence, so an
  empty-string env value bled through)
- User#openai_configured? / #anthropic_configured? delegate to the
  Provider::* class methods — single source of truth
- Assistant::Responder renames the OpenAI-shape history builder
  conversation_history → openai_messages_payload so the kwarg name
  matches the local method name (messages: openai_messages_payload,
  conversation_history: chat_message_records)
- Assistant::Builtin stale-history comment updated to reference both
  builders

Adds a streaming chat_response test using ad-hoc subclasses of the
SDK event types so the case/when dispatch matches via is_a? without
stubbing class-level === behavior.

* test(ai): add Anthropic tool_use round-trip + multi-tool turn coverage

Addresses @jjmata's "worth confirming" note on PR #1983: tool-use turns
from prior assistant messages must round-trip correctly when retrieved
from the database.

- New `ChatParser → ToolCall::Function → MessageFormatter` test walks
  the full path: Anthropic response with a tool_use block →
  ChatFunctionRequest → ToolCall::Function.from_function_request →
  persisted on the AssistantMessage → MessageFormatter rebuild on the
  next turn. Asserts the original `tool_use.id` is preserved end-to-end
  as both `tool_use.id` and the paired `tool_result.tool_use_id`, and
  that the original `input` hash and serialized result content survive.
- New multi-tool assistant turn test confirms two tool_use blocks on a
  single assistant message render as two tool_use blocks followed by
  two paired tool_result blocks in a single user-role follow-up,
  matching Anthropic's required alternation.

Both tests exercise the existing PR1 code without behavior changes.

* test(ai): require "ostruct" explicitly in Anthropic provider tests

OpenStruct is moving out of Ruby's default load path (warning in 3.4+,
removed in 3.5+). Tests work today because ActiveSupport transitively
loads it, but that's incidental. Match the existing convention in
test/controllers/settings/hostings_controller_test.rb which explicitly
requires ostruct for the same reason.

* fix(ai): sanitize Langfuse warn logs, normalize tool_use.input, dedup history fetch

Addresses three open CodeRabbit findings on PR #1983.

- Provider::Anthropic Langfuse rescue branches no longer include
  `e.full_message` in `Rails.logger.warn`. `full_message` bundles the
  backtrace + cause chain and on some SDK error types includes the
  serialized request/response payload (prompt, model output). Logs
  now report `#{e.class}: #{e.message}` only. Three sites:
  create_langfuse_trace, log_langfuse_generation, upsert_langfuse_trace.
  Note: Provider::Openai has the same pattern (copy-pasted source) —
  harmonization deferred to a follow-up cleanup PR; this commit fixes
  only the Anthropic provider to keep PR scope tight.

- MessageFormatter#parse_arguments now coerces any non-Hash parsed
  result to `{}`. Anthropic's Messages API requires `tool_use.input`
  to be a JSON object (map); a stored ToolCall::Function record whose
  arguments parse to a scalar, bool, or array (corrupt row, legacy
  data, cross-provider bleed) would otherwise produce a payload the
  API rejects. Normal flow stores Hash arguments end-to-end so the
  fix is defensive — adds 2 tests covering scalar/array JSON strings
  and non-String non-Hash inputs.

- Assistant::Responder dedups the chat-history fetch. The previous
  layout fired two near-identical `chat.messages.where(...).includes(
  :tool_calls).ordered` queries per LLM turn (one for the OpenAI-shape
  payload, one for the raw-records kwarg). A new memoized
  `complete_chat_messages` fetches once; `chat_message_records` filters
  out the current message via `Array#reject`, `openai_messages_payload`
  iterates the cached array unchanged. One SQL query per turn instead
  of two. Memoization scope = single Responder instance (per LLM call),
  so cache invalidation is not a concern.

All 4370 tests pass (1 pre-existing libvips env error unrelated).
Rubocop + brakeman clean.

* fix(ci): replace sk-ant- prefixed test placeholders

Pipelock secret scanner pattern-matches `sk-ant-*` as a real Anthropic
API key and fails the PR security-scan check. Test stubs and
ClimateControl env values used `sk-ant-test`, `sk-ant-from-setting`,
`sk-ant-x`, `sk-ant-y` as obvious placeholders, but the scanner does
not care about value entropy.

Switched to `fake-anthropic-key-*` / `fake-token-*` strings so the
scanner stops flagging them. No production code touched, no behavior
change — Provider::Anthropic still accepts any non-blank token.

* feat(ai): add Anthropic batch ops + LLM cost ledger (2/5)

Implements auto_categorize, auto_detect_merchants, and
enhance_provider_merchants on Provider::Anthropic via forced tool calls,
plus the cost-ledger plumbing they need.

- Provider::Anthropic::AutoCategorizer, AutoMerchantDetector,
  ProviderMerchantEnhancer each define a single output tool whose
  input_schema mirrors the desired output, then force the model to call
  it via tool_choice: { type: "tool", name: ..., disable_parallel_tool_use: true }.
  Anthropic guarantees the tool_use.input matches the schema, so there
  is no JSON parsing fragility, no <think> tag stripping, and no
  json_object/json_schema fallback ladders.
- Concerns::UsageRecorder mirrors the OpenAI sibling but persists
  cache_creation_input_tokens / cache_read_input_tokens to dedicated
  columns instead of metadata.
- Migration adds cache_creation_tokens, cache_read_tokens (nullable
  integers) to llm_usages. OpenAI rows leave them null.
- LlmUsage::PRICING gains Claude 4.x rows (opus-4-7 $15/$75, sonnet-4-6
  $3/$15, haiku-4-5 $1/$5 per MTok). infer_provider returns "anthropic"
  for claude-* via the existing exact/prefix lookup.
- Provider::Anthropic#chat_response now persists cache columns directly
  rather than stashing them in metadata.
- 25-transaction batch cap mirrors the OpenAI provider so the cost
  ledger sees the same shape regardless of which provider ran a batch.

Tests cover the forced-tool-call path, null/None normalization,
case-insensitive merchant matching, the missing-tool_use error path,
and Anthropic-specific pricing + provider inference on LlmUsage.

Stacked on #1983 (PR 1/5). 3/5 PDF + vision next.

* fix(ai): attribute Bedrock model IDs to anthropic + clean nil enum

- LlmUsage.infer_provider now returns "anthropic" for Bedrock /
  Vertex shaped IDs (anthropic.* and anthropic/*), so cost-ledger
  filtering by provider stays correct even when no per-MTok rate is
  stored. Previously these IDs fell through to the "openai" default.
- AutoCategorizer drops the redundant nil sentinel from the
  category_name enum — the union type [string, null] already permits
  null, and some JSON Schema validators reject nil literals inside
  enum arrays.

* test(ai): require "ostruct" in Anthropic batch op tests

Same rationale as the PR1 ostruct fix — explicit require so the tests
don't depend on ActiveSupport's transitive load when Ruby 3.5+ removes
OpenStruct from the default load path.

* fix(llm-usage): include Anthropic cache tokens in estimated_cost

calculate_cost only priced prompt + completion tokens, so estimated_cost
under-reported every cached call — the cache_creation/cache_read columns this PR
added were tracked but never billed. Verified against the Anthropic dashboard: a
cached chat turn billed $0.05 but the ledger recorded $0.038; the gap was exactly
the unpriced cache tokens.

Price them relative to the input rate (Anthropic: cache write 1.25x, read 0.1x)
and thread the cache counts from both recorders (chat + batch). OpenAI rows leave
the columns null (treated as 0), so they're unaffected. Ledger now reproduces the
dashboard ($0.054 for the test turn).

* chore(ai): guard chat usage double-record; flag deferred Anthropic batch wiring

- Hardening: guard the success-path record_llm_usage with
  `unless partial_usage_recorded` so a future change that emits partial usage on
  a normal stream can't silently double-bill (the symptom investigated in the
  #1984 review). No behavior change today — on_partial only fires from the
  mid-stream-error rescue, which re-raises past this line.
- Notice: the family auto-categorize / merchant-detect / merchant-enhance flows
  still hardcode get_provider(:openai). Provider::Anthropic now implements those
  batch ops but they aren't wired into the family flows yet — documented with
  TODOs at each site for the follow-up.

* chore(ai): point family-flow TODOs at tracking issue #2113

* fix(ai): address review findings on cost ledger + categorizer schema

Three AI-review findings on #1984:

- category_name enum omitted null (codex + coderabbit): the prompt + type allow
  Claude to abstain on uncertain transactions, but JSON Schema `enum` restricted
  the value to category names, so null was invalid — forcing miscategorization.
  Append nil to the enum (the consumer already normalizes null -> uncategorized).
- Cache pricing applied to all providers (coderabbit): the 1.25x/0.1x cache
  multipliers are Anthropic-specific. Gate them on provider == "anthropic" so a
  non-Anthropic caller passing cache counts isn't billed with the wrong rates.
- Negative cache-token counts (coderabbit): add DB check constraints
  (cache_*_tokens IS NULL OR >= 0), per the repo's DB-level-validation convention.

Tests: enum includes nil; non-Anthropic cache tokens aren't priced.
2026-06-01 22:00:48 +02:00
Juan José Mata
f491916411 Track failed LLM API calls in llm_usages table (#360)
* Track failed LLM API calls in llm_usages table

This commit adds comprehensive error tracking for failed LLM API calls:

- Updated LlmUsage model with helper methods to identify failed calls
  and retrieve error details (failed?, http_status_code, error_message)

- Modified Provider::Openai to record failed API calls with error metadata
  including HTTP status codes and error messages in both native and
  generic chat response methods

- Enhanced UsageRecorder concern with record_usage_error method to support
  error tracking for auto-categorization and auto-merchant detection

- Updated LLM usage UI to display failed calls with:
  - Red background highlighting for failed rows
  - Error indicator icon with "Failed" label
  - Interactive tooltip on hover showing error message and HTTP status code

Failed calls are now tracked with zero tokens and null cost, storing
error details in the metadata JSONB column for visibility and debugging.

* Dark mode fixes

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-11-22 02:15:20 +01:00
soky srm
bb364fab38 LLM cost estimation (#223)
* Password reset back button also after confirmation

Signed-off-by: Juan José Mata <juanjo.mata@gmail.com>

* Implement a filter for category (#215)

- Also implement an is empty/is null condition.

* Implement an LLM cost estimation page

Track costs across all the cost categories: auto categorization, auto merchant detection and chat.
Show warning with estimated cost when running a rule that contains AI.

* Update pricing

* Add google pricing

and fix inferred model everywhere.

* Update app/models/llm_usage.rb

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: soky srm <sokysrm@gmail.com>

* FIX address review

* Linter

* Address review

- Lowered log level
- extracted the duplicated record_usage method into a shared concern

* Update app/controllers/settings/llm_usages_controller.rb

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: soky srm <sokysrm@gmail.com>

* Moved attr_reader out of private

---------

Signed-off-by: Juan José Mata <juanjo.mata@gmail.com>
Signed-off-by: soky srm <sokysrm@gmail.com>
Co-authored-by: Juan José Mata <juanjo.mata@gmail.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2025-10-24 00:08:59 +02:00