From 8ccc434b3d16bcf03bcabe12b8c6766ebb380259 Mon Sep 17 00:00:00 2001 From: Guillem Arias Fauste Date: Wed, 3 Jun 2026 11:30:51 +0200 Subject: [PATCH] feat(ai): Anthropic native PDF processing (3/5) (#1985) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * feat(ai): add Anthropic provider with chat parity (1/5) Introduces Provider::Anthropic alongside Provider::Openai, implementing the LlmConcept chat_response contract over the official anthropic Ruby SDK. Batch ops, PDF, and RAG land in follow-up PRs. - Provider::Anthropic uses Messages API for sync and streaming responses - ChatConfig builds requests with ephemeral prompt-cache markers on the system prompt and the last tool definition - MessageFormatter reconstructs multi-turn history (text + tool_use + tool_result blocks) from raw Message records, including the paired user-role tool_result turn Anthropic requires after every tool_use - ChatParser maps Anthropic Message into the shared ChatResponse Data - Registry, Setting, User, Chat default model wired for ANTHROPIC_* envs and Setting.anthropic_*; LLM_PROVIDER selects between providers - Responder forwards raw conversation_history (Array) so providers without hosted conversation state can rebuild context - OpenAI provider accepts and ignores the new kwarg (no behavior change) Tests cover provider init, model gating, MessageFormatter for all turn shapes, ChatConfig request building (max_tokens, system cache, tool conversion), ChatParser for text / tool_use / mixed blocks, Registry discovery, and mocked chat_response success / error / function_request paths. Live VCR cassettes recorded in a follow-up with a real key. Stacked PRs: 2/5 batch ops + cost ledger, 3/5 PDF, 4/5 pgvector RAG, 5/5 settings UI + disclosure. * fix(ai): address PR review on Anthropic provider foundation Surface fixes raised by Codex + CodeRabbit on PR 1/5: - Provider::Anthropic#chat_response now accepts (and ignores) a `messages:` kwarg. Assistant::Responder passes both `messages:` (OpenAI-shape) and `conversation_history:` (raw Message records) for cross-provider parity, so the previous signature raised ArgumentError on the first chat turn through the Anthropic provider. - Provider::Anthropic#supports_model? bypasses the `claude` prefix gate when a custom base_url is configured, mirroring the OpenAI provider. Bedrock-shaped IDs like `anthropic.claude-sonnet-4-5-20250929-v1:0` and `claude-opus-4@20250514` are otherwise rejected by Assistant::Provided#get_model_provider and the chat dies. - Setting.anthropic_access_token is now in EncryptedSettingFields::ENCRYPTED_FIELDS so the Anthropic API key is encrypted at rest like every other provider secret. Previously plaintext while siblings (openai_access_token, twelve_data_api_key, external_assistant_token) were ciphertext. - Chat.default_model falls back to whichever provider is actually configured. Previously, with LLM_PROVIDER=anthropic but no Anthropic credentials, the default model resolved to a Claude ID that no registered provider supported, so chats failed even when OpenAI was fully configured. Adds Provider::{Anthropic,Openai}#configured? class methods for the readable callsite. - Provider::Anthropic.effective_model uses `ENV["ANTHROPIC_MODEL"].presence || Setting.anthropic_model` so the Setting lookup is only performed when the env var is absent — the previous `ENV.fetch(KEY, default)` evaluated the default arg eagerly on every call. - Provider::Anthropic::ChatConfig#anthropic_input_schema strips both `:strict` and `"strict"` keys so JSON-decoded schemas with string keys cannot leak the OpenAI-only flag through to Anthropic. Test coverage added: supports_model? bypass on custom endpoints, chat_response messages: kwarg compatibility, default_model fallback in the three credential combinations, configured? against ENV + Setting, strict-flag stripping for both key types, and a `Setting.expects(:anthropic_model).never` assertion proving the ENV-precedence test now exercises the lazy path. All 4365 tests pass (1 pre-existing libvips env error unrelated). * test(chat): make default_model tests resilient to ENV model overrides CodeRabbit flagged on PR review: the new default_model tests asserted against Provider::*::DEFAULT_MODEL, but Chat.default_model actually returns Provider::*.effective_model.presence (which reads OPENAI_MODEL / ANTHROPIC_MODEL from the environment). With either env var set, the tests would fail intermittently even though routing was correct. - New default_model tests now assert against the provider's effective_model directly, so they verify the routing decision (which provider's value wins) without coupling to the constant. - Pre-existing "creates with default model" assertions had the same brittleness; switch them to compare against Chat.default_model so the chosen model is whatever the env / Setting cascade resolves to. Verified by running `ANTHROPIC_MODEL=claude-haiku-4-5 OPENAI_MODEL=gpt-4o bin/rails test test/models/chat_test.rb` — 16 runs, 0 failures (previously 2 pre-existing failures + 0 from the new tests). * fix(ai): address local review on Anthropic foundation - Provider::Anthropic#supports_pdf_processing? bypasses prefix gate for custom endpoints, mirroring supports_model? - Provider::Anthropic#initialize raises Error when custom_endpoint? AND model.blank?, parity with Provider::Openai - stream_chat_response captures partial usage on mid-stream errors and records it via the new on_partial callback so chat_response can skip the duplicate error row in the outer rescue - safe_accumulated_message swallows the secondary failure when the SDK cannot reconstruct a snapshot - langfuse_client memoizes properly (||= instead of =) so repeated calls don't churn Langfuse instances - MessageFormatter sorts tool_calls by created_at then id so the message array is deterministic across replays; skips tool_calls missing both provider_call_id and provider_id rather than sending `id: nil` and getting rejected by Anthropic - Setting.anthropic_access_token default falls back through ENV["ANTHROPIC_API_KEY"].presence (was missing .presence, so an empty-string env value bled through) - User#openai_configured? / #anthropic_configured? delegate to the Provider::* class methods — single source of truth - Assistant::Responder renames the OpenAI-shape history builder conversation_history → openai_messages_payload so the kwarg name matches the local method name (messages: openai_messages_payload, conversation_history: chat_message_records) - Assistant::Builtin stale-history comment updated to reference both builders Adds a streaming chat_response test using ad-hoc subclasses of the SDK event types so the case/when dispatch matches via is_a? without stubbing class-level === behavior. * test(ai): add Anthropic tool_use round-trip + multi-tool turn coverage Addresses @jjmata's "worth confirming" note on PR #1983: tool-use turns from prior assistant messages must round-trip correctly when retrieved from the database. - New `ChatParser → ToolCall::Function → MessageFormatter` test walks the full path: Anthropic response with a tool_use block → ChatFunctionRequest → ToolCall::Function.from_function_request → persisted on the AssistantMessage → MessageFormatter rebuild on the next turn. Asserts the original `tool_use.id` is preserved end-to-end as both `tool_use.id` and the paired `tool_result.tool_use_id`, and that the original `input` hash and serialized result content survive. - New multi-tool assistant turn test confirms two tool_use blocks on a single assistant message render as two tool_use blocks followed by two paired tool_result blocks in a single user-role follow-up, matching Anthropic's required alternation. Both tests exercise the existing PR1 code without behavior changes. * test(ai): require "ostruct" explicitly in Anthropic provider tests OpenStruct is moving out of Ruby's default load path (warning in 3.4+, removed in 3.5+). Tests work today because ActiveSupport transitively loads it, but that's incidental. Match the existing convention in test/controllers/settings/hostings_controller_test.rb which explicitly requires ostruct for the same reason. * fix(ai): sanitize Langfuse warn logs, normalize tool_use.input, dedup history fetch Addresses three open CodeRabbit findings on PR #1983. - Provider::Anthropic Langfuse rescue branches no longer include `e.full_message` in `Rails.logger.warn`. `full_message` bundles the backtrace + cause chain and on some SDK error types includes the serialized request/response payload (prompt, model output). Logs now report `#{e.class}: #{e.message}` only. Three sites: create_langfuse_trace, log_langfuse_generation, upsert_langfuse_trace. Note: Provider::Openai has the same pattern (copy-pasted source) — harmonization deferred to a follow-up cleanup PR; this commit fixes only the Anthropic provider to keep PR scope tight. - MessageFormatter#parse_arguments now coerces any non-Hash parsed result to `{}`. Anthropic's Messages API requires `tool_use.input` to be a JSON object (map); a stored ToolCall::Function record whose arguments parse to a scalar, bool, or array (corrupt row, legacy data, cross-provider bleed) would otherwise produce a payload the API rejects. Normal flow stores Hash arguments end-to-end so the fix is defensive — adds 2 tests covering scalar/array JSON strings and non-String non-Hash inputs. - Assistant::Responder dedups the chat-history fetch. The previous layout fired two near-identical `chat.messages.where(...).includes( :tool_calls).ordered` queries per LLM turn (one for the OpenAI-shape payload, one for the raw-records kwarg). A new memoized `complete_chat_messages` fetches once; `chat_message_records` filters out the current message via `Array#reject`, `openai_messages_payload` iterates the cached array unchanged. One SQL query per turn instead of two. Memoization scope = single Responder instance (per LLM call), so cache invalidation is not a concern. All 4370 tests pass (1 pre-existing libvips env error unrelated). Rubocop + brakeman clean. * fix(ci): replace sk-ant- prefixed test placeholders Pipelock secret scanner pattern-matches `sk-ant-*` as a real Anthropic API key and fails the PR security-scan check. Test stubs and ClimateControl env values used `sk-ant-test`, `sk-ant-from-setting`, `sk-ant-x`, `sk-ant-y` as obvious placeholders, but the scanner does not care about value entropy. Switched to `fake-anthropic-key-*` / `fake-token-*` strings so the scanner stops flagging them. No production code touched, no behavior change — Provider::Anthropic still accepts any non-blank token. * feat(ai): add Anthropic batch ops + LLM cost ledger (2/5) Implements auto_categorize, auto_detect_merchants, and enhance_provider_merchants on Provider::Anthropic via forced tool calls, plus the cost-ledger plumbing they need. - Provider::Anthropic::AutoCategorizer, AutoMerchantDetector, ProviderMerchantEnhancer each define a single output tool whose input_schema mirrors the desired output, then force the model to call it via tool_choice: { type: "tool", name: ..., disable_parallel_tool_use: true }. Anthropic guarantees the tool_use.input matches the schema, so there is no JSON parsing fragility, no tag stripping, and no json_object/json_schema fallback ladders. - Concerns::UsageRecorder mirrors the OpenAI sibling but persists cache_creation_input_tokens / cache_read_input_tokens to dedicated columns instead of metadata. - Migration adds cache_creation_tokens, cache_read_tokens (nullable integers) to llm_usages. OpenAI rows leave them null. - LlmUsage::PRICING gains Claude 4.x rows (opus-4-7 $15/$75, sonnet-4-6 $3/$15, haiku-4-5 $1/$5 per MTok). infer_provider returns "anthropic" for claude-* via the existing exact/prefix lookup. - Provider::Anthropic#chat_response now persists cache columns directly rather than stashing them in metadata. - 25-transaction batch cap mirrors the OpenAI provider so the cost ledger sees the same shape regardless of which provider ran a batch. Tests cover the forced-tool-call path, null/None normalization, case-insensitive merchant matching, the missing-tool_use error path, and Anthropic-specific pricing + provider inference on LlmUsage. Stacked on #1983 (PR 1/5). 3/5 PDF + vision next. * fix(ai): attribute Bedrock model IDs to anthropic + clean nil enum - LlmUsage.infer_provider now returns "anthropic" for Bedrock / Vertex shaped IDs (anthropic.* and anthropic/*), so cost-ledger filtering by provider stays correct even when no per-MTok rate is stored. Previously these IDs fell through to the "openai" default. - AutoCategorizer drops the redundant nil sentinel from the category_name enum — the union type [string, null] already permits null, and some JSON Schema validators reject nil literals inside enum arrays. * test(ai): require "ostruct" in Anthropic batch op tests Same rationale as the PR1 ostruct fix — explicit require so the tests don't depend on ActiveSupport's transitive load when Ruby 3.5+ removes OpenStruct from the default load path. * feat(ai): Anthropic native PDF processing (3/5) Implements process_pdf and extract_bank_statement on Provider::Anthropic using the native `document` content block — no rasterization, no text pre-extraction. - Provider::Anthropic::PdfProcessor classifies the document, summarizes it, and extracts statement metadata via a forced report_document_analysis tool whose input_schema mirrors the existing Provider::Openai output (document_type from Import::DOCUMENT_TYPES, summary, extracted_data). - Provider::Anthropic::BankStatementExtractor returns the same { transactions, period, account_holder, account_number, bank_name, opening_balance, closing_balance } shape via report_bank_statement so downstream pdf_import code is provider-agnostic. - Both attach the PDF as { type: "document", source: { type: "base64", media_type: "application/pdf", data: } } — Claude 3.5+ / 4.x accept this natively (up to 32MB / 100 pages). No pdf-reader, no pdftoppm, no chunking for typical statements. - supports_pdf_processing? (introduced in PR 1) already returns true for claude-* models, gating process_pdf with a clear error otherwise. - Cost ledger rows are persisted via the shared UsageRecorder concern, including cache_creation/cache_read tokens. Tests verify the document block shape, tool_choice forcing, normalized document_type for unknown classifications, transaction normalization (date / amount / reference → notes), and the missing-tool_use error path. Blank pdf_content raises before any client call. Stacked on #1984 (PR 2/5). 4/5 pgvector RAG next. * fix(ai): guard PDF size + surface bank-statement truncation - PdfProcessor and BankStatementExtractor raise upfront when pdf_content.bytesize exceeds MAX_PDF_BYTES (32 MB, matching Anthropic's hard limit). Previously a 100 MB PDF would be base64-encoded (~133 MB) and packed into the JSON body before the API rejected it — peak heap ~270 MB per Sidekiq worker. - BankStatementExtractor inspects response.stop_reason; when the model hit max_tokens it logs a warning and flags result[:truncated] so downstream callers know the transaction list may be incomplete. - ISO date pattern added to statement_period_start/end schema in PdfProcessor so the model can't return "March 2026" — Anthropic enforces the regex via the tool's input_schema. Tests cover the size guard (raises before any client.messages call), truncated-result flagging, and the warning log path. * test(ai): require "ostruct" in Anthropic PDF tests Match the explicit ostruct require added in PR1/PR2 — same Ruby 3.5+ load-path reason. * fix(llm-usage): include Anthropic cache tokens in estimated_cost calculate_cost only priced prompt + completion tokens, so estimated_cost under-reported every cached call — the cache_creation/cache_read columns this PR added were tracked but never billed. Verified against the Anthropic dashboard: a cached chat turn billed $0.05 but the ledger recorded $0.038; the gap was exactly the unpriced cache tokens. Price them relative to the input rate (Anthropic: cache write 1.25x, read 0.1x) and thread the cache counts from both recorders (chat + batch). OpenAI rows leave the columns null (treated as 0), so they're unaffected. Ledger now reproduces the dashboard ($0.054 for the test turn). * chore(ai): guard chat usage double-record; flag deferred Anthropic batch wiring - Hardening: guard the success-path record_llm_usage with `unless partial_usage_recorded` so a future change that emits partial usage on a normal stream can't silently double-bill (the symptom investigated in the #1984 review). No behavior change today — on_partial only fires from the mid-stream-error rescue, which re-raises past this line. - Notice: the family auto-categorize / merchant-detect / merchant-enhance flows still hardcode get_provider(:openai). Provider::Anthropic now implements those batch ops but they aren't wired into the family flows yet — documented with TODOs at each site for the follow-up. * chore(ai): point family-flow TODOs at tracking issue #2113 * chore(ai): flag deferred Anthropic PDF wiring (TODO #2113) The PDF import + bank-statement-extract flows hardcode get_provider(:openai). Provider::Anthropic implements process_pdf / extract_bank_statement (this PR) but they aren't wired into these paths yet — documented with TODOs at each site. Tracked in #2113 (broadened to cover batch ops + PDF). * chore(anthropic-pdf): drop redundant strip_heredoc; document no-dedup - The squiggly heredoc (<<~) already strips indentation, so the trailing .strip_heredoc was a no-op in both PDF extractors. - Document why BankStatementExtractor intentionally does NOT deduplicate (unlike the OpenAI extractor): we send the whole PDF as one native document block, so there are no overlapping-chunk artifacts to dedupe, and deduping would wrongly merge legitimate same-day, same-amount transactions. * fix(anthropic): cap PDF bytes below the base64-encoded request limit Anthropic's 32 MB limit is on the Messages request body, and the PDF is sent base64-encoded (~4/3 larger) alongside the JSON envelope, so a 32 MB raw PDF encodes to ~42 MB and is rejected. Cap the raw bytes at 3/4 of the request budget minus a 1 MB envelope reserve (~23 MiB). Addresses Codex review on #1985. --- .../function/import_bank_statement.rb | 2 + app/models/pdf_import.rb | 4 + app/models/provider/anthropic.rb | 43 +++- .../anthropic/bank_statement_extractor.rb | 229 ++++++++++++++++++ .../provider/anthropic/pdf_processor.rb | 185 ++++++++++++++ .../bank_statement_extractor_test.rb | 141 +++++++++++ .../provider/anthropic/pdf_processor_test.rb | 126 ++++++++++ 7 files changed, 728 insertions(+), 2 deletions(-) create mode 100644 app/models/provider/anthropic/bank_statement_extractor.rb create mode 100644 app/models/provider/anthropic/pdf_processor.rb create mode 100644 test/models/provider/anthropic/bank_statement_extractor_test.rb create mode 100644 test/models/provider/anthropic/pdf_processor_test.rb diff --git a/app/models/assistant/function/import_bank_statement.rb b/app/models/assistant/function/import_bank_statement.rb index dee54602f..d3bf86dec 100644 --- a/app/models/assistant/function/import_bank_statement.rb +++ b/app/models/assistant/function/import_bank_statement.rb @@ -93,6 +93,8 @@ class Assistant::Function::ImportBankStatement < Assistant::Function end # Extract transactions from the PDF using provider + # TODO(#2113): hardcoded to OpenAI. Provider::Anthropic implements + # extract_bank_statement (PR #1985); this should honor Setting.llm_provider. provider = Provider::Registry.get_provider(:openai) unless provider return { diff --git a/app/models/pdf_import.rb b/app/models/pdf_import.rb index 1e1f494ef..69c74cb9e 100644 --- a/app/models/pdf_import.rb +++ b/app/models/pdf_import.rb @@ -89,6 +89,8 @@ class PdfImport < Import end def process_with_ai + # TODO(#2113): hardcoded to OpenAI. Provider::Anthropic implements + # process_pdf (PR #1985); this should honor Setting.llm_provider. provider = Provider::Registry.get_provider(:openai) raise "AI provider not configured" unless provider raise "AI provider does not support PDF processing" unless provider.supports_pdf_processing? @@ -115,6 +117,8 @@ class PdfImport < Import def extract_transactions return unless statement_with_transactions? + # TODO(#2113): hardcoded to OpenAI. Provider::Anthropic implements + # extract_bank_statement (PR #1985); this should honor Setting.llm_provider. provider = Provider::Registry.get_provider(:openai) raise "AI provider not configured" unless provider diff --git a/app/models/provider/anthropic.rb b/app/models/provider/anthropic.rb index 8e2e1fa50..ab05a0e0b 100644 --- a/app/models/provider/anthropic.rb +++ b/app/models/provider/anthropic.rb @@ -155,11 +155,50 @@ class Provider::Anthropic < Provider end def process_pdf(pdf_content:, model: "", family: nil) - raise Error, "process_pdf not yet implemented for Provider::Anthropic" + with_provider_response do + effective_model = model.presence || @default_model + raise Error, "Model does not support PDF processing: #{effective_model}" unless supports_pdf_processing?(model: effective_model) + + trace = create_langfuse_trace( + name: "anthropic.process_pdf", + input: { pdf_size: pdf_content&.bytesize } + ) + + result = PdfProcessor.new( + client, + model: effective_model, + pdf_content: pdf_content, + langfuse_trace: trace, + family: family + ).process + + upsert_langfuse_trace(trace: trace, output: result.to_h) + + result + end end def extract_bank_statement(pdf_content:, model: "", family: nil) - raise Error, "extract_bank_statement not yet implemented for Provider::Anthropic" + with_provider_response do + effective_model = model.presence || @default_model + + trace = create_langfuse_trace( + name: "anthropic.extract_bank_statement", + input: { pdf_size: pdf_content&.bytesize } + ) + + result = BankStatementExtractor.new( + client: client, + pdf_content: pdf_content, + model: effective_model, + langfuse_trace: trace, + family: family + ).extract + + upsert_langfuse_trace(trace: trace, output: { transaction_count: result[:transactions].size }) + + result + end end def chat_response( diff --git a/app/models/provider/anthropic/bank_statement_extractor.rb b/app/models/provider/anthropic/bank_statement_extractor.rb new file mode 100644 index 000000000..e91d44b47 --- /dev/null +++ b/app/models/provider/anthropic/bank_statement_extractor.rb @@ -0,0 +1,229 @@ +class Provider::Anthropic::BankStatementExtractor + include Provider::Anthropic::Concerns::UsageRecorder + + TOOL_NAME = "report_bank_statement".freeze + + # Mirrors Provider::Anthropic::PdfProcessor::MAX_PDF_BYTES. + MAX_PDF_BYTES = 32 * 1024 * 1024 + + attr_reader :client, :model, :pdf_content, :langfuse_trace, :family + + def initialize(client:, model:, pdf_content:, langfuse_trace: nil, family: nil) + @client = client + @model = model + @pdf_content = pdf_content + @langfuse_trace = langfuse_trace + @family = family + end + + def extract + raise Provider::Anthropic::Error, "PDF content is required" if pdf_content.blank? + if pdf_content.bytesize > MAX_PDF_BYTES + raise Provider::Anthropic::Error, + "PDF exceeds Anthropic's 32 MB limit (#{pdf_content.bytesize} bytes)" + end + + span = langfuse_trace&.span(name: "extract_bank_statement_api_call", input: { + model: model, + pdf_size: pdf_content.bytesize + }) + + response = client.messages.create( + model: model, + max_tokens: max_tokens, + system_: instructions, + messages: [ { role: "user", content: user_content } ], + tools: [ output_tool ], + tool_choice: { type: "tool", name: TOOL_NAME, disable_parallel_tool_use: true } + ) + + parsed = extract_tool_input(response) + result = build_result(parsed) + + truncated = stop_reason(response) == :max_tokens + if truncated + Rails.logger.warn( + "[BankStatementExtractor] response truncated by max_tokens — extracted #{result[:transactions].size} " \ + "transactions but more may be present in the statement. Raise ANTHROPIC_MAX_TOKENS or chunk the PDF." + ) + result[:truncated] = true + end + + record_usage(model, response.usage, operation: "extract_bank_statement", metadata: { + pdf_size: pdf_content.bytesize, + transaction_count: result[:transactions].size, + truncated: truncated + }) + + span&.end(output: { transaction_count: result[:transactions].size }, usage: usage_hash(response.usage)) + result + rescue => e + span&.end(output: { error: e.message }, level: "ERROR") + record_usage_error(model, operation: "extract_bank_statement", error: e, metadata: { pdf_size: pdf_content&.bytesize }) + raise + end + + private + def max_tokens + ENV.fetch("ANTHROPIC_MAX_TOKENS", 4096).to_i + end + + def user_content + [ + { + type: "document", + source: { + type: "base64", + media_type: "application/pdf", + data: Base64.strict_encode64(pdf_content) + } + }, + { + type: "text", + text: "Extract every transaction from this bank statement and return them via the report_bank_statement tool." + } + ] + end + + def output_tool + { + name: TOOL_NAME, + description: "Return the full set of transactions and statement metadata extracted from the PDF.", + input_schema: { + type: "object", + properties: { + bank_name: { type: [ "string", "null" ] }, + account_holder: { type: [ "string", "null" ] }, + account_number: { type: [ "string", "null" ], description: "Typically last 4 digits only." }, + statement_period: { + type: "object", + properties: { + start_date: { type: [ "string", "null" ], description: "YYYY-MM-DD" }, + end_date: { type: [ "string", "null" ], description: "YYYY-MM-DD" } + }, + required: [], + additionalProperties: false + }, + opening_balance: { type: [ "number", "null" ] }, + closing_balance: { type: [ "number", "null" ] }, + transactions: { + type: "array", + description: "Every transaction in the statement, in document order.", + items: { + type: "object", + properties: { + date: { type: "string", description: "YYYY-MM-DD" }, + description: { type: "string" }, + amount: { type: "number", description: "Negative for debits / expenses, positive for credits / deposits." }, + reference: { type: [ "string", "null" ] }, + category: { type: [ "string", "null" ] } + }, + required: [ "date", "description", "amount" ], + additionalProperties: false + } + } + }, + required: [ "transactions" ], + additionalProperties: false + } + } + end + + def instructions + <<~INSTRUCTIONS + Extract bank statement data from the attached PDF and return the result via the report_bank_statement tool. + + Rules: + - Extract EVERY transaction in document order + - Negative amounts for debits / expenses, positive for credits / deposits + - Dates in YYYY-MM-DD + - Use null for any field you cannot read; do not invent values + INSTRUCTIONS + end + + def stop_reason(response) + raw = response.respond_to?(:stop_reason) ? response.stop_reason : nil + raw.to_s.to_sym if raw + end + + def extract_tool_input(response) + tool_use = Array(response.content).find { |block| block_type(block) == :tool_use } + raise Provider::Anthropic::Error, "Model did not invoke #{TOOL_NAME}" unless tool_use + + input = block_input(tool_use) + input = JSON.parse(input) if input.is_a?(String) + input + end + + def build_result(parsed) + # Intentionally NOT deduplicated, unlike Provider::Openai's extractor. That + # one chunks the PDF text with overlap and must drop transactions repeated + # across adjacent chunks. We send the whole PDF as a single native document + # block — no chunk artifacts — so deduping here would wrongly merge + # legitimate same-day, same-amount rows (e.g. two identical purchases). + # Preserve every transaction the model returns. + transactions = Array(parsed["transactions"] || parsed[:transactions]).map { |t| normalize_transaction(t) }.compact + + { + transactions: transactions, + period: { + start_date: dig_period(parsed, :start_date), + end_date: dig_period(parsed, :end_date) + }, + account_holder: parsed["account_holder"] || parsed[:account_holder], + account_number: parsed["account_number"] || parsed[:account_number], + bank_name: parsed["bank_name"] || parsed[:bank_name], + opening_balance: parsed["opening_balance"] || parsed[:opening_balance], + closing_balance: parsed["closing_balance"] || parsed[:closing_balance] + } + end + + def dig_period(parsed, key) + period = parsed["statement_period"] || parsed[:statement_period] + return nil unless period.is_a?(Hash) + period[key.to_s] || period[key] + end + + def normalize_transaction(txn) + return nil unless txn.is_a?(Hash) + + { + date: parse_date(txn["date"] || txn[:date]), + amount: parse_amount(txn["amount"] || txn[:amount]), + name: txn["description"] || txn[:description] || txn["name"] || txn[:name], + category: txn["category"] || txn[:category], + notes: txn["reference"] || txn[:reference] + } + end + + def parse_date(date_str) + return nil if date_str.blank? + Date.parse(date_str.to_s).strftime("%Y-%m-%d") + rescue ArgumentError + nil + end + + def parse_amount(amount) + return nil if amount.nil? + return amount.to_f if amount.is_a?(Numeric) + amount.to_s.gsub(/[^0-9.\-]/, "").to_f + end + + def block_type(block) + raw = block.respond_to?(:type) ? block.type : block[:type] || block["type"] + raw.to_s.to_sym + end + + def block_input(block) + block.respond_to?(:input) ? block.input : (block[:input] || block["input"]) + end + + def usage_hash(raw_usage) + return {} unless raw_usage + { + "input_tokens" => raw_usage.input_tokens.to_i, + "output_tokens" => raw_usage.output_tokens.to_i, + "total_tokens" => raw_usage.input_tokens.to_i + raw_usage.output_tokens.to_i + } + end +end diff --git a/app/models/provider/anthropic/pdf_processor.rb b/app/models/provider/anthropic/pdf_processor.rb new file mode 100644 index 000000000..dc6dc2c96 --- /dev/null +++ b/app/models/provider/anthropic/pdf_processor.rb @@ -0,0 +1,185 @@ +class Provider::Anthropic::PdfProcessor + include Provider::Anthropic::Concerns::UsageRecorder + + TOOL_NAME = "report_document_analysis".freeze + + # Anthropic enforces a 32 MB limit on the whole Messages *request body*, and + # the PDF travels base64-encoded (~4/3 larger) inside that body alongside the + # JSON envelope (instructions, tool schema). So a 32 MB raw PDF would encode + # to ~42 MB and be rejected. Cap the raw bytes at 3/4 of the request budget, + # minus a generous envelope reserve, so the encoded request stays under the + # limit. Guarding upstream also avoids base64-encoding an over-size blob in + # vain (peak heap before the API would reject it). + MAX_REQUEST_BYTES = 32 * 1024 * 1024 + REQUEST_ENVELOPE_BYTES = 1 * 1024 * 1024 + MAX_PDF_BYTES = (MAX_REQUEST_BYTES - REQUEST_ENVELOPE_BYTES) * 3 / 4 + + attr_reader :client, :model, :pdf_content, :langfuse_trace, :family + + def initialize(client, model:, pdf_content:, langfuse_trace: nil, family: nil) + @client = client + @model = model + @pdf_content = pdf_content + @langfuse_trace = langfuse_trace + @family = family + end + + def process + raise Provider::Anthropic::Error, "PDF content is required" if pdf_content.blank? + if pdf_content.bytesize > MAX_PDF_BYTES + raise Provider::Anthropic::Error, + "PDF is too large (#{pdf_content.bytesize} bytes); base64-encoded it would exceed Anthropic's 32 MB request limit" + end + + span = langfuse_trace&.span(name: "process_pdf_api_call", input: { + model: model, + pdf_size: pdf_content&.bytesize + }) + + response = client.messages.create( + model: model, + max_tokens: max_tokens, + system_: instructions, + messages: [ { role: "user", content: user_content } ], + tools: [ output_tool ], + tool_choice: { type: "tool", name: TOOL_NAME, disable_parallel_tool_use: true } + ) + + parsed = extract_tool_input(response) + result = build_result(parsed) + + record_usage(model, response.usage, operation: "process_pdf", metadata: { pdf_size: pdf_content.bytesize }) + + span&.end(output: result.to_h, usage: usage_hash(response.usage)) + result + rescue => e + span&.end(output: { error: e.message }, level: "ERROR") + record_usage_error(model, operation: "process_pdf", error: e, metadata: { pdf_size: pdf_content&.bytesize }) + raise + end + + private + PdfProcessingResult = Provider::LlmConcept::PdfProcessingResult + + def max_tokens + ENV.fetch("ANTHROPIC_MAX_TOKENS", 4096).to_i + end + + def user_content + [ + { + type: "document", + source: { + type: "base64", + media_type: "application/pdf", + data: Base64.strict_encode64(pdf_content) + } + }, + { + type: "text", + text: "Analyze the attached document and return the result via the report_document_analysis tool." + } + ] + end + + def output_tool + { + name: TOOL_NAME, + description: "Return the structured analysis of the attached document.", + input_schema: { + type: "object", + properties: { + document_type: { + type: "string", + enum: Import::DOCUMENT_TYPES, + description: "Classification of the document." + }, + summary: { + type: "string", + description: "Concise human-readable summary of the document." + }, + extracted_data: { + type: "object", + properties: { + institution_name: { type: [ "string", "null" ] }, + statement_period_start: { type: [ "string", "null" ], pattern: "^\\d{4}-\\d{2}-\\d{2}$", description: "YYYY-MM-DD or null" }, + statement_period_end: { type: [ "string", "null" ], pattern: "^\\d{4}-\\d{2}-\\d{2}$", description: "YYYY-MM-DD or null" }, + transaction_count: { type: [ "integer", "null" ] }, + opening_balance: { type: [ "number", "null" ] }, + closing_balance: { type: [ "number", "null" ] }, + currency: { type: [ "string", "null" ] }, + account_holder: { type: [ "string", "null" ] } + }, + required: [], + additionalProperties: false + } + }, + required: [ "document_type", "summary", "extracted_data" ], + additionalProperties: false + } + } + end + + def instructions + <<~INSTRUCTIONS + You analyze financial documents. For the attached PDF, classify the document type, + summarize it, and extract key metadata. Return the result via the report_document_analysis tool. + + Classification options: + - bank_statement: bank account statements (incl. mobile money / digital wallets) + - credit_card_statement: credit card statements + - investment_statement: brokerage / investment statements + - financial_document: tax forms, receipts, invoices, financial reports + - contract: legal agreements, loans, terms of service + - other: anything else + + Rules: + - Be factual; only report what is clearly visible + - If a field is unclear/redacted, return null for it + - Do not invent figures or names you cannot read + - For statements with many transactions, return the count rather than enumerating them + INSTRUCTIONS + end + + def extract_tool_input(response) + tool_use = Array(response.content).find { |block| block_type(block) == :tool_use } + raise Provider::Anthropic::Error, "Model did not invoke #{TOOL_NAME}" unless tool_use + + input = block_input(tool_use) + input = JSON.parse(input) if input.is_a?(String) + input + end + + def build_result(parsed) + PdfProcessingResult.new( + summary: parsed["summary"] || parsed[:summary], + document_type: normalize_document_type(parsed["document_type"] || parsed[:document_type]), + extracted_data: parsed["extracted_data"] || parsed[:extracted_data] || {} + ) + end + + def normalize_document_type(doc_type) + return "other" if doc_type.blank? + + normalized = doc_type.to_s.strip.downcase.gsub(/\s+/, "_") + Import::DOCUMENT_TYPES.include?(normalized) ? normalized : "other" + end + + def block_type(block) + raw = block.respond_to?(:type) ? block.type : block[:type] || block["type"] + raw.to_s.to_sym + end + + def block_input(block) + block.respond_to?(:input) ? block.input : (block[:input] || block["input"]) + end + + def usage_hash(raw_usage) + return {} unless raw_usage + { + "input_tokens" => raw_usage.input_tokens.to_i, + "output_tokens" => raw_usage.output_tokens.to_i, + "total_tokens" => raw_usage.input_tokens.to_i + raw_usage.output_tokens.to_i + } + end +end diff --git a/test/models/provider/anthropic/bank_statement_extractor_test.rb b/test/models/provider/anthropic/bank_statement_extractor_test.rb new file mode 100644 index 000000000..203246221 --- /dev/null +++ b/test/models/provider/anthropic/bank_statement_extractor_test.rb @@ -0,0 +1,141 @@ +require "test_helper" +require "ostruct" + +class Provider::Anthropic::BankStatementExtractorTest < ActiveSupport::TestCase + setup do + @pdf_content = "%PDF-1.4 fake bytes".b + end + + test "sends PDF as native document and returns normalized transactions + metadata" do + fake_response = build_response(content: [ + tool_use_block( + id: "toolu_1", + name: "report_bank_statement", + input: { + "bank_name" => "Bank of Example", + "account_holder" => "Jane Doe", + "account_number" => "1234", + "statement_period" => { "start_date" => "2026-03-01", "end_date" => "2026-03-31" }, + "opening_balance" => 1000.0, + "closing_balance" => 1500.0, + "transactions" => [ + { "date" => "2026-03-05", "description" => "Coffee", "amount" => -4.5 }, + { "date" => "2026-03-15", "description" => "Salary", "amount" => 3000.0, "reference" => "Payroll Mar" } + ] + } + ) + ]) + client = stub_client(fake_response) + + result = Provider::Anthropic::BankStatementExtractor.new( + client: client, + model: "claude-sonnet-4-6", + pdf_content: @pdf_content + ).extract + + assert_equal "Bank of Example", result[:bank_name] + assert_equal "Jane Doe", result[:account_holder] + assert_equal "1234", result[:account_number] + assert_equal "2026-03-01", result[:period][:start_date] + assert_equal "2026-03-31", result[:period][:end_date] + assert_equal 1000.0, result[:opening_balance] + assert_equal 1500.0, result[:closing_balance] + + assert_equal 2, result[:transactions].size + txn1 = result[:transactions].first + assert_equal "2026-03-05", txn1[:date] + assert_equal "Coffee", txn1[:name] + assert_equal(-4.5, txn1[:amount]) + + txn2 = result[:transactions].last + assert_equal "Salary", txn2[:name] + assert_equal 3000.0, txn2[:amount] + assert_equal "Payroll Mar", txn2[:notes] + end + + test "raises when pdf_content is blank" do + err = assert_raises(Provider::Anthropic::Error) do + Provider::Anthropic::BankStatementExtractor.new( + client: mock, + model: "claude-sonnet-4-6", + pdf_content: nil + ).extract + end + assert_match(/PDF content is required/i, err.message) + end + + test "raises when model omits the tool call" do + fake_response = build_response(content: [ OpenStruct.new(type: :text, text: "no tool") ]) + client = stub_client(fake_response) + + err = assert_raises(Provider::Anthropic::Error) do + Provider::Anthropic::BankStatementExtractor.new( + client: client, + model: "claude-sonnet-4-6", + pdf_content: @pdf_content + ).extract + end + assert_match(/did not invoke report_bank_statement/i, err.message) + end + + test "raises before API call when pdf_content exceeds the 32 MB limit" do + oversized = "a".b * (Provider::Anthropic::BankStatementExtractor::MAX_PDF_BYTES + 1) + client = mock + client.expects(:messages).never + + err = assert_raises(Provider::Anthropic::Error) do + Provider::Anthropic::BankStatementExtractor.new( + client: client, + model: "claude-sonnet-4-6", + pdf_content: oversized + ).extract + end + assert_match(/exceeds Anthropic's 32 MB limit/i, err.message) + end + + test "flags result as truncated when stop_reason is max_tokens" do + fake_response = build_response( + content: [ + tool_use_block( + id: "toolu_1", + name: "report_bank_statement", + input: { "transactions" => [ { "date" => "2026-03-05", "description" => "Coffee", "amount" => -4.5 } ] } + ) + ] + ) + fake_response.stop_reason = :max_tokens + client = stub_client(fake_response) + + Rails.logger.expects(:warn).with(regexp_matches(/truncated by max_tokens/i)) + + result = Provider::Anthropic::BankStatementExtractor.new( + client: client, + model: "claude-sonnet-4-6", + pdf_content: @pdf_content + ).extract + + assert_equal true, result[:truncated] + end + + private + def stub_client(response) + messages = mock + messages.stubs(:create).returns(response) + client = mock + client.stubs(:messages).returns(messages) + client + end + + def build_response(content:, usage: { input_tokens: 1500, output_tokens: 400 }) + OpenStruct.new( + id: "msg_test", + model: "claude-sonnet-4-6", + content: content, + usage: OpenStruct.new(input_tokens: usage[:input_tokens], output_tokens: usage[:output_tokens]) + ) + end + + def tool_use_block(id:, name:, input:) + OpenStruct.new(type: :tool_use, id: id, name: name, input: input) + end +end diff --git a/test/models/provider/anthropic/pdf_processor_test.rb b/test/models/provider/anthropic/pdf_processor_test.rb new file mode 100644 index 000000000..d2cdbb3a7 --- /dev/null +++ b/test/models/provider/anthropic/pdf_processor_test.rb @@ -0,0 +1,126 @@ +require "test_helper" +require "ostruct" + +class Provider::Anthropic::PdfProcessorTest < ActiveSupport::TestCase + setup do + @pdf_content = "%PDF-1.4 fake bytes".b + end + + test "sends PDF as native document content block and parses tool response" do + fake_response = build_response(content: [ + tool_use_block( + id: "toolu_1", + name: "report_document_analysis", + input: { + "document_type" => "bank_statement", + "summary" => "Bank of Example, Mar 2026 statement.", + "extracted_data" => { + "institution_name" => "Bank of Example", + "statement_period_start" => "2026-03-01", + "statement_period_end" => "2026-03-31", + "transaction_count" => 42, + "opening_balance" => 1000.0, + "closing_balance" => 1500.0, + "currency" => "USD", + "account_holder" => "Account Holder" + } + } + ) + ]) + captured = nil + client = stub_client(fake_response) { |params| captured = params } + + result = Provider::Anthropic::PdfProcessor.new( + client, + model: "claude-sonnet-4-6", + pdf_content: @pdf_content + ).process + + document_block = captured[:messages].first[:content].first + assert_equal "document", document_block[:type] + assert_equal "application/pdf", document_block[:source][:media_type] + assert_equal "base64", document_block[:source][:type] + assert_equal Base64.strict_encode64(@pdf_content), document_block[:source][:data] + + assert_equal "report_document_analysis", captured[:tool_choice][:name] + assert captured[:tool_choice][:disable_parallel_tool_use] + + assert_equal "bank_statement", result.document_type + assert_equal "Bank of Example, Mar 2026 statement.", result.summary + assert_equal 42, result.extracted_data["transaction_count"] + end + + test "normalizes unknown document_type to other" do + fake_response = build_response(content: [ + tool_use_block( + id: "toolu_2", + name: "report_document_analysis", + input: { + "document_type" => "alien_invasion_form", + "summary" => "Unknown.", + "extracted_data" => {} + } + ) + ]) + client = stub_client(fake_response) + + result = Provider::Anthropic::PdfProcessor.new( + client, + model: "claude-sonnet-4-6", + pdf_content: @pdf_content + ).process + + assert_equal "other", result.document_type + end + + test "raises when pdf_content is blank" do + err = assert_raises(Provider::Anthropic::Error) do + Provider::Anthropic::PdfProcessor.new( + mock, + model: "claude-sonnet-4-6", + pdf_content: "" + ).process + end + assert_match(/PDF content is required/i, err.message) + end + + test "raises before any API call when pdf_content exceeds the base64-adjusted cap" do + oversized = "a".b * (Provider::Anthropic::PdfProcessor::MAX_PDF_BYTES + 1) + client = mock + client.expects(:messages).never + + err = assert_raises(Provider::Anthropic::Error) do + Provider::Anthropic::PdfProcessor.new( + client, + model: "claude-sonnet-4-6", + pdf_content: oversized + ).process + end + assert_match(/32 MB request limit/i, err.message) + end + + private + def stub_client(response) + messages = mock + messages.expects(:create).with do |params| + yield(params) if block_given? + true + end.returns(response) + client = mock + client.stubs(:messages).returns(messages) + client + end + + def build_response(content:, usage: { input_tokens: 800, output_tokens: 200 }) + OpenStruct.new( + id: "msg_test", + model: "claude-sonnet-4-6", + content: content, + usage: OpenStruct.new(input_tokens: usage[:input_tokens], output_tokens: usage[:output_tokens]) + ) + end + + def tool_use_block(id:, name:, input:) + OpenStruct.new(type: :tool_use, id: id, name: name, input: input) + end +end