QT/sure - sure

QT/sure

Fork 0

mirror of https://github.com/we-promise/sure.git synced 2026-05-31 16:29:03 +00:00

Commit Graph

Author	SHA1	Message	Date
Guillem Arias	cfde4c70a1	fix(ai): guard PDF size + surface bank-statement truncation - PdfProcessor and BankStatementExtractor raise upfront when pdf_content.bytesize exceeds MAX_PDF_BYTES (32 MB, matching Anthropic's hard limit). Previously a 100 MB PDF would be base64-encoded (~133 MB) and packed into the JSON body before the API rejected it — peak heap ~270 MB per Sidekiq worker. - BankStatementExtractor inspects response.stop_reason; when the model hit max_tokens it logs a warning and flags result[:truncated] so downstream callers know the transaction list may be incomplete. - ISO date pattern added to statement_period_start/end schema in PdfProcessor so the model can't return "March 2026" — Anthropic enforces the regex via the tool's input_schema. Tests cover the size guard (raises before any client.messages call), truncated-result flagging, and the warning log path.	2026-05-29 14:51:09 +02:00
Guillem Arias	38e950fe23	feat(ai): Anthropic native PDF processing (3/5) Implements process_pdf and extract_bank_statement on Provider::Anthropic using the native `document` content block — no rasterization, no text pre-extraction. - Provider::Anthropic::PdfProcessor classifies the document, summarizes it, and extracts statement metadata via a forced report_document_analysis tool whose input_schema mirrors the existing Provider::Openai output (document_type from Import::DOCUMENT_TYPES, summary, extracted_data). - Provider::Anthropic::BankStatementExtractor returns the same { transactions, period, account_holder, account_number, bank_name, opening_balance, closing_balance } shape via report_bank_statement so downstream pdf_import code is provider-agnostic. - Both attach the PDF as { type: "document", source: { type: "base64", media_type: "application/pdf", data: <b64> } } — Claude 3.5+ / 4.x accept this natively (up to 32MB / 100 pages). No pdf-reader, no pdftoppm, no chunking for typical statements. - supports_pdf_processing? (introduced in PR 1) already returns true for claude-* models, gating process_pdf with a clear error otherwise. - Cost ledger rows are persisted via the shared UsageRecorder concern, including cache_creation/cache_read tokens. Tests verify the document block shape, tool_choice forcing, normalized document_type for unknown classifications, transaction normalization (date / amount / reference → notes), and the missing-tool_use error path. Blank pdf_content raises before any client call. Stacked on #1984 (PR 2/5). 4/5 pgvector RAG next.	2026-05-29 14:51:09 +02:00

Author

SHA1

Message

Date

Guillem Arias

cfde4c70a1

fix(ai): guard PDF size + surface bank-statement truncation

- PdfProcessor and BankStatementExtractor raise upfront when
  pdf_content.bytesize exceeds MAX_PDF_BYTES (32 MB, matching
  Anthropic's hard limit). Previously a 100 MB PDF would be
  base64-encoded (~133 MB) and packed into the JSON body before
  the API rejected it — peak heap ~270 MB per Sidekiq worker.
- BankStatementExtractor inspects response.stop_reason; when the
  model hit max_tokens it logs a warning and flags result[:truncated]
  so downstream callers know the transaction list may be incomplete.
- ISO date pattern added to statement_period_start/end schema in
  PdfProcessor so the model can't return "March 2026" — Anthropic
  enforces the regex via the tool's input_schema.

Tests cover the size guard (raises before any client.messages call),
truncated-result flagging, and the warning log path.

2026-05-29 14:51:09 +02:00

Guillem Arias

38e950fe23

feat(ai): Anthropic native PDF processing (3/5)

Implements process_pdf and extract_bank_statement on Provider::Anthropic
using the native `document` content block — no rasterization, no text
pre-extraction.

- Provider::Anthropic::PdfProcessor classifies the document, summarizes
  it, and extracts statement metadata via a forced report_document_analysis
  tool whose input_schema mirrors the existing Provider::Openai output
  (document_type from Import::DOCUMENT_TYPES, summary, extracted_data).
- Provider::Anthropic::BankStatementExtractor returns the same
  { transactions, period, account_holder, account_number, bank_name,
  opening_balance, closing_balance } shape via report_bank_statement so
  downstream pdf_import code is provider-agnostic.
- Both attach the PDF as
  { type: "document", source: { type: "base64", media_type: "application/pdf", data: <b64> } }
  — Claude 3.5+ / 4.x accept this natively (up to 32MB / 100 pages).
  No pdf-reader, no pdftoppm, no chunking for typical statements.
- supports_pdf_processing? (introduced in PR 1) already returns true for
  claude-* models, gating process_pdf with a clear error otherwise.
- Cost ledger rows are persisted via the shared UsageRecorder concern,
  including cache_creation/cache_read tokens.

Tests verify the document block shape, tool_choice forcing, normalized
document_type for unknown classifications, transaction normalization
(date / amount / reference → notes), and the missing-tool_use error
path. Blank pdf_content raises before any client call.

Stacked on #1984 (PR 2/5). 4/5 pgvector RAG next.

2026-05-29 14:51:09 +02:00

2 Commits