Files
sure/app/models/provider/openai/pdf_processor.rb
MkDev11 0afdb1d0fd Feature/pdf import transaction rows (#846)
* Add import row generation from PDF extracted data

- Add generate_rows_from_extracted_data method to PdfImport
- Add import! method to create transactions from PDF rows
- Update ProcessPdfJob to generate rows after extraction
- Update configured?, cleaned?, publishable? for PDF workflow
- Add column_keys, required_column_keys, mapping_steps
- Set bank statements to pending status for user review
- Add tests for new functionality

Closes #844

* Add tests for BankStatementExtractor

- Test transaction extraction from PDF content
- Test deduplication across chunk boundaries
- Test amount normalization for various formats
- Test graceful handling of malformed JSON responses
- Test error handling for empty/nil PDF content

* Fix supports_pdf_processing? to validate effective model

The validation was always checking @default_model, but process_pdf
allows overriding the model via parameter. This could cause a
vision-capable override model to be rejected, or a non-vision-capable
override to pass validation only to fail during processing.

Changes:
- supports_pdf_processing? now accepts optional model parameter
- process_pdf passes effective model to validation
- Raise Provider::Openai::Error inside with_provider_response for
  consistent error handling

Addresses review feedback from PR#808

* Fix insert_all! bug: explicitly set import_id

Rails insert_all! on associations does NOT auto-set the foreign key.
Added import_id explicitly and use Import::Row.insert_all! directly.
Also reload rows before counting to ensure accurate count.

* Fix pending status showing as processing for bank statements with rows

When bank statement PDF imports have extracted rows, show a 'Ready for Review'
screen with a link to the confirm path instead of the 'Processing' spinner.

This addresses the PR feedback that users couldn't reach the review flow even
though rows were created.

* Gate publishable? on account.present? to prevent import failure

PDF imports are created without an account, and import! raises if account
is missing. This prevents users from hitting publish and having the job fail.

* Wrap generate_rows_from_extracted_data in transaction for atomicity

- Clear rows and reset count even when no transactions extracted
- Use transaction block to prevent partial updates on failure
- Use mapped_rows.size instead of reload for count

* Localize transactions count string with i18n helper

* Add AccountMapping step for PDF imports when account is nil

PDF imports need account selection before publishing. This adds
Import::AccountMapping to mapping_steps when account is nil,
matching the behavior of TransactionImport and TradeImport.

Addresses PR#846 feedback about account selection for PDF imports.

* Only include CategoryMapping when rows have non-empty categories

PDF extraction doesn't extract categories from bank statements,
so the CategoryMapping step would show empty. Now we only include
CategoryMapping if rows actually have non-empty category values.

This prevents showing an empty mapping step for PDF imports.

* Fix PDF import UI flow and account selection

- Add direct account selection in PDF import UI instead of AccountMapping
- AccountMapping designed for CSV imports with multiple account values
- PDF imports need single account for all transactions
- Add update action and route for imports controller
- Fix controller to handle pdf_import param format from form_with
- Show Publish button when import is publishable (account set)
- Fix stepper nav: Upload/Configure/Clean non-clickable for PDF imports
- Redirect PDF imports from configuration step (auto-configured)
- Improve AI prompt to recognize M-PESA/mobile money as bank statements
- Fix migration ordering for import_rows table columns

* Add guard for invalid account_id in imports#update

Prevents silently clearing account when invalid ID is passed.
Returns error message instead of confusing 'Account saved' notice.

* Localize step names in import nav and add account guard

- Use t() helper for all step names (Upload, Configure, Clean, Map, Confirm)
- Add guard for invalid account_id in imports#update
- Prevents silently clearing account when invalid ID is passed

* Make category column migrations idempotent

Check if columns exist before adding to prevent duplicate column
errors when migrations are re-run with new timestamps.

* Add match_path for PDF import step highlighting

Fixes step detection when path is nil by using separate match_path
for current step highlighting while keeping links disabled.

* Rename category migrations and update to Rails 7.2

- Rename class to EnsureCategoryFieldsOnImportRows to avoid conflicts
- Rename class to EnsureCategoryIconOnImportRows
- Update migration version from 7.1 to 7.2 per guidelines
- Rename files to match class names
- Add match_path for PDF import step highlighting

* Use primary (black) style for Create Account and Save buttons

* Remove match_path from auto-completed PDF steps

Only step 4 (Confirm) needs match_path for active-step detection.
Steps 1-3 are purely informational and always complete.

* Add fallback for document type translation

Handles nil or unexpected document_type values gracefully.
Also removes match_path from auto-completed PDF steps.

* Use index-based step number for mobile indicator

Fixes 'Step 5 of 4' issue when Map step is dynamically removed.

* Fix hostings_controller_test: use blank? instead of nil

Setting returns empty string not nil for unset values.

* Localize step progress label and use design token

* Fix button styling: use design system Tailwind classes

btn--primary and btn--secondary CSS classes don't exist.
Use actual design system classes from DS::Buttonish.

* Fix CRLF line endings in tags_controller_test.rb

---------

Co-authored-by: mkdev11 <jaysmth689+github@users.noreply.github.com>
2026-02-02 16:27:02 +01:00

266 lines
9.5 KiB
Ruby

class Provider::Openai::PdfProcessor
include Provider::Openai::Concerns::UsageRecorder
attr_reader :client, :model, :pdf_content, :custom_provider, :langfuse_trace, :family
def initialize(client, model: "", pdf_content: nil, custom_provider: false, langfuse_trace: nil, family: nil)
@client = client
@model = model
@pdf_content = pdf_content
@custom_provider = custom_provider
@langfuse_trace = langfuse_trace
@family = family
end
def process
span = langfuse_trace&.span(name: "process_pdf_api_call", input: {
model: model.presence || Provider::Openai::DEFAULT_MODEL,
pdf_size: pdf_content&.bytesize
})
# Try text extraction first (works with all models)
# Fall back to vision API with images if text extraction fails (for scanned PDFs)
response = begin
process_with_text_extraction
rescue Provider::Openai::Error => e
Rails.logger.warn("Text extraction failed: #{e.message}, trying vision API with images")
process_with_vision
end
span&.end(output: response.to_h)
response
rescue => e
span&.end(output: { error: e.message }, level: "ERROR")
raise
end
def instructions
<<~INSTRUCTIONS.strip
You are a financial document analysis assistant. Your job is to analyze uploaded PDF documents
and provide a structured summary of what the document contains.
For each document, you must determine:
1. **Document Type**: Classify the document as one of the following:
- `bank_statement`: A bank account statement showing transactions, balances, and account activity. This includes mobile money statements (like M-PESA, Venmo, PayPal, Cash App), digital wallet statements, and any statement showing a list of financial transactions with dates and amounts.
- `credit_card_statement`: A credit card statement showing charges, payments, and balances
- `investment_statement`: An investment/brokerage statement showing holdings, trades, or portfolio performance
- `financial_document`: General financial documents like tax forms, receipts, invoices, or financial reports
- `contract`: Legal agreements, loan documents, terms of service, or policy documents
- `other`: Any document that doesn't fit the above categories
2. **Summary**: Provide a concise summary of the document that includes:
- The issuing institution or company name (if identifiable)
- The date range or statement period (if applicable)
- Key financial figures (account balances, total transactions, etc.)
- The account holder's name (if visible, use "Account Holder" if redacted)
- Any notable items or important information
3. **Extracted Data**: If the document is a statement with transactions, extract key metadata:
- Number of transactions (if countable)
- Statement period (start and end dates)
- Opening and closing balances (if visible)
- Currency used
IMPORTANT GUIDELINES:
- Be factual and precise - only report what you can clearly see in the document
- If information is unclear or redacted, note it as "not clearly visible" or "redacted"
- Do NOT make assumptions about data you cannot see
- For statements with many transactions, provide a count rather than listing each one
- Focus on providing actionable information that helps the user understand what they uploaded
- If the document is unreadable or the PDF is corrupted, indicate this clearly
Respond with ONLY valid JSON in this exact format (no markdown code blocks, no other text):
{
"document_type": "bank_statement|credit_card_statement|investment_statement|financial_document|contract|other",
"summary": "A clear, concise summary of the document contents...",
"extracted_data": {
"institution_name": "Name of bank/company or null",
"statement_period_start": "YYYY-MM-DD or null",
"statement_period_end": "YYYY-MM-DD or null",
"transaction_count": number or null,
"opening_balance": number or null,
"closing_balance": number or null,
"currency": "USD/EUR/etc or null",
"account_holder": "Name or null"
}
}
INSTRUCTIONS
end
private
PdfProcessingResult = Provider::LlmConcept::PdfProcessingResult
def process_with_text_extraction
effective_model = model.presence || Provider::Openai::DEFAULT_MODEL
# Extract text from PDF using pdf-reader gem
pdf_text = extract_text_from_pdf
raise Provider::Openai::Error, "Could not extract text from PDF" if pdf_text.blank?
# Truncate if too long (max ~100k chars to stay within token limits)
pdf_text = pdf_text.truncate(100_000) if pdf_text.length > 100_000
params = {
model: effective_model,
messages: [
{ role: "system", content: instructions },
{
role: "user",
content: "Please analyze the following document text and provide a structured summary:\n\n#{pdf_text}"
}
],
response_format: { type: "json_object" }
}
response = client.chat(parameters: params)
Rails.logger.info("Tokens used to process PDF: #{response.dig("usage", "total_tokens")}")
record_usage(
effective_model,
response.dig("usage"),
operation: "process_pdf",
metadata: { pdf_size: pdf_content&.bytesize }
)
parse_response_generic(response)
end
def extract_text_from_pdf
return nil if pdf_content.blank?
reader = PDF::Reader.new(StringIO.new(pdf_content))
text_parts = []
reader.pages.each_with_index do |page, index|
text_parts << "--- Page #{index + 1} ---"
text_parts << page.text
end
text_parts.join("\n\n")
rescue => e
Rails.logger.error("Failed to extract text from PDF: #{e.message}")
nil
end
def process_with_vision
effective_model = model.presence || Provider::Openai::DEFAULT_MODEL
# Convert PDF to images using pdftoppm
images_base64 = convert_pdf_to_images
raise Provider::Openai::Error, "Could not convert PDF to images" if images_base64.blank?
# Build message content with images (max 5 pages to avoid token limits)
content = []
images_base64.first(5).each do |img_base64|
content << {
type: "image_url",
image_url: {
url: "data:image/png;base64,#{img_base64}",
detail: "low"
}
}
end
content << {
type: "text",
text: "Please analyze this PDF document (#{images_base64.size} pages total, showing first #{[ images_base64.size, 5 ].min}) and respond with valid JSON only."
}
# Note: response_format is not compatible with vision, so we ask for JSON in the prompt
params = {
model: effective_model,
messages: [
{ role: "system", content: instructions + "\n\nIMPORTANT: Respond with valid JSON only, no markdown or other formatting." },
{ role: "user", content: content }
],
max_tokens: 4096
}
response = client.chat(parameters: params)
Rails.logger.info("Tokens used to process PDF via vision: #{response.dig("usage", "total_tokens")}")
record_usage(
effective_model,
response.dig("usage"),
operation: "process_pdf_vision",
metadata: { pdf_size: pdf_content&.bytesize, pages: images_base64.size }
)
parse_response_generic(response)
end
def convert_pdf_to_images
return [] if pdf_content.blank?
Dir.mktmpdir do |tmpdir|
pdf_path = File.join(tmpdir, "input.pdf")
File.binwrite(pdf_path, pdf_content)
# Convert PDF to PNG images using pdftoppm
output_prefix = File.join(tmpdir, "page")
system("pdftoppm", "-png", "-r", "150", pdf_path, output_prefix)
# Read all generated images
image_files = Dir.glob(File.join(tmpdir, "page-*.png")).sort
image_files.map do |img_path|
Base64.strict_encode64(File.binread(img_path))
end
end
rescue => e
Rails.logger.error("Failed to convert PDF to images: #{e.message}")
[]
end
def parse_response_generic(response)
raw = response.dig("choices", 0, "message", "content")
parsed = parse_json_flexibly(raw)
build_result(parsed)
end
def build_result(parsed)
PdfProcessingResult.new(
summary: parsed["summary"],
document_type: normalize_document_type(parsed["document_type"]),
extracted_data: parsed["extracted_data"] || {}
)
end
def normalize_document_type(doc_type)
return "other" if doc_type.blank?
normalized = doc_type.to_s.strip.downcase.gsub(/\s+/, "_")
Import::DOCUMENT_TYPES.include?(normalized) ? normalized : "other"
end
def parse_json_flexibly(raw)
return {} if raw.blank?
# Try direct parse first
JSON.parse(raw)
rescue JSON::ParserError
# Try to extract JSON from markdown code blocks
if raw =~ /```(?:json)?\s*(\{[\s\S]*?\})\s*```/m
begin
return JSON.parse($1)
rescue JSON::ParserError
# Continue to next strategy
end
end
# Try to find any JSON object
if raw =~ /(\{[\s\S]*\})/m
begin
return JSON.parse($1)
rescue JSON::ParserError
# Fall through to error
end
end
raise Provider::Openai::Error, "Could not parse JSON from PDF processing response: #{raw.truncate(200)}"
end
end