feat/Add AI-Powered Bank Statement Import (step 1, PDF import & analysis) (#808)

* feat: Add PDF import with AI-powered document analysis

This enhances the import functionality to support PDF files with AI-powered
document analysis. When a PDF is uploaded, it is processed by AI to:
- Identify the document type (bank statement, credit card statement, etc.)
- Generate a summary of the document contents
- Extract key metadata (institution, dates, balances, transaction count)

After processing, an email is sent to the user asking for next steps.

Key changes:
- Add PdfImport model for handling PDF document imports
- Add Provider::Openai::PdfProcessor for AI document analysis
- Add ProcessPdfJob for async PDF processing
- Add PdfImportMailer for user notification emails
- Update imports controller to detect and handle PDF uploads
- Add PDF import option to the new import page
- Add i18n translations for all new strings
- Add comprehensive tests for the new functionality

* Add bank statement import with AI extraction

- Create ImportBankStatement assistant function for MCP
- Add BankStatementExtractor with chunked processing for small context windows
- Register function in assistant configurable
- Make PdfImport#pdf_file_content public for extractor access
- Increase OpenAI request timeout to 600s for slow local models
- Increase DB connection pool to 20 for concurrent operations

Tested with M-Pesa bank statement via remote Ollama (qwen3:8b):
- Successfully extracted 18 transactions
- Generated CSV and created TransactionImport
- Works with 3000 char chunks for small context windows

* Add pdf-reader gem dependency

The BankStatementExtractor uses PDF::Reader to parse bank statement
PDFs, but the gem was not properly declared in the Gemfile. This would
cause NameError in production when processing bank statements.

Added pdf-reader ~> 2.12 to Gemfile dependencies.

* Fix transaction deduplication to preserve legitimate duplicates

The previous deduplication logic removed ALL duplicate transactions based
on [date, amount, name], which would drop legitimate same-day duplicates
like multiple ATM withdrawals or card authorizations.

Changed to only deduplicate transactions that appear in consecutive chunks
(chunking artifacts) while preserving all legitimate duplicates within the
same chunk or non-adjacent chunks.

* Refactor bank statement extraction to use public provider method

Address code review feedback:
- Add public extract_bank_statement method to Provider::Openai
- Remove direct access to private client via send(:client)
- Update ImportBankStatement to use new public method
- Add require 'set' to BankStatementExtractor
- Remove PII-sensitive content from error logs
- Add defensive check for nil response.error
- Handle oversized PDF pages in chunking logic
- Remove unused process_native and process_generic methods
- Update email copy to reflect feature availability
- Add guard for nil document_type in email template
- Document pdf-reader gem rationale in Gemfile

Tested with both OpenAI (gpt-4o) and Ollama (qwen3:8b):
- OpenAI: 49 transactions extracted in 30s
- Ollama: 40 transactions extracted in 368s
- All encapsulation and error handling working correctly

* Update schema.rb with ai_summary and document_type columns

* Address PR #808 review comments

- Rename :csv_file to :import_file across controllers/views/tests
- Add PDF test fixture (sample_bank_statement.pdf)
- Add supports_pdf_processing? method for graceful degradation
- Revert unrelated database.yml pool change (600->3)
- Remove month_start_day schema bleed from other PR
- Fix PdfProcessor: use .strip instead of .strip_heredoc
- Add server-side PDF magic byte validation
- Conditionally show PDF import option when AI provider available
- Fix ProcessPdfJob: sanitize errors, handle update failure
- Move pdf_file attachment from Import to PdfImport
- Document deduplication logic limitations
- Fix ImportBankStatement: catch specific exceptions only
- Remove unnecessary require 'set'
- Remove dead json_schema method from PdfProcessor
- Reduce default OpenAI timeout from 600s to 60s
- Fix nil guard in text mailer template
- Add require 'csv' to ImportBankStatement
- Remove Gemfile pdf-reader comment

* Fix RuboCop indentation in ProcessPdfJob

* Refactor PDF import check to use model predicate method

Replace is_a?(PdfImport) type check with requires_csv_workflow? predicate
that leverages STI inheritance for cleaner controller logic.

* Fix missing 'unknown' locale key and schema version mismatch

- Add 'unknown: Unknown Document' to document_types locale
- Fix schema version to match latest migration (2026_01_24_180211)

* Document OPENAI_REQUEST_TIMEOUT env variable

Added to .env.local.example and docs/hosting/ai.md

* Rename ALLOWED_MIME_TYPES to ALLOWED_CSV_MIME_TYPES for clarity

* Add comment explaining requires_csv_workflow? predicate

* Remove redundant required_column_keys from PdfImport

Base class already returns [] by default

* Add ENV toggle to disable PDF processing for non-vision endpoints

OPENAI_SUPPORTS_PDF_PROCESSING=false can be used for OpenAI-compatible
endpoints (e.g., Ollama) that don't support vision/PDF processing.

* Wire up transaction extraction for PDF bank statements

- Add extracted_data JSONB column to imports
- Add extract_transactions method to PdfImport
- Call extraction in ProcessPdfJob for bank statements
- Store transactions in extracted_data for later review

* Fix ProcessPdfJob retry logic, sanitize and localize errors

- Allow retries after partial success (classification ok, extraction failed)
- Log sanitized error message instead of raw message to avoid data leakage
- Use i18n for user-facing error messages

* Add vision-capable model validation for PDF processing

* Fix drag-and-drop test to use correct field name csv_file

* Schema bleedover from another branch

* Fix drag-drop import form field name to match controller

* Add vision capability guard to process_pdf method

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: mkdev11 <jaysmth689+github@users.noreply.github.com>
Co-authored-by: Juan José Mata <jjmata@jjmata.com>
This commit is contained in:
MkDev11
2026-01-30 14:44:25 -05:00
committed by GitHub
parent 9f5fdd4d13
commit 6f8858b1a6
37 changed files with 1388 additions and 25 deletions

View File

@@ -13,6 +13,16 @@ module Provider::LlmConcept
raise NotImplementedError, "Subclasses must implement #auto_detect_merchants"
end
PdfProcessingResult = Data.define(:summary, :document_type, :extracted_data)
def supports_pdf_processing?
false
end
def process_pdf(pdf_content:, family: nil)
raise NotImplementedError, "Provider does not support PDF processing"
end
ChatMessage = Data.define(:id, :output_text)
ChatStreamChunk = Data.define(:type, :data, :usage)
ChatResponse = Data.define(:id, :model, :messages, :function_requests)

View File

@@ -8,6 +8,9 @@ class Provider::Openai < Provider
DEFAULT_OPENAI_MODEL_PREFIXES = %w[gpt-4 gpt-5 o1 o3]
DEFAULT_MODEL = "gpt-4.1"
# Models that support PDF/vision input (not all OpenAI models have vision capabilities)
VISION_CAPABLE_MODEL_PREFIXES = %w[gpt-4o gpt-4-turbo gpt-4.1 gpt-5 o1 o3].freeze
# Returns the effective model that would be used by the provider
# Uses the same logic as Provider::Registry and the initializer
def self.effective_model
@@ -18,6 +21,7 @@ class Provider::Openai < Provider
def initialize(access_token, uri_base: nil, model: nil)
client_options = { access_token: access_token }
client_options[:uri_base] = uri_base if uri_base.present?
client_options[:request_timeout] = ENV.fetch("OPENAI_REQUEST_TIMEOUT", 60).to_i
@client = ::OpenAI::Client.new(**client_options)
@uri_base = uri_base
@@ -112,6 +116,65 @@ class Provider::Openai < Provider
end
end
# Can be disabled via ENV for OpenAI-compatible endpoints that don't support vision
# Only vision-capable models (gpt-4o, gpt-4-turbo, gpt-4.1, etc.) support PDF input
def supports_pdf_processing?
return false unless ENV.fetch("OPENAI_SUPPORTS_PDF_PROCESSING", "true").to_s.downcase.in?(%w[true 1 yes])
# Custom providers manage their own model capabilities
return true if custom_provider?
# Check if the configured model supports vision/PDF input
VISION_CAPABLE_MODEL_PREFIXES.any? { |prefix| @default_model.start_with?(prefix) }
end
def process_pdf(pdf_content:, model: "", family: nil)
raise "Model does not support PDF/vision processing" unless supports_pdf_processing?
with_provider_response do
effective_model = model.presence || @default_model
trace = create_langfuse_trace(
name: "openai.process_pdf",
input: { pdf_size: pdf_content&.bytesize }
)
result = PdfProcessor.new(
client,
model: effective_model,
pdf_content: pdf_content,
custom_provider: custom_provider?,
langfuse_trace: trace,
family: family
).process
trace&.update(output: result.to_h)
result
end
end
def extract_bank_statement(pdf_content:, model: "", family: nil)
with_provider_response do
effective_model = model.presence || @default_model
trace = create_langfuse_trace(
name: "openai.extract_bank_statement",
input: { pdf_size: pdf_content&.bytesize }
)
result = BankStatementExtractor.new(
client: client,
pdf_content: pdf_content,
model: effective_model
).extract
trace&.update(output: { transaction_count: result[:transactions].size })
result
end
end
def chat_response(
prompt,
model:,

View File

@@ -0,0 +1,213 @@
class Provider::Openai::BankStatementExtractor
MAX_CHARS_PER_CHUNK = 3000
attr_reader :client, :pdf_content, :model
def initialize(client:, pdf_content:, model:)
@client = client
@pdf_content = pdf_content
@model = model
end
def extract
pages = extract_pages_from_pdf
raise Provider::Openai::Error, "Could not extract text from PDF" if pages.empty?
chunks = build_chunks(pages)
Rails.logger.info("BankStatementExtractor: Processing #{chunks.size} chunk(s) from #{pages.size} page(s)")
all_transactions = []
metadata = {}
chunks.each_with_index do |chunk, index|
Rails.logger.info("BankStatementExtractor: Processing chunk #{index + 1}/#{chunks.size}")
result = process_chunk(chunk, index == 0)
# Tag transactions with chunk index for deduplication
tagged_transactions = (result[:transactions] || []).map { |t| t.merge(chunk_index: index) }
all_transactions.concat(tagged_transactions)
if index == 0
metadata = {
account_holder: result[:account_holder],
account_number: result[:account_number],
bank_name: result[:bank_name],
opening_balance: result[:opening_balance],
closing_balance: result[:closing_balance],
period: result[:period]
}
end
if result[:closing_balance].present?
metadata[:closing_balance] = result[:closing_balance]
end
if result.dig(:period, :end_date).present?
metadata[:period] ||= {}
metadata[:period][:end_date] = result.dig(:period, :end_date)
end
end
{
transactions: deduplicate_transactions(all_transactions),
period: metadata[:period] || {},
account_holder: metadata[:account_holder],
account_number: metadata[:account_number],
bank_name: metadata[:bank_name],
opening_balance: metadata[:opening_balance],
closing_balance: metadata[:closing_balance]
}
end
private
def extract_pages_from_pdf
return [] if pdf_content.blank?
reader = PDF::Reader.new(StringIO.new(pdf_content))
reader.pages.map(&:text).reject(&:blank?)
rescue => e
Rails.logger.error("Failed to extract text from PDF: #{e.message}")
[]
end
def build_chunks(pages)
chunks = []
current_chunk = []
current_size = 0
pages.each do |page_text|
if page_text.length > MAX_CHARS_PER_CHUNK
chunks << current_chunk.join("\n\n") if current_chunk.any?
current_chunk = []
current_size = 0
chunks << page_text
next
end
if current_size + page_text.length > MAX_CHARS_PER_CHUNK && current_chunk.any?
chunks << current_chunk.join("\n\n")
current_chunk = []
current_size = 0
end
current_chunk << page_text
current_size += page_text.length
end
chunks << current_chunk.join("\n\n") if current_chunk.any?
chunks
end
def process_chunk(text, is_first_chunk)
params = {
model: model,
messages: [
{ role: "system", content: is_first_chunk ? instructions_with_metadata : instructions_transactions_only },
{ role: "user", content: "Extract transactions:\n\n#{text}" }
],
response_format: { type: "json_object" }
}
response = client.chat(parameters: params)
content = response.dig("choices", 0, "message", "content")
raise Provider::Openai::Error, "No response from AI" if content.blank?
parsed = parse_json_response(content)
{
transactions: normalize_transactions(parsed["transactions"] || []),
period: {
start_date: parsed.dig("statement_period", "start_date"),
end_date: parsed.dig("statement_period", "end_date")
},
account_holder: parsed["account_holder"],
account_number: parsed["account_number"],
bank_name: parsed["bank_name"],
opening_balance: parsed["opening_balance"],
closing_balance: parsed["closing_balance"]
}
end
def parse_json_response(content)
cleaned = content.gsub(%r{^```json\s*}i, "").gsub(/```\s*$/, "").strip
JSON.parse(cleaned)
rescue JSON::ParserError => e
Rails.logger.error("BankStatementExtractor JSON parse error: #{e.message} (content_length=#{content.to_s.bytesize})")
{ "transactions" => [] }
end
def deduplicate_transactions(transactions)
# Deduplicates transactions that appear in consecutive chunks (chunking artifacts).
#
# KNOWN LIMITATION: Legitimate duplicate transactions (same date, amount, merchant)
# that happen to appear in adjacent chunks will be incorrectly deduplicated.
# This is an acceptable trade-off since chunking artifacts are more common than
# true same-day duplicates at chunk boundaries. Transactions within the same
# chunk are always preserved regardless of similarity.
seen = Set.new
transactions.select do |t|
# Create key without chunk_index for deduplication
key = [ t[:date], t[:amount], t[:name], t[:chunk_index] ]
# Check if we've seen this exact transaction in a different chunk
duplicate = seen.any? do |prev_key|
prev_key[0..2] == key[0..2] && (prev_key[3] - key[3]).abs <= 1
end
seen << key
!duplicate
end.map { |t| t.except(:chunk_index) }
end
def normalize_transactions(transactions)
transactions.map do |txn|
{
date: parse_date(txn["date"]),
amount: parse_amount(txn["amount"]),
name: txn["description"] || txn["name"] || txn["merchant"],
category: infer_category(txn),
notes: txn["reference"] || txn["notes"]
}
end.compact
end
def parse_date(date_str)
return nil if date_str.blank?
Date.parse(date_str).strftime("%Y-%m-%d")
rescue ArgumentError
nil
end
def parse_amount(amount)
return nil if amount.nil?
if amount.is_a?(Numeric)
amount.to_f
else
amount.to_s.gsub(/[^0-9.\-]/, "").to_f
end
end
def infer_category(txn)
txn["category"] || txn["type"]
end
def instructions_with_metadata
<<~INSTRUCTIONS.strip
Extract bank statement data as JSON. Return:
{"bank_name":"...","account_holder":"...","account_number":"last 4 digits","statement_period":{"start_date":"YYYY-MM-DD","end_date":"YYYY-MM-DD"},"opening_balance":0.00,"closing_balance":0.00,"transactions":[{"date":"YYYY-MM-DD","description":"...","amount":-0.00}]}
Rules: Negative amounts for debits/expenses, positive for credits/deposits. Dates as YYYY-MM-DD. Extract ALL transactions. JSON only, no markdown.
INSTRUCTIONS
end
def instructions_transactions_only
<<~INSTRUCTIONS.strip
Extract transactions from bank statement text as JSON. Return:
{"transactions":[{"date":"YYYY-MM-DD","description":"...","amount":-0.00}]}
Rules: Negative amounts for debits/expenses, positive for credits/deposits. Dates as YYYY-MM-DD. Extract ALL transactions. JSON only, no markdown.
INSTRUCTIONS
end
end

View File

@@ -0,0 +1,265 @@
class Provider::Openai::PdfProcessor
include Provider::Openai::Concerns::UsageRecorder
attr_reader :client, :model, :pdf_content, :custom_provider, :langfuse_trace, :family
def initialize(client, model: "", pdf_content: nil, custom_provider: false, langfuse_trace: nil, family: nil)
@client = client
@model = model
@pdf_content = pdf_content
@custom_provider = custom_provider
@langfuse_trace = langfuse_trace
@family = family
end
def process
span = langfuse_trace&.span(name: "process_pdf_api_call", input: {
model: model.presence || Provider::Openai::DEFAULT_MODEL,
pdf_size: pdf_content&.bytesize
})
# Try text extraction first (works with all models)
# Fall back to vision API with images if text extraction fails (for scanned PDFs)
response = begin
process_with_text_extraction
rescue Provider::Openai::Error => e
Rails.logger.warn("Text extraction failed: #{e.message}, trying vision API with images")
process_with_vision
end
span&.end(output: response.to_h)
response
rescue => e
span&.end(output: { error: e.message }, level: "ERROR")
raise
end
def instructions
<<~INSTRUCTIONS.strip
You are a financial document analysis assistant. Your job is to analyze uploaded PDF documents
and provide a structured summary of what the document contains.
For each document, you must determine:
1. **Document Type**: Classify the document as one of the following:
- `bank_statement`: A bank account statement showing transactions, balances, and account activity
- `credit_card_statement`: A credit card statement showing charges, payments, and balances
- `investment_statement`: An investment/brokerage statement showing holdings, trades, or portfolio performance
- `financial_document`: General financial documents like tax forms, receipts, invoices, or financial reports
- `contract`: Legal agreements, loan documents, terms of service, or policy documents
- `other`: Any document that doesn't fit the above categories
2. **Summary**: Provide a concise summary of the document that includes:
- The issuing institution or company name (if identifiable)
- The date range or statement period (if applicable)
- Key financial figures (account balances, total transactions, etc.)
- The account holder's name (if visible, use "Account Holder" if redacted)
- Any notable items or important information
3. **Extracted Data**: If the document is a statement with transactions, extract key metadata:
- Number of transactions (if countable)
- Statement period (start and end dates)
- Opening and closing balances (if visible)
- Currency used
IMPORTANT GUIDELINES:
- Be factual and precise - only report what you can clearly see in the document
- If information is unclear or redacted, note it as "not clearly visible" or "redacted"
- Do NOT make assumptions about data you cannot see
- For statements with many transactions, provide a count rather than listing each one
- Focus on providing actionable information that helps the user understand what they uploaded
- If the document is unreadable or the PDF is corrupted, indicate this clearly
Respond with ONLY valid JSON in this exact format (no markdown code blocks, no other text):
{
"document_type": "bank_statement|credit_card_statement|investment_statement|financial_document|contract|other",
"summary": "A clear, concise summary of the document contents...",
"extracted_data": {
"institution_name": "Name of bank/company or null",
"statement_period_start": "YYYY-MM-DD or null",
"statement_period_end": "YYYY-MM-DD or null",
"transaction_count": number or null,
"opening_balance": number or null,
"closing_balance": number or null,
"currency": "USD/EUR/etc or null",
"account_holder": "Name or null"
}
}
INSTRUCTIONS
end
private
PdfProcessingResult = Provider::LlmConcept::PdfProcessingResult
def process_with_text_extraction
effective_model = model.presence || Provider::Openai::DEFAULT_MODEL
# Extract text from PDF using pdf-reader gem
pdf_text = extract_text_from_pdf
raise Provider::Openai::Error, "Could not extract text from PDF" if pdf_text.blank?
# Truncate if too long (max ~100k chars to stay within token limits)
pdf_text = pdf_text.truncate(100_000) if pdf_text.length > 100_000
params = {
model: effective_model,
messages: [
{ role: "system", content: instructions },
{
role: "user",
content: "Please analyze the following document text and provide a structured summary:\n\n#{pdf_text}"
}
],
response_format: { type: "json_object" }
}
response = client.chat(parameters: params)
Rails.logger.info("Tokens used to process PDF: #{response.dig("usage", "total_tokens")}")
record_usage(
effective_model,
response.dig("usage"),
operation: "process_pdf",
metadata: { pdf_size: pdf_content&.bytesize }
)
parse_response_generic(response)
end
def extract_text_from_pdf
return nil if pdf_content.blank?
reader = PDF::Reader.new(StringIO.new(pdf_content))
text_parts = []
reader.pages.each_with_index do |page, index|
text_parts << "--- Page #{index + 1} ---"
text_parts << page.text
end
text_parts.join("\n\n")
rescue => e
Rails.logger.error("Failed to extract text from PDF: #{e.message}")
nil
end
def process_with_vision
effective_model = model.presence || Provider::Openai::DEFAULT_MODEL
# Convert PDF to images using pdftoppm
images_base64 = convert_pdf_to_images
raise Provider::Openai::Error, "Could not convert PDF to images" if images_base64.blank?
# Build message content with images (max 5 pages to avoid token limits)
content = []
images_base64.first(5).each do |img_base64|
content << {
type: "image_url",
image_url: {
url: "data:image/png;base64,#{img_base64}",
detail: "low"
}
}
end
content << {
type: "text",
text: "Please analyze this PDF document (#{images_base64.size} pages total, showing first #{[ images_base64.size, 5 ].min}) and respond with valid JSON only."
}
# Note: response_format is not compatible with vision, so we ask for JSON in the prompt
params = {
model: effective_model,
messages: [
{ role: "system", content: instructions + "\n\nIMPORTANT: Respond with valid JSON only, no markdown or other formatting." },
{ role: "user", content: content }
],
max_tokens: 4096
}
response = client.chat(parameters: params)
Rails.logger.info("Tokens used to process PDF via vision: #{response.dig("usage", "total_tokens")}")
record_usage(
effective_model,
response.dig("usage"),
operation: "process_pdf_vision",
metadata: { pdf_size: pdf_content&.bytesize, pages: images_base64.size }
)
parse_response_generic(response)
end
def convert_pdf_to_images
return [] if pdf_content.blank?
Dir.mktmpdir do |tmpdir|
pdf_path = File.join(tmpdir, "input.pdf")
File.binwrite(pdf_path, pdf_content)
# Convert PDF to PNG images using pdftoppm
output_prefix = File.join(tmpdir, "page")
system("pdftoppm", "-png", "-r", "150", pdf_path, output_prefix)
# Read all generated images
image_files = Dir.glob(File.join(tmpdir, "page-*.png")).sort
image_files.map do |img_path|
Base64.strict_encode64(File.binread(img_path))
end
end
rescue => e
Rails.logger.error("Failed to convert PDF to images: #{e.message}")
[]
end
def parse_response_generic(response)
raw = response.dig("choices", 0, "message", "content")
parsed = parse_json_flexibly(raw)
build_result(parsed)
end
def build_result(parsed)
PdfProcessingResult.new(
summary: parsed["summary"],
document_type: normalize_document_type(parsed["document_type"]),
extracted_data: parsed["extracted_data"] || {}
)
end
def normalize_document_type(doc_type)
return "other" if doc_type.blank?
normalized = doc_type.to_s.strip.downcase.gsub(/\s+/, "_")
Import::DOCUMENT_TYPES.include?(normalized) ? normalized : "other"
end
def parse_json_flexibly(raw)
return {} if raw.blank?
# Try direct parse first
JSON.parse(raw)
rescue JSON::ParserError
# Try to extract JSON from markdown code blocks
if raw =~ /```(?:json)?\s*(\{[\s\S]*?\})\s*```/m
begin
return JSON.parse($1)
rescue JSON::ParserError
# Continue to next strategy
end
end
# Try to find any JSON object
if raw =~ /(\{[\s\S]*\})/m
begin
return JSON.parse($1)
rescue JSON::ParserError
# Fall through to error
end
end
raise Provider::Openai::Error, "Could not parse JSON from PDF processing response: #{raw.truncate(200)}"
end
end