mirror of
https://github.com/we-promise/sure.git
synced 2026-04-07 14:31:25 +00:00
* feat: Add PDF import with AI-powered document analysis This enhances the import functionality to support PDF files with AI-powered document analysis. When a PDF is uploaded, it is processed by AI to: - Identify the document type (bank statement, credit card statement, etc.) - Generate a summary of the document contents - Extract key metadata (institution, dates, balances, transaction count) After processing, an email is sent to the user asking for next steps. Key changes: - Add PdfImport model for handling PDF document imports - Add Provider::Openai::PdfProcessor for AI document analysis - Add ProcessPdfJob for async PDF processing - Add PdfImportMailer for user notification emails - Update imports controller to detect and handle PDF uploads - Add PDF import option to the new import page - Add i18n translations for all new strings - Add comprehensive tests for the new functionality * Add bank statement import with AI extraction - Create ImportBankStatement assistant function for MCP - Add BankStatementExtractor with chunked processing for small context windows - Register function in assistant configurable - Make PdfImport#pdf_file_content public for extractor access - Increase OpenAI request timeout to 600s for slow local models - Increase DB connection pool to 20 for concurrent operations Tested with M-Pesa bank statement via remote Ollama (qwen3:8b): - Successfully extracted 18 transactions - Generated CSV and created TransactionImport - Works with 3000 char chunks for small context windows * Add pdf-reader gem dependency The BankStatementExtractor uses PDF::Reader to parse bank statement PDFs, but the gem was not properly declared in the Gemfile. This would cause NameError in production when processing bank statements. Added pdf-reader ~> 2.12 to Gemfile dependencies. * Fix transaction deduplication to preserve legitimate duplicates The previous deduplication logic removed ALL duplicate transactions based on [date, amount, name], which would drop legitimate same-day duplicates like multiple ATM withdrawals or card authorizations. Changed to only deduplicate transactions that appear in consecutive chunks (chunking artifacts) while preserving all legitimate duplicates within the same chunk or non-adjacent chunks. * Refactor bank statement extraction to use public provider method Address code review feedback: - Add public extract_bank_statement method to Provider::Openai - Remove direct access to private client via send(:client) - Update ImportBankStatement to use new public method - Add require 'set' to BankStatementExtractor - Remove PII-sensitive content from error logs - Add defensive check for nil response.error - Handle oversized PDF pages in chunking logic - Remove unused process_native and process_generic methods - Update email copy to reflect feature availability - Add guard for nil document_type in email template - Document pdf-reader gem rationale in Gemfile Tested with both OpenAI (gpt-4o) and Ollama (qwen3:8b): - OpenAI: 49 transactions extracted in 30s - Ollama: 40 transactions extracted in 368s - All encapsulation and error handling working correctly * Update schema.rb with ai_summary and document_type columns * Address PR #808 review comments - Rename :csv_file to :import_file across controllers/views/tests - Add PDF test fixture (sample_bank_statement.pdf) - Add supports_pdf_processing? method for graceful degradation - Revert unrelated database.yml pool change (600->3) - Remove month_start_day schema bleed from other PR - Fix PdfProcessor: use .strip instead of .strip_heredoc - Add server-side PDF magic byte validation - Conditionally show PDF import option when AI provider available - Fix ProcessPdfJob: sanitize errors, handle update failure - Move pdf_file attachment from Import to PdfImport - Document deduplication logic limitations - Fix ImportBankStatement: catch specific exceptions only - Remove unnecessary require 'set' - Remove dead json_schema method from PdfProcessor - Reduce default OpenAI timeout from 600s to 60s - Fix nil guard in text mailer template - Add require 'csv' to ImportBankStatement - Remove Gemfile pdf-reader comment * Fix RuboCop indentation in ProcessPdfJob * Refactor PDF import check to use model predicate method Replace is_a?(PdfImport) type check with requires_csv_workflow? predicate that leverages STI inheritance for cleaner controller logic. * Fix missing 'unknown' locale key and schema version mismatch - Add 'unknown: Unknown Document' to document_types locale - Fix schema version to match latest migration (2026_01_24_180211) * Document OPENAI_REQUEST_TIMEOUT env variable Added to .env.local.example and docs/hosting/ai.md * Rename ALLOWED_MIME_TYPES to ALLOWED_CSV_MIME_TYPES for clarity * Add comment explaining requires_csv_workflow? predicate * Remove redundant required_column_keys from PdfImport Base class already returns [] by default * Add ENV toggle to disable PDF processing for non-vision endpoints OPENAI_SUPPORTS_PDF_PROCESSING=false can be used for OpenAI-compatible endpoints (e.g., Ollama) that don't support vision/PDF processing. * Wire up transaction extraction for PDF bank statements - Add extracted_data JSONB column to imports - Add extract_transactions method to PdfImport - Call extraction in ProcessPdfJob for bank statements - Store transactions in extracted_data for later review * Fix ProcessPdfJob retry logic, sanitize and localize errors - Allow retries after partial success (classification ok, extraction failed) - Log sanitized error message instead of raw message to avoid data leakage - Use i18n for user-facing error messages * Add vision-capable model validation for PDF processing * Fix drag-and-drop test to use correct field name csv_file * Schema bleedover from another branch * Fix drag-drop import form field name to match controller * Add vision capability guard to process_pdf method --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: mkdev11 <jaysmth689+github@users.noreply.github.com> Co-authored-by: Juan José Mata <jjmata@jjmata.com>
409 lines
12 KiB
Ruby
409 lines
12 KiB
Ruby
class Import < ApplicationRecord
|
|
MaxRowCountExceededError = Class.new(StandardError)
|
|
MappingError = Class.new(StandardError)
|
|
|
|
MAX_CSV_SIZE = 10.megabytes
|
|
MAX_PDF_SIZE = 25.megabytes
|
|
ALLOWED_CSV_MIME_TYPES = %w[text/csv text/plain application/vnd.ms-excel application/csv].freeze
|
|
ALLOWED_PDF_MIME_TYPES = %w[application/pdf].freeze
|
|
|
|
DOCUMENT_TYPES = %w[bank_statement credit_card_statement investment_statement financial_document contract other].freeze
|
|
|
|
TYPES = %w[TransactionImport TradeImport AccountImport MintImport CategoryImport RuleImport PdfImport].freeze
|
|
SIGNAGE_CONVENTIONS = %w[inflows_positive inflows_negative]
|
|
SEPARATORS = [ [ "Comma (,)", "," ], [ "Semicolon (;)", ";" ] ].freeze
|
|
|
|
NUMBER_FORMATS = {
|
|
"1,234.56" => { separator: ".", delimiter: "," }, # US/UK/Asia
|
|
"1.234,56" => { separator: ",", delimiter: "." }, # Most of Europe
|
|
"1 234,56" => { separator: ",", delimiter: " " }, # French/Scandinavian
|
|
"1,234" => { separator: "", delimiter: "," } # Zero-decimal currencies like JPY
|
|
}.freeze
|
|
|
|
AMOUNT_TYPE_STRATEGIES = %w[signed_amount custom_column].freeze
|
|
|
|
belongs_to :family
|
|
belongs_to :account, optional: true
|
|
|
|
before_validation :set_default_number_format
|
|
before_validation :ensure_utf8_encoding
|
|
|
|
scope :ordered, -> { order(created_at: :desc) }
|
|
|
|
enum :status, {
|
|
pending: "pending",
|
|
complete: "complete",
|
|
importing: "importing",
|
|
reverting: "reverting",
|
|
revert_failed: "revert_failed",
|
|
failed: "failed"
|
|
}, validate: true, default: "pending"
|
|
|
|
validates :type, inclusion: { in: TYPES }
|
|
validates :amount_type_strategy, inclusion: { in: AMOUNT_TYPE_STRATEGIES }
|
|
validates :col_sep, inclusion: { in: SEPARATORS.map(&:last) }
|
|
validates :signage_convention, inclusion: { in: SIGNAGE_CONVENTIONS }, allow_nil: true
|
|
validates :number_format, presence: true, inclusion: { in: NUMBER_FORMATS.keys }
|
|
validate :custom_column_import_requires_identifier
|
|
validates :rows_to_skip, numericality: { only_integer: true, greater_than_or_equal_to: 0 }
|
|
validate :account_belongs_to_family
|
|
validate :rows_to_skip_within_file_bounds
|
|
|
|
has_many :rows, dependent: :destroy
|
|
has_many :mappings, dependent: :destroy
|
|
has_many :accounts, dependent: :destroy
|
|
has_many :entries, dependent: :destroy
|
|
|
|
class << self
|
|
def parse_csv_str(csv_str, col_sep: ",")
|
|
CSV.parse(
|
|
(csv_str || "").strip,
|
|
headers: true,
|
|
col_sep: col_sep,
|
|
converters: [ ->(str) { str&.strip } ],
|
|
liberal_parsing: true
|
|
)
|
|
end
|
|
end
|
|
|
|
def publish_later
|
|
raise MaxRowCountExceededError if row_count_exceeded?
|
|
raise "Import is not publishable" unless publishable?
|
|
|
|
update! status: :importing
|
|
|
|
ImportJob.perform_later(self)
|
|
end
|
|
|
|
def publish
|
|
raise MaxRowCountExceededError if row_count_exceeded?
|
|
|
|
import!
|
|
|
|
family.sync_later
|
|
|
|
update! status: :complete
|
|
rescue => error
|
|
update! status: :failed, error: error.message
|
|
end
|
|
|
|
def revert_later
|
|
raise "Import is not revertable" unless revertable?
|
|
|
|
update! status: :reverting
|
|
|
|
RevertImportJob.perform_later(self)
|
|
end
|
|
|
|
def revert
|
|
Import.transaction do
|
|
accounts.destroy_all
|
|
entries.destroy_all
|
|
end
|
|
|
|
family.sync_later
|
|
|
|
update! status: :pending
|
|
rescue => error
|
|
update! status: :revert_failed, error: error.message
|
|
end
|
|
|
|
def csv_rows
|
|
@csv_rows ||= parsed_csv
|
|
end
|
|
|
|
def csv_headers
|
|
parsed_csv.headers
|
|
end
|
|
|
|
def csv_sample
|
|
@csv_sample ||= parsed_csv.first(2)
|
|
end
|
|
|
|
def dry_run
|
|
mappings = {
|
|
transactions: rows_count,
|
|
categories: Import::CategoryMapping.for_import(self).creational.count,
|
|
tags: Import::TagMapping.for_import(self).creational.count
|
|
}
|
|
|
|
mappings.merge(
|
|
accounts: Import::AccountMapping.for_import(self).creational.count,
|
|
) if account.nil?
|
|
|
|
mappings
|
|
end
|
|
|
|
def required_column_keys
|
|
[]
|
|
end
|
|
|
|
# Returns false for import types that don't need CSV column mapping (e.g., PdfImport).
|
|
# Override in subclasses that handle data extraction differently.
|
|
def requires_csv_workflow?
|
|
true
|
|
end
|
|
|
|
# Subclasses that require CSV workflow must override this.
|
|
# Non-CSV imports (e.g., PdfImport) can return [].
|
|
def column_keys
|
|
raise NotImplementedError, "Subclass must implement column_keys"
|
|
end
|
|
|
|
def generate_rows_from_csv
|
|
rows.destroy_all
|
|
|
|
mapped_rows = csv_rows.map do |row|
|
|
{
|
|
account: row[account_col_label].to_s,
|
|
date: row[date_col_label].to_s,
|
|
qty: sanitize_number(row[qty_col_label]).to_s,
|
|
ticker: row[ticker_col_label].to_s,
|
|
exchange_operating_mic: row[exchange_operating_mic_col_label].to_s,
|
|
price: sanitize_number(row[price_col_label]).to_s,
|
|
amount: sanitize_number(row[amount_col_label]).to_s,
|
|
currency: (row[currency_col_label] || default_currency).to_s,
|
|
name: (row[name_col_label] || default_row_name).to_s,
|
|
category: row[category_col_label].to_s,
|
|
tags: row[tags_col_label].to_s,
|
|
entity_type: row[entity_type_col_label].to_s,
|
|
notes: row[notes_col_label].to_s
|
|
}
|
|
end
|
|
|
|
rows.insert_all!(mapped_rows)
|
|
update_column(:rows_count, rows.count)
|
|
end
|
|
|
|
def sync_mappings
|
|
transaction do
|
|
mapping_steps.each do |mapping_class|
|
|
mappables_by_key = mapping_class.mappables_by_key(self)
|
|
|
|
updated_mappings = mappables_by_key.map do |key, mappable|
|
|
mapping = mappings.find_or_initialize_by(key: key, import: self, type: mapping_class.name)
|
|
mapping.mappable = mappable
|
|
mapping.create_when_empty = key.present? && mappable.nil?
|
|
mapping
|
|
end
|
|
|
|
updated_mappings.each { |m| m.save(validate: false) }
|
|
mapping_class.where.not(id: updated_mappings.map(&:id)).destroy_all
|
|
end
|
|
end
|
|
end
|
|
|
|
def mapping_steps
|
|
[]
|
|
end
|
|
|
|
def uploaded?
|
|
raw_file_str.present?
|
|
end
|
|
|
|
def configured?
|
|
uploaded? && rows_count > 0
|
|
end
|
|
|
|
def cleaned?
|
|
configured? && rows.all?(&:valid?)
|
|
end
|
|
|
|
def publishable?
|
|
cleaned? && mappings.all?(&:valid?)
|
|
end
|
|
|
|
def revertable?
|
|
complete? || revert_failed?
|
|
end
|
|
|
|
def has_unassigned_account?
|
|
mappings.accounts.where(key: "").any?
|
|
end
|
|
|
|
def requires_account?
|
|
family.accounts.empty? && has_unassigned_account?
|
|
end
|
|
|
|
# Used to optionally pre-fill the configuration for the current import
|
|
def suggested_template
|
|
family.imports
|
|
.complete
|
|
.where(account: account, type: type)
|
|
.order(created_at: :desc)
|
|
.first
|
|
end
|
|
|
|
def apply_template!(import_template)
|
|
update!(
|
|
import_template.attributes.slice(
|
|
"date_col_label", "amount_col_label", "name_col_label",
|
|
"category_col_label", "tags_col_label", "account_col_label",
|
|
"qty_col_label", "ticker_col_label", "price_col_label",
|
|
"entity_type_col_label", "notes_col_label", "currency_col_label",
|
|
"date_format", "signage_convention", "number_format",
|
|
"exchange_operating_mic_col_label",
|
|
"rows_to_skip"
|
|
)
|
|
)
|
|
end
|
|
|
|
def max_row_count
|
|
10000
|
|
end
|
|
|
|
private
|
|
def row_count_exceeded?
|
|
rows_count > max_row_count
|
|
end
|
|
|
|
def import!
|
|
# no-op, subclasses can implement for customization of algorithm
|
|
end
|
|
|
|
def default_row_name
|
|
"Imported item"
|
|
end
|
|
|
|
def default_currency
|
|
account&.currency || family.currency
|
|
end
|
|
|
|
def parsed_csv
|
|
return @parsed_csv if defined?(@parsed_csv)
|
|
|
|
csv_content = raw_file_str || ""
|
|
if rows_to_skip.to_i > 0
|
|
csv_content = csv_content.lines.drop(rows_to_skip).join
|
|
end
|
|
|
|
@parsed_csv = self.class.parse_csv_str(csv_content, col_sep: col_sep)
|
|
end
|
|
|
|
def sanitize_number(value)
|
|
return "" if value.nil?
|
|
|
|
format = NUMBER_FORMATS[number_format]
|
|
return "" unless format
|
|
|
|
# First, normalize spaces and remove any characters that aren't numbers, delimiters, separators, or minus signs
|
|
sanitized = value.to_s.strip
|
|
|
|
# Handle French/Scandinavian format specially
|
|
if format[:delimiter] == " "
|
|
sanitized = sanitized.gsub(/\s+/, "") # Remove all spaces first
|
|
else
|
|
sanitized = sanitized.gsub(/[^\d#{Regexp.escape(format[:delimiter])}#{Regexp.escape(format[:separator])}\-]/, "")
|
|
|
|
# Replace delimiter with empty string
|
|
if format[:delimiter].present?
|
|
sanitized = sanitized.gsub(format[:delimiter], "")
|
|
end
|
|
end
|
|
|
|
# Replace separator with period for proper float parsing
|
|
if format[:separator].present?
|
|
sanitized = sanitized.gsub(format[:separator], ".")
|
|
end
|
|
|
|
# Return empty string if not a valid number
|
|
unless sanitized =~ /\A-?\d+\.?\d*\z/
|
|
return ""
|
|
end
|
|
|
|
sanitized
|
|
end
|
|
|
|
def set_default_number_format
|
|
self.number_format ||= "1,234.56" # Default to US/UK format
|
|
end
|
|
|
|
def custom_column_import_requires_identifier
|
|
return unless amount_type_strategy == "custom_column"
|
|
|
|
if amount_type_inflow_value.blank?
|
|
errors.add(:base, I18n.t("imports.errors.custom_column_requires_inflow"))
|
|
end
|
|
end
|
|
|
|
# Common encodings to try when UTF-8 detection fails
|
|
# Windows-1250 is prioritized for Central/Eastern European languages
|
|
COMMON_ENCODINGS = [ "Windows-1250", "Windows-1252", "ISO-8859-1", "ISO-8859-2" ].freeze
|
|
|
|
def ensure_utf8_encoding
|
|
# Handle nil or empty string first (before checking if changed)
|
|
return if raw_file_str.nil? || raw_file_str.bytesize == 0
|
|
|
|
# Only process if the attribute was changed
|
|
# Use will_save_change_to_attribute? which is safer for binary data
|
|
return unless will_save_change_to_raw_file_str?
|
|
|
|
# If already valid UTF-8, nothing to do
|
|
begin
|
|
if raw_file_str.encoding == Encoding::UTF_8 && raw_file_str.valid_encoding?
|
|
return
|
|
end
|
|
rescue ArgumentError
|
|
# raw_file_str might have invalid encoding, continue to detection
|
|
end
|
|
|
|
# Detect encoding using rchardet
|
|
begin
|
|
require "rchardet"
|
|
detection = CharDet.detect(raw_file_str)
|
|
detected_encoding = detection["encoding"]
|
|
confidence = detection["confidence"]
|
|
|
|
# Only convert if we have reasonable confidence in the detection
|
|
if detected_encoding && confidence > 0.75
|
|
# Force encoding and convert to UTF-8
|
|
self.raw_file_str = raw_file_str.force_encoding(detected_encoding).encode("UTF-8", invalid: :replace, undef: :replace)
|
|
else
|
|
# Fallback: try common encodings
|
|
try_common_encodings
|
|
end
|
|
rescue LoadError
|
|
# rchardet not available, fallback to trying common encodings
|
|
try_common_encodings
|
|
rescue ArgumentError, Encoding::CompatibilityError => e
|
|
# Handle encoding errors by falling back to common encodings
|
|
try_common_encodings
|
|
end
|
|
end
|
|
|
|
def try_common_encodings
|
|
COMMON_ENCODINGS.each do |encoding|
|
|
begin
|
|
test = raw_file_str.dup.force_encoding(encoding)
|
|
if test.valid_encoding?
|
|
self.raw_file_str = test.encode("UTF-8", invalid: :replace, undef: :replace)
|
|
return
|
|
end
|
|
rescue Encoding::InvalidByteSequenceError, Encoding::UndefinedConversionError
|
|
next
|
|
end
|
|
end
|
|
|
|
# If nothing worked, force UTF-8 and replace invalid bytes
|
|
self.raw_file_str = raw_file_str.force_encoding("UTF-8").scrub("?")
|
|
end
|
|
|
|
def account_belongs_to_family
|
|
return if account.nil?
|
|
return if account.family_id == family_id
|
|
|
|
errors.add(:account, "must belong to your family")
|
|
end
|
|
|
|
def rows_to_skip_within_file_bounds
|
|
return if raw_file_str.blank?
|
|
return if rows_to_skip.to_i == 0
|
|
|
|
line_count = raw_file_str.lines.count
|
|
|
|
if rows_to_skip.to_i >= line_count
|
|
errors.add(:rows_to_skip, "must be less than the number of lines in the file (#{line_count})")
|
|
end
|
|
end
|
|
end
|