Fix CSV import for non-UTF-8 encodings (Windows-1250, ISO-8859-2, etc.) (#617)

* Initial plan

* Add encoding detection for CSV imports to handle Windows-1250 and other non-UTF-8 encodings

Co-authored-by: jjmata <187772+jjmata@users.noreply.github.com>

* Improve encoding detection: prioritize Windows-1250 and increase confidence threshold

Co-authored-by: jjmata <187772+jjmata@users.noreply.github.com>

* Update Gemfile.lock with rchardet dependency

Co-authored-by: jjmata <187772+jjmata@users.noreply.github.com>

* Refactor: Extract common encodings to constant and deduplicate code

Co-authored-by: jjmata <187772+jjmata@users.noreply.github.com>

* Fix Rubocop style violations

Co-authored-by: jjmata <187772+jjmata@users.noreply.github.com>

* Fix linter violations and encoding detection logic

- Remove trailing whitespace from test file (9 lines)
- Fix ensure_utf8_encoding to handle binary strings properly by checking bytesize instead of blank?
- Add error handling for ArgumentError and Encoding::CompatibilityError
- Add invalid/undef replacement options to encode calls for robustness

Co-authored-by: jjmata <187772+jjmata@users.noreply.github.com>

* Fix encoding error in ensure_utf8_encoding method

- Use will_save_change_to_raw_file_str? instead of raw_file_str_changed? to avoid encoding errors when checking if attribute changed
- Wrap UTF-8 validation check in begin/rescue to handle ArgumentError from invalid encodings
- This fixes the test failure: "ArgumentError: invalid byte sequence in UTF-8"

Co-authored-by: jjmata <187772+jjmata@users.noreply.github.com>

* Fix test: add missing column labels and reload import before checking rows

Co-authored-by: jjmata <187772+jjmata@users.noreply.github.com>

* Fix test: ensure import is reloaded before checking rows_count and accessing rows

Co-authored-by: jjmata <187772+jjmata@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: jjmata <187772+jjmata@users.noreply.github.com>
This commit is contained in:
Copilot
2026-01-12 10:17:55 +01:00
committed by GitHub
parent d354ce48e1
commit 5b736bf691
5 changed files with 144 additions and 0 deletions

View File

@@ -22,6 +22,7 @@ class Import < ApplicationRecord
belongs_to :account, optional: true
before_validation :set_default_number_format
before_validation :ensure_utf8_encoding
scope :ordered, -> { order(created_at: :desc) }
@@ -294,6 +295,68 @@ class Import < ApplicationRecord
self.number_format ||= "1,234.56" # Default to US/UK format
end
# Common encodings to try when UTF-8 detection fails
# Windows-1250 is prioritized for Central/Eastern European languages
COMMON_ENCODINGS = [ "Windows-1250", "Windows-1252", "ISO-8859-1", "ISO-8859-2" ].freeze
def ensure_utf8_encoding
# Handle nil or empty string first (before checking if changed)
return if raw_file_str.nil? || raw_file_str.bytesize == 0
# Only process if the attribute was changed
# Use will_save_change_to_attribute? which is safer for binary data
return unless will_save_change_to_raw_file_str?
# If already valid UTF-8, nothing to do
begin
if raw_file_str.encoding == Encoding::UTF_8 && raw_file_str.valid_encoding?
return
end
rescue ArgumentError
# raw_file_str might have invalid encoding, continue to detection
end
# Detect encoding using rchardet
begin
require "rchardet"
detection = CharDet.detect(raw_file_str)
detected_encoding = detection["encoding"]
confidence = detection["confidence"]
# Only convert if we have reasonable confidence in the detection
if detected_encoding && confidence > 0.75
# Force encoding and convert to UTF-8
self.raw_file_str = raw_file_str.force_encoding(detected_encoding).encode("UTF-8", invalid: :replace, undef: :replace)
else
# Fallback: try common encodings
try_common_encodings
end
rescue LoadError
# rchardet not available, fallback to trying common encodings
try_common_encodings
rescue ArgumentError, Encoding::CompatibilityError => e
# Handle encoding errors by falling back to common encodings
try_common_encodings
end
end
def try_common_encodings
COMMON_ENCODINGS.each do |encoding|
begin
test = raw_file_str.dup.force_encoding(encoding)
if test.valid_encoding?
self.raw_file_str = test.encode("UTF-8", invalid: :replace, undef: :replace)
return
end
rescue Encoding::InvalidByteSequenceError, Encoding::UndefinedConversionError
next
end
end
# If nothing worked, force UTF-8 and replace invalid bytes
self.raw_file_str = raw_file_str.force_encoding("UTF-8").scrub("?")
end
def account_belongs_to_family
return if account.nil?
return if account.family_id == family_id