mirror of
https://github.com/we-promise/sure.git
synced 2026-04-23 22:14:08 +00:00
Fix CSV import for non-UTF-8 encodings (Windows-1250, ISO-8859-2, etc.) (#617)
* Initial plan * Add encoding detection for CSV imports to handle Windows-1250 and other non-UTF-8 encodings Co-authored-by: jjmata <187772+jjmata@users.noreply.github.com> * Improve encoding detection: prioritize Windows-1250 and increase confidence threshold Co-authored-by: jjmata <187772+jjmata@users.noreply.github.com> * Update Gemfile.lock with rchardet dependency Co-authored-by: jjmata <187772+jjmata@users.noreply.github.com> * Refactor: Extract common encodings to constant and deduplicate code Co-authored-by: jjmata <187772+jjmata@users.noreply.github.com> * Fix Rubocop style violations Co-authored-by: jjmata <187772+jjmata@users.noreply.github.com> * Fix linter violations and encoding detection logic - Remove trailing whitespace from test file (9 lines) - Fix ensure_utf8_encoding to handle binary strings properly by checking bytesize instead of blank? - Add error handling for ArgumentError and Encoding::CompatibilityError - Add invalid/undef replacement options to encode calls for robustness Co-authored-by: jjmata <187772+jjmata@users.noreply.github.com> * Fix encoding error in ensure_utf8_encoding method - Use will_save_change_to_raw_file_str? instead of raw_file_str_changed? to avoid encoding errors when checking if attribute changed - Wrap UTF-8 validation check in begin/rescue to handle ArgumentError from invalid encodings - This fixes the test failure: "ArgumentError: invalid byte sequence in UTF-8" Co-authored-by: jjmata <187772+jjmata@users.noreply.github.com> * Fix test: add missing column labels and reload import before checking rows Co-authored-by: jjmata <187772+jjmata@users.noreply.github.com> * Fix test: ensure import is reloaded before checking rows_count and accessing rows Co-authored-by: jjmata <187772+jjmata@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: jjmata <187772+jjmata@users.noreply.github.com>
This commit is contained in:
@@ -22,6 +22,7 @@ class Import < ApplicationRecord
|
||||
belongs_to :account, optional: true
|
||||
|
||||
before_validation :set_default_number_format
|
||||
before_validation :ensure_utf8_encoding
|
||||
|
||||
scope :ordered, -> { order(created_at: :desc) }
|
||||
|
||||
@@ -294,6 +295,68 @@ class Import < ApplicationRecord
|
||||
self.number_format ||= "1,234.56" # Default to US/UK format
|
||||
end
|
||||
|
||||
# Common encodings to try when UTF-8 detection fails
|
||||
# Windows-1250 is prioritized for Central/Eastern European languages
|
||||
COMMON_ENCODINGS = [ "Windows-1250", "Windows-1252", "ISO-8859-1", "ISO-8859-2" ].freeze
|
||||
|
||||
def ensure_utf8_encoding
|
||||
# Handle nil or empty string first (before checking if changed)
|
||||
return if raw_file_str.nil? || raw_file_str.bytesize == 0
|
||||
|
||||
# Only process if the attribute was changed
|
||||
# Use will_save_change_to_attribute? which is safer for binary data
|
||||
return unless will_save_change_to_raw_file_str?
|
||||
|
||||
# If already valid UTF-8, nothing to do
|
||||
begin
|
||||
if raw_file_str.encoding == Encoding::UTF_8 && raw_file_str.valid_encoding?
|
||||
return
|
||||
end
|
||||
rescue ArgumentError
|
||||
# raw_file_str might have invalid encoding, continue to detection
|
||||
end
|
||||
|
||||
# Detect encoding using rchardet
|
||||
begin
|
||||
require "rchardet"
|
||||
detection = CharDet.detect(raw_file_str)
|
||||
detected_encoding = detection["encoding"]
|
||||
confidence = detection["confidence"]
|
||||
|
||||
# Only convert if we have reasonable confidence in the detection
|
||||
if detected_encoding && confidence > 0.75
|
||||
# Force encoding and convert to UTF-8
|
||||
self.raw_file_str = raw_file_str.force_encoding(detected_encoding).encode("UTF-8", invalid: :replace, undef: :replace)
|
||||
else
|
||||
# Fallback: try common encodings
|
||||
try_common_encodings
|
||||
end
|
||||
rescue LoadError
|
||||
# rchardet not available, fallback to trying common encodings
|
||||
try_common_encodings
|
||||
rescue ArgumentError, Encoding::CompatibilityError => e
|
||||
# Handle encoding errors by falling back to common encodings
|
||||
try_common_encodings
|
||||
end
|
||||
end
|
||||
|
||||
def try_common_encodings
|
||||
COMMON_ENCODINGS.each do |encoding|
|
||||
begin
|
||||
test = raw_file_str.dup.force_encoding(encoding)
|
||||
if test.valid_encoding?
|
||||
self.raw_file_str = test.encode("UTF-8", invalid: :replace, undef: :replace)
|
||||
return
|
||||
end
|
||||
rescue Encoding::InvalidByteSequenceError, Encoding::UndefinedConversionError
|
||||
next
|
||||
end
|
||||
end
|
||||
|
||||
# If nothing worked, force UTF-8 and replace invalid bytes
|
||||
self.raw_file_str = raw_file_str.force_encoding("UTF-8").scrub("?")
|
||||
end
|
||||
|
||||
def account_belongs_to_family
|
||||
return if account.nil?
|
||||
return if account.family_id == family_id
|
||||
|
||||
Reference in New Issue
Block a user