Fix CSV import for non-UTF-8 encodings (Windows-1250, ISO-8859-2, etc.) (#617)

* Initial plan * Add encoding detection for CSV imports to handle Windows-1250 and other non-UTF-8 encodings Co-authored-by: jjmata <187772+jjmata@users.noreply.github.com> * Improve encoding detection: prioritize Windows-1250 and increase confidence threshold Co-authored-by: jjmata <187772+jjmata@users.noreply.github.com> * Update Gemfile.lock with rchardet dependency Co-authored-by: jjmata <187772+jjmata@users.noreply.github.com> * Refactor: Extract common encodings to constant and deduplicate code Co-authored-by: jjmata <187772+jjmata@users.noreply.github.com> * Fix Rubocop style violations Co-authored-by: jjmata <187772+jjmata@users.noreply.github.com> * Fix linter violations and encoding detection logic - Remove trailing whitespace from test file (9 lines) - Fix ensure_utf8_encoding to handle binary strings properly by checking bytesize instead of blank? - Add error handling for ArgumentError and Encoding::CompatibilityError - Add invalid/undef replacement options to encode calls for robustness Co-authored-by: jjmata <187772+jjmata@users.noreply.github.com> * Fix encoding error in ensure_utf8_encoding method - Use will_save_change_to_raw_file_str? instead of raw_file_str_changed? to avoid encoding errors when checking if attribute changed - Wrap UTF-8 validation check in begin/rescue to handle ArgumentError from invalid encodings - This fixes the test failure: "ArgumentError: invalid byte sequence in UTF-8" Co-authored-by: jjmata <187772+jjmata@users.noreply.github.com> * Fix test: add missing column labels and reload import before checking rows Co-authored-by: jjmata <187772+jjmata@users.noreply.github.com> * Fix test: ensure import is reloaded before checking rows_count and accessing rows Co-authored-by: jjmata <187772+jjmata@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: jjmata <187772+jjmata@users.noreply.github.com>
2026-04-23 22:14:08 +00:00 · 2026-01-12 10:17:55 +01:00
parent d354ce48e1
commit 5b736bf691
5 changed files with 144 additions and 0 deletions
--- a/app/models/import.rb
+++ b/app/models/import.rb
@@ -22,6 +22,7 @@ class Import < ApplicationRecord
  belongs_to :account, optional: true

  before_validation :set_default_number_format
+  before_validation :ensure_utf8_encoding

  scope :ordered, -> { order(created_at: :desc) }

@@ -294,6 +295,68 @@ class Import < ApplicationRecord
      self.number_format ||= "1,234.56" # Default to US/UK format
    end

+    # Common encodings to try when UTF-8 detection fails
+    # Windows-1250 is prioritized for Central/Eastern European languages
+    COMMON_ENCODINGS = [ "Windows-1250", "Windows-1252", "ISO-8859-1", "ISO-8859-2" ].freeze
+
+    def ensure_utf8_encoding
+      # Handle nil or empty string first (before checking if changed)
+      return if raw_file_str.nil? || raw_file_str.bytesize == 0
+
+      # Only process if the attribute was changed
+      # Use will_save_change_to_attribute? which is safer for binary data
+      return unless will_save_change_to_raw_file_str?
+
+      # If already valid UTF-8, nothing to do
+      begin
+        if raw_file_str.encoding == Encoding::UTF_8 && raw_file_str.valid_encoding?
+          return
+        end
+      rescue ArgumentError
+        # raw_file_str might have invalid encoding, continue to detection
+      end
+
+      # Detect encoding using rchardet
+      begin
+        require "rchardet"
+        detection = CharDet.detect(raw_file_str)
+        detected_encoding = detection["encoding"]
+        confidence = detection["confidence"]
+
+        # Only convert if we have reasonable confidence in the detection
+        if detected_encoding && confidence > 0.75
+          # Force encoding and convert to UTF-8
+          self.raw_file_str = raw_file_str.force_encoding(detected_encoding).encode("UTF-8", invalid: :replace, undef: :replace)
+        else
+          # Fallback: try common encodings
+          try_common_encodings
+        end
+      rescue LoadError
+        # rchardet not available, fallback to trying common encodings
+        try_common_encodings
+      rescue ArgumentError, Encoding::CompatibilityError => e
+        # Handle encoding errors by falling back to common encodings
+        try_common_encodings
+      end
+    end
+
+    def try_common_encodings
+      COMMON_ENCODINGS.each do |encoding|
+        begin
+          test = raw_file_str.dup.force_encoding(encoding)
+          if test.valid_encoding?
+            self.raw_file_str = test.encode("UTF-8", invalid: :replace, undef: :replace)
+            return
+          end
+        rescue Encoding::InvalidByteSequenceError, Encoding::UndefinedConversionError
+          next
+        end
+      end
+
+      # If nothing worked, force UTF-8 and replace invalid bytes
+      self.raw_file_str = raw_file_str.force_encoding("UTF-8").scrub("?")
+    end
+
    def account_belongs_to_family
      return if account.nil?
      return if account.family_id == family_id