A fresh test run on French exposed a real bug in _apply_translation:
when the model returns a JSON list for a plural entry (e.g.
["form0", "form1"], which is a valid representation since plural forms
are ordered), the previous code took the else branch and broadcast
str(list) — Python list-repr like ['form0', 'form1'] — to every plural
form. Both msgstr[0] and msgstr[1] ended up containing the same
literal Python list-repr string, breaking gettext lookups for that
entry. Spanish dodged it by chance (the model returned dicts that
time); the failure mode is reproducible on French.
Changes:
- Extract _apply_plural_translation helper. Handles dict, list,
scalar, and non-JSON-string responses. List path distributes forms
by index and repeats the last form if the model returned fewer
forms than the language requires (better than leaving slots blank,
which falls back to displaying the raw English msgid).
- The split also drops _apply_translation's cyclomatic complexity
back below the C901 threshold.
- Adds 4 regression tests covering: list response, list response
round-tripped through parse_response, list shorter than required
forms (last-form-repeats), and empty list (falls back to raw-string
broadcast).
Verified end-to-end on French: the previously-broken plural entry
"Added 1 new column to the virtual dataset" / "Added %s new columns
to the virtual dataset" now writes msgstr[0] and msgstr[1] correctly
on a fresh run.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Addresses sadpandajoe's review on #39448:
1. Adds tests/unit_tests/scripts/translations/backfill_po_test.py with
19 cases covering parse_response (singular/plural/markdown-fence
stripping/non-ASCII/non-numeric keys/list-and-scalar rejection/JSON
errors) and _apply_translation (singular path, plural-dict path,
plural-scalar fallback, plural invalid-JSON fallback, fuzzy flag,
attribution append/dedup, end-to-end round-trip from parse_response
into _apply_translation). The script is loaded via importlib since
it lives outside the package tree.
2. translate_batch now pipes the prompt over stdin instead of passing
it as argv. With --batch-size 50 and many reference languages a
single batch can grow into the tens of KB and approach ARG_MAX on
some platforms; stdin removes that ceiling.
3. _process_batches now saves the catalog after each batch that wrote
at least one translation (when not in --dry-run). For sparse
languages with thousands of missing strings, a crash mid-run now
only loses the in-flight batch rather than every batch translated
so far. The full save at end of backfill() is removed since the
per-batch save covers it.
4. Module docstring referenced --fuzzy/--no-fuzzy but argparse only
registers --no-fuzzy; doc updated to match the actual flag.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A non-object JSON response (list, scalar, null) would raise AttributeError
from .items(). _process_batches only catches (ValueError, RuntimeError),
so the crash would abort the entire run instead of being handled per-batch.
Surface the type error as ValueError so it's caught gracefully.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Addresses two codeant-ai review comments about missing docstrings on
newly added top-level functions.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Four "Apply suggestion" commits from codeant-ai replaced real code bodies
with docstring-only lines, breaking syntax (syntax errors at
build_translation_index.py:119). Restore the bodies while keeping the
suggested docstrings:
- build_translation_index.py: _plural_key, main()
- backfill_po.py: _lang_name, _plural_key
Also addresses two major issues raised in review:
1. parse_response() in backfill_po.py used str(v) on values, which
converted dict responses (from plural entries) into Python repr
like "{'0': 'x'}" that json.loads could not later parse in
_apply_translation. Serialize dict/list values with json.dumps.
2. build_index() wrote fuzzy entries as trusted context in the
cross-language index, letting AI-generated drafts propagate back
into future backfill runs as if reviewed. Gate index values via
_is_translated so fuzzy entries become null.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds two scripts to help maintainers fill in missing .po translations
using Claude AI, and applies them to backfill all 184 missing Spanish strings.
New scripts:
- scripts/translations/build_translation_index.py — reads every .po file
and outputs a cross-language JSON index {msgid: {lang: translation}}
used to provide reference context to the AI
- scripts/translations/backfill_po.py — for a target language, finds all
untranslated entries, batches them, and calls claude -p with cross-language
context to generate draft translations marked #, fuzzy for human review
Design highlights:
- Cross-language translations are passed per-string so the AI can disambiguate
ambiguous English (e.g. "Scale", "Table") from how other translators handled it
- --min-context N skips strings with fewer than N reference translations
- Each generated entry is tagged with a translator comment listing the model
and which languages provided context (e.g. [refs: fr, ru])
- translation_index.json added to .gitignore (regenerated locally)
Spanish translations:
- Backfilled all 184 previously untranslated strings in es/LC_MESSAGES/messages.po
- All entries marked #, fuzzy pending human review
Docs: added "Backfilling missing translations with AI" section to
docs/developer_docs/contributing/howtos.md
npm shortcuts added to superset-frontend/package.json:
- translations:build-index
- translations:backfill
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>