DOCX Translation
The DOCX handler (src/formats/docx.rs) translates Microsoft Word documents at the paragraph level while leaving all formatting, styles, numbering, images, and tables structurally intact.
How It Works
A .docx file is a ZIP archive containing XML files. The handler:
- Unpacks the archive into memory.
- Parses
word/document.xmlto locate paragraph (<w:p>) elements. - Extracts the concatenated text of each paragraph’s runs (
<w:r><w:t>). - Translates each paragraph’s text via
TranslationService::translate_text. - Reconstructs the paragraph by replacing the text content of its runs with the translated text while keeping all run properties (
<w:rPr>, bold, italic, font, etc.) unchanged. - Repacks the modified XML back into a ZIP archive and writes it to the output path.
Formatting Preservation
Run-level properties (font, size, bold, colour) are preserved because the handler only modifies <w:t> text nodes, not the surrounding <w:rPr> elements. Paragraph-level properties (<w:pPr>) such as alignment, spacing, and list numbering are also untouched.
Limitations
- If a paragraph’s runs have different formatting (e.g. a word in bold mid-sentence), the translated output may collapse those runs into a single run to simplify reconstruction. Visual differences may appear in mixed-format paragraphs.
- Text inside text boxes, headers, and footers may not be translated in the current implementation. Check the source for which XML parts are currently processed.