PDF Translation Pipeline
Requires the
cargo build --release --features pdf
The PDF pipeline is the most complex format handler in TMT. Because PDFs encode text as positioned glyphs rather than editable strings, a full parse-rasterise-render cycle is required.
Pipeline Stages
1. Parsing (src/formats/pdf/parser.rs)
The parser walks the PDF content stream operator by operator, identifying text-drawing operators and extracting:
- The text string for each operator.
- The bounding box (
Bbox) — the page coordinates and dimensions of that text region. - Font information referenced by the operator.
This produces a list of TextBlock structs, each containing a bounding box and its original text.
2. Translation
Each TextBlock’s text is sent to TranslationService::translate_text. Blocks are processed concurrently up to --concurrency in parallel.
3. Reconstruction (src/formats/pdf/reconstructor.rs)
For each page:
- The original page is rasterised to a bitmap at the configured
--dpi. This raster becomes the background image. - The background is JPEG-compressed at
--jpeg-qualityand embedded in the new PDF page. - White rectangles are drawn over each original text bounding box to blank out the source text on the background.
- The translated text is rendered at the same bounding box position using the font specified by
--font-path.
4. Font Management (src/formats/pdf/fonts.rs)
Because translated Devanagari text cannot be rendered with a Latin font, a custom TrueType/OpenType font must be supplied via --font-path. The font manager loads the font, subsets it for the glyphs actually used, and embeds it in the output PDF.
Configuration Reference
| Flag | Effect on PDF pipeline |
|---|---|
--font-path | (required) Font used to render translated text |
--dpi | Resolution of the rasterised background (default: 96) |
--jpeg-quality | JPEG compression of background images (default: 85) |
--debug-bboxes | Draw red rectangles around all detected text regions |
Debugging Layout Issues
If translated text appears misaligned or bounding boxes are wrong:
- Run with
--debug-bboxesto visualise where the parser detected text regions. - Increase
--dpifor a higher-fidelity background that makes alignment easier to judge. - Check that the provided font contains the glyphs required for the target language.
Limitations
- PDFs with scanned/image-only pages contain no text operators; the parser will find nothing to translate on those pages.
- Complex ligatures or right-to-left scripts beyond Devanagari may not render correctly depending on the chosen font and the PDF rendering library’s shaping support.
- The rasterise-and-overlay approach means the output file size will typically be larger than the original.