PDF Translation Pipeline

Requires the pdf feature flag: cargo build --release --features pdf

The PDF pipeline is the most complex format handler in TMT. Because PDFs encode text as positioned glyphs rather than editable strings, a full parse-rasterise-render cycle is required.

Pipeline Stages

1. Parsing (`src/formats/pdf/parser.rs`)

The parser walks the PDF content stream operator by operator, identifying text-drawing operators and extracting:

The text string for each operator.
The bounding box (Bbox) — the page coordinates and dimensions of that text region.
Font information referenced by the operator.

This produces a list of TextBlock structs, each containing a bounding box and its original text.

2. Translation

Each TextBlock’s text is sent to TranslationService::translate_text. Blocks are processed concurrently up to --concurrency in parallel.

3. Reconstruction (`src/formats/pdf/reconstructor.rs`)

For each page:

The original page is rasterised to a bitmap at the configured --dpi. This raster becomes the background image.
The background is JPEG-compressed at --jpeg-quality and embedded in the new PDF page.
White rectangles are drawn over each original text bounding box to blank out the source text on the background.
The translated text is rendered at the same bounding box position using the font specified by --font-path.

4. Font Management (`src/formats/pdf/fonts.rs`)

Because translated Devanagari text cannot be rendered with a Latin font, a custom TrueType/OpenType font must be supplied via --font-path. The font manager loads the font, subsets it for the glyphs actually used, and embeds it in the output PDF.

Configuration Reference

Flag	Effect on PDF pipeline
`--font-path`	(required) Font used to render translated text
`--dpi`	Resolution of the rasterised background (default: 96)
`--jpeg-quality`	JPEG compression of background images (default: 85)
`--debug-bboxes`	Draw red rectangles around all detected text regions

Debugging Layout Issues

If translated text appears misaligned or bounding boxes are wrong:

Run with --debug-bboxes to visualise where the parser detected text regions.
Increase --dpi for a higher-fidelity background that makes alignment easier to judge.
Check that the provided font contains the glyphs required for the target language.

Limitations

PDFs with scanned/image-only pages contain no text operators; the parser will find nothing to translate on those pages.
Complex ligatures or right-to-left scripts beyond Devanagari may not render correctly depending on the chosen font and the PDF rendering library’s shaping support.
The rasterise-and-overlay approach means the output file size will typically be larger than the original.

Keyboard shortcuts

TMT — Translation Management Tool