TMT — Translation Management Tool
TMT is a high-performance command-line utility for translating structured documents while preserving their original formatting and metadata. It supports CSV, TSV, DOCX, and PDF formats with first-class handling of complex scripts such as Devanagari.
Translation is powered by the TMT API provided by Kathmandu University’s ILPRL lab, which supports English, Nepali, and Tamang language pairs.
Supported Formats
| Format | Extension | Notes |
|---|---|---|
| Comma-separated values | .csv | All cells translated |
| Tab-separated values | .tsv | All cells translated |
| Word document | .docx | Paragraph-level translation |
.pdf | Requires pdf feature flag; needs a font for Devanagari rendering |
Supported Languages
| Code | Language |
|---|---|
en | English |
ne | Nepali |
tmg | Tamang |
Design Philosophy
The tool follows a strict Parse → Validate → Execute lifecycle:
- CLI arguments are parsed and validated into a
RuntimeConfig. - The validated config is used to dispatch to a format-specific handler.
- Each handler extracts text, delegates translation to the
TranslationService, and reconstructs the output document.
This separation keeps format concerns isolated from network and rate-limiting logic, making each layer independently testable and extensible.
Source Repository
The source code is available at github.com/razzat008/tmt-hackathon.
Created by github/razzat008.
Getting Started
This section walks you through everything you need to go from zero to translating your first document.
- Prerequisites & Building — Install the Rust toolchain, optional system libraries for PDF support, and compile the binary.
- Configuration — Set up your API token and understand the runtime validation rules.
- CLI Reference — A complete reference for every command-line argument.
Prerequisites & Building
Prerequisites
Before building TMT, ensure the following are available on your system:
- Rust toolchain (Edition 2024). Install via rustup.rs.
- C libraries (PDF support only):
freetypeandlibpdfiummust be available in your system path or linker search path.
Building
Standard Build
The default build enables support for DOCX, CSV, and TSV translation:
cargo build --release
The compiled binary will be at target/release/tmt.
Build with PDF Support
PDF support is gated behind an optional feature flag because it depends on the external C libraries pdfium-render and freetype:
cargo build --release --features pdf
Note: Without the
Quick Smoke Test
After building, verify the binary works:
./target/release/tmt --help
Global Constants
These constants are compiled into the binary and cannot be overridden at runtime:
| Constant | Value | Description |
|---|---|---|
DEFAULT_BASE_URL | https://tmt.ilprl.ku.edu.np/lang-translate | Default API endpoint |
MAX_FILE_SIZE_BYTES | 1,000,000 (1 MB) | Hard limit for input files |
MAX_REQUEST_TEXT_BYTES | 5,000 | Maximum characters sent per API request |
Configuration
API Token
TMT requires an API token to communicate with the TMT translation service. You can supply it in one of two ways:
Via environment variable (recommended):
export TMT_API_TOKEN=your_secret_token_here
Or place it in a .env file in the project root (a template is provided at .env_example):
TMT_API_TOKEN=your_secret_token_here
Via CLI argument (overrides the environment variable):
tmt --api-token your_secret_token_here ...
Runtime Validation
Before translation begins, all CLI arguments are transformed into a RuntimeConfig struct via TryFrom<&Cli>. This process enforces the following rules — any violation exits with a descriptive error:
| Rule | Details |
|---|---|
| API token must be present | Fails with AppError::MissingApiToken if neither --api-token nor TMT_API_TOKEN is set, or if the value is whitespace-only. |
| Concurrency ≥ 1 | --concurrency 0 is rejected with an InvalidArgument error. |
| Source and target languages must differ | --src-lang en --tgt-lang en is rejected (case-insensitive check). |
| DPI ≥ 1 | Applies to PDF output only. |
| JPEG quality in range [1, 100] | Applies to PDF output only. |
| Font path must exist | If --font-path is provided, the path must point to an existing file. |
Example Usage
Translate a CSV from English to Nepali
tmt \
--input data.csv \
--output data_ne.csv \
--src-lang en \
--tgt-lang ne
Translate a DOCX from Nepali to English
tmt \
--input document.docx \
--output document_en.docx \
--src-lang ne \
--tgt-lang en
Translate a PDF (requires pdf feature)
PDF translation requires a TrueType or OpenType font for rendering the target script, especially for Devanagari:
tmt \
--input report.pdf \
--output report_ne.pdf \
--src-lang en \
--tgt-lang ne \
--font-path /usr/share/fonts/truetype/Noto/NotoSansDevanagari-Regular.ttf
CLI Reference
The TMT CLI is defined by the Cli struct using the clap crate. The file format is inferred from the input file’s extension — no explicit format flag is needed.
Arguments
| Argument | Short | Type | Default | Description |
|---|---|---|---|---|
--input | -i | PathBuf | (required) | Path to the source file (.pdf, .docx, .csv, .tsv). |
--output | -o | PathBuf | (required) | Path where the translated file will be saved. Must share the same extension as --input. |
--src-lang | -s | String | (required) | Source language code: en, ne, or tmg. |
--tgt-lang | -t | String | (required) | Target language code: en, ne, or tmg. Must differ from --src-lang. |
--base-url | — | String | https://tmt.ilprl.ku.edu.np/lang-translate | Override the API base URL. |
--api-token | — | String | (none) | API token. Overrides the TMT_API_TOKEN environment variable. |
--concurrency | — | usize | 2 | Maximum number of simultaneous in-flight API requests. Must be ≥ 1. |
--rate-limit-ms | — | u64 | (none) | Fixed delay in milliseconds between API requests, regardless of concurrency. |
--max-retries | — | u32 | 4 | Number of retries for non-rate-limit failures (e.g., network errors). |
--font-path | — | PathBuf | (none) | Path to a TrueType/OpenType font file. Required for PDF output. |
--dpi | — | u32 | 96 | Resolution (DPI) for PDF page rendering. Must be ≥ 1. |
--jpeg-quality | — | u8 | 85 | JPEG compression quality (1–100) for PDF background images. |
--verbose | — | bool | false | Enable debug-level tracing logs. |
--debug-bboxes | — | bool | false | Draw red rectangles around detected text regions in PDF output. Useful for diagnosing layout issues. |
Concurrency & Rate Limiting
--concurrency controls the capacity of the async Semaphore inside TranslationService — it caps how many translation tasks run in parallel. --rate-limit-ms adds a fixed cooldown delay enforced by a Mutex inside TmtClient, ensuring a minimum spacing between any two requests regardless of the concurrency setting. Both can be used together for fine-grained throughput control.
PDF-Specific Options
Because the PDF pipeline reconstructs documents by rendering translated text over a rasterized background of the original page, --dpi and --jpeg-quality let you trade off between output file size and visual fidelity:
- Higher
--dpi→ sharper background, larger file. - Lower
--jpeg-quality→ smaller file, more compression artifacts.
Debugging Flags
--verbosesets thetracingsubscriber level todebugfor the application (dependencies stay atwarn).--debug-bboxesinstructs the PDF reconstructor to draw bounding boxes around every detected text region, which helps diagnose mis-aligned or missing translated text.
Core Architecture
The TMT tool is structured as a pipeline that transforms source documents into translated outputs through a Parse → Validate → Execute lifecycle. The design deliberately decouples document format handling from the underlying translation service and network concerns.
System Overview
The application follows a modular design where the app module orchestrates the high-level flow, while specialized modules handle CLI parsing, configuration, and format-specific logic.
┌─────────────┐ ┌──────────────┐ ┌──────────────────┐ ┌──────────────┐
│ main.rs │────▶│ app::run │────▶│ formats:: │────▶│ Translation │
│ (CLI parse)│ │ (lifecycle) │ │ translate_file │ │ Service │
└─────────────┘ └──────────────┘ └──────────────────┘ └──────────────┘
│ │
RuntimeConfig Format Handler
(validated) (pdf / docx / csv_tsv)
Data Flow and Lifecycle
1. Parse & Validate
The system converts the raw Cli struct into a RuntimeConfig. During this phase it performs critical safety checks:
- File existence and size — verifies the input file exists and does not exceed
MAX_FILE_SIZE_BYTES(1 MB). - Format matching — ensures the output file extension matches the input extension.
- Environment preparation — creates any necessary parent directories for the output path.
2. Format Dispatch
Once validated, formats::translate_file acts as the central router. It reads the file extension, then initialises a TmtClient and TranslationService before handing off to the correct handler.
3. Execution
The request is dispatched to one of the three format handlers — PDF, DOCX, or CSV/TSV. Each handler owns the internal structure of its file format and delegates the actual text translation to TranslationService.
Major Subsystems
| Subsystem | Responsibility | Key Types |
|---|---|---|
| CLI & Config | Parse arguments and environment into a validated runtime state | Cli, RuntimeConfig |
| App Orchestration | Manage the high-level lifecycle and filesystem safety checks | app::run, validate_input_file |
| Format Handlers | Parse document structures and reconstruct translated output | formats::pdf, formats::docx, formats::csv_tsv |
| Translation Layer | Sentence splitting, caching, and concurrency control | TranslationService, TmtClient |
Error Handling Strategy
A centralised AppError enum (defined with the thiserror crate) captures all failure modes — from IO failures and rate limits to format-specific parsing errors. Standard library errors such as std::io::Error and csv::Error are automatically converted into AppError variants via From implementations. Errors bubble up to main.rs where they are logged before the process exits with a non-zero code.
See Error Handling for a full reference of all error variants.
Application Entry Point & Request Lifecycle
Entry Point — main.rs
The binary entry point performs two steps and nothing else:
- Parse the command line into a
Clistruct (viaclap). - Call
app::run(&cli)and handle any returnedAppErrorby logging it and exiting with a non-zero code.
// Conceptual outline of main.rs
fn main() {
let cli = Cli::parse();
if let Err(e) = app::run(&cli) {
eprintln!("Error: {e}");
std::process::exit(1);
}
}
app::run — The Lifecycle Coordinator
app::run drives the full Parse → Validate → Execute sequence:
Step 1 — Build RuntimeConfig
#![allow(unused)]
fn main() {
let config = RuntimeConfig::try_from(&cli)?;
}
RuntimeConfig::try_from validates every argument (see Configuration) and fails fast with a descriptive AppError on any violation.
Step 2 — Validate the Input File
validate_input_file performs filesystem-level checks before any network call is made:
- The input path must exist and be a regular file.
- The file size must not exceed
MAX_FILE_SIZE_BYTES(1 MB). - The output extension must match the input extension — e.g., you cannot translate a
.csvinto a.docx. - Parent directories for the output path are created if they do not exist.
Step 3 — Dispatch to Format Handler
#![allow(unused)]
fn main() {
formats::translate_file(&config).await?;
}
translate_file inspects the input extension and routes to the appropriate handler:
| Extension | Handler |
|---|---|
.csv | formats::csv_tsv::translate |
.tsv | formats::csv_tsv::translate |
.docx | formats::docx::translate |
.pdf | formats::pdf::translate (requires pdf feature) |
An unrecognised extension returns AppError::UnsupportedFormat.
Initialization Sequence (Diagram)
main()
│
├─ Cli::parse() # clap parses argv
│
└─ app::run(&cli)
│
├─ RuntimeConfig::try_from(&cli) # validation
│ └─ AppError on failure
│
├─ validate_input_file() # filesystem checks
│ └─ AppError on failure
│
└─ formats::translate_file(&config) # dispatch
├─ TmtClient::new()
├─ TranslationService::new()
└─ <format handler>::translate()
Error Handling
The AppError Enum
TMT uses a single centralised error type defined in src/error.rs using the thiserror crate. Every failure mode in the application is represented as a variant of AppError, which gives callers a single type to match on and provides human-readable messages automatically via the Display trait.
Error Propagation
All fallible functions return Result<T, AppError> and use the ? operator to bubble errors upward. The chain terminates in main.rs, which logs the error and exits with a non-zero status code.
format handler error
│ ?
TranslationService error
│ ?
app::run
│ ?
main ──▶ eprintln! + process::exit(1)
Automatic Conversions
Standard library and third-party error types are automatically converted into AppError variants through From implementations:
| Source Type | AppError Variant |
|---|---|
std::io::Error | AppError::Io |
csv::Error | AppError::Csv |
| HTTP / network errors | AppError::Network (via TmtClient) |
| Rate-limit responses | AppError::RateLimit |
Key Error Variants
| Variant | When it occurs |
|---|---|
MissingApiToken | No token provided via --api-token or TMT_API_TOKEN |
InvalidArgument | A CLI argument fails a business-logic constraint (e.g. concurrency = 0) |
FileTooLarge | The input file exceeds MAX_FILE_SIZE_BYTES |
FormatMismatch | Input and output extensions do not match |
UnsupportedFormat | The file extension is not handled by any format module |
RateLimit | The API responded with HTTP 429 and backoff was exhausted |
Io | Any filesystem read/write failure |
Csv | A parsing error in CSV/TSV content |
DocxParse | A structural error in a DOCX file |
PdfParse | A structural error in a PDF file (pdf feature only) |
Adding New Error Variants
Because AppError uses thiserror, adding a new variant is straightforward:
#![allow(unused)]
fn main() {
#[derive(Debug, thiserror::Error)]
pub enum AppError {
// ... existing variants ...
#[error("my new error: {0}")]
MyNewError(String),
}
}
Implement From<YourLibraryError> if you want automatic ? conversion from a third-party type.
Translation Service Layer
The Translation Service Layer sits between the document format handlers and the external TMT API. Its job is to take large, unstructured text extracted from a document and produce translated text efficiently — without hammering the API or doing redundant work.
Responsibilities
- Text segmentation — split raw strings into sentences using linguistic delimiters (
.,!,?, and the Devanagari danda।). - Concurrency management — an async
Semaphorelimits the number of simultaneous API requests. - Deduplication & caching — identical sentences within or across documents are translated only once.
- In-flight merging — if two tasks request the same sentence at the same time, only one network call is made; the second waits on a
tokio::sync::oneshotchannel for the result.
Service Constraints
| Constraint | Value / Mechanism |
|---|---|
| Max request size | MAX_REQUEST_TEXT_BYTES (5,000 chars) |
| Concurrency | tokio::sync::Semaphore (capacity = --concurrency) |
| Sentence delimiters | . ! ? । |
| Minimum fragment length | 5 characters (shorter fragments are merged) |
| Deduplication | tokio::sync::oneshot channels in a shared HashMap |
Pipeline Flow
Document Handler
│
│ raw text block
▼
TranslationService::translate_text()
│
├─ split_sentences() # split on . ! ? ।
│
├─ for each sentence:
│ ├─ cache hit? ──▶ return cached result
│ ├─ in-flight? ──▶ wait on oneshot channel
│ └─ new request ──▶ acquire Semaphore permit
│ │
│ TmtClient::translate()
│ │
│ (rate limiting / retry)
│ │
│ translated sentence
│
└─ rejoin sentences ──▶ translated text block
Sub-pages
TranslationService & Sentence Splitting
TranslationService
TranslationService (defined in src/translate/service.rs) is the single public interface that format handlers call. It holds:
- An
Arc<TmtClient>for HTTP communication. - A
Semaphoreto cap concurrent requests. - A
Mutex<HashMap<CacheKey, CacheEntry>>for deduplication and caching.
translate_text
The primary method signature (conceptually):
#![allow(unused)]
fn main() {
pub async fn translate_text(
&self,
text: &str,
src_lang: &str,
tgt_lang: &str,
) -> Result<String, AppError>
}
Internally it:
- Calls
split_sentences(text)to obtain a list of sentence fragments. - Spawns a concurrent task per fragment (bounded by the
Semaphore). - Each task checks the cache / in-flight map before making a network call.
- Joins all translated fragments back into a single string.
Cache Key
Each entry in the internal cache is keyed on a CacheKey:
#![allow(unused)]
fn main() {
struct CacheKey {
text: String,
src_lang: String,
tgt_lang: String,
}
}
This means the same sentence translated between different language pairs is cached independently.
split_sentences
Defined in src/translate/sentence.rs, split_sentences divides a text block on the following delimiters:
| Delimiter | Meaning |
|---|---|
. | Latin full stop |
! | Exclamation mark |
? | Question mark |
। | Devanagari danda (used in Nepali and Tamang) |
Fragment Merging
Short fragments (fewer than 5 characters) are merged with the next fragment before being sent to the API. This avoids wasting API quota on punctuation-only or single-word fragments and ensures the server has enough context for accurate translation.
Example
Input:
Hello world. How are you? I am fine!
Fragments after splitting and merging:
["Hello world.", "How are you?", "I am fine!"]
TMT API Client
Overview
TmtClient (defined in src/tmt/mod.rs) is the low-level HTTP client responsible for communicating with the TMT translation API. It is initialised once and shared (via Arc) across all concurrent translation tasks.
Responsibilities
- Construct
TmtRequestobjects from text and language pair. - Execute HTTP POST requests to the configured
base_url. - Parse
TmtResponseand map HTTP status codes to typedTmtErrorvariants. - Enforce a minimum inter-request delay via a
Mutex-protected timestamp.
Request Format
Each API call sends a JSON body of roughly this shape:
{
"text": "Hello world.",
"src_lang": "en",
"tgt_lang": "ne"
}
The text field is limited to MAX_REQUEST_TEXT_BYTES (5,000 characters). Longer inputs must be split upstream by TranslationService.
Response Handling
A successful response returns the translated text. Failure responses are mapped as follows:
| HTTP Status | TmtError / AppError |
|---|---|
200 OK | Success |
429 Too Many Requests | RateLimit (triggers backoff) |
5xx | Network (triggers retry) |
| Other | UnexpectedStatus |
Rate-Limiting Gate
TmtClient holds a Mutex<Instant> recording the time of the last request. When --rate-limit-ms is configured, each call acquires the mutex, sleeps until the required interval has elapsed since the last request, fires the HTTP call, and then updates the stored instant before releasing the mutex. This serialises the delay calculation even when multiple tokio tasks are running concurrently.
Rate Limiting & Backoff
TMT implements a multi-layer strategy to stay within API rate limits and recover gracefully from transient failures.
Layer 1 — Fixed Rate Limit Delay
When --rate-limit-ms <N> is provided, TmtClient inserts a mandatory sleep of at least N milliseconds between any two consecutive requests. This is enforced by a Mutex<Instant> (see TMT API Client).
Layer 2 — Exponential Backoff with Jitter
AsyncGlobalBackoffState tracks the history of failures and calculates a retry delay using jittered exponential backoff:
delay = min(BASE_DELAY * 2^failures, MAX_DELAY) + random_jitter
- Streak tracking — consecutive failures increase the exponent; a successful request resets the streak.
- Jitter — randomness is added to prevent thundering-herd problems when many tasks hit the API simultaneously.
Layer 3 — Retry-After Header Parsing
When the API responds with HTTP 429 (Too Many Requests), the response may include a Retry-After header specifying how many seconds to wait. TMT parses this header and uses its value as the minimum delay before the next attempt, overriding the backoff calculation if the server-supplied value is larger.
HTTP 429
Retry-After: 30
TMT will wait at least 30 seconds before retrying that request.
--max-retries
Non-rate-limit failures (e.g., network timeouts, HTTP 5xx) are retried up to --max-retries times (default: 4). Rate-limit (429) failures use backoff and do not count against this limit — they are retried indefinitely until the backoff delay is served.
Summary
Request fails
│
├─ HTTP 429 ──▶ parse Retry-After ──▶ sleep ──▶ retry (no limit)
│
└─ Other error
│
├─ retries remaining? ──▶ exponential backoff + jitter ──▶ retry
│
└─ retries exhausted ──▶ AppError::Network (propagated up)
Format Handlers Overview
Format handlers live in the src/formats/ module. Each handler is responsible for:
- Parsing the source document structure (cells, paragraphs, PDF content streams, etc.).
- Extracting text units to translate.
- Delegating translation to
TranslationService. - Reconstructing the output document with translated text, preserving all non-text structure.
The central router is formats::translate_file, which reads the file extension from RuntimeConfig and dispatches to the correct handler. An unrecognised extension returns AppError::UnsupportedFormat.
Handlers at a Glance
| Format | Handler module | Key challenge |
|---|---|---|
| CSV / TSV | formats::csv_tsv | Preserve column structure across all rows |
| DOCX | formats::docx | Translate paragraph runs while keeping styles, numbering, and tables intact |
formats::pdf | Parse content streams, rasterise background, render translated text at correct positions |
Sub-pages
CSV & TSV Translation
The CSV/TSV handler (src/formats/csv_tsv.rs) translates delimited text files cell by cell while preserving the row/column structure and any header rows.
How It Works
- Read — The file is parsed using the
csvcrate. For TSV files the delimiter is changed to\t. - Translate — Each non-empty cell value is sent to
TranslationService::translate_text. Empty cells are passed through unchanged. - Write — Translated cell values are written to the output file using the same delimiter and quoting rules as the input.
Because cell values are translated independently, the TranslationService cache is particularly effective here: repeated values (e.g. category names appearing in every row) are translated once and served from cache on subsequent rows.
Error Handling
Parsing errors from the csv crate are automatically converted to AppError::Csv via the From implementation in src/error.rs.
Limitations
- The handler does not attempt to detect or skip header rows automatically. All cells, including headers, are sent for translation. If you want to preserve English headers, pre-process the file before passing it to TMT.
- Cells exceeding
MAX_REQUEST_TEXT_BYTES(5,000 characters) are chunked byTranslationServiceat sentence boundaries before being sent.
DOCX Translation
The DOCX handler (src/formats/docx.rs) translates Microsoft Word documents at the paragraph level while leaving all formatting, styles, numbering, images, and tables structurally intact.
How It Works
A .docx file is a ZIP archive containing XML files. The handler:
- Unpacks the archive into memory.
- Parses
word/document.xmlto locate paragraph (<w:p>) elements. - Extracts the concatenated text of each paragraph’s runs (
<w:r><w:t>). - Translates each paragraph’s text via
TranslationService::translate_text. - Reconstructs the paragraph by replacing the text content of its runs with the translated text while keeping all run properties (
<w:rPr>, bold, italic, font, etc.) unchanged. - Repacks the modified XML back into a ZIP archive and writes it to the output path.
Formatting Preservation
Run-level properties (font, size, bold, colour) are preserved because the handler only modifies <w:t> text nodes, not the surrounding <w:rPr> elements. Paragraph-level properties (<w:pPr>) such as alignment, spacing, and list numbering are also untouched.
Limitations
- If a paragraph’s runs have different formatting (e.g. a word in bold mid-sentence), the translated output may collapse those runs into a single run to simplify reconstruction. Visual differences may appear in mixed-format paragraphs.
- Text inside text boxes, headers, and footers may not be translated in the current implementation. Check the source for which XML parts are currently processed.
PDF Translation Pipeline
Requires the
cargo build --release --features pdf
The PDF pipeline is the most complex format handler in TMT. Because PDFs encode text as positioned glyphs rather than editable strings, a full parse-rasterise-render cycle is required.
Pipeline Stages
1. Parsing (src/formats/pdf/parser.rs)
The parser walks the PDF content stream operator by operator, identifying text-drawing operators and extracting:
- The text string for each operator.
- The bounding box (
Bbox) — the page coordinates and dimensions of that text region. - Font information referenced by the operator.
This produces a list of TextBlock structs, each containing a bounding box and its original text.
2. Translation
Each TextBlock’s text is sent to TranslationService::translate_text. Blocks are processed concurrently up to --concurrency in parallel.
3. Reconstruction (src/formats/pdf/reconstructor.rs)
For each page:
- The original page is rasterised to a bitmap at the configured
--dpi. This raster becomes the background image. - The background is JPEG-compressed at
--jpeg-qualityand embedded in the new PDF page. - White rectangles are drawn over each original text bounding box to blank out the source text on the background.
- The translated text is rendered at the same bounding box position using the font specified by
--font-path.
4. Font Management (src/formats/pdf/fonts.rs)
Because translated Devanagari text cannot be rendered with a Latin font, a custom TrueType/OpenType font must be supplied via --font-path. The font manager loads the font, subsets it for the glyphs actually used, and embeds it in the output PDF.
Configuration Reference
| Flag | Effect on PDF pipeline |
|---|---|
--font-path | (required) Font used to render translated text |
--dpi | Resolution of the rasterised background (default: 96) |
--jpeg-quality | JPEG compression of background images (default: 85) |
--debug-bboxes | Draw red rectangles around all detected text regions |
Debugging Layout Issues
If translated text appears misaligned or bounding boxes are wrong:
- Run with
--debug-bboxesto visualise where the parser detected text regions. - Increase
--dpifor a higher-fidelity background that makes alignment easier to judge. - Check that the provided font contains the glyphs required for the target language.
Limitations
- PDFs with scanned/image-only pages contain no text operators; the parser will find nothing to translate on those pages.
- Complex ligatures or right-to-left scripts beyond Devanagari may not render correctly depending on the chosen font and the PDF rendering library’s shaping support.
- The rasterise-and-overlay approach means the output file size will typically be larger than the original.
Testing
TMT’s test suite lives in the tests/ directory and covers two main areas.
Configuration Validation Tests (tests/config_validation.rs)
These tests exercise the RuntimeConfig::try_from validation logic by constructing Cli structs with deliberate constraint violations and asserting that the correct AppError variant is returned.
Covered scenarios include:
- Missing API token (neither
--api-tokennorTMT_API_TOKENset). - Whitespace-only API token.
--concurrency 0(must be ≥ 1).- Identical source and target language codes.
--dpi 0(must be ≥ 1).--jpeg-quality 0and--jpeg-quality 101(must be in [1, 100]).--font-pathpointing to a non-existent file.
Retry-After Parsing & Sentence Splitting Tests
tests/parse_retry_after.rs
Validates the HTTP Retry-After header parser used by TmtClient. Tests cover:
- Integer values (
Retry-After: 30). - HTTP-date values (
Retry-After: Wed, 21 Oct 2025 07:28:00 GMT). - Missing or malformed headers.
tests/sentence_split.rs
Validates split_sentences from src/translate/sentence.rs. Tests cover:
- Splitting on
.,!,?. - Splitting on the Devanagari danda
।. - Fragment merging behaviour for short (<5 character) fragments.
- Empty strings and strings with no delimiters.
Running the Tests
# All tests
cargo test
# A specific test file
cargo test --test config_validation
# With PDF feature enabled
cargo test --features pdf
Adding New Tests
Integration tests belong in tests/. Unit tests for a module can be added inline in the source file in a #[cfg(test)] block:
#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn my_test() {
// ...
}
}
}
Glossary
AppError
The central error enum defined in src/error.rs. All fallible operations in the application return Result<T, AppError>.
AsyncGlobalBackoffState The struct that tracks failure streaks and computes jittered exponential backoff delays between API retries.
Bbox Bounding box — a rectangle expressed in PDF page coordinates (x, y, width, height) that describes where a piece of text is positioned on a page. Used by the PDF parser and reconstructor.
CacheKey
A composite key of (text, src_lang, tgt_lang) used to look up previously translated sentences in TranslationService’s internal cache.
Cli
The struct derived by clap from raw command-line arguments. It is the unvalidated precursor to RuntimeConfig.
Danda (।)
The Devanagari sentence-ending punctuation mark, equivalent to a full stop in Latin scripts. TMT’s sentence splitter recognises it as a sentence boundary.
formats::translate_file
The top-level dispatch function in src/formats/mod.rs. It inspects the input file extension and routes execution to the correct format handler.
RuntimeConfig
The validated configuration struct produced by RuntimeConfig::try_from(&cli). It is the single source of truth for the application’s runtime state and is passed to every subsystem.
Semaphore
A tokio::sync::Semaphore used by TranslationService to cap the number of concurrent in-flight API requests to the value specified by --concurrency.
split_sentences
The function in src/translate/sentence.rs that divides a text string into sentence fragments on ., !, ?, and ।.
TextBlock
A struct produced by the PDF parser that pairs a Bbox with the extracted text string for one text-drawing operator on a PDF page.
TMT API
The external HTTP translation service provided by Kathmandu University’s ILPRL lab, accessible at https://tmt.ilprl.ku.edu.np/lang-translate. TMT the CLI tool is a client for this API.
TmtClient
The low-level HTTP client in src/tmt/mod.rs that sends TmtRequest objects to the TMT API and parses TmtResponse objects back.
TranslationService
The high-level orchestration layer in src/translate/service.rs. It wraps TmtClient and adds sentence splitting, caching, deduplication, and concurrency control.