Skip to main content

Two-stage architecture

Trace OCR uses a two-stage pipeline:
  1. Detection — a deep learning model scans the full image and outputs bounding boxes around text regions
  2. Recognition — each detected region is cropped and fed to a second model that reads the characters
This separation allows you to swap detection and recognition models independently, optimizing for speed or accuracy depending on your use case.

Detection stage

The detection model takes a full document image and produces a list of bounding boxes, each containing text. You can tune detection behavior with these parameters:
ParameterDefaultDescription
detection_modeldb_resnet50The detection architecture to use
assume_straight_pagestrueSkip rotation handling for faster inference on upright documents
preserve_aspect_ratiotruePad the image to preserve aspect ratio before detection
symmetric_paddingtruePad symmetrically instead of bottom-right only
binary_threshold0.1Pixel-level threshold for the segmentation map
box_threshold0.1Minimum confidence to keep a detected box
detection_batch_size2Number of pages processed in parallel
Lower thresholds produce more boxes (higher recall, more noise). Higher thresholds produce fewer boxes (higher precision, may miss faint text).

Recognition stage

Each detected text region is cropped and resized, then passed to the recognition model. The model outputs:
  • value — the recognized text string
  • confidence — a score from 0 to 1 indicating how certain the model is
ParameterDefaultDescription
recognition_modelcrnn_vgg16_bnThe recognition architecture to use
recognition_batch_size128Number of crops processed in parallel

Response hierarchy

The full OCR endpoint (/ocr) returns a nested structure:
Document (array of files)
└── OCROut (per file)
    ├── name, orientation, language, dimensions
    └── pages (array)
        └── OCRPage
            └── blocks (array)
                └── OCRBlock
                    ├── geometry, detection_score
                    └── lines (array)
                        └── OCRLine
                            ├── geometry, detection_score
                            └── words (array)
                                └── OCRWord
                                    ├── value
                                    ├── geometry
                                    ├── confidence
                                    ├── detection_score
                                    └── text_orientation
Words are grouped into lines, lines into blocks, and blocks into pages. This grouping is controlled by the group_lines (default true) and group_blocks (default false) parameters on the /ocr endpoint.

Geometry format

All bounding boxes use normalized coordinates — floats between 0 and 1 relative to the page dimensions. For straight pages (assume_straight_pages=true), geometry is [x_min, y_min, x_max, y_max]:
"geometry": [0.12, 0.05, 0.45, 0.10]
To convert to pixel coordinates, multiply by the page dimensions:
x_min_px = geometry[0] * width
y_min_px = geometry[1] * height
x_max_px = geometry[2] * width
y_max_px = geometry[3] * height

Additional options

The /ocr and /kie endpoints support extra parameters:
ParameterDefaultDescription
detect_orientationfalseEstimate the page rotation angle
detect_languagefalsePredict the page language
straighten_pagesfalseAuto-rotate pages before detection
disable_page_orientationfalseSkip page orientation classification
disable_text_orientationfalseSkip per-crop text orientation classification
paragraph_break0.0035Vertical gap threshold for splitting blocks (OCR only)