hOCR Editor is a planned graphical editor for editing OCR results in the open standard format hOCR 1.2. The goal is to allow editing that is true to the original and lossless.

The idea came out of my Bachelor’s thesis on ‘Developing an hOCR Editor for Editing OCR Results.’ Here is a wireframe to illustrate the concept:

HocrEditor-wireframe

Introduction Link to heading

OCR engines can output results in open formats such as hOCR, ALTO und PAGE. These formats enrich plain text with additional information, such as:

  • Bounding boxes (hOCR: bbox) for positioning and dimensions of paragraphs, images, lines, words, or even characters; effectively allowing layout reconstruction.
  • Confidence scores, indicating how certain the OCR engine was about the recognition.
  • Typographic details like the baseline.

You can use OCR results to generate searchable PDFs, for example with OCRmyPDF. The resulting PDF consists of scanned pages as raster images, with an invisible, searchable, and selectable text layer overlaid.

Problem Link to heading

A fundamental issue with OCR is its error susceptibility. Incorrect words are frequently recognized. But depending on the use case, accuracy can be critical. A few examples:

  • for training OCR engines, where the so-called ground truth must be correct
  • in memory institutions digitizing historical documents
  • for retro-digitization in companies or public institutions (e.g., invoice data)

Another big issue is data loss in many OCR workflows. Much of the data (like typographic details) is lost during downstream processing (e.g., PDF generation). Converting from scan to hOCR to PDF is inherently lossy.

Even in round-trip serialization, data can be altered: simply loading and re-saving hOCR data often introduces unwanted changes. For instance, if a bounding box shifts vertically, ideally only one number in the hOCR should change. This is essential for meaningful, text-based version control (e.g., with Git). Version control, in turn, is a key part of maintaining data integrity, especially in memory institutions. Moreover, a clean round-trip is crucial for proper integration testing to ensure data remains unchanged.

Currently, there is not a single mature graphical editor for OCR results in open standard formats like hOCR, ALTO, or PAGE that avoids this data loss problem. Some OCR editors do exist (e.g., Adobe Acrobat, OmniPage, ABBYY FineReader), but they are expensive and based on proprietary result formats. Institutions like archives deliberately use open standards to avoid vendor lock-in.

Solution Link to heading

The base library HocrSharp models the entire hOCR standard (including its dialects) at the data level and solves the issue of lossy round-trip conversion. In my professional experience, I have worked on multiple projects focused specifically on solving this problem, and I am applying that knowledge here.

Built on top of it, HocrSharp.Editor is a GUI frontend enabling visual editing of OCR results: bounding boxes, text, properties, structure, and layout can be modified. Proofreading is possible. Reactive programming with Avalonia UI and a clean MVVM architecture solve the round-trip problem even at the GUI level.

Target Audience Link to heading

This tool is aimed at OCR workflows that prioritize full control, data integrity, and long-term archiving — in other words, workflows that require minimally invasive editing of OCR results. Key audiences include:

  • memory institutions, such as:
  • archives, libraries, and museums
  • the Internet Archive, which uses hOCR
  • (retro-)digitization service providers
  • scholars in the digital humanities
  • users of OCRmyPDF and Tesseract

I myself need software like this. As a genealogy enthusiast, I have scanned local history books for personal use, OCRed them with Tesseract 4.0.0-beta.1, and have already partially corrected the hOCR data manually or via script. But I also need a visual editor for further corrections and annotations — without ‘side effects’ during editing. Status

  • development resumed in early 2025
  • base library HocrSharp:
    • extensive test coverage, including standard compliance
    • essentially complete
  • frontend HocrSharp.Editor:
    • in progress, mockups available
  • open to open source

Current Goal Link to heading

I am currently looking for practical feedback:

  • Is there a need for such an editor in your field?
  • What features would you consider essential?
  • What challenges have you encountered with hOCR?

📬 Contact: mail@akopetsch.de