Please note: The algorithm descriptions in English have been automatically translated. Errors may have been introduced in this process. For the original descriptions, go to the Dutch version of the Algorithm Register.

Automatic text recognition "Loghi"

National Archives

The Automatic Text Recognition (ATH) algorithm called 'Loghi' is used by the National Archives. This algorithm is used in making digitised archives accessible. The algorithm automatically creates transcriptions. This involves converting handwritten, typed or printed text on scans of documents into digitally searchable text (transcriptions).

Last change on 28th of April 2026, at 13:34 (CET) | Publication Standard 1.0

Publication category: Other algorithms
Impact assessment: DPIA
Status: In use

Theme

Education and Science

Begin date

2023-04

Contact information

info@nationaalarchief.nl

Link to publication website

https://www.nationaalarchief.nl/onderzoeken/datalab-nationaal-archief/handschriftherkenning

Goal and impact

The purpose of the Automatic Text Recognition (ATH) software 'Loghi' is to automatically convert text on scans of documents into digital text (transcripts). In order to digitally search the text and enable further digital processing. Digital processing is, for example, recognising personal names in the digital text.

Considerations

This algorithm is capable of automatically transcribing large volumes of documents where the cost and turnaround time of human input would be too high. It helps researchers conduct research more efficiently and allows for other connections to be made.

Human intervention

For automatic text recognition, a model is created. A model is the result of training an algorithm with a large set of data that allows computers to perform intelligent tasks automatically. When creating the model, scans for training, validation and testing are usually selected at random. For the purpose of model making, the selected scans are usually transcribed manually. The National Archives assesses the suitability of a model based on the error rates shown by the validation set and possibly the test set. Randomly, the National Archives monitors the quality of automatic transcriptions. This is done visually per scan and possibly at character level. Correction of automatic transcriptions is possible but usually too time-consuming to apply on a large scale. This can be justified because the original text is also displayed.

Risk management

The algorithm determines which characters are on the scan of a document. When using the automatically created transcriptions, it is advisable to also consult the scan because the software is not flawless. There may be errors in determining the correct characters and thus misrepresentation of characters in the transcription may occur. It is impossible to avoid bias in algorithms. Dwelling on this and checking for possible bias is therefore important. The algorithm itself does not pose a risk. The data processed by the algorithm, which may contain personal data, poses a risk. Risk management therefore depends on the dataset used and the personal data it contains.

There are two times when data is used by the algorithm. When training the model for a specific dataset and when actually converting the digital images to digital text.

When training the model, risk management consists of:

A Data Protection Impact Assessment (DPIA). If there may be personal data within the meaning of the AVG in the archive being trained with, a DPIA must have been carried out.
Do not make available. If archives containing personal data have been used for the template, the template will not be made available to third parties.
Fault recognition. The model must have a certain reliability value. The reliability value is expressed in a number of characters or words recognised as errors.

When converting digital images to digital text, risk management consists of:

A DPIA. If personal data may be present in the archive being converted, a DPIA must have been carried out.
A test set. The model must be applicable to the dataset. This is determined using a test set. Randomly, sentences are transcribed both automatically and manually. The results are compared. If the deviation is too large, the model is not (yet) appropriate.
Reliability value. The processed scans must meet certain reliability values. The reliability value is expressed as a value between 0 and 1 where 1 is good.
Manual sampling. The processed batch is visually checked by an employee at random for transcription errors.

Legal basis

Under the Archives Act, the archives transferred to the National Archives are managed by the general state archivist. The archives must be kept in good, orderly and accessible condition so that research of the archives is facilitated to the maximum extent possible. The present algorithm serves to improve the accessible state of the archives.

Links to legal bases

Archives Act: https://wetten.overheid.nl/BWBR0007376/2024-06-19/0

Elaboration on impact assessments

The scans of documents and thus also the digital text of the transcription may contain personal data. Personal data may therefore be processed both when training the model and when actually making the transcriptions. This depends on the archive being transcribed. In many cases, personal data are contained in archives. When an archive is under 110 years old (this is the maximum set age of a human being), contains personal data and is used as training data or is transcribed, a Data Protection Impact Assessment (DPIA) is carried out. A DPIA is then done on processing and the data before the algorithm is deployed and not specifically on the algorithm used or training the model.

The algorithm is not created for a specific archive, it is generic in design and therefore deployable on various archives. A DPIA is done on the processing and data of a specific archive. It is not possible here to include all DPIA impact tests for all transcribed archives as this could become an infinite list.

Impact assessment

Data Protection Impact Assessment (DPIA)

Data

The algorithm Loghi processes text from scans of (historical) documents. Depending on the archives offered, this can be all kinds of data.

The algorithm has been applied in transcribing the Central Archive on Special Jurisdiction (CABR) but also in transcribing older archives. No specific data source can be designated here because the algorithm can be applied to multiple data sources/archives.

Technical design

The ATH software "Loghi" first determines where the lines of text are located. The software can determine this because it is trained to detect where the lines are on which the text rests: the so-called baselines. This can be seen as finding the line ring with text in a lined script.

Using these baselines, the entire line of text can be cut out. This is then automatically transcribed. Indeed, at an earlier stage, the software was able to learn from many examples of cut-out lines of text and their corresponding manual transcription. That knowledge is contained in a model.

To train a model, for both recognising text lines and recognising characters, training data has to be created. To do this, a representative data set is obtained using a sample from a set of scans of documents. From these scans, it is automatically determined where the baselines are and which characters are on the baselines. There are errors in the automatic transcriptions. These are corrected manually. The corrected transcripts are the training data.

Feeding the software with the training data creates a model that can self-predict where baselines are on the scan and which characters are on the baseline. After training, the model is evaluated with a test set. This process can be repeated several times until the desired result is obtained.

The ATH is software consisting of several components. Two of them use machine learning. And can make predictions about the data based on examples in a trained model.

The first component is Laypa https://doi.org/10.1145/3604951.3605520. Laypa uses scans augmented with data. This data shows where a line of text is located on the scan of the document. The goal of the software is to predict as accurately as possible where the text lines are in a scan. This is achieved by predicting which pixels are part of a baseline.

The second component is Loghi https://doi.org/10.1007/978-3-031-70645-5_6. Loghi learns to predict what text is on lines of text not previously seen. This is done based on machine learning and examples of text lines and associated transcripts.

External provider

KNAW Huygens Institute

Link to code base

https://github.com/knaw-huc/loghi

Similar algorithm descriptions

Text recognition and document routing
Municipality of Steenwijkerland
An application is used to help our organisation digitise and manage documents. This involves converting scanned documents from an image to text. Metadata is automatically read from the text to give the document information for the route to follow in the internal process.
Last change on 16th of May 2025, at 8:36 (CET) | Publication Standard 1.0
Publication category
Other algorithms
Impact assessment
Field not filled in.
Status
In use
Voice-activated reporting
GGD IJsselland
This algorithm automatically converts spoken text into written text. Professionals can record a report, minutes or a case note during or immediately after a meeting. The system generates a draft text from this, which is then checked and edited by a member of staff before being saved.
Last change on 3rd of July 2026, at 7:12 (CET) | Publication Standard 1.0
Publication category
Other algorithms
Impact assessment
AIIA, DPIA
Status
In use
PolyAI Voicebot
Municipality of Amsterdam
PolyAI is a voicebot that can communicate with a citizen based on natural speech recognition. Poly AI uses an algorithm to recognise the subject of a question asked.
Last change on 10th of September 2025, at 9:46 (CET) | Publication Standard 1.0
Publication category
Impactful algorithms
Impact assessment
DPIA, The Ethical Guide
Status
In use
Automatic website translation.
Municipality of Reimerswaal
AI translations of websites and e-forms with Weglot
Last change on 22nd of May 2025, at 14:47 (CET) | Publication Standard 1.0
Publication category
Other algorithms
Impact assessment
Field not filled in.
Status
In use
Automatic remission
Hoogheemraadschap Hollands Noorderkwartier
When a citizen applies for automated waiver, eligibility is checked using decision rules.
Last change on 23rd of April 2024, at 8:58 (CET) | Publication Standard 1.0
Publication category
Impactful algorithms
Impact assessment
Field not filled in.
Status
In use

Automatic text recognition "Loghi"

National Archives

General information

Theme

Begin date

Contact information

Link to publication website

Responsible use

Goal and impact

Considerations

Human intervention

Risk management

Legal basis

Links to legal bases

Elaboration on impact assessments

Impact assessment

Operations

Data

Technical design

External provider

Link to code base

Similar algorithm descriptions

Text recognition and document routing

Municipality of Steenwijkerland

Voice-activated reporting

GGD IJsselland

PolyAI Voicebot

Municipality of Amsterdam

Automatic website translation.

Municipality of Reimerswaal

Automatic remission

Hoogheemraadschap Hollands Noorderkwartier