Please note: The algorithm descriptions in English have been automatically translated. Errors may have been introduced in this process. For the original descriptions, go to the Dutch version of the Algorithm Register.
Automatic text recognition "Loghi"
- Publication category
- Other algorithms
- Impact assessment
- DPIA
- Status
- In use
General information
Theme
Begin date
Contact information
Link to publication website
Responsible use
Goal and impact
The purpose of the Automatic Text Recognition (ATH) software 'Loghi' is to automatically convert text on scans of documents into digital text (transcripts). In order to digitally search the text and enable further digital processing. Digital processing is, for example, recognising personal names in the digital text.
Considerations
This algorithm is capable of automatically transcribing large volumes of documents where the cost and turnaround time of human input would be too high. It helps researchers conduct research more efficiently and allows for other connections to be made.
Human intervention
For automatic text recognition, a model is created. A model is the result of training an algorithm with a large set of data that allows computers to perform intelligent tasks automatically. When creating the model, scans for training, validation and testing are usually selected at random. For the purpose of model making, the selected scans are usually transcribed manually. The National Archives assesses the suitability of a model based on the error rates shown by the validation set and possibly the test set. Randomly, the National Archives monitors the quality of automatic transcriptions. This is done visually per scan and possibly at character level. Correction of automatic transcriptions is possible but usually too time-consuming to apply on a large scale. This can be justified because the original text is also displayed.
Risk management
The algorithm determines which characters are on the scan of a document. When using the automatically created transcriptions, it is advisable to also consult the scan because the software is not flawless. There may be errors in determining the correct characters and thus misrepresentation of characters in the transcription may occur. It is impossible to avoid bias in algorithms. Dwelling on this and checking for possible bias is therefore important. The algorithm itself does not pose a risk. The data processed by the algorithm, which may contain personal data, poses a risk. Risk management therefore depends on the dataset used and the personal data it contains.
There are two times when data is used by the algorithm. When training the model for a specific dataset and when actually converting the digital images to digital text.
When training the model, risk management consists of:
- A Data Protection Impact Assessment (DPIA). If there may be personal data within the meaning of the AVG in the archive being trained with, a DPIA must have been carried out.
- Do not make available. If archives containing personal data have been used for the template, the template will not be made available to third parties.
- Fault recognition. The model must have a certain reliability value. The reliability value is expressed in a number of characters or words recognised as errors.
When converting digital images to digital text, risk management consists of:
- A DPIA. If personal data may be present in the archive being converted, a DPIA must have been carried out.
- A test set. The model must be applicable to the dataset. This is determined using a test set. Randomly, sentences are transcribed both automatically and manually. The results are compared. If the deviation is too large, the model is not (yet) appropriate.
- Reliability value. The processed scans must meet certain reliability values. The reliability value is expressed as a value between 0 and 1 where 1 is good.
- Manual sampling. The processed batch is visually checked by an employee at random for transcription errors.
Legal basis
Under the Archives Act, the archives transferred to the National Archives are managed by the general state archivist. The archives must be kept in good, orderly and accessible condition so that research of the archives is facilitated to the maximum extent possible. The present algorithm serves to improve the accessible state of the archives.
Links to legal bases
Elaboration on impact assessments
The scans of documents and thus also the digital text of the transcription may contain personal data. Personal data may therefore be processed both when training the model and when actually making the transcriptions. This depends on the archive being transcribed. In many cases, personal data are contained in archives. When an archive is under 110 years old (this is the maximum set age of a human being), contains personal data and is used as training data or is transcribed, a Data Protection Impact Assessment (DPIA) is carried out. A DPIA is then done on processing and the data before the algorithm is deployed and not specifically on the algorithm used or training the model.
The algorithm is not created for a specific archive, it is generic in design and therefore deployable on various archives. A DPIA is done on the processing and data of a specific archive. It is not possible here to include all DPIA impact tests for all transcribed archives as this could become an infinite list.
Impact assessment
Operations
Data
The algorithm Loghi processes text from scans of (historical) documents. Depending on the archives offered, this can be all kinds of data.
The algorithm has been applied in transcribing the Central Archive on Special Jurisdiction (CABR) but also in transcribing older archives. No specific data source can be designated here because the algorithm can be applied to multiple data sources/archives.
Technical design
The ATH software "Loghi" first determines where the lines of text are located. The software can determine this because it is trained to detect where the lines are on which the text rests: the so-called baselines. This can be seen as finding the line ring with text in a lined script.
Using these baselines, the entire line of text can be cut out. This is then automatically transcribed. Indeed, at an earlier stage, the software was able to learn from many examples of cut-out lines of text and their corresponding manual transcription. That knowledge is contained in a model.
To train a model, for both recognising text lines and recognising characters, training data has to be created. To do this, a representative data set is obtained using a sample from a set of scans of documents. From these scans, it is automatically determined where the baselines are and which characters are on the baselines. There are errors in the automatic transcriptions. These are corrected manually. The corrected transcripts are the training data.
Feeding the software with the training data creates a model that can self-predict where baselines are on the scan and which characters are on the baseline. After training, the model is evaluated with a test set. This process can be repeated several times until the desired result is obtained.
The ATH is software consisting of several components. Two of them use machine learning. And can make predictions about the data based on examples in a trained model.
The first component is Laypa https://doi.org/10.1145/3604951.3605520. Laypa uses scans augmented with data. This data shows where a line of text is located on the scan of the document. The goal of the software is to predict as accurately as possible where the text lines are in a scan. This is achieved by predicting which pixels are part of a baseline.
The second component is Loghi https://doi.org/10.1007/978-3-031-70645-5_6. Loghi learns to predict what text is on lines of text not previously seen. This is done based on machine learning and examples of text lines and associated transcripts.
External provider
Link to code base
Similar algorithm descriptions
- An application is used to help our organisation digitise and manage documents. This involves converting scanned documents from an image to text. Metadata is automatically read from the text to give the document information for the route to follow in the internal process.Last change on 16th of May 2025, at 8:36 (CET) | Publication Standard 1.0
- Publication category
- Other algorithms
- Impact assessment
- Field not filled in.
- Status
- In use
- PolyAI is a voicebot that can communicate with a citizen based on natural speech recognition. Poly AI uses an algorithm to recognise the subject of a question asked.Last change on 10th of September 2025, at 9:46 (CET) | Publication Standard 1.0
- Publication category
- Impactful algorithms
- Impact assessment
- DPIA, The Ethical Guide
- Status
- In use
- AI translations of websites and e-forms with WeglotLast change on 22nd of May 2025, at 14:47 (CET) | Publication Standard 1.0
- Publication category
- Other algorithms
- Impact assessment
- Field not filled in.
- Status
- In use
- When a citizen applies for automated waiver, eligibility is checked using decision rules.Last change on 23rd of April 2024, at 8:58 (CET) | Publication Standard 1.0
- Publication category
- Impactful algorithms
- Impact assessment
- Field not filled in.
- Status
- In use