Please note: The algorithm descriptions in English have been automatically translated. Errors may have been introduced in this process. For the original descriptions, go to the Dutch version of the Algorithm Register.

Anonymisation software DataMask

The algorithm anonymises documents by highlighting personal data. An employee checks that the anonymisation has been carried out correctly. After the employee's approval, the software removes the marked data and blacklists them. The documents can then be published, for example under the WOO.

Last change on 24th of September 2024, at 12:38 (CET) | Publication Standard 1.0
Publication category
Impactful algorithms
Impact assessment
DPIA, ...
Status
In use

General information

Theme

Organisation and business operations

Begin date

2024-05

Contact information

provincieloket@gelderland.nl 026-359 9999

Responsible use

Goal and impact

The anonymisation software is used to anonymise documents published by the province faster and better. This way, we prevent data leaks and contribute to better protection of data subjects' AVG rights.

Considerations

Province of Gelderland increasingly needs to disclose information. To comply with - among other things - privacy laws when publishing, privacy- or business-sensitive information must be lacquered out. Before the algorithm was deployed, this deleting did not always go well. There were data breaches where not all personal data was deleted or where deleted information could still be read. The advantage of anonymisation software is that anonymisation is faster and better. The disadvantage is that the text layer of the document is analysed by a Microsoft Azure server. The content is not stored on this server, so the privacy risk of using the algorithm does not outweigh the privacy benefit of reducing the number of data breaches due to improper anonymisation.

Human intervention

The outcome of the algorithm is checked by an employee. The clerk is required by the software to check all pages. The clerk determines whether the document is correctly anonymised.

Risk management

There is no risk of automated decision-making or infringement of fundamental rights, as the algorithm does not make binding decisions, but only suggests anonymising personal data. In addition, the algorithm is also used by the developer himself, which helps to quickly identify errors. Furthermore, the algorithm is regularly retrained to improve its performance. Our organisation has specifically requested that our documents not be used to train the algorithm. Should the algorithm be insufficiently accurate, we can refine the process by using so-called blacklists and whitelists. A blacklist contains terms or data, such as specific names or addresses, that should always be flagged and anonymised. In contrast, the whitelist contains information that does not need to be flagged, for example because it is not personal data or because it is information that should explicitly not be anonymised, such as job titles or general terms. This makes it possible to further improve the accuracy of anonymisation.


The final step in the process is always a manual check by a county employee, who assesses whether the anonymisation has been carried out correctly. However, there is a risk that employees do not check properly. We try to mitigate this by educating employees on the importance of thorough checking and careful assessment of the data found by the algorithm.


The remaining risk is the privacy risk when using Azure. Because of the Patriot Act, Microsoft may in some cases be required to transfer data to US authorities. To mitigate this risk, the vendor applies privacy by default. This means that the default settings are always privacy-friendly. When data is sent to the Azure service via the API, it can be synchronous or asynchronous. The vendor has chosen to disable the feature waabrij Azure temporarily stores the data sent via the API for debugging purposes. As a result, the data is deleted immediately after processing. Moreover, the vendor is ISO 27001 certified, confirming that data protection is well regulated. Using this software, with the mentioned precautions, offers more benefits than the risks of not properly anonymising data without this tool.

Legal basis

1. WOO 2. WCO 3. UAVG 4. WEP 5. WDO

Links to legal bases

  • Woo: https://wetten.overheid.nl/BWBR0045754/
  • WDO: https://eur-lex.europa.eu/legal-content/NL/TXT/HTML/?uri=CELEX:31995L0046
  • UAVG: https://wetten.overheid.nl/BWBR0040940
  • Wep: https://wetten.overheid.nl/BWBR0043961
  • Wdo: https://wetten.overheid.nl/BWBR0048156
  • Wet elektronische publicaties: https://wetten.overheid.nl/BWBR0043961/2024-01-01

Link to Processing Index

Trustbound

Elaboration on impact assessments

DEDA & DPIA conducted by DataMask. Pre-DPIA conducted by province of Gelderland. ICO Wizard requested by province of Gelderland, completed by DataMask.

Impact assessment

  • DEDA anonimiseringssoftware
  • DPIA anonimiseringssoftware
  • ICO Wizard (BIO)

Operations

Data

All information found in the uploaded documents (except metadata) is processed by the algorithm. This may include ordinary personal data, special personal data and criminal data. It may also include business-sensitive information.

Technical design

Documents are uploaded to the application by an employee. At that point, a (temporary) copy is made of the original in the form of a PDF with text layer and the metadata of the original document is removed from the copy. This copy ends up on a Dutch server and remains there for a maximum of 30 days. The text layer of the PDF is offered to the machine learning algorithm through an API. This is a Natural Language Processing algorithm (named entity recognition) from Microsoft Azure. The API returns at which location in the analysed texts a personal data is likely to occur, along with the probability score (a percentage). At that point, Azure immediately removes the text layer. The probability score is used along with vendor-developed proprietary ai models to make the recognition of personal data as accurate as possible. The models are trained using, among others, the following trained datasets as CoNLL-2003, UD Dutch LassySmall v2.8, Dutch NER Annotations for UD LassySmall and UD Dutch Alpino v2.8. Minimum key figures for the accuracy of identifying personal data are as follows: Named entities (precision): 0.78, Named entities (recall): 0.76, Named entities (F-score): 0.77. Finally, a staff member checks the document and when it is finalised, the data to be anonymised is permanently removed from the text layer and is blacklisted.

External provider

DataMask B.V.

Similar algorithm descriptions

  • The algorithm underlines personal data in documents. An employee has to review all pages and check whether the document is properly anonymised. Then the software removes all highlighted information and blacklists it. After that, the documents can be published, for example under the Open Government Act (WOO).

    Last change on 8th of January 2025, at 13:06 (CET) | Publication Standard 1.0
    Publication category
    Other algorithms
    Impact assessment
    DPIA
    Status
    In use
  • The algorithm underlines personal data in documents. An employee has to look at all pages and check whether the document is properly anonymised. Then the software removes all highlighted information and blacklists it. After that, the documents can be published, for example under the Open Government Act (WOO).

    Last change on 12th of November 2024, at 7:25 (CET) | Publication Standard 1.0
    Publication category
    Other algorithms
    Impact assessment
    DPIA, ...
    Status
    In use
  • The algorithm underlines personal data in documents. An employee has to look at all pages and check whether the document is properly anonymised. Then the software removes all highlighted information and blacklists it. After that, the documents can be published, for example under the Open Government Act (WOO).

    Last change on 31st of October 2024, at 15:08 (CET) | Publication Standard 1.0
    Publication category
    Other algorithms
    Impact assessment
    DPIA, ...
    Status
    In development
  • The algorithm underlines personal data in documents. An employee has to look at all pages and check whether the document is properly anonymised. Then the software removes all highlighted information and blacklists it. After that, the documents can be published, for example under the Open Government Act (WOO).

    Last change on 31st of October 2024, at 9:40 (CET) | Publication Standard 1.0
    Publication category
    Other algorithms
    Impact assessment
    DPIA, ...
    Status
    In use
  • The algorithm underlines personal data in documents. An employee has to look at all pages and check whether the document is properly anonymised. Then the software removes all highlighted information and blacklists it. After that, the documents can be published, for example under the Open Government Act (WOO).

    Last change on 5th of December 2024, at 13:34 (CET) | Publication Standard 1.0
    Publication category
    Other algorithms
    Impact assessment
    DPIA, ...
    Status
    In use