Please note: The algorithm descriptions in English have been automatically translated. Errors may have been introduced in this process. For the original descriptions, go to the Dutch version of the Algorithm Register.

Deduplication script

The deduplication script supports search specialists in filtering duplicate files in a Woo request.

Last change on 31st of October 2024, at 8:26 (CET) | Publication Standard 1.0
Publication category
Other algorithms
Impact assessment
Field not filled in.
Status
In use

General information

Theme

Organisation and business operations

Begin date

07-2023

Contact information

cio-office@minfin.nl

Responsible use

Goal and impact

The algorithm aims to make the process within Open Government Act (Woo) requests faster and more efficient. In a Woo request, multiple versions of the requested documents may exist. This algorithm helps to de-duplicate these documents so that the Woo requester only receives relevant versions. Before the algorithm went into production, deduplication was done manually. The deployment of the algorithm has sped up the process, meaning that the Woo requester gets a faster response to the request.

Considerations

The algorithm helps a search specialist find duplicate documents, so it does not have to be done manually. A possible drawback is that some documents are wrongly marked as "duplicate". However, this happens as little as possible thanks to careful setting. The search specialist always checks the results manually.

Human intervention

There are several moments when a specialist manually checks during document deduplication. First, the algorithm makes a proposal, which the search specialist evaluates. Then the search specialist decides whether the selection should be sent to the appropriate policy officer. The policy officer finally decides which documents are important for the Woo request and whether any documents are missing.


Risk management

The risk of being wrongly marked as duplicate is relatively low. This risk has been minimised by the following measures:


Algorithm tuning: The algorithm is tuned conservatively. This means that the algorithm tends to mark too few documents as duplicates rather than too many.


Manual assessment: The search specialist manually assesses the results of the algorithm. The results are forwarded to the relevant policy officer. The policy officer assesses for completeness and relevance.

Legal basis

Open Government Act

Links to legal bases

Woo: https://wetten.overheid.nl/BWBR0045754/2023-04-01

Elaboration on impact assessments

The algorithm does not process personal data. The parameters are matched only to file metadata.

Operations

Data

The algorithm uses the following data:


  • File name
  • Size of file

Technical design

The input comes from the Ministry of Finance's Search & Find programme (Zoek & Vind - Ministerie van Financiën (overheid.nl)). This programme adds extra characters to create unique file names, as required by Windows. The deduplication script looks at the file name without these extra characters to check for duplicate files. If there are duplicate files, the file sizes are compared. Based on certain settings, the algorithm moves one of the two files to another folder. This leaves the search specialist with a folder of documents that the deduplication script has rated as "non-duplicate".