Please note: The algorithm descriptions in English have been automatically translated. Errors may have been introduced in this process. For the original descriptions, go to the Dutch version of the Algorithm Register.
Deduplication script
- Publication category
- Other algorithms
- Impact assessment
- Field not filled in.
- Status
- In use
General information
Theme
Begin date
Contact information
Responsible use
Goal and impact
Considerations
The algorithm helps a search specialist find duplicate documents, so it does not have to be done manually. A possible drawback is that some documents are wrongly marked as "duplicate". However, this happens as little as possible thanks to careful setting. The search specialist always checks the results manually.
Human intervention
Risk management
The risk of being wrongly marked as duplicate is relatively low. This risk has been minimised by the following measures:
Algorithm tuning: The algorithm is tuned conservatively. This means that the algorithm tends to mark too few documents as duplicates rather than too many.
Manual assessment: The search specialist manually assesses the results of the algorithm. The results are forwarded to the relevant policy officer. The policy officer assesses for completeness and relevance.
Legal basis
Open Government Act
Links to legal bases
Elaboration on impact assessments
The algorithm does not process personal data. The parameters are matched only to file metadata.
Operations
Data
The algorithm uses the following data:
- File name
- Size of file
Technical design
The input comes from the Ministry of Finance's Search & Find programme (Zoek & Vind - Ministerie van Financiën (overheid.nl)). This programme adds extra characters to create unique file names, as required by Windows. The deduplication script looks at the file name without these extra characters to check for duplicate files. If there are duplicate files, the file sizes are compared. Based on certain settings, the algorithm moves one of the two files to another folder. This leaves the search specialist with a folder of documents that the deduplication script has rated as "non-duplicate".