amir/duplicatefinder

Fork 0

Go to file

amir 2076a2d035 Readme edit

2026-05-04 18:37:15 +01:00

experiments

Project reordering and mpmc code

2026-05-04 13:39:49 +01:00

.gitignore

Add a build system

2026-05-01 20:59:51 +01:00

arena_base.h

LF MPMC queue improvements

2026-03-09 13:21:45 +01:00

arena.c

Add a build system

2026-05-01 20:59:51 +01:00

arena.h

LF MPMC queue improvements

2026-03-09 13:21:45 +01:00

base.h

Project reordering and mpmc code

2026-05-04 13:39:49 +01:00

build.bat

Reworking process_completion() function

2026-05-02 16:48:21 +01:00

build.sh

Reworking process_completion() function

2026-05-02 16:48:21 +01:00

changelog.txt

Add a build system

2026-05-01 20:59:51 +01:00

CMakeLists.txt

Add a build system

2026-05-01 20:59:51 +01:00

config.h

Reworking process_completion() function

2026-05-02 16:48:21 +01:00

file_hasher.c

Add a build system

2026-05-01 20:59:51 +01:00

lf_mpmc.h

MPMC queues implementation

2026-05-04 14:06:48 +01:00

LICENSE

Initial commit

2026-02-23 23:21:13 +00:00

mt_mpmc.h

Project reordering and mpmc code

2026-05-04 13:39:49 +01:00

platform.c

Experimenting with the restrict keyword

2026-05-04 17:27:01 +01:00

README.md

Readme edit

2026-05-04 18:37:15 +01:00

sm_mpmc.h

MPMC queues implementation

2026-05-04 14:06:48 +01:00

xxh_x86dispatch.c

Updating the LF MPMC queue and replacing DirQueue with it

2026-03-12 13:57:09 +01:00

xxh_x86dispatch.h

Updating the LF MPMC queue and replacing DirQueue with it

2026-03-12 13:57:09 +01:00

xxhash.c

first commit

2026-02-28 10:54:16 +01:00

xxhash.h

Add support for AVX2 instead of SSE2

2026-02-28 19:44:43 +01:00

README.md

Duplicate Finder

Presentation

Collects some metadata and hashes files then detects duplicate files and directories. It outputs the path, hash, size, creation and last modification dates and the author in file_hasher.txt.
Creation and modification dates and author can be disabled in the config file.

It is a high performance cross platform Windows and Linux compatible program, it uses:

Multiple threads for scanning and hashing (multi-threading can be disabled in the config file).
Stores the generated data in thread local configurable arenas that support growing by committing more memory and chaining blocks.
Two Multi Producer Multi Consumer queues, one for the scanners and one between the scanners and hashers.
xxh3_128bits algorithm from xxhash, that supports SIMD instruction sets (SSE2, AVX2, AVX512) and uses a runtime dispatcher to select the best available instruction set.
IO Ring for asynchronous I/O in Windows and the equivalent io_uring in Linux. The implementation is event driven, thread local, uses DMA and direct disk I/O, bypassing the OS cache completely, registered buffers (and registered files in io_uring), it supports bashing multiple submissions and can handle multiple files at the same time.
It can be disabled in the config file.
Fallback to buffered I/O if there is errors in the IO Ring path.

Building

Windows

Requirements: Make sur to use UCRT64 environment from MSYS2 instead of the standard MinGW environment. UCRT64 uses the modern Universal C Runtime (ucrtbase.dll), which supports the newest APIs, the standard MSYS2 uses the legacy msvcrt.dll and does not support IO Ring.
To install:
pacman -S mingw-w64-ucrt-x86_64-clang
or:
pacman -S mingw-w64-ucrt-x86_64-gcc

pacman -Syu
And add to path:
C:\msys64\ucrt64\bin

Additionally, to use clang-cl install the latest version of Windows SDK and MSVC, or at least select these in Visual Studio Installer:

MSVC Build tools fo x64/86.
C++ Build tools core features.
MSBuild support for LLVM (clang-cl) toolset.
Windows Universal C runtime.
Windows Universal CRT SDK.
Windows 11 SDK.

And use the MSVC command prompt or run a script to add MSVC environment variables to current session.
Ex: for PowerShell Terminal save as .ps1 (not persistent):

# Add MS visual studio environment variables
cmd /c '"C:\Program Files (x86)\Microsoft Visual Studio\18\BuildTools\VC\Auxiliary\Build\vcvarsall.bat" x64 && set' |
ForEach-Object {
    if ($_ -match "^(.*?)=(.*)$") {
        Set-Item -Path "Env:$($matches[1])" -Value $matches[2]
    }
}

Optional: to use the build system
pacman -S mingw-w64-ucrt-x86_64-cmake
The build system uses Ninja and fallsback to make, in Windows it prefers clang-cl > gcc > clang, and in Linux gcc > clang.

Using a build system

Command	Description
./build.bat	Build Release with best available compiler
./build.bat Debug	Build Debug
./build.bat clean	Clean and build Release
./build.bat Debug clean	Clean and build Debug

Release

gcc -O3 file_hasher.c xxhash.c xxh_x86dispatch.c -o filehasher
clang -O3 file_hasher.c xxhash.c xxh_x86dispatch.c -o filehasher
clang-cl /O2 file_hasher.c xxhash.c xxh_x86dispatch.c

Debug

gcc -g -O0 file_hasher.c xxhash.c xxh_x86dispatch.c -o filehasher
clang -g -O0 file_hasher.c xxhash.c xxh_x86dispatch.c -o filehasher
clang-cl /Zi /Od file_hasher.c xxhash.c xxh_x86dispatch.c

Linux

Requirements: GCC or clang, optional CMake, Ninja or make.

Using a build system

Command	Description
./build.sh	Build Release with best available compiler
./build.sh Debug	Build Debug
./build.sh clean	Clean and build Release
./build.sh Debug clean	Clean and build Debug

Release

gcc -O3 file_hasher.c xxhash.c xxh_x86dispatch.c -pthread -luring -o filehasher
clang -O3 file_hasher.c xxhash.c xxh_x86dispatch.c -pthread -luring -o filehasher

Debug

gcc -g -O0 file_hasher.c xxhash.c xxh_x86dispatch.c -pthread -luring -o filehasher
clang -g -O0 file_hasher.c xxhash.c xxh_x86dispatch.c -pthread -luring -o filehasher

Notes about the IO Ring implementations

IO Ring

File registration

Registering files is a performance optimization that allows the kernel to allocate an array of descriptors/handles to pre-validate and maintain long-term references to file handles. Instead of passing a standard file descriptor/handle with every I/O request, you pass a simple integer index into a pre-registered table.

The Linux implementation has io_uring_register_files_scarse() to create an empty array of descriptors (initialized with -1) without having to create and initialize it in the user space, and we can use io_uring_register_files_update() to update one or more entries. Windows on the other hand is limited to BuildIoRingRegisterFileHandles() only, so we need to re register the entire array of handles each time. This is why there is a provided macro in config.h to disable or enable it.

Why Register Files? (The Benefits)

When you use a standard file descriptor in a high-frequency I/O loop, the kernel must perform several "hidden" tasks for every single operation:

Permission Checks: Validating that the process still has the right to read/write that specific file.
Reference Counting: Incrementing the file's internal reference count at the start of the I/O and decrementing it at the end to ensure the file isn't closed while in use.
Object Lookup: Traversing the internal "file descriptor table" to find the actual kernel object associated with your integer ID.

Registering the files performs these checks once at registration time. Subsequent I/O operations skip these steps, significantly reducing CPU overhead and latency, especially when handling thousands of small I/O operations per second.

Comparison: Linux vs. Windows Implementation

While both systems share the same core concept, their APIs and management styles differ significantly.

Feature	Linux (`io_uring`)	Windows (`IoRing`)
API Call	`io_uring_register`	`BuildIoRingRegisterFileHandles`
Registration Method	Synchronous system call; blocks until the table is set up.	Asynchronous request submitted to the ring like a read/write operation.
Partial Updates	Supports `IORING_REGISTER_FILES_UPDATE` to swap specific indices.	No partial updates; a new registration replaces the entire table.
Scope of Operations	Extremely broad (files, sockets, timers, signals, etc.).	Primarily focused on file storage (read, write, flush).

Completion Wait count and peek

To avoid busy waiting when receiving CQEs, we can use io_uring_submit_and_wait() in Linux by entering a wait count, the threads sleep until the count of CQEs are received, in windows the wait_count is present in SubmitIoRing() but is not implemented yet, so we wait with a completion event for a single completion. Another limitation on the completion event is that the kernel will waik up the thread only when receiving the first CQE, after that we need to drain the completion queue completely before sleeping again, or we enter an eternal slumber.
In the other hand, in Linux we can batch pop completions with io_uring_peek_batch_cqe() + io_uring_cq_advance(), in Windows we can only pop one completion at a time with PopIoRingCompletion() (equivalent to io_uring_peek_cqe() + io_uring_cqe_seen()).
To simulate the same behavior as the Linux functions we use a double loop, an outer loop to control how much we wait and in inner loop to drain all the available completions.

Filtering CQEs

Unlike Linux, The Windows implementation treats buffer and file registration as an asynchronous operation that we submit to the ring, similar to a read or write. Those operations produce CQEs (completion queue entries) that we filter here using cqe.UserData == USERDATA_REGISTER

    if (win_cqe.UserData == USERDATA_REGISTER)
      continue;

io_uring

Creation flags

io_uring provides a lot of configuration flags compared to IO Ring, some of them are at the creation and others during the operations, here what we use in this implementation at creation time and is lacking in the IO Ring implementation.

IORING_SETUP_SINGLE_ISSUER: Since we are using a thread local io_uring, we can set this flag to remove the atomic operations.
IORING_SETUP_DEFER_TASKRUN: By default, the kernel sends an interrupts when a CQE is ready, we use this flag to disable this syscall and wait for a specific number of CQEs to be ready to group them, this reduces the number of syscall.

Memlock limit warning

     "WARNING: Buffer registration failed due to memlock limits (ENOMEM).\n"
     "Increase the limit to solve this warning.\n");

The Memlock limit in Linux restricts the amount of memory that can be "locked" into physical RAM using the mlock() family of system calls. This prevents the operating system from swapping that memory out to disk.
And registering buffers will lock the buffers memory so the hardware can access it directly without kernel intervention and prevents the kernel from swapping it to the SSD or HDD.
This limit does not apply to a single process, but it applies to what all the runnig processes can lock, so in order to be able to register the buffers, we need to set it to unlimited or increase it to at least:
num_hash_threads * NUM_BUFFERS_PER_THREAD * IORING_BUFFER_SIZE + extra memory reserved for other processes.

Modifying the Limit

The method for changing the memlock limit depends on whether you are managing a user session or a system service.

For Users and Interactive Sessions
To permanently increase the limit for a specific user or group, modify the /etc/security/limits.conf file. Add the following lines:

 # Example for a specific user (replace 'username'), unlimited or a custom value in KB
 username   soft    memlock    unlimited
 username   hard    memlock    unlimitedhttps://wiki.postgresql.org/wiki/AIO

 # Example for all users
 *          soft    memlock    unlimited
 *          hard    memlock    unlimited

Soft Limit: The value the user starts with; can be raised up to the hard limit.

Hard Limit: The absolute maximum; only a privileged user (root) can increase this. Values: Can be set in Kilobytes (KB) or as unlimited.

For Systemd Services Settings in limits.conf do not affect background services managed by systemd. To increase the limit for a service, edit its service file (e.g., /etc/systemd/system/myservice.service) and add:

  [Service]
  LimitMEMLOCK=infinity

Why Register Buffers?

In a standard "unregistered" I/O operation, the kernel must perform several expensive steps for every single read or write:

Virtual-to-Physical Mapping: The kernel has to translate your application's virtual memory addresses into physical RAM addresses.
Page Pinning: The kernel must "pin" the memory pages (using get_user_pages) to prevent them from being swapped to disk or moved while the hardware (like your SSD) is writing to them.
TLB Overhead: Constant mapping and unmapping put pressure on the Translation Lookaside Buffer (TLB), which can slow down the CPU.

Registering the buffers performs all of this "pinning" and "mapping" once.

Direct I/O: O_DIRECT (Linux) and FILE_FLAG_NO_BUFFERING (Windows)

Modern operating systems normally use a page cache when reading files. This means file data is first loaded into kernel memory and then copied to user space. While this improves performance for many workloads, it introduces extra memory usage and copy overhead.

Both Linux and Windows provide a way to bypass this cache and perform direct I/O:

Linux: O_DIRECT Windows: FILE_FLAG_NO_BUFFERING

These flags instruct the OS to transfer data directly between disk and user-provided buffers, avoiding the page cache.

Benefits

Reduced memory overhead Avoids polluting the OS page cache Especially useful for large sequential reads (e.g. hashing, backups)
Lower CPU usage Eliminates extra memory copies between kernel and user space
Predictable performance No interference from cache eviction or readahead heuristics More consistent throughput for streaming workloads
Better scalability Ideal for high-throughput, multi-threaded I/O pipelines Prevents cache contention between threads
Avoids double caching Important when the application already manages its own buffering

File system compatibility

Not all file systems are compatible with O_DIRECT, if we try to open files residing in an NTFS partition, most of the time it will fail, and some times it opens but the CQEs return with an error code bad descriptor, and it causes some lags.

To address this issue the program falls back to sequential read when the open fails, and falls back to buffered sequential hashing if we receive an error in the CQEs. There is also a file system detection that we can enable in the config file, it will enable the collection of the file system in scan_folder() and the file will be opened accordingly, but it costs one additional syscall / directory.