224 lines
11 KiB
Markdown
224 lines
11 KiB
Markdown
# filehasher
|
|
|
|
# Presentation
|
|
Collects some metadata and hashes files. It outputs the path, hash, size, creation and
|
|
last modification dates and the author in file_hasher.txt.
|
|
Creation and modification dates and author can be disabled in the config file.
|
|
|
|
It is a high performance cross platform Windows and Linux compatible program, it uses:
|
|
* Multiple threads for scanning and hashing (multi-threading can be disabled in the config file).
|
|
* Stores the generated data in thread local configurable arenas that support growing
|
|
by committing more memory and chaining blocks.
|
|
* Two Multi Producer Multi Consumer queues, one for the scanners and one between the scanners and hashers.
|
|
* xxh3_128bits algorithm from xxhash, that supports SIMD instruction sets (SSE2, AVX2, AVX512)
|
|
and uses a runtime dispatcher to select the best available instruction set.
|
|
* IO Ring for asynchronous I/O in Windows and the equivalent io_uring in Linux.
|
|
The implementation is event driven, thread local, uses DMA and direct disk I/O,
|
|
bypassing the OS cache completely, registered buffers (and registered files in io_uring),
|
|
it supports bashing multiple submissions and can handle multiple files at the same time.
|
|
It can be disabled in the config file.
|
|
* Fallback to buffered I/O if there is errors in the IO Ring path.
|
|
|
|
# Building
|
|
## Windows
|
|
### Release
|
|
|
|
**Note**: Make sur to use UCRT64 environment from MSYS2 instead of the standard MinGW environment.
|
|
UCRT64 uses the modern Universal C Runtime (ucrtbase.dll), which supports the newest APIs,
|
|
the standard MSYS2 uses the legacy msvcrt.dll and does not support IO Ring.
|
|
To install:
|
|
pacman -S mingw-w64-ucrt-x86_64-gcc
|
|
pacman -S mingw-w64-ucrt-x86_64-clang
|
|
pacman -Syu
|
|
And add to path:
|
|
C:\msys64\ucrt64\bin
|
|
|
|
gcc -O3 file_hasher.c xxhash.c xxh_x86dispatch.c -o file_hasher
|
|
clang -O3 file_hasher.c xxhash.c xxh_x86dispatch.c -o file_hasher
|
|
clang-cl /O2 file_hasher.c xxhash.c xxh_x86dispatch.c
|
|
|
|
### Debug
|
|
gcc -g -O0 file_hasher.c xxhash.c xxh_x86dispatch.c -o file_hasher
|
|
clang -g -O0 file_hasher.c xxhash.c xxh_x86dispatch.c -o file_hasher
|
|
clang-cl /Zi /Od file_hasher.c xxhash.c xxh_x86dispatch.c
|
|
|
|
## Linux
|
|
### Release
|
|
gcc -O3 file_hasher.c xxhash.c xxh_x86dispatch.c -pthread -luring -o file_hasher
|
|
clang -O3 file_hasher.c xxhash.c xxh_x86dispatch.c -pthread -luring -o file_hasher
|
|
|
|
### Debug
|
|
gcc -g -O0 file_hasher.c xxhash.c xxh_x86dispatch.c -pthread -luring -o file_hasher
|
|
clang -g -O0 file_hasher.c xxhash.c xxh_x86dispatch.c -pthread -luring -o file_hasher
|
|
|
|
# Notes about the IO Ring implementations
|
|
## IO Ring
|
|
|
|
### File registration
|
|
Registering files is a performance optimization that allows the kernel to allocate an array
|
|
of descriptors/handles to pre-validate and maintain long-term references to file handles.
|
|
Instead of passing a standard file descriptor/handle with every I/O request, you pass a simple integer
|
|
index into a pre-registered table.
|
|
|
|
The Linux implementation has io_uring_register_files_scarse() to create an empty array of descriptors
|
|
(initialized with -1) without having to create and initialize it in the user space, and we can
|
|
use io_uring_register_files_update() to update one or more entries. Windows on the other hand
|
|
is limited to BuildIoRingRegisterFileHandles() only, so we need to re register the entire array of handles
|
|
each time. This is why there is a provided macro in config.h to disable or enable it.
|
|
|
|
#### *Why Register Files? (The Benefits)*
|
|
When you use a standard file descriptor in a high-frequency I/O loop,
|
|
the kernel must perform several "hidden" tasks for every single operation:
|
|
* Permission Checks: Validating that the process still has the right to read/write
|
|
that specific file.
|
|
* Reference Counting: Incrementing the file's internal reference count at the start of
|
|
the I/O and decrementing it at the end to ensure the file isn't closed while in use.
|
|
* Object Lookup: Traversing the internal "file descriptor table" to find the actual
|
|
kernel object associated with your integer ID.
|
|
|
|
Registering the files performs these checks once at registration time. Subsequent
|
|
I/O operations skip these steps, significantly reducing CPU overhead and latency,
|
|
especially when handling thousands of small I/O operations per second.
|
|
|
|
#### *Comparison: Linux vs. Windows Implementation*
|
|
While both systems share the same core concept, their APIs and management styles differ significantly.
|
|
|
|
| Feature | Linux (`io_uring`) | Windows (`IoRing`) |
|
|
| :--- | :--- | :--- |
|
|
| **API Call** | `io_uring_register` | `BuildIoRingRegisterFileHandles` |
|
|
| **Registration Method** | Synchronous system call; blocks until the table is set up. | Asynchronous request submitted to the ring like a read/write operation. |
|
|
| **Partial Updates** | Supports `IORING_REGISTER_FILES_UPDATE` to swap specific indices. | No partial updates; a new registration replaces the entire table. |
|
|
| **Scope of Operations** | Extremely broad (files, sockets, timers, signals, etc.). | Primarily focused on file storage (read, write, flush). |
|
|
|
|
### Completion Wait count
|
|
To avoid busy waiting when receiving CQEs, we can use io_uring_submit_and_wait() in Linux by entering a wait count,
|
|
the threads sleeps until the count of CQEs are received, in windows the wait_count is present in SubmitIoRing()
|
|
but is not implemented yet, so we wait with a completion event for a single completion. Another limitation on the completion
|
|
event is that the kernel will waik up the thread only when receiving the first CQE, after that we need to drain the completion
|
|
queue completely before sleeping again, or we enter an eternal slumber. And my config, each time the thread wakes up
|
|
it receives rarely more than 3 to 5 CQEs and most of the time only one CQE.
|
|
|
|
### Filtering CQEs
|
|
|
|
Unlike Linux, The Windows implementation treats buffer and file registration
|
|
as an asynchronous operation that we submit to the ring, similar to a read or write.
|
|
Those operations produce CQEs (completion queue entries) that we filter here using
|
|
cqe.UserData == USERDATA_REGISTER
|
|
```c
|
|
if (win_cqe.UserData == USERDATA_REGISTER)
|
|
continue;
|
|
```
|
|
|
|
## io_uring
|
|
|
|
### Creation flags
|
|
io_uring provides a lot of configuration flags compared to IO Ring, some
|
|
of them are at the creation and others during the operations, here what
|
|
we use in this implementation at creation time and is lacking in the
|
|
IO Ring implementation.
|
|
|
|
* IORING_SETUP_SINGLE_ISSUER: Since we are using a thread local io_uring, we can
|
|
set this flag to remove the atomic operations.
|
|
* IORING_SETUP_DEFER_TASKRUN: By default, the kernel sends an interrupts when a CQE
|
|
is ready, we use this flag to disable this syscall and wait for a specific number of
|
|
CQEs to be ready to group them, this reduces the number of syscall.
|
|
|
|
### Memlock limit warning
|
|
|
|
```c
|
|
"WARNING: Buffer registration failed due to memlock limits (ENOMEM).\n"
|
|
"Increase the limit to solve this warning.\n");
|
|
```
|
|
|
|
The Memlock limit in Linux restricts the amount of memory a process can
|
|
"lock" into physical RAM using the mlock() family of system calls. This
|
|
prevents the operating system from swapping that memory out to disk.
|
|
And registering buffers will lock the buffers memory so the hardware
|
|
can access it directly without kernel intervention and prevents the kernel from
|
|
swapping it to the SSD or HDD. Increase the limit to be able to register the buffers.
|
|
|
|
#### *Modifying the Limit*
|
|
The method for changing the memlock limit depends on whether you are
|
|
managing a user session or a system service.
|
|
1. For Users and Interactive Sessions
|
|
To permanently increase the limit for a specific user or group, modify
|
|
the /etc/security/limits.conf file. Add the following lines:
|
|
|
|
```conf
|
|
# Example for a specific user (replace 'username'), unlimited or a custom value in KB
|
|
username soft memlock unlimited
|
|
username hard memlock unlimited
|
|
```
|
|
```conf
|
|
# Example for all users
|
|
* soft memlock unlimited
|
|
* hard memlock unlimited
|
|
```
|
|
|
|
Soft Limit: The value the user starts with; can be raised up to the
|
|
hard limit.
|
|
|
|
Hard Limit: The absolute maximum; only a privileged user
|
|
(root) can increase this. Values: Can be set in Kilobytes (KB) or as
|
|
unlimited.
|
|
|
|
2. For Systemd Services
|
|
Settings in limits.conf do not affect background services managed by
|
|
systemd. To increase the limit for a service, edit its service file
|
|
(e.g., /etc/systemd/system/myservice.service) and add:
|
|
```conf
|
|
[Service]
|
|
LimitMEMLOCK=infinity
|
|
```
|
|
|
|
#### *Why Register Buffers?*
|
|
In a standard "unregistered" I/O operation, the kernel must perform several
|
|
expensive steps for every single read or write:
|
|
* Virtual-to-Physical Mapping: The kernel has to translate your application's
|
|
virtual memory addresses into physical RAM addresses.
|
|
* Page Pinning: The kernel must "pin" the memory pages (using get_user_pages)
|
|
to prevent them from being swapped to disk or moved while the hardware
|
|
(like your SSD) is writing to them.
|
|
* TLB Overhead: Constant mapping and unmapping put pressure on the Translation
|
|
Lookaside Buffer (TLB), which can slow down the CPU.
|
|
|
|
Registering the buffers performs all of this "pinning" and "mapping" once.
|
|
|
|
### Direct I/O: O_DIRECT (Linux) and FILE_FLAG_NO_BUFFERING (Windows)
|
|
|
|
Modern operating systems normally use a page cache when reading files. This means file
|
|
data is first loaded into kernel memory and then copied to user space. While this improves
|
|
performance for many workloads, it introduces extra memory usage and copy overhead.
|
|
|
|
Both Linux and Windows provide a way to bypass this cache and perform direct I/O:
|
|
|
|
Linux: O_DIRECT
|
|
Windows: FILE_FLAG_NO_BUFFERING
|
|
|
|
These flags instruct the OS to transfer data directly between disk and user-provided buffers, avoiding the page cache.
|
|
|
|
#### *Benefits*
|
|
1. Reduced memory overhead
|
|
Avoids polluting the OS page cache
|
|
Especially useful for large sequential reads (e.g. hashing, backups)
|
|
2. Lower CPU usage
|
|
Eliminates extra memory copies between kernel and user space
|
|
3. Predictable performance
|
|
No interference from cache eviction or readahead heuristics
|
|
More consistent throughput for streaming workloads
|
|
4. Better scalability
|
|
Ideal for high-throughput, multi-threaded I/O pipelines
|
|
Prevents cache contention between threads
|
|
5. Avoids double caching
|
|
Important when the application already manages its own buffering
|
|
|
|
#### *File system compatibility*
|
|
Not all file systems are compatible with O_DIRECT, if we try to open files residing in an NTFS partition,
|
|
most of the time it will fail, and some times it opens but the CQEs return with an error code bad
|
|
descriptor, and it causes some lags.
|
|
|
|
To address this issue the program falls back to sequential read when the open fails, and falls back to
|
|
buffered sequential hashing if we receive an error in the CQEs. There is also a file system detection
|
|
that we can enable in the config file, it will enable the collection of the file system in scan_folder()
|
|
and the file will be opened accordingly, but it costs one additional syscall / directory.
|