Well, in order to understand what's going on, you need to realize what's happening while AOM is hashing a file. Basically:
- AOM tells Windows to read some data.
- Windows asks NTFS what block the data requested by AOM is in.
- Windows asks the harddisk to read that block. (t1)
- Windows copies the relevant data from that block into a buffer provided by AOM.
- AOM waits for windows to finish reading and then pipes the data read through the four separate hashing algorithms. (t2)
- AOM then repeats that process until the file has been read.
For any given implementation of a hashing algorithm, these tasks will be the same - but how efficient the algorithm is depends on how you order them. For example, in the sketch above, the time t2 depends solely on the CPU and scales linear (means: if there's twice as much data to be processed, it takes twice as long). The time t1, however, depends primarily on the harddisk. Due to noteable overhead (especially seek times, but also the time required to lookup block positions in NTFS's master file table), it does not scale that easily: Assume that it takes 1 unit of time to read as much data as fits into one buffer. The time required to read 0.1 buffers would be greater then 0.1 time units, and the time to read 2 buffers would be smaller then 2 time units. In other words, if you can manage to read the file in one go, without interrupting to process the data inbetween, you'll save time - at the same time, however, you're forced to read the file in small chunks since you can't assume that the entire file fits into memory...
Thus, the reason why a larger buffer increases performance is that there's less "dead time" spent waiting for harddisk seeks and Windows' management.
One implementation suggestion I've seen somewhere on Google recently is multithreaded:
Thread 1 reads the data into N separate buffers that are allocated at runtime to be twice the size of the volume's block size, but at least 32KB. Thread 2 empties these buffers, piping their content through the hashing algorithms. With clever interlocking (I figure you need at most two semaphores, one for each thread to represent the status of the three buffers, and one ordinary lock to stop thread 2 until thread 1 has pre-filled all buffers the first time), the slower of the two processes will set the speed for the other automagically.
PetriW wrote:Oh yeah, and if you hint to windows that you'll read the file sequentially from the start to the end it'll lower io throughput by about... 60%.... wtf...

That is indeed weird, but what do you mean with "hinting Windows"? I'm not aware of any special function for sequential reads in the C/C++ Win32 API, maybe it's a problem with Delphi's libraries instead?