I have one of those SoC fpga devices which at first seems like it could be a fit for your plan to reprogram the fpga for every hash. However I think you need to redo some of the math here. The board I have has a 32bit 400/800 DDR interface. Which comes out to 25.6gbit/s. Since the result of each hash is 512 bits. This gives you roughly 50M hashes into and out of memory. Each hash algorithm will have to read the previous hash and then write it's output, halving the bandwidth. So 25M raw hashes. Divided by 11 algorithms this gives best case performance of 2.3 MH/s. Aka quite a bit less than a 750ti. This also puts into question the idea of fully unrolling the hashes since 25MH/s/algo should only require 2/4 unrolls. Obviously there is a lot to be gained by fitting multiple hashes into the chip at one time and avoiding the memory.
I have only looked into Blake and BMW so far. They are both Merkle-Damgard wide hash constructions. If all the other hashes were the same. It should be possible to maybe make some kind of construction where the inner loops of the hash use a type of automata so that resources like these huge numbers of 64bit adders can be reused for different algorithms. In such a case maybe all the hashes could fit in a single "generic hash" module. If this module could run > 100Mhz then total performance could be as high as 10MH. Which is like a lot better imho than what you get with memory.
I have only looked into Blake and BMW so far. They are both Merkle-Damgard wide hash constructions. If all the other hashes were the same. It should be possible to maybe make some kind of construction where the inner loops of the hash use a type of automata so that resources like these huge numbers of 64bit adders can be reused for different algorithms. In such a case maybe all the hashes could fit in a single "generic hash" module. If this module could run > 100Mhz then total performance could be as high as 10MH. Which is like a lot better imho than what you get with memory.
Last edited by a moderator: