I want to get a conversation going among devs all about X11 - the different hash functions, implementation issues, security, anything. Since I'm starting the thread, and have been spending far too much time with X11, I'll start with examining one of the very few algorithms in X11 whose GPU performance isn't totally shot: Shavite-3.
Shavite-3 is basically a lot of AES rounds, using a dynamic key, so it's pretty hard to screw up. The basic framework of the current implementation, however, is sound - yes, Shavite-3 can be done 4-way, but you'd also pound the hell out of LDS; it's likely better to keep the state in registers, doing it 1-way - the biggest fuckup here seems to just be the PHM-style "unroll fucking everything." That implementation has two loops - and one of them is for putting the AES tables into LDS - so code size kills it on GCN. Code cache is 32kB, and it weighs in at 76,680 bytes when compiled for Tahiti. There's also one more thing, the PHM implementation is scalar. It doesn't use arrays, just a whole load of 32-bit integers. Shavite-3 can be vectorized pretty well with uint4 - you have to access the individual members more than I'd like - but overall, it's cleaner and faster. Adding loops to roll up a lot of repetitive code helps a ton, as well. Of course, GPUs don't do branching well, so don't go overboard, but I managed to put the size at 31,492 bytes - rolled up just enough to fly under 32kB.
For reference, the algorithms used in X11, in order, follow (all are the 512-bit versions):
1. Blake
2. Blue Midnight Wish (BMW)
3. Groestl
4. Skein
5. JH
6. Keccak
7. Luffa
8. CubeHash
9. Shavite-3
10. SIMD
11. Echo
Shavite-3 is basically a lot of AES rounds, using a dynamic key, so it's pretty hard to screw up. The basic framework of the current implementation, however, is sound - yes, Shavite-3 can be done 4-way, but you'd also pound the hell out of LDS; it's likely better to keep the state in registers, doing it 1-way - the biggest fuckup here seems to just be the PHM-style "unroll fucking everything." That implementation has two loops - and one of them is for putting the AES tables into LDS - so code size kills it on GCN. Code cache is 32kB, and it weighs in at 76,680 bytes when compiled for Tahiti. There's also one more thing, the PHM implementation is scalar. It doesn't use arrays, just a whole load of 32-bit integers. Shavite-3 can be vectorized pretty well with uint4 - you have to access the individual members more than I'd like - but overall, it's cleaner and faster. Adding loops to roll up a lot of repetitive code helps a ton, as well. Of course, GPUs don't do branching well, so don't go overboard, but I managed to put the size at 31,492 bytes - rolled up just enough to fly under 32kB.
For reference, the algorithms used in X11, in order, follow (all are the 512-bit versions):
1. Blake
2. Blue Midnight Wish (BMW)
3. Groestl
4. Skein
5. JH
6. Keccak
7. Luffa
8. CubeHash
9. Shavite-3
10. SIMD
11. Echo
Last edited by a moderator: