X11 Technical Discussion Thread

Wolf0

Member
I want to get a conversation going among devs all about X11 - the different hash functions, implementation issues, security, anything. Since I'm starting the thread, and have been spending far too much time with X11, I'll start with examining one of the very few algorithms in X11 whose GPU performance isn't totally shot: Shavite-3.

Shavite-3 is basically a lot of AES rounds, using a dynamic key, so it's pretty hard to screw up. The basic framework of the current implementation, however, is sound - yes, Shavite-3 can be done 4-way, but you'd also pound the hell out of LDS; it's likely better to keep the state in registers, doing it 1-way - the biggest fuckup here seems to just be the PHM-style "unroll fucking everything." That implementation has two loops - and one of them is for putting the AES tables into LDS - so code size kills it on GCN. Code cache is 32kB, and it weighs in at 76,680 bytes when compiled for Tahiti. There's also one more thing, the PHM implementation is scalar. It doesn't use arrays, just a whole load of 32-bit integers. Shavite-3 can be vectorized pretty well with uint4 - you have to access the individual members more than I'd like - but overall, it's cleaner and faster. Adding loops to roll up a lot of repetitive code helps a ton, as well. Of course, GPUs don't do branching well, so don't go overboard, but I managed to put the size at 31,492 bytes - rolled up just enough to fly under 32kB.

For reference, the algorithms used in X11, in order, follow (all are the 512-bit versions):
1. Blake
2. Blue Midnight Wish (BMW)
3. Groestl
4. Skein
5. JH
6. Keccak
7. Luffa
8. CubeHash
9. Shavite-3
10. SIMD
11. Echo
 
Last edited by a moderator:
For reference, the algorithms used in X11, in order, follow (all are the 512-bit versions):
1. Blake
2. Blue Midnight Wish (BMW)
3. Groestl
4. Skein
5. JH
6. Keccak
7. Luffa
8. CubeHash
9. Shavite-3
10. SIMD
11. Echo

Thanks for starting this discussion. I'm hopeful there is a few technical mining nerds kicking around (looking at your jimbit) that will want to get in on a discussion about how the improve x11 hash rates. You mention in your post that Shavite-3 is done fairly well. Can you comment on which other ones need the most improvement and which ones aren't total dogs?

I heard you talk in the channel about about SIMD being pretty much the slowest dog of them all?
 
Thanks for starting this discussion. I'm hopeful there is a few technical mining nerds kicking around (looking at your jimbit) that will want to get in on a discussion about how the improve x11 hash rates. You mention in your post that Shavite-3 is done fairly well. Can you comment on which other ones need the most improvement and which ones aren't total dogs?

I heard you talk in the channel about about SIMD being pretty much the slowest dog of them all?

Yes, but that isn't really due to bad coding. Not that SIMD was done well, or anywhere near, but it has a fucked design, and it's gonna be really hard to do properly on GPU. No one-way implementation is gonna cut it - you'll spill registers to global memory, which is dead slow. It needs to be split at least 4-way, but likely 8-way, to have any real chance of performing as it should. I've said it before: a properly done X11 implementation for GPU would use pretty much no global memory at all, and if necessary, could easily avoid using even LDS by bitslicing the AES S-box.

The incompetence awards go to Echo. That code is just terrible. First off, what is with the seperate integer variables? Did PHM have a fear of arrays or something? That shit is just hard to read. Then, there are 16 64-bit integer variables used that flat out are not needed in any way - if you look hard enough, you'll see they're constant - never modified. It has 4 32-bit variables to count the number of bits we hash - but only one is needed, because we're hashing a small amount of data every single time, and it's NEVER going to be large enough to need the others - still, space and time is wasted on checking for overflow.

Then you have the very fact that it's using 64-bit integers to begin with - Echo is easily doable with only 32-bit operations, and the AMD compiler, while it shouldn't have graduated compiler kidnergarten, is particularly brain-dead when it comes to compiling 64-bit code. Additionally, the GPU doesn't have 64-bit hardware - the operations are emulated, I believe - so there's zero advantage to doing it that way, and quite a disadvantage due to the compiler. Then, of course, PHM's tendency to unroll ALL the things hurts it a bit, although it still fits in the code cache, the compiler doesn't really optimize it as well as it could - which isn't much, anyway. Oh, yeah, and a missed vectorization opportunity when shifting rows.

Echo can, and probably should, be done 8-way, but my 1-way is pretty good.
 
So now that we have outlined what some of the x11 algorithms are how about we talk a bit about how some of these current implementations could be improved. Is it something we can pick away at improving one at a time or do we need to attack them all at once?

What would be some of the incentives for improving them?

Is it worth your time to do if we could raise the funds from the community?

Do you see any glaring security issues in relation to how we are using x11 and attempting to scale it?
 
You can take them one at a time - it's the easiest way. I did it by making changes to one and hashing, if you get HW errors, you fucked up.

I don't know what incentives there are for improving them; as for it being worth my time, Ignition75 is working on that, he said.

As for glaring security issues... not really, besides it being fairly FPGA and ASIC-friendly. I know that's by design, though. Basically, memory is usually constrained on an FPGA, and you can do X11 with almost none at all; the oddball fuckup of a hash called SIMD has a ridiculous 1kb state, so you'd want that, some scratch space for calculations, and some breathing room - wouldn't be expensive in terms of memory. For ASIC, it gets better, because you can just make chips with that tiny amount of memory needed, probably, and parallelize the fuck out of it.

I'll detail some improvement ideas and stuff in a seperate post.
 
Crowd funding project for GPU mining optimisations is forming right now, I'll be running it and reaching out to every community that will benefit.

Please stay tuned details should be finalised in the next few days...
 
Crowd funding project for GPU mining optimisations is forming right now, I'll be running it and reaching out to every community that will benefit.

Please stay tuned details should be finalised in the next few days...
Wow! Exciting stuff. Keep me updated!
 
Wow! Exciting stuff. Keep me updated!
Fingers crossed, I don't want to reveal who's involved out of respect for them, until all the details are finalised. I've got about 50 coin communities on my list so far, this kind of co-ordination will require the whole community to come together, I don't think it has been done before.

The ShadowCoin designer CRZ who is a friend of mine is putting together something professional for us to use as people like shiny things, if the scammers can appear professional why can't we?
 
Opportunities for improvements in the current implementations of these hashes are countless. I'll talk about Skein here - it can be improved, even though it's not particularly slow.

The PHM implementation of Skein just barely fits in the code cache, at 30,808 bytes. This is good, but it's too heavy on VGPRs, meaning less threads can be run per compute unit (less "waves in flight.") I should note that GCN cards, that is, all AMD cards 7xxx and later, do NOT have hardware vectors - a vector general purpose register is 4 bytes. However, using vectors larger than this in your code can sometimes help the compiler see opportunities - I vectorized my implementation of Skein almost completely, and dropped VGPR usage from 73 to 49 - getting me another wave in flight by a large margin. Additionally, the compiler is brain dead on 64-bit rotate - I emulated it with a uint2. In addition to having more waves in flight, my code drops the clock cycles needed per wavefront from 52 to 44, so each wave also executes faster.

Vectorizing Skein isn't super easy, but it's not super hard, either. State and key go in a 64-bit unsigned integer vector of 8. last 64-bits of key go in a seperate variable. For both the even and odd Skein rounds, you break the state into two vectors of four - then you can add them in a vectorized fashion, rotate one of them by the necessary constants (I do this in a scalar fashion, because I emulate 64-bit rotate for the retarded compiler), XOR them in a vectorized fashion, and then it's just a quick shuffle, or permutation of the vector, before you do it three more times with more constants.

Before this, though, you need to do the key injection. Not hard, add the key vector to the state vector, then add those constants that come from the options (the params to UBI, in the Skein spec.) I store those in an array and dynamically index it with modulus, it's okay - the compiler hardcodes them as constants because I force unroll the main loop of all the Skein rounds. Add the round number, and the key injection is done.

The way to loop it isn't dead obvious - you see, there are 9 64-bit integers used for the key, just only 8 are used at a time. So, you need to shuffle it oddly - store the current first 64-bit integer in the vector in a temp var, shuffle it so all the integers move one to the left, and then store the last part of the key (in a seperate variable), into the slot all the way to the right. After that, the integer that was first previously - the one you stored in the temp var, becomes the last part of the key. In this way, you're implementing a rotate left of 9 64-bit integers, even though you only have a vector of 8. The vector size is perfect, because the variable that gets shuffled out is not added to the state that round.

So, do the Skein even round, comprised of a key injection, then the main mix 4 times with the even round constants, do the key shuffle, do the Skein odd round, same thing, key injection, same mix 4 times, different constants. Shuffle the key again, rinse and repeat 9 times (18 rounds). After that, one last key injection, and it's done.
 
Can someone ELI5 about the recent changes in sgminer that have bumped hashing rates higher? People are talking about a 40-50% increase. Is this real? Be curious to know the back story behind all this?
 
Can someone ELI5 about the recent changes in sgminer that have bumped hashing rates higher? People are talking about a 40-50% increase. Is this real? Be curious to know the back story behind all this?
Those would be mine; yes, it's real. A friend of mine begged me to work on X11, because he was getting ready to shut down his GPUs, and I kept saying no, until I finally said I'd look into it.
 
Back
Top