john-dev - Re: GCN: indexed access to VGPRs

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20121209234436.GD4261@openwall.com>
Date: Mon, 10 Dec 2012 03:44:36 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: GCN: indexed access to VGPRs

On Sun, Dec 09, 2012 at 02:38:18PM +0200, Milen Rangelov wrote:
> I thought about that when doing the bcrypt kernel. There is one problem
> with that though - we have a hard limit of 256 VGPRs per workitem

Yeah, I thought of this shortly after I sent the messages yesterday, but
I was unsure.  The GCN instruction encoding only allows for fixed VGPR
register numbers in the 0 to 255 range, but it is unclear if this
limitation applies to indexed access to VGPRs as well or not (there's no
fixed-width field for the register number then).  Anyhow, OpenCL might
impose this limitation universally, regardless of what the hardware is
capable of.

> and it
> does not matter how many workitems per group we spawn, the limit stays even
> if we run the kernel with worksize of say just 2 items (effectively that
> means we'd underuse the register file a lot). So we can utilize at most 1KB
> of registers for our sbox data. What eventually happens though is that the
> compiler spills registers into global memory (and this register spill is
> much worse than I expected). I tried having one of the 4 sboxes as a
> private array and got a lot of spilled registers, the end result being
> slower even given the increased occupancy and finally for some reason the
> kernel was not calculating the hash correctly (might be mistake on my part
> or a compiler issue, didn't investigate).

Understood.  Placing one of the 4 S-boxes into registers was one of my
ideas, too (was not mentioned yet).

> Perhaps though, smaller chunk of the sbox in VGPRs would be beneficial, I
> just did not try that possibility.

We'd have an if/else then - and if it's implemented with eager
execution, then we incur the LDS access latency even when the data is in
fact in a register.  What we gain is a slightly higher number of
concurrent bcrypt instances per CU (18 instead of 16 if we put one half
of one S-box into registers?)

This is worth experimenting with, but if the 256 registers per work-item
limit does in fact apply, then any possible gain is quite minor.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.