Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20121209234436.GD4261@openwall.com>
Date: Mon, 10 Dec 2012 03:44:36 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: GCN: indexed access to VGPRs

On Sun, Dec 09, 2012 at 02:38:18PM +0200, Milen Rangelov wrote:
> I thought about that when doing the bcrypt kernel. There is one problem
> with that though - we have a hard limit of 256 VGPRs per workitem

Yeah, I thought of this shortly after I sent the messages yesterday, but
I was unsure.  The GCN instruction encoding only allows for fixed VGPR
register numbers in the 0 to 255 range, but it is unclear if this
limitation applies to indexed access to VGPRs as well or not (there's no
fixed-width field for the register number then).  Anyhow, OpenCL might
impose this limitation universally, regardless of what the hardware is
capable of.

> and it
> does not matter how many workitems per group we spawn, the limit stays even
> if we run the kernel with worksize of say just 2 items (effectively that
> means we'd underuse the register file a lot). So we can utilize at most 1KB
> of registers for our sbox data. What eventually happens though is that the
> compiler spills registers into global memory (and this register spill is
> much worse than I expected). I tried having one of the 4 sboxes as a
> private array and got a lot of spilled registers, the end result being
> slower even given the increased occupancy and finally for some reason the
> kernel was not calculating the hash correctly (might be mistake on my part
> or a compiler issue, didn't investigate).

Understood.  Placing one of the 4 S-boxes into registers was one of my
ideas, too (was not mentioned yet).

> Perhaps though, smaller chunk of the sbox in VGPRs would be beneficial, I
> just did not try that possibility.

We'd have an if/else then - and if it's implemented with eager
execution, then we incur the LDS access latency even when the data is in
fact in a register.  What we gain is a slightly higher number of
concurrent bcrypt instances per CU (18 instead of 16 if we put one half
of one S-box into registers?)

This is worth experimenting with, but if the 256 registers per work-item
limit does in fact apply, then any possible gain is quite minor.

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.