crypt-dev - Re: alternative approach

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110520043450.GA12135@openwall.com>
Date: Fri, 20 May 2011 08:34:50 +0400
From: Solar Designer <solar@...nwall.com>
To: crypt-dev@...ts.openwall.com
Subject: Re: alternative approach

On Fri, May 20, 2011 at 12:07:05AM -0300, Yuri Gonzaga wrote:
> Even so, I did some research on Google.
> According to [1], in section "CLBs, Slices, and LUTs", One Virtex-6 slice
> has 4 LUTs/8 flip-flops.
> It looks like one LUT has up to 2 flip-flops.

Yes.  This appears to be a 2x improvement over Virtex-5.

> Maybe, some of the registers were mergerd to logic in the 131 LUTs.

It appears so.  This "Virtex-6 Family Overview" you found says that
"between 25-50% of all slices can also use their LUTs as distributed
64-bit RAM".  Per the table on page 2, for the device you synthesized
for, this percentage appears to be at around 35.9%.

Also, it appears that you synthesized for the smallest Virtex-6 device,
with 11,640 slices.  The largest described in that overview is 6.4 times
larger.  If my math is right, assuming the 66x figure from my previous
e-mail, it could give us a 420x advantage over a quad-core CPU, or as
much as 2500x for Pico's 6-chip board.

Would it be possible to generate some sort of diagram, to easily see how
those 131 LUTs are used?

That table on page 2 also gives the number of 36 Kbit Block RAMs - from
156 to 1064.  This should let us implement this many full bcrypt cores.
I think we should try to do just that, and perhaps we'd have plenty of
LUTs left for smaller/other cores (bflike and/or DES).  So please
proceed to implement bcrypt's performance-critical loop (as discussed
previously) for Virtex-6 using Block RAM.

Finally, there are lots of DSP slices - so many that it feels wrong not
to use them - from 288 to 2016.  Each DSP can process 48-bit data at
600 MHz.  It is unclear from this overview how much memory and by what
means the DSPs can access, though.  If they can access an equivalent of
arbitrary 48-bit registers (many of them), then 2016 of those DSPs could
outperform a quad-core dual-issue 128-bit SIMD CPU at bitslice DES by a
factor of 15.

> > The relatively small
> > increase in LUT count means that with a 2-round implementation, we waste
> > most LUTs on the state machine, right?
> 
> I am not sure. Maybe doing some experimentations could clarify this
> question.
> For example, remove part of state machine and see the impact in the result.

Yes, you could try.  Another guess: many LUTs are wasted on the s[]
initial values.  Those values could theoretically be packed in just four
LUTs, but that would probably not allow for them to be read out quickly.
So perhaps more LUTs are wasted for that reason.

> > Are you sure this is good enough for the compiler to do the smart thing,
> > realizing that we have an adder with some holes punched in it?
> 
> And what do you suggest? Find an equivalent function but simpler?

No, I thought you'd show this partial carry addition bit-by-bit, without
redundancy that you expect the synthesizer to detect and optimize out.
But maybe this is not needed.  If we could see (and understand) a diagram
of the generated circuit, we'd know for sure.  Or you can try both ways
and compare the LUT counts.

Oh, here's a simpler test: try replacing pcadd() with a simple addition
or simple XOR.  If the synthesizer was smart enough, this should not
change the LUT count.  If this does reduce the LUT count, then perhaps
there was room for improvement.

> In any case, the synthesizer already does optimizations in logic (or at
> least, tries to).

Yes, perhaps it does that well enough.

> > Do these initial value assignments consume any logic?
> 
> Yes.
> 
> > If so, perhaps you can save some logic by having the initial state of
> > s[] uploaded from the host?
> 
> It is possible. However for each calculation it will be necessary to upload
> them.

And that's fine.  We'll need to be supplying them anyway - if not from
the host, then from a SHA-256 core (or the like) implemented in the FPGA.

Also, Niter of 512 that I had in my code was just for testing.  In actual
use, the iteration counts would be a lot higher - perhaps about 2 million
(to compute our hashes in 10ms at 200 MHz).  Uploading some data every
10ms is fine.  We'd need to do it anyway, the only question is how much
data - only input to a SHA-256 core implemented in FPGA (which would in
turn feed those other components) or the output of SHA-256 or SHA-512
running on the host (inputs to each of the thousands of cores individually).
I think that either approach would work.

BTW, I did some testing of bflike.c with millions of iterations -
detected no state repeats.

> Unless we upload just once and maintain them stored in some registers.

No, you're missing the fact that we actually need to provide inputs to
those cores.  L and R values are not sufficient - they're only 16-bit
together, so there will be lots of repeated values of them across the
cores and between iterations.  We want our cores to be computing
different things, not the same thing many times (which an attacker would
optimize out).

Oh, if we used different initial values of s[] for each core, then we
could possibly send/receive just the 16-bit blocks (and accept having
some collisions).  Yet this feels worse than sending/receiving s[].

Regarding correctness testing:

> I did exactly what you said.

Great!

> [1]
> http://www.dinigroup.com/product/data/DN-DualV6-PCIe-4/files/Virtex6_Overview_v11.pdf

Thanks,

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.