
Date: Sat, 14 May 2011 03:47:06 +0400
From: Solar Designer <solar@...nwall.com>
To: cryptdev@...ts.openwall.com
Subject: Re: alternative approach
David, Yuri 
I've experimented with this today. Attached is a program that
demonstrates a heavily cutdown Eksblowfishlike algorithm.
On Thu, May 12, 2011 at 02:51:40PM +0400, Solar Designer wrote:
> The primary idea is to use smaller Sboxes. Normally, Blowfish uses
> four 8to32 Sboxes. We could reduce their size maybe to 4to16.
Reduced to 4to8 in the attached proofofconcept program. The number
of Sboxes is reduced from 4 to 2.
The size of Sboxes is 256 bits (2x16x8). Besides the changing Sboxes,
there's 16 bits of internal state in the L and R variables (8bit each).
I wonder if it's possible to synthesize this into 68 logic cells on
Virtex6 (we have 272 bits of state, and new Xilinx logic cells have 4
output registers each). Yuri  can you give this a try? Please note
that Nrounds can be made as low as 2 if needed, eliminating that loop,
or instead it can be made high, making the outer loop's logic
nonperformancecritical. Whatever works best. I did my testing with
either combination of settings (high Nrounds gives slightly better
properties).
David  for comparison, how many logic cells do the DES cores use?
Since they're pipelined, we'll need to divide by 16 to compare against
an iterated implementation... or we'll need to make a pipelined
implementation as well (and take the number of pipeline stages into
consideration when comparing). With the internal state reduced to 272
bits, I think a pipelined implementation is possible.
> I think 4bit indices are close to the minimum we can use. With even
> smaller indices, we allow for bitslice implementations in software that
> keep the arrays in registers (or load them into registers when needed)
> and use nottoomany bitwise operations to extract the right elements.
The above is correct, but the below is not:
> For 4bit, it takes 16 registers (or loads) per Sbox and at least 4
> instructions (when there's a bit select instruction like selb, vsel,
> vpcmov, BFI_INT) or 12 (with andnot) or 16 instructions. Oh, there
> will also be plenty of instructions needed to spread/duplicate our
> index bits over the SIMD vectors before they're usable for bit selects.
When writing the above, I had a partiallybitslice implementation in
mind, and I got the numbers wrong anyway. So please disregard this.
With a fully bitslice implementation, it'd take at least 15 instructions
(assuming the most appropriate instruction set with 3 inputs per
instruction) to combine the 16 values, but there won't be any need to
spread/duplicate the index bits. Those 15+ instructions will give us
parallel 4to1 bit lookups (as many of them as our SIMD vector width
permits, e.g. 128). We'll need to do this 8 times for one 4to8 Sbox
lookup. This gives us a slowdown estimate of 120x for each of the
instances being computed in parallel, compared to a purely sequential
software implementation with explicit array lookups. If we're able to
compute not just 128, but, say, 256 instances in parallel (e.g., on a
future processor implementing AMD XOP efficiently), then we might have
some speedup. There's another maybe 50% slowdown because of having to
load a half of the 256 values into registers, though, since we don't
expect our CPU to have 128+ registers and since on x86 instruction set
extensions only one input operand may reside in memory. (And on RISC
architectures, none can.) So perhaps the vectors would need to be wider
than 256 bits for any speedup due to bitslicing. On the other hand,
bitslicing allows for optimizing the partialcarry adds.
> The Sboxes themselves will be everchanging, like they are in Blowfish
> normally. This prevents DESlike bitslice implementations.
...However, there's some similarity to DES:
With DES, we directly put the constant Sboxes into LUTs. With changing
Sboxes, we instead use the LUTs to select the right outputs, emulating
table lookups much like a bitslice software implementation would. Thus,
in either case (DES or this Blowfishlike construction) a bitslice
implementation in software would use the same approach that a hardware
implementation does.
This makes me think that we might not actually gain much (or anything)
by doing this, compared to just using DES. But this needs further
research and testing.
If there's a reduction in logic cell count (compared to that for DES),
then this is what provides advantage  we'll have many more cores per
FPGA chip. If not, then maybe the current unfriendliness to bitslice
implementations (needing relatively uncommon instructions and frequent
loads from L1 cache) provides some advantage.
> Also, Blowfish uses XORs (nocarry addition) and ADDs (allcarry
> addition). In hardware, we can instead use partialcarry addition 
> e.g., do carry for every other bit. A nonbitslice software
> implementation would need to use an extra instruction there, whereas
> we save logic.
Here's what this looks like in a nonbitslice software implementation:
#define pcadd(a, b, mask) \
(((a) ^ (b)) + (((a) & (b) & (mask)) << 1))
The mask is constant (since we optimize for hardware implementation).
That's 5 instructions instead of 1, whereas the cost in FPGA should be
the same. Or did I miss a way to do this in software more optimally
(other than bitslice)?
> Finally, we can add bit permutations. These are free in hardware, but
> are costly for nonbitslice software implementations. (Some modern
> instruction sets include arbitrary byte permutation instructions, but
> not arbitrary bit permutations.)
I haven't tried this yet.
> We'd need to do some testing of the resulting component, such as for
> collision resistance, but we'd call it "noncrypto". We may use a
> software implementation for most of such testing (slow, but that's OK).
I did some testing of this kind. The results are acceptable.
$ gcc bflike.c o bflike O2 fomitframepointer funrollloops Wall
$ ./bflike < w  sort u  wc l
ntot/neq = 255.83
65536
$ wc l w
65536 w
No full internal state collisions for the 65536 inputs here. Also, the
code does some sanity checks.
Yuri  here's a simplified version of encrypt(), with the sanity checks
and encryption mode removed (leaving only the slow key setup mode), for
you to try to synthesize:
static void slow_key_setup(unsigned int n)
{
word *p;
word l = 0;
word r = 0;
do {
p = &s[0][0];
do {
int i;
for (i = 0; i < Nrounds / 2; i++) {
r ^= pcadd(s[0][l & 0xf], s[1][l >> 4], 0x55);
l ^= pcadd(s[0][r & 0xf], s[1][r >> 4], 0xaa);
}
*p++ ^= l;
*p++ ^= r;
} while (p < &s[1][15]);
} while (n);
}
Thanks,
Alexander
View attachment "bflike.c" of type "text/plain" (3746 bytes)
Powered by blists  more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.