crypt-dev - Re: alternative approach

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110513234706.GA18495@openwall.com>
Date: Sat, 14 May 2011 03:47:06 +0400
From: Solar Designer <solar@...nwall.com>
To: crypt-dev@...ts.openwall.com
Subject: Re: alternative approach

David, Yuri -

I've experimented with this today.  Attached is a program that
demonstrates a heavily cut-down Eksblowfish-like algorithm.

On Thu, May 12, 2011 at 02:51:40PM +0400, Solar Designer wrote:
> The primary idea is to use smaller S-boxes.  Normally, Blowfish uses
> four 8-to-32 S-boxes.  We could reduce their size maybe to 4-to-16.

Reduced to 4-to-8 in the attached proof-of-concept program.  The number
of S-boxes is reduced from 4 to 2.

The size of S-boxes is 256 bits (2x16x8).  Besides the changing S-boxes,
there's 16 bits of internal state in the L and R variables (8-bit each).

I wonder if it's possible to synthesize this into 68 logic cells on
Virtex-6 (we have 272 bits of state, and new Xilinx logic cells have 4
output registers each).  Yuri - can you give this a try?  Please note
that Nrounds can be made as low as 2 if needed, eliminating that loop,
or instead it can be made high, making the outer loop's logic
non-performance-critical.  Whatever works best.  I did my testing with
either combination of settings (high Nrounds gives slightly better
properties).

David - for comparison, how many logic cells do the DES cores use?

Since they're pipelined, we'll need to divide by 16 to compare against
an iterated implementation... or we'll need to make a pipelined
implementation as well (and take the number of pipeline stages into
consideration when comparing).  With the internal state reduced to 272
bits, I think a pipelined implementation is possible.

> I think 4-bit indices are close to the minimum we can use.  With even
> smaller indices, we allow for bitslice implementations in software that
> keep the arrays in registers (or load them into registers when needed)
> and use not-too-many bitwise operations to extract the right elements.

The above is correct, but the below is not:

> For 4-bit, it takes 16 registers (or loads) per S-box and at least 4
> instructions (when there's a bit select instruction like selb, vsel,
> vpcmov, BFI_INT) or 12 (with and-not) or 16 instructions.  Oh, there
> will also be plenty of instructions needed to spread/duplicate our
> index bits over the SIMD vectors before they're usable for bit selects.

When writing the above, I had a partially-bitslice implementation in
mind, and I got the numbers wrong anyway.  So please disregard this.

With a fully bitslice implementation, it'd take at least 15 instructions
(assuming the most appropriate instruction set with 3 inputs per
instruction) to combine the 16 values, but there won't be any need to
spread/duplicate the index bits.  Those 15+ instructions will give us
parallel 4-to-1 bit lookups (as many of them as our SIMD vector width
permits, e.g. 128).  We'll need to do this 8 times for one 4-to-8 S-box
lookup.  This gives us a slowdown estimate of 120x for each of the
instances being computed in parallel, compared to a purely sequential
software implementation with explicit array lookups.  If we're able to
compute not just 128, but, say, 256 instances in parallel (e.g., on a
future processor implementing AMD XOP efficiently), then we might have
some speedup.  There's another maybe 50% slowdown because of having to
load a half of the 256 values into registers, though, since we don't
expect our CPU to have 128+ registers and since on x86 instruction set
extensions only one input operand may reside in memory.  (And on RISC
architectures, none can.)  So perhaps the vectors would need to be wider
than 256 bits for any speedup due to bitslicing.  On the other hand,
bitslicing allows for optimizing the partial-carry adds.

> The S-boxes themselves will be ever-changing, like they are in Blowfish
> normally.  This prevents DES-like bitslice implementations.

...However, there's some similarity to DES:

With DES, we directly put the constant S-boxes into LUTs.  With changing
S-boxes, we instead use the LUTs to select the right outputs, emulating
table lookups much like a bitslice software implementation would.  Thus,
in either case (DES or this Blowfish-like construction) a bitslice
implementation in software would use the same approach that a hardware
implementation does.

This makes me think that we might not actually gain much (or anything)
by doing this, compared to just using DES.  But this needs further
research and testing.

If there's a reduction in logic cell count (compared to that for DES),
then this is what provides advantage - we'll have many more cores per
FPGA chip.  If not, then maybe the current unfriendliness to bitslice
implementations (needing relatively uncommon instructions and frequent
loads from L1 cache) provides some advantage.

> Also, Blowfish uses XORs (no-carry addition) and ADDs (all-carry
> addition).  In hardware, we can instead use partial-carry addition -
> e.g., do carry for every other bit.  A non-bitslice software
> implementation would need to use an extra instruction there, whereas
> we save logic.

Here's what this looks like in a non-bitslice software implementation:

#define pcadd(a, b, mask) \
	(((a) ^ (b)) + (((a) & (b) & (mask)) << 1))

The mask is constant (since we optimize for hardware implementation).

That's 5 instructions instead of 1, whereas the cost in FPGA should be
the same.  Or did I miss a way to do this in software more optimally
(other than bitslice)?

> Finally, we can add bit permutations.  These are free in hardware, but
> are costly for non-bitslice software implementations.  (Some modern
> instruction sets include arbitrary byte permutation instructions, but
> not arbitrary bit permutations.)

I haven't tried this yet.

> We'd need to do some testing of the resulting component, such as for
> collision resistance, but we'd call it "non-crypto".  We may use a
> software implementation for most of such testing (slow, but that's OK).

I did some testing of this kind.  The results are acceptable.

$ gcc bflike.c -o bflike -O2 -fomit-frame-pointer -funroll-loops -Wall
$ ./bflike < w | sort -u | wc -l
ntot/neq = 255.83
65536
$ wc -l w
65536 w

No full internal state collisions for the 65536 inputs here.  Also, the
code does some sanity checks.

Yuri - here's a simplified version of encrypt(), with the sanity checks
and encryption mode removed (leaving only the slow key setup mode), for
you to try to synthesize:

static void slow_key_setup(unsigned int n)
{
	word *p;
	word l = 0;
	word r = 0;

	do {
		p = &s[0][0];
		do {
			int i;
			for (i = 0; i < Nrounds / 2; i++) {
				r ^= pcadd(s[0][l & 0xf], s[1][l >> 4], 0x55);
				l ^= pcadd(s[0][r & 0xf], s[1][r >> 4], 0xaa);
			}
			*p++ ^= l;
			*p++ ^= r;
		} while (p < &s[1][15]);
	} while (--n);
}

Thanks,

Alexander

View attachment "bflike.c" of type "text/plain" (3746 bytes)
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.