|
Message-ID: <CAGiyFdd6_LVzUUfFcaqMyub1c2WPvWUzAQDCH+Aza-_t6mvmXg@mail.gmail.com>
Date: Fri, 16 Dec 2016 08:08:57 +0000
From: Jean-Philippe Aumasson <jeanphilippe.aumasson@...il.com>
To: George Spelvin <linux@...encehorizons.net>, ak@...ux.intel.com, davem@...emloft.net,
David.Laight@...lab.com, ebiggers3@...il.com, hannes@...essinduktion.org,
Jason@...c4.com, kernel-hardening@...ts.openwall.com,
linux-crypto@...r.kernel.org, linux-kernel@...r.kernel.org,
luto@...capital.net, netdev@...r.kernel.org, tom@...bertland.com,
torvalds@...ux-foundation.org, tytso@....edu, vegard.nossum@...il.com
Cc: djb@...yp.to
Subject: Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF
Here's a tentative HalfSipHash:
https://github.com/veorq/SipHash/blob/halfsiphash/halfsiphash.c
Haven't computed the cycle count nor measured its speed.
On Fri, Dec 16, 2016 at 4:46 AM George Spelvin <linux@...encehorizons.net>
wrote:
> Jean-Philippe Aumasson wrote:
> > If a halved version of SipHash can bring significant performance boost
> > (with 32b words instead of 64b words) with an acceptable security level
> > (64-bit enough?) then we may design such a version.
>
> It would be fairly significant, a 2x speed benefit on a lot of 32-bit
> machines.
>
> First is the fact that a 64-bit SipHash round on a generic 32-bit machine
> requires not twice as many instructions, but more than three.
>
> Consider the core SipHash quarter-round operation:
> a += b;
> b = rotate_left(b, k)
> b ^= a
>
> The add and xor are equivalent between 32- and 64-bit rounds; twice the
> instructions do twice the work. (There's a dependency via the carry
> bit between the two halves of the add, but it ends up not being on the
> critical path even in a superscalar implementation.)
>
> The problem is the rotates. Although some particularly nice code is
> possible on 32-bit ARM due to its support for shift-and-xor operations,
> on a generic 32-bit CPU the rotate grows to 6 instructions with a 2-cycle
> dependency chain (more in practice because barrel shifters are large and
> even quad-issue CPUs can't do 4 shifts per cycle):
>
> temp_lo = b_lo >> (32-k)
> temp_hi = b_hi >> (32-k)
> b_lo <<= k
> b_hi <<= k
> b_lo ^= temp_hi
> b_hi ^= temp_lo
>
> The resultant instruction counts and (assuming wide issue)
> latencies are:
>
> 64-bit SipHash "Half" SipHash
> Inst. Latency Inst. Latency
> 10 3 3 2 Quarter round
> 40 6 12 4 Full round
> 80 12 24 8 Two rounds
> 82 13 26 9 Mix in one word
> 82 13 52 18 Mix in 64 bits
> 166 26 61 18 Four round finalization + final XOR
> 248 39 113 36 Hash 64 bits
> 330 52 165 54 Hash 128 bits
> 412 65 217 72 Hash 192 bits
>
> While the ideal latencies are actually better for the 64-bit algorithm,
> that requires an unrealistic 6+-wide superscalar implementation that's
> more than twice as wide as the 64-bit code requires (which is already
> optimized for quad-issue). For a 1- or 2-wide processor, the instruction
> counts dominate, and not only does the 64-bit algorithm take 60% more
> time to mix in the same number of bytes, but the finalization rounds
> bring the ratio to 2:1 for small inputs.
>
> (And I haven't included the possible savings if the input size is an odd
> number of 32-bit words, such as networking applications which include
> the source/dest port numbers.)
>
>
> Notes on particular processors:
> - x86 can do a 64-bit rotate in 3 instructions and 2 cycles using
> the SHLD/SHRD instructions instead:
> movl %b_hi, %temp
> shldl $k, %b_lo, %b_hi
> shldl $k, %temp, %b_lo
> ... but as I mentioned the problem is registers. SipHash needs 8 32-bit
> words plus at least one temporary, and 32-bit x86 has only 7 available.
> (And compilers can rarely manage to keep more than 6 of them busy.)
> - 64-bit SipHash is particularly efficient on 32-bit ARM due to its
> support for shift-and-op instructions. The 64-bit shift and following
> xor can be done in 4 instructions. So the only benefit is from the
> reduced finalization.
> - Double-width adds cost a little more on CPUs like MIPS and RISC-V without
> condition codes.
> - Certain particularly crappy uClinux processors with slow shifts
> (68000, anyone?) really suffer from extra shifts.
>
> One *weakly* requested feature: It might simplify some programming
> interfaces if we could use the same key for multiple hash tables with a
> 1-word "tweak" (e.g. pointer to the hash table, so it could be assumed
> non-zero if that helped) to make distinct functions. That would let us
> more safely use a global key for multiple small hash tables without the
> need to add code to generate and store key material for each place that
> an unkeyed hash is replaced.
>
Content of type "text/html" skipped
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.