crypt-dev - scrypt parallelism vs. area-time vs. TMTO resistance

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20131219020852.GA23401@openwall.com>
Date: Thu, 19 Dec 2013 06:08:53 +0400
From: Solar Designer <solar@...nwall.com>
To: crypt-dev@...ts.openwall.com
Subject: scrypt parallelism vs. area-time vs. TMTO resistance

Hi,

Setting scrypt's p > 1 introduces parallelism at (almost) the highest
level.  This has the advantage of needing to synchronize the threads
just once (before the final PBKDF2), but it results in much lower
area-time cost for the attacker than we could achieve by having the
parallelism at a slightly lower level, inside SMix.

Specifically, in SMix's first loop, we may have the threads each write
to its own portion of V, and in SMix's second loop we may have them all
read from the entire V.  This requires that we synchronize the threads
just one more time: between the two loops in SMix.

For classic scrypt, the running time is 2*N (assuming that we have
enough computing resources to make full use of the parallelism, and
omitting r from this discussion for now).  For the revision proposed
above, the running time is only 2*N/p: the first loop fills all N
elements of V in N/p time, and the second loop performs a total of N
lookups (as usual) and accesses 63% of V elements (as usual) in N/p time.

Now given the same p and the same target running time, we can set N to
be p times higher.

If for classic scrypt the area-time cost of attack is
O(old_N^2 * p * r^y) [1], then for the proposed revision it is
O(new_N^2 * r^y).  If we maintain the same running time, then
new_N = old_N * p, giving the new area-time cost O(old_N^2 * p^2 * r^y).
This is a major improvement: we went from p to p^2 for a factor in the
attacker's cost, whereas the defender's cost remained the same.

While p > 1 is currently most relevant to (offline) uses of the KDF, we
may also look at what happens on authentication servers doing password
hashing and needing to achieve a certain throughput.  With classic
scrypt, going from p = 1 (with N and r tuned to achieve the desired
throughput) to p > 1 (with N decreased accordingly) weakens the hashes
(in area-time terms) by a factor of p.  With the proposed revision, this
(theoretically) does not affect security, so the decision may be made
based on other criteria (including possibly providing lower latency for
the users by making full use of the available computing resources even
when the server is under less than the maximum supported load).  This
may be most relevant for possible defensive use of GPUs and such, where
the amount of parallelism available from concurrent authentication
attempts may be way below what's needed to optimally utilize the device,
even on busy authentication servers (need tens of thousands of
concurrent threads for current GPUs, and this keeps growing).

Note that this revision to scrypt does not affect its TMTO friendliness
(it remains just as TMTO friendly as the original, thereby allowing for
efficient computation of the same derived keys or hashes e.g. on smaller
mobile devices, if desired).

Sounds great so far?  It does to me.  Unfortunately, when we're trying
to discourage TMTO (which we should when we don't intend to use it
ourselves, as it gives extra flexibility to the attacker), this revision
prevents us from using some of the possible approaches to that.
Specifically, it is incompatible with any writes to V in SMix's second
loop.  Some workarounds to that are possible - e.g., split SMix's second
loop into several, write only to small portions of V at a time, read
only from the remainder of the entire V, synchronize the threads
whenever there's a change in which portions are read-write (and local to
their respective threads) and which are read-only (and shared between
the threads) - but nevertheless the TMTO-discouraging effect is reduced
and complexity is increased.

Another minor drawback is that while with classic scrypt p is reusable
as a way to control running time without increasing memory needs (to be
used on small devices, where we can nevertheless wait a long time for a
derived key to be produced), in the proposed revision a new parameter
would need to be introduced for that - preferably a way to increase SMix
second loop's iteration count arbitrarily beyond N/p.  Arguably, as
pointed out by CodesInChaos, classic scrypt should have had a parameter
like this anyway, because increasing it provides better security than
abusing p, due to increasing the portion of time spent in SMix's second
loop (N^2 cost to attacker) vs. first loop (N^1.5 cost to attacker).

Here are some excerpts from my current implementation/hack, made as a
revision of crypto_scrypt-ref.c (without actually making use of the
parallelism yet, which I am leaving for the optimized implementations):

/**
 * smix(B, r, N, p, flags, V, NROM, VROM, XY):
 * Compute B = SMix_r(B, N).  The input B must be 128rp bytes in length; the
 * temporary storage V must be 128rN bytes in length; the temporary storage
 * XY must be 256r bytes in length.  The value N must be a power of 2.
 */
static void
smix(uint8_t * B, size_t r, uint64_t N, uint32_t p, escrypt_flags_t flags,
    uint8_t * V, uint64_t NROM, const uint8_t * VROM, uint8_t * XY)
{
	size_t s = 128 * r;
	uint64_t Vchunk = 0, Nchunk = N / p, Nloop;
	uint32_t i;

	Nloop = 0;
	if (flags & ESCRYPT_RW)
		Nloop = Nchunk / p;

	for (i = 0; i < p - 1; i++) {
		smix1(&B[i * s], r, Nchunk, flags,
		    &V[Vchunk * s], NROM, VROM, XY);
		smix2(&B[i * s], r, p2floor(Nchunk), Nloop, flags,
		    &V[Vchunk * s], NROM, VROM, XY);
		Vchunk += Nchunk;
	}
	smix1(&B[i * s], r, N - Vchunk, flags,
	    &V[Vchunk * s], NROM, VROM, XY);
	smix2(&B[i * s], r, p2floor(N - Vchunk), Nloop, flags,
	    &V[Vchunk * s], NROM, VROM, XY);

	for (i = 0; i < p; i++)
		smix2(&B[i * s], r, N, Nchunk - Nloop, flags & ~ESCRYPT_RW,
		    V, NROM, VROM, XY);
}

(Please disregard NROM and VROM for now, they're irrelevant to the topic
at hand.)  The functions smix1() and smix2() implement equivalents of
the original SMix's first and second loop, respectively.  This code is
actually still able to compute classic scrypt (when invoked with the
appropriate parameter values, including the flags), matching scrypt test
vectors, as well as the revision discussed here (and more).

ESCRYPT_RW requests read-write accesses to V to discourage TMTO.  When
this flag is not set (like it is not set when computing classic scrypt
or when the only deviation from classic scrypt is the new approach to
parallelization), Nloop is 0, so the first two smix2() calls seen in the
code fragment above are no-ops.  When ESCRYPT_RW is set, there's some
additional magic here, which ensures there's no reduction in "absolute"
TMTO resistance when going from p=1 to p>1 and increasing N
proportionally.  Unfortunately, there is reduction in "relative" TMTO
resistance (the amount of memory an attacker has to provide for
efficient TMTO does not increase), as well as if p is increased without
a corresponding increase of N.

Because of this added tradeoff (higher area-time vs. TMTO resistance),
as well as to continue to offer computation of classic scrypt as an
option in this source tree for now, there's a flag that requests the new
approach to parallelization.  Here's a comment I wrote on the flags:

 * Classic scrypt is available by setting flags to ESCRYPT_WORM [...]
[...] In this mode, the thread-local memory region (RAM)
 * is first sequentially written to and then randomly read from.  This
 * algorithm is friendly towards time-memory tradeoffs, available both to
 * defenders (albeit not in this implementation) and to attackers.
 *
 * Setting ESCRYPT_RW adds extra random reads and writes to the thread-local
 * memory region (RAM), which makes time-memory tradeoffs less efficient.
 * This may be used to slow down the kinds of attackers who would otherwise
 * benefit from classic scrypt's efficient time-memory tradeoff.
 *
 * ESCRYPT_PARALLEL_SMIX moves parallelism that is present with p > 1 to a
 * different level (as compared to where it is in classic scrypt), which
 * results in higher relative area-time cost for the attacker (given the same
 * running time for the defender) and changes the way the running time is to be
 * controlled from N*r*p (for classic scrypt) to N*r (in this modification).
 * In the ESCRYPT_WORM mode, there are no serious drawbacks to this setting
 * except that it breaks backwards compatibility with classic scrypt.  Thus, if
 * you want time-memory tradeoff friendliness (for possible defensive use at a
 * later time, e.g. on a small mobile device), but don't need compatibility
 * with classic scrypt, then set ESCRYPT_PARALLEL_SMIX.  In the ESCRYPT_RW
 * mode, which is meant to discourage time-memory tradeoffs, this new approach
 * to parallelization makes the time-memory tradeoffs less inefficient (this is
 * a very unfortunate side-effect of avoiding some random writes, as we have to
 * in order to allow for parallel threads to access a common memory region
 * without synchronization overhead).  Thus, in this mode this setting poses a
 * tradeoff of its own (higher area-time cost vs. better TMTO resistance).
 * All of this applies only when p > 1.  For p = 1, this setting is a no-op.

I'd appreciate any feedback.

Alexander

[1] The scrypt paper suggests that y=2 (in Conjecture 1 on page 12), but
given the attack on SMix's first loop (and similar) pointed out by
CodesInChaos and the similarity between SMix's first loop and BlockMix,
I think it may actually be 1.5.  This may be more intuitive if (for the
sake of this thought experiment) we assume large r, small N (e.g., N=2),
and an attacker's implementation that does not actually store V elements
(extreme TMTO).  Luckily, whether y is 2 or 1.5 or something else, it
remains the same for all revisions discussed above, so does not affect
any conclusions here.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.