|
Message-ID: <20121014020400.GA726@openwall.com> Date: Sun, 14 Oct 2012 06:04:00 +0400 From: Solar Designer <solar@...nwall.com> To: crypt-dev@...ts.openwall.com Subject: ROM or memory port hardness Colin, all - Here's an idea that I am going to explore further and I'd appreciate your input on. First, let me define the problem: The task is password hashing on servers dedicated to authentication (or maybe even to password hashing alone), at a large enough org to go for this approach anyway. At low durations (e.g. the 1 ms to 100 ms range as needed for user authentication at such orgs), we face two problems with scrypt: 1. Somewhat low memory usage: acceptable at 100 ms, way too low at 1 ms. 2. Poor scalability on multi-CPU (and multi-core) systems. Moreover, there's a trade-off between these two: we can improve memory usage slightly (up to ~2x) at the cost of even worse scalability, or we can improve scalability (no hard limit) at the cost of even lower memory usage. (We access this trade-off by hacking the Salsa20 rounds count in scrypt, so we deviate from the official scrypt when we do this.) Here's a possible solution (with its own drawbacks, though): Pre-fill a large amount of RAM (would be tens of gigabytes currently) with reproducible yet not too quick to recompute pseudo-random data. This may take e.g. a minute upon system bootup. There should be no trivial patterns like V[j] = f(V[j-1]), to defeat an efficient TMTO that we'd otherwise have. As far as further processing is concerned, we have a fairly large ROM filled with pseudo-random data. At password hash computation, use a loop similar to scrypt smix's second loop, up to a certain number of iterations (chosen for the desired duration, which can then be arbitrary). This loop will access a small percentage of the ROM array elements only, but as long as the set of elements it will need to access for a given {password, salt} is not predictable in advance (without going through the same sequential computation), the entire ROM needs to be quickly accessible in order for hash computation to complete quickly. This allows for parallel computation of hashes for many {password, salt} pairs with only one instance of the ROM. This is both good and bad. It is good for authentication servers: we fully use our RAM size for defense regardless of the current request rate (be it 1 or 1000 requests in a given second). We have perfect scalability to multiple CPU cores: we only need to increase the amount of processing between memory accesses to be such that we stay just below saturating the memory bandwidth when all CPU cores are in use. However, obviously an attacker has the same advantages, and more: a custom or otherwise more suitable hardware setup would have a larger number of ports to the same ROM size (and more cores too). So this is not exactly a memory-hard algorithm; it's more like a memory-port-hard one. While asymptotically this is less secure than scrypt, I think that at realistic parameter values for user authentication on current server hardware and given likely attackers' hardware (when we're talking password hashing and not KDF for data encryption), it is actually a lot more secure. scrypt at 100 ms is currently limited to using about 64 MB (either as two instances using 32 MB each, or one hacked instance using 64 MB due to lower Salsa20 rounds count). With the ROM approach above, we can make it e.g. 64 GB. Yes, it allows for parallelism, including for attackers, but if we consider ASICs, it'd take 1024 times more memory ports to get at the same area*time/candidate as scrypt's, and that's not counting area consumed by the memory ports themselves. OK, those don't have to be ports to the same large ROM - instead, separate smaller ROM banks may be used, assuming that bank conflicts will be fairly rare - but nevertheless the area*time/candidate estimate vs. scrypt's holds. An attack would be to use a smaller RAM holding only a portion of the large ROM at a time (e.g., one half of it). When we have a large number of candidate passwords to test, we may proceed with computation for each as long as we have the needed data in RAM. When we don't, computation for this candidate gets stalled (but others can proceed). Then we switch to the next ROM window (fill our RAM with the next subset of data, e.g. read from slower external storage) and proceed with previously-stalled candidates. And so on. This is a trade-off: our cost is in RAM needed to store the in-flight candidates (OK, they might be quickly recomputable) and per-candidate state (not recomputable without swapped-out portions of the ROM). A similar attack may work with a botnet, if typical nodes don't possess enough RAM to compute these hashes without trade-offs on their own. The ROM table may be split across several nodes, but then the trade-off is not only in extra RAM needed to store the per-candidate state, but also in need for a lot of communication between the nodes. Thus, I am suggesting to use this with ROM table sizes that are likely to exceed a typical botnet node's RAM size for a few years to come - e.g., currently this might be 48 GB to 256 GB RAM in a new server vs. up to 16 GB RAM in a typical end-user's computer. (Supporting ROM table sizes that are not powers of 2 might be desirable to get closer to the full RAM size actually installed in a server, yet leave a little bit of room for the OS/software/buffers/caches - e.g., use 46 out of 48 GB.) I also have thoughts on making use of CPU caches for defense: we could use L3 and L2 in scrypt-like manner, and L1 in bcrypt-like manner (take advantage of being able to do narrow lookups efficiently), along with the large ROM approach above (scrypt on L2 caches is too weak on its own, as we know from Litecoin, yet it may be a reasonable addition if we can have it for free). This will increase the cost for attackers who go for parallelism levels beyond what we use for defense. Ideally, to avoid excessive complexity, we'd come up with a way to use similar code for all 4 levels - e.g., merely accessing them at different iterations of the loop. This is tricky. If we fail to make this simple enough, we should use only the ROM or maybe 2 layers (ROM and one cache level). Any comments? Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.