|
Message-ID: <20150427012043.GA27103@openwall.com> Date: Mon, 27 Apr 2015 04:20:43 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: [GSoC] John the Ripper support for PHC finalists On Sun, Apr 26, 2015 at 09:37:10PM +0200, Agnieszka Bielec wrote: > 2015-04-25 13:39 GMT+02:00 Solar Designer <solar@...nwall.com>: > > > A major task that you haven't approached yet is instruction interleaving > > on the CPUs. Do you understand this concept? Including why it helps? > > I doubt about that interleaving can help in pomelo. It might, or it might not. We should try. Then re-test on future CPUs. > Maybe I missed > something. Interleaving can make possibility to make SIMD when it's > not possible in one function execution. > actually I think that this SIMD is good. Maybe you want to speed up > the RAM access or maybe something else? By interleaving, we mean primarily mixing of instructions from multiple instances. Not SIMD. I understand what you mean by saying that interleaving 2+ hash computations might enable use of SIMD, and we're doing that too (e.g., we need 8 parallel MD5's to fill a 256-bit AVX2 vector), but that's not what we refer to when we say "interleaving". We're also using interleaving on top of SIMD (so e.g. 16 or more parallel MD5's per thread is likely optimal on AVX2, not just 8). Please do take a look at and play with different versions of php_mt_seed. It uses both SIMD and interleaving at once. If you modify it to only use SIMD, and not interleaving, it'd become much slower. You need to understand why. What's your understanding as to why interleaving might help, beyond SIMD? As to POMELO's SIMD being good, yes, it appears to be good for up to 256-bit. For 512-bit, such as on MIC and AVX-512, we'd need to experiment. It might be best just to waste the upper 256 bits, or we might use those too (run two instances in the wider SIMD vectors side-by-side). In fact, something like this happens on GPUs too, but this detail is hidden from you by the OpenCL "driver's" auto-vectorization. I think POMELO's performance significantly depends on the device's efficiency at gather loads, of 256-bit quantities in this case, and with how those are implemented in code (e.g., using native gather load instructions, although those typically support up to 64-bit vector elements only, so might be wasteful, or with explicit loads/shifts of the 256-bit portions). Once again, interleaving is a separate thing, on top of SIMD, although it will need to be tuned along with SIMD (what interleaving is optimal may vary depending on how we use SIMD, and vice versa). Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.