|
Message-ID: <20121008210624.GA10754@openwall.com> Date: Tue, 9 Oct 2012 01:06:24 +0400 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Subject: Re: Password hashing at scale (for Internet companies with millions of users) - YaC 2012 slides On Mon, Oct 08, 2012 at 11:54:16PM +0530, Sayantan Datta wrote: > If we are doing bcrypt on xeon phi , then in order to utilize the 512 bit > wide SIMDs , I think we must mix at least 16 bcrypt hash per core at > instruction level. Yes, but we might have to limit the usable number of vector elements to 8 (thus 256-bit) in order to fit in 32 KB of L1 data cache. With full 512-bit vectors, we'll be accessing L2 cache for one half of the vector elements, which will probably be very slow (entire cache lines will be getting transferred for each of those L1-missing 32-bit accesses). > However for GCN GPUs we usually don't have to worry > about instruction level parallelism (only for GCN architecture, VLIW4 could > benefit from ILP) because by definition the kernels follow SIMD > execution. We do need ILP beyond what's included in one SIMD instruction and VLIW bundle, especially on GPU. This is because of pipelining and thus high instruction latencies. In fact, from the bcrypt speed numbers on HD 7970 that we've obtained so far, the latencies appear to be on the order of 10 clock cycles (this would be very high for a CPU). Unfortunately, the limited local memory size does not permit us to mix in more instructions to hide these latencies - hence the poor performance. > Doesn't this make programming on xeon phi harder? In my > opinion a GCN GPU with gather-scatter load/store should be the best for the > programmers. I think you're wrong. Sure, programming in OpenCL might be easier than with explicit intrinsics, especially if in the latter case you have to explicitly mix multiple instances to provide the ILP - but that's not inherent to the hardware devices. We might have OpenCL for Xeon Phi eventually, too. As to the performance we obtain, as you're aware HD 7970 is only general-purpose CPU-like at bcrypt. Xeon Phi is likely to allow for much greater performance as I expect its L1 data cache access latencies to be CPU-like (~1 cycle), not GPU-like (~10 cycles). I could be wrong about the latencies, though. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.