|
Message-ID: <20150707131648.GA8487@openwall.com> Date: Tue, 7 Jul 2015 16:16:48 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: extend SIMD intrinsics Alain, Your reply is slightly out of context. I guess we confused you by discussing several topics at once. Earlier, we discussed use of load/store intrinsics vs. simple assignments (or direct use of in-memory SIMD operands in expressions with other intrinsics). However, in the message you replied to, and in the piece of it you quoted, we were discussing different kinds of the "simple assignments" approach, which may differ as it relates to C strict aliasing rules and as it relates to compiler optimizations unrelated to what you mention. However, your comment is useful anyway, and I'll comment on it further: On Mon, Jul 06, 2015 at 11:15:41PM -0400, Alain Espinosa wrote: > In Visual C the difference of a simple assignment and a vload is that for the assignment the compiler generate an unaligned SIMD load instruction, and for vload it generates an aligned SIMD load with the usual restriction: if this memory access isn't aligned the required byte amount an exception is raised. In general the performance difference is negligible, if any. I saw similar behavior with recent gcc, but it wasn't as simple as you understand/explain it. It turned out that recent gcc started generating unaligned SIMD load instructions when it didn't have a reliable way to see that the access is aligned. This meant that we should make the alignment transparent to gcc - avoid going via opaque pointers (which, as discussed elsewhere in this thread, also tends to violate strict aliasing rules). When I corrected my code (bitslice DES code in JtR) to make the alignment apparent to gcc, it stopped generating the unaligned load instructions, generating the aligned ones instead. I suspect Visual C might be similar. As to the performance difference being negligible or non-existent, this is true on recent Intel CPUs, but not true on older ones. In particular, I saw performance impact for bitslice DES on the order of 20% on Xeon E5420 (Core 2'ish), caused solely by the unaligned load instructions. Simply replacing those instructions (via sed applied to gcc-generated assembly) with their aligned counterparts regained that performance loss. Ditto correcting the source code to make the alignments apparent to gcc. So there is in fact an issue for us to keep in mind here: if we avoid the load/store intrinsics, we have to make sure the compiler is aware of the alignment through other means, and we should review the generated code to make sure it uses aligned loads/stores. Well, or we may use the intrinsics. Thank you for reminding me about this issue! Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.