|
Message-ID: <20150509020039.GA27423@openwall.com> Date: Sat, 9 May 2015 05:00:39 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: Adding OpenMP support to SunMD5 On Sat, May 09, 2015 at 09:23:45AM +0800, Lei Zhang wrote: > There seems to be some compatibility issue among different compilers' implementation of OpenMP. Initially I used the following OpenMP clause, which works fine with icc on my laptop: > > #pragma omp parallel for default(none) private(idx) copyin(input_buf_big) \ > shared(saved_salt, data, constant_phrase, ngroups, group_sz) > > But when I experimented it on well, gcc failed to compile, saying that constant_phrase is already a constant, thus no need to be declared as shared. So I removed it, and then icc failed to compile... Magnum suggested I use the simplified form to avoid this issue: > > #pragma omp parallel for copyin(input_buf_big) That's not great. I prefer that we use default(none). > It works well both on my laptop and well. I don't know it fails on super before you pointing it out. I tried changing the OpenMP clause back to its lengthy form, and now it works on super: > > [lei@...er src]$ ../run/john --test --format=sunmd5 > Will run 32 OpenMP threads > Benchmarking: SunMD5 [MD5 128/128 AVX 4x3]... (32xOMP) DONE > Speed for cost 1 (iteration count) of 5000 > Raw: 5907 c/s real, 194 c/s virtual Note that this is only ~11x faster than single-thread speed. It should be ~15x+ faster. > The default gcc version on super is 4.4.7, and I'm using gcc-4.9.2 on well. I assume there's incompatibility even between different versions of gcc. Not necessarily. Passing self-tests isn't enough proof the code is actually reliable. It is possible that there are still issues, and they just appear on self-tests "randomly" with different versions of code on different machines. Anyway, you need to revise this code as I suggested in another message. Then test it for real (beyond self-test). > > and the single-thread performance is a bit lower than it was before. > > It was: > > > > [solar@...er run]$ ./john -te -form=sunmd5 > > Benchmarking: SunMD5 [MD5 128/128 AVX 4x3]... DONE > > Speed for cost 1 (iteration count) of 5000 > > Raw: 538 c/s real, 538 c/s virtual > > I'm not sure of the penalty introduced by a new outer loop, even if the loop is only iterated once in non-openmp mode. I can use macros to disable the outer loop in a non-openmp build if necessary. I haven't done so because the code looks cleaner the current way. I think copyin() might have a measurable performance cost. A little for one thread, and more for many threads. This may also be why the OpenMP scaling is worse than expected in the 32-thread benchmark. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.