|
Message-ID: <20130419001910.J0YDS.37573.imail@eastrmwml206> Date: Fri, 19 Apr 2013 0:19:10 -0400 From: <jfoug@....net> To: john-dev@...ts.openwall.com Subject: Re: Got all dyna formats (except $1$ and $apr1$) working with OMP ---- jfoug@....net wrote: > ---- magnum <john.magnum@...hmail.com> wrote: > This made 1 thread OMP work almost same speed as non-OMP, for 'some' dynas. However, in others, things were bad. 60%, 50% and even some slower than that (40% or so). > > I THINK this is due to unicode checking, calling omp_thread_num() within many of the string functions. I am pretty sure the the thread safe unicode data was the bottleneck. There may be others still lurking, I will check. Here is the new call withing the OMP for loop: (*(curdat.dynamic_FUNCTIONS[i]))(j,top,omp_get_thread_num()); The 3rd param was added (to all primitives and some helper functions). I WILL need to do some #define magic for non-OMP builds, for some of the non-primitive helper functions (like the unicode getter and setter), but all in all, it should be pretty trivial. here are timings of dyna0 and dyna1. *** Non OMP: Benchmarking: dynamic_0: md5($p) (raw-md5) [128/128 SSE2 intrinsics 10x4x3]... DONE Raw: 27730K c/s real, 27764K c/s virtual Benchmarking: dynamic_1: md5($p.$s) (joomla) [128/128 SSE2 intrinsics 10x4x3]... DONE Many salts: 16422K c/s real, 16394K c/s virtual Only one salt: 12244K c/s real, 12259K c/s virtual *** OMP 1x thread id as 3rd param Benchmarking: dynamic_0: md5($p) (raw-md5) [128/128 SSE2 intrinsics 480x4x3]... DONE Raw: 26282K c/s real, 26285K c/s virtual Benchmarking: dynamic_1: md5($p.$s) (joomla) [128/128 SSE2 intrinsics 480x4x3]... DONE Many salts: 14510K c/s real, 14499K c/s virtual Only one salt: 11237K c/s real, 11241K c/s virtual *** OMP 1x thread id being computed within unicode thread getter/setter Benchmarking: dynamic_0: md5($p) (raw-md5) [128/128 SSE2 intrinsics 480x4x3]... DONE Raw: 26135K c/s real, 26125K c/s virtual Benchmarking: dynamic_1: md5($p.$s) (joomla) [128/128 SSE2 intrinsics 480x4x3]... DONE Many salts: 6952K c/s real, 6951K c/s virtual Only one salt: 6066K c/s real, 6064K c/s virtual In the 3rd param method, we are calling omp_get_thread_num() 4 times for every 5760 candidates. For the one where the omp_get_thread_num() call was in the unicode getter/setter, omp_get_thread_num() was being called at least 11520 times per each 5760 candidates!!!! That could be GREATLY reduced (basically a loop-invariant code motion). But using the 2nd method (newest), it simply is an inline function to a array. So a smart compiler will actually do the loop invariant motion for us. Thanks for pointing out the problem. I may be able to use this hints to reduce other overhead. Jim.
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.