john-users - Re: bitslice DES parallelization with OpenMP

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100630194632.GA16803@openwall.com>
Date: Wed, 30 Jun 2010 23:46:32 +0400
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: bitslice DES parallelization with OpenMP

On Tue, Jun 29, 2010 at 10:12:15PM -0500, Joshua J. Drake wrote:
> There seems to be a huge slow down when I try with a few cores under
> high load.. I was actually getting worse performance than running a
> single instance by itself.

Confirmed, and this is not limited to the DES code, nor to JtR.  This
turns out to be a known problem with libgomp:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706

The fixes committed with reference to this bug do not actually fix the
problem (they fix another related problem), as correctly pointed out by
the original reporters.

Setting any one of the following environment variables - or a
combination - is a workaround that brings performance back to a
reasonable level (not great, just "reasonable") despite of having other
program(s) busily running on some CPUs:

GOMP_SPINCOUNT=10000	# When waiting, don't spin for too long
GOMP_CPU_AFFINITY=0-99	# Forcibly bind threads to CPUs sequentially
OMP_WAIT_POLICY=PASSIVE	# Avoid spinning
OMP_NUM_THREADS=7	# On an 8-core system, voluntarily only use 7 threads

Which one of these (or which combination) produces best results varies
across systems, kinds of other load, and JtR invocation.

Overall, OpenMP behaves poorly when there's other load on the system.
I'll continue to be trying to make the code less sensitive to other
load, but at a later point I expect to need to introduce another
parallelization approach.

Curiously, setting GOMP_SPINCOUNT=10000 on a system with SMT also
significantly improves speed for the single-salt case even with almost
no other load.  On the Core i7 system (the same one I posted benchmarks
for previously), I am getting:

host!solar:~/john/john-1.7.6-omp-des/run$ GOMP_SPINCOUNT=10000 ./john -te=1 -fo=des
Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     10174K c/s real, 1270K c/s virtual
Only one salt:  6045K c/s real, 1149K c/s virtual

The old benchmark was (without GOMP_SPINCOUNT override):

Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     10174K c/s real, 1267K c/s virtual
Only one salt:  4841K c/s real, 602923 c/s virtual

Both are currently reproducible (multiple times).

I think the speedup with lower GOMP_SPINCOUNT is attributable to
Core i7's SMT (two threads per core), where having a waiting thread
yield the CPU actually frees CPU resources up for use by other threads.
This is confirmed by OMP_WAIT_POLICY=PASSIVE:

host!solar:~/john/john-1.7.6-omp-des/run$ OMP_WAIT_POLICY=PASSIVE ./john -te=1 -fo=des
Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     7962K c/s real, 1731K c/s virtual
Only one salt:  5182K c/s real, 1279K c/s virtual

Notice how the "c/s virtual" of 1279K is similar to that for the
GOMP_SPINCOUNT=10000 run.  This must be what it becomes when other
threads are not waiting busily.  However, only GOMP_SPINCOUNT=10000
provides an overall speedup, meaning that going for passive waits is not
always optimal.

Thank you for testing and for your feedback!

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.