john-dev - Re: tuning OMP_SCALE on MIC (was: Lei's weekly report #7)

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150624015316.GB24971@openwall.com>
Date: Wed, 24 Jun 2015 04:53:16 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: tuning OMP_SCALE on MIC (was: Lei's weekly report #7)

On Tue, Jun 23, 2015 at 11:09:00PM +0800, Lei Zhang wrote:
> Below are the slowest formats (non-OpenMP c/s rate lower than 1M) among the 125 I found, and their c/s rates under different OMP_SCALEs (1, 2, 4, 8, 16):

Weird stuff.  Most of these are obscure (rarely used) formats, and many
of these are fast hashes, which are meant to perform orders of magnitude
faster once we have proper parallelization for them.

OTOH, some of these are slow, and it is surprising that OMP_SCALE makes
so much of a difference for them.

These are worth optimizing:

> tc_whirlpool    [545, 685, 1177, 680, 680]
> vtp             [464, 922, 1794, 3654, 7112]
> keyring         [26713, 34285, 42666, 46900, 47627]

Out of them, tc_whirlpool and vtp show surprisingly low speeds, whereas
keyring isn't that bad.  (Comparing to speeds seen on CPU.)

You need to look into why tc_* and vtp are so slow.  The issue is
probably primarily not OMP_SCALE, but something else.

For keyring, there's probably something else as well.

It may be OK to tune OMP_SCALE first, though.  Even if temporarily.

> dynamic_1023    [577188, 673663, 650376, 672000, 551764]
> dahua           [21644, 42613, 75956, 136439, 212831]
> Panama          [1284096, 1146880, 1819648, 2035712, 1872896]
> skein-256       [1482752, 1631232, 2254848, 2530304, 2923520]
> skein-512       [1440768, 1711104, 1936384, 2379776, 3183616]
> HAVAL-256-3     [1345536, 1452032, 1657856, 2137088, 2437120]
> Tiger           [1288192, 1433600, 1645568, 1949696, 2394112]
> mdc2            [16717, 26401, 45405, 78721, 133338]
> Raw-Keccak-256  [1235968, 1449984, 1841152, 2293760, 3083264]
> HAVAL-128-4     [1486848, 1712128, 2068480, 2915328, 3301376]
> ripemd-128      [1531904, 1748992, 2014208, 2372608, 2898944]
> whirlpool       [988752, 1098752, 1144832, 1500160, 1599488]
> ripemd-160      [1400832, 808941, 1857536, 2558976, 3176448]
> Snefru-128      [934574, 951920, 1077248, 1479680, 1494016]
> Raw-Keccak      [1060864, 1133568, 1645568, 1911808, 1969152]
> has-160         [987089, 1037312, 1164288, 1376256, 1532928]
> Snefru-256      [918178, 1142784, 1081344, 1432576, 1611776]
> MD2             [547485, 612705, 735058, 872554, 933647]
> VNC             [3179520, 4987904, 6769664, 8011776, 8549376]
> MongoDB         [3656704, 5729280, 7598080, 9051136, 10071040]
> OpenVMS         [3147776, 4302848, 5317632, 5926912, 6306816]
> Raw-Blake2      [1141760, 1236992, 1360896, 1832960, 2039808]

Most of these are fast crap (even if they appear slow at this test).

Please sanity-check against speeds you obtain on CPU.  And no, I am not
asking you to post more data in here - I am merely suggesting what
checks to perform when you work on this.

> In addition, I filtered out formats that didn't show noticeable variation under different OMP_SCALEs, which I think are not worth tuning for the moment.

Maybe not worth tuning OMP_SCALE for, but certainly there were others
where some other(?) kind of tuning is needed.  IIRC, for example the
speeds for phpass were way lower than expected.

> Some of those left above might need an OMP_SCALE higher than 16 to achieve optimal performance.

Substantially benefiting from OMP_SCALE higher than 1 at low c/s rates
(like below 100k total at 240 threads) is unexpected.  This suggests
something else is wrong.

I don't mean to say we should lock OMP_SCALE=1 for them and not tune,
but I am saying that if you see e.g. vtp's improvement from 464 to 922
at OMP_SCALE 1 vs. 2, this suggests that something else is very wrong.

OK, I just took a look at vtp_fmt_plug.c.  It's a poor fit for MIC as
written, because it doesn't use SIMD and uses moderately large arrays
(with 4 threads/core, we only have 8 KB of L1 data cache per thread).
Maybe you'll optimize it later (SIMD'ing it will also benefit CPU a lot).

> Anyway, I think it's clear that some formats will perform way better on MIC (than the current state) with a tuned OMP_SCALE.

Yes, but to me the table above primarily suggests that there are other
problems.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.