john-dev - Re: DES BS + OMP improvements, MPI direction

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CADkMHCnG9nnMcp7O29ff8CQeiNKq7x6Gg4=zPL0hcgvK1s6rzQ@mail.gmail.com>
Date: Mon, 19 Dec 2011 08:48:34 -0700
From: RB <aoz.syn@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: DES BS + OMP improvements, MPI direction

2011/12/19 Solar Designer <solar@...nwall.com>:
> On Sun, Dec 18, 2011 at 01:29:02AM -0700, RB wrote:
>> On Sat, Dec 17, 2011 at 15:46, Solar Designer <solar@...nwall.com> wrote:
>> > Are you suggesting that we'd apply MPI at the same low level where we
>> > currently apply OpenMP?  No, that doesn't sound like a good idea to me,
>> > except maybe for the slowest hash/cipher types only (BF, MSCash2, RAR,
>> > crypt when run against SHA-crypt), where it might be acceptable.
>>
>> Yes, it is what I'm suggesting.  Obviously anything that completes in
>> less than the RTT of the MPI environment would need to be handled
>> differently (if at all), but the problem the MPI implementation has
>> historically had is that it splits work at too high a level.  Last I
>> checked it was splitting work based on individual work steps - each
>> thread took a given length x count and worked it by itself on that.
>> For very fast hashes (like LM) I'm sure that's more acceptable, but
>> for slower ones and with larger MPI implementations (hundreds to
>> thousands of threads), the network very quickly gets to the point that
>> every thread is working on an effectively impossible-to-solve step,
>> which really isn't the end-user's intent of parallelization.
>
> I think you don't literally mean "every" above (as that would imply that
> all easier "steps" were already completed, so everything is as desired),
> but I get what you're referring to.

For posterity's completeness, I meant that for a given run of X
seconds, no given step/pass/stage (let's call it a 'set' unless you
desire otherwise) that takes a single processor longer than X seconds
to solve will finish, no matter the number of processor resources.
This can be exacerbated in MPI environments with asymmetric processors
(or even cooling, in the case of clock-boosting processors).

It is often the case with MPI runs that, when aborted, many (if not
all) of the processors have been working for a considerable length of
time on a single set that, given the overall processing power
available, "could" have been completed had all resources been focused
there.  It would be fascinating (but programmatically challenging) to
have the implementation automatically switch approaches (or reduce
working unit size) as timing for individual sets clears a given
threshold.  There will definitely have to be a balance, which history
has shown you're certainly capable of finding and striking.

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.