Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <AANLkTinLCvii4dsAj3iNSsJLXx292T95qW1FHCOZJK1f@mail.gmail.com>
Date: Tue, 1 Feb 2011 10:54:17 +0200
From: Milen Rangelov <gat3way@...il.com>
To: john-users@...ts.openwall.com
Subject: Re: FreeBSD crypt() / MD5-crypt implementation question

Hello,

The "andnot" optimization won't work on GPUs as they don't have a PANDN-like
instruction in their ISA - neither AMD, nor Nvidia. I've been playing around
with this and found out the xor versions to be faster. Speaking about
optimizations, on 5xxx ATIs there is a single "leftrotate" instruction
(bitalign) that is 3 times faster than the "two-shifts, one or" version.
Newer Nvidia hardware has fused SHL+ADD instruction which also helps, but
not as much as bitalign on ATI.

AMD ISA has BFI_INT which is a single instruction that does practically the
exact same thing as the F and G transformations. Unfortunately, AMD did not
expose BFI_INT to IL and it's rather useless at present. That would give a
nice 15% speed boost for MD5 (perhaps even more if hash reversal is
performed, which unfortunately is not possible for md5crypt).

On Tue, Feb 1, 2011 at 12:41 AM, Freddie Witherden <freddie@...herden.org>wrote:

> On 31/01/11 22:04, Simon Marechal wrote:
> > Le 31/01/2011 16:32, Freddie Witherden a écrit :
> >> implementation is not ideal (I am reasonably confident that you can get
> >> 20% out of it without too much work) it does perform better than the
> >
> > That would be quite a nice improvement. The MD5 body function is pretty
> > tight (performs like barswf, and I know nothing that is faster), but the
> > "dispatch" part is far from optimal. According to my last profiling it
> > only accounted for 15% of the processing time. I'm not sure where the
> > low hanging fruits are in this patch ...
>
> Interesting, as I've found it to perform quite a bit worse than BarsWF.
>  In the MD5 code I've been poking about with (which is--essentially--the
> same as that in JtR) around ~4% can be gained from 0 input optimisation
> for inputs less than 224-bits.  This is similar to what BarsWF does,
> although far less aggressive, given that MD5 crypt strings tend to be
> longer.
>
> I've also had some luck changing the definitions of the F and G
> functions from their xor forms to their andnot forms.  (Although this is
> quite likely to be dependant on the individual instruction latencies.)
>
> While I have not really toyed with the dispatch function I gleefully
> assumed after a cursory glance that with a bit of TLC a good few percent
> could be squeezed out of it.  Perhaps I was being a bit too optimistic.
>
> Polemically yours, Freddie.
>
>

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.