john-dev - Re: bf

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CA+TsHUCB7mua=pPF9STSC_TVX5hh_6SoHhYd34xD8QLZnPXzTQ@mail.gmail.com>
Date: Tue, 10 Jul 2012 11:16:17 +0530
From: Sayantan Datta <std2048@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: bf_kernel.cl

On Tue, Jul 10, 2012 at 9:49 AM, Solar Designer <solar@...nwall.com> wrote:

> On Tue, Jul 10, 2012 at 09:23:11AM +0530, Sayantan Datta wrote:
> > On Tue, Jul 10, 2012 at 8:16 AM, Solar Designer <solar@...nwall.com>
> wrote:
> >
> > > Shouldn't we expect more like a 50% improvement, based on the speeds
> for
> > > the implementation using global memory that you had before?  Compared
> to
> > > your LDS-using implementation, we're adding uses of computing and
> memory
> > > resources that would otherwise be completely idle.
> >
> > We are still not capable of utilizing 100% of the hardware.
>
> Of course not.  But I don't see what prevents us from achieving the
> combined speed of your global-memory-using and your LDS-using
> implementations.  Yes, the former tried to use all SIMDs (if I
> understand correctly), even though it kept them stalled waiting for data
> most of the time, but can't we achieve roughly the same speed with fewer
> SIMDs (such as with just two per CU, which we're not using for LDS),
> since the task is memory speed bound anyway?  I think we'll saturate the
> 384-bit bus even with just two SIMDs per CU, or even with just one per
> CU, for that matter (so we may save some electricity and heat
> dissipation by leaving one SIMD per CU completely unused).
>
> Am I missing something?
>
> Alexander
>

I remember that during actual cracking how speed were limited to somewhere
near 1000 c/s on the kernel using global memory although benchmarking
suggested much higher 2400c/s. This suggest that we were incurring stalls
during actual cracking which we weren't during benchmarking.  I think this
is the ultimate which we can achieve using global memory.

Also I could achive nearly the same numbers using global memory alone
despite of heavily under utilizing the CU. I limited global no. of work
items to 512 and work group size to 8 which produced 1019 c/s in actual
cracking.

This puts my revised value of x to be 4 not 8. So we will see upto 25%
extra using global memory.

One more thing I would like you to know that your Sptr implemntation
performs nearly same as before on nvidia after a 4x loop unroll of the 512
iteration loop.

Regards,
Sayantan

Content of type "text/html" skipped

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.