|
Message-ID: <CA+TsHUC373WXZ8ubaaRRBp85-EDNPOCaD_AtfgCw9p2TgA_dgg@mail.gmail.com>
Date: Mon, 9 Jul 2012 13:02:12 +0530
From: Sayantan Datta <std2048@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: bf_kernel.cl (was: Sayantan:Weekly Report #11)
On Mon, Jul 9, 2012 at 7:37 AM, Solar Designer <solar@...nwall.com> wrote:
> On Thu, Jul 05, 2012 at 12:11:14PM +0530, SAYANTAN DATTA wrote:
> > Therfore
> > your calculation the utilization of one out of four SIMD units per CU on
> > 7970 is valid for current kernel.
>
> Do you know if we currently utilize one of four SIMD units or maybe
> 1/4th of each SIMD unit (vector width) or another combination (e.g.,
> maybe 8 out of 16 vector elements in two SIMD units, leaving the other
> two completely idle)?
>
One wavefront or workgroup (whichever is less ) is scheduled on one SIMD
unit. So we are using two SIMD units and using only half Processing Elemnts
on each SIMD.
> > I once tried two john builds together on 7970. One running on LDS and the
> > other on global memory but it caused an asic hang. Maybe we need to merge
> > them together under one build. But how to mearge them remains a question.
> > One of the two possible way is to call two clEnqueKernels per crypt. Or
> we
> > can merge the two kernels. Also how the two branches get scheduled on
> gpu
> > will impact performance.
>
> I'd try setting WORK_GROUP_SIZE to 12, keep the declaration of S_Buffer
> at its current size - introduce some new macro for this, like
> LDS_GROUP_SIZE, which we'd keep at 8 for 7970. If lid is <
> LDS_GROUP_SIZE, then use the current code. If lid is >= LDS_GROUP_SIZE
> (would be 8, 9, 10, or 11 under this example), then use new code that
> would use global memory instead (just modify the supplied BF_current_S
> directly?)
>
I'll try this first.
>
> Meanwhile, attached is a quick hack that uses simpler addressing modes
> (maybe, depending on what the code is compiled into). No significant
> performance change on 7970 from this (but the source code size is
> reduced), yet you could want to benchmark this more carefully. There
> appears to be a 10% slowdown on GTX 570 from this, but that's with
> non-optimal settings (the same as 7970's), so it might not be relevant.
> You could want to play with this too.
>
> Also, the 512-iterations loop in BF_body() could be partially unrolled -
> maybe try a 2x unroll first (256 iterations of the new loop). Maybe
> this would let a few instructions of one iteration of the original loop
> be intermixed with instructions from the other iteration, hiding some
> latencies.
>
> Thanks,
>
> Alexander
>
I'll try them too.
Regards,
Sayantan
Content of type "text/html" skipped
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.