|
Message-ID: <20120710060236.GA6348@openwall.com> Date: Tue, 10 Jul 2012 10:02:36 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: bf_kernel.cl On Tue, Jul 10, 2012 at 11:16:17AM +0530, Sayantan Datta wrote: > I remember that during actual cracking how speed were limited to somewhere > near 1000 c/s on the kernel using global memory although benchmarking > suggested much higher 2400c/s. This suggest that we were incurring stalls > during actual cracking which we weren't during benchmarking. I think this > is the ultimate which we can achieve using global memory. Oh, I think I was not aware of the lower speed during actual cracking. The speed difference could be due to benchmarks having only a handful of candidate passwords to test, and testing them repeatedly. So we get multiple instances of the same candidate password in-flight at a given time, presumably resulting in accesses to similar memory locations. Yet this is puzzling since the S-boxes are all separate and read-write. Even if similar (mod 4 KB) addresses are accessed and the same data is being read/written, this shouldn't allow for better cache usage than if the data were different. In fact, the similarity in addresses could result in more bank conflicts. So we could want to investigate this. > Also I could achive nearly the same numbers using global memory alone > despite of heavily under utilizing the CU. I limited global no. of work > items to 512 and work group size to 8 which produced 1019 c/s in actual > cracking. > > This puts my revised value of x to be 4 not 8. So we will see upto 25% > extra using global memory. Makes sense. > One more thing I would like you to know that your Sptr implemntation > performs nearly same as before on nvidia after a 4x loop unroll of the 512 > iteration loop. This also makes sense. Are you committing this change? I think it makes the code simpler, although it needs 3 extra registers per bcrypt instance. We should have plenty of spare registers since we're under-utilizing the GPU anyway (assuming that this OpenCL code is being run on a GPU). Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.