|
Message-ID: <51CC882C.9040707@gmail.com> Date: Fri, 28 Jun 2013 00:15:00 +0530 From: Sayantan Datta <std2048@...il.com> To: john-dev@...ts.openwall.com Subject: Re: Re: latest version of descrypt project on reddit On Thursday 27 June 2013 08:01 PM, Solar Designer wrote: > On Thu, Jun 27, 2013 at 07:23:44PM +0530, Sayantan Datta wrote: >> I profiled the kernel using codeXL. >> >> When we have all 16 rounds unrolled put under single iteration , cache >> hit is +99%. This is supported by the evidence that there is 0% memory >> unit stalls and very little fetch from video memory.This corresponds to >> the first case(4694Mkeys/s). >> >> Next when we put the 16 rounds of des in a 25 iter loop the cache hit >> suddenly drops to 1%. Now the memory unit is stalled 23% of the time >> and video memory fetches are increased by nearly 100x. This would be >> the second case(117 Mkeys/s). > This makes me wonder: what if instead of the 25 iter loop, you unroll > the entire thing - that is, repeat the 16 rounds 25 times. > > I think 16 rounds already exceed I-cache size, yet with non-looping > kernel like that the hardware somehow manages to avoid most cache misses - > perhaps by reusing each portion of fetched code many times (due to the > high GWS). Perhaps we can make use of this feature for the entire > descrypt just as easily? 400 round unroll with much higher GWS: Run time: 6.897884 s Rate: 77.831245 Mkeys/s Time to search keyspace: 10715.490027 days Stats: Cache hit 92%, FetchSize:891MB , 0% memory stall. This is little harder to explain. On one hand we have very good cache hit but on the other hand we have extremely large fetch size. In best case scenario we had cache hit +90% and fetch size around 50KB(for 4 round unroll under 100 iterations). >> We can increase the cache hit back to +90% with almost 0 memory unit >> stalls if we can somehow unroll only 4 rounds and put it under >> appropriate iteration count. > That's tough. We can easily do it for 8 rounds (in fact, that's what we > already do in descrypt-opencl), but when we reduce to 4, we have to use > non-constant indices into B[]. > > BTW, can you check the I-cache hit rate for the current descrypt-opencl? Current kernel has a cache hit of 26% for fully hardcoded kernel , but this includes all cache hits and not just the i-cache. However I am mostly sure it is mainly i-cache because there no or very little data being fetched from global memory. Also the amount of fetch is very very high nearly 13GB. For a hardcoded but not fully unrolled kernel cache hit is just 2%. But amount of fetch is also low, nearly around 238MB. Note that these kernels also include compare and password generators. Regards, Sayantan
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.