|
Message-ID: <BANLkTik36yhDeMXBg=8XAmouAzUW3UETAQ@mail.gmail.com> Date: Mon, 11 Apr 2011 22:47:48 +0200 From: Łukasz Odzioba <lukas.odzioba@...il.com> To: john-dev@...ts.openwall.com Subject: Re: sha256 format patches Hi! I've updated sha256cuda patch for JtR and wanted to share results with you. Test Environment: CPU: Intel Core 2 Duo P7350@...z 16 Gflops - theoretical GPU: nVidia GeForce 9600m - 32 cuda cores 110 Gflops - theoretical Gflops are not good mertics to compare gpu and cpu, but it's difficult to find something well fitted. I considered two types of hashes: fast hashes: classic sha256 with 64 iterations slow hashes: sha256 calculated 5000 time - 320 000 iterations I used nVidia Visual profiler to distinguish difference between slow and fast hashes. Following charts shows time consumtion during accordingly fast and slow hash computation: http://sphere.pl/~ukasz/gsoc2011/slow.png http://sphere.pl/~ukasz/gsoc2011/fast.png Short analysis leads to following conclusions: Fast hashes: -mostly ALU bounded -optimizing pci-e transfers can be profitable -requires changes in cpu code to utilize gpu power Slow hashes: -extreme ALU bounded -pci-e data transfers are negligible -great scalability Sha256 is simple function and can be efficently computed with “few” registers. Every hash calculation requires 96 bytes to be transfered through PCI-e bus so: Fast sha256 calculates ~2000k c/s - 185 MB/s PCI-e transfer Slow sha256 calculates 380 c/s - 35kB/s PCI-e transfer First version of my sha256cuda patch was able to calculate ~1795k c/s, comparing to CPU version ~1134k c/s (it is not optimal). Sha256 cuda after following optimizations: --use constant memory for initialization values +12% --find best threads per block configuration +18% --throw away unused table initialization +5% --unrolling loops +23% achieved ~2987k c/s. To unroll loops I used nvcc #pragma unroll N macro but it resulted utilization 56 registers per thread while unrolled version only 18 registers. So I've crated my own unrolling macros but huge number of used registers didn't change, and there was big performance loss. This could be easily solved by using nvcc --maxrregcount N option and resulted a big speedup. CUDA api delivers page locked memory allocation so data through PCIe can be transfered in DMA mode. It is great advantage and gives +8% boost (checked 3205k c/s). It can be connected with asynchronous transfer and gpu code execution which should provide another 10-15 percent speedup (on fast hashes). On Fermi architecture data can be transfered in both sides while gpu code executes, on G80 only one side background is allowed. Picture below describes idea similar to pipeline. http://sphere.pl/~ukasz/gsoc2011/streams.png We need to divide computations in smaller parts and utilize memory controller asynchronous data transfer mode. These are percentage time on my gpu: -Reading data from Host do Device - 17% -Calculating hashes - 68% -Writing data from Device to Host - 14% Hovewer there are three problems with this approach: -pinned memory allocation is more expensive -we should not allocate a lot of pinned memory because of system stability -JtR format does not provide mechanism to do cleanup after computations, so I suppose we would have to alocate and dealocate memory for every crypt_all call, but it is not healthy. I was wondering about my bitwise ror macro but according to .asm i .ptx files nvcc and gcc optimized it well. There are still places where I could gain some performance: -shared memory -texture memory (better than global because is cached) -looking deeper into ptx files -developing something similar to Alain's code searching for best thread/block/registers settings in running environment However i might do not have time to play with it before Friday. Because my major topic is slow hashes, changes i did have affected on slow sha256 computation. In term of slow sha256 I got following results: CPU - 380 c/s GPU - 2520 c/s So GPU version is 6.6x faster that the cpu. Slow hashes are better scalable so taking 2x faster GPU will can easily get 12x faster code comparing to cpu. Those results do not include unrolling loops modification. This is link to newest version sha256patch: http://sphere.pl/~ukasz/gsoc2011/john-1.7.6-sha256cuda-1.diff Code is a bit dirty I will do cleanup and link to wiki soon. Solar Designer benchmarked sha256cuda-version 0 patch and got following results on GeForce 8800 GTS 512: (...)" 4456k c/s real 4412k c/s virtual after a certain change, i got this to: 5255k c/s real 5308k c/s virtual running two processes simultaneously gives a combined speed of 9k" and also Raw MD5 opencl patch on 8800 GTS 512: Raw: 36296K c/s real 94371K c/s virtual running two at once: Raw: 32263K c/s real 86037K c/s virtual Raw: 32577K c/s real 79891K c/s virtual so that's 64k combined speed Lukas
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.