john-users - NTLM and OpenCL 1.0

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <AANLkTimn4veSyrBSfdH6DL=tFOSE12ddUxn+SCuXkcze@mail.gmail.com>
Date: Mon, 6 Sep 2010 10:01:29 -0700
From: Alain Espinosa <alainesp@...il.com>
To: john-users <john-users@...ts.openwall.com>
Subject: NTLM and OpenCL 1.0

I was working in MAC OS X development searching how to accelerate
Computer Vision apps and i encounter Grand Central Dispatch and the
OpenCL standard. I know there is programs that use GPU for
accelerating cracking and i think a good aproach to learn OpenCL is
making a patch to John. Attached are a NTLM patch on top of jumbo7
that use OpenCL 1.0 if choosen.

I add 2 make targets for MAC OS X: "macosx-x86-64-opencl" and
"macosx-x86-sse2-opencl". If you know how to make OpenCL linkage in
other platforms please leave me a comment. Note that its needed the
header "cl.h" that normally stay in "CL/cl.h" in platforms other than
MAC.

<<<<<<The following NTLM benchmarks was made in a Core 2 DUO T8100
2.1GHz with MAC OS 10.6.4 in a VMware 7.0>>>>>
M = millions passwords

//x86-64---------------------------------------------------
Benchmarking: NT MD4 [128/128 X2 SSE2-16]... DONE
Raw:    20830K c/s real, 21040K c/s virtual

//x86-sse2-------------------------------------------------
Benchmarking: NT MD4 [128/128 X2 + 32/32]... DONE
Raw:    14597K c/s real, 14744K c/s virtual

//x86-64-C_code-------------------------------------------------
Benchmarking: NT MD4 [32/32]... DONE
Raw:    8818K c/s real, 8906K c/s virtual

//x86-64-opencl--------------------------------------------
Benchmarking: NT MD4 [OpenCL 1.0]... DONE
Raw:    8568K c/s real, 5070 c/s virtual

The performance was the same using a vectorized OpenCL implementation.
The compiler would generate SSE2 code with vectorized kernels and CPU
devices but Apple compiler aparently do not make this improvements.
Maybe using ATI platform that they advertized as smart enough to
generate SSE2 code. I think the problem here its an inmature OpenCL
implementation, but maybe its the virtual machine or maybe its my code
or maybe NTLM are so fast that OpenCL calls take a significant porcion
of the time. I can not make a test with a GPU because the virtualized
environment. In my "real" PC i have Windos 7 64 bits and john crash in
cygwin with a segfault if i try to make a dynamik linkage of the
OpenCL.dll.

Theoretic Performance:
1- Normally GPU are conected with a PCI or PCI express port with a
bandwidth of 8GB/seg. Then 8GB/(lenght of keys=32) = 268 M/seg.
Program ighashgpu in my Quadro FX 3600M GPU benchmark at 195 M/seg.
Better GPU performs much better. Then there is a problem with the
bandwidth. This can be solve or atenuate reducing key lenght at half
or implementing a simple key compress algoritm.
2- John generate keys at 65 M/seg for -test and 35 M/seg for -inc in
my PC (this was benchmarking eliminating the crypt funtion). Then
there is a big problem here for good speeds. This can be a speed limit
even with x86-64 code.

This 2 tips presents a problem to programs similar to john that walk
over a search_space. What its better, a sophisticated searching at low
benchmark or a naive at high benchmark? The problem exist because in
the GPU its difficult to program a sophisticated search, and there are
a lot of existing code now that target CPUs and moving to GPU its
tricky an time consuming.

Some solutions could be:
- Optimize generation of keys. Require work on john and i do not know
if posible.
- Add a naive but fast generation of keys.
- Generate keys continuosly in the CPU concurrent with GPU cracking.
- Execute various instances of John for full use of GPU and bandwidth.
I do not know if this its posible.

Do not expect then big speedups with this patch. Of course that this
problem only exist for extremly fast hash like NTLM, for other hash
types using OpenCL could make a big speedup.


This patch may or may not work with particular OpenCL 1.0
implementation. Please send coments if do not. In particular my Nvidia
driver successfully compile the code but do not load it. All benchmark
and tests are welcome.

saludos,
alain

Download attachment "john-1.7.6-jumbo-7-ntopencl-1.diff" of type "application/octet-stream" (22369 bytes)
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.