john-dev - RE: Speedup of x86 .S build of raw-sha1 format

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <000301cbe40c$a6e24be0$f4a6e3a0$@net>
Date: Wed, 16 Mar 2011 14:02:09 -0500
From: "jfoug" <jfoug@....net>
To: <john-dev@...ts.openwall.com>
Subject: RE: Speedup of x86 .S build of raw-sha1 format

>2011-03-15 8:54 PM, magnum wrote
>>On 2011-03-15 23:28, jfoug wrote:
>> Here is about a 10% speedup. A very trivial change.
>>
>> - memset(saved_key, 0, sizeof(saved_key));
>> + //memset(saved_key, 0, sizeof(saved_key));
>> + memset(saved_key, 0, 64*MMX_COEF);
>
>Good find. I suppose it could be done exactly the same in
>mysqlSHA1_fmt.c too? It even has a comment about that memset. But it
>runs x1 on x86-64

On mysqlSHA1_fmt.c I found 2 speedup's (SSE).

First is above. 2nd is a double byte swap.  Here are the timings on my
system:

john -test=5 -for=mysql-sha1
Benchmarking: MySQL 4.1 double-SHA-1 SSE2 [mysql-sha1 SSE2]... DONE
Raw:    4145K c/s

john -test=5 -for=mysql-sha1
Benchmarking: MySQL 4.1 double-SHA-1 SSE2 [mysql-sha1 SSE2]... DONE
Raw:    4391K c/s

john -test=5 -for=mysql-sha1
Benchmarking: MySQL 4.1 double-SHA-1 SSE2 [mysql-sha1 SSE2]... DONE
Raw:    4608K c/s

The 1st is the 'starting point'. 
The 2nd is simple change to the memset.  
The 3rd is a change to the asm creating a new function that does not
byteswap prior to returning (and the reduction of the memset)

static void mysqlsha1_crypt_all(int count) {  
#ifdef MMX_COEF
    unsigned int i;

-    shammx((unsigned char *) crypt_key, (unsigned char *) saved_key,
total_len);
-
-    for(i = 0; i < MMX_COEF*BINARY_SIZE/sizeof(unsigned); i++)
-    {
-        ((unsigned*)interm_key)[i] = BYTESWAP(((unsigned*)crypt_key)[i]);
-    }
+    shammx_nofinalbyteswap((unsigned char *) crypt_key, (unsigned char *)
saved_key, total_len);
+    for(i = 0; i < MMX_COEF*BINARY_SIZE/sizeof(unsigned); i++)
+        ((unsigned*)interm_key)[i] = ((unsigned*)crypt_key)[i];

Simply create a new function called shammx_nofinalbyteswap(). That function
works just like shammx. However, before jumping into the shammx_noinit call,
it sets a dword=1.  Then at the bottom of the shammx_noinit, right before it
does a jmp to skip_endianity, I check that dword. If it is 1, I simply set
that dword back to 0 (so next call to shammx will work properly), and rip
the 5 mmx registers into the output buffer.  If the dword was 0, the
original jmp skip_endianity is done.

What that gives, is a x86 friendly dump of sha1 into the buffer, to avoid
endianity checks later (converting to base-16, comparing against original
base-16, etc), but also gives a shammx function which leaves the buffer in
PROPER endianity format for input data.  

So if there was a function such as phpass, which used sha1, this speedup
would allow multiple encryptions right in the same buffer.  Phpass is 'like'
this mysql-sha1, in the fact that the key is setup, the crypt is done, and
then the crypt is done over and over again on the binary results of the
buffer.  In phpass, there is more than just the binary results (it also
appends key, but that binary residue from the prior run, is ALWAYS 16
bytes).   

In phpass, we have this:    md5(binary(md5($p.$s)).$p)^2048    

In mysql-sha1 we have sha1(binary(sha1(byteswap.c($p)))  After my change,
this is what the logic actually does.   

Before the change we were doing
sha1(byteswap.c(byteswap.sse(binary(sha1(byteswap.c($p))))))


NOTE, I probably should have kept some of these optimizations quiet, so that
I had some ability to offset the overhead of the generic functions, yet
still be able to design subformats that can replace existing ones (like
sha1-raw, mysql-sha1, etc), keeping 'almost' equal speeds.

Jim.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.