|
Message-ID: <20120615151205.GT3546@cmpxchg8b.com> Date: Fri, 15 Jun 2012 17:12:05 +0200 From: Tavis Ormandy <taviso@...xchg8b.com> To: john-dev@...ts.openwall.com Subject: Re: [patch] optional new raw sha1 implemetation Oh, unrelated note, I also have an original CUDA SHA-1 that (based on some quick calculations) should compare quite favourably. I can port it to a plugin if anyone is interested. My obsession with SHA-1 actually comes from trying to find colliding blocks[1], but I suppose it can also benefit password security :-) [1] https://lock.cmpxchg8b.com/shatter/visualize.html?best=1 Tavis. On Fri, Jun 15, 2012 at 04:54:15PM +0200, Tavis Ormandy wrote: > [starting a new thread] > > > On 2012-06-14 21:38, magnum wrote: > > > On 2012-06-14 16:32, Frank Dittrich wrote: > > > > On Thu, Jun 14, 2012 at 6:19 PM, Tavis > > > > Ormandy<taviso-1TlbntoI6+xF6kxbq+BtvQ@...lic.gmane.org> wrote: > > > > > p.s. I also have a sha-1 implementation that's a little faster than > > > > > the jumbo version, would this be the right list to send that to? Is > > > > > there a jumbo cvs repo I can checkout to patch against? > > >> > > > > Probably the latest git version is considerably faster than the last > > > > jumbo version. > > > > > > I was going to say "not much" but I just checked raw-sha1 and apparently > > > it's 33% faster. I'm not sure how that happened, from memory the code > > > changes only boosted it by like 10-12% (and this CPU does not support > > > XOP or other stuff that Solar added optimisations for). > > > > I tracked this down: I was remembering correctly but that was not compared > > to Jumbo-5 but to 80x4 [1] magnum-jumbo. The sha-1 intrinsics changes by > > Jim made about half of those 33%, and my optimisations of set_key() in the > > sha1 formats did the rest. I suppose Travis improved the intrinsics code > > so these changes may well be worth looking at, and compare with the 16x4 > > code Jim made. Even if Jim's code turns out to be faster, there may be > > some bits and pieces we can use. > > > > [1] The 80x4 vs 16x4 refers to the SSE-2 key buffer. In the older code by > > Simon, it's 80 bytes per candidate where 16 bytes are just scratch space. > > In Jim's code, it's 64 bytes just like MD4 and MD5, and the scratch space > > is on stack. > > > > magnum > > I see, thanks for the information magnum! I cloned from your git repo, and my > code still seems to be around ~40% faster on most of my hardware. I'm not > sure you're going to be too happy though, I didn't change any intrinsics, I > actually wrote a new plugin from scratch. > > I know SHA-1 and SSE quite well, so I'm not afraid to dive in and start > hacking, but the existing code is quite hard to follow for an outsider! > > I understand why, it needs to work well on lots of different silicon, but I > was hoping you might be convinced to include mine as an optional build that > might perform better on some machines. > > Here are the numbers on my slow x32 xeon: > > $ printf madmda16 | sha1sum | awk '{print $1}' > passwords > $ time ../run/john --format=rawsha1_sse4 passwords > Loaded 1 password hash (Raw SHA-1 (taviso sse4 build) [rawsha1_sse4]) > madmda16 (?) > guesses: 1 time: 0:00:01:29 DONE (Fri Jun 15 16:01:37 2012) c/s: 9612K trying: madmda11 - madmda16 > real 1m29.253s > user 1m28.767s > sys 0m0.049s > > $ rm ../run/john.pot > $ time ../run/john --format=raw-sha1 passwords > Loaded 1 password hash (Raw SHA-1 [SSE2 4x]) > madmda16 (?) > guesses: 1 time: 0:00:02:02 DONE (Fri Jun 15 16:43:14 2012) c/s: 6973K trying: madmda11 - madmda16 > > real 2m3.020s > user 2m2.356s > sys 0m0.084s > > > About ~3k c/s faster. I haven't tested it on any AMD hardware, I imagine it > will still be a bit faster as I have some interesting logical optimisations as > well as code optimisations, but maybe not as much of an improvement. > > As the code is self contained, I've just attached my first attempt. Let me know > if you have any feedback, or suggestions to help get it merged. > > I've only tested it on Linux with GCC, and you need to add -msse4 to CFLAGS in the Makefile. > > The code is original, I can assign copright to Solar if required. > > Tavis. > > p.s. I can live without sse4 if it's a deal breaker, but I use it on > quite a hot path. > > -- > ------------------------------------- > taviso@...xchg8b.com | pgp encrypted mail preferred > ------------------------------------------------------- > #include <stdbool.h> > #include <emmintrin.h> > #include <smmintrin.h> > #include <string.h> > #include <stdint.h> > #include <stdint.h> > > #include "params.h" > #include "formats.h" > #include "memory.h" > #include "sha.h" > > // > // Alternative SSE2 raw SHA-1 format plugin for John The Ripper. I believe this > // version should perform well on most recent SSE2+ hardware, but is not as > // portable as the standard version. > // > // I would reccommend benchmarking both on your machine. > // > // Tavis Ormandy <taviso@...xchg8b.com>, 2012. > // > // > > #if defined(__SSE4_1__) && defined(__GNUC__) && defined(__linux__) > # warning building experimental rawsha1_sse4 plugin, send bug reports to taviso@...xchg8b.com > #else > # error not recommended to use this code on your platform > #endif > > #define SHA1_BLOCK_SIZE 64 > #define SHA1_BLOCK_WORDS 16 > #define SHA1_DIGEST_SIZE 20 > #define SHA1_DIGEST_WORDS 5 > #define SHA1_PARALLEL_HASH 4 > > #define __aligned __attribute__((aligned(16))) > > #define X(X0, X2, X8, X13) do { \ > X0 = _mm_xor_si128(X0, X8); \ > X0 = _mm_xor_si128(X0, X13); \ > X0 = _mm_xor_si128(X0, X2); \ > X0 = _mm_xor_si128(_mm_slli_epi32(X0, 1), _mm_srli_epi32(X0,31)); \ > } while (false) > > #define R1(W, A, B, C, D, E) do { \ > E = _mm_add_epi32(E, K); \ > E = _mm_add_epi32(E, _mm_and_si128(C, B)); \ > E = _mm_add_epi32(E, _mm_andnot_si128(B, D)); \ > E = _mm_add_epi32(E, W); \ > B = _mm_xor_si128(_mm_slli_epi32(B, 30), _mm_srli_epi32(B, 2)); \ > E = _mm_add_epi32(E, _mm_xor_si128(_mm_slli_epi32(A, 5), _mm_srli_epi32(A, 27))); \ > } while (false) > > #define R2(W, A, B, C, D, E) do { \ > E = _mm_add_epi32(E, K); \ > E = _mm_add_epi32(E, _mm_xor_si128(_mm_xor_si128(B, C), D)); \ > E = _mm_add_epi32(E, W); \ > B = _mm_xor_si128(_mm_slli_epi32(B, 30), _mm_srli_epi32(B, 2)); \ > E = _mm_add_epi32(E, _mm_xor_si128(_mm_slli_epi32(A, 5), _mm_srli_epi32(A, 27))); \ > } while (false) > > #define R3(W, A, B, C, D, E) do { \ > E = _mm_add_epi32(E, K); \ > E = _mm_add_epi32(E, _mm_or_si128(_mm_and_si128(_mm_or_si128(B, D), C), _mm_and_si128(B, D))); \ > E = _mm_add_epi32(E, W); \ > B = _mm_xor_si128(_mm_slli_epi32(B, 30), _mm_srli_epi32(B, 2)); \ > E = _mm_add_epi32(E, _mm_xor_si128(_mm_slli_epi32(A, 5), _mm_srli_epi32(A, 27))); \ > } while (false) > > #define _MM_TRANSPOSE4_EPI32(R0, R1, R2, R3) do { \ > __m128i T0, T1, T2, T3; \ > T0 = _mm_unpacklo_epi32(R0, R1); \ > T1 = _mm_unpacklo_epi32(R2, R3); \ > T2 = _mm_unpackhi_epi32(R0, R1); \ > T3 = _mm_unpackhi_epi32(R2, R3); \ > R0 = _mm_unpacklo_epi64(T0, T1); \ > R1 = _mm_unpackhi_epi64(T0, T1); \ > R2 = _mm_unpacklo_epi64(T2, T3); \ > R3 = _mm_unpackhi_epi64(T2, T3); \ > } while (false) > > #define _mm_load_si128(x) _mm_load_si128((void *)(x)) > #define _mm_store_si128(x, y) _mm_store_si128((void *)(x), (y)) > > static uint32_t __aligned M[SHA1_PARALLEL_HASH][4]; > static uint32_t __aligned N[SHA1_PARALLEL_HASH][4]; > static uint32_t __aligned MD[SHA1_PARALLEL_HASH]; > > static struct fmt_tests sha1_fmt_tests[] = { > { "f8252c7b6035a71242b4047782247faabfccb47b", "taviso" }, > { "b47f363e2b430c0647f14deea3eced9b0ef300ce", "is" }, > { "03d67c263c27a453ef65b29e30334727333ccbcd", "awesome" }, > { "7a73673e78669ea238ca550814dca7000d7026cc", "!!!!1111eleven" }, > { "da39a3ee5e6b4b0d3255bfef95601890afd80709", "" }, > { "AC80BAA235B7FB7BDFC593A976D40B24B851F924", "CAPSLOCK" }, > { NULL, NULL } > }; > > static inline uint32_t __attribute((const)) rotateright(uint32_t value, uint8_t count) > { > register uint32_t result; > > asm("ror %%cl, %0" > : "=r" (result) > : "0" (value), > "c" (count)); > > return result; > } > > static inline uint32_t __attribute((const)) rotateleft(uint32_t value, uint8_t count) > { > register uint32_t result; > > asm("rol %%cl, %0" > : "=r" (result) > : "0" (value), > "c" (count)); > > return result; > } > > static int sha1_fmt_valid(char *ciphertext, struct fmt_main *format) > { > // This routine is not hot, verify this only contains hex digits. > if (strspn(ciphertext, "0123456789aAbBcCdDeEfF") != SHA1_DIGEST_SIZE * 2) > return 0; > > return 1; > } > > static void * sha1_fmt_binary(char *ciphertext) > { > uint32_t *result = mem_alloc_tiny(SHA1_DIGEST_SIZE, MEM_ALIGN_SIMD); > uint8_t *binary = result; > char byte[3] = {0}; > > // Convert ascii representation into binary. This routine is not hot, so > // it's okay to keep this simple. > for (; *ciphertext; ciphertext += 2, binary += 1) { > *binary = strtoul(memcpy(byte, ciphertext, 2), NULL, 16); > } > > // Now subtract the SHA-1 IV, returning this hash to the R80 state. This > // means we save 4 SIMD additions for every crypt(), because we don't have > // to do it there. > result[0] = __builtin_bswap32(result[0]) - 0x67452301; > result[1] = __builtin_bswap32(result[1]) - 0xEFCDAB89; > result[2] = __builtin_bswap32(result[2]) - 0x98BADCFE; > result[3] = __builtin_bswap32(result[3]) - 0x10325476; > > // One additional preprocessing step, if we calculate E80 rol 2 here, we > // can compare it against A75 and save 5 rounds. > result[4] = rotateleft(__builtin_bswap32(result[4]) - 0xC3D2E1F0, 2); > > return result; > } > > // This function is called when John wants us to buffer a crypt() operation > // on the specified key. We also preprocess it for SHA-1 as we load it. > // > // This implementation is hardcoded to only accept passwords under 15 > // characters. This is because we can create a new message block in just two > // MOVDQA instructions (we need 15 instead of 16 because we must append a bit > // to the message). > // > // This routine assumes that key is not on an unmapped page boundary, but > // doesn't require it to be 16 byte aligned (although that would be nice). > static void sha1_fmt_set_key(char *key, int index) > { > __m128i Z = _mm_setzero_si128(); > __m128i X = _mm_loadu_si128(key); > __m128i B = _mm_set_epi32(1 << 31, 0, 0, 0); > uint32_t len = _mm_movemask_epi8(_mm_cmpeq_epi8(X, Z)); > > // First, find the length of the key by scanning for a zero byte. > len = __builtin_ctz(len); > > // Zero out the rest of the DQWORD in X by making a mask. First, set all > // the bits by comparing to itself, then shift it down N bits. > Z = _mm_cmpeq_epi8(Z, Z); > > // XXX: PSRLDQ requires an imm8 shift count. Is there a way to do this > // without the branch? We can use PSRLQ, but then how do I reset the > // other QWORD? > switch (len) { > case 0: Z = _mm_slli_si128(Z, 0); B = _mm_srli_si128(B, 15); break; > case 1: Z = _mm_slli_si128(Z, 1); B = _mm_srli_si128(B, 14); break; > case 2: Z = _mm_slli_si128(Z, 2); B = _mm_srli_si128(B, 13); break; > case 3: Z = _mm_slli_si128(Z, 3); B = _mm_srli_si128(B, 12); break; > case 4: Z = _mm_slli_si128(Z, 4); B = _mm_srli_si128(B, 11); break; > case 5: Z = _mm_slli_si128(Z, 5); B = _mm_srli_si128(B, 10); break; > case 6: Z = _mm_slli_si128(Z, 6); B = _mm_srli_si128(B, 9); break; > case 7: Z = _mm_slli_si128(Z, 7); B = _mm_srli_si128(B, 8); break; > case 8: Z = _mm_slli_si128(Z, 8); B = _mm_srli_si128(B, 7); break; > case 9: Z = _mm_slli_si128(Z, 9); B = _mm_srli_si128(B, 6); break; > case 10: Z = _mm_slli_si128(Z, 10); B = _mm_srli_si128(B, 5); break; > case 11: Z = _mm_slli_si128(Z, 11); B = _mm_srli_si128(B, 4); break; > case 12: Z = _mm_slli_si128(Z, 12); B = _mm_srli_si128(B, 3); break; > case 13: Z = _mm_slli_si128(Z, 13); B = _mm_srli_si128(B, 2); break; > case 14: Z = _mm_slli_si128(Z, 14); B = _mm_srli_si128(B, 1); break; > case 15: Z = _mm_slli_si128(Z, 15); break; > default: __builtin_trap(); > } > > // Now we have this: > // B = 00 00 00 00 00 80 00 00 00 00 00 00 00 00 00 > // Z = 00 00 00 00 00 ff ff ff ff ff ff ff ff ff ff > // X = 41 41 41 41 41 00 12 34 56 78 12 34 56 78 9A > // <---------------> <------------------------> > // key bytes w/nul junk from stack. > > // Use PANDN to apply the mask, then POR to append the trailing bit > // required by SHA-1, which leaves us with this: > // X = 41 41 41 41 41 80 00 00 00 00 00 00 00 00 00 > X = _mm_or_si128(_mm_andnot_si128(Z, X), B); > > // SHA-1 requires us to bswap all the 32bit words in the message, which we > // can do now. > // FIXME: Do this with PSHUFB, this will do for now. SHA-1 requires this > // byte order for input words. > X = _mm_set_epi32(__builtin_bswap32(_mm_cvtsi128_si32(_mm_srli_si128(X, 12))), > __builtin_bswap32(_mm_cvtsi128_si32(_mm_srli_si128(X, 8))), > __builtin_bswap32(_mm_cvtsi128_si32(_mm_srli_si128(X, 4))), > __builtin_bswap32(_mm_cvtsi128_si32(X))); > > // Store the result and it's length into the message buffer, we need the > // length in bits because SHA-1 requires the length be part of the final > // message block (or only message block, in this case). > _mm_store_si128(&M[index], X); > _mm_store_si128(&N[index], _mm_set_epi32(len * CHAR_BIT, 0, 0, 0)); > > return; > } > > static char * sha1_fmt_get_key(int index) > { > static uint32_t key[4]; > > // This function is not hot, we can do this slowly. First, restore > // endianness. > key[0] = __builtin_bswap32(M[index][0]); > key[1] = __builtin_bswap32(M[index][1]); > key[2] = __builtin_bswap32(M[index][2]); > key[3] = __builtin_bswap32(M[index][3]); > > // Skip backwards until we hit the trailing bit, then remove it. > memset(memrchr(key, 0x80, sizeof key), > 0x00, > 1); > > // Return pointer to static buffer. > return key; > } > > static void sha1_fmt_crypt_all(int count) > { > __m128i W[SHA1_BLOCK_WORDS]; > __m128i A, B, C, D, E; > __m128i K; > > A = _mm_set1_epi32(0x67452301); > B = _mm_set1_epi32(0xEFCDAB89); > C = _mm_set1_epi32(0x98BADCFE); > D = _mm_set1_epi32(0x10325476); > E = _mm_set1_epi32(0xC3D2E1F0); > > // Zero the unused parts of W. > W[4] = W[5] = W[6] = W[7] = _mm_setzero_si128(); > W[8] = W[9] = W[10] = W[11] = _mm_setzero_si128(); > > // Fetch the lengths. > W[12] = _mm_load_si128(&N[0]); > W[13] = _mm_load_si128(&N[1]); > W[14] = _mm_load_si128(&N[2]); > W[15] = _mm_load_si128(&N[3]); > > // Fetch the message. > W[0] = _mm_load_si128(&M[0]); > W[1] = _mm_load_si128(&M[1]); > W[2] = _mm_load_si128(&M[2]); > W[3] = _mm_load_si128(&M[3]); > > // Use a 4x4 integer matrix transpose to shuffle those words into > // position. > _MM_TRANSPOSE4_EPI32(W[12], W[13], W[14], W[15]); > _MM_TRANSPOSE4_EPI32(W[0], W[1], W[2], W[3]); > > K = _mm_set1_epi32(0x5A827999); > > R1(W[0], A, B, C, D, E); > R1(W[1], E, A, B, C, D); > R1(W[2], D, E, A, B, C); > R1(W[3], C, D, E, A, B); > R1(W[4], B, C, D, E, A); > R1(W[5], A, B, C, D, E); // 5 > R1(W[6], E, A, B, C, D); > R1(W[7], D, E, A, B, C); > R1(W[8], C, D, E, A, B); > R1(W[9], B, C, D, E, A); > R1(W[10], A, B, C, D, E); // 10 > R1(W[11], E, A, B, C, D); > R1(W[12], D, E, A, B, C); > R1(W[13], C, D, E, A, B); > R1(W[14], B, C, D, E, A); > R1(W[15], A, B, C, D, E); // 15 > > X(W[0], W[2], W[8], W[13]); R1(W[0], E, A, B, C, D); > X(W[1], W[3], W[9], W[14]); R1(W[1], D, E, A, B, C); > X(W[2], W[4], W[10], W[15]); R1(W[2], C, D, E, A, B); > X(W[3], W[5], W[11], W[0]); R1(W[3], B, C, D, E, A); > > K = _mm_set1_epi32(0x6ED9EBA1); > > X(W[4], W[6], W[12], W[1]); R2(W[4], A, B, C, D, E); // 20 > X(W[5], W[7], W[13], W[2]); R2(W[5], E, A, B, C, D); > X(W[6], W[8], W[14], W[3]); R2(W[6], D, E, A, B, C); > X(W[7], W[9], W[15], W[4]); R2(W[7], C, D, E, A, B); > X(W[8], W[10], W[0], W[5]); R2(W[8], B, C, D, E, A); > X(W[9], W[11], W[1], W[6]); R2(W[9], A, B, C, D, E); // 25 > X(W[10], W[12], W[2], W[7]); R2(W[10], E, A, B, C, D); > X(W[11], W[13], W[3], W[8]); R2(W[11], D, E, A, B, C); > X(W[12], W[14], W[4], W[9]); R2(W[12], C, D, E, A, B); > X(W[13], W[15], W[5], W[10]); R2(W[13], B, C, D, E, A); > X(W[14], W[0], W[6], W[11]); R2(W[14], A, B, C, D, E); // 30 > X(W[15], W[1], W[7], W[12]); R2(W[15], E, A, B, C, D); > X(W[0], W[2], W[8], W[13]); R2(W[0], D, E, A, B, C); > X(W[1], W[3], W[9], W[14]); R2(W[1], C, D, E, A, B); > X(W[2], W[4], W[10], W[15]); R2(W[2], B, C, D, E, A); > X(W[3], W[5], W[11], W[0]); R2(W[3], A, B, C, D, E); // 35 > X(W[4], W[6], W[12], W[1]); R2(W[4], E, A, B, C, D); > X(W[5], W[7], W[13], W[2]); R2(W[5], D, E, A, B, C); > X(W[6], W[8], W[14], W[3]); R2(W[6], C, D, E, A, B); > X(W[7], W[9], W[15], W[4]); R2(W[7], B, C, D, E, A); > > K = _mm_set1_epi32(0x8F1BBCDC); > > X(W[8], W[10], W[0], W[5]); R3(W[8], A, B, C, D, E); // 40 > X(W[9], W[11], W[1], W[6]); R3(W[9], E, A, B, C, D); > X(W[10], W[12], W[2], W[7]); R3(W[10], D, E, A, B, C); > X(W[11], W[13], W[3], W[8]); R3(W[11], C, D, E, A, B); > X(W[12], W[14], W[4], W[9]); R3(W[12], B, C, D, E, A); > X(W[13], W[15], W[5], W[10]); R3(W[13], A, B, C, D, E); // 45 > X(W[14], W[0], W[6], W[11]); R3(W[14], E, A, B, C, D); > X(W[15], W[1], W[7], W[12]); R3(W[15], D, E, A, B, C); > X(W[0], W[2], W[8], W[13]); R3(W[0], C, D, E, A, B); > X(W[1], W[3], W[9], W[14]); R3(W[1], B, C, D, E, A); > X(W[2], W[4], W[10], W[15]); R3(W[2], A, B, C, D, E); // 50 > X(W[3], W[5], W[11], W[0]); R3(W[3], E, A, B, C, D); > X(W[4], W[6], W[12], W[1]); R3(W[4], D, E, A, B, C); > X(W[5], W[7], W[13], W[2]); R3(W[5], C, D, E, A, B); > X(W[6], W[8], W[14], W[3]); R3(W[6], B, C, D, E, A); > X(W[7], W[9], W[15], W[4]); R3(W[7], A, B, C, D, E); // 55 > X(W[8], W[10], W[0], W[5]); R3(W[8], E, A, B, C, D); > X(W[9], W[11], W[1], W[6]); R3(W[9], D, E, A, B, C); > X(W[10], W[12], W[2], W[7]); R3(W[10], C, D, E, A, B); > X(W[11], W[13], W[3], W[8]); R3(W[11], B, C, D, E, A); > > K = _mm_set1_epi32(0xCA62C1D6); > > X(W[12], W[14], W[4], W[9]); R2(W[12], A, B, C, D, E); // 60 > X(W[13], W[15], W[5], W[10]); R2(W[13], E, A, B, C, D); > X(W[14], W[0], W[6], W[11]); R2(W[14], D, E, A, B, C); > X(W[15], W[1], W[7], W[12]); R2(W[15], C, D, E, A, B); > X(W[0], W[2], W[8], W[13]); R2(W[0], B, C, D, E, A); > X(W[1], W[3], W[9], W[14]); R2(W[1], A, B, C, D, E); // 65 > X(W[2], W[4], W[10], W[15]); R2(W[2], E, A, B, C, D); > X(W[3], W[5], W[11], W[0]); R2(W[3], D, E, A, B, C); > X(W[4], W[6], W[12], W[1]); R2(W[4], C, D, E, A, B); > X(W[5], W[7], W[13], W[2]); R2(W[5], B, C, D, E, A); > X(W[6], W[8], W[14], W[3]); R2(W[6], A, B, C, D, E); // 70 > X(W[7], W[9], W[15], W[4]); R2(W[7], E, A, B, C, D); > X(W[8], W[10], W[0], W[5]); R2(W[8], D, E, A, B, C); > X(W[9], W[11], W[1], W[6]); R2(W[9], C, D, E, A, B); > X(W[10], W[12], W[2], W[7]); R2(W[10], B, C, D, E, A); > X(W[11], W[13], W[3], W[8]); R2(W[11], A, B, C, D, E); // 75 > > // A75 has an interesting property, it is the first word that is (almost) > // part of the final MD (E79 ror 2). The common case will be that this > // doesn't match, so we bail early here. > _mm_store_si128(MD, E); > > #if 0 > // For the benefit of developers who need to hack this in future, here is > // what I would do if I wanted to do the full 80 rounds. We store the > // registers in an unusual way, so the final transposition may not be > // obvious. > X(W[12], W[14], W[4], W[9]); R2(W[12], E, A, B, C, D); > X(W[13], W[15], W[5], W[10]); R2(W[13], D, E, A, B, C); > X(W[14], W[0], W[6], W[11]); R2(W[14], C, D, E, A, B); > X(W[15], W[1], W[7], W[12]); R2(W[15], B, C, D, E, A); > > // Transpose the final digest stored as a 5x4 integer matrix, so that we > // can use MOVDQA to store the final result. This is annoyingly tricky but > // means we do not have to shuffle the results outside of XMM registers. > // > // Desired Result: > // > // A[0] B[0] C[0] D[0] > // E[0] A[1] B[1] C[1] > // D[1] E[1] A[2] B[2] > // C[2] D[2] E[2] A[3] > // B[3] C[3] D[3] E[3] > // > { > __m128i T, U; > __m128i V, W, X, Y, Z; > > T = _mm_unpacklo_epi32(A, C); > U = _mm_unpacklo_epi32(B, D); > V = _mm_unpacklo_epi32(T, U); > > T = _mm_unpacklo_epi32(A, E); > U = _mm_unpacklo_epi32(C, B); > T = _mm_shuffle_epi32(T, _MM_SHUFFLE(1, 2, 3, 0)); > U = _mm_unpackhi_epi32(U, T); > W = _mm_shuffle_epi32(U, _MM_SHUFFLE(0, 2, 1, 3)); > > T = _mm_unpacklo_epi32(D, E); > U = _mm_unpackhi_epi32(A, B); > U = _mm_shuffle_epi32(U, _MM_SHUFFLE(1, 0, 1, 0)); > T = _mm_unpackhi_epi32(T, U); > X = _mm_shuffle_epi32(T, _MM_SHUFFLE(3, 1, 2, 0)); > > T = _mm_unpackhi_epi32(C, D); > U = _mm_unpackhi_epi32(E, A); > U = _mm_shuffle_epi32(U, _MM_SHUFFLE(1, 2, 3, 0)); > T = _mm_unpacklo_epi32(T, U); > Y = _mm_shuffle_epi32(T, _MM_SHUFFLE(3, 1, 2, 0)); > > T = _mm_unpackhi_epi32(B, D); > U = _mm_unpackhi_epi32(C, E); > Z = _mm_unpackhi_epi32(T, U); > > _mm_store_si128((gpointer)(&MD[0][0]), V); // A3 B3 C3 D3 > _mm_store_si128((gpointer)(&MD[0][4]), W); // E3 A2 B2 C2 > _mm_store_si128((gpointer)(&MD[1][3]), X); // D2 E2 A1 B1 > _mm_store_si128((gpointer)(&MD[2][2]), Y); // C1 D1 E1 A0 > _mm_store_si128((gpointer)(&MD[3][1]), Z); // B0 C0 D0 E0 > } > > #endif > > return; > } > > // This intrinsic is not always available in GCC. > static inline int _mm_testz_si128 (__m128i __M, __m128i __V) > { > return __builtin_ia32_ptestz128 ((__v2di)__M, (__v2di)__V); > } > > // This is a modified SSE2 port of Algorithm 6-2 from "Hackers Delight" by > // Henry Warren, ISBN 0-201-91465-4. Returns non-zero if any double word in X > // is zero using a branchless algorithm. -- taviso. > static inline int _mm_testz_epi32 (__m128i __X) > { > __m128i M = _mm_cmpeq_epi32(__X, __X); > __m128i Z = _mm_srli_epi32(M, 1); > __m128i Y = _mm_andnot_si128(_mm_or_si128(_mm_or_si128(_mm_add_epi32(_mm_and_si128(__X, Z), Z), __X), Z), M); > return ! _mm_testz_si128(Y, M); > } > > static int sha1_fmt_cmp_all(uint32_t *binary, int count) > { > __m128i A = _mm_cmpeq_epi32(_mm_set1_epi32(binary[4]), _mm_load_si128(MD)); > > // This function is hot, we need to do this quickly. > return _mm_testz_epi32(_mm_andnot_si128(A, _mm_cmpeq_epi32(A, A))); > } > > // This function is not hot, and will only be called 1:2^30 random crypts. > static int sha1_fmt_cmp_one(void *binary, int index) > { > uint32_t full_sha1_digest[SHA1_DIGEST_WORDS]; > SHA_CTX ctx; > SHA1_Init(&ctx); > SHA1_Update(&ctx, sha1_fmt_get_key(index), strlen(sha1_fmt_get_key(index))); > SHA1_Final(full_sha1_digest, &ctx); > > // Remove IV > full_sha1_digest[0] = __builtin_bswap32(full_sha1_digest[0]) - 0x67452301; > full_sha1_digest[1] = __builtin_bswap32(full_sha1_digest[1]) - 0xEFCDAB89; > full_sha1_digest[2] = __builtin_bswap32(full_sha1_digest[2]) - 0x98BADCFE; > full_sha1_digest[3] = __builtin_bswap32(full_sha1_digest[3]) - 0x10325476; > full_sha1_digest[4] = rotateleft(__builtin_bswap32(full_sha1_digest[4]) - 0xC3D2E1F0, 2); > > // Compare result. > return memcmp(binary, full_sha1_digest, sizeof full_sha1_digest) == 0; > } > > static int sha1_fmt_cmp_exact(char *source, int cuont) > { > return 1; > } > > static int sha1_fmt_get_hash0(int index) { return MD[index] & 0x0000000F; } > static int sha1_fmt_get_hash1(int index) { return MD[index] & 0x000000FF; } > static int sha1_fmt_get_hash2(int index) { return MD[index] & 0x00000FFF; } > static int sha1_fmt_get_hash3(int index) { return MD[index] & 0x0000FFFF; } > static int sha1_fmt_get_hash4(int index) { return MD[index] & 0x000FFFFF; } > static int sha1_fmt_get_hash5(int index) { return MD[index] & 0x00FFFFFF; } > static int sha1_fmt_get_hash6(int index) { return MD[index] & 0x07FFFFFF; } > > static int sha1_fmt_binary0(uint32_t *binary) { return binary[4] & 0x0000000F; } > static int sha1_fmt_binary1(uint32_t *binary) { return binary[4] & 0x000000FF; } > static int sha1_fmt_binary2(uint32_t *binary) { return binary[4] & 0x00000FFF; } > static int sha1_fmt_binary3(uint32_t *binary) { return binary[4] & 0x0000FFFF; } > static int sha1_fmt_binary4(uint32_t *binary) { return binary[4] & 0x000FFFFF; } > static int sha1_fmt_binary5(uint32_t *binary) { return binary[4] & 0x00FFFFFF; } > static int sha1_fmt_binary6(uint32_t *binary) { return binary[4] & 0x07FFFFFF; } > > struct fmt_main sha1_fmt_main = { > .params = { > .label = "rawsha1_sse4", > .format_name = "Raw SHA-1 (taviso sse4 build)", > .algorithm_name = "rawsha1_sse4", > .benchmark_comment = "", > .benchmark_length = 1, > .plaintext_length = sizeof(__m128i) - 1, > .binary_size = SHA1_DIGEST_SIZE, > .salt_size = 0, > .min_keys_per_crypt = 4, > .max_keys_per_crypt = 4, > .flags = FMT_CASE | FMT_8_BIT | FMT_SPLIT_UNIFIES_CASE, > .tests = sha1_fmt_tests, > }, > .methods = { > .init = fmt_default_init, > .prepare = fmt_default_prepare, > .valid = sha1_fmt_valid, > .split = fmt_default_split, > .binary = sha1_fmt_binary, > .salt = fmt_default_salt, > .salt_hash = fmt_default_salt_hash, > .set_salt = fmt_default_set_salt, > .set_key = sha1_fmt_set_key, > .get_key = sha1_fmt_get_key, > .clear_keys = fmt_default_clear_keys, > .crypt_all = sha1_fmt_crypt_all, > .get_hash = { > [0] = sha1_fmt_get_hash0, > [1] = sha1_fmt_get_hash1, > [2] = sha1_fmt_get_hash2, > [3] = sha1_fmt_get_hash3, > [4] = sha1_fmt_get_hash4, > [5] = sha1_fmt_get_hash5, > [6] = sha1_fmt_get_hash6, > }, > .binary_hash = { > [0] = sha1_fmt_binary0, > [1] = sha1_fmt_binary1, > [2] = sha1_fmt_binary2, > [3] = sha1_fmt_binary3, > [4] = sha1_fmt_binary4, > [5] = sha1_fmt_binary5, > [6] = sha1_fmt_binary6, > }, > .cmp_all = sha1_fmt_cmp_all, > .cmp_one = sha1_fmt_cmp_one, > .cmp_exact = sha1_fmt_cmp_exact, > }, > }; -- ------------------------------------- taviso@...xchg8b.com | pgp encrypted mail preferred -------------------------------------------------------
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.