john-users - Re: other algorithms on ZTEX 1.15y?

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170531173333.GA10584@openwall.com>
Date: Wed, 31 May 2017 19:33:33 +0200
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: other algorithms on ZTEX 1.15y?

On Wed, May 31, 2017 at 06:07:56AM -0800, Royce Williams wrote:
> Beyond the algorithms either already supported in john or implemented
> elsewhere (descrypt, bcrypt, DES), what other algorithms are feasible
> or worthwhile on ZTEX?

Are you aware of bcrypt already implemented on ZTEX elsewhere?  Where
exactly?  Have you tested?

Regarding DES, are you referring to Gifts' implementation?  Have you
tried using it, or anything else?

Maybe we need to add a plain DES cracker mode to JtR, like I think
hashcat has now (but not on FPGAs yet).

As to our developments so far, after the descrypt-ztex format Denis has
also been working on bcrypt-ztex, citing speeds of ~105k c/s per board
at bcrypt cost 5 - but this work is yet to be completed and merged.
Actual speeds will vary by cracking mode since the current synchronous
crypt_all() API combined with the not-so-fast USB interface results in
significant idle time when the candidate passwords are fed from the
host.  On-FPGA mask mode mostly avoids that (and so will an API revision
for asynchronous processing, but we haven't gotten around to that yet).

> This project is working on WPA2 support, which seems interesting:
> 
> https://github.com/JarrettR/FPGA-Cryptoparty
> 
> From a brief review of the project's files, I infer that SHA1 and
> PBKDF2 would be possible on ZTEX. Would they be worth the effort?

For PBKDF2 with MD*/SHA-1/SHA-2, it should be possible to obtain
GPU-like speeds on ZTEX, roughly like these boards worked for Bitcoin
mining (thus, one quad-FPGA board is roughly same as one high-end GPU
from 2015 or so).  The purpose would be to put these boards to more
general use and to achieve better energy efficiency (compared to GPUs).

For fast unsalted hashes, good speeds may only be achieved for up to a
few thousand hashes loaded for cracking.  This is a lot worse than with
GPUs, which handle millions.  So focusing on PBKDF2 makes more sense.

We didn't come up with a good enough idea for a generic password hashing
soft CPU yet.  My current thinking is that, to avoid bumping into BRAM
port count for the register file as we would with instructions doing
little work each, maybe we should have different bitstreams for
different crypto primitives like MD5, SHA-1, etc. (one at a time) and
have those available through very high latency instructions in the soft
CPU to allow for full pipelining - thus, 64 cycles latency for MD5, etc.
We'd also have a handful of simpler instructions (same or similar in the
different bitstreams) for implementing higher-level crypto schemes
around the current bitstream's crypto primitive (this way, the same
bitstream will be usable for multiple higher-level schemes sharing the
same crypto primitive).  These would include data copying and control
transfer instructions.  A tough question is how to combine the extreme
high-latency crypto instructions with control flow transfer - do we have
like 63 delay slots?  SPARC has 1, some DSPs have a few, but I've never
heard of an ISA having tens of delay slots.  Yet maybe this is the way
to go.

Meanwhile, or alternatively, maybe we need PBKDF2-SHA* bitstreams.
There are many JtR formats that use PBKDF2, so it would have been a
primary candidate for implementation on the soft CPU anyway.

For NTLM, we could use a soft CPU having an MD4 primitive, but then do
we have anything else needing MD4?  Perhaps just raw-MD4?  That's very
rare, and other MD4-based things are probably even more rare.  So
perhaps a separate bitstream for NTLM as well, or maybe one usable for
NTLM and for raw-MD4 (different placement of characters into the current
block in on-FPGA mask mode; the rest of the difference can probably be
handled on host).

LM will need to be its own bitstream, although it could be a revision of
the descrypt design.  Denis probably has specific thoughts on it.

Technically, we could share a bitstream between descrypt and LM, as
that's basically different IV (0 vs. non-0), iterations (25 vs. 1), and
salt size (12 vs. 0 bits, but we can simply set the 12 bits to all 0's),
but this would be suboptimal.

Overall, most JtR formats (perhaps 90%+, with exception for scrypt and
the like) could be reasonably implemented for ZTEX, but a speedup over
GPU is expected for only a few (bcrypt, maybe Lotus/Domino), the
required effort is substantial, and there's almost no demand.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.