|
Message-ID: <20151007031433.GA12826@openwall.com> Date: Wed, 7 Oct 2015 06:14:33 +0300 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Cc: Roman Rusakov <rusakovster@...il.com> Subject: Re: nVidia Maxwell support (especially descrypt)? On Tue, Oct 06, 2015 at 07:42:24PM +0200, magnum wrote: > Have you seen this work by Janet Yellen? I can't recall it mentioned here. I had not seen it. Cool stuff! > https://github.com/DeepLearningJohnDoe/merikens-tripcode-engine/tree/master > https://devtalk.nvidia.com/default/topic/860120/cuda-programming-and-performance/bitslice-des-optimization/post/4622827/#4622827 > > "Gate counts: 25 24 25 18 25 24 24 23 (avg. 23.5) > Depth: 8 7 7 6 8 10 10 8 (avg. 8)" Roman's S4 posted here is 1 gate shorter (17 vs. 18): http://www.openwall.com/lists/john-users/2014/09/18/2 > "With this version, I get a performance of 950 MH/s for UNIX DES > crypt(3) (or equivalently 23750 MH/s for 1 round of DES) on my reference > Gigabyte GTX 980 Ti (+270 MHz). Considering hashcat's implementation > gets 165.5 MH/s on a GTX Titan X (+225 MHz), it's a great improvement. > Even my naive implementation bounded by shared memory/synchronization > with old SBOXes from JtR is faster (300 MH/s on 980 Ti +300 MHz)." Also from that NVIDIA forum: "As for your Nvidia/AMD comparison, I am currently getting 800MH/s on my 7990 with OpenCL and rewriting my implementation with a GCN assembler. We will see how that goes :)" These are very good speeds indeed. The "naive implementation ... with old SBOXes from JtR" mentioned there was somehow using the bitselect-lacking S-boxes (I just took a look at the code on GitHub), so would likely run even faster with the bitselect-enabled ones. OTOH, it is very interesting how they appear to split (even in that naive implementation) one DES computation across 4 "threads" (aka work-items), for 4 different pairs of S-box lookups. During this summer's CMIYC contest, I was getting ~235M c/s per Tahiti 1 GHz (in 7990's), using Sayantan's code in JtR and his instructions for pre-building per-salt kernels (takes about an hour). (Some less lucky Catalyst versions on other machines got slightly lower speeds on Tahiti.) I think this is ~2x faster than hashcat, but clearly (as seen from the speeds above) it is possible to do better yet. BTW, we should post such instructions somewhere public, like include them in a file under doc/ or/and put them on the wiki. Also, I think Sayantan has changed the code greatly since then. Running the latest code against 10 descrypt hashes on Titan X (stock clocks) with -mask='?l?l?l?l?l?l?l?l', I get: 0g 0:00:02:31 1.27% (ETA: 09:15:40) 0g/s 17601Kp/s 181316Kc/s 181316KC/s GPU:66C util:100% fan:24% aayuspia..aayuspia (BTW, the range of candidates looks weird here. It's the initial "aa" that are being iterated, I think, but it's not seen here, making it appear as though the start and end of range are the same.) On Tahiti (1050 MHz with Catalyst 15.7 here), it quickly gets to higher speeds: 0g 0:00:02:19 1.53% (ETA: 08:34:44) 0g/s 22864Kp/s 233304Kc/s 233304KC/s aaiemika..aaiemika Is it still possible to lazy-build, or does the pre-building now have to be done before start of cracking? I've just tried the latest code, and it pre-builds by default (is it possible to disable this behavior, or are we only supporting per-salt kernels now?) and it appears to do so before start of cracking (is it still possible to lazy-build during cracking?) Oh, and the current code appears to build kernels for some salts (the self-test ones?) multiple times - a minor bug? And it still doesn't appear to allow easy use of multiple CPU cores for the kernel building. I think it should be made --fork friendly for that - I think this is easy to implement, but it's a topic for john-dev. I found there's PARALLEL_BUILD in opencl_DES_hst_dev_shared.h now, which enables use of OpenMP, but enabling it made the compiler crash with a weird error (when targeting Titan X), so maybe something isn't MT-safe. Then, I don't understand why it pre-builds by default when HARDCODE_SALT is not enabled by default: #define OVERRIDE_AUTO_CONFIG 0 #define HARDCODE_SALT 0 #define FULL_UNROLL 0 #define PARALLEL_BUILD 0 IIRC, previously HARDCODE_SALT was the setting to enable this mode. Really need documentation for this, even if it's still in development. And then we could proceed to discuss how to revise it to make it more usable. And of course we'll also need to include some LOP3.LUT S-boxes. If Roman's are still unreleased (except for S4), then Janet's. Sayantan? Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.