|
Message-ID: <20081012161508.GA21821@openwall.com> Date: Sun, 12 Oct 2008 20:15:08 +0400 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Subject: Re: Breaking UNIX crypt() on the PlayStation 3 Hi Marc, Thank you for posting this - it's good stuff! On Tue, Oct 07, 2008 at 12:45:26AM +0000, Marc Bevand wrote: > The implementation I came up with averages 45.5 gates per S-box and is > capable of testing 11.5 million password/sec on a PS3 (1.9 Mpw/sec > per SPU core). > > Slides and source code: > http://www.epitech.eu/v4/perso/~bevand_m/talks/breaking-unix-crypt.pdf > http://www.epitech.eu/v4/perso/~bevand_m/cell-bf/cell-bf-1.tar.gz This is quite impressive. Here are some comments on your slides: Your implementation is specialized for a given target salt at compile time, and it tries candidate passwords "sequentially". Both of these things allow for greater candidate passwords per second rate, but they limit usability and reduce the success rate (defined as successful guesses per unit of time when cracking multiple target hashes, which, by the way, your code does not even support). This is fine, but it means that your comparison of cell-bf on Cell/PS3 vs. JtR on Q6700 is not exactly "fair" if you use it to illustrate relative performance of these CPUs and these programs - because the two programs differ in what they actually do. If we similarly modify the code in JtR to only allow for trying candidate passwords in an algorithm-dependent order, then its performance at "only one salt" will approach that at "many salts". That's roughly a 10% improvement. If we "hard-code" the salt, this might provide another 10% improvement at traditional crypt(3). Together, these changes bring us almost to the speed you've achieved on a Cell/PS3. (Obviously, "hard-coding" of the salt may be done at runtime with self-modifying code. Then it won't reduce usability on a given supported platform, but it will reduce portability and/or require separate implementations for different platforms. I had this as a questionable item on my to-do for years.) On the other hand, you're not making use of the PPU yet, and you might have under-estimated the performance improvement it will provide. As to your price comparison, it does not hold for many parts of the world. I've just checked the prices for PlayStation 3 (retail, new) in Moscow, Russia - they appear to start at $560. The prices for Q6700 (just the CPU) start at $250 (and Q6600 starts at $200), and it should be possible to have a reasonable cut-down computer for less than $500. Overall, I'd say that these are similar in terms of both performance per machine and performance per dollar. > The source code also includes an "S-box circuit bruteforcer" I wrote > based on Matthew Kwan's ideas (he made the S-boxes used in John the > Ripper itself). I ran it for 2-3 hours to find good S-box circuits > leveraging all the logical instructions available to the SPU. Thank you for releasing this code! I think that Matthew never released his code (he only released the resulting S-box C files, then released the paper describing the approach some years later - but never the actual optimizer code), nor am I aware of anyone else doing that (some people claimed impressive results, but never released anything). I think that your results are almost directly reusable for "traditional" PowerPC with AltiVec. The "selb" instruction appears to be equivalent to AltiVec's "vsel" and the vec_sel() C intrinsic. The only real difference appears to be in the number of registers (128 vs. 32), so maybe some of the S-boxes will need extra instructions on AltiVec. Something I am tempted to try (but don't have time right now) is to convert your "best.c" to AltiVec C intrinsics and put it into JtR, letting the compiler deal with register allocation - then see if it provides a performance improvement compared to the code currently in JtR (which is essentially Matthew Kwan's nonstd.c). Have you looked into this yet? Then, it'd be curious/useful to consider reusing/enhancing your optimizer for other target architectures. The major obstacle here appears to be your reliance on a mux instruction, which is usually missing. Can you suggest whether and how your current approach and code can reasonably be modified to avoid this dependency? (Of course, it is trivial to "emulate" mux with several other instructions, but the resulting instruction counts would likely be unacceptable. For example, in your best "circuit" for S4, as many as 15 out of 34 "gates" are mux.) The "next level" could be making the "circuit" optimizer aware of more subtle properties of the target instruction set architecture (e.g., extra "moves" needed on 2-operand architectures, register count, invalid operand combinations) and specific CPU (instruction latencies and multiple-issue rules) - optimize the cycle count, not the "gate count". On "non-trivial" architectures/CPUs, it is almost certain that the most optimal "circuit", resulting in the best cycle count, is not the smallest one. Thanks again, Alexander -- To unsubscribe, e-mail john-users-unsubscribe@...ts.openwall.com and reply to the automated confirmation request that will be sent to you.
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.