|
Message-ID: <20081017002539.GA16615@openwall.com> Date: Fri, 17 Oct 2008 04:25:39 +0400 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Subject: Re: Breaking UNIX crypt() on the PlayStation 3 On Mon, Oct 13, 2008 at 02:18:38AM +0000, Marc Bevand wrote: > I talked about that in a private discussion with Simon Marechal. The PPU > indeed implements the AltiVec SIMD instruction set, which includes about the > same logical instructions as the SPU ones (except nand, or-not, xnor), and 32 > 128-bit registers. An S-box would almost fit in the 32 registers and may need > about 5-10 more gates because of the lack of nand/or-not/xnor. So I think the > PPU could provide 70-90% the perf of one SPU. Isn't the PPU capable of issuing more instructions per cycle (than the SPUs)? If not more AltiVec instructions, then maybe more PowerPC instructions intermixed with AltiVec ones? BTW, this additional optimization - using native 64-bit instructions intermixed with vector ones - might work on all of: SSE2, AltiVec, SPU. > If mux doesn't exist, there is no other option than to replace with other > instructions. To regain the lost speed, the bruteforcing space of sbox-gen can > simply be extended to search through more circuits. The core function of > sbox-gen, get_crackin(), tries various unary, binary, and ternary "operators". > By extending it to 4-ary, or 5-ary (or more) operators, it should produce even > better circuits. According to my estimations, bruteforcing up to 5-ary > operators should take about 50-100 days on a 2 GHz AMD64 CPU. This should probably be done, including for architectures that do have a mux instruction. > However, as I explained to Simon Marechal in the same discussion, I have > noticed that dumb bruteforcing like this doesn't always produce the best > result. By that I mean finding the smallest number of gates outputting bit 0, > then finding the smallest number of gates outputing bit 1 (based on the > circuit built so far for bit 0), etc doesn't necessarily end up with the best > circuit. Sometimes a solution with more gates to output bit 0 can lead to a > better circuit because more gates can be shared between bit 0 and bit 1. > > So a simple solution is to increase the bruteforcing space by also considering > the 2nd, 3rd, etc best solutions for each bits. But it quickly increases the > bruteforcing time... This is not surprising. After all, what we are trying to > achive, generating an optimal bitslice circuit, is an unsolved pb in computer > science. Yes. It is also possible to start with any bit number, not necessarily bit 1, and proceed to add gates for other bits in any order - then choose the best solution. This will make the search roughly 4! = 24 times slower, which is two days instead of two hours. Of course, this will likely not find the most optimal solution, so this approach can be combined with what you describe above. > > The "next level" could be making the "circuit" optimizer aware of more > > subtle properties of the target instruction set architecture (e.g., > > extra "moves" needed on 2-operand architectures, register count, invalid > > operand combinations) and specific CPU (instruction latencies and > > multiple-issue rules) - optimize the cycle count, not the "gate count". > > On "non-trivial" architectures/CPUs, it is almost certain that the most > > optimal "circuit", resulting in the best cycle count, is not the > > smallest one. > > Yeah these are good ideas. For the Cell, we would only need to make sbox-gen > aware of the instruction latencies (always 2 cycles for logical ones). BTW, another way to "satisfy" the latencies for the Cell would be by using 256-bit or longer "virtual" vectors, with repeated instructions. You have enough registers for that. > For other archs, it seems a very hard task to implement all of this. Yes, some of this is hard to implement, but if it is implemented for code generation (based on S-box "circuits") anyway, then rolling it into the same brute-forcing loop rather than having it as a separate step might be pretty easy. Of course, code generation becomes performance critical then, so it'd need to be in C and not, say, in Perl, and it may be cut down to calculating the number of instructions, maybe code size (if instructions are not of fixed size), and expected number of cycles (on a given CPU) - without actually emitting any code. A difficulty here is that some optimization approaches involve actually running the code - e.g., I've been brute-forcing instruction scheduling for x86-64 on Athlon64 and Xeon Nocona (need to repeat that on Core 2) - and the final cycle count is not known until that is done. > And maybe not > worth it because a human can do these optimizations pretty easily, after you > give him or her a good, low-gate count output of sbox-gen. Have you tried doing that for MMX or 8-register SSE2? ;-) It's not easy for a human to do at all. In fact, it is not easy even for 16-register SSE2 - and least for me it was easier to write a Perl script than to do it all by hand. Conversion to 2-op and register allocation should definitely be done by a computer, at least initially (then manual adjustments are possible). Dealing with invalid operand combinations may be done manually, although it is also pretty hard to do optimally (there's simply no spare register for the extra temporaries, and having the script leave a register unused "just in case" kills performance). More importantly, the lowest cycle count may likely be achieved with a "circuit" that is not the smallest. Alexander -- To unsubscribe, e-mail john-users-unsubscribe@...ts.openwall.com and reply to the automated confirmation request that will be sent to you.
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.