|
Message-ID: <435301E9.6010401@gmail.com> Date: Sun, 16 Oct 2005 18:44:09 -0700 From: h1kari <0x31337@...il.com> To: john-users@...ts.openwall.com Subject: Re: Using Hardwareaccelerators to speed up John Thanks so much for contacting me regarding my work. I really look forward to working with you guys and discussion this more. I would like to initially comment that I have more detailed information on my work on a website I just put up: http://www.openciphers.org I also have all of my source code published on sourceforge if you guys want to look at it. Most if it is based off of a modified version of Rudi Usselman's opencores DES core. Oh, and my newer research has been in cracking Lanman/NTLM passwords, so my older Unix DES code isn't on there currently. My comments are inline: Solar Designer wrote: >>>1. General-purpose FPGA-based boards. These would need to be programmed >>>for the very specific task. I briefly evaluated this possibility back >>>in 1998-1999 and it appeared that FPGAs would deliver roughly 5 times >>>better DES performance for the money, compared against the most suitable >>>CPUs (at the time, that was Alpha 21164PC - affordable and really good >>>at bitslice DES). I used retail prices; the improvement could be a lot >>>better for large quantities. I know that some of the statistics in my older presentations were a bit off. Currently right now on our (Pico E-12) LX25 boards we are able to clock our design at either 125MHz with one DES core cracking 128 hashes in parallel or the core instantiated 4 times cracking one hash at 125MHz * 4 per second. For Unix DES, it would essentially be the Lanman performance / 25, since Unix DES requires 25 rounds, so the max performance of our card is currently ~50M c/s, which is a little less than my projected number in the slides. I'd also like to note that the clock speed should be able to to be increased with additional cooling and/or higher FPGA speedgrades. Currently it's limited to 125MHz because the chip goes into thermal runaway if it's clocked higher without additional cooling (Synthesis says that it should run 200MHz+). > Indeed, it has. > > However, my estimate from 6+ years ago ("5 times better DES performance > for the money") appears to still hold true for low-end FPGAs. > >>Also have a look at the slides: >>http://www.ccc.de/congress/2004/fahrplan/files/340-fpga-slides.pdf > > On slide 35, "Password File Cracker", the following performance numbers > are given: > > "PC (3.0Ghz P4 \w john)" - 300,000 c/s > "Hardware (Low end FPGA \w jawn)" - 4,000,000 c/s > > I am assuming that these are for traditional DES-based crypt(3), which > is 25 iterations of modified-DES. > > My guess is that the 3 GHz P4 benchmark was done with John 1.6, which > did not yet use bitslice DES on x86 processors. Current publicly > available development versions of John do around 700k c/s on 3 GHz P4s. > Current non-public development versions of John do around 900k c/s with > SSE code on the same P4s. (And as I have already mentioned, PPC G5s are > even faster than that - up to 1.6M c/s - but they're more expensive.) > > So this gives us a 5x performance increase. As it relates to prices, > low-end computers based on P4 Celerons (which are not any slower than > "full" P4s for John) are likely cheaper than low-end general-purpose > FPGA-based cards, both in retail quantities. There are newer slides and source on openciphers. A lot of the information I provided in my older talks was work-in-progress statistics. >>David Hulton claimed you'd be able to "crack password hashes as fast as >>100+ PCs using FPGA PCMCIA cards on your laptop". > > This claim could become the reality, but we're not quite there yet. > > Slide 36, "Up & Coming", gives an estimate of 60M c/s for Picomon, the > most powerful card (of those listed) that would be usable in a laptop. > Given that the cards currently available from Pico Computing are priced > at around $2,500, my guess is that this new card, when released, is not > going to be cheaper. (I'm sure it's a lot cheaper to produce, but > companies such as Pico Computing need to cover their development costs > and make a profit.) > > If we're comparing this against desktop PCs, a similarly priced one > would be Apple's PowerMac G5 with dual 2.7 GHz processors. It can do > over 3M c/s. So the FPGA-based card is "only" 20 times faster (which is > still a lot!), not 100+ times faster. That is also with Xilinx's lowest end Virtex-4 FPGA. Our newer boards will feature up to a FX60 which should increase the performance by at least double, and we'll be able to interface with 2 onboard powerpc processors to do software acceleration, which has been one of my main goals that I haven't been able to implement yet. >>See http://www.ccc.de/congress/2004/fahrplan/event/244.en.html >>(IIRC, "basic functionality of john the ripper" merely implemented >>brute forcing a part of the key space. > > This is a very important observation you make. It's not only about c/s > rates, but also about the order in which candidate passwords are tried. > Much of John's success is due to its ability to try candidate passwords > in an optimal order. > > With a PCMCIA card capable of hashing candidate passwords at a rate of > 60 million per second, either the card itself will have to generate the > candidate passwords to try (in a far less optimal order) or the laptop's > CPU would become the bottleneck since it wouldn't be able to feed the > card with candidate passwords at this high a speed. > > A similar problem exists with testing the computed hashes against a > large number of those loaded for cracking. Perhaps the hash tables > (used to quickly locate potentially matching hashes) will have to be > loaded onto the FPGA card. Yeah. I'm sorry it ended up coming out comparing directly to the functionality of John. The idea I was trying to get across was that when most people think of password cracking, they think of john, and I was doing something similar. Ideally, I'd like to have the FPGA act as a hardware accelerator plugin for John and be able to directly enhance the speed of checking based on intelligent wordlists. Right now our only nitch with this project is for passwords that can't be easily cracked by John or L0phtcrack. >>But it should also be possible to let john create the password >>candidates, and calculate (and compare) the hashes using FPGA >>hardware, using an order of magnitude larger MAX_KEYS_PER_CRYPT >>value than for general purpose CPUs.) > > Actually, fully-pipelined implementations of DES are not small, so you > can only fit a handful of them onto current FPGAs. If I interpret the > numbers from David's slides correctly, he has been assuming 1 to 5 > instances of DES per chip. So MAX_KEYS_PER_CRYPT would need to be > rather small, unless a larger value is determined to help reduce the > communication overhead, etc. Yeah. The 16-stage pipeline method that most people use takes up roughly 20% of the LX25 FPGA, mostly because of the S-Boxes. When you're talking about Unix DES, you have to feed in 16 passwords and wait 16*26 clock cycles to get all of the hashes out the other side. > David, > > Some more comments on your publications: > > http://www.picocomputing.com/press/KeyRecoveryServer.pdf > > "World's Fastest Lanman/NTLM Key Recovery Server Shipped." > > This press release says that the server can try over 500M LM keys per > second. Very impressive indeed. However, the claims that this is "250x > faster than a top of the line CPU" and the "12 hours vs. 136 days" once > again assume unoptimal software. (I am sure this is unintentional.) > > The current publicly available development version of John can do around > 7M c/s at LM on modern P4s (2.8+ GHz) and 9.5M c/s at LM on G5 2.7 GHz. > So your special-purpose server (with 10 FPGA cards) appears to be > 50 to 70 times faster than individual general-purpose CPUs. (Curiously > enough, my "5x speedup" estimate from 6+ years ago still holds true.) > > Also, I believe John's performance at LM hashes could be made 2-3 times > better if I would re-design it to try candidate passwords in an order > that is optimal for the low-level routines (essentially eliminating key > setup overhead). Currently, John tries more likely passwords first, > which is highly desirable when using it to detect weak passwords. > Exhaustive key searches with no requirement to get the weakest passwords > cracked early on are quite a different task. > > Please don't get me wrong, I find that you're doing the right thing and > I'd be interested in possible cooperation. I just couldn't resist the > temptation to defend my software. ;-) I'm defintiely glad that we're having this discussion. I think that we ended up testing this against rainbowcrack or l0phtcrack when we did a speed comparison. You should also note that the server that we built didn't have optimal cooling, so we ended up running all of the cards at 50MHz to make sure they operated consistently out in the field. Keep in mind that without the 128 parallel compares on the boards, we have a mode you can use to run 4 cores in parallel on each card providing 4x the 500M LM keys performance that we mentioned in the press release. We're working on a new prototype now that provides cooling for each of our boards that should allow us to additionally clock all of the boards at at least 100MHz, which would double the speeds. Anyway, I definitely didn't mean to knock the work you guys are doing. A lot of the benchmarking stuff is a little murky and it was hard to find specific benchmarks from the different open source projects. If you guys could provide some specs, I would really like to setup a page that provides accurate performance information from all of the projects including rainbowcrack/l0phtcrack/etc, or maybe there's a resource already out there for that. As far as future work. We've been doing a lot of research with the Virtex-4 FX cards and the onboard PowerPCs and we see a lot of potential for using the APU bus to provide custom instructions to software (john) that would allow you to accelerate your DES and other functions with single instruction calls. I don't know how much this would speed up john considering the onboard PowerPCs can only be clocked up to 450MHz, but it seems like it would at least be a bit of a speed improvement over doing the crypto in software. Your comments on this would be really appreciated. Also, if we were able to provide the hardware end of this to you guys, would you be able interested in tying it into john? Also, for the record, I'm perfectly fine with you guys running my code on any other FPGA boards, and there are plenty of other ones out there that are a lot cheaper than the ones that we sell. It would be really cool if FPGA crypto acceleration started getting more mainstream, so I'd totally support you guys if you want to port this to other cheaper boards. Thanks, -David
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.