|
Message-ID: <20140413082311.GB22732@openwall.com> Date: Sun, 13 Apr 2014 12:23:11 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: ZedBoard: bcrypt Hi Katja, Thank you for bringing this back to the list! On Fri, Apr 11, 2014 at 06:54:39PM +0200, Katja Malvoni wrote: > Here are some better news about this. Great! > Redesigning PS-PL communication resulted in improvement. I have working > design with 70 bcrypt cores. Performance is 2162 c/s on 71 MHz frequency. 2 > cycles are needed for one Blowfish round. Computation on host is overlapped > with computation on FPGA 5/6th of the time. Off-list, you had reported 67 cores at 71 MHz doing 1895 c/s when you had 4 cycles/round, or so I understood you (possibly incorrectly since this data was split across multiple e-mails). 70 cores would be doing something like 1895*70/67 = 1980 c/s. At 2 cycles/round, the speed should be almost twice that, but you're reporting "only" 2162 c/s. Why is that? Are we possibly bumping into computation on the host, despite of the 5/6th overlap, now that you've halved the cycles per round? If I'm not mistaken, at $2a$05 the host's computation is at around 1.8% of FPGA's: (512+64)/(512*2*32) = 0.01758 This corresponds to required minimum host speed to avoid it being the bottleneck at: 2162*0.01758 = 38 c/s We were actually getting 84 c/s with (semi-)optimized code on one of these ARM cores. This is 2x+ higher that the required minimum, yet I think we may be close enough that, along with maybe less optimal code, communication overhead and less than perfect overlapping we may be seeing significant impact on the (mostly missing) speed improvement when going from 4 to 2 cycles/round. It makes sense to start by running some more benchmarks, though: what speeds are you getting for 1 core (in FPGA), for the 2-cycle and 4-cycle versions? What speeds are you getting for $2a$08 (reduces relative cost of host's computation by a factor of 8 compared to $2a$05)? Once you ran the benchmarks above, you might want to try adding OpenMP to use the second ARM core. You might also want to try implementing almost the entire bcrypt in FPGA, although chances are that at least initially this will result in much bigger cores, so fewer will fit. This is unlikely a good idea as long as we're able to provide enough CPU power for roughly 2% of the total processing power. Yet it could be worth trying at some point. > Utilization is: > Number of Slice Registers: 11,849 out of 106,400 11% > Number of Slice LUTs: 44,811 out of 53,200 84% > Number of occupied Slices: 12,914 out of 13,300 97% > Number of RAMB36E1/FIFO36E1s: 140 out of 140 100% > Number of BUFG/BUFGCTRLs: 2 out of 32 6% > > I can't fit more than 70 cores, BRAM is the limiting resource. If I don't > store P, expanded key, salt and cost in BRAM, I have to store it in > distributed RAM in order to keep the communication the way it is now. I > can't use AXI4 bus to store something in register, it has to be a memory > with address bus, data in and data out buses and write enable signal > (actually, when I implement it such that it uses write enable, it's > synthesized as distributed RAM. And write enable is the only way I can tell > is the host writing or reading). LUT utilization for this design was around > 55% for 4 bcrypt cores. Ouch. I think there was still much room for optimization there, while keeping those things in distributed RAM. It might well be that spending two BRAM blocks per bcrypt core is the most optimal configuration. Yet what about sharing a BRAM block across multiple cores - e.g., try one shared BRAM per two cores - for the tiny things (P, etc.)? You have two cycles/round, so clearly you're reading from P on only one of these two cycles. You could have a nearby core read its P from the same BRAM on the other cycle. Then you'd have three BRAMs per two cores, so 46 pairs of cores, or the equivalent of 92 current cores, would fit in terms of BRAM. Oh, you're at 97% for Slices, so this is unlikely to allow for a core count increase... > Code: git clone https://github.com/kmalvoni/JohnTheRipper -b master Can you also post a summary of what work is done on those two cycles? Are you still getting correct results on my ZedBoard only, but not on yours (needing a lower core count for yours)? And not on Parallella board either? I suspect the limited power / core voltage drop issue. At 1.0 V core voltage, even a (peak) power usage of just 1.0 W means a current of 1.0 A, so if e.g. a PCB trace has impedance of 0.1 Ohm (I think this is too high, but not unrealistic) we might have a voltage drop of 0.1 V right there, and that's 10% of total. That's not even considering limitations of the voltage regulator. (I am assuming that there's no voltage sense going back from the FPGA to the voltage regulator. I think there is not.) As discussed off-list, I think you should also proceed with ztex board. You mentioned that the documentation wasn't of sufficient help for you to get communication going, right? If so, suggest that you work primarily from working code examples, such as those for Bitcoin and Litecoin mining, as well as with the vendor's SDK examples. Overall, I am happy about the progress you're making at this project. Thanks again, Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.