|
Message-ID: <20131103220247.GA25424@openwall.com> Date: Mon, 4 Nov 2013 02:02:47 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: ZedBoard: bcrypt Hi Katja, On Sun, Nov 03, 2013 at 02:06:27PM +0100, Katja Malvoni wrote: > On Wed, Oct 30, 2013 at 10:17 AM, Solar Designer <solar@...nwall.com> wrote: > > > If so, does anything prevent you from optimizing this to? - > > > > Cycle 0: compute new R; swap L and R; initiate 4 S-box lookups > > Cycle 1: wait > > I implemented this - Great! I think your next step is to implement two instances of bcrypt per core, so that there are no wait-only cycles. That is, in Cycle 1 above you would be doing the same kind of work as on Cycle 0, but for the other instance. You may use the currently wasted halves of the same RAM blocks (just set the most significant address bit when doing the memory accesses for the second bcrypt instance) or you may use separate RAM blocks - whichever results in lower utilization of other resources. > performance on self test for one core is 79 c/s while > for 14 cores it's 765 c/s. For cost 12 these numbers are 0.6656c/s for 1 > core and 8.002c/s for 14 cores. Overhead of loading data from shared BRAM > into per core BRAMs is significant. I think it's not only the overhead of loading data, but also the overhead of host-side computation, which is not currently overlapped with computation on the FPGA. Remember that you only implemented bcrypt's variable-cost loop on the FPGA, keeping some fixed-cost Blowfish stuff before and after this loop on the host CPU. Although JtR's format interface currently requires that everything is in sync by the time crypt_all() returns (no precomputation for next set of candidate passwords possible at this point), you may nevertheless overlap host and FPGA computation most of the time by making max_keys_per_crypt several times higher and overlapping things inside of crypt_all(), except for the very last subset of candidate passwords. I suggest that you make this max_keys_per_crypt increase factor configurable - at least at compile-time, or it can even be chosen at runtime since the format's init() may modify max_keys_per_crypt. For example, with 14 cores and two bcrypt instances per core, you'd have min_keys_per_crypt at 28, but you may have max_keys_per_crypt at higher multiples of 28 - e.g., 112. With that, you'd be able to overlap host and FPGA computation 3/4th of the time. Is the above explanation clear? Please feel free to ask any questions you might have. > Maximum frequency is now 93.765 MHz although design seems to be working > properly with 100 MHz clock. OK. I think you might be able to optimize this and the LUTs utilization later. For now, please focus on getting a second instance of bcrypt per core. I hope you'll be able to keep the core count at 14 or, if you have to, very slightly lower. In fact, you might be able to achieve better overall results with even more than two bcrypt instances per core - try that after you get two instances working. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.