|
Message-ID: <20150624064347.GA27412@openwall.com> Date: Wed, 24 Jun 2015 09:43:47 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: optimizing bcrypt cracking on x86 On Wed, Jun 24, 2015 at 07:10:07AM +0300, Solar Designer wrote: > Now, the bottleneck does indeed appear to be different. It appears to > be taking too many instructions and uops to extract all of the S-box > indices and compute the effective addresses to keep the L1 data cache > read ports 100% busy at all times. (We only keep them about 60% busy.) BTW, optimizing the effective address calculation could make a difference. In my testing on Haswell, not of bcrypt code but in general, there is performance impact from using larger than 1-byte displacement, as well as from using an index register (even if the scaling factor is set to 1). Loads that use only a base register with either no displacement or a 1-byte displacement (sufficient e.g. to skip P-boxes and get to the first S-box) run faster (or rather, have less impact on issue rate of adjacent instructions). To avoid needing index scaling, we could use 64-bit S-box elements, using bits 33 to 2 for data. Extracting the lowest pre-scaled index is then quick. Extracting the remaining 3, not so much - can't just use BEXTR, need an AND. Maybe the cost of an AND can be amortized, if applied to two non-adjacent pre-scaled indices at once, and then BEXTR could be used. BMI's 3-op ANDN instruction may be used to avoid MOVs. Unfortunately, when using only a base register, we can't avoid either needing large displacements or ADDing them to that register first. Yet in my testing performance impact from using both a large displacement and an index at once is worse than from using either of these alone. So maybe large_displacement+base is better than small+base+index*4. ... Just tested on artificial code: yes, this is so, especially when the instruction is a load-op rather than a pure load, Unfortunately, there's also performance impact from larger than 1-byte immediate values in instructions such as AND. So we'd need to minimize the number of different masks and preload them into registers. This means that this approach is possibly only worthwhile on CPUs with HT, where we don't need more than 2x interleaving. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.