john-dev - Re: Parallella: bcrypt

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130725020252.GC10917@openwall.com>
Date: Thu, 25 Jul 2013 06:02:52 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Parallella: bcrypt

Katja, Yaniv -

On Wed, Jul 24, 2013 at 07:28:19PM +0200, Katja Malvoni wrote:
> When I moved the same code in function instead of leaving it under #define
> it compiled. I don't remember what was I doing wrong but after some changes
> it compiled as a macro too. I'm still working on optimizing BF_ROUND, at
> the moment I'm using two FPU instructions (IADD and IMADD) and I'm getting
> 926 c/s. I'm sorry for being so slow with this optimization, it took me a
> lot just to get it working.

That's OK.  So I am looking at parallella_e_bcrypt.c with the pieces of
inline asm, and here are my comments:

| #ifdef assembly

Thank you for keeping the plain C version in the same source file as
well.  That's how I like it.

| #define BF_ROUND(L, R, N) \
| __asm__ __volatile__ ( \
|         "and r44, %1, r25\n" \
|         "lsr r23, %1, 0xe\n" \
|         "and r23, r23, r21\n" \
|         "lsr r24, %1, 0x16\n" \
|         "and r24, r24, r21\n" \
|         "imadd r44, r44, r46\n" \
|         "ldr r23, [r18, +r23]\n" \
|         "ldr r24, [r26, +r24]\n" \
|         "lsr r22, %1, 6\n" \
|         "and r22, r22, r21\n" \
|         "iadd r23, r24, r23\n" \
|         "ldr r22, [r19, +r22]\n" \
|         "ldr r27, [r45], 0x1\n" \
|         "ldr r44, [r20, +r44]\n" \
|         "eor %0, %0, r27\n" \
|         "eor r23, r22, r23\n" \
|         "add r23, r23, r44\n" \
|         "eor %0, %0, r23\n" \
|         : "+r" (R) \
|         : "r" (L) \
| );

The code itself mostly looks good to me (including your delayed use of
results from IMADD and IADD).  Shouldn't you re-order these two, though? -

|         "eor %0, %0, r27\n" \
|         "eor r23, r22, r23\n" \

because r22 is loaded sooner than r27?  Well, maybe this makes no
difference on the current chip, but it might if load's latency is
increased in a future revision of Epiphany.

With 2x interleave, once you implement it, you might be able to
reasonably replace the "add r23, r23, r44" with IADD, because you'd have
some more instructions to put before the EOR.  You might need to put two
rounds into the macro, though (so that the two instances' rounds are
slightly out of sync in order to have more parallelism in each round's
end-game).

Now, here's an issue/bug in the above: you rely on registers being
preserved across multiple pieces of inline asm, but gcc does not
guarantee you that.  Also, you don't declare which registers you
clobber.  To fix this, your BF_ROUND should not be the entire __asm__
block, but rather just a portion of the string you put inside such
block.  The asm block itself, with proper confession on what registers
you clobber, should be in the BF_encrypt function.

Instead of the asm block before the loop, where you initialize register
values, I think you should simply have assignments to local variables in
C, and have the compiler allocate registers to those.  Then you
reference those variables via the "r" (var) notation in the one and only
asm block inside the loop (or better yet, you can have the loop in asm
too, which is easy to do once you have a macro for one or two rounds in
the form of a string).

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.