john-dev - Re: Handling of hashes with different iteration counts

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <BLU0-SMTP24B6EDAFCD15B3A4F9C576FDAD0@phx.gbl>
Date: Tue, 28 Jan 2014 09:49:04 +0100
From: Frank Dittrich <frank_dittrich@...mail.com>
To: john-dev@...ts.openwall.com
Subject: Re: Handling of hashes with different iteration counts

Alexander,

thank you very much for your valuable input.
I'll incorporate many of your suggestions into V2 of this patch.

On 01/28/2014 01:46 AM, Solar Designer wrote:
> On Tue, Jan 28, 2014 at 01:10:58AM +0100, Frank Dittrich wrote:
>> Any suggestions what to change?
> 
> Iteration count is just a special case of tunable cost.  We already have
> scrypt, which accepts three tunable cost parameters: N, r, p.  Maybe the
> format should export these as up to two - t_cost and m_cost (as they're
> called in PHC) - even if the underlying hash/KDF is more flexible. 

You are right, "iteration" (as well as "rounds") are too implementation
specific, and may not be the right terms for all hash algorithms.
I also prefer "tunable" over "configurable" (which I used in earlier
mails), because it is the more general term.

OK, if we intend to report and use up to two tunables right now, my
current approach will still work (if the 2nd array element points to
NULL, john can assume the cost to be 1).
If a format cannot convert all its relevant tunables into just 2 values,
we can still decide to check up to tree values without the need to
adjust formats that don't need that many tunables.

> For
> the scrypt example, as long as our implementation doesn't use TMTO and
> doesn't use scrypt's internal parallelism (since we've got plenty due to
> multiple candidate passwords), the format may export N*r as m_cost and
> N*r*p as t_cost.

I need to check scrypt's reference implementation or docs, to know what
each of these parameters means.
Currently, this doesn't make much sense to me, because if m_cost is N*r
and t_cost is N*r*p, it means t_cost = m_cost*p.
Without knowing the implementation details of scrypt, I would have
assumed that if N*r gets reported as m_cost, I would just report p as
t_cost, or r as m_cost and N*p as t_cost.

Also, if I have 2 tunables A and B which both influence run time, is it
really OK to report A*B as a single cost?
Even if 2A*B = A*2B (both resulting in roughly the same time for
computation), is it really OK to mix 2A*B and A*2B salts when cracking
those hashes, or will the cost be 2A*2B in this case?

But you don't need to reply on this, I'll first learn more about scrypt.

> And we'd need command-line options, similar to --salts,
> which will let us choose subsets of hashes based on ranges of these
> values (in fact, combined with --salts, this will be up to three ranges
> that the chosen hashes must or must not fall within).

Yes, that was my intention.
But --rounds and --rounds2 (my preferred choice before you replied) is
not such a good option name, --cost and --cost2 (--cost3 if required)
are better names.
I could use --cost_t and --cost_m, but may be this is too specific and
will be wrong for some future algorithms.
To be able to print a correct name for the tunable in question, instead
of "Loaded hashes with varying # of rounds (from 16 to 256)", we could
extend the format interface by an array of strings describing each tunable.

E.g., one format could call these tunables "time" and "memory", the next
one could use "iterations" and "salt length".
Then, --list=format-[all-]details could also print these tunables as a
comma separated list .

> Your array of functions idea could be good, but I think we'll need
> exactly two, so maybe just make it two functions, t_cost() and m_cost().

Currently, it doesn't really matter how many of these functions a format
specifies, I can always assume cost 1 if no format method is specified.
Since these functions need to be called just once per salt, they are not
really performance critical. (I could even drop the need to use
fmt_default_iteration_count (or fmt_default_cost...) and assume cost 1
if not even format method cost[0] exists.
As long as the framework just checks cost[0] and cost[1], additional
format methods listed in a format will just be ignored, or we could make
self test fail if there are too many format methods.
We could also use the constant which defines the array size to determine
which of the --cost2 ... --costN options should be printed on the usage
screen or in the --list=hidden-options output.

> Many future password hashing schemes are likely to have more than 2
> tunable parameters (and more than scrypt's 3, too), but for the purpose
> of choosing which hashes to focus attacks on, we may translate those
> many parameters into t_cost and m_cost for _our_ attacks.

I think it should depend on the format implementation which tunables (or
combination of tunables) to report in which sequence, and how many
different values are needed (or nice to have).
A GPU implementation might even decide to report another combination of
tunables (or the same tunables in another sequence) than a CPU
implementation.
I'd prefer the tunable with the highest impact (whatever highest means
if one tunable adjusts memory requirements and the other computation
time) to reported by the first format method, the other by the second...

> t_cost shouldn't be in real time units, but rather in some fixed units -
> e.g., for bcrypt it will be exactly the cost parameter specified with
> hashes.  I think it's OK to keep it the base2-logarithm of actual cost,
> since that's how bcrypt is defined.  Or do we prefer to make it linear,
> so that regardless of hash type if one doubles the acceptable t_cost,
> they should expect the c/s rate to be roughly twice lower?  That would
> make sense to me, too.  (Of course, actual effect on c/s rate will vary
> depending on actual t_cost's distribution across available hashes.)

I think reporting the "real" cost is more user friendly.
On the other hand, smaller values are easier to read.
We could assume that a person who uses john to detect poor passwords
knows that the cost is given as base2-logarithm of the iteration count.
But in this case, it should be well documented that a doubled run time
doesn't mean a doubled t_cost value.
(We need to document anyway that the cost values reported by one hash
algorithm have nothing in common with cost values reported by other hash
algorithms.)

> On the other hand, it'd be nice if our m_cost will be in bytes or KiB,
> and will not include build-specific memory allocation overhead.  That
> way, it'd be convenient to choose only hashes that fit in RAM, yet the
> same command-line will result in consistent behavior across systems.

My only preference is that format should be able to report all possible
cost values as unsigned int > 0.

> We could also have a just.conf setting that will specify the maximum
> memory size that JtR is permitted to allocate, and its implementation
> could use formats' exported m_cost in order to detect would-be-excessive
> memory usage without actually reaching the threshold.  (I briefly
> mentioned this idea in a discussion with Alexander Cherepanov before.)

For some "hash" algorithms this might be hard to do, e.g. some formats
that need to uncompress large files.

> https://password-hashing.net/call.html
> 
> "The t_cost and m_cost arguments are intended to parameterize time and
> memory usage, respectively"
> 
> BTW, there are interesting discussions going on here:
> 
> http://news.gmane.org/gmane.comp.security.phc

Very interesting, indeed.

Frank
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.