|
Message-ID: <017701cbed50$f2429ed0$d6c7dc70$@net> Date: Mon, 28 Mar 2011 09:03:42 -0500 From: "jfoug" <jfoug@....net> To: <john-dev@...ts.openwall.com> Subject: RE: john scalability >From: magnum Sent Saturday, March 26, 2011 2:59 PM > >Do the self tests check all five (ten) functions? I'll experiment a >little. Yes, but I think you found this answer already. >However, I did some experimenting: Adding size 3 and 4 to mssql did not >speed things up for 1M hashes, unless they are 1M hashes with very few >salts. I have seen bad salt distribution with mssql but not that bad. So >maybe this is very academic for all but the unsalted formats. Raw-sha1 >is one, but it is fixed in intrinsics-2. For many formats, it matters little. For salted hashes, the 3 pair of original hash functions are usually more than enough, unless you run against a single hash, but usually then, if it is 'real' passcodes, there will be only a few of each salt (unless the salt function is terribly busted). It is the unsalted formats ones which need the larger set of hashing functions. One thing that 'could' speed thing up, while not blowing memory totally out of whack, is to have fewer salt-hash functions, but then have the final search (the collision resolution) be done in a binary search, vs a linear search. That way, if you have 30 million unsalted candidates, and use the current hash-5 (1M), then on average, each bucket would get 30 items in each hash 'bucket'. Currently we check 30 items for each crypt (assume failure). If we had each bucket sorted, and could bin-search, then we would find, or know it did not exist, in 5 comparisons. If we took the same 30 million candidate list, but only used the hash-4, then average count in each hash bucket would be 30*16 = 480. Thus, the current linear search would have to check against 480 on average for each crypt, but a bin search would only look at 9 comparisons. The nice thing, is the building of this hash table (and collision resolution lists) are done at startup only. Thus, they getting this data into the right format, has no runtime penalty. We would have to change the way the hash lists are created. Right now, we have an array of unsorted lists (I think). If we look at changing code like this, then it would likely be that we would want to keep the linear search collision resolution model, if the hash table is on the sparse side, and switch over to the bin search collision resolution model if the table gets overfull. The break-even point would likely be in the case where the average count in each (used) hash bucket is more than 6 to 10. I mentioned 'used', because some of the hash functions only use some of the buckets, and may never hash certain values out. One additional improvement the hashed bin-search model can help with, is that we can hash to find a group of candidates which can be very localized in memory. Thus, this can also greatly reduce page faults on certain HW. I am just throwing this idea out there now. I image Solar has thought along these lines also, in the past, but has not invested the time to see the impact, or R/D what it would take to make this happen. Jim.
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.