|
Message-ID: <20201112205125.GA25194@openwall.com> Date: Thu, 12 Nov 2020 21:51:25 +0100 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Subject: Re: SIMD performance impact On Thu, Oct 15, 2020 at 07:08:32PM +0200, Solar Designer wrote: > I've just added benchmarks of AWS EC2 c5.24xlarge > (2x Intel Xeon Platinum 8275CL, 3.6 GHz all-core turbo) and AWS EC2 I think the clock rate is actually way lower when running AVX-512 code. The 3.6 GHz all-core turbo is probably for at most 128-bit SIMD. > c5a.24xlarge (AMD EPYC 7R32, ~3.3 GHz sustained turbo) as text files > linked from these AWS EC2 instance names at: > > https://www.openwall.com/john/cloud/ > > The Intel benchmark uses AVX-512, the AMD one uses AVX2, except where > the corresponding JtR format doesn't support SIMD (e.g., bcrypt) or > doesn't support wide SIMD (e.g., scrypt uses plain AVX). > > AVX-512 wins by a large margin, but on the other hand it's two Intel > chips for the 96 vCPUs vs. just one AMD chip for the same vCPU count. > Much higher TDP for the two chips, too. > > Some hightlights, Intel: > > Benchmarking: descrypt, traditional crypt(3) [DES 512/512 AVX512F]... (96xOMP) DONE > Many salts: 561512K c/s real, 5906K c/s virtual > Only one salt: 85685K c/s real, 1415K c/s virtual > AMD: > > Benchmarking: descrypt, traditional crypt(3) [DES 256/256 AVX2]... (96xOMP) DONE > Many salts: 408354K c/s real, 4262K c/s virtual > Only one salt: 64290K c/s real, 668373 c/s virtual With today's further changes in PR #4453 and further experiments, these are now improved to: Intel: $ GOMP_CPU_AFFINITY=0-47 OMP_NUM_THREADS=48 ./john -test -form=descrypt Will run 48 OpenMP threads Benchmarking: descrypt, traditional crypt(3) [DES 512/512 AVX512F]... (48xOMP) DONE Many salts: 941985K c/s real, 19649K c/s virtual Only one salt: 102051K c/s real, 2126K c/s virtual The affinity was only needed because our default benchmark is quick - during actual cracking, the scheduler eventually does the right thing on its own, without an explicit affinity setting. $ ./john -test -form=descrypt Will run 96 OpenMP threads Benchmarking: descrypt, traditional crypt(3) [DES 512/512 AVX512F]... (96xOMP) DONE Many salts: 862912K c/s real, 9047K c/s virtual Only one salt: 84194K c/s real, 1048K c/s virtual AMD: Benchmarking: descrypt, traditional crypt(3) [DES 256/256 AVX2]... (96xOMP) DONE Many salts: 541163K c/s real, 5640K c/s virtual Only one salt: 65731K c/s real, 686305 c/s virtual Actual cracking, Intel: $ OMP_NUM_THREADS=48 ./john ~/pw-fake-unix -form=descrypt -mask -len=7 -progress=10 Using default input encoding: UTF-8 Loaded 3269 password hashes with 2243 different salts (1.5x same-salt boost) (descrypt, traditional crypt(3) [DES 512/512 AVX512F]) Will run 48 OpenMP threads Using default mask: ?1?2?2?2?2?2?2 Press 'q' or Ctrl-C to abort, almost any other key for status 0g 0:00:00:10 0.00% (ETA: 2020-11-16 22:59) 0g/s 378092p/s 908193Kc/s 1323MC/s Izodeaa..K0kueaa 0g 0:00:00:20 0.01% (ETA: 2020-11-16 16:23) 0g/s 405301p/s 920561Kc/s 1341MC/s Cj42iaa..Qzc5iaa 0g 0:00:00:30 0.01% (ETA: 2020-11-16 16:23) 0g/s 405368p/s 923657Kc/s 1345MC/s Idusnaa..Kb0cnaa 0g 0:00:00:40 0.01% (ETA: 2020-11-16 14:52) 0g/s 412159p/s 924954Kc/s 1348MC/s Cstqraa..Qdf2raa 0g 0:00:00:50 0.02% (ETA: 2020-11-16 15:10) 0g/s 410828p/s 925635Kc/s 1348MC/s I7xosaa..Koissaa 0g 0:00:01:00 0.02% (ETA: 2020-11-16 15:22) 0g/s 409941p/s 925972Kc/s 1349MC/s 90rptaa..M8pftaa 0g 0:00:01:10 0.02% (ETA: 2020-11-16 15:31) 0g/s 409307p/s 926259Kc/s 1349MC/s wjw5maa..Ez77maa 0g 0:00:01:20 0.02% (ETA: 2020-11-16 14:52) 0g/s 412210p/s 926397Kc/s 1350MC/s 9bkudaa..Mj3gdaa 0g 0:00:01:30 0.03% (ETA: 2020-11-16 15:02) 0g/s 411465p/s 926564Kc/s 1350MC/s ws42yaa..Edc5yaa 0g 0:00:01:40 0.03% (ETA: 2020-11-16 15:10) 0g/s 410869p/s 926655Kc/s 1350MC/s u7hsuaa..3o0cuaa 0g 0:00:01:50 0.03% (ETA: 2020-11-16 14:44) 0g/s 412839p/s 926670Kc/s 1350MC/s w8sqbaa..E7v2baa 0g 0:00:02:00 0.04% (ETA: 2020-11-16 14:52) 0g/s 412228p/s 926705Kc/s 1350MC/s uzxogaa..30esgaa 0g 0:00:02:10 0.04% (ETA: 2020-11-16 14:59) 0g/s 411710p/s 926117Kc/s 1349MC/s lbrppaa..fjpfpaa 0g 0:00:02:20 0.04% (ETA: 2020-11-16 15:05) 0g/s 411267p/s 926082Kc/s 1349MC/s Xlw5jaa..hd77jaa 0g 0:00:02:30 0.05% (ETA: 2020-11-16 14:46) 0g/s 412685p/s 926008Kc/s 1349MC/s lokufaa..fs3gfaa 0g 0:00:02:40 0.05% (ETA: 2020-11-16 14:52) 0g/s 412236p/s 925966Kc/s 1349MC/s X952waa..h7m5waa 0g 0:00:02:50 0.05% (ETA: 2020-11-16 14:57) 0g/s 411840p/s 925962Kc/s 1349MC/s Pwhsxaa..r01cxaa 0g 0:00:03:00 0.05% (ETA: 2020-11-16 15:02) 0g/s 411488p/s 925929Kc/s 1349MC/s Tu9fqaa..Zpsqqaa I deliberately ran the default mask, which doesn't fit this set of passwords well, so that there wouldn't be a flood of cracks in this test (like there would be with "--incremental"). $ OMP_NUM_THREADS=12 ./john ~/pw-fake-unix -form=descrypt -mask -len=7 -progress=10 -fork=4 Using default input encoding: UTF-8 Loaded 3269 password hashes with 2243 different salts (1.5x same-salt boost) (descrypt, traditional crypt(3) [DES 512/512 AVX512F]) Will run 12 OpenMP threads per process (48 total across 4 processes) Node numbers 1-4 of 4 (fork) Using default mask: ?1?2?2?2?2?2?2 Press 'q' or Ctrl-C to abort, almost any other key for status 3 0g 0:00:00:10 0.00% (ETA: 2020-11-16 16:27) 0g/s 101376p/s 237165Kc/s 345475KC/s Xlwyaap..egkhaap 1 0g 0:00:00:10 0.00% (ETA: 2020-11-16 16:27) 0g/s 101376p/s 234361Kc/s 341393KC/s Xlwyaaa..egkhaaa 2 0g 0:00:00:10 0.00% (ETA: 2020-11-16 16:27) 0g/s 101376p/s 238274Kc/s 347124KC/s Xlwyaam..egkhaam 4 0g 0:00:00:10 0.00% (ETA: 2020-11-16 16:27) 0g/s 101376p/s 235158Kc/s 342610KC/s Xlwyaa0..egkhaa0 3 0g 0:00:00:20 0.01% (ETA: 2020-11-16 13:28) 0g/s 104755p/s 238071Kc/s 346885KC/s axi1aap..o651aap 1 0g 0:00:00:20 0.01% (ETA: 2020-11-16 13:28) 0g/s 104755p/s 236577Kc/s 344729KC/s axi1aaa..o651aaa 2 0g 0:00:00:20 0.01% (ETA: 2020-11-16 13:28) 0g/s 104755p/s 239237Kc/s 348628KC/s axi1aam..o651aam 4 0g 0:00:00:20 0.01% (ETA: 2020-11-16 13:28) 0g/s 104755p/s 237385Kc/s 345861KC/s axi1aa0..o651aa0 3 0g 0:00:00:30 0.01% (ETA: 2020-11-16 12:31) 0g/s 105881p/s 238402Kc/s 347413KC/s irjoeap..rbhneap 1 0g 0:00:00:30 0.01% (ETA: 2020-11-16 14:26) 0g/s 103628p/s 237316Kc/s 345879KC/s X9xieaa..erjoeaa 2 0g 0:00:00:30 0.01% (ETA: 2020-11-16 12:31) 0g/s 105881p/s 239565Kc/s 349089KC/s irjoeam..rbhneam 4 0g 0:00:00:30 0.01% (ETA: 2020-11-16 12:31) 0g/s 105881p/s 238136Kc/s 347043KC/s irjoea0..rbhnea0 3 0g 0:00:00:40 0.01% (ETA: 2020-11-16 13:28) 0g/s 104755p/s 238448Kc/s 347542KC/s ayrkeap..ow7keap 1 0g 0:00:00:40 0.01% (ETA: 2020-11-16 13:28) 0g/s 104755p/s 237561Kc/s 346192KC/s ayrkeaa..ow7keaa 4 0g 0:00:00:40 0.01% (ETA: 2020-11-16 13:28) 0g/s 104755p/s 238377Kc/s 347427KC/s ayrkea0..ow7kea0 2 0g 0:00:00:40 0.01% (ETA: 2020-11-16 12:02) 0g/s 106444p/s 239603Kc/s 349169KC/s nw7keam..s53geam 3 0g 0:00:00:50 0.02% (ETA: 2020-11-16 12:53) 0g/s 105431p/s 238403Kc/s 347419KC/s i3f3eap..rok9eap 1 0g 0:00:00:50 0.02% (ETA: 2020-11-16 12:53) 0g/s 105431p/s 237649Kc/s 346311KC/s i3f3eaa..rok9eaa 2 0g 0:00:00:50 0.02% (ETA: 2020-11-16 11:46) 0g/s 106782p/s 239560Kc/s 349141KC/s lok9eam..mhc8eam 4 0g 0:00:00:50 0.02% (ETA: 2020-11-16 12:53) 0g/s 105431p/s 238464Kc/s 347512KC/s i3f3ea0..rok9ea0 william (u245-des) 3 0g 0:00:01:00 0.02% (ETA: 2020-11-16 12:31) 0g/s 105881p/s 238262Kc/s 347217KC/s ncisiap..sv5siap 1 0g 0:00:01:00 0.02% (ETA: 2020-11-16 12:31) 0g/s 105881p/s 237611Kc/s 346302KC/s ncisiaa..sv5siaa 2 1g 0:00:01:00 0.02% (ETA: 2020-11-16 12:31) 0.01666g/s 105881p/s 239410Kc/s 348910KC/s ncisiam..sv5siam 4 0g 0:00:01:00 0.02% (ETA: 2020-11-16 12:31) 0g/s 105881p/s 238409Kc/s 347423KC/s ncisia0..sv5sia0 Slightly higher cumulative speed with some use of "--fork" (4 processes with 12 threads each): ~953M total. It'd still take full use of "--fork" instead of OpenMP to get beyond 1 billion like I mentioned before, but we're getting closer to that with OpenMP now. Actual cracking, AMD: $ ./john ~/pw-fake-unix -form=descrypt -mask -len=7 -progress=10 Using default input encoding: UTF-8 Loaded 3269 password hashes with 2243 different salts (1.5x same-salt boost) (descrypt, traditional crypt(3) [DES 256/256 AVX2]) Will run 96 OpenMP threads Using default mask: ?1?2?2?2?2?2?2 Press 'q' or Ctrl-C to abort, almost any other key for status 0g 0:00:00:10 0.00% (ETA: 2020-11-19 11:15) 0g/s 235929p/s 554729Kc/s 807469KC/s ceh3aaa..0cqaeaa 0g 0:00:00:20 0.00% (ETA: 2020-11-19 11:15) 0g/s 235929p/s 552576Kc/s 804401KC/s vi1weaa..Edi9eaa 0g 0:00:00:30 0.01% (ETA: 2020-11-19 11:15) 0g/s 235929p/s 552507Kc/s 804952KC/s 9ookiaa..Dybziaa 0g 0:00:00:40 0.01% (ETA: 2020-11-19 11:15) 0g/s 235929p/s 551293Kc/s 803060KC/s Lnkmoaa..Fh2koaa 0g 0:00:00:50 0.01% (ETA: 2020-11-19 11:15) 0g/s 235929p/s 551107Kc/s 803009KC/s Hr3inaa..rbrcnaa 0g 0:00:01:00 0.01% (ETA: 2020-11-19 11:15) 0g/s 235929p/s 550738Kc/s 802691KC/s Xll5naa..bkporaa Somehow slightly better speed than the benchmark reported. $ ./john ~/pw-fake-unix -form=descrypt -mask -len=7 -progress=10 -fork=4 Using default input encoding: UTF-8 Loaded 3269 password hashes with 2243 different salts (1.5x same-salt boost) (descrypt, traditional crypt(3) [DES 256/256 AVX2]) Will run 24 OpenMP threads per process (96 total across 4 processes) Node numbers 1-4 of 4 (fork) Using default mask: ?1?2?2?2?2?2?2 Press 'q' or Ctrl-C to abort, almost any other key for status 1 0g 0:00:00:10 0.00% (ETA: 2020-11-19 11:16) 0g/s 58923p/s 140385Kc/s 204331KC/s pmysaaa..Edlmaaa 4 0g 0:00:00:10 0.00% (ETA: 2020-11-19 11:16) 0g/s 58923p/s 142035Kc/s 206615KC/s pmysaa0..Edlmaa0 2 0g 0:00:00:10 0.00% (ETA: 2020-11-19 11:16) 0g/s 58923p/s 142565Kc/s 207307KC/s pmysaam..Edlmaam 3 0g 0:00:00:10 0.00% (ETA: 2020-11-19 11:16) 0g/s 58923p/s 142049Kc/s 206629KC/s pmysaap..Edlmaap 1 0g 0:00:00:20 0.00% (ETA: 2020-11-19 11:16) 0g/s 58952p/s 140374Kc/s 204434KC/s Apxuaaa..Jvpkaaa 2 0g 0:00:00:20 0.00% (ETA: 2020-11-19 11:16) 0g/s 58952p/s 142562Kc/s 207573KC/s Apxuaam..Jvpkaam 3 0g 0:00:00:20 0.00% (ETA: 2020-11-19 11:16) 0g/s 58952p/s 142113Kc/s 206932KC/s Apxuaap..Jvpkaap 4 0g 0:00:00:20 0.00% (ETA: 2020-11-19 11:16) 0g/s 58952p/s 142150Kc/s 206983KC/s Apxuaa0..Jvpkaa0 1 0g 0:00:00:30 0.01% (ETA: 2020-11-19 11:16) 0g/s 58962p/s 140409Kc/s 204600KC/s G0awaaa..d99zaaa 2 0g 0:00:00:30 0.01% (ETA: 2020-11-19 11:16) 0g/s 58962p/s 142498Kc/s 207804KC/s G0awaam..d99zaam 4 0g 0:00:00:30 0.01% (ETA: 2020-11-19 11:16) 0g/s 58962p/s 142124Kc/s 207219KC/s G0awaa0..d99zaa0 3 0g 0:00:00:30 0.01% (ETA: 2020-11-19 11:16) 0g/s 58962p/s 142075Kc/s 207126KC/s G0awaap..d99zaap 2 0g 0:00:00:40 0.01% (ETA: 2020-11-19 01:55) 0g/s 62653p/s 142468Kc/s 207566KC/s 9os8aam..Yre4aam 1 0g 0:00:00:40 0.01% (ETA: 2020-11-19 11:16) 0g/s 58967p/s 140370Kc/s 204591KC/s ceh3aaa..3os8aaa 4 0g 0:00:00:40 0.01% (ETA: 2020-11-19 01:55) 0g/s 62653p/s 142110Kc/s 207042KC/s 9os8aa0..Yre4aa0 3 0g 0:00:00:40 0.01% (ETA: 2020-11-19 01:55) 0g/s 62653p/s 142033Kc/s 206943KC/s 9os8aap..Yre4aap 2 0g 0:00:00:50 0.01% (ETA: 2020-11-19 03:42) 0g/s 61919p/s 142436Kc/s 207550KC/s Byjieam..rbhneam 1 0g 0:00:00:50 0.01% (ETA: 2020-11-19 03:42) 0g/s 61919p/s 140354Kc/s 204506KC/s Byjieaa..rbhneaa 4 0g 0:00:00:50 0.01% (ETA: 2020-11-19 03:42) 0g/s 61919p/s 142082Kc/s 207004KC/s Byjiea0..rbhnea0 3 0g 0:00:00:50 0.01% (ETA: 2020-11-19 03:42) 0g/s 61919p/s 142003Kc/s 206883KC/s Byjieap..rbhneap 2 0g 0:00:01:00 0.01% (ETA: 2020-11-19 04:55) 0g/s 61440p/s 142449Kc/s 207571KC/s nw8meam..zxqdeam 1 0g 0:00:01:00 0.01% (ETA: 2020-11-19 04:55) 0g/s 61429p/s 140353Kc/s 204494KC/s nw8meaa..zxqdeaa 3 0g 0:00:01:00 0.01% (ETA: 2020-11-19 04:55) 0g/s 61429p/s 142009Kc/s 206947KC/s nw8meap..zxqdeap 4 0g 0:00:01:00 0.01% (ETA: 2020-11-19 04:55) 0g/s 61429p/s 142095Kc/s 207074KC/s nw8mea0..zxqdea0 That's ~567M total. Again, pure "--fork" in a non-OpenMP build would be somewhat faster, but anyhow these speeds are not bad for a CPU. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.