|
Message-ID: <20150912095745.GA21500@openwall.com> Date: Sat, 12 Sep 2015 12:57:45 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: SHA-1 H() Lei, On Sat, Sep 12, 2015 at 04:53:42PM +0800, Lei Zhang wrote: > On my laptop, where each core supports 2 hardware threads, running 2 threads gets a 2x speedup compared to 1 thread on the same core. This happens, but it's not very common. Usually, speedup from running 2 threads/core is much less than 2x. > OTOH, each Power8 core supports up to 8 hardware threads, so I'd expect a higher speedup than just 2x. SMT isn't only a way to increase resource utilization of a core when running many threads. It's also a way to achieve lower latency due to fewer context switches in server workloads (with lots of concurrent requests) and to allow CPU designers to use higher instruction latencies and achieve higher clock rate. (Note that my two uses of the word latency in the previous sentence refer to totally different latencies: server response latency on the order of milliseconds may be improved, but instruction latency on the order of nanoseconds may be harmed at the same time.) Our workload uses relatively low latency instructions: integer only, and with nearly 100% L1 cache hit rate. Some other workloads like multiplication of large matrices (exceeding L1 data cache) might benefit from more hardware threads per core (or explicit interleaving, but that's uncommon in scientific workloads except through OpenCL and such), and that's also a reason for Power CPU designers to support and possibly optimize for more hardware threads per core. Finally, SMT provides middle ground between increasing the number of ISA-visible CPU registers (which is limited by instruction size and the number of register operands you can encode per instruction, as well as by the need to maintain compatibility) and increasing the number of rename registers. With SMT, there are sort of more ISA-visible CPU registers: total across the many hardware threads. Those registers are as good as ISA-visible ones for the purpose of replacing the need to interleave instructions within 1 thread, yet they don't bump into instruction size limitations. I expect that on a CPU with more than 2 hardware threads the speed growth with the increase of threads/core in use is spread over the 1 to max threads range. So e.g. the speedup at only 2 threads on an 8 hardware threads CPU may very well be less than the speedup at 2 threads on a 2 hardware threads CPU. I don't necessarily expect that the speedup achieved at max threads is much or any greater than that achieved at 2 threads on a CPU where 2 is the max. There's potential for it to be greater (in the sense that the thread count doesn't limit it to at most 2), but it might or might not be greater in practice. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.