I've been thinking along similar lines, and also wondering what percentage of workloads still see significant benefits from SMT. It made a lot more sense in 2002 when it was still common to find servers with only one core but now even a phone has half a dozen cores and laptops are pushing into the tens of cores range while servers are pushing an order of magnitude greater.
I remember disabling SMT in the mid-2000s because it was a performance degradation for some memory-sensitive code one of my users was running. I'd be really curious what percentage of workloads are not limited by I/O but still stay below the level where resource contention between threads lowers performance.
DDR4 RAM has the same latency as DDR2 RAM. DDR5 will probably have the same latency as well.
In 2002, our processors were so thin they could only execute maybe 2-instructions per clock tick. Today, we've got CPUs that hit 4 (x86 L1 cache), 6 (x86 uop cache), or 8 (Apple M1) instructions per tick.
With many problems still memory-latency bound (aka: any Java code going through blah->foo()->bar()... look at all those pointer indirections!!), it makes sense to "find other work to do" while the CPU stalls on memory latency.
The RAM supports higher bandwidth. The CPU supports more instructions-per-tick. Spending effort to find work to do just makes sense.
Oh, I'm aware of that concern but I was wondering how many benchmarks show a significant win. After you consider the impacts on the processor caches, and the fraction of apps which are performance sensitive but still end up limited on cache-unfriendly random memory access, it seems like it could be an increasingly niche concern where most of the code people run either isn't performance sensitive or has gotten better about memory access patterns after decades of hardware trending in that direction.
Apple's highly-competitive performance with processors which don't support SMT suggests that this is not such a clear win, especially given the non-trivial cost of the security mitigations.
Thanks - that’s the kind of spread I was curious about since there are some other factors like the cost of Spectre mitigations which have added addition dimensions to comparisons.
It is very rare for programs to be memory bandwidth bound. It usually take a lot of optimization just to get to that point as well as some disregard for memory bandwidth on top of that (such as looping through large arrays, only doing one simple calculation to each index, then doing that on many cores).
The vast majority of what people run is memory latency bound and in those cases using extra threads makes sense so that the explicit parallelism can compensate for memory latency.
> (such as looping through large arrays, only doing one simple calculation to each index, then doing that on many cores).
...which perfectly describes a parallelized mat-vec-mult. Yes, that's not common in most applications, but I'd have a hard time naming a more basic operation in scientific (and related) computations.
We are saying the same thing here, though I think you are missing the point that this is all a response to someone asking if SMT is useful anymore since there are many cores in almost every CPU.
The answer is that it is absolutely still useful since your example is niche and most software/systems can still benefit from being able to work around memory latency with more threads.
There are different ways to do that. Apple put efforts into decoders allowing to increase single-threaded performance. Intel realistically cannot do that due to limitations of x86 instruction format. So they worked-around that via hyper threading that allowed to decode more in parallel.
It has decreased, but let's say halving a timing within 20-25 years is not exactly the kind of progress people intuitively associate with semiconductors.
What has decreased quite a lot though is the time to transfer, which is of course stacked on top of CL. CPUs always have to fetch a full cache line (usually 64 bytes), and the time to get 64 bytes out of memory has more than halved each generation.
I would say most workloads benefit from SMT. I've done work disabling and enabling it and rerunning benchmarks to understand the impact.
As a gross over generalization, disabling it can improve latency for workloads that are compute intensive. This is not most workloads, and almost everything will see some improvement and increase in throughput by having SMT turned on at lower utilizations.
As you push the workload beyond 60-70% utilization the wins from SMT fall away, but generally are not impacting you negatively. What happens is your processor is over performing at lower utilizations, and your performance curve bends to where it would have been without SMT at lower utilizations.
The only workload I've done lately that buried all the cores was a significantly parallel CSV import into clickhouse from compressed zips of multiple CSVs, while filtering the CSVs with python. Running nproc=cores was 5% slower than nproc=cores+smt in overall throughput.
I remember disabling SMT in the mid-2000s because it was a performance degradation for some memory-sensitive code one of my users was running. I'd be really curious what percentage of workloads are not limited by I/O but still stay below the level where resource contention between threads lowers performance.