I've been thinking along similar lines, and also wondering what percentage of wo...

dragontamer · on Nov 12, 2021

DDR4 RAM has the same latency as DDR2 RAM. DDR5 will probably have the same latency as well.

In 2002, our processors were so thin they could only execute maybe 2-instructions per clock tick. Today, we've got CPUs that hit 4 (x86 L1 cache), 6 (x86 uop cache), or 8 (Apple M1) instructions per tick.

With many problems still memory-latency bound (aka: any Java code going through blah->foo()->bar()... look at all those pointer indirections!!), it makes sense to "find other work to do" while the CPU stalls on memory latency.

The RAM supports higher bandwidth. The CPU supports more instructions-per-tick. Spending effort to find work to do just makes sense.

acdha · on Nov 12, 2021

Oh, I'm aware of that concern but I was wondering how many benchmarks show a significant win. After you consider the impacts on the processor caches, and the fraction of apps which are performance sensitive but still end up limited on cache-unfriendly random memory access, it seems like it could be an increasingly niche concern where most of the code people run either isn't performance sensitive or has gotten better about memory access patterns after decades of hardware trending in that direction.

Apple's highly-competitive performance with processors which don't support SMT suggests that this is not such a clear win, especially given the non-trivial cost of the security mitigations.

dragontamer · on Nov 12, 2021

https://www.phoronix.com/scan.php?page=article&item=intel-ht...

* HT (Intel's SMT, aka Hyperthreading) wins on Blender, Parboil, Rodinia, HMMer Search, TTSIOD 3D Renderer, x264 Encoding, 7-Zip, Stockfish Chess AI, ASMFish Chess AI, Linux Kernel Compiles, PHP Compiles, C-Ray, POV-RAY, m-queens,

* HT loses on VP9 Encoding

* Virtually a tie on GraphicsMagick,

acdha · on Nov 12, 2021

Thanks - that’s the kind of spread I was curious about since there are some other factors like the cost of Spectre mitigations which have added addition dimensions to comparisons.

MauranKilom · on Nov 12, 2021

...unless you are bandwidth-bound already. In such cases I've observed 10-30% gains from turning off SMT...

CyberDildonics · on Nov 12, 2021

It is very rare for programs to be memory bandwidth bound. It usually take a lot of optimization just to get to that point as well as some disregard for memory bandwidth on top of that (such as looping through large arrays, only doing one simple calculation to each index, then doing that on many cores).

The vast majority of what people run is memory latency bound and in those cases using extra threads makes sense so that the explicit parallelism can compensate for memory latency.

MauranKilom · on Nov 13, 2021

> (such as looping through large arrays, only doing one simple calculation to each index, then doing that on many cores).

...which perfectly describes a parallelized mat-vec-mult. Yes, that's not common in most applications, but I'd have a hard time naming a more basic operation in scientific (and related) computations.

CyberDildonics · on Nov 13, 2021

We are saying the same thing here, though I think you are missing the point that this is all a response to someone asking if SMT is useful anymore since there are many cores in almost every CPU.

The answer is that it is absolutely still useful since your example is niche and most software/systems can still benefit from being able to work around memory latency with more threads.

_0w8t · on Nov 13, 2021

There are different ways to do that. Apple put efforts into decoders allowing to increase single-threaded performance. Intel realistically cannot do that due to limitations of x86 instruction format. So they worked-around that via hyper threading that allowed to decode more in parallel.

polskibus · on Nov 12, 2021

CAS Latency has been increasing with each DDR version.

throwaway2048 · on Nov 12, 2021

CAS latency is counted in cycles, ram speed has increased, the cycle count has increased, the absolute amount of time passing has not.

formerly_proven · on Nov 12, 2021

It has decreased, but let's say halving a timing within 20-25 years is not exactly the kind of progress people intuitively associate with semiconductors.

What has decreased quite a lot though is the time to transfer, which is of course stacked on top of CL. CPUs always have to fetch a full cache line (usually 64 bytes), and the time to get 64 bytes out of memory has more than halved each generation.

adgjlsfhk1 · on Nov 13, 2021

And DDR5 goes even further by moving you to 2 32 byte cache lines per ram stick which should allow that time to reduce even further

gpapilion · on Nov 12, 2021

I would say most workloads benefit from SMT. I've done work disabling and enabling it and rerunning benchmarks to understand the impact.

As a gross over generalization, disabling it can improve latency for workloads that are compute intensive. This is not most workloads, and almost everything will see some improvement and increase in throughput by having SMT turned on at lower utilizations.

As you push the workload beyond 60-70% utilization the wins from SMT fall away, but generally are not impacting you negatively. What happens is your processor is over performing at lower utilizations, and your performance curve bends to where it would have been without SMT at lower utilizations.

wiredfool · on Nov 12, 2021

The only workload I've done lately that buried all the cores was a significantly parallel CSV import into clickhouse from compressed zips of multiple CSVs, while filtering the CSVs with python. Running nproc=cores was 5% slower than nproc=cores+smt in overall throughput.