> threads being penalised now for activity from several seconds ago,
Exactly... They would have found this out much quicker with a trace. They would have seen "how come this application level request is being handled on thread number X, yet that thread is not running on any core, and many cores are idle"? Then quickly they could see the reason that thread isn't scheduled by enabling extra tracing detail seeing the internal data structures used by the scheduler to see why something is schedulable or not at that instant.
I think you're suffering hindsight bias, here. A trace is rarely as clear as that, and it's hard to see the details it's not designed to expose.
Your original message would probably be better received if you'd omitted the "I think this problem would have been debugged and solved much quicker [...]" and its insulting implications and instead started with "Sometimes, I find that CPU activity traces can really help with diagnosing this sort of problem".
Please stop advocating for politeness over correctness.
Sure hindsight help but regardless, a company such as Twitter should have experts at tracing that have tools and knowledge that goes beyond the average developer knowledge about tracing methodologies. Excusing that is an appeal to a lowering of technical excellence worldwide, which is majorly important and matter more than hypothetical feelings.
> a company such as Twitter should have experts at tracing
In a big company, getting the person with the most skills to solve a problem to be the one actually tasked with solving the problem is very hard. This particular problem had many avenues to find a solution - and while I think my proposed route would have been quicker, if you aren't aware of those tools or techniques, then other avenues might be much quicker. When starting an investigation like this, you don't know where you're going to end up either - if it turned out that the performance cliff was caused by CPU thermal throttling, it would be hard to see in a scheduling trace - everything would just seem universally slow all of a sudden.
On Windows, we have the xperf and wpa toolset that makes looking at holistic scheduling performance, including processor power management and device io tractable. Even then, the skillset to analyze an issue like the one presented here takes months to acquire and only a few engineers can do it. We have dedicated teams to do this performance analysis work, and they're always in high demand.
Exactly... They would have found this out much quicker with a trace. They would have seen "how come this application level request is being handled on thread number X, yet that thread is not running on any core, and many cores are idle"? Then quickly they could see the reason that thread isn't scheduled by enabling extra tracing detail seeing the internal data structures used by the scheduler to see why something is schedulable or not at that instant.