Expect that the higher the cache affinity of an application's single-threaded execution, the less tolerant that application will be of the reduced cache resources available to each logical processor under Hyper-Threading Technology. For a given cache hit rate in the original, single-threaded execution, the following figure illustrates the effective miss rate, which would cause the thread to run twice as slowly as in serial. Thus, any hit rate that falls in the region between the curves should correspond to an overall speed-up when two threads are active:
The region between the curves narrows dramatically as the original cache hit rate approaches 100%, indicating that applications with excellent cache affinity will be the least tolerant of reduced effective cache size. For example, when a single-threaded run achieves a 60% hit rate, the dual-threaded run's hit rate can be as low as 10% and still offer overall speed-up. On the other hand, an application with a 99% hit rate must maintain an 88% hit rate in the smaller cache to avoid slowdown.
In processors that support Hyper-Threading Technology, the goal is to execute both threads with no resource contention issues or stalls. When this occurs, two fully independent threads should be able to execute an application in half the time of a single thread. Likewise, each thread can execute up to 50% more slowly than the single-threaded case and still yield speed-up.
The following formula exhibits the approximate time to execute an application on a hypothetical system with a three-level memory hierarchy consisting of registers, cache, and main memory:
Texe/N = (1 - Fmemory) Tproc + Fmemory [Ghit Tcache + (1 - Ghit) Tmemory]
The variables used in the formula are as follows:
- N = Number of instructions executed
- Fmemory = Fraction of N that access memory
- Ghit = Fraction of loads that hit the cache
- Tproc = #cycles to process an instruction
- Tcache = #cycles to process a hit
- Tmemory = #cycles to process a miss
- Texe = Execution time
While cache hit rates,
Ghit, cannot be easily estimated for the shared cache, we can explore the performance impact of a range of possible hit rates. For the purposes of this calculation, we assume the values
Fmemory=20%,
Tproc=2,
Tcache=3, and
Tmemory=100.