How to Predict the Impact of Cache Effects on Applications Running on Hyper-Threading Technology-Enabled Processors

Author: Intel® Software Network
Published On: Saturday, April 01, 2006 | Last Modified On: Tuesday, August 05, 2008

Challenge
Gauge whether the negative impact of sharing data-cache resources will outweigh the performance increases for a specific application. Depending on the application characteristics, Hyper-Threading Technology's shared caches have the potential to help or hinder performance.

Each logical processor exposed by a Hyper-Threading-enabled processor maintains a complete set of the architecture state, which consists of registers including the general-purpose registers, the control registers, the advanced programmable interrupt controller (APIC) registers, and some machine state registers. From a software perspective, once the architecture state is duplicated, the processor appears to be two processors. The number of transistors to store the architecture state is an extremely small fraction of the total. Logical processors share nearly all other resources on the physical processor, such as caches.

The threads in data-parallel applications tend to work on distinct subsets of the memory, so this may be expected to halve the effective cache size available to each logical processor. Note that each execution core in a multi-core processor has its own cache. Nevertheless, if those cores support Hyper-Threading Technology, the two logical processors within each execution core still must share that execution core’s cache resources.

Solution
Expect that the higher the cache affinity of an application's single-threaded execution, the less tolerant that application will be of the reduced cache resources available to each logical processor under Hyper-Threading Technology. For a given cache hit rate in the original, single-threaded execution, the following figure illustrates the effective miss rate, which would cause the thread to run twice as slowly as in serial. Thus, any hit rate that falls in the region between the curves should correspond to an overall speed-up when two threads are active:



The region between the curves narrows dramatically as the original cache hit rate approaches 100%, indicating that applications with excellent cache affinity will be the least tolerant of reduced effective cache size. For example, when a single-threaded run achieves a 60% hit rate, the dual-threaded run's hit rate can be as low as 10% and still offer overall speed-up. On the other hand, an application with a 99% hit rate must maintain an 88% hit rate in the smaller cache to avoid slowdown.

In processors that support Hyper-Threading Technology, the goal is to execute both threads with no resource contention issues or stalls. When this occurs, two fully independent threads should be able to execute an application in half the time of a single thread. Likewise, each thread can execute up to 50% more slowly than the single-threaded case and still yield speed-up.

The following formula exhibits the approximate time to execute an application on a hypothetical system with a three-level memory hierarchy consisting of registers, cache, and main memory:

Texe/N = (1 - Fmemory) Tproc + Fmemory [Ghit Tcache + (1 - Ghit) Tmemory]
The variables used in the formula are as follows:

  • N = Number of instructions executed
  • Fmemory = Fraction of N that access memory
  • Ghit = Fraction of loads that hit the cache
  • Tproc = #cycles to process an instruction
  • Tcache = #cycles to process a hit
  • Tmemory = #cycles to process a miss
  • Texe = Execution time

While cache hit rates, Ghit, cannot be easily estimated for the shared cache, we can explore the performance impact of a range of possible hit rates. For the purposes of this calculation, we assume the values Fmemory=20%, Tproc=2, Tcache=3, and Tmemory=100.

Sources

Post a comment If you have any questions, please contact our support team.