Using Intel® Thread Profiler for Win32* Threads: Nuts and Bolts

Author: Clay P. Breshears
Published On: Friday, January 20, 2006 | Last Modified On: Tuesday, August 05, 2008

Introduction
Clay P. Breshears
Parallel Applications Engineer

In Part One, we examined the concepts of processor utilization, or concurrency level, and critical path analysis used in Intel® Thread Profiler for Win32* Threads. We saw how color was used to distinguish between different levels of concurrency and the amounts of time an application spent within each category. We also observed how the intensity denotes the interactions between threads and discussed how overall application performance can be impacted by those interactions.

In Part Two, we demonstrate how these concepts are realized within the GUI for Intel Thread Profiler for Win32 Threads. We shall also examine how the information presented by the tool is used to identify and locate performance problems within an explicitly threaded application.

We assume that the reader is familiar with the installation of the Intel® Threading Tools and how to create an Intel Thread Profiler activity. No special switches are needed for compilation of the target application, beyond those required for binary instrumentation, to use Intel Thread Profiler on a Win32 threaded code. (See Getting Started with the Intel Thread Profiler included in the documentation for the tool for more information on preparing applications for analysis.) Binary instrumentation of the threaded application will be done when an Intel Thread Profiler activity is started. This instrumentation is targeted at system threading APIs, such as thread creation and termination, suspension and resumption of thread execution, events, mutexes, critical sections, and other synchronization APIs, as well as, I/O requests, pipes, ports, and other blocking APIs.


Example Application: Critical Paths View
To illustrate some of the basic techniques in using Intel Thread Profiler for Win32 Threads, we have written a small, poorly performing application. After some initial serial preparation, the master thread creates five threads: four worker threads and a single verification thread. Each worker thread is set up to perform a sequence of independent tasks. After completing one task and before starting on the next, worker threads pause to allow the verification thread to check that the correct results have been computed.

When using Intel Thread Profiler it is best to run your application with a “production-size” data set to sufficiently exercise the entire application. If a data set is too small, the performance results can be skewed by things that are actually inconsequential to the performance, such as input/output and thread management overheads.

Figure 1 shows the Critical Path View within Intel Thread Profiler for the example application. Near the bottom of the histogram bar we see a section of bright green. According to the legend, this is “Fully parallel impact time.” That is, on the four processor machine that ran this application, four threads were active, but some number of threads were unable to run due to a synchronization object that was under the control of one or more of the executing threads. Even though threads are being prevented from running, we realize that all four processors are being utilized and there would be no “room” for the impacted threads to run. (From our example application, we know the blocked thread is the verification thread that must wait for all worker threads to finish before it can execute.) The light blue segment (“Oversubscribed cruise time”) at the very bottom of the graph is time spent with more than four threads able to run. At the top of the Critical Paths histogram we see some “Serial cruise time” (light orange) from the initialization by the master thread before any worker threads are created.


Figure 1. Critical Paths View (condensed)

The segment of the histogram that should draw our attention is the “Serial impact time” in dark orange. This portion accounts for almost half of the application execution time and is the result of a single thread running and being directly responsible for keeping all other threads from executing. Recall from Part One, Using Intel® Thread Profiler for Win32 Threads: Philosophy and Theory that overall execution time can be reduced for the application by reducing the amount of time spent on the critical path. Impact time, especially time when processor resources are undersubscribed, provide significant opportunities to shorten execution time of the critical path.

Since the Critical Paths View is a summary, we do not know if a single thread is responsible, if other threads share this responsibility over the course of execution or if there is one particular synchronization object that is involved. To get more detailed information about how threads interacted on the critical path, we need to use the Profile View.


Profile View
Select a Critical Path histogram from the Critical Paths View. The selected path histogram will be outlined by a dark border when selected. (Left-click once on the chosen bar to select a critical path.) Clicking on the “Profile View” tab at the bottom of the data view pane will change the display to the Profile View with the selected critical path data loaded. The initial display is the critical path summary data histogram, but reversed from the Critical Paths View.

Grouping

The Profile View allows us to group and filter and sort the data in order to get a better understanding about the performance of the threads within the application. Across the top of the data view pane are shortcut buttons for the most common methods of viewing and grouping the critical path data. The first pair of buttons (left and right arrows) works like the back and forward buttons of a web browser. After some groupings have been displayed, the user can traverse between them in the order they were created by using the navigation arrows.

The second set of buttons is the grouping shortcuts. The first button (red circle with slash) returns the view to the default critical path summary; the second button (“CL”) shows the concurrency level along the critical path; the third button (spool of thread) breaks down the data by time spent on the critical path for each thread (see Figure 2); and the fourth button (trio of blobs) will show the impact time for which each synchronization object was directly responsible.

Other grouping options can be found as options within the drop-down menu under the first two buttons (“1” and “2”) in the next set. Once an initial grouping has been selected, the “2” button will add a second grouping within each first grouping category. The third button (two circular arrows) will swap the order of the groupings. Text describing the current grouping being displayed is listed in the blue border between the shortcuts and the data display.

As an example, after grouping the critical path data by Thread, choosing Object from the “2” button menu will further categorize the critical path time displayed for each thread into the time spent on the critical path that impacted other threads. The details of each relevant synchronization object that was involved will be found in the ID box below each histogram bar. This particular grouping combination can identify objects whose use may be adversely impacting performance.

The fourth set of buttons allows the user to zoom in and out on the data display. The magnification ranges from 1x to 100x. Users can quickly select a magnification level from the drop-down menu to the right of the icons.

Filtering

Filtering allows users to exclude data that is not of interest. After grouping the data, select one or more histogram bars that are to be examined in-depth. Right-click on a selected bar and choose “Filter Selection” from the top of the pop-up menu. All non-selected bars will be removed and further groupings will only be applied to the data that remains. The “Filter” status bar in the blue border above the data display is an indicator of the percentage of data that remains after filtering. Filtering can be applied successively to refine the focus of the data display as needed.

Sorting

Controls for sorting the data displays are found by clicking on the double-down arrow icon at the far right of the blue border bar (just above the legend). What data to sort on (Field), how to display sorting results (Order), and what groups to apply the sorting criteria are all chosen here. For example, choosing the Field “CP Time” and Order “Descending” will sort the displayed histogram bars by time spent on the critical path from most to least.

Halos and Thread Lifetime

Threads have a lifetime beyond time spent on the critical path. When the Profile View is grouped by Threads, the lifetime of a thread is given by a light green “halo” that is positioned behind the critical path histogram bar. The height of the halo corresponds to the total time the given thread was “alive” from creation to termination. A darker green halo within the lifetime halo represents the total time that the given thread was active and available for execution, but was not on the critical path.

Synchronization objects also have lifetimes within an application. After grouping by Object, the halo behind a critical path histogram bar is the amount of time the given sync object was “alive” from initialization to destruction. Synchronization object lifetime halos are only visible in the Object grouping as the thread lifetime halo is only visible under the Thread grouping.


Example Application Analysis and Improvement
Figure 2 is the Thread grouping of the Profile View for our example application. The histogram bars have been sorted by time on the critical path. Thread 1 is the master thread and has the longest lifetime, Threads 2-5 are the worker threads, and Thread 6 is the verification thread.

 









 

 

 


Figure 2. Profile View – Grouped by Threads goes here

While one should expect that threads executing in parallel and performing the same amount and type of computation will have approximately the same amount of time on the critical path, some difference should be expected due to scheduling, system load, and other intangibles beyond the control of the application. However, the active time spent by Thread 3, being much greater than any of the other worker threads, is an indication that there is some load imbalance between the workers. This imbalance is also evident by the “Serial impact time” spent by Thread 3 on the critical path. The elimination of this imbalance, and, consequently, the impact time keeping Thread 6 from starting the verification process, will shorten the critical path and reduce the overall execution time of the application.


Timeline View
The critical path histogram bars are summarized data with no means of linking how the critical path passed from one thread to another. The Timeline View is able to display the motion of the critical path across threads. Keeping track of these transitions, however, is an expensive operation and will increase memory usage and cause a noticeable slow down in execution time within Thread Profiler. Therefore, the user must configure the Intel Thread Profiler for Win32 Threads data collector to activate this feature before execution of the target application begins.

Figure 3 shows the Timeline View for our example application. The Lifetime and Active Lifetime halos are overlaid with a simplified color coding of critical path activity. This simplified scheme uses bright green for cruising time, red for impact time, and yellow for overhead or transition time.


(Click here to see whole image)
Figure 3. Timeline View

Knowing the structure of which threads create which other threads, the Timeline View can help determine how threads are numbered by the Intel Thread Profiler. The master thread is Thread 1 that first creates Threads 2-5 before creating the verification thread (Thread 6). All threads “join” back with the master thread at termination.

The load imbalance of Thread 3 with the other worker threads is very apparent from the Timeline View. Each Active Lifetime segment from Thread 3 is longer than corresponding segments from any other worker thread. If only a few tasks were out of balance with the rest of the worker threads, the Timeline view could be used to quickly identify these larger tasks and allow the programmer to change the assignment of work to a more even distribution.


Source Views
Any impact time, especially “Serial impact time” when multiple threads might be able to run concurrently, represents an undesirable performance problem from thread synchronization. Identifying those synchronization objects that were involved with impact time on the critical path is done by simply grouping the Profile View by Object.

Placing the mouse over the ID box at the bottom of the histogram bar will reveal information about the object, such as lifetime, where created, etc. Figure 4 shows the Profile View of our example application grouped by object and the identification of the second object, which is a semaphore. Intel Thread Profiler uses an internal naming scheme to identify different objects. To link the Intel Thread Profiler identification back to the specific object within the source code, right-click on a histogram bar of interest and choose “Creation Source Code” from the pop-up menu. If the application was compiled with debug symbols, a Source View window will be generated that shows the dynamic initialization of the object.

 

 

 

 

 


Figure 4. Profile View grouped by Object

Within Intel Thread Profiler for Win32 Threads, a transition is when the critical path moves from one thread to another. At such transitions, Intel Thread Profiler is able to keep track of several call sites within the application that were involved. These call sites are:

  • Previous: the transition of the critical path out of the previous thread.
  • In: the transition of the critical path into the current thread.
  • Out: the call site of the current event, when the critical path transfers out of the current thread.
  • Next: the transition of the critical path into the next thread.
Release and acquisition of synchronization objects often cause transitions of the critical path. Right-clicking on an object’s histogram and choosing “Transition Source View” will bring up a source view window that can be used to find transition points that were caused by the sync object. Figure 5 shows the Out location (a call to ReleaseSemaphore()) that is transitioning the critical path out of Thread 6 to Thread 4 (Next). The previous threads to affect this semaphore are Threads 3, 4, and 2 (Prev).

 

 

 

 


Figure 5. Transition Source View

Knowing what impact objects have on the execution performance and direct view of source code allows the programmer to judge how those objects could be used more effectively. Based on this information, two performance modifications that could be implemented to improve utilization of synchronization objects would be (1) to reduce the amount of time threads hold the relevant objects and (2) to modify the behavior of threads that have been blocked to perform other processing and wait for objects at a later point in their execution.


More Example Application Analysis and Improvement
From Figure 2, you may have noticed that Thread 6 spends a large portion of time on the critical path under “Serial impact time.” For our example application this is a necessary evil due to the requirement that results be verified before new results may be computed. However, after correcting the load imbalance of work assigned to Thread 3, this serial execution stands out as an obvious roadblock to better utilization of resources and parallel performance.

For our example application, any parallelism that can be introduced to the verifier thread could be of great benefit. Upon further consideration, it was decided that a single verifier thread for four worker threads was unnecessary. Each worker thread could be paired with its own verifier thread to alleviate the need for such draconian synchronization required by the single verification thread.

After making the necessary modifications to the source code (using the Intel® Thread Checker to determine that no new threading errors were introduced by the changes), the example application was run through Intel Thread Profiler for Win32 Threads again. Figure 6 shows a comparison of the critical paths of the original application with the updated version corrected for load balance and using four verification threads. The worker threads must still block on synchronization while the verification threads execute. However, with four verification threads, all processors are kept busy more often (larger percentage of “Fully parallel” green in critical path summary histogram) than in previous versions of the application.


Figure 6. Comparison of original and tuned application


Summary
Intel Thread Profiler for Win32 Threads monitors execution of applications to detect threading performance issues, including thread overhead and synchronization impact. Intel Thread Profiler provides results in graphical displays to help quickly pinpoint the locations in code that directly affect execution time where fewer than the optimal number of threads were executing, identify synchronization objects that may be impacting thread execution, and ascertain load balance issues between threads.

A good rule of thumb is that having 80% or more execution time within an application running in parallel is considered very good. Maximum utilization of computing resources is another goal to increase parallel performance. Removing situations within an application where fewer threads than available processors are found to be executing, especially at the same time other threads are blocked from running, will both increase the utilization of resources and increase the parallel working percentage.


Additional Resources

Articles

Using Intel® Thread Profiler for Win32* Threads: Philosophy and Theory

Intel® Threading Tools and OpenMP*

Advanced OpenMP* Programming

Developer Centers

Threading/Multi-Core

Digital Media

Intel® Pentium® 4 Processor

Intel® Xeon® Processor

Community

Threading on Intel Parallel Architectures

Other Resources

Threading KnowledgeBase

Intel Software College


Acknowledgments
The author wishes to thank Douglas Armstrong and the Intel Threading Tools Development Team for their assistance with facts, details, and examples that went into the research and writing of this article.


About the Author
Clay Breshears is currently a Parallel Applications Engineer at the Intel Parallel Applications Center in Champaign, IL. He has been involved with parallel programming and computing for over twoh decades, from message-passing on distributed memory clusters to multithreading on SMP nodes. Before joining Intel, Clay was a Research Scientist at Rice University working on DoD HPC contracts.

Post a comment If you have any questions, please contact our support team.