 |  | Page & Feed Options Bookmark This  |
 Vote on This Page Community Tags 
|
By Kiefer Kuah
The 'sum of absolute difference' routine commonly found in video encoders was optimized by decomposing the routine into multiple threads and by using the Intel® Integrated Performance Primitives (Intel® IPP). The performance speedup from multi-threading was 3.84x on a 4-core processor. The Intel IPP function injected another factor of 1.45x to the gain to yield a combined speedup of 5.55x. The methods illustrated here do not require a great deal of code rewrite and can be implemented quickly. Introduction The rise in the number of dual core processors in desktop and mobile computers over the last year should give software developers a compelling reason to multi-thread their applications. The trend towards greater levels of parallel computing will continue. The Intel® Core™2 Duo processor is a dual core processor. A 4-core version of this processor, the Intel® Core™2 Quad, was launched recently. In order to take advantage of these multiple cores, software applications need to be multi-threaded. A single threaded application will only execute on a single core at any given time and leave the other cores idle. In a well threaded application, the workload is divided into equal chunks and distributed evenly to the available cores. Making sure that the workload given to each core is well balanced is crucial to obtaining the best performance out of multi-threading. In this article, we present an example from the video encoding application where multi-threading a routine produces speedups that scale almost linearly to the number of cores. On a 4-core processor, the speedup from multi-threading was 3.84x.
One of the computational intensive operations in video encoding is motion estimation. This operation constitutes about 40% or more of the computational cycles in an encoder. Searching for motion vectors involves comparing a block of pixels from one frame to blocks of pixels from a reference frame to find a block in the reference that is the most similar to it. The amount of work increases with the size of the data set, which, in this case, is the size of the video frames.
Part of the code to do this motion vector search operation is shown in Figure 1. The function blockMatch is called in a double nested loop. In this example, it is called frameHeight/blockHeight* frameWidth/blockWidth number of times, where frameHeight, frameWidth, blockHeight and blockWidth are the frame's height and width, and the block's height and width in number of pixels. The search range is often limited to a region within the reference frame, which would reduce the number of times this function gets called. The function is passed in pointers to the reference frame and to the current block in the current frame and it computes the block in the reference frame that is the best match to the current block. The results are stored in the argument 'matchBlock[i][j]' that is passed into the function.
Figure 1. The routine blockMatch is called within a nested loop.
for (int i=0; i<frameHeight/blockHeight; i++)
{
for (int j=0; j<frameWidth/blockWidth; j++)
{
blockMatch(refFrame, stepBytesRF,
curFrame+stepBytesCF*i+j*blockWidth, stepBytesCF, matchBlock[i][j]);
}
}
|
This code was multi-threaded using OpenMP*. While Windows* threading API could have been used, using OpenMP was simpler in this case. It was done with an OpenMP pragma directive added before the loop (Figure 2). This pragma directive directs the compiler to thread the loops. The number of threads to use is determined dynamically at run time by the number of cores available in the system. It is possible to program the number of threads to use into the application but this should be done only if there are reasons to want to do so. Intel® Compiler 9 and Microsoft Visual Studio 2005* support OpenMP.
Figure 2. Using OpenMP to multi-thread.
#pragma omp parallel for
for (int i=0; i<frameHeight/blockHeight; i++)
{
for (int j=0; j<frameWidth/blockWidth; j++)
{
blockMatch(refFrame, stepBytesRF,
curFrame+stepBytesCF*i+j*blockWidth, stepBytesCF, matchBlockOptimized[i][j]);
}
}
|
We next replaced the C++ code with an optimized function from IPP 5.1 (Figure 3). This IPP function computes the 'sum of absolute differences' between the reference block and the current block. There are 7 'sum of absolute differences' functions in IPP to accommodate different block sizes.
Figure 3. Calling the IPP function, ippiSAD4x4_8u32s.
for (int j=0; j<frameWidth-blockWidth; j++)
{
temSum = 0;
pCur = curBlock;
pRef = refFrame+i*stepBytesRF+j;
ippiSAD4x4_8u32s(pCur, stepBytesCB, pRef,
stepBytesRF, &temSum, IPPVC_MC_APX_FF);
if (temSum < lowSum)
{
lowSum = temSum;
matchBlock[0] = j;
matchBlock[1] = i;
}
}
|
Results The motion vector search routine was tested on the Intel Core 2 Duo processor and the Intel Core 2 Quad processor. The threaded routine attained speedups that scaled almost linearly with the number of cores. It yielded a speedup of 1.89x on the dual core processor and 3.84x on the 4-core processor (Table 1). Further gain was obtained by using the optimized IPP function. The IPP function itself yielded a speedup of between 1.39x and 1.45x. Combined it with multi-threading, the speedup was 2.67x on the dual core processor and 5.55x on the 4-core processor.
The IPP function contains SIMD instructions. During initialization, the IPP library detects the processor's features and determines which Streaming SIMD Extentions (SSE) are supported and would dispatch the most optimized version of the functions supported by the processor. Using IPP is the alternative to having to write your own SIMD optimized functions.
Table 1. Elapsed time in milliseconds. The speedups were computed using the elapse time for the C++ function as the baseline. For these tests, the frame size and block size were set to the following values: frameWidth = 128, frameHeight = 128, blockWidth = 4, blockHeight = 4.
|
|
2.67GHz Core™2 Duo |
2.67GHz Core™2 Quad | |
Function |
Time (ms) |
Speedup |
Time (ms) |
Speedup | |
C++ |
355.82 |
1.00x |
356.08 |
1.00x | |
C++ threaded |
188.12 |
1.89x |
92.68 |
3.84x | |
IPP optimized |
256.21 |
1.39x |
245.1 |
1.45x |
IPP optimized threaded |
133.21 |
2.67x |
64.13 |
5.55x |
The ratio of parallel to serial code in an application determines the magnitude of the speedups from multi-threading. If p is the fraction of the code in an application that can be parallelized and n is the number of cores, then Amdahl's Law states that the speedup possible is
Speedup = 1 / (1-p+p/n)
As the level of parallelism increases, that is, as n increases, the term p/n gets smaller. As n becomes much greater than p, p/n tends toward 0, leaving 1/(1-p) on the right hand side of the equation. We know that (1-p) is the fraction of serial code. In other words, as we increase the level of parallelism in the parallel section of the application, the factor that limits speedup is the serial code.
We can estimate how much the gain in optimized motion estimation could impact the overall performance. If motion vector search constitutes about 40% of the encoding workload and it is sped up by 5.55x using IPP and 4 threads, the theoretical speedup in the overall encoder would be 1.49x.
Conclusion Multi-threading for performance is typically only an afterthought in the development of an application. As dual-core and 4-core processors become more widespread, it is necessary to purposefully use multi-threading to speed up performance critical code. You should no longer depend on increases in the processor frequency to scale your application's performance or feature sets. The speedup achievable from well threaded code running on multi-core processors can be significant. Speedups that are close to the theoretical limit were attained in this motion estimation example. This is one of several components in an encoder. Other components in the encoder can also be examined for opportunities to be decomposed and threaded.
Additional Resources We invite you to post a comment (not monitored by customer support) on this page or send a question directly to our support team. |