How to Reduce the Impact of Misaligned Memory Accesses

Submit New Article


Last Modified On :   May 7, 2008 1:25 AM PDT
Rate
 


Challenge

Reduce the impact of misaligned memory accesses in an SSE2 algorithm. Misalignment of memory access is a problem commonly encountered when optimizing code with Streaming SIMD Extensions 2 (SSE2). An SSE2 algorithm often requires loading and storing data 16 bytes at a time to match the size of the XMM registers. If alignment cannot be guaranteed, some part of the performance gain achieved by processing multiple data elements in parallel will be lost because either the compiler or assembly programmer must use unaligned move instructions.

These penalties can sometimes be avoided by forcing 16-byte alignment on the data structures from which the SIMD operands are being drawn. You can do this rather cleanly using either the __declspec(align(16)) directive for static variables, or the _mm_malloc() call for dynamic memory allocation. Both of these language extensions originated with the 
Intel® Compiler but are also supported in more recent versions of the Microsoft* Visual C++* compiler (see the Intel Compiler documentation for details).

Given the nature of a particular algorithm, some misalignments cannot be resolved by any means. Motion estimation or motion compensation algorithms used in a video codec are good examples. These algorithms typically process pixel data in 16x16 chunks, also known as macroblocks. While this matches well with the XMM register size (each pixel is one byte in length), misaligned loads are prevalent because the 16x16 block can reside anywhere within the video frame.

A quarter-pixel interpolation algorithm offers an illustration of how to minimize performance loss when unaligned access is unavoidable.

The quarter pixel average is to be computed over a 16x16 block of pixels. You can envision this as a small square shape within a video image displayed on your monitor. For all 256 pixels in the block, an average will be taken between the given pixel, the pixels adjacent to the right and below, and its diagonal neighbor to the lower right. The four byte-length values are summed and then divided by four to obtain the average. Note that for the pixels on the bottom row and right-most column of the 16x16 block, we need to access data outside the block to obtain the quarter-pixel average.

Since we are adding four bytes together, each intermediate sum needs to be represented as a word-length value (two bytes) to avoid saturation. Therefore, the 16-byte XMM registers only allow us to process half a row (eight pixels) at a time.

The following code accomplishes the quarter-pixel interpolation in a suboptimal fashion using SSE2 assembly. The computational sections are condensed to pseudo-code comments to keep the focus on the memory-access pattern. The 'RowLoop' produces one row of quarter pixel averages per iteration and executes sixteen times. The top half of the loop processes the left side of the row, while the bottom processes the right side:



For the left-side calc ulations, the SIMD algorithm requires four vector operands, all represented as packed word values:

TopRow[0:7]
TopRow[1:8]
BottomRow[0:7]
BottomRow[1:8]

The following figure illustrates how they are drawn from the macroblock:


 
These operands are arranged through a series of unpack and shift instructions. Once this organizational overhead is complete, eight quarter-pixel averages can be computed in just a handful of SSE2 instructions.

The second half of the loop on the right side of the row works in very similar fashion. The four operands here are:

TopRow[8:15]
TopRow[9:16]
BottomRow[8:15]
BottomRow[9:16]

Once the averages from both sides of the row are calculated, they can be packed back into one XMM register as byte values and stored to the destination buffer.

If you have experience developing SSE2 assembly code, you may have already identified some inefficiency with the code shown above. In processing the left side of the row, we twice load 16 bytes using MOVDQU, even though we only need the first nine bytes of each row to fill out the four vector operands. The MOVDQU instruction is convenient because it brings in all the data at once, but it also fetches seven extra bytes that are eventually discarded.

Solution

Revise the memory-access pattern of the entire loop, making better use of the seven left-over bytes from the left-side processing to give a head start to the right-side work further down in the loop. With a little tweaking, we can completely remove two of the four MOVDQU instructions from the loop, as shown here:



The instructions colored green indicate changes from the code in the Challenge section. The row data loaded in the top half of the iteration is saved in a register for the bottom, where bytes [8:15] can be used for the right-side averaging.

Note that some changes are required to set up the [9:16] operands. The PSRLDQ-PUNPCKHBW sequence shifts zero into the upper byte of the register and unpacks the upper eight bytes to word values. This leaves pixels [9:15] and an empty slot where we need [16]. This is accommodated by MOVZX and PINSRW, which load byte [16] into a general register, zero extend it to a 32-bit value, and insert it into the final XMM word position to complete the [9:16] operand.

This rework increases performance of the function by 13%, as we reduced the total memory load bandwidth from 64 bytes to 34. While minimizing the number of memory accesses in a tight loop is generally a good idea, it is especially good when accesses are unaligned. A cache-line split could be lurking between any two pixels, so this optimization cuts our exposure to that performance killer in half.

Further optimizations that build on this one are available in the article Reducing the Impact of Misaligned Memory Accesses.

Source





Comments (0)



Leave a comment

Name (required)

Email (required; will not be displayed on this page)

Your URL (optional)


Comment*