Sun Solutions by Forsythe
Chad Mynhier
Master Consultant

Memory alignment on SPARC, or a 300x speedup!

user warning: Duplicate entry '827686' for key 1 query: INSERT INTO dr_accesslog (title, path, url, hostname, uid, sid, timer, timestamp) values('Chief Technology Architect', 'user/2', '', '38.107.191.88', 0, 'nhfh08onut2gahrplde79vqrv7', 239, 1268187582) in /usr/local/apache/htdocs/includes/database.mysql.inc on line 172.
Wed, 11/19/2008 - 22:36 by Chad Mynhier

I remember first running across SIGBUS in an introductory programming course some years ago. You'll get a SIGBUS when you have a misaligned memory access. For example, if you're on a 32-bit processor, an integer is going to be 4-byte aligned, i.e., the address to access that integer will be evenly divisible by 4. If you try to access an integer starting at an odd address, you'll get a SIGBUS, and your application will crash, probably leaving you a nice core file. SIGBUS is one of those things you don't really understand until you have your head wrapped around pointers. (It probably also helps to understand a little bit about computer architecture, at least if you're curious why they call it a SIGBUS instead of something else like SIGMISALIGN.)

In Solaris on the SPARC architecture, you can gracefully handle a misaligned architecture if you use the right compilation options. (The x86 architecture handles things differently.) The option to cc is "-xmemalign" with two parameters, the assumed byte alignment and how to handle misaligned memory access. For example, "-xmemalign=8s" means that you want to assume that all memory accesses are 8-byte aligned and that you want a SIGBUS on any misaligned accesses. "-xmemalign=4i" means that you want to assume that memory accesses are 4-byte aligned and that you want to handle misaligned memory accesses gracefully.

So what does it mean to handle misaligned accesses gracefully? Darryl Gove, in his book Solaris Application Programming, discusses this compiler optino in a little more detail, but there's not much more to it than that the application will trap into the kernel on a misaligned memory access, and a kernel function will do the right thing.

Okay, so there are really two ways you can handle misaligned memory accesses (this is logical, given that there are two parameters to the -xmemalign compiler option.) If you know ahead of time that you're going to have plenty of misaligned memory accesses, you can set the assumed byte alignment appropriately. For example, if you know that things will frequently be aligned at odd addresses, you can do "-xmemalign=1s". The penalty you'll pay for this is that 8-byte memory accesses will translate into eight separate instructions. Your binary will be bigger, and you'll have a little added runtime, depending on how many memory accesses your program makes.

If you don't think you'll have a lot of misaligned memory accesses, you can set the byte alignment appropriately and let the kernel handle any misaligned accesses. You'll get a smaller binary, your runtime will be proportionately less, but every once in a while you'll pay a big penalty for a misaligned access. But how big is that penalty?

Here's a sample program to measure the penatly we'd pay for misaligned memory accesses:

#include <stdio.h>

#include <stdlib.h>

typedef struct {

        int a;

        int b;

} pair_t;

#define PAIRS 100

#define REPS 10000

int

main()

{

        int i, j;

        char *foo;

        pair_t *pairs;

        if ((foo = (char *) malloc((PAIRS + 1) * sizeof(pair_t))) == NULL) {

                fprintf(stderr, "Unable to allocate memory\n");

                exit(1);

        }

#ifdef ALIGNED

        pairs = (pair_t *) foo;

#else

        pairs = (pair_t *) (foo + 1);

#endif

        for (i = 0; i < PAIRS; i++) {

                pairs[i].a = i;

                pairs[i].b = i+5;

        }

        for (j = 0; j < REPS; j++) {

                int sum;

                for (i = 0; i < PAIRS; i++) {

                        sum += pairs[i].a + pairs[i].b;

                }

        }

}

With this Makefile:

all: aligned unaligned onebyte-aligned onebyte-unaligned

aligned: memalign.c

        cc -DALIGNED -xmemalign=4i -o aligned memalign.c

unaligned: memalign.c

        cc -xmemalign=4i -o unaligned memalign.c

onebyte-aligned: memalign.c

        cc -DALIGNED -xmemalign=1s -o onebyte-aligned memalign.c

onebyte-unaligned: memalign.c

        cc -xmemalign=1s -o onebyte-unaligned memalign.c

First, let's look at the impact of handling misaligned access in the kernel. (The version of ptime(1) I'm using here is a modified version that will be putback into Solaris sometime soon, probably Nevada build 101 or 102.):

# ptime ./aligned

real 0.016512200

user 0.008128100

sys 0.004643800

#

#

# ptime ./unaligned

real 5.749458300

user 3.343899250

sys 2.339621750

#

So, misaligned accesses causes us to run 300x slower! This is nowhere near what I expected on first glance. Given that we were spending some 40% of our time in sys, I would have expected to get that time back by eliminating the misaligned access, not a 30,000% speedup. The unexpected thing here is that the time spent in userland is so large -- I'd have expected that to be about the same. I'm not sure why this is the case, I'll have to do some digging. (It's likely that we're blowing something out of the cache, which makes sense. But that's just hypothesis for the moment.)

That aside, if we look a bit deeper at this, we'll see where all of our time is spent (using the "microstate accounting" option to ptime(1), another part of the modifications being putback soon):

# ptime -m ./unaligned

real 6.263908350

user 3.591816850

sys 0.013817200

trap 2.589204200

tflt 0.000000000

dflt 0.000000000

kflt 0.000000000

lock 0.000000000

slp 0.000000000

lat 0.067424450

stop 0.000162500

#

So the majority of that extra time is not being spent actually doing the useful work of handling the misaligned memory access, it's being spent in trap. This isn't that unexpected, 'cause it's well-known that traps are expensive. But it does demonstrate just how wasteful this is.

So now let's look at the other option, compiling with "-xmemalign=1s". What performance penalty do we pay for this? Here's a comparison with the above aligned version of the program:

# ptime ./aligned

real 0.012762500

user 0.007944150

sys 0.004371950

#

# ptime ./onebyte-aligned

real 0.030157100

user 0.024818150

sys 0.004310100

#

# ptime ./onebyte-unaligned

real 0.030376500

user 0.024850600

sys 0.004306500

#

Okay, so that's reasonable, we end up running about 2.5x slower. (Note that the aligned and unaligned versions run in the same time, as we don't technically have any misaligned accesses.) Of course, for any real application, being able to get a 250% performance improvement is probably worth investing some time to debug the misaligned memory accesses (no matter how much it might pale in comparison to a 30,000% performance improvement.)