Art of Softsynth Development: Using SSE in C++ without the hassle

Jul 28, 2013

In the previous article we arrived at the conclusion that processing audio samples as 64-bit doubles using SSE has some clear advantages, the most important being that it simplifies development. However, SSE usage is tedious without a few abstractions. So let’s abstract a bit.

Intrinsics

One way to use SSE instructions is to write your program manually in assembly, integrating with C++ code using either an external assembler or inline assembly code. There are two disadvantages to this approach:

The abstraction level is really low, making it unnecessarily tedious work.
You can’t rely on the compiler to do any optimizations. If you’re an expert assembly programmer, this might work in your favor, but for the other 99% of us this really isn’t an advantage.

What you should do instead is use C/C++ SSE intrinsics. I hope most of you are familiar with them, but if not, let me quickly introduce them.

SSE intrinsics are a set of datatypes, functions and macros which allow you to work with SSE data as if it were any regular old datatype. SSE intrinsics are available in all major C/C++ compilers, if you include the right header files.

Here’s an example of how it works.

__m128d a = _mm_set_pd(1.0, 2.0);
__m128d b = _mm_set_pd(0.5, 8.0);

// y = a * x + b
__m128d y = _mm_add_pd(_mm_mul_pd(a, x), b);

Although it doesn’t look like much, we did gain just a bit of abstraction here. We can now work with SSE data as regular C++ datatypes, meaning we can leave optimizations and C++ integration to the compiler.

If we were to stop here, I couldn’t in good conscience recommend SSE for softsynth programming, unless absolutely necessary for performance reasons. Fortunately, there is one more step which absolutely saves the day.

C++ operator overloading

Operator overloading in C++ is a controversial feature. If you’re doubtful, let me assure you that what we’re going to do now is perfectly fine. I hope it is obvious to most of you. I have to admit that I didn’t come up with this independently, I picked it up from rygorous’ posts on optimizing occlusion culling in a 3D software renderer.

We’re going to create a new datatype sample_t which is a very thin wrapper around the __m128d intrinsic SSE datatype. This data represent a pair of 64-bit doubles - one for each stereo channel. We’ll use the C++ union to our advantage so we can access data as both __m128d and as two doubles:

union sample_t
{
	double  d[2];
	__m128d v;
};

Now we get to the magic part - overloading the arithmetic operators:

union sample_t
{
	sample_t& operator=(const sample_t& other) { v = other.v; return *this; }

	sample_t& operator+=(const sample_t& other) { v = _mm_add_pd(v, other.v); return *this; }
	sample_t& operator-=(const sample_t& other) { v = _mm_sub_pd(v, other.v); return *this; }
	sample_t& operator*=(const sample_t& other) { v = _mm_mul_pd(v, other.v); return *this; }
	sample_t& operator/=(const sample_t& other) { v = _mm_div_pd(v, other.v); return *this; }
};

inline sample_t operator+(const sample_t& a, const sample_t& b) { return sample_t(_mm_add_pd(a.v, b.v)); }
inline sample_t operator-(const sample_t& a, const sample_t& b) { return sample_t(_mm_sub_pd(a.v, b.v)); }
inline sample_t operator*(const sample_t& a, const sample_t& b) { return sample_t(_mm_mul_pd(a.v, b.v)); }
inline sample_t operator/(const sample_t& a, const sample_t& b) { return sample_t(_mm_div_pd(a.v, b.v)); }

Here we just map the arithmetic operators directly to the corresponding intrinsic SSE functions, meaning that operators will be compiled directly down to SSE instructions.

If you’re a little concerned that the binary operators outside the class return by value, don’t be. This is the canonical way to implement this, and it will be optimized by the compiler. So don’t try to “fix” this if you’re inexperienced at operator overloading.

As a last step, let me (read: rygorous) suggest a few constructors and helper functions:

union sample_t
{
	sample_t() {}
	explicit sample_t(const double x) : v(cc(x)) {}
	sample_t(const __m128d& in) : v(in) {}
	sample_t(const sample_t& x) : v(x.v) {}
	sample_t(double x1, double x2) : v(_mm_set_pd(x1, x2)) {}

	static sample_t zero()                    { return sample_t(_mm_setzero_pd()); }
	static sample_t load(const double* ptr)   { return sample_t(_mm_load_pd(ptr)); }
	static sample_t loadu(const double* ptr)  { return sample_t(_mm_loadu_pd(ptr)); }
	void store(double* ptr)                   { _mm_store_pd(ptr, v); }
	void storeu(double* ptr)                  { _mm_storeu_pd(ptr, v); }
}

In line 4, constructing with a single double will set both doubles to that value. I have included here the explicit keyword, meaning that the compiler will not automatically convert doubles to sample_ts:

sample_t s = 1.0;

This will fail to compile. Instead you have to explicitly call the constructor:

sample_t s = sample_t(1.0);

I was on the fence for a while about this. There are arguments for (readability) and against (you might end up more sample_t values than you’re realizing). In the end, it came down to performance: _mm_set1_pd is quite slow, so you should keep it out of tight loops. Making it easier to spot this conversion and not introducing it without consideration carried the argument in favor of explicit.

loadu and storeu is for unaligned access, which you should generally try to avoid.

Usage example

Now, finally, let me demonstrate the power of SSE intrinsics with operator overloading. This is the processing code for a one-pole lowpass filter which uses this approach:

// buf and f are sample_t class members, k1 is a constant defined outside the function for performance
inline sample_t ZOnepoleFilter::Lowpass(const sample_t& input)
{
	sample_t out = (buf + f*input) / (k1 + f);
	buf = f*(input-out) + out;
	return out;
}

This very simple and concise piece of code

Expressed the filter in completely natural C syntax.
Handles both stereo channels with no loops and no code duplication, expressing only the raw computation.
Compiles down to highly efficient SSE instructions, letting the compiler do all the heavy lifting of data shuffling and optimization.
Doesn’t have to worry about denormals, because they will be flushed to zero (as opposed to FPU code, which will have to handle those manually).

This was the main takeaway. Now let’s cover a few details.

Alignment

In order for SSE code to run efficiently, __m128d values must be 16-byte aligned. (it might crash if it’s not). If you haven’t handled alignment manually before, let me give you a few pointers.

You can specify 16-byte alignment with the __declspec(align(16)) modifier. However, this is useless. The C++ compiler already knows that __m128d must be 16-byte aligned, and it will automatically inherit alignment to containing classes. And this goes all the way up the hierarchy, so if class A contains a class B which contains a class C which contains a __m128d, all classes in the hierarchy will themselves be 16-byte aligned - the simplest way to always guarantee the alignment of the embedded __m128d. Well – as long as you’re allocating on the stack!

If you’re allocating on the heap, you’re in trouble. In order for this to work, you must make sure that classes of types A, B and C from the example above are all 16-byte aligned when they’re allocated.

There’s no way to specify alignment for the builtin new/delete operators. Fortunately, you can (and must!) override the new/delete operators, and make sure they return aligned instances of classes (if you’re in a demoscene context, you’re probably already overloading new/delete, right?). There are two routes you can go:

Override global new/delete so everything is 16-byte aligned on allocation.
Override new/delete per class.

Option 1 is much simpler than option 2, so use that if you can. However, if your code will ever have to work as a static or a dynamic library (for instance as a VSTi plugin) you probably have to go with option 2. The simplest solution in option 2 is to create a base class that every class, struct and union will inherit new/delete operator overloading from.

Bonus: C++ code for setting FPU flags to flush denormals

Here is a very handy little C++ class for setting the FPU flags to flush denormals.

// RAII FPU state class, sets FTZ and DAZ and rounding, no exceptions 
// Adapted from code by mystran @ kvraudio
// http://www.kvraudio.com/forum/viewtopic.php?t=312228&postdays=0&postorder=asc&start=0

class ZFPUState
{ 
private:
	unsigned int sse_control_store; 

public: 
	enum Rounding 
	{
		kRoundNearest = 0, 
		kRoundNegative, 
		kRoundPositive, 
		kRoundToZero, 
	}; 

	ZFPUState(Rounding mode = kRoundToZero) 
	{
		sse_control_store = _mm_getcsr(); 

		// bits: 15 = flush to zero | 6 = denormals are zero 
		// bitwise-OR with exception masks 12:7 (exception flags 5:0) 
		// rounding 14:13, 00 = nearest, 01 = neg, 10 = pos, 11 = to zero 
		// The enum above is defined in the same order so just shift it up 
		_mm_setcsr(0x8040 | 0x1f80 | ((unsigned int)mode << 13)); 
	} 

	~ZFPUState() 
	{
		// clear exception flags, just in case (probably pointless) 
		_mm_setcsr(sse_control_store & (~0x3f)); 
	} 
};

Because other parts of your system might change these flags, it’s recommended to tread lightly here. If you’re running in a VST or other plugin context, set the flags when you’re asked to do your processing, and reset them when you’re done. For instance, as part of my VST code I have the following:

void VSTInstrument::processReplacing(float** inputs, float** outputs, VstInt32 sampleFrames)
{
	// Before processing, set rounding flags, and denormal flags
	ZFPUState fpuState(ZFPUState::kRoundToZero);

	// Do actual processing.. 

	// FPU flags are automatically reset when leaving scope
}

Also remember that FPU flags are thread specific, so if you have multiple threads in your synth, you have to set them in each thread.