Message Passing – You’re Already Doing It, part 2

In part 1 we discussed two reasons for why remote method calls have never been successfully hidden under the covers of normal OOP style programming: Remote method calls may be orders of magnitudes slower than local ones. They may also fail catastrophically without the calling system doing so. Alternatives such as Erlang, Akka and the MailboxProcessor in F# have popped up that make the potentially remote and asynchronous nature of these types of operations explicit using message passing. However, are such frameworks at all relevant for systems running on a single computer?

A “single computer” used to refer to a system with one execution unit (processor). Even though multi processor systems have been used for a long time in heavy duty servers, they have belonged to a specialist domain and usually used to serve a large number of users, rather than running a single application quickly. When the PC world went multi core, all that changed and programmers everywhere are forced to care. It is rare to find a computer with a single logical processor, high end Intel chips such as the i7 processor includes 6 cores and offers 12 logical execution units using hyperthreading. The term “single computer” will therefore have to be redefined. It can be seen as a system of processing units all connected to a shared, “fast” temporary storage (RAM memory). However, as we shall see, fast and shared may be competing goals.

Let us examine what the real cost of shared memory is. To avoid being fooled by optimisers, garbage collectors and other mechanisms that are usually our friends in high level languages, we will have to turn to C++ and even (*gasp*) assembler. To measure the cost of using shared memory, a very basic computation will be used: incrementing a counter 1,000,000,000 times, either using a single processor or using multiple cores concurrently. First, we need a simple profiling function:

void runWorkers(
    void (*worker)(uint32_t*, uint32_t),
    int concurrency,
    uint32_t iterations)
{
    uint32_t counter = 0;
    boost::timer::auto_cpu_timer t;
    vector threads(concurrency);
    for (int i = 0; i < concurrency; i++) {
        threads[i] = new boost::thread(worker, &counter, iterations);
    }
    for (int i = 0; i < concurrency; i++) {
        threads[i]->join();
        delete threads[i];
    }
    cout << "All done! Count is now " << counter << endl;;
}

(Don’t you just love C++ function pointer type declarations?)

What the above achieves is to take some function to execute across multiple threads. When started, the function will be passed the number of iterations to execute and a pointer to a location in memory to work on. Finally, it waits for each thread to complete and, using boost::auto_cpu_timer, reports timings for the whole sequence. Next, let’s throw some work at it.

void incrementShared(uint32_t *counter, uint32_t iterations)
{
    __asm {
        mov ecx, iterations // Count the iterations
        mov edx, counter    // Remember pointer to shared location
$loop:  inc DWORD PTR [edx] // In each iteration, increment variable in memory
        loop $loop          // Repeat until ECX is 0
    }
}

As promised, there’s assembler, in this case Intel style x86 assembler embedded into Visual C++. Ignoring the setup instruction, we end up with a tight loop that runs the inc (increment) operation the specified number of times, incrementing our variable in memory in each iteration. Let us take it for a spin, starting out in the traditional way with a single execution unit:

    cout << "Single worker, shared increment, 1000000000 iterations ..." << endl;
    runWorkers(incrementShared, 1, 1000000000);

which in the test environment used results in:

Single worker, shared increment, 1000000000 iterations ...
All done! Count is now 1000000000
 2.700709s wall, 2.558416s user + 0.000000s system = 2.558416s CPU (94.7%)

2.7 seconds to add one a billion times may be impressive in its own right, but with all those cores available to us, we should be able to do better. The test environment happens to have four cores, making four threads the ideal number for parallelising CPU bound work loads. We will split the one billion iterations into equal quarters:

    cout << "4 workers, shared increment, 250000000 iterations each ..." << endl;
    runWorkers(incrementShared, 4, 250000000);

which outputs:

4 workers, shared increment, 250000000 iterations each ...
All done! Count is now 401385749
 1.318384s wall, 3.541223s user + 0.000000s system = 3.541223s CPU (268.6%)

You may (or may not) be surprised by the results. We did get a speedup, but only roughly a doubling of the throughput, not quite the theoretical four times that we were hoping for. Worse, we appear to have lost our ability to count! That harmless “inc”, even though it is a single assembler instruction, does not execute atomically with regards to updating memory. Instead, the inner workings of the processor has pipelined it into a series of operations – reading from memory, incrementing the value and writing it back into memory. Store buffers, speculative execution, caching and the phase of the moon work together to ensure that each core “sometimes” increment the value in memory correctly but often overwrite it with a value that is already stale, cancelling the updates made by other cores.

To solve this problem, we need some way of protecting us from changes to our variable in memory while we increment and update it. For larger chunks of code, this may be done using monitors, mutexes, critical sections or similar threading primitives. The cheapest way in our case, however is to go straight to the processor, asking it to do it for us using the lock prefix. This is the underlying primitive of Interlocked.Increment in .NET, AtomicInteger in Java and std::atomic in C++11. Let’s try it on our problem:

void incrementSharedLocked(uint32_t *counter, uint32_t iterations)
{
    __asm {
        mov ecx, iterations      // Count the iterations
        mov edx, counter         // Remember pointer to shared location
$loop:  lock inc DWORD PTR [edx] // Increment, atomically this time
        loop $loop               // Repeat until ECX is 0
    }
}

Note that the only difference between this and the previous version is the lock prefix, simply telling the processor that we care about concurrent updates to the shared memory address. The output when running this, in the same style as above:

4 workers, locked increment, 250000000 iterations
All done! Count is now 1000000000
 77.033589s wall, 282.50221s user + 0.03120s system = 282.53341s CPU (366.8%)

We can count again, which is good, but look at the timings! What previously took a second now takes 77! Obviously that innocent looking lock prefix wasn’t that harmless after all if it could slow us down by a factor of close to a hundred, two orders of magnitude. Let’s try running the exact same thing but on a single thread a billion times instead of 4 threads with a quarter of a billion each:

Single worker, locked increment, 1000000000 iterations
All done! Count is now 1000000000
 10.354197s wall, 10.046464s user + 0.000000s system = 10.046464s CPU (97.0%)

Yes, the lock prefix does appear to cost us a bit, but now we’re seeing a slowdown of a factor four, not a hundred. Something else is going on here. Next time: What exactly is going on, why should you care about cache coherency and finally – how does this all relate to message passing?