Provide a set of basic & advanced atomic operations. More...

Data Structures
struct	SprAtomicSlot

Functions
void	spr_atomic_nop_MBwrite (void)

void	spr_atomic_nop_MBread (void)

void	spr_atomic_nop_MBfull (void)
	Impose a full (read+write) memory barrier. More...

uint32_t	spr_atomic_get32_MBacquire (volatile uint32_t *ptr)

uint32_t	spr_atomic_twinget32_MBnone (uint32_t ptr, uint32_t val2)

uint64_t	spr_atomic_twingetc48_18_MBnone (uint64_t ptr, uint64_t val2)

void	spr_atomic_set32_MBrelease (volatile uint32_t *ptr, uint32_t value)

void	spr_atomic_twinset32_MBrelease (volatile uint32_t *ptr, uint32_t value1, uint32_t value2)

void	spr_atomic_inc32_MBnone (volatile uint32_t *ptr)

uint32_t	spr_atomic_dec32_MBnone (volatile uint32_t *ptr)

uint32_t	spr_atomic_add32_MBnone (volatile uint32_t *ptr, uint32_t value)

uint32_t	spr_atomic_or32_MBnone (volatile uint32_t *ptr, uint32_t value)

uint32_t	spr_atomic_and32_MBnone (volatile uint32_t *ptr, uint32_t value)

uint32_t	spr_atomic_swap32_MBnone (volatile uint32_t *ptr, uint32_t value)

int	spr_atomic_cas32_MBnone (volatile uint32_t *ptr, uint32_t tst_value, uint32_t new_value)

int	spr_atomic_twincas32_MBnone (volatile uint32_t *ptr, uint32_t tst_value1, uint32_t tst_value2, uint32_t new_value1, uint32_t new_value2)

int	spr_atomic_bitset32_MBnone (volatile uint32_t *ptr, int bit_ndx)

int	spr_atomic_bitclr32_MBnone (volatile uint32_t *ptr, int bit_ndx)

int	spr_atomic_slot_acquire_nowait (SprAtomicSlot *slot)

int	spr_atomic_slot_acquire_wait (SprAtomicSlot *slot)

int	spr_atomic_slot_acquire0_nowait (SprAtomicSlot *slot)

int	spr_atomic_slot_acquire0_wait (SprAtomicSlot *slot)

void	spr_atomic_slot_signal_ready (SprAtomicSlot *slot, int ndx)

int	spr_atomic_slot_signal_ready_acquire0_nowait (SprAtomicSlot *slot, int ndx)

void	spr_atomic_slot_release (SprAtomicSlot *slot, int ndx)

void	spr_atomic_slot_release0 (SprAtomicSlot *slot)
	Release slot 0. More...

uint32_t	spr_atomic_slot_release_nocheck (SprAtomicSlot *slot, int ndx)

void	spr_atomic_slot_release0_check (SprAtomicSlot *slot)

int	spr_atomic_slot_release_check0 (SprAtomicSlot *slot, int ndx)

int	spr_atomic_slot_acquire_all (SprAtomicSlot *slot)

void	spr_atomic_slot_release_all (SprAtomicSlot *slot)

Detailed Description

Provide a set of basic & advanced atomic operations.

The basic atomic operations are provided as a set of macro's. We opted to go with macro's for the basic atomic operations instead of (inline) functions for the following two reasons:

Since not all atomic operations are supported on all platforms, one should check the existence of the requested functionality before using it. Checking the existence of a macro definition at compile time is straight-forward, while checking the existence of a function is near impossible.
Side effects are extremely undesired when working with atomic operations. By going with macro's, it should also be absolutely clear that these atomic operations are fragile.

The more advanced atomic(-like) operations use (inline) function calls.

Basic atomic operations

The name of every basic atomic operation macro follows the following naming convention:

spr_atomic_<op><sz>_MB<mfence>

<op> is the operation and can be any of the following:

nop: No-operation. This will only enforce a memory barrier. Note that the operand size <sz> is not relevant for a nop-operation and hence must be omitted.
get: Read the value from a given memory location (Note: on most CPU's a load is atomic by itself, but there are exceptions; so even if no memory barrier is needed, one should still use the atomic variant when atomicity is required).
twinget: Read a pair of values from a given memory location.
set: Write a given value to a given memory location (Note: on most CPU's a store is atomic by itself, but there are exceptions; so even if no memory barrier is needed, one should still use the atomic variant when atomicity is required).
twinset: Write a pair of values to a given memory location.
inc: Increment the value on a given memory address by one.
dec: Decrement the value on a given memory address by one and test for zero.
add: Add a given value to the value on a given memory address and return the old memory value.
or: Or a value on a given memory address with the given value and return the old memory value.
and: And a value on a given memory address with the given value and return the old memory value.
swap: Swap a given value with the value on a given memory address.
cas: Check if the value at the given memory location (still) equals a certain value, and if so, set to a new value. Return 0 if the update was successful (memory value was updated).
twincas: Check if the two value at the given memory location (still) equal certain values, and if so, set both to a new value. Return 0 if the update was successful (memory value was updated).
bitset: Set a given bit in a bit array to 1 and return the old bit value.
bitclr: Set a given bit in a bit array to 0 and return the old bit value.

<sz> is the operand size and can be any of the following:

08: an 8bits quantity
16: a 16bits quantity
32: a 32bits quantity
64: a 64bits quantity
c48_18: a 48bits and a 18bits quantity (twin-ops)
this allows twin-operations (needed to make performant lock-free algorithms) on platforms that do not support 64bits twin-ops but that only use 48bits pointers, leaving 16bits + 2bits (assuming all pointers are at least 32bits aligned) for a transaction number
48: the 48bits quantity (the pointer) from a c48_18 op
18: the 18bits quantity (the transaction number) from a c48_18 op

All quantities are assumed to be unsigned. Casting to signed (or pointer to signed) will give the desired result.

<mfence> indicates the type of memory synchronisation (barrier) enforced by the atomic operation, and can be any of the following:

none: No memory barrier.
write: Assure that the effect of all write operations from before the atomic operation are visible to the outside world before any of the writes during or after the atomic operation becomes visible. In theory, the write barrier also impose strict ordering between the writes during the atomic operation and those after the atomic operation. Since this is never needed in practice, we dropped this constraint. Typically used to prevent other threads from reading outdated information.
release: Ensure that earlier memory writes become visible to the outside world before the atomic write. In other words, this is a half way write barrier in the sense that it only orders the preceding writes w.r.t to the atomic write; the order of the succeeding writes is left open.
read: Make sure reads after the atomic operation get newer data than any of the reads before or during the atomic operation (only of importance for reads from the outside world). In theory, the read barrier also impose strict ordering between the reads during the atomic operation and those after the atomic operation. Since this is never needed in practice, we dropped this constraint. Typically used to prevent this thread from fetching newer data in older reads (i.e. reads from before this point).
acquire: Ensure that later memory reads from the outside world are started after the atomic read. In other words, this is a half way read barrier in the sense that it only orders the succeeding reads w.r.t to the atomic read; the order of the preceding reads is left open.
ddacq: A special version of the acquire memory barrier. Assures that all subsequent reads that depend on the reads in the atomic operation use the new atomically read values. Only few CPU's (as far as I known, only the DEC-alpha v6) do not autmomatically impose this contraint.
full: A read and write barrier, i.e. enforce that all reads and writes before and after the atomic operation are ordered w.r.t. the atomic operation.

In order the better specify the effect of the memory barriers, lets define six distinct sets of operations:

RB & WB: the set of read and write operations before the atomic operation
RX & WX: the reads and writes in atomic operation
RA & WA: the set of read and write operations after the atomic operation

Then the different memory barriers impose the following ordering:

none: no ordering
write: WB < {WX,WA}
release: WB < WX
read: {RB,RX} < RA
acquire: RX < RA
ddacq: RX < RA(RX)
full: {RB,WB} < {RX,WX} < {RA,WA}

Note: Write/release and read/acquire barriers are usually paired, i.e. the producer of the data issues a release barrier while the consumer uses an acquire barrier.; Use the least restrictive barrier for maximal performance (and if performance is not an issue, use the more standard mutexes). Using the full memory barrier may look like a safe option, but on certain computer architectures this also has a huge impact on performance.

Advanced atomic operations

The advanced atomic operations provided by SPRAAK are:

atomic_slot: Manage a main resource (one that does not support concurency) but that can be backed up by several auxilary resources, e.g. by adding the concurent requests to a todo list (the todo requests are then handled by whoever acquired the main resource). See the spr_atomic_slot_xxx() functions for more details. See msg.c for an example on how to use this system.

Important note on the use of atomic operations

When making multi-threaded applications, two main techniques are used to deal with the problems resulting from concurrent access to shared resources (mostly memory).

The most common (and easiest) way is the use of mutexes (which uses locks, read-write locks, semaphores and conditions to get everything synchronized). Mutexes avoid race conditions by serializing the operations around the critical part of the code (usually when shared data is accessed), i.e. only one thread can execute the protected part of the code. This serialization is also the biggest down-side of mutexes. If the shared data access cannot be isolated in a small and localized fraction of the code, the threads may end-up waiting on each other the better part of the time, hence nullifying any possible speedup when working with multiple threads.

The alternative to mutexes is the use of lock-free algorithms. Such lock-free algorithms use atomic operations to provide coherent access to the shared data (e.g. concurrently accessible queues and hash-tables). Unfortunately, the standard libraries do not yet provide such lock-free algorithms (although there exist proprietary libraries). Furthermore, not all data structures are well suited for full lock-free access (e.g. double linked lists). Another (and possibly the biggest) down-side of concurrent algorithms is that they are extremely hard to develop (e.g. bugs are virtually untraceable and correctness proofs are extremely difficult). Only with better hardware support (multiple outstanding transactions in a single atomic operation) lock-free programming will become an option for the non-experts.

SPRAAK provides the atomic operations that allow the creation of concurrent data structures and algorithms. However, given the above mentioned problems, they are used only when absolutley crucial. In most situation, mutexes are preferred.

Note on memory barriers.

In order to guarantee the causality of a program, memory access inside a single CPU-core (read 'thread') will appear to happen in program order. I.e. the compiler or the CPU may reorder memory access but only if this will not affect the outcome of the program. However, to the outside world –this includes any other CPU-cores (read 'threads') accessing the same memory pool– there are no guarantees whatsoever w.r.t. memory access. This is mainly done for efficiency reasons. For example, a core may keep data in its cache and not write it to memory, or it may read data well before its time (pre-fetching), or it may arbitrarily rearrange reads and writes (fusing and/or grouping according to memory bank). In order to still allow concurrency, special instructions have been added to all modern CPU's to ensure that interaction with the outside world happens in an orderly fashion. These memory barriers impose a perceived partial ordering between the memory operations specified on either side of the barrier on all other CPU's of the system. This for example guarantees that all elements in a new list element are initialized and exported to the out-side world before the new list element is added to the list using an atomic operation.

References.: For a good introducation on memory barriers, see http://www.gelato.unsw.edu.au/lxr/source/Documentation/memory-barriers.txt For more information about the why and how of atomic operations and memory bariers, we refer to xxx.pdf For more information about mutexes, we refer to the POSIX pthreads library.

Author: Kris Demuynck

Date: 03/09/2007

Bug:

Not all atomic operations are (or even can be) implemented for all platforms.

Developing concurrent algorithms, choosing the correct memory barrier, and to a certain extent every piece of code that needs atomic operations is difficult and bugs will be extremely difficult to trace. So avoid them unless absolutely necessary.

Data Structures

Functions

Detailed Description

Basic atomic operations

Advanced atomic operations

Important note on the use of atomic operations

Note on memory barriers.