



- > Optimizations, Races, and the Memory Model
- Ordering What: Acquire and Release
- Ordering How: Mutexes, Atomics, and/or Fences
- ▶ Other Restrictions on Compilers and Hardware (→ Bugs)
- Code Gen & Performance: x86/x64, IA64, POWER, ARM, ... ???
- Relaxed Atomics (as time allows)
- Coda: Volatile (as time allows)









# The Talk In One Slide

Don't write a race condition or use non-default atomics and your code will do what you think.

#### Unless you:

(a) use compilers/hardware that can have bugs;

(b) are irresistably drawn to pull Random Big Red Levers; or

(c) are one of Those Folks who long to take over the gears in the Machine.







# Two Key Concepts

- Sequential consistency (SC): Executing the program you wrote.
  - Defined in 1979 by Leslie Lamport as "the result of any execution is the same as if the reads and writes occurred in some order, and the operations of each individual processor appear in this sequence in the order specified by its program"
- Race condition: A memory location (variable) can be simultaneously accessed by two threads, and at least one thread is a writer.
  - Memory location == non-bitfield variable, or sequence of non-zerolength bitfield variables.
  - Simultaneously == without happens-before ordering.



# Converging the SW and HW Models

- Sequential consistency for data race free programs (SC-DRF, or DRF0): Appearing to execute the program you wrote, as long as you didn't write a race condition.
  - Defined in 1990 by Sarita Adve and Mark Hill as "a formalization that prohibits data races in a program. We believe that this allows for faster hardware than an unconstrained synchronization model, without reducing software flexibility much, since a large majority of programs are already written using explicit synchronization operations and attempt to avoid data races."
  - The purpose is to define "a contract between software and hardware where hardware promises to appear sequentially consistent at least to the software that obeys a certain set of constraints which we have called the synchronization model. This definition is analogous to that given by Lamport for sequential consistency in that it only specifies how hardware should appear to software. ... It allows programmers to continue reasoning about their programs using the sequential model of memory."



# Fact of Life

#### Transformations at all levels are **equivalent**.

 $\Rightarrow$  Can reason about all transformations as reorderings of source code loads and stores.











# Fact of Life

# Software MMs have converged on SC for data-race-free programs (SC-DRF).

Java: SC-DRF required since 2005. C11 and C++11: SC-DRF default (relaxed == transitional tool).



# How To Think About Races

Q: While debugging an optimized build, have you ever seen pink elephants?

In a race, one thread can see into another thread with the same view as a debugger.

- Optimizations, Races, and the Memory Model
- Ordering What: Acquire and Release
- Ordering How: Mutexes, Atomics, and/or Fences
- Other Restrictions on Compilers and Hardware (→ Bugs)
- Code Gen & Performance: x86/x64, IA64, POWER, ARM, ... ???
- Relaxed Atomics (as time allows)
- Coda: Volatile (as time allows)















|   | "Plain" Acq/Rel | V | s. | SC Acq/Rel |   |
|---|-----------------|---|----|------------|---|
|   | acquire         | V | V  | acquire    | V |
|   | release         | A | 4  | release    |   |
| 4 | release         |   |    | release    | 4 |
| ▼ | acquire         | V | V  | acquire    | V |





|                           | low Do We Cope With Latency?<br>Concurrency Everywhere                                                                                |                       |
|---------------------------|---------------------------------------------------------------------------------------------------------------------------------------|-----------------------|
| Strategy                  | Technique                                                                                                                             | Can affect your code? |
| Parallelize<br>leverage   | <b>Pipeline, execute out of order ("OoO"):</b> Launch expensive memory operations earlier, and do other work while waiting.           | Yes                   |
| compute<br>power)         | Add <u>hardware</u> threads: Have other work available for the <i>same CPU core</i> to perform while other work is blocked on memory. | No *                  |
|                           | Instruction cache                                                                                                                     | No                    |
| C <b>ache</b><br>Teverage | Data cache: Multiple levels. Unit of sharing = cache line.                                                                            | Yes                   |
| capacity)                 | <b>Other buffering:</b> Perhaps the most popular is store buffering, because writes are usually more expensive.                       | Yes                   |
| Speculate                 | Predict branches: Guess whether an "if" will be true.                                                                                 | No                    |
| leverage                  | Other optimistic execution: E.g., try both branches?                                                                                  | No                    |
| pandwidth,<br>compute)    | Prefetch, scout: Warm up the cache.                                                                                                   | No                    |

\* But you have to provide said other work (e.g., software threads) or this is useless!



- > Optimizations, Races, and the Memory Model
- Ordering What: Acquire and Release
- Ordering How: Mutexes, Atomics, and/or Fences
- ▶ Other Restrictions on Compilers and Hardware (→ Bugs)
- Code Gen & Performance: x86/x64, IA64, POWER, ARM, ... ???
- Relaxed Atomics (as time allows)
- Coda: Volatile (as time allows)



























- Optimizations, Races, and the Memory Model
- Ordering What: Acquire and Release
- Ordering How: Mutexes, Atomics, and/or Fences
- ▶ Other Restrictions on Compilers and Hardware (→ Bugs)
- Code Gen & Performance: x86/x64, IA64, POWER, ARM, ... ???
- Relaxed Atomics (as time allows)
- Coda: Volatile (as time allows)

















#### The system **must never invent a write to a variable that wouldn't be written to** in an SC execution.

Q: Why?

If you the programmer can't see all the variables that get written to, you can't possibly know what locks to take.















- Optimizations, Races, and the Memory Model
- Ordering What: Acquire and Release
- Ordering How: Mutexes, Atomics, and/or Fences
- Other Restrictions on Compilers and Hardware (→ Bugs)
- Code Gen & Performance: x86/x64, IA64, POWER, ARM, ... ???
- Relaxed Atomics (as time allows)
- Coda: Volatile (as time allows)

# **RECALL:** Fact of Life

# Software MMs have converged on SC for data-race-free programs (SC-DRF).

Java: SC-DRF required since 2005. C11 and C++11: SC-DRF default (relaxed == transitional tool).



| Code | Generation |
|------|------------|
|------|------------|

|                                                       | Load<br>Ordinary SC Atomic |     | S <sup>.</sup><br>Ordinary | <b>tore</b><br>SC Atomic |         | CAS |  |  |
|-------------------------------------------------------|----------------------------|-----|----------------------------|--------------------------|---------|-----|--|--|
| x86/x64                                               | mov                        | mov | mov                        | xchg                     | cmpxchg |     |  |  |
|                                                       |                            |     |                            |                          |         |     |  |  |
|                                                       |                            |     |                            |                          |         |     |  |  |
|                                                       |                            |     |                            |                          |         |     |  |  |
|                                                       |                            |     |                            |                          |         |     |  |  |
|                                                       |                            |     |                            |                          |         |     |  |  |
|                                                       |                            |     |                            |                          |         |     |  |  |
|                                                       |                            |     |                            |                          |         |     |  |  |
| http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html |                            |     |                            |                          |         |     |  |  |



|         | L<br>Ordinary | . <b>oad</b><br>SC At | omic_                                                                                                                      | Store<br>nic Ordinary SC Atomic |      | CAS     |  |  |
|---------|---------------|-----------------------|----------------------------------------------------------------------------------------------------------------------------|---------------------------------|------|---------|--|--|
| x86/x64 | mov           | mov                   | ornic.                                                                                                                     | mov                             | xchg | cmpxchg |  |  |
|         |               |                       |                                                                                                                            |                                 |      |         |  |  |
|         |               |                       | On x86, SC atomic store could also be " <b>mov + mfence</b> "<br>Q: Would it be a good idea for a compiler to choose that? |                                 |      |         |  |  |
|         |               |                       | A: No.                                                                                                                     |                                 |      |         |  |  |
|         |               |                       | (minor) 1. mfence is expensive, and anyway order semantic<br>should be attached to the memory op (not be standalone)       |                                 |      |         |  |  |
|         |               |                       | (major) 2. Everybody on a given platform has to agree on the code gen, at least on compilation boundaries                  |                                 |      |         |  |  |

|         | Load<br>Ordinary SC Atomic |        | S<br>Ordinary | <b>tore</b><br>SC Atomic | CAS             |
|---------|----------------------------|--------|---------------|--------------------------|-----------------|
| (86/x64 | mov                        | mov    | mov           | xchg                     | cmpxchg         |
| A64     | ld                         | ld.acq | st            | st.rel; mf               | cmpxchg.rel; mf |
|         |                            |        |               |                          |                 |

# Code Generation

|         | L<br>Ordinary | <b>oad</b><br>SC Atomic |                                                                                                                                                          | <b>tore</b><br>SC Atomic | CAS                           |  |  |
|---------|---------------|-------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------|-------------------------------|--|--|
| x86/x64 | mov           | mov                     | mov                                                                                                                                                      | xchg                     | cmpxchg                       |  |  |
| IA64    | ld            | ld.acq                  | st                                                                                                                                                       | st.rel; mf               | cmpxchg.rel <mark>;</mark> mf |  |  |
|         |               |                         | Q: Why after the store?<br>A: Purely to prevent <b>st.rel</b> + <b>ld.acq</b> reordering, in cas<br>there's a ld.acq to another location coming up soon. |                          |                               |  |  |
|         |               |                         |                                                                                                                                                          |                          |                               |  |  |
|         |               |                         |                                                                                                                                                          |                          |                               |  |  |
|         |               | http://www.c            | l.cam.ac.uk/~pes20/                                                                                                                                      | /con/con0ymanning        | html                          |  |  |

|         | L<br>Ordinary                                                      | <b>.oad</b><br>SC Atomic | S<br>Ordinary                                                | <b>tore</b><br>SC Atomic | CAS                                   |  |
|---------|--------------------------------------------------------------------|--------------------------|--------------------------------------------------------------|--------------------------|---------------------------------------|--|
| x86/x64 | mov                                                                |                          | mov                                                          |                          |                                       |  |
| IA64    | ld                                                                 | ld.acq                   | st                                                           | st.rel; mf               | cmpxchg.rel; mf                       |  |
|         |                                                                    |                          | A: No, that's broken.                                        |                          |                                       |  |
|         |                                                                    |                          |                                                              |                          | C atomic store as " <b>mf + st</b> "? |  |
|         |                                                                    | (m                       | najor) 1. <b>Id.acq</b> and <b>st.rel</b> are a package deal |                          |                                       |  |
|         | (major) 2. That wouldn't prevent <b>st.rel + ld.acg</b> reordering |                          |                                                              |                          |                                       |  |

| Code | Generation |
|------|------------|
|------|------------|

|         | L        | oad                            | S        | tore       | CAS                                                                                  |
|---------|----------|--------------------------------|----------|------------|--------------------------------------------------------------------------------------|
|         | Ordinary | SC Atomic                      | Ordinary | SC Atomic  |                                                                                      |
| x86/x64 | mov      |                                | mov      |            |                                                                                      |
| IA64    | ld       | ld.acq                         | st       | st.rel; mf | cmpxchg.rel; mf                                                                      |
| POWER   | ld (     | sync; Id;<br>cmp; bc;<br>isync | st       | sync; st   | <pre>sync; loop: lwarx; cmp;<br/>bc _exit; stwcx.; bc<br/>_loop; isync; _exit:</pre> |

| Herb | Sutter |
|------|--------|
|      |        |

|         | l<br>Ordinary | <b>.oad</b><br>SC Atomic       | S<br>Ordinary                                       | <b>tore</b><br>SC Atomic | CAS                                                                                  |
|---------|---------------|--------------------------------|-----------------------------------------------------|--------------------------|--------------------------------------------------------------------------------------|
| x86/x64 | mov           | mov                            | mov                                                 | xchg                     | cmpxchg                                                                              |
| IA64    | ld            |                                | st                                                  |                          |                                                                                      |
| POWER   | ld            | sync; ld;<br>cmp; bc;<br>isync | st                                                  | sync; st                 | <pre>sync:_loop: lwarx; cmp;<br/>bc _exit; stwcx.; bc<br/>_loop; isync; _exit:</pre> |
|         |               | Y                              | ou can <i>almo</i><br>away with<br><b>lwsync</b> he | an                       |                                                                                      |

| Code         |                                           | eratior                        |                                                                                 |                          |                                                                                      |
|--------------|-------------------------------------------|--------------------------------|---------------------------------------------------------------------------------|--------------------------|--------------------------------------------------------------------------------------|
|              | L<br>Ordinary                             | <b>oad</b><br>SC Atomic        | S<br>Ordinary                                                                   | <b>tore</b><br>SC Atomic | CAS                                                                                  |
| x86/x64      | mov                                       |                                | mov                                                                             |                          |                                                                                      |
| IA64         | ld                                        | ld.acq                         | st                                                                              | st.rel; mf               | cmpxchg.rel; mf                                                                      |
| POWER        | ld                                        | sync; ld;<br>cmp; bc;<br>isvnc | st                                                                              | sync; st                 | <pre>sync; loop: lwarx; cmp;<br/>bc _exit; stwcx.; bc<br/>_loop; isync; _exit:</pre> |
| Q: Why is t  | his bad? an                               | d how bad?                     |                                                                                 |                          |                                                                                      |
| A: Heavy cos | st on loads i                             | is anathema.                   |                                                                                 | State -                  |                                                                                      |
| primary r    | ruction is ha<br>eason wh<br>s exist in C | ny relaxed "                   | ()<br>()<br>()<br>()<br>()<br>()<br>()<br>()<br>()<br>()<br>()<br>()<br>()<br>( | /cpp/cpp0xmapping.       | s.html                                                                               |

1

|         | L<br>Ordinary | <b>.oad</b><br>SC Atomic                     | S<br>Ordinary | <b>tore</b><br>SC Atomic | CAS                                                                                                                            |
|---------|---------------|----------------------------------------------|---------------|--------------------------|--------------------------------------------------------------------------------------------------------------------------------|
| x86/x64 | mov           |                                              | mov           |                          |                                                                                                                                |
| IA64    | ld            |                                              | st            |                          |                                                                                                                                |
| POWER   | ld            | <b>sync;</b> ld;<br>cmp; bc;<br><b>isync</b> | st st         |                          |                                                                                                                                |
| ARM v7  | ldr           | ldr dmb                                      | str           | dmb; str;<br>dmb         | <b>dmb; _loop:</b> Idrex roldval,<br>[rptr]; mov rres, 0; teq<br>roldval, rold; strexeq rres,<br>rnewval, [rptr]; teq rres, 0; |





# Memory synchronization **actively works against** important modern hardware optimizations.

 $\Rightarrow$  Want to do **as little as possible**.

#### Fact of Life

Software MMs have converged on SC for data-race-free programs (SC-DRF). Hardware MMs are disadvantaged unless SC acquire/release

are the primary HW MM instructions.

|         |     | <b>oad</b><br>SC Atomic                      | S<br>Ordinary | <b>tore</b><br>SC Atomic | CAS |
|---------|-----|----------------------------------------------|---------------|--------------------------|-----|
| x86/x64 | mov |                                              | mov           |                          |     |
| IA64    | ld  |                                              | st            |                          |     |
| POWER   | ld  | <b>sync;</b> ld;<br>cmp; bc;<br><b>isync</b> | st            |                          |     |
| ARM v7  | ldr | ldr; <b>dmb</b>                              | str           |                          |     |
| ARM v8  | ldr | Idra                                         | str           | strl                     |     |

|          | Load           |                  | Store      |                      | CAS                       |
|----------|----------------|------------------|------------|----------------------|---------------------------|
|          | Ordinary       | SC Atomic        | Ordinary   | SC Atomic            |                           |
|          |                |                  |            |                      |                           |
| ARM CP   | Us: In Oct     | 2011. ARM ar     | nounced n  | ew <b>"SC load a</b> | cquire" and "SC store     |
|          |                | •                |            |                      | ture (32-bit and 64-bit). |
| NB: Indu | ustry first. A | And very new     | – no annou | nced silicon ye      | et from ARM or partners.  |
|          |                |                  |            |                      |                           |
| ۱RM GP   | Us: Currer     | ntly have a stru | onger mem  | orv model (fu        | IV SC). ARIVI has         |













### Fact of Life

#### Relaxed: Don't do it.

Data point from Hans Boehm:

"I would emphasize that we've taken great care that **without relaxed** atomics, 'simultaneously' really means what you thought it did."

























A: Increment can be relaxed (not a publish operation).
Decrement can be acq\_rel (both acq+rel necessary, probably sufficient).

























#### It's always legal to **reduce** the set of possible executions.

Example: **a=1; a=2;** → **a=2;** 

 $\Rightarrow$  "as if" thread always ran really fast, window never exercised. OK because window **wasn't guaranteed to ever be exercised**,

so no valid code in another thread could rely on it.





### Hans Boehm

The difference between acq\_rel and seq\_cst is generally whether the operation is required to participate in the single global order of sequentially consistent operations.

This has subtle and unintuitive effects.

The fences in the current standard may be the most experts-only construct we have in the language.









#### Roadmap

- > Optimizations, Races, and the Memory Model
- Ordering What: Acquire and Release
- Ordering How: Mutexes, Atomics, and/or Fences
- ▶ Other Restrictions on Compilers and Hardware (→ Bugs)
- Code Gen & Performance: x86/x64, IA64, POWER, ARM, ... ???
- Relaxed Atomics (as time allows)
- Coda: Volatile (as time allows)







| Ordered Atomics vs. Unoptimizable Vars                                      |                                                                                                                   |                                                                                                                                                                                           |  |  |  |  |
|-----------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|
|                                                                             | Inter-thread synchronization                                                                                      | External memory locations (e.g., HW reg)                                                                                                                                                  |  |  |  |  |
|                                                                             | $\Rightarrow$ Ordered atomic (atomic <t>)</t>                                                                     | ightarrow Unoptimizable variable (C/C++ volatile)                                                                                                                                         |  |  |  |  |
| Atomic, all-or-<br>nothing?                                                 | Yes, either for types T up to a certain size (Java and ≈.NET) or for all T (ISO C++)                              | <b>No</b> , in fact sometimes they cannot be naturally atomic (e.g., HW registers that must be unaligned or larger than CPU's native word size)                                           |  |  |  |  |
| Reorder/invent/elide<br>ordinary memory<br>ops across these<br>special ops? | Some (1): in one direction<br>only, down across an ordered<br>atomic load or up across an<br>ordered atomic store | <b>Some (2)</b> : one reading of the standard is<br>"like I/O"; another is that ordinary loads<br>can move across a volatile load/store in<br>either direction, but ordinary stores can't |  |  |  |  |
| Reorder/invent/elide<br>these special ops<br>themselves?                    | <b>Some</b> optimizations are<br>allowed, such as combining<br>two adjacent stores to the<br>same location        | <b>No</b> optimization possible; the compiler is<br>not allowed to assume it knows anything<br>about the type<br>not even $v = 1$ ; $r1 = v$ ; $\rightarrow v = 1$ ; $r1=1$ ;             |  |  |  |  |

## The Talk In One Slide

Don't write a race condition or use non-default atomics and your code will do what you think.

#### Unless you:

(a) use compilers/hardware that can have bugs;

(b) are irresistably drawn to pull Random Big Red Levers; or

(c) are one of Those Folks who long to take over the gears in the Machine.





# A Few Good Optimizations



Programmer (Tom Cruise): Kernel hardware, did you reorder the code I wrote?Judge: You don't have to answer that question.Compiler/Processor/Cache (Jack Nicholson): I'll answer the question.