Skip to content
Snippets Groups Projects
README.md 3.6 KiB
Newer Older
moebiusband73's avatar
moebiusband73 committed
# The Bandwidth Benchmark

This is a collection of simple streaming kernels for teaching purposes.
moebiusband73's avatar
moebiusband73 committed
It is heavily inspired by John McCalpin's https://www.cs.virginia.edu/stream/ benchmark.
moebiusband73's avatar
moebiusband73 committed

moebiusband73's avatar
moebiusband73 committed
It contains the following streaming kernels with corresponding data access pattern (Notation: S - store, L - load, WA - write allocate):
moebiusband73's avatar
moebiusband73 committed

moebiusband73's avatar
moebiusband73 committed
* init (S1, WA): Initilize an array. Store only.
* sum (L1): Vector reduction. Load only.
* copy  (L1, S1, WA): Classic memcopy.
* update (L1, S1): Update a vector. Also load + store but without write allocate.
* triad (L2, S1, WA): Stream triad - `a = b + b * scalar`.
* daxpy (L2, S1): Daxpy - `a = a + b * scalar`.
* striad (L3, S1, WA): Schoenauer triad - `a = b + c * d`.
* sdaxpy (L3, S1): Schoenauer triad without write allocate - `a = a + b * c`.
moebiusband73's avatar
moebiusband73 committed

moebiusband73's avatar
moebiusband73 committed
As added benefit the code is a blueprint for a minimal benchmarking application with a generic makefile and modules for aligned array allocation, accurate timing and affinity settings. Those components can be used standalone in your own project.
moebiusband73's avatar
moebiusband73 committed

moebiusband73's avatar
moebiusband73 committed
## Build

1. Configure the toolchain to use in the `Makefile`:
```
TAG = GCC  # Supported GCC, CLANG, ICC
```

2. Review the flags for toolchain in the corresponding included file, e.g. `include_GCC.mk`. OpenMP is disabled per default, you can enable it by uncommenting the OpenMP flag:
```
OPENMP   = -fopenmp
```

moebiusband73's avatar
moebiusband73 committed
3. Adjust options set in config.mk:
```
OPTIONS  =  -DSIZE=40000000ull
OPTIONS +=  -DNTIMES=10
OPTIONS +=  -DARRAY_ALIGNMENT=64
#OPTIONS +=  -DVERBOSE_AFFINITY
#OPTIONS +=  -DVERBOSE_DATASIZE
#OPTIONS +=  -DVERBOSE_TIMER
```

The verbosity options enable detailed output about affinity settings, allocation sizes and timer resolution.

4. Build with:
moebiusband73's avatar
moebiusband73 committed
```
make
```

moebiusband73's avatar
moebiusband73 committed
You can build multiple toolchains in the same directory, but notice that the Makefile is only acting on the one currently set. Intermediate build results are located in the `<TOOLCHAIN>` directory.

To output the executed commands use:
```
make Q=
```
moebiusband73's avatar
moebiusband73 committed

moebiusband73's avatar
moebiusband73 committed
5. Clean up with:
moebiusband73's avatar
moebiusband73 committed
```
make clean
```
to clean intermediate build results.

```
make distclean
```
to clean intermediate build results and binary.

moebiusband73's avatar
moebiusband73 committed
6. (Optional) Generate assembler:
moebiusband73's avatar
moebiusband73 committed
```
make asm
```
The assembler files will also be located in the `<TOOLCHAIN>` directory.

moebiusband73's avatar
moebiusband73 committed
## Usage
moebiusband73's avatar
moebiusband73 committed

moebiusband73's avatar
moebiusband73 committed
To run the benchmark call:
moebiusband73's avatar
moebiusband73 committed
```
./bwBench-<TOOLCHAIN>
```

The benchmark will output the results similar to the stream benchmark. Results are validated.
For threaded execution it is recommended to control thread affinity.

We recommend to use likwid-pin for benchmarking:
```
likwid-pin -c 0-3 ./bwbench-GCC  
```

moebiusband73's avatar
moebiusband73 committed
Example output for threaded execution:
moebiusband73's avatar
moebiusband73 committed
```
-------------------------------------------------------------
[pthread wrapper] 
[pthread wrapper] MAIN -> 0
[pthread wrapper] PIN_MASK: 0->1  1->2  2->3  
[pthread wrapper] SKIP MASK: 0x0
        threadid 140271463495424 -> core 1 - OK
        threadid 140271455102720 -> core 2 - OK
        threadid 140271446710016 -> core 3 - OK
OpenMP enabled, running with 4 threads
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Init:        14681.5000       0.0110       0.0109       0.0111
Sum:         20634.9290       0.0079       0.0078       0.0082
Copy:        18822.2827       0.0172       0.0170       0.0176
Update:      28135.9717       0.0115       0.0114       0.0117
Triad:       19263.0634       0.0253       0.0249       0.0268
Daxpy:       26718.1377       0.0182       0.0180       0.0187
STriad:      21229.4470       0.0305       0.0301       0.0313
SDaxpy:      26714.3897       0.0243       0.0240       0.0253
-------------------------------------------------------------
Solution Validates
```