High Level Transforms for SIMD and low-level computer vision algorithms L. Lacassagne, D. Etiemble, A. Zaharee, A. Dominguez, P.Vezolle extended version of [PPoPP/WMVP 2014]
[email protected] www.lip6.fr
Context & HLT ‣ Context: - implementation of 2D stencils and convolution for image processing
‣ Observation : cache overflow - SIMD+OpenMP is not enough to get and sustain a high level of performance (here cycle per point vs image size) - after cache overflow (capacity problem), the perf is divided by 3 (at least!)
‣ HLT (High Level Transforms)
Nehalem
11 10
- to get better performance before CO
9 8
‣ Presentation in 3 points - algorithm algorithm presentation presentation
before 6 cache 5 overflow :-) cache 4 3 overflow area 7
cpp
- to sustain performance after CO (Intel mobile proc, ARM Cortex A9)
cache overflow area
after cache overflow :-(
2
algorithm optimization optimization with with HLT HLT - algorithm - benchmark BenchmarksofofIntel, Intel,IBM IBMand andARM ARMmachines machines
1 0
400
800
size1200
1600
execution time in cycle per point (cpp) the lower the better
2000
2 /15
Harris Point of Interest detector Nopipe GradX
Ixx
Gauss
Sxx
Mul
Ixy
Gauss
Sxy
Mul
Iyy
Gauss
Syy
Ix
I GradY
Mul
Coarsity
K
Iy
‣ Harris is representative of low-level image processing - combination of point and convolution operators = 2D stencils can scale across a wide range of codes using 2D stencils and convolutions - arithmetic intensity is low => memory bound algorithm like for many image processing algorithms - neither SIMD nor OpenMP will change that ! - Apply High Level Transforms to get a higher level of performance operator Nopipe
MUL+ADD 22+35=57
LOAD+STORE arithmetic intensity AI 54+9=63 0.90 3 /15
HLT = algorithmic transforms ‣ First set of HLT: - Operator Fusion (named Halfpipe and Fullpipe) • to avoir memory access to temporary array (less stress on memory buses) X
F1
Y
o
Y
F2
Z
=
X
F1
F2
Z
- Convolution decomposition with column-wise reduction (named Red) • to both reduce arithmetic complexity and memory access amount =
*
- Can be done in {scalar, SIMD} x {mono-thread, multi-threaded}
‣ Second set of HLT - Operator pipeline + circular buffer (to store temporary data) with Modular addressing (Mod) • increases spatial & temporal locality
- but memory layout must be transformed 4 /15
operator fusion: transform rules 1. operators are described with producer-consummer model with input pattern and output pattern
(3x3)
(1x1)
(1x1)
(1x1) B
A point to point operator
2. pattern adaptation: input of 2nd operator = output of 1st operator can be combined with pattern transformations: factorization / decomposition / combination
convolution kernel / stencil
3. scalarization: temporary results are stored into registers (variables instead of array cells) case #1 (1x1)->(1x1) o (1x1)->(1x1) = (1x1)->(1x1) case #2 (3x3)->(1x1) o (1x1)->(1x1) = (3x3)->(1x1) o
F
G
=
F
G
F
o
F
=
G
G
case #4 (3x3)->(1x1) o (3x3)->(1x1) = (5x5)->(1x1) F
G
o
=
case #3 (1x1)->(1x1) o (3x3)->(1x1) = (3x3)->(1x1) F
G
o
o
9F
=
=
9F
G
=
G
9F 9F
o
G
G
5 /15
Harris & operator fusion Nopipe Mul GradX
Halfpipe1
Sxx
GradX Mul
Ixy
Gauss
Sxy
Coarsity
K
GradY Iyy
Gauss
Mul
Ixx
Gauss
Mul
Gauss
Mul
Gauss
Coarsity
K
Mul
Gauss
Mul
Gauss
Mul
Gauss
GradX
GradX I
Gauss
Iy
Syy
Fullpipe
Mul Ix
I
Iy Mul
Halfpipe2
Gauss
Ix
I GradY
Ixx
Mul
Ixy
Gauss
Coarsity
I
K
Coarsity
K
GradY
GradY Mul
Iyy
Gauss
operator Nopipe+red Halfpipe2+red Halfpipe1+red Fullpipe+red
MUL+ADD 5+27=32 5+27=32 11+27=73 29+82=111
LOAD+STORE 21+9=30 12+4=16 27+3=30 5+1=6
AI 1.10 2.0 2.4 18.5
memory bound ? computation bound ? 6 /15
Macro Meta Programming • Level #1 ‣ one set of macros for each SIMD instruction set - IBM Altivec (aka VMX), SPX - Intel SSE, AVX, AVX2, KNC - ARM Neon - ST Microelectronics VECx
‣ Arithmetic: vec_add vec_sub, vec_mul, vec_fmadd, vec_fmsub, vec_set I/O: vec_load1D, vec_store1D, vec_load2D, vec_store2D permutation: vec_left1, vec_right1
• Level #2 ‣ one set of Harris operators: GRADIENTX, GRADIENTY, MUL, GAUSS, COARSITY
7 /15
Three families of processors ‣ Three processor famillies: Intel, IBM and ARM - Intel (Core and Xeon): Penryn, Nehalem, SandyBridge, IvyBridge, Haswell + preliminary result on Xeon-Phi - IBM: PowerPC 970MP, Power6, Power7, Power7+ - ARM: Cortex-A9 (TI OMAP4), Cortex-A15 (Samsung Exynos 5) - Prediction: depending on processor's Arithmetic Intensity, cache overflow magnitude will be high or low (same for the impact of HLT)
‣ SIMD & fairness: 128-bit only to be fair with ARM - paper is about HLT impact, not architecture comparison (except SIMD multi-cores vs GPU) - SIMD = {SSE, Altivec, Neon} without FMA, compiler = {icc, xlc, armgcc} - All codes are fully parallelized with SIMD & OpenMP - Xeon-Phi codes use 512-bit MIC instructions proc Cortex A9 Cortex A15
# cores 1x2 1x2
GHz 1.2 1.7
GFlops 4.8 13.6
BW (GB/s) 1.2 5.8
AI 4.0 2.3
PowerPC Power7+
2x2 4x8
2.5 3.8
40 486
5.4 265
7.4 1.8
Nehalem IvyBridge Xeon Phi
2x4 2x12 1x61
2.67 2.7 1.33
85.1 518.4 1298
22 92 170
3.9 5.6 7.6
8 /15
Benchmark: Halfpipe & Fullpipe with reduction Penryn
70
PowerPC
80
No 60
Cortex-A9
100
No
70
Full
90
No
80 60
50
Half2
50
60
cpp
cpp
cpp
40
40
Half2
30
Half1
20
70
Full
30
Half1
20
Half2
50 40
Half1
30 20
10
400
800
1200 size
1600
10 0
2000
IvyBridge
7
10 200
400
size 600
800
Power7
3
400
size 600
800
1000
Cortex-A15
No
35 2.5
30
5 2
No
1.5
Full
Half2
3
Half1
2
Full
1
1000
2000
size
3000
4000
cpp
cpp
4
0
200
40
No 6
0
1000
1
0
Full
20
Half2
15
Half1
10
0.5
Half2
25 cpp
0
Full
Half1
5 1000
2000 size
3000
4000
0
200
400
size 600
800
1000
- metric: cpp = cycle per pixel vs matrix size [32x32 .. 4096x4096] - For all: data fit in cache longer, Halpipe1 is always faster than Halfpipe2 - depending on processor's AI, Fullpipe is faster that Halfpipe1 - Halfpipe2 and Halfpipe1 are memory bound => cache overflow still happens :-(
9 /15
Modular addressing #1 ‣ Illife matrix [1961]: - Popularized by Numerical Recipes in C - offset addressing = arrays of pointers to rows - can easily manage apron for stencil / convolution - hypothesis: - p parallelism: p cores, p threads
Iliffe matrix + 1 set of circular buffers mono-threaded version
classic Iliffe matrix -1
-1
Iliffe matrix + 2 sets of circular buffers multi-threaded version -1
0
0
0
0
0
0
1
1
1
1
1
1
2
2
2
2
2
2
3
3
3
3
4
4
4
4
5
5
5
5
6
6
6
7
7
7
8
8
8
9
9
9 10
10
4 5
0
6
1
7
2
8 9
- k=3 for (k x k) stencils / convolutions
external apron
internal apron
10
‣ Circular buffer with modular addressing - mono-thread: 1 set of 3-row circular buffer for (3x3) stencil: T[i] points to i mod 3 - multi-thread version: p sets of p set of 3-rows circular buffers: for "internal apron" T[i] points to different sets of circular buffers
‣ Hack - the memory layout is transparent to the user ... even for multi-threaded version :-) (thanks to offset addressing) - on only has to write a pipelined version of the operators 10 /15
Modular addressing #2 ‣ Example:
-1
- input and output arrays X,Y should be classical arrays - Temporary array T is a set(s) of circular buffer(s) - pipeline of two convolution operators: F1 and F2 - apron processing is not detailed in order to simplify
- call operator F1 (k-1) times: steps #0, #1 to produce 2 rows
- repeat until the end (size / p)
1
2
2
0
1
1
2
2
3
3
4
4
5
5
5
6
6
4
7
7
8
8
9
9
10
9+1
3 4
5
0
6
1
7
2
8
0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
9 10
-1
T : temporary array
Y : output array
0-1 0 1
F1
0
step #0: prolog
0-1 0 2
- after each call of F1, there is enough data to call F2
1
0
1
- Data-flow processing: for each input, 1output
0
0-1
X : input array
- Prolog: initiate the filling of T
0
-1
0
F1
1
step #1: prolog
0-1 0
0
0
1
1
1
2
2
2 3
F1
F2
1
step #2.b: data-flow
step #2.a: data-flow
0-1 0
3
3
2
1
1
3
2
2
3 4
F1 step #3.a: data-flow
1
F2
2
step #3.b: data-flow
11 /15
Xeon Phi: preliminary results ‣ Observations
XeonPhi
2 1.8
- programing model is easy (shared memory + SIMD)
1.6 1.4 1.2
cpp
- Mod results not presented: some problems with memory and multi-threading to fix (with Vtune) ...
No
1
0.8
- HLT transforms are efficient: x5.5
0.6
- peak bandwidth is reached
0.2
Half2 Half1 Full
0.4
0
1000
2000
size
3000
4000
proc Nopipe Half2+red Half1+Red Full+red speedup cpp 0.99 0.81 0.35 0.18 x5.5 GFlops 65.8 52.5 144.4 820.2 BW (GB/s) 306 105 182 177 Xeon-Phi: cpp, GFLops and Bandwidth for 2048x2048 images
12 /15
Benchmark: Modular addressing Power7+
2
No
Half2
5
No
1.8 1.6
Full
1.4
Half1
4
Half2
cpp
cpp
1.2
3
1
1
0
1000
2000
3000size 4000
Full
0.6
Half2+M Half1+M
0.4
5000
6000
Half1 Half2+M Half1+M
0.2 0
No
35 30
Full
25 20
0.8 2
Cortex-A15
40
cpp
IvyBridge
6
1000
2000size
3000
4000
Half2 Half1
15 10
Half2+M
5
Half1+M
0
proc Cortex A9 Cortex A15
no 86.4 34.1
cpp red 31.2 13.9
mod 9.4 5.6
HLT speedups red mod tot x2.8 x3.3 x9.2 x2.5 x2.5 x6.1
PowerPC Power7+
75.7 1.62
18.0 0.40
10.2 0.21
x4.2 x4.1
x1.8 x1.9
x7.4 x7.7
Nehalem IvyBridge Xeon Phi
10.5 5.30 0.99
2.95 0.65 0.18
0.50 0.15 -
x3.6 x8.2 x5.5
x5.9 x4.3 -
x21.0 x35.3 -
200
300
400
500 600 size
700
800
900 1000
- Mod codes sustain the performance longer after initial cache overflow. - On IvyBridge, even for Halfpipe1+Red+Mod, cache overflow still happens ... - The total speedups are very high: x9.2 & x7.7 for A9 & P7 up to x35.3 for IVB (depends on proc AI) - Fullpipe rank is a clue to processor peak power (the lower, the higher)
13 /15
Multi-core SIMD versus GPU ‣ GPU - benchmark done on GTX 580 and estimated for Titan and K40 for 2048x2048 images - Texture versions use free bi-linear interpolation to reduce computations tile size has been optimized (exhaustive search) for Shared memory - HLT are also efficient for GPU: x5.5
‣ Observations: - GPP match GPU performance thanks to HLT - for Harris, Xeon-Pi behaves like a GPU: minimize communications => Fullpipe is the fastest version
Cortex A15 IvyBridge Power7+ Xeon Phi
No 84.1 8.23 1.79 3.12
Half+Red 34.2 1.01 0.44 1.10
Full+Red 60.3 1.26 1.5 0.57
Half+Red+Mod 13.7 0.23 0.23 -
gain x6.1 x35.3 x7.7 x5.5
memory GTX 580 Titan (est.) K40 (est.)
No global 6.52 2.29 2.41
Half Tex 2.24 0.79 0.83
Full Tex 1.4 0.49 0.52
Full Shared 1.16 0.41 0.43
x5.6 x5.6 x5.6
14 /15
Conclusion & future works ‣ Conclusion - huge impact of High Level Transforms for SIMD multicore GPP, GPU and Xeon-Phi - can scale across a wide range of codes using 2D stencils and convolutions - done by hand as compilers can't vectorize Red versions - GPP match GPU with HLT
‣ Future works - improve Xeon-Phi performance for Mod versions - benchmark up-coming machines (Xeon Haswell, Power8, Cortex A57, ...) - apply HLT to complex algorithms (image stabilization, tracking) - Harris code as a reference for benchmarking ?
We are looking for access to new machines (through NDA ?) and collaboration 15/15
Thanks !