High Level Transforms for SIMD and low-level ... - Julien Tesson

Benchmarks of Intel, IBM and ARM machines cache overflow area. 400. 800. 1200 ... to avoir memory access to temporary array (less stress on memory buses).
1MB taille 7 téléchargements 528 vues
High Level Transforms for SIMD and low-level computer vision algorithms L. Lacassagne, D. Etiemble, A. Zaharee, A. Dominguez, P.Vezolle extended version of [PPoPP/WMVP 2014] [email protected] www.lip6.fr

Context & HLT ‣ Context: - implementation of 2D stencils and convolution for image processing

‣ Observation : cache overflow - SIMD+OpenMP is not enough to get and sustain a high level of performance (here cycle per point vs image size) - after cache overflow (capacity problem), the perf is divided by 3 (at least!)

‣ HLT (High Level Transforms)

Nehalem

11 10

- to get better performance before CO

9 8

‣ Presentation in 3 points - algorithm algorithm presentation presentation

before 6 cache 5 overflow :-) cache 4 3 overflow area 7

cpp

- to sustain performance after CO (Intel mobile proc, ARM Cortex A9)

cache overflow area

after cache overflow :-(

2

algorithm optimization optimization with with HLT HLT - algorithm - benchmark BenchmarksofofIntel, Intel,IBM IBMand andARM ARMmachines machines

1 0

400

800

size1200

1600

execution time in cycle per point (cpp) the lower the better

2000

2 /15

Harris Point of Interest detector Nopipe GradX

Ixx

Gauss

Sxx

Mul

Ixy

Gauss

Sxy

Mul

Iyy

Gauss

Syy

Ix

I GradY

Mul

Coarsity

K

Iy

‣ Harris is representative of low-level image processing - combination of point and convolution operators = 2D stencils can scale across a wide range of codes using 2D stencils and convolutions - arithmetic intensity is low => memory bound algorithm like for many image processing algorithms - neither SIMD nor OpenMP will change that ! - Apply High Level Transforms to get a higher level of performance operator Nopipe

MUL+ADD 22+35=57

LOAD+STORE arithmetic intensity AI 54+9=63 0.90 3 /15

HLT = algorithmic transforms ‣ First set of HLT: - Operator Fusion (named Halfpipe and Fullpipe) • to avoir memory access to temporary array (less stress on memory buses) X

F1

Y

o

Y

F2

Z

=

X

F1

F2

Z

- Convolution decomposition with column-wise reduction (named Red) • to both reduce arithmetic complexity and memory access amount =

*

- Can be done in {scalar, SIMD} x {mono-thread, multi-threaded}

‣ Second set of HLT - Operator pipeline + circular buffer (to store temporary data) with Modular addressing (Mod) • increases spatial & temporal locality

- but memory layout must be transformed 4 /15

operator fusion: transform rules 1. operators are described with producer-consummer model with input pattern and output pattern

(3x3)

(1x1)

(1x1)

(1x1) B

A point to point operator

2. pattern adaptation: input of 2nd operator = output of 1st operator can be combined with pattern transformations: factorization / decomposition / combination

convolution kernel / stencil

3. scalarization: temporary results are stored into registers (variables instead of array cells) case #1 (1x1)->(1x1) o (1x1)->(1x1) = (1x1)->(1x1) case #2 (3x3)->(1x1) o (1x1)->(1x1) = (3x3)->(1x1) o

F

G

=

F

G

F

o

F

=

G

G

case #4 (3x3)->(1x1) o (3x3)->(1x1) = (5x5)->(1x1) F

G

o

=

case #3 (1x1)->(1x1) o (3x3)->(1x1) = (3x3)->(1x1) F

G

o

o

9F

=

=

9F

G

=

G

9F 9F

o

G

G

5 /15

Harris & operator fusion Nopipe Mul GradX

Halfpipe1

Sxx

GradX Mul

Ixy

Gauss

Sxy

Coarsity

K

GradY Iyy

Gauss

Mul

Ixx

Gauss

Mul

Gauss

Mul

Gauss

Coarsity

K

Mul

Gauss

Mul

Gauss

Mul

Gauss

GradX

GradX I

Gauss

Iy

Syy

Fullpipe

Mul Ix

I

Iy Mul

Halfpipe2

Gauss

Ix

I GradY

Ixx

Mul

Ixy

Gauss

Coarsity

I

K

Coarsity

K

GradY

GradY Mul

Iyy

Gauss

operator Nopipe+red Halfpipe2+red Halfpipe1+red Fullpipe+red

MUL+ADD 5+27=32 5+27=32 11+27=73 29+82=111

LOAD+STORE 21+9=30 12+4=16 27+3=30 5+1=6

AI 1.10 2.0 2.4 18.5

memory bound ? computation bound ? 6 /15

Macro Meta Programming • Level #1 ‣ one set of macros for each SIMD instruction set - IBM Altivec (aka VMX), SPX - Intel SSE, AVX, AVX2, KNC - ARM Neon - ST Microelectronics VECx

‣ Arithmetic: vec_add vec_sub, vec_mul, vec_fmadd, vec_fmsub, vec_set I/O: vec_load1D, vec_store1D, vec_load2D, vec_store2D permutation: vec_left1, vec_right1

• Level #2 ‣ one set of Harris operators: GRADIENTX, GRADIENTY, MUL, GAUSS, COARSITY

7 /15

Three families of processors ‣ Three processor famillies: Intel, IBM and ARM - Intel (Core and Xeon): Penryn, Nehalem, SandyBridge, IvyBridge, Haswell + preliminary result on Xeon-Phi - IBM: PowerPC 970MP, Power6, Power7, Power7+ - ARM: Cortex-A9 (TI OMAP4), Cortex-A15 (Samsung Exynos 5) - Prediction: depending on processor's Arithmetic Intensity, cache overflow magnitude will be high or low (same for the impact of HLT)

‣ SIMD & fairness: 128-bit only to be fair with ARM - paper is about HLT impact, not architecture comparison (except SIMD multi-cores vs GPU) - SIMD = {SSE, Altivec, Neon} without FMA, compiler = {icc, xlc, armgcc} - All codes are fully parallelized with SIMD & OpenMP - Xeon-Phi codes use 512-bit MIC instructions proc Cortex A9 Cortex A15

# cores 1x2 1x2

GHz 1.2 1.7

GFlops 4.8 13.6

BW (GB/s) 1.2 5.8

AI 4.0 2.3

PowerPC Power7+

2x2 4x8

2.5 3.8

40 486

5.4 265

7.4 1.8

Nehalem IvyBridge Xeon Phi

2x4 2x12 1x61

2.67 2.7 1.33

85.1 518.4 1298

22 92 170

3.9 5.6 7.6

8 /15

Benchmark: Halfpipe & Fullpipe with reduction Penryn

70

PowerPC

80

No 60

Cortex-A9

100

No

70

Full

90

No

80 60

50

Half2

50

60

cpp

cpp

cpp

40

40

Half2

30

Half1

20

70

Full

30

Half1

20

Half2

50 40

Half1

30 20

10

400

800

1200 size

1600

10 0

2000

IvyBridge

7

10 200

400

size 600

800

Power7

3

400

size 600

800

1000

Cortex-A15

No

35 2.5

30

5 2

No

1.5

Full

Half2

3

Half1

2

Full

1

1000

2000

size

3000

4000

cpp

cpp

4

0

200

40

No 6

0

1000

1

0

Full

20

Half2

15

Half1

10

0.5

Half2

25 cpp

0

Full

Half1

5 1000

2000 size

3000

4000

0

200

400

size 600

800

1000

- metric: cpp = cycle per pixel vs matrix size [32x32 .. 4096x4096] - For all: data fit in cache longer, Halpipe1 is always faster than Halfpipe2 - depending on processor's AI, Fullpipe is faster that Halfpipe1 - Halfpipe2 and Halfpipe1 are memory bound => cache overflow still happens :-(

9 /15

Modular addressing #1 ‣ Illife matrix [1961]: - Popularized by Numerical Recipes in C - offset addressing = arrays of pointers to rows - can easily manage apron for stencil / convolution - hypothesis: - p parallelism: p cores, p threads

Iliffe matrix + 1 set of circular buffers mono-threaded version

classic Iliffe matrix -1

-1

Iliffe matrix + 2 sets of circular buffers multi-threaded version -1

0

0

0

0

0

0

1

1

1

1

1

1

2

2

2

2

2

2

3

3

3

3

4

4

4

4

5

5

5

5

6

6

6

7

7

7

8

8

8

9

9

9 10

10

4 5

0

6

1

7

2

8 9

- k=3 for (k x k) stencils / convolutions

external apron

internal apron

10

‣ Circular buffer with modular addressing - mono-thread: 1 set of 3-row circular buffer for (3x3) stencil: T[i] points to i mod 3 - multi-thread version: p sets of p set of 3-rows circular buffers: for "internal apron" T[i] points to different sets of circular buffers

‣ Hack - the memory layout is transparent to the user ... even for multi-threaded version :-) (thanks to offset addressing) - on only has to write a pipelined version of the operators 10 /15

Modular addressing #2 ‣ Example:

-1

- input and output arrays X,Y should be classical arrays - Temporary array T is a set(s) of circular buffer(s) - pipeline of two convolution operators: F1 and F2 - apron processing is not detailed in order to simplify

- call operator F1 (k-1) times: steps #0, #1 to produce 2 rows

- repeat until the end (size / p)

1

2

2

0

1

1

2

2

3

3

4

4

5

5

5

6

6

4

7

7

8

8

9

9

10

9+1

3 4

5

0

6

1

7

2

8

0

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10

9 10

-1

T : temporary array

Y : output array

0-1 0 1

F1

0

step #0: prolog

0-1 0 2

- after each call of F1, there is enough data to call F2

1

0

1

- Data-flow processing: for each input, 1output

0

0-1

X : input array

- Prolog: initiate the filling of T

0

-1

0

F1

1

step #1: prolog

0-1 0

0

0

1

1

1

2

2

2 3

F1

F2

1

step #2.b: data-flow

step #2.a: data-flow

0-1 0

3

3

2

1

1

3

2

2

3 4

F1 step #3.a: data-flow

1

F2

2

step #3.b: data-flow

11 /15

Xeon Phi: preliminary results ‣ Observations

XeonPhi

2 1.8

- programing model is easy (shared memory + SIMD)

1.6 1.4 1.2

cpp

- Mod results not presented: some problems with memory and multi-threading to fix (with Vtune) ...

No

1

0.8

- HLT transforms are efficient: x5.5

0.6

- peak bandwidth is reached

0.2

Half2 Half1 Full

0.4

0

1000

2000

size

3000

4000

proc Nopipe Half2+red Half1+Red Full+red speedup cpp 0.99 0.81 0.35 0.18 x5.5 GFlops 65.8 52.5 144.4 820.2 BW (GB/s) 306 105 182 177 Xeon-Phi: cpp, GFLops and Bandwidth for 2048x2048 images

12 /15

Benchmark: Modular addressing Power7+

2

No

Half2

5

No

1.8 1.6

Full

1.4

Half1

4

Half2

cpp

cpp

1.2

3

1

1

0

1000

2000

3000size 4000

Full

0.6

Half2+M Half1+M

0.4

5000

6000

Half1 Half2+M Half1+M

0.2 0

No

35 30

Full

25 20

0.8 2

Cortex-A15

40

cpp

IvyBridge

6

1000

2000size

3000

4000

Half2 Half1

15 10

Half2+M

5

Half1+M

0

proc Cortex A9 Cortex A15

no 86.4 34.1

cpp red 31.2 13.9

mod 9.4 5.6

HLT speedups red mod tot x2.8 x3.3 x9.2 x2.5 x2.5 x6.1

PowerPC Power7+

75.7 1.62

18.0 0.40

10.2 0.21

x4.2 x4.1

x1.8 x1.9

x7.4 x7.7

Nehalem IvyBridge Xeon Phi

10.5 5.30 0.99

2.95 0.65 0.18

0.50 0.15 -

x3.6 x8.2 x5.5

x5.9 x4.3 -

x21.0 x35.3 -

200

300

400

500 600 size

700

800

900 1000

- Mod codes sustain the performance longer after initial cache overflow. - On IvyBridge, even for Halfpipe1+Red+Mod, cache overflow still happens ... - The total speedups are very high: x9.2 & x7.7 for A9 & P7 up to x35.3 for IVB (depends on proc AI) - Fullpipe rank is a clue to processor peak power (the lower, the higher)

13 /15

Multi-core SIMD versus GPU ‣ GPU - benchmark done on GTX 580 and estimated for Titan and K40 for 2048x2048 images - Texture versions use free bi-linear interpolation to reduce computations tile size has been optimized (exhaustive search) for Shared memory - HLT are also efficient for GPU: x5.5

‣ Observations: - GPP match GPU performance thanks to HLT - for Harris, Xeon-Pi behaves like a GPU: minimize communications => Fullpipe is the fastest version

Cortex A15 IvyBridge Power7+ Xeon Phi

No 84.1 8.23 1.79 3.12

Half+Red 34.2 1.01 0.44 1.10

Full+Red 60.3 1.26 1.5 0.57

Half+Red+Mod 13.7 0.23 0.23 -

gain x6.1 x35.3 x7.7 x5.5

memory GTX 580 Titan (est.) K40 (est.)

No global 6.52 2.29 2.41

Half Tex 2.24 0.79 0.83

Full Tex 1.4 0.49 0.52

Full Shared 1.16 0.41 0.43

x5.6 x5.6 x5.6

14 /15

Conclusion & future works ‣ Conclusion - huge impact of High Level Transforms for SIMD multicore GPP, GPU and Xeon-Phi - can scale across a wide range of codes using 2D stencils and convolutions - done by hand as compilers can't vectorize Red versions - GPP match GPU with HLT

‣ Future works - improve Xeon-Phi performance for Mod versions - benchmark up-coming machines (Xeon Haswell, Power8, Cortex A57, ...) - apply HLT to complex algorithms (image stabilization, tracking) - Harris code as a reference for benchmarking ?

We are looking for access to new machines (through NDA ?) and collaboration 15/15

Thanks !