HJ 7 jh
Magnetohydrodynamic Turbulence in Accretion Discs A test case for petascale computing in astrophysics JH hj Tt Marc Joos Sébastien Fromang Collaborators: Pierre Kestener, Geoffroy Lesur, Héloïse Méheut, Daniel Pomarède & Bruno Thooris Patrick Hennebelle, Andrea Ciardi, Romain Teyssier...
Service d’Astrophysique - CEA Saclay
14/11/2013
M. Joos Petascale computing in astrophysics
1/35
Outline Introduction Accretion discs Magneto-rotational instability Numerical approach IBM BlueGene/Q Numerical methods & initial conditions What challenges? Results Overview Power spectra Angular momentum transport rate Parallel I/O Why do we care? Approaches Benchmark Hybridation Why hybridize codes? Hybridation of Ramses Auto-parallelization GPU Why do we want GPUs? OpenACC 14/11/2013
M. Joos Petascale computing in astrophysics
2/35
Outline Introduction Accretion discs Magneto-rotational instability Numerical approach Results Parallel I/O Hybridation GPU
14/11/2013
M. Joos Petascale computing in astrophysics
3/35
Introduction I Accretion discs
Accretion discs What is it? I
discs of diffuse material (gas mostly)
I
rotating around central object observed at all scales
I
I I I I
(NASA)
protostars neutron stars supermassive black holes etc.
(Grosso et al., 2003) 14/11/2013
M. Joos Petascale computing in astrophysics
4/35
Introduction I Accretion discs
Accretion discs What is it? I
discs of diffuse material (gas mostly)
I
rotating around central object observed at all scales
I
I I I I
(NASA)
protostars neutron stars supermassive black holes etc.
Angular momentum: I
Material accretion ⇒ angular momentum loss
(Grosso et al., 2003) 14/11/2013
M. Joos Petascale computing in astrophysics
4/35
Introduction I Accretion discs
Accretion discs What is it? I
discs of diffuse material (gas mostly)
I
rotating around central object observed at all scales
I
I I I I
(NASA)
protostars neutron stars supermassive black holes etc.
Angular momentum: I
Which mechanism to transport efficiently angular momentum?
(Grosso et al., 2003) 14/11/2013
M. Joos Petascale computing in astrophysics
4/35
Introduction I Accretion discs
Accretion discs What is it? I
discs of diffuse material (gas mostly)
I
rotating around central object observed at all scales
I
I I I I
(NASA)
protostars neutron stars supermassive black holes etc.
Angular momentum: I
ad hoc prescription: νt = αcs H (Shakura & Synyaev 1973; Lynden-Bell & Pringle 1974)
(Grosso et al., 2003) 14/11/2013
M. Joos Petascale computing in astrophysics
4/35
Introduction I Magneto-rotational instability
The magneto-rotational instability (MRI)
What is it? I
MHD instability (Balbus & Hawley 1991)
I
weak B field
I
dr Ω < 0
14/11/2013
M. Joos Petascale computing in astrophysics
5/35
Introduction I Magneto-rotational instability
The magneto-rotational instability (MRI)
What is it? I
MHD instability (Balbus & Hawley 1991)
I
weak B field
I
dr Ω < 0
Fig.: Magneto-rotational instability principle 14/11/2013
M. Joos Petascale computing in astrophysics
5/35
Introduction I Magneto-rotational instability
The magneto-rotational instability (MRI)
What is it? I
MHD instability (Balbus & Hawley 1991)
I
weak B field
I
dr Ω < 0 Fig.: MRI principle
Some dimensionless numbers. . . I
Magnetic intensity: β ∼ (cs /va )2
14/11/2013
M. Joos Petascale computing in astrophysics
5/35
Introduction I Magneto-rotational instability
The magneto-rotational instability (MRI)
What is it? I
MHD instability (Balbus & Hawley 1991)
I
weak B field
I
dr Ω < 0 Fig.: MRI principle
Some dimensionless numbers. . . I
Magnetic intensity: β ∼ (cs /va )2
I
Viscosity: Re = cs H/ν
14/11/2013
M. Joos Petascale computing in astrophysics
5/35
Introduction I Magneto-rotational instability
The magneto-rotational instability (MRI)
What is it? I
MHD instability (Balbus & Hawley 1991)
I
weak B field
I
dr Ω < 0 Fig.: MRI principle
Some dimensionless numbers. . . I
Magnetic intensity: β ∼ (cs /va )2
I
Viscosity: Re = cs H/ν
I
Résistivity: Rm = cs H/η
14/11/2013
M. Joos Petascale computing in astrophysics
5/35
Introduction I Magneto-rotational instability
The magneto-rotational instability (MRI)
What is it? I
MHD instability (Balbus & Hawley 1991)
I
weak B field
I
dr Ω < 0 Fig.: MRI principle
Some dimensionless numbers. . . I
Magnetic intensity: β ∼ (cs /va )2
I
Viscosity: Re = cs H/ν
I
Résistivity: Rm = cs H/η
I
Magnetic Prandtl number: Pm = Rm /Re = ν/η
14/11/2013
M. Joos Petascale computing in astrophysics
5/35
Introduction I Magneto-rotational instability
The magneto-rotational instability (MRI)
What is it? I
MHD instability (Balbus & Hawley 1991)
I
weak B field
I
dr Ω < 0 Fig.: MRI principle
Some dimensionless numbers. . . I
Magnetic intensity: β ∼ (cs /va )2
I
Viscosity: Re = cs H/ν
I
Résistivity: Rm = cs H/η
I
Magnetic Prandtl number: Pm 1 in accretion discs (Balbus & Henri 2008)
14/11/2013
M. Joos Petascale computing in astrophysics
5/35
Introduction I Magneto-rotational instability
The magneto-rotational instability (MRI)
What is it? I
MHD instability (Balbus & Hawley 1991)
I
weak B field
I
dr Ω < 0 Fig.: MRI principle
Evolution of α with Pm ? Re = 400 Re = 800 β = 10 3 : Re = 1600 β = 10 4 ◦: Re = 3200 β = 10 ×: Re = 6400 ∗ Re = 20 000 & β = 103
2
:
+:
(Lesur & Longaretti 2010)
14/11/2013
M. Joos Petascale computing in astrophysics
5/35
Introduction I Magneto-rotational instability
The magneto-rotational instability (MRI)
What is it? I
MHD instability (Balbus & Hawley 1991)
I
weak B field
I
dr Ω < 0 Fig.: MRI principle
Evolution of α with Pm ? Rm = 2600
0.05 0.04
α
0.03 0.02 0.01 0.00 10 14/11/2013
2
10
1
Pm
100
M. Joos Petascale computing in astrophysics
101 5/35
Outline Introduction Numerical approach IBM BlueGene/Q Numerical methods & initial conditions What challenges? Results Parallel I/O Hybridation GPU
14/11/2013
M. Joos Petascale computing in astrophysics
6/35
Numerical approach I IBM BlueGene/Q
The BlueGene/Q hierarchy Simulation performed on Turing@IDRIS
14/11/2013
M. Joos Petascale computing in astrophysics
7/35
Numerical approach I Numerical methods & initial conditions
Local approach Resolution issue I
turbulent scale: `turb ∼ H
I
few 100’s fluid elements to resolve H
I
H/R ∼0.1
I
∼25 H to cover the radial range
→
computationally expensive!
14/11/2013
M. Joos Petascale computing in astrophysics
8/35
Numerical approach I Numerical methods & initial conditions
Local approach Resolution issue I
turbulent scale: `turb ∼ H
I
few 100’s fluid elements to resolve H
I
H/R ∼0.1
I
∼25 H to cover the radial range
→
computationally expensive!
The local approach I
MHD equations
I
dissipation (or not)
I
EOS
14/11/2013
M. Joos Petascale computing in astrophysics
8/35
Numerical approach I Numerical methods & initial conditions
Local approach Resolution issue I
turbulent scale: `turb ∼ H
I
few 100’s fluid elements to resolve H
I
H/R ∼0.1
I
∼25 H to cover the radial range
→
computationally expensive!
The local approach (1)
∂t ρ + ∇ · (ρv) = 0
(2)
ρ (∂t v + (v · ∇) v) = −∇P + (∇ × B) × B − 2ρΩ × v + 2qρΩ20 xex
(3)
∂t B = ∇ × (v × B) + energy eq. or EOS
14/11/2013
M. Joos Petascale computing in astrophysics
8/35
Numerical approach I Numerical methods & initial conditions
Local approach The shearing box
(a) t = 0
(b) t > 0
Boundary conditions:
14/11/2013
I
Azimuthal direction: periodic
I
Vertical direction: periodic
I
Radial direction: periodic in shearing coordinates M. Joos Petascale computing in astrophysics
8/35
Numerical approach I Numerical methods & initial conditions
Numerical methods
The Ramses code (Teyssier 2002; Fromang et al. 2006) I
Finite volume method: (Godunov’s scheme) n+1/2
∂t u + ∇ · F(u) = 0 ⇒
⇒
un+1 = uni − i
∆x
Riemann problem to solve at cells interface
I
upwind scheme: stable if |a∆t/∆x| ≤ 1
I
Constrained transport: using Stokes theorem, the induction eq. becomes Z Z ∂t B · dS + (B × v) · dl = 0 S
14/11/2013
n+1/2
Fi+1/2 − Fi−1/2
L
M. Joos Petascale computing in astrophysics
9/35
∆t
Numerical approach I Numerical methods & initial conditions
Initial conditions
Parameters: I
Resolution: 800×1600×832
I
∼800 000 timesteps (∼9 millions CPU hours, 25 orbits)
I
on 32 768 CPUs (131 072 sub-grids of 25×25×13)
I
Toroidal B
I
Homogeneous ρ
14/11/2013
M. Joos Petascale computing in astrophysics
10/35
Numerical approach I Numerical methods & initial conditions
Initial conditions
Parameters: I
Resolution: 800×1600×832
I
∼800 000 timesteps (∼9 millions CPU hours, 25 orbits)
I
on 32 768 CPUs (131 072 sub-grids of 25×25×13)
I
I
Toroidal B
I
Homogeneous ρ
ideal MHD → Pm ∼ 1
14/11/2013
M. Joos Petascale computing in astrophysics
10/35
Numerical approach I Numerical methods & initial conditions
Initial conditions
Parameters: I
Resolution: 800×1600×832
I
∼800 000 timesteps (∼9 millions CPU hours, 25 orbits)
I
on 32 768 CPUs (131 072 sub-grids of 25×25×13)
I
I
Toroidal B
I
Homogeneous ρ
ideal MHD → Pm ∼ 1
14/11/2013
I
non-ideal MHD
Re = 85 000 & Rm = 2600 ⇒ Pm = 0.03
M. Joos Petascale computing in astrophysics
10/35
Numerical approach I Numerical methods & initial conditions
Initial conditions
Parameters: I
Resolution: 800×1600×832
I
∼800 000 timesteps (∼9 millions CPU hours, 25 orbits)
I
on 32 768 CPUs (131 072 sub-grids of 25×25×13)
I
I
Toroidal B
I
Homogeneous ρ
ideal MHD → Pm ∼ 1
14/11/2013
I
non-ideal MHD
Re = 85 000 & Rm = 2600 ⇒ Pm = 0.03 highest Re ever reach!
M. Joos Petascale computing in astrophysics
10/35
Numerical approach I What challenges?
What challenges at the petascale? 1. Weak scaling
Weak scaling # CPUs 4096 8192 32768
14/11/2013
2th./CPU
telapsed
∼0.58 ∼0.82 ∼0.84
[s]
4th./CPU
telapsed
[s]
∼0.55 ∼0.78 ∼0.80
M. Joos Petascale computing in astrophysics
11/35
Numerical approach I What challenges?
What challenges at the petascale? 1. Weak scaling
Weak scaling # CPUs 4096 8192 32768
2th./CPU
telapsed
[s]
∼0.58 ∼0.82 ∼0.84
I
∼70% efficiency on 32 768 CPUs
I
∼5% faster with 4 threads/CPU
14/11/2013
4th./CPU
telapsed
[s]
∼0.55 ∼0.78 ∼0.80
M. Joos Petascale computing in astrophysics
11/35
Numerical approach I What challenges?
What challenges at the petascale? 1. Weak scaling 2. Parallel I/O
Parallel I/O I
131 072 MPI processes, no hybridation ⇒ if sequential I/O: 131 072 files to write (and read)!
I
different libraries tested, in particular parallel HDF5 and parallel NetCDF
I
more details later!
14/11/2013
M. Joos Petascale computing in astrophysics
11/35
Numerical approach I What challenges?
What challenges at the petascale? 1. Weak scaling 2. Parallel I/O
Parallel I/O I
131 072 MPI processes, no hybridation ⇒ if sequential I/O: 131 072 files to write (and read)!
I
different libraries tested, in particular parallel HDF5 and parallel NetCDF
I
more details later!
I
(note however that GPFS holds on well even with so many files to deal with...)
14/11/2013
M. Joos Petascale computing in astrophysics
11/35
Numerical approach I What challenges?
What challenges at the petascale? 1. Weak scaling 2. Parallel I/O 3. Visualization & data processing
Visualization & data processing How to visualize the data? (200 Go outputs!)
14/11/2013
M. Joos Petascale computing in astrophysics
11/35
Numerical approach I What challenges?
What challenges at the petascale? 1. Weak scaling 2. Parallel I/O 3. Visualization & data processing
Visualization & data processing How to visualize the data? (200 Go outputs!) I
high frequency outputs: sides of the domain → every 3200 timesteps
14/11/2013
M. Joos Petascale computing in astrophysics
11/35
Numerical approach I What challenges?
What challenges at the petascale? 1. Weak scaling 2. Parallel I/O 3. Visualization & data processing
Visualization & data processing How to visualize the data? (200 Go outputs!) I
high frequency outputs: sides of the domain → every 3200 timesteps → fast visualization, 3D movies
14/11/2013
M. Joos Petascale computing in astrophysics
11/35
Numerical approach I What challenges?
What challenges at the petascale? 1. Weak scaling 2. Parallel I/O 3. Visualization & data processing
Visualization & data processing How to visualize the data? (200 Go outputs!) I
high frequency outputs: sides of the domain → every 3200 timesteps → fast visualization, 3D movies
I
low frequency outputs: whole domain → every 32 000 timesteps
14/11/2013
M. Joos Petascale computing in astrophysics
11/35
Numerical approach I What challenges?
What challenges at the petascale? 1. Weak scaling 2. Parallel I/O 3. Visualization & data processing
Visualization & data processing How to visualize the data? (200 Go outputs!) I
high frequency outputs: sides of the domain → every 3200 timesteps → fast visualization, 3D movies
I
low frequency outputs: whole domain → every 32 000 timesteps → science (averages, power spectra. . . )
14/11/2013
M. Joos Petascale computing in astrophysics
11/35
Outline Introduction Numerical approach Results Overview Power spectra Angular momentum transport rate Parallel I/O Hybridation GPU
14/11/2013
M. Joos Petascale computing in astrophysics
12/35
Results I Overview
Overview I
ideal vs. non-ideal MHD: dissipation effects
Ideal MHD – By
14/11/2013
non-ideal MHD – By M. Joos Petascale computing in astrophysics
13/35
Results I Overview
Overview I
ideal vs. non-ideal MHD: dissipation effects
I
kinetic vs. magnetic: dissipation scales
non-ideal MHD – vz
14/11/2013
non-ideal MHD – By
M. Joos Petascale computing in astrophysics
13/35
Results I Power spectra
Power spectra Kinetic & magnetic energies I
Energy
I
14/11/2013
10 10 10 10 10 10 10 10 10
1 2 v(kz )| Ek (kz ) = ρ0 |e 2 2 1 e Emag (kz ) = B(kz ) 8π
1
Ek Em
2 3 4 5 6 7 8 9
100
101
k
102
M. Joos Petascale computing in astrophysics
14/35
Results I Power spectra
Power spectra Kinetic & magnetic energies I
Energy
I
14/11/2013
10 10 10 10 10 10 10 10 10
1 2 v(kz )| Ek (kz ) = ρ0 |e 2 2 1 e Emag (kz ) = B(kz ) 8π
Ek ∝ k
1
3/2
Ek Em
2 3 4 5 6 7 8 9
100
101
k
102
M. Joos Petascale computing in astrophysics
14/35
Results I Angular momentum transport rate
Angular momentum transport rate How to measure the turbulence efficiency? TReynolds TMaxwell
14/11/2013
ρ (vx − v¯x ) vy − v¯y Bx By = − 4π
=
M. Joos Petascale computing in astrophysics
15/35
Results I Angular momentum transport rate
Angular momentum transport rate How to measure the turbulence efficiency? TReynolds TMaxwell
⇒ α=
14/11/2013
ρ (vx − v¯x ) vy − v¯y Bx By = − 4π
=
TReynolds + TMaxwell P0
M. Joos Petascale computing in astrophysics
15/35
Results I Angular momentum transport rate
Angular momentum transport rate How to measure the turbulence efficiency? TReynolds TMaxwell
ρ (vx − v¯x ) vy − v¯y Bx By = − 4π
=
⇒ α=
TReynolds + TMaxwell P0
Rm = 2600
0.05 0.04
α
0.03 0.02 0.01 0.00 10 14/11/2013
2
10
1
100
Pm in astrophysics M. Joos Petascale computing
101 15/35
Results I Angular momentum transport rate
Angular momentum transport rate How to measure the turbulence efficiency? TReynolds TMaxwell
ρ (vx − v¯x ) vy − v¯y Bx By = − 4π
=
⇒ α=
TReynolds + TMaxwell P0
Rm = 2600
0.05 0.04
α
0.03 0.02 0.01 0.00 10 14/11/2013
2
10
1
100
Pm in astrophysics M. Joos Petascale computing
101 15/35
Outline Introduction Numerical approach Results Parallel I/O Why do we care? Approaches Benchmark Hybridation GPU
14/11/2013
M. Joos Petascale computing in astrophysics
16/35
Parallel I/O I Why do we care?
Why do we care? On-going evolution I
Computing power increases;
I
Number of cores increases rapidely;
I
Memory per core stays constant or decreases; Storing capacity is growing faster than the access speed.
I
14/11/2013
M. Joos Petascale computing in astrophysics
17/35
Parallel I/O I Why do we care?
Why do we care? On-going evolution I
Computing power increases;
I
Number of cores increases rapidely;
I
Memory per core stays constant or decreases; Storing capacity is growing faster than the access speed.
I
Consequences I
Data generated increases with the computing power;
I
More core but less memory: more files!; One file per process approach:
I
I I
I
saturation of filesystems; pre- & post-processing steps heavier;
Time spent in I/O increases.
14/11/2013
M. Joos Petascale computing in astrophysics
17/35
Parallel I/O I Why do we care?
Why do we care? On-going evolution I
Computing power increases;
I
Number of cores increases rapidely;
I
Memory per core stays constant or decreases; Storing capacity is growing faster than the access speed.
I
Consequences I
Data generated increases with the computing power;
I
More core but less memory: more files!; One file per process approach:
I
I I
I
saturation of filesystems; pre- & post-processing steps heavier;
Time spent in I/O increases. ⇒
14/11/2013
Need for parallel I/O with sustainable performance on supercomputers M. Joos Petascale computing in astrophysics
17/35
Parallel I/O I Approaches
Parallel I/O approaches Possible approaches: I
sequential I/O
I
master process distributing data
I
MPI-IO
I
Parallel HDF5
I
Parallel NetCDF
I
ADIOS ...
I
14/11/2013
M. Joos Petascale computing in astrophysics
18/35
Parallel I/O I Approaches
Parallel I/O approaches: POSIX dataCPU 1
dataCPU 2
dataCPU 3
dataCPU n−1
dataCPU n
100101110001010101101000010010111000101010110100001001011100010101011010000100 111000101010110100001001011100010101011010000100101110001010101101000010010111 fileCPU n−1 fileCPU 3 fileCPU 1 fileCPU 2 fileCPU n 010101011010000100101110001010101101000010010111000101010110100001001011100010 101101000010010111000101010110100001001011100010101000101010010101110101101001
14/11/2013
M. Joos Petascale computing in astrophysics
19/35
Parallel I/O I Approaches
Parallel I/O approaches: POSIX
1 2 3 4 5 6 7 8
real(8), dimension(xdim,ydim,zdim,nvar) :: data character(LEN=80) :: filename call get_filename(myrank, 'posix', filename) open(unit=10, file=filename, status='unknown', form='unformatted') write(10) data close(10)
14/11/2013
M. Joos Petascale computing in astrophysics
19/35
Parallel I/O I Approaches
Parallel I/O approaches: MPI-IO dataCPU 1
dataCPU 2
dataCPU 3
dataCPU n−1
dataCPU n
100101110001010101101000010010111000101010110100001001011100010101011010000100 111000101010110100001001011100010101011010000100101110001010101101000010010111 file 010101011010000100101110001010101101000010010111000101010110100001001011100010 101101000010010111000101010110100001001011100010101011010010100001001011100010
14/11/2013
M. Joos Petascale computing in astrophysics
20/35
Parallel I/O I Approaches
Parallel I/O approaches: MPI-IO 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
integer :: xpos, ypos, zpos, myrank, i real(8), dimension(xdim,ydim,zdim,nvar) :: data integer, dimension(3) :: boxsize, domdecomp character(LEN=13) :: filename ! MPI variables integer :: fhandle, ierr integer :: int_size, double_size integer(kind=MPI_OFFSET_KIND) :: buf_size integer :: written_arr integer, dimension(3) :: wa_size, wa_subsize, wa_start ! Create MPI array type wa_size = (/ nx*xdim, ny*ydim, nz*zdim /) wa_subsize = (/ xdim, ydim, zdim /) wa_start = (/ xpos, ypos, zpos /)*wa_subsize call MPI_Type_Create_Subarray(3, wa_size, wa_subsize, wa_start & & , MPI_ORDER_FORTRAN, MPI_DOUBLE_PRECISION, written_arr, ierr) call MPI_Type_Commit(written_arr, ierr) call MPI_Type_Size(MPI_INTEGER, int_size, ierr) call MPI_Type_Size(MPI_DOUBLE_PRECISION, double_size, ierr) filename = 'parallelio.mp' ! Open file call MPI_File_Open(MPI_COMM_WORLD, trim(filename) & & , MPI_MODE_WRONLY + MPI_MODE_CREATE, MPI_INFO_NULL, fhandle, ierr) 14/11/2013
M. Joos Petascale computing in astrophysics
20/35
Parallel I/O I Approaches
Parallel I/O approaches: MPI-IO 29 30 31 32 33 34 35 36
! Write data buf_size = 6*int_size + xdim*ydim*zdim*double_size*myrank call MPI_File_Seek(fhandle, buf_size, MPI_SEEK_SET, ierr) call MPI_File_Write_All(fhandle, data(:,:,:,1), xdim*ydim*zdim & , MPI_DOUBLE_PRECISION, MPI_STATUS_IGNORE, ierr) ! Close file call MPI_File_Close(fhandle, ierr)
14/11/2013
M. Joos Petascale computing in astrophysics
20/35
Parallel I/O I Approaches
arallel I/O approaches: Parallel NetCDF PNetCDF: Network Common Data Form →
self-documented, portable format
dataCPU 1
dataCPU 2
dataCPU 3
dataCPU n−1
dataCPU n
dimension: nx = 128; ny = 128; variables: file double array(nx,ny); data: array = ...
14/11/2013
M. Joos Petascale computing in astrophysics
21/35
Parallel I/O I Approaches
Parallel I/O approaches: Parallel NetCDF 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
integer :: xpos, ypos, zpos, myrank real(8), dimension(xdim,ydim,zdim,nvar) :: data character(LEN=13) :: filename ! PnetCDF variables integer(kind=MPI_OFFSET_KIND) :: nxtot, nytot, nztot integer :: nout, ncid, xdimid, ydimid, zdimid, vid1 integer, dimension(3) :: sdimid integer(kind=MPI_OFFSET_KIND), dimension(3) :: dims, start, count integer :: ierr dims = (/ xdim, ydim, zdim /) ! Create file filename = 'parallelio.nc' nout = nfmpi_create(MPI_COMM_WORLD, filename, NF_CLOBBER, MPI_INFO_NULL & , ncid) ! Define dimensions nout = nfmpi_def_dim(ncid, "x", nxtot, xdimid) nout = nfmpi_def_dim(ncid, "y", nytot, ydimid) nout = nfmpi_def_dim(ncid, "z", nztot, zdimid) sdimid = (/ xdimid, ydimid, zdimid /) ! Create variable nout = nfmpi_def_var(ncid, "var1", NF_DOUBLE, 3, sdimid, vid1) ! End of definitions nout = nfmpi_enddef(ncid) 14/11/2013
M. Joos Petascale computing in astrophysics
21/35
Parallel I/O I Approaches
Parallel I/O approaches: Parallel NetCDF 31 32 33 34 35 36 37 38
start = (/ xpos, ypos, zpos /)*dims+1 count = dims ! Write data nout = nfmpi_put_vara_double_all(ncid, vid1, start, count, data(:,:,:,1)) ! Close file nout = nfmpi_close(ncid)
14/11/2013
M. Joos Petascale computing in astrophysics
21/35
Parallel I/O I Approaches
Parallel I/O approaches: Parallel HDF5 HDF5: Hierarchical Data Format → self-documented, hierarchical, portable format dataCPU 1
dataCPU 2
dataCPU 3
dataCPU n−1
dataCPU n
GROUP "/" { DATASET "array" { DATATYPE H5T_IEEE_F64LE DATASPACEfile SIMPLE { (128, 128) / (128, 128) } DATA { ...
14/11/2013
M. Joos Petascale computing in astrophysics
22/35
Parallel I/O I Approaches
Parallel I/O approaches: Parallel HDF5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
integer :: xpos, ypos, zpos, myrank real(8), dimension(xdim,ydim,zdim,nvar) :: data character(LEN=13) :: filename ! HDF5 variables integer :: ierr integer(HID_T) :: integer(HID_T) :: integer(HSIZE_T), integer(HSIZE_T),
file_id, fapl_id, dxpl_id h5_dspace, h5_dset, h5_dspace_file dimension(3) :: start, count, stride, blockSize dimension(3) :: dims, dims_file
! Initialize HDF5 interface call H5open_f(ierr) ! Create HDF5 property IDs for parallel file access filename = 'parallelio.h5' call H5Pcreate_f(H5P_FILE_ACCESS_F, fapl_id, ierr) call H5Pset_fapl_mpio_f(fapl_id, MPI_COMM_WORLD, MPI_INFO_NULL, ierr) call H5Fcreate_f(filename, H5F_ACC_RDWR_F, file_id, ierr & , access_prp=fapl_id) ! Select space in memory and file dims = (/ xdim, ydim, zdim /) dims_file = (/ xdim*nx, ydim*ny, zdim*nz /) call H5Screate_simple_f(3, dims, h5_dspace, ierr) call H5Screate_simple_f(3, dims_file, h5_dspace_file, ierr) 14/11/2013
M. Joos Petascale computing in astrophysics
22/35
Parallel I/O I Approaches
Parallel I/O approaches: Parallel HDF5 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
! Hyperslab for selecting data in h5_dspace start = (/ 0, 0, 0 /) stride = (/ 1, 1, 1 /) count = dims blockSize = (/ 1, 1, 1 /) call H5Sselect_hyperslab_f(h5_dspace, H5S_SELECT_SET_F, start, count & , ierr, stride, blockSize) ! Hyperslab for selecting location in h5_dspace_file (to set the ! correct location in file where we want to put our piece of data) start = (/ xpos, ypos, zpos /)*dims stride = (/ 1,1,1 /) count = dims blockSize = (/ 1,1,1 /) call H5Sselect_hyperslab_f(h5_dspace_file, H5S_SELECT_SET_F, start, count & , ierr, stride, blockSize) ! Enable parallel collective IO call H5Pcreate_f(H5P_DATASET_XFER_F, dxpl_id, ierr) call H5Pset_dxpl_mpio_f(dxpl_id, H5FD_MPIO_COLLECTIVE_F, ierr) ! Create data set call H5Dcreate_f(file_id, trim(dsetname), H5T_NATIVE_DOUBLE & , h5_dspace_file, h5_dset, ierr, H5P_DEFAULT_F, H5P_DEFAULT_F & , H5P_DEFAULT_F)
14/11/2013
M. Joos Petascale computing in astrophysics
22/35
Parallel I/O I Approaches
Parallel I/O approaches: Parallel HDF5 55 56 57 58 59 60 61 62 63 64 65 66 67
! Finally write data to file call H5Dwrite_f(h5_dset, H5T_NATIVE_DOUBLE, data, dims, ierr & , mem_space_id=h5_dspace, file_space_id=h5_dspace_file & , xfer_prp=dxpl_id) ! Clean HDF5 IDs call H5Pclose_f(dxpl_id, ierr) call H5Dclose_f(h5_dset, ierr) call H5Sclose_f(h5_dspace, ierr) call H5Sclose_f(h5_dspace_file, ierr) call H5Fclose_f(file_id, ierr) call H5Pclose_f(fapl_id, ierr) call H5close_f(ierr)
14/11/2013
M. Joos Petascale computing in astrophysics
22/35
Parallel I/O I Approaches
Parallel I/O approaches: ADIOS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
integer :: xpos, ypos, zpos, myrank real(8), dimension(xdim,ydim,zdim,nvar) :: data character(LEN=17) :: filename ! MPI & ADIOS variables integer :: adios_err integer(8) :: adios_handle, offset_x, offset_y, offset_z integer :: xdimglob, ydimglob, zdimglob integer :: ierr ! Init ADIOS call ADIOS_Init("adios_BRIO.xml", MPI_COMM_WORLD, ierr) ! Define offset_x offset_y offset_z xdimglob
offset and global dimensions = xdim*xpos = ydim*ypos = zdim*zpos = xdim*nx; ydimglob = ydim*ny; zdimglob = zdim*nz
! Open ADIOS file & write data call ADIOS_Open(adios_handle, "dump", "parallelio_XML.bp", "w" & & , MPI_COMM_WORLD, ierr) ! Write I/O # include "gwrite_dump.fh" ! Close ADIOS file and interface call ADIOS_Close(adios_handle, ierr) call ADIOS_Finalize(myrank, ierr) 14/11/2013
M. Joos Petascale computing in astrophysics
23/35
Parallel I/O I Approaches
I/O approaches: ADIOS Pwitharallel the following XML file for definitions: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
14/11/2013
M. Joos Petascale computing in astrophysics
23/35
Parallel I/O I Approaches
Parallel I/O approaches Possible approaches: I
sequential I/O
I
master process distributing data
I I
MPI-IO Parallel HDF5
I
Parallel NetCDF
I
ADIOS
I
...
library — MPI-IO PHDF5 PnetCDF ADIOS
14/11/2013
M. Joos Petascale computing in astrophysics
24/35
Parallel I/O I Approaches
Parallel I/O approaches Possible approaches: I
sequential I/O
I
master process distributing data
I I
MPI-IO Parallel HDF5
I
Parallel NetCDF
I
ADIOS
I
...
library — MPI-IO PHDF5 PnetCDF ADIOS
14/11/2013
ease of use X – X – XX M. Joos Petascale computing in astrophysics
24/35
Parallel I/O I Approaches
Parallel I/O approaches Possible approaches: I
sequential I/O
I
master process distributing data
I I
MPI-IO Parallel HDF5
I
Parallel NetCDF
I
ADIOS
I
...
library — MPI-IO PHDF5 PnetCDF ADIOS
14/11/2013
ease of use 1 file X – X – XX
X X X X X M. Joos Petascale computing in astrophysics
24/35
Parallel I/O I Approaches
Parallel I/O approaches Possible approaches: I
sequential I/O
I
master process distributing data
I I
MPI-IO Parallel HDF5
I
Parallel NetCDF
I
ADIOS
I
...
library — MPI-IO PHDF5 PnetCDF ADIOS
14/11/2013
ease of use 1 file portability X – X – XX
X X X X X
X X X X X
M. Joos Petascale computing in astrophysics
24/35
Parallel I/O I Approaches
Parallel I/O approaches Possible approaches: I
sequential I/O
I
master process distributing data
I I
MPI-IO Parallel HDF5
I
Parallel NetCDF
I
ADIOS
I
...
library — MPI-IO PHDF5 PnetCDF ADIOS
14/11/2013
ease of use 1 file portability X – X – XX
X X X X X
X X X X X
self-documented X X X X X
M. Joos Petascale computing in astrophysics
24/35
Parallel I/O I Approaches
Parallel I/O approaches Possible approaches: I
sequential I/O
I
master process distributing data
I I
MPI-IO Parallel HDF5
I
Parallel NetCDF
I
ADIOS
I
...
library — MPI-IO PHDF5 PnetCDF ADIOS
14/11/2013
ease of use 1 file portability X – X – XX
X X X X X
X X X X X
self-documented flexibility X X X X X
M. Joos Petascale computing in astrophysics
X – X X XX 24/35
Parallel I/O I Approaches
Parallel I/O approaches Possible approaches: I
sequential I/O
I
master process distributing data
I I
MPI-IO Parallel HDF5
I
Parallel NetCDF
I
ADIOS
I
...
library — MPI-IO PHDF5 PnetCDF ADIOS
14/11/2013
ease of use 1 file portability X – X – XX
X X X X X
X X X X X
self-documented flexibility interface X X X X X
M. Joos Petascale computing in astrophysics
X – X X XX 24/35
X X X X –
Parallel I/O I Benchmark
BRIO: a benchmark for parallel I/O Tested libraries: I
sequential I/O
I
MPI-IO
I
parallel HDF5
I
parallel NetCDF ADIOS
I
What does it do? I
write/read data distributed on a cartesian grid
I
compute writing/reading time few parameters:
I
I I I I
I
size of the grid domain decomposition contiguity of data XML/noXML interface (for ADIOS only)
under GNU/GPL license, available at https://bitbucket.org/mjoos
14/11/2013
M. Joos Petascale computing in astrophysics
25/35
Parallel I/O I Benchmark
Results on Turing (BG/Q) # MPI threads
library
contiguous
twriting [s]
4096
— HDF5 parallel HDF5
— — X X X
9.602 9.337 29.226 8.394 5.941
— X X X
12.129 109.419 12.165 9.557
— X
100.197 47.592
parallel NetCDF
16 384
— parallel HDF5 parallel NetCDF
131 072
14/11/2013
— parallel NetCDF
M. Joos Petascale computing in astrophysics
26/35
Parallel I/O I Benchmark
Results on Turing (BG/Q) # MPI threads
library
contiguous
twriting [s]
4096
— HDF5 parallel HDF5
— — X X X
9.602 9.337 29.226 8.394 5.941
— X X X
12.129 109.419 12.165 9.557
— X
100.197 47.592
parallel NetCDF
16 384
— parallel HDF5 parallel NetCDF
131 072
14/11/2013
— parallel NetCDF
M. Joos Petascale computing in astrophysics
26/35
Parallel I/O I Benchmark
Results on Turing (BG/Q) # MPI threads
library
contiguous
twriting [s]
4096
— HDF5 parallel HDF5
— — X X X
9.602 9.337 29.226 8.394 5.941
— X X X
12.129 109.419 12.165 9.557
— X
100.197 47.592
parallel NetCDF
16 384
— parallel HDF5 parallel NetCDF
131 072
14/11/2013
— parallel NetCDF
M. Joos Petascale computing in astrophysics
26/35
Outline Introduction Numerical approach Results Parallel I/O Hybridation Why hybridize codes? Hybridation of Ramses Auto-parallelization GPU
14/11/2013
M. Joos Petascale computing in astrophysics
27/35
Hybridation I Why hybridize codes?
Why hybridize codes? Advantages I
better match of modern architectures: interconnected nodes with shared memory
14/11/2013
M. Joos Petascale computing in astrophysics
28/35
Hybridation I Why hybridize codes?
Why hybridize codes? Advantages I
I
better match of modern architectures: interconnected nodes with shared memory optimized memory usage: I I
14/11/2013
less data duplicated by MPI processes lower memory footprint
M. Joos Petascale computing in astrophysics
28/35
Hybridation I Why hybridize codes?
Why hybridize codes? Pure MPI:
14/11/2013
M. Joos Petascale computing in astrophysics
28/35
Hybridation I Why hybridize codes?
Why hybridize codes? data
Pure MPI:
14/11/2013
M. Joos Petascale computing in astrophysics
28/35
Hybridation I Why hybridize codes?
Why hybridize codes? data
boundaries
Pure MPI:
14/11/2013
M. Joos Petascale computing in astrophysics
28/35
Hybridation I Why hybridize codes?
Why hybridize codes? data
boundaries
Pure MPI:
ghost zones
14/11/2013
M. Joos Petascale computing in astrophysics
28/35
Hybridation I Why hybridize codes?
Why hybridize codes? data
boundaries
Pure MPI:
ghost zones
MPI+OpenMP:
14/11/2013
M. Joos Petascale computing in astrophysics
28/35
Hybridation I Why hybridize codes?
Why hybridize codes? Pure MPI
MPI+OpenMP
Ex: 800×1600×832 domains, 11 variables (double precision) # MPI processes
# OpenMP threads
Size (in Go)
131 072 16 384
1 8
197 135
memory gain: >30%
14/11/2013
M. Joos Petascale computing in astrophysics
28/35
Hybridation I Why hybridize codes?
Why hybridize codes? Advantages I
I
better match of modern architectures: interconnected nodes with shared memory optimized memory usage: I I
I
less data duplicated by MPI processes lower memory footprint
better I/O performances: I I I
14/11/2013
less simultaneous access less operations with bigger datasets less files (without parallel I/O)
M. Joos Petascale computing in astrophysics
28/35
Hybridation I Why hybridize codes?
Why hybridize codes? Advantages I
I
better match of modern architectures: interconnected nodes with shared memory optimized memory usage: I I
I
better I/O performances: I I I
I
less data duplicated by MPI processes lower memory footprint less simultaneous access less operations with bigger datasets less files (without parallel I/O)
better granularity: I I I
14/11/2013
MPI program: compute and communicate granularity: ratio between computing and communication steps the larger the granularity, the better the extensivity
M. Joos Petascale computing in astrophysics
28/35
Hybridation I Hybridation of Ramses
Hybridation of Ramses Main steps: 1. reorganize the code →
from 42 to 10 source files, reorganization of the modules etc.
2. Fine-Grain approach: parallelize external loops 3. MPI communications
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
integer, parameter :: nx=512, ny=512, nz=1024 real(8), dimension(:,:,:), allocatable :: array integer :: i, j, k allocate(array(nx,ny,nz)) do k = 1, nz do j = 1, ny do i = 1, nx array(i,j,k) = (k*j + i)*1. enddo enddo enddo deallocate(array)
14/11/2013
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
integer, parameter :: nx=512, ny=512, nz=1024 real(8), dimension(:,:,:), allocatable :: array integer :: i, j, k allocate(array(nx,ny,nz)) !$OMP PARALLEL !$OMP DO SCHEDULE(RUNTIME) do k = 1, nz do j = 1, ny do i = 1, nx array(i,j,k) = (k*j + i)*1. enddo enddo enddo !$OMP END DO !$OMP END PARALLEL deallocate(array)
M. Joos Petascale computing in astrophysics
29/35
Hybridation I Hybridation of Ramses
Hybridation of Ramses Preliminary results: Poincaré@Maison de la Simulation (16 cores per node, Intel Sandy Bridge): # MPI proc. 16 8 4 2
# OpenMP threads 1 2 4 8
telapsed [s] 17.49 17.04 16.85 20.07
Turing@IDRIS (1024 cores per node, PowerPC A2): # MPI proc. 2048 1024 512
14/11/2013
# OpenMP threads 1 2 4
telapsed [s] 355.9 337.9 366.1
M. Joos Petascale computing in astrophysics
29/35
Hybridation I Auto-parallelization
Auto-parallelization: does it worth it? What is it? I
compilers can detect serial portions of the code that can be multithreaded → equivalent to an OpenMP parallelization
How to use it? compiler
option
Intel IBM PGI
-parallel -qsmp=auto -Mconcur
How to help your compiler? compiler Intel IBM PGI
14/11/2013
pragma parallel — concur[/noconcur]
M. Joos Petascale computing in astrophysics
30/35
Hybridation I Auto-parallelization
Auto-parallelization: does it worth it? Results:
Simple example:
Intel compiler: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
integer, parameter :: nx=512, ny=512, nz=1024 real(8), dimension(:,:,:), allocatable :: array integer :: i, j, k allocate(array(nx,ny,nz)) !$OMP PARALLEL !$OMP DO SCHEDULE(RUNTIME) do k = 1, nz do j = 1, ny do i = 1, nx array(i,j,k) = (k*j + i)*1. enddo enddo enddo !$OMP END DO !$OMP END PARALLEL deallocate(array)
14/11/2013
# threads
tauto [s]
tOpenMP [s]
1 2 4 8
0.5174 0.2454 0.1302 0.07820
0.5187 0.2383 0.1272 0.06650
# threads
tauto [s]
tOpenMP [s]
1 2 4 8
0.6510 0.3092 0.1559 0.08200
1.049 0.5067 0.2580 0.1332
PGI compiler:
M. Joos Petascale computing in astrophysics
31/35
Hybridation I Auto-parallelization
Auto-parallelization: does it worth it? Intermediate example:
Results: Intel compiler:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
!$omp parallel shared(a,anew,error,iter) do error = 0.d0 !$omp do reduction(max:error) schedule(runtime) do j = 2, m-1 do i = 2, n-1 anew(i,j) = 0.25*(a(i,j+1) + a(i,j-1) & + a(i-1,j) + a(i+1,j)) error = max(error,abs(anew(i,j)-a(i,j))) enddo enddo !$omp do schedule(runtime) do j = 2, m-1 do i = 2, n-1 a(i,j) = anew(i,j) enddo enddo if((error .lt. tolerance) .or. & (iter-1 .gt. iter_max)) exit !$omp single if(mod(iter,10).eq.0) print*, iter, error iter = iter + 1 !$omp end single enddo !$omp end parallel
14/11/2013
# threads
tauto [s]
tOpenMP [s]
1 2 4 8 16
0.0627 0.0593 0.0518 0.0581 0.0796
0.0856 0.0469 0.0251 0.0154 0.0122
# threads
tauto [s]
tOpenMP [s]
1 2 4 8 16
0.175 0.101 0.0643 0.0440 0.0344
0.194 0.0967 0.0509 0.0285 0.0177
PGI compiler:
M. Joos Petascale computing in astrophysics
31/35
Hybridation I Auto-parallelization
Auto-parallelization: does it worth it? “Real” example: Tested on Ramses: I
Intel compiler, with 4 MPI processes: # threads 1 2 4
14/11/2013
tauto [s] 656.5 540.5 491.4
tOpenMP [s] 777.6 424.2 246.4
M. Joos Petascale computing in astrophysics
31/35
Hybridation I Auto-parallelization
Auto-parallelization: does it worth it? “Real” example: Tested on Ramses: I
Intel compiler, with 4 MPI processes: # threads 1 2 4
I
tauto [s] 656.5 540.5 491.4
tOpenMP [s] 777.6 424.2 246.4
IBM compiler, with 1024 MPI processes: # threads 1 2 4
14/11/2013
tauto [s] 587.7 340.3 218.5
tOpenMP [s] 582.3 337.9 216.0
M. Joos Petascale computing in astrophysics
31/35
Outline Introduction Numerical approach Results Parallel I/O Hybridation GPU Why do we want GPUs? OpenACC
14/11/2013
M. Joos Petascale computing in astrophysics
32/35
GPU I Why do we want GPUs?
Why do want to do astrophysics on GPUs? Pro: X sheer computing power
14/11/2013
hardware
processing power [GFLOPS]
Intel Sandy Bridge (single core) Intel Sandy Bridge (whole chip) NVIDIA Tesla Kepler 20 (SP) NVIDIA Tesla Kepler 20 (DP)
24.6 157.7 4106 1173
M. Joos Petascale computing in astrophysics
33/35
GPU I Why do we want GPUs?
Why do want to do astrophysics on GPUs? Pro: X sheer computing power X weak scaling on Ramses (Ramses-GPU doc., P. Kestener) # MPI proc.
Global size
perfCPU [update/s]
perfGPU [update/s]
1 8 64 128 256
128×128×128 256×256×256 512×512×512 1024×512×512 1024×1024×512
0.21 1.68 13.4 26.8 52.5
13.6 95.3 750.3 1498.3 2969.3
14/11/2013
M. Joos Petascale computing in astrophysics
33/35
GPU I Why do we want GPUs?
Why do want to do astrophysics on GPUs? Pro: X sheer computing power X weak scaling on Ramses (Ramses-GPU doc., P. Kestener) # MPI proc.
Global size
perfCPU [update/s]
perfGPU [update/s]
1 8 64 128 256
128×128×128 256×256×256 512×512×512 1024×512×512 1024×1024×512
0.21 1.68 13.4 26.8 52.5
13.6 95.3 750.3 1498.3 2969.3
Cons: X CUDA is a C library → need to translate scientific codes in C/C++
14/11/2013
M. Joos Petascale computing in astrophysics
33/35
GPU I Why do we want GPUs?
Why do want to do astrophysics on GPUs? Pro: X sheer computing power X weak scaling on Ramses (Ramses-GPU doc., P. Kestener) # MPI proc.
Global size
perfCPU [update/s]
perfGPU [update/s]
1 8 64 128 256
128×128×128 256×256×256 512×512×512 1024×512×512 1024×1024×512
0.21 1.68 13.4 26.8 52.5
13.6 95.3 750.3 1498.3 2969.3
Cons: X CUDA is a C library → need to translate scientific codes in C/C++ X CUDA is not memory management-friendly → need to rethink the algorithms
14/11/2013
M. Joos Petascale computing in astrophysics
33/35
GPU I Why do we want GPUs?
Why do want to do astrophysics on GPUs? Pro: X sheer computing power X weak scaling on Ramses (Ramses-GPU doc., P. Kestener) # MPI proc.
Global size
perfCPU [update/s]
perfGPU [update/s]
1 8 64 128 256
128×128×128 256×256×256 512×512×512 1024×512×512 1024×1024×512
0.21 1.68 13.4 26.8 52.5
13.6 95.3 750.3 1498.3 2969.3
Cons: X CUDA is a C library → need to translate scientific codes in C/C++ X CUDA is not memory management-friendly → need to rethink the algorithms ⇒ 14/11/2013
1.5 year to recode Ramses
M. Joos Petascale computing in astrophysics
33/35
GPU I OpenACC
OpenACC: the way to go? What is it? I
Compiler directives to specify loops and regions to offload from CPU to accelerator
I
no need to explicitely manage data
I
Fortran friendly!
14/11/2013
M. Joos Petascale computing in astrophysics
34/35
GPU I OpenACC
OpenACC: the way to go? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
tol = 1.d-6 iter_max = 1000 do while ( error .gt. tol .and. iter .lt. iter_max ) error=0.d0 do j = 1, m-2 do i = 1, n-2 Anew(i,j) = 0.25 *(A(i+1,j) + A(i-1,j) & + A(i,j-1) + A(i,j+1)) error = max( error, abs(Anew(i,j) - A(i,j))) end do end do
I
tCPU = 0.177 s
if(mod(iter,100).eq.0) print*, iter, error iter = iter + 1 do j = 1, m-2 do i = 1, n-2 A(i,j) = Anew(i,j) end do end do end do
14/11/2013
M. Joos Petascale computing in astrophysics
34/35
GPU I OpenACC
OpenACC: the way to go? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
tol = 1.d-6 iter_max = 1000 do while ( error .gt. tol .and. iter .lt. iter_max ) error=0.d0 !$acc kernels loop do j = 1, m-2 do i = 1, n-2 Anew(i,j) = 0.25 *(A(i+1,j) + A(i-1,j) & + A(i,j-1) + A(i,j+1)) error = max( error, abs(Anew(i,j) - A(i,j))) end do end do if(mod(iter,100).eq.0) print*, iter, error iter = iter + 1
I
tCPU = 0.177 s
I
tGPU = 0.149 s
!$acc kernels loop do j = 1, m-2 do i = 1, n-2 A(i,j) = Anew(i,j) end do end do end do
14/11/2013
M. Joos Petascale computing in astrophysics
34/35
GPU I OpenACC
OpenACC: the way to go? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
tol = 1.d-6 iter_max = 1000 !$acc data copy(A) create(Anew) do while ( error .gt. tol .and. iter .lt. iter_max ) error=0.d0 !$acc kernels loop do j = 1, m-2 do i = 1, n-2 Anew(i,j) = 0.25 *(A(i+1,j) + A(i-1,j) & + A(i,j-1) + A(i,j+1)) error = max( error, abs(Anew(i,j) - A(i,j))) end do end do if(mod(iter,100).eq.0) print*, iter, error iter = iter + 1
I
tCPU = 0.177 s
I
tGPU = 0.149 s tGPU data = 0.00667 s
I
!$acc kernels loop do j = 1, m-2 do i = 1, n-2 A(i,j) = Anew(i,j) end do end do end do !$acc end data 14/11/2013
M. Joos Petascale computing in astrophysics
34/35
Conclusions & prospects Physics: 1. Asymptotic convergence of α at low Pm 2. Hydrodynamics cascade at small scales 3. Simulation still running on Turing to confirm the result
14/11/2013
M. Joos Petascale computing in astrophysics
35/35
Conclusions & prospects Physics: 1. Asymptotic convergence of α at low Pm 2. Hydrodynamics cascade at small scales 3. Simulation still running on Turing to confirm the result
Numerics: 1. parallel I/O are mature enough to be used: I I I
(very) good performance more and more easy to use unavoidable to go to exascale
2. need hybridation (OpenMP/OpenACC + MPI) to go to keep going with the increasing computational power 3. GPU are not so hard to use 4. Don’t worry! your compilers become smarter and smarter
14/11/2013
M. Joos Petascale computing in astrophysics
35/35
Conclusions & prospects Physics: 1. Asymptotic convergence of α at low Pm 2. Hydrodynamics cascade at small scales 3. Simulation still running on Turing to confirm the result
Numerics: 1. parallel I/O are mature enough to be used: I I I
(very) good performance more and more easy to use unavoidable to go to exascale
2. need hybridation (OpenMP/OpenACC + MPI) to go to keep going with the increasing computational power 3. GPU are not so hard to use 4. Don’t worry! your compilers become smarter and smarter Thank you for your attention! 14/11/2013
M. Joos Petascale computing in astrophysics
35/35