Logiciel - index

de processeurs ! ▫ Soit une machine ..... Simultaneous War at 3 frontiers. New hardware ... FPL-2010 (http://conferenze.dei.polimi.it/FPL2010/presentations/W1_B_1.pdf) ...... If it's not defined until run time, it won' be synthesizable. Any of the ...
30MB taille 5 téléchargements 485 vues
FPGA2 ou le Codesign Mat´eriel/Logiciel Bertrand Granado - Andrea Pinna LIP6 / UPMC Courriel : [email protected] [email protected]

Automne 2017

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

1 / 86

Plan 1

Les syst` emes Embarqu´ es Introduction D´ efinition

2

Etude de la conception de syst` emes embarqu´ es sur puce avec des parties Logicielles et des parties Mat´ erielles (Co-design) Sp´ ecification R´ ealisation

3

Plate-formes pour l’embarqu´ e

4

Profilage d’application : temps, consommation

5

Interlude : les r´ eseaux de neurones convolutionnels ou CNN

6

Heuristiques

7

Les algorithmes d’optimisation

8

Le front de Pareto

9

La loi d’Ahmdal Gain et efficacit´ e du parall´ elisme

10 Optimisation MultiCrit` eres 11 HLS Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

2 / 86

Les syst` emes Embarqu´ es

Introduction

Plan 1

Les syst`emes Embarqu´es Introduction D´efinition

2

Etude de la conception de syst`emes embarqu´es sur puce avec des parties Logicielles et des parties Mat´erielles (Co-design)

3

Plate-formes pour l’embarqu´e

4

Profilage d’application : temps, consommation

5

Interlude : les r´eseaux de neurones convolutionnels ou CNN

6

Heuristiques

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

3 / 86

Les syst` emes Embarqu´ es

Introduction

Les syst`emes Embarqu´es

Figure: R´eseau de Capteur sans fil

Figure: Capsule Vid´eo Endoscopique

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

4 / 86

Les syst` emes Embarqu´ es

Introduction

Les syst`emes Embarqu´es

Figure: T´el´ephonie Mobile (1973)

Figure: Mon nouvel ami

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

5 / 86

Les syst` emes Embarqu´ es

Introduction

Les syst`emes Embarqu´es

Figure: syst`eme de freinage ABS

Figure: Health Monitoring (copyright Sagem)

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

6 / 86

Les syst` emes Embarqu´ es

D´ efinition

Plan 1

Les syst`emes Embarqu´es Introduction D´efinition

2

Etude de la conception de syst`emes embarqu´es sur puce avec des parties Logicielles et des parties Mat´erielles (Co-design)

3

Plate-formes pour l’embarqu´e

4

Profilage d’application : temps, consommation

5

Interlude : les r´eseaux de neurones convolutionnels ou CNN

6

Heuristiques

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

7 / 86

Les syst` emes Embarqu´ es

D´ efinition

Syst`eme Embarqu´e

Les syst`emes embarqu´es sont des syst`emes r´eactifs: ”A reactive system is one which is in continual interaction with is environment and executes at a pace determined by that environment“ [Berg´e, 1995] Le comportement d´epend des entr´ees `a l’instant courant.

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

8 / 86

Les syst` emes Embarqu´ es

D´ efinition

Syst`eme Embarqu´e : tentative de d´efinition

Syst`eme autonome qui interagit avec son environnement Un syst`eme embarqu´e doit ˆetre efficace, il est r´egit par des contraintes Energie → Faible consommation Taille du Code → Ressource m´emoire limit´ee Temps → Contrainte temps r´eel Surface → Espace limit´e Coˆ ut → Int´egration dans des appareil grand public Sp´ecifique → D´edi´e ` a certaines applications

Connaissance du comportement `a la conception qui facilite la minimisation des ressources et la maximisation de la robustesse Interface utilisateur d´edi´ee (pas forc´ement de souris, clavier , ´ecran. . . )

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

9 / 86

Les syst` emes Embarqu´ es

D´ efinition

Syst`eme Embarqu´es : Contrainte Energie

Puissance Dissip´ee Pdis = Psta + Pdyn Psta = Ioff ∗ VDD 2 Pdyn = Fc ∗ CL ∗ VDD

Faible Consommation Psta : optimisation technologique Pdyn : optimisation technologique et architecturale

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

10 / 86

Les syst` emes Embarqu´ es

D´ efinition

Syst`eme Embarqu´es : Taille du code

Taille m´emoire limit´ee : du `a l’environnement Taille m´emoire born´ee : impossibilit´e de l’augmenter N´ecessit´e d’optimiser l’utilisation de la m´emoire Pas de surperflux Utilisation des options d’optimisation du compilateur (-02, -03 de gcc) sans garantie toutefois

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

11 / 86

Les syst` emes Embarqu´ es

D´ efinition

Syst`eme Embarqu´es : Temps Un syst`eme embarqu´e doit souvent r´epondre `a des contraintes temps r´eel Un syst`eme temps r´eel doit r´eagir `a un stimuli dans un interval de temps d´ependant de l’environnement. Il y a deux type de temps r´eel : Le temps r´eel dur (latence) Un syst`eme temps r´eel qui produit une bonne r´eponse mais trop tard est faux. La r´eponse d’un syst`eme temps r´eel dur ne peut ˆetre statistique, elle est au pire des cas (WCET : Worst Case Execution Time). ”A real-time constraint is called hard, if not meeting that constraint could result in a catastrophe“ [Kopetz, 1997].

Le temps r´eel mou (cadence)

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

12 / 86

Les syst` emes Embarqu´ es

D´ efinition

Syst`eme Embarqu´es : Temps

Temps r´eel dur

Figure: Contrˆ ole Avion

Temps r´eel mou

Figure: T´el´e Num´erique

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

13 / 86

Les syst` emes Embarqu´ es

D´ efinition

Syst`eme Embarqu´es : Surface

Facteur de forme

Figure: Camera Espion

Figure: Implant Cochl´eaire

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

14 / 86

Les syst` emes Embarqu´ es

D´ efinition

Syst`eme Embarqu´es : Coˆut

O` u sont pass´e mes euros ?

Figure: Objets Connect´es

Figure: Un march´e immense !

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

15 / 86

Les syst` emes Embarqu´ es

D´ efinition

Syst`eme Embarqu´es : Sp´ecificit´e

D´epend du domaine d’application A´eronautique et A´erospatial Automobile Biom´edical E-Sant´e Robotique R´eseaux de Capteurs ...

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

16 / 86

Les syst` emes Embarqu´ es

D´ efinition

Syst`eme Embarqu´e : seconde tentative de d´efinition

Tous les syst`emes embarqu´es ne poss`edent pas toutes ces caract´eristiques. D´ efinition: Un syst`eme de traitement de l‘information pr´esentant la plupart de ces caract´eristiques est appel´e syst`eme embarqu´e (en anglais : embedded system)

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

17 / 86

Etude de la conception de syst` emes embarqu´ es sur puce avec des parties Logicielles et des parties Mat´ erielles (Co-design) Sp´ ecification

Plan 1

Les syst`emes Embarqu´es

2

Etude de la conception de syst`emes embarqu´es sur puce avec des parties Logicielles et des parties Mat´erielles (Co-design) Sp´ecification R´ealisation

3

Plate-formes pour l’embarqu´e

4

Profilage d’application : temps, consommation

5

Interlude : les r´eseaux de neurones convolutionnels ou CNN

6

Heuristiques

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

18 / 86

Etude de la conception de syst` emes embarqu´ es sur puce avec des parties Logicielles et des parties Mat´ erielles (Co-design) Sp´ ecification

Syst`emes Embarqu´es sur puce : Sp´ecification

Constat L’homme n’est pas capable de comprendre un syst`eme contenant entre 5 et 10 objets. La plupart des syst`emes embarqu´es manipulent plus d’objets Lisibilit´e Portabilit´e et Flexibilit´e

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

19 / 86

Etude de la conception de syst` emes embarqu´ es sur puce avec des parties Logicielles et des parties Mat´ erielles (Co-design) R´ ealisation

Plan 1

Les syst`emes Embarqu´es

2

Etude de la conception de syst`emes embarqu´es sur puce avec des parties Logicielles et des parties Mat´erielles (Co-design) Sp´ecification R´ealisation

3

Plate-formes pour l’embarqu´e

4

Profilage d’application : temps, consommation

5

Interlude : les r´eseaux de neurones convolutionnels ou CNN

6

Heuristiques

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

20 / 86

Etude de la conception de syst` emes embarqu´ es sur puce avec des parties Logicielles et des parties Mat´ erielles (Co-design) R´ ealisation

Syst`emes Embarqu´es : R´ealisation

Choix d’une architecture ´electronique pour r´ealiser le syst`eme embarqu´e Utilisation d’une m´ethodologie pour D´eployer Choisir l’architecture Explorer l’espace des solutions architecturales Etre capable de le d´efinir !

→ R´ealisation d’un syst`eme sur puce

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

21 / 86

Etude de la conception de syst` emes embarqu´ es sur puce avec des parties Logicielles et des parties Mat´ erielles (Co-design) R´ ealisation

Syst`emes Embarqu´es : R´ealisation

Figure: Un syst`eme sur puce ou SoC (System on Chip) (Cours de ) Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

22 / 86

Etude de la conception de syst` emes embarqu´ es sur puce avec des parties Logicielles et des parties Mat´ erielles (Co-design) R´ ealisation

Syst`emes Embarqu´es : Choix d’une architecture

Choisir une architecture mat´erielle compos´ee de Blocs programmables g´en´eralistes : CPU, GPU, DSP Blocs sp´ecialis´es ou d´edi´es : FPGA, ASIC Bus de communication

SoC = cohabitation de ces ressources sur un mˆeme circuit, prise en compte globale pour la r´ealisation logicielle/mat´erielle

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

23 / 86

Etude de la conception de syst` emes embarqu´ es sur puce avec des parties Logicielles et des parties Mat´ erielles (Co-design) R´ ealisation

Syst`emes Embarqu´es : Choix d’une architecture

ASIC

Figure: Wafer Asic

FPGA

Figure: FPGA

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

CPU

Figure: AMD K10

Automne 2017

24 / 86

Etude de la conception de syst` emes embarqu´ es sur puce avec des parties Logicielles et des parties Mat´ erielles (Co-design) R´ ealisation

Syst`emes Embarqu´es : Choix d’une architecture

Figure Comparaison Performance vs Flexibilit´e

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

25 / 86

Plate-formes pour l’embarqu´ e

Arduino Yun

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

26 / 86

Plate-formes pour l’embarqu´ e

Artik10

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

27 / 86

Plate-formes pour l’embarqu´ e

Pi Rasperry

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

28 / 86

Plate-formes pour l’embarqu´ e

Zynq UltraScale+

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

29 / 86

Plate-formes pour l’embarqu´ e

Cyclone

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

30 / 86

Plate-formes pour l’embarqu´ e

Syst`emes Embarqu´es : M´ethodologie de Conception

Proc´edure pour concevoir un syst`eme. Comprendre une m´ethodologie aide `a garantir la s´ecurit´e de la conception. P Flot de conception : de compilateurs, outils de d´eveloppement logiciel, outils de conception assist´ee par ordinateur (CAO), etc., permettant : d’aider `a automatiser les ´etapes de la m´ethodologie; de garder trace de l’application de la m´ethodologie (gestion de version, rapports, acc´el´eration des it´erations).

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

31 / 86

Plate-formes pour l’embarqu´ e

Syst`emes Embarqu´es : M´ethodologie de Conception

Buts Satisfaire : Performances : rapidit´e globale, ´ech´eances. Fonctionnalit´e et interface utilisateur. Coˆ ut de fabrication. Consommation. Divers exigences (encombremen

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

32 / 86

Plate-formes pour l’embarqu´ e

Syst`emes Embarqu´es : M´ethodologie de Conception

M´ethode: technique de r´esolution de probl`eme caract´eris´ee par un ensemble de r`egles bien d´efinies qui conduisent `a une solution correcte

M´ethodologie: un ensemble structur´e et coh´erent de mod`eles, m´ethodes, guides et outils permettant de d´eduire la mani`ere de r´esoudre un probl`eme

Mod`ele: une repr´esentation d’un aspect partiel et coh´erent du ”monde” r´eel pr´ec`ede toute d´ecision ou formulation d’une opinion est ´elabor´e pour r´epondre `a la question qui conduit au d´eveloppement d’un syst`eme

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

33 / 86

Plate-formes pour l’embarqu´ e

Syst`emes Embarqu´es : M´ethodologie de conception

un peu d’histoire 70-80 : full-custom Sch´ema Dessin des masques Simulation ´electronique

80-90 : Pr´ecaract´eris´e FPGA R´eutilisation de briques ´el´ementaires Mod´elisation, simulation

00-xx : SoC R´eutilisation du mat´eriel et logiciel Co-design, v´erification

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

34 / 86

Plate-formes pour l’embarqu´ e

Syst`emes Embarqu´es : Notion d’IP

Acc´el´erer la conception de Syst`eme sur Puce R´eutiliser les blocs d´ej`a con¸cus dans la soci´et´e ; Utiliser les g´en´erateurs de macro-cellules (Ram, multiplieurs,. . . ) Acheter des blocs con¸cus hors de l’entreprise

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

35 / 86

Plate-formes pour l’embarqu´ e

Syst`emes Embarqu´es : Notion d’IP

Blocs fonctionnels complexes r´eutilisables Mat´eriel : d´ej`a implant´e, d´ependant de la technologie, fortement optimis´e Logiciel : dans un langage de haut niveau (VHDL, Verilog, C++. . . ), param´etrables Normalisation des interfaces ( OCP) Environnement de d´eveloppement (co-design, co-specification, co-v´erificatin) Performances moyennes (peu optimis´e)

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

36 / 86

Plate-formes pour l’embarqu´ e

Syst`emes Embarqu´es : Utilisation d’IP

Bloc r´eutilisable (IP) connaˆıtre les fonctionnalit´es estimer les performances dans un syst`eme ˆetre sˆ ur du bon fonctionnement de l’IP int´egrer cet IP dans le syst`eme valider le syst`eme

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

37 / 86

Plate-formes pour l’embarqu´ e

Syst`emes Embarqu´es : Utilisation d’IP

Logiciel

Firm

Mat´eriel

Flot de Conception Conception Syst`eme Conception RTL Synth`ese plan de masse Placement Routage V´erification

Repr´esentation Comportemental RTL

Librairies

Technologie

Portabilit´e

-

Ind´ependant technologie

Illimit´ee

RTL Blocs Netlist

De r´ef´erence (temps, dessin)

Technologie g´en´erique

Portable sur librairie

Sp´ecifique process R`egles de dessin

Technologie Fixe

D´ependant du process

Polygˆ ones r´eguliers

et -

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

38 / 86

Plate-formes pour l’embarqu´ e

Le Codesign

Le Corps du probl`eme Ajouter d´efinition

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

39 / 86

Plate-formes pour l’embarqu´ e

Le Codesign

D´efinition Conception de syst`eme macro ou micro qui int`egrent `a la fois des parties logicielles (s’ex´ecutant sur des processeurs ou DSP) et des IP (r´ealis´e sur des FPGA ou des ASIC) Conception conjointe des composants logiciels et mat´eriels Unification des chemins logiciels et mat´eriels couramment s´epar´es

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

40 / 86

Plate-formes pour l’embarqu´ e

Le Codesign

D´efinition Une m´ethodologie de conception qui supporte le d´eveloppement coop´eratif et concurrent des parties logicielles et mat´erielles (co-sp´ecification, co-d´eveloppement et co-v´erification) afin d’obtenir des fonctionnalit´es partag´ees et d’atteindre les performances esp´er´eessa a R. Gupta et G. De Micheli - ”Hardware-Software Cosynthesis for Digital Systems” - IEEE Design and Test of Computers, 1993, pp. 29:41

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

41 / 86

Plate-formes pour l’embarqu´ e

Exemple jouet : mesure de la vitesse d’une roue Contraintes : Surface : 40 unit´es Temps : 100 cycles

Mise en œuvre Processeurs Mat´eriel Sp´ecialis´e Combinaison Processeur et Mat´eriel Sp´ecialis´e

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

42 / 86

Plate-formes pour l’embarqu´ e

Exemple jouet : mesure de la vitesse d’une roue

Mise en œuvre logicielle sur des Processeurs Contraintes : Surface : 48 unit´es > 40 unit´es Temps : 132 cycles > 100 cycles

D´eveloppement : 2 mois

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

43 / 86

Plate-formes pour l’embarqu´ e

Exemple jouet : mesure de la vitesse d’une roue Mise en œuvre mat´eriel sur des ASIC ou des FPGA Contraintes : Surface : 24 unit´es < 40 unit´es Temps : 54 cycles < 100 cycles

Optimisation de 40% par rapport au contraintes de temps et de surface D´eveloppement : 9 mois (D´elai trop long dans un univers hyper-comp´etitif)

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

44 / 86

Plate-formes pour l’embarqu´ e

Exemple jouet : mesure de la vitesse d’une roue Mise en œuvre logicielle sur des Processeurs et mat´erielle sur des ASIC ou des FPGA Contraintes : Surface : 37 unit´es < 40 unit´es Temps : 97 cycles < 100 cycles

D´eveloppement : 3,5 mois Pas aussi efficace que la mise en œuvre purement mat´erielle mais satisfait les contraintes R´ealise un bon compromis

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

45 / 86

Plate-formes pour l’embarqu´ e

Motivation du Codesign Atteindre les performances attendues en d´epla¸cant les goulets d’´etranglement du logiciel vers le mat´eriel Utilisation du mat´eriel pour satisfaire des contraintes de temps et de surface qui ne peuvent ˆetre satisfaites par un processeur g´en´eraliste. Dans une mise en oeuvre mat´erielle statique, il n’est pas possible de tout mettre en mat´eriel, du fait des ressources limit´ees. Dans une mise en oeuvre mat´erielle dynamique, il faut reconsid´erer cette affirmation.

Des parties de l’application sont plus en ad´equation avec un traitement s´equentiel ( Contrˆ ole par exemple) r´ealis´e par un processeur g´en´eraliste. Aujourd’hui beaucoup de syst`emes sont embarqu´es ce qui requiert `a la fois des parties logicielles et mat´erielles.

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

46 / 86

Plate-formes pour l’embarqu´ e

Motivation du Codesign

La complexit´e et la fonctionnalit´e des syst`emes croit `a un rythme rapide et on voit ´emerger les SystemOnChip (SOC). Il est difficile, voir impossible, pour des syst`eme ad-hoc d’ˆetre con¸cus, r´ealis´es et test´es dans un temps acceptable mˆeme avec les plus avanc´es des outils de CAO standards. (Solution?) Tirer profit de design pr´ec´edemment r´ealis´es (IPs) et de processeurs test´es pour r´eduire le temps de conception et pour augmenter la fiabilit´e.

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

47 / 86

Plate-formes pour l’embarqu´ e

Motivation du Codesign La faille de productivit´e de l’ing´enieur (ITRS)

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

48 / 86

Plate-formes pour l’embarqu´ e

Compromis/D´ecisions En partant d’un ensemble de contraintes sp´ecifi´ees et de technologies maitris´ees les concepteurs doivent trouver les compromis pour faire fonctionner ensemble les composants logiciels et mat´eriels D´ecisions, Constraintes et Evaluations? Performance. Surface. Consommation. Programmabilit´e. D´eveloppement Coˆ ut de Fabrication. Fiabilit´e. Robustesse. Maintenance. Evolution.

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

49 / 86

Plate-formes pour l’embarqu´ e

Co-Design: Recherche

La recherche en codesign traverse plusieurs champs de comp´etences tels que : Sp´ecification syst`eme et mod´elisation Exploration du design Partitionnement Ordonnancement Co-verification et Co-simulation G´en´eration de code mat´eriel et logiciel Interfa¸cage mat´eriel/logiciel

L’objectif commun ici est de d´evelopper une m´ethodologie unifi´ee pour cr´eer des syst`emes qui contiennent `a la fois du mat´eriel et du logiciel.

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

50 / 86

Plate-formes pour l’embarqu´ e

Approche Simple

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

51 / 86

Profilage d’application : temps, consommation

Profilage et Partitionnement B´en´efices Acc´el´eration de 10 `a 200 fois Acc´el´eration possible de 800 fois Beaucoup plus de potentiel que les optimisations dynamiques logicielles (internes au processeur, d´eroulage de boucle, pipeline logiciel,...) R´eduction de la consommation d’´energie de 25 `a 95% Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

52 / 86

Profilage d’application : temps, consommation

Profilage Le Profilage permet d’apprendre les endroits, en terme de code, o` u le programme passe son temps. Quelle fonction appelle quelle autre durant son ex´ecution. Le profilage s’effectue via des donn´ees collect´ee lors de l’ex´ecution de l’application. Cette m´ethode peut donc ˆetre utilis´ee pour analyser des programmes trop complexe pour une analyse via la lecture des sources. Ces informations de profil, montre les bouts de code o` u le programme est plus lent qu’attendu. Ces bouts de code sont de bons candidats `a : une r´e´ecriture optimis´ees une transformation mat´erielle

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

53 / 86

Profilage d’application : temps, consommation

Profilage : comment ?

Avec gcc, il faut tout d’abord compiler et lier le programme avec les options de profilage autoris´ees : gcc -o myprog.exe myprog.c utils.c -g -pg

Il faut ensuite ex´ecuter le programme pour collecter les donn´ee du profil d’ex´ecution Le programme ´ecrit les donn´ees collect´ees dans un fichier gmon.outjuste avant de finir.

Il est possible apr`es d’utiliser gprof pour analyser les donn´ees collect´ees : gprof options myprog.exe gmon.out > outfile gprof cr´e´e un fichier de profil et un graphe d’ex´ecution

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

54 / 86

Profilage d’application : temps, consommation

Profilage: `a savoir

Options: -e function name : indique `a gprof de ne pas g´en´erer d’information sur la fonction function name (et ses enfants . . . ) dans le graphe d’appel -f function name : provoque une limitation de l’analyse dans le graphe des appels `a la fonction function name et ses enfants -b : gprof ne renvoie pas d’informations explicatives `a propos des champs renseign´es dans les tables.

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

55 / 86

Profilage d’application : temps, consommation

Profilage : flat profile

% time : pourcentage du temps total d’ex´ecution que le program a passer dans cette fonction. cumulative seconds : Temps cumulatif en seconde que le processeur a pass´e a ex´ecuter cette fonction ainsi que toutes les fonctions appel´ees dans cette fonction. self seconds : Temps en secondes utilis´e pour cette seule fonction. calls: Nombre de fois total o` u cette fonction a ´et´e appel´ee. self ms/call : Temps moyen en milliseconde pris par chaque appel de la fonction. total ms/call : Temps moyen en milliseconde pris par chaque appel de la fonction et de ses descendants. name : Nom de la fonction. Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

56 / 86

Profilage d’application : temps, consommation

Faiblesse de cette premi`ere approche

Certaines fonctions ne sont pas triviales `a r´ealiser en mat´eriel. Les d´ecisions prises trop tˆ ot dans le fot risque de ne pas ˆetre optimales Aucune consid´eration pour la communication et l’interfa¸cage. Si l’application change alors il faut r´eex´ecuter un profilage et ensuite un partitionnement.

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

57 / 86

Profilage d’application : temps, consommation

Codesign : un atelier

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

58 / 86

Profilage d’application : temps, consommation

Partitionnement et Ordonnancement Le partitionnement et l’ordonnancement de tache est imp´eratif dans beaucoup d’applications, en codesing de syst`eme, pour les multi-processeur et les syst`emes reconfigurables. Les taches identifi´ees de la description initiale de l’application doivent ˆetre mise en oeuvre : Au bon endroit (partitionnement) Au bon moment (ordonnanceur)

Ces probl`emes bien connus, le partitionnement et l’ordonnancement, ont ´et´e identifi´es comme des probl`emes NP-Complets. Les techniques d’optimisations bas´ees sur des heuristiques sont g´en´eralement employ´ees pour explorer l’espace des possibilit´es o` u des solutions quasi-optimales peuvent ˆetre trouv´ees.

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

59 / 86

Profilage d’application : temps, consommation

Partitionnement et Ordonnancement

Les m´ecanismes `a optimiser lors d’un partitionnement : Minimiser les communication `a travers un bus Extraire le maximum de parall´elisme → Faire ex´ecuter simultan´ement le mat´eriel (FPGA/ASIC) et le logiciel (Processeur) Extraire le maximum de performances du processeur

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

60 / 86

Heuristiques

Fidducia-Matheyse Graphe des taches

Fonction de Coˆ ut

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

61 / 86

Heuristiques

Fidducia-Matheyse Graphe des taches

Fonction de Coˆ ut Nombre de coupure = 5

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

61 / 86

Heuristiques

Fidducia-Matheyse Graphe des taches

Fonction de Coˆ ut Nombre de coupure = 5 Coˆ ut = 8 Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

61 / 86

Heuristiques

Fidducia-Matheyse Graphe des taches

Fonction de Coˆ ut Nombre de coupure = Coˆ ut =

3

0

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

61 / 86

Heuristiques

Fidducia-Matheyse

Graphe des taches

Fonction de Coˆ ut Nombre de coupure = Coˆ ut =

2

-4

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

61 / 86

Les algorithmes d’optimisation

Le Recuit Simul´e

Simulated Annealing (Kirkpatrick 83) Inspir´e par la physique statistique et les refroidissement des m´etaux Autorise les d´eplacements qui d´egradent en fonction d’une probabilit´e qui d´epend d’une temp´erature Paccept = exp (−δE /T )

Si l’´energie d´ecroˆıt, le syst`eme accepte la perturbation Si l’´energie croˆıt, le syst`eme accepte la perturbation selon Paccept

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

62 / 86

Les algorithmes d’optimisation

1: 2: 3: 4:

8:

S´electionner une solution initiale s S´electionner une temp´erature initiale T > 0 while Condition d’arrˆet non v´erifi´ee do S´electionner au hasard s’ ∈ N(s); Calculer δ = f(s’) – f(s); if δ < 0 then 5: s = s’ else 6: x=hasard([0,1]) if x < exp (−δ/T ) then 7: s = s’ end end end Actualiser la temp´erature T Algorithm 1: Le Recuit Simul´e

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

63 / 86

Les algorithmes d’optimisation

Les algorithmes Gloutons

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

64 / 86

Les algorithmes d’optimisation

Syndex

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

65 / 86

Les algorithmes d’optimisation

Les algorithmes g´en´etiques

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

66 / 86

Les algorithmes d’optimisation

Vulcan

Gupta, De Micheli, Stanford University Approche Primale 1 2

Initialement il n’y a que des IP mat´erielles. It´erativement certaines IP deviennent logicielle pour r´eduire le coˆ ut.

Utilisation d’un langage de sp´ecification mat´erielle : HardwareC Compil´e sous forme de graphe flot de donn´ees

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

67 / 86

Les algorithmes d’optimisation

Vulcan

D´efinition du graphe flot de donn´ees. Une variation d’un graphe de tˆaches. Les nœuds : Representent des op´erations. Typiquement des op´eration de bas niveau de type addition, multiplication, ... .

Les arcs : Representent les d´ependances de donn´ees. Chaque arc est valu´e par un bol´een repr´esentant la condition de transition d’un nœud ` a un autre.

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

68 / 86

Les algorithmes d’optimisation

Vulcan

Le graphe flot de donn´ees : est ex´ecut´e p´eriodiquement peut poss´eder pour chaque nœud des contrainte de temps T (vj ) ≥ T (vi ) + lij and T (vj ) ≤ T (vi ).uij

peut poss´eder pour chaque nœud des contraintes de d´ebit mi?Ri?Mi

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

69 / 86

Les algorithmes d’optimisation

Vulcan

Algorithme de Co-Synth`ese dans Vulcan : Le quantum de partitionnement est le thread L’algorithme transforme le graphe flot de donn´ees en thread et les alloue sur les ressources Thread boundary est d´etermin´e par : Toujours par un ´el´ement de d´elai non-d´eterministe, tel qu’un ´ev´enement sur une variable externe. Parfois par d’autres point du graphe flot de donn´ees.

Architecture cible Processeur et des acc´el´erateurs mat´eriels

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

70 / 86

Les algorithmes d’optimisation

Cosyma

Repr´esentation unifi´ee : ES graphe (CDFG ) Partittionnement : m´ethode combin´ee bas´ee sur un partitionnement guid´e par l’utilisateur qui s’appuie sur une fonction de coˆ ut et un partitionnement plus fin r´ealis´e par un algorithme de recuit-simul´e. Ordonnancement : pas de m´ethode sp´ecifique. Mod´elisation : mod`eles ´ecrits en C++. Validation : simulation bas´ee sur des ex´ecutables ´ecrits en C++. Principale emphasis sur la partitionnement des acc´el´erateurs mat´eriels.

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

71 / 86

Les algorithmes d’optimisation

Cosyma Developp´e `a Technical University de Braunschweig en Allemagne. Un syst`eme exp´erimental pour le Co-Design de petits syst`emes embrqu´es temps r´eels : Mise en œuvre de plus d’op´erations possibles logiciellement sur un processeur. G´en´eration d’acc´el´erateur mat´eriel uniquement quand la contrainte de temps est viol´ee.

Architecture cible : Processeur RISC Acc´el´erateurs mat´eriel

Les communication entre les IP mat´erielles et les IP logicielles sont r´ealis´ee `a travers l’utilisation d’une m´emoire partag´ee munie d’un protocol de communication s´equentiel de type CSP 1 1

CSP : Communicating Sequential Process

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

72 / 86

Les algorithmes d’optimisation

Cosyma

La description du syst`eme utilise le langage C*. Cette description est traduite en une repr´esentation interne `a Cosyma sous forme de graphe qui permet : le partitionnement. La g´en´eration des acc´el´erateurs mat´eriels lorsqu’il y a migration du logiciel vers le mat´eriel. La repr´esentation interne sous forme de graphe combine un graphe de contrˆ ole et un graphe flot de donn´ees Extended syntax (ES) graph Syntax graph Symbol table Local data/control dependencies

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

73 / 86

Les algorithmes d’optimisation

Cosyma

∆c (b) = w .(tHW (b) − tSW (b) + tCOM (Z ) − tCOM (ZUb).It(b)) Avec w un poids fixe tHW temps mat´eriel tSW temps logiciel tCOM temps de communication It nombre d’it´erations

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

74 / 86

Les algorithmes d’optimisation

Cosyma

! tCOM (ZUb) = tCOM (Z ) −

X a∈Z

Ca,b −

X

Cd,b

.tTRANS

d∈U csc Z

Avec fff

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

75 / 86

Les algorithmes d’optimisation

Cosyma

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

76 / 86

Vilfredo Pareto, Economiste italien, 1848 – 1923 n 

n 

Etude sur la répartition des revenus « Ecrits sur la courbe de la répartition des richesses » n  n  n 

Edités par Giovanni Busino Librairie Droz, Genêve, 1965 http://books.google.fr/books? hl=fr&lr=&id=CP4a4VSJO0Q C&oi=fnd&pg=PA1&dq=autho rnbsp:pareto&ots=sgU2aq9a xn&sig=oeFCRb5Kc71y0JjS meYfcXNym24#v=onepage& n  Revenus population Angleterre, Griffin q&f=false

Observation de Pareto

Validation de Pareto

Principe de Pareto n 

Conséquence de l’observation de Pareto : n 

n 

n 

Joseph Juran, 1904 (Roumanie) – 2008 (USA), inventeur de la gestion de la qualité 20% des causes provoquent 80% des effets n  n  n 

n 

20% de la population d’un pays possède 80% de la richesse du pays

20% causes -> 80% défauts de production 20% clients -> 80% CA 20% clients -> 80% Plaintes

Joseph Juran, « Universals in Management, Planning and Controlling », The Management Control, 1954

Diagramme de Pareto n  n 

n 

Joseph Juran Histogramme des causes triées par ordre décroissant Permet de distinguer : n 

n 

20% causes les plus importantes, produisent 80% effets Causes secondaires, produisent autres effets

B. Efficacité de Pareto

Economie théorique : allocation n 

Allocation de biens Répartition des biens et des services entre les agents. n  Répartition des facteurs de production entre les agents industriels. n  Répartition des revenus entre les agents. n 

Efficacité de Pareto Etant donné une allocation de biens, une amélioration de Pareto est une allocation de biens qui attribue plus de biens à au moins une personne, sans diminuer les biens attribués aux autres. n  Un allocation de Pareto efficace, ou optimum de Pareto, est une allocation à laquelle on ne peut apporter aucune amélioration de Pareto. n 

Théorèmes fondamentaux de l’économie du bien-être TH1 Dans une économie parfaite, tout état d’équilibre est un optimum de Pareto n  TH2 Dans une économie parfaite, pour tout optimum de Pareto, il existe une allocation initiale dont l’état d’équilibre est cet optimum n  Remarque : un optimum de Pareto est efficace, mais il peut être inégalitaire. n 

Exemple n 

n 

Allocation A : n  Production armes peut être augmentée sans diminuer production de beurre n  Amélioration de Pareto possible Allocations B, C, D : n  Augmenter 1 production oblige à diminuer autre n  Pas d’amélioration de Pareto possible

II. Exploration espace conception

16

Métriques de comparaison n 

Taille de l’IP n  n 

Eléments logiques de base du FPGA Pourcentage d’utilisation des ressources du FPGA

n 

Vitesse de traitement

n 

Energie

n 

n  n  n  n 

Chemin critique Fréquence de fonctionnement Nombre de transistors Taille des transistors Tension d’alimentation

Exploration espace de conception n 

Design space exploration n  n 

n 

Architecture exploration Exploration d’architecture

Faire varier des paramètres du design n  n  n  n 

Degré de parallélisme Largeur du codage des données Partitionnement logiciel / matériel …

Front de Pareto Latence (durée de traitement) s

Différents designs Front Pareto Courbe des meilleurs compromis Taille (Nombre de cellules)

Optimaux de Pareto

19

Exemple fronts Pareto

n 

n 

Génération par algorithmes génétiques : à gauche état initial, à droite fronts de Pareto PARETO FRONT GENERATION FOR A TRADEOFF BETWEEN AREA AND TIMING, M. Holzer and B. Knerr, Vienna University of Technology, Copyright IEEE 2006, Austrochip 2006, 11.10.2006, Vienna, Austria

Illustration de la loi d'Amdahl Accélération en fonction de part de code séquentiel pour 100 PE

Implications de la loi d'Amdahl n 

Soit 1 programme contenant 10 % de code purement séquentiel n 

n 

n 

Soit une machine parallèle n 

Contenant 100 processeurs Accélérant 1 programme d'1 facteur 100 Alors : !seq = 0 !

n  Son accélération sera inférieure à 10 Quelque soit le nombre n  de processeurs ! n  En effet S p =

1 = 10 S p ≤ 0,1

1 100

1

τ seq +

1− τ seq p

= τ seq +

1− τ seq 100

B. Loi d'Amdahl généralisée n 

1 programme contient : n 

Une part qui peut être accélérée

Une part qui ne peut pas l'être n  Hennessy Patterson n 

S ap =

1 1− t amél +

t am él p

tav = tinch + tamél tap = tinch +

τ amél = t

t amél p

tamél tav

Sap = tav = ap

1

tap t av

= tav − tamél +

tamél p

Illustration Amdahl généralisée Moyen Durée transpo Sierra rt désert Pied Vélo Ferrari

Durée désert

Durée totale

Accélér Accélér ation ation désert globale

20 h

50 h

70 h

1

20 h

20 h

40 h

2,5

1,8

20 h

1,7 h

21,7h

30

3,2

1

Applications de Amdahl généralisée n 

Hiérarchie mémoire On dispose d'un cache 5 fois plus rapide que la mémoire centrale n  Ce cache est utilisé 90% du temps du fait de la localité des référence n  Gain en vitesse lié à l'utilisation du cache : 3,6 n 

Sap =

1 τ 1− τ amél + amél p

=

1 1−0,9+ 0,9 5

1 = 0,1+0,18 ≈ 3,6

Applications de Amdahl généralisée n 

Optimisation de code 1 programme passe 90 % du temps d'exécution dans 10 % du code n  On accélère ces 10 % d'un facteur 3 en optimisant le code source n  Accélération d'un facteur 2,5 du programme complet n 

Sap =

1 τ 1− τ amél + amél p

=

1 1−0,9+ 0,9 3

=

1 0,1+0,3

= 2,5

Applications de Amdahl généralisée n 

Codesign 1 programme passe 80 % du temps d'exécution dans 20 % du code n  On accélère ces 20 % d'un facteur 50 en optimisant le code source n  Quelle est l’accélération du programme complet ? n 

Le front de Pareto

Observation de Pareto Vilfredo Pareto est un ´economiste Italien (1848-1923) qui a fait des ´etudes sur la r´epartition des richesses dans les villes europ´eennes Livre : ””

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

77 / 86

Le front de Pareto

Observation de Pareto

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

78 / 86

Le front de Pareto

Observation de Pareto

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

79 / 86

Le front de Pareto

Enonc´e du principe de Pareto

Observation de Pareto 20% de la population d’un pays poss`ede 80% de ses richesses Utilisation de ce principe en gestion de la qualit´e Introduit par Joseph Juran (1904-2008) 20% des causes provoquent 80% des effets 20% des causes provoquent 80% des d´efauts de production 20% des clients g´en`erent 80% du chiffres d’affaire 20% des clients cr´e´e 80% des plaintes

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

80 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Plan 1

Les syst`emes Embarqu´es

2

Etude de la conception de syst`emes embarqu´es sur puce avec des parties Logicielles et des parties Mat´erielles (Co-design)

3

Plate-formes pour l’embarqu´e

4

Profilage d’application : temps, consommation

5

Interlude : les r´eseaux de neurones convolutionnels ou CNN

6

Heuristiques

7

Les algorithmes d’optimisation

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

81 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parall´elisme - Gain et Efficacit´e

Le gain exprime l’acc´el´eration il est ´egal `a G=

Tempss e´quentiel Tempsparall e´le

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

82 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parall´elisme - Gain et Efficacit´e

Le gain exprime l’acc´el´eration il est ´egal `a G=

Tempss e´quentiel Tempsparall e´le

L’efficacit´e exprime l’utilisation effective des ressources disponibles elle est ´egale `a G E= Nombreressources

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

82 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parall´elisme - Loi d’Amdhal

Question : Si j’ai 100 processeurs vais-je 100 fois plus vite qu’avec 1 seul ?

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

83 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parall´elisme - Loi d’Amdhal

Question : Si j’ai 100 processeurs vais-je 100 fois plus vite qu’avec 1 seul ? R´eponse : Non, il existe toujours une partie s´equentielle dans le programme qu’on ne peut parall´eliser.

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

83 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parall´elisme - Loi d’Amdhal

Question : Si j’ai 100 processeurs vais-je 100 fois plus vite qu’avec 1 seul ? R´eponse : Non, il existe toujours une partie s´equentielle dans le programme qu’on ne peut parall´eliser. Exemple

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

83 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parall´elisme - Loi d’Amdhal

Question : Si j’ai 100 processeurs vais-je 100 fois plus vite qu’avec 1 seul ? R´eponse : Non, il existe toujours une partie s´equentielle dans le programme qu’on ne peut parall´eliser. Exemple Un programme de 20 instructions de dur´ee 1 cycle chaque

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

83 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parall´elisme - Loi d’Amdhal

Question : Si j’ai 100 processeurs vais-je 100 fois plus vite qu’avec 1 seul ? R´eponse : Non, il existe toujours une partie s´equentielle dans le programme qu’on ne peut parall´eliser. Exemple Un programme de 20 instructions de dur´ee 1 cycle chaque 30% d’instructions s´equentielles

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

83 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parall´elisme - Loi d’Amdhal

Question : Si j’ai 100 processeurs vais-je 100 fois plus vite qu’avec 1 seul ? R´eponse : Non, il existe toujours une partie s´equentielle dans le programme qu’on ne peut parall´eliser. Exemple Un programme de 20 instructions de dur´ee 1 cycle chaque 30% d’instructions s´equentielles dur´ee du programme avec 1 processeur : 20 cycles

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

83 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parall´elisme - Loi d’Amdhal

Question : Si j’ai 100 processeurs vais-je 100 fois plus vite qu’avec 1 seul ? R´eponse : Non, il existe toujours une partie s´equentielle dans le programme qu’on ne peut parall´eliser. Exemple Un programme de 20 instructions de dur´ee 1 cycle chaque 30% d’instructions s´equentielles dur´ee du programme avec 1 processeur : 20 cycles dur´ee id´eale avec 20 processeurs : 1 cycle

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

83 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parall´elisme - Loi d’Amdhal

Question : Si j’ai 100 processeurs vais-je 100 fois plus vite qu’avec 1 seul ? R´eponse : Non, il existe toujours une partie s´equentielle dans le programme qu’on ne peut parall´eliser. Exemple Un programme de 20 instructions de dur´ee 1 cycle chaque 30% d’instructions s´equentielles dur´ee du programme avec 1 processeur : 20 cycles dur´ee id´eale avec 20 processeurs : 1 cycle dur´ee r´eelle avec 20 processeurs : 7 cycles

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

83 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parall´elisme - Loi d’Amdhal

Question : Si j’ai 100 processeurs vais-je 100 fois plus vite qu’avec 1 seul ? R´eponse : Non, il existe toujours une partie s´equentielle dans le programme qu’on ne peut parall´eliser. Exemple Un programme de 20 instructions de dur´ee 1 cycle chaque 30% d’instructions s´equentielles dur´ee du programme avec 1 processeur : 20 cycles dur´ee id´eale avec 20 processeurs : 1 cycle dur´ee r´eelle avec 20 processeurs : 7 cycles G = 20 7 = 2, 85

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

83 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parall´elisme - Loi d’Amdhal

Question : Si j’ai 100 processeurs vais-je 100 fois plus vite qu’avec 1 seul ? R´eponse : Non, il existe toujours une partie s´equentielle dans le programme qu’on ne peut parall´eliser. Exemple Un programme de 20 instructions de dur´ee 1 cycle chaque 30% d’instructions s´equentielles dur´ee du programme avec 1 processeur : 20 cycles dur´ee id´eale avec 20 processeurs : 1 cycle dur´ee r´eelle avec 20 processeurs : 7 cycles G = 20 7 = 2, 85 E = 2,85 20 = 0, 142

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

83 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parall´elisme - Loi d’Amdhal

Acc´el´eration Acc =

1 (1 − P) +

P N

P taux de code parall`ele N nombre de processeurs

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

84 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parall´elisme - Loi d’Amdhal

Acc´el´eration T =S +P

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

85 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parall´elisme - Loi d’Amdhal

Acc´el´eration T (N) = S +

P N

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

85 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parall´elisme - Loi d’Amdhal

Acc´el´eration G=

T T (N)

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

85 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parall´elisme - Loi d’Amdhal

Acc´el´eration G=

S +P P S+N

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

85 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parall´elisme - Loi d’Amdhal

Acc´el´eration G=

(1 − P) + P P (1 − P) + N

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

85 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parall´elisme - Loi d’Amdhal

Acc´el´eration G=

1 (1 − P) +

P N

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

85 / 86

HLS

HLS

Transparents de Zahid Syed Ahmed Transparents de Xilinx

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

86 / 86

What is ESL? • ESL (Electronic System Level) :

http://en.wikipedia.org/wiki/Electronic_system-level_design_and_verification

Design and verification methodology for system design at a higher abstraction level. The term was coined by Gartner Dataquest for industrial market analysis

• HLS (High Level Synthesis) : http://en.wikipedia.org/wiki/High-level_synthesis Hardware/system design at higher abstraction. It is in principle more Scientific/Technical name of the ESL tools that help mostly (not limited to) standard C/C++ /SystemC to RTL (VHDL/Verilog) synthesis

• Some State of the Art C/C++/SystemC  RTL tools examples – – – –

Synopsys: Synphony (former Synfora’s Pico) Cadence: C-to-Silicon Calypto: Catapult C (former Mentor’s Catapult C) Xilinx: VivadoHLS (former AutoESL’s AutoPilot)

• Academic Research – Some current examples include Legup of Toronto Univ etc. – Former examples include AutoPilot of UCLA (origin of AutoESL!) etc. Syed Zahid AHMED

12/01/2017

3

HLS Tools historical overview A nice industrial survey on topic: Grant Martin, Gary Smith, “High-Level Synthesis: Past, Present and Future”, IEEE Design and Test of Computers, July/Aug. 2009. http://cas.et.tudelft.nl/education/courses/et4054/2009_PAPER_High-Level_Synthesis_Past,_Present,_and_Future.pdf

Issues of Custom Languages Wrong people targeted HLS-> Gates in beginning tools …

Syed Zahid AHMED

12/01/2017

4

FPGAs vs Coarse Grain (a special form of HLS) Story Missing Umbrella

myC ESL* Moore’s Law

RTL

ANSI C/C++

newC

lovelyC

almostC greatC

It is difficult to Enter/win in Industry Like this

Coarse Grains suffered: Solo Adventure in Desert Simultaneous War at 3 frontiers New hardware, new Language, no IP www.patioumbrellas.com

FPGAs have enjoyed: Nice party Scenario

Diverse solutions of coarse grain makes/have kept them difficult to be widely adapted by Industry No/partial reuse of designs, scarce IP leverage, non standard programming makes them risky investment option by companies compared to FPGAs

Survey of new trends in Industry for Programmable hardware: FPGAs, MPPAs, MPSoCs, Structured ASICs, eFPGAs and new wave of innovation in FPGAs. Syed Zahid Ahmed, Gilles Sassatelli, Lionel Torres, Laurent Rougé. FPL-2010 (http://conferenze.dei.polimi.it/FPL2010/presentations/W1_B_1.pdf) Syed Zahid AHMED

12/01/2017

5

Modern ESL tools • Standard C/C++/SystemC input • Constraints infrastructure to optimize implementations • Leveraging the mature RTL design flow and tools

Syed Zahid AHMED

12/01/2017

6

Outline • ESL – Fundamentals & Historical overview – Current ESL tools/vendors

• Xilinx Vivado tools suite Basics – ISE vs Vivado

• Xilinx VivadoHLS – Fundamentals and flow overview – Strengths & Limitations – Code examples & Xilinx Videos

• Image Processing IPs ESL exploration work in SagemCom Project – ESL vs Hand coded IP

Syed Zahid AHMED

12/01/2017

7

Xilinx Vivado vs ISE ISE All FPGAs upto 7Series and Zynq

Vivado 7-Series/Zynq and Future devices

• New tools suite of Xilinx of IP-Centric Design environment for 7-series/Zynq and future devices with Built-in ESL tool VivadoHLS • New algorithms for faster Place and Route, improved design flow vs ISE • New SDC based constraints system (.ucf is obsolete), improved facilities for timing/power analysis… http://www.xilinx.com/support/documentation/white_papers/wp416-Vivado-Design-Suite.pdf

Multiple Xilinx Video tutorials for Vivado available @ http://www.xilinx.com/training/vivado/index.htm http://www.youtube.com/playlist?list=PL35626FEF3D5CB8F2&feature=plcp Syed Zahid AHMED

12/01/2017

8

ETIS Project SagemCom SuperSoC SuperSOC_v1

Nios System

On-chip SRAM 128KB

Altera Performance Counter

GrayScale

Binarize

Binarize Count

SelectiveTrc

DDR2

AVALON

1GB

ImgColor

Wiener

FindEdges

Hough

NeuralGas

Bitonal

PLL ArriaIIGX125 FPGA

Syed Zahid AHMED

12/01/2017

29

SuperSoC (Hardware Statistics) IP ImageColor Wiener Binarize BinarizeCount Grayscale SelectiveTrc FindEdges Hough NeuralGas Bitonal NIOS SYSTEM DDR2 Contr. Others(Bus,Buf...) TOTAL % of Device Arria II GX125

ALUTs

FFs

Mem. Bits

DSP

Freq.

CONFIDENTIAL

769 3,287 11,129 40,443 41 99,280

549 3,001 21,972 43,399 44 99,280

2,107,392 74,816 162,304 4,350,528 65 6,727,680

0 0 0 276 48 576

50 100 NA NA NA NA

D Power* mw S Power* mW NA NA NA NA NA NA NA NA NA NA NA 110.1 NA 328.4 NA 218.6 565.9 1012.2 NA NA NA NA

Real-time Power

*Statistical Values from PowerPlay tool

Total

Reset (~Static) Syed Zahid AHMED

12/01/2017

30

Hand coded RTL vs ESL for Selected IPs C Source

Complimentary Research Work  Exploration of ESL tools of Xilinx  Two IPs* explored: RTL vs ESL  Evaluations on Viretex-6 & Zynq

-Mallocs… removal -AXI interfaces -Constraints

ARM Cortex A9

DDR3

Processing System

Controller

AXI4

Vivado HLS

Key Findings  ESL is Cool   RTL vs ESL resources: Same/Similar  ESL DMA not good in current version  ARM Cortex is very powerful

20

Vivado HLS

HW

IP

AXI Timer

Zynq Z20 FPGA

Development Time (Man Weeks) 15

15

11 10 5

3 1

0

IP-1 RTL

IP-1 ESL

IP-2 RTL

IP-2 ESL

* Names not mentioned to comply with Sagemcom agreements

Syed Zahid AHMED

12/01/2017

31

Sagemcom project example Design Space Exploration (DSE) for IP1 Rapid Design Space Exploration with quick Synthesis of Vivado HLS IP-1: VivadoHLS results for Zynq Z20 for 256x256 pixel image S1: Raw transform S2: S1 with Shared Div S3:S2 with DivToMult S4: S3 with Buffer Memory Partitioning S5: S4 with Shared 32/64bit Multipliers S6: S5 with Smart Mult sharing S7: S6 with Mult Lat. exp (comb. Mult) S8: S6 with Mult Lat. exp (1clk cycle Mult) S9: S4 with SuperBurst S10: S4 with Smart Burst S11: S6 with Smart Burst

LUTs 20,276 9,158 7,142 4,694 4,778 4,900 4,838 4,838 4,025 3,979 4,224

FFs 18,293 7,727 6,172 4,773 4,375 4,569 4,080 4,056 4,357 4,314 4,079

BRAMs 16 16 16 16 16 16 16 16 80 24 24

DSP 59 51 84 81 32 40 40 40 79 80 39

ClkPr (ns) 8.75 8.75 8.75 8.75 8.75 8.75 13.56 22.74 8.75 8.75 8.75

Latency (clk cycles) Speedup 8,523,620 NA 8,555,668 1.00 3,856,048 2.22 2,438,828 3.51 3,111,560 2.75 2,910,344 2.94 1,975,952 4.33 2,181,776 3.92 NA NA NA NA NA NA

Power* 1410 1638 1337 953 890 918 893 890 NA NA NA

FPGA Prototype of Selected steps on Zedboard (Issues of timing closure and DMA efficiency) IP-1: Zynq Z20 FPGA Board results for 256x256 pixel test image ARM CortexA9 MPCore (Core-0 only) MicroBlaze (with Cache & int. Divider) Hand-Coded IP (Resources for Ref.) S3:S2 with DivToMult S4: S3 with Buffer Mem. Partitioning S6: S5 with Smart Mult sharing S8: S6 with Mult Latency exp (1clk) S9: S4 with SuperBurst S10: S4 with Smart Burst S11: S6 with Smart Burst

LUTs

NA 2,027 8798 6,721 4,584 5,156 5,027 4,504 4,485 4,813

FFs

NA 1,479 3,693 6,449 5,361 4,870 4,607 5,008 4,981 4,438

BRAMs

NA 6 8 8 10 10 10 42 14 14

DSP

NA 3 33 72 69 32 33 70 73 33

Fmax (MHz) NA NA 54.2 40.2 84.5 91.3 77.3 77.6 103 101

Fexp (MHz) 666.7 100.0 *50(NA)

40.0 83.3 90.9 76.9 76.9 100 100

Latency (ms) 32.5 487.4 *4 (NA) 142.4 124.5 136.9 121.2 26.5 22.48 33.7

* Vivado HLS quick syntheis power, it Has no units

Syed Zahid AHMED

12/01/2017

Speedup

15.0 1.00 NA 3.42 3.91 3.56 4.02 18.39 21.68 14.46

Power** (mW) NA NA 57 12 30 26 22 NA 62 42

** Statistical Power from Xpower

32

Work Published in Xilinx Xcell Journal

Published in Xilinx Xcell Journal, Issue 84, pages 34-41 (July 2013) Vivado’s ESL Capabilities Speeds IP design on Zynq SoC Project: Automated methodology delivers results similar to hand-coded RTL for two image processing IP cores ” Syed Zahid Ahmed, Sébastien Fuhrmann, Bertrand Granado http://www.xilinx.com/publications/xcellonline/ www.xilinx.com/publications/archives/xcell/Xcell84.pdf

Syed Zahid AHMED

12/01/2017

33

Conclusions • Potentials for image/video processing projects – – – – – – –

Rapid design space exploration for HW accelerators: In Days instead of Months! Wide range of optimization options using Directives constraints HW/SW co-design further simplified Ultra-fast verification cycle: Self testing Testbench and its re-use, Co-simulation… Experiments with Dual Core ARM using Zynq Support for Floating Point hardware Automatic creation of SW drivers of IP

Syed Zahid AHMED

12/01/2017

34

References to get started The Zinq Book (2014): www.zynqbook.com Vivado All Documentation: http://www.xilinx.com/support/documentation/dt_vivado.htm VivadoHLS Userguide: http://www.xilinx.com/support/documentation/sw_manuals/xilinx2012_4/ug902-vivado-high-level-synthesis.pdf VivadoHLS Getting Started Tutorial & its design files: http://www.xilinx.com/support/documentation/sw_manuals/xilinx2012_4/ug871-vivado-high-level-synthesis-tutorial.pdf / Project Files : https://secure.xilinx.com/webreg/clickthrough.do?cid=198573&license=RefDesLicense&filename=ug871-design-files.zip White Papers/App. Notes: – – – –

http://www.xilinx.com/support/documentation/application_notes/xapp745-processor-control-vhls.pdf http://www.xilinx.com/support/documentation/application_notes/xapp793-memory-structures-video-vivado-hls.pdf http://www.xilinx.com/support/documentation/application_notes/xapp599-floating-point-vivado-hls.pdf http://www.xilinx.com/support/documentation/white_papers/wp416-Vivado-Design-Suite.pdf

Video Tutorials of Vivado and VivadoHLS: – –

http://www.xilinx.com/training/vivado/index.htm http://www.youtube.com/playlist?list=PL35626FEF3D5CB8F2&feature=plcp

Xilinx’s Vivado HLS tutorial (2013): – –

http://www.xilinx.com/support/documentation/sw_manuals/xilinx2013_2/ug871-vivado-high-level-synthesis-tutorial.pdf https://secure.xilinx.com/webreg/clickthrough.do?cid=338217&license=RefDesLicense&filename=ug871-designfiles.zip&languageID=1

Syed Zahid AHMED

12/01/2017

35

Introduc)on  to  High-­‐Level  Synthesis  with   Vivado  HLS   This material exempt per Department of Commerce license exception TSU

Objec)ves   After completing this module, you will be able to: –  Describe the high level synthesis flow –  Understand the control and datapath extraction –  Describe scheduling and binding phases of the HLS flow –  List the priorities of directives set by Vivado HLS –  List comprehensive language support in Vivado HLS –  Identify steps involved in validation and verification flows

Intro to HLS 11- 2

© Copyright 2016 Xilinx

Outline   Introduction to High-Level Synthesis High-Level Synthesis with Vivado HLS Language Support Validation Flow Summary

Intro to HLS 11- 3

© Copyright 2016 Xilinx

Need  for  High-­‐Level  Synthesis   Algorithmic-based approaches are getting popular due to accelerated design time and time to market (TTM) –  Larger designs pose challenges in design and verification of hardware at HDL level

Industry trend is moving towards hardware acceleration to enhance performance and productivity –  CPU-intensive tasks can be offloaded to hardware accelerator in FPGA –  Hardware accelerators require a lot of time to understand and design

Vivado HLS tool converts algorithmic description written in C-based design flow into hardware description (RTL) –  Elevates the abstraction level from RTL to algorithms

High-level synthesis is essential for maintaining design productivity for large designs Intro to HLS 11- 4

© Copyright 2016 Xilinx

High-­‐Level  Synthesis:  HLS   High-Level Synthesis –  Creates an RTL implementation from C, C+ +, System C, OpenCL API C kernel code –  Extracts control and dataflow from the source code –  Implements the design based on defaults and user applied directives

Many implementation are possible from the same source description –  Smaller designs, faster designs, optimal designs –  Enables design exploration

Intro to HLS 11- 5

© Copyright 2016 Xilinx

Design  Explora)on  with  Direc)ves   One body of code: Many hardware outcomes

The same hardware is used for each iteration of the loop: • Small area • Long latency • Low throughput

Intro to HLS 11- 6

… loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } ….

Different hardware is used for each iteration of the loop: • Higher area • Short latency • Better throughput

© Copyright 2016 Xilinx

Before we get into details, let’s look under the hood ….

Different iterations are executed concurrently: • Higher area • Short latency • Best throughput

Introduc)on  to  High-­‐Level  Synthesis   How is hardware extracted from C code? –  Control and datapath can be extracted from C code at the top level –  The same principles used in the example can be applied to sub-functions •  At some point in the top-level control flow, control is passed to a sub-function •  Sub-function may be implemented to execute concurrently with the top-level and or other sub-functions

How is this control and dataflow turned into a hardware design? –  Vivado HLS maps this to hardware through scheduling and binding processes

How is my design created? –  How functions, loops, arrays and IO ports are mapped?

Intro to HLS 11- 7

© Copyright 2016 Xilinx

HLS:  Control  Extrac)on   Code void fir ( data_t *y, coef_t c[4], data_t x ){

Control Behavior Finite State Machine (FSM) states

Function Start

static data_t shift_reg[4]; acc_t acc; int i; acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc; }

0 For-Loop Start

1

For-Loop End

2

Function End From any C code example ..

Intro to HLS 11- 8

The loops in the C code correlated to states of behavior

© Copyright 2016 Xilinx

This behavior is extracted into a hardware state machine

HLS:  Control  &  Datapath  Extrac)on   Code void fir ( data_t *y, coef_t c[4], data_t x ){ static data_t shift_reg[4]; acc_t acc; int i; acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc; }

From any C code example ..

Intro to HLS 11- 9

Operations

Control Behavior Finite State Machine (FSM) states

Control & Datapath Behavior Control Dataflow

RDx RDc

>= == + * + *

0

1

WRy Operations are extracted…

© Copyright 2016 Xilinx

2 The control is known

RDx

RDc

>=

-

==

-

+

*

+

*

WRy

A unified control dataflow behavior is created.

High-­‐Level  Synthesis:  Scheduling  &  Binding   Scheduling & Binding –  Scheduling and Binding are at the heart of HLS

Scheduling determines in which clock cycle an operation will occur –  Takes into account the control, dataflow and user directives –  The allocation of resources can be constrained

Binding determines which library cell is used for each operation –  Takes into account component delays, user directives Technology Library

Design Source (C, C++, SystemC)

Scheduling

Binding

User Directives

Intro to HLS 11- 10

© Copyright 2016 Xilinx

RTL (Verilog, VHDL, SystemC)

Scheduling   The operations in the control flow graph are mapped into clock cycles a b c d e

void foo ( … t1 = a * b; t2 = c + t1; t3 = d * t2; out = t3 – e; }

Schedule 1

* +

* -

*

*

+

out

-

The technology and user constraints impact the schedule –  A faster technology (or slower clock) may allow more operations to occur in the same clock cycle

Schedule 2

*

The code also impacts the schedule –  Code implications and data dependencies must be obeyed

Intro to HLS 11- 11

© Copyright 2016 Xilinx

+

*

-

Binding   Binding is where operations are mapped to cores from the hardware library –  Operators map to cores

Binding Decision: to share –  Given this schedule:

*

+

*

-

•  Binding must use 2 multipliers, since both are in the same cycle •  It can decide to use an adder and subtractor or share one addsub

Binding Decision: or not to share –  Given this schedule:

*

+

*

-

•  Binding may decide to share the multipliers (each is used in a different cycle) •  Or it may decide the cost of sharing (muxing) would impact timing and it may decide not to share them •  It may make this same decision in the first example above too

Intro to HLS 11- 12

© Copyright 2016 Xilinx

Outline   Introduction to High-Level Synthesis High-Level Synthesis with Vivado HLS Language Support Validation Flow Summary

Intro to HLS 11- 13

© Copyright 2016 Xilinx

RTL  vs  High-­‐Level  Language  

Intro to HLS 11- 14

© Copyright 2016 Xilinx

Vivado  HLS  Benefits   Productivity –  Verification Video Design Example

•  Functional •  Architectural

–  Abstraction •  Datatypes

Input

C Simulation Time

RTL Simulation Time

Improvement

10 frames 1280x720

10s

~2 days (ModelSim)

~12000x

•  Interface •  Classes

–  Automation

RTL (Spec) C (Spec/Sim)

RTL (Sim)

Block level specification AND verification significantly reduced

Intro to HLS 11- 15

© Copyright 2016 Xilinx

RTL (Sim)

Vivado  HLS  Benefits   Portability –  Processors and FPGAs –  Technology migration –  Cost reduction –  Power reduction

Design and IP reuse Intro to HLS 11- 16

© Copyright 2016 Xilinx

Vivado  HLS  Benefits   Permutability –  Architecture Exploration •  Timing –  Parallelization –  Pipelining

•  Resources –  Sharing

–  Better QoR

Rapid design exploration delivers QoR rivaling hand-coded RTL Intro to HLS 11- 17

© Copyright 2016 Xilinx

Understanding  Vivado  HLS  Synthesis   Vivado HLS –  Determines in which cycle operations should occur (scheduling) –  Determines which hardware units to use for each operation (binding) –  Performs high-level synthesis by : •  Obeying built-in defaults •  Obeying user directives & constraints to override defaults •  Calculating delays and area using the specified technology/device

Priority of directives in Vivado HLS 1.  Meet Performance (clock & throughput) • 

Vivado HLS will allow a local clock path to fail if this is required to meet throughput

• 

Often possible the timing can be met after logic synthesis

2.  Then minimize latency 3.  Then minimize area

Intro to HLS 11- 18

© Copyright 2016 Xilinx

The  Key  AMributes  of  C  code   void fir ( data_t *y, coef_t c[4], data_t x ){ static data_t shift_reg[4]; acc_t acc; int i; acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i] * c[i]; }

Functions: All code is made up of functions which represent the design hierarchy: the same in hardware Top Level IO : The arguments of the top-level function determine the hardware RTL interface ports Types: All variables are of a defined type. The type can influence the area and performance Loops: Functions typically contain loops. How these are handled can have a major impact on area and performance Arrays: Arrays are used often in C code. They can influence the device IO and become performance bottlenecks

} *y=acc; }

Operators: Operators in the C code may require sharing to control area or specific hardware implementations to meet performance

Let’s examine the default synthesis behavior of these …

Intro to HLS 11- 19

© Copyright 2016 Xilinx

Func)ons  &  RTL  Hierarchy   Each function is translated into an RTL block –  Verilog module, VHDL entity

Source Code

void A() { ..body A..} void B() { ..body B..} void C() { B(); } void D() { B(); }

void foo_top() { A(…); C(…); D(…) }

RTL hierarchy foo_top

D

my_code.c

B

B

Each function/block can be shared like any other component (add, sub, etc) provided it’s not in use at the same time

–  By default, each function is implemented using a common instance –  Functions may be inlined to dissolve their hierarchy •  Small functions may be automatically inlined

Intro to HLS 11- 20

C

A

© Copyright 2016 Xilinx

Types  =  Operator  Bit-­‐sizes   Code void fir ( data_t *y, coef_t c[4], data_t x ){ static data_t shift_reg[4]; acc_t acc; int i; acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc; }

From any C code example ...

Intro to HLS 11- 21

Operations

Types Standard C types long long (64-bit)

RDx RDc

>= == + * + *

short (16-bit)

int (32-bit)

char (8-bit)

float (32-bit)

double (64-bit)

unsigned types

Arbitary Precision types

WRy

C:

ap(u)int types (1-1024)

C++:

ap_(u)int types (1-1024) ap_fixed types

C++/SystemC:

sc_(u)int types (1-1024) sc_fixed types

Can be used to define any variable to be a specific bit-width (e.g. 17-bit, 47bit etc).

Operations are extracted…

© Copyright 2016 Xilinx

The C types define the size of the hardware used: handled automatically

Loops     By default, loops are rolled –  Each C loop iteration è Implemented in the same state –  Each C loop iteration è Implemented with same resources

foo_top

Synthesis

a[N]

Loops require labels if they are to be referenced by Tcl directives (GUI will auto-add labels)

–  Loops can be unrolled if their indices are statically determinable at elaboration time •  Not when the number of iterations is variable

–  Unrolled loops result in more elements to schedule but greater operator mobility •  Let’s look at an example ….

Intro to HLS 11- 22

© Copyright 2016 Xilinx

+

void  foo_top  (…)  {      ...      Add:  for  (i=3;i>=0;i-­‐-­‐)  {    b  =  a[i]  +  b;      ...      }    

N

b

Data  Dependencies:  Good     void fir ( … acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc; }

Default Schedule == -

*

>=

==

-

+

-

RDc

*

>=

==

-

+

-

RDc

Iteration 1

Iteration 2

The read X operation has good mobility

Example of good mobility –  The read on data port X can occur anywhere from the start to iteration 4 •  The only constraint on RDx is that it occur before the final multiplication

–  Vivado HLS has a lot of freedom with this operation •  It waits until the read is required, saving a register •  There are no advantages to reading any earlier (unless you want it registered) •  Input reads can be optionally registered

–  The final multiplication is very constrained…

Intro to HLS 11- 23

© Copyright 2016 Xilinx

*

>=

==

*

>=

-

+

-

RDx

+

RDc

Iteration 3

RDc

Iteration 4

WRy

Data  Dependencies:  Bad   void fir ( … acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc; }

Default Schedule == -

*

>=

==

-

+

-

RDc

*

>=

==

-

+

-

RDc

Iteration 1

Iteration 2

>=

==

*

>=

-

+

-

RDx

+

Iteration 3

RDc

Iteration 4

Mult is very constrained

Example of bad mobility –  The final multiplication must occur before the read and final addition •  It could occur in the same cycle if timing allows

–  Loops are rolled by default •  Each iteration cannot start till the previous iteration completes •  The final multiplication (in iteration 4) must wait for earlier iterations to complete

–  The structure of the code is forcing a particular schedule •  There is little mobility for most operations

–  Optimizations allow loops to be unrolled giving greater freedom Intro to HLS 11- 24

* RDc

© Copyright 2016 Xilinx

WRy

Schedule  aRer  Loop  Op)miza)on   With the loop unrolled (completely) –  The dependency on loop iterations is gone –  Operations can now occur in parallel •  If data dependencies allow •  If operator timing allows

RDc

RDc

RDc RDx

–  Design finished faster but uses more operators •  2 multipliers & 2 Adders

* *

* *

+

+ +

Schedule Summary

WRy

–  All the logic associated with the loop counters and index checking are now gone –  Two multiplications can occur at the same time •  All 4 could, but it’s limited by the number of input reads (2) on coefficient port C

–  Why 2 reads on port C? •  The default behavior for arrays now limits the schedule…

Intro to HLS 11- 25

RDc

© Copyright 2016 Xilinx

void fir ( … acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc; }

Arrays  in  HLS   An array in C code is implemented by a memory in the RTL –  By default, arrays are implemented as RAMs, optionally a FIFO void foo_top(int x, …) { int A[N]; L1: for (i = 0; i < N; i++) A[i+x] = A[i] + i; }

N-1

SPRAMB

N-2 …

foo_top

A[N]

Synthesis

A_in

1 0

DIN ADDR

DOUT

A_out

CE WE

The array can be targeted to any memory resource in the library –  The ports (Address, CE active high, etc.) and sequential operation (clocks from address to data out) are defined by the library model –  All RAMs are listed in the Vivado HLS Library Guide

Arrays can be merged with other arrays and reconfigured –  To implement them in the same memory or one of different widths & sizes

Arrays can be partitioned into individual elements –  Implemented as smaller RAMs or registers Intro to HLS 11- 26

© Copyright 2016 Xilinx

Top-­‐Level  IO  Ports   Top-level function arguments –  All top-level function arguments have a default hardware port type

When the array is an argument of the top-level function –  The array/RAM is “off-chip” –  The type of memory resource determines the top-level IO ports –  Arrays on the interface can be mapped & partitioned •  E.g. partitioned into separate ports for each element in the array DPRAMB

foo_top

Synthesis

+

void foo_top( int A[3*N] , int x) { L1: for (i = 0; i < N; i++) A[i+x] = A[i] + i; }

DIN0 ADDR0

DOUT0

CE0 WE0

Number of ports defined by the RAM resource

DIN1 ADDR1

Default RAM resource –  Dual port RAM if performance can be improved otherwise Single Port RAM

Intro to HLS 11- 27

© Copyright 2016 Xilinx

CE1 WE1

DOUT1

Schedule  aRer  an  Array  Op)miza)on   With the existing code & defaults –  Port C is a dual port RAM –  Allows 2 reads per clock cycles

RDc

RDc

RDc

RDc RDx

•  IO behavior impacts performance Note: It could have performed 2 reads in the original rolled design but there was no advantage since the rolled loop forced a single read per cycle

* *

* *

+

+ +

loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc;

WRy

With the C port partitioned into (4) separate ports –  All reads and mults can occur in one cycle –  If the timing allows •  The additions can also occur in the same cycle •  The write can be performed in the same cycles •  Optionally the port reads and writes could be registered

RDc RDc RDc RDc RDx

* * * *

+ + + WRy

Intro to HLS 11- 28

© Copyright 2016 Xilinx

Operators   Operator sizes are defined by the type –  The variable type defines the size of the operator

Vivado HLS will try to minimize the number of operators –  By default Vivado HLS will seek to minimize area after constraints are satisfied

User can set specific limits & targets for the resources used –  Allocation can be controlled •  An upper limit can be set on the number of operators or cores allocated for the design: This can be used to force sharing •  e.g limit the number of multipliers to 1 will force Vivado HLS to share 3

2

1

0

Use 1 mult, but take 4 cycle even if it could be done in 1 cycle using 4 mults

–  Resources can be specified •  The cores used to implement each operator can be specified •  e.g. Implement each multiplier using a 2 stage pipelined core (hardware)

Intro to HLS 11- 29

3

1

2

0

Same 4 mult operations could be done with 2 pipelined mults (with allocation limiting the mults to 2)

© Copyright 2016 Xilinx

Outline   Introduction to High-Level Synthesis High-Level Synthesis with Vivado HLS Language Support Validation Flow Summary

Intro to HLS 11- 30

© Copyright 2016 Xilinx

Comprehensive  C  Support   A Complete C Validation & Verification Environment –  Vivado HLS supports complete bit-accurate validation of the C model –  Vivado HLS provides a productive C-RTL co-simulation verification solution

Vivado HLS supports C, C++, SystemC and OpenCL API C kernel –  Functions can be written in any version of C –  Wide support for coding constructs in all three variants of C

Modeling with bit-accuracy –  Supports arbitrary precision types for all input languages –  Allowing the exact bit-widths to be modeled and synthesized

Floating point support –  Support for the use of float and double in the code

Support for OpenCV functions –  Enable migration of OpenCV designs into Xilinx FPGA –  Libraries target real-time full HD video processing Intro to HLS 11- 31

© Copyright 2016 Xilinx

C,  C++  and  SystemC  Support   The vast majority of C, C++ and SystemC is supported –  Provided it is statically defined at compile time –  If it’s not defined until run time, it won’ be synthesizable

Any of the three variants of C can be used –  If C is used, Vivado HLS expects the file extensions to be .c –  For C++ and SystemC it expects file extensions .cpp

Intro to HLS 11- 32

© Copyright 2016 Xilinx

Outline   Introduction to High-Level Synthesis High-Level Synthesis with Vivado HLS Language Support Validation Flow Summary

Intro to HLS 11- 33

© Copyright 2016 Xilinx

C  Valida)on  and  RTL  Verifica)on   There are two steps to verifying the design –  Pre-synthesis: C Validation •  Validate the algorithm is correct

–  Post-synthesis: RTL Verification •  Verify the RTL is correct

C validation

Validate C

–  A HUGE reason users want to use HLS •  Fast, free verification −  Validate the algorithm is correct before synthesis •  Follow the test bench tips given over

RTL Verification

Verify RTL

–  Vivado HLS can co-simulate the RTL with the original test bench Intro to HLS 11- 34

© Copyright 2016 Xilinx

C  Func)on  Test  Bench   The test bench is the level above the function –  The main() function is above the function to be synthesized

Good Practices –  The test bench should compare the results with golden data •  Automatically confirms any changes to the C are validated and verifies the RTL is correct

–  The test bench should return a 0 if the self-checking is correct •  Anything but a 0 (zero) will cause RTL verification to issue a FAIL message •  Function main() should expect an integer return (non-void) int main () { int ret=0; … ret = system("diff --brief -w output.dat output.golden.dat"); if (ret != 0) { printf("Test failed !!!\n"); ret=1; } else { printf("Test passed !\n"); } … return ret; } Intro to HLS 11- 35

© Copyright 2016 Xilinx

Determine  or  Create  the  Top-­‐level  Func)on   Determine the top-level function for synthesis If there are Multiple functions, they must be merged –  There can only be 1 top-level function for synthesis Given a case where functions func_A and func_B are to be implemented in FPGA

Re-partition the design to create a new single top-level function inside main() main.c

main.c int main () { ... func_A(a,b,*i1); func_B(c,*i1,*i2); func_C(*i2,ret)

#include func_AB.h int main (a,b,c,d) { ... // func_A(a,b,i1); // func_B(c,i1,i2); func_AB (a,b,c, *i1, *i2); func_C(*i2,ret)

func_A func_B func_C

return ret; }

func_AB func_C

return ret; }

func_AB.c

Recommendation is to separate test bench and design files

Intro to HLS 11- 36

© Copyright 2016 Xilinx

#include func_AB.h func_AB(a,b,c, *i1, *i2) { ... func_A(a,b,*i1); func_B(c,*i1,*i2); … }

func_A func_B

Outline   Introduction to High-Level Synthesis High-Level Synthesis with Vivado HLS Language Support Validation Flow Summary

Intro to HLS 11- 37

© Copyright 2016 Xilinx

Summary   In HLS –  C becomes RTL –  Operations in the code map to hardware resources –  Understand how constructs such as functions, loops and arrays are synthesized

HLS design involves –  Synthesize the initial design –  Analyze to see what limits the performance •  User directives to change the default behaviors •  Remove bottlenecks

–  Analyze to see what limits the area •  The types used define the size of operators •  This can have an impact on what operations can fit in a clock cycle

Intro to HLS 11- 38

© Copyright 2016 Xilinx

Summary   Use directives to shape the initial design to meet performance –  Increase parallelism to improve performance –  Refine bit sizes and sharing to reduce area

Vivado HLS benefits –  Productivity –  Portability –  Permutability

Intro to HLS 11- 39

© Copyright 2016 Xilinx

Improving  Performance   This material exempt per Department of Commerce license exception TSU

Objec3ves   After completing this module, you will be able to: –  Add directives to your design –  List number of ways to improve performance –  State directives which are useful to improve latency –  Describe how loops may be handled to improve latency –  Recognize the dataflow technique that improves throughput of the design –  Describe the pipelining technique that improves throughput of the design –  Identify some of the bottlenecks that impact design performance

Improving Performance 13- 2

© Copyright 2016 Xilinx

Outline   Adding Directives Improving Latency –  Manipulating Loops

Improving Throughput Performance Bottleneck Summary

Improving Performance 13- 3

© Copyright 2016 Xilinx

Improving  Performance   Vivado HLS has a number of ways to improve performance –  Automatic (and default) optimizations –  Latency directives –  Pipelining to allow concurrent operations

Vivado HLS support techniques to remove performance bottlenecks –  Manipulating loops –  Partitioning and reshaping arrays

Optimizations are performed using directives –  Let’s look first at how to apply and use directives in Vivado HLS

Improving Performance 13- 4

© Copyright 2016 Xilinx

Applying  Direc3ves   If the source code is open in the GUI Information pane –  The Directive tab in the Auxiliary pane shows all the locations and objects upon which directives can be applied (in the opened C file, not the whole design) •  Functions, Loops, Regions, Arrays, Top-level arguments

–  Select the object in the Directive Tab •  “dct” function is selected

–  Right-click to open the editor dialog box –  Select a desired directive from the dropdown menu •  “DATAFLOW” is selected

–  Specify the Destination •  Source File •  Directive File

Improving Performance 13- 5

© Copyright 2016 Xilinx

Op3miza3on  Direc3ves:  Tcl  or  Pragma   Directives can be placed in the directives file –  The Tcl command is written into directives.tcl –  There is a directives.tcl file in each solution •  Each solution can have different directives Once applied the directive will be shown in the Directives tab (rightclick to modify or delete)

Directives can be place into the C source –  Pragmas are added (and will remain) in the C source file –  Pragmas (#pragma) will be used by every solution which uses the code

Improving Performance 13- 6

© Copyright 2016 Xilinx

Solu3on  Configura3ons   Configurations can be set on a solution – Set the default behavior for that solution •  Open configurations settings from the menu (Solutions > Solution Settings…)

“Add” or “Remove” configuration settings

Select “General”

– Choose the configuration from the drop-down menu •  Array Partitioning, Binding, Dataflow Memory types, Interface, RTL Settings, Core, Compile, Schedule efforts

Improving Performance 13- 7

© Copyright 2016 Xilinx

Example:  Configuring  the  RTL  Output   Specify the FSM encoding style –  By default the FSM is auto

Add a header string to all RTL output files –  Example: Copyright Acme Inc.

Add a user specified prefix to all RTL output filenames –  The RTL has the same name as the C functions –  Allow multiple RTL variants of the same top-level function to be used together without renaming files

Reset all registers –  By default only the FSM registers and variables initialized in the code are reset –  RAMs are initialized in the RTL and bitstream

Synchronous or Asynchronous reset

The remainder of the configuration commands will be covered throughout the course

–  The default is synchronous reset

Active high or low reset –  The default is active high Improving Performance 13- 8

© Copyright 2016 Xilinx

Copying  Direc3ves  into  New  Solu3ons   Click the New Solution Button Optionally modify any of the settings –  Part, Clock Period, Uncertainty –  Solution Name

Copy existing directives –  By default selected –  Uncheck if do not want to copy –  No need to copy pragmas, they are in the code

Improving Performance 13- 9

© Copyright 2016 Xilinx

Outline   Adding Directives Improving Latency –  Manipulating Loops

Improving Throughput Performance Bottleneck Summary

Improving Performance 13- 10

© Copyright 2016 Xilinx

Latency  and  Throughput  –  The  Performance  Factors   Design Latency –  The latency of the design is the number of cycle it takes to output the result •  In this example the latency is 10 cycles

Design Throughput –  The throughput of the design is the number of cycles between new inputs •  By default (no concurrency) this is the same as latency •  Next start/read is when this transaction ends Improving Performance 13- 11

© Copyright 2016 Xilinx

Latency  and  Throughput   In the absence of any concurrency –  Latency is the same as throughput

Pipelining for higher throughput –  Vivado HLS can pipeline functions and loops to improve throughput –  Latency and throughput are related –  We will discuss optimizing for latency first, then throughput

Improving Performance 13- 12

© Copyright 2016 Xilinx

Vivado  HLS:  Minimize  latency   Vivado HLS will by default minimize latency –  Throughput is prioritized above latency (no throughput directive is specified here) –  In this example •  The functions are connected as shown •  Assume function B takes longer than any other functions

Vivado HLS will automatically take advantage of the parallelism –  It will schedule functions to start as soon as they can •  Note it will not do this for loops within a function: by default they are executed in sequence

Improving Performance 13- 13

© Copyright 2016 Xilinx

Reducing  Latency   Vivado HLS has the following directives to reduce latency –  LATENCY •  Allows a minimum and maximum latency constraint to be specified

–  LOOP_FLATTEN •  Allows nested loops to be collapsed into a single loop with improved laten

–  LOOP_MERGE •  Merge consecutive loops to reduce overall latency, increase sharing, and improve logic optimization

–  UNROLL

Improving Performance 13- 14

© Copyright 2016 Xilinx

Default  Behavior:  Minimizing  Latency   Functions –  Vivado HLS will seek to minimize latency by allowing functions to operate in parallel •  As shown on the previous slide

Loops –  Vivado HLS will not schedule loops to operate in parallel by default •  Dataflow optimization must be used or the loops must be unrolled •  Both techniques are discussed in detail later

Operations

Loop:for(i=1;i=0;i-­‐-­‐)  {    b  =  a[i]  +  b;      ...      }    

Loops require labels if they are to be referenced by Tcl directives (GUI will auto-add labels)

–  Loops can be unrolled if their indices are statically determinable at elaboration time •  Not when the number of iterations is variable

Improving Performance 13- 19

© Copyright 2016 Xilinx

b

Rolled  Loops  Enforce  Latency   A rolled loop can only be optimized so much –  Given this example, where the delay of the adder is small compared to the clock frequency void  foo_top  (…)  {      ...      Add:  for  (i=3;i>=0;i-­‐-­‐)  {    b  =  a[i]  +  b;      ...      }    

Clock Adder Delay

3

–  This rolled loop will never take less than 4 cycles •  No matter what kind of optimization is tried •  This minimum latency is a function of the loop iteration count

Improving Performance 13- 20

© Copyright 2016 Xilinx

2

1

0

Unrolled  Loops  can  Reduce  Latency  

Select loop “Add” in the directives pane and right-click

Unrolled loops allow greater option & exploration

Options explained on next slide

Improving Performance 13- 21

Unrolled loops are likely to result in more hardware resources and higher area

© Copyright 2016 Xilinx

Par3al  Unrolling   Fully unrolling loops can create a lot of hardware Loops can be partially unrolled –  Provides the type of exploration shown in the previous slide

Partial Unrolling –  A standard loop of N iterations can be unrolled to by a factor –  For example unroll by a factor 2, to have N/2 iterations

Add: for(int i = 0; i < N; i++) { a[i] = b[i] + c[i]; }

Add: for(int i = 0; i < N; i += 2) { a[i] = b[i] + c[i]; if (i+1 >= N) break; a[i+1] = b[i+1] + c[i+1]; }

•  Similar to writing new code as shown on the right è •  The break accounts for the condition when N/2 is not an integer

Effective code after compiler transformation

–  If “i” is known to be an integer multiple of N •  The user can remove the exit check (and associated logic) •  Vivado HLS is not always be able to determine this is true (e.g. if N is an input argument) •  User takes responsibility: verify!

Improving Performance 13- 22

© Copyright 2016 Xilinx

for(int i = 0; i < N; i += 2) { a[i] = b[i] + c[i]; a[i+1] = b[i+1] + c[i+1]; }

An extra adder for N/ 2 cycles trade-off

Loop  FlaPening   Vivado HLS can automatically flatten nested loops –  A faster approach than manually changing the code

Flattening should be specified on the inner most loop –  It will be flattened into the loop above –  The “off” option can prevent loops in the hierarchy from being flattened

1 x4

2

x4

3 4

x4

x4

36 transitions

void  foo_top  (…)  {   void  foo_top  (…)  {      ...      ...      L1:  for  (i=3;i>=0;i-­‐-­‐)  {      L1:  for  (i=3;i>=0;i-­‐-­‐)  {      [loop  body  l1  ]      [loop  body  l1  ]      }        }            L2:  for  (i=3;i>=0;i-­‐-­‐)  {                  L3:  for  (j=3;j>=0;j-­‐-­‐)  {      L2:  for  (k=15,k>=0;k-­‐-­‐)  {      [loop  body  l3  ]                    }      [loop  body  l3  ]      }   }          L4:  for  (i=3;i>=0;i-­‐-­‐)  {      L4:  for  (i=3;i>=0;i-­‐-­‐)  {      [loop  body  l4  ]      [loop  body  l1  ]      }        }             Loops will be flattened by default: use  “off” to disable  

Improving Performance 13- 23

© Copyright 2016 Xilinx

1 x4

2

x16

4 x4

28 transitions

Perfect  and  Semi-­‐Perfect  Loops   Only perfect and semi-perfect loops can be flattened –  The loop should be labeled or directives cannot be applied –  Perfect Loops –  Only the inner most loop has body (contents)

Loop_outer:  for  (i=3;i>=0;i-­‐-­‐)  {        Loop_inner:  for  (j=3;j>=0;j-­‐-­‐)  {                [loop  body]        }     }  

–  There is no logic specified between the loop statements –  The loop bounds are constant

–  Semi-perfect Loops –  Only the inner most loop has body (contents) –  There is no logic specified between the loop statements

Loop_outer:  for  (i=3;i>N;i-­‐-­‐)  {        Loop_inner:  for  (j=3;j>=0;j-­‐-­‐)  {                [loop  body]        }     }  

–  The outer most loop bound can be variable –  Other types

–  Should be converted to perfect or semi-perfect loops

Improving Performance 13- 24

© Copyright 2016 Xilinx

Loop_outer:  for  (i=3;i>N;i-­‐-­‐)  {        [loop  body]        Loop_inner:  for  (j=3;j>=M;j-­‐-­‐)  {                [loop  body]        }     }  

Loop  Merging   Vivado HLS can automatically merge loops –  A faster approach than manually changing the code –  Allows for more efficient architecture explorations –  FIFO reads, which must occur in strict order, can prevent loop merging •  Can be done with the “force” option : user takes responsibility for correctness

1 x4

2

x4

3 4

x4

x4

void  foo_top  (…)  {      ...      L1:  for  (i=3;i>=0;i-­‐-­‐)  {      [loop  body  l1  ]      }          L2:  for  (i=3;i>=0;i-­‐-­‐)  {              L3:  for  (j=3;j>=0;j-­‐-­‐)  {      [loop  body  l3  ]              }        }   Already flattened      L4:  for  (i=3;i>=0;i-­‐-­‐)  {      [loop  body  l4  ]      }          

void  foo_top  (…)  {      ...          L123:  for  (l=16,l>=0;l-­‐-­‐)  {      if  (cond1)      [loop  body  l1  ]          [loop  body  l3  ]        if  (cond4)      [loop  body  l4  ]      }    

x16

18 transitions

36 transitions Improving Performance 13- 25

1

© Copyright 2016 Xilinx

Loop  Merge  Rules   If loop bounds are all variables, they must have the same value If loops bounds are constants, the maximum constant value is used as the bound of the merged loop –  As in the previous example where the maximum loop bounds become 16 (implied by L3 flattened into L2 before the merge)

Loops with both variable bound and constant bound cannot be merged The code between loops to be merged cannot have side effects –  Multiple execution of this code should generate same results •  A=B is OK, A=A+1 is not

Reads from a FIFO or FIFO interface must always be in sequence –  A FIFO read in one loop will not be a problem –  FIFO reads in multiple loops may become out of sequence •  This prevents loops being merged

Improving Performance 13- 26

© Copyright 2016 Xilinx

Loop  Reports   Vivado HLS reports the latency of loops –  Shown in the report file and GUI

Given a variable loop index, the latency cannot be reported –  Vivado HLS does not know the limits of the loop index –  This results in latency reports showing unknown values

The loop tripcount (iteration count) can be specified –  Apply to the loop in the directives pane –  Allows the reports to show an estimated latency

Improving Performance 13- 27

© Copyright 2016 Xilinx

Impacts reporting – not synthesis

Techniques  for  Minimizing  Latency  -­‐  Summary   Constraints –  Vivado HLS accepts constraints for latency

Loop Optimizations –  Latency can be improved by minimizing the number of loop boundaries •  Rolled loops (default) enforce sharing at the expense of latency •  The entry and exits to loops costs clock cycles

Improving Performance 13- 28

© Copyright 2016 Xilinx

Outline   Adding Directives Improving Latency –  Manipulating Loops

Improving Throughput Performance Bottleneck Summary

Improving Performance 13- 29

© Copyright 2016 Xilinx

Improving  Throughput   Given a design with multiple functions –  The code and dataflow are as shown

Vivado HLS will schedule the design

It can also automatically optimize the dataflow for throughput

Improving Performance 13- 30

© Copyright 2016 Xilinx

Dataflow  Op3miza3on   Dataflow Optimization –  Can be used at the top-level function –  Allows blocks of code to operate concurrently •  The blocks can be functions or loops •  Dataflow allows loops to operate concurrently

–  It places channels between the blocks to maintain the data rate

•  For arrays the channels will include memory elements to buffer the samples •  For scalars the channel is a register with hand-shakes

Dataflow optimization therefore has an area overhead –  Additional memory blocks are added to the design –  The timing diagram on the previous page should have a memory access delay between the blocks •  Not shown to keep explanation of the principle clear

Improving Performance 13- 31

© Copyright 2016 Xilinx

Dataflow  Op3miza3on  Commands   Dataflow is set using a directive –  Vivado HLS will seek to create the highest performance design •  Throughput of 1

Improving Performance 13- 32

© Copyright 2016 Xilinx

Dataflow  Op3miza3on  through  Configura3on   Command   Configuring Dataflow Memories –  Between functions Vivado HLS uses ping-pong memory buffers by default •  The memory size is defined by the maximum number of producer or consumer elements

–  Between loops Vivado HLS will determine if a FIFO can be used in place of a ping-pong buffer –  The memories can be specified to be FIFOs using the Dataflow Configuration •  Menu: Solution > Solution Settings > config_dataflow •  With FIFOs the user can override the default size of the FIFO •  Note: Setting the FIFO too small may result in an RTL verification failure

Individual Memory Control –  When the default is ping-pong •  Select an array and mark it as Streaming (directive STREAM) to implement the array as a FIFO

–  When the default is FIFO •  Select an array and mark it as Streaming (directive STREAM) with option “off” to implement the array as a pingpong To use FIFO’s the access must be sequential. If HLS determines that the access is not sequential then it will halt and issue a message. If HLS can not determine the sequential nature then it will issue warning and continue.

Improving Performance 13- 33

© Copyright 2016 Xilinx

Dataflow  :  Ideal  for  streaming  arrays  &  mul3-­‐rate  func3ons   Arrays are passed as single entities by default –  This example uses loops but the same principle applies to functions

Dataflow pipelining allows loop_2 to start when data is ready –  The throughput is improved –  Loops will operate in parallel •  If dependencies allow

Multi-Rate Functions –  Dataflow buffers data when one function or loop consumes or produces data at different rate from others

IO flow support –  To take maximum advantage of dataflow in streaming designs, the IO interfaces at both ends of the datapath should be streaming/handshake types (ap_hs or ap_fifo) Improving Performance 13- 34

© Copyright 2016 Xilinx

Dataflow  Limita3ons  (1)   Must be single producer consumer; the following code violates the rule and dataflow does not work The Fix

Improving Performance 13- 35

© Copyright 2016 Xilinx

Dataflow  Limita3ons  (2)   You cannot bypass a task; the following code violates this rule and dataflow does not work The fix: make it systolic like datapath

Improving Performance 13- 36

© Copyright 2016 Xilinx

Dataflow  vs  Pipelining  Op3miza3on   Dataflow Optimization –  Dataflow optimization is “coarse grain” pipelining at the function and loop level –  Increases concurrency between functions and loops –  Only works on functions or loops at the top-level of the hierarchy •  Cannot be used in sub-functions

Function & Loop Pipelining –  “Fine grain” pipelining at the level of the operators (*, +, >>, etc.) –  Allows the operations inside the function or loop to operate in parallel –  Unrolls all sub-loops inside the function or loop being pipelined •  Loops with variable bounds cannot be unrolled: This can prevent pipelining •  Unrolling loops increases the number of operations and can increase memory and run time

Improving Performance 13- 37

© Copyright 2016 Xilinx

Func3on  Pipelining   There are 3 clock cycles before operation RD can occur again

The latency is the same The throughput is better

–  Throughput = 3 cycles

–  Less cycles, higher throughput

There are 3 cycles before the 1st output is written –  Latency = 3 cycles Without Pipelining

With Pipelining void foo(...) { op_Read; op_Compute; op_Write; }

RD CMP WR

Throughput = 3 cycles

RD

CMP

WR

Throughput = 1 cycle

RD

CMP

RD

WR

Improving Performance 13- 38

CMP

WR

RD

CMP

Latency = 3 cycles

Latency = 3 cycles

© Copyright 2016 Xilinx

WR

Loop  Pipelining   Without Pipelining

With Pipelining Loop:for(i=1;i> 1; *outB = *in2 >> 2; } void add_sub_pass(int A, int B, int *C, int *D) { int apb, amb; int a2, b2;

B

A

B>>1

Zero Area

}

Inlining allows optimization to be performed across function hierarchies Like RTL ungrouping, too much inlining can create a lot of logic and slow runtime

21- 12 Improving Area and Resources 21- 12

A

sumsub_func(&A,&B,&apb,&amb); sumsub_func(&apb,&amb,&a2,&b2); shift_func(&a2,&b2,C,D);

2 Adders 2 Subtractors

shift_func

Inlining

int sumsub_func (int *in1, int *in2, int *outSum, int *outSub) { *outSum = *in1 + *in2; *outSub = *in1 - *in2; }

© Copyright 2016 Xilinx

Inline  and  Alloca6on:  Shape  the  Hierarchy   Easy to Share

One RTL block is reused for both instances of function foo

Cannot be shared

Function foo is not within the immediate scope of foo_top

21- 13 Improving Area and Resources 21- 13

© Copyright 2016 Xilinx

Controlling Sharing

Inlining brings foo into function foo_top where it can be shared

Loops     By default, loops are rolled –  Each C loop iteration è Implemented in the same state –  Each C loop iteration è Implemented with same resources

N

foo_top

Synthesis

a[N]

+

void  foo_top  (…)  {      ...      Add:  for  (i=3;i>=0;i-­‐-­‐)  {    b  =  a[i]  +  b;      ...      }    

b

For Area optimization Keeping loops rolled maximizes sharing across loop iterations: each iteration of the loop uses the same hardware resources

21- 14 Improving Area and Resources 21- 14

© Copyright 2016 Xilinx

Loop  Merging  &  FlaLening   Loop merging & flattening can remove the redundant computation among multiple (related) loops –  Improving area (and sometimes performance) My_Region: { #pragma HLS merge loop for (i = 0; i < N; ++i) A[i] = B[i] + 1;

Merge

for (i = 0; i < N; ++i) C[i] = A[i] / 2;

for (i = 0; i < N; ++i) { A[i] = B[i] + 1; C[i] = A[i] / 2; }

Effective code after compiler transformation

}

Allows Vivado HLS to perform optimizations –  Optimization cannot occur across loop boundaries for (i = 0; i < N; ++i) C[i] = (B[i] + 1) / 2;

Removes A[i], any address logic and any potential memory accesses

21- 15 Improving Area and Resources 21- 15

© Copyright 2016 Xilinx

Mapping  Arrays   The arrays in the C model may not be ideal for the available RAMs –  The code may have many small arrays –  The array may not utilize the RAMs very well

Array Mapping –  Mapping combines smaller arrays into larger arrays •  Allows arrays to be reconfigured without code edits

–  Specify the array variable to be mapped –  Give all arrays to be combined the same instance name

Vivado HLS provides options as to the type of mapping –  Combine the arrays without impacting performance •  Vertical & Horizontal mapping

Global Arrays –  When a global array is mapped all arrays involved are promoted to global –  When arrays are in different functions, the target becomes global

Arrays which are function arguments –  All must be part of the same function interface

21- 16 Improving Area and Resources 21- 16

© Copyright 2016 Xilinx

Horizontal  Mapping   Horizontal Mapping –  Combines multiple arrays into longer (horizontal) array –  Optionally allows the arrays to be offset •  The default is to concatenate after the last element

•  The first array specified (in GUI or Tcl script) starts at location zero 21- 17 Improving Area and Resources 21- 17

© Copyright 2016 Xilinx

Ver6cal  Mapping   Vertical Mapping –  Combines multiple arrays in to an array with more bits

–  The first array specified (in Tcl or GUI) starts at the LSB

Vertical Mapping for performance –  Creates RAMs with wide words è Parallel accesses

21- 18 Improving Area and Resources 21- 18

© Copyright 2016 Xilinx

Arbitrary  Precision  Integers   C and C++ have standard types created on the 8-bit boundary –  char (8-bit), short (16-bit), int (32-bit), long long (64-bit) •  Also provides stdint.h (for C), and stdint.h and cstdint (for C++) •  Types: int8_t, uint16_t, uint32_t, int_64_t etc.

–  They result in hardware which is not bit-accurate and can give sub-standard QoR

Vivado HLS provides bit-accurate types in both C and C++ –  Plus SystemC types can be used in C++ –  Allow any arbitrary bit-width to be specified –  Will simulate with bit-accuracy

21- 19 Improving Area and Resources 21- 19

© Copyright 2016 Xilinx

Why  are  Arbitrary  Precision  types  Needed?   Code using native C int type

However, if the inputs will only have a max range of 8-bit –  Arbitrary precision data-types should be used

–  It will result in smaller & faster hardware with full precision 21- 20 Improving Area and Resources 21- 20

© Copyright 2016 Xilinx

Outline   Optimizing Resource Utilization Reducing Area Usage Summary

Improving Area and Resources 21- 21

© Copyright 2016 Xilinx

Summary   Resource utilization can be reduced using allocation and binding controls Arbitrary precision data types help controlling both the area and resource utilization The design structure can be controlled by –  Inlining functions: direct impact on RTL hierarchy & optimization possibilities –  Loops: direct impact on reuse of resources –  Arrays: direct impact on the RAM

Major area optimization techniques –  Minimize bit widths –  Map smaller arrays into larger arrays •  Make better use of existing RAMs

–  Control loop hierarchy –  Control function call hierarchy –  Control the number of operators and cores

Improving Area and Resources 21- 22

© Copyright 2016 Xilinx

Using  Vivado  HLS   This material exempt per Department of Commerce license exception TSU

Objec4ves   After completing this module, you will be able to: –  List various OS under which Vivado HLS is supported –  Describe how projects are created and maintained in Vivado HLS –  State various steps involved in using Vivado HLS project creation wizard –  Distinguish between the role of top-level module in testbench and design to be synthesized –  List various verifications which can be done in Vivado HLS –  List Vivado HLS project directory structure

Using Vivado HLS 12 - 2

© Copyright 2016 Xilinx

Outline   Invoking Vivado HLS Project Creation using Vivado HLS Synthesis to IPXACT Flow Design Analysis Other Ways to use Vivado HLS Summary

Using Vivado HLS 12 - 3

© Copyright 2016 Xilinx

Vivado  HLS  OS  Support   Vivado HLS is supported on both Linux and Windows Vivado HLS tool available under two licenses –  HLS license •  HLS license come with Vivado System Edition •  Supports all 7 series devices including Zynq® All Programmable SoC •  Does not support Virtex®-6 and earlier devices –  Use older version of Vivado HLS for Virtex-6 and earlier

Operating System Windows

Using Vivado HLS 12 - 4

© Copyright 2016 Xilinx

Version Windows 10 Professional (64-bit) Windows 8.1 Professional (64-bit) Windows 7 SP1 Professional (64-bit)

Red Hat Linux

RHEL Enterprise Linux 5.11, 6.7-6.8 (64-bit) RHEL Enterprise Linux 7.1 and 7.2 (64-bit)

SUSE

SUSE Linux Enterprise 11.4 and 12.1 (64-bit)

CentOS

CentOS 6.8 (64-bit)

Ubuntu

Ubuntu Linux 16.04 LTS (64-bit)

Invoke  Vivado  HLS  from  Windows  Menu  

The first step is to open or create a project

12- 5 Using Vivado HLS 12 - 5

© Copyright 2016 Xilinx

Vivado  HLS  GUI  

Information Pane

Auxiliary Pane

Project Explorer Pane

Console Pane

12- 6 Using Vivado HLS 12 - 6

© Copyright 2016 Xilinx

Outline   Invoking Vivado HLS Project Creation using Vivado HLS Synthesis to IPXACT Flow Design Analysis Other Ways to use Vivado HLS Summary

Using Vivado HLS 12 - 7

© Copyright 2016 Xilinx

Vivado  HLS  Projects  and  Solu4ons   Vivado HLS is project based –  A project specifies the source code which will be synthesized

Source

–  Each project is based on one set of source code –  Each project has a user specified name

A project can contain multiple solutions –  Solutions are different implementations of the same code –  Auto-named solution1, solution2, etc. –  Supports user specified names

Project Level

Solution Level

–  Solutions can have different clock frequencies, target technologies, synthesis directives

Projects and solutions are stored in a hierarchical directory structure –  Top-level is the project directory –  The disk directory structure is identical to the structure shown in the GUI project explorer (except for source code location) 12- 8 Using Vivado HLS 12 - 8

© Copyright 2016 Xilinx

Vivado  HLS  Step  1:  Create  or  Open  a  project   Start a new project –  The GUI will start the project wizard to guide you through all the steps

Optionally use the Toolbar Button to Open New Project

Open an existing project –  All results, reports and directives are automatically saved/remembered –  Use “Recent Project” menu for quick access 12- 9 Using Vivado HLS 12 - 9

© Copyright 2016 Xilinx

Project  Wizard   The Project Wizard guides users through the steps of opening a new project Step-by-step guide …

Define project and directory

Add design source files

Specify test bench files

1st Solution Information

Project Level Information Using Vivado HLS 12 - 10

Specify clock and select part

© Copyright 2016 Xilinx

Define  Project  &  Directory   Define the project name −  Note, here the project is given the extension .prj −  A useful way of seeing it’s a project (and not just another directory) when browsing

Browse to the location of the project –  In this example, project directory “matrixmul.prj” will be created inside directory “lab1”

Using Vivado HLS 12 - 11

© Copyright 2016 Xilinx

Add  Design  Source  Files   Add Design Source Files −  This allows Vivado HLS to determine the top-level design for synthesis, from the test bench and associated files −  Not required for SystemC designs

Add Files… –  Select the source code file(s) –  The CTRL and SHIFT keys can be used to add multiple files –  No need to include headers (.h) if they reside in the same directory

Select File and Edit CFLAGS… −  If required, specify C compile arguments using the “Edit CFLAGS…” −  Define macros: -DVERSION1 −  Location of any (header) files not in the same directory as the source: -I../include Using Vivado HLS 12 - 12

© Copyright 2016 Xilinx

There is no need to add the location of standard Vivado HLS or SystemC header files or header files located in the same project location

Specify  Test  Bench  Files   Use “Add Files” to include the test bench –  Vivado HLS will re-use these to verify the RTL using cosimulation

And all files referenced by the test bench –  The RTL simulation will be executed in a different directory (Ensures the original results are not overwritten) –  Vivado HLS needs to also copy any files accessed by the test bench • 

E.g. Input data and output results

Add Folders –  If the test bench uses relative paths like “sub_directory/my_file.dat” you can add “sub_directory” as a folder/directory

Use “Edit CFLAGS…” –  To add any C compile flags required for compilation Using Vivado HLS 12 - 13

© Copyright 2016 Xilinx

Test  benches  I   The test bench should be in a separate file Or excluded from synthesis –  The Macro __SYNTHESIS__ can be used to isolate code which will not be synthesized •  This macro is defined when Vivado HLS parses any code (-D__SYNTHESIS__) // test.c #include void test (int d[10]) { int acc = 0; int i; for (i=0;i IP Catalog 2.  Add IP to import this block 3.  Browse to the zip file inside “ip”

impl

ip

In System Generator : 1.  Use XilinxBlockAdd 2.  Select Vivado_HLS block type 3.  Browse to the solution directory Using Vivado HLS 12 - 32

Solution directories There can be multiple solutions for each project. Each solution is a different implementation of the same (project) source code

Project Directory Top-level project directory (there must be one)

© Copyright 2016 Xilinx

syn

sysgen

solutionN

sim

RTL  Export  for  Implementa4on   Click on Export RTL –  Export RTL Dialog opens

Select the desired output format

Optionally, configure the output Select the desired language Optionally, click on Vivado RTL Synthesis and Place and Route options for invoking implementation tools from within Vivado HLS Click OK to start the implementation

Using Vivado HLS 12 - 33

© Copyright 2016 Xilinx

RTL  Export  (Place  and  Route  Op4on)  Results   Impl directory created –  Will contain a sub-directory for each RTL which is synthesized

Report –  A report is created and opened automatically

Using Vivado HLS 12 - 34

© Copyright 2016 Xilinx

RTL  Export  Results  (Place  and  Route  Op4on  Unchecked)   Impl directory created –  Will contain a sub-directory for both VHDL and Verilog along with the ip directory

No report will be created Observe the console –  No packing, routing phases

Using Vivado HLS 12 - 35

© Copyright 2016 Xilinx

Outline   Invoking Vivado HLS Project Creation using Vivado HLS Synthesis to IPXACT Flow Design Analysis Other Ways to use Vivado HLS Summary

Using Vivado HLS 12 - 36

© Copyright 2016 Xilinx

Analysis  Perspec4ve   Perspective for design analysis –  Allows interactive analysis

Using Vivado HLS 12 - 37

© Copyright 2016 Xilinx

Performance  Analysis  

Using Vivado HLS 12 - 38

© Copyright 2016 Xilinx

Resources  Analysis  

Using Vivado HLS 12 - 39

© Copyright 2016 Xilinx

Outline   Invoking Vivado HLS Project Creation using Vivado HLS Synthesis to IPXACT Flow Design Analysis Other Ways to use Vivado HLS Summary

Using Vivado HLS 12 - 40

© Copyright 2016 Xilinx

Command  Line  Interface:  Batch  Mode   Vivado HLS can also be run in batch mode –  Opening the Command Line Interface (CLI) will give a shell

–  Supports the commands required to run Vivado HLS & pre-synthesis verification (gcc, g++, apcc, make)

12- 41 Using Vivado HLS 12 - 41

© Copyright 2016 Xilinx

Using  Vivado  HLS  CLI   Invoke Vivado HLS in interactive mode –  Type Tcl commands one at a time

> vivado_hls –i

Execute Vivado HLS using a Tcl batch file –  Allows multiple runs to be scripted and automated

> vivado_hls –f run_aesl.tcl

Open an existing project in the GUI –  For analysis, further work or to modify it

> vivado_hls –p my.prj

Use the shell to launch Vivado HLS GUI > vivado_hls

12- 42 Using Vivado HLS 12 - 42

© Copyright 2016 Xilinx

Using  Tcl  Commands   When the project is created –  All Tcl command to run the project are created in script.tcl •  User specified directives are placed in directives.tcl

–  Use this as a template from creating Tcl scripts •  Uncomment the commands before running the Tcl script

Using Vivado HLS 12 - 43

© Copyright 2016 Xilinx

Help   Help is always available – The Help Menu – Opens User Guide, Reference Guide and Man Pages

In interactive mode – The help command lists the man page for all commands Vivado_hls> help add_files

Auto-Complete all commands using the tab key

SYNOPSIS add_files [OPTIONS] Etc…

Using Vivado HLS 12 - 44

© Copyright 2016 Xilinx

Outline   Invoking Vivado HLS Project Creation using Vivado HLS Synthesis to IPXACT Flow Design Analysis Other Ways to use Vivado HLS Summary

Using Vivado HLS 12 - 45

© Copyright 2016 Xilinx

Summary   Vivado HLS can be run under Windows 7/8.1, Red Hat Linux, SUSE OS, and Ubuntu Vivado HLS can be invoked through GUI and command line in Windows OS, and command line in Linux Vivado HLS project creation wizard involves –  Defining project name and location –  Adding design files –  Specifying testbench files –  Selecting clock and technology

The top-level module in testbench is main() whereas top-level module in the design is the function to be synthesized

12- 46 Using Vivado HLS 12 - 46

© Copyright 2016 Xilinx

Summary   Vivado HLS project directory consists of –  *.prj project file –  Multiple solutions directories –  Each solution directory may contain •  impl, synth, and sim directories •  The impl directory consists of ip, verilog, and vhdl folders •  The synth directory consists of reports, vhdl, and verilog folders •  The sim directory consists of testbench and simulation files

Using Vivado HLS 12 - 47

© Copyright 2016 Xilinx