FPGA2 ou le Codesign Mat´eriel/Logiciel Bertrand Granado - Andrea Pinna LIP6 / UPMC Courriel :
[email protected] [email protected]
Automne 2017
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
1 / 86
Plan 1
Les syst` emes Embarqu´ es Introduction D´ efinition
2
Etude de la conception de syst` emes embarqu´ es sur puce avec des parties Logicielles et des parties Mat´ erielles (Co-design) Sp´ ecification R´ ealisation
3
Plate-formes pour l’embarqu´ e
4
Profilage d’application : temps, consommation
5
Interlude : les r´ eseaux de neurones convolutionnels ou CNN
6
Heuristiques
7
Les algorithmes d’optimisation
8
Le front de Pareto
9
La loi d’Ahmdal Gain et efficacit´ e du parall´ elisme
10 Optimisation MultiCrit` eres 11 HLS Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
2 / 86
Les syst` emes Embarqu´ es
Introduction
Plan 1
Les syst`emes Embarqu´es Introduction D´efinition
2
Etude de la conception de syst`emes embarqu´es sur puce avec des parties Logicielles et des parties Mat´erielles (Co-design)
3
Plate-formes pour l’embarqu´e
4
Profilage d’application : temps, consommation
5
Interlude : les r´eseaux de neurones convolutionnels ou CNN
6
Heuristiques
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
3 / 86
Les syst` emes Embarqu´ es
Introduction
Les syst`emes Embarqu´es
Figure: R´eseau de Capteur sans fil
Figure: Capsule Vid´eo Endoscopique
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
4 / 86
Les syst` emes Embarqu´ es
Introduction
Les syst`emes Embarqu´es
Figure: T´el´ephonie Mobile (1973)
Figure: Mon nouvel ami
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
5 / 86
Les syst` emes Embarqu´ es
Introduction
Les syst`emes Embarqu´es
Figure: syst`eme de freinage ABS
Figure: Health Monitoring (copyright Sagem)
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
6 / 86
Les syst` emes Embarqu´ es
D´ efinition
Plan 1
Les syst`emes Embarqu´es Introduction D´efinition
2
Etude de la conception de syst`emes embarqu´es sur puce avec des parties Logicielles et des parties Mat´erielles (Co-design)
3
Plate-formes pour l’embarqu´e
4
Profilage d’application : temps, consommation
5
Interlude : les r´eseaux de neurones convolutionnels ou CNN
6
Heuristiques
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
7 / 86
Les syst` emes Embarqu´ es
D´ efinition
Syst`eme Embarqu´e
Les syst`emes embarqu´es sont des syst`emes r´eactifs: ”A reactive system is one which is in continual interaction with is environment and executes at a pace determined by that environment“ [Berg´e, 1995] Le comportement d´epend des entr´ees `a l’instant courant.
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
8 / 86
Les syst` emes Embarqu´ es
D´ efinition
Syst`eme Embarqu´e : tentative de d´efinition
Syst`eme autonome qui interagit avec son environnement Un syst`eme embarqu´e doit ˆetre efficace, il est r´egit par des contraintes Energie → Faible consommation Taille du Code → Ressource m´emoire limit´ee Temps → Contrainte temps r´eel Surface → Espace limit´e Coˆ ut → Int´egration dans des appareil grand public Sp´ecifique → D´edi´e ` a certaines applications
Connaissance du comportement `a la conception qui facilite la minimisation des ressources et la maximisation de la robustesse Interface utilisateur d´edi´ee (pas forc´ement de souris, clavier , ´ecran. . . )
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
9 / 86
Les syst` emes Embarqu´ es
D´ efinition
Syst`eme Embarqu´es : Contrainte Energie
Puissance Dissip´ee Pdis = Psta + Pdyn Psta = Ioff ∗ VDD 2 Pdyn = Fc ∗ CL ∗ VDD
Faible Consommation Psta : optimisation technologique Pdyn : optimisation technologique et architecturale
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
10 / 86
Les syst` emes Embarqu´ es
D´ efinition
Syst`eme Embarqu´es : Taille du code
Taille m´emoire limit´ee : du `a l’environnement Taille m´emoire born´ee : impossibilit´e de l’augmenter N´ecessit´e d’optimiser l’utilisation de la m´emoire Pas de surperflux Utilisation des options d’optimisation du compilateur (-02, -03 de gcc) sans garantie toutefois
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
11 / 86
Les syst` emes Embarqu´ es
D´ efinition
Syst`eme Embarqu´es : Temps Un syst`eme embarqu´e doit souvent r´epondre `a des contraintes temps r´eel Un syst`eme temps r´eel doit r´eagir `a un stimuli dans un interval de temps d´ependant de l’environnement. Il y a deux type de temps r´eel : Le temps r´eel dur (latence) Un syst`eme temps r´eel qui produit une bonne r´eponse mais trop tard est faux. La r´eponse d’un syst`eme temps r´eel dur ne peut ˆetre statistique, elle est au pire des cas (WCET : Worst Case Execution Time). ”A real-time constraint is called hard, if not meeting that constraint could result in a catastrophe“ [Kopetz, 1997].
Le temps r´eel mou (cadence)
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
12 / 86
Les syst` emes Embarqu´ es
D´ efinition
Syst`eme Embarqu´es : Temps
Temps r´eel dur
Figure: Contrˆ ole Avion
Temps r´eel mou
Figure: T´el´e Num´erique
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
13 / 86
Les syst` emes Embarqu´ es
D´ efinition
Syst`eme Embarqu´es : Surface
Facteur de forme
Figure: Camera Espion
Figure: Implant Cochl´eaire
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
14 / 86
Les syst` emes Embarqu´ es
D´ efinition
Syst`eme Embarqu´es : Coˆut
O` u sont pass´e mes euros ?
Figure: Objets Connect´es
Figure: Un march´e immense !
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
15 / 86
Les syst` emes Embarqu´ es
D´ efinition
Syst`eme Embarqu´es : Sp´ecificit´e
D´epend du domaine d’application A´eronautique et A´erospatial Automobile Biom´edical E-Sant´e Robotique R´eseaux de Capteurs ...
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
16 / 86
Les syst` emes Embarqu´ es
D´ efinition
Syst`eme Embarqu´e : seconde tentative de d´efinition
Tous les syst`emes embarqu´es ne poss`edent pas toutes ces caract´eristiques. D´ efinition: Un syst`eme de traitement de l‘information pr´esentant la plupart de ces caract´eristiques est appel´e syst`eme embarqu´e (en anglais : embedded system)
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
17 / 86
Etude de la conception de syst` emes embarqu´ es sur puce avec des parties Logicielles et des parties Mat´ erielles (Co-design) Sp´ ecification
Plan 1
Les syst`emes Embarqu´es
2
Etude de la conception de syst`emes embarqu´es sur puce avec des parties Logicielles et des parties Mat´erielles (Co-design) Sp´ecification R´ealisation
3
Plate-formes pour l’embarqu´e
4
Profilage d’application : temps, consommation
5
Interlude : les r´eseaux de neurones convolutionnels ou CNN
6
Heuristiques
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
18 / 86
Etude de la conception de syst` emes embarqu´ es sur puce avec des parties Logicielles et des parties Mat´ erielles (Co-design) Sp´ ecification
Syst`emes Embarqu´es sur puce : Sp´ecification
Constat L’homme n’est pas capable de comprendre un syst`eme contenant entre 5 et 10 objets. La plupart des syst`emes embarqu´es manipulent plus d’objets Lisibilit´e Portabilit´e et Flexibilit´e
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
19 / 86
Etude de la conception de syst` emes embarqu´ es sur puce avec des parties Logicielles et des parties Mat´ erielles (Co-design) R´ ealisation
Plan 1
Les syst`emes Embarqu´es
2
Etude de la conception de syst`emes embarqu´es sur puce avec des parties Logicielles et des parties Mat´erielles (Co-design) Sp´ecification R´ealisation
3
Plate-formes pour l’embarqu´e
4
Profilage d’application : temps, consommation
5
Interlude : les r´eseaux de neurones convolutionnels ou CNN
6
Heuristiques
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
20 / 86
Etude de la conception de syst` emes embarqu´ es sur puce avec des parties Logicielles et des parties Mat´ erielles (Co-design) R´ ealisation
Syst`emes Embarqu´es : R´ealisation
Choix d’une architecture ´electronique pour r´ealiser le syst`eme embarqu´e Utilisation d’une m´ethodologie pour D´eployer Choisir l’architecture Explorer l’espace des solutions architecturales Etre capable de le d´efinir !
→ R´ealisation d’un syst`eme sur puce
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
21 / 86
Etude de la conception de syst` emes embarqu´ es sur puce avec des parties Logicielles et des parties Mat´ erielles (Co-design) R´ ealisation
Syst`emes Embarqu´es : R´ealisation
Figure: Un syst`eme sur puce ou SoC (System on Chip) (Cours de ) Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
22 / 86
Etude de la conception de syst` emes embarqu´ es sur puce avec des parties Logicielles et des parties Mat´ erielles (Co-design) R´ ealisation
Syst`emes Embarqu´es : Choix d’une architecture
Choisir une architecture mat´erielle compos´ee de Blocs programmables g´en´eralistes : CPU, GPU, DSP Blocs sp´ecialis´es ou d´edi´es : FPGA, ASIC Bus de communication
SoC = cohabitation de ces ressources sur un mˆeme circuit, prise en compte globale pour la r´ealisation logicielle/mat´erielle
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
23 / 86
Etude de la conception de syst` emes embarqu´ es sur puce avec des parties Logicielles et des parties Mat´ erielles (Co-design) R´ ealisation
Syst`emes Embarqu´es : Choix d’une architecture
ASIC
Figure: Wafer Asic
FPGA
Figure: FPGA
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
CPU
Figure: AMD K10
Automne 2017
24 / 86
Etude de la conception de syst` emes embarqu´ es sur puce avec des parties Logicielles et des parties Mat´ erielles (Co-design) R´ ealisation
Syst`emes Embarqu´es : Choix d’une architecture
Figure Comparaison Performance vs Flexibilit´e
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
25 / 86
Plate-formes pour l’embarqu´ e
Arduino Yun
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
26 / 86
Plate-formes pour l’embarqu´ e
Artik10
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
27 / 86
Plate-formes pour l’embarqu´ e
Pi Rasperry
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
28 / 86
Plate-formes pour l’embarqu´ e
Zynq UltraScale+
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
29 / 86
Plate-formes pour l’embarqu´ e
Cyclone
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
30 / 86
Plate-formes pour l’embarqu´ e
Syst`emes Embarqu´es : M´ethodologie de Conception
Proc´edure pour concevoir un syst`eme. Comprendre une m´ethodologie aide `a garantir la s´ecurit´e de la conception. P Flot de conception : de compilateurs, outils de d´eveloppement logiciel, outils de conception assist´ee par ordinateur (CAO), etc., permettant : d’aider `a automatiser les ´etapes de la m´ethodologie; de garder trace de l’application de la m´ethodologie (gestion de version, rapports, acc´el´eration des it´erations).
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
31 / 86
Plate-formes pour l’embarqu´ e
Syst`emes Embarqu´es : M´ethodologie de Conception
Buts Satisfaire : Performances : rapidit´e globale, ´ech´eances. Fonctionnalit´e et interface utilisateur. Coˆ ut de fabrication. Consommation. Divers exigences (encombremen
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
32 / 86
Plate-formes pour l’embarqu´ e
Syst`emes Embarqu´es : M´ethodologie de Conception
M´ethode: technique de r´esolution de probl`eme caract´eris´ee par un ensemble de r`egles bien d´efinies qui conduisent `a une solution correcte
M´ethodologie: un ensemble structur´e et coh´erent de mod`eles, m´ethodes, guides et outils permettant de d´eduire la mani`ere de r´esoudre un probl`eme
Mod`ele: une repr´esentation d’un aspect partiel et coh´erent du ”monde” r´eel pr´ec`ede toute d´ecision ou formulation d’une opinion est ´elabor´e pour r´epondre `a la question qui conduit au d´eveloppement d’un syst`eme
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
33 / 86
Plate-formes pour l’embarqu´ e
Syst`emes Embarqu´es : M´ethodologie de conception
un peu d’histoire 70-80 : full-custom Sch´ema Dessin des masques Simulation ´electronique
80-90 : Pr´ecaract´eris´e FPGA R´eutilisation de briques ´el´ementaires Mod´elisation, simulation
00-xx : SoC R´eutilisation du mat´eriel et logiciel Co-design, v´erification
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
34 / 86
Plate-formes pour l’embarqu´ e
Syst`emes Embarqu´es : Notion d’IP
Acc´el´erer la conception de Syst`eme sur Puce R´eutiliser les blocs d´ej`a con¸cus dans la soci´et´e ; Utiliser les g´en´erateurs de macro-cellules (Ram, multiplieurs,. . . ) Acheter des blocs con¸cus hors de l’entreprise
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
35 / 86
Plate-formes pour l’embarqu´ e
Syst`emes Embarqu´es : Notion d’IP
Blocs fonctionnels complexes r´eutilisables Mat´eriel : d´ej`a implant´e, d´ependant de la technologie, fortement optimis´e Logiciel : dans un langage de haut niveau (VHDL, Verilog, C++. . . ), param´etrables Normalisation des interfaces ( OCP) Environnement de d´eveloppement (co-design, co-specification, co-v´erificatin) Performances moyennes (peu optimis´e)
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
36 / 86
Plate-formes pour l’embarqu´ e
Syst`emes Embarqu´es : Utilisation d’IP
Bloc r´eutilisable (IP) connaˆıtre les fonctionnalit´es estimer les performances dans un syst`eme ˆetre sˆ ur du bon fonctionnement de l’IP int´egrer cet IP dans le syst`eme valider le syst`eme
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
37 / 86
Plate-formes pour l’embarqu´ e
Syst`emes Embarqu´es : Utilisation d’IP
Logiciel
Firm
Mat´eriel
Flot de Conception Conception Syst`eme Conception RTL Synth`ese plan de masse Placement Routage V´erification
Repr´esentation Comportemental RTL
Librairies
Technologie
Portabilit´e
-
Ind´ependant technologie
Illimit´ee
RTL Blocs Netlist
De r´ef´erence (temps, dessin)
Technologie g´en´erique
Portable sur librairie
Sp´ecifique process R`egles de dessin
Technologie Fixe
D´ependant du process
Polygˆ ones r´eguliers
et -
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
38 / 86
Plate-formes pour l’embarqu´ e
Le Codesign
Le Corps du probl`eme Ajouter d´efinition
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
39 / 86
Plate-formes pour l’embarqu´ e
Le Codesign
D´efinition Conception de syst`eme macro ou micro qui int`egrent `a la fois des parties logicielles (s’ex´ecutant sur des processeurs ou DSP) et des IP (r´ealis´e sur des FPGA ou des ASIC) Conception conjointe des composants logiciels et mat´eriels Unification des chemins logiciels et mat´eriels couramment s´epar´es
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
40 / 86
Plate-formes pour l’embarqu´ e
Le Codesign
D´efinition Une m´ethodologie de conception qui supporte le d´eveloppement coop´eratif et concurrent des parties logicielles et mat´erielles (co-sp´ecification, co-d´eveloppement et co-v´erification) afin d’obtenir des fonctionnalit´es partag´ees et d’atteindre les performances esp´er´eessa a R. Gupta et G. De Micheli - ”Hardware-Software Cosynthesis for Digital Systems” - IEEE Design and Test of Computers, 1993, pp. 29:41
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
41 / 86
Plate-formes pour l’embarqu´ e
Exemple jouet : mesure de la vitesse d’une roue Contraintes : Surface : 40 unit´es Temps : 100 cycles
Mise en œuvre Processeurs Mat´eriel Sp´ecialis´e Combinaison Processeur et Mat´eriel Sp´ecialis´e
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
42 / 86
Plate-formes pour l’embarqu´ e
Exemple jouet : mesure de la vitesse d’une roue
Mise en œuvre logicielle sur des Processeurs Contraintes : Surface : 48 unit´es > 40 unit´es Temps : 132 cycles > 100 cycles
D´eveloppement : 2 mois
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
43 / 86
Plate-formes pour l’embarqu´ e
Exemple jouet : mesure de la vitesse d’une roue Mise en œuvre mat´eriel sur des ASIC ou des FPGA Contraintes : Surface : 24 unit´es < 40 unit´es Temps : 54 cycles < 100 cycles
Optimisation de 40% par rapport au contraintes de temps et de surface D´eveloppement : 9 mois (D´elai trop long dans un univers hyper-comp´etitif)
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
44 / 86
Plate-formes pour l’embarqu´ e
Exemple jouet : mesure de la vitesse d’une roue Mise en œuvre logicielle sur des Processeurs et mat´erielle sur des ASIC ou des FPGA Contraintes : Surface : 37 unit´es < 40 unit´es Temps : 97 cycles < 100 cycles
D´eveloppement : 3,5 mois Pas aussi efficace que la mise en œuvre purement mat´erielle mais satisfait les contraintes R´ealise un bon compromis
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
45 / 86
Plate-formes pour l’embarqu´ e
Motivation du Codesign Atteindre les performances attendues en d´epla¸cant les goulets d’´etranglement du logiciel vers le mat´eriel Utilisation du mat´eriel pour satisfaire des contraintes de temps et de surface qui ne peuvent ˆetre satisfaites par un processeur g´en´eraliste. Dans une mise en oeuvre mat´erielle statique, il n’est pas possible de tout mettre en mat´eriel, du fait des ressources limit´ees. Dans une mise en oeuvre mat´erielle dynamique, il faut reconsid´erer cette affirmation.
Des parties de l’application sont plus en ad´equation avec un traitement s´equentiel ( Contrˆ ole par exemple) r´ealis´e par un processeur g´en´eraliste. Aujourd’hui beaucoup de syst`emes sont embarqu´es ce qui requiert `a la fois des parties logicielles et mat´erielles.
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
46 / 86
Plate-formes pour l’embarqu´ e
Motivation du Codesign
La complexit´e et la fonctionnalit´e des syst`emes croit `a un rythme rapide et on voit ´emerger les SystemOnChip (SOC). Il est difficile, voir impossible, pour des syst`eme ad-hoc d’ˆetre con¸cus, r´ealis´es et test´es dans un temps acceptable mˆeme avec les plus avanc´es des outils de CAO standards. (Solution?) Tirer profit de design pr´ec´edemment r´ealis´es (IPs) et de processeurs test´es pour r´eduire le temps de conception et pour augmenter la fiabilit´e.
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
47 / 86
Plate-formes pour l’embarqu´ e
Motivation du Codesign La faille de productivit´e de l’ing´enieur (ITRS)
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
48 / 86
Plate-formes pour l’embarqu´ e
Compromis/D´ecisions En partant d’un ensemble de contraintes sp´ecifi´ees et de technologies maitris´ees les concepteurs doivent trouver les compromis pour faire fonctionner ensemble les composants logiciels et mat´eriels D´ecisions, Constraintes et Evaluations? Performance. Surface. Consommation. Programmabilit´e. D´eveloppement Coˆ ut de Fabrication. Fiabilit´e. Robustesse. Maintenance. Evolution.
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
49 / 86
Plate-formes pour l’embarqu´ e
Co-Design: Recherche
La recherche en codesign traverse plusieurs champs de comp´etences tels que : Sp´ecification syst`eme et mod´elisation Exploration du design Partitionnement Ordonnancement Co-verification et Co-simulation G´en´eration de code mat´eriel et logiciel Interfa¸cage mat´eriel/logiciel
L’objectif commun ici est de d´evelopper une m´ethodologie unifi´ee pour cr´eer des syst`emes qui contiennent `a la fois du mat´eriel et du logiciel.
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
50 / 86
Plate-formes pour l’embarqu´ e
Approche Simple
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
51 / 86
Profilage d’application : temps, consommation
Profilage et Partitionnement B´en´efices Acc´el´eration de 10 `a 200 fois Acc´el´eration possible de 800 fois Beaucoup plus de potentiel que les optimisations dynamiques logicielles (internes au processeur, d´eroulage de boucle, pipeline logiciel,...) R´eduction de la consommation d’´energie de 25 `a 95% Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
52 / 86
Profilage d’application : temps, consommation
Profilage Le Profilage permet d’apprendre les endroits, en terme de code, o` u le programme passe son temps. Quelle fonction appelle quelle autre durant son ex´ecution. Le profilage s’effectue via des donn´ees collect´ee lors de l’ex´ecution de l’application. Cette m´ethode peut donc ˆetre utilis´ee pour analyser des programmes trop complexe pour une analyse via la lecture des sources. Ces informations de profil, montre les bouts de code o` u le programme est plus lent qu’attendu. Ces bouts de code sont de bons candidats `a : une r´e´ecriture optimis´ees une transformation mat´erielle
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
53 / 86
Profilage d’application : temps, consommation
Profilage : comment ?
Avec gcc, il faut tout d’abord compiler et lier le programme avec les options de profilage autoris´ees : gcc -o myprog.exe myprog.c utils.c -g -pg
Il faut ensuite ex´ecuter le programme pour collecter les donn´ee du profil d’ex´ecution Le programme ´ecrit les donn´ees collect´ees dans un fichier gmon.outjuste avant de finir.
Il est possible apr`es d’utiliser gprof pour analyser les donn´ees collect´ees : gprof options myprog.exe gmon.out > outfile gprof cr´e´e un fichier de profil et un graphe d’ex´ecution
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
54 / 86
Profilage d’application : temps, consommation
Profilage: `a savoir
Options: -e function name : indique `a gprof de ne pas g´en´erer d’information sur la fonction function name (et ses enfants . . . ) dans le graphe d’appel -f function name : provoque une limitation de l’analyse dans le graphe des appels `a la fonction function name et ses enfants -b : gprof ne renvoie pas d’informations explicatives `a propos des champs renseign´es dans les tables.
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
55 / 86
Profilage d’application : temps, consommation
Profilage : flat profile
% time : pourcentage du temps total d’ex´ecution que le program a passer dans cette fonction. cumulative seconds : Temps cumulatif en seconde que le processeur a pass´e a ex´ecuter cette fonction ainsi que toutes les fonctions appel´ees dans cette fonction. self seconds : Temps en secondes utilis´e pour cette seule fonction. calls: Nombre de fois total o` u cette fonction a ´et´e appel´ee. self ms/call : Temps moyen en milliseconde pris par chaque appel de la fonction. total ms/call : Temps moyen en milliseconde pris par chaque appel de la fonction et de ses descendants. name : Nom de la fonction. Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
56 / 86
Profilage d’application : temps, consommation
Faiblesse de cette premi`ere approche
Certaines fonctions ne sont pas triviales `a r´ealiser en mat´eriel. Les d´ecisions prises trop tˆ ot dans le fot risque de ne pas ˆetre optimales Aucune consid´eration pour la communication et l’interfa¸cage. Si l’application change alors il faut r´eex´ecuter un profilage et ensuite un partitionnement.
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
57 / 86
Profilage d’application : temps, consommation
Codesign : un atelier
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
58 / 86
Profilage d’application : temps, consommation
Partitionnement et Ordonnancement Le partitionnement et l’ordonnancement de tache est imp´eratif dans beaucoup d’applications, en codesing de syst`eme, pour les multi-processeur et les syst`emes reconfigurables. Les taches identifi´ees de la description initiale de l’application doivent ˆetre mise en oeuvre : Au bon endroit (partitionnement) Au bon moment (ordonnanceur)
Ces probl`emes bien connus, le partitionnement et l’ordonnancement, ont ´et´e identifi´es comme des probl`emes NP-Complets. Les techniques d’optimisations bas´ees sur des heuristiques sont g´en´eralement employ´ees pour explorer l’espace des possibilit´es o` u des solutions quasi-optimales peuvent ˆetre trouv´ees.
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
59 / 86
Profilage d’application : temps, consommation
Partitionnement et Ordonnancement
Les m´ecanismes `a optimiser lors d’un partitionnement : Minimiser les communication `a travers un bus Extraire le maximum de parall´elisme → Faire ex´ecuter simultan´ement le mat´eriel (FPGA/ASIC) et le logiciel (Processeur) Extraire le maximum de performances du processeur
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
60 / 86
Heuristiques
Fidducia-Matheyse Graphe des taches
Fonction de Coˆ ut
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
61 / 86
Heuristiques
Fidducia-Matheyse Graphe des taches
Fonction de Coˆ ut Nombre de coupure = 5
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
61 / 86
Heuristiques
Fidducia-Matheyse Graphe des taches
Fonction de Coˆ ut Nombre de coupure = 5 Coˆ ut = 8 Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
61 / 86
Heuristiques
Fidducia-Matheyse Graphe des taches
Fonction de Coˆ ut Nombre de coupure = Coˆ ut =
3
0
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
61 / 86
Heuristiques
Fidducia-Matheyse
Graphe des taches
Fonction de Coˆ ut Nombre de coupure = Coˆ ut =
2
-4
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
61 / 86
Les algorithmes d’optimisation
Le Recuit Simul´e
Simulated Annealing (Kirkpatrick 83) Inspir´e par la physique statistique et les refroidissement des m´etaux Autorise les d´eplacements qui d´egradent en fonction d’une probabilit´e qui d´epend d’une temp´erature Paccept = exp (−δE /T )
Si l’´energie d´ecroˆıt, le syst`eme accepte la perturbation Si l’´energie croˆıt, le syst`eme accepte la perturbation selon Paccept
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
62 / 86
Les algorithmes d’optimisation
1: 2: 3: 4:
8:
S´electionner une solution initiale s S´electionner une temp´erature initiale T > 0 while Condition d’arrˆet non v´erifi´ee do S´electionner au hasard s’ ∈ N(s); Calculer δ = f(s’) – f(s); if δ < 0 then 5: s = s’ else 6: x=hasard([0,1]) if x < exp (−δ/T ) then 7: s = s’ end end end Actualiser la temp´erature T Algorithm 1: Le Recuit Simul´e
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
63 / 86
Les algorithmes d’optimisation
Les algorithmes Gloutons
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
64 / 86
Les algorithmes d’optimisation
Syndex
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
65 / 86
Les algorithmes d’optimisation
Les algorithmes g´en´etiques
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
66 / 86
Les algorithmes d’optimisation
Vulcan
Gupta, De Micheli, Stanford University Approche Primale 1 2
Initialement il n’y a que des IP mat´erielles. It´erativement certaines IP deviennent logicielle pour r´eduire le coˆ ut.
Utilisation d’un langage de sp´ecification mat´erielle : HardwareC Compil´e sous forme de graphe flot de donn´ees
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
67 / 86
Les algorithmes d’optimisation
Vulcan
D´efinition du graphe flot de donn´ees. Une variation d’un graphe de tˆaches. Les nœuds : Representent des op´erations. Typiquement des op´eration de bas niveau de type addition, multiplication, ... .
Les arcs : Representent les d´ependances de donn´ees. Chaque arc est valu´e par un bol´een repr´esentant la condition de transition d’un nœud ` a un autre.
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
68 / 86
Les algorithmes d’optimisation
Vulcan
Le graphe flot de donn´ees : est ex´ecut´e p´eriodiquement peut poss´eder pour chaque nœud des contrainte de temps T (vj ) ≥ T (vi ) + lij and T (vj ) ≤ T (vi ).uij
peut poss´eder pour chaque nœud des contraintes de d´ebit mi?Ri?Mi
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
69 / 86
Les algorithmes d’optimisation
Vulcan
Algorithme de Co-Synth`ese dans Vulcan : Le quantum de partitionnement est le thread L’algorithme transforme le graphe flot de donn´ees en thread et les alloue sur les ressources Thread boundary est d´etermin´e par : Toujours par un ´el´ement de d´elai non-d´eterministe, tel qu’un ´ev´enement sur une variable externe. Parfois par d’autres point du graphe flot de donn´ees.
Architecture cible Processeur et des acc´el´erateurs mat´eriels
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
70 / 86
Les algorithmes d’optimisation
Cosyma
Repr´esentation unifi´ee : ES graphe (CDFG ) Partittionnement : m´ethode combin´ee bas´ee sur un partitionnement guid´e par l’utilisateur qui s’appuie sur une fonction de coˆ ut et un partitionnement plus fin r´ealis´e par un algorithme de recuit-simul´e. Ordonnancement : pas de m´ethode sp´ecifique. Mod´elisation : mod`eles ´ecrits en C++. Validation : simulation bas´ee sur des ex´ecutables ´ecrits en C++. Principale emphasis sur la partitionnement des acc´el´erateurs mat´eriels.
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
71 / 86
Les algorithmes d’optimisation
Cosyma Developp´e `a Technical University de Braunschweig en Allemagne. Un syst`eme exp´erimental pour le Co-Design de petits syst`emes embrqu´es temps r´eels : Mise en œuvre de plus d’op´erations possibles logiciellement sur un processeur. G´en´eration d’acc´el´erateur mat´eriel uniquement quand la contrainte de temps est viol´ee.
Architecture cible : Processeur RISC Acc´el´erateurs mat´eriel
Les communication entre les IP mat´erielles et les IP logicielles sont r´ealis´ee `a travers l’utilisation d’une m´emoire partag´ee munie d’un protocol de communication s´equentiel de type CSP 1 1
CSP : Communicating Sequential Process
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
72 / 86
Les algorithmes d’optimisation
Cosyma
La description du syst`eme utilise le langage C*. Cette description est traduite en une repr´esentation interne `a Cosyma sous forme de graphe qui permet : le partitionnement. La g´en´eration des acc´el´erateurs mat´eriels lorsqu’il y a migration du logiciel vers le mat´eriel. La repr´esentation interne sous forme de graphe combine un graphe de contrˆ ole et un graphe flot de donn´ees Extended syntax (ES) graph Syntax graph Symbol table Local data/control dependencies
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
73 / 86
Les algorithmes d’optimisation
Cosyma
∆c (b) = w .(tHW (b) − tSW (b) + tCOM (Z ) − tCOM (ZUb).It(b)) Avec w un poids fixe tHW temps mat´eriel tSW temps logiciel tCOM temps de communication It nombre d’it´erations
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
74 / 86
Les algorithmes d’optimisation
Cosyma
! tCOM (ZUb) = tCOM (Z ) −
X a∈Z
Ca,b −
X
Cd,b
.tTRANS
d∈U csc Z
Avec fff
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
75 / 86
Les algorithmes d’optimisation
Cosyma
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
76 / 86
Vilfredo Pareto, Economiste italien, 1848 – 1923 n
n
Etude sur la répartition des revenus « Ecrits sur la courbe de la répartition des richesses » n n n
Edités par Giovanni Busino Librairie Droz, Genêve, 1965 http://books.google.fr/books? hl=fr&lr=&id=CP4a4VSJO0Q C&oi=fnd&pg=PA1&dq=autho rnbsp:pareto&ots=sgU2aq9a xn&sig=oeFCRb5Kc71y0JjS meYfcXNym24#v=onepage& n Revenus population Angleterre, Griffin q&f=false
Observation de Pareto
Validation de Pareto
Principe de Pareto n
Conséquence de l’observation de Pareto : n
n
n
Joseph Juran, 1904 (Roumanie) – 2008 (USA), inventeur de la gestion de la qualité 20% des causes provoquent 80% des effets n n n
n
20% de la population d’un pays possède 80% de la richesse du pays
20% causes -> 80% défauts de production 20% clients -> 80% CA 20% clients -> 80% Plaintes
Joseph Juran, « Universals in Management, Planning and Controlling », The Management Control, 1954
Diagramme de Pareto n n
n
Joseph Juran Histogramme des causes triées par ordre décroissant Permet de distinguer : n
n
20% causes les plus importantes, produisent 80% effets Causes secondaires, produisent autres effets
B. Efficacité de Pareto
Economie théorique : allocation n
Allocation de biens Répartition des biens et des services entre les agents. n Répartition des facteurs de production entre les agents industriels. n Répartition des revenus entre les agents. n
Efficacité de Pareto Etant donné une allocation de biens, une amélioration de Pareto est une allocation de biens qui attribue plus de biens à au moins une personne, sans diminuer les biens attribués aux autres. n Un allocation de Pareto efficace, ou optimum de Pareto, est une allocation à laquelle on ne peut apporter aucune amélioration de Pareto. n
Théorèmes fondamentaux de l’économie du bien-être TH1 Dans une économie parfaite, tout état d’équilibre est un optimum de Pareto n TH2 Dans une économie parfaite, pour tout optimum de Pareto, il existe une allocation initiale dont l’état d’équilibre est cet optimum n Remarque : un optimum de Pareto est efficace, mais il peut être inégalitaire. n
Exemple n
n
Allocation A : n Production armes peut être augmentée sans diminuer production de beurre n Amélioration de Pareto possible Allocations B, C, D : n Augmenter 1 production oblige à diminuer autre n Pas d’amélioration de Pareto possible
II. Exploration espace conception
16
Métriques de comparaison n
Taille de l’IP n n
Eléments logiques de base du FPGA Pourcentage d’utilisation des ressources du FPGA
n
Vitesse de traitement
n
Energie
n
n n n n
Chemin critique Fréquence de fonctionnement Nombre de transistors Taille des transistors Tension d’alimentation
Exploration espace de conception n
Design space exploration n n
n
Architecture exploration Exploration d’architecture
Faire varier des paramètres du design n n n n
Degré de parallélisme Largeur du codage des données Partitionnement logiciel / matériel …
Front de Pareto Latence (durée de traitement) s
Différents designs Front Pareto Courbe des meilleurs compromis Taille (Nombre de cellules)
Optimaux de Pareto
19
Exemple fronts Pareto
n
n
Génération par algorithmes génétiques : à gauche état initial, à droite fronts de Pareto PARETO FRONT GENERATION FOR A TRADEOFF BETWEEN AREA AND TIMING, M. Holzer and B. Knerr, Vienna University of Technology, Copyright IEEE 2006, Austrochip 2006, 11.10.2006, Vienna, Austria
Illustration de la loi d'Amdahl Accélération en fonction de part de code séquentiel pour 100 PE
Implications de la loi d'Amdahl n
Soit 1 programme contenant 10 % de code purement séquentiel n
n
n
Soit une machine parallèle n
Contenant 100 processeurs Accélérant 1 programme d'1 facteur 100 Alors : !seq = 0 !
n Son accélération sera inférieure à 10 Quelque soit le nombre n de processeurs ! n En effet S p =
1 = 10 S p ≤ 0,1
1 100
1
τ seq +
1− τ seq p
= τ seq +
1− τ seq 100
B. Loi d'Amdahl généralisée n
1 programme contient : n
Une part qui peut être accélérée
Une part qui ne peut pas l'être n Hennessy Patterson n
S ap =
1 1− t amél +
t am él p
tav = tinch + tamél tap = tinch +
τ amél = t
t amél p
tamél tav
Sap = tav = ap
1
tap t av
= tav − tamél +
tamél p
Illustration Amdahl généralisée Moyen Durée transpo Sierra rt désert Pied Vélo Ferrari
Durée désert
Durée totale
Accélér Accélér ation ation désert globale
20 h
50 h
70 h
1
20 h
20 h
40 h
2,5
1,8
20 h
1,7 h
21,7h
30
3,2
1
Applications de Amdahl généralisée n
Hiérarchie mémoire On dispose d'un cache 5 fois plus rapide que la mémoire centrale n Ce cache est utilisé 90% du temps du fait de la localité des référence n Gain en vitesse lié à l'utilisation du cache : 3,6 n
Sap =
1 τ 1− τ amél + amél p
=
1 1−0,9+ 0,9 5
1 = 0,1+0,18 ≈ 3,6
Applications de Amdahl généralisée n
Optimisation de code 1 programme passe 90 % du temps d'exécution dans 10 % du code n On accélère ces 10 % d'un facteur 3 en optimisant le code source n Accélération d'un facteur 2,5 du programme complet n
Sap =
1 τ 1− τ amél + amél p
=
1 1−0,9+ 0,9 3
=
1 0,1+0,3
= 2,5
Applications de Amdahl généralisée n
Codesign 1 programme passe 80 % du temps d'exécution dans 20 % du code n On accélère ces 20 % d'un facteur 50 en optimisant le code source n Quelle est l’accélération du programme complet ? n
Le front de Pareto
Observation de Pareto Vilfredo Pareto est un ´economiste Italien (1848-1923) qui a fait des ´etudes sur la r´epartition des richesses dans les villes europ´eennes Livre : ””
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
77 / 86
Le front de Pareto
Observation de Pareto
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
78 / 86
Le front de Pareto
Observation de Pareto
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
79 / 86
Le front de Pareto
Enonc´e du principe de Pareto
Observation de Pareto 20% de la population d’un pays poss`ede 80% de ses richesses Utilisation de ce principe en gestion de la qualit´e Introduit par Joseph Juran (1904-2008) 20% des causes provoquent 80% des effets 20% des causes provoquent 80% des d´efauts de production 20% des clients g´en`erent 80% du chiffres d’affaire 20% des clients cr´e´e 80% des plaintes
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
80 / 86
La loi d’Ahmdal
Gain et efficacit´ e du parall´ elisme
Plan 1
Les syst`emes Embarqu´es
2
Etude de la conception de syst`emes embarqu´es sur puce avec des parties Logicielles et des parties Mat´erielles (Co-design)
3
Plate-formes pour l’embarqu´e
4
Profilage d’application : temps, consommation
5
Interlude : les r´eseaux de neurones convolutionnels ou CNN
6
Heuristiques
7
Les algorithmes d’optimisation
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
81 / 86
La loi d’Ahmdal
Gain et efficacit´ e du parall´ elisme
Parall´elisme - Gain et Efficacit´e
Le gain exprime l’acc´el´eration il est ´egal `a G=
Tempss e´quentiel Tempsparall e´le
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
82 / 86
La loi d’Ahmdal
Gain et efficacit´ e du parall´ elisme
Parall´elisme - Gain et Efficacit´e
Le gain exprime l’acc´el´eration il est ´egal `a G=
Tempss e´quentiel Tempsparall e´le
L’efficacit´e exprime l’utilisation effective des ressources disponibles elle est ´egale `a G E= Nombreressources
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
82 / 86
La loi d’Ahmdal
Gain et efficacit´ e du parall´ elisme
Parall´elisme - Loi d’Amdhal
Question : Si j’ai 100 processeurs vais-je 100 fois plus vite qu’avec 1 seul ?
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
83 / 86
La loi d’Ahmdal
Gain et efficacit´ e du parall´ elisme
Parall´elisme - Loi d’Amdhal
Question : Si j’ai 100 processeurs vais-je 100 fois plus vite qu’avec 1 seul ? R´eponse : Non, il existe toujours une partie s´equentielle dans le programme qu’on ne peut parall´eliser.
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
83 / 86
La loi d’Ahmdal
Gain et efficacit´ e du parall´ elisme
Parall´elisme - Loi d’Amdhal
Question : Si j’ai 100 processeurs vais-je 100 fois plus vite qu’avec 1 seul ? R´eponse : Non, il existe toujours une partie s´equentielle dans le programme qu’on ne peut parall´eliser. Exemple
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
83 / 86
La loi d’Ahmdal
Gain et efficacit´ e du parall´ elisme
Parall´elisme - Loi d’Amdhal
Question : Si j’ai 100 processeurs vais-je 100 fois plus vite qu’avec 1 seul ? R´eponse : Non, il existe toujours une partie s´equentielle dans le programme qu’on ne peut parall´eliser. Exemple Un programme de 20 instructions de dur´ee 1 cycle chaque
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
83 / 86
La loi d’Ahmdal
Gain et efficacit´ e du parall´ elisme
Parall´elisme - Loi d’Amdhal
Question : Si j’ai 100 processeurs vais-je 100 fois plus vite qu’avec 1 seul ? R´eponse : Non, il existe toujours une partie s´equentielle dans le programme qu’on ne peut parall´eliser. Exemple Un programme de 20 instructions de dur´ee 1 cycle chaque 30% d’instructions s´equentielles
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
83 / 86
La loi d’Ahmdal
Gain et efficacit´ e du parall´ elisme
Parall´elisme - Loi d’Amdhal
Question : Si j’ai 100 processeurs vais-je 100 fois plus vite qu’avec 1 seul ? R´eponse : Non, il existe toujours une partie s´equentielle dans le programme qu’on ne peut parall´eliser. Exemple Un programme de 20 instructions de dur´ee 1 cycle chaque 30% d’instructions s´equentielles dur´ee du programme avec 1 processeur : 20 cycles
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
83 / 86
La loi d’Ahmdal
Gain et efficacit´ e du parall´ elisme
Parall´elisme - Loi d’Amdhal
Question : Si j’ai 100 processeurs vais-je 100 fois plus vite qu’avec 1 seul ? R´eponse : Non, il existe toujours une partie s´equentielle dans le programme qu’on ne peut parall´eliser. Exemple Un programme de 20 instructions de dur´ee 1 cycle chaque 30% d’instructions s´equentielles dur´ee du programme avec 1 processeur : 20 cycles dur´ee id´eale avec 20 processeurs : 1 cycle
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
83 / 86
La loi d’Ahmdal
Gain et efficacit´ e du parall´ elisme
Parall´elisme - Loi d’Amdhal
Question : Si j’ai 100 processeurs vais-je 100 fois plus vite qu’avec 1 seul ? R´eponse : Non, il existe toujours une partie s´equentielle dans le programme qu’on ne peut parall´eliser. Exemple Un programme de 20 instructions de dur´ee 1 cycle chaque 30% d’instructions s´equentielles dur´ee du programme avec 1 processeur : 20 cycles dur´ee id´eale avec 20 processeurs : 1 cycle dur´ee r´eelle avec 20 processeurs : 7 cycles
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
83 / 86
La loi d’Ahmdal
Gain et efficacit´ e du parall´ elisme
Parall´elisme - Loi d’Amdhal
Question : Si j’ai 100 processeurs vais-je 100 fois plus vite qu’avec 1 seul ? R´eponse : Non, il existe toujours une partie s´equentielle dans le programme qu’on ne peut parall´eliser. Exemple Un programme de 20 instructions de dur´ee 1 cycle chaque 30% d’instructions s´equentielles dur´ee du programme avec 1 processeur : 20 cycles dur´ee id´eale avec 20 processeurs : 1 cycle dur´ee r´eelle avec 20 processeurs : 7 cycles G = 20 7 = 2, 85
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
83 / 86
La loi d’Ahmdal
Gain et efficacit´ e du parall´ elisme
Parall´elisme - Loi d’Amdhal
Question : Si j’ai 100 processeurs vais-je 100 fois plus vite qu’avec 1 seul ? R´eponse : Non, il existe toujours une partie s´equentielle dans le programme qu’on ne peut parall´eliser. Exemple Un programme de 20 instructions de dur´ee 1 cycle chaque 30% d’instructions s´equentielles dur´ee du programme avec 1 processeur : 20 cycles dur´ee id´eale avec 20 processeurs : 1 cycle dur´ee r´eelle avec 20 processeurs : 7 cycles G = 20 7 = 2, 85 E = 2,85 20 = 0, 142
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
83 / 86
La loi d’Ahmdal
Gain et efficacit´ e du parall´ elisme
Parall´elisme - Loi d’Amdhal
Acc´el´eration Acc =
1 (1 − P) +
P N
P taux de code parall`ele N nombre de processeurs
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
84 / 86
La loi d’Ahmdal
Gain et efficacit´ e du parall´ elisme
Parall´elisme - Loi d’Amdhal
Acc´el´eration T =S +P
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
85 / 86
La loi d’Ahmdal
Gain et efficacit´ e du parall´ elisme
Parall´elisme - Loi d’Amdhal
Acc´el´eration T (N) = S +
P N
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
85 / 86
La loi d’Ahmdal
Gain et efficacit´ e du parall´ elisme
Parall´elisme - Loi d’Amdhal
Acc´el´eration G=
T T (N)
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
85 / 86
La loi d’Ahmdal
Gain et efficacit´ e du parall´ elisme
Parall´elisme - Loi d’Amdhal
Acc´el´eration G=
S +P P S+N
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
85 / 86
La loi d’Ahmdal
Gain et efficacit´ e du parall´ elisme
Parall´elisme - Loi d’Amdhal
Acc´el´eration G=
(1 − P) + P P (1 − P) + N
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
85 / 86
La loi d’Ahmdal
Gain et efficacit´ e du parall´ elisme
Parall´elisme - Loi d’Amdhal
Acc´el´eration G=
1 (1 − P) +
P N
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
85 / 86
HLS
HLS
Transparents de Zahid Syed Ahmed Transparents de Xilinx
Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel
Automne 2017
86 / 86
What is ESL? • ESL (Electronic System Level) :
http://en.wikipedia.org/wiki/Electronic_system-level_design_and_verification
Design and verification methodology for system design at a higher abstraction level. The term was coined by Gartner Dataquest for industrial market analysis
• HLS (High Level Synthesis) : http://en.wikipedia.org/wiki/High-level_synthesis Hardware/system design at higher abstraction. It is in principle more Scientific/Technical name of the ESL tools that help mostly (not limited to) standard C/C++ /SystemC to RTL (VHDL/Verilog) synthesis
• Some State of the Art C/C++/SystemC RTL tools examples – – – –
Synopsys: Synphony (former Synfora’s Pico) Cadence: C-to-Silicon Calypto: Catapult C (former Mentor’s Catapult C) Xilinx: VivadoHLS (former AutoESL’s AutoPilot)
• Academic Research – Some current examples include Legup of Toronto Univ etc. – Former examples include AutoPilot of UCLA (origin of AutoESL!) etc. Syed Zahid AHMED
12/01/2017
3
HLS Tools historical overview A nice industrial survey on topic: Grant Martin, Gary Smith, “High-Level Synthesis: Past, Present and Future”, IEEE Design and Test of Computers, July/Aug. 2009. http://cas.et.tudelft.nl/education/courses/et4054/2009_PAPER_High-Level_Synthesis_Past,_Present,_and_Future.pdf
Issues of Custom Languages Wrong people targeted HLS-> Gates in beginning tools …
Syed Zahid AHMED
12/01/2017
4
FPGAs vs Coarse Grain (a special form of HLS) Story Missing Umbrella
myC ESL* Moore’s Law
RTL
ANSI C/C++
newC
lovelyC
almostC greatC
It is difficult to Enter/win in Industry Like this
Coarse Grains suffered: Solo Adventure in Desert Simultaneous War at 3 frontiers New hardware, new Language, no IP www.patioumbrellas.com
FPGAs have enjoyed: Nice party Scenario
Diverse solutions of coarse grain makes/have kept them difficult to be widely adapted by Industry No/partial reuse of designs, scarce IP leverage, non standard programming makes them risky investment option by companies compared to FPGAs
Survey of new trends in Industry for Programmable hardware: FPGAs, MPPAs, MPSoCs, Structured ASICs, eFPGAs and new wave of innovation in FPGAs. Syed Zahid Ahmed, Gilles Sassatelli, Lionel Torres, Laurent Rougé. FPL-2010 (http://conferenze.dei.polimi.it/FPL2010/presentations/W1_B_1.pdf) Syed Zahid AHMED
12/01/2017
5
Modern ESL tools • Standard C/C++/SystemC input • Constraints infrastructure to optimize implementations • Leveraging the mature RTL design flow and tools
Syed Zahid AHMED
12/01/2017
6
Outline • ESL – Fundamentals & Historical overview – Current ESL tools/vendors
• Xilinx Vivado tools suite Basics – ISE vs Vivado
• Xilinx VivadoHLS – Fundamentals and flow overview – Strengths & Limitations – Code examples & Xilinx Videos
• Image Processing IPs ESL exploration work in SagemCom Project – ESL vs Hand coded IP
Syed Zahid AHMED
12/01/2017
7
Xilinx Vivado vs ISE ISE All FPGAs upto 7Series and Zynq
Vivado 7-Series/Zynq and Future devices
• New tools suite of Xilinx of IP-Centric Design environment for 7-series/Zynq and future devices with Built-in ESL tool VivadoHLS • New algorithms for faster Place and Route, improved design flow vs ISE • New SDC based constraints system (.ucf is obsolete), improved facilities for timing/power analysis… http://www.xilinx.com/support/documentation/white_papers/wp416-Vivado-Design-Suite.pdf
Multiple Xilinx Video tutorials for Vivado available @ http://www.xilinx.com/training/vivado/index.htm http://www.youtube.com/playlist?list=PL35626FEF3D5CB8F2&feature=plcp Syed Zahid AHMED
12/01/2017
8
ETIS Project SagemCom SuperSoC SuperSOC_v1
Nios System
On-chip SRAM 128KB
Altera Performance Counter
GrayScale
Binarize
Binarize Count
SelectiveTrc
DDR2
AVALON
1GB
ImgColor
Wiener
FindEdges
Hough
NeuralGas
Bitonal
PLL ArriaIIGX125 FPGA
Syed Zahid AHMED
12/01/2017
29
SuperSoC (Hardware Statistics) IP ImageColor Wiener Binarize BinarizeCount Grayscale SelectiveTrc FindEdges Hough NeuralGas Bitonal NIOS SYSTEM DDR2 Contr. Others(Bus,Buf...) TOTAL % of Device Arria II GX125
ALUTs
FFs
Mem. Bits
DSP
Freq.
CONFIDENTIAL
769 3,287 11,129 40,443 41 99,280
549 3,001 21,972 43,399 44 99,280
2,107,392 74,816 162,304 4,350,528 65 6,727,680
0 0 0 276 48 576
50 100 NA NA NA NA
D Power* mw S Power* mW NA NA NA NA NA NA NA NA NA NA NA 110.1 NA 328.4 NA 218.6 565.9 1012.2 NA NA NA NA
Real-time Power
*Statistical Values from PowerPlay tool
Total
Reset (~Static) Syed Zahid AHMED
12/01/2017
30
Hand coded RTL vs ESL for Selected IPs C Source
Complimentary Research Work Exploration of ESL tools of Xilinx Two IPs* explored: RTL vs ESL Evaluations on Viretex-6 & Zynq
-Mallocs… removal -AXI interfaces -Constraints
ARM Cortex A9
DDR3
Processing System
Controller
AXI4
Vivado HLS
Key Findings ESL is Cool RTL vs ESL resources: Same/Similar ESL DMA not good in current version ARM Cortex is very powerful
20
Vivado HLS
HW
IP
AXI Timer
Zynq Z20 FPGA
Development Time (Man Weeks) 15
15
11 10 5
3 1
0
IP-1 RTL
IP-1 ESL
IP-2 RTL
IP-2 ESL
* Names not mentioned to comply with Sagemcom agreements
Syed Zahid AHMED
12/01/2017
31
Sagemcom project example Design Space Exploration (DSE) for IP1 Rapid Design Space Exploration with quick Synthesis of Vivado HLS IP-1: VivadoHLS results for Zynq Z20 for 256x256 pixel image S1: Raw transform S2: S1 with Shared Div S3:S2 with DivToMult S4: S3 with Buffer Memory Partitioning S5: S4 with Shared 32/64bit Multipliers S6: S5 with Smart Mult sharing S7: S6 with Mult Lat. exp (comb. Mult) S8: S6 with Mult Lat. exp (1clk cycle Mult) S9: S4 with SuperBurst S10: S4 with Smart Burst S11: S6 with Smart Burst
LUTs 20,276 9,158 7,142 4,694 4,778 4,900 4,838 4,838 4,025 3,979 4,224
FFs 18,293 7,727 6,172 4,773 4,375 4,569 4,080 4,056 4,357 4,314 4,079
BRAMs 16 16 16 16 16 16 16 16 80 24 24
DSP 59 51 84 81 32 40 40 40 79 80 39
ClkPr (ns) 8.75 8.75 8.75 8.75 8.75 8.75 13.56 22.74 8.75 8.75 8.75
Latency (clk cycles) Speedup 8,523,620 NA 8,555,668 1.00 3,856,048 2.22 2,438,828 3.51 3,111,560 2.75 2,910,344 2.94 1,975,952 4.33 2,181,776 3.92 NA NA NA NA NA NA
Power* 1410 1638 1337 953 890 918 893 890 NA NA NA
FPGA Prototype of Selected steps on Zedboard (Issues of timing closure and DMA efficiency) IP-1: Zynq Z20 FPGA Board results for 256x256 pixel test image ARM CortexA9 MPCore (Core-0 only) MicroBlaze (with Cache & int. Divider) Hand-Coded IP (Resources for Ref.) S3:S2 with DivToMult S4: S3 with Buffer Mem. Partitioning S6: S5 with Smart Mult sharing S8: S6 with Mult Latency exp (1clk) S9: S4 with SuperBurst S10: S4 with Smart Burst S11: S6 with Smart Burst
LUTs
NA 2,027 8798 6,721 4,584 5,156 5,027 4,504 4,485 4,813
FFs
NA 1,479 3,693 6,449 5,361 4,870 4,607 5,008 4,981 4,438
BRAMs
NA 6 8 8 10 10 10 42 14 14
DSP
NA 3 33 72 69 32 33 70 73 33
Fmax (MHz) NA NA 54.2 40.2 84.5 91.3 77.3 77.6 103 101
Fexp (MHz) 666.7 100.0 *50(NA)
40.0 83.3 90.9 76.9 76.9 100 100
Latency (ms) 32.5 487.4 *4 (NA) 142.4 124.5 136.9 121.2 26.5 22.48 33.7
* Vivado HLS quick syntheis power, it Has no units
Syed Zahid AHMED
12/01/2017
Speedup
15.0 1.00 NA 3.42 3.91 3.56 4.02 18.39 21.68 14.46
Power** (mW) NA NA 57 12 30 26 22 NA 62 42
** Statistical Power from Xpower
32
Work Published in Xilinx Xcell Journal
Published in Xilinx Xcell Journal, Issue 84, pages 34-41 (July 2013) Vivado’s ESL Capabilities Speeds IP design on Zynq SoC Project: Automated methodology delivers results similar to hand-coded RTL for two image processing IP cores ” Syed Zahid Ahmed, Sébastien Fuhrmann, Bertrand Granado http://www.xilinx.com/publications/xcellonline/ www.xilinx.com/publications/archives/xcell/Xcell84.pdf
Syed Zahid AHMED
12/01/2017
33
Conclusions • Potentials for image/video processing projects – – – – – – –
Rapid design space exploration for HW accelerators: In Days instead of Months! Wide range of optimization options using Directives constraints HW/SW co-design further simplified Ultra-fast verification cycle: Self testing Testbench and its re-use, Co-simulation… Experiments with Dual Core ARM using Zynq Support for Floating Point hardware Automatic creation of SW drivers of IP
Syed Zahid AHMED
12/01/2017
34
References to get started The Zinq Book (2014): www.zynqbook.com Vivado All Documentation: http://www.xilinx.com/support/documentation/dt_vivado.htm VivadoHLS Userguide: http://www.xilinx.com/support/documentation/sw_manuals/xilinx2012_4/ug902-vivado-high-level-synthesis.pdf VivadoHLS Getting Started Tutorial & its design files: http://www.xilinx.com/support/documentation/sw_manuals/xilinx2012_4/ug871-vivado-high-level-synthesis-tutorial.pdf / Project Files : https://secure.xilinx.com/webreg/clickthrough.do?cid=198573&license=RefDesLicense&filename=ug871-design-files.zip White Papers/App. Notes: – – – –
http://www.xilinx.com/support/documentation/application_notes/xapp745-processor-control-vhls.pdf http://www.xilinx.com/support/documentation/application_notes/xapp793-memory-structures-video-vivado-hls.pdf http://www.xilinx.com/support/documentation/application_notes/xapp599-floating-point-vivado-hls.pdf http://www.xilinx.com/support/documentation/white_papers/wp416-Vivado-Design-Suite.pdf
Video Tutorials of Vivado and VivadoHLS: – –
http://www.xilinx.com/training/vivado/index.htm http://www.youtube.com/playlist?list=PL35626FEF3D5CB8F2&feature=plcp
Xilinx’s Vivado HLS tutorial (2013): – –
http://www.xilinx.com/support/documentation/sw_manuals/xilinx2013_2/ug871-vivado-high-level-synthesis-tutorial.pdf https://secure.xilinx.com/webreg/clickthrough.do?cid=338217&license=RefDesLicense&filename=ug871-designfiles.zip&languageID=1
Syed Zahid AHMED
12/01/2017
35
Introduc)on to High-‐Level Synthesis with Vivado HLS This material exempt per Department of Commerce license exception TSU
Objec)ves After completing this module, you will be able to: – Describe the high level synthesis flow – Understand the control and datapath extraction – Describe scheduling and binding phases of the HLS flow – List the priorities of directives set by Vivado HLS – List comprehensive language support in Vivado HLS – Identify steps involved in validation and verification flows
Intro to HLS 11- 2
© Copyright 2016 Xilinx
Outline Introduction to High-Level Synthesis High-Level Synthesis with Vivado HLS Language Support Validation Flow Summary
Intro to HLS 11- 3
© Copyright 2016 Xilinx
Need for High-‐Level Synthesis Algorithmic-based approaches are getting popular due to accelerated design time and time to market (TTM) – Larger designs pose challenges in design and verification of hardware at HDL level
Industry trend is moving towards hardware acceleration to enhance performance and productivity – CPU-intensive tasks can be offloaded to hardware accelerator in FPGA – Hardware accelerators require a lot of time to understand and design
Vivado HLS tool converts algorithmic description written in C-based design flow into hardware description (RTL) – Elevates the abstraction level from RTL to algorithms
High-level synthesis is essential for maintaining design productivity for large designs Intro to HLS 11- 4
© Copyright 2016 Xilinx
High-‐Level Synthesis: HLS High-Level Synthesis – Creates an RTL implementation from C, C+ +, System C, OpenCL API C kernel code – Extracts control and dataflow from the source code – Implements the design based on defaults and user applied directives
Many implementation are possible from the same source description – Smaller designs, faster designs, optimal designs – Enables design exploration
Intro to HLS 11- 5
© Copyright 2016 Xilinx
Design Explora)on with Direc)ves One body of code: Many hardware outcomes
The same hardware is used for each iteration of the loop: • Small area • Long latency • Low throughput
Intro to HLS 11- 6
… loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } ….
Different hardware is used for each iteration of the loop: • Higher area • Short latency • Better throughput
© Copyright 2016 Xilinx
Before we get into details, let’s look under the hood ….
Different iterations are executed concurrently: • Higher area • Short latency • Best throughput
Introduc)on to High-‐Level Synthesis How is hardware extracted from C code? – Control and datapath can be extracted from C code at the top level – The same principles used in the example can be applied to sub-functions • At some point in the top-level control flow, control is passed to a sub-function • Sub-function may be implemented to execute concurrently with the top-level and or other sub-functions
How is this control and dataflow turned into a hardware design? – Vivado HLS maps this to hardware through scheduling and binding processes
How is my design created? – How functions, loops, arrays and IO ports are mapped?
Intro to HLS 11- 7
© Copyright 2016 Xilinx
HLS: Control Extrac)on Code void fir ( data_t *y, coef_t c[4], data_t x ){
Control Behavior Finite State Machine (FSM) states
Function Start
static data_t shift_reg[4]; acc_t acc; int i; acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc; }
0 For-Loop Start
1
For-Loop End
2
Function End From any C code example ..
Intro to HLS 11- 8
The loops in the C code correlated to states of behavior
© Copyright 2016 Xilinx
This behavior is extracted into a hardware state machine
HLS: Control & Datapath Extrac)on Code void fir ( data_t *y, coef_t c[4], data_t x ){ static data_t shift_reg[4]; acc_t acc; int i; acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc; }
From any C code example ..
Intro to HLS 11- 9
Operations
Control Behavior Finite State Machine (FSM) states
Control & Datapath Behavior Control Dataflow
RDx RDc
>= == + * + *
0
1
WRy Operations are extracted…
© Copyright 2016 Xilinx
2 The control is known
RDx
RDc
>=
-
==
-
+
*
+
*
WRy
A unified control dataflow behavior is created.
High-‐Level Synthesis: Scheduling & Binding Scheduling & Binding – Scheduling and Binding are at the heart of HLS
Scheduling determines in which clock cycle an operation will occur – Takes into account the control, dataflow and user directives – The allocation of resources can be constrained
Binding determines which library cell is used for each operation – Takes into account component delays, user directives Technology Library
Design Source (C, C++, SystemC)
Scheduling
Binding
User Directives
Intro to HLS 11- 10
© Copyright 2016 Xilinx
RTL (Verilog, VHDL, SystemC)
Scheduling The operations in the control flow graph are mapped into clock cycles a b c d e
void foo ( … t1 = a * b; t2 = c + t1; t3 = d * t2; out = t3 – e; }
Schedule 1
* +
* -
*
*
+
out
-
The technology and user constraints impact the schedule – A faster technology (or slower clock) may allow more operations to occur in the same clock cycle
Schedule 2
*
The code also impacts the schedule – Code implications and data dependencies must be obeyed
Intro to HLS 11- 11
© Copyright 2016 Xilinx
+
*
-
Binding Binding is where operations are mapped to cores from the hardware library – Operators map to cores
Binding Decision: to share – Given this schedule:
*
+
*
-
• Binding must use 2 multipliers, since both are in the same cycle • It can decide to use an adder and subtractor or share one addsub
Binding Decision: or not to share – Given this schedule:
*
+
*
-
• Binding may decide to share the multipliers (each is used in a different cycle) • Or it may decide the cost of sharing (muxing) would impact timing and it may decide not to share them • It may make this same decision in the first example above too
Intro to HLS 11- 12
© Copyright 2016 Xilinx
Outline Introduction to High-Level Synthesis High-Level Synthesis with Vivado HLS Language Support Validation Flow Summary
Intro to HLS 11- 13
© Copyright 2016 Xilinx
RTL vs High-‐Level Language
Intro to HLS 11- 14
© Copyright 2016 Xilinx
Vivado HLS Benefits Productivity – Verification Video Design Example
• Functional • Architectural
– Abstraction • Datatypes
Input
C Simulation Time
RTL Simulation Time
Improvement
10 frames 1280x720
10s
~2 days (ModelSim)
~12000x
• Interface • Classes
– Automation
RTL (Spec) C (Spec/Sim)
RTL (Sim)
Block level specification AND verification significantly reduced
Intro to HLS 11- 15
© Copyright 2016 Xilinx
RTL (Sim)
Vivado HLS Benefits Portability – Processors and FPGAs – Technology migration – Cost reduction – Power reduction
Design and IP reuse Intro to HLS 11- 16
© Copyright 2016 Xilinx
Vivado HLS Benefits Permutability – Architecture Exploration • Timing – Parallelization – Pipelining
• Resources – Sharing
– Better QoR
Rapid design exploration delivers QoR rivaling hand-coded RTL Intro to HLS 11- 17
© Copyright 2016 Xilinx
Understanding Vivado HLS Synthesis Vivado HLS – Determines in which cycle operations should occur (scheduling) – Determines which hardware units to use for each operation (binding) – Performs high-level synthesis by : • Obeying built-in defaults • Obeying user directives & constraints to override defaults • Calculating delays and area using the specified technology/device
Priority of directives in Vivado HLS 1. Meet Performance (clock & throughput) •
Vivado HLS will allow a local clock path to fail if this is required to meet throughput
•
Often possible the timing can be met after logic synthesis
2. Then minimize latency 3. Then minimize area
Intro to HLS 11- 18
© Copyright 2016 Xilinx
The Key AMributes of C code void fir ( data_t *y, coef_t c[4], data_t x ){ static data_t shift_reg[4]; acc_t acc; int i; acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i] * c[i]; }
Functions: All code is made up of functions which represent the design hierarchy: the same in hardware Top Level IO : The arguments of the top-level function determine the hardware RTL interface ports Types: All variables are of a defined type. The type can influence the area and performance Loops: Functions typically contain loops. How these are handled can have a major impact on area and performance Arrays: Arrays are used often in C code. They can influence the device IO and become performance bottlenecks
} *y=acc; }
Operators: Operators in the C code may require sharing to control area or specific hardware implementations to meet performance
Let’s examine the default synthesis behavior of these …
Intro to HLS 11- 19
© Copyright 2016 Xilinx
Func)ons & RTL Hierarchy Each function is translated into an RTL block – Verilog module, VHDL entity
Source Code
void A() { ..body A..} void B() { ..body B..} void C() { B(); } void D() { B(); }
void foo_top() { A(…); C(…); D(…) }
RTL hierarchy foo_top
D
my_code.c
B
B
Each function/block can be shared like any other component (add, sub, etc) provided it’s not in use at the same time
– By default, each function is implemented using a common instance – Functions may be inlined to dissolve their hierarchy • Small functions may be automatically inlined
Intro to HLS 11- 20
C
A
© Copyright 2016 Xilinx
Types = Operator Bit-‐sizes Code void fir ( data_t *y, coef_t c[4], data_t x ){ static data_t shift_reg[4]; acc_t acc; int i; acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc; }
From any C code example ...
Intro to HLS 11- 21
Operations
Types Standard C types long long (64-bit)
RDx RDc
>= == + * + *
short (16-bit)
int (32-bit)
char (8-bit)
float (32-bit)
double (64-bit)
unsigned types
Arbitary Precision types
WRy
C:
ap(u)int types (1-1024)
C++:
ap_(u)int types (1-1024) ap_fixed types
C++/SystemC:
sc_(u)int types (1-1024) sc_fixed types
Can be used to define any variable to be a specific bit-width (e.g. 17-bit, 47bit etc).
Operations are extracted…
© Copyright 2016 Xilinx
The C types define the size of the hardware used: handled automatically
Loops By default, loops are rolled – Each C loop iteration è Implemented in the same state – Each C loop iteration è Implemented with same resources
foo_top
Synthesis
a[N]
Loops require labels if they are to be referenced by Tcl directives (GUI will auto-add labels)
– Loops can be unrolled if their indices are statically determinable at elaboration time • Not when the number of iterations is variable
– Unrolled loops result in more elements to schedule but greater operator mobility • Let’s look at an example ….
Intro to HLS 11- 22
© Copyright 2016 Xilinx
+
void foo_top (…) { ... Add: for (i=3;i>=0;i-‐-‐) { b = a[i] + b; ... }
N
b
Data Dependencies: Good void fir ( … acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc; }
Default Schedule == -
*
>=
==
-
+
-
RDc
*
>=
==
-
+
-
RDc
Iteration 1
Iteration 2
The read X operation has good mobility
Example of good mobility – The read on data port X can occur anywhere from the start to iteration 4 • The only constraint on RDx is that it occur before the final multiplication
– Vivado HLS has a lot of freedom with this operation • It waits until the read is required, saving a register • There are no advantages to reading any earlier (unless you want it registered) • Input reads can be optionally registered
– The final multiplication is very constrained…
Intro to HLS 11- 23
© Copyright 2016 Xilinx
*
>=
==
*
>=
-
+
-
RDx
+
RDc
Iteration 3
RDc
Iteration 4
WRy
Data Dependencies: Bad void fir ( … acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc; }
Default Schedule == -
*
>=
==
-
+
-
RDc
*
>=
==
-
+
-
RDc
Iteration 1
Iteration 2
>=
==
*
>=
-
+
-
RDx
+
Iteration 3
RDc
Iteration 4
Mult is very constrained
Example of bad mobility – The final multiplication must occur before the read and final addition • It could occur in the same cycle if timing allows
– Loops are rolled by default • Each iteration cannot start till the previous iteration completes • The final multiplication (in iteration 4) must wait for earlier iterations to complete
– The structure of the code is forcing a particular schedule • There is little mobility for most operations
– Optimizations allow loops to be unrolled giving greater freedom Intro to HLS 11- 24
* RDc
© Copyright 2016 Xilinx
WRy
Schedule aRer Loop Op)miza)on With the loop unrolled (completely) – The dependency on loop iterations is gone – Operations can now occur in parallel • If data dependencies allow • If operator timing allows
RDc
RDc
RDc RDx
– Design finished faster but uses more operators • 2 multipliers & 2 Adders
* *
* *
+
+ +
Schedule Summary
WRy
– All the logic associated with the loop counters and index checking are now gone – Two multiplications can occur at the same time • All 4 could, but it’s limited by the number of input reads (2) on coefficient port C
– Why 2 reads on port C? • The default behavior for arrays now limits the schedule…
Intro to HLS 11- 25
RDc
© Copyright 2016 Xilinx
void fir ( … acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc; }
Arrays in HLS An array in C code is implemented by a memory in the RTL – By default, arrays are implemented as RAMs, optionally a FIFO void foo_top(int x, …) { int A[N]; L1: for (i = 0; i < N; i++) A[i+x] = A[i] + i; }
N-1
SPRAMB
N-2 …
foo_top
A[N]
Synthesis
A_in
1 0
DIN ADDR
DOUT
A_out
CE WE
The array can be targeted to any memory resource in the library – The ports (Address, CE active high, etc.) and sequential operation (clocks from address to data out) are defined by the library model – All RAMs are listed in the Vivado HLS Library Guide
Arrays can be merged with other arrays and reconfigured – To implement them in the same memory or one of different widths & sizes
Arrays can be partitioned into individual elements – Implemented as smaller RAMs or registers Intro to HLS 11- 26
© Copyright 2016 Xilinx
Top-‐Level IO Ports Top-level function arguments – All top-level function arguments have a default hardware port type
When the array is an argument of the top-level function – The array/RAM is “off-chip” – The type of memory resource determines the top-level IO ports – Arrays on the interface can be mapped & partitioned • E.g. partitioned into separate ports for each element in the array DPRAMB
foo_top
Synthesis
+
void foo_top( int A[3*N] , int x) { L1: for (i = 0; i < N; i++) A[i+x] = A[i] + i; }
DIN0 ADDR0
DOUT0
CE0 WE0
Number of ports defined by the RAM resource
DIN1 ADDR1
Default RAM resource – Dual port RAM if performance can be improved otherwise Single Port RAM
Intro to HLS 11- 27
© Copyright 2016 Xilinx
CE1 WE1
DOUT1
Schedule aRer an Array Op)miza)on With the existing code & defaults – Port C is a dual port RAM – Allows 2 reads per clock cycles
RDc
RDc
RDc
RDc RDx
• IO behavior impacts performance Note: It could have performed 2 reads in the original rolled design but there was no advantage since the rolled loop forced a single read per cycle
* *
* *
+
+ +
loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc;
WRy
With the C port partitioned into (4) separate ports – All reads and mults can occur in one cycle – If the timing allows • The additions can also occur in the same cycle • The write can be performed in the same cycles • Optionally the port reads and writes could be registered
RDc RDc RDc RDc RDx
* * * *
+ + + WRy
Intro to HLS 11- 28
© Copyright 2016 Xilinx
Operators Operator sizes are defined by the type – The variable type defines the size of the operator
Vivado HLS will try to minimize the number of operators – By default Vivado HLS will seek to minimize area after constraints are satisfied
User can set specific limits & targets for the resources used – Allocation can be controlled • An upper limit can be set on the number of operators or cores allocated for the design: This can be used to force sharing • e.g limit the number of multipliers to 1 will force Vivado HLS to share 3
2
1
0
Use 1 mult, but take 4 cycle even if it could be done in 1 cycle using 4 mults
– Resources can be specified • The cores used to implement each operator can be specified • e.g. Implement each multiplier using a 2 stage pipelined core (hardware)
Intro to HLS 11- 29
3
1
2
0
Same 4 mult operations could be done with 2 pipelined mults (with allocation limiting the mults to 2)
© Copyright 2016 Xilinx
Outline Introduction to High-Level Synthesis High-Level Synthesis with Vivado HLS Language Support Validation Flow Summary
Intro to HLS 11- 30
© Copyright 2016 Xilinx
Comprehensive C Support A Complete C Validation & Verification Environment – Vivado HLS supports complete bit-accurate validation of the C model – Vivado HLS provides a productive C-RTL co-simulation verification solution
Vivado HLS supports C, C++, SystemC and OpenCL API C kernel – Functions can be written in any version of C – Wide support for coding constructs in all three variants of C
Modeling with bit-accuracy – Supports arbitrary precision types for all input languages – Allowing the exact bit-widths to be modeled and synthesized
Floating point support – Support for the use of float and double in the code
Support for OpenCV functions – Enable migration of OpenCV designs into Xilinx FPGA – Libraries target real-time full HD video processing Intro to HLS 11- 31
© Copyright 2016 Xilinx
C, C++ and SystemC Support The vast majority of C, C++ and SystemC is supported – Provided it is statically defined at compile time – If it’s not defined until run time, it won’ be synthesizable
Any of the three variants of C can be used – If C is used, Vivado HLS expects the file extensions to be .c – For C++ and SystemC it expects file extensions .cpp
Intro to HLS 11- 32
© Copyright 2016 Xilinx
Outline Introduction to High-Level Synthesis High-Level Synthesis with Vivado HLS Language Support Validation Flow Summary
Intro to HLS 11- 33
© Copyright 2016 Xilinx
C Valida)on and RTL Verifica)on There are two steps to verifying the design – Pre-synthesis: C Validation • Validate the algorithm is correct
– Post-synthesis: RTL Verification • Verify the RTL is correct
C validation
Validate C
– A HUGE reason users want to use HLS • Fast, free verification − Validate the algorithm is correct before synthesis • Follow the test bench tips given over
RTL Verification
Verify RTL
– Vivado HLS can co-simulate the RTL with the original test bench Intro to HLS 11- 34
© Copyright 2016 Xilinx
C Func)on Test Bench The test bench is the level above the function – The main() function is above the function to be synthesized
Good Practices – The test bench should compare the results with golden data • Automatically confirms any changes to the C are validated and verifies the RTL is correct
– The test bench should return a 0 if the self-checking is correct • Anything but a 0 (zero) will cause RTL verification to issue a FAIL message • Function main() should expect an integer return (non-void) int main () { int ret=0; … ret = system("diff --brief -w output.dat output.golden.dat"); if (ret != 0) { printf("Test failed !!!\n"); ret=1; } else { printf("Test passed !\n"); } … return ret; } Intro to HLS 11- 35
© Copyright 2016 Xilinx
Determine or Create the Top-‐level Func)on Determine the top-level function for synthesis If there are Multiple functions, they must be merged – There can only be 1 top-level function for synthesis Given a case where functions func_A and func_B are to be implemented in FPGA
Re-partition the design to create a new single top-level function inside main() main.c
main.c int main () { ... func_A(a,b,*i1); func_B(c,*i1,*i2); func_C(*i2,ret)
#include func_AB.h int main (a,b,c,d) { ... // func_A(a,b,i1); // func_B(c,i1,i2); func_AB (a,b,c, *i1, *i2); func_C(*i2,ret)
func_A func_B func_C
return ret; }
func_AB func_C
return ret; }
func_AB.c
Recommendation is to separate test bench and design files
Intro to HLS 11- 36
© Copyright 2016 Xilinx
#include func_AB.h func_AB(a,b,c, *i1, *i2) { ... func_A(a,b,*i1); func_B(c,*i1,*i2); … }
func_A func_B
Outline Introduction to High-Level Synthesis High-Level Synthesis with Vivado HLS Language Support Validation Flow Summary
Intro to HLS 11- 37
© Copyright 2016 Xilinx
Summary In HLS – C becomes RTL – Operations in the code map to hardware resources – Understand how constructs such as functions, loops and arrays are synthesized
HLS design involves – Synthesize the initial design – Analyze to see what limits the performance • User directives to change the default behaviors • Remove bottlenecks
– Analyze to see what limits the area • The types used define the size of operators • This can have an impact on what operations can fit in a clock cycle
Intro to HLS 11- 38
© Copyright 2016 Xilinx
Summary Use directives to shape the initial design to meet performance – Increase parallelism to improve performance – Refine bit sizes and sharing to reduce area
Vivado HLS benefits – Productivity – Portability – Permutability
Intro to HLS 11- 39
© Copyright 2016 Xilinx
Improving Performance This material exempt per Department of Commerce license exception TSU
Objec3ves After completing this module, you will be able to: – Add directives to your design – List number of ways to improve performance – State directives which are useful to improve latency – Describe how loops may be handled to improve latency – Recognize the dataflow technique that improves throughput of the design – Describe the pipelining technique that improves throughput of the design – Identify some of the bottlenecks that impact design performance
Improving Performance 13- 2
© Copyright 2016 Xilinx
Outline Adding Directives Improving Latency – Manipulating Loops
Improving Throughput Performance Bottleneck Summary
Improving Performance 13- 3
© Copyright 2016 Xilinx
Improving Performance Vivado HLS has a number of ways to improve performance – Automatic (and default) optimizations – Latency directives – Pipelining to allow concurrent operations
Vivado HLS support techniques to remove performance bottlenecks – Manipulating loops – Partitioning and reshaping arrays
Optimizations are performed using directives – Let’s look first at how to apply and use directives in Vivado HLS
Improving Performance 13- 4
© Copyright 2016 Xilinx
Applying Direc3ves If the source code is open in the GUI Information pane – The Directive tab in the Auxiliary pane shows all the locations and objects upon which directives can be applied (in the opened C file, not the whole design) • Functions, Loops, Regions, Arrays, Top-level arguments
– Select the object in the Directive Tab • “dct” function is selected
– Right-click to open the editor dialog box – Select a desired directive from the dropdown menu • “DATAFLOW” is selected
– Specify the Destination • Source File • Directive File
Improving Performance 13- 5
© Copyright 2016 Xilinx
Op3miza3on Direc3ves: Tcl or Pragma Directives can be placed in the directives file – The Tcl command is written into directives.tcl – There is a directives.tcl file in each solution • Each solution can have different directives Once applied the directive will be shown in the Directives tab (rightclick to modify or delete)
Directives can be place into the C source – Pragmas are added (and will remain) in the C source file – Pragmas (#pragma) will be used by every solution which uses the code
Improving Performance 13- 6
© Copyright 2016 Xilinx
Solu3on Configura3ons Configurations can be set on a solution – Set the default behavior for that solution • Open configurations settings from the menu (Solutions > Solution Settings…)
“Add” or “Remove” configuration settings
Select “General”
– Choose the configuration from the drop-down menu • Array Partitioning, Binding, Dataflow Memory types, Interface, RTL Settings, Core, Compile, Schedule efforts
Improving Performance 13- 7
© Copyright 2016 Xilinx
Example: Configuring the RTL Output Specify the FSM encoding style – By default the FSM is auto
Add a header string to all RTL output files – Example: Copyright Acme Inc.
Add a user specified prefix to all RTL output filenames – The RTL has the same name as the C functions – Allow multiple RTL variants of the same top-level function to be used together without renaming files
Reset all registers – By default only the FSM registers and variables initialized in the code are reset – RAMs are initialized in the RTL and bitstream
Synchronous or Asynchronous reset
The remainder of the configuration commands will be covered throughout the course
– The default is synchronous reset
Active high or low reset – The default is active high Improving Performance 13- 8
© Copyright 2016 Xilinx
Copying Direc3ves into New Solu3ons Click the New Solution Button Optionally modify any of the settings – Part, Clock Period, Uncertainty – Solution Name
Copy existing directives – By default selected – Uncheck if do not want to copy – No need to copy pragmas, they are in the code
Improving Performance 13- 9
© Copyright 2016 Xilinx
Outline Adding Directives Improving Latency – Manipulating Loops
Improving Throughput Performance Bottleneck Summary
Improving Performance 13- 10
© Copyright 2016 Xilinx
Latency and Throughput – The Performance Factors Design Latency – The latency of the design is the number of cycle it takes to output the result • In this example the latency is 10 cycles
Design Throughput – The throughput of the design is the number of cycles between new inputs • By default (no concurrency) this is the same as latency • Next start/read is when this transaction ends Improving Performance 13- 11
© Copyright 2016 Xilinx
Latency and Throughput In the absence of any concurrency – Latency is the same as throughput
Pipelining for higher throughput – Vivado HLS can pipeline functions and loops to improve throughput – Latency and throughput are related – We will discuss optimizing for latency first, then throughput
Improving Performance 13- 12
© Copyright 2016 Xilinx
Vivado HLS: Minimize latency Vivado HLS will by default minimize latency – Throughput is prioritized above latency (no throughput directive is specified here) – In this example • The functions are connected as shown • Assume function B takes longer than any other functions
Vivado HLS will automatically take advantage of the parallelism – It will schedule functions to start as soon as they can • Note it will not do this for loops within a function: by default they are executed in sequence
Improving Performance 13- 13
© Copyright 2016 Xilinx
Reducing Latency Vivado HLS has the following directives to reduce latency – LATENCY • Allows a minimum and maximum latency constraint to be specified
– LOOP_FLATTEN • Allows nested loops to be collapsed into a single loop with improved laten
– LOOP_MERGE • Merge consecutive loops to reduce overall latency, increase sharing, and improve logic optimization
– UNROLL
Improving Performance 13- 14
© Copyright 2016 Xilinx
Default Behavior: Minimizing Latency Functions – Vivado HLS will seek to minimize latency by allowing functions to operate in parallel • As shown on the previous slide
Loops – Vivado HLS will not schedule loops to operate in parallel by default • Dataflow optimization must be used or the loops must be unrolled • Both techniques are discussed in detail later
Operations
Loop:for(i=1;i=0;i-‐-‐) { b = a[i] + b; ... }
Loops require labels if they are to be referenced by Tcl directives (GUI will auto-add labels)
– Loops can be unrolled if their indices are statically determinable at elaboration time • Not when the number of iterations is variable
Improving Performance 13- 19
© Copyright 2016 Xilinx
b
Rolled Loops Enforce Latency A rolled loop can only be optimized so much – Given this example, where the delay of the adder is small compared to the clock frequency void foo_top (…) { ... Add: for (i=3;i>=0;i-‐-‐) { b = a[i] + b; ... }
Clock Adder Delay
3
– This rolled loop will never take less than 4 cycles • No matter what kind of optimization is tried • This minimum latency is a function of the loop iteration count
Improving Performance 13- 20
© Copyright 2016 Xilinx
2
1
0
Unrolled Loops can Reduce Latency
Select loop “Add” in the directives pane and right-click
Unrolled loops allow greater option & exploration
Options explained on next slide
Improving Performance 13- 21
Unrolled loops are likely to result in more hardware resources and higher area
© Copyright 2016 Xilinx
Par3al Unrolling Fully unrolling loops can create a lot of hardware Loops can be partially unrolled – Provides the type of exploration shown in the previous slide
Partial Unrolling – A standard loop of N iterations can be unrolled to by a factor – For example unroll by a factor 2, to have N/2 iterations
Add: for(int i = 0; i < N; i++) { a[i] = b[i] + c[i]; }
Add: for(int i = 0; i < N; i += 2) { a[i] = b[i] + c[i]; if (i+1 >= N) break; a[i+1] = b[i+1] + c[i+1]; }
• Similar to writing new code as shown on the right è • The break accounts for the condition when N/2 is not an integer
Effective code after compiler transformation
– If “i” is known to be an integer multiple of N • The user can remove the exit check (and associated logic) • Vivado HLS is not always be able to determine this is true (e.g. if N is an input argument) • User takes responsibility: verify!
Improving Performance 13- 22
© Copyright 2016 Xilinx
for(int i = 0; i < N; i += 2) { a[i] = b[i] + c[i]; a[i+1] = b[i+1] + c[i+1]; }
An extra adder for N/ 2 cycles trade-off
Loop FlaPening Vivado HLS can automatically flatten nested loops – A faster approach than manually changing the code
Flattening should be specified on the inner most loop – It will be flattened into the loop above – The “off” option can prevent loops in the hierarchy from being flattened
1 x4
2
x4
3 4
x4
x4
36 transitions
void foo_top (…) { void foo_top (…) { ... ... L1: for (i=3;i>=0;i-‐-‐) { L1: for (i=3;i>=0;i-‐-‐) { [loop body l1 ] [loop body l1 ] } } L2: for (i=3;i>=0;i-‐-‐) { L3: for (j=3;j>=0;j-‐-‐) { L2: for (k=15,k>=0;k-‐-‐) { [loop body l3 ] } [loop body l3 ] } } L4: for (i=3;i>=0;i-‐-‐) { L4: for (i=3;i>=0;i-‐-‐) { [loop body l4 ] [loop body l1 ] } } Loops will be flattened by default: use “off” to disable
Improving Performance 13- 23
© Copyright 2016 Xilinx
1 x4
2
x16
4 x4
28 transitions
Perfect and Semi-‐Perfect Loops Only perfect and semi-perfect loops can be flattened – The loop should be labeled or directives cannot be applied – Perfect Loops – Only the inner most loop has body (contents)
Loop_outer: for (i=3;i>=0;i-‐-‐) { Loop_inner: for (j=3;j>=0;j-‐-‐) { [loop body] } }
– There is no logic specified between the loop statements – The loop bounds are constant
– Semi-perfect Loops – Only the inner most loop has body (contents) – There is no logic specified between the loop statements
Loop_outer: for (i=3;i>N;i-‐-‐) { Loop_inner: for (j=3;j>=0;j-‐-‐) { [loop body] } }
– The outer most loop bound can be variable – Other types
– Should be converted to perfect or semi-perfect loops
Improving Performance 13- 24
© Copyright 2016 Xilinx
Loop_outer: for (i=3;i>N;i-‐-‐) { [loop body] Loop_inner: for (j=3;j>=M;j-‐-‐) { [loop body] } }
Loop Merging Vivado HLS can automatically merge loops – A faster approach than manually changing the code – Allows for more efficient architecture explorations – FIFO reads, which must occur in strict order, can prevent loop merging • Can be done with the “force” option : user takes responsibility for correctness
1 x4
2
x4
3 4
x4
x4
void foo_top (…) { ... L1: for (i=3;i>=0;i-‐-‐) { [loop body l1 ] } L2: for (i=3;i>=0;i-‐-‐) { L3: for (j=3;j>=0;j-‐-‐) { [loop body l3 ] } } Already flattened L4: for (i=3;i>=0;i-‐-‐) { [loop body l4 ] }
void foo_top (…) { ... L123: for (l=16,l>=0;l-‐-‐) { if (cond1) [loop body l1 ] [loop body l3 ] if (cond4) [loop body l4 ] }
x16
18 transitions
36 transitions Improving Performance 13- 25
1
© Copyright 2016 Xilinx
Loop Merge Rules If loop bounds are all variables, they must have the same value If loops bounds are constants, the maximum constant value is used as the bound of the merged loop – As in the previous example where the maximum loop bounds become 16 (implied by L3 flattened into L2 before the merge)
Loops with both variable bound and constant bound cannot be merged The code between loops to be merged cannot have side effects – Multiple execution of this code should generate same results • A=B is OK, A=A+1 is not
Reads from a FIFO or FIFO interface must always be in sequence – A FIFO read in one loop will not be a problem – FIFO reads in multiple loops may become out of sequence • This prevents loops being merged
Improving Performance 13- 26
© Copyright 2016 Xilinx
Loop Reports Vivado HLS reports the latency of loops – Shown in the report file and GUI
Given a variable loop index, the latency cannot be reported – Vivado HLS does not know the limits of the loop index – This results in latency reports showing unknown values
The loop tripcount (iteration count) can be specified – Apply to the loop in the directives pane – Allows the reports to show an estimated latency
Improving Performance 13- 27
© Copyright 2016 Xilinx
Impacts reporting – not synthesis
Techniques for Minimizing Latency -‐ Summary Constraints – Vivado HLS accepts constraints for latency
Loop Optimizations – Latency can be improved by minimizing the number of loop boundaries • Rolled loops (default) enforce sharing at the expense of latency • The entry and exits to loops costs clock cycles
Improving Performance 13- 28
© Copyright 2016 Xilinx
Outline Adding Directives Improving Latency – Manipulating Loops
Improving Throughput Performance Bottleneck Summary
Improving Performance 13- 29
© Copyright 2016 Xilinx
Improving Throughput Given a design with multiple functions – The code and dataflow are as shown
Vivado HLS will schedule the design
It can also automatically optimize the dataflow for throughput
Improving Performance 13- 30
© Copyright 2016 Xilinx
Dataflow Op3miza3on Dataflow Optimization – Can be used at the top-level function – Allows blocks of code to operate concurrently • The blocks can be functions or loops • Dataflow allows loops to operate concurrently
– It places channels between the blocks to maintain the data rate
• For arrays the channels will include memory elements to buffer the samples • For scalars the channel is a register with hand-shakes
Dataflow optimization therefore has an area overhead – Additional memory blocks are added to the design – The timing diagram on the previous page should have a memory access delay between the blocks • Not shown to keep explanation of the principle clear
Improving Performance 13- 31
© Copyright 2016 Xilinx
Dataflow Op3miza3on Commands Dataflow is set using a directive – Vivado HLS will seek to create the highest performance design • Throughput of 1
Improving Performance 13- 32
© Copyright 2016 Xilinx
Dataflow Op3miza3on through Configura3on Command Configuring Dataflow Memories – Between functions Vivado HLS uses ping-pong memory buffers by default • The memory size is defined by the maximum number of producer or consumer elements
– Between loops Vivado HLS will determine if a FIFO can be used in place of a ping-pong buffer – The memories can be specified to be FIFOs using the Dataflow Configuration • Menu: Solution > Solution Settings > config_dataflow • With FIFOs the user can override the default size of the FIFO • Note: Setting the FIFO too small may result in an RTL verification failure
Individual Memory Control – When the default is ping-pong • Select an array and mark it as Streaming (directive STREAM) to implement the array as a FIFO
– When the default is FIFO • Select an array and mark it as Streaming (directive STREAM) with option “off” to implement the array as a pingpong To use FIFO’s the access must be sequential. If HLS determines that the access is not sequential then it will halt and issue a message. If HLS can not determine the sequential nature then it will issue warning and continue.
Improving Performance 13- 33
© Copyright 2016 Xilinx
Dataflow : Ideal for streaming arrays & mul3-‐rate func3ons Arrays are passed as single entities by default – This example uses loops but the same principle applies to functions
Dataflow pipelining allows loop_2 to start when data is ready – The throughput is improved – Loops will operate in parallel • If dependencies allow
Multi-Rate Functions – Dataflow buffers data when one function or loop consumes or produces data at different rate from others
IO flow support – To take maximum advantage of dataflow in streaming designs, the IO interfaces at both ends of the datapath should be streaming/handshake types (ap_hs or ap_fifo) Improving Performance 13- 34
© Copyright 2016 Xilinx
Dataflow Limita3ons (1) Must be single producer consumer; the following code violates the rule and dataflow does not work The Fix
Improving Performance 13- 35
© Copyright 2016 Xilinx
Dataflow Limita3ons (2) You cannot bypass a task; the following code violates this rule and dataflow does not work The fix: make it systolic like datapath
Improving Performance 13- 36
© Copyright 2016 Xilinx
Dataflow vs Pipelining Op3miza3on Dataflow Optimization – Dataflow optimization is “coarse grain” pipelining at the function and loop level – Increases concurrency between functions and loops – Only works on functions or loops at the top-level of the hierarchy • Cannot be used in sub-functions
Function & Loop Pipelining – “Fine grain” pipelining at the level of the operators (*, +, >>, etc.) – Allows the operations inside the function or loop to operate in parallel – Unrolls all sub-loops inside the function or loop being pipelined • Loops with variable bounds cannot be unrolled: This can prevent pipelining • Unrolling loops increases the number of operations and can increase memory and run time
Improving Performance 13- 37
© Copyright 2016 Xilinx
Func3on Pipelining There are 3 clock cycles before operation RD can occur again
The latency is the same The throughput is better
– Throughput = 3 cycles
– Less cycles, higher throughput
There are 3 cycles before the 1st output is written – Latency = 3 cycles Without Pipelining
With Pipelining void foo(...) { op_Read; op_Compute; op_Write; }
RD CMP WR
Throughput = 3 cycles
RD
CMP
WR
Throughput = 1 cycle
RD
CMP
RD
WR
Improving Performance 13- 38
CMP
WR
RD
CMP
Latency = 3 cycles
Latency = 3 cycles
© Copyright 2016 Xilinx
WR
Loop Pipelining Without Pipelining
With Pipelining Loop:for(i=1;i> 1; *outB = *in2 >> 2; } void add_sub_pass(int A, int B, int *C, int *D) { int apb, amb; int a2, b2;
B
A
B>>1
Zero Area
}
Inlining allows optimization to be performed across function hierarchies Like RTL ungrouping, too much inlining can create a lot of logic and slow runtime
21- 12 Improving Area and Resources 21- 12
A
sumsub_func(&A,&B,&apb,&amb); sumsub_func(&apb,&amb,&a2,&b2); shift_func(&a2,&b2,C,D);
2 Adders 2 Subtractors
shift_func
Inlining
int sumsub_func (int *in1, int *in2, int *outSum, int *outSub) { *outSum = *in1 + *in2; *outSub = *in1 - *in2; }
© Copyright 2016 Xilinx
Inline and Alloca6on: Shape the Hierarchy Easy to Share
One RTL block is reused for both instances of function foo
Cannot be shared
Function foo is not within the immediate scope of foo_top
21- 13 Improving Area and Resources 21- 13
© Copyright 2016 Xilinx
Controlling Sharing
Inlining brings foo into function foo_top where it can be shared
Loops By default, loops are rolled – Each C loop iteration è Implemented in the same state – Each C loop iteration è Implemented with same resources
N
foo_top
Synthesis
a[N]
+
void foo_top (…) { ... Add: for (i=3;i>=0;i-‐-‐) { b = a[i] + b; ... }
b
For Area optimization Keeping loops rolled maximizes sharing across loop iterations: each iteration of the loop uses the same hardware resources
21- 14 Improving Area and Resources 21- 14
© Copyright 2016 Xilinx
Loop Merging & FlaLening Loop merging & flattening can remove the redundant computation among multiple (related) loops – Improving area (and sometimes performance) My_Region: { #pragma HLS merge loop for (i = 0; i < N; ++i) A[i] = B[i] + 1;
Merge
for (i = 0; i < N; ++i) C[i] = A[i] / 2;
for (i = 0; i < N; ++i) { A[i] = B[i] + 1; C[i] = A[i] / 2; }
Effective code after compiler transformation
}
Allows Vivado HLS to perform optimizations – Optimization cannot occur across loop boundaries for (i = 0; i < N; ++i) C[i] = (B[i] + 1) / 2;
Removes A[i], any address logic and any potential memory accesses
21- 15 Improving Area and Resources 21- 15
© Copyright 2016 Xilinx
Mapping Arrays The arrays in the C model may not be ideal for the available RAMs – The code may have many small arrays – The array may not utilize the RAMs very well
Array Mapping – Mapping combines smaller arrays into larger arrays • Allows arrays to be reconfigured without code edits
– Specify the array variable to be mapped – Give all arrays to be combined the same instance name
Vivado HLS provides options as to the type of mapping – Combine the arrays without impacting performance • Vertical & Horizontal mapping
Global Arrays – When a global array is mapped all arrays involved are promoted to global – When arrays are in different functions, the target becomes global
Arrays which are function arguments – All must be part of the same function interface
21- 16 Improving Area and Resources 21- 16
© Copyright 2016 Xilinx
Horizontal Mapping Horizontal Mapping – Combines multiple arrays into longer (horizontal) array – Optionally allows the arrays to be offset • The default is to concatenate after the last element
• The first array specified (in GUI or Tcl script) starts at location zero 21- 17 Improving Area and Resources 21- 17
© Copyright 2016 Xilinx
Ver6cal Mapping Vertical Mapping – Combines multiple arrays in to an array with more bits
– The first array specified (in Tcl or GUI) starts at the LSB
Vertical Mapping for performance – Creates RAMs with wide words è Parallel accesses
21- 18 Improving Area and Resources 21- 18
© Copyright 2016 Xilinx
Arbitrary Precision Integers C and C++ have standard types created on the 8-bit boundary – char (8-bit), short (16-bit), int (32-bit), long long (64-bit) • Also provides stdint.h (for C), and stdint.h and cstdint (for C++) • Types: int8_t, uint16_t, uint32_t, int_64_t etc.
– They result in hardware which is not bit-accurate and can give sub-standard QoR
Vivado HLS provides bit-accurate types in both C and C++ – Plus SystemC types can be used in C++ – Allow any arbitrary bit-width to be specified – Will simulate with bit-accuracy
21- 19 Improving Area and Resources 21- 19
© Copyright 2016 Xilinx
Why are Arbitrary Precision types Needed? Code using native C int type
However, if the inputs will only have a max range of 8-bit – Arbitrary precision data-types should be used
– It will result in smaller & faster hardware with full precision 21- 20 Improving Area and Resources 21- 20
© Copyright 2016 Xilinx
Outline Optimizing Resource Utilization Reducing Area Usage Summary
Improving Area and Resources 21- 21
© Copyright 2016 Xilinx
Summary Resource utilization can be reduced using allocation and binding controls Arbitrary precision data types help controlling both the area and resource utilization The design structure can be controlled by – Inlining functions: direct impact on RTL hierarchy & optimization possibilities – Loops: direct impact on reuse of resources – Arrays: direct impact on the RAM
Major area optimization techniques – Minimize bit widths – Map smaller arrays into larger arrays • Make better use of existing RAMs
– Control loop hierarchy – Control function call hierarchy – Control the number of operators and cores
Improving Area and Resources 21- 22
© Copyright 2016 Xilinx
Using Vivado HLS This material exempt per Department of Commerce license exception TSU
Objec4ves After completing this module, you will be able to: – List various OS under which Vivado HLS is supported – Describe how projects are created and maintained in Vivado HLS – State various steps involved in using Vivado HLS project creation wizard – Distinguish between the role of top-level module in testbench and design to be synthesized – List various verifications which can be done in Vivado HLS – List Vivado HLS project directory structure
Using Vivado HLS 12 - 2
© Copyright 2016 Xilinx
Outline Invoking Vivado HLS Project Creation using Vivado HLS Synthesis to IPXACT Flow Design Analysis Other Ways to use Vivado HLS Summary
Using Vivado HLS 12 - 3
© Copyright 2016 Xilinx
Vivado HLS OS Support Vivado HLS is supported on both Linux and Windows Vivado HLS tool available under two licenses – HLS license • HLS license come with Vivado System Edition • Supports all 7 series devices including Zynq® All Programmable SoC • Does not support Virtex®-6 and earlier devices – Use older version of Vivado HLS for Virtex-6 and earlier
Operating System Windows
Using Vivado HLS 12 - 4
© Copyright 2016 Xilinx
Version Windows 10 Professional (64-bit) Windows 8.1 Professional (64-bit) Windows 7 SP1 Professional (64-bit)
Red Hat Linux
RHEL Enterprise Linux 5.11, 6.7-6.8 (64-bit) RHEL Enterprise Linux 7.1 and 7.2 (64-bit)
SUSE
SUSE Linux Enterprise 11.4 and 12.1 (64-bit)
CentOS
CentOS 6.8 (64-bit)
Ubuntu
Ubuntu Linux 16.04 LTS (64-bit)
Invoke Vivado HLS from Windows Menu
The first step is to open or create a project
12- 5 Using Vivado HLS 12 - 5
© Copyright 2016 Xilinx
Vivado HLS GUI
Information Pane
Auxiliary Pane
Project Explorer Pane
Console Pane
12- 6 Using Vivado HLS 12 - 6
© Copyright 2016 Xilinx
Outline Invoking Vivado HLS Project Creation using Vivado HLS Synthesis to IPXACT Flow Design Analysis Other Ways to use Vivado HLS Summary
Using Vivado HLS 12 - 7
© Copyright 2016 Xilinx
Vivado HLS Projects and Solu4ons Vivado HLS is project based – A project specifies the source code which will be synthesized
Source
– Each project is based on one set of source code – Each project has a user specified name
A project can contain multiple solutions – Solutions are different implementations of the same code – Auto-named solution1, solution2, etc. – Supports user specified names
Project Level
Solution Level
– Solutions can have different clock frequencies, target technologies, synthesis directives
Projects and solutions are stored in a hierarchical directory structure – Top-level is the project directory – The disk directory structure is identical to the structure shown in the GUI project explorer (except for source code location) 12- 8 Using Vivado HLS 12 - 8
© Copyright 2016 Xilinx
Vivado HLS Step 1: Create or Open a project Start a new project – The GUI will start the project wizard to guide you through all the steps
Optionally use the Toolbar Button to Open New Project
Open an existing project – All results, reports and directives are automatically saved/remembered – Use “Recent Project” menu for quick access 12- 9 Using Vivado HLS 12 - 9
© Copyright 2016 Xilinx
Project Wizard The Project Wizard guides users through the steps of opening a new project Step-by-step guide …
Define project and directory
Add design source files
Specify test bench files
1st Solution Information
Project Level Information Using Vivado HLS 12 - 10
Specify clock and select part
© Copyright 2016 Xilinx
Define Project & Directory Define the project name − Note, here the project is given the extension .prj − A useful way of seeing it’s a project (and not just another directory) when browsing
Browse to the location of the project – In this example, project directory “matrixmul.prj” will be created inside directory “lab1”
Using Vivado HLS 12 - 11
© Copyright 2016 Xilinx
Add Design Source Files Add Design Source Files − This allows Vivado HLS to determine the top-level design for synthesis, from the test bench and associated files − Not required for SystemC designs
Add Files… – Select the source code file(s) – The CTRL and SHIFT keys can be used to add multiple files – No need to include headers (.h) if they reside in the same directory
Select File and Edit CFLAGS… − If required, specify C compile arguments using the “Edit CFLAGS…” − Define macros: -DVERSION1 − Location of any (header) files not in the same directory as the source: -I../include Using Vivado HLS 12 - 12
© Copyright 2016 Xilinx
There is no need to add the location of standard Vivado HLS or SystemC header files or header files located in the same project location
Specify Test Bench Files Use “Add Files” to include the test bench – Vivado HLS will re-use these to verify the RTL using cosimulation
And all files referenced by the test bench – The RTL simulation will be executed in a different directory (Ensures the original results are not overwritten) – Vivado HLS needs to also copy any files accessed by the test bench •
E.g. Input data and output results
Add Folders – If the test bench uses relative paths like “sub_directory/my_file.dat” you can add “sub_directory” as a folder/directory
Use “Edit CFLAGS…” – To add any C compile flags required for compilation Using Vivado HLS 12 - 13
© Copyright 2016 Xilinx
Test benches I The test bench should be in a separate file Or excluded from synthesis – The Macro __SYNTHESIS__ can be used to isolate code which will not be synthesized • This macro is defined when Vivado HLS parses any code (-D__SYNTHESIS__) // test.c #include void test (int d[10]) { int acc = 0; int i; for (i=0;i IP Catalog 2. Add IP to import this block 3. Browse to the zip file inside “ip”
impl
ip
In System Generator : 1. Use XilinxBlockAdd 2. Select Vivado_HLS block type 3. Browse to the solution directory Using Vivado HLS 12 - 32
Solution directories There can be multiple solutions for each project. Each solution is a different implementation of the same (project) source code
Project Directory Top-level project directory (there must be one)
© Copyright 2016 Xilinx
syn
sysgen
solutionN
sim
RTL Export for Implementa4on Click on Export RTL – Export RTL Dialog opens
Select the desired output format
Optionally, configure the output Select the desired language Optionally, click on Vivado RTL Synthesis and Place and Route options for invoking implementation tools from within Vivado HLS Click OK to start the implementation
Using Vivado HLS 12 - 33
© Copyright 2016 Xilinx
RTL Export (Place and Route Op4on) Results Impl directory created – Will contain a sub-directory for each RTL which is synthesized
Report – A report is created and opened automatically
Using Vivado HLS 12 - 34
© Copyright 2016 Xilinx
RTL Export Results (Place and Route Op4on Unchecked) Impl directory created – Will contain a sub-directory for both VHDL and Verilog along with the ip directory
No report will be created Observe the console – No packing, routing phases
Using Vivado HLS 12 - 35
© Copyright 2016 Xilinx
Outline Invoking Vivado HLS Project Creation using Vivado HLS Synthesis to IPXACT Flow Design Analysis Other Ways to use Vivado HLS Summary
Using Vivado HLS 12 - 36
© Copyright 2016 Xilinx
Analysis Perspec4ve Perspective for design analysis – Allows interactive analysis
Using Vivado HLS 12 - 37
© Copyright 2016 Xilinx
Performance Analysis
Using Vivado HLS 12 - 38
© Copyright 2016 Xilinx
Resources Analysis
Using Vivado HLS 12 - 39
© Copyright 2016 Xilinx
Outline Invoking Vivado HLS Project Creation using Vivado HLS Synthesis to IPXACT Flow Design Analysis Other Ways to use Vivado HLS Summary
Using Vivado HLS 12 - 40
© Copyright 2016 Xilinx
Command Line Interface: Batch Mode Vivado HLS can also be run in batch mode – Opening the Command Line Interface (CLI) will give a shell
– Supports the commands required to run Vivado HLS & pre-synthesis verification (gcc, g++, apcc, make)
12- 41 Using Vivado HLS 12 - 41
© Copyright 2016 Xilinx
Using Vivado HLS CLI Invoke Vivado HLS in interactive mode – Type Tcl commands one at a time
> vivado_hls –i
Execute Vivado HLS using a Tcl batch file – Allows multiple runs to be scripted and automated
> vivado_hls –f run_aesl.tcl
Open an existing project in the GUI – For analysis, further work or to modify it
> vivado_hls –p my.prj
Use the shell to launch Vivado HLS GUI > vivado_hls
12- 42 Using Vivado HLS 12 - 42
© Copyright 2016 Xilinx
Using Tcl Commands When the project is created – All Tcl command to run the project are created in script.tcl • User specified directives are placed in directives.tcl
– Use this as a template from creating Tcl scripts • Uncomment the commands before running the Tcl script
Using Vivado HLS 12 - 43
© Copyright 2016 Xilinx
Help Help is always available – The Help Menu – Opens User Guide, Reference Guide and Man Pages
In interactive mode – The help command lists the man page for all commands Vivado_hls> help add_files
Auto-Complete all commands using the tab key
SYNOPSIS add_files [OPTIONS] Etc…
Using Vivado HLS 12 - 44
© Copyright 2016 Xilinx
Outline Invoking Vivado HLS Project Creation using Vivado HLS Synthesis to IPXACT Flow Design Analysis Other Ways to use Vivado HLS Summary
Using Vivado HLS 12 - 45
© Copyright 2016 Xilinx
Summary Vivado HLS can be run under Windows 7/8.1, Red Hat Linux, SUSE OS, and Ubuntu Vivado HLS can be invoked through GUI and command line in Windows OS, and command line in Linux Vivado HLS project creation wizard involves – Defining project name and location – Adding design files – Specifying testbench files – Selecting clock and technology
The top-level module in testbench is main() whereas top-level module in the design is the function to be synthesized
12- 46 Using Vivado HLS 12 - 46
© Copyright 2016 Xilinx
Summary Vivado HLS project directory consists of – *.prj project file – Multiple solutions directories – Each solution directory may contain • impl, synth, and sim directories • The impl directory consists of ip, verilog, and vhdl folders • The synth directory consists of reports, vhdl, and verilog folders • The sim directory consists of testbench and simulation files
Using Vivado HLS 12 - 47
© Copyright 2016 Xilinx