Logiciel - index - P.PDFHALL.COM

Logiciel - index

de processeurs ! â« Soit une machine ..... Simultaneous War at 3 frontiers. New hardware ... FPL-2010 (http://conferenze.dei.polimi.it/FPL2010/presentations/W1_B_1.pdf) ...... If it's not defined until run time, it won' be synthesizable. Any of the ...

Télécharger le PDF

30MB taille 5 téléchargements 485 vues

commentaire

Report

FPGA2 ou le Codesign Matériel/Logiciel Bertrand Granado - Andrea Pinna LIP6 / UPMC Courriel : [email protected] [email protected]

Automne 2017

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

1 / 86

Plan 1

Les syst` emes Embarqu´ es Introduction D´ efinition

2

Etude de la conception de syst` emes embarqu´ es sur puce avec des parties Logicielles et des parties Mat´ erielles (Co-design) Sp´ ecification R´ ealisation

3

Plate-formes pour l’embarqu´ e

4

Profilage d’application : temps, consommation

5

Interlude : les r´ eseaux de neurones convolutionnels ou CNN

6

Heuristiques

7

Les algorithmes d’optimisation

8

Le front de Pareto

9

La loi d’Ahmdal Gain et efficacit´ e du parall´ elisme

10 Optimisation MultiCrit` eres 11 HLS Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

2 / 86

Les syst` emes Embarqu´ es

Introduction

Plan 1

Les systèmes Embarqués Introduction Définition

2

Etude de la conception de systèmes embarqués sur puce avec des parties Logicielles et des parties Matérielles (Co-design)

3

Plate-formes pour l’embarqué

4

Profilage d’application : temps, consommation

5

Interlude : les réseaux de neurones convolutionnels ou CNN

6

Heuristiques

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

3 / 86

Les syst` emes Embarqu´ es

Introduction

Les systèmes Embarqués

Figure: Réseau de Capteur sans fil

Figure: Capsule Vidéo Endoscopique

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

4 / 86

Les syst` emes Embarqu´ es

Introduction

Les systèmes Embarqués

Figure: Téléphonie Mobile (1973)

Figure: Mon nouvel ami

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

5 / 86

Les syst` emes Embarqu´ es

Introduction

Les systèmes Embarqués

Figure: système de freinage ABS

Figure: Health Monitoring (copyright Sagem)

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

6 / 86

Les syst` emes Embarqu´ es

D´ efinition

Plan 1

Les systèmes Embarqués Introduction Définition

2

Etude de la conception de systèmes embarqués sur puce avec des parties Logicielles et des parties Matérielles (Co-design)

3

Plate-formes pour l’embarqué

4

Profilage d’application : temps, consommation

5

Interlude : les réseaux de neurones convolutionnels ou CNN

6

Heuristiques

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

7 / 86

Les syst` emes Embarqu´ es

D´ efinition

Système Embarqué

Les systèmes embarqués sont des systèmes réactifs: ”A reactive system is one which is in continual interaction with is environment and executes at a pace determined by that environment“ [Bergé, 1995] Le comportement dépend des entrées à l’instant courant.

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

8 / 86

Les syst` emes Embarqu´ es

D´ efinition

Système Embarqué : tentative de définition

Système autonome qui interagit avec son environnement Un système embarqué doit être efficace, il est régit par des contraintes Energie → Faible consommation Taille du Code → Ressource mémoire limitée Temps → Contrainte temps réel Surface → Espace limité Coˆ ut → Intégration dans des appareil grand public Spécifique → Dédié ` a certaines applications

Connaissance du comportement à la conception qui facilite la minimisation des ressources et la maximisation de la robustesse Interface utilisateur dédiée (pas forcément de souris, clavier , écran. . . )

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

9 / 86

Les syst` emes Embarqu´ es

D´ efinition

Système Embarqués : Contrainte Energie

Puissance Dissipée Pdis = Psta + Pdyn Psta = Ioff ∗ VDD 2 Pdyn = Fc ∗ CL ∗ VDD

Faible Consommation Psta : optimisation technologique Pdyn : optimisation technologique et architecturale

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

10 / 86

Les syst` emes Embarqu´ es

D´ efinition

Système Embarqués : Taille du code

Taille mémoire limitée : du à l’environnement Taille mémoire bornée : impossibilité de l’augmenter Nécessité d’optimiser l’utilisation de la mémoire Pas de surperflux Utilisation des options d’optimisation du compilateur (-02, -03 de gcc) sans garantie toutefois

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

11 / 86

Les syst` emes Embarqu´ es

D´ efinition

Système Embarqués : Temps Un système embarqué doit souvent répondre à des contraintes temps réel Un système temps réel doit réagir à un stimuli dans un interval de temps dépendant de l’environnement. Il y a deux type de temps réel : Le temps réel dur (latence) Un système temps réel qui produit une bonne réponse mais trop tard est faux. La réponse d’un système temps réel dur ne peut être statistique, elle est au pire des cas (WCET : Worst Case Execution Time). ”A real-time constraint is called hard, if not meeting that constraint could result in a catastrophe“ [Kopetz, 1997].

Le temps réel mou (cadence)

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

12 / 86

Les syst` emes Embarqu´ es

D´ efinition

Système Embarqués : Temps

Temps réel dur

Figure: Contrˆ ole Avion

Temps réel mou

Figure: Télé Numérique

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

13 / 86

Les syst` emes Embarqu´ es

D´ efinition

Système Embarqués : Surface

Facteur de forme

Figure: Camera Espion

Figure: Implant Cochléaire

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

14 / 86

Les syst` emes Embarqu´ es

D´ efinition

Système Embarqués : Coût

O` u sont passé mes euros ?

Figure: Objets Connectés

Figure: Un marché immense !

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

15 / 86

Les syst` emes Embarqu´ es

D´ efinition

Système Embarqués : Spécificité

Dépend du domaine d’application Aéronautique et Aérospatial Automobile Biomédical E-Santé Robotique Réseaux de Capteurs ...

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

16 / 86

Les syst` emes Embarqu´ es

D´ efinition

Système Embarqué : seconde tentative de définition

Tous les systèmes embarqués ne possèdent pas toutes ces caractéristiques. D´ efinition: Un système de traitement de l‘information présentant la plupart de ces caractéristiques est appelé système embarqué (en anglais : embedded system)

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

17 / 86

Etude de la conception de syst` emes embarqu´ es sur puce avec des parties Logicielles et des parties Mat´ erielles (Co-design) Sp´ ecification

Plan 1

Les systèmes Embarqués

2

Etude de la conception de systèmes embarqués sur puce avec des parties Logicielles et des parties Matérielles (Co-design) Spécification Réalisation

3

Plate-formes pour l’embarqué

4

Profilage d’application : temps, consommation

5

Interlude : les réseaux de neurones convolutionnels ou CNN

6

Heuristiques

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

18 / 86

Etude de la conception de syst` emes embarqu´ es sur puce avec des parties Logicielles et des parties Mat´ erielles (Co-design) Sp´ ecification

Systèmes Embarqués sur puce : Spécification

Constat L’homme n’est pas capable de comprendre un système contenant entre 5 et 10 objets. La plupart des systèmes embarqués manipulent plus d’objets Lisibilité Portabilité et Flexibilité

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

19 / 86

Etude de la conception de syst` emes embarqu´ es sur puce avec des parties Logicielles et des parties Mat´ erielles (Co-design) R´ ealisation

Plan 1

Les systèmes Embarqués

2

Etude de la conception de systèmes embarqués sur puce avec des parties Logicielles et des parties Matérielles (Co-design) Spécification Réalisation

3

Plate-formes pour l’embarqué

4

Profilage d’application : temps, consommation

5

Interlude : les réseaux de neurones convolutionnels ou CNN

6

Heuristiques

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

20 / 86

Etude de la conception de syst` emes embarqu´ es sur puce avec des parties Logicielles et des parties Mat´ erielles (Co-design) R´ ealisation

Systèmes Embarqués : Réalisation

Choix d’une architecture électronique pour réaliser le système embarqué Utilisation d’une méthodologie pour Déployer Choisir l’architecture Explorer l’espace des solutions architecturales Etre capable de le définir !

→ Réalisation d’un système sur puce

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

21 / 86

Etude de la conception de syst` emes embarqu´ es sur puce avec des parties Logicielles et des parties Mat´ erielles (Co-design) R´ ealisation

Systèmes Embarqués : Réalisation

Figure: Un système sur puce ou SoC (System on Chip) (Cours de ) Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

22 / 86

Etude de la conception de syst` emes embarqu´ es sur puce avec des parties Logicielles et des parties Mat´ erielles (Co-design) R´ ealisation

Systèmes Embarqués : Choix d’une architecture

Choisir une architecture matérielle composée de Blocs programmables généralistes : CPU, GPU, DSP Blocs spécialisés ou dédiés : FPGA, ASIC Bus de communication

SoC = cohabitation de ces ressources sur un même circuit, prise en compte globale pour la réalisation logicielle/matérielle

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

23 / 86

Etude de la conception de syst` emes embarqu´ es sur puce avec des parties Logicielles et des parties Mat´ erielles (Co-design) R´ ealisation

Systèmes Embarqués : Choix d’une architecture

ASIC

Figure: Wafer Asic

FPGA

Figure: FPGA

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

CPU

Figure: AMD K10

Automne 2017

24 / 86

Etude de la conception de syst` emes embarqu´ es sur puce avec des parties Logicielles et des parties Mat´ erielles (Co-design) R´ ealisation

Systèmes Embarqués : Choix d’une architecture

Figure Comparaison Performance vs Flexibilité

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

25 / 86

Plate-formes pour l’embarqu´ e

Arduino Yun

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

26 / 86

Plate-formes pour l’embarqu´ e

Artik10

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

27 / 86

Plate-formes pour l’embarqu´ e

Pi Rasperry

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

28 / 86

Plate-formes pour l’embarqu´ e

Zynq UltraScale+

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

29 / 86

Plate-formes pour l’embarqu´ e

Cyclone

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

30 / 86

Plate-formes pour l’embarqu´ e

Systèmes Embarqués : Méthodologie de Conception

Procédure pour concevoir un système. Comprendre une méthodologie aide à garantir la sécurité de la conception. P Flot de conception : de compilateurs, outils de développement logiciel, outils de conception assistée par ordinateur (CAO), etc., permettant : d’aider à automatiser les étapes de la méthodologie; de garder trace de l’application de la méthodologie (gestion de version, rapports, accélération des itérations).

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

31 / 86

Plate-formes pour l’embarqu´ e

Systèmes Embarqués : Méthodologie de Conception

Buts Satisfaire : Performances : rapidité globale, échéances. Fonctionnalité et interface utilisateur. Coˆ ut de fabrication. Consommation. Divers exigences (encombremen

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

32 / 86

Plate-formes pour l’embarqu´ e

Systèmes Embarqués : Méthodologie de Conception

Méthode: technique de résolution de problème caractérisée par un ensemble de règles bien définies qui conduisent à une solution correcte

Méthodologie: un ensemble structuré et cohérent de modèles, méthodes, guides et outils permettant de déduire la manière de résoudre un problème

Modèle: une représentation d’un aspect partiel et cohérent du ”monde” réel précède toute décision ou formulation d’une opinion est élaboré pour répondre à la question qui conduit au développement d’un système

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

33 / 86

Plate-formes pour l’embarqu´ e

Systèmes Embarqués : Méthodologie de conception

un peu d’histoire 70-80 : full-custom Schéma Dessin des masques Simulation électronique

80-90 : Précaractérisé FPGA Réutilisation de briques élémentaires Modélisation, simulation

00-xx : SoC Réutilisation du matériel et logiciel Co-design, vérification

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

34 / 86

Plate-formes pour l’embarqu´ e

Systèmes Embarqués : Notion d’IP

Accélérer la conception de Système sur Puce Réutiliser les blocs déjà con¸cus dans la société ; Utiliser les générateurs de macro-cellules (Ram, multiplieurs,. . . ) Acheter des blocs con¸cus hors de l’entreprise

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

35 / 86

Plate-formes pour l’embarqu´ e

Systèmes Embarqués : Notion d’IP

Blocs fonctionnels complexes réutilisables Matériel : déjà implanté, dépendant de la technologie, fortement optimisé Logiciel : dans un langage de haut niveau (VHDL, Verilog, C++. . . ), paramétrables Normalisation des interfaces ( OCP) Environnement de développement (co-design, co-specification, co-vérificatin) Performances moyennes (peu optimisé)

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

36 / 86

Plate-formes pour l’embarqu´ e

Systèmes Embarqués : Utilisation d’IP

Bloc réutilisable (IP) connaˆıtre les fonctionnalités estimer les performances dans un système être sˆ ur du bon fonctionnement de l’IP intégrer cet IP dans le système valider le système

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

37 / 86

Plate-formes pour l’embarqu´ e

Systèmes Embarqués : Utilisation d’IP

Logiciel

Firm

Matériel

Flot de Conception Conception Système Conception RTL Synthèse plan de masse Placement Routage Vérification

Représentation Comportemental RTL

Librairies

Technologie

Portabilité

-

Indépendant technologie

Illimitée

RTL Blocs Netlist

De référence (temps, dessin)

Technologie générique

Portable sur librairie

Spécifique process Règles de dessin

Technologie Fixe

Dépendant du process

Polygˆ ones réguliers

et -

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

38 / 86

Plate-formes pour l’embarqu´ e

Le Codesign

Le Corps du problème Ajouter définition

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

39 / 86

Plate-formes pour l’embarqu´ e

Le Codesign

Définition Conception de système macro ou micro qui intègrent à la fois des parties logicielles (s’exécutant sur des processeurs ou DSP) et des IP (réalisé sur des FPGA ou des ASIC) Conception conjointe des composants logiciels et matériels Unification des chemins logiciels et matériels couramment séparés

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

40 / 86

Plate-formes pour l’embarqu´ e

Le Codesign

Définition Une méthodologie de conception qui supporte le développement coopératif et concurrent des parties logicielles et matérielles (co-spécification, co-développement et co-vérification) afin d’obtenir des fonctionnalités partagées et d’atteindre les performances espéréessa a R. Gupta et G. De Micheli - ”Hardware-Software Cosynthesis for Digital Systems” - IEEE Design and Test of Computers, 1993, pp. 29:41

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

41 / 86

Plate-formes pour l’embarqu´ e

Exemple jouet : mesure de la vitesse d’une roue Contraintes : Surface : 40 unités Temps : 100 cycles

Mise en œuvre Processeurs Matériel Spécialisé Combinaison Processeur et Matériel Spécialisé

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

42 / 86

Plate-formes pour l’embarqu´ e

Exemple jouet : mesure de la vitesse d’une roue

Mise en œuvre logicielle sur des Processeurs Contraintes : Surface : 48 unités > 40 unités Temps : 132 cycles > 100 cycles

Développement : 2 mois

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

43 / 86

Plate-formes pour l’embarqu´ e

Exemple jouet : mesure de la vitesse d’une roue Mise en œuvre matériel sur des ASIC ou des FPGA Contraintes : Surface : 24 unités < 40 unités Temps : 54 cycles < 100 cycles

Optimisation de 40% par rapport au contraintes de temps et de surface Développement : 9 mois (Délai trop long dans un univers hyper-compétitif)

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

44 / 86

Plate-formes pour l’embarqu´ e

Exemple jouet : mesure de la vitesse d’une roue Mise en œuvre logicielle sur des Processeurs et matérielle sur des ASIC ou des FPGA Contraintes : Surface : 37 unités < 40 unités Temps : 97 cycles < 100 cycles

Développement : 3,5 mois Pas aussi efficace que la mise en œuvre purement matérielle mais satisfait les contraintes Réalise un bon compromis

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

45 / 86

Plate-formes pour l’embarqu´ e

Motivation du Codesign Atteindre les performances attendues en dépla¸cant les goulets d’étranglement du logiciel vers le matériel Utilisation du matériel pour satisfaire des contraintes de temps et de surface qui ne peuvent être satisfaites par un processeur généraliste. Dans une mise en oeuvre matérielle statique, il n’est pas possible de tout mettre en matériel, du fait des ressources limitées. Dans une mise en oeuvre matérielle dynamique, il faut reconsidérer cette affirmation.

Des parties de l’application sont plus en adéquation avec un traitement séquentiel ( Contrˆ ole par exemple) réalisé par un processeur généraliste. Aujourd’hui beaucoup de systèmes sont embarqués ce qui requiert à la fois des parties logicielles et matérielles.

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

46 / 86

Plate-formes pour l’embarqu´ e

Motivation du Codesign

La complexité et la fonctionnalité des systèmes croit à un rythme rapide et on voit émerger les SystemOnChip (SOC). Il est difficile, voir impossible, pour des système ad-hoc d’être con¸cus, réalisés et testés dans un temps acceptable même avec les plus avancés des outils de CAO standards. (Solution?) Tirer profit de design précédemment réalisés (IPs) et de processeurs testés pour réduire le temps de conception et pour augmenter la fiabilité.

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

47 / 86

Plate-formes pour l’embarqu´ e

Motivation du Codesign La faille de productivité de l’ingénieur (ITRS)

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

48 / 86

Plate-formes pour l’embarqu´ e

Compromis/Décisions En partant d’un ensemble de contraintes spécifiées et de technologies maitrisées les concepteurs doivent trouver les compromis pour faire fonctionner ensemble les composants logiciels et matériels Décisions, Constraintes et Evaluations? Performance. Surface. Consommation. Programmabilité. Développement Coˆ ut de Fabrication. Fiabilité. Robustesse. Maintenance. Evolution.

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

49 / 86

Plate-formes pour l’embarqu´ e

Co-Design: Recherche

La recherche en codesign traverse plusieurs champs de compétences tels que : Spécification système et modélisation Exploration du design Partitionnement Ordonnancement Co-verification et Co-simulation Génération de code matériel et logiciel Interfa¸cage matériel/logiciel

L’objectif commun ici est de développer une méthodologie unifiée pour créer des systèmes qui contiennent à la fois du matériel et du logiciel.

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

50 / 86

Plate-formes pour l’embarqu´ e

Approche Simple

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

51 / 86

Profilage d’application : temps, consommation

Profilage et Partitionnement Bénéfices Accélération de 10 à 200 fois Accélération possible de 800 fois Beaucoup plus de potentiel que les optimisations dynamiques logicielles (internes au processeur, déroulage de boucle, pipeline logiciel,...) Réduction de la consommation d’énergie de 25 à 95% Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

52 / 86

Profilage d’application : temps, consommation

Profilage Le Profilage permet d’apprendre les endroits, en terme de code, o` u le programme passe son temps. Quelle fonction appelle quelle autre durant son exécution. Le profilage s’effectue via des données collectée lors de l’exécution de l’application. Cette méthode peut donc être utilisée pour analyser des programmes trop complexe pour une analyse via la lecture des sources. Ces informations de profil, montre les bouts de code o` u le programme est plus lent qu’attendu. Ces bouts de code sont de bons candidats à : une réécriture optimisées une transformation matérielle

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

53 / 86

Profilage d’application : temps, consommation

Profilage : comment ?

Avec gcc, il faut tout d’abord compiler et lier le programme avec les options de profilage autorisées : gcc -o myprog.exe myprog.c utils.c -g -pg

Il faut ensuite exécuter le programme pour collecter les donnée du profil d’exécution Le programme écrit les données collectées dans un fichier gmon.outjuste avant de finir.

Il est possible après d’utiliser gprof pour analyser les données collectées : gprof options myprog.exe gmon.out > outfile gprof créé un fichier de profil et un graphe d’exécution

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

54 / 86

Profilage d’application : temps, consommation

Profilage: à savoir

Options: -e function name : indique à gprof de ne pas générer d’information sur la fonction function name (et ses enfants . . . ) dans le graphe d’appel -f function name : provoque une limitation de l’analyse dans le graphe des appels à la fonction function name et ses enfants -b : gprof ne renvoie pas d’informations explicatives à propos des champs renseignés dans les tables.

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

55 / 86

Profilage d’application : temps, consommation

Profilage : flat profile

% time : pourcentage du temps total d’exécution que le program a passer dans cette fonction. cumulative seconds : Temps cumulatif en seconde que le processeur a passé a exécuter cette fonction ainsi que toutes les fonctions appelées dans cette fonction. self seconds : Temps en secondes utilisé pour cette seule fonction. calls: Nombre de fois total o` u cette fonction a été appelée. self ms/call : Temps moyen en milliseconde pris par chaque appel de la fonction. total ms/call : Temps moyen en milliseconde pris par chaque appel de la fonction et de ses descendants. name : Nom de la fonction. Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

56 / 86

Profilage d’application : temps, consommation

Faiblesse de cette première approche

Certaines fonctions ne sont pas triviales à réaliser en matériel. Les décisions prises trop tˆ ot dans le fot risque de ne pas être optimales Aucune considération pour la communication et l’interfa¸cage. Si l’application change alors il faut réexécuter un profilage et ensuite un partitionnement.

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

57 / 86

Profilage d’application : temps, consommation

Codesign : un atelier

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

58 / 86

Profilage d’application : temps, consommation

Partitionnement et Ordonnancement Le partitionnement et l’ordonnancement de tache est impératif dans beaucoup d’applications, en codesing de système, pour les multi-processeur et les systèmes reconfigurables. Les taches identifiées de la description initiale de l’application doivent être mise en oeuvre : Au bon endroit (partitionnement) Au bon moment (ordonnanceur)

Ces problèmes bien connus, le partitionnement et l’ordonnancement, ont été identifiés comme des problèmes NP-Complets. Les techniques d’optimisations basées sur des heuristiques sont généralement employées pour explorer l’espace des possibilités o` u des solutions quasi-optimales peuvent être trouvées.

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

59 / 86

Profilage d’application : temps, consommation

Partitionnement et Ordonnancement

Les mécanismes à optimiser lors d’un partitionnement : Minimiser les communication à travers un bus Extraire le maximum de parallélisme → Faire exécuter simultanément le matériel (FPGA/ASIC) et le logiciel (Processeur) Extraire le maximum de performances du processeur

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

60 / 86

Heuristiques

Fidducia-Matheyse Graphe des taches

Fonction de Coˆ ut

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

61 / 86

Heuristiques

Fidducia-Matheyse Graphe des taches

Fonction de Coˆ ut Nombre de coupure = 5

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

61 / 86

Heuristiques

Fidducia-Matheyse Graphe des taches

Fonction de Coˆ ut Nombre de coupure = 5 Coˆ ut = 8 Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

61 / 86

Heuristiques

Fidducia-Matheyse Graphe des taches

Fonction de Coˆ ut Nombre de coupure = Coˆ ut =

3

0

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

61 / 86

Heuristiques

Fidducia-Matheyse

Graphe des taches

Fonction de Coˆ ut Nombre de coupure = Coˆ ut =

2

-4

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

61 / 86

Les algorithmes d’optimisation

Le Recuit Simulé

Simulated Annealing (Kirkpatrick 83) Inspiré par la physique statistique et les refroidissement des métaux Autorise les déplacements qui dégradent en fonction d’une probabilité qui dépend d’une température Paccept = exp (−δE /T )

Si l’énergie décroˆıt, le système accepte la perturbation Si l’énergie croˆıt, le système accepte la perturbation selon Paccept

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

62 / 86

Les algorithmes d’optimisation

1: 2: 3: 4:

8:

Sélectionner une solution initiale s Sélectionner une température initiale T > 0 while Condition d’arrêt non vérifiée do Sélectionner au hasard s’ ∈ N(s); Calculer δ = f(s’) – f(s); if δ < 0 then 5: s = s’ else 6: x=hasard([0,1]) if x < exp (−δ/T ) then 7: s = s’ end end end Actualiser la température T Algorithm 1: Le Recuit Simulé

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

63 / 86

Les algorithmes d’optimisation

Les algorithmes Gloutons

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

64 / 86

Les algorithmes d’optimisation

Syndex

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

65 / 86

Les algorithmes d’optimisation

Les algorithmes génétiques

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

66 / 86

Les algorithmes d’optimisation

Vulcan

Gupta, De Micheli, Stanford University Approche Primale 1 2

Initialement il n’y a que des IP matérielles. Itérativement certaines IP deviennent logicielle pour réduire le coˆ ut.

Utilisation d’un langage de spécification matérielle : HardwareC Compilé sous forme de graphe flot de données

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

67 / 86

Les algorithmes d’optimisation

Vulcan

Définition du graphe flot de données. Une variation d’un graphe de tâches. Les nœuds : Representent des opérations. Typiquement des opération de bas niveau de type addition, multiplication, ... .

Les arcs : Representent les dépendances de données. Chaque arc est valué par un boléen représentant la condition de transition d’un nœud ` a un autre.

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

68 / 86

Les algorithmes d’optimisation

Vulcan

Le graphe flot de données : est exécuté périodiquement peut posséder pour chaque nœud des contrainte de temps T (vj ) ≥ T (vi ) + lij and T (vj ) ≤ T (vi ).uij

peut posséder pour chaque nœud des contraintes de débit mi?Ri?Mi

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

69 / 86

Les algorithmes d’optimisation

Vulcan

Algorithme de Co-Synthèse dans Vulcan : Le quantum de partitionnement est le thread L’algorithme transforme le graphe flot de données en thread et les alloue sur les ressources Thread boundary est déterminé par : Toujours par un élément de délai non-déterministe, tel qu’un événement sur une variable externe. Parfois par d’autres point du graphe flot de données.

Architecture cible Processeur et des accélérateurs matériels

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

70 / 86

Les algorithmes d’optimisation

Cosyma

Représentation unifiée : ES graphe (CDFG ) Partittionnement : méthode combinée basée sur un partitionnement guidé par l’utilisateur qui s’appuie sur une fonction de coˆ ut et un partitionnement plus fin réalisé par un algorithme de recuit-simulé. Ordonnancement : pas de méthode spécifique. Modélisation : modèles écrits en C++. Validation : simulation basée sur des exécutables écrits en C++. Principale emphasis sur la partitionnement des accélérateurs matériels.

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

71 / 86

Les algorithmes d’optimisation

Cosyma Developpé à Technical University de Braunschweig en Allemagne. Un système expérimental pour le Co-Design de petits systèmes embrqués temps réels : Mise en œuvre de plus d’opérations possibles logiciellement sur un processeur. Génération d’accélérateur matériel uniquement quand la contrainte de temps est violée.

Architecture cible : Processeur RISC Accélérateurs matériel

Les communication entre les IP matérielles et les IP logicielles sont réalisée à travers l’utilisation d’une mémoire partagée munie d’un protocol de communication séquentiel de type CSP 1 1

CSP : Communicating Sequential Process

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

72 / 86

Les algorithmes d’optimisation

Cosyma

La description du système utilise le langage C*. Cette description est traduite en une représentation interne à Cosyma sous forme de graphe qui permet : le partitionnement. La génération des accélérateurs matériels lorsqu’il y a migration du logiciel vers le matériel. La représentation interne sous forme de graphe combine un graphe de contrˆ ole et un graphe flot de données Extended syntax (ES) graph Syntax graph Symbol table Local data/control dependencies

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

73 / 86

Les algorithmes d’optimisation

Cosyma

∆c (b) = w .(tHW (b) − tSW (b) + tCOM (Z ) − tCOM (ZUb).It(b)) Avec w un poids fixe tHW temps matériel tSW temps logiciel tCOM temps de communication It nombre d’itérations

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

74 / 86

Les algorithmes d’optimisation

Cosyma

! tCOM (ZUb) = tCOM (Z ) −

X a∈Z

Ca,b −

X

Cd,b

.tTRANS

d∈U csc Z

Avec fff

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

75 / 86

Les algorithmes d’optimisation

Cosyma

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

76 / 86

Vilfredo Pareto, Economiste italien, 1848 – 1923 n 

n 

Etude sur la répartition des revenus « Ecrits sur la courbe de la répartition des richesses » n  n  n 

Edités par Giovanni Busino Librairie Droz, Genêve, 1965 http://books.google.fr/books? hl=fr&lr=&id=CP4a4VSJO0Q C&oi=fnd&pg=PA1&dq=autho rnbsp:pareto&ots=sgU2aq9a xn&sig=oeFCRb5Kc71y0JjS meYfcXNym24#v=onepage& n  Revenus population Angleterre, Griffin q&f=false

Observation de Pareto

Validation de Pareto

Principe de Pareto n 

Conséquence de l’observation de Pareto : n 

n 

n 

Joseph Juran, 1904 (Roumanie) – 2008 (USA), inventeur de la gestion de la qualité 20% des causes provoquent 80% des effets n  n  n 

n 

20% de la population d’un pays possède 80% de la richesse du pays

20% causes -> 80% défauts de production 20% clients -> 80% CA 20% clients -> 80% Plaintes

Joseph Juran, « Universals in Management, Planning and Controlling », The Management Control, 1954

Diagramme de Pareto n  n 

n 

Joseph Juran Histogramme des causes triées par ordre décroissant Permet de distinguer : n 

n 

20% causes les plus importantes, produisent 80% effets Causes secondaires, produisent autres effets

B. Efficacité de Pareto

Economie théorique : allocation n 

Allocation de biens Répartition des biens et des services entre les agents. n  Répartition des facteurs de production entre les agents industriels. n  Répartition des revenus entre les agents. n 

Efficacité de Pareto Etant donné une allocation de biens, une amélioration de Pareto est une allocation de biens qui attribue plus de biens à au moins une personne, sans diminuer les biens attribués aux autres. n  Un allocation de Pareto efficace, ou optimum de Pareto, est une allocation à laquelle on ne peut apporter aucune amélioration de Pareto. n 

Théorèmes fondamentaux de l’économie du bien-être TH1 Dans une économie parfaite, tout état d’équilibre est un optimum de Pareto n  TH2 Dans une économie parfaite, pour tout optimum de Pareto, il existe une allocation initiale dont l’état d’équilibre est cet optimum n  Remarque : un optimum de Pareto est efficace, mais il peut être inégalitaire. n 

Exemple n 

n 

Allocation A : n  Production armes peut être augmentée sans diminuer production de beurre n  Amélioration de Pareto possible Allocations B, C, D : n  Augmenter 1 production oblige à diminuer autre n  Pas d’amélioration de Pareto possible

II. Exploration espace conception

16

Métriques de comparaison n 

Taille de l’IP n  n 

Eléments logiques de base du FPGA Pourcentage d’utilisation des ressources du FPGA

n 

Vitesse de traitement

n 

Energie

n 

n  n  n  n 

Chemin critique Fréquence de fonctionnement Nombre de transistors Taille des transistors Tension d’alimentation

Exploration espace de conception n 

Design space exploration n  n 

n 

Architecture exploration Exploration d’architecture

Faire varier des paramètres du design n  n  n  n 

Degré de parallélisme Largeur du codage des données Partitionnement logiciel / matériel …

Front de Pareto Latence (durée de traitement) s

Différents designs Front Pareto Courbe des meilleurs compromis Taille (Nombre de cellules)

Optimaux de Pareto

19

Exemple fronts Pareto

n 

n 

Génération par algorithmes génétiques : à gauche état initial, à droite fronts de Pareto PARETO FRONT GENERATION FOR A TRADEOFF BETWEEN AREA AND TIMING, M. Holzer and B. Knerr, Vienna University of Technology, Copyright IEEE 2006, Austrochip 2006, 11.10.2006, Vienna, Austria

Illustration de la loi d'Amdahl Accélération en fonction de part de code séquentiel pour 100 PE

Implications de la loi d'Amdahl n 

Soit 1 programme contenant 10 % de code purement séquentiel n 

n 

n 

Soit une machine parallèle n 

Contenant 100 processeurs Accélérant 1 programme d'1 facteur 100 Alors : !seq = 0 !

n  Son accélération sera inférieure à 10 Quelque soit le nombre n  de processeurs ! n  En effet S p =

1 = 10 S p ≤ 0,1

1 100

1

τ seq +

1− τ seq p

= τ seq +

1− τ seq 100

B. Loi d'Amdahl généralisée n 

1 programme contient : n 

Une part qui peut être accélérée

Une part qui ne peut pas l'être n  Hennessy Patterson n 

S ap =

1 1− t amél +

t am él p

tav = tinch + tamél tap = tinch +

τ amél = t

t amél p

tamél tav

Sap = tav = ap

1

tap t av

= tav − tamél +

tamél p

Illustration Amdahl généralisée Moyen Durée transpo Sierra rt désert Pied Vélo Ferrari

Durée désert

Durée totale

Accélér Accélér ation ation désert globale

20 h

50 h

70 h

1

20 h

20 h

40 h

2,5

1,8

20 h

1,7 h

21,7h

30

3,2

1

Applications de Amdahl généralisée n 

Hiérarchie mémoire On dispose d'un cache 5 fois plus rapide que la mémoire centrale n  Ce cache est utilisé 90% du temps du fait de la localité des référence n  Gain en vitesse lié à l'utilisation du cache : 3,6 n 

Sap =

1 τ 1− τ amél + amél p

=

1 1−0,9+ 0,9 5

1 = 0,1+0,18 ≈ 3,6

Applications de Amdahl généralisée n 

Optimisation de code 1 programme passe 90 % du temps d'exécution dans 10 % du code n  On accélère ces 10 % d'un facteur 3 en optimisant le code source n  Accélération d'un facteur 2,5 du programme complet n 

Sap =

1 τ 1− τ amél + amél p

=

1 1−0,9+ 0,9 3

=

1 0,1+0,3

= 2,5

Applications de Amdahl généralisée n 

Codesign 1 programme passe 80 % du temps d'exécution dans 20 % du code n  On accélère ces 20 % d'un facteur 50 en optimisant le code source n  Quelle est l’accélération du programme complet ? n 

Le front de Pareto

Observation de Pareto Vilfredo Pareto est un économiste Italien (1848-1923) qui a fait des études sur la répartition des richesses dans les villes européennes Livre : ””

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

77 / 86

Le front de Pareto

Observation de Pareto

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

78 / 86

Le front de Pareto

Observation de Pareto

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

79 / 86

Le front de Pareto

Enoncé du principe de Pareto

Observation de Pareto 20% de la population d’un pays possède 80% de ses richesses Utilisation de ce principe en gestion de la qualité Introduit par Joseph Juran (1904-2008) 20% des causes provoquent 80% des effets 20% des causes provoquent 80% des défauts de production 20% des clients génèrent 80% du chiffres d’affaire 20% des clients créé 80% des plaintes

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

80 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Plan 1

Les systèmes Embarqués

2

Etude de la conception de systèmes embarqués sur puce avec des parties Logicielles et des parties Matérielles (Co-design)

3

Plate-formes pour l’embarqué

4

Profilage d’application : temps, consommation

5

Interlude : les réseaux de neurones convolutionnels ou CNN

6

Heuristiques

7

Les algorithmes d’optimisation

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

81 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parallélisme - Gain et Efficacité

Le gain exprime l’accélération il est égal à G=

Tempss e´quentiel Tempsparall e´le

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

82 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parallélisme - Gain et Efficacité

Le gain exprime l’accélération il est égal à G=

Tempss e´quentiel Tempsparall e´le

L’efficacité exprime l’utilisation effective des ressources disponibles elle est égale à G E= Nombreressources

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

82 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parallélisme - Loi d’Amdhal

Question : Si j’ai 100 processeurs vais-je 100 fois plus vite qu’avec 1 seul ?

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

83 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parallélisme - Loi d’Amdhal

Question : Si j’ai 100 processeurs vais-je 100 fois plus vite qu’avec 1 seul ? Réponse : Non, il existe toujours une partie séquentielle dans le programme qu’on ne peut paralléliser.

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

83 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parallélisme - Loi d’Amdhal

Question : Si j’ai 100 processeurs vais-je 100 fois plus vite qu’avec 1 seul ? Réponse : Non, il existe toujours une partie séquentielle dans le programme qu’on ne peut paralléliser. Exemple

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

83 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parallélisme - Loi d’Amdhal

Question : Si j’ai 100 processeurs vais-je 100 fois plus vite qu’avec 1 seul ? Réponse : Non, il existe toujours une partie séquentielle dans le programme qu’on ne peut paralléliser. Exemple Un programme de 20 instructions de durée 1 cycle chaque

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

83 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parallélisme - Loi d’Amdhal

Question : Si j’ai 100 processeurs vais-je 100 fois plus vite qu’avec 1 seul ? Réponse : Non, il existe toujours une partie séquentielle dans le programme qu’on ne peut paralléliser. Exemple Un programme de 20 instructions de durée 1 cycle chaque 30% d’instructions séquentielles

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

83 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parallélisme - Loi d’Amdhal

Question : Si j’ai 100 processeurs vais-je 100 fois plus vite qu’avec 1 seul ? Réponse : Non, il existe toujours une partie séquentielle dans le programme qu’on ne peut paralléliser. Exemple Un programme de 20 instructions de durée 1 cycle chaque 30% d’instructions séquentielles durée du programme avec 1 processeur : 20 cycles

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

83 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parallélisme - Loi d’Amdhal

Question : Si j’ai 100 processeurs vais-je 100 fois plus vite qu’avec 1 seul ? Réponse : Non, il existe toujours une partie séquentielle dans le programme qu’on ne peut paralléliser. Exemple Un programme de 20 instructions de durée 1 cycle chaque 30% d’instructions séquentielles durée du programme avec 1 processeur : 20 cycles durée idéale avec 20 processeurs : 1 cycle

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

83 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parallélisme - Loi d’Amdhal

Question : Si j’ai 100 processeurs vais-je 100 fois plus vite qu’avec 1 seul ? Réponse : Non, il existe toujours une partie séquentielle dans le programme qu’on ne peut paralléliser. Exemple Un programme de 20 instructions de durée 1 cycle chaque 30% d’instructions séquentielles durée du programme avec 1 processeur : 20 cycles durée idéale avec 20 processeurs : 1 cycle durée réelle avec 20 processeurs : 7 cycles

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

83 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parallélisme - Loi d’Amdhal

Question : Si j’ai 100 processeurs vais-je 100 fois plus vite qu’avec 1 seul ? Réponse : Non, il existe toujours une partie séquentielle dans le programme qu’on ne peut paralléliser. Exemple Un programme de 20 instructions de durée 1 cycle chaque 30% d’instructions séquentielles durée du programme avec 1 processeur : 20 cycles durée idéale avec 20 processeurs : 1 cycle durée réelle avec 20 processeurs : 7 cycles G = 20 7 = 2, 85

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

83 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parallélisme - Loi d’Amdhal

Question : Si j’ai 100 processeurs vais-je 100 fois plus vite qu’avec 1 seul ? Réponse : Non, il existe toujours une partie séquentielle dans le programme qu’on ne peut paralléliser. Exemple Un programme de 20 instructions de durée 1 cycle chaque 30% d’instructions séquentielles durée du programme avec 1 processeur : 20 cycles durée idéale avec 20 processeurs : 1 cycle durée réelle avec 20 processeurs : 7 cycles G = 20 7 = 2, 85 E = 2,85 20 = 0, 142

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

83 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parallélisme - Loi d’Amdhal

Accélération Acc =

1 (1 − P) +

P N

P taux de code parallèle N nombre de processeurs

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

84 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parallélisme - Loi d’Amdhal

Accélération T =S +P

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

85 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parallélisme - Loi d’Amdhal

Accélération T (N) = S +

P N

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

85 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parallélisme - Loi d’Amdhal

Accélération G=

T T (N)

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

85 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parallélisme - Loi d’Amdhal

Accélération G=

S +P P S+N

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

85 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parallélisme - Loi d’Amdhal

Accélération G=

(1 − P) + P P (1 − P) + N

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

85 / 86

La loi d’Ahmdal

Gain et efficacit´ e du parall´ elisme

Parallélisme - Loi d’Amdhal

Accélération G=

1 (1 − P) +

P N

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

85 / 86

HLS

HLS

Transparents de Zahid Syed Ahmed Transparents de Xilinx

Bertrand Granado - Andrea Pinna (LIP6 / UPMC) FPGA2 ou le Codesign Mat´ eriel/Logiciel

Automne 2017

86 / 86

What is ESL? • ESL (Electronic System Level) :

http://en.wikipedia.org/wiki/Electronic_system-level_design_and_verification

Design and verification methodology for system design at a higher abstraction level. The term was coined by Gartner Dataquest for industrial market analysis

• HLS (High Level Synthesis) : http://en.wikipedia.org/wiki/High-level_synthesis Hardware/system design at higher abstraction. It is in principle more Scientific/Technical name of the ESL tools that help mostly (not limited to) standard C/C++ /SystemC to RTL (VHDL/Verilog) synthesis

• Some State of the Art C/C++/SystemC  RTL tools examples – – – –

Synopsys: Synphony (former Synfora’s Pico) Cadence: C-to-Silicon Calypto: Catapult C (former Mentor’s Catapult C) Xilinx: VivadoHLS (former AutoESL’s AutoPilot)

• Academic Research – Some current examples include Legup of Toronto Univ etc. – Former examples include AutoPilot of UCLA (origin of AutoESL!) etc. Syed Zahid AHMED

12/01/2017

3

HLS Tools historical overview A nice industrial survey on topic: Grant Martin, Gary Smith, “High-Level Synthesis: Past, Present and Future”, IEEE Design and Test of Computers, July/Aug. 2009. http://cas.et.tudelft.nl/education/courses/et4054/2009_PAPER_High-Level_Synthesis_Past,_Present,_and_Future.pdf

Issues of Custom Languages Wrong people targeted HLS-> Gates in beginning tools …

Syed Zahid AHMED

12/01/2017

4

FPGAs vs Coarse Grain (a special form of HLS) Story Missing Umbrella

myC ESL* Moore’s Law

RTL

ANSI C/C++

newC

lovelyC

almostC greatC

It is difficult to Enter/win in Industry Like this

Coarse Grains suffered: Solo Adventure in Desert Simultaneous War at 3 frontiers New hardware, new Language, no IP www.patioumbrellas.com

FPGAs have enjoyed: Nice party Scenario

Diverse solutions of coarse grain makes/have kept them difficult to be widely adapted by Industry No/partial reuse of designs, scarce IP leverage, non standard programming makes them risky investment option by companies compared to FPGAs

Survey of new trends in Industry for Programmable hardware: FPGAs, MPPAs, MPSoCs, Structured ASICs, eFPGAs and new wave of innovation in FPGAs. Syed Zahid Ahmed, Gilles Sassatelli, Lionel Torres, Laurent Rougé. FPL-2010 (http://conferenze.dei.polimi.it/FPL2010/presentations/W1_B_1.pdf) Syed Zahid AHMED

12/01/2017

5

Modern ESL tools • Standard C/C++/SystemC input • Constraints infrastructure to optimize implementations • Leveraging the mature RTL design flow and tools

Syed Zahid AHMED

12/01/2017

6

Outline • ESL – Fundamentals & Historical overview – Current ESL tools/vendors

• Xilinx Vivado tools suite Basics – ISE vs Vivado

• Xilinx VivadoHLS – Fundamentals and flow overview – Strengths & Limitations – Code examples & Xilinx Videos

• Image Processing IPs ESL exploration work in SagemCom Project – ESL vs Hand coded IP

Syed Zahid AHMED

12/01/2017

7

Xilinx Vivado vs ISE ISE All FPGAs upto 7Series and Zynq

Vivado 7-Series/Zynq and Future devices

• New tools suite of Xilinx of IP-Centric Design environment for 7-series/Zynq and future devices with Built-in ESL tool VivadoHLS • New algorithms for faster Place and Route, improved design flow vs ISE • New SDC based constraints system (.ucf is obsolete), improved facilities for timing/power analysis… http://www.xilinx.com/support/documentation/white_papers/wp416-Vivado-Design-Suite.pdf

Multiple Xilinx Video tutorials for Vivado available @ http://www.xilinx.com/training/vivado/index.htm http://www.youtube.com/playlist?list=PL35626FEF3D5CB8F2&feature=plcp Syed Zahid AHMED

12/01/2017

8

ETIS Project SagemCom SuperSoC SuperSOC_v1

Nios System

On-chip SRAM 128KB

Altera Performance Counter

GrayScale

Binarize

Binarize Count

SelectiveTrc

DDR2

AVALON

1GB

ImgColor

Wiener

FindEdges

Hough

NeuralGas

Bitonal

PLL ArriaIIGX125 FPGA

Syed Zahid AHMED

12/01/2017

29

SuperSoC (Hardware Statistics) IP ImageColor Wiener Binarize BinarizeCount Grayscale SelectiveTrc FindEdges Hough NeuralGas Bitonal NIOS SYSTEM DDR2 Contr. Others(Bus,Buf...) TOTAL % of Device Arria II GX125

ALUTs

FFs

Mem. Bits

DSP

Freq.

CONFIDENTIAL

769 3,287 11,129 40,443 41 99,280

549 3,001 21,972 43,399 44 99,280

2,107,392 74,816 162,304 4,350,528 65 6,727,680

0 0 0 276 48 576

50 100 NA NA NA NA

D Power* mw S Power* mW NA NA NA NA NA NA NA NA NA NA NA 110.1 NA 328.4 NA 218.6 565.9 1012.2 NA NA NA NA

Real-time Power

*Statistical Values from PowerPlay tool

Total

Reset (~Static) Syed Zahid AHMED

12/01/2017

30

Hand coded RTL vs ESL for Selected IPs C Source

Complimentary Research Work  Exploration of ESL tools of Xilinx  Two IPs* explored: RTL vs ESL  Evaluations on Viretex-6 & Zynq

-Mallocs… removal -AXI interfaces -Constraints

ARM Cortex A9

DDR3

Processing System

Controller

AXI4

Vivado HLS

Key Findings  ESL is Cool   RTL vs ESL resources: Same/Similar  ESL DMA not good in current version  ARM Cortex is very powerful

20

Vivado HLS

HW

IP

AXI Timer

Zynq Z20 FPGA

Development Time (Man Weeks) 15

15

11 10 5

3 1

0

IP-1 RTL

IP-1 ESL

IP-2 RTL

IP-2 ESL

* Names not mentioned to comply with Sagemcom agreements

Syed Zahid AHMED

12/01/2017

31

Sagemcom project example Design Space Exploration (DSE) for IP1 Rapid Design Space Exploration with quick Synthesis of Vivado HLS IP-1: VivadoHLS results for Zynq Z20 for 256x256 pixel image S1: Raw transform S2: S1 with Shared Div S3:S2 with DivToMult S4: S3 with Buffer Memory Partitioning S5: S4 with Shared 32/64bit Multipliers S6: S5 with Smart Mult sharing S7: S6 with Mult Lat. exp (comb. Mult) S8: S6 with Mult Lat. exp (1clk cycle Mult) S9: S4 with SuperBurst S10: S4 with Smart Burst S11: S6 with Smart Burst

LUTs 20,276 9,158 7,142 4,694 4,778 4,900 4,838 4,838 4,025 3,979 4,224

FFs 18,293 7,727 6,172 4,773 4,375 4,569 4,080 4,056 4,357 4,314 4,079

BRAMs 16 16 16 16 16 16 16 16 80 24 24

DSP 59 51 84 81 32 40 40 40 79 80 39

ClkPr (ns) 8.75 8.75 8.75 8.75 8.75 8.75 13.56 22.74 8.75 8.75 8.75

Latency (clk cycles) Speedup 8,523,620 NA 8,555,668 1.00 3,856,048 2.22 2,438,828 3.51 3,111,560 2.75 2,910,344 2.94 1,975,952 4.33 2,181,776 3.92 NA NA NA NA NA NA

Power* 1410 1638 1337 953 890 918 893 890 NA NA NA

FPGA Prototype of Selected steps on Zedboard (Issues of timing closure and DMA efficiency) IP-1: Zynq Z20 FPGA Board results for 256x256 pixel test image ARM CortexA9 MPCore (Core-0 only) MicroBlaze (with Cache & int. Divider) Hand-Coded IP (Resources for Ref.) S3:S2 with DivToMult S4: S3 with Buffer Mem. Partitioning S6: S5 with Smart Mult sharing S8: S6 with Mult Latency exp (1clk) S9: S4 with SuperBurst S10: S4 with Smart Burst S11: S6 with Smart Burst

LUTs

NA 2,027 8798 6,721 4,584 5,156 5,027 4,504 4,485 4,813

FFs

NA 1,479 3,693 6,449 5,361 4,870 4,607 5,008 4,981 4,438

BRAMs

NA 6 8 8 10 10 10 42 14 14

DSP

NA 3 33 72 69 32 33 70 73 33

Fmax (MHz) NA NA 54.2 40.2 84.5 91.3 77.3 77.6 103 101

Fexp (MHz) 666.7 100.0 *50(NA)

40.0 83.3 90.9 76.9 76.9 100 100

Latency (ms) 32.5 487.4 *4 (NA) 142.4 124.5 136.9 121.2 26.5 22.48 33.7

* Vivado HLS quick syntheis power, it Has no units

Syed Zahid AHMED

12/01/2017

Speedup

15.0 1.00 NA 3.42 3.91 3.56 4.02 18.39 21.68 14.46

Power** (mW) NA NA 57 12 30 26 22 NA 62 42

** Statistical Power from Xpower

32

Work Published in Xilinx Xcell Journal

Published in Xilinx Xcell Journal, Issue 84, pages 34-41 (July 2013) Vivado’s ESL Capabilities Speeds IP design on Zynq SoC Project: Automated methodology delivers results similar to hand-coded RTL for two image processing IP cores ” Syed Zahid Ahmed, Sébastien Fuhrmann, Bertrand Granado http://www.xilinx.com/publications/xcellonline/ www.xilinx.com/publications/archives/xcell/Xcell84.pdf

Syed Zahid AHMED

12/01/2017

33

Conclusions • Potentials for image/video processing projects – – – – – – –

Rapid design space exploration for HW accelerators: In Days instead of Months! Wide range of optimization options using Directives constraints HW/SW co-design further simplified Ultra-fast verification cycle: Self testing Testbench and its re-use, Co-simulation… Experiments with Dual Core ARM using Zynq Support for Floating Point hardware Automatic creation of SW drivers of IP

Syed Zahid AHMED

12/01/2017

34

References to get started The Zinq Book (2014): www.zynqbook.com Vivado All Documentation: http://www.xilinx.com/support/documentation/dt_vivado.htm VivadoHLS Userguide: http://www.xilinx.com/support/documentation/sw_manuals/xilinx2012_4/ug902-vivado-high-level-synthesis.pdf VivadoHLS Getting Started Tutorial & its design files: http://www.xilinx.com/support/documentation/sw_manuals/xilinx2012_4/ug871-vivado-high-level-synthesis-tutorial.pdf / Project Files : https://secure.xilinx.com/webreg/clickthrough.do?cid=198573&license=RefDesLicense&filename=ug871-design-files.zip White Papers/App. Notes: – – – –

http://www.xilinx.com/support/documentation/application_notes/xapp745-processor-control-vhls.pdf http://www.xilinx.com/support/documentation/application_notes/xapp793-memory-structures-video-vivado-hls.pdf http://www.xilinx.com/support/documentation/application_notes/xapp599-floating-point-vivado-hls.pdf http://www.xilinx.com/support/documentation/white_papers/wp416-Vivado-Design-Suite.pdf

Video Tutorials of Vivado and VivadoHLS: – –

http://www.xilinx.com/training/vivado/index.htm http://www.youtube.com/playlist?list=PL35626FEF3D5CB8F2&feature=plcp

Xilinx’s Vivado HLS tutorial (2013): – –

http://www.xilinx.com/support/documentation/sw_manuals/xilinx2013_2/ug871-vivado-high-level-synthesis-tutorial.pdf https://secure.xilinx.com/webreg/clickthrough.do?cid=338217&license=RefDesLicense&filename=ug871-designfiles.zip&languageID=1

Syed Zahid AHMED

12/01/2017

35

Introduc)on to High-‐Level Synthesis with Vivado HLS This material exempt per Department of Commerce license exception TSU

Objec)ves After completing this module, you will be able to: –  Describe the high level synthesis flow –  Understand the control and datapath extraction –  Describe scheduling and binding phases of the HLS flow –  List the priorities of directives set by Vivado HLS –  List comprehensive language support in Vivado HLS –  Identify steps involved in validation and verification flows

Intro to HLS 11- 2

© Copyright 2016 Xilinx

Outline Introduction to High-Level Synthesis High-Level Synthesis with Vivado HLS Language Support Validation Flow Summary

Intro to HLS 11- 3

© Copyright 2016 Xilinx

Need for High-‐Level Synthesis Algorithmic-based approaches are getting popular due to accelerated design time and time to market (TTM) –  Larger designs pose challenges in design and verification of hardware at HDL level

Industry trend is moving towards hardware acceleration to enhance performance and productivity –  CPU-intensive tasks can be offloaded to hardware accelerator in FPGA –  Hardware accelerators require a lot of time to understand and design

Vivado HLS tool converts algorithmic description written in C-based design flow into hardware description (RTL) –  Elevates the abstraction level from RTL to algorithms

High-level synthesis is essential for maintaining design productivity for large designs Intro to HLS 11- 4

© Copyright 2016 Xilinx

High-‐Level Synthesis: HLS High-Level Synthesis –  Creates an RTL implementation from C, C+ +, System C, OpenCL API C kernel code –  Extracts control and dataflow from the source code –  Implements the design based on defaults and user applied directives

Many implementation are possible from the same source description –  Smaller designs, faster designs, optimal designs –  Enables design exploration

Intro to HLS 11- 5

© Copyright 2016 Xilinx

Design Explora)on with Direc)ves One body of code: Many hardware outcomes

The same hardware is used for each iteration of the loop: • Small area • Long latency • Low throughput

Intro to HLS 11- 6

… loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } ….

Different hardware is used for each iteration of the loop: • Higher area • Short latency • Better throughput

© Copyright 2016 Xilinx

Before we get into details, let’s look under the hood ….

Different iterations are executed concurrently: • Higher area • Short latency • Best throughput

Introduc)on to High-‐Level Synthesis How is hardware extracted from C code? –  Control and datapath can be extracted from C code at the top level –  The same principles used in the example can be applied to sub-functions •  At some point in the top-level control flow, control is passed to a sub-function •  Sub-function may be implemented to execute concurrently with the top-level and or other sub-functions

How is this control and dataflow turned into a hardware design? –  Vivado HLS maps this to hardware through scheduling and binding processes

How is my design created? –  How functions, loops, arrays and IO ports are mapped?

Intro to HLS 11- 7

© Copyright 2016 Xilinx

HLS: Control Extrac)on Code void fir ( data_t *y, coef_t c[4], data_t x ){

Control Behavior Finite State Machine (FSM) states

Function Start

static data_t shift_reg[4]; acc_t acc; int i; acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc; }

0 For-Loop Start

1

For-Loop End

2

Function End From any C code example ..

Intro to HLS 11- 8

The loops in the C code correlated to states of behavior

© Copyright 2016 Xilinx

This behavior is extracted into a hardware state machine

HLS: Control & Datapath Extrac)on Code void fir ( data_t *y, coef_t c[4], data_t x ){ static data_t shift_reg[4]; acc_t acc; int i; acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc; }

From any C code example ..

Intro to HLS 11- 9

Operations

Control Behavior Finite State Machine (FSM) states

Control & Datapath Behavior Control Dataflow

RDx RDc

>= == + * + *

0

1

WRy Operations are extracted…

© Copyright 2016 Xilinx

2 The control is known

RDx

RDc

>=

-

==

-

+

*

+

*

WRy

A unified control dataflow behavior is created.

High-‐Level Synthesis: Scheduling & Binding Scheduling & Binding –  Scheduling and Binding are at the heart of HLS

Scheduling determines in which clock cycle an operation will occur –  Takes into account the control, dataflow and user directives –  The allocation of resources can be constrained

Binding determines which library cell is used for each operation –  Takes into account component delays, user directives Technology Library

Design Source (C, C++, SystemC)

Scheduling

Binding

User Directives

Intro to HLS 11- 10

© Copyright 2016 Xilinx

RTL (Verilog, VHDL, SystemC)

Scheduling The operations in the control flow graph are mapped into clock cycles a b c d e

void foo ( … t1 = a * b; t2 = c + t1; t3 = d * t2; out = t3 – e; }

Schedule 1

* +

* -

*

*

+

out

-

The technology and user constraints impact the schedule –  A faster technology (or slower clock) may allow more operations to occur in the same clock cycle

Schedule 2

*

The code also impacts the schedule –  Code implications and data dependencies must be obeyed

Intro to HLS 11- 11

© Copyright 2016 Xilinx

+

*

-

Binding Binding is where operations are mapped to cores from the hardware library –  Operators map to cores

Binding Decision: to share –  Given this schedule:

*

+

*

-

•  Binding must use 2 multipliers, since both are in the same cycle •  It can decide to use an adder and subtractor or share one addsub

Binding Decision: or not to share –  Given this schedule:

*

+

*

-

•  Binding may decide to share the multipliers (each is used in a different cycle) •  Or it may decide the cost of sharing (muxing) would impact timing and it may decide not to share them •  It may make this same decision in the first example above too

Intro to HLS 11- 12

© Copyright 2016 Xilinx

Outline Introduction to High-Level Synthesis High-Level Synthesis with Vivado HLS Language Support Validation Flow Summary

Intro to HLS 11- 13

© Copyright 2016 Xilinx

RTL vs High-‐Level Language

Intro to HLS 11- 14

© Copyright 2016 Xilinx

Vivado HLS Benefits Productivity –  Verification Video Design Example

•  Functional •  Architectural

–  Abstraction •  Datatypes

Input

C Simulation Time

RTL Simulation Time

Improvement

10 frames 1280x720

10s

~2 days (ModelSim)

~12000x

•  Interface •  Classes

–  Automation

RTL (Spec) C (Spec/Sim)

RTL (Sim)

Block level specification AND verification significantly reduced

Intro to HLS 11- 15

© Copyright 2016 Xilinx

RTL (Sim)

Vivado HLS Benefits Portability –  Processors and FPGAs –  Technology migration –  Cost reduction –  Power reduction

Design and IP reuse Intro to HLS 11- 16

© Copyright 2016 Xilinx

Vivado HLS Benefits Permutability –  Architecture Exploration •  Timing –  Parallelization –  Pipelining

•  Resources –  Sharing

–  Better QoR

Rapid design exploration delivers QoR rivaling hand-coded RTL Intro to HLS 11- 17

© Copyright 2016 Xilinx

Understanding Vivado HLS Synthesis Vivado HLS –  Determines in which cycle operations should occur (scheduling) –  Determines which hardware units to use for each operation (binding) –  Performs high-level synthesis by : •  Obeying built-in defaults •  Obeying user directives & constraints to override defaults •  Calculating delays and area using the specified technology/device

Priority of directives in Vivado HLS 1.  Meet Performance (clock & throughput) • 

Vivado HLS will allow a local clock path to fail if this is required to meet throughput

• 

Often possible the timing can be met after logic synthesis

2.  Then minimize latency 3.  Then minimize area

Intro to HLS 11- 18

© Copyright 2016 Xilinx

The Key AMributes of C code void fir ( data_t *y, coef_t c[4], data_t x ){ static data_t shift_reg[4]; acc_t acc; int i; acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i] * c[i]; }

Functions: All code is made up of functions which represent the design hierarchy: the same in hardware Top Level IO : The arguments of the top-level function determine the hardware RTL interface ports Types: All variables are of a defined type. The type can influence the area and performance Loops: Functions typically contain loops. How these are handled can have a major impact on area and performance Arrays: Arrays are used often in C code. They can influence the device IO and become performance bottlenecks

} *y=acc; }

Operators: Operators in the C code may require sharing to control area or specific hardware implementations to meet performance

Let’s examine the default synthesis behavior of these …

Intro to HLS 11- 19

© Copyright 2016 Xilinx

Func)ons & RTL Hierarchy Each function is translated into an RTL block –  Verilog module, VHDL entity

Source Code

void A() { ..body A..} void B() { ..body B..} void C() { B(); } void D() { B(); }

void foo_top() { A(…); C(…); D(…) }

RTL hierarchy foo_top

D

my_code.c

B

B

Each function/block can be shared like any other component (add, sub, etc) provided it’s not in use at the same time

–  By default, each function is implemented using a common instance –  Functions may be inlined to dissolve their hierarchy •  Small functions may be automatically inlined

Intro to HLS 11- 20

C

A

© Copyright 2016 Xilinx

Types = Operator Bit-‐sizes Code void fir ( data_t *y, coef_t c[4], data_t x ){ static data_t shift_reg[4]; acc_t acc; int i; acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc; }

From any C code example ...

Intro to HLS 11- 21

Operations

Types Standard C types long long (64-bit)

RDx RDc

>= == + * + *

short (16-bit)

int (32-bit)

char (8-bit)

float (32-bit)

double (64-bit)

unsigned types

Arbitary Precision types

WRy

C:

ap(u)int types (1-1024)

C++:

ap_(u)int types (1-1024) ap_fixed types

C++/SystemC:

sc_(u)int types (1-1024) sc_fixed types

Can be used to define any variable to be a specific bit-width (e.g. 17-bit, 47bit etc).

Operations are extracted…

© Copyright 2016 Xilinx

The C types define the size of the hardware used: handled automatically

Loops By default, loops are rolled –  Each C loop iteration è Implemented in the same state –  Each C loop iteration è Implemented with same resources

foo_top

Synthesis

a[N]

Loops require labels if they are to be referenced by Tcl directives (GUI will auto-add labels)

–  Loops can be unrolled if their indices are statically determinable at elaboration time •  Not when the number of iterations is variable

–  Unrolled loops result in more elements to schedule but greater operator mobility •  Let’s look at an example ….

Intro to HLS 11- 22

© Copyright 2016 Xilinx

+

void foo_top (…) { ... Add: for (i=3;i>=0;i-‐-‐) { b = a[i] + b; ... }

N

b

Data Dependencies: Good void fir ( … acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc; }

Default Schedule == -

*

>=

==

-

+

-

RDc

*

>=

==

-

+

-

RDc

Iteration 1

Iteration 2

The read X operation has good mobility

Example of good mobility –  The read on data port X can occur anywhere from the start to iteration 4 •  The only constraint on RDx is that it occur before the final multiplication

–  Vivado HLS has a lot of freedom with this operation •  It waits until the read is required, saving a register •  There are no advantages to reading any earlier (unless you want it registered) •  Input reads can be optionally registered

–  The final multiplication is very constrained…

Intro to HLS 11- 23

© Copyright 2016 Xilinx

*

>=

==

*

>=

-

+

-

RDx

+

RDc

Iteration 3

RDc

Iteration 4

WRy

Data Dependencies: Bad void fir ( … acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc; }

Default Schedule == -

*

>=

==

-

+

-

RDc

*

>=

==

-

+

-

RDc

Iteration 1

Iteration 2

>=

==

*

>=

-

+

-

RDx

+

Iteration 3

RDc

Iteration 4

Mult is very constrained

Example of bad mobility –  The final multiplication must occur before the read and final addition •  It could occur in the same cycle if timing allows

–  Loops are rolled by default •  Each iteration cannot start till the previous iteration completes •  The final multiplication (in iteration 4) must wait for earlier iterations to complete

–  The structure of the code is forcing a particular schedule •  There is little mobility for most operations

–  Optimizations allow loops to be unrolled giving greater freedom Intro to HLS 11- 24

* RDc

© Copyright 2016 Xilinx

WRy

Schedule aRer Loop Op)miza)on With the loop unrolled (completely) –  The dependency on loop iterations is gone –  Operations can now occur in parallel •  If data dependencies allow •  If operator timing allows

RDc

RDc

RDc RDx

–  Design finished faster but uses more operators •  2 multipliers & 2 Adders

* *

* *

+

+ +

Schedule Summary

WRy

–  All the logic associated with the loop counters and index checking are now gone –  Two multiplications can occur at the same time •  All 4 could, but it’s limited by the number of input reads (2) on coefficient port C

–  Why 2 reads on port C? •  The default behavior for arrays now limits the schedule…

Intro to HLS 11- 25

RDc

© Copyright 2016 Xilinx

void fir ( … acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc; }

Arrays in HLS An array in C code is implemented by a memory in the RTL –  By default, arrays are implemented as RAMs, optionally a FIFO void foo_top(int x, …) { int A[N]; L1: for (i = 0; i < N; i++) A[i+x] = A[i] + i; }

N-1

SPRAMB

N-2 …

foo_top

A[N]

Synthesis

A_in

1 0

DIN ADDR

DOUT

A_out

CE WE

The array can be targeted to any memory resource in the library –  The ports (Address, CE active high, etc.) and sequential operation (clocks from address to data out) are defined by the library model –  All RAMs are listed in the Vivado HLS Library Guide

Arrays can be merged with other arrays and reconfigured –  To implement them in the same memory or one of different widths & sizes

Arrays can be partitioned into individual elements –  Implemented as smaller RAMs or registers Intro to HLS 11- 26

© Copyright 2016 Xilinx

Top-‐Level IO Ports Top-level function arguments –  All top-level function arguments have a default hardware port type

When the array is an argument of the top-level function –  The array/RAM is “off-chip” –  The type of memory resource determines the top-level IO ports –  Arrays on the interface can be mapped & partitioned •  E.g. partitioned into separate ports for each element in the array DPRAMB

foo_top

Synthesis

+

void foo_top( int A[3*N] , int x) { L1: for (i = 0; i < N; i++) A[i+x] = A[i] + i; }

DIN0 ADDR0

DOUT0

CE0 WE0

Number of ports defined by the RAM resource

DIN1 ADDR1

Default RAM resource –  Dual port RAM if performance can be improved otherwise Single Port RAM

Intro to HLS 11- 27

© Copyright 2016 Xilinx

CE1 WE1

DOUT1

Schedule aRer an Array Op)miza)on With the existing code & defaults –  Port C is a dual port RAM –  Allows 2 reads per clock cycles

RDc

RDc

RDc

RDc RDx

•  IO behavior impacts performance Note: It could have performed 2 reads in the original rolled design but there was no advantage since the rolled loop forced a single read per cycle

* *

* *

+

+ +

loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc;

WRy

With the C port partitioned into (4) separate ports –  All reads and mults can occur in one cycle –  If the timing allows •  The additions can also occur in the same cycle •  The write can be performed in the same cycles •  Optionally the port reads and writes could be registered

RDc RDc RDc RDc RDx

* * * *

+ + + WRy

Intro to HLS 11- 28

© Copyright 2016 Xilinx

Operators Operator sizes are defined by the type –  The variable type defines the size of the operator

Vivado HLS will try to minimize the number of operators –  By default Vivado HLS will seek to minimize area after constraints are satisfied

User can set specific limits & targets for the resources used –  Allocation can be controlled •  An upper limit can be set on the number of operators or cores allocated for the design: This can be used to force sharing •  e.g limit the number of multipliers to 1 will force Vivado HLS to share 3

2

1

0

Use 1 mult, but take 4 cycle even if it could be done in 1 cycle using 4 mults

–  Resources can be specified •  The cores used to implement each operator can be specified •  e.g. Implement each multiplier using a 2 stage pipelined core (hardware)

Intro to HLS 11- 29

3

1

2

0

Same 4 mult operations could be done with 2 pipelined mults (with allocation limiting the mults to 2)

© Copyright 2016 Xilinx

Outline Introduction to High-Level Synthesis High-Level Synthesis with Vivado HLS Language Support Validation Flow Summary

Intro to HLS 11- 30

© Copyright 2016 Xilinx

Comprehensive C Support A Complete C Validation & Verification Environment –  Vivado HLS supports complete bit-accurate validation of the C model –  Vivado HLS provides a productive C-RTL co-simulation verification solution

Vivado HLS supports C, C++, SystemC and OpenCL API C kernel –  Functions can be written in any version of C –  Wide support for coding constructs in all three variants of C

Modeling with bit-accuracy –  Supports arbitrary precision types for all input languages –  Allowing the exact bit-widths to be modeled and synthesized

Floating point support –  Support for the use of float and double in the code

Support for OpenCV functions –  Enable migration of OpenCV designs into Xilinx FPGA –  Libraries target real-time full HD video processing Intro to HLS 11- 31

© Copyright 2016 Xilinx

C, C++ and SystemC Support The vast majority of C, C++ and SystemC is supported –  Provided it is statically defined at compile time –  If it’s not defined until run time, it won’ be synthesizable

Any of the three variants of C can be used –  If C is used, Vivado HLS expects the file extensions to be .c –  For C++ and SystemC it expects file extensions .cpp

Intro to HLS 11- 32

© Copyright 2016 Xilinx

Outline Introduction to High-Level Synthesis High-Level Synthesis with Vivado HLS Language Support Validation Flow Summary

Intro to HLS 11- 33

© Copyright 2016 Xilinx

C Valida)on and RTL Verifica)on There are two steps to verifying the design –  Pre-synthesis: C Validation •  Validate the algorithm is correct

–  Post-synthesis: RTL Verification •  Verify the RTL is correct

C validation

Validate C

–  A HUGE reason users want to use HLS •  Fast, free verification −  Validate the algorithm is correct before synthesis •  Follow the test bench tips given over

RTL Verification

Verify RTL

–  Vivado HLS can co-simulate the RTL with the original test bench Intro to HLS 11- 34

© Copyright 2016 Xilinx

C Func)on Test Bench The test bench is the level above the function –  The main() function is above the function to be synthesized

Good Practices –  The test bench should compare the results with golden data •  Automatically confirms any changes to the C are validated and verifies the RTL is correct

–  The test bench should return a 0 if the self-checking is correct •  Anything but a 0 (zero) will cause RTL verification to issue a FAIL message •  Function main() should expect an integer return (non-void) int main () { int ret=0; … ret = system("diff --brief -w output.dat output.golden.dat"); if (ret != 0) { printf("Test failed !!!\n"); ret=1; } else { printf("Test passed !\n"); } … return ret; } Intro to HLS 11- 35

© Copyright 2016 Xilinx

Determine or Create the Top-‐level Func)on Determine the top-level function for synthesis If there are Multiple functions, they must be merged –  There can only be 1 top-level function for synthesis Given a case where functions func_A and func_B are to be implemented in FPGA

Re-partition the design to create a new single top-level function inside main() main.c

main.c int main () { ... func_A(a,b,*i1); func_B(c,*i1,*i2); func_C(*i2,ret)

#include func_AB.h int main (a,b,c,d) { ... // func_A(a,b,i1); // func_B(c,i1,i2); func_AB (a,b,c, *i1, *i2); func_C(*i2,ret)

func_A func_B func_C

return ret; }

func_AB func_C

return ret; }

func_AB.c

Recommendation is to separate test bench and design files

Intro to HLS 11- 36

© Copyright 2016 Xilinx

#include func_AB.h func_AB(a,b,c, *i1, *i2) { ... func_A(a,b,*i1); func_B(c,*i1,*i2); … }

func_A func_B

Outline Introduction to High-Level Synthesis High-Level Synthesis with Vivado HLS Language Support Validation Flow Summary

Intro to HLS 11- 37

© Copyright 2016 Xilinx

Summary In HLS –  C becomes RTL –  Operations in the code map to hardware resources –  Understand how constructs such as functions, loops and arrays are synthesized

HLS design involves –  Synthesize the initial design –  Analyze to see what limits the performance •  User directives to change the default behaviors •  Remove bottlenecks

–  Analyze to see what limits the area •  The types used define the size of operators •  This can have an impact on what operations can fit in a clock cycle

Intro to HLS 11- 38

© Copyright 2016 Xilinx

Summary Use directives to shape the initial design to meet performance –  Increase parallelism to improve performance –  Refine bit sizes and sharing to reduce area

Vivado HLS benefits –  Productivity –  Portability –  Permutability

Intro to HLS 11- 39

© Copyright 2016 Xilinx

Improving Performance This material exempt per Department of Commerce license exception TSU

Objec3ves After completing this module, you will be able to: –  Add directives to your design –  List number of ways to improve performance –  State directives which are useful to improve latency –  Describe how loops may be handled to improve latency –  Recognize the dataflow technique that improves throughput of the design –  Describe the pipelining technique that improves throughput of the design –  Identify some of the bottlenecks that impact design performance

Improving Performance 13- 2

© Copyright 2016 Xilinx

Outline Adding Directives Improving Latency –  Manipulating Loops

Improving Throughput Performance Bottleneck Summary

Improving Performance 13- 3

© Copyright 2016 Xilinx

Improving Performance Vivado HLS has a number of ways to improve performance –  Automatic (and default) optimizations –  Latency directives –  Pipelining to allow concurrent operations

Vivado HLS support techniques to remove performance bottlenecks –  Manipulating loops –  Partitioning and reshaping arrays

Optimizations are performed using directives –  Let’s look first at how to apply and use directives in Vivado HLS

Improving Performance 13- 4

© Copyright 2016 Xilinx

Applying Direc3ves If the source code is open in the GUI Information pane –  The Directive tab in the Auxiliary pane shows all the locations and objects upon which directives can be applied (in the opened C file, not the whole design) •  Functions, Loops, Regions, Arrays, Top-level arguments

–  Select the object in the Directive Tab •  “dct” function is selected

–  Right-click to open the editor dialog box –  Select a desired directive from the dropdown menu •  “DATAFLOW” is selected

–  Specify the Destination •  Source File •  Directive File

Improving Performance 13- 5

© Copyright 2016 Xilinx

Op3miza3on Direc3ves: Tcl or Pragma Directives can be placed in the directives file –  The Tcl command is written into directives.tcl –  There is a directives.tcl file in each solution •  Each solution can have different directives Once applied the directive will be shown in the Directives tab (rightclick to modify or delete)

Directives can be place into the C source –  Pragmas are added (and will remain) in the C source file –  Pragmas (#pragma) will be used by every solution which uses the code

Improving Performance 13- 6

© Copyright 2016 Xilinx

Solu3on Configura3ons Configurations can be set on a solution – Set the default behavior for that solution •  Open configurations settings from the menu (Solutions > Solution Settings…)

“Add” or “Remove” configuration settings

Select “General”

– Choose the configuration from the drop-down menu •  Array Partitioning, Binding, Dataflow Memory types, Interface, RTL Settings, Core, Compile, Schedule efforts

Improving Performance 13- 7

© Copyright 2016 Xilinx

Example: Configuring the RTL Output Specify the FSM encoding style –  By default the FSM is auto

Add a header string to all RTL output files –  Example: Copyright Acme Inc.

Add a user specified prefix to all RTL output filenames –  The RTL has the same name as the C functions –  Allow multiple RTL variants of the same top-level function to be used together without renaming files

Reset all registers –  By default only the FSM registers and variables initialized in the code are reset –  RAMs are initialized in the RTL and bitstream

Synchronous or Asynchronous reset

The remainder of the configuration commands will be covered throughout the course

–  The default is synchronous reset

Active high or low reset –  The default is active high Improving Performance 13- 8

© Copyright 2016 Xilinx

Copying Direc3ves into New Solu3ons Click the New Solution Button Optionally modify any of the settings –  Part, Clock Period, Uncertainty –  Solution Name

Copy existing directives –  By default selected –  Uncheck if do not want to copy –  No need to copy pragmas, they are in the code

Improving Performance 13- 9

© Copyright 2016 Xilinx

Outline Adding Directives Improving Latency –  Manipulating Loops

Improving Throughput Performance Bottleneck Summary

Improving Performance 13- 10

© Copyright 2016 Xilinx

Latency and Throughput – The Performance Factors Design Latency –  The latency of the design is the number of cycle it takes to output the result •  In this example the latency is 10 cycles

Design Throughput –  The throughput of the design is the number of cycles between new inputs •  By default (no concurrency) this is the same as latency •  Next start/read is when this transaction ends Improving Performance 13- 11

© Copyright 2016 Xilinx

Latency and Throughput In the absence of any concurrency –  Latency is the same as throughput

Pipelining for higher throughput –  Vivado HLS can pipeline functions and loops to improve throughput –  Latency and throughput are related –  We will discuss optimizing for latency first, then throughput

Improving Performance 13- 12

© Copyright 2016 Xilinx

Vivado HLS: Minimize latency Vivado HLS will by default minimize latency –  Throughput is prioritized above latency (no throughput directive is specified here) –  In this example •  The functions are connected as shown •  Assume function B takes longer than any other functions

Vivado HLS will automatically take advantage of the parallelism –  It will schedule functions to start as soon as they can •  Note it will not do this for loops within a function: by default they are executed in sequence

Improving Performance 13- 13

© Copyright 2016 Xilinx

Reducing Latency Vivado HLS has the following directives to reduce latency –  LATENCY •  Allows a minimum and maximum latency constraint to be specified

–  LOOP_FLATTEN •  Allows nested loops to be collapsed into a single loop with improved laten

–  LOOP_MERGE •  Merge consecutive loops to reduce overall latency, increase sharing, and improve logic optimization

–  UNROLL

Improving Performance 13- 14

© Copyright 2016 Xilinx

Default Behavior: Minimizing Latency Functions –  Vivado HLS will seek to minimize latency by allowing functions to operate in parallel •  As shown on the previous slide

Loops –  Vivado HLS will not schedule loops to operate in parallel by default •  Dataflow optimization must be used or the loops must be unrolled •  Both techniques are discussed in detail later

Operations

Loop:for(i=1;i=0;i-‐-‐) { b = a[i] + b; ... }

Loops require labels if they are to be referenced by Tcl directives (GUI will auto-add labels)

–  Loops can be unrolled if their indices are statically determinable at elaboration time •  Not when the number of iterations is variable

Improving Performance 13- 19

© Copyright 2016 Xilinx

b

Rolled Loops Enforce Latency A rolled loop can only be optimized so much –  Given this example, where the delay of the adder is small compared to the clock frequency void foo_top (…) { ... Add: for (i=3;i>=0;i-‐-‐) { b = a[i] + b; ... }

Clock Adder Delay

3

–  This rolled loop will never take less than 4 cycles •  No matter what kind of optimization is tried •  This minimum latency is a function of the loop iteration count

Improving Performance 13- 20

© Copyright 2016 Xilinx

2

1

0

Unrolled Loops can Reduce Latency

Select loop “Add” in the directives pane and right-click

Unrolled loops allow greater option & exploration

Options explained on next slide

Improving Performance 13- 21

Unrolled loops are likely to result in more hardware resources and higher area

© Copyright 2016 Xilinx

Par3al Unrolling Fully unrolling loops can create a lot of hardware Loops can be partially unrolled –  Provides the type of exploration shown in the previous slide

Partial Unrolling –  A standard loop of N iterations can be unrolled to by a factor –  For example unroll by a factor 2, to have N/2 iterations

Add: for(int i = 0; i < N; i++) { a[i] = b[i] + c[i]; }

Add: for(int i = 0; i < N; i += 2) { a[i] = b[i] + c[i]; if (i+1 >= N) break; a[i+1] = b[i+1] + c[i+1]; }

•  Similar to writing new code as shown on the right è •  The break accounts for the condition when N/2 is not an integer

Effective code after compiler transformation

–  If “i” is known to be an integer multiple of N •  The user can remove the exit check (and associated logic) •  Vivado HLS is not always be able to determine this is true (e.g. if N is an input argument) •  User takes responsibility: verify!

Improving Performance 13- 22

© Copyright 2016 Xilinx

for(int i = 0; i < N; i += 2) { a[i] = b[i] + c[i]; a[i+1] = b[i+1] + c[i+1]; }

An extra adder for N/ 2 cycles trade-off

Loop FlaPening Vivado HLS can automatically flatten nested loops –  A faster approach than manually changing the code

Flattening should be specified on the inner most loop –  It will be flattened into the loop above –  The “off” option can prevent loops in the hierarchy from being flattened

1 x4

2

x4

3 4

x4

x4

36 transitions

void foo_top (…) { void foo_top (…) { ... ... L1: for (i=3;i>=0;i-‐-‐) { L1: for (i=3;i>=0;i-‐-‐) { [loop body l1 ] [loop body l1 ] } } L2: for (i=3;i>=0;i-‐-‐) { L3: for (j=3;j>=0;j-‐-‐) { L2: for (k=15,k>=0;k-‐-‐) { [loop body l3 ] } [loop body l3 ] } } L4: for (i=3;i>=0;i-‐-‐) { L4: for (i=3;i>=0;i-‐-‐) { [loop body l4 ] [loop body l1 ] } } Loops will be flattened by default: use “off” to disable

Improving Performance 13- 23

© Copyright 2016 Xilinx

1 x4

2

x16

4 x4

28 transitions

Perfect and Semi-‐Perfect Loops Only perfect and semi-perfect loops can be flattened –  The loop should be labeled or directives cannot be applied –  Perfect Loops –  Only the inner most loop has body (contents)

Loop_outer: for (i=3;i>=0;i-‐-‐) { Loop_inner: for (j=3;j>=0;j-‐-‐) { [loop body] } }

–  There is no logic specified between the loop statements –  The loop bounds are constant

–  Semi-perfect Loops –  Only the inner most loop has body (contents) –  There is no logic specified between the loop statements

Loop_outer: for (i=3;i>N;i-‐-‐) { Loop_inner: for (j=3;j>=0;j-‐-‐) { [loop body] } }

–  The outer most loop bound can be variable –  Other types

–  Should be converted to perfect or semi-perfect loops

Improving Performance 13- 24

© Copyright 2016 Xilinx

Loop_outer: for (i=3;i>N;i-‐-‐) { [loop body] Loop_inner: for (j=3;j>=M;j-‐-‐) { [loop body] } }

Loop Merging Vivado HLS can automatically merge loops –  A faster approach than manually changing the code –  Allows for more efficient architecture explorations –  FIFO reads, which must occur in strict order, can prevent loop merging •  Can be done with the “force” option : user takes responsibility for correctness

1 x4

2

x4

3 4

x4

x4

void foo_top (…) { ... L1: for (i=3;i>=0;i-‐-‐) { [loop body l1 ] } L2: for (i=3;i>=0;i-‐-‐) { L3: for (j=3;j>=0;j-‐-‐) { [loop body l3 ] } } Already flattened L4: for (i=3;i>=0;i-‐-‐) { [loop body l4 ] }

void foo_top (…) { ... L123: for (l=16,l>=0;l-‐-‐) { if (cond1) [loop body l1 ] [loop body l3 ] if (cond4) [loop body l4 ] }

x16

18 transitions

36 transitions Improving Performance 13- 25

1

© Copyright 2016 Xilinx

Loop Merge Rules If loop bounds are all variables, they must have the same value If loops bounds are constants, the maximum constant value is used as the bound of the merged loop –  As in the previous example where the maximum loop bounds become 16 (implied by L3 flattened into L2 before the merge)

Loops with both variable bound and constant bound cannot be merged The code between loops to be merged cannot have side effects –  Multiple execution of this code should generate same results •  A=B is OK, A=A+1 is not

Reads from a FIFO or FIFO interface must always be in sequence –  A FIFO read in one loop will not be a problem –  FIFO reads in multiple loops may become out of sequence •  This prevents loops being merged

Improving Performance 13- 26

© Copyright 2016 Xilinx

Loop Reports Vivado HLS reports the latency of loops –  Shown in the report file and GUI

Given a variable loop index, the latency cannot be reported –  Vivado HLS does not know the limits of the loop index –  This results in latency reports showing unknown values

The loop tripcount (iteration count) can be specified –  Apply to the loop in the directives pane –  Allows the reports to show an estimated latency

Improving Performance 13- 27

© Copyright 2016 Xilinx

Impacts reporting – not synthesis

Techniques for Minimizing Latency -‐ Summary Constraints –  Vivado HLS accepts constraints for latency

Loop Optimizations –  Latency can be improved by minimizing the number of loop boundaries •  Rolled loops (default) enforce sharing at the expense of latency •  The entry and exits to loops costs clock cycles

Improving Performance 13- 28

© Copyright 2016 Xilinx

Outline Adding Directives Improving Latency –  Manipulating Loops

Improving Throughput Performance Bottleneck Summary

Improving Performance 13- 29

© Copyright 2016 Xilinx

Improving Throughput Given a design with multiple functions –  The code and dataflow are as shown

Vivado HLS will schedule the design

It can also automatically optimize the dataflow for throughput

Improving Performance 13- 30

© Copyright 2016 Xilinx

Dataflow Op3miza3on Dataflow Optimization –  Can be used at the top-level function –  Allows blocks of code to operate concurrently •  The blocks can be functions or loops •  Dataflow allows loops to operate concurrently

–  It places channels between the blocks to maintain the data rate

•  For arrays the channels will include memory elements to buffer the samples •  For scalars the channel is a register with hand-shakes

Dataflow optimization therefore has an area overhead –  Additional memory blocks are added to the design –  The timing diagram on the previous page should have a memory access delay between the blocks •  Not shown to keep explanation of the principle clear

Improving Performance 13- 31

© Copyright 2016 Xilinx

Dataflow Op3miza3on Commands Dataflow is set using a directive –  Vivado HLS will seek to create the highest performance design •  Throughput of 1

Improving Performance 13- 32

© Copyright 2016 Xilinx

Dataflow Op3miza3on through Configura3on Command Configuring Dataflow Memories –  Between functions Vivado HLS uses ping-pong memory buffers by default •  The memory size is defined by the maximum number of producer or consumer elements

–  Between loops Vivado HLS will determine if a FIFO can be used in place of a ping-pong buffer –  The memories can be specified to be FIFOs using the Dataflow Configuration •  Menu: Solution > Solution Settings > config_dataflow •  With FIFOs the user can override the default size of the FIFO •  Note: Setting the FIFO too small may result in an RTL verification failure

Individual Memory Control –  When the default is ping-pong •  Select an array and mark it as Streaming (directive STREAM) to implement the array as a FIFO

–  When the default is FIFO •  Select an array and mark it as Streaming (directive STREAM) with option “off” to implement the array as a pingpong To use FIFO’s the access must be sequential. If HLS determines that the access is not sequential then it will halt and issue a message. If HLS can not determine the sequential nature then it will issue warning and continue.

Improving Performance 13- 33

© Copyright 2016 Xilinx

Dataflow : Ideal for streaming arrays & mul3-‐rate func3ons Arrays are passed as single entities by default –  This example uses loops but the same principle applies to functions

Dataflow pipelining allows loop_2 to start when data is ready –  The throughput is improved –  Loops will operate in parallel •  If dependencies allow

Multi-Rate Functions –  Dataflow buffers data when one function or loop consumes or produces data at different rate from others

IO flow support –  To take maximum advantage of dataflow in streaming designs, the IO interfaces at both ends of the datapath should be streaming/handshake types (ap_hs or ap_fifo) Improving Performance 13- 34

© Copyright 2016 Xilinx

Dataflow Limita3ons (1) Must be single producer consumer; the following code violates the rule and dataflow does not work The Fix

Improving Performance 13- 35

© Copyright 2016 Xilinx

Dataflow Limita3ons (2) You cannot bypass a task; the following code violates this rule and dataflow does not work The fix: make it systolic like datapath

Improving Performance 13- 36

© Copyright 2016 Xilinx

Dataflow vs Pipelining Op3miza3on Dataflow Optimization –  Dataflow optimization is “coarse grain” pipelining at the function and loop level –  Increases concurrency between functions and loops –  Only works on functions or loops at the top-level of the hierarchy •  Cannot be used in sub-functions

Function & Loop Pipelining –  “Fine grain” pipelining at the level of the operators (*, +, >>, etc.) –  Allows the operations inside the function or loop to operate in parallel –  Unrolls all sub-loops inside the function or loop being pipelined •  Loops with variable bounds cannot be unrolled: This can prevent pipelining •  Unrolling loops increases the number of operations and can increase memory and run time

Improving Performance 13- 37

© Copyright 2016 Xilinx

Func3on Pipelining There are 3 clock cycles before operation RD can occur again

The latency is the same The throughput is better

–  Throughput = 3 cycles

–  Less cycles, higher throughput

There are 3 cycles before the 1st output is written –  Latency = 3 cycles Without Pipelining

With Pipelining void foo(...) { op_Read; op_Compute; op_Write; }

RD CMP WR

Throughput = 3 cycles

RD

CMP

WR

Throughput = 1 cycle

RD

CMP

RD

WR

Improving Performance 13- 38

CMP

WR

RD

CMP

Latency = 3 cycles

Latency = 3 cycles

© Copyright 2016 Xilinx

WR

Loop Pipelining Without Pipelining

With Pipelining Loop:for(i=1;i> 1; *outB = *in2 >> 2; } void add_sub_pass(int A, int B, int *C, int *D) { int apb, amb; int a2, b2;

B

A

B>>1

Zero Area

}

Inlining allows optimization to be performed across function hierarchies Like RTL ungrouping, too much inlining can create a lot of logic and slow runtime

21- 12 Improving Area and Resources 21- 12

A

sumsub_func(&A,&B,&apb,&amb); sumsub_func(&apb,&amb,&a2,&b2); shift_func(&a2,&b2,C,D);

2 Adders 2 Subtractors

shift_func

Inlining

int sumsub_func (int *in1, int *in2, int *outSum, int *outSub) { *outSum = *in1 + *in2; *outSub = *in1 - *in2; }

© Copyright 2016 Xilinx

Inline and Alloca6on: Shape the Hierarchy Easy to Share

One RTL block is reused for both instances of function foo

Cannot be shared

Function foo is not within the immediate scope of foo_top

21- 13 Improving Area and Resources 21- 13

© Copyright 2016 Xilinx

Controlling Sharing

Inlining brings foo into function foo_top where it can be shared

Loops By default, loops are rolled –  Each C loop iteration è Implemented in the same state –  Each C loop iteration è Implemented with same resources

N

foo_top

Synthesis

a[N]

+

void foo_top (…) { ... Add: for (i=3;i>=0;i-‐-‐) { b = a[i] + b; ... }

b

For Area optimization Keeping loops rolled maximizes sharing across loop iterations: each iteration of the loop uses the same hardware resources

21- 14 Improving Area and Resources 21- 14

© Copyright 2016 Xilinx

Loop Merging & FlaLening Loop merging & flattening can remove the redundant computation among multiple (related) loops –  Improving area (and sometimes performance) My_Region: { #pragma HLS merge loop for (i = 0; i < N; ++i) A[i] = B[i] + 1;

Merge

for (i = 0; i < N; ++i) C[i] = A[i] / 2;

for (i = 0; i < N; ++i) { A[i] = B[i] + 1; C[i] = A[i] / 2; }

Effective code after compiler transformation

}

Allows Vivado HLS to perform optimizations –  Optimization cannot occur across loop boundaries for (i = 0; i < N; ++i) C[i] = (B[i] + 1) / 2;

Removes A[i], any address logic and any potential memory accesses

21- 15 Improving Area and Resources 21- 15

© Copyright 2016 Xilinx

Mapping Arrays The arrays in the C model may not be ideal for the available RAMs –  The code may have many small arrays –  The array may not utilize the RAMs very well

Array Mapping –  Mapping combines smaller arrays into larger arrays •  Allows arrays to be reconfigured without code edits

–  Specify the array variable to be mapped –  Give all arrays to be combined the same instance name

Vivado HLS provides options as to the type of mapping –  Combine the arrays without impacting performance •  Vertical & Horizontal mapping

Global Arrays –  When a global array is mapped all arrays involved are promoted to global –  When arrays are in different functions, the target becomes global

Arrays which are function arguments –  All must be part of the same function interface

21- 16 Improving Area and Resources 21- 16

© Copyright 2016 Xilinx

Horizontal Mapping Horizontal Mapping –  Combines multiple arrays into longer (horizontal) array –  Optionally allows the arrays to be offset •  The default is to concatenate after the last element

•  The first array specified (in GUI or Tcl script) starts at location zero 21- 17 Improving Area and Resources 21- 17

© Copyright 2016 Xilinx

Ver6cal Mapping Vertical Mapping –  Combines multiple arrays in to an array with more bits

–  The first array specified (in Tcl or GUI) starts at the LSB

Vertical Mapping for performance –  Creates RAMs with wide words è Parallel accesses

21- 18 Improving Area and Resources 21- 18

© Copyright 2016 Xilinx

Arbitrary Precision Integers C and C++ have standard types created on the 8-bit boundary –  char (8-bit), short (16-bit), int (32-bit), long long (64-bit) •  Also provides stdint.h (for C), and stdint.h and cstdint (for C++) •  Types: int8_t, uint16_t, uint32_t, int_64_t etc.

–  They result in hardware which is not bit-accurate and can give sub-standard QoR

Vivado HLS provides bit-accurate types in both C and C++ –  Plus SystemC types can be used in C++ –  Allow any arbitrary bit-width to be specified –  Will simulate with bit-accuracy

21- 19 Improving Area and Resources 21- 19

© Copyright 2016 Xilinx

Why are Arbitrary Precision types Needed? Code using native C int type

However, if the inputs will only have a max range of 8-bit –  Arbitrary precision data-types should be used

–  It will result in smaller & faster hardware with full precision 21- 20 Improving Area and Resources 21- 20

© Copyright 2016 Xilinx

Outline Optimizing Resource Utilization Reducing Area Usage Summary

Improving Area and Resources 21- 21

© Copyright 2016 Xilinx

Summary Resource utilization can be reduced using allocation and binding controls Arbitrary precision data types help controlling both the area and resource utilization The design structure can be controlled by –  Inlining functions: direct impact on RTL hierarchy & optimization possibilities –  Loops: direct impact on reuse of resources –  Arrays: direct impact on the RAM

Major area optimization techniques –  Minimize bit widths –  Map smaller arrays into larger arrays •  Make better use of existing RAMs

–  Control loop hierarchy –  Control function call hierarchy –  Control the number of operators and cores

Improving Area and Resources 21- 22

© Copyright 2016 Xilinx

Using Vivado HLS This material exempt per Department of Commerce license exception TSU

Objec4ves After completing this module, you will be able to: –  List various OS under which Vivado HLS is supported –  Describe how projects are created and maintained in Vivado HLS –  State various steps involved in using Vivado HLS project creation wizard –  Distinguish between the role of top-level module in testbench and design to be synthesized –  List various verifications which can be done in Vivado HLS –  List Vivado HLS project directory structure

Using Vivado HLS 12 - 2

© Copyright 2016 Xilinx

Outline Invoking Vivado HLS Project Creation using Vivado HLS Synthesis to IPXACT Flow Design Analysis Other Ways to use Vivado HLS Summary

Using Vivado HLS 12 - 3

© Copyright 2016 Xilinx

Vivado HLS OS Support Vivado HLS is supported on both Linux and Windows Vivado HLS tool available under two licenses –  HLS license •  HLS license come with Vivado System Edition •  Supports all 7 series devices including Zynq® All Programmable SoC •  Does not support Virtex®-6 and earlier devices –  Use older version of Vivado HLS for Virtex-6 and earlier

Operating System Windows

Using Vivado HLS 12 - 4

© Copyright 2016 Xilinx

Version Windows 10 Professional (64-bit) Windows 8.1 Professional (64-bit) Windows 7 SP1 Professional (64-bit)

Red Hat Linux

RHEL Enterprise Linux 5.11, 6.7-6.8 (64-bit) RHEL Enterprise Linux 7.1 and 7.2 (64-bit)

SUSE

SUSE Linux Enterprise 11.4 and 12.1 (64-bit)

CentOS

CentOS 6.8 (64-bit)

Ubuntu

Ubuntu Linux 16.04 LTS (64-bit)

Invoke Vivado HLS from Windows Menu

The first step is to open or create a project

12- 5 Using Vivado HLS 12 - 5

© Copyright 2016 Xilinx

Vivado HLS GUI

Information Pane

Auxiliary Pane

Project Explorer Pane

Console Pane

12- 6 Using Vivado HLS 12 - 6

© Copyright 2016 Xilinx

Outline Invoking Vivado HLS Project Creation using Vivado HLS Synthesis to IPXACT Flow Design Analysis Other Ways to use Vivado HLS Summary

Using Vivado HLS 12 - 7

© Copyright 2016 Xilinx

Vivado HLS Projects and Solu4ons Vivado HLS is project based –  A project specifies the source code which will be synthesized

Source

–  Each project is based on one set of source code –  Each project has a user specified name

A project can contain multiple solutions –  Solutions are different implementations of the same code –  Auto-named solution1, solution2, etc. –  Supports user specified names

Project Level

Solution Level

–  Solutions can have different clock frequencies, target technologies, synthesis directives

Projects and solutions are stored in a hierarchical directory structure –  Top-level is the project directory –  The disk directory structure is identical to the structure shown in the GUI project explorer (except for source code location) 12- 8 Using Vivado HLS 12 - 8

© Copyright 2016 Xilinx

Vivado HLS Step 1: Create or Open a project Start a new project –  The GUI will start the project wizard to guide you through all the steps

Optionally use the Toolbar Button to Open New Project

Open an existing project –  All results, reports and directives are automatically saved/remembered –  Use “Recent Project” menu for quick access 12- 9 Using Vivado HLS 12 - 9

© Copyright 2016 Xilinx

Project Wizard The Project Wizard guides users through the steps of opening a new project Step-by-step guide …

Define project and directory

Add design source files

Specify test bench files

1st Solution Information

Project Level Information Using Vivado HLS 12 - 10

Specify clock and select part

© Copyright 2016 Xilinx

Define Project & Directory Define the project name −  Note, here the project is given the extension .prj −  A useful way of seeing it’s a project (and not just another directory) when browsing

Browse to the location of the project –  In this example, project directory “matrixmul.prj” will be created inside directory “lab1”

Using Vivado HLS 12 - 11

© Copyright 2016 Xilinx

Add Design Source Files Add Design Source Files −  This allows Vivado HLS to determine the top-level design for synthesis, from the test bench and associated files −  Not required for SystemC designs

Add Files… –  Select the source code file(s) –  The CTRL and SHIFT keys can be used to add multiple files –  No need to include headers (.h) if they reside in the same directory

Select File and Edit CFLAGS… −  If required, specify C compile arguments using the “Edit CFLAGS…” −  Define macros: -DVERSION1 −  Location of any (header) files not in the same directory as the source: -I../include Using Vivado HLS 12 - 12

© Copyright 2016 Xilinx

There is no need to add the location of standard Vivado HLS or SystemC header files or header files located in the same project location

Specify Test Bench Files Use “Add Files” to include the test bench –  Vivado HLS will re-use these to verify the RTL using cosimulation

And all files referenced by the test bench –  The RTL simulation will be executed in a different directory (Ensures the original results are not overwritten) –  Vivado HLS needs to also copy any files accessed by the test bench • 

E.g. Input data and output results

Add Folders –  If the test bench uses relative paths like “sub_directory/my_file.dat” you can add “sub_directory” as a folder/directory

Use “Edit CFLAGS…” –  To add any C compile flags required for compilation Using Vivado HLS 12 - 13

© Copyright 2016 Xilinx

Test benches I The test bench should be in a separate file Or excluded from synthesis –  The Macro __SYNTHESIS__ can be used to isolate code which will not be synthesized •  This macro is defined when Vivado HLS parses any code (-D__SYNTHESIS__) // test.c #include void test (int d[10]) { int acc = 0; int i; for (i=0;i IP Catalog 2.  Add IP to import this block 3.  Browse to the zip file inside “ip”

impl

ip

In System Generator : 1.  Use XilinxBlockAdd 2.  Select Vivado_HLS block type 3.  Browse to the solution directory Using Vivado HLS 12 - 32

Solution directories There can be multiple solutions for each project. Each solution is a different implementation of the same (project) source code

Project Directory Top-level project directory (there must be one)

© Copyright 2016 Xilinx

syn

sysgen

solutionN

sim

RTL Export for Implementa4on Click on Export RTL –  Export RTL Dialog opens

Select the desired output format

Optionally, configure the output Select the desired language Optionally, click on Vivado RTL Synthesis and Place and Route options for invoking implementation tools from within Vivado HLS Click OK to start the implementation

Using Vivado HLS 12 - 33

© Copyright 2016 Xilinx

RTL Export (Place and Route Op4on) Results Impl directory created –  Will contain a sub-directory for each RTL which is synthesized

Report –  A report is created and opened automatically

Using Vivado HLS 12 - 34

© Copyright 2016 Xilinx

RTL Export Results (Place and Route Op4on Unchecked) Impl directory created –  Will contain a sub-directory for both VHDL and Verilog along with the ip directory

No report will be created Observe the console –  No packing, routing phases

Using Vivado HLS 12 - 35

© Copyright 2016 Xilinx

Outline Invoking Vivado HLS Project Creation using Vivado HLS Synthesis to IPXACT Flow Design Analysis Other Ways to use Vivado HLS Summary

Using Vivado HLS 12 - 36

© Copyright 2016 Xilinx

Analysis Perspec4ve Perspective for design analysis –  Allows interactive analysis

Using Vivado HLS 12 - 37

© Copyright 2016 Xilinx

Performance Analysis

Using Vivado HLS 12 - 38

© Copyright 2016 Xilinx

Resources Analysis

Using Vivado HLS 12 - 39

© Copyright 2016 Xilinx

Outline Invoking Vivado HLS Project Creation using Vivado HLS Synthesis to IPXACT Flow Design Analysis Other Ways to use Vivado HLS Summary

Using Vivado HLS 12 - 40

© Copyright 2016 Xilinx

Command Line Interface: Batch Mode Vivado HLS can also be run in batch mode –  Opening the Command Line Interface (CLI) will give a shell

–  Supports the commands required to run Vivado HLS & pre-synthesis verification (gcc, g++, apcc, make)

12- 41 Using Vivado HLS 12 - 41

© Copyright 2016 Xilinx

Using Vivado HLS CLI Invoke Vivado HLS in interactive mode –  Type Tcl commands one at a time

> vivado_hls –i

Execute Vivado HLS using a Tcl batch file –  Allows multiple runs to be scripted and automated

> vivado_hls –f run_aesl.tcl

Open an existing project in the GUI –  For analysis, further work or to modify it

> vivado_hls –p my.prj

Use the shell to launch Vivado HLS GUI > vivado_hls

12- 42 Using Vivado HLS 12 - 42

© Copyright 2016 Xilinx

Using Tcl Commands When the project is created –  All Tcl command to run the project are created in script.tcl •  User specified directives are placed in directives.tcl

–  Use this as a template from creating Tcl scripts •  Uncomment the commands before running the Tcl script

Using Vivado HLS 12 - 43

© Copyright 2016 Xilinx

Help Help is always available – The Help Menu – Opens User Guide, Reference Guide and Man Pages

In interactive mode – The help command lists the man page for all commands Vivado_hls> help add_files

Auto-Complete all commands using the tab key

SYNOPSIS add_files [OPTIONS] Etc…

Using Vivado HLS 12 - 44

© Copyright 2016 Xilinx

Outline Invoking Vivado HLS Project Creation using Vivado HLS Synthesis to IPXACT Flow Design Analysis Other Ways to use Vivado HLS Summary

Using Vivado HLS 12 - 45

© Copyright 2016 Xilinx

Summary Vivado HLS can be run under Windows 7/8.1, Red Hat Linux, SUSE OS, and Ubuntu Vivado HLS can be invoked through GUI and command line in Windows OS, and command line in Linux Vivado HLS project creation wizard involves –  Defining project name and location –  Adding design files –  Specifying testbench files –  Selecting clock and technology

The top-level module in testbench is main() whereas top-level module in the design is the function to be synthesized

12- 46 Using Vivado HLS 12 - 46

© Copyright 2016 Xilinx

Summary Vivado HLS project directory consists of –  *.prj project file –  Multiple solutions directories –  Each solution directory may contain •  impl, synth, and sim directories •  The impl directory consists of ip, verilog, and vhdl folders •  The synth directory consists of reports, vhdl, and verilog folders •  The sim directory consists of testbench and simulation files

Using Vivado HLS 12 - 47

© Copyright 2016 Xilinx

Logiciel - index

des documents recommandant