FPGA-Based Evolvable Hardware Systems - ARCS 2013

Feb 21, 2013 - Case study: Evolutionary design of image filters. II. .... Fast pipelined reconfiguration and pipelined processing. .... DSP slices, A/D converter, …
3MB taille 0 téléchargements 242 vues
FPGA-Based Evolvable Hardware Systems Lukáš Sekanina Brno University of Technology Faculty of Information Technology IT4Innovation Centre of Excellence http://www.fit.vutbr.cz/~sekanina

ARCS 2013, Prague February 21, 2013 1

Evolvable Hardware = Evolutionary Computation + Reconfigurable Device

2

Evolvable Hardware Circuit Configured in the device

Fitness Computation

Test Set Introduced Final Circuit

011010000110 110010100011

2,78

2,98

3,56

3,44

0,23

110011100010 011001010010

+

001100101101 A new population is created applying genetic operators

New Circuit Population

Crossover 110010

High-scored configurations are selected 011010000110

100011

Evolution finished?

-

110010100011 110011

100010

110011100010 1

Initial population generator

011001010010

Mutation 00110 101101 0

001100101101

3

Why evolutionary design? • Allows engineers to increase the level of design automation. • The design problem is transformed into the search problem! • Exploring “dark corners” of design spaces. • Multi-objective search.

• Novel designs (unreachable by conventional techniques) can be discovered by means of EA. • antennas, analog circuits, digital circuits, optical lens systems, programs, protocols…

• Adaptive and self-repairing hardware can be implemented using evolvable hardware.

Courtesy of NASA AMES.

4

Human-Competitive Results Competition: The Humies at GECCO since 2004 J. Koza: We say that an automatically created result is “humancompetitive” if it satisfies one or more of the eight criteria below. • • • • • • • •

(A) The result was patented as an invention in the past, is an improvement over a patented invention, or would qualify today as a patentable new invention. (B) The result is equal to or better than a result that was accepted as a new scientific result at the time when it was published in a peerreviewed scientific journal. (C) The result is equal to or better than a result that was placed into a database or archive of results maintained by an internationally recognized panel of scientific experts. (D) The result is publishable in its own right as a new scientific result independent of the fact that the result was mechanically created. (E) The result is equal to or better than the most recent human-created solution to a long-standing problem for which there has been a succession of increasingly better human-created solutions. (F) The result is equal to or better than a result that was considered an achievement in its field at the time it was first discovered. (G) The result solves a problem of indisputable difficulty in its field. (H) The result holds its own or wins a regulated competition involving human contestants (in the form of either live human players or human-written computer programs).

5

Winners of Humies 2004 - 2012 •

2004 • •



2005 • •



Evolutionary design of the energy function for protein structure prediction (University of Nottingham, UK)

2011 •



A Genetic Programming Approach to Automated Software Repair (University of New Mexico, US)

2010 •



Genetic Programming for Finite Algebras (Hampshire College, US)

2009 •



Evolutionary Design of Single-Mode Microstructured Polymer Optical Fibres (UCL, UK)

2008 •



Catalogue of Variable Frequency and Single-Resistance-Controlled Oscillators (MIT, US)

2007 •



Two-dimensional photonic crystals designed by evolutionary algorithms (Cornell U. US) Shaped-pulse optimization of coherent soft-x-rays (Colorado State University, US)

2006 •



An Evolved Antenna for Deployment on NASA's Space Technology 5 Mission (NASA Ames) Automatic Quantum Computer Programming: A Genetic Programming Approach (Hampshire College, US)

GA-FreeCell: Evolving Solvers for the Game of FreeCell (Ben-Gurion University of the Negev, IL)

2012 •

Evolutionary Game Design (Imperial College London, UK)

6

Intrinsic evolution in XC6216 FPGA

(A. Thompson, 1996)

Task: In FPGA XC6216, find a circuit which gives log. 1 when finput = 10kHz and log. 0 when finput = 1kHz.

The resulting configuration works correctly only for the FPGA where it was evolved and in the environment that existed at the end of evolution. Impossible to fully understand how the evolved circuit works.

7

Outline I. II. III. IV. V.

Case study: Evolutionary design of image filters EHW systems utilizing Virtual Reconfigurable Circuits EHW systems utilizing Dynamic Partial Reconfiguration EHW systems in Zynq-7000 all programmable SoC Conclusions

8

Part I: Case study: Evolutionary design of image filters

9

Evolutionary design of image filters

Can EA design an image filter which exhibits better filtering properties and lower implementation cost w.r.t. conventional solutions? Target domain: filters suppressing shot noise, Gaussian noise, burst noise, edge detectors, … 10

Method: Cartesian Genetic Programming • • •

9

13

17

10 11 12

9 x 8bits

Array of nc x nr prog. elements (PE). Feed-forward networks. All I/O and connections on 8 bits.



Encoding of PE22: (17, 19, 5)



Complete encoding: nr x nc x 3 + 1 integers



Chromosome: (0,1,1), (2,3,5), …, (35,36,5), 22

37 22

19 41

1 x 8bits 11

List of functions in PE

All inputs and outputs at 8 bits! 12

Fitness function 1 R −1 C −1 MAE = I (r , c ) − K (r , c ) ∑∑ RC r =0 c =0

I: golden image K: filtered image R: rows C: columns

GOLDEN IMAGE

REG

-

|ABS|

+

FILTERED IMAGE 13

CGP: Search algorithm ES(1+λ) 1. 2. 3.

Randomly generate 1+λ individuals (the initial population). Evaluate the population. WHILE the termination criterion is not satisfied DO    

4.

Select the highest scored individual. Use mutation to create λ offspring of the parent individual. Create a new population using the parent and its λ offspring. Evaluate the population.

Report the individual with the highest fitness.

Typical setup: • • • • •

λ = 1 - 10 Mutation of 1 – 5 genes Array of 4 x 8 elements 30000 generations 100 runs 14

Example of evolved filter behavior a) Image corrupted by 5% salt-and-pepper noise PSNR: 18.43 dB (peak signal to noise ratio)

b) Original image c) Median filter (kernel 3x3) PSNR: 27.92 dB 268 FPGA slices; 305 MHz

d) Evolved filter (kernel 3x3) PSNR: 37.50 dB 200 FPGA slices; 308 MHz

d)

a)

c)

b) 15

Comparison of various filters Mean PSNR for 25 test images

MF – Median Filter AMF – Adaptive Median Filter The single filter is one of evolved filters. Best SW - Y. Dong, S. Xu: A new directional weighted median filter for removal of random-valued impulse noise. Signal Processing Letters. vol. 14, no. 3, p. 193–196, 2007 16

Burst noise filtering • switching image filters • 5 x 5 filtering window

17

Image corrupted by 5% impulse bursts noise.

CWMF

evolved AMF

PWMAD 18

Part II: EHW systems utilizing Virtual Reconfigurable Circuits

… because internal reconfiguration has not been supported at all and dynamic partial reconfiguration has been too complicated in former FPGAs. 19

Implementation strategies HW SW – outside FPGA SW – inside FPGA (soft CPU core) SW – inside FPGA (hard CPU core)

DPR VRC simulator HW SW – outside FPGA SW – inside FPGA (soft CPU core) SW – inside FPGA (hard CPU core) 20

Virtual Reconfigurable Circuit

• • •

VRC is a new MUX-based reconfigurable layer on the top of the FPGA. Fast pipelined reconfiguration and pipelined processing. Configuration register contains 384 bits for the 8x4 processing elements. 21

Fitness calculation Image with noise

Golden image

22

EHW in Xilinx Virtex II Pro FPGA

23

Time of evolution Evaluation of a single candidate circuit (@100 MHz, M x N pixels):

Total evolution time: (Ng generations, Nc VRC instances, p – population size):

24

Speedup and resources utilization • Target platform: Combo6X • CGP • 4 x 8 nodes • training image: 128x128

• Results (FPGA at 100 MHz) • 1 VRC: 44 times faster than a Celeron 2.4GHz CPU • evaluates approx. 6k candidate filters per second • requires approx. 10 sec to produce a filter (~30k generations) • 4 VRCs: The speedup is 170.

Results of synthesis for Virtex II Pro 2VP50FF1517 Nc is the number of VRCs

25

Part III: EHW systems utilizing Dynamic Partial Reconfiguration (DPR)

26

DPR in Virtex-5

 Internal Configuration Access Port (ICAP) ~ 100 MHz

 Frame - the smallest addressable unit of information of the configuration memory (41 x 32 bits).

 Relocation - capability to configure the same reconfigurable module at different positions of the device.

27

Image filter evolution using DPR EMBEDDED µP

RECONFIGURABLE CORE

MEMORY

EVALUATION

MICROBLAZE

EA

IMPROVED

IDEAL RESULT

? INPUT DATA

Reconfiguration Commands CANDIDATE RESULT

PLB BUS

EXTERNAL DDR2 MEMORY

MEMORY CONTROLLER Configuration information

RECONFIGURATION ENGINE

FAST LINK

HWICAP

Dynamic and Partial Reconfiguration

EVOLVABLE FPGA-BASED SoC

Virtex-5 LX110T FPGA included in Digilent’s XUPV5 Evaluation Platform 28

Processing architecture

 PEs  



Functional Blocks (FB), routing logic is incorporated 9 basic functions and routing possibilities yield 16 PE variants (10 CLBs per PE) One BusMacro for each port of PE.

 Chromosome = 98 bits  4 bits per PE (16 times)  4 bits per input MUX (8 times)  2 bits for the output  ES (1 + 8)

ID

Function

0 1 2 3 4 5 6 7 8 9

x+y x > 1 255 x >> 1 x max(x,y) min(x,y) x –s y 29

Representation of the filter

Result

Filter latency

Reference image

Image to filter Image to filter

Fitness value

Evolution Filter Fitness

Compare Reference Result

Evolution statistics:

Mutation rate

- number of generations - number of filters evaluated (8 per generation) - number of reconfigured elements

3.5 seconds

Fitness vs time plot (lower is better) 55 seconds

30

Time of evolution: DPR vs. VRC

• • • •

100000 generations, (1+8) ES, 128×128 pixels SW: Intel Core i5 processor @ 3.3 GHz 5056 circuits/s for the RA 6818 circuits/s for the VRC

31

Resources utilization: DPR vs. VRC

• VRC is not as expensive as we thought! • HWICAP can be reused by other applications

32

Part IV: EHW systems in Zynq-7000 all programmable SoC

33

Zynq-7000 all programmable SoC • Processing System (PS) • Dual-core ARM Cortex-A9 (667 MHz – 1 GHz) • Cache (L1: 32 kB + 2 x 32 kB, L2: 512 kB) • On-chip RAM (256 kB) • External memory interfaces (DDR3, DDR2, …) • DMA controller • IO (USB, UART, CAN, … )

PS-centric architecture

• Programmable Logic (PL) • • • • • •

Artix-7 or Kintex-7 LUT (17 600 – 218 000) CLB = 8 x 6-inp. LUT + 16 FF + FF (35 200 – 437 200) BRAM (240 – 2 180 kB) DSP slices, A/D converter, … 34

Zynq-7000 all programmable SoC • Processor configuration access port (PCAP) • a part of the PS • does not need any instantiation in the PL part

• Download throughput: max 400 MB/sec for non-secure PL configuration • Configuration data size: 4 MB for XC7Z020 • Reconfiguration performed by DMA. • The frames are 50 CLB high and 1 CLB wide ~ 15 kB/frame • Frame relocation is possible.

35

Image filter evolution in Zynq: Overview

• EA is a SW running on PS. • Reconfigurable Array • PEs – Programmable elements (pre-synthesized frames) • Config – a register determining the interconnection of PEs by multiplexers (VRC-like style) • pipelined

• FU – fitness unit 36

Image filter evolution in Zynq: Results • •

CGP: 9 inputs / 2 outputs (over 8 bits), 4 rows x 8 columns, ES (1 + 4), 16 functions per PE We measured the number of clock cycles for 100 generations in 10 independent runs.

fvrc = 203 MHz fhyb = 265 MHz

Replacement of 1 PE by PCAP takes 41 us.

37

Is Zynq-7000 good for EHW? • • • •

It is the best currently available platform for EHW! A very fast programmable logic. ADC – a direct analog input from environment is available. Options for EA • a standalone system running on the “second” core (well tuned for a particular application) • one of applications of the OS running on a dual core system

• PCAP is more comfortable than ICAP to perform DPR.

• But, frames still too big => long reconfiguration times!

38

Conclusions

• •

I presented the evolution of FPGA-based implementations of EHW systems. Current trends:

• •



The concept of VRC has not been outperformed by DPR yet!





Complete EHW systems on a chip Genetic operations as a SW for embedded processor.

DPR beats the VRC in the normal operation mode only.

EHW fits the recently introduced concept of underdesigned and opportunistic computing. 39

References Sekanina, L.: Evolvable hardware, Handbook of Natural Computing, Berlin, Springer, 2012, p. 1657-1705 Vašíček, Z., Sekanina, L.: Hardware Accelerator of Cartesian Genetic Programming with Multiple Fitness Units, Computing and Informatics, Vol. 29, No. 6, 2010, p. 1359-1371 Salvador, R., Otero, A., Mora, J., de la Torre, E., Riesgo, T., Sekanina, L.: Self-reconfigurable Evolvable Hardware System for Adaptive Image Processing. IEEE Transactions on Computers, 2013, in press Dobai, R., Sekanina, L.: Image Filter Evolution on the Xilinx Zynq Platform. In Proc. of the NASA/ESA Adaptive Hardware and Systems, 2013, submitted

40

Acknowledgments • Faculty of Information Technology, Brno University of Technology • Roland Dobai, Zdeněk Vašíček, Michal Bidlo

• Center of Industrial Electronics, Universidad Politécnica de Madrid • Ruben Salvador, Eduardo de la Torre, Andrés Otero

41