Cedric Hartland's PhD dissertation - HARTLAND Cédric

Jun 4, 2010 - Introduction ...... Machine Learning (ML) to Evolutionary Robotics (ER) were proposed in ... The definition of the controller search space strongly ...... is used to speciate the population, which protects innovative solutions and.
2MB taille 41 téléchargements 85 vues
Th` ese de doctorat de l’Universit´ e Paris-Sud pr´esent´ee en vue de l’obtention du grade de docteur de l’universit´e Paris-Sud Sp´ecialit´e: Informatique par

C´ edric HARTLAND Equipe-projet TAO, LRI, UMR CNRS 8623, Bˆat 490, Universit´e Paris-Sud, 91405 Orsay Cedex France

A contribution to robust adaptive robotic control acquisition sous la direction de Mich` ele SEBAG et Nicolas BREDECHE Jury : M. M. Mme M. M. Mme

Jean-Sylvain Lienard Cyril Fonlupt Helene Paugam-Moisy St´ephane Doncieux Nicolas Bredeche Mich`ele Sebag

Directeur de Recherche Professeur Professeur Maˆıtre de Conf´erences Maˆıtre de Conf´erences Directrice de Recherche

Pr´esident Rapporteur Rapporteur Examinateur Directeur de Th`ese Directrice de Th`ese

Remerciements Ce manuscrit de th`ese repr´esente l’aboutissement de nombreuses ann´ees d’´etudes. Il convient de remercier tout particuli`erement mes deux directeurs de th`ese, Nicolas Bredeche et Mich`ele Sebag, qui ont jou´e un rˆole important au cours de cette th`ese. J’ai consid´erablement appris `a votre contact scientifiquement et humainement parlant. Je tiens `a vous remercier vivement pour tous vos efforts. Mes remerciements s’adressent ´egalement `a Helene Paugam-Moisy et Cyril Fonlupt qui m’ont fait l’honneur d’ˆetre les rapporteurs de ma th`ese. Je remercie tout autant St´ephane Doncieux et Jean-Sylvain Lienard pour avoir accept´e d’ˆetre dans mon jury. Merci aux membres de l’´equipe TAO, pour leur convivialit´e, leur patience et leur passion contagieuse. Mes remerciements vont tout particuli`erement Chaouki Aouiti, Nicolas Baskiotis, Jacques Bibai, Ettore Cavallaro, Alexandre Devert, Lou Fedon, Jiang Fei, Mary Felkin, Alvaro Fialho, Romaric Gaudel, Sylvain Gelly, Jean-Baptiste Hoock, Mohamed Jebalia, Miguel Nicolau, Mathieu Pierres, Arpad Rimmel, Raymond Ros, Marc Schoenauer, Damien Tessier, Fabien Teytaud et Olivier Teytaud. Je tiens ´egalement `a remercier les autres membres du LRI, toute l’´equipe staff et mes compagnons de th`ese et sp´ecialement Anh Hoang Phan et Rafael Lopez. Il est des personnes qui me sont ch`eres que je tiens ´egalement `a remercier pour leur soutien ind´efectible. A mes parents, Marie-Paule et Paul Hartland, qui m’ont toujours soutenu, je vous remercie de tout mon cœur et je vous d´edie cette th`ese. A ma femme, Yin Tang, pour m’avoir aim´e et support´e pendant ces derni`eres ann´ees qui n’ont pas ´et´e les plus faciles. Je te remercie tout sp´ecialement.

iii

iv

R´ esum´ e de la th` ese en Fran¸ cais Ce manuscrit pr´esente le travail effectu´e au cours de ces ´etudes en vue de l’obtention d’un diplˆome de doctorat en sciences informatiques. Ce manuscrit pr´esente des travaux effectu´e dans un contexte d’informatique pour la robotique, au travers de l’apprentissage et l’´evolution artificielle et plus pr´ecis´ement, s’articule autours de la notion de robustesse pour la robotique mobile autonome. Ce r´esum´e r´edig´e en fran¸cais reprend les points clefs qui sont d´etaill´es dans le corps du manuscrit. Quelques d´etails techniques relatifs aux outils de simulations sont donn´es dans le chapitre d’annexe `a la fin du manuscrit, suivi d’un glossaire et de la bibliographie.

Introduction Le contenu de la th`ese s’inscrit dans le domaine de l’Intelligence Artificielle (IA). Suivant le point de vue, l’intelligence artificielle peut viser, soit `a comprendre les ph´enom`enes cognitifs `a l’aide de mod`eles informatiques, soit `a r´esoudre des probl`emes difficiles. Le point de vue adopt´e ici concerne la r´esolution de probl`emes difficiles (probl`emes de navigation en robotique) en faisant appel `a des mod`eles bio-inspir´es, dans la mesure o` u ils sont inspir´es du vivant. L’objectif in-fine, est de pouvoir implanter des algorithmes permettant de r´esoudre des tˆaches qui, si elles sont facilement r´esolubles par des humains, sont extrˆemement dures `a r´esoudre par les ordinateurs. Dans ce contexte, la difficult´e est souvent li´ee `a la n´ecessit´e de faire appel `a des processus mentaux de hauts niveaux. Le robot est une plateforme m´ecatronique (m´ecanique et ´electronique), capable de r´ecolter des informations sur son environnement `a l’aide de senseurs d’une part, et d’autre part, capable d’agir sur ce mˆeme environnement `a l’aide d’actionneurs. Tel que pos´e ici, le probl`eme de l’informatique pour la robotique consiste `a proposer une fonction de contrˆole du robot v

qui, ´etant donn´e sa perception du monde (informations sensorielles et ´etats internes) va proposer les actions les plus appropri´ees afin de r´esoudre la tˆache attribu´ee au robot. L’un des probl`emes, qui apparaˆıt rapidement en s’int´eressant `a la conception d’un programme de contrˆole de robot, est la complexit´e. La programmation des robots est contrainte `a des ph´enom`enes physiques et `a une grande part d’inconnu en ce qui concerne l’environnement du robot. Pour tenter de r´epondre aux besoins et `a la complexit´e croissante de plateformes robotiques toujours plus sophistiqu´ees, des m´ethodes de programmation des robots aussi originales que vari´ees ont ´et´e propos´ees par le pass´e. D’abord, l’approche dite bottom-up propose de partir de briques de base facilement programmable r´epondant `a des stimuli sensoriels sp´ecifiques ne n´ecessitant pas une connaissance pr´ecise du monde. Un assemblage hi´erarchique de ces briques permet ainsi de composer des comportements complexes de fa¸con plus ais´ee. L’approche adaptative, quand a elle, propose de programmer les comportements de fa¸con automatique en tenant compte de crit`eres. Les m´ethodes adaptatives reposent sur des mod`eles et algorithmes issues de l’optimisation stochastique (Recherche par gradients, algorithmes ´evolutionnistes) et de l’apprentissage artificiel, incluant les r´eseaux de neurones artificiels. Les methodes adaptatives permettent de faire emerger un comportement solution `a un probleme en se basant sur des donn´ees de contrˆoles connus ou des crit`eres d’´evaluation relatifs `a la tˆache `a effectuer. La combinaison de ces approches, entre autres, a permis de nombreuses avanc´ees significatives en informatique pour la robotique.

Motivations A ce jour, de nombreuses questions restent en suspens et plusieurs d´efis se dressent encore. Cette ´etude est motiv´ee par un ´el´ement pluriel qui repr´esente aujourd’hui une limitation s´erieuse `a l’apprentissage artificiel et les algorithmes ´evolutionnistes pour la robotique : la robustesse. La robustesse du contrˆoleur appris est d’abord consid´er´ee. vi

Dans la

cadre de la robotique ´evolutionniste1 , le gros du travail tend `a ˆetre op´er´e en simulation et non directement sur les robots r´eels pour raisons pratiques. En raison de faiblesses des mod`eles simul´es, les contrˆoleurs ainsi produit ne se montrent pas toujours applicable sur le robot r´eel ou leur application en est fortement compromise. Ce probl`eme nomm´e Reality Gap repr´esente l’un des premiers goulots d’´etranglements limitant la robustesse du contrˆoleur entraˆın´e. La robustesse lors de l’acquisition du contrˆoleur est un enjeu ´egalement important, que ce soit dans le cadre de l’apprentissage artificiel, o` u le probl`eme du choix du mod`ele de contrˆoleur se pose en plus de celui de l’exactitude improbable des donn´ees utilis´ees pour l’apprentissage, ou bien encore, dans le cadre des algorithmes ´evolutionnistes, lorsque le mod`ele poss`ede des propri´et´es temporelles. D’une fa¸con g´en´erale, les probl`emes d’apprentissage dans un cadre robotique ne v´erifient pas les propri´et´es habituelles sur les donn´ees dans le cadre g´en´eral fonctionnel des algorithmes. Les contributions de cette th`ese portent sur cette notion de robustesse autour du contrˆoleur lors de son apprentissage mais ´egalement durant son utilisation.

Etat de l’art Un ´etat de l’art, aussi exhaustif que possible, dans la limite de la th`ese, est d´etaill´e dans le chapitre correspondant. Il est ici bri`evement r´esum´e.

Apprentissage Artificiel Le domaine de l’apprentissage artificiel concerne l’extraction automatique et l’acquisition de connaissances `a partir de donn´ees. Un logiciel d’apprentissage artificiel est con¸cu pour s’adapter en fonction de son contexte, afin d’am´eliorer sa performance sur la tˆache qui lui est attribu´ee. Les applications sont nombreuses et incluent, la vision artificielle, la reconnaissance de caract`eres manuscrits, la th´eorie des jeux ou encore la robotique mobile autonome. L’apprentissage artificiel est employ´e depuis les ann´ees 80 dans le milieu industriel pour programmer les automates sur la base de trajectoires optimales. Le contexte de l’apprentissage artificiel globalement consid´er´e dans cette th`ese est celui de l’apprentissage supervis´e. Des donn´ees sont r´ecolt´ees dans l’environnement consid´er´e. Un op´erateur prend un rˆole d’oracle et 1

l’application des algorithmes ´evolutionniste `a la robotiques

vii

va pouvoir annoter ces exemples sous la forme de labels, `a l’aide de son expertise. L’apprentissage d’un mod`ele est effectu´e en appliquant des techniques d’apprentissage de telles sortes que `a la fin de l’apprentissage, lorsque le mod`ele a en entr´ee des donn´ees, il produise en sortie une valeur qui corresponde au plus pr`es au label donn´e par l’oracle. Les exp´eriences consid`erent en particulier l’apprentissage par d´emonstration. Dans l’apprentissage par d´emonstration, l’oracle dirige le robot (`a l’aide d’un joystick par exemple). A chaque instant, les donn´ees des senseurs sont enregistr´ees ainsi que l’ordre moteur donn´e par l’oracle. Ces ordres moteurs correspondent aux labels d´esir´es pour les valeurs des senseurs. Par r´egression, le contrˆoleur est entrain´e de telle sorte que pour les situations rencontr´ees, il r´eagisse de mˆeme fa¸con que l’oracle l’aurait fait. On s’essaie donc `a mod´eliser la strat´egie de contrˆole de l’op´erateur humain. L’avantage de cette approche est qu’`a priori, aucune connaissance en informatique n’est requise pour programmer le robot, et elle peut s’appliquer directement sur un robot r´eel. L’inconv´enient vient dans le fait que a) la distribution des donn´ees ne suis pas la r`egle standard en apprentissage qui dicte que les ´echantillons soient distribu´es de fa¸con identiquement ind´ependante (iid) b) les d´emonstrations peuvent ˆetre impr´ecises, incompl`etes et incoh´erentes par endroits, rendant l’apprentissage difficile ou le r´esultat hasardeux; c) la quantit´e de donn´ees d’apprentissage peut ˆetre trop r´eduite pour proposer des solutions suffisamment g´en´erales et robustes; d) le mod`ele ad´equat pour apprendre la tˆache n’est pas connu `a priori.

Algorithmes ´ evolutionnistes Les algorithmes ´evolutionnistes sont des familles d’algorithmes d’optimisations stochastiques inspir´es de la th´eorie de l’´evolution Darwinienne. Ces familles d’algorithmes reposent sur deux principes : la s´election naturelle et des variations al´eatoires. D’une fa¸con g´en´erale, des op´erateurs de variations permettent d’explorer un espace de recherche qui contient des solutions candidates tandis que les op´erateurs de s´election permettent d’orienter cette recherche en privil´egiant les solutions candidates les plus performantes vis `a vis de la fonction d’´evaluation consid´er´ee. En g´en´eral, dans une application robotique, cette fonction d’´evaluation caract´erise la tˆache `a accomplir par le robot (une distance `a parcourir dans un temps donn´e, une tˆache pr´ecise `a accomplir, etc.); on parle alors de robotique ´evolutionniste.

viii

Dans le cadre des travaux pr´esent´es dans ce manuscrit, les mod`eles optimis´es sont des r´eseaux de neurones artificiels. Ils sont, pour la plupart, param´etrique (topologie et nombre de param`etres fix´es) avec des attributs r´eels (poids des connections entre neurones). Dans ce contexte, la famille d’algorithmes ´evolutionniste employ´e est la strat´egie d’´evolution, pour sa capacit´e `a traiter les probl`emes dans des espaces continus. Ce paradigme propose des op´erateurs de variations r´eduits disposant des propri´et´es d´esir´ees `a l’aide d’explorations al´eatoire bas´ee sur une distribution normale gaussienne. Les propri´et´es d´esir´ees sont de pouvoir effectuer `a la fois une recherche locale et globale dans l’espace de recherche contenant les solutions. En particulier, dans ces travaux, la m´ethode utilisant un ajustement de la matrice de covariance des solutions candidates permet de d´e-randomiser en partie la m´ethode de recherche (Algorithme CMA-ES). Cet algorithme repr´esente l’´etat de l’art en optimisation stochastique en espace continu. Comme pr´ecis´e au paragraphe pr´ec´edent, la plupart des mod`eles sont des r´eseaux de neurones `a topologies fixes. Des mod`eles non param´etriques sont ´egalement employ´es, `a savoir les r´eseaux de neurones g´en´er´es par l’algorithme NEAT (NeuroEvolution of Augmenting Topologies). NEAT est un algorithme ´evolutionniste d´edi´e `a la g´en´eration de r´eseaux de neurones, optimisant le nombre de neurones et leurs interconnexions.

R´ eseaux de Neurones Les r´eseaux de neurones sont l’un des nombreux mod`eles qui existent dans le domaine de l’apprentissage artificiel. Ils repr´esentent un graphe dans lequel les nœuds sont des unit´es de calculs dont les valeurs calcul´ees sur la base de leurs entr´ees sont transmis aux autres unit´es. Certains nœuds sont des entr´ees du syst`eme et certains en sont les sorties, tous les autres sont consid´er´es cach´es. Les r´eseaux de neurones peuvent avoir des connexions r´ecurrentes, ce qui permet un transit non monotone de l’information, agissant tel une m´emoire. Avec les r´eseaux de neurones viennent des m´ethodes d’apprentissage qui permettent d’en ajuster les param`etres, tels que l’algorithme de r´etro-propagation du gradient. Ces param`etres sont de fa¸con standard les poids des connections entre les neurones. Il existe ´egalement des m´ethodes d’apprentissage qui en ajustent la topologie. Pour tenter de r´epondre aux d´efis pos´es dans ces travaux, un paradigme r´ecent de r´eseaux de neurones est consid´er´e : les Echo State Networks (ESN). ix

Ces r´eseaux de neurones sont g´en´er´es al´eatoirement, en prenant en compte un param`etre de densit´e et de contraction dans les connections entre les neurones cach´es. L’apprentissage dans ce type de r´eseaux ne s’applique que dans les connections vers les neurones de sortie, et peut se faire de fa¸con lin´eaire en prenant en compte le contexte observ´e dans les neurones cach´es.

Anticipation et Adaptation Motivations La premi`ere contribution s’attaque au probl`eme de la transition d’un comportement obtenu en simulation (in-silico) vers un robot r´eel (in-situ). Des diff´erences existent dans les dynamiques du monde r´eel compar´ees `a celles du monde simul´e. Ces diff´erences entrainent des diff´erences de comportement du robot, qui peuvent s’exprimer au travers des senseurs et des actionneurs. Ce probl`eme est connu comme ´etant le probl`eme du Reality Gap. Afin de r´epondre au mieux `a ce probl`eme, plusieurs solutions sont propos´ees dans la litt´erature. L’impl´ementation plus ou moins pr´ecise de bruits sur les diff´erents ´el´ements essentiels des dynamiques concern´ees repr´esente le minimum en termes de robustesse. L’´evolution duale en simulation puis sur robot r´eel permet de r´eduire le temps exp´erimental compar´e `a une ´evolution sur robot physique, ou encore l’approche adaptative est la piste envisag´ee ici. Des approches adaptatives ont ´et´e propos´ees pr´ec´edemment; elles reposent sur un mod`ele permettant de modifier le contrˆoleur et de l’adapter aux perturbations rencontr´ees. En particulier, des approches bas´ees sur l’anticipation permettent d’offrir un crit`ere pour l’adaptation des contrˆoleurs. Les modifications effectu´ees, consid´erant ce crit`ere d’anticipation permettent effectivement un ajustement ad´equat du contrˆoleur. Ces modifications induisent cependant immanquablement `a la perte de la capacit´e du contrˆoleur a` se comporter de fa¸con adapt´ee `a la tˆache qui lui est assign´ee. En reprenant le principe pos´e par l’anticipation, le mod`ele propos´e consiste, non plus `a adapter le contrˆoleur tel que c’est le cas, au moins en partie, dans les mod`eles pr´ec´edents, mais `a adapter la sortie du contrˆoleur de fa¸con `a l’adapter aux situations du monde r´eel. x

Exp´ eriences et R´ esultats Dans un premier temps, des r´eseaux de neurones contrˆoleurs sont optimis´es `a l’aide de strat´egies d’´evolution (CMA-ES) en simulation. Des perturbations sont induites sur les actionneurs du robot simul´es et la capacit´e du mod`ele `a adapter le contrˆole est ´evalu´ee. Finalement, un contrˆoleur est ´evalu´e sur le robot r´eel, et le mod`ele est ´evalu´e sur sa capacit´e `a adapter efficacement le contrˆole. Les exp´eriences montrent que l’adaptation peut se faire rapidement et ainsi permettre au contrˆoleur de se comporter de fa¸con tr`es similaire `a son comportement attendu.

D´ emonstration de Comportements Motivations Comme il a ´et´e vu au chapitre pr´ec´edent, le paradigme ´evolutionniste, s’il se montre efficace pour de nombreux challenges, pose des difficult´es majeures. L’absence d’un mod`ele de simulation extrˆemement pr´ecis n’est pas la seule difficult´e. La gestion de l’algorithme ´evolutionniste, en la d´efinition de la fonction d’´evaluation et le choix des op´erateurs de l’algorithme, s’av`erent d´elicats mˆeme pour un expert. Ce chapitre consid`ere un axe diff´erent dans la g´en´eration d’un comportement; ici, il ne s’agit plus d’avoir un crit`ere d’´evaluation caract´erisant la tˆache `a accomplir, mais des exemples du comportement attendu pour accomplir cette tˆache. Le paradigme de l’apprentissage par d´emonstration est consid´er´e pour apprendre. La tˆache principale consiste `a trouver un objet cible dans l’environnement et s’en approcher. Pour ce faire, plusieurs probl´ematiques se posent : a) les d´emonstrations du comportement sont peu nombreuses et bruit´ees, ce qui rend toute g´en´eralisation difficile; b) les donn´ees r´ecolt´ees ne sont pas identiquement et ind´ependamment distribu´ees (iid), ce qui est un cadre g´en´eralement indispensable pour les algorithmes d’apprentissage; c) le choix du mod`ele est difficile `a op´erer `a priori, d’autant qu’il doit pouvoir prendre en compte des donn´ees non iid et des dynamiques temporelles. L’approche consid´er´ee consiste `a utiliser des r´eseaux de neurones r´ecurrents issus du Reservoir Computing : les Echo State Networks (ESN). Ces r´eseaux de neurones permettent de prendre en compte des dynamiques temporelles et un apprentissage contextuel facilit´e. xi

Experiences et R´ esultats Afin de valider l’int´erˆet d’utiliser les ESNs dans un cadre d’apprentissage par d´emonstration, deux exp´eriences ont ´et´e op´er´ees, d’abord en simulation, puis sur robot r´eel. Les exp´eriences effectu´ees en simulation visent `a reproduire dans une certaine mesure les contrˆoleurs obtenus en ´evolution lors du chapitre pr´ec´edent. L’op´erateur humain contrˆole un robot kh´ep´era II simul´e, `a l’aide des fl`eches directionnelles du clavier. Les donn´ees enregistr´ees sont utilis´ees pour apprendre des ESNs et des r´eseaux de neurones standards. Les comportements sont ensuite ´evalu´es par l’op´erateur. Les r´esultats montrent que les ESNs sont tr`es comp´etitifs face aux r´eseaux de neurones standards. Les exp´eriences effectu´ees sur robots r´eels (Khepera II) reposent sur deux d´emonstrations du comportement attendu : rechercher puis atteindre l’objet cible. Dans ce contexte de donn´ees disponibles en quantit´es parcimonieuses, et bruit´ees de fa¸con plus importantes qu’en simulation, les r´eseaux de neurones standards se montrent incapables d’apprendre le comportement vis´e l`a o` u les ESNs y arrivent. Un mod`ele issu de l’´etat de l’art en apprentissage par d´emonstration, MPL, est ´egalement consid´er´e pour comparaisons. Ce mod`ele n’inclut pas de dynamiques temporelles et se montre cependant assez efficace sur la tˆache `a accomplir. Cela montre que les dynamiques temporelles consid´er´ees ont un assez faible impact.

Contrˆ oleurs avec m´ emoire Motivations Les exp´eriences pratiqu´ees au chapitre pr´ec´edent donnent un cadre non iid ne requ´erant pas n´ecessairement de dynamiques temporelles. Afin d’´evaluer plus pr´ecis´ement la capacit´e des r´eseaux de neurones r´ecurrents `a apprendre des tˆaches n´ecessitant des dynamiques plus fortes, le chapitre 5 propose la d´efinition d’un environnement comportant plusieurs alias perceptifs : le peigne de Tolman. Un alias perceptif repr´esente une situation dans laquelle les informations sensorielles ne permettent pas de distinguer une situation donn´ee d’une autre (ex. le robot ne peut pas d´eterminer s’il est dans une pi`ece ou dans une autre). Le peigne de Tolman implique un xii

corridor comportant de nombreuses branches s’ouvrant sur le mˆeme cot´e `a intervalles r´eguliers. L’objectif du robot est de tourner exclusivement dans la 3ieme branche dans l’environnement. Cela impose strictement au robot d’identifier un contexte temporel, le seul permettant de distinguer les diff´erentes branches. Les ESNs, optimis´es `a l’aide de CMA-ES, sont ici compar´es `a NEAT, qui est un algorithme ´evolutionniste g´en´erant des r´eseaux de neurones de topologies variables. Le contexte ´evolutionniste de cette exp´erience permet, en plus d’une comparaison des approches, de mettre en avant le probl`eme de d´efinition de fonction d’´evaluation ainsi que le probl`eme d’opportunisme ´evolutionniste. Si la fonction d’´evaluation est mal d´efinie, le r´esultat produit par l’optimisation se montrera soit trivial, soit ind´esirable (Pour un crit`ere d’´evitement des obstacles, la strat´egie la plus ´evidente consiste `a rester sur place).

Experiences et R´ esultats Les exp´eriences prennent place en simulation, avec deux fonctions d’´evaluation. Dans la premi`ere fonction, la distance minimale `a une cible virtuelle positionn´ee au fond du corridor cible est calcul´ee sur une dur´ee d’´evaluation donn´ee. Une seconde fonction d’´evaluation facilite la tˆache en consid´erant un point de passage suppl´ementaire `a l’entr´ee du corridor. Si la meilleure solution attendue reste la mˆeme, dans le premier cas, un robot allant dans le second ou quatri`eme corridor peut potentiellement se retrouver plus pr`es de la cible qu’un robot restant `a l’entr´ee du troisi`eme corridor. Egalement, afin d’assurer une g´en´eralit´e et d’´eviter des comportements triviaux observ´es, l’´evaluation est effectu´ee en plusieurs sous-´evaluations dans l’environnement comportant des variations al´eatoires (distance du robot au premier couloir, distance entre les couloirs). Cela se montre d’autan plus n´ecessaire que, `a cause de l’expression de dynamiques, un mˆeme contrˆoleur peut r´eussir o` u non la tˆache. Plusieurs ´evaluations permettent d’´evaluer la capacit´e g´en´erale `a r´esoudre la tˆache. Cela permet d’´eviter les solutions globalement mauvaises mais chanceuses sur certains environnements. NEAT ainsi que les ESNs se montrent capables d’accomplir la tˆache et de ne prendre en compte que le troisi`eme corridor. En particulier, `a l’exception des contraintes d’oubli, qui permettent un contrˆole pr´ecis de la m´emoire dans les ESNs, NEAT produit des r´eseaux `a la topologie relativement similaire aux xiii

ESNs. Les r´esultats montrent que les r´eseaux de neurones sont ainsi capables d’exprimer une capacit´e implicite `a compter.

Conclusion et Perspectives Les contributions permettent de saisir avec plus de pr´ecision les ph´enom`enes relatifs `a la robustesse dans l’apprentissage de comportements. D’une part, l’adaptation des contrˆoleurs face aux perturbations qui peuvent survenir a ´et´e abord´ee `a l’aide de m´ecanismes d’anticipation, et d’autre part, des principes d’apprentissage robustes ont ´et´e mis en ´evidence, en utilisant des mod`eles `a mˆeme de s’appliquer en d´epit d’un contexte d’apprentissage difficile, mais ´egalement en proposant un protocole permettant un apprentissage robuste des contrˆoleurs. Ces travaux ouvrent de nouvelles pistes de recherches : • Les travaux effectu´es pr´ec´edemment sur le sujet de la transition in-silico in-situ se sont concentr´ees sur l’adaptation du comportement obtenu in-silico `a l’aide d’un m´ecanisme d’anticipation. Ce m´ecanisme permet d’ajuster le comportement `a l’aide d’un mod`ele de l’environnement (en consid´erant les diff´erences dans le comportement entre le monde r´eel et celui simul´e). Le crit`ere d’´evaluation de l’adaptation reste toujours un probl`eme pour adapter des contrˆoleurs plus complexes que des robots mobiles holonomes (peuvent tourner sur place sans n´ecessiter de cr´eneaux). A ce titre, les perspectives se portent sur l’adaptation des valeurs sensorielles, toujours `a l’aide d’un mod`ele du monde, agissant cette fois comme un pr´eprocesseur sur ces donn´ees et non plus un post processeur sur les actionneurs. • L’apprentissage par d´emonstration permet de s’abstraire des difficult´es de l’´evolution artificielle (n´ecessit´e de simulation, d´efinition du crit`ere d’´evaluation, opportunisme ´evolutionniste, solution cibl´ees). De plus, cette approche pourrait ˆetre effectu´ee par un utilisateur final sans qu’il soit ing´enieur. En raison des diff´erents probl`emes relatifs aux d´emonstrations, le crit`ere de minimisation de l’erreur peut s’av´erer insuffisant pour obtenir un contrˆoleur efficace g´en´eralisant efficacement. Le besoin de proposer un protocole incorporant une interaction lors de l’apprentissage permettra de r´epartir au mieux la charge de travail pour le d´emonstrateur en rendant plus efficace l’apprentissage. Un crit`ere d’´evaluation pourrait ´egalement ˆetre identifi´e `a partir des d´emonstrations. xiv

• Le chapitre 5 a propos´e un protocole permettant de rendre plus robuste l’apprentissage de tˆaches non r´eactives (n´ecessitant l’utilisation d’une m´emoire). D’une part, la possibilit´e de biaiser la construction al´eatoire des r´eservoirs dans les ESNs pourrait permettre un contrˆole plus pr´ecis `a priori des dynamiques en jeux. Le protocole a cependant un coˆ ut ´elev´e, en raison de la n´ecessit´e de multiplier les ´evaluations. L’application de m´ethodes de bandits est ´egalement consid´er´ee en perspectives, pour permettre une distribution des cr´edits d’´evaluation sur les individus les plus prometteurs.

xv

xvi

Contents Contents

xvii

1 Introduction 1.1 Embodied Artificial Intelligence 1.2 Motivations . . . . . . . . . . . 1.3 Contributions . . . . . . . . . . 1.4 Outline of the thesis . . . . . . 1.5 Published works . . . . . . . . .

. . . . .

1 1 2 2 3 3

2 Literature Review 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Myths and history . . . . . . . . . . . . . . . . . . . . 2.1.2 From prototypes to Industry, Space and Entertainment 2.1.3 Background and requirements . . . . . . . . . . . . . . 2.1.4 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . Goals and criteria . . . . . . . . . . . . . . . . . . . . . Generalities . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Supervised learning . . . . . . . . . . . . . . . . . . . . Classification and regression . . . . . . . . . . . . . . . 2.2.3 Reinforcement Learning . . . . . . . . . . . . . . . . . 2.2.4 Learning by Demonstration . . . . . . . . . . . . . . . Micro Population based Learning (MPL) . . . . . . . . 2.3 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . 2.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . Neural Network . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Learning in neural networks . . . . . . . . . . . . . . . Hebbian rule . . . . . . . . . . . . . . . . . . . . . . .

5 5 5 6 7 7 10 10 11 12 13 13 14 16 17 18 18 19 19 19 22 23

xvii

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

CONTENTS

Widrow-Hoff rule . . . . . . . . . . . . . . Back-propagation algorithm . . . . . . . . Training recurrent networks . . . . . . . . Other adjustments . . . . . . . . . . . . . Optimisation . . . . . . . . . . . . . . . . 2.3.4 Reservoir computing: Echo State Networks Structure . . . . . . . . . . . . . . . . . . Training . . . . . . . . . . . . . . . . . . . 2.4 Evolutionary Computation . . . . . . . . . . . . . 2.4.1 Optimisation, Stochastic Optimisation . . 2.4.2 Evolutionary Computation . . . . . . . . . 2.4.3 Phenotypic Operators . . . . . . . . . . . Selection . . . . . . . . . . . . . . . . . . . Replacements . . . . . . . . . . . . . . . . Termination criterion . . . . . . . . . . . . 2.4.4 Genotypic Operators . . . . . . . . . . . . Initialisation . . . . . . . . . . . . . . . . . Crossovers . . . . . . . . . . . . . . . . . . Mutation . . . . . . . . . . . . . . . . . . 2.4.5 Evolution Strategies . . . . . . . . . . . . CMA-ES and sep CMA-ES . . . . . . . . . 2.4.6 Neuro-Evolution of Augmenting Topologies 2.4.7 Evolutionary Robotics . . . . . . . . . . . 3 Anticipation and Adaptation 3.1 Motivation: Reality Gap . . . . . . . . . . 3.1.1 Disturbances . . . . . . . . . . . . 3.1.2 Off-line gap avoidance . . . . . . . 3.1.3 On-line gap avoidance . . . . . . . 3.2 Anticipation based Approach . . . . . . . 3.2.1 Predictive architecture . . . . . . . 3.2.2 Correction module . . . . . . . . . 3.2.3 Experiments Goal . . . . . . . . . . 3.3 Experimental Settings . . . . . . . . . . . 3.3.1 Simulator . . . . . . . . . . . . . . 3.3.2 Physical robot . . . . . . . . . . . . 3.3.3 Neural network settings . . . . . . 3.3.4 Tasks . . . . . . . . . . . . . . . . . Evolving the Control Module . . . Training The Anticipation Module Validation of the model . . . . . . . xviii

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

23 24 24 25 26 26 27 28 28 29 30 32 32 33 33 33 33 35 35 36 37 38 39

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

43 43 44 45 46 47 48 49 50 51 51 51 52 52 52 54 55

CONTENTS

3.4 3.5 3.6

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

55 57 58 58 59 61 61 63 63 63 65 66 66

4 Demonstrating Behaviours 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Experiments Goal . . . . . . . . . . . . . . . . . . 4.2 Training ESNs by demonstration . . . . . . . . . . . . . 4.2.1 Structure of the log . . . . . . . . . . . . . . . . . 4.2.2 Controller Search Space . . . . . . . . . . . . . . 4.2.3 Assessment of a Controller . . . . . . . . . . . . . 4.2.4 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . Wandering behaviour . . . . . . . . . . . . . . . . Target following behaviour . . . . . . . . . . . . . 4.3 Experimental Settings . . . . . . . . . . . . . . . . . . . 4.3.1 Learning by Demonstration Environment . . . . . 4.3.2 Controller Parameters . . . . . . . . . . . . . . . 4.4 Experimental Results, task 1: Wandering behaviour . . . 4.4.1 Methodology . . . . . . . . . . . . . . . . . . . . 4.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . 4.5 Experimental Results, task 2: Target following behaviour 4.5.1 Methodology . . . . . . . . . . . . . . . . . . . . 4.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . 4.5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

73 73 74 75 75 76 77 77 77 78 78 78 80 82 82 83 84 85 85 87 89 91

3.7

Task 1: Evolving the control module . . . . . Task 2: Training the anticipation module . . . Task 3: Validation of the model . . . . . . . . 3.6.1 On-line calibration correction . . . . . Results . . . . . . . . . . . . . . . . . . Discussion . . . . . . . . . . . . . . . . 3.6.2 On-line adaptation to continuous wear Results . . . . . . . . . . . . . . . . . . Discussion . . . . . . . . . . . . . . . . 3.6.3 In-situ experiments . . . . . . . . . . . Results . . . . . . . . . . . . . . . . . . Discussion . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

5 Memory Enhanced Controllers 93 5.1 Going beyond reactive control . . . . . . . . . . . . . . . . . . 93 5.1.1 Deliberative control . . . . . . . . . . . . . . . . . . . . 94 5.1.2 Neural approaches to deliberative control . . . . . . . . 96 xix

CONTENTS

5.2 Tolman Comb and Robust Training . . . . 5.2.1 A benchmark Environment . . . . . 5.2.2 Designing robust memory-enhanced 5.3 Validation . . . . . . . . . . . . . . . . . . 5.3.1 Preliminary experiments . . . . . . 5.3.2 Tolman Experiments . . . . . . . . Experimental Setting . . . . . . . . 5.3.3 Results . . . . . . . . . . . . . . . . 5.4 Conclusion . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

6 Conclusion and Perspectives

. . . . . . . . .

98 98 100 102 102 108 108 111 115 129

A Software A.1 The simulator . . . . . . . . . . . A.1.1 Creating an environment . A.1.2 Creating an agent . . . . . A.1.3 Launching simulation . . . A.2 Running an evolutionary process A.3 Khepera . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

135 . 135 . 137 . 139 . 141 . 141 . 142

Index

145

Bibliography

147

xx

Chapter 1 Introduction Bender: Fry: Bender:

Admit it! You think robots are just machines built to make life easier. Well, aren’t they? I’ve never made anyone’s life easier, and you know it! Bender the robot and P.J. Fry, Futurama

1.1

Embodied Artificial Intelligence

Artificial Intelligence is a research field meant to answer one question : can a machine think ? Since Alan Turing’s test in 1950, where he proposed that If, during text-based conversation, a machine is indistinguishable from a human, then it could be said to be ’thinking’ and, therefore, could be attributed with intelligence, AI solved many difficult challenges. Among the many challenges left, is the challenge of autonomous robotics. Autonomous robotics has been considered from the computer scientist point of view since the early years of Artificial Intelligence. Embodied Artificial Intelligence aims at intelligence within a robot both able to sense and act on its environment, as would do an actual living being. This embodiment raises new problems relatively to the overall difficulty to control the robot according to the huge amount of information available in real-time. In order to face the complexity of the challenge, adaptive approaches ranging from Machine Learning (ML) to Evolutionary Robotics (ER) were proposed in the literature to design intelligent robot software. 1

1. INTRODUCTION

1.2

Motivations

Both ML and ER approaches (among others) were used to address issues relevant to robotics, such as signal and image processing, kinematics, interaction, and design of control policies, to name a few. The presented work focuses on methodologies for designing adaptive control policies, referred to as controllers. The ER paradigm proceeds by exploring some controller search space in order to optimise a so-called fitness function (section 2.4), involving a large number of trials and errors. This large number of trials and errors advocates for a simulation-based approach (design in silico) as opposed to dealing with actual robots (design in-situ). While ER proved to be very efficient in generating adequate controller behaviours, the resulting controllers do not necessarily perform adequately in-situ; this main limitation of ER is referred to as Reality Gap. Another ER limitation concerns the design of a fitness function, that is a criterion enabling to assess the controller behaviour. Robotics end-user can hardly be expected to provide a computable fitness function. The learning by demonstration paradigm, allowing a human teacher to demonstrate the desired robot behaviour, has been investigated to address the above issue. The definition of the controller search space strongly depends on the desired behaviour, while again, selecting the appropriate search space can hardly be done by the end-user. Considering these facts, the present work is interested into adaptive robust learning of robot control policies, with decent effort from the human designer.

1.3

Contributions

The presented work aims at achieving robustness in robotic control and robotic control acquisition. In order to reach that goal, the dissertation presents three main contributions: • Firstly, an anticipation-based correction module is proposed, aimed at automatically reducing the reality gap within the evolutionary robotic context. • Secondly, the learning by demonstration framework is investigated in relation with recurrent neural networks (NNs), specifically Echo State Networks (ESN), to achieve wandering and pursuit behaviours. 2

1.4. OUTLINE OF THE THESIS

• Finally, the question of memory-enhanced controllers is studied within the evolutionary robotic context; several NN settings are compared and a robust methodology avoiding the so-called Evolution Opportunism is presented.

1.4

Outline of the thesis

Chapter 2 provides the historical and formal background, and key concepts to comprehend the challenges and introduce the state of the art. It is separated in four distinct sections: an historical overview of robotics from the Artificial Intelligence point of view, a short introduction to basic Machine Learning concepts, to artificial neural networks and to evolutionary computation, including the evolutionary robotic paradigm. Chapter 3 is devoted to anticipation-based control. The presented approach, aimed at addressing the reality gap, is detailed and experimental validation is discussed. Chapter 4 focuses on Learning by Demonstration, detailing the experimental methodology and discussing the limitations of the approach. Chapter 5 investigates memory-enhanced controllers, proposing a specific benchmark task and comparing several NN-based search spaces. The last chapter summarizes the lessons learned from the presented work and describes some perspectives for further research. In appendix are described the software programming environment and the hardware supporting the experiments in-situ. An index is available, page 145.

1.5

Published works

The presented research has been published in reviewed conferences and workshops. Additionally, some works related to the PASCAL challenge On-line Trading between Exploration and Exploitation ([Hartland et al., 2006] and [Hartland et al., 2007]) have been done. The connexion between this work, related to Dynamic Multi-Armed Bandits [Auer et al., 2006], and robotic applications are mentioned in the perspectives of the PhD thesis. The following publications were done during the thesis: Memory-Enhanced Evolutionary Robotics: The Echo State Network Approach. C. Hartland, N. Bredeche, M. Sebag. IEEE Congress on Evolutionary Computation (IEEE CEC 2009), pages 2788–2795. 3

1. INTRODUCTION

Robotique Evolutionnaire et M´emoire. C. Hartland, N. Bredeche, M. Sebag., Conf´erence francophone sur l’apprentissage automatique, CAp09, pages 161–172. Using Echo State Networks for robot navigation behavior acquisition. Hartland, C. and Bredeche, N. (2007). In IEEE International Conference on Robotics and Biomimetics (ROBIO07), Sanya, China. IEEE Computer Society Press, pages 201–206. Change Point Detection and meta-bandits for online learning in dynamic environments, C. Hartland, S. Gelly, N. Baskiotis, O. Teytaud, M. Sebag, Conf´erence francophone sur l’apprentissage automatique, CAp07, pages 237–250. Human Heuristics for a Team of Mobile Robots, C. Tijus, N. Bredeche, Y. Kodratoff, M. Felkin, C. Hartland, V. Besson and E. Zibetti. Proceedings of the 5th International Conference on Research, Innovation and Vision for the Future (RIVF’07), Hano¨ı, IEEE Computer Society Press, pages 122–129. Evolutionary Robotics, Anticipation and the Reality Gap, C. Hartland, N. Bredeche, Proceedings of the IEEE International conference on Robotics and Biomimetics ROBIO 2006, pages 1640–1645. Multi-armed Bandit, Dynamic Environments and Meta-Bandits, C. Hartland, S. Gelly, N. Baskiotis, O. Teytaud, M. Sebag, Workshop On-line Trading of Exploration and Exploitation, NIPS 2006. Evolutionary Robotics: From Simulation to the Real World using Anticipation, C. Hartland, N. Bredeche, Workshop ABIALS 2006.

4

Chapter 2 Literature Review The overall goal of the presented PhD thesis is to achieve robust mobile machine control. Within a short survey of robotics and robotic challenges, this chapter introduces the key concepts that will be used and discusses the main questions addressed in the dissertation. Lastly, the two main enabling technologies used to address these questions are presented, namely Machine Learning and Evolutionary Robotics.

2.1 2.1.1

Introduction Myths and history

The fascination towards robotics existed long before robotic science actually existed. Amongst the ancient Greek gods, Hephaestus, god of the forge, built Talos, among others, a mechanical bronze man that may be thought of as a golem or an automaton. Automata have been produced for centuries in order to imitate living beings, such as Jacques Vaucanson’s mechanical duck in 1739 (Fig. 2.1), supposedly the first robot ever created, Jacquet-Droz brothers automata or Hisashige Tanaka’s karakuri-ningyˆo (mechanical dolls) including the Bow Shooting Boy around 1850. The word robot appeared in 1920 in Rossum’s Universal Robots 1 science fiction play by the Czech writer Karel Capek2 . Many different robots existed for decades, and many more populated the science fiction literature, with renowned authors such as Isaac Asimov. 1

In this play, the robots rebel against human tyranny in a violent way, then end up discovering the meaning of life and feelings such as love. 2 The term robot was apparently suggested by his brother Joseph Capek, where robota means serf labour in Czech.

5

2. LITERATURE REVIEW

Figure 2.1: Vaucanson’s digesting duck automaton build in 1739, having over 400 moving parts.

2.1.2

From prototypes to Industry, Space and Entertainment

In the 40s, Grey Walter designed the turtle robots which are the first autonomous mobile robots to our best knowledge. For decades, the main application field for robots remained industry where they are used in assembly lines, to replace human operators on repetitive and well specified tasks. Unimate, first industrial robotic arm, was designed in 1961. In 1967, appeared Shakey, first robot endowed with planning ability and complex sensory inputs such as a video camera. Wabot-1 was created in 1973 as the first mobile robot using two legs for biped walk. In 1986, the French robot Magali has been developed to harvest apples directly from the trees. On the same year started the E0 (Experimental Model) project in Japan, leading the Asimo robot in 2000. The well known Pathfinder Sojourner rover was sent to Mars for exploration of the surface in 1997, followed by Spirit and Opportunity rovers in 2003. Among the first entertainment robots, Aibo was commercially available in 1999. In the past years, robots potential increased, with the ability to walk, roll, crawl, fly, or swim. Robots can act on, and sense their environments with precise and various actuators or sensors (Fig. 2.2). Together with the increasing potential of new robotic platforms, the application range increased, including e.g. medical assistance with robotic/bionic arm transplantation, assistance, service or entertainment.

6

2.1. INTRODUCTION

Figure 2.2: A few robotic devices existing in 2008: from left to right: ASIMO, HRP-2m, Aibo, Nao. Robot scale are not preserved.

2.1.3

Background and requirements

A robot is a mechatronic (Mechanical and Electronics Engineering) device, incorporating sensors, actuators and a controller. Sensors provide information about the environment surrounding the robot. Actuators allow the robot to act on its environment. The controller encodes the action policy, i.e. the choice of the actuator values depending on the current state of the world, the system internal state or exterior guidance. A mobile robot is a robot able to move thanks to own batteries and motors, Autonomous robots, as opposed to tele-operated robots, are able to act autonomously. Therefore, an autonomous robot controller exclusively relies on its own internal states and sensors to select the actuator values, a.k.a. actions. A sensor is a piece of equipment able to grab data from the environment3 . Sensors provide numerical noisy partial representation of the environment. Signal processing can be applied on the raw sensor data, providing higher level information referred to as perceptions. An actuator performs a change of state of the physical device, usually being a motor. The autonomous robot relies on its controller, a.k.a. policy, defined as a function mapping the sensor values, and possibly program internal states, onto actuator values. The behaviour is defined through the actions of the robot.

2.1.4

Challenges

Despite a number of advances in robotics, autonomous robotics still is a challenge, for several reasons. In the most general case, battery autonomy is the main bottleneck constraining the robot range of action and embeded 3

A sensor may be active or passive (if it interacts or not with the environment), internal or external (within the robot or the environment).

7

2. LITERATURE REVIEW

computer power. The design of adequate sensors and actuators also is a challenge. The last and not least challenge remains the robotic software, in charge of controlling the robot behaviour and its adaptation in an open world. This last challenge, adaptive robot controller design, is the goal of the presented research.

Figure 2.3: Robot Control System Spectrum Robotic control was tackled by researchers from domains such as physics, mathematics, biology, or psychology, laying the foundations of Artificial Intelligence [Russell et al., 2006]. According to [Arkin, 1998], two main approaches referred to as Deliberative and Reactive controllers, are distinguished. Deliberative controllers deal with high level representations of the environment and are endowed with complex planning abilities, while reactive controllers are provided with raw sensor values and always react identically for the same sensor input. It must be noted that different situations might result in same sensory inputs, leading to the so-called perceptual aliasing phenomenon. Reactive control, bound to provide the same reaction to different situations with same sensory coding, thus is hindered by perceptual aliasing. Figure 2.3 comparatively illustrates deliberative and reactive approaches. While deliberative approaches usually rely on computationally heavy algorithms and critical crafting of high level representations, reactive approaches quickly deal with sensor stimuli. The complementarity of deliberative and reactive approaches thus suggests that the ideal controller is a mixture of both approaches.

8

2.1. INTRODUCTION

The literature of controller design distinguishes top-down and bottom-up approaches. • The top-down approach requires some exhaustive knowledge about the environment, enabling optimal control to be specified through equations (for industrial robotic manipulators) or high level symbolic controllers (for AI projects up to the 80s, Fig. 2.4, such as the Shakey robot).

Figure 2.4: Generic hierarchical control architecture; environment modeling is performed using perceptions. This environment serves planning algorithms leading to actions undertaken by the robot. • The bottom-up approach does not assume any exhaustive knowledge about the environment, as it is usually not available. The best representative for bottom-up approach is Brooks’ behavioural approach [Brooks, 1986] with the so-called subsumption architecture: simple controllers, or behaviours, are established for specific sensor activations and then integrated in hierarchical controllers (Fig. 2.5). While bottom-up controllers have been manually crafted for decades, they do not scale up beyond certain limits. New approaches, namely Machine Learning and Evolutionary Robotics, have been investigated to address these scalability limitations. As mentioned earlier on, the key goal of the presented research is to achieve adaptive autonomous robotic control; the core enabling technologies in order to do so are Evolutionary Robotics and Machine Learning. Specifically, the main criteria for the presented research involve: 9

2. LITERATURE REVIEW

Figure 2.5: Subsumption architecture: the move forward controller is the default one, and can be overridden by the obstacle avoidance behaviour which can itself be overridden by the environment exploration. • the robustness of the robot behaviour with respect to the environment and the device specificities (sensor and actuator noise and failures); • the robustness of the controller design with respect to the training procedure and the load on the human designer; • the ability of the controller to deal with complex environments and/or tasks through memory skills.

2.2

Machine Learning

Science makes extensive use of models, be they based on a human expertise, on statistical analysis of empirical data, or both. Machine Learning (ML) is a scientific discipline that is concerned with the design and development of algorithms that allow computers to learn from data such as sensor and control data. Machine learning is closely related to fields such as statistics, probability theory, data mining, pattern recognition, artificial intelligence, adaptive control, and theoretical computer science.

2.2.1

Introduction

Machine Learning is the field concerned with the automatic extraction and acquisition of knowledge from data [Mitchell, 1997] [Russell et al., 2006] [Vapnik, 1995]. A machine learning software is designed to adapt itself, allowing to perform better on the tasks it was meant to perform. Applications for machine learning include computer vision [Saxena et al., 2009], speech and handwriting recognition [LeCun et al., 2004], game playing or robot locomotion to cite a few. Machine Learning has been applied to perform industrial robotic control tasks since the 80s [Segre, 1988]. 10

2.2. MACHINE LEARNING

Goals and criteria Machine Learning algorithms are organised in a taxonomy, according to the input and expected output of the algorithm. The main approaches are the following: • Supervised Learning [Breiman et al., 1984] [Vapnik, 1995]: exploits training data E = {(xi , yi∗), i = 1, . . . N}, xi ∈ X, yi∗ ∈ Y where yi∗ is the desired label (a.k.a. output) for the instance (a.k.a. input) xi according to some oracle. The algorithm goal is to produce a hypothesis h from X onto Y , such that h(x) = y ∗ corresponds to the correct label for new instances not within E. The label may be a discrete (classification) or a continuous (regression) value. • Unsupervised Learning: exploits training data made of instances only: E = {(xi ), i = 1, . . . N}, xi ∈ X. The goal is to define classes, a.k.a. clusters, including similar instances. Main applications of unsupervised learning, a.k.a. clustering, are (hierarchical) concept discovery or anomaly detection. • Reinforcement Learning (RL) [Sutton and Barto, 1998a][Girgin et al., 2008]: interacts with the environment through emitting actions a1 , a2 , . . . and receiving rewards. The RL goal is to learn an action policy that maximises the overall reward received over the lifetime of the RL controller. Reinforcement Learning, related to the field of decision and control theory, has been extensively applied to mobile robot control [Smart and Kaelbling, 2002]. A wide variety of ML algorithms has been developed, ranging from Artificial Neural Networks (section 2.3), to Decision Trees [Breiman et al., 1984], Support Vector Machines [Vapnik, 1995], Bayesian Networks [Pearl, 1985], Hidden Markov Models [Baum and Petrie, 1966] and Ensemble Learning (Boosting [Schapire, 1990] and Bagging [Breiman, 1994]). It must be emphasised that there is no such thing as a universal ML algorithm [Wolpert, 1996]. The best algorithm depends on the data, the task at hand and the domain application criteria, specifically: • The data distribution. While training data are usually assumed to be Independently and Identically Distributed (iid ) in ML, this assumption does not hold in the case of Robotics. 11

2. LITERATURE REVIEW

• Noise within data. In robotics, data relate to various sensors and activators prone to noise. In the case of learning by demonstration (section 2.2.4), a teacher directs the robot, usually introducing a larger amount of noise. • The intelligibility of the results. Some models are self explainable and can be interpreted (e.g. decision trees) while others are not (e.g. neural networks). The intelligibility of the results could be a premise to the study. • Real time constraints. The training time or the computational cost of the model could be restricted for some reason. Supervised Learning, Reinforcement Learning and Programming by Demonstration are presented briefly in the following. Generalities Modeling is a 3-step process: defining an appropriate representation of the data (a set of descriptive attributes, e.g. the sensors), defining a performance measurement, and finally selecting an appropriate model space and training method. The use of machine learning algorithms requires some care, depending on the data specificities and the choice of the model space. Most real-world data are noisy; this is especially true when dealing with robotic sensors. The data noise might severely hinder learning algorithms. Evaluating the quality or the performance requires a success criterion. Achieving a good performance on the training data does not imply that the target concept be acquired, due to the overfitting problem. The ML goal is to generalise, i.e. to achieve a good performance on unseen data too. The generalisation error (expectation of the error) is estimated from the error on a test set (disjoint from the training set). When training occurs in high dimension (i.e. large number of attributes), there should be enough training data to represent precisely the concept. If there is not enough data with respect to the dimension, any model might not acquire the target concept, whatever the training method. This is referred to as curse of dimensionality. A central dilemma in ML is the trade-off between the accuracy on the training set, and the flexibility of the model space (e.g. measured from its Vapnik Chervonenkis dimension [Vapnik and Chervonenkis, 1971]), 12

2.2. MACHINE LEARNING

addressing the so-called bias-variance issue. If the model space H is too poor, the bias (error of the best model in H) is high; if H is too expressive relatively to the available amount of training data, the variance of the best model according to the training set, is high. The bias-variance dilemma is particularly critical in domain applications with high number of attributes, e.g. for X = IRd and d > 50.

2.2.2

Supervised learning

Supervised learning can be viewed as a 3-player game, involving the environment, the oracle and the learner. Let consider data sets {yi , xi1 , xi2 , . . . , xip }m where x are data samples provided by the eni i=1 vironment and yi are the labels provided by the oracle. The learner aims at finding a model h ∈ H with small error on training data (h(xi ) = yi for most i = 1 . . . m) and on unseen data. While h ideally coincides with the unknown target concept, its identification might be impossible for two reasons: Firstly, noise present in data render the true concept unreachable. Secondly, an insufficient amount of training data will lead to situations where one cannot distinguish between the true concept and the concept obtained. The goal thus becomes to find the best approximation of the target concept, within the model space allowed by the available data.

Classification and regression As mentioned earlier on, supervised learning is referred to as classification or regression, depending on whether the label space Y is discrete or continuous. In the regression context, the standard learning criterion is referred to as Mean Square Error (MSE), MSE(h) =

n X

(h(xi ) − yi∗)2

i=1

In Robotics, the label most often corresponds to the desired actuator real value, defining a regression problem. Given a data set {yi , xi1 , xi2 , . . . , xip }m i=1 of m statistical units, a linear regression model assumes that the relationship between the dependent variable yi and the p-vector of regressors xi is approximately linear. This approximate relationship is modeled through a so-called disturbance term Ei , an unobserved random variable that adds noise to the linear relationship between 13

2. LITERATURE REVIEW

the dependent variable and regressors. Thus the model takes the form Y = Xβ + E where



  Y = 

y1 y2 .. . ym





    X =    

  β= 

β1 β2 .. .

x1 x2 .. . xm 

βm





    =   

    E =   

x11 x21 .. . xm1  E1 E2   ..  .  Em

. . . x1p . . . x2p .. .. . . . . . xmp

    

In linear regression, the model space H is the set of linear combinations of the attributes β = (β1 , . . . , βd ). Linear regression assumes that instances in X are linearly independent, iid, and the noise is Gaussian, The optimal linear model is computed as β = (X T X)−1 X T Y

2.2.3

Reinforcement Learning

Reinforcement Learning, at the crossroad of Control Theory, Operations Research, Artificial Intelligence and Cognitive Science, is concerned with an agent or robot in interaction with the environment. The robot makes actions, modifying the environment, including the robot state and the way the robot perceives the environment. The robot receives some pay-off or rewards upon executing some actions, usually with some delay, and the goal of reinforcement learning is to devise a policy (deciding which action to select in every state) such that it maximises the cumulative reward collected by the robot over a given time period (which can depend, for instance, on the battery autonomy or the time taken to perform the task). For a comprehensive introduction to Reinforcement Learning, the interested reader is referred to [Sutton and Barto, 1998b], and the tutorial presented by Satinder Singh at NIPS 20054. Among the prototypical RL problems are the pole balancing, the car on the hill, and the search for a navigation policy in order to arrive at the treasure location in a maze 4

http://www.eecs.umich.edu/∼baveja/NIPS05RLTutorial/

14

2.2. MACHINE LEARNING

[Sutton and Barto, 1998b]. Reinforcement Learning is most often formalised in the Markov Decision Process setting. The main concepts are summarised below. • The set of states, denoted S, corresponds to the possible situations of the robot in the environment. A state can be described from the robot position and/or its sensor values; it can also incorporate some memory of the past robot behaviour. The set of states (vector of state variables) is meant to convey every information relevant to action selection. • The set of actions, denoted A, encodes the possible decisions of the robot in any state (motor activation). Notably, actions can be described at any level of generality, ranging from low-level decisions (speed of right and left wheel), to high-level ones (select a controller in some controller library). • The transition function p : S × A × S → IR+ defines the conditional probability p(s′ |s, a) of arriving at state s′ by selecting action a in state s. The transition function corresponds to a model of the world; a simulator can be viewed as a procedural transition function. • The reward function r is a mapping from (a subset of) the states onto real-valued numbers, representing the outcome of the robot actions. The main point is that the rewards are delayed in time. In the maze example, the reward is received at the moment when the robot arrives at the goal state. The goal of RL is to find a policy, that is a mapping from the set of states onto the set of actions: π : S 7→ A such that the policy is optimal in terms of the expected cumulative reward Eπ that will be obtained by starting in state s and selecting actions after π: Find

πT∗ (s)

T X = argmaxEπ [ r(st |s0 = s)] t=0

where T is the time horizon. An extension to Reinforcement Learning is the so called Inverse Reinforcement Learning. In standard RL, the reward function is available. In the Inverse RL setting, the reward function is learned from available samples of the desired policy, acquired through Learning by Demonstration [Abbeel, 2008]. 15

2. LITERATURE REVIEW

2.2.4

Learning by Demonstration

Standard robotic control approaches usually require a formal description of the target behaviour, or a formal description of the behaviour assessment as in Reinforcement Learning or Evolutionary Robotics (section 2.4.7). Learning by Demonstration proposes to train the controller based on examples of the target behaviour demonstrated by a teacher[Schaal, 1999] [Dillmann, 2003] [Billard and Siegwart, 2004]. Learning by Demonstration (LbD) is also referred to as apprenticeship learning, where expert demonstrations of the task are available. This is useful when no reward function can be defined adequately, as in Inverse Reinforcement Learning. Learning by Demonstration technique is a direct application of supervised learning to robotic control. The aim is to acquire robust controllers from demonstrations of the desired behaviour, through tele-operation or imitation of the operation [Calinon and Billard, 2007]. LbD requires some interaction between the teacher5 and the robot learner. The user may program the robot using speech, gesture or direct demonstration of the task, without requiring specific knowledge on Machine Learning. The robot observes, interprets, then acts according to what has been learned. This paradigm should provide the user with a natural and easy way to program robot behaviours for an everyday use in human inhabited environment. As already mentioned, LbD relies on supervised learning, where the training data is provided by the demonstrated behaviours. In [Katagami and Yamada, 2001], a Khepera robot is trained by using classifier systems, genetic algorithms and joystick driven demonstrations from a human supervisor. The goal is to enable the robot to play robotic soccer. In [LeCun et al., 2005], several demonstrations of tele-operated mini off-road truck are recorded. The data are two left and right low resolution video inputs and output control at each time step. A specific neural network (section 2.3), is trained, and provide efficient results. Approaches based on standard regression or classifier systems are investigated in [Hugues and Drogoul, 2002], together with Micro Population based Learning (MPL). In both settings, the user demonstrates the expected behaviour(s), and a model is trained according to the recorded data in order to achieve such behaviour. In [Boucher and Dominey, 2006] and [Miura and Nishimura, 2007], robots are trained using guidances and available policies. The training generates new policies which are a combination of the previous existing policies. In [Miura and Nishimura, 2007], interactive learning is used to learn unknown but required intermediate policies. 5

most probably a human, but may also be another autonomous robot

16

2.2. MACHINE LEARNING

Robotic arm manipulation is considered in [Atkeson and Schaal, 1997] and hand grasping in [Dillmann et al., 1999]. The application of Reinforcement Learning and Inverse Reinforcement Learning with learning by demonstration is considered in [Abbeel and Ng, 2004, Abbeel, 2008]. The demonstrations provided allow to estimate a reward function through a linear combination of its features. In [Abbeel et al., 2007, Kolter et al., 2008], an application of apprenticeship learning is done on helicopter and quadruped robot control learning.

Micro Population based Learning (MPL) Micro-Population based Learning (MPL) has been designed by [Hugues and Drogoul, 2002] specifically for Learning by Demonstration. MPL works as follows: at each time step, the sensor input is split into micro-percept also referred to as tessels. Those percepts can be considered at various granularities, from pixel colours in a video image to high level concepts (chairs, doors, etc.) Original studies consider pixel micro-percepts from a video image. The tessel is hence defined as the pixel coordinate together with the colour information. The control model makes use of control rules. Each rule is based on a tessel and some desired control output provided from the demonstrations. For each sensor input within the demonstration, new rules are built. The set of rules defines the controller; at each time step, the rules which match the current sensor input are triggered and the corresponding actuator values are computed. Since the rules triggered could have different control outputs, a vote occurs from the rules to produce the final control output (Fig 2.6). MPL algorithm was evaluated and validated on real world Pioneer 2 DX mobile robot platform with video camera for a set of navigation tasks including target following and slalom. It will be used as baseline approach in Chapter 4. In [Hugues, 2002], MPL is considered for a video image input. Demonstrations of the desired behaviour are performed and MPL generates a set of rules. Specifically, MPL is provided with the training data {(xi , yi ), i = 1 . . . m}, where xi stands for a video image and yi for the associated actuator values. Each video image xij is a point in IRd , where d = 30 × 40. xi is stochastically generalised, retaining s among d features, where s is a parameter of the algorithm. For each example (xi , yi), s random j are drawn in 17

2. LITERATURE REVIEW

Figure 2.6: Micro-Population based Learning: the sensor input is split into tessels. Among the rules, those recognizing the tessel are triggered (in red) and return their control output. The final control is given from the aggregation of the whole set of activated rules. IRd , and the corresponding rules are added: rk (xij ) → yi MPL rule sets can be defined in several conditions. The basic model works as a reactive model, only taking on video image to produce the control output. Tessels can also be sampled in time from sequential video images (ex. Pixel colour variation through time for a given coordinate) or combined in space (ex. Several pixels at different coordinates within the same image are considered to be one tessel). Moreover, the experiments presented in [Hugues, 2002] concludes that not all tessels within the input space needs to be considered. A random sub-sampling can achieve similar results while having smaller sets of rules, enabling real time calculation of the control output.

2.3 2.3.1

Artificial Neural Networks Introduction

Artificial Neural Networks, or Neural Networks are computational models inspired from biological neural systems [McCulloch and Pitts, 1943] [Rosenblatt, 1958]. Neural Networks involve synaptic connections between elementary computational units referred to as neurons. As a sub-field to Machine Learning, neural networks can be trained to address difficult tasks [Widrow and Hoff, 1960] [LeCun, 1985] [Rumelhart et al., 1988]. Not only 18

2.3. ARTIFICIAL NEURAL NETWORKS

Artificial Neural Networks are efficient models for many tasks, they prove to be robust to noise and perform real-time computation. Main factors limiting their use is the difficulty to interpret the model once trained, e.g. compared to decision trees. This section formally introduces artificial neural networks and describe the standard training methods; lastly, the recent Reservoir Computing paradigm is presented.

2.3.2

Definitions

Neuron A neuron, also named unit, cell or node, is a mathematical and computational representation of the biological neuron (Fig. 2.7). A unit has an input vector x = {x1 , . . . , xn } and one output value y. To each input xi applies a weight wi . The unit operates a weighted sum of its inputs; an activation function f is applied to this sum result and produced as the output of the network. A Threshold value θ, referred to as bias, is used to define the threshold of the activation value. The formula is given below: y = f(

n X

wi xi − θ)

i=1

For practical implementation reasons, bias θ is often represented as an additional constant input x0 = 1 connected to the neuron with a connection weight w0 = θ, making: n X y = f( wi xi ) i=0

The usual activation function is sigmoidal, although, other functions can also be used (Fig. 2.8). The reason that motivates the use of a sigmoidal function rather than a simple threshold function is that the sigmoidal function is a differentiable function. This key property is important when considering training of neural networks. Moreover, the sigmoid function provides a continuous non-linear function of its inputs, enforcing a smooth behaviour. Neural Network A neural network is a graph whose nodes are neurons. Neural networks map input x ∈ IRd onto output y ∈ IRm . In the topology, three types of neurons can be distinguished: input neurons, hidden neurons and output neurons. The input neurons receive data from outside of the 19

2. LITERATURE REVIEW

Figure 2.7: Neuron: a unit provide an output value, which is a transformation by an activation function f over the sum of the weighted input values and θ where θ is the activation function threshold

Figure 2.8: Typical activation functions, from left to right, up to bottom: Threshold, linear, Piecewise linear (saturation), sigmoid network. The output neurons provide data for outside of the network. The hidden neuron inputs and outputs remain within the network, thus, being hidden from outside of the network. If a connection is established between two neurons, from i to j, then the corresponding weight is noted wij . 20

2.3. ARTIFICIAL NEURAL NETWORKS

Figure 2.9: 4-layered neural network topology.

Figure 2.10: Neural network topologies: feed-forward neural network (left), recurrent neural network (right).

Figure 2.11: Two examples of neural network topologies: Hopfield network topology (left) and Elman network (right). In Elman networks, all neurons within the hidden layer are connected to each other. Neural network topologies can be categorised as feed-forward or recurrent (Fig. 2.10). In a feed-forward topology, the information flow is propagated from the input to the output layer. The output only depends on the input vector. In the recurrent networks, cycles exist in the topology graph, including feed-back connections. Basically, output produced at time t depends on the t′ past time step inputs. Recurrent networks express 21

2. LITERATURE REVIEW

dynamical properties while feed-forward do not. The recurrences can be seen as states reached based on previous input sequences. Typical recurrent topologies includes the Hopfield or Elman networks [Elman, 1990b] to cite a few (Fig. 2.11). Neural Networks can be structured in layers of neurons. Such layered neural networks are named after their number of hidden layers of neurons. A neural network having one input layer, one hidden layer and one output layer is 1-hidden-layer neural network. When considering a recurrent neural network, the update rule, computing y m from xd , where xd is the input vector of the layer k and y m is the output vector of the layer k, depends on the topology. The update can be synchronous (all neuron outputs are computed at the same time) or asynchronous (neurons are updated after their input neurons are updated). The update rule considered in this work is asynchronous. Three main limitations exist on recurrent networks: memory lifespan, possible memory saturation and high training cost. Depending on the network topology, its dynamics trajectory might lead to extremely fast memory fading. Training sequences of actions in recurrent neural networks is hard, since the most recent stimulus will dominate the previous stimulus. Moreover, the problem of saturation may occur, when the network is stuck in one state, not being able to react to new input stimulus. Recurrent neural networks potentially have larger number of training parameters, due to recurrent connections, compared to feed-forward neural networks (neurons and connection weights); recurrent neural networks require sufficient number of training data in order to cope with all possible dynamics within the network. A recent paradigm, known as Reservoir Computing, proposes to tackle those limitations and is presented below.

2.3.3

Learning in neural networks

Neural networks are adaptive models; their design is achieved through learning or optimisation methods. Training has been developed allowing neural networks to express their capabilities for problems formalised in either supervised or non supervised paradigms (section 2.2). While the methods presented in this section propose to consider a fixed topology and optimise weights, some consider optimising both topology and connection weights (section 2.4.6). Whatever the approach, samples xi are drawn randomly in each training step until all samples are selected once, and the process is repeated until some stoping criterion (finite number of iteration, 22

2.3. ARTIFICIAL NEURAL NETWORKS

error threshold or convergence to a local optimum for instance). The standard setup optimises the connection weights in fixed topology neural networks. These weights command the relations between neurons and the meaning of the network. Through the learning process, parts of the network might end up specialising for specific parts of the task. Due to the nature of the topology, this specialisation usually goes beyond comprehension. The sole understandable part are usually outputs produced by the output neurons. Hebbian rule One basis for training neural networks in the unsupervised learning setting is the Hebbian learning rule proposed by [Hebb, 1949]: hen an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased. Hence, if two neurons i and j are active at the same time, then their inter-connection is strengthened. If neuron j receives input from neuron i, then weight wij is changed by ∆wij where: ∆wj = γxj y X y = f( wi xi ) i

where γ is a positive constant representing the learning rate. Widrow-Hoff rule

Widrow-Hoff training rule considers the standard supervised learning setting where data set {yi , xi1 , xi2 , . . . , xip }m i=1 containts m samples of data xi and relative labels y (section 2.2.2). For each input xi provided, the network computes an output y. The difference with the target value y ∗m can be evaluated as (y ∗m − y m ). The delta rule uses a cost, or error function, based on the euclidian distance between known yi and y in order to adjust the weights. The total error is given by the following formula: E=

1X (y − yi )2 2 m

The weights are varied to reduce the error, according to the so called gradient descent method. Gradient descent is a first-order optimisation algorithm. To find a local minimum of a function using gradient descent, one takes steps 23

2. LITERATURE REVIEW

proportional to the negative of the gradient of the function at the current point: ∂E ∆wj = −γ ∂wj where

∂E ∂E ∂y = ∂wj ∂y ∂wj

Back-propagation algorithm The back-propagation algorithm extends the use of the delta rule method to multilayer neural networks. Thanks to differentiable activation functions, the estimated gradient over the error from one layer to another, back from the output to the inputs, can be computed, and it is used to modify the weights accordingly. The algorithm is repeated for several iterations on the training data; weight change for one sample proceeds as follows: firstly, all neuron output values are computed depending on the input. The error estimation is then computed backward from output to input neurons. All weights are then changed according to the learning rule. In feed-forward neural networks, the output y of the network is computed by propagation of the input x through the layers. The error E between the output and the desired output is obtained as follows: 1 E = (yi − y)2 2 where yi is the desired output and y the output produced by the network. The error signal can be defined as

and the learning rule is:

∂E δj = − Pn ∂ i=0 wij xi ∆wij = γδj f (

n X

wij xi )

i=0

Training recurrent networks Most training algorithms presented only apply on feed-forward networks. In the case of recurrent neural networks, the neural networks can be formalised as a feed-forward neural network by unfolding the neural network in time for some iterations [Minsky and Papert, 1969] (Fig. 2.12). 24

2.3. ARTIFICIAL NEURAL NETWORKS

This Back-Propagation Through Time (BPTT) [Werbos, 1990] algorithm proceeds as back-propagation on the unfolded neural network. RealTime Recurrent Learning algorithm (RTRL) computes the derivative of states and outputs depending on weights during the forward computation [Williams and Zipser, 1989]. A few other training methods exists, folding either space or time representation of the network, albeit with a high computational cost. Black-box Optimisation methods can be used as an alternative for neural network training.

Figure 2.12: Standard Elman topology representation (left), the same network with flattened contextual representation (right): the hidden layer is duplicated in input of the network. At time t, the duplicated layer correspond to the hidden layer state at time t-1. This representation could have several duplications for time t − 2, t − 3, . . ..

Other adjustments As the presented methods only apply on networks with fixed topologies, the remaining question concerns the expressiveness of the network. The question is asked whether the expressiveness is high enough, allowing the network to reach the solution, or too expressive, limiting the training due to high dimension or variance in the training samples. In any case, experimental tuning of the network’s size, through number of hidden neurons, has not necessarily to be done in order to find the appropriate model network. Methods exists, known as optimal brain damage [LeCun et al., 1990] in which, neuron connections are removed along with the training process if the connections are useless, or optimal brain surgeon [Stork and Hassibi, 1993], also removing neurons. 25

2. LITERATURE REVIEW

Optimisation Another possibility for training (recurrent) neural networks is based on optimisation algorithms. Notably, the optimisation approach is an adequate alternative when training data are not available but an assessment criterion is. Optimisation algorithms deal with parametric optimisation (the target solution is a fixed size vector) or non-parametric optimisation (e.g. aimed at finding the neural network topology, weights, and/or the parameters of the activation functions). Some parametric and non-parametric stochastic optimisation algorithms, used in the thesis to optimise neural networks, will be introduced in section 2.4, including the NEAT algorithm [Stanley and Miikkulainen, 2002] specifically designed for the nonparametric optimisation of neural networks.

2.3.4

Reservoir computing: Echo State Networks

As mentioned earlier on, while recurrent neural networks encode some memory of the input presented to the neural network in the last time steps through the state of the hidden neurons, their training raises specific difficulties; how to adjust the memory lifespan and prevent the network saturation is also done by trials and errors. A new approach, Reservoir Computing (RC) [Jaeger et al., 2006, Verstraeten et al., 2007], involves an original recurrent architecture together with a training algorithm, enforcing the control of the memory lifespan within an affordable training effort. While RC comes into two flavours, Liquid State Machines [Maass et al., 2002] and Echo State Networks (ESNs). Most RC models use spiking neurons, a more biologically plausible type of neurons [Maass, 1996]. Only ESNs [Jaeger, 2001, Jaeger, 2002] with sigmoidal activation functions will be considered in the following and are described in this section, referring the reader to [Jaeger, 2002] for a more comprehensive presentation. Reservoir computing involves a set of hidden neurons, referred to as reservoir, which are randomly connected together (as opposed to e.g. Elman NN, where hidden neurons are fully connected). The connection matrix is controlled from a density parameter ρ (usual values for ρ are 10 or 20%). The hidden neurons, connected to the input neurons, express several different dynamics due to random cycles from random connections generation. The idea behind RC is that the desired output can be sought as a linear combination of the hidden neurons. Formally, input and output neurons are fully connected to the hidden neurons (Figure 2.13). Some other connections (from input to output neurons, from output to reservoir 26

2.3. ARTIFICIAL NEURAL NETWORKS

neurons) can be used, but they are not considered in the ESN topologies used in the following chapters.

Figure 2.13: Typical ESN topology: Reservoir (hidden) neurons are fully connected to the input and output neurons. The connections among hidden neurons are stochastically defined after a density parameter. Both the memory saturation and the memory lifespan are controlled from the connection matrix, by upper bounding its maximal eigenvalue α referred to as damping factor or spectral radius. By setting α < 1, one avoids the memory saturation (when the neural network output does not depend on the input any more), since memory saturation occurs when the state of the hidden neurons corresponds to a fixed point with value 1. Likewise, the α value controls the memory lifespan; the smaller α, the sooner the information encoded through the hidden neuron states is forgotten. Structure Formally, an ESN is defined from its input x1 , . . . , xK , its output y1 , . . . , yL , and N hidden neurons e1 , . . . , eN (Fig. 2.13), where each hidden and output neurons implement a standard sigmoidal activation function. As mentioned above, inputs are connected to hidden neurons through the K × N matrix Win . Likewise, hidden neurons are connected to the outputs through the N × L weight matrix Wout . Finally, hidden neurons are connected together through the N × N matrix W, where wij commonly denotes the connection weight from neuron ei to neuron ej . Matrix W is sparse, depending on density factor ρ. The specificity of ESNs compared to standard neural networks is twofold. Firstly, ESN training only modifies matrix Wout , referred to as readout matrix. Therefore, the underlying optimisation problem has linear size in the 27

2. LITERATURE REVIEW

number N of hidden neurons, contrasting with the quadratic size of typical Elman recurrent networks. Secondly, the ESN core, made of the hidden connection matrix W, is specified from two macro-parameters, namely the connectivity rate ρ and the damping factor, highest eigenvalue α of W. Formally, wij is set to 0 with probability 1−ρ. Otherwise, wij is uniformly drawn in [−a, a]. The weights are finally adjusted such that α satisfies the constraint on the damping factor (usually α ∈ {0.8 : 0.95}). The ESN design thus involves two hyper-parameters, the connectivity rate ρ and damping factor α, besides the weights on the only output connections. In contrast, NN design involves some hyper-parameters (size and number of layers, learning rate, connections), additionally to the weights on all neural network connections.

Training The ESN training only considers the readout matrix (the weights on the connections from the reservoir to the output neurons). In the case where the reservoir offers a sufficiently rich catalogue of dynamics depending on the input sequence, the training task is to approximate the desired dynamics as a linear combination of the reservoir neurons. In the case where the output values are known, ESNs can be trained using linear regression. The main difficulty lies in the stochasticity of the approach: the reservoir dynamics cannot be predicted from the two hyper-parameters. In practice, an efficient ESN design thus proceeds either by trials and errors (generating several ESN with same parameters and retaining the best one).

2.4

Evolutionary Computation

Evolutionary Computation pertains to the field of Stochastic Optimisation and Metaeuristic Methods. Metaheuristic methods are problem solving approaches independent from the problem representation. Stochastic optimisation approaches involve a stochastic exploration of the solution space. This section first introduces optimisation and stochastic optimisation, before describing Evolutionary Computation, specifically Evolutionary Strategies (ES) [Rechenberg, 1965]. The Neuro-Evolution of Augmenting Topologies (NEAT) algorithm [Stanley and Miikkulainen, 2002] is thereafter detailed, before reviewing the state of the art in Evolutionary Robotics [Nolfi and Floreano, 2000]. 28

2.4. EVOLUTIONARY COMPUTATION

2.4.1

Optimisation, Stochastic Optimisation

Optimisation aims at finding the optima x∗ of a function f , aka fitness function, defined on some search space S. One distinguishes combinatorial and continuous optimisation, respectively involving a discrete or continuous search space S. Solution x∗ is a global optimum iff ( f (x∗ ) ≥ f (x) maximization of f ∀x ∈ S f (x∗ ) ≤ f (x) minimization of f Solution x is a local optimum iff there exists a neighbourhood N(x) ⊂ S such that x is a global optimum on N(x) (Fig. 2.14).

Figure 2.14: Global Optimisation (maximisation): x∗ is the global optimum and x′ a local optimum. A well-posed optimisation problem is defined on a continuous search space S, and corresponds to a continuous, differentiable, convex fitness function f ; in such cases, the global optimum exists, is unique and can be found using gradient methods. Ill-posed optimisation problems include mixed (discrete and continuous) search spaces S, functions f that are non convex and admit local optima, or that are not differentiable. In many ill-posed optimisation problems, stochastic heuristics are used to implement any-time algorithms, providing a solution within a given computational budget. Stochastic optimisation algorithms include in particular hill climbing, simulated annealing [Kirkpatrick et al., 1983], tabu search [Glover, 1989] or evolutionary computing [Eiben and Smith, 2003]; the randomness of the algorithm relates to the initial conditions and/or the navigation in the search space. The key issue in stochastic optimisation is referred to as Exploration versus Exploitation dilemma (EVE): the algorithm must sufficiently explore the search space (otherwise, no guarantee about the discovery of the global optimum can be provided); it must also exploit the good individuals found 29

2. LITERATURE REVIEW

so far (as opposed to pure random search, which almost surely finds the global optimum but does not scale up with respect to the size of the search space). After the No Free Lunch theorem [Wolpert, 1996] (under very restrictive assumptions), there is no such thing as a universal optimisation algorithm. The success ultimately depends on the available prior knowledge about the structure of the search space, of the fitness function, or of the sought solution, and how to exploit the available prior knowledge within the optimisation algorithm. The success of Evolutionary Computation critically depends on its flexibility to incorporate whatever prior knowledge is available.

2.4.2

Evolutionary Computation

The first Evolutionary Computation algorithms appeared almost on the same time during the sixties. Rechenberg pioneered Evolutionary Strategies [Rechenberg, 1965] while Larry Fogel pioneered Evolutionary Programming [Fogel et al., 1966]. Later on, Holland pioneered Genetic Algorithms [Holland, 1975]. While all three paradigms rely on similar concepts, they apply to different data types: Evolutionary Strategies deal with continuous optimisation (S = IRd ); Evolutionary Programming deals with finite state automata; Genetic Algorithms deal with bit strings (S = {0, 1}d). Genetic Programming, a fourth Evolutionary paradigm, came later on [Cramer, 1985, Koza, 1992] and deals with program spaces. A main factor behind the spread of Evolutionary Computation, after the success of Goldberg’s book Genetic Algorithms in Search, Optimisation and Machine Learning [Goldberg, 1989], is its simplicity and its empirical effectiveness, e.g. for industrial applications. An EC algorithm evolves candidate solutions, referred to as Individuals; a set of individuals constitutes a Population. The Genotype is the coding representation for individuals whose elementary attributes are called genes. The expression of the genotype is referred to as Phenotype, conditioning the quality of the individual according to a fitness function (function to be optimised). The algorithm iteratively transforms the current population into another one; the transition from population at time t to the next one is referred to as Generation. The initialization of the process is done by sampling the first population in the search space, either uniformly or using prior knowledge. In each generation, individuals are evaluated according to the fitness function; 30

2.4. EVOLUTIONARY COMPUTATION

the so-called parents are selected individuals, used to generate new individuals called offspring through stochastic variation operators (recombination and mutation). The replacement step builds the next population from the current offspring and possibly some parents. The algorithm continues until a termination condition is reached, returning then the best ever individual. The canonical EC algorithm is given in pseudo-code in algorithm 2.1 and in Figure 2.15.

Figure 2.15: Evolutionary process. Initialisation and recombination operators (crossover, mutation) are the genotypic operators making use of randomness (in pink on the figure). Selection and replacement are the phenotypic operators and make use of the fitness value and possibly randomness (in green in the figure). EC algorithms features the following properties: • 0-th order method: they only require the fitness function to be computable on (most) elements in the search space; • any-time behaviour: like all stochastic algorithms, it can return the best solution reached for a prescribed computational budget; • a high number of fitness computations is required (over a few dozen thousands in most cases) and 99,99% of the overall computational cost is spent in computing the fitness, making its effective and robust implementation a key requirement for the success of EC; 31

2. LITERATURE REVIEW

Algorithm 2.1 General Scheme of an Evolutionary Algorithm INITIALISE population with random individuals; EVALUATE each individual; repeat SELECT parents RECOMBINE parents MUTATE the resulting offspring EVALUATE new individuals SELECT individuals for the next generation until TERMINATION CONDITION is satisfied • the EVE dilemma, governing the quality of the found solutions, is controlled by the designer through trials and errors. Among the various ingredients of EC algorithms, one distinguishes genotypic operators (initialization, crossover, mutation, operating on the genotypic information) and phenotypic operators (selection and replacement, depending on the only fitness of the individuals).

2.4.3

Phenotypic Operators

Selection Each individual is evaluated at least one time. One evaluation can imply several sub-evaluations, referred to as epochs, during which the fitness is calculated for specific initial conditions. After all individuals in the population have been evaluated, the algorithm selects the parent individuals to create the next generation population (Fig. 2.15), in view of generating new and better individuals. The selection operator is either deterministic or stochastic. The simplest deterministic selection retains the best individuals among the offspring and the parents. This approach, commonly used in Evolution Strategies (see (µ + λ)-ES, section 2.4.5), might favour the premature convergence of the algorithm; its merit is to avoid loosing a good individual. Among the stochastic selection heuristics are roulette wheel (either based on fitness values or ranks) and tournament. Stochastic selection expectedly favours the fittest individuals, although unfit individuals get a chance to be selected too. In roulette wheel selection, the probability of selecting an individual is proportional to its fitness value. If P denotes the 32

2.4. EVOLUTIONARY COMPUTATION

population size, P draws are launched to select (possibly multiple copies of) individuals in the current population. In some cases (large differences in the fitness value), many copies of the fittest individual might be selected. An alternative is provided by rank wheel selection, where the probability of selecting an individual is proportional to its fitness (the better fitness, the higher rank). The tournament selection uniformly selects K individuals in the population and returns the best one.

Replacements The replacement operator builds the next population from the current one and the offspring, based on the fitness value of all individuals. The simplest replacement (generational replacement) replaces the current population by the offspring generated from the selected parents. The main other replacement procedures are the (µ, λ) and (µ + λ) procedures used in Evolution Strategies. In both cases, µ parents are used to generate λ offspring. In (µ, λ)-ES, the best µ offspring are selected to become the next population. In (µ + λ)-ES, the best µ individuals among parents and offspring are selected and become the next population. As could be expected, both approaches have pros and cons. (µ + λ)-ES is prone to premature convergence as the best individual is never lost; only (µ, λ)-ES yields some guarantees of convergence toward the global optimum [B¨ack et al., 1991, Functions et al., 1997]. Termination criterion The most straightforward termination criterion is when the target fitness value has been reached, or when the computational budget has been exhausted. Other termination criteria are based on detecting the stagnation of evolution: when no fitness improvement has occured for a given number of generations or fitness computations; or when the diversity of the population falls below a prescribed threshold.

2.4.4

Genotypic Operators

Initialisation The first step in the evolutionary process concerns the initialisation of individuals within the search space. Two situations may occur. In the first 33

2. LITERATURE REVIEW

case, some prior knowledge about potentially interesting regions is available, e.g. through former optimisation phases based on the current fitness function or others, or given by the expert. While such prior knowledge can facilitate the search, it might also lead to premature convergence. Otherwise, the first population is stochastically sampled in the search space. Quasi-Random techniques (see [Sobol, 1967] or [Niederreiter, 1992]; (Fig. 2.16) provide a better coverage of the search space, possibly improving the overall performance of EC [Kimura and Matsumura, 2005] [Teytaud and Gelly, 2007].

Pseudo random draw

quasi-random (sobol) draw

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

-0.2

-0.2

-0.4

-0.4

-0.6

-0.6

-0.8

-0.8

-1

-1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

Figure 2.16: Example of random draws: on the left, pseudo random generation; on the right quasi random generation using Sobol algorithm on the upper right. We can notice a better, but still far from perfect, coverage of the space using quasi random draws. The population size is always small relatively to the search space; some results regarding to the appropriate size [Hansen and Ostermeier, 2001] and its adjustment [Hansen and Ostermeier, 1995] [Auger and Hansen, 2005] have been proposed. 34

2.4. EVOLUTIONARY COMPUTATION

Crossovers The crossover operator, remotely inspired from sexual reproduction, uses two or several parents to produce offspring, sharing their parents genes. The canonical bit-string crossover is the 1-point crossover, splitting the parent genotypes at some randomly defined cutting point, and concatenating the first fragment of the first parent with the second fragment of the second parent, and vice versa (Fig. 2.17). Other crossover operators include the multi-point crossover and the uniform crossover [Sywerda, 1989].

Figure 2.17: One point crossover Crossover operators enforce both Exploration and Exploitation depending on the current population, since the offspring lay in the convex hull defined from the parents: if the parents are far away, crossover achieves exploration; otherwise, it achieves exploitation, exploring the close neighbourhood of the two parents. In the continuous case, it is recommended to slightly extend the crossover scope [Michalewicz, 1996]: Of f spring = (1 − α)P arent1 + αP arent2 , α ∈ [−.5, 1.5] in order to preserve the population span.

Mutation Mutation proceeds by stochastically modifying some gene values in the individual, according to the mutation probability. It allows for Exploration, and recovering from the so-called genetic drift in discrete spaces: when some gene value is no longer present in the population, only mutation (as opposed to crossover) can reintroduce this value. Binary mutation proceeds by flipping the value of the selected genes. In the continuous case, several heuristics have been proposed, based on Gaussian 35

2. LITERATURE REVIEW

Figure 2.18: Mutation Operator: the mutated genes are picked-up randomly and their value is flipped in binary case. perturbations of the selected genes: x → N (x, σ) where the mutation step or amplitude is defined by σ (Fig. 2.19).

Figure 2.19: Gaussian distribution centred on µ with its variance σ 2 As for the initialisation step, quasi-random mutation can improve on random mutation [Teytaud and Gelly, 2007]. The appropriate mixture of variation operators depends on the search space, the fitness landscape, and the other ingredients of the EC algorithm. Typically, Genetic Algorithms and Genetic Programming used to heavily rely on crossover, whereas Evolutionary Programming mostly uses mutation, and Evolution Strategies are agnostic. The main point is to enforce a good EVE trade-off, and preserve the population diversity.

2.4.5

Evolution Strategies

Evolution Strategies, among the major Evolutionary Computation paradigms, are specifically designed for continuous optimisation. Following the standard EC skeleton (Fig. 2.15), they focus on the adjustment 36

2.4. EVOLUTIONARY COMPUTATION

of the mutation step size σ (section 2.4.4). When σ is large (respectively small), the algorithm is biased toward Exploration (reps. Exploitation). The key for an effective convergence toward the global optimum is to adaptively adjust σ depending on the state of the search. The adaptation of the step size σ proceeds in various ways. One simple well known approach is the 1/5th rule [Rechenberg, 1971]: If more than 20% of mutated offspring lead to fitness improvements within the last N generations, then σ value is multiplied by 1.22. If less than 20% of the offspring obtain better fitness, then σ is divided by 1.22. This approach and parameters designed to be optimal on the sphere test Pd are d 2 function: f (x) = [Michalewicz, 1994], supposed to i=1 xi ; x ∈ IR represent typical fitness landscape for many optimisation problems. The 1/5th rule, is under optimal on some non-linear or non-convex function [Chellapilla and Fogel, 1999]. Self-adaptive mutation proceeds by adding the mutation parameter σ (either scalar, vectorial or matrix) to the individual genotype [Schwefel, 1981]. This way, evolution adjusts for free the mutation parameters; rather, an individual with under-optimal genotype x or mutation parameters will have no descendent after a few generations. Said otherwise, only individuals with good genotypes and σ values will have offspring and grand-children. The limitation of self-adaptive mutation is to increase and complexify the search space. CMA-ES and sep CMA-ES The acknowledged best approach in continuous evolutionary evolution is the Covariance Matrix Adaptation Evolution Strategy (CMA- ES) [Hansen and Ostermeier, 1996, Hansen and Ostermeier, 2001]. The covariance matrix adaptation is a method designed to update the covariance matrix of the multivariate normal mutation distribution in the evolution strategy. New candidate solutions are sampled according to the mutation distribution and the covariance matrix describes the pairwise dependencies between the variables in the distribution. While CMA-ES can accommodate a small population size (depending on the search space dimensionality) [Hansen and Ostermeier, 1995], it also involves a restart policy possibly increasing the population size [Auger and Hansen, 2005], leading to a virtually parameter-less evolutionary algorithm. Two adaptation principles are exploited in the CMA-ES algorithm. 37

2. LITERATURE REVIEW

Firstly, a maximum likelihood principle, based on the idea to increase the probability of a successful mutation step. The covariance matrix of the distribution is updated such that the likelihood of previously realized successful steps to appear again is increased. Consequently the CMA conducts an iterated principal component analysis of the successful mutation steps while retaining all principle axes. Secondly, a path of the time evolution of the distribution mean of the strategy is recorded, called evolution path. Such a path contains significant information of the correlation between consecutive steps. The evolution path is exploited in two ways. It is used for the covariance matrix adaptation procedure in place of single successful mutation steps. Also, the evolution path is used to conduct an additional step-size control. This step-size control aims to make consecutive movements of the distribution mean orthogonal in expectation. The step-size control effectively prevents premature convergence. Separable CMA-ES (sep-CMA-ES) improves on the original CMA-ES algorithm [Ros and Hansen, 2008] in large dimension search spaces, by using diagonal covariance matrices: their estimation has linear complexity in the dimension of the search space (as opposed to, quadratic in CMA-ES), significantly improving the algorithm scalability. This algorithm is to be preferred to the standard CMA-ES when evolving neural networks with large number of parameters.

2.4.6

Neuro-Evolution of Augmenting Topologies

The above optimisation algorithms mainly apply to fixed-size vector genotypes, such as neural network connection weights. Neural network design, however, involves the selection of the network topology (number of hidden neurons and connexion graph). The Neuro-Evolution of Augmenting Topologies algorithm (NEAT) [Stanley and Miikkulainen, 2002] has been specifically designed for the non-parametric optimisation of Neural Nets through a flexible genetic encoding [Stanley, 2004]. NEAT features three specific heuristics, endowing the algorithm with adequate search policy within a potentially large search space: homology detection, innovation protection and structure minimisation. The first principle is homology: NEAT encodes each node and connection in a network with a gene. Whenever a structural mutation results in a new 38

2.4. EVOLUTIONARY COMPUTATION

gene, that gene receives a historical marking. Historical markings are used to match up homologous genes during crossover, and to define a compatibility operator. Historical markings allow NEAT to perform crossover without analysing topologies. Genomes of different organizations and sizes stay compatible throughout evolution. This methodology allows NEAT to complicate structure while different networks still remain compatible. The second principle is protecting innovation. A compatibility operator is used to speciate the population, which protects innovative solutions and prevents incompatible genomes from crossing over. Generating new offspring requires adding new neurons and connections between neurons according to some user-specified probabilities, possibly leading to large fitness variations. Innovation relies on fitness sharing [Goldberg and Richardson, 1987] and the pruning of stagnant species. Speciation in NEAT allows niches to be considered so that individuals compete primarily within their own niches instead of with the whole population. This way, topological innovations are protected and have time to optimize their structure before they have to compete with other niches in the population. In addition, speciation prevents bloating of genomes: Species with smaller genomes survive as long as their fitness is competitive, ensuring that small networks are not replaced by larger ones unnecessarily. Finally, NEAT follows the philosophy that search should begin in as small a space as possible and expand gradually. Evolution in NEAT always begins with a population of minimal structures. Structural mutations add new connections and nodes to networks in the population, leading to incremental growth. Topological innovations have a chance to realize their potential because they are protected from the rest of the population by speciation. Because only useful structural additions tend to survive in the long term, the structures being optimized tend to be the minimum necessary to solve the problem.

2.4.7

Evolutionary Robotics

Evolutionary Robotics emerges in the 1990s as a direct application of Evolutionary Computation to autonomous robotic control. The term evolutionary robotics is proposed in [Harvey and Husbands, 1992] and the interest in ER grows quickly. While most ER work considers the use of simulated robots, real robots are considered as well [Miglino et al., 1995]. Neural network controllers are widely considered in these experiments, due to bio-inspiration [Nelson et al., 2003] and the good properties of 39

2. LITERATURE REVIEW

neural networks, although other models such as Markov Decision Processes [Bellman, 1957], fuzzy controllers [Driankov et al., 1996] or Learning Classifier Systems (LCS) [Sigaud, 2007] have also been considered. In [Nolfi et al., 1994] and [Nolfi and Parisi, 1993], learning occurs during controller evaluation thanks to back-propagation algorithm. Learning robots to walk is considered in [Gruau, 1994] or [Jakobi, 1998]. Complex navigation behaviour is investigated in [Capi and Doya, 2005] with sequential tasks requiring to visit three places in the right order, or in [Nelson et al., 2003] where robots are competitively evolved to find target objects in large and complex environments. The interested reader is referred to [Nolfi and Floreano, 2000] for a more comprehensive presentation. Figure 2.20 illustrates a standard evolutionary robotic process. A population of random controllers is created in the first generation. Each controller is evaluated within some environment according to the fitness function. The best controllers are selected and undergo variation operators to generate offspring. Along the course of evolution, the controllers improve in order to maximize the fitness function. LCS rely on genetic algorithms to optimise sensor matching rules [Holland, 1975] [Sigaud, 2007]. The control system is a set of rules, made of premises, actions and rewards. The rule premises conditions the applicability of the rule; it is defined as a sensor pattern. The action part is the control command; the reward is an estimation of the pay-off relative to the given action in the given context. In each time step, the applicable rules are determined and the resulting action is computed from the rule actions and the associated rewards. The generality of the premises (tuned by GA) commands the effectiveness of the rule set (fraction of applicable rules in each time step). In [Miglino et al., 1995], neural network controllers are evolved in-situ, using the robot internal data (sensor values and motor activation), to estimate the fitness value. The training time cost is very high (several hours on a standard PC) for evolving an obstacle avoidance behaviour. Many typical evolutionary robotic experiments are detailed in [Nolfi and Floreano, 2000], ranging from obstacle avoidance (in-silico) to prey predator co-evolution, and hardware design. The current trends in evolutionary robotics focus on the simultaneous evolution of the robot structure and controller [Yasuda et al., 2007], the evolution of robotic swarm [Baldassarre et al., 2006], the use of behavioural models for anomaly detection and fast controller repair. The latter trend is illustrated by [Bongard and Lipson, 2004], where a legged robot builds an empirical model 40

2.4. EVOLUTIONARY COMPUTATION

Figure 2.20: Evolutionary Robotics standard process: The first generation involves randomly generated robot controllers; each controller is evaluated according to the fitness function. The best individuals tends to be selected in order to produce new individuals. The new individuals undergo variations. The new individuals replace the old ones, leading to the next generation. Thanks to the fitness function, the adequate control characteristics emerges within the individuals, increasing the performance from times to times, until the algorithm reaches some termination criterion. of its behaviour (an embedded simplified simulator). When a failure occurs, an evolutionary process is launched in order to adapt the robot to its new situation with only a few trials on the actual robot. The robot possess a proprioceptive model of itself, allowing to detect failures or changes in its behaviour.

41

2. LITERATURE REVIEW

42

Chapter 3 Anticipation and Adaptation In this chapter, the problem of transferring controllers obtained through simulation to real robots is considered, known as the reality gap. Controllers evolved in simulation are confronted to physical robot. An approach based on an anticipation module is implemented and evaluated toward the reduction of the reality gap impact.

3.1

Motivation: Reality Gap

The Evolutionary Robotic approach addresses the problem of automatic design of robotic controller [Nolfi and Floreano, 2000]s. Such algorithms are particularly well suited when the target behaviour is difficult to specify, but can be easily assessed. These optimisation algorithms have been shown to be very efficient for robotic control problems. The number and cost of fitness evaluations needed for optimisation, however, precludes in general the controller assessment in-situ, i.e. in the real world. The controllers evolved are most often run and assessed within a simulation platform (in-silico). Controller evolution, however, has long been rightfully criticised because of a poor modelisation of the robot physical behaviour in-situ. Short and long-term disturbances are either ignored or loosely modeled within the simulator because of the task potential difficulty [Lipson et al., 2006] [Nehmzow, 2001] [Jakobi et al., 1995]. Hence, evolved controllers are very difficult to implement in a real world robot and experiments are usually limited to simulated environments. A clear effect of the reality gap is presented in [Boeing et al., 2004], where biped walking gait are evolved in-silico but cannot run efficiently in-situ.

43

3. ANTICIPATION AND ADAPTATION

3.1.1

Disturbances

The study focuses on the adaptive on-line calibration process with respect to locomotion disturbances1 . The approach presented is based on the definition of an Anticipation Module (AM) together with the control architecture. The anticipation mechanism is trained to model interactions between the robot and its environment. The anticipation module is used to estimate whether the effects of a given action correspond to the robot expectancy, or whether some error occured. The feedback of the Anticipation Module is used to perform on-line adaptation of the controler in order to recover from the possible disturbances. The presented work aims at optimising controllers that are robust with respect to disturbances in the real world. Several kinds of disturbances are considered, most of them resulting in directional biases during the robot locomotion: 1. calibration disturbances depending on wheels or motor characteristics at initialisation such as wheel diameter or power (Fig. 3.1) 2. isolated disturbances very limited in time. Slides or collisions fall into this category. 3. long-lasting or wear-related disturbances due to a partial minor hardware failures, lasting energy supply problem, external disturbances or change in the wheel texture due to wear. 4. intensity-changing disturbances; ongoing loss of energy for one motor, or aggravating motor problem Isolated disturbances will most probably have a low impact in time and the controller will handle the situation with immediate recovery without requiring anticipation module to act. The controller may not handle the other disturbances, having to oscillate between the goal-oriented control and the error compensation. For instance, lets consider a robot, with a malfunctioning motor, whose behaviour is to follow the walls to reach the exit of a maze. The robot is to keep appropriate distance to the wall, but it will either continuously get nearer or farther to the wall due to the failure. Without control compensation, the controller will switch to follow the wall, get nearer or get farther behaviours, resulting in a poor ability to behave. 1

Calibration issues, partial hardware or software failure, medium and long lasting changes in the environment

44

3.1. MOTIVATION: REALITY GAP

Figure 3.1: On the left, the behaviour trace for one robot; on the right, a slight calibration change exists on the right wheel, resulting in a collision on the left. Solid elements represents obstacles such as walls while circles represents screenshots of the robots position at each time step.

3.1.2

Off-line gap avoidance

One extreme approach, to reduce the reality gap, is to evolve controllers directly on real robots, as done in [Floreano and Mondada, 1994] where an avoidance control is evolved on a real robot. In this work, a generation takes approximately 39 minutes, making 65 hours for 100 generations to achieve design of a wander behaviour. While this makes it possible to avoid the reality gap, it is extremely time consuming both from the robot and from the human supervisor viewpoint. During 65 hours however, it is extremely likely that electronic or mechanical failures will occur, making the Evolutionary Robotic in-situ not only very time consuming, but also prone to failures. The same experiments in simulation could be performed in a matter of minutes. A different approach consists in sampling robot sensors or effectors to sustain a more accurate simulation, so that to reduce the reality gap. In [Lund and Miglino, 1996], the simulator is designed thanks to sensor and motor sampling on the physical robot. The robots then evolved in simulation perform similarly in-silico and in-situ. Real world dynamics are modeled based on robotic arm control samples in [Nakamura and Hashimoto, 2007]. Noise is considered as unavoidable when evolving in simulation [Jakobi et al., 1995]. Only the useful features are to be precisely modeled, leaving the other features within the fog of 45

3. ANTICIPATION AND ADAPTATION

noise. The nature of the noise remain dependent on sensor and actuator sampling [Lund and Miglino, 1996]. Sensor coupling might result in sensor correlated answer, requiring for specific modeling [Meeden, 1998]. In [Miglino et al., 1995], a comparative study of various noise models is done. The most appropriate noise model is the one established thanks to sensor sampling, the conservative noise. An alternative to these approaches is the use of both simulation and physical robots. Some specific settings [Miglino et al., 1995] aims at the best of both worlds, pasting the best controllers evolved in simulation onto the real robot to perform a few evolutionary steps in-situ. This approach adapts the controller to the specificities of the given robot and world while lowering the experimental cost. As mentioned in these studies, each sensor and actuator, and more generally all robots are different. By adjusting the simulation to a specific robot, one forgets about the general performance of the controllers. This fact limits the interest into the previous approach of evolving in simulation at first and then on physical robots.

3.1.3

On-line gap avoidance

Adapting the controller in order to cope with the robot or environment specificities is referred to as adaptiveness to (long) lasting changes. In [Nolfi and Parisi, 1993], an adaptive neural network model is proposed, aiming at evolving robust controllers. During the evolutionary process, environmental variations introduce variations in the sensors efficiency to detect obstacles. The model, referred to as auto-teaching setting proceeds as follows: Evolution optimises all the initial weights of the control and teacher neural networks. During the robot life, the teacher network is frozen, while the control part is adapted by retro-propagation from the difference between the actual controller output, and the desired one produced by the teacher network. The system fitness is computed from the behaviour of the auto-teaching neural network, assuming both its initial values , and its ability to cope with the environment changes. [Godzik, 2005], however, made it clear that the auto-teaching setting itself is prone to evolution opportunism too: when observing the controller over period larger than the one considered during evolution, it occurs that the best auto-teaching controller always lose their base control features, making them unable to navigate any more. [Godzik, 2005] proposes a new 46

3.2. ANTICIPATION BASED APPROACH

architecture, namely Action Anticipation Adaptation (AAA), embedding a control and an anticipatory network. The anticipatory networks task is to predict the sensor at time t + 1 for the sensor and motor activation at time t. This value can be directly compared to the actual sensor input on the next time step t + 1; the difference is used to adjust the network through back-propagation. Results show that robustness in the long term is possible for the same night and day experimental setting as in [Nolfi and Parisi, 1993]. A possible limitation for the AAA architecture remain somehow similar to the auto-teaching approach: The control network is changed using back-propagation algorithm based on prediction error and not the behaviour model. Networks are evolved so that the correction will help maintain controller capacity to behave longer than the auto-teaching approach, however, the controller will likely lose its correct behaviour on the long term.

3.2

Anticipation based Approach

The proposed approach will thus present a principled use of Anticipation Module. As shown in the previous works, anticipation modules were concerned with adaptation of the controller towards long-lasting sensor disturbances, the anticipatory mechanism provides an efficient way to take into consideration environmental disturbances. Indeed, the main point of such a mechanism is to provide an error rate based on the expected consequence of an action in the environment. The concept of anticipation based adaptation is thus considered in the presented study to reduce the reality gap from simulation to physical world. An anticipatory behaviour is one that will not only make use of present and past data, but also of predictions over the future [Butz et al., 2007]. Early works in evolutionary robotics already show the interest of using anticipation for behaviour planning [Nolfi et al., 1994]. Formally [Rosen, 1985], an anticipation system contains the model of a system with which it interacts, being predictive over the future states of the system. The present state of the model system causes changes of states within the anticipation system. These changes are involved in the interaction of the anticipation system and model system, without affecting the model of the system. With regard to this definition, in [Nolfi et al., 1994] or [Godzik et al., 2004], the controllers belong to the model system and are directly affected by the anticipation based learning, which result in alterations of the behaviour. 47

3. ANTICIPATION AND ADAPTATION

While the evolutionary process selected the best controllers together with the best anticipation modules for adaptation of the control, the disturbances on the long run can be explained due to this direct disturbance of the model system with regard to anticipation without explicit improvement criterion. An anticipation-based model is meant to play the role of an error detector rather than an exact model of the world. Simply presented: an action will most probably have a consequence. If this consequence is predicted, then, one can identify differences with this prediction and the effective consequence. This way, any error introduced by any bias can be identified using these differences. In the context of robotics, if the difference is due by any wear on the wheels or sensor noise over the robot, then, the control produced can be adjusted in order to cope with hardware specificities without altering the performance of the controller.

3.2.1

Predictive architecture

The main idea beyond our model concerns the adaptation of the control output, not the controller. Since the optimised controller displays an adequate behaviour according to the task at hand, changing it is meaningless to some point since the quality of changes might not be evaluated on-line after evolution in the general case. Adapting to changes may require for such modification of the controller, however, in the general case, what is to adapt is the response intensity to new conditions, not the explicit behaviour itself. The use of anticipation does not only allow the controller to plan future actions based on belief acquired on the world, but it also allows the controller to detect when an action does not have the expected effect. The global architecture proposed is presented in Figure 3.2. The anticipation module predicts the variation in orientation2 from desired motor outputs AVSt . The predicted variation is confronted to the actual observed variation VSt in orientation which is accessible to the robot, enabling the evaluation of the prediction error; this error is finally used to correct desired actions At into effective actions A∗t . This architecture features two properties: 1. The anticipation module, fully independent from other modules, can be trained independently. 2

Orientation variation is considered rather than exact orientation. Those two representations are the same but the variation is context independent.

48

3.2. ANTICIPATION BASED APPROACH

Figure 3.2: Control module including the anticipation and correction module. The anticipation module takes the previous decided action A∗t−1 and return the estimated variation on the sensors Avst . The effective variation V St from previous action, current sensor state St and Avst are provided to the correction module which will adjust the action to A∗t

2. Anticipation Module training can be formalised as supervised learning problem, where each step provides the source data from learning step t and the oracle output for learning step t − 1 (∀t), that is ∀t ∈ T , find Fanticipation defined as: Fanticipation (motorst , sensorst ) → variationst+1

3.2.2

Correction module

The anticipation module described here is not meant to affect directly the controller module (Fig. 3.2). It does instead affect a correction module aiming at correcting the desired output so that it is performed as wanted while still considering disturbances that may occur. This module is added to the architecture once the controller is achieved and the controller immersed on a physical robot. The algorithm 3.1 describes the correction module. The algorithm takes five parameters: vs is the variation observed on the sensors and va the anticipated variation; as is the adaptation step size; 49

3. ANTICIPATION AND ADAPTATION

Algorithm 3.1 Correction Module algorithm Require: vs, va, as , desiredMotorright , desiredMotorright vA = abs(dif f (vs, va)) vD = sign(dif f (vs, va)) if vA > ǫ then if vD < 0 then αlef t ← αlef t + as × σ(vA ) else αright ← αright + as × σ(vA ) end if end if correctedMotorright ← desiredMotorright + αright correctedMotorlef t ← desiredMotorlef t + αlef t desiredMotorright and desiredMotorright are the control output decided by the controller according to the current sensory input. The algorithm first computes the absolute difference vA between vs and va and then the sign of the difference. If the difference exceed some small ǫ > 0, then either αlef t or αright are adjusted proportionally to vA and as . The threshold ǫ is set to reduce the effects of noise. All presented parameters are normalised and αlef t and αright are constrained within [0, 1]. Long-term adaptation is performed by modifying αlef t and αright . These terms are used to adjust motor calibration and should converge toward optimal values. One should note that the given correction function used here remain limited to a linear function and an addition to α parameters.

3.2.3

Experiments Goal

A specific approach to correct the control output is described in the following section and a few experiments are performed in order to validate the approach. The experiments involve several disturbances, as described in section 3.1.1, applied on both trained and hand-made controllers. The ability for the correction module to adapt the control output is evaluated according to quantitative and qualitative criteria: • The fitness value of the evolved controllers are evaluated under three situations: without disturbance, as during the evolutionary process; with disturbances, with and without correction. • The control behaviour is assessed qualitatively according to a quan50

3.3. EXPERIMENTAL SETTINGS

tifiable criterion: the collision rate over a long period of time on the simulated robot, and a deviation assessment on the physical robot. It is expected that the robot under disturbances should not behave properly without corrections, while it should converge quickly to the evolved behaviour with correction.

3.3

Experimental Settings

This section details the methodology followed to assess strengths and weaknesses of the anticipation based control correction for the reality-gap challenge.

3.3.1

Simulator

The evolutionary robotic experiments are performed using a home-made Java simulator. The simulator is designed to match the khepera II specifications and include noise model based on khepera robot sensor sampling. More details about the simulator are given in appendix section A.1. The controllers and all neural networks are implemented in Java, using the PicoNode open source library. The library implements training algorithms such as the back-propagation algorithm. At each time step, the program get the sensor values from the robot, computes the output and produces the control outputs for the robot control. Neural networks are trained using CMA-ES evolutionary algorithm (Chap. 2.4). The optimisation software is the standard java implementation of CMA-ES available from the designer’s homepage.

3.3.2

Physical robot

The robot manipulated in the following experiments is a two wheeled khepera II mobile robot. Khepera and khepera II have been used extensively in evolutionary robotic mobile robot navigation experiments through the literature. The robot has two independent motors activating wheels and eight infra-red (IR) sensors able to detect light sources as well as obstacles. An extension turret is added with a wireless colour video camera. No compass is available on the robot, however, a landmark detection algorithm together with the video camera are used on the robot. This device allows the robot to detect orientation variations. From the anticipation module point of view, the variation values are interpreted as an orientation variation. 51

3. ANTICIPATION AND ADAPTATION

More details about the robot architecture are given in appendix section A.3. The khepera II robot is wired to a computer hosting the neural network. At each time step, the controller program get the sensor values from the physical robot, computes the output and produces these outputs to the physical robot for acting.

3.3.3

Neural network settings

Both the anticipation and control modules are neural networks. This section presents the different parameters used in the experiments. Four neural network structures are evolved for the control module. The networks have 8 input nodes, corresponding to each infra-red sensor present on the robot, 2 output nodes, one for each motor, and 0, 2, 4 and 8 hidden neurons. On top of that, a bias neuron is added and connected to hidden and output neurons. Activation functions used are hyperbolic tangents for hidden and output neurons and simple linear function f (x) = x for the input neurons. The networks nodes are fully connected from one layer to another, making respectively 18, 24, 46 and 90 connexion weights for 0, 2, 4 and 8 hidden neurons. The anticipation neural network contains 2 input nodes with linear activation functions and one output node with hyperbolic tangent activation function. The inputs correspond to motor activations while the output is the estimated angle variation. Two anticipation networks are considered, having 0 or 2 hidden neurons and 1 bias neuron.

3.3.4

Tasks

Evolving the Control Module Evolving a neural controller is a typical Evolutionary Robotic experiment. The fitness to be optimised is the maximum number of places visited in the environment (Fig. 3.3). This environment aims at favouring the emergence of general and robust behaviours. The robot starts in the centre of the environment, in the small corridor, having the wall on its back. In order to explore its environment, the controller is required to be able to move forward, turn on the right, turn on the left and avoid collisions as much as possible in any situation. Many obstacles are 52

3.3. EXPERIMENTAL SETTINGS

added to the environment so that the controller encounter difficult situations.

Figure 3.3: Experimental environment with grid fitness visualisation. The fitness is defined as the number of positions reached during evaluation. The robot initial position is given by the red landmark with orientation toward the right. The fitness defined consist in minimising the non-visited parts of a grid over the environment; formally: P P i × j − ni=0 nj=0 exploredi,j F = i×j where the environment is equally divided in i × j regions to be visited over a time duration of Tmax simulated time steps. exploredi,j equals 1 if the robot ever explored the region or 0. If the robot is to collide any-time t < Tmax then the evaluation is terminated, resulting in an implicit penalisation for collisions. The collided robot, no longer able to explore new regions, will get an under-optimal fitness, although it might still be considered if it managed to visit enough regions before the collision. A high Tmax = 1000 is chosen in order to enable the robot to explore its environment. The above fitness, in opposition to a fitness using directly accessible variables3 , gives no indications or bias toward the optimal solution. No indications regarding how the task should be achieved is given and all regions are equivalent in term of reward. 3

such as internal proprioceptive sensors, motor activation or infra-red sensors

53

3. ANTICIPATION AND ADAPTATION

The value of i and j, e.g. i = j = 20 here, impacts on the optimisations ability to converge toward a solution. If i and j were to be small (having large regions), it becomes difficult to distinguish behaviours in terms of fitness. If i and j were to large (having small regions), a strong bias toward erratic behaviours happens. In order to visit as many regions as possible, the behaviour cannot be smooth any-more. In practice, on the different tests performed through this work, this fitness proved efficient in simulation and provided better generalised results in average as for the standard fitness function described in [Nolfi and Floreano, 2000] for wandering behaviour with no collisions. A last parameter considered in evolution is the output space of the controller. Since the real robot motor controls are integer values in [−20 : 20], the same is considered with the simulated robot. Both input and output values of the neural networks are normalised floating values in [−1 : 1], however, the simulated output motor values are converted to integer values in [−20 : 20] to match with the real robot. While default experiments are performed using integer output controls, they are also launched using floating point output controls. This is meant to try and generalise the concepts of anticipation correction and observe more precisely the global effects of adaptation to disturbances with floating point values. The neural networks connexion weights are optimised by CMA-ES evolutionary algorithm. The parameters are the following: 20.000 evaluations are performed for each optimisation, without use of restart policy. The population size λ is fixed and higher than the default one λ = 4 + [3ln(n)] where n is the dimension of the problem, i.e. 18, 24, 46 and 90 connexion weights; λ is set to 50. Considering the number of evaluations and population size, the number of generations is 400. All other parameters are set to default. Training The Anticipation Module The anticipation module is trained using the desired motor commands at time t + 1 as output, with the goal to predict as accurately as possible the future state of the robot in the world. This is a straight-forward regression learning task (IR2 → IR1 ). In order to perform this regression task, the anticipation mechanism is here implemented as a neural network with 2 inputs and one output. Here, the inputs are the left and right motor activation, and the output is the estimated compass variation. Weights of the network are trained using back-propagation algorithm. 54

3.4. TASK 1: EVOLVING THE CONTROL MODULE

This is made easily possible since motor activations are discrete values in [−20 : 20], making 1600 possible motor activation combinations. Due to noise, a sample of 3000 motor activations and compass changes are obtained, and the sample is split with random sampling in one training set of 2400 and a test set with the remaining 600 samples. Learning is performed over the training set, applying back-propagation algorithm. The terminating criterion is the number of training iterations set to 500. The model obtained is assessed through the error on test set. The process is done 10 times to estimate the accuracy of the training performance. A few network topologies are tested: a mono-layer neural network and a multi-layered neural network having 2 fully connected hidden neurons. Once trained and added to the control architecture, the correction module is fixed, manually parametrised for αlef t = 0, αright = 0 and as . Validation of the model Three experiments are presented to validate the model: • The first one aims at on-line calibration, targeting the reality gap. Response rate of sensors and motors vary from simulation to physical robot. The ability for the anticipation module to adapt the control is assessed on a simple go-forward behaviour and the best evolved behaviours. The go-forward behaviour provides a visual example of adaptation capability. • The second experiment aims at on-line adaptation with respect to robot wear. The evolved controllers are evaluated according to the adaptation facing wears such as wheel, motor wears or energy power reducing with time. • The last experiment aims at full on-line adaptation on the physical robot. The simple go-forward behaviour is transposed on the physical robot and is adapted as in the first adaptation experiment.

3.4

Task 1: Evolving the control module

Controller evolution provide the results shown on Figure 3.14, 3.15, 3.16 and 3.17. They represents the evolution of a neural network having 0, 2, 4 or 8 hidden neurons. In all cases, the top figure gives three information: average fitness over 11 independent runs, variance and the best run. The fitness is 55

3. ANTICIPATION AND ADAPTATION

designed so that the behaviour is to minimise the remaining non visited regions. Visiting no regions in the environment will lead to a fitness of 1; it is more likely near one since the robot already start in one region. If the robot was to visit all areas in its environment, it would have a fitness of 0. This cannot happen anyway, since obstacles and available time restrain the robot to visit all regions. On the bottom figure, the same results are provided using not average but median and quartiles. This latest representation allows the reader to get a better insight than average on global convergence tendencies toward an optimal solution. The convergence toward an optimal solution is better for the 4 and 8 hidden neurons controller. Despite larger dimension of the search space, multilayered networks provide overall better results with enough hidden neurons (Table 3.1). Since the behaviour to achieve is rather reactive, the best networks, independently from the topology, provide similar results and fitness’s (Fig. 3.4 and 3.5). Slight variations in behaviour, due to noise introduced in simulation, help in exploring the regions and reinforcing generality of the behaviour. A qualitative study of the best controllers obtained tends to suggest that controller networks with more hidden neurons are more robust in general (Table 3.2). hidden neurons 0 2 4 8

connexions 18 24 46 90

best fitness 3.9.10−1 4.1.10−1 3.5.10−1 3.4.10−1

average 5.5.10−1 ± 1.8.10−1 5.5.10−1 ± 1.7.10−1 4.0.10−1 ± 3.9.10−2 4.2.10−1 ± 1.2.10−1

Table 3.1: Controller evolution results.

hidden neurons 0 2 4 8

collisions 86.3% 63.1% 53.8% 38.8%

best fitness 4.2.10−1 4.3.10−1 3.8.10−1 3.4.10−1

average 6.6.10−1 ± 1.4.10−1 5.7.10−1 ± 1.3.10−1 5.6.10−1 ± 1.6.10−1 5.0.10−1 ± 1.4.10−1

Table 3.2: Qualitative study: the best controller evolved is evaluated 1000 times on 1000 time steps each. The percentage represents the collision rate probability during evaluation.

56

3.5. TASK 2: TRAINING THE ANTICIPATION MODULE

Figure 3.4: Typical behaviour expressed by the best genotypes obtained.

Figure 3.5: Typical behaviour expressed by the best genotypes obtained.

3.5

Task 2: Training the anticipation module

The main results are given in Figure 3.6 and 3.7. Since the training tends to converge to a solution quickly, and in order to compare the models, the results are also given in logarithmic scale. The network with no hidden neurons converges quickly to a solution with a low error rate, around 2.8.10−8 on the training set, and around 2.5.10−8 on the test set. Training of models having hidden neurons does not provide solutions as efficient as the first one with the same error rate of 4.7.10−4 for 1 or 2 hidden neurons on the training set. This is not a surprise since 57

3. ANTICIPATION AND ADAPTATION

Back-propagation error 1

median error on training set

0.9 0.8

error rate

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

50

100 150 200 250 300 350 400 450 500

iterations

Back-propagation error 2 hidden neurons

1

median error on training set

0.9 0.8

error rate

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

50

100 150 200 250 300 350 400 450 500

iterations

Figure 3.6: Training of anticipation using back propagation algorithm. The variable parameter is the number of hidden neurons the estimation of sensor variation is quite a straight-forward mapping of the inputs to the outputs in this setting. In a more general case, such network may require for hidden neurons since the module may need complex input sensors as well as controls and may produce a non direct mapping to estimate multiple sensor variations. For the experiments to come, we consider the best trained anticipation module having no hidden neurons.

3.6 3.6.1

Task 3: Validation of the model On-line calibration correction

The first problem to address while applying the evolved controller to the physical robot is the initial calibration problem, when the response rate 58

error rate

3.6. TASK 3: VALIDATION OF THE MODEL

Back-propagation error - log scale no hidden neurons

1 0.1 0.01 0.001 0.0001 1e-05 1e-06 1e-07 1e-08

median error on training set

1

10

100

iterations

Back-propagation error - log scale 2 hidden neurons

error rate

1

median error on training set

0.1 0.01 0.001 0.0001 1

10

100

iterations

Figure 3.7: Training of anticipation using back propagation algorithm, log scale. The variable parameter is the number of hidden neurons of both sensors and motors varies. Evaluation of calibration is considered in the case of a hand-written go-forward, and evolved controller. The go-forward behaviour is considered as it clearly presents the consequence of the calibration problem.

Results In the first setting, a go-forward controller is assessed. On Figure 3.8, on the left, the behaviour is expressed without disturbances. The environment is square and closed. The robot starts from the upper-left corner and goes straight ahead to the bottom-right corner. On the right, a constant disturbance is subtracted from the left motor, making it go slower than it 59

3. ANTICIPATION AND ADAPTATION

Figure 3.8: On the left, a go-forward behaviour, headed from the upper-left corner to the lower-right one. On the right, the same behaviour having a continuous disturbances on its left wheel. should. The trajectory is quickly deflected and the robot collide with the left wall (from the reader view point). Anticipation-correction module is then enabled, resulting in result presented in Figure 3.9 for the same disturbance setting.

Figure 3.9: Both behaviours are the same go-forward with continuous wear on the left wheel. The Anticipation correction modules is plugged in. On the left, the correction step is small, resulting in a slow adaptation, while on the right, the step is equal 1, resulting in a quick adaptation. On the left trace (Fig. 3.9), a low value is given to the correction step 60

3.6. TASK 3: VALIDATION OF THE MODEL

as = 0.2. A few steps are required before the behaviour is adjusted and the robot can move straight-forward again. On the right figure, a higher adjustment step value as = 1 applies and the correction is faster. In this case, the disturbances has only a little consequence on the behaviour. In the first case, having as = 0.2, the error comparing anticipated and effective sensor variation is displayed together with α value on Figure 3.10. The results are medians and quartiles over 11 runs. When the error becomes lower than ǫ = 1.10−4 , it barely raises again and stay stable quick enough. As seen on the figures, a few iterations are required to achieve a stable adjustment of the control. In the second setting, the best evolved behaviour are evaluated in the same conditions as in the evolution process (1000 evaluations of 1000 time steps per behaviour). They are evaluated in the same environment and the same initial conditions. Table 3.3 gives results with regard to the best behaviours with or without correction, with or without noise. The noise value given is equal to 1 motor value, i.e. a motor control value in [-20:20] to which is substracted 1.

Discussion Results tends to suggest that adaptation occurred and disturbance effect could be reduced. The best behaviour found with 4 hidden neurons acts in another way, however, since it appear to avoid more collisions and reach better fitness under the effect of the disturbances. Its average fitness remain higher with disturbance anyway, suggesting that this side effect is relative to controller evolved. All controllers evolved displays biases toward left or right. This bias disambiguates symmetric situations that occurs within the environment. Disturbances have different effects on the robots, possibly improving the behaviour, however, this is not taken into account and the correction occurs so that the behaviour remain in-situ as near as possible as it is insilico. Figure 3.11 provide qualitative insight on the correction effect through a typical disturbed behaviours with and without correction.

3.6.2

On-line adaptation to continuous wear

The following settings is similar to the previous one. Here, the disturbance is due to a continuous wear. Such disturbance may come out of wheel, motor or mechanical wear, or simply due to energy power reducing along 61

3. ANTICIPATION AND ADAPTATION

Online error reduction 0.012

median sensor estimated error

error rate

0.01 0.008 0.006 0.004 0.002 0 0

20

40

60

80

100

time steps

Online alpha adaptation 1

median alpha value

0.99

alpha

0.98 0.97 0.96 0.95 0.94 0.93 0

20

40

60

80

100

time steps

Figure 3.10: On the top figure, prediction error over the sensor variation along with time. On the bottom figure, the αlef t value adjustment applying the formula in algorithm 3.1.

with time. In this setting, experiments involve the same controllers as before. The disturbance intensity increases in the form of a logarithmic function, so that to reach a threshold to the component’s wear in a finite time. The correction mechanism is exactly the same as before an same tests are performed. Results are given in Table 3.4.

62

3.6. TASK 3: VALIDATION OF THE MODEL

hidden neurons 0 2 4 8 0 2 4 8 0 2 4 8

collisions best fitness average Best controller in simulation 86.3% 4.2.10−1 6.6.10−1 ± 1.4.10−1 −1 63.1% 4.3.10 5.7.10−1 ± 1.3.10−1 53.8% 3.8.10−1 5.6.10−1 ± 1.6.10−1 38.8% 3.4.10−1 5.0.10−1 ± 1.4.10−1 With calibration disturbances 99.6% 4.4.10−1 8.3.10−1 ± 1.3.10−1 97.5% 4.6.10−1 8.5.10−1 ± 1.8.10−1 −1 49.8% 4.4.10 6.0.10−1 ± 1.6.10−1 99.9% 4.2.10−1 9.2.10−1 ± 1.1.10−1 With disturbances and correction 86.4% 4.1.10−1 6.6.10−1 ± 1.410−1 57.6% 4.3.10−1 5.7.10−1 ± 1.3.10−1 −1 52.5% 3.8.10 5.6.10−1 ± 1.6.10−1 38.3% 3.6.10−1 5.1.10−1 ± 1.2.10−1

Table 3.3: Qualitative study: the best controller evolved is evaluated 1000 times on 1000 time steps each. The percentage represents the collision rate probability during evaluation. Results Discussion Similar results are found as for the calibration adaptation. Applying the correction allows the robot to behave almost as if no difference appeared, both in term of fitness and behaviour. Again, the controller network found for 4 hidden neurons does express the same side effect due to noise.

3.6.3

In-situ experiments

As a final experiment, behaviour adaptation is made on a physical robot. A khepera II robot with a 2D camera is used. The 2D camera helps grabbing variations of orientation towards a visual mark in the environment. The controller is the simple go-forward ad-hoc behaviour and the anticipation module is trained in simulation. The orientation sensor is implemented using a visual tracking algorithm which tracks a coloured landmark in the environment. The robot is placed in straight line in front of the landmark. The controller does not make use of landmark detection sensor. 63

3. ANTICIPATION AND ADAPTATION

Figure 3.11: Evolved controller with disturbances on the left. On the right, the same controller with anticipation correction modules and the same disturbances. The controller quickly correct over the variations and act almost as the original (Fig. 3.5)

hidden neurons 0 2 4 8 0 2 4 8 0 2 4 8

collisions best fitness average Best controller in simulation 86.3% 4.2.10−1 6.6.10−1 ± 1.4.10−1 63.1% 4.3.10−1 5.7.10−1 ± 1.3.10−1 53.8% 3.8.10−1 5.3.10−1 ± 1.6.10−1 −1 38.8% 3.4.10 5.0.10−1 ± 1.4.10−1 With continuously increasing disturbance 99.6% 4.0.10−1 8.2.10−1 ± 1.3.10−1 95.3% 4.7.10−1 7.4.10−1 ± 1.8.10−1 51.5% 4.3.10−1 6.0.10−1 ± 1.6.10−1 −1 99.8% 3.6.10 8.6.10−1 ± 1.1.10−1 With disturbance and correction 85.9% 4.1.10−1 6.7.10−1 ± 1.3.10−1 63.0% 4.2.10−1 5.9.10−1 ± 1.6.10−1 55.2% 3.6.10−1 5.6.10−1 ± 1.7.10−1 39.1% 3.5.10−1 5.2.10−1 ± 1.5.10−1

Table 3.4: Qualitative study: the best controller evolved is evaluated 1000 times on 1000 time steps each. The percentage represents the collision rate probability during evaluation.

64

3.6. TASK 3: VALIDATION OF THE MODEL

Results Experiments provide similar results on the real robot than the one obtained in simulation (Fig. 3.12). The error between variation estimation and effective variation is given in Figure 3.13. When applying correction, error raises less until the robot reaches the front wall. The sensor algorithm becomes unstable in both cases, leading to high increases of error. Error should be considered up to these increases: around 120 time steps for the controller without correction, 160 for the one having correction.

Figure 3.12: Example on real robot with low disturbance. Two left images: go-forward without correction. Two right images: go-forward with correction.

Figure 3.13: Summed error between anticipated variation and effective variation while applying correction or not.

65

3. ANTICIPATION AND ADAPTATION

Discussion The in-situ experiment provides hints that the correction is effective and the behaviour can be corrected. The experimental setting remains, however, limited to the hand-crafted controller on the physical robot, due to the lack of appropriate sensors for the anticipation. A better sensor feedback should be used, depending on the robot architecture and tasks at hand.

3.7

Conclusion

In this chapter, the problem of transferring a simulation-based controller to an autonomous mobile robot in the real world has been addressed. This remain one of the key Evolutionary Robotics issues since evolved controller are not often used in real robots, due to the extreme difficulty to simulate precisely relevant features of the real world. As presented in introduction, a few tricks can be used, mostly by changing the noise impact in order to reduce the problem, however, on-line adaptation appears to be an adequate approach to solve the problem in a most general case. Our approach consists in applying a simple parameter adjustment with regard to estimation error between what is expected and what occurs. Both behaviour and adaptation rules remain simple and most probably not optimal. In the presented experiments, the controllers can almost behave properly despite disturbances that occur. This result correspond to the expectations of the experiments. Having a more complex model of robot and behaviour should not be an issue, as long as the anticipation system can rely on low biased sensor or feedback information from the control. The adjustment rule should be tuned according to the problem at hand. Experiments done provide good hints toward the understanding of the capability of an anticipation module to compute adequate prediction variations in order to optimise on-line the controller. The controller remain as is; the expected consequences of the actions can be compared to the effective ones so that to allow the correction module to adjust the control output, not the controller. Experiments however contains several flaws. Better sensors should be used for an adequate detection of errors: proprioceptive motor feedback values, accelerometers, or vision. Further works on this topic should investigate two main issues: • Making use of more appropriate sensors for the anticipation module 66

3.7. CONCLUSION

rather than specific sensors. The infra-red sensors used by the controller together with the controller output might provide enough information in several situations to make use of an anticipation correction module. By doing this, the adaptation of the controller should be more precise. This could apply to legged robots in which a proprioceptive model of the robot can be done. • The adaptation rule should be trained on-line rather than hand designed in order to cope with un-expected disturbances and reduce possible side effects of the parameter adjustments. The use of a dynamic system such as a recurrent neural network could help in the general case.

67

3. ANTICIPATION AND ADAPTATION

Controller evolution average fitness mininum fitness

1 0.9

fitness

0.8 0.7 0.6 0.5 0.4 0.3 0 0.0⋅10

3

5.0⋅10

4

1.0⋅10

4

1.5⋅10

4

2.0⋅10

evaluations

Controller evolution Fitness median and quartiles

1 0.9

fitness

0.8 0.7 0.6 0.5 0.4 0.3 0 0.0⋅10

3

5.0⋅10

4

1.0⋅10

4

1.5⋅10

4

2.0⋅10

evaluations

Figure 3.14: Evolution statistics on the controller evolution for the controller with no hidden neurons. 11 independent runs were conducted. On the top, average with standard deviation and best run; on the bottom, the same runs displayed using median and quartiles.

68

3.7. CONCLUSION

Controller evolution average fitness mininum fitness

1 0.9

fitness

0.8 0.7 0.6 0.5 0.4 0.3 0 0.0⋅10

3

5.0⋅10

4

1.0⋅10

4

1.5⋅10

4

2.0⋅10

evaluations

Controller evolution Fitness median and quartiles

1 0.9

fitness

0.8 0.7 0.6 0.5 0.4 0.3 0 0.0⋅10

3

5.0⋅10

4

1.0⋅10

4

1.5⋅10

4

2.0⋅10

evaluations

Figure 3.15: Evolution statistics on the controller evolution for the controller with 2 hidden neurons. 11 independent runs were conducted. On the top, average with standard deviation and best run; on the bottom, the same runs displayed using median and quartiles.

69

3. ANTICIPATION AND ADAPTATION

Controller evolution average fitness mininum fitness

1 0.9

fitness

0.8 0.7 0.6 0.5 0.4 0.3 0 0.0⋅10

3

5.0⋅10

4

1.0⋅10

4

1.5⋅10

4

2.0⋅10

evaluations

Controller evolution Fitness median and quartiles

1 0.9

fitness

0.8 0.7 0.6 0.5 0.4 0.3 0 0.0⋅10

3

5.0⋅10

4

1.0⋅10

4

1.5⋅10

4

2.0⋅10

evaluations

Figure 3.16: Evolution statistics on the controller evolution for the controller with 4 hidden neurons. 11 independent runs were conducted. On the top, average result with standard deviation and the best run (minimum fitness); at the bottom, the same runs displayed using median and quartiles.

70

3.7. CONCLUSION

Controller evolution average fitness mininum fitness

1 0.9

fitness

0.8 0.7 0.6 0.5 0.4 0.3 0 0.0⋅10

3

5.0⋅10

4

1.0⋅10

4

1.5⋅10

4

2.0⋅10

evaluations

Controller evolution Fitness median and quartiles

1 0.9

fitness

0.8 0.7 0.6 0.5 0.4 0.3 0 0.0⋅10

3

5.0⋅10

4

1.0⋅10

4

1.5⋅10

4

2.0⋅10

evaluations

Figure 3.17: Evolution statistics on the controller evolution for the controller with 8 hidden neurons. 11 independent runs were conducted. On the top, average with standard deviation and best run; on the bottom, the same runs displayed using median and quartiles.

71

3. ANTICIPATION AND ADAPTATION

72

Chapter 4 Demonstrating Behaviours This chapter is concerned with Learning by Demonstration, using Neural Networks and Echo State Networks (ESNs) as search spaces. The motivation for the presented setting is twofold. On the one hand, as already mentioned (Chapter 2), ESNs offer a good trade-off between expressive power and tractable optimisation, with a linear complexity in the number of neurons in the network. On the other hand, Learning by Demonstration (Chapter 2.2.4) is conducive to a robust fitness design, sidestepping the so-called Evolution Opportunism. The presented approach is applied to two robotic control tasks: wandering and target finding behaviours.

4.1

Motivation

Evolutionary Robotics faces several limitations, as discussed in Chapter 2: • The generality of the solution controller critically depends on the fitness function and experimental setting (see discussion on Auto-Teaching and robustness in facing the environment 3.1.3). • The fitness function must enforce the discovery of an efficient solution (avoiding a Needle in the Haystack-like landscape [Crammer and Chechik, 2004]) and avoid discovering stupid solutions through the so-called Evolution Opportunism [Godzik, 2005]. • An additional limitation is referred to as Reality Gap (Chapter 3). If the controller is assessed in simulation, then it is subject to the inaccuracy of the simulator (e.g. sensor and actuator noise). 73

4. DEMONSTRATING BEHAVIOURS

Learning by Demonstration, exploiting the traces of the human teachers demonstrating the desired behaviour (Chapter 2.2.4), partially addresses the above limitations: • It enables some generality through several demonstrations of the target behaviour. • The fitness function is simply defined as reproducing the expert demonstrated behaviour. • It avoids the reality gap to the extent that demonstrations can be performed directly on the physical robot. In counterpart, Learning by Demonstration raises some other limitations: • Only sparse training datasets are available in the general case (with some exceptions, see LeCun [LeCun et al., 2005]). The teacher might not be willing or able to provide several demonstrations of the target behaviour to the robot. • Training data are non-iid, which is a usual precondition for Machine Learning algorithms (Chapter 2.2). • Humanly performed demonstrations are prone to errors. Since only a few demonstrations are likely to be available, expert errors will have a deep impact on the models behaviour trained1 . One outcome of the study will be to compare qualitatively the behaviours obtained in the Learning by Demonstration context with those obtained by Evolutionary Computation in chapter 3.

4.1.1

Experiments Goal

As described in the motivation, Learning by Demonstration paradigm allows one to avoid the reality-gap problem since the behaviour can be demonstrated directly on the physical robot (in-situ). Neural approaches such as convolutional networks have been shown to be efficiently trained from many demonstrations. Limitation remains, in the situations where training data is sparse and non-iid. Robust robot controller training under those conditions is investigated in the present research. 1

Additionally, sensor and motor noise adversely affect learning (as on the general case)

74

4.2. TRAINING ESNS BY DEMONSTRATION

ESNs have been shown to learn complex dynamic controls on non-iid data sequences for robotic tasks. ESN robust training and control ability are therefore evaluated in the Learning by Demonstration context. It is expected that ESN should perform better than standard feed-forward neural networks. In order to evaluate the overall controller, results are compared to a state-of-the-art robotic control algorithm, known as MPL (section 2.2.4). Another goal of the presented experiments concerns the capability to train a controller facing a task with or without pre-processing over the sensor inputs. With simplified sensors such as a target’s estimated position rather than a picture containing the target, it might be expected that the controller can be trained in an easier way. However, simplification might introduce a bias or remove the key features present in the sensory inputs, leading to worst results.

4.2

Training ESNs by demonstration

The presented approach aims at achieving robust autonomous control from only a few demonstrations of the target behaviour. In a first phase, the desired robot behaviour is obtained through tele-operating the robot, leading to one or several robotic logs. These robotic logs are used to define a fitness function or a training criterion, and to train the controller in a second phase.

4.2.1

Structure of the log

Experiments are performed in-silico and in-situ. For each time step, during the teachers demonstration, the value of each sensor and motor are recorded. Formally, the log L is described as {(xi , yi∗), i = 1 . . . T, xi ∈ IRd , yi∗ ∈ IR2 } where d denotes the number of sensors. In-silico, d = 8: the simulator involves 8 infra-red distance sensors. In-situ, d = D: besides of the 8 infrared sensors, the mobile khepera II (Chapter A.3) is equiped with a wireless video camera with a resolution of 320 × 240 pixels2 which is used instead of the infra-red sensors. For tractability reasons, noise reduction is performed 2

Hue, Saturation and Brightness (HSB) is considered rather than Red, Green and Blue (RGB) for better results in extracting visual informations such as colours.

75

4. DEMONSTRATING BEHAVIOURS

by using a Kalman filter [Welch and Bishop, 2008], followed by a standard dimensionality reduction to 20×15 making D = 20×15×(H +S +B) = 9003. In the experiments, the reduced image is compressed further to: 8×6 making D = 144 for a reasonable tractability. Figure 4.1 displays both the original camera image and the reduced image stored in the logs.

Figure 4.1: Camera images: Original image (left), compressed image stored (right)

4.2.2

Controller Search Space

Each controller is a model whose parameters make an hypothesis h taking a sensory input x ∈ IRd and producing a 2 dimension control output y ∈ IR2 (left and right motor activation). The experiments consider the use of three main controler models: • A fixed feed-forward neural network controller (NN) is considered; the number of hidden neurons (0 to 100) depends on the tasks. The neuron activation function is an hyperbolic tangent. The networks are trained using back-propagation algorithm. • Echo State Networks (ESNs) are also considered. An ESN is a recurrent neural network with specific construction constraints, allowing the expressiveness of managed memory properties (Chapter 2.3.4). ESN readout layer, from hidden neurons to output neurons, is trained using simple linear regression algorithm (Chapter 2.2.2). • Micro-Population based Learning algorithm (MPL), a standard algorithm designed for robot Learning by Demonstration experiments (Chapter 2.2.4). MPL can be viewed as a random projection-based classifier. For each example (xi , yi∗), a random feature selection f s is 3

The sub pixel has an average value HSB of each pixels included within.

76

4.2. TRAINING ESNS BY DEMONSTRATION

drawn and the rule f s(x, xi ) → yi is generated, where f s(x, xi ) is true if and only if the values of x and xi coincide. Additionally, a feed-forward neural network trained in the same conditions as an ESN will be considered, referred to as reactive ESN (rESN). Only the readout connexions from the hidden neurons to the outputs are trained using linear regression, in the same way the standard ESN is trained.

4.2.3

Assessment of a Controller

Two criteria will be used to assess the competence of a controller. The first criteria is the mean square error (MSE) with respect to a validation log (not considered during the training). Training and test are performed 11 times on a controller randomly initialised each time according to their specific parameters (Neural networks with different initial random weights, ESN with different reservoirs). E(h) =

T X

kh(xi ) − yi∗ k22

i=1

This criterion, meant to be minimised, however does not provide enough guaranties to achieve the appropriate behaviour acquisition. Demonstrations are sparse and prone to errors and noise. If the controller mean square error is low with regard to those errors, then the adequate behaviour won’t be achieved. Moreover, practical experiments suggest that an error threshold exists, separating adequate from inadequate behaviours, however, estimating this threshold requires extra-information. For these reasons, a high level subjective assessment is provided by the designer, and will be made accessible to the reader through the graphical representation of the controller traces.

4.2.4

Tasks

Wandering behaviour The teacher demonstrates a given trajectory in the environment. The challenge is twofold: on the one hand, the traces satisfy implicit requirements such as obstacle avoidance, which should also be satisfied by the trained controller. On the other hand, the robot is subject to perceptual aliasing. As different decisions should be made in perceptually similar conditions, it is expected that reactive controllers (feed-forward neural networks) will not perform as efficiently as non-reactive controllers (ESN).

77

4. DEMONSTRATING BEHAVIOURS

The task is meant to produce a behaviour similar to the one evolved in chapter 3, within the same environment, however, differences exist in both the available traces of the behaviour and the assessment criterion. On the one hand, the speed of the controller is artificially reduced so as to enable an easy robot control by the teacher. On the second hand, the MSE criterion does not consider the environment exploration as does the fitness defined in chapter 3. In the end, the expected target behaviour should provide similar wandering behaviour, but the results should not be quantitatively comparable (computing the fitness value from chapter 3 on the trained behaviours will probably lead to variations in results depending on the demonstration data). Such comparison will hence not be performed. Target following behaviour The second task involves finding and reaching an object (in-situ). The teacher demonstrates two behaviours: finding the target object, a red ball, where the initial position of both the robot and the ball are randomly drawn in each experiment; and reaching the target object when it appears in the scope of the robot camera (according to the robot teacher).

4.3

Experimental Settings

This section details the methodology followed to assess strengths and weaknesses of robust control acquisition in the context of Learning by Demonstration.

4.3.1

Learning by Demonstration Environment

In order to enable the teacher to provide effective traces both in-situ and in-silico, we have developed a Java-based Learning by Demonstration environment4 (Chapter A). The control can be achieved through the use of a joystick (in-situ) or a keyboard (in-silico). • The joystick offers an analogical control of the robot to the teacher. Two measures are made within the two joystick axis X and Y to evaluate the motor activations: the stick move from its neutral position p two 2 s = (x + y 2 ), x, y ∈ [−1; 1] and the orientation y of the stick. Those

4

This environment is available upon request to the author. It has been used in the context of various robotic teachings at IFIPS engineering school, University Paris Sud from 2008 to 2009.

78

4.3. EXPERIMENTAL SETTINGS

displacement and orientation values are converted to left and right motor activation (a × (y − s), a × (−y − s)), where a is a constant set experimentally to 8. • The keyboard offers a discrete control of the robot, considering specific control cases for directional keys: up, left, right, down, providing respectively (3,3), (1,2), (2,1) and (-1,-1) motor control activation. The values were obtained through experiments in order to provide an efficient control of the teacher on the environments considered. Together with the control values provided by the teacher, the environment designed provides the teacher with the robot sensor values: the infra-red sensor values in task 1 and the sensor values plus the camera in task 2 (Fig. 4.7). Figure 4.2 illustrates the Java-based application developed for controlling the robot and storing the data log.

Figure 4.2: Two samples of the log demonstrations. On the top is the video input, Infra-red sensors activation are displayed in the middle and the controls are on the bottom. The demonstration environment records the robotic logs (section 4.2.1) storing, for each time step, the robot sensors and the desired motor values, computed from the joystick or keyboard activated by the teacher. Notably, in task 2, the log will store a compressed version of the camera image (D = 20 × 15), although a higher degree of compression will be applied during learning. 79

4. DEMONSTRATING BEHAVIOURS

4.3.2

Controller Parameters

The hyper-parameters of the feed-forward NN, ESN, rESN and MPL sets are presented in Table 4.1 for task 1 and 4.2 for task 2. Task 1 Controller NN

ESN

parameters 8 inputs (8 infra-red sensors) 0 hidden neurons in the reservoir full connexion between input and output layers 2 outputs with hyperbolic tangent activation function all weights uniformly drawn in [−1 : 1]. 8 inputs (8 infra-red sensors) 20 hidden neurons in the reservoir 2 outputs with hyperbolic tangent activation function all weights uniformly drawn in [−1 : 1] except reservoir connexion weights uniformly drawn in [−0.5; 0.5]. damping: 0.9, connexion density 0.1.

Table 4.1: Overall controller parameters for task 1. MPL involves a single hyper parameter, the size s of the feature set randomly extracted from the whole set of D features. In task 2, s is set to 4. It has been designed to work with video image as sensor input, so it is not considered for use in the first task involving only the infra-red sensors. In all experiments, the ESN connexion weights are uniformly drawn in [−1; 1] except for the reservoir internal connexions which are uniformly drawn in [−0.5; 0.5]. This choice was made since all weights are usually reduced to have final values within [−0.5; 0.5] once the higher eigenvalue is reduced to remain lower than one and equal to the damping factor. In the end, there is no obvious differences for data distribution in the previous two situations, however, a slight distribution difference should have no consequence on the reservoir weight generation. Two experiments are presented in the following. The first experiment considers the use of simulation for simple behaviour acquisition from teacher demonstrations while the second is performed on the physical robot for a more complex task. As in chapter 3, The simulator used proposes to simulate 80

4.3. EXPERIMENTAL SETTINGS

Task 2 Controller NN

NN

ESN

ESN

rESN

MPL

parameters 144 inputs (8x6 image, h,s,b) 100 hidden neurons in the reservoir full connexion between layers 2 outputs with hyperbolic tangent activation function all weights uniformly drawn in [−1 : 1]. 1 input (target coordinate) 100 hidden neurons in the reservoir full connexion between layers 2 outputs with hyperbolic tangent activation function all weights uniformly drawn in [−1 : 1]. 144 inputs (8x6 image, h,s,b) 100 hidden neurons in the reservoir 0.05 connexion density, 0.8 damping factor 2 outputs with hyperbolic tangent activation function all weights uniformly drawn in [−1 : 1] except reservoir connexion weights uniformly drawn in [−0.5; 0.5]. 1 input (target coordinate) 100 hidden neurons in the reservoir 0.05 connexion density, 0.8 damping factor 2 outputs with hyperbolic tangent activation function all weights uniformly drawn in [−1 : 1] except reservoir connexion weights uniformly drawn in [−0.5; 0.5]. 144 inputs (8x6 image, h,s,b) 100 hidden neurons in the reservoir 0.0 connexion density (no recurrences) 2 outputs with hyperbolic tangent activation function all weights uniformly drawn in [−1 : 1] sampling with 4 random pixel inputs per frame

Table 4.2: Overall controller parameters for task 2.

a khepera II. The simulator is described in appendix A.1. 81

4. DEMONSTRATING BEHAVIOURS

4.4 4.4.1

Experimental Results, task 1: Wandering behaviour Methodology

The environment considered in the first experiment is the same environment as previously used in evolutionary experiments presented in chapter 3 (Fig. 4.3). The robots starting location remain the same, in the middle of the environment. The teacher performs four demonstrations with about the same length, showing how to get out of the maze (Fig. 4.3). The log of the first three demonstrations is used as training set (429 examples); the MSE criterion is computed on the log of the fourth demonstration (137 examples).

Figure 4.3: Wandering task environment: On the left, the given environment with the initial robot position and orientation toward the reader’s right. On the right picture, the trace of one demonstration log. The controller considered here is a feed-forward neural network with no hidden neurons, trained using back-propagation and an ESN with 20 reservoir neurons. The evaluation criterion is, for each sensor value given in input, to minimise distance of network control output to the desired output provided in simulation by the teacher (section 4.2.3). In this experiment, the feed-forward neural network controller has no hidden neurons. All inputs are connected to all outputs and all activation functions are hyperbolic tangent functions. The training algorithm is backpropagation on the training samples. The number of training iterations is 82

4.4. EXPERIMENTAL RESULTS, TASK 1: WANDERING BEHAVIOUR

10k with a 10−2 learning rate. The ESN does have 20 reservoir neurons with a connexion density of 0.1 and a damping of 0.9. The activation functions are hyperbolic tangent functions as well as for the feed-forward neural network. The ESN is trained using linear regression algorithm. All parameters are listed in Table 4.1. For each controller type, experiments are performed 11 times. This implies the use of 11 randomly generated ESNs having different random reservoirs each time.

4.4.2

Results

Each controller is trained for approximately 1 to 3 seconds on a 3.0GHz pentium processor. Quantitative error on both training and test sets are displayed in Table 4.3. The obtained controllers are then evaluated qualitatively by running in the given environment. Trace examples of the behaviours obtained are presented in Figure 4.4 and 4.5.

Controller

training average error

test average error

NN

0.210−2 ± 1.510−6

0.2810−2 ± 5.910−7

ESN

0.810−5 ± 2.810−8

1.010−3 ± 4.410−8

Table 4.3: Neural network and ESN average error on both training and test sets. The error rates are low for both controllers with a low variance, however, results suggest that ESN can achieve better data fitting on these demonstrations made of non-iid data. Controllers trained provide similar qualitative results, expressing more or less a wall following behaviour or a move straightforward and avoid obstacles behaviours. Both controllers are able to navigate through the environment, avoiding obstacles. Differences of behaviours occur within the environment parts not considered during the demonstrations, which are totally normal with only a few training data. The demonstrated data considers the robot behaviour in a sub-part of the environment (Fig. 4.3); parts of the environment ignored during demonstrations lead to variations of the behaviours obtained. In order to improve robustness, or generalisation, new demonstrations can be made (An example extra demonstration and obtained improved behaviour are displayed in Figure 4.6 as an example).

83

4. DEMONSTRATING BEHAVIOURS

Figure 4.4: Neural networks wandering behaviours trained over demonstrated training data.

Figure 4.5: ESN wandering behaviours trained over demonstrated training data.

4.4.3

Discussion

Both the standard feed-forward neural network and Echo State Network controllers were efficiently trained to yield the expected wandering behaviour. Compared to the evolved behaviours from chapter 3 where the goal is to maximise the explored areas of the environment, the behaviours look similar. The demonstration logs contain similar behaviours to the ones evolved within a the environment considered with a speed difference, where the evolved behaviour is faster. They do not capture the overall 84

4.5. EXPERIMENTAL RESULTS, TASK 2: TARGET FOLLOWING BEHAVIOUR

Figure 4.6: Example of demonstration with extra situations is presented on the left. Trained ESN behaviour is displayed on the right. same features, however, since the training criterions are not the same, as discussed in section 4.1.1. Adding a few demonstrations of the specific unseen situations allow the training to increase robustness to cope with more general situations. This robustness by generality is the same problem met while defining the fitness function and environment for an evolutionary robotic experiment.

4.5 4.5.1

Experimental Results, task 2: Target following behaviour Methodology

The task is formally described as a combination of two subtasks: • finding the target: The teacher demonstrates the task as follows: he direct the robot to move to express a spiral by moving forward while slightly turning until the target object is in the robot sight. If the robot solely turn on itself, it will miss the target if too far from sightseeing. By having a spiral motion, it might eventually find the target. Also, the demonstrations consider turning always in the same direction (to the left). This is done since no sensor data allows the robot to detect whether the target is on the left or on the right, unless the target is within camera range. 85

4. DEMONSTRATING BEHAVIOURS

• reaching the target: The demonstrated behaviour then changes by going straight forward in the direction of the target. During this approach phase, the teacher tries to keep the ball in the middle of the video image by having both an overall view of the environment and the robot sensor values on the computer screen. Figure 4.7 displays the experimental setup. The sensors used for the control are the video image inputs with a downsized resolution of 8×6. Feed-forward neural networks, ESN, rESN and MPL controllers are considered in this experiment (Tab. 4.2). As discussed in the motivations (section 4.1), only two demonstration sets are created for this experiment.

Figure 4.7: in-situ experimental setup The experiment is also conducted using an image pre-processing algorithm converting the picture to a coordinate values system with an estimated object target position within the plan in one dimension, considering one variable x. By default, if the ball is out of camera range, its estimated position is set to the extreme left. Only the feed-forward neural network and ESN controllers are evaluated with these sensory inputs. The controllers make use of this reduced sensor space input, reducing the 144 pixel image to one orientation features. The algorithm simply estimates the target’s colour density across the width axis of the image to estimate the target’s position if within range. The qualitative evaluation of the controllers consider two experiments: • In a first experiment, the robot is placed in condition to perform task as demonstrated. The initial position of the robot is within a range of distance [15cm : 20cm] and an orientation of approximately 90 degrees to the target. This way, the target is not within reach of the video 86

4.5. EXPERIMENTAL RESULTS, TASK 2: TARGET FOLLOWING BEHAVIOUR

camera on top of the robot. In order to reach the target, the robot needs to feature the two sub behaviours: finding and reaching the target. Once reached, the robot has to stop in front of the target. The success is defined as reaching the target and the experiment is repeated 11 times on the 11 controllers obtained for each controller type (making 11 × 11 = 121 evaluations for each controler type). • A second criterion regards the time taken to reach the target from three different initial positions, having different orientations toward the target. The robot is positioned at approxiamtivelly 20cm from the target with various orientations. The orientations are 90, 180 and 270 degrees, i.e. respectively on its left, rear and right. This experiment is repeated 3 times for the best controllers trained of each type.

4.5.2

Results

Two experiments are performed on the second task (section 4.5.1). Results for the first experiment are shown in Table 4.4. The average error over the training and test sets are presented with variance for each model trained. The two sets are considered for training and testing alternatively. The setting considering the video image as the input contains better overall results for the ESN. No differences appear when comparing the feed-forward neural network, even if it tends to produce slightly better results with the coordinate system. The ESN approach provides the best results with regard to error rate on both training and test sets, outperforming the feed-forward neural network. The neural networks number of parameters to train is dramatically high. In the case of the full 8x6 image input, the connexions from input to 100 hidden neurons make 14702 weights5 using only a few demonstrated data. ESN requires only 202 connexions training in comparison: the reservoir to output connexions plus the bias to output connexions, reducing significantly the number of parameters. The recurrence-less ESN proves equivalent to the feed-forward neural network in terms of model structure and expressiveness, however, the neural network is outperformed. Technically, the only difference is the training method and the connexion weights involved in the training. Back-propagation algorithm tries to change every 14702 weights while for the rESN, only 202 weights are changed according to the fixed remaining 5

144x100 connexions from input to hidden layer, 100x2 from hidden to output layers and one bias connexion for either hidden and output layer adding 102 connexions

87

4. DEMONSTRATING BEHAVIOURS

Algorithms

demo 1, learning

demo 2 testing

Using x coordinate as input NN

1.9.10−2 ± 4.0.10−4

1.8.10−2 ± 9.0.10−4

ESN

0.9.10−2 ± 9.7.10−4

1.1.10−2 ± 6.2.10−4

Using 8x6 image as input NN

2.0.10−2 ± 1.6.10−3

2.2.10−2 ± 2.4.10−3

MPL

4.0.10−3 ± 4.2.10−5

0.9.10−2 ± 9.5.10−5

ESN

4.0.10−4 ± 3.5.10−5

0.6.10−2 ± 1.7.10−3

rESN

3.9.10−4 ± 1.6.10−5

0.5.10−2 ± 1.5.10−3

Algorithms

demo 2 learning

demo 1 testing

Using x coordinate as input NN

1.9.10−2 ± 8.7.10−4

1.6.10−2 ± 4.8.10−4

ESN

0.7.10−2 ± 4.2.10−4

2.1.10−2 ± 3.5.10−3

Using 8x6 image as input NN

2.2.10−2 ± 2.6.10−3

1.8.10−2 ± 1.6.10−3

MPL

6.0.10−3 ± 3.7.10−5

0.6.10−2 ± 9.7.10−5

ESN

6.1.10−4 ± 4.7.10−5

0.4.10−2 ± 1.2.10−3

rESN

5.2.10−4 ± 3.0.10−5

0.4.10−2 ± 1.3.10−3

Table 4.4: Training and testing errors averaged over N = 11 runs ones. On the training set, ESNs achieves the best results but little difference appears with MPL when compared on test sets. Especially, MPL does have a low variance over the runs compared to ESNs, despite random sampling in the video image. The qualitative experiments to assess controller performance are then performed (section 4.5.1). Regarding the first experiment, a qualitative success is defined as the ability to find and reach the target. With the full image as input, the Echo State Network success probability is 73% with recurrent 88

4.5. EXPERIMENTAL RESULTS, TASK 2: TARGET FOLLOWING BEHAVIOUR

connexions and 70% without, while MPL success rate is 67%. The controllers obtained with the input description failed to behave efficiently at all. • Despite its high success, ESN architecture showed variations in the target behaviour, sometimes searching for the target by turning on the right rather than on the left as demonstrated (1 controllers over the eleventh) or a few unsuccessful behaviours obtained didn’t even try to turn at all (2 controller over the elevens). For the ESN not turning, looking at the control output shows that the controller tries to turn left, but, the robot makes use of integer output, converting the different real values to similar integer values. All ESN controllers however go forward to reach the target as soon as it comes within camera range. • The MPL behaviours tend to be very regular, showing always the same patterns with slight performances variations for target detection, sometime not being able to detect the target. • The feed-forward neural network could not reach the target at all and failed to produce any relevant behaviour. Experiments revealed that neural network-based controller are hardly able to locate the target and react only if being placed very close to the target. Results for the second experiment are shown in Table 4.5. MPL controllers reach the target efficiently on the favourable setting where the target is on the robots left, but the performance quickly decrease with the rotation needed to face the target, failing to ever reach the target initially oriented on its right. On the contrary, ESN controllers could perform the task under all conditions, with high variance regarding time elapsed before target was reached. The high variations from one orientation to another are explained by the demonstrated behaviour at first, which implies to the robot to turn on the left to find the target object, and the poor video image quality was not always providing a clear seeing of the target in sight. Most unsuccessful controllers suffer from the video image quality: they turn around but do not always stop turning while the target is within video camera range.

4.5.3

Discussion

The first lesson learned from the experiments is that the Mean Square Error (MSE) does not correlate well with the acquisition of the desired behaviour. Typically neural networks yield a low MSE while they completely fail to reach the target object when it is not in sight of the robot (Table 4.4). A 89

4. DEMONSTRATING BEHAVIOURS

Algorithm

orientation 1

orientation 2

orientation 3

MPL

57s to 80s

9s to 10s

never reach

ESN

11s to 32s

7s to 8s

12s to 19s

rESN

12s to 36s

8s to 10s

13s to 23s

Table 4.5: Time needed to reach the target object, depending on its orientation with respect to the robot. tentative interpretation for this failure is the bias of the example distribution. As shown in Figure 4.8, the demonstration oversample the favourable case where the target object is visible. Not only does this distribution make it hard to learn the appropriate behaviour in the not-so-favourable cases; it also gives an over-optimistic view of the controller performance.

Figure 4.8: Sensor space distribution in the log (projection of the target object coordinate on the X axis) The second lesson is that MPL improves on Neural Nets, with respect to MSE and overall in terms of their behaviour. The effectiveness of the approach is explained from its controlled generality: each step in the demonstration (log example) is considered independently, and is generalised as a rule; furthermore, the rule premises involves a set of s = 4 pixels, randomly selected. In comparison, neural networks rely on a more fragile learning process, given the size of the learning space (14702 weights) and the fact that back-propagation is doomed to arrive in a local optimum. The third lesson is that ESNs succeed in reproducing the desired behaviour, despite the fact that the corresponding MSE is comparable to 90

4.6. CONCLUSION

that of neural networks. The fact that they outperform MPL is explained as they enable a more flexible processing of the input (the camera image), inherited from the NN framework. It is explained in two facts that they improve on feed-forward Neural Nets. Firstly, the size of the learning space is dramatically reduced comparing to neural networks (14702 to 202). Secondly, they intrinsically are able to deal with sequential (non-iid) data: the readout connexions are trained to select the dynamics of the internal nodes best matching the task at hand.

4.6

Conclusion

This chapter has investigated the Learning by Demonstration approach on two tasks, respectively aimed at a wandering behaviour and a target reaching behaviour. Several controller spaces and learning algorithms, including feed-forward neural networks with back-propagation, ESNs and rESNs with linear regression, and MPL, have been experimented. After the empirical evidence, ESNs offer a good trade-off between learning tractability and expressive power. Typically, ESNs learning has linear complexity in the number of neurons (as opposed to, in the number of connexions for standard neural networks); the generalization is more smooth than for MPL. Additional experiments have investigated the use of pre-processing the raw sensor values, e.g. to estimate the orientation of the target object. In this simple environment, the considered pre-processing failed to provide useful cues. The extension of this pre-processing step, e.g. using dimensionality reduction [Jenkins and Matari´c, 2004], is left for further studies. Along the same lines, it might be interesting to investigate the use of Random Forests on the demonstration log [Statistics and Breiman, 2001]. Still, the main bottleneck of Learning by Demonstration, as witnessed by the reported experiments, is that minimizing the Mean Square Error with respect to the demonstration provides little guarantees as to the quality of the final behaviour of the controller. The qualitative assessment of the learned controllers is required to tell whether the desired behaviour has been acquired. A major perspective for further research thus is to either couple Learning by Demonstration with Interactive Optimisation [Herdy, 1996, Llor`a et al., 2005] or simply consider Learning by Demonstra91

4. DEMONSTRATING BEHAVIOURS

tion in an interactive setting. Independently, in order to investigate more precisely the limitations of the ESN framework, it is also required to consider explicit non-reactive tasks, where the controller needs to be provided with some memory of its past actions/environments. Such tasks will be investigated in the next chapter.

92

Chapter 5 Memory Enhanced Controllers In previous chapters, robust training of neural controllers has been tackled, evolved in-silico or trained in-situ. In the Learning by Demonstration experiment, Echo State Networks (ESN) were found to provide an efficient neural network framework expressing dynamics for a reduced number of parameters (reservoir to output connexions) compared to standard neural networks (all connexion weights). This chapter focuses on robust acquisition and exploitation of memory skills in the context of robotic control. State of the art recurrent neural networks are investigated for memory-enhanced control modelling, namely Echo State Networks and Neuro-Evolution of Augmenting Topologies (NEAT). While recurrent neural networks offer the required expressiveness for complex dynamics modelling, ESNs have a linear complexity in the size of the reservoir, i.e. the number of hidden neurons. ESNs are trained with Evolution Strategies with covariance matrix adaptation. NEAT involve a genetic-based non-parametric neural network connexion weights and topology optimisation. The experimental validation of the approach considers the Tolman maze benchmark, requiring the robot controller to feature some limited counting abilities. An elaborate experimental setting is used to enforce the controller generality and prevent opportunistic evolution from completing a deliberative task through smart reactive heuristics.

5.1

Going beyond reactive control

Well-posed problems in robotics are handled using control theory, modelling the target goal and the environment in terms of differential equations 93

5. MEMORY ENHANCED CONTROLLERS

[Laumond, 1998]. The control framework gives optimality guarantees regarding the solution controller and its stability. This framework however relies on strong assumptions, including a comprehensive model of the world. This assumption is not verified in the current context. One prominent challenge in Autonomous Robotics is to go beyond reactive control [Thrun et al., 2006] and to feature deliberative control, involving at least to some extent planning abilities [Toussaint et al., 2007]. As discussed in chapter 2, section 2.1, one key difference between reactive and deliberative control is that the latter proposes planning mechanisms, expressing memory ability. The robot situation cannot be determined from its current sensor values only, due to perceptual aliasing (section 2.1.4). Memory based control enhance the perception range of the robot, disambiguating such delicate situations. Approaches investigating explicit or implicit memory modelling have been proposed in the literature for robotic experimental setups. In [Lanzi, 1998a], an explicit memory represents an implicit knowledge about past states and is used to enhance the controller in a discrete control environment. A quite elaborate and demanding approach proceeds by encoding every possible situation of the robot in the search space as in [Meuleau and Brafman, 2007]. Implicit memory modelling often proceeds by representing robotic controllers as a recurrent neural networks [Nolfi and Floreano, 2000] [Urzelai and Floreano, 2001] [Tuci et al., 2004] (section 2.3.2). The implicit memory in recurrent neural networks motivates the presented chapter; other search spaces could also have been considered though.

5.1.1

Deliberative control

Many robotic tasks are tackled by deliberative controllers rather than reactive ones. These approaches rely on high level perceptions which often involve the use of maps and localisation algorithms [Arkin, 1998]. Map localisation is commonly referred to as position belief, using position filters over the available map, such as the Kalman filter from signal processing or Monte-Carlo localisation [Welch and Bishop, 2008]. The dual problem of map building and localisation within the map, known as Simultaneous Localisation And Mapping (SLAM) is also a great challenge in robotic planning based control [Smith et al., 1986] [Bailey and Durrant-Whyte, 2006]. Maps, enable the use of planning algorithms, such as A∗ to generate a 94

5.1. GOING BEYOND REACTIVE CONTROL

sequence of actions toward a goal [Hart et al., 1968]. SLAM approaches supports planning activities possibly under uncertainty. Planning under uncertainty is the context considered in this chapter: the environment is not known, the environment can be dynamic and change with time, as a result of the robot’s action or other phenomenons. According to [Russell et al., 2006], most robots make use of deterministic algorithms to undertake control decisions, using the most likely state. Markov Decision Processes (MDP) are often considered under such premises, when uncertainty occurs during transitions while states are fully observable through the sensors [Puterman, 1994]. The robot goal is expressed via a reward function; solutions are policies maximising the reward expectation in a Reinforcement Learning (RL) setting. Reinforcement Learning proceeds by estimating value functions, namely the expectation reward associated to each state in the search space, or to each pair (state, action) through the Hamilton-Jacobi-Bellman equations [Sutton and Barto, 1998b]; an optimal policy can henceforth be derived by selecting in every state the action maximising the associated reward or leading to the state with maximal expected reward. Helicopter or Quadruped robotic control using MDP and Apprenticeship Learning have been tackled in [Abbeel, 2008]. Memory modelling in RL is often based on redesigning the state space [Meuleau and Brafman, 2007]. Each state might be duplicated depending on the current situation of the robot with respect to the goal. The MDP proposes an optimal control policy for each state. The environment is usually partially observable due to sensor limitations and perceptual aliasing, requiring for belief based MDP, referred to as Partially Observable Markov Decision Process (POMDP). POMDP approaches imply planning under uncertainty. While reinforcement learning provides sound guarantees of optimality, it hardly scales up with respect to the size of the state and action spaces, particularly so when no model of the environment is provided. When the model is provided, policy learning can be directly tackled in terms of maximum a posteriori estimation [Toussaint et al., 2007]. As a consequence, those approaches hardly apply to robotic problems since they require a model of the world which is usually not available. Another state of the art framework based on Genetic Algorithms and MDP is the Learning Classifier System (LCS) [Holland, 1975] [Sigaud, 2007]. The genetic based individuals are rules referred to as classifiers. The control system contains several rules, composed of premises, actions and rewards. The premises represent the world situation, or the input parameters that 95

5. MEMORY ENHANCED CONTROLLERS

enable the triggering of the rule. The action is the control command and the reward an estimation of the pay-off relative to the given action in the given context. The Genetic Algorithm and Reinforcement Learning rules generate and distort rules so as to update either trigger conditions and reward. Several rules may be triggered for a specific input, requiring specific arbitrage to select the final control output. LCS can be used in time dependent systems named as multi-step systems [Lanzi, 1998a] [Butz et al., 2005].

5.1.2

Neural approaches to deliberative control

Neural networks contribution to deliberative control is twofold. On the one hand, network topology performs an implicit transformation of low level sensations into some level of perceptions, encapsulated in the neuron states, through hidden neurons and connexions. On the other hand, neural networks are able to express different forms of memory. The first form of memory expressed by neural networks is the so-called associative memory: the network connexions and weights between neurons is defined through training, where the network retain data presented in input, being some form of memory. The Hopfield neural network represents a typical autoassociative memory based neural network [Hopfield, 1982] relying on Hebb training rule (section 2.3.2). In Hopfield model, neurons are fully connected, but no connexion is made from a neuron to itself. Associative memory encodes the input data presented to the system as the neural network weights. Those approaches proves able, through training, to memorise data such as human faces. The second kind of memory is concerned with temporal dynamics. Aside from associative memories, where the connexion weight values represent the memory, as detailed in section 2.3.2, the states of the neurons retain some information about its previous states, due to the recurrences in the network. The recurrences are graph cycles that allow the information to circulate for some iterations into the network until fading completely. In Elman networks [Elman, 1990a], the hidden layer is fully connected. The original representation considers a copy layer connected to the hidden layer, proposing a context for the hidden layer. The output produced depends on this context, which itself depends on past inputs. Recurrent based robotic precisely the Through the

neural networks provide a relevant framework for memorycontrol. On the one hand, it is rather difficult to determine required amount of dynamics required for a given task. neural network topology, dynamic behaviours can be im96

5.1. GOING BEYOND REACTIVE CONTROL

plemented. On the other hand, the training of such dynamics can be achieved through Machine Learning and optimisation techniques1 . Evolutionary techniques provide a direct approach to train recurrent neural networks, as in chapter 3. Whatever the approach, training recurrent neural networks is significantly more complex than feed-forward neural networks. Recurrent neural networks meet expressiveness requirement [Hecht-Nielsen, 1989], but this imply a high training cost. An equally important requirement regards the training capability: the search for satisfactory solutions must be computationally tractable. When focussing specifically on evolutionary robotics, an additional requirement regards the amount of human effort needed to find satisfactory solutions, by designing an appropriate fitness function and evolution parameters. Prominent approaches in the evolutionary robotics literature address the above requirements in different ways. Elman recurrent neural networks can be trained with parametric continuous optimisation. The expressiveness of the recurrent neural network search space is ruled by the user-supplied number of neurons N, however, the size of the optimisation search space quadratically increases with N [Tuci et al., 2004]. The remaining challenge is to find a good trade-off between the size of the search space (number N of neurons and related connexion weights) and the memory span supported by the architecture. The literature proposes some works concerned with time-dependent control for mobile robots achieved through evolution of recurrent neural networks [Urzelai and Floreano, 2001] [Capi and Doya, 2005]. Recent approaches to recurrent network training include the NeuroEvolution of Augmenting Topologies (NEAT) framework and Echo State Networks (ESN). The NEAT framework proposes a non-parametric optimisation of neural networks, be they recurrent or not [Stanley and Miikkulainen, 2002] [K.O. Stanley, 2003]. NEAT optimises both the topology and weights of the neural networks. Even though NEAT does adjust the size of the search space, it requires approximately 30 evolutionary hyper parameters to be properly tuned (by trials and errors). The number of neurons and network topology are automatically adjusted according to the fitness value. An alternative to NEAT is based on the parametric optimisation of the so-called Echo State Networks proposed by Jaeger [Jaeger, 2002] (Chapter 4). Moreover, when modelling memory in recurrent neural networks, two situations can occur: on the one hand, the information from previous time steps may fade 1

Back-propagation techniques do not work directly on recurrent topologies (section 2.3.3)

97

5. MEMORY ENHANCED CONTROLLERS

exponentially, scaling down memory capability; on the other hand, reservoir dynamics may be subject to memory saturation. ESNs propose a context to control memory more precisely than in the general recurrent neural network setting, thanks to the damping factor α used to constrain reservoir dynamics. Recent investigations over the ESN in the domain of Autonomous Robotics [Hartland and Bredeche, 2007] [Jiang et al., 2008] [Antonelo et al., 2007] or Optimal Design [Devert et al., 2007] show that ESNs are amenable to frugal and efficient training. The work presented in [Antonelo et al., 2007] is concerned with the detection of complex events and the robot localisation in the environment: ESNs are used to detect specific environment patterns with regard to their previous positions, allowing the robot to make use of an implicit map for decision making. In [Jiang et al., 2008], the famous double pole problem is tackled using ESN optimised with CMA-ES. A third approach to memory modelling within neural networks is based on memory enhanced neurons. The Spiking Neurons represents a biologically more plausible model of neurons [Maass, 1996] [Maass et al., 2002] where the output of a neuron depends on the input dynamics during the past time T . A reservoir computing framework including spiking neurons is referred to as Liquid State Machines [Jaeger et al., 2006] [Jaeger et al., 2007]. Neural networks with explicit memory encoded in the neurons were also proposed through literature [Hochreiter and Schmidhuber, 1995].

5.2

Tolman Comb and Robust Training

The contribution presented here is threefold: firstly, a benchmark environment is proposed to assess memory-enhanced robot controllers. Secondly, a robust experimental methodology is defined in order to ensure robust control training, preventing the evolution opportunism as it might occur. Finally, a comparative study is performed in order to assess state of the art recurrent neural network settings: NEAT and ESNs.

5.2.1

A benchmark Environment

The environment benchmark presented in this section is inspired from ethology experiments conducted in the early 30s by Tolman. Tolman’s experiments aimed at investigating animal latent learning ability, focussing on rats and humans [Tolman and Honzik, 1930] [Tolman, 1948]. Various experiments were performed having rats navigate through mazes; under various premises the rats learned the shortest path toward food; doors and 98

5.2. TOLMAN COMB AND ROBUST TRAINING

curtains were positioned through the maze to enforce a strong perceptual aliasing so as to enforce the requirement for memory (Figure 5.1).

Figure 5.1: Environment for latent rat learning. Learning Classifier Systems were applied to a discrete variant of the Tolman comb shown in Figure 5.2 [Lanzi, 1998b]. The behaviour aimed at is that of reaching the third avenue in a maze containing 4 branches, thus demonstrating some limited counting abilities.

Figure 5.2: Tolman comb environment. The robot starts from the extreme left position, heading to the right. The target is located with the red cross. The Tolman comb is designed so that the robot cannot distinguish between the four branches using its sole sensors. The sensor range is limited so that, at a given time, whatever the robot position, only one branch can be detected. Therefore, for the robot to navigate directly to the third branch, the controller requires a memory capability. Since the translation from one 99

5. MEMORY ENHANCED CONTROLLERS

branch to another takes time, the memory span needs to be finely tuned in order to avoid memory fading effects.

5.2.2

Designing robust memory-enhanced controllers

The task is defined as the minimal distance to a target, possibly involving intermediate checkpoint targets; the target is located in the bottom of the third branch in the comb (Fig. 5.2). The fitness is meant to be minimised; the optimisation should produce controllers minimising the distance to the target. A large amount of time is given for the controller assessment so that the robot is physically able to reach the target. Evolutionary opportunism is a critical issue with regard to the fitness function and experimental conditions defined (section 2.4.3). Trivial or undesirable solutions corresponding to optimal inappropriate behaviours will emerge. Such solutions tend to crowd the population and the evolutionary search never recovers. A representative example described in [Nolfi and Floreano, 2000] considers the fitness maximising the speed while penalising collisions; evolution opportunism yields controllers which produce behaviours that turn very fast while staying on the same location. Opportunism was sidestepped during the experiments presented in chapter 3, due to the defined fitness function. Notably, heuristics enforcing the controller population through diversity [Lehman and Stanley, 2008] might counteract Evolution Opportunism, enforcing the discovery of more diverse solutions. In the general case however, efficient fitness functions often result from some co-evolution between the evolutionary engine and the designer, barring the discovery of undesired solutions while yielding a tractable fitness landscape. Preliminary experiments with the given Tolman comb yield such inappropriate behaviours. Training performed with feed-forward neural networks and the robot could surprisingly achieve turning in the third branch. One way to reach the target without making use of memory is to go to the end of the corridor and then make an adequate rebound so that to enter in the appropriate branch. Another behaviour simply consists into following the wall and enter every branch until the third is reached (Fig. 5.3). In this context, no memory is required so the experimental setting thus requires some generality features in order to enforce the requirement for memory. This is achieved by applying two simple yet efficient tricks. Firstly, the environment is opened at the end of the main corridor, which leads to the open-Tolman comb (Fig. 5.4). Secondly, variations are introduced within the environment by changing two corridor length parameters from one 100

5.2. TOLMAN COMB AND ROBUST TRAINING

Figure 5.3: Feed-forward neural networks behaving in the Tolman comb environment. evaluation to another.

Figure 5.4: Open-Tolman Comb environment: The length of d and d′ are uniformly drawn at each new epoch during assessments. The controller fitness value is averaged over K independent epochs. Both the length from the starting position to the first branch and the distance from one branch to another are drawn uniformly within specific ranges ensuring the minimal requirement for perceptual aliasing. The controller is then not evaluated on one open-Tolman comb, but on several variants having different values for those parameters d and d′ (Fig. 5.4). More details are given in the experimental setting section, however, the point is that, since no controllers are assessed within the same environment, issues need to be considered. Some controllers, performing adequately in environment defined with specific values of the parameters d and d′ , will perform poorly for other environments defined by other values d and d′ . It is expected that the solution should be able to perform the adequate behaviour whatever the values of d and d′ . In order for the assessment to be more robust to this evolution opportunism, the evaluation is performed through averaging the fitness 101

5. MEMORY ENHANCED CONTROLLERS

value over several epochs, each epoch involving a new environment with uniformly drawn set of values d and d′ . When considering this one epoch-one environment setting for several epochs per evaluation, it is expected that the solution controllers should be able to behave in a more general context relative to d and d′ and thus reduce the impact of the evolution opportunism.

5.3

Validation

The experiments are designed to evaluate the capacity of neural networks to yield memory skills , which can be seen as an counting ability. The experiment is performed in two steps. A preliminary experiment is done considering ESNs at first. In this experiment, counting ability is assessed on simple binary value sequences. Control experiments are then performed comparing ESN and NEAT approaches on the open-Tolman comb environment.

5.3.1

Preliminary experiments

A sanity check is performed in order to assess the ability of the ESN with standard training procedure to feature counting abilities. The input function is defined within sequences of 100 time steps during which events occurs. The default value in input is 0. Three times, at regular distances of 10 time steps, sequences of 10 time steps are produced as inputs with value of 0.25 to the system, producing input block patterns: 0+ + (0.25 . . . 0.25) + (0 . . . 0) + (0.25 . . . 0.25) + (0 . . . 0) + (0.25 . . . 0.25) + 0+ For the given inputs, the output produced correspond to the block: 0+ + (0 . . . 0) + (0 . . . 0) + (0 . . . 0) + (0 . . . 0) + (0.25 . . . 0.25) + 0+ The output produces activation as an answer to the third peak of 0.25 sequence input signal. Simply resumed, the output activates only each third time an input sequence of 0.25 is provided to the system, to represent the capability to express counting ability. Training follows the training procedure presented in [Jaeger, 2001] using simple linear regression algorithm. and repeating 10 randomly generated sequences. It is expected that ESN should be able to express the expected counting behaviour, and to assess the adequate reservoir size. ESNs with various reservoir sizes (N = {10, 20, 50, 100, 150, 200, 300}) are trained on 10 randomly generated sequences, the connectivity rate and 102

5.3. VALIDATION

damping factor remain the same, respectively ρ = 0.1 and α = 0.9. These sequences are repeated 10 times with small random noise added, making a total of 100 sequences submitted for training. The ESNs are finally assessed on 2 newly generated test sequences. Results are given in Table 5.1, and an example of a typical result for an ESN with 100 reservoir neurons is given in Figure 5.5. Results suggests that increasing the size of the reservoir allow the network to reach a more and more precise result, up to some extent, expressing a counting ability. Having larger reservoir tends to limit the performance. Two possible reasons might be too complex dynamics expressing or a limit to the performance of the training method. Reservoir size 10 20 50 100 150 200 300

best error 3.65 × 10−2 3.46 × 10−2 2.76 × 10−2 4.00 × 10−3 1.38 × 10−3 1.28 × 10−3 1.81 × 10−3

average test error 1.4 × 10−1 ± 1.3 × 10−1 3.9 × 10−2 ± 1.6 × 10−3 3.2 × 10−2 ± 1.8 × 10−3 8.4 × 10−3 ± 3.9 × 10−3 3.7 × 10−3 ± 1.7 × 10−3 2.8 × 10−3 ± 7.0 × 10−3 5.3 × 10−3 ± 4.6 × 10−3

Table 5.1: Mean Square Error of ESN for various reservoir sizes. With increasing size in the reservoir, error get lower. ESN whose reservoirs are bigger or equal to 100 begin to produce usable results, even if the performance reduces as reservoir keep growing. In order to ensure the validity of the experiment with regard to the counting ability, the trained networks are evaluated on sequences with only two input peaks each time. The expected ESN behaviour is to not activate to an absent or far away in time third input peak. Results are displayed in Table 5.2. The performance remain similar to the previous experiment and the trained behaviours behave as predicted. If the sequence contains only two peaks then the network does not activate as shown in figure 5.6. Finally, in order to cope with the open-Tolman experiment, variations between signals can also be introduced. Here, the waiting duration between 2 peaks during the sequence vary from one sequence to another. The small range between peaks considered is 10 to 15 time steps while it was only 10 time steps in the previous experiments. The experiment does not provide adequate behaviours with reservoirs size less than 200 neurons (Fig. 5.3). Figure 5.7 presents the best result obtained. Through variations, ESN 103

5. MEMORY ENHANCED CONTROLLERS

Input and desired Output functions function values

0.4

input function desired output function

0.3 0.2 0.1 0 -0.1 0

200

400

600

800

1000

1200

time

desired Output and ESN functions function values

0.4

desired output function ESN effective output

0.3 0.2 0.1 0 -0.1 0

200

400

600

800

1000

1200

time Figure 5.5: The input and desired output functions are presented (top); the bottom row presents the output of a representative ESN with N = 100 on the training (0 to 1000) and test data (1000 to 1200).

104

5.3. VALIDATION

Input and desired Output functions function values

0.4

input function desired output function

0.3 0.2 0.1 0 -0.1 0

200

400

600

800

1000

1200

1400

time

desired Output and ESN functions function values

0.4

desired output function ESN effective output

0.3 0.2 0.1 0 -0.1 0

200

400

600

800

1000

1200

1400

time Figure 5.6: The input and desired output functions are presented (top); the bottom row presents the output of a representative ESN with N = 100 on the training and test data. On the test phase (time step 1200 to 1400), sequences contains only two peaks. The ESN does not trigger since no third peak is given in input, plus, the time between the new sequence allows the ESNs to washout their reservoir.

105

5. MEMORY ENHANCED CONTROLLERS

Reservoir size 10 20 50 100 150 200 300

best error 3.66 × 10−2 4.00 × 10−2 3.34 × 10−2 3.20 × 10−3 3.31 × 10−3 2.12 × 10−3 3.89 × 10−3

average test error 4.6 × 10−2 ± 1.5 × 10−2 4.4 × 10−2 ± 5.9 × 10−3 4.1 × 10−2 ± 3.3 × 10−3 1.3 × 10−2 ± 5.4 × 10−3 4.6 × 10−3 ± 1.6 × 10−3 3.6 × 10−3 ± 9.2 × 10−4 8.5 × 10−3 ± 3.8 × 10−3

Table 5.2: Mean Square Error of ESN for various reservoir sizes when only two peaks sequences are given to check whether the ESNs answer to the inputs or an internal dynamic. training becomes harder and requires larger reservoir size. Networks up to 200 neurons hardly behave, while 300 neurons produce adequate behaviours. Results clearly suggest that the generalisation is rather hard to obtain in this specific example where several parameters change. Even though the task is not trivial, it can be accomplished to some extent by finding an appropriate reservoir size. The training method can be questioned in term of generality and ability to overcome local optimums. Reservoir size 10 20 50 100 150 200 300

best error 3.69 × 10−2 3.57 × 10−2 3.02 × 10−2 2.76 × 10−2 2.04 × 10−2 1.53 × 10−2 6.26 × 10−3

average test error 1.2 × 10−1 ± 1.2 × 10−1 6.1 × 10−2 ± 7.0 × 10−2 3.3 × 10−2 ± 2.8 × 10−3 3.7 × 10−2 ± 1.2 × 10−2 5.0 × 10−2 ± 5.3 × 10−2 4.9 × 10−2 ± 4.9 × 10−2 5.3 × 10−2 ± 6.0 × 10−2

Table 5.3: Mean Square Error of ESN for various reservoir sizes when variations occur between sequences and within sequences. With increasing reservoir size, error gets lower. ESN with reservoirs size larger to 200 yield the desired behaviour.

106

5.3. VALIDATION

Input and desired Output functions function values

0.4

input function desired output function

0.3 0.2 0.1 0 -0.1 0

200

400

600

800

1000

1200

time

desired Output and ESN functions function values

0.4

desired output function ESN effective output

0.3 0.2 0.1 0 -0.1 0

200

400

600

800

1000

1200

time Figure 5.7: The input and desired output functions are presented (top); the bottom row presents the output of a representative ESN with N = 300 on the training and test data, varying randomly the delay between peaks.

107

5. MEMORY ENHANCED CONTROLLERS

5.3.2

Tolman Experiments

In this section, the memory-testing robotic control benchmark is considered. Firstly, a concise definition of the experimental settings is made. Secondly, two main evolutionary approaches are compared, the prominent NEAT and CMA-ES based optimisation of ESNs. Results are then discussed. Experimental Setting Fitness function In the following, the controllers are evolved using NEAT (for recurrent neural networks) and CMA-ES (for ESNs), exploiting the known ability of evolutionary computation to explore the search space2 . Formally, let Fi be the first fitness function defined as the minimal distance to the target location x∗ : Fi = mint ||x(t) − x∗ ||2 The Fi landscape features a local optimum: the end of the second branch is closer to the target location than the entry of the third branch; a controller arriving at the end of the second branch and staying there thus fits more than a controller cruising in the top avenue of the maze. It is worth noting that the Fi landscape however favours exploration; letting t∗ denote the time step where the trajectory is closest to the target position x∗ (t∗ = argmint ||x(t) − x∗ ||), then anything the controller can do after t∗ is free of charge. Formally, the Fi landscape thus involves large neutrality plateaus. The second fitness function considered, noted Fd , incorporates extensive background knowledge about the task at hand and the structure of the solution. Inspired from the RL progress estimator [Mataric, 1997], it is based on decomposing the problem into two sub-tasks, arriving at the entry of the third branch noted z ∗ and further arriving at the end of the third branch x∗ . More precisely, fitness Fd multiplicatively aggregates two terms, the minimal distance between the robot trajectory and z ∗ , and the minimal distance between the trajectory and x∗ : Fd = (mint=1...T ||x(t) − z ∗ ||2 + 1) × (mint=1...T ||x(t) − x∗ ||2 ) 2

Complementary experiments using the Learning by Demonstration paradigm on the Tolman comb with linear regression did not provide adequate behaviours.

108

5.3. VALIDATION

Indeed this fitness function is very specific and crafted for the problem; it provides explicit clues concerning the expected behaviour with one extra target distance minimisation3 .

Training The software environment supporting the experiment include the followings: The simulator, considered in previous chapter experiments, is described in appendix A. The ESN library, considered through all presented experiments, is a home made library exploiting JAva MAtrix package (JAMA) available at http://math.nist.gov/javanumerics/jama/. Due to the larger size of the search space dimension (compared to chapter 3), a linear variant of the original CMA-ES is used, referred to as Sep CMA-ES [Ros and Hansen, 2008] (Chapter 2.4.5). Specifically, Sep CMA-ES uses a diagonal matrix to compute σ variation. The implementation of the algorithm used is the Java one, available at http://www.bionik.tu-berlin.de/user/niko/cmaesintro.html. The NEAT library is Another NEAT Java Implementation 2.0 (ANJI) available from http://www.cs.ucf.edu/∼kstanley/neat.html. Thanks to (Sep) CMA-ES, the evolution of ESNs involve only three hyper parameters: the number N of neurons in the reservoir; the connectivity rate ρ and the damping factor α. After preliminary experiments, the connectivity rate and damping parameters were respectively set to 0.1 and 0.9. The reservoir size N was varied in {50, 100, 200}. NEAT, through ANJI implementation, requires a fine tuning of approximately 30 parameters among whom 12 are essential (Table 5.4). Used at first with its default parameters, NEAT failed to solve the problem at hand. The failure was a posteriori mainly blamed on an insufficient population size, preventing NEAT from exploiting solution clusters. In the remainder of the chapter, the NEAT parameters were taken from [Devert et al., 2007], with a population size set to 5004. All parameters are set to default value except for those described in Table 5.4. The number of generations is set so that the overall number of evaluations remain the same for both Sep CMA-ES and NEAT. 3

As NEAT is tailored to maximise, the fitness function is set to C − Fi , respectively C − Fd , where C is a high constant value. 4 This required population size is large compared to Sep CMA-ES population, but they both involve different search spaces.

109

5. MEMORY ENHANCED CONTROLLERS

Parameter Population size Reproduction ratio per species Elite size per species Crossover probability Add-node mutation probability Add-link mutation probability Enable-link mutation probability Disable-link mutation probability Gaussian weights mutation probability Std. dev. for Gaussian weight mutation Uniform weights mutation probability Distance parameters for fitness sharing

Value 500 0.2 1 0.15 0.01 0.01 0.045 0.045 0.8 0.1 0.01 1.0 – 1.0 – 0.2

Table 5.4: Neat Parameters. All other parameters are set to their default values [Devert et al., 2007]. One simulation, or epoch, is defined by: • drawing the maze parameters d and d′ . • setting the initial position of the robot. • running the controller for T = 200 time steps. The fitness of each controller is averaged over K epochs, making one evaluation with K x T total time steps. Several K values were tested: 2, 8, 16 and 32. Each run is stopped after a total number of 200,000 epochs, for NEAT and ESN evolution. For both ESNs and NEAT, the computational time is circa 24 to 48 hours on Dual Core AMD Opteron 1.8GHz. The duration variation mainly depends on K. All results are averaged over 16 independent runs with the same initial parameters (Each run involve one ESN reservoir, making 16 different ESNs optimisations).

Syntactic and Semantic Evaluation Two success criteria are considered to evaluate the quality of the results obtained, due to the intrinsic bias in the fitness, namely the syntactic and semantic success criteria. The semantic evaluation refers to the behaviour performance with regard to the task while the syntactic evaluation refers to the fitness evaluation during evolution. An excellent individual for specific 110

5.3. VALIDATION

values of d and d′ might not be very efficient in the general case. Each experiment, for a set of parameters (K=2,8,16,32, N=50,100,200), was launched 16 times for a syntactic evaluation (Evolution based on the fitness functions Fi and Fd ) of the overall ability to converge. Additionally, for each of the 16 runs, we used to assess the performance, convergence and variance of the approach; the best individuals (improving on the previous best individual found in the evolution processes so far) are assessed according to a success rate on 100 independent mazes to filter out the noise. The success rate measures whether the robot gets sufficiently close enough to the target location after a tolerance parameter ǫ, experimentally set to half the branch length. Parameter ǫ however is not very sensitive as most robots either go very close to the target location, or stays far away. This experiment allows us to observe improvement in the behaviours and possibly the increase in robustness due to evaluation average over epochs.

5.3.3

Results

Unless specifically stated, experiments considers the use of fitness Fd . Number of epochs The first results displayed consider the comparison of epoch numbers within evolutionary process with Fd fitness function. The optimisation is performed over 16 independent runs on ESNs with 100 reservoir neurons (Fig. 5.8 and 5.9). Evolutionary curves are given in logarithmic scale for a better readability of the results. The semantic performance (Fig. 5.13) shows the median and best performance on 100 trials over each the 16 evolutionary processes, i.e. the number of times the target is reached over 100 independent open-Tolman comb environments for each individual improving the evolutionary processes. The results suggest that the number K of epochs, aimed at smoothing the noise effects, controls the convergence of the process: if K is too low, a premature convergence toward a local optimum is observed; if K is too high, the search space is confined to stable under-optimal individuals. A first result concerns the efficient convergence when considering only 2 epochs, compared to higher epoch values: 8, 16 or 32. The semantic study confirms that evolutionary opportunism occurred with low number K of epochs: the median performance of 2-epochs trained controllers is 0. The best individual founds within the early stages of the process confirms that further improvement did not bring behaviour improvements. With K = 2, the best 111

5. MEMORY ENHANCED CONTROLLERS

behaviours evolved only behave adequately for specific environment profiles through advantageous d and d′ values (when K = 2, best ESNs performed adequately on approximately 26% of the 100 environments generated). For larger K values, a slower convergence rate toward a lower fitness is observed. The median fitness for K = 16 gets the best fitness, and this result is consistent with the semantic study. The median performance of the best individuals is the highest with 16 epochs. In fact, performance raises together with the number of epochs, but K = 32 provide a less efficient behaviours than K = 16. Except for K = 2, the best individuals obtained through the 16 processes provide valid solutions. Not only this approach does allow the evolution to avoid evolution opportunism; it also allows a robust and general training of the controller. Results tend to suggest that the number of epochs should be high enough to avoid the opportunism, but probably not too high either in order to ensure algorithm convergence. Increasing K to large values seem to hinder the evolutionary process. In this study, however, this limit was evaluated experimentally using broad ranges for K which might not be optimal with regard to the task at hand. Size of the reservoir For the preliminary experiments (section 5.3.1), the reservoir size trained is {10, 20, 50, 100, 200, 300}. Regarding the Tolman problem, for tractable reasons, the reservoir size varies in {50, 100}. Another reason to constrain the reservoir size concerns the results obtained with NEAT. Even if the range is larger, the efficient networks produced by NEAT range in {55, 96} hidden neurons. The results suggest that the 50-neurons ESNs converge more slowly on the average than the one with 100, but one 50-neurons ESN achieve a better convergence speed (Fig. 5.10 and 5.11); the ESNs with 100 reservoir neurons converges to a better fitness. The 50-neurons ESNs represents less parameters (half the number of a 100 reservoir neurons ESN) and hence tends to converge faster but the topologies appear not to express sufficient dynamics to solve the problem at hand, making the average optimisation not able to converge to the best results. The semantic performance of the best ESNs obtained is equal, but on the average, 50-neurons ESNs tends to perform less efficiently (Fig. 5.14). A tentative interpretation of these results involve limited random dynamics that are generated for a 50 reservoir neurons ESN. ESNs with larger 112

5.3. VALIDATION

reservoir will more likely produce the appropriate dynamics required to solve the task, but will require more parameters to optimise. ESN and NEAT Table 5.5 displays results obtained with ESNs5 and NEAT, respectively considering Fd and fitness Fi with K = 16; left column report the syntactic results considering median and best fitness obtained during evolution (Fig. 5.11 and 5.12). Right column represent the semantic evaluation through the success rate of the best controllers evolved for each evolutionary processes (averaged over 100 mazes) (Fig. 5.15, 5.17 and 5.18). Best NEAT topologies behaving adequately contains 55 to 96 hidden neurons. Other NEAT best topologies evolved ranges within 0 to 118 hidden neurons but achieve low semantic performance. The average damping on the NEAT topologies obtained is 1.79 ± 0.28 and is always higher than 1 except for the solution with no hidden neurons. No differences occur when comparing the successful NEAT topologies and the unsuccessful ones with regard to the damping factor.

Fd Fi

NEAT ESN NEAT ESN

Fitness value Best Median −2 4.6.10 0.97 −2 4.5.10 0.18 1.10−2 0.94 −4 9.33.10 0.84

Best 98 99 95 98

Success rate (/100) Median Average 78 57.4 ± 37.2 70 59.06 ± 31.52 76 72.87 ± 14.8 24 43.75 ± 32.0

Table 5.5: The Tolman Maze: ESNs and NEAT results. Fitness values (leftmost part) and semantic success rate (middle part), with fitness Fd and Fi , averaged on 16 independent runs. Overall, ESNs and NEAT reach comparable performances, however, they react diversely to the two fitness functions. The fitness landscape defined by the prior knowledge-based fitness Fd is more amenable to ESNs than NEAT, as demonstrated by NEAT stagnating median fitness. Optimisation in Fi fitness landscape (minimum distance to target), proves to be more difficult to converge, as expected. The results confirms that, with Fi, an improved fitness value does not necessarily imply that the success rate likewise increases. A tentative 5

with a reservoir size of 100 neurons

113

5. MEMORY ENHANCED CONTROLLERS

interpretation concerns the number of epochs K, which plays a role toward reducing the evolutionary opportunism (trivial and undesirable solutions), and the fitness landscape defined by this fitness function. Controllers turning within the second or fourth branch will reach low fitness values, generating local optimums hard to handle. While NEAT faces the same convergence difficulties, it keeps a good consistency and overall produces more robust behaviours on Fi fitness function in terms of semantic success rate. When considering both fitness functions, NEAT is outperformed by ESNs in terms of median fitness (Only significant when comparing NEAT and ESNs on Fd ), however, in both cases, best results are obtained through ESNs. On Fi fitness function, ESN best individuals are found after 55,000 evaluations vs 100,000 for NEAT (Fig. 5.16).

Fd Fi

NEAT ESN NEAT ESN

Best 44 40 45 40

Time to Median 49 47 51 46

target Average 59.6 ± 21.26 47.7 ± 5.43 58.6 ± 14.89 46.45 ± 4.86

Table 5.6: The Tolman Maze: ESNs and NEAT results. Controller speed during semantic study, with fitness Fd and Fi, averaged on 16 independent runs. Additional features appear from the result data as presented in Table 5.6. During the semantic study, the robot is assessed in 100 randomly generated mazes with regard to d and d′ . The time taken to reach the target, or speed, is not explicitly considered as part of the fitness selection, and the time given during assessment is large enough, not being a constraint during the evolutionary process. The results show that for comparable success rates (Table 5.5), ESN controllers reach the target faster than NEAT controllers. The average number of time steps required by ESNs is 48 ± 5 with Fd , vs 60 ± 21 for NEAT. With fitness Fi , the average number of time steps required by ESNs is 46.5 ±5 vs 55 ±10 for NEAT. A tentative interpretation for this difference, suggesting that ESNs undergo a stronger pressure in favour of fast controllers than NEAT, goes as follow. By definition the target behaviour relies on the memory of the past trajectory; the memory is encoded in the hidden neuron states. Within an ESN, this memory vanishes exponentially fast, according to the damping 114

5.4. CONCLUSION

factor of 0.9. While a general limit for the damping factor is set to 1, the actual limit could be higher, depending on the reservoir topology. The faster the ESN controller, the more reliable its memory; efficiency and speed are thus tightly related. In opposition, NEAT does not set any constraint on the damping factor of the recurrent neural networks. Further investigations show that the NEAT solutions have a low connectivity rate and a number of hidden neurons ranging from 55 to 96 for the best results, having their structure close to that of the ESN solutions.

5.4

Conclusion

This chapter has investigated the feasibility of an autonomous robot recurrent neural network controller training with counting ability, comparing two evolutionary approaches: the state of the art Neuro-Evolution of Augmenting Topologies (NEAT) achieves the non-parametric optimisation of a recurrent neural network; a new recurrent neural architecture known as Echo State Networks (ESN), is amenable to efficient parametric optimisation through (Sep) CMA-ES. Experimental results suggest that evolved ESNs open promising research avenues for implicit memory modelling, with similar performances and significantly lesser hyper-parameters to tune than NEAT. For the considered parameters, NEAT solutions propose parameters close to ESN ones (number of neurons and topology features). An carefully designed methodology has been proposed to enforce the controller generality and discard evolutionary opportunism. Stochastic perturbations over the Tolman Maze together with the control f the number K of epochs, used to smooth the fitness landscape, have been used to enforce the generality of the controller solution and discard trivial non usable solutions. The presented work opens a new research avenue in Evolutionary Robotics, based on the comparison of the neural net architectures respectively built by NEAT and evolved ESNs. The presented experiments provide insights into capability of recurrent neural networks in a generic environment. Further work should investigate two main issues: • The best topologies built by NEAT feature similar results as ESN with sparse and recurrent connectivity, without any explicit damping factor limit. The issue of obtaining the actual best damping factor according to network topology and relate it to the memory skill remain open. The question, along the same lines as [Schrauwen et al., 2008], is whether 115

5. MEMORY ENHANCED CONTROLLERS

the performance of the controller relates to some deep characteristics of the neural connexion matrix. • Deepening the memory ability study with ESNs should be performed. It has been proven that ESNs are able to count to some extent, but the challenge of having them perform complex time dependent tasks (involving states) remain open. As shown in the sanity checks, increase in task complexity imply a requirement for increased size of the reservoir. Proposing an approach to optimise the reservoir topology might prove another path toward increased complexity.

116

5.4. CONCLUSION

Controller evolution - Fd 10 ESN - N=100 - 2 epochs

fitness

1

0.1

0.01 1.0E+04

1.0E+05 evaluations Controller evolution - Fd

10 ESN - N=100 - 8 epochs

fitness

1

0.1

0.01 1.0E+04

1.0E+05 evaluations

Figure 5.8: Evolution median and quartiles over 16 runs for an ESN having 100 neurons within the reservoir, using Fd fitness function. The variable parameter is the number of epochs considered: two on the top, 8 on the bottom. Result is given in logarithmic scale for better readability.

117

5. MEMORY ENHANCED CONTROLLERS

Controller evolution - Fd 10 ESN - N=100 - 16 epochs

fitness

1

0.1

0.01 1.0E+04

1.0E+05 evaluations Controller evolution - Fd

10 ESN - N=100 - 32 epochs

fitness

1

0.1

0.01 1.0E+04

1.0E+05 evaluations

Figure 5.9: Evolution median and quartiles over 16 runs for an ESN having 100 neurons within the reservoir, using Fd fitness function. The variable parameter is the number of epochs considered: 16 on the top, 32 on the bottom. Result is given in logarithmic scale for better readability.

118

5.4. CONCLUSION

Controller evolution - Fd 10 ESN - N=50 - 16 epochs

fitness

1

0.1

0.01 1.0E+04

1.0E+05 evaluations Controller evolution - Fd

10 ESN - N=100 - 16 epochs

fitness

1

0.1

0.01 1.0E+04

1.0E+05 evaluations

Figure 5.10: Evolution median and quartiles over 16 runs for an ESN having 50 or 100 neurons within the reservoir, using Fd fitness function. The number of epochs considered is equal to 16. Result is given in logarithmic scale for better readability.

119

5. MEMORY ENHANCED CONTROLLERS

Controller evolution - Fd 10 ESN - N=100 - 16 epochs - Fd

fitness

1

0.1

0.01 1.0E+04

1.0E+05 evaluations Controller evolution - Fi

10 ESN - N=100 - 16 epochs - Fi

fitness

1

0.1

0.01 1.0E+04

1.0E+05 evaluations

Figure 5.11: Evolution median and quartiles over 16 runs for an ESN having 100 neurons within the reservoir, using Fd fitness function on the top and Fi fitness function on the bottom. The number of epochs considered is equal to 16. Result is given in logarithmic scale for better readability.

120

5.4. CONCLUSION

Controller evolution - Fd 10 NEAT - 16 epochs - Fd

fitness

1

0.1

0.01 1.0E+04

1.0E+05 evaluations Controller evolution - Fi

10 NEAT - 16 epochs - Fi

fitness

1

0.1

0.01 1.0E+04

1.0E+05 evaluations

Figure 5.12: Evolution median and quartiles over 16 runs for NEAT, using Fd fitness function on the top and Fi fitness function on the bottom. The number of epochs considered is equal to 16. Result is given in logarithmic scale for better readability.

121

5. MEMORY ENHANCED CONTROLLERS

median performance vs epochs - Fd 100 esn, 2 epochs esn, 8 epochs esn, 16 epochs esn, 32 epochs

# success

80 60 40 20 0 0

50000

100000

150000

200000

evaluations best performance vs epochs - Fd 100 esn, 2 epochs esn, 8 epochs esn, 16 epochs esn, 32 epochs

# success

80 60 40 20 0 0

50000

100000

150000

200000

evaluations

Figure 5.13: Semantic study of median improvement along with evolution over 16 runs for an ESN having 100 neurons within the reservoir, using Fd fitness function for variable epoch sizes. On the top, the median success is given, on the bottom, the best results obtained are given.

122

5.4. CONCLUSION

performance vs reservoir size - Fd 100 esn 50, 16 epochs esn 100, 16 epochs

# success

80 60 40 20 0 0

50000

100000

150000

200000

evaluations performance vs reservoir size - Fd 100 esn 50, 16 epochs esn 100, 16 epochs

# success

80 60 40 20 0 0

50000

100000

150000

200000

evaluations

Figure 5.14: Semantic study of median improvement along with evolution over 16 runs for an ESN having 50 or 100 neurons within the reservoir, using Fd fitness function for variable epoch sizes. On the top, the median success is given, on the bottom, the best results obtained are given.

123

5. MEMORY ENHANCED CONTROLLERS

performance : esn vs neat - Fd 100 NEAT, 16 epochs ESN 100, 16 epochs

# success

80 60 40 20 0 0

50000

100000

150000

200000

evaluations performance : esn vs neat - Fd 100 NEAT, 16 epochs ESN 100, 16 epochs

# success

80 60 40 20 0 0

50000

100000

150000

200000

evaluations

Figure 5.15: Semantic study of median improvement along with evolution over 16 runs for an ESN having 100 neurons within the reservoir faced to NEAT, using Fd fitness function. On the top, the median success is given, on the bottom, the best results obtained are given.

124

5.4. CONCLUSION

performance : esn vs neat - Fi 100 NEAT, 16 epochs ESN 100, 16 epochs

# success

80 60 40 20 0 0

50000

100000

150000

200000

evaluations performance : esn vs neat - Fi 100 NEAT, 16 epochs ESN 100, 16 epochs

# success

80 60 40 20 0 0

50000

100000

150000

200000

evaluations

Figure 5.16: Semantic study of median improvement along with evolution over 16 runs for an ESN having 100 neurons within the reservoir faced to NEAT, using Fi fitness function. On the top, the median success is given, on the bottom, the best results obtained are given.

125

5. MEMORY ENHANCED CONTROLLERS

NEAT - success evaluation - Fd 100 NEAT - 16 epochs - Fd

90 80 # success

70 60 50 40 30 20 10 0 0.0E+00

5.0E+04

1.0E+05

1.5E+05

2.0E+05

evaluations NEAT - success evaluation - Fi 100 NEAT - 16 epochs - Fi

90 80 # success

70 60 50 40 30 20 10 0 0.0E+00

5.0E+04

1.0E+05

1.5E+05

2.0E+05

evaluations

Figure 5.17: Semantic study of median improvement along with evolution over 16 runs for an ESN having 50 or 100 neurons within the reservoir, using Fd fitness function for variable epoch sizes. On the top, the median success is given, on the bottom, the best results obtained are given.

126

5.4. CONCLUSION

ESN -success evaluation - Fd 100 90

ESN - 100 - 16 epochs - Fd

80 # success

70 60 50 40 30 20 10 0 0.0E+00

5.0E+04

1.0E+05

1.5E+05

2.0E+05

evaluations ESN -success evaluation - Fi 100 90

ESN - 100 - 16 epochs - Fi

80 # success

70 60 50 40 30 20 10 0 0.0E+00

5.0E+04

1.0E+05

1.5E+05

2.0E+05

evaluations

Figure 5.18: Semantic study of median improvement along with evolution over 16 runs for an ESN having 50 or 100 neurons within the reservoir, using Fd fitness function for variable epoch sizes. On the top, the median success is given, on the bottom, the best results obtained are given.

127

5. MEMORY ENHANCED CONTROLLERS

128

Chapter 6 Conclusion and Perspectives Several issues at the core of Autonomous Robotic Control have been investigated in the presented research. In this last chapter, the position of the considered issues will be summarized, the contributions will be discussed, opening some perspectives for further research.

Training with bounded and unreliable resources A first major issue for Autonomous Robotic is that of in-situ vs in-silico training. In-silico approaches extensively rely on robotic simulators, encoding a world model and specifically sensor and actuator models. While the whole in-silico approach relies on the quality of the simulator at hand, acquiring reliable models of the sensor and actuator devices raises significant difficulties, relatively to the noise model1 and the many parameters involved in real world modelling (e.g., light conditions and texture of the obstacles for visual sensors). A physics-compliant simulator (e.g. modeling elastic shocks or gravity) is usually slow, to such an extent that simulations might take about as long as in-situ experiments. Using in-silico approaches, a robotic controller is trained within cheap unreliable, or time-consuming accurate, programming environments. In the first case, the trained controller suffers from the so-called Reality Gap: while being accurate in-silico, the robot behaviour is inappropriate in real conditions. In the second case, the merits of in-silico training are 1

It is widely acknowledged that sensors provide noisy information; actuators likewise are noisy. Still, the noise model proves to be elusive, and particularly, not Gaussian.

129

6. CONCLUSION AND PERSPECTIVES

questionable compared to that of in-situ training (the real thing). In opposition, in-situ approaches are always time-demanding: experiments take time and require the full participation of the human designer. Furthermore, they put stress on the robotic devices, increasing the chances of failure. The exploration of the environment and the behaviours also requires special care to preserve the robot integrity. Lastly, in-situ training does not necessarily offer decent guarantees about the controller generality, for two reasons. On the one hand, robot sensors and actuators might sensibly differ from one robot to another; the controller trained on one physical robot might present some inaccuracies when transfered on another robot. On the other hand, autonomous robot cannot be trained under all possible experimental conditions; a change in the light conditions might later hinder the effectiveness of the robot controller. One promising direction for handling the in-silico vs in-situ dilemma (ISIS) is to consider that the Reality Gap (RG) is unavoidable, whatever the training mode of the controller, and therefore to learn how to deal with it. Addressing the Reality Gap phenomenon assumedly requires an anticipation module to be built: the robot can hardly detect the presence of the gap if it does not have any expectation about the likely consequences of its actions. Preliminary results along this line, building a RG recovery module as a post-processor of the controller on the top of an anticipation module, have been presented in Chapter 3; the feasibility of the approach has been shown in the very simple case of a parametric recovery module. A new perspective for further research is to consider RG recovery modules operating as pre-processors of the controller, repairing the sensor values provided to the (unchanged) controller. One possible approach is to bridge the gap between the training conditions and the actual environmental conditions of the robot, through normalising the distribution of the robot sensor data. In such a case, the anticipation module only builds a model of the sensor data distribution; the RG recovery module only repairs the actual sensor data in order to fit the reference distribution. Another approach, based on a more elaborate anticipation module, proceeds as follows. Whenever some discrepancy between the actual sensor values and the expected ones occurs, the idea is to determine what the sensor values of the robot should have been at time t in order to account for its sensor values at time t + 1 given the actuator values at time t. In this way, one repairs the sensor values, thus de facto adapting the controller. 130

Criteria for controller learning A second issue is to provide the robot with some criterion of quality, decently encoding the desired behaviour. How critical the design of a robotic fitness function is (all the more so due to Evolutionary Opportunism), has been extensively discussed in Evolutionary Robotics [Nolfi and Floreano, 2000]; the design of reward functions in Reinforcement Learning raises similar difficulties. Specifically, the fitness/reward function defines an optimisation problem, with two requirements. On the one hand, the optimisation problem should be tractable (avoiding Needle-In-The-Haystack-like fitness landscapes), that is, the criterion should enable making differences between inappropriate robotic behaviours and very inappropriate behaviours, conducive to the discovery of appropriate ones. On the other hand, the criterion should not admit trivial and undesirable solutions; typically, an undesirable way of achieving obstacle avoidance is to stay motionless. Learning by Demonstration (LbD), i.e., show me!, is a promising alternative to the definition of a quality criterion, in the spirit of Machine Learning: instead of providing rules or criteria, the human teacher provides examples. This approach has been investigated in Chapter 4, suggesting that LbD both addresses some known difficulties and raises some new difficulties. On the one hand, the teacher demonstrations are noisy. On the other hand, not all time steps in a demonstration are equally meaningful to the quality of the behaviour. Henceforth, minimizing the Mean Square Error between the robot traces and the expert demonstrations is irrelevant: better controllers might get a worse MSE. As the MSE criterion is not sufficient to judge the merits of the controllers learned from the teacher traces, the teacher was required to mark the controllers based on the traces of their behaviour, opening two perspectives for further research. The first one is based on interactive optimisation [Marks et al., 2000], where the designer in the loop both replaces the fitness computation and the selection operator in the EC framework. Additionally, interactive optimisation could be coupled with active learning [Teytaud et al., 2007], allowing the learning controller to ask the oracle about the most appropriate actions in critical situations. The main limitation of interactive evolutionary optimisation [Llor`a et al., 2005] however is the stress put on the expert: interactive 131

6. CONCLUSION AND PERSPECTIVES

optimisation hardly scales up above a few thousand evaluations. A second perspective thus is to learn the teacher’s criteria along interactive optimisation as proposed in [Llor`a et al., 2005], within the Learning to Rank framework [Burges et al., 2007]. In that way, the goal is to learn the fitness function from the teacher traces, and the interaction with the expert. This approach is in the same spirit as Inverse Reinforcement Learning [Abbeel and Ng, 2004], where the reward function is learned from the teacher traces. The difference is that the proposed approach does not require features to be specified in advance (e.g. speed), while seamlessly involving the teacher’s feedback.

Memory, Setting and Controller Space A third issue concerns the acquisition of elaborate behaviours, involving memory skills. The contributions presented in Chapter 5 rely on the so-called Tolman maze, used as a benchmark problem since the desired behaviour, reaching the third branch in a comb maze, requires some limited counting abilities due to perceptual aliasing. In order to acquire and exploit memory skills, three interdependent questions must be handled, respectively: selecting an appropriate controller search space; enforcing a tractable training approach; designing a robust training methodology, enforcing the generality of the trained controller. Promising results have been obtained and discussed in Chapter 5. The search space of Echo State Networks, pertaining to the field of Reservoir Computing [Verstraeten et al., 2007], has been shown to encode memory skills (sufficient for the problem at hand). In the meanwhile, ESNs are trained with linear complexity in the size of the network (number of neurons) − as opposed to e.g., Elman or standard recurrent networks. Further, ESN design only involves two hyper-parameters: the number of neurons and the so-called damping factor. The evolutionary optimisation of ESNs was thus found to be competitive with the evolutionary non-parametric optimisation of neural nets achieved by NEAT, involving about 30 hyper-parameters [Stanley and Miikkulainen, 2002]. Lastly, a robust methodology enforcing the generality of the solution controllers is based on averaging the solution fitness over several epochs, where each epoch considers a different maze (the length of the comb branches are independently drawn in each epoch). This methodology prevents the 132

opportunist discovery of reactive controllers, guessing a reactive path toward the desired location. In counterpart, the fitness function gathered from one epoch is a random variable, raising all difficulties related to noisy optimisation [Arnold and Beyer, 2002, Runarsson, 2006]. This work opens several perspectives for further research. The first one is to relate more precisely the structure of the ESN topology to the dynamics of the reservoir, along the same lines as [Maier et al., 2009]. The variability of the trained controllers, based on topologies with same connexion density and damping factors, suggests that some other parameters of the connexion graph control the memory lifespan. The second one revisits the optimisation of noisy fitness functions, extending the Multi-Armed Bandit framework [Auer et al., 2002]. Some preliminary work along this line has been done [Hartland et al., 2007], albeit in the context of the On-line Trading Exploration vs Exploitation Pascal Challenge2 . The situation of the problem is as follows. In order to face the noise in the fitness values, the standard approach is to average the value recorded along several independent measures; in the considered robotic setting, the controller fitness is averaged over 16 independent epochs. While increasing the number of measures/epochs clearly decreases the variance, it (linearly) increases the computational cost. While it makes sense to get a more precise estimate of the fitness for the most promising individuals in the population, the estimate might be less precise for the other individuals. In other words, the number of epochs considered for each individual should be adjusted, with the goal of determining which ones, among the λ offspring, should be retained in the next population, or used to adjust the covariance matrix. Due to time limitations, this perspective could not be investigated in the PhD framework. Still, it is hoped that the work done to address the dynamic Exploration vs Exploitation dilemma can lead to new and effective approaches for the optimisation of noisy functions, at the core of Autonomous Robotic Control.

2

This challenge, http://pascallin.ecs.soton.ac.uk/Challenges/EEC/, has been won by Nicolas Baskiotis, Sylvain Gelly, C´edric Hartland, Michele Sebag and Olivier Teytaud in 2006.

133

6. CONCLUSION AND PERSPECTIVES

134

Appendix A Software A.1

The simulator

The experiments presented in this dissertation considers experiments on physical robot as well as simulation. Simulation allows to perform quick controller evaluations and experiments that would be time consuming on real robots (e.g. Evolutionary Robotics). In simulation, one can optimise or change robot physiology or perform experiments using countless numbers of robots (e.g. Swarm Robotics). The choice of the simulator was made early during the internship. Simbad simulator written in Java was considered, being a part of a whole package of libraries containing neural network implementation as well as evolution software [Hugues and Bredeche, 2006]. Due to limitations faced during the experiments, a new simulator was developed to overcome those limitations. All experiments previously made on Simbad were re-done using the new simulator, for both validation of the simulator and consistency of the experiments. The new simulator: Robot Simulation Kit (RSK) strongly inspired from simbad (Fig. A.1), is implemented in Java 1.6 and can be used on any computer running a sun Java jre 1.6. The simulator is similar to Simbad, implementing similar structure. The user can focus on implementing an Environment and a Robot class to perform specific experiments. Since many Simbad mechanisms rely on Java3d, an extension to Java, it causes many difficulties. Up to recent updates Java 3d suffered of memory leak happening to cause crashes of Simbad. The simulator could not be launched efficiently on clusters of computers since it requires a graphical 135

A. SOFTWARE

server to run, even when not using any Graphical User Interface (GUI). It also requires graphical acceleration for a smooth run, which is not necessarily available, resulting in poor performances. Due to the specific implementation, it proves rather difficult to move, resize and remove items from the environment without causing crashes, which was an important requirement for experiments presented through the thesis. A few apparent more bugs caused erratic behaviours within the simulator. Those reasons lead to the need for a new simulation material similar to Simbad to keep track of previously made experiments.

Figure A.1: RSK: Robot Simulation Kit software used in simulation experiments. RSK is a lightweight robot simulation, proposing a 2D implementation, allowing to free from Java 3D. Since no 3D computation occurs, the simulator is working fluently on slower computers, without requiring either graphical display, graphical acceleration or specific extra Java libraries. The code have been carefully tested and validated for kinematic, sensor and collision detection. Noise sampled on two khepera robots have been used to model gaussian noise in simulation. A keyboard interface has been implemented 136

A.1. THE SIMULATOR

used by a specific human controlled robot: bots.KeyboardKhepera. The controls given by the teacher are stored along with robot sensor values in a file. One should note, however, that this simulator suffers limitations compared to Simbad in the sense that extra robotic components may not be used: No camera implementation as well as light sources are implemented at the time being. As in Simbad, RSK may be launched in two modes: • standard mode: The simulator is used through a graphical application, proposing a graphical display of the environment and robots, a control panel, a terminal window, and a log trace window. The environment window displays the current state of the world. The control panel allows to reset, play, pause and play step by step simulation. The speed of simulation can also be changed. Screenshots of the environment and robot’s behaviour log may be done using the dedicated menu. The GUI workspace organisation can be configured and stored from one run to another (Fig. A.2). • batch mode: experiments such as evolutionary computing require specific code to access the simulator core. This mode proposes to address the simulation without going through the GUI. This mode is useful when performing evolutionary robotic experiments on computer clusters. A limited graphical window showing the environment may, or not, be displayed. This last feature allow to check quickly whether an experiment seem to produce results or not. The architecture of the simulator is rather simple. On the one hand, a simulation core present in package sim contains the required code to perform simulation: sensor update, movement updates or collision detections. On the other hand, a graphical user interface exists in package gui. This interface uses the simulator core present in package sim. The simulator proposes the functions allowing to manipulate the world which is the environment containing obstacles and robots. A simulation step consists in updating the position and status of the robots by calling their function step(). Any Artificial Intelligence controller should be implemented within the function step() inside a robot class inheriting from either sim.Agent or sim.KheperaII.

A.1.1

Creating an environment

The first step to perform an experiment consist in creating an environment class. An environment is a limited world representation, containing obstacles 137

A. SOFTWARE

Figure A.2: A few GUI configurations in standard mode and robots. The construction of an environment may follow an xml file, containing construction rules of elements to add to the environment as well as its properties; it may also follow code implementation to add elements. The class sim.Environment proposes the constructors to make use of both approaches. The default constructor creates an empty environment: Environment env = new Environment(); Once the Environment instance is created, items may be added to it by calling the add() function. It does apply to either obstacles or robots. The sim.Box class proposes a default generic rectangular obstacle to be added in the environment. The construction of a Box require to provide two 2d coordinates (x1 , y1 ) and (x2 , y2 ), representing respectively the upper left corner and the lower right corner, and its rotation r. The box is constructed and then added in the environment: Box box = new Box(x1 , y1 , x2 , y2 , r); env.add(box); which is equivalent to: env.add(new Box(x1 , y1 , x2 , y2 , r)); 138

A.1. THE SIMULATOR

A few default environments are proposed in package env such as the env.ClosedEnvironment which proposes enclosing walls around the environment. Many environments were tested but only a few remain through the thesis. Lets take the example of the environment mainly considered through the thesis (Fig. A.3) (Chapter 3 and 4). Environment env.add(new env.add(new env.add(new env.add(new env.add(new env.add(new env.add(new env.add(new env.add(new env.add(new env.add(new env.add(new env.add(new env.add(new env.add(new env.add(new env.add(new

A.1.2

env = new Environment(); Box(0,0,20,0.2,0)); Box(0,0,0.2,20,0)); Box(0,20,20,19.8,0)); Box(20,0,19.8,20,0)); Box(0,18,2,20,0)); Box(1,19,3,20,0)); Box(16,16,20,20,0)); Box(15,19,16,20,0)); Box(8,12.5,13,14,0)); Box(8,12,16.5,12.5,0)); Box(8,0,8.5,12,0)); Box(16,4,16.5,12,0)); Box(8,12,16.5,12.5,0)); Box(11,3.5,16.5,5,0)); Box(8,7.5,13.5,9,0)); Box(0,10,2,10.5,0)); Box(6,10,8,10.5,0));

Figure A.3: Environment Java code (left), Environment display (right).

Creating an agent

Together with the environment, one simply need to create an robot class. A robot is a class inheriting from the generic abstract class sim.Agent or more specific class sim.KheperaII. The later represent a specific agent with the kheperaII robot properties: size, weights, control kinematic and sensors. Aside from the constructor, two functions need to be implemented: step() and initAgent(). The function initAgent() is called upon reset to (re)initialise any parameter needed. The step() function contains the robot control algorithm and is called at each simulation step. Within this function, sensor values are read; control is decided according to the sensor and possibly internal states. The control output are decided by calling the setSpeed() function. Once all agent steps functions are finished, the simulator updates the state of the world and perform movements if possible. The default agent constructor has 4 parameters: initial coordinates x and 139

A. SOFTWARE

y, initial orientation r in radians, and a name. The creation of a khepera robot controller may follow the guide given in Figure A.4. public class DemoKhepera extends KheperaII { public DemoKhepera(double x, double y, double r, String name) { super(__x, __y, __rotation, __name); _sensors = new double [8]; } public void initAgent() { } public void step() { double [] sensors = new double [8]; if (!collisionDetected()){ readProximitySensors(sensors); // add control depending on sensors setSpeeds(left,right); } else setSpeeds(0,0); } } Figure A.4: Agent class creation. The sensor values can be obtained by calling readProximitySensors() method and the control value is set via setSpeed(). Since the robot considered is a khepera, the outputs produced are left and right motor activation, referred to as differential kinematic. In this setting, each motor is activated independently. Once created, agents can be added to the environment as boxes: DemoKhepera robot = new DemoKhepera(x, y, r, name); env.add(robot); 140

A.2. RUNNING AN EVOLUTIONARY PROCESS

A.1.3

Launching simulation

Once then environment MyEnvironment and the agent(s) DemoKhepera are created, the simulator can be used in two ways: batch or standard mode. The batch mode allows to address directly the simulator and perform simulation steps at will. This is useful when performing evolutionary process and optimise the the controller. In standard mode, the simulator is launched through the GUI (Fig. A.5). public static void main(String[] args) { MyEnvironment env = new MyEnvironment(); env.add(new DemoKhepera(10.5,10.5,0,"robot")); SwingUtilities.invokeLater(new KsK(env)); } Figure A.5: Running simulation with one DemoKhepera agent within MyEnvironment.

A.2

Running an evolutionary process

Let consider that both an environment and an agent class have been implemented. The controller coded within the agent class, be it a neural network, or more generally a control model, needs to be optimised according to a fitness function. Without entering into the detail of any specific algorithm or optimisation library, let consider that a fitness function may be defined in the form of a class containing an evaluation function. This class contains everything needed to perform evaluations of genotypes (controller parameters) provided by the optimisation algorithm. The fitness class contain: the batch simulator, the environment and the agent. In the evaluation function, the controller parameters are set to the genotype given in method input. The robot behave according to its controller and parameters given within its environment. This happen by repeatedly calling for a simulation step until a stopping criterion is reached. Such criterion may be a time limit, a collision detection, a success or an evident failure in solving the task. The fitness value computed according to the behaviour is then returned. Chapter 3 proposes an example of evolutionary robotic experiment involving optimisation of neural networks through CMA-ES implementation (Chapter 2). 141

A. SOFTWARE

A.3

Khepera

Together with simulation, real khepera II robots were used to perform experiments presented in this thesis. It is a small holonomous round robot equipped with Infra red sensors around its body (Fig. A.6). Holonomous refer to a capability of the robot to rotate while keeping on the same location.

Figure A.6: Khepera II robot and sensor positions around the robot. The sensor 0 is the left sensor. The sensors available on the khepera robot are oriented toward the front of the robot, with 3 infra red sensors couples on the front, front left and front right and 2 extra sensors on the rear. Each sensor allows both to compute distance to the near obstacles, and light detection. Those sensors are non linear and remain highly sensitive to external light conditions. On top of it, several modules may be added such as a grip, a linear camera or a video camera to cite a few. In the presented experiments, only a wireless video camera was used for demonstration learning. This robotic platform has been used a lot in robotic control literature and especially in evolutionary robotics [Nolfi and Floreano, 2000]. A Java interface from computer to robot using serial connexion was developed based on a previous version by Thomas Delquie, using javax.comm extra library for serial communication on both windows and linux systems. This implementation is available in the form of Java package named khepera. A serial communication port was coded and and a class containing all functions available on the physical robot were implemented. Class modules were also created to make use of khepera plug-ins such as the Grip or the video camera (linear and 2d). The Agent class coded in simulation can have a direct mapping using a simple simulation Thread replacing the simulated 142

A.3. KHEPERA

khepera by the actual physical khepera. The use is rather straight forward.

143

A. SOFTWARE

144

Index Actuator, 7 Adaptive, 8 Anticipation, 47 Apprenticeship Learning, 16 Artificial Neural Networks, 18 Autonomous robot, 7 Behaviour, 7 Bias, 19 Classification, 13 Control Policy, see Controller Controller, 7 crossover, 35 Deliberative control, 8 Dimension, 12 Echo State Network, 26, 97 ESN, 26 Epoch, 32 Evaluation, 30, 32 Evolution Strategies, 36 Evolutionary Computation, 28 Evolutionary Robotics, 39 Exploitation, 29 Exploration, 29 fitness function, 30 Gene, 30 Generalisation, 12 Generation, 30 Genotype, 30

Global Optimum, 29 In-silico, see Simulation In-situ, 43 Independently and Identically Distributed, 11 iid, 11 Individual, 30 Inverse Reinforcement Learning, 15 Learning by Demonstration, 16 Learning Classifier System, 40, 95 LCS, 95 Machine Learning, 10 Markov Decision Process, 95 MDP, 95 Partially Observable Markov Decision Process, 95 POMDP, 95 Mobile Robot, 7 mutation, 35 Neural Network, 19 Elman Network, 21, 22 feed-forward, 21 Hopfield Network, 21, 22 Recurrent, see Recurrent Neural Network Neuro-Evolution of Augmenting Topologies, 97 NEAT, 97 Neuron, 19 145

INDEX

hidden neuron, 19 input neuron, 19 output neuron, 19 Optimisation, 29 Overfitting, 12 Perception, 7 Perceptual Aliasing, 8, 77 Phenotype, 30 Population, 30 Quasi-Random, 34 Random, 29, 34 Reactive Control, 8 Reality gap, 43 Recurrent Neural Network, 21 Regression, 13 Reinforcement Learning, 14 Robot, 7 Selection, 31 Sensor, 7 Simulation, 43 Solution space, see Search space Stochastic, see Random Stochastic Optimisation, 28 Tolman Maze, 93 Tournament Selection, 33 Wheel Selection, 32

146

Bibliography [Abbeel, 2008] Abbeel, P. (2008). Apprenticeship Learning and Reinforcement Learning with Application to Robotic Control. PhD thesis, Stanford University. [Abbeel et al., 2007] Abbeel, P., Coates, A., Quigley, M., and Ng, A. Y. (2007). An application of reinforcement learning to aerobatic helicopter flight. In Scholkopf, B., Platt, J., and Hoffman, T., editors, Advances in Neural Information Processing Systems 19, pages 1–8. MIT Press, Cambridge, MA. [Abbeel and Ng, 2004] Abbeel, P. and Ng, A. Y. (2004). Apprenticeship learning via inverse reinforcement learning. In In Proc. ICML. ACM Press. [Antonelo et al., 2007] Antonelo, E., Schrauwen, B., Dutoit, X., Stroobandt, D., and Nuttin, M. (2007). Event detection and localization in mobile robot navigation using reservoir computing. In Proceedings of the International Conference on Artificial Neural Networks, Porto. Portugal. [Arkin, 1998] Arkin, R. C. (1998). A Behavior-based Robotics. MIT Press, Cambridge, MA, USA. [Arnold and Beyer, 2002] Arnold, D. V. and Beyer, H.-G. (2002). Noisy Local Optimization with Evolution Strategies. Kluwer Academic Publishers, Norwell, MA, USA. [Atkeson and Schaal, 1997] Atkeson, C. G. and Schaal, S. (1997). Robot learning from demonstration. In Proc. 14th International Conference on Machine Learning, pages 12–20. Morgan Kaufmann. [Auer et al., 2002] Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finitetime analysis of the multiarmed bandit problem. Machine Learning, 47(23):235–256. 147

BIBLIOGRAPHY

[Auer et al., 2006] Auer, P., Cesa-Bianchi, N., Hussain, Z., Newnham, L., and Shawe-Taylor, J., editors (2006). NIPS 2006 Workshop on On-line Trading of Exploration and Exploitation. [Auger and Hansen, 2005] Auger, A. and Hansen, N. (2005). A restart cma evolution strategy with increasing population size. In In Proceedings of the IEEE Congress on Evolutionary Computation, CEC 2005, pages pp.1769– 1776. [Bailey and Durrant-Whyte, 2006] Bailey and Durrant-Whyte (2006). Simultaneous localisation and mapping (slam): Part ii. In Robotics and Automation Magazine, pages 108–117. [Baldassarre et al., 2006] Baldassarre, G., Parisi, D., and Nolfi, S. (Summer 2006). Distributed coordination of simulated robots based on selforganisation. Artificial Life, 12(3):289–311. [Baum and Petrie, 1966] Baum, L. E. and Petrie, T. (1966). Statistical inference for probabilistic functions of finite state markov chains. The Annals of Mathematical Statistics. [Bellman, 1957] Bellman, R. (1957). A Markov decision process. Journal of Mathematical Mechanics, 6:679–684. [Billard and Siegwart, 2004] Billard, A. and Siegwart, R. (2004). Special issue on robot learning from demonstration. Robotics And Autonomous Systems, 47(2-3):65–67. [Boeing et al., 2004] Boeing, A., Hanham, S., and Braunl, T. (2004). Evolving autonomous biped control from simulation to reality. In International Conference on Autonomous Robots and Agents, ICARA 04, pages 440–445. [Bongard and Lipson, 2004] Bongard, J. C. and Lipson, H. (2004). Automated robot function recovery after unanticipated failure or environmental change using a minimum of hardware trials. [Boucher and Dominey, 2006] Boucher, J.-D. and Dominey, P. F. (2006). Perceptual-motor sequence learning via human-robot interaction. In SAB, pages 224–235. [Breiman, 1994] Breiman, L. (1994). Bagging predictors. Technical report, Department of Statistics, University of California, Berkeley. [Breiman et al., 1984] Breiman, L. et al. (1984). Classification and Regression Trees. Chapman & Hall, New York. 148

BIBLIOGRAPHY

[Brooks, 1986] Brooks, R. (1986). A robust layered control system for a mobile robot. Robotics and Automation, IEEE Journal of [legacy, pre 1988], 2(1):14–23. [Burges et al., 2007] Burges, C. J., Ragno, R., and Le, Q. V. (2007). Learning to rank with nonsmooth cost functions. In Scholkopf, B., Platt, J., and Hoffman, T., editors, Advances in Neural Information Processing Systems 19, pages 193–200. MIT Press, Cambridge, MA. [Butz et al., 2005] Butz, M. V., Goldberg, D. E., and Lanzi, P. L. (2005). Gradient descent methods in learning classifier systems: improving xcs performance in multistep problems. IEEE Trans. Evolutionary Computation, 9(5):452–473. [Butz et al., 2007] Butz, M. V., Sigaud, O., Pezzulo, G., and Baldassarre, G. (2007). Anticipations, brains, individual and social behavior: An introduction to anticipatory systems. pages 1–18. [B¨ack et al., 1991] B¨ack, T., Hoffmeister, F., and Schwefel, H.-P. (1991). A survey of evolution strategies. In Proceedings of the Fourth International Conference on Genetic Algorithms, pages 2–9. Morgan Kaufmann. [Calinon and Billard, 2007] Calinon, S. and Billard, A. (2007). Learning of gestures by imitation in a humanoid robot. In Imitation and Social Learning in Robots, Humans and Animals: Behavioural, Social and Communicative Dimensions, pages 153–177. Cambridge University Press, K. Dautenhahn and C.L. Nehaniv edition. [Capi and Doya, 2005] Capi, G. and Doya, K. (2005). Evolution of neural architecture fitting environmental dynamics. Adaptive Behavior, 13:53–66. [Chellapilla and Fogel, 1999] Chellapilla, K. and Fogel, D. B. (1999). Fitness distributions in evolutionary computation: motivation and examples in the continuous domain. Biosystems, 54:15–29. [Cramer, 1985] Cramer, N. L. (1985). A representation for the adaptive generation of simple sequential programs. In Grefenstette, J. J., editor, Proceedings of an International Conference on Genetic Algorithms and the Applications, pages 183–187, Carnegie-Mellon University, Pittsburgh, PA, USA. [Crammer and Chechik, 2004] Crammer, K. and Chechik, G. (2004). A needle in a haystack: Local one-class optimization. In In In Proc. ICML. 149

BIBLIOGRAPHY

[Devert et al., 2007] Devert, A., Bredeche, N., and Schoenauer, M. (2007). Unsupervised learning of echo state networks: A case study in artificial embryogeny. In Artificial Evolution, pages 278–290. [Dillmann, 2003] Dillmann, R. (2003). Teaching and learning of robot tasks via observation of human performance. In Proceedings of the IROS-2003 Workshop on Robot Learning by Demonstration, pages 5–10. [Dillmann et al., 1999] Dillmann, R., Rogalla, O., Ehrenmann, M., Z¨ollner, R., and Bordegoni, M. (1999). Learning robot behaviour and skills based on human demonstration and advice: the machine learning paradigm. In 9th International Symposium of Robotics Research (ISRR ’99), pages 229– 238. [Driankov et al., 1996] Driankov, D., Hellendoorn, H., and Reinfrank, M. (1996). An introduction to fuzzy control (2nd ed.). Springer-Verlag, London, UK. [Eiben and Smith, 2003] Eiben, A. E. and Smith, J. E. (2003). Introduction to Evolutionary Computing (Natural Computing Series). Springer. [Elman, 1990a] Elman, J. (1990a). Finding structure in time. In Cognitive Science, volume 14, pages 179–211. [Elman, 1990b] Elman, J. L. (1990b). Finding structure in time. Cognitive Science, 14(2):179–211. [Floreano and Mondada, 1994] Floreano, D. and Mondada, F. (1994). Automatic creation of an autonomous agent: Genetic evolution of a neuralnetwork driven robot. In In, pages 421–430. MIT Press. [Fogel et al., 1966] Fogel, L. J., Owens, A. J., and Walsh, M. J. (1966). Artificial Intelligence through Simulated Evolution. John Wiley, New York, USA. [Functions et al., 1997] Functions, O. T. R., Oyman, A. I., Beyer, H.-G., and Schwefel, H.-P. (1997). Convergence behavior of the (1,+ lambda) evolution strategy on the ridge functions. [Girgin et al., 2008] Girgin, S., Loth, M., Munos, R., Preux, P., and Ryabko, D., editors (2008). Recent Advances in Reinforcement Learning, 8th European Workshop, EWRL 2008, Villeneuve d’Ascq, France, June 30 - July 3, 2008, Revised and Selected Papers, volume 5323 of Lecture Notes in Computer Science. Springer. 150

BIBLIOGRAPHY

[Glover, 1989] Glover, F. (1989). Tabu search – Part I. ORSA J. on Computing, 1:190–206. [Godzik, 2005] Godzik, N. (2005). Une approche ´evolutionnaire de la robotique modulaire et anticipative. PhD thesis, Universit´e de Paris Sud – Orsay. [Godzik et al., 2004] Godzik, N., Schoenauer, M., and Sebag, M. (2004). Robotics and multi-agent systems robustness in the long run: Autoteaching vs anticipation in evolutionary robotics. In PPSN, pages 932–941. [Goldberg, 1989] Goldberg, D. E. (1989). Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA. [Goldberg and Richardson, 1987] Goldberg, D. E. and Richardson, J. (1987). Genetic algorithms with sharing for multimodalfunction optimization. In ICGA, pages 41–49. [Gruau, 1994] Gruau, F. (1994). Automatic definition of modular neural networks. Adapt. Behav., 3(2):151–183. [Hansen and Ostermeier, 1995] Hansen, N., A. G. and Ostermeier, A. (1995). Sizing the population with respect to the local progress in (1, λ)-evolution strategies – a theoretical analysis. In In 1995 IEEE International Conference on Evolutionary Computation Proceedings, pages pp. 80–85. [Hansen and Ostermeier, 1996] Hansen, N. and Ostermeier, A. (1996). Adapting arbitrary normal mutation distributions in evolution strategies: The covariance matrix adaptation. In In Proceedings of the 1996 IEEE International Conference on Evolutionary Computation, pages 312–317. [Hansen and Ostermeier, 2001] Hansen, N. and Ostermeier, A. (2001). Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 9(2):159–195. [Hart et al., 1968] Hart, P. E., Nilsson, N. J., and Raphael, B. (1968). A formal basis for the heuristic determination of minimum cost paths. Systems Science and Cybernetics, IEEE Transactions on, 4(2):100–107. [Hartland et al., 2006] Hartland, C., Baskiotis, N., Gelly, S., Teytaud, O., and Sebag, M. (2006). Multi-armed bandit, dynamic environments and meta-bandits. In Online Trading of Exploration and Exploitation Workshop, NIPS, Whistler, Canada. 151

BIBLIOGRAPHY

[Hartland et al., 2007] Hartland, C., Baskiotis, N., Gelly, S., Teytaud, O., and Sebag, M. (2007). Change point detection and meta-bandits for online learning in dynamic environments. In Conf´erence francophone sur l’apprentissage automatique (CAp07), Whistler, Canada. Association Fran¸caise pour l’Intelligence Artificielle. [Hartland and Bredeche, 2007] Hartland, C. and Bredeche, N. (2007). Using echo state networks for robot navigation behavior acquisition. In IEEE International Conference on Robotics and Biomimetics (ROBIO07), pages 201–206, Sanya, China. IEEE Computer Society Press. [Harvey and Husbands, 1992] Harvey, I. and Husbands, P. (1992). Evolutionary robotics. Proceedings of IEE Colloquium on Genetic Algorithms for Control Systems Engineering, pages 1–4. [Hebb, 1949] Hebb, D. O. (1949). The Organization of Behavior: A Neuropsychological Theory. Wiley, New York. [Hecht-Nielsen, 1989] Hecht-Nielsen, R. (1989). Neuro-computing. Addison Wesley. [Herdy, 1996] Herdy, M. (1996). Evolution strategies with subjective selection. In PPSN, pages 22–31. [Hochreiter and Schmidhuber, 1995] Hochreiter, S. and Schmidhuber, K. (1995). Long short-term memory. Technical report, Fakultat fur Informatik, Technische Universitat Munchen. [Holland, 1975] Holland, J. H. (1975). Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. University of Michigan Press. [Hopfield, 1982] Hopfield, J. (1982). Neural networks and physical systems with emergent collective computational abilities. NAS, 79:2554–2558. [Hugues, 2002] Hugues, L. (2002). Apprentissage de comportements pour un robot autonome. PhD thesis, Universit´e Pierre et Marie Curie - Paris 6. [Hugues and Bredeche, 2006] Hugues, L. and Bredeche, N. (2006). Simbad : an autonomous robot simulation package for education and research. In Proceedings of The Ninth International Conference on the Simulation of Adaptive Behavior (SAB’06), pages 831–842., Rome, Italy. Published in Springer’s Lecture Notes in Computer Sciences / Artificial Intelligence series (LNCS/LNAI) n.4095. 152

BIBLIOGRAPHY

[Hugues and Drogoul, 2002] Hugues, L. and Drogoul, A. (2002). Synthesis of robot’s behavior from few examples. In IEEE/RSJ International Conference on Intelligent Robots and Sytems, IROS’02. [Jaeger, 2001] Jaeger, H. (2001). The echo state approach to analysing and training recurrent neural networks. In Technical report GMD report 148. German National Research Center for Information Technology. [Jaeger, 2002] Jaeger, H. (2002). A tutorial on training recurrent neural networks. covering bptt, rtrl, ekf, and the echo state network approach. In GMD report 159. German National Research Center for Information Technology. [Jaeger et al., 2006] Jaeger, H., Maass, W., and Pr´ıncipe, J. C., editors (2006). NIPS 2006 Workshop on Echo State Networks and Liquid State Machines. http://www.esn-lsm.tugraz.at/. [Jaeger et al., 2007] Jaeger, H., Maass, W., and Pr´ıncipe, J. C. (2007). Special issue on echo state networks and liquid state machines. Neural Networks, 20(3):287–289. [Jakobi, 1998] Jakobi, N. (1998). Running across the reality gap: octopod locomotion evolved in a minimal simulation. In P. Husbands, J. M., editor, Evolutionary Robotics: First European Workshop, EvoRobot98, pages 39– 58. Springer-Verlag. [Jakobi et al., 1995] Jakobi, N., Husbands, P., and Harvey, I. (1995). Noise and the reality gap : the use of simulation in evolutionnary robotics. [Jenkins and Matari´c, 2004] Jenkins, O. C. and Matari´c, M. J. (2004). A spatio-temporal extension to isomap nonlinear dimension reduction. In ICML ’04: Proceedings of the twenty-first international conference on Machine learning, page 56, New York, NY, USA. ACM Press. [Jiang et al., 2008] Jiang, F., Berry, H., and Schoenauer, M. (2008). Supervised and evolutionary learning of echo state networks. In PPSN X, 10th International Conference on Parallel Problem Solving from Nature, pages 215–224. [Katagami and Yamada, 2001] Katagami, D. and Yamada, S. (2001). Real robot learning with human teaching. In The4-th Japan-Australia Joint Workshop on Intelligent and Evolutionary Systems, pages 263–270. 153

BIBLIOGRAPHY

[Kimura and Matsumura, 2005] Kimura, S. and Matsumura, K. (2005). Genetic algorithms using low-discrepancy sequences. In GECCO ’05: Proceedings of the 2005 conference on Genetic and evolutionary computation, pages 1341–1346, New York, NY, USA. ACM. [Kirkpatrick et al., 1983] Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. (1983). Optimization by simulated annealing. Science. [K.O. Stanley, 2003] K.O. Stanley, B.D. Bryant, R. M. (2003). Evolving adaptive neural networks with and without adaptive synapses. Evolutionary Computation, 4:2557–2564. [Kolter et al., 2008] Kolter, J. Z., Abbeel, P., and Ng, A. (2008). Hierarchical apprenticeship learning with application to quadruped locomotion. In Platt, J., Koller, D., Singer, Y., and Roweis, S., editors, Advances in Neural Information Processing Systems 20. MIT Press, Cambridge, MA. [Koza, 1992] Koza, J. R. (1992). Genetic programming: on the programming of computers by means of natural selection. The MIT Press, Cambridge, MA. [Lanzi, 1998a] Lanzi, P. L. (1998a). Adding Memory to XCS. In Proceedings of the IEEE Conference on Evolutionary Computation (ICEC98). IEEE Press. [Lanzi, 1998b] Lanzi, P. L. (1998b). An analysis of the memory mechanism of XCSM. In Koza, J. R., Banzhaf, W., Chellapilla, K., Deb, K., Dorigo, M., Fogel, D. B., Garzon, M. H., Goldberg, D. E., Iba, H., and Riolo, R., editors, Genetic Programming 1998: Proceedings of the Third Annual Conference, pages 643–651, University of Wisconsin, Madison, Wisconsin, USA. Morgan Kaufmann. [Laumond, 1998] Laumond, J.-P. P. (1998). Robot Motion Planning and Control. Springer-Verlag New York, Inc., Secaucus, NJ, USA. [LeCun, 1985] LeCun, Y. (1985). Une proc´edure d’apprentissage pour r´eseau a seuil asymmetrique (a learning scheme for asymmetric threshold networks). In Proceedings of Cognitiva 85, pages 599–604, Paris, France. [LeCun et al., 1990] LeCun, Y., Denker, J. S., and Solla, S. A. (1990). Optimal brain damage. In Advances in Neural Information Processing Systems, pages 598–605. Morgan Kaufmann. 154

BIBLIOGRAPHY

[LeCun et al., 2004] LeCun, Y., Huang, F.-J., and Bottou, L. (2004). Learning methods for generic object recognition with invariance to pose and lighting. In Proceedings of CVPR’04. IEEE Press. [LeCun et al., 2005] LeCun, Y., Muller, U., Ben, J., Cosatto, E., and Flepp, B. (2005). Off-road obstacle avoidance through end-to-end learning. In NIPS. [Lehman and Stanley, 2008] Lehman, J. and Stanley, K. O. (2008). Exploiting open-endedness to solve problems through the search for novelty. In Proceedings of the Eleventh International Conference on Artificial Life (ALIFE XI), Cambridge, MA. MIT Press. [Lipson et al., 2006] Lipson, H., Bongard, J. C., Zykov, V., and Malone, E. (2006). Evolutionary robotics for legged machines: From simulation to physical reality. In IAS, pages 11–18. [Llor`a et al., 2005] Llor`a, X., Sastry, K., Goldberg, D. E., Gupta, A., and Lakshmi, L. (2005). Combating user fatigue in igas: partial ordering, support vector machines, and synthetic fitness. In GECCO, pages 1363– 1370. [Lund and Miglino, 1996] Lund, H. H. and Miglino, O. (1996). From simulated to real robots. In International Conference on Evolutionary Computation, pages 362–365. [Maass, 1996] Maass, W. (1996). Networks of spiking neurons: The third generation of neural network models. In Bartlett, P., Burkitt, A., and Williamson, R., editors, Australian Conference on Neural Networks, pages 1–10. Australian National University. [Maass et al., 2002] Maass, W., Natschl¨ager, T., and Markram, H. (2002). Real-time computing without stable states: a new framework for neural computation based on perturbations. Neural Comput., 14(11):2531–2560. [Maier et al., 2009] Maier, M., Luxburg, U. V., and Hein, M. (2009). Influence of graph construction on graph-based clustering measures. In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems (NIPS). [Marks et al., 2000] Marks, J., Mirtich, B., Ratajczak, D., Ryall, K., Anderson, D., Anderson, E., and Lesh, N. (2000). Human-guided simple search. In In Proc. of AAAI 2000, pages 209–216. AAAI Press. 155

BIBLIOGRAPHY

[Mataric, 1997] Mataric, M. J. (1997). Reinforcement learning in the multirobot domain. Autonomous Robots, 4(1):73–83. [McCulloch and Pitts, 1943] McCulloch, W. and Pitts, H. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, pages 115–133. [Meeden, 1998] Meeden, L. (1998). Bridging the gap between robot simulations and reality with improved models of sensor noise. In Koza, J. R., Banzhaf, W., Chellapilla, K., Deb, K., Dorigo, M., Fogel, D. B., Garzon, M. H., Goldberg, D. E., Iba, H., and Riolo, R., editors, Genetic Programming 1998: Proceedings of the Third Annual Conference, pages 824–831, University of Wisconsin, Madison, Wisconsin, USA. Morgan Kaufmann. [Meuleau and Brafman, 2007] Meuleau, N. and Brafman, R. I. (2007). Hierarchical heuristic forward search in stochastic domains. In Veloso, M. M., editor, IJCAI, pages 2542–2549. [Michalewicz, 1994] Michalewicz, Z. (1994). Genetic Algorithms Plus Data Structures Equals Evolution Programs. Springer-Verlag New York, Inc., Secaucus, NJ, USA. [Michalewicz, 1996] Michalewicz, Z. (1996). Genetic algorithms + data structures = evolution programs (3rd ed.). Springer-Verlag, London, UK. [Miglino et al., 1995] Miglino, O., Lund, H. H., and Nolfi, S. (1995). Evolving mobile robots in simulated and real environments. Artificial Life, 2(4):417– 434. [Minsky and Papert, 1969] Minsky, M. and Papert, S. (1969). Perceptrons: An introduction to computational geometry. The MIT Press, Cambridge. [Mitchell, 1997] Mitchell, T. M. (1997). Machine Learning. McGraw-Hill, New York. [Miura and Nishimura, 2007] Miura, J. and Nishimura, Y. (2007). Codevelopment of task models through robot-human interaction. In IEEE International Conference on Robotics and Biomimetics (ROBIO07), pages 640–645, Sanya, China. IEEE Computer Society Press. [Nakamura and Hashimoto, 2007] Nakamura, S. and Hashimoto, S. (2007). Hybrid learning strategy to solve pendulum swing-up problem for real hardware. In IEEE International Conference on Robotics and Biomimetics (ROBIO07), pages 1972–1977, Sanya, China. IEEE Computer Society Press. 156

BIBLIOGRAPHY

[Nehmzow, 2001] Nehmzow, U. (2001). Quantitative analysis of robotenvironment interaction - on the difference between simulations and the real thing. In Eurobot. [Nelson et al., 2003] Nelson, A., Grant, E., Barlow, G., and White, M. (2003). Evolution of autonomous robot behaviors using relative competitive fitness. In 2003 IEEE International Conference on Integration of Knowledge Intensive Multi-Agent Systems (KIMAS’03) Modeling, Exploration, and Engineering Systems, pages 145–150, Boston MA. [Niederreiter, 1992] Niederreiter, H. (1992). Random number generation and quasi-monte carlo methods. In SIAM. [Nolfi and Floreano, 2000] Nolfi, S. and Floreano, D. (2000). Evolutionary Robotics: The Biology, Intelligence, and Technology of Self-Organizing Machines. Cambridge, MA: MIT Press/Bradford Books. [Nolfi and Parisi, 1993] Nolfi, S. and Parisi, D. (1993). Auto-teaching networks that develop their own teaching input. In Deneubourg, J., Bersini, H., Goss, S., Nicolis, G., and Dagonnier, R., editors, Proceedings of the Second European Conference on Artificial Life, Brussels, Free University of Brussels. [Nolfi et al., 1994] Nolfi, S., Psicologia, I. D., Elman, J. L., and Parisi, D. (1994). Learning and evolution in neural networks. Adaptive Behavior, 3:5–28. [Pearl, 1985] Pearl, J. (1985). Bayesian networks: A model of self-activated memory for evidential reasoning. In 7th Conference of the Cognitive Science Society, University of California, Irvine, CA, pages 329–334. [Puterman, 1994] Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley-Interscience. [Rechenberg, 1965] Rechenberg, I. (1965). Cybernetic solution path of an experimental problem. Royal Aircraft Establishment, translation No. 1122, Ministry of Aviation, Farnborough Hants, UK. [Rechenberg, 1971] Rechenberg, I. (1971). Evolutionsstrategie - optimierung technischer systeme nach prinzipien der biologischen evolution. Published 1973 by Fromman-Holzboog. [Ros and Hansen, 2008] Ros, R. and Hansen, N. (2008). A simple modification in cma-es achieving linear time and space complexity. In Proceedings 157

BIBLIOGRAPHY

of the 10th international conference on Parallel Problem Solving from Nature, pages 296–305. Springer-Verlag. [Rosen, 1985] Rosen, R. (1985). Anticipatory Systems: Philosophical, Mathematical and Methodological Foundations. Pergamon Press. [Rosenblatt, 1958] Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–408. [Rumelhart et al., 1988] Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1988). Learning representations by back-propagating errors. pages 696–699. [Runarsson, 2006] Runarsson, T. (2006). Approximate evolution strategy using stochastic ranking. In Proceedings of the 2006 IEEE Congress on Evolutionary Computation, pages 745–752, Vancouver, BC, Canada. IEEE Press. [Russell et al., 2006] Russell, S., Norvig, P., Miclet, L., and Popineau, F. (2006). Intelligence Artificielle. 2e ´edition, french version. Pearson Education. [Saxena et al., 2009] Saxena, A., Driemeyer, J., and Ng., A. Y. (2009). Learning 3-d object orientation from images. In International Conference on Robotics and Automation (ICRA). [Schaal, 1999] Schaal, S. (1999). Is imitation learning the route to humanoid robots? Trends Cogn Sci, 3(6):233–242. [Schapire, 1990] Schapire, R. (1990). The strength of weak learnability. [Schrauwen et al., 2008] Schrauwen, B., Buesing, L., and Legenstein, R. A. (2008). On computational power and the order-chaos phase transition in reservoir computing. In NIPS. [Schwefel, 1981] Schwefel, H.-P. (1981). Numerical Optimization of Computer Models. John Wiley & Sons, Inc., New York, NY, USA. [Segre, 1988] Segre, A. M. (1988). Machine learning of robot assembly plans. Kluwer Academic Publishers, Norwell, MA, USA. [Sigaud, 2007] Sigaud, O. (2007). Les syst`emes de classeurs. d’Intelligence Artificielle, 21(1):75–106. 158

Revue

BIBLIOGRAPHY

[Smart and Kaelbling, 2002] Smart, W. D. and Kaelbling, L. P. (2002). Effective reinforcement learning for mobile robots. [Smith et al., 1986] Smith, R., Self, M., and Cheeseman, P. (1986). Estimating uncertain spatial relationships in robotics. In UAI, pages 435–461. [Sobol, 1967] Sobol, I. M. (1967). The distribution of points in a cube and the approximate evaluation of integrals. In USSR Comput. Math. Math. Phys., volume 7, pages 86–112. [Stanley, 2004] Stanley, K. O. (2004). Efficient evolution of neural networks through complexification. PhD thesis. Supervisor-Miikkulainen,, Risto P. [Stanley and Miikkulainen, 2002] Stanley, K. O. and Miikkulainen, R. (2002). Evolving neural networks through augmenting topologies. Evolutionary Computation, 10(2):99–127. [Statistics and Breiman, 2001] Statistics, L. B. and Breiman, L. (2001). Random forests. In Machine Learning, pages 5–32. [Stork and Hassibi, 1993] Stork, D. and Hassibi, B. (1993). Second order derivatives for network pruning: Optimal brain surgeon. In Advances in Neural Information Processing Systems (NIPS) 5, pages 164–171. Morgan Kaufmann. [Sutton and Barto, 1998a] Sutton, R. and Barto, A. (1998a). Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA. [Sutton and Barto, 1998b] Sutton, R. and Barto, A. (1998b). Reinforcement learning: An introduction. MIT Press. [Sywerda, 1989] Sywerda, G. (1989). Uniform crossover in genetic algorithms. In Proceedings of the third international conference on Genetic algorithms, pages 2–9, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. [Teytaud and Gelly, 2007] Teytaud, O. and Gelly, S. (2007). Dcma: yet another derandomization in covariance-matrix-adaptation. In GECCO ’07: Proceedings of the 9th annual conference on Genetic and evolutionary computation, pages 955–963, New York, NY, USA. ACM. [Teytaud et al., 2007] Teytaud, O., Gelly, S., and Mary, J. (2007). Active learning in regression, with application to stochastic dynamic programming. In International Conference On Informatics in Control, A. and Robotics, editors, ICINCO and CAP. 159

BIBLIOGRAPHY

[Thrun et al., 2006] Thrun, S., Montemerlo, M., Dahlkamp, H., Stavens, D., Aron, A., Diebel, J., Fong, P., Gale, J., Halpenny, M., Hoffmann, G., Lau, K., Oakley, C., Palatucci, M., Pratt, V., Stang, P., Strohband, S., Dupont, C., Jendrossek, L.-E., Koelen, C., Markey, C., Rummel, C., van Niekerk, J., Jensen, E., Alessandrini, P., Bradski, G., Davies, B., Ettinger, S., Kaehler, A., Nefian, A., and Mahoney, P. (2006). Winning the darpa grand challenge. Journal of Field Robotics. [Tolman, 1948] Tolman, E. (1948). Cognitive maps in rats and men. Psychological Review, 4(55):189–208. [Tolman and Honzik, 1930] Tolman, E. and Honzik, C. (1930). Insights in rats. University of California Publications in Psychology, 4(14):215–232. [Toussaint et al., 2007] Toussaint, M., Willert, V., Eggert, J., and Korner, E. (2007). Motion segmentation using inference in dynamic bayesian networks. In British Machine Vision Conference 2007. [Tuci et al., 2004] Tuci, E., Trianni, V., and Dorigo, M. (2004). Evolving the feeling of time through sensory-motor coordination: a robot-based model. In Yao, X., Burke, E., Lozano, J., Smith, J., Merelo-Guerv´os, J., Bullinaria, J., Rowe, J., Tiˆ no, P., Kab`an, A., and Schwefel, H., editors, The 8th International Conference on Parallel Problem Solving From Nature (PPSN 2004), volume 3242 of Lecture Notes in Computer Science, pages 1001–1010. Springer Verlag, Berlin, Germany. [Urzelai and Floreano, 2001] Urzelai, J. and Floreano, D. (2001). Evolution of adaptive synapses: Robots with fast adaptive behavior in new environments. Evol. Comput., 9(4):495–524. [Vapnik, 1995] Vapnik, V. N. (1995). The nature of statistical learning theory. Springer-Verlag New York, Inc., New York, NY, USA. [Vapnik and Chervonenkis, 1971] Vapnik, V. N. and Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl., 16:264–280. [Verstraeten et al., 2007] Verstraeten, D., Schrauwen, B., D’Haene, M., and Stroobandt, D. (2007). An experimental unification of reservoir computing methods. Neural Networks, 20(3):391–403. [Welch and Bishop, 2008] Welch, G. and Bishop, G. (2008). Siggraph 2001 course 8 an introduction to the kalman filter. 160

BIBLIOGRAPHY

[Werbos, 1990] Werbos, P. J. (1990). Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560. [Widrow and Hoff, 1960] Widrow, B. and Hoff, M. E. (1960). Adaptive switching circuit. IRE Western Electric Show and Convention Record, pages 96–104. [Williams and Zipser, 1989] Williams, R. J. and Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1:270–280. [Wolpert, 1996] Wolpert, D. H. (1996). The lack of a priori distinctions between learning algorithms. [Yasuda et al., 2007] Yasuda, N., Kawakami, T., Iwano, H., and Kikuchi, K. (2007). Robotic design principles emerging from balance of morphology and intelligence. In IEEE International Conference on Robotics and Biomimetics (ROBIO07), pages 541–546, Sanya, China. IEEE Computer Society Press.

161