Master thesis - Pierre MARGUERITE's Personal Page

cette année de DESS. Par une analyse rétrospective de données de biopuces, j'avais pu ..... queries, meaningful data exchange and comprehensive data analysis. The direct ..... the users to format the array into a standard referencing system.
1MB taille 1 téléchargements 710 vues
DESS Bioinformatique Université des sciences et technologie de Lille

Projet présenté par

Pierre MARGUERITE

Rendre les standards de description de biopuces accessibles: réalisation d’un module de conversion inter-standard [Facilitating Standardization and Exchange of Microarray Layout]

4 mai 2004 - 31 octobre 2004 Microarray Informatics Team, EMBL-EBI, Cambridge, UK

Encadrants : Philippe Rocca-Serra, PhD Ugis Sarkans, PhD Mohammadreza Shojatalab

Niran Abeygunawardena Sergio Contrino Susanna-Assunta Sansone, PhD

Acknowledgment This Thesis is part of the Master of Science (MSc) degree in Computer Sciences specialising in Bioinformatics at Lille University. It is the closure on the education that leads to the Master of Science degree. I would like to use this opportunity to give a special thank to every person who made this project possible and particularly: Alvis Brazma, Microarray Informatics Team Leader, for receiving me in his team Philippe Rocca-Serra, Ugis Sarkans, Mohammadreza Shojatalab, Niran Abeygunawardena, Sergio Contrino, Susanna-Assunta Sansone, my project supervisors, for their support Maude Pupin for having rated my work and finally, all team members for their support. Thanks to Kjell Petersen at Bergen Center for useful assistance throughout the training period. And, special thanks to Joel for his help on XML. Lille, October 2004

Remerciements Je remercie chaleureusement toutes les personnes qui ont rendu ce projet de DESS possible et notamment : Alvis Brazma, directeur de l’équipe Microarray, pour m’avoir accueilli au sein du projet; Philippe Rocca-Serra, Ugis Sarkans, Mohammadreza Shojatalab, Niran Abeygunawardena, Sergio Contrino, Susanna-Assunta Sansone, mes responsables de stage, pour leur encadrement; Maude Pupin pour avoir bien voulu évaluer mon travail; sans oublier: Philippe, Joel, Vincent, Eric, Mélanie, Catherine, Isabelle, Mado, Armand, mes colocs, Paula, Mark, Lawrence, Pelin, Abdullah et les autres... et enfin tous les membres de l’équipe Microarray de l’institut européen pour la générosité de leur accueil.

i

ii

Résumé Etudiant en DESS Bioinformatique de l’Université des Sciences et Technologies de Lille, je conclus mes études par un stage de six mois en Angleterre, du 4 mai au 31 octobre 2004. Il fait suite à une proposition de l’équipe Microarray de l’institut européen de bioinformatique [ebib] sur leur site web. Après l’envoi de mon curriculum vitae suivi d’un entretien téléphonique en anglais avec les responsables du projet, j’ai été retenu, ainsi que deux autres étudiants, pour effectuer mon stage dans leur équipe. Le thème du projet proposé était sur biopuces et plus précisément sur la standardisation des données de biopuces. C’est une thème que j’avais déjà abordé durant le projet bioinformatique, au sein de la « plateforme de génomique de Lille », pendant cette année de DESS. Par une analyse rétrospective de données de biopuces, j’avais pu découvrir les tenants et les aboutissants de cette nouvelle technique prometteuse.

L’Institut Européen de Bioinformatique L’E.B.I., ou Institut Européen de Bioinformatique, est une organisation à but non lucratif, ayant des missions de recherche et de services en bioinformatique. Son objectif vise à mettre à la disposition de la communauté scientifique les découvertes dans la recherche génomique et en biologie moléculaire. Cet institut fait partie des cinq sites, formant le laboratoire européen en biologie moléculaire [emb] (E.M.B.L.): Heidelberg [Allemagne], Hambourg [Allemagne], Grenoble [France] et Hinxton [Royaume Uni].

Figure 1: L’institut de européen en Bioinformatique, un élément de l’EMBL. L’E.B.I. est situé à Hinxton, une vingtaine de kilomètre de Cambridge, au Royaume Uni, dans le Wellcome Trust Genome Campus. Son effectif, d’environ 200 personiii

nes, est composé à la fois de développeurs, d’annotateurs (curateur, en anglais) et de chercheurs. Ces membres viennent de nombreux pays différents, avec des formations, des pratiques et des cultures différentes. Nous partageons tous un point commun : la pratique de la langue anglaise comme langue d’échange et de travail. Comme le présente la figure , l’E.B.I. possède quatre éléments transversaux : • Industrie : pour la promotion de standard • Recherche : pour la recherche en Bioinformatique • Apprentissage: cours sur différents domaines de la bioinformatique • Service : mise à disposition de banque de données (une quinzaine)

L’équipe Microarray Au sein de l’EBI, le sujet des biopuces, datant de 1993, est géré par l’équipe Microarray. Cette composante, dirigé par Alvis Brazma, comprend principalement des annotateurs et des développeurs, une petite équipe de 26 personnes. Ils travaillent ensemble sur le projet ArrayExpress: la première banque publique de données de biopuces, associée avec des outils de soumission, le service web MIAMExpress (voir figure 1.5.1.1 et partie 1.5.1.1 pour plus de détails). Les annotateurs ont un rôle de premier ordre dans l’équipe. Ils s’assurent que les données soumises soient correctes avant leur entrée dans la banque de données. Au sein du consortium MGED -Microarray Gene Expression Data-, elle travaille à la standardisation des données, de leurs échanges et de leurs stockages. Ce consortium est une organisation international de biologistes, de développer et d’analystes, afin de faciliter l’échange de données de biopuces obtenues par les expérience de génomique fonctionnelle et de protéomique.

Pourquoi la standardisation? Au cours de mon projet bioinformatique, j’avais constaté avec étonnement le grand nombre de formats de données existants. Chaque constructeur a généralement son propre format (robot spotteur, scanner,..), auquel se rajoute, par exemple, les formats des logiciels de normalisation. Ce nombre important de formats de données demande aux biologistes de nombreuses manipulations afin d’obtenir les résultats d’une expérience, avec la part d’erreur humaine que cela implique. Ces manipulations inutiles montrent le manque d’accords et de communications entre les différents fournisseurs d’outils et de matériels. cet état est aggravé par la rareté d’outils permettant la gestion des données et transformation des données d’un format dans autre. Cette situation fait donc perdre un temps précieux aux biologistes. Partant de ce constat, l’équipe Microarray propose, au travers du consortium MGED, des recommandations et différents standards (MIAME[BHQ+ 01], MAGE[maga], ontologie MGED[mgea]), ainsi des outils permettant de faciliter la manipulation de ces données, notamment leur soumission dans les banques de données (ArrayExpress). iv

Ce travail est important, puisque les quantités de données biologiques, et donc de biopuces, ne cessent d’augmenter. La mise en place de standard (un processus assez long) est à faire le plus tôt possible, avant cela cette quantité ne devienne trop importante.

Le projet Le projet de mon stage ne concerne qu’une partie des données de biopuces: les descriptions d’agencement (aussi appelé design de biopuces). Son objectif est d’améliorer la standardisation et l’échange de descriptions de biopuces, en proposant un outil de conversion entre format de descriptions d’agencement. La description d’agencement correspond à la conception de la biopuce avant l’expérience. Il existe actuellement deux standards: les formats ADF, pour « Array Design File » et MAGE-ML. Le format ADF est assez simple et très proche des formats existants: il est composé de fichiers textes tabulaires (voir la partie 2.1.2 pour un exemple). Le MAGE-ML est une extension du format XML. Il couvre bien plus que la description d’agencement, incluant les recommandations MIAME (Minimum Information About a Microarray Experiment): protocole, matériels, traitement,... . Un fichier MAGE-ML ne peut pas être composé à la main et demande des connaissances précises, ainsi que des outils adaptés. Ce projet apportera beaucoup à la communauté scientifique et à l’infrastructure de ArrayExpress, en accélérant l’échange de description d’agencement dans un format standard: • Améliorer la visualisation des annotations de descriptions • Permettre à l’utilisateur de choisir la meilleur solution pour ces besoins et ces ressources • faciliter la standardisation en démontrant le faisabilité d’avoir toutes les informations récommandées (MIAME) de façon efficace. ADF Un ADF,-« Array Design File »-, est composé de trois parties, pouvant être contenues dans trois fichiers plats tabulaires ou trois feuilles de calcul dans un classeur Microsoft Excel. C’est trois parties, contenant différentes entêtes, sont appelés (voir la partie 2.1.2 pour un exemple) : • Header: contenant les méta-données concernant le design, comme une liste de contacts, un numéro de version, etc. • Feature/Reporter: concernant deux éléments de design appelés « feature » et « reporter ». (Les termes employés correspondent aux recommandations MIAME et sont des objets du modèle MAGE). Pour simplifier, une « feature » correspond à une position sur une lame de biopuces, elle est définie par ses coordonnées. Un reporter est l’élément déposé sur une feature, qui a certaines caractéristiques, dont la principale est la séquence déposée, associée à l’échantillon biologique. • Composites: Les séquences composites entre les « reporter » vers une entité biologique. En effet, les reporters n’ont pas de réalité biologique stricto sensu. v

MAGE-ML Le format MAGE-ML est une extension du format XML, eXtendable Mark-up Langage.Ce format de fichier est une représentation persistante du modèle objet MAGEOM[magd]. Ce modèle contient des éléments des recommandations MIAME. Pour inclure complètement ces recommandations, de bonnes pratiques ont été édictées, pour les relations entre objets, la présence d’objets supplémentaires, utilisation de certains paramètres obligatoires, etc. Ce modèle recouvre tous les aspects d’une biopuce, c’est-à-dire bien plus que le design. Cela inclut le matériel utilisé, les différents produits et manipulations appliqués, ainsi que la normalisation appliquée sur les résultats obtenus.

Contribution J’ai donc conçu un outil indépendant, sans connexion à une banque de données, et multi-plateforme (Microsoft Windows, GNU/Linux, Mac OS) de conversion de fichiers entre ces deux formats. Il vérifie et corrige tout d’abord la structure des fichiers et les données, afin qu’ils respectent les bonnes pratiques, dérivant des recommandations MIAME. Durant cette étape, les valeurs fournies sont standardisées toujours pour correspondre à ces pratiques. J’ai défini des règles de validation afin d’effectuer cette validation (voir annexe .4.5). Ensuite, les données sont converties dans l’autre format (ADF vers MAGE-ML et MAGE-ML vers ADF), et les différents éléments nécessaires aux bonnes pratiques sont ajoutés. Actuellement, il n’existe pas d’outils faisant une validation complète des données de description. Pour la conversion, certains outils existent. Ils utilisent des données contenues dans une banque de données en considérant les comme correctes. En discutant avec les annotateurs, il est apparu que les fichiers soumis contenaient souvent différentes erreurs, principalement typographiques, notamment au niveau des noms de colonnes, ou des séquences. Ces fichiers étaient refusés alors que leur contenu était correct, ou seulement une partie des données était reconnues. Il a donc été convenu que les erreurs courantes seraient acceptées par l’outil, mais modifiées, d’une certaine façon standardisées, afin d’obtenir un fichier de sortie correct. De plus, lors de la conversion, afin de faciliter les recherches dans les banques de données, des groupes implicites sont crées, notamment concernant la localisation des séquences (chromosome et brin du chromosome). Durant le traitement de fichiers, les événements (erreurs, information) sont journalisés dans un fichier. Celui-ci permet à l’utilisateur, à la fin du traitement, de faire les modifications nécessaires et pourvoir soumettre ces données. validation des fichiers Avant la conversion des données de description, une validation syntaxique et lexicale des fichiers et de leur contenu est effectuée. A cet effet, une liste de validation a été établie (voir annexe .4.5), comprenant tous les éléments à vérifier. De plus, les règles de validation ont été classées par niveau de sévérité. Concernant les fichiers ADF, la validation utilise un fichier au format XML décrivant un fichier de données (structure, en-tête, etc). Ces fichiers contiennent la structure d’un document (plusieurs fichiers/feuilles de calcul, ainsi que pour chaque table contenue, vi

la liste des noms possibles pour l’en-tête, leur cardinalité, le type des valeurs contenues dans les champs associés, etc. Pour les fichiers MAGE-ML, il existe un Document Type Definition (DTD) décrivant la structure et les éléments autorisés dans un document. Ce DTD a été validé par l’Object Management Group (OMG) [omg], organisme d’acceptation de spécification pour la standardisation afin de maintenir une inter-opérabilité entre les applications. Une version est disponible sur le site de l’OMG. Grâce à la validation d’un fichier MAGE-ML par ce DTD, il est certain que ce fichier est correct. Mais, la validation par un DTD ne permet pas la vérification de tous les règles établies, notamment les bonnes pratiques pour la création d’un MAGE-ML, les vocabulaires contrôlés (les termes de l’ontologie MGED et ce qui concerne les banques de données). La lecture des fichiers est faite au moyen de la bibliothèque Java MAGE-stk[mage] permettant de lire et d’écrire des fichiers MAGE-ML, ainsi que de manipuler les données sous formes d’objets MAGE-OM.

Vérification des termes de la ontologie MGED La vérification des termes de la ontologie MGED est effectuée au moyen d’un analyseur syntaxique pour le fichier ontologie au format DAML+OIL. Cet analyseur provient du groupe bioinformatique du centre de recherche « Bergen Center for Computational Science »[ber].

Vérification des banques de données approuvées Au sein de l’équipe microarray est maintenue une liste des banques de données approuvées pour être utiliser dans la banque ArrayExpress. Cette liste est au format classeur Excel et contient le nom, le label, ainsi des expressions régulières décrivant les numéros d’accession dans la banque de données associée.

La validation comporte deux niveaux d’évaluation : • strict : les fichiers et les données doivent exactement correspondre aux spécifications et aux valeurs attendues. • relax : ce niveau autorise permet plus de souplesse. Par exemple, l’en-tête d’un fichier peut comporter des noms erronés. Les erreurs détectées sont généralement automatiquement corrigées par le mécanisme de curation. Conversion La conversion des données ADF en MAGE-ML respecte les bonnes pratiques, c’est-à-dire que différents objets MAGE, différents du design de biopuces en lui-même et non indispensable à la validation d’un fichier MAGE-ML -concernant le DTD-, sont ajoutés. Ces objets crées correspondent principalement aux méta-données fournis dans la partie Header de l’ADF. En plus, de cela, l’outil intègre la création d’identifiant pour les objets MAGE correspondant aux nouvelles spécifications définies durant une conférence MAGE Jamboree en Décembre 2003, toujours afin d’avancer vers la standardisation. Auparavant, chacun était libre de créer ses propres identifiants, pouvant vii

aboutir avoir différents objets dans la banque de données avec le même identifiant. De plus, afin de faciliter les recherches dans les banques de données, des groupes implicites sont crées, notamment la localisation des séquences (le chromosome sur lequel se trouve la séquence ou sur quel bras du chromosome).

Durant le traitement de fichiers, les événements (erreurs, information) sont journalisés dans un fichier -un par fichier traité-, ainsi que dans un fichier globale regroupant les informations générales concernant l’outil. Ceci est offert par l’utilisation de la bibliothèque Log4j, qui possède différents niveaux d’information (debug, info, warn, error, fatal). Elle permet aussi, une chose intéressante pour la suite du projet, d’obtenir les informations sous différents formats de sorties, comme le format HTML, facilitant l’intégration facile à une interface web, comme MIAMExpress en tant qu’outil externe.

L’outil est accessible au moyen d’une ligne de commande, comportant bien sûr un fichier ou un répertoire à traiter, ainsi que différents paramètres, dont notamment le type de vérification, ou la possibilité de sauvegarder les données corrigées (curation) après validation. Mais, l’intégration de nouvelles interfaces d’interaction par la suite, notamment une interface graphique, ont été pris en compte dans les spécifications de l’outil. Pour cela, le traitement des données a été séparé des modules d’interaction avec l’utilisation. L’utilisateur a aussi la possibilité, grâce à un module avancé de configuration, d’indiquer des options par défaut pour les différents traitements. La configuration est conservée dans un fichier de propriétés Java, facilement modifiable par l’utilisateur. Initialement, l’outil n’offrira qu’une utilisation par ligne commande. Son utilisation est considérée comme très pratique par certains comme pour le traitement de plusieurs fichiers et très austère et bien rébarbative par d’autres. C’est pourquoi, il a été convenu de faciliter l’intégration de nouvelle interface d’utilisation, comme une interface graphique ou une interface web, par l’utilisation d’une sorte de bibliothèque, une classe chargée d’initialiser les différents composants et de centraliser tous les appels de méthode. Les interfaces ne se chargeront alors que de transmettre de paramètres à la bibliothèque. Grâce aux respects des bonnes pratiques et de la standardisation des données, l’homogénéité de banques de données comme ArrayExpress sera améliorée. Cet outil permettra de soulager le travail des annotateurs. Ils vérifient actuellement toutes les données soumises -un processus longtemps et fastidieux-. Une partie de la validation des données pourra être faite par les biologistes, avec cet outil avant de soumettre leurs données. Cela favorisera, de plus, la soumission par le format MAGEML, afin de généraliser son usage.

Les choix techniques Les laboratoires ayant généralement des parcs hétérogènes, le choix d’une application multi-plateforme s’est imposé. Le langage Java a été choisi, notamment, par le viii

fait, que la bibliothèque MAGE-stk[mage], pour la gestion des fichiers MAGE-ML et leur sérialisation en objet Java, n’est disponible uniquement que dans les langages Java et perl. Etant autonome concernant la conception et les choix techniques, j’ai décidé de valider les fichiers et les données au moyen de fichiers XML, décrivant la structure des fichiers. L’idée est de permettre la modification des format de fichiers, sans avoir à modifier l’outil. La conversion entre les différents formats s’effectue en suivant des parcours évitant des traitements redondants et essayant de limiter l’utilisation de la mémoire. La mémoire est un élément critique d’un tel outil, devant traiter des quantités importantes de données. Dès la fin du stage, l’outil sera disponible directement sur le site web de l’EBI : http://www.ebi.ac.uk/adf. Il sera ensuite maintenu par les développeurs de l’équipe. L’outil étant un logiciel libre, le code source sera aussi disponible pour favoriser son amélioration.

Bilan Les enjeux inhérents à un tel stage sont : • l’initiation à la vie d’entreprise, • être autonome - dans la conception et les choix techniques • les capacités d’initiative, pour proposer de nouvelles idées • l’analyse et les méthodes de réflexion, • la maîtrise de l’anglais, comme langue de travail, pour une bonne communication, • faire partie d’une équipe.

Apports techniques Ce projet a été l’occasion d’utiliser de nouveaux outils ou techniques, ainsi que d’approfondir différentes notions vues au cours de mes études. Durant mon stage, j’ai pu utilisé: • Java : l’outil a été complètement développé en utilisant ce langage objet; • XML : « eXtensible Markup Language », en manipulant les fichiers au format MAGE-ML, une extension de XML. Mais aussi, lors de développement de l’outil de validation, en créant un XML Schema pour les fichiers de description de fichier; • Eclipse : Cette environnement de développement, m’a permis d’automatiser différentes tâches séparées et de mieux structurer mon code source. • CVS: « the Concurrent Version System », utilisé pour sauvegarder l’avancement du projet, tout en ayant la possibilité de revenir en arrière si nécessaire. ix

• IZPack: un installateur libre, en Java, permettant d’installer des applications sous Microsoft Windows, GNU/Linux et même Mac OS. Le projet sera par la suite distribué avec cet installateur.

Un projet ouvert sur l’extérieur J’ai eu l’occasion durant ce stage de discuter avec des personnes extérieures à l’équipe, très intéressé par l’outil développé. Le standard MAGE-ML datant de 2001 est de plus en plus utilisé. Mais, son expansion est limitée par la complexité du modèle, et par le manque outils permettant de créer des fichiers dans ce format. Michael Miller de Rosetta Inpharmatics, ainsi que l’équipe de la bibliothèque Java MAGE-stk, ont été mes principaux contacts.

Proposer et gérer les choix Ce projet m’a montré l’importance des choix et décision prises. Cette face de la gestion d’un projet a été en grande partie révélée par le choix des bibliothèques (JaxB, jxl plutôt que poi, etc) et méthodes à utiliser. J’y ai vu l’importance à accorder à de telles décisions face aux performances, à la facilité de maintenance et l’intégration de nouvelles fonctionnalités. Les principaux choix ont aussi été faits lors de la définition des spécifications du projet: choix des technologies, facilité d’utilisation, intégration dans un projet plus large (MIAMExpress - ArrayExpress).

Un environnement de travail enrichissant Par ce départ à l’étranger, dans un institut européen de surcroît, cela a été l’occasion de travailler dans un milieu international, dont la langue de travail est l’anglais. J’ai donc pu grandement améliorer mon anglais. J’ai pu y côtoyer des personnes venants de différents pays, ainsi que de différents domaines (chercheurs, annotateurs, développeurs). J’espère, par cette expérience, avoir un esprit plus ouvert.

Il me paraît maintenant évident que l’utilisation et l’implantation de standard se fait au moyen d’outils à la fois simple d’utilisation et complet. Ces outils doivent s’adopter aux besoins des utilisateurs sans pour autant montrer les concepts, souvent complexes, utilisés. L’ADF et le MAGE-ML ont de nombreux atouts, dont notamment d’être des standards libres et ouverts, et de contenir suffisamment d’information pour convenir à tous les cas d’utilisations, grâce aux recommandations MIAME. Personnellement, l’apport d’un tel stage est très loin d’être négligeable. J’ai pu, tout d’abord, améliorer ma maîtrise de la langage anglaise. Mais, j’ai y eu l’occasion de me faire de nombreux contacts, qui me serviront très certainement lors de ma recherche d’un emploi. Ainsi, que par la suite, si comme je l’espère, j’obtiens un emploi dans la bioinformatique.

x

Abstract Microarray technology has become over the last ten years a tool of choice for Life Science experimentalists. Terabytes of data are being generated by routine use of arrays surveying gene expression of thousands of genes at a time. Though, such experiments can only be interpreted in context. Additional knowledge about the data (metadata) is necessary to interpret, reproduce and share the data. As various factors may influence the interpretation of results and the exchange of data, a grass root work, by a consortium of leading academics laboratories and manufacturers, has been carried out in order to assess and define the necessary information required to fully describe microarray experiments. This consortium, known as the MGED society has released the MIAME requirements, a document defining the content and nature of minimum information that should be recorded. If this first achievement delivered by the society set in stone the content, it fell short to address the issue of data persistence. Aiming to fill the gap, the MGED society members have produced an OMG approved object model of the microarray experiments known as the MAGE-OM, standing for MicroArray Gene Expression Object Model. This model paved the way to database construction and, from this model, an electronic format has been automatically generated. MAGE-ML, as it is known, is an XML serialization of the MAGE object Model, allowing compact and consistent representation of information related to microarray experiments. Both standards, MAGE-OM and MAGE-ML, are targeting the computer scientists and bioinformaticists audience and any one dealing with data persistence and data exchange. However, it became rapidly clear that there was a need for a wet-lab scientist friendlier format that could be used to represent, share and distribute array designs within the scientific community. The necessity for such lighter format arisen from the need to provide means to ensure consistent description of array designs when submitting those to microarray data repositories such as ArrayExpress or Geo. A format, now known as ADF for Array Description File format, has been derived from the MAGE-OM to provide a spreadsheet like view of a microarray layout. The ADF format is currently routinely used for submitting array designs to ArrayExpress using MIAMExpress, an online data submission tool[mia]. Usage has identified limitations in the current ADF format: among others, a limited support for microarray applications other than gene expression (for instance Chromatin Immuno Precipitation application or Comparative Genomic Studies) as well as poor grouping facility. The current project focuses first on refining the ADF specifications in order to close the caveats aforementioned. Then, a tool has been developed to convert ADF to MAGE-ML file and reverse. Yet, the main focus of this tool will be gene expression, which represents 90 % data submission to ArrayExpress. ADF specifications have nevertheless been modified to accommodate basic support for Comparative Genomic Hybridization (CGH) and Binding Site Identification (ChIP) oriented microarray usage. It should be noted that currently no specific modifications of ADF format has xi

been made to take into account the specifics of protein microarrays, mainly because the technology is still in its infancy. Finally, the tool’s main goal is to enhance MGED standards usage by offering a simple interface to complex standards, ensuring the creation of MIAME compliant files in MAGE-ML, a format tailored for data persistence used in state of the art repositories with limited efforts and extended support. This project concludes study of a master degree in computer science specialising in Bioinformatics for university of Lille. This project involves the Microarray Informatics team of European Bionformatics Institute (E.B.I.), an E.M.B.L. oustation settled in Cambridge, United Kingdom.

xii

Contents Acknowledgment

i

Remerciements

i

Résumé

iii

Abstract

xi

Table of content

xiii

Table of figures

xiv

Introduction

1

1 The European Bioinformatics Institute (E.B.I) 1.1 European Molecular Biology laboratory (E.M.B.L.) 1.2 History of the EBI . . . . . . . . . . . . . . . . . . 1.3 Research fields . . . . . . . . . . . . . . . . . . . 1.4 Analysis Tools and Database Access . . . . . . . . 1.5 Microarray Informatics Team . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

5 5 6 6 6 7

2 The project: ADF MAGE-ML Tool 2.1 What is it? . . . . . . . . . . . . . . 2.2 How to convert data? . . . . . . . . 2.3 Functionalities . . . . . . . . . . . . 2.4 Implementation - Technical Choices 2.5 Available resources . . . . . . . . . 2.6 Future improvements . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

13 13 22 23 27 33 33

3 The training period 3.1 Work environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Programming skills . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Contribution to the Team . . . . . . . . . . . . . . . . . . . . . . . .

35 35 36 37

Bibliography

42

Annexe .1 MicroArray layout . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 ArrayExpress Licence . . . . . . . . . . . . . . . . . . . . . . . . . . .3 Approved Databases . . . . . . . . . . . . . . . . . . . . . . . . . .

47 47 49 50

xiii

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

.4 .5 .6

ADF specification overview . . . . . . . . . . . . . . . . . . . . . . MAGE-ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Used tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xiv

52 66 67

List of Figures 1 2

L’EBI dans l’EMBL . . . . . . . . . . . . . . . . . . . . . . . . . . . MicroArray technic steps . . . . . . . . . . . . . . . . . . . . . . . .

iii 3

1.1 1.2

ArrayExpress architecture . . . . . . . . . . . . . . . . . . . . . . . ArrayExpress conceptual schema and external Links . . . . . . . . .

9 11

2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16

New MIAMExpress components . . . . . . . Audit and Security MAGE-OM package . . . Description MAGE-OM package . . . . . . . Array MAGE-OM package . . . . . . . . . . Array design MAGE-OM package . . . . . . DesignElement MAGE-OM package . . . . . BioSequence MAGE-OM package . . . . . . ADF MAGE mapping . . . . . . . . . . . . . ADF Header information example . . . . . . ADF Feature Reporter data example . . . . . Tool steps . . . . . . . . . . . . . . . . . . . Command-line mode available functionalities. Architecture of the tool . . . . . . . . . . . . Picture of the whole XML Schema . . . . . . Item structure description . . . . . . . . . . . GUI mode functionalities. . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

14 15 16 16 17 17 18 19 20 21 23 26 29 29 30 33

1 2 3 4 5 6 7 8 9 10 11

simple microarray layout schema . . . . . . Approved Databases. . . . . . . . . . . . . Approved Databases. . . . . . . . . . . . . ADF Header item information . . . . . . . ADF Feature Reporter item information. . . ADF Feature Reporter item information. . . ADF Feature Reporter item information. . . ADF Feature Reporter item information. . . ADF CompositeSequence item information ADF CompositeSequence item information Ontology in ArrayExpress. . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

47 51 52 56 57 58 59 60 61 63 67

xv

. . . . . . . . . . .

xvi

Introduction This project was carried out for my DESS bioinformatique -the second year of my Master’s degree specialising in Bioinformatics-, at the University of Lille. It was developed at the EBI : European Bioinformatics Institute (Cambridge, UK) in the Microarray Informatics Team, led by Alvis Brazma. We propose a tool to facilitate the usage of the MAGE-ML file format by in order to facilitate the exchange and the standardisation of the Microarray layout. Microarray experiments are information intensive and to record their complex structure it is necessary to define and capture each step, including the experimental design, the sample(s) source and its treatment(s), the preparation of extract(s) and labelled extract(s), the hybridization process(s), the array design(s) used and the final gene expression data. In particular array designs are very complex to describe and the layout varies from manufactures to manufactures. The requirement for minimal descriptors and a common terminology to represent the microarray experiment, along with the demand for standardizing data storage and exchange formats, have been recognized by the Microarray Gene Expression Data (MGED) Society[mgeb] an international organization of biologists, computer scientists, and data analysts. The MGED Society has developed a set of standards needed for microarray data sharing infrastructure. Guidelines known as the Minimum Information About a Microarray Experiment or MIAME[BHQ+ 01] have been developed to help data producers and microarray software developers to capture the minimum descriptors required to interpret and reproduce a microarray experiment. To ensure software interoperability, a formal standard that specifies the communication protocols has also been developed. MicroArray Gene Expression (MAGE) standards[maga] - MAGE object model (MAGE-OM) and MAGE markup language (MAGE-ML) - encoding all MIAME required information are currently used worldwide by several databases and microarray informatics tools. MIAME and MAGE work in combination with a third component, the MGED Ontology[mgea], defining sets of common terms and annotation rules for microarray experiments, enabling unambiguous annotation and efficient queries, meaningful data exchange and comprehensive data analysis. The direct involvement in the MGED Society, as a founder member and active participant, has uniquely positioned the EBI Microarray Team[ebia] at the forefront of this effort aiming to define an international standard for capturing and exchanging microarray data. The team, lead by Alvis Brazma, has developed ArrayExpress, the first public, MGED standards-compliant infrastructure for microarray-based data[arra]. ArrayExpress infrastructure consists of two data submission routes, a core repository, an online query interface, a query optimized data warehouse (under development) and an online analysis tool, Expression Profiler. The first data submission route (via an ftp site) allows batch submission in MAGE-ML format, however creation of such files is a demanding exercise and for data producers with little bioinformatics support this might 1

not be an option. To assist such user groups, the ArrayExpress infrastructure provides them with an on-line annotation and submission tool, MIAMExpress[mia], not requiring any knowledge of the MAGE standards. MIAMExpress is presented in form of a MIAME-based questionnaire, where MGED Ontology is used to structure inputs and provide controlled vocabularies for entry. Upon submission of array designs, a set of procedures is provided to the users to format the array into a standard referencing system. The Array Design File (ADF) format is a simple tab delimited format, which is generated by merging a spotter output file and a clone tracker file. The ADF provides information layout (from the spotter output file) and the biological annotation for each spot (from the clone tracker file) in a single file. The ADF format unambiguously locates each element on the array and provides a consistent biological annotation for data mining, data evaluation and data comparison across different arrays and technology platforms. Another set of tools allows the user to access the latest gene annotation, to re-annotate or update their array, by the link provided to another EBI database, EnsMart[ens]. MIAMExpress is a generic tool suitable for annotation of microarray experiments involving any species, single (e.g. Affymetrix) or dual channel experiments, gene expression or CGH experiments. MIAMExpress has also been developed further for the toxicogenomics/pharmacogenomics community as Tox-MIAMExpress[MPS+04], and an Arabidopsis specific version at MIAMExpress is currently under development. These specific instances are similar to the generic MIAMExpress, but have domain specific information and use species/domain specific controlled vocabularies. MIAMExpress is an open source project, consisting of HTML forms generated through a Perl-CGI interface, a MySQL database, and a MAGE-ML export component. Coding array design in MAGE-ML can be a lengthy process, also complex for users without in-house bioinformatics support. MAGE-ML files are neither human readable nor easy to visualize. The purpose of this project is to develop a module allowing interchangeable, standard format for array design, converting MAGE-ML document into an ADF and conversely take as input a valid ADF and generate a self contained MAGEML file. The conversion module works as stand-alone tool. The outcome of this project will be an important resource for the user community and the ArrayExpress infrastructure, accelerating array design exchange in a standard format. Ensuring conversion from MAGE-ML to ADF will enhance visualization of the array design annotation. Combining MAGE-ML to ADF will assist users in choosing the best approachable solution for their needs and resources. Facilitating standardization will demonstrate the feasibility of gathering MIAME-required information for complex array design in an effective manner.

Microarray technology Microarray technology now allows to look at many genes at once and determine which are expressed in a particular cell type. DNA molecules representing many genes are placed in discrete spots on a microscope slide. This is called a microarray. Thousands of individual genes can be spotted on a single square inch slide! Next, messenger RNA–the working copies of genes within cells (and thus an indicator of which genes are being used in these cells)–is purified from cells of a particular type. The RNA 2

molecules are then "labelled" by attaching a fluorescent dye that allows us to see them under a microscope, and added to the DNA dots on the microarray. Due to a phenomenon termed base-pairing, RNA will stick to the gene it came from. After washing away all of the unstuck RNA, we can look at the microarray under a microscope and see which RNA remains stuck to the DNA spots. Since we know which gene each spot represents, and the RNA only sticks to the gene that encoded it, we can determine which genes are turned on in the cells! Some researchers are using this powerful technology to learn which genes are turned on or off in diseased versus healthy human tissues. The genes that are expressed differently in the two tissues may be involved in causing the disease.

Figure 2: Description of the Microarray technique. The microarray technology has currently several applications: • gene expression study • protein location and function analysis • Comparative Genomic Hybridization and tumor analysis • Single Nucleotide Polymorphism (SNP) and genotyping In this report, will be presented in first instance the E.B.I., who they are and what they have done as part of the EMBL, then the Microarray Informatics Team, following by presentation of the developed tool, including a summary about what is the microarray layout and existing formats. After that, my motivation and the outcome of the project, ended by the conclusion.

3

4

Chapter 1

The European Bioinformatics Institute (E.B.I) The European Bioinformatics Institute (EBI) is a non-profit academic organisation that forms part of the European Molecular Biology Laboratory (EMBL). The EBI is a centre for research and services in bioinformatics. The Institute manages databases of biological data including nucleic acid, protein sequences and macromolecular structures. The mission of the EBI is to ensure that the growing body of information from molecular biology and genome research is placed in the public domain and is accessible freely to all facets of the scientific community in ways that promote scientific progress. The EBI serves researchers in molecular biology, genetics, medicine and agriculture from academia, and the agricultural, biotechnology, chemical and pharmaceutical industries. The EBI does this by building, maintaining and making available databases and information services relevant to molecular biology, as well as carrying out research in bioinformatics and computational molecular biology. The EBI Industry Programme was established to meet the special needs of the biotechnology, chemical and pharmaceutical industries, but still remain consistent with the public domain policy of the EBI. This programme aims to help industry adapt quickly to, and maximise benefits from, innovations in bioinformatics.

1.1 European Molecular Biology laboratory (E.M.B.L.) General Information The European Molecular Biology Laboratory (EMBL) was established in 1974 and is supported by seventeen member states including nearly all of Western Europe and Israel. EMBL consists of five facilities: the main Laboratory in Heidelberg [Germany], Outstations in Hamburg [Germany], Grenoble [France] and Hinxton [U. K.], and an external Research Programme in Monterotondo [Italy]. The EMBL is an international network of research institutes dedicated to research in molecular biology. EMBL was founded with a four-fold mission: to conduct basic research in molecular biology, to provide essential services to scientists in its Member States, to provide high-level training to its staff, students, and visitors, and to develop new instrumentation for biological research. These core functions are combined with significant outreach activities in the areas of technology transfer, science and society and training for 5

science teachers.

1.2 History of the EBI The roots of the EBI lie in the EMBL Nucleotide Sequence Data Library, which was established in 1980 at the EMBL laboratories in Heidelberg, Germany and was the world’s first nucleotide sequence database. The original goal was to establish a central computer database of DNA sequences, rather than have scientists submit sequences to journals. What began as a modest task of abstracting information from literature, soon became a major database activity with direct electronic submissions of data and the need for highly skilled informatics staff. The task grew in scale with the start of the genome projects, and grew in visibility as the data became relevant to research in the commercial sector. It soon became apparent that the EMBL Nucleotide Sequence Data Library needed better financial security to ensure its long-term viability and to cope with the sheer scale of the task. This led to the creation of specified institute, the European Bioinformatics Institute.

1.3 Research fields Twelve research groups at the EBI participate efficiently to the development of this new science, Bioinformatics. Theses groups are working in different fields: • study of protein-protein, protein-DNA interactions • genome hand-operated and automatic annotation, protein classification • tri-dimensional protein structure analysis • macromolecular structure analysis • tri-dimensional protein structure comparison • standard development • Computational Neurobiology • data mining

1.4 Analysis Tools and Database Access The EBI maintains versions of all the major public domain sequence database searching and analysis tools, e.g. FASTA (Smith & Waterman, 1981), BLAST (Altschul et al., 1990), CLUSTALW (Thompson et al., 1994) and Smith & Waterman (Smith & Waterman, 1981) implementations. The EBI also hosts tools such as DALI (Holm & Sander, 1997), a service for comparing protein structures in three dimensions and revealing biologically interesting similarities, and GeneQuiz, a system for highly automated analysis of protein sequences for the prediction of biochemical function. A major utility is the SRS system that was developed at EMBL Heidelberg and EBI and is now deployed at sites around the world. SRS is a program for the indexing and 6

cross-referencing of databases of textual information and provides unified access to molecular biology databases, integration of analysis tools and advanced parsing tools for disseminating and reformatting information stored in ASCII text. The services that the EBI develops and offer centre around the major databases of molecular biology information that it maintains. These are the EMBL Nucleotide Sequence Database, the TrEMBL and SWISS-PROT protein sequence databases, the Macromolecular Structure Database (EBI-MSD) of 3D co-ordinates of biological macromolecules, and the RHdb database of radiation hybrid maps. Aside from the major database projects, the EBI is involved in the preparation and distribution of over 70 other databases dedicated to particular areas of molecular biology. EMBL Nucleotide Sequence Database - the first EBI database In September 2004, the EMBL Nucleotide Sequence Database (usually referred to as EMBL) contained over seventy billion nucleotides in more than forty-two million sequences entries that contain sequence information and associated annotation. It is produced in close collaboration with GenBank in the USA and DDBJ in Japan. Together, these three sites constitute the global deposition sites for nucleotide sequence information, and every twenty-four hours the three databases exchange information. This information exchange requires carefully co-ordinated protocols. EMBL also contains all the sequence data from the European patent literature. The EBI has developed automated methods to propagate updates to remote copies of the database making it easy for users to maintain a complete and up-to-date local copy of the EMBL database. ArrayExpress database - the public microarray data repository ArrayExpress is a public repository for microarray data, which is aimed at storing well annotated data in accordance with MGED recommendations[BPS+03]. The data relating to each microarray project in the ArrayExpress database is subdivided into two main components: the Array, which refers to information about the design and manufacture of the array itself, and the Experiment, which provides information on the experimental factors and the actual data obtained. In addition to these, a third component, Protocol, describes the procedures used in the production of the array or the execution of the experiment. This repository is hosted and maintained by the Microarray Informatics Team.

1.5 Microarray Informatics Team Microarrays are being used to generate large datasets containing valuable information about many aspects of biology and medicine. For instance, they can be used to find which genes are under- or overexpressed in different cell types, or in a diseased tissue versus its healthy counterpart. To realize the full potential of microarrays, it is needed to describe microarray experiments consistently so that we can compare them. It is also needed to store them in a well-organized database linked to relevant bioinformatics resources so that, for example, it can be easily found functional information on groups of genes with similar expression patterns. This team is using microarray technology to analyse the sequence data from the genome projects to identify which genes are expressed in a particular cell type of an organism. 7

The Microarray Informatics team makes use of the sequence resources created by the genome projects to answer the question, what genes are expressed in a particular cell type of an organism. The microarray informatics team at the EBI addresses the problem(s) of managing, storing and analysing microarray data. There are several microarray projects being undertaken in this team : for example, the nutrinogenomics and the toxicogenomics projects.

Toxicogenomics Toxicogenomics combines the conventional tools of toxicology (such as enzyme assays, clinical chemistry, pathology and histopathoplogy) with the new approaches of transcriptomics, proteomics, metabolomics and bioinformatics. This marriage of toxicology and genomics has created not only opportunities, but also new informatics challenges! The EBI is rising to the challenge, teaming up with ILSI-HESI and the NIH-NIEHS National Center for Toxicogenomics. The collaborations aim to: • Establish international infrastructure for toxicogenomics data • Develop standard format for data storage and data exchange • Promote harmonization of terminologies (ontologies)

Nutrigenomics In the past decade, nutrition research has undergone an important shift in focus from epidemiology and physiology to molecular biology and genetics. Nutrigenomics is the application of transcriptomics, proteomics, metabolomics and bioinformatics in nutrition research. A European Nutrigenomics Organisation (NuGO) has recently taken over the ambitious challenge to translate the nutrigenomics data into an accurate prediction of the beneficial or adversary health effects of dietary components. Furthermore, NuGO has soon realized the need for a strong investment in informatics infrastructure welcoming the EBI as a new partner. The EBI will contribute to tackle issues such as: • Nutrigenomics technology standardisation and innovation • Bioinformatics environment harmonisation • Integrated information system development

1.5.1 The ArrayExpress project 1.5.1.1 What is ArrayExpress? ArrayExpress is an international public repository for microarray data. It aims to store and provide access to well-annotated data from microarray experiments. This project has three major goals: (1) to serve the scientific community as a repository for data supporting publications, (2) to provide the community with easy access to high quality data in a standard format, and (3) to facilitate the sharing of microarray designs and experimental protocols. ArrayExpress supports the microarray community standards MIAME and MAGE-ML, which were developed by the Microarray Gene Expression Data Society (www.mged.org). 8

MIAME specifies the minimum information about a microarray experiment that should be provided to interpret the experiment unambiguously. MAGE-ML, the microarray gene expression markup language, is a standard format for exchanging data among microarray databases and data analysis tools. ArrayExpress data submissions are divided into three parts Experiment, Protocol and Array. Each is given an accession number so that an Array or Protocol can be referenced by many Experiments. The ArrayExpress code is freely available, on the EBI website: http://www.ebi.ac.uk/arrayexpress/Implementation/index.html; the database and its contents can be downloaded to analyse publicly available microarray data locally, or use ArrayExpress to store own data. ArrayExpress has four major components (see figure 1.5.1.1) the database itself, a web based query interface, a data submission and annotation tool called MIAMExpress, and an online data analysis tool called Expression Profiler. MIAMExpress allows anybody to submit microarray data using a web-interface. All the data are stored in the central ArrayExpress database, from which they can be accessed using a web-based query tool. The data can be imported directly into Expression Profiler for analysis, or can be exported data to analyse them locally using proper tools.

Figure 1.1: Architecture of the ArrayExpress repository

1.5.1.2 What can I do with ArrayExpress? • Make microarray data publicly available in accordance with the standards developed by the microarray community • Download public-domain microarray data for in-house data mining Link gene expression patterns with other genomic information 9

• Obtain protocols for microarray experiments • Find out which sequences are represented on different microarrays Provide supporting data for publications • Export public data for analysis with the EBI’s analysis tools or own tools. Submitting data to ArrayExpress As many journals recommend it, the Nature journals now require submission of MIAME compliant data to a public repository as a condition of publication. ArrayExpress can be used for this purpose. Data can be submitted to the ArrayExpress by two different ways: • Using MIAMExpress, the EBI’s microarray data submission and annotation tool. Its series of web forms will guide submitter through the annotation of microarray experiments, allowing to document arrays, samples, hybridizations and relevant protocols before uploading data and completing submission. MIAMExpress can be run using any web browser, and has been designed for use by biologists with minimal experience of bioinformatics. It is supported by the ArrayExpress curation team. The curation team checks data before uploading it into ArrayExpress. MIAMExpress is a microarray data annotation and submission tool, which allows entry of MIAME compliant information and exports MAGE-ML files. The MIAMExpress interface provides a convenient and robust means of annotating experimental data to the MIAME standard. Two-thirds of the hybridizations in ArrayExpress, accounting for roughly half the available experiments, have been deposited via MIAMExpress. The tool is widely used as the submission tool to ArrayExpress database, and data from over 4000 hybridisations have been entered this way. • Submissions from external microarray databases, in the MAGE-ML file format. Many microarray laboratories can export their data in MAGE-ML format from their internal microarray database or laboratory information management system (LIMS). Pipelines with array manufacturers, including Affymetrix[aff] and Agilent[agi], and from contributors such as TIGR, the Stanford Microarray Database[tig], the Wellcome Trust Sanger Institute[san] and Deutsches Ressourcenzentrum fuer Genomforschung GmbH (RZPD)[rzp] are being established. Retrieving data from ArrayExpress The ArrayExpress query interface allows to query public datasets on fields such as author, accession number, species, experiment type and array name. Experiments and Protocols can also be viewed as browsable lists and downloaded as MAGE-ML or tabdelimited files. The future ArrayExpress has been operational since February 2002, and data are now been accumulating rapidly. Current plans for development include: a data warehouse that permits gene-centric queries; domain-specific annotation tools; improving the query interface for ArrayExpress; and adding new functionality to Expression Profiler. As of September 2004, the data available in ArrayExpress represents approximately 10000 individual hybridizations relating to over 340 experiments. These studies performed on 25 different organisms, ranging from human to bacteria cover a wide range 10

of topics, such as cancer in mammalian tissues and stress responses in single-celled fungi. Experimental data may currently be accessed by queries based on species, experimental design type (e.g. time series), the factors varied during the experiment, authors, or the array design used. A distinguishing feature of ArrayExpress is the support provided by the microarray curation team at EBI. Curator input is an important part of the data submission process, which allows the microarray team to maintain the quality of submissions to ArrayExpress. In addition to maintaining the ArrayExpress database, the microarray team at EBI provides a number of tools for the annotation and analysis of microarray data. A data analysis suite, Expression Profiler[Vil03], allows researchers to apply a variety of algorithms, such as hierarchical or Kmeans clustering, to ArrayExpress data. Source code is freely available for all of these applications, allowing individual researchers or organizations to set up local database installations. Specialized research communities can then use EBI’s tools to annotate data with detailed organism- or application-specific information. The ArrayExpress database (see figure 1.5.1.2)is designed to store all MIAME compliant information, and is based on the MAGE object model (MAGE-OM), enabling users to construct detailed and precise queries of microarray data in the database. ArrayExpress also uses the MGED ontology as its controlled vocabulary, underpinning a reliable and consistent interface for the submission and querying of experimental data, array designs and protocols. The combination of convenient data submission options, expert curation, analysis tools, and the implementation of data standards have made ArrayExpress widely used throughout the microarray community.

Figure 1.2: ArrayExpress schema and external Links (Ontologies,references, others databases, . . . )

11

12

Chapter 2

The project: ADF MAGE-ML Tool 2.1 What is it? The aim of this project is to provide a standalone tool for Microarray layout data file conversion from ADF format (part 2.1.2) to MAGE-ML file format (part 2.1.1) and vice-versa As both format contained controlled vocabulary and have a specific structure, a first step is mandatory before converting: the validation step. This step will ensure data structure and consistency (syntax and semantics). This project will be made public with the new release of the data submission tool, MIAMExpress. This new release will included three new components (see figure 2.1), including this one. The two related components are also student projects: the visualisation module will display a representation of submitted experience and the batch upload module will permit to user to upload data by directly to MIAMExpress without the use of the web interface.

2.1.1 Microarray Gene Expression - Markup Language Format (MAGE-ML) Communication of microarray data and experimental annotation has historically proven difficult due to their varied and complex nature. To help resolve this problem, the Microarray Gene Expression Data Society (MGED) has produced two standards. The Minimal Information About a Microarray Experiment (MIAME) [BHQ+ 01] specification describes what information should be provided when making microarray data public. Several journals have adopted MIAME and require compliance for all microarrayrelated publications. The second standard, the Microarray Gene Expression Data Object Model (MAGE-OM) [magd] provides a standardized data format for transmitting microarray data, and is based on MIAME. The various components of MIAME are grouped into discrete packages within MAGE-OM. 13

Figure 2.1: MIAMExpress architecture with three new Components

2.1.1.1 MAGE-OM format and MAGE-ML: The MAGE-OM model was created using standard software development tools using Unified Modeling Language (UML) notation[arrb]. Such a process allowed the MAGE working group to document the model as development progressed. It also allowed production of various platform specific software toolkits (MAGEstk [mage]) to be directly produced from the model. Lastly, a data transfer XML standard (MAGEML) was produced directly from MAGE-OM. MAGE-OM and MAGE-ML were subsequently approved by the Object Management Group (OMG) as a microarray data transfer standard in late 2002. MAGE-ML format is an XML serialization of the MAGE-Object Model [magd]. It has been automatically generated from the object model. The various tags allow for object representation and creation of association between those. MAGE-ML documents must comply with the MAGE-ML Definition Type Document (DTD) available from the OMG website or the sourceforge project page of MAGE. MAGE-ML coded information lives in packages that should appear in an ordered manner, specified by the DTD. Yet, not all packages are necessary to form a valid MAGE-ML document. Out of the sixteen packages present in the MAGE Object Model, only six of them are actually relevant for describing microarray layouts (see below). Furthermore, MAGE-OM being a complex model, there is more than one way to represent the same information in a MAGE document. To reduce MAGE-ML document polymorphism, additional work has been carried out during MAGE Jamborees. This resulted in the publication of a series of guidelines, known as the « MAGE Good practice », (http://www.mged.org/Workgroups/MIAME/miame_mage-om.html). The document focuses on how to generate optimal MAGE documents, ensuring adequate usage of objects and optimal querying capabilities. The advantage of such guidelines is obvious for all developers, as they will ensure in the long run that MAGE-ML documents are consistently generated and can reliably be parsed by tools developed to support microarray data interchange. 14

Various examples of MAGE-ML documents are available from ArrayExpress web site. Audit and Security package (see figure 2.2) Classes to describe submitters (Organization and Persons)

Figure 2.2: Audit and Security MAGE-OM package

Description package (see figure 2.3) contains information describing an element. A generic class that is associated to all objects derived from "Describable" MAGE super-class. All "Describable" objects can be "tagged" with a "Description" object whose most basic attribute can be a text, but the object may have more complex associations, for instance to "Databases" objects and "OntologyEntry" objects via "annotation" associations. Array package (see figure 2.4) Information about array. Only basic objects have been used and no support in the tool is provided to use « Manufacture LIMS » objects and related associations and ancillary objects. The same is true about the « ArrayManufactureDeviation » objects ArrayDesign (see figure 2.5) Information about the design of an array. Almost all objects are used in the definition of the ADF and in the tool. However, "Zone Layout" object and association to Unit are not used. 15

Figure 2.3: Description MAGE-OM package

Figure 2.4: Array MAGE-OM package

16

Figure 2.5: Array design MAGE-OM package DesignElement package Information about elements contains in the array design. All objects used except Position and association to Position.

Figure 2.6: DesignElement MAGE-OM package

17

BioSequence package (see figure 2.7) Information about biosequence. A class for describing biological sequence objects: it has two mandatory associations ("PolymerType" and "Type" to "OntologyEntry") and controlled vocabulary has to be supplied. Both have required the creation of their counterpart in the ADF format. The "SeqFeature" association is not exploited. A major goal

Figure 2.7: BioSequence MAGE-OM package for the original MAGE model was to provide a standard format for sharing microarray data. The MAGE-ML standard has successfully enabled the transfer and archiving of many well-annotated complete microarray studies. As web-services models become more prevalent, groups are looking to use MAGE-ML in providing distributed query capabilities on individual components, such as details on an individual hybridization. 2.1.1.2 Optimal use of MAGE: MAGE Good Practice MAGE object model is a highly engineered representation of the microarray world. Initial work with earlier versions of ADF and MAGE-ML documents has revealed that some object usage was better than others. Furthermore, in some cases, for the sake of simplification, some objects were improperly used. A MAGE Jamboree held at EBI in December 2003 has released guidelines about an optimal usage of MAGE model and MAGE objects. Figure 2.8 below presents in grey boxes the mandatory objects that should be used when dealing with microarray layout. Relationships between objects are shown as directed arrows. Optional associations and object are represented as white boxes and « / » or dashed boxes. The present representation insists on one important point that was overlooked in earlier work with the MAGE standard. All actual biological information attached to a microarray element should be mapped to a CompositeSequence element via a BiologicalCharacteristics association. The actual sequence deposited on the array should be linked to Reporter object via ImmobilizedCharacteristics. MAGE-ML format, however creation of such files is a demanding exercise and for data producers with little bioinformatics support this might not be an option. To assist such user groups, the ArrayExpress infrastructure provides them with an on-line annotation and submission tool, MIAMExpress, not requiring any knowledge of the MAGE standards. MIAMExpress is presented in form of a MIAME-based questionnaire, where MGED Ontology is used to structure inputs and provide controlled vocab18

Figure 2.8: Mapping between ADF and MAGE objects /MAGE good practice applied to Design elements description ularies for entry. Upon submission of array designs, a set of procedures is provided to the users to format the array into a standard referencing system.

2.1.2 Array Design File Format (A.D.F.) The ADF format corresponds to the definition of the human readable (usually tabular) format for representing an array design. It has resulted from the identification in the MAGE object model of the core classes and mandatory MIAME information that should be represented. The major concern was to put together a format that could be easily be used by the community, without compromising on MGED standards usage. Various layers of information had to be combined in order to keep the ADF format as lightweight as possible. A core set of headers have been devised, but more specific ones have been added in order to provide better support to microarray application that were not oriented towards gene expression (i.e. CGH, ChIP). « Header » spreadsheet/ file - ".adh" (see annexe .4.1)} Containing additional information needed or quite useful for comparison (Database), like date of public release and submitter name, or for conversion to MAGE-ML. « FeatureReporter » spreadsheet/ file - "adr"} (see annexe .4.1) Containing Feature and Reporter data. This file corresponds to the previous ADF format, even though slight modifications have been incorporated. Each row corresponds to a deposit in the array: position (feature) and biological information on the reporter. « Composites » spreadsheet/ file - "adc"} (see annexe .4.1) Optional, only when a more complex array design has been created and where Reporters can be combined in order to represent different genetic elements. Typ19

ically, the case of an array devised to monitor splice variants events or an operon. Reporters, representing exons, can be grouped in various ways to represent the different transcripts. This third file allows for the description of such relationships (map) between reporters and CompositeSequence (representing the various transcripts) The tables below show the various authorized headers that should be used in the ADF and their mapping to relevant MAGE objects. In the latest version of the ADF specification, an ADF file can be supplied either as a Microsoft Excel workbook containing three spreadsheets or as three text component files. To summarize, the Array Design File (ADF) format is a simple tab delimited format, which is generated by merging a spotter output file and a clone tracker file. The ADF provides information layout (from the spotter output file) and the biological annotation for each spot (from the clone tracker file) in a single file. The ADF format unambiguously locates each element on the array and provides a consistent biological annotation for data mining, data evaluation and data comparison across different arrays and technology platforms. ADF example - usual case (minimal Feature-Reporter File for gene expression application, simple microarray layout) • Header information (figure 2.1.2, see annexe .4.2.1 for item specification) Audit & Security

Person LastName: Person FirstName: Person email: Person office phone:

Smith John [email protected] + 44 1248 485 485

Institution/Company: Department: URL:

EMBL-EBI Microarray Team www.ebi.ac.uk/microarray

Address: Zip code/PO box: City: State/Province: Country:

Wellcome Trust Genome Campus CB10 1SD Hinxton Cambridge Cambridgeshire UK

Array Design Description

Array Design Name: Array Version: Array Design Number of Features: Application: Technology Type: Surface Type: Substrate Type: Array Description: Array Protocol:

myarray 1 2600 G.E. spotted_ss_oligo_features polylysine glass This is a test array. No specific usage Protocol for array printing: please describe here printer make and manufacture, slides, buffer used

Miscellaneous information

Release Date: User defined value :

12/11/2005 mydatabase

Figure 2.9: ADF Header information example : Mandatory Items and some optional ones. • Feature Reporter data (figure 2.10, see annexe 5 for item specification)

20

MetaColumn MetaRow

Column

Row

Reporter Name [intergenic]

Reporter ControlType

Reporter BioSequence Type

Reporter BioSequence Polymer Type

Reporter BioSequence Database Entry [embl]

Assigned_Gene Database Entry [mydatabase]

1 1 1 1 1 1

1 1 1 2 2 2

1 2 3 1 2 3

iYC3243 iYC3244 iYC3245 iYC3246 iYC3247 iYC3248

not_control not_control not_control not_control not_control not_control

intergenic intergenic intergenic intergenic intergenic intergenic

DNA DNA DNA DNA DNA DNA

AC04235 AC04236 AC04237 AC04238 AC04239 AC04240

SGD:003424 SGD:003425;SGD00345 SGD:003426 SGD:003427 SGD:003428 SGD:003429

1 1 1 1 1 1

Figure 2.10: ADF Feature Reporter data example : Mandatory Items and some optional ones. • CompositeSequence File No provided Data. This is not a complex array design (Reporter CompositeSequence one to one relation). 2.1.2.1 Used Tags Initially defined to provide means to declare databases when providing accession numbers, the "tag" have been generalized various fields of the ADF format to enhance expressivity and information acquisition. Tags are mainly used in ADF header. Tags can be split into three categories: • Database tags The use of database tags: – Enforces homogenous data content submission – Enables accession number structure check – Avoids costly data check by providing direct identification of the resource A list of approved database tags (and associated regular expression for accession number structure control) maintained by ArrayExpress curation team will be used by the converter during parsing events. • Fixed tags Additional tags have been devised to help the parsing process distinguish between several elements of same nature. • Optional tags Depending on microarray application, some tag might be appended to header item to show up a biologic feature a tag may be appended to the "Reporter Name" field. In the case of binding site identification application, the ADF should contain the following Reporter Name header: « Reporter Name [integenic] », simply to tell the parser to expect the concept of assigned gene and therefore switch to appropriate parsing mode.

2.1.3 Controlled Vocabulary Both formats contain domain-specific vocabularies and terms. Relying of ontology definitions and terms facilitate unambiguous description and subsequent query efficiency. The main source for the filler of data will be the MGED Ontology. MGED Ontology (also known as MO) is the third standard developed by the MGED Society. 21

It aims at making available a set of categories covering the world of microarray and biological annotation. Both ADF and MAGE-ML document will undergo a checking process aiming at ensuring adequate use of the controlled terms where required.

2.2 How to convert data? 2.2.1 Requirements First of all, the project must be an Open source project. This means that project files (source and binary files) must be freely provided. The concept of the project is a standalone application, without any connection to a database or any server. It should permit to submitters to test locally before submitting files to a public repository, as ArrayExpress and to do modification to be compliant with file format specifications. But, it could as well be used as a module of the MIAMExpress, submission tool to ArrayExpress.

2.2.2 Guide lines When startong the project, there was no application that checks and converts ADF to MAGE-ML and from MAGE-ML to ADF. As part of the MicroArray Informatics Team, there was only an ADF file checking tool: 1. the ADF checker (http://www.ebi.ac.uk/~farne/ADF/), 2. a tool to validate and to load in a database: the MAGELoader (ftp://ftp.ebi.ac.uk/pub/databases/arrayexpress/MAGEvalidator-DISTRIB/), 3. and there is also the possibility to obtain an ADF file (previous version) from the ArrayExpress database. After an analysis of these existing tools, it was possible to point interesting guide lines: • Respect good practices, and follow guide lines given for submitters (http://www.ebi.ac.uk/~ele/ext/submitter.html) • Do the wider possible data checking • Have fast application • As standalone application, find another way to check MGED ontology terms and database data (tags and accession numbers). • Have flexible tool - permit mistake in header item name - by permitting a relaxed checking in complement of strict checking, strictly matching the file format specification. • Have a command-line user mode • Have a batch mode, for multiple file treatment • Do not need of a specific hardware as server computer. A usual desktop computer should be enough • Additionally, have a graphical user interface (G.U.I) for an easier use than commandline 22

2.2.3 Overview The tool will check the files (ADF and MAGE-ML) for structure and content (well formed and valid) (see section 2.3.1): • ADF files will be screened against field definition devised in the specification[adf] (annexe .4) (structure check); • MAGE-ML files will be screened against entity definition which is provided by a DTD[magb] (structure check); • MAGE-ML file and ADF contained data will be verified, where terms have to be encoded using a controlled vocabulary: for example, against MGED ontology; or in a specific format: for database accession number validity; where applicable. and then, converts correct data in the other file format. Obviously, the user (i.e. a submitter) will be informed of all problems identified during the process.

2.3 Functionalities From these previous elements, it has been decided to have two mandatory steps for the file conversion : a checking step and a conversion step. In so doing, data quality is ensured (syntax and semantics) before starting conversion process. The project application will mainly propose two functionalities (see process on figure 2.11): 1. ADF and MAGE-ML file validation check, well formed (data structure) and validity (data) verification (see Annexe .4.5 for ADF); 2. file conversion from a file format to the other one (ADF to MAGE-ML and MAGE-ML to ADF)- data must be identified and correct-.

Figure 2.11: Steps of the tool execution.

23

2.3.1 File/Data checking This step will check data files and inform user about errors and, if possible, how to correct them. 2.3.1.1 ADF check From submitted data based on current experience at ArrayExpress[arra] and discussion with curation team, most submitted ADF contain many errors and very seldom comply with the ADF specification. It is mainly due to the fact that files are created by hand. So, usually, the item names are incorrect. However, data content is most of the time correct and the submission should be accepted. In conclusion, the tool is flexible enough to allow usual mistakes (see Annexe .4.4), those being corrected during the checking process. This flexibility allows having two levels of checking: Relaxed:

allowing the usual mistakes (if they can be identified as well). If the data are identified, the converting process can be done.

Strict:

file must exactly match the specification.

And, for convenience, the tool could have two execution modes during this step: • A complete mode, which checks whole data; In that case, the process will not stop if an error is identified. • A step-by-step mode: once an error is found, the process will stop, allowing a correction of errors one by one (for small data set or known small error numbers); File format validation: The control on the file format is a structural control before any attempt to assess the data content of the file. If the file structure does not comply with the structure specification (described in the appendix) , a major error will be reported, possibly causing the application to abort subsequent checks. File content validation Once the file format has been validated, a second checking step is required to verify if the type of data is compatible with the header declarations. Data content of each field is verifying. This control covers the data type (string, integer.. ), all database tags, all database Entries and the values supplied when controlled vocabulary usage is enforced. In case of use of controlled vocabulary, terms will be matched against list of approved terms. Most controlled terms will be supplied by the MGED ontology. Several checking lists have been completed for ADF file validation - see annexe -. Automatic curation Thanks to the flexibility supplied by the different checking levels, it is possible to do partial automatic curation in order to strictly match to the ADF specifications[adf], and thus, ensuring correct file structure. Additionally, this functionality offers the possibility to the user to save data in correct new ADF files exactly matching with the ADF specification[adf]. In order to indicate that the tool has processed the file and curated it, a tag will be added to the original file name. The new curated ADF files (or workbook) are tagged with « _curated_ » between the filename 24

and the file extension (adh, adr, adc or xls) to identify them in the filesystem. In case of simple microarray layout, composite data will be generated automatically from reporter data and consequently, the ADF CompositeSequences part (adc file or composites sheet) will be generated by the application. 2.3.1.2 MAGE-ML validation: 1. Validation against DTD: The validation of MAGE-ML document is rather straightforward as it can be checked simply against the Document Type Definition. 2. Validation against MAGE good practice: Testing whether MAGE good practices have been used to create the document is also necessary. This is actually a critical point in order to address the issue of file consistency and optimal usage of MAGE object. 3. Validation of controlled vocabulary (MGED ontology terms and databases). Error report For the correction of data files, warning and error information could be display in the standard output (for command-line)or save in file with a ".log" extension in addition of the checked file name (default) or , even, display in a window, as well (for a GUI). For each file, treated during the execution process, all warning and errors occurring during the process are reported in a different log file. The name of log file is the same one than the treated file with the « .log » extension. For ADF file, reported information is presented following the ADF data line number. 2.3.1.3 Conversion Once, files verified (correct files and data), the conversion process can start. ADF files (workbook) will be converted in MAGE-ML file or MAGE-ML files in ADF files [or workbook] as determined by the user. ADF to MAGE-ML conversion For this conversion case, some additional groups of reporters and composite sequences will be created automatically during the conversion, from FeatureReporter and CompositeSequence items. They will mainly be grouped following the control type, the species, the chromosome and the chromosome brand. These groups will be used in the database for simplifying queries in the database of submitted data. In case of ADF files conversion, the referenced DTD, describing MAGE-ML file structure, to output MAGE-ML files can be changed from the default location - MGED website[mgea] - by the user to another location, as a local one. MAGE-ML to ADF conversion For this conversion, the main drawback of the MAGE-stk is that all entities will be present in memory. Usually, there are a lot of entities in MAGE-ML files, which are not related to array description. This is an important point to keep in mind for the application, because it only uses a subset of all the MAGE objects. So, the application will try not to consume a large quantity of memory to avoid running out of memory. 25

There is more information represented in MAGE-ML than it appears in the ADF tabular format. The difficulty is to find the right balance between data capture and ease of use, making a MAGE-ML file to hold enough needed information to be MIAME compliant, but not to overload submitters with too many fields when generating the ADF. From MAGE-ML file to ADF, not all entities will be converted, only data corresponding to the array design, others entities will be skipped. For every file given as parameter (ADF or MAGE-ML file), the tool will check its compliance to the specifications[adf],[magc]. For example, for some fields, only ontology terms (controlled vocabulary) can be used to fill it in. Or, they have to respect some format rules (i.e. database accession number). The tool must be flexible to allow usual mistakes and error cases. As for the ADF conversion, the MAGE-stk will be used to read MAGE-ML file, to obtain MAGE-OM Java objects. 2.3.1.4 Conversion: If the input file has passed successfully step 1 and 2 of validation, the conversion process will be triggered. 2.3.1.5 User mode Command-line The command-line mode is mainly for the treatment of high quantity of files, as batchmode. So, it is expected to be fast, without much user interaction. For now, the development work is focused on a command-line version. But, a graphical user interface (G.U.I) is planned to improve the tool usability.

Check ADF file

User

Check and Convert ADF to MAGE-ML file

Check MAGE-ML file

Check and Convert MAGE-ML file to ADF

Figure 2.12: Command-line mode available functionalities. In this mode, the user will have four possibilities: • Check an ADF (three files: adh, adr, adh or a workbook) or several ADFs present in a directory given as parameter; 26

• Check and convert an ADF or several ADFs present in a directory given as parameter; • Check a MAGE-ML file or several MAGE-ML files present in a directory given as parameter; • Check and convert a MAGE-ML file or several MAGE-ML files present in a directory given as parameter.

2.4 Implementation - Technical Choices 2.4.1 How does it work? The main goal of the ADF converter is to provide a platform independent application capable of validating files in ADF format and converting those into MAGE-ML format and validating MAGE-ML files and converting those into ADF tabular format. Building on the two core functionalities, additional functionalities, such as different running mode for the application, an automatic curation facility, a log report extension as well as a facility for new users to create rapidly valid ADF files will be added. 2.4.1.1 The programming language As the tool was supposed to use the MAGE-stk available, the choice of the programming language to the only programming languages in which the API is available -perl and Java-. In addition, the tool has been specified a standalone and platform independent tool, the choice of the language has been obvious: the Java language, thanks to its system-wide availability and to its easy installation. Currently, the Java platform (Java Virtual Machine) is available on the three main platforms using by submitters: Unix/Linux, Microsoft Windows, Mac. The Java version, to base it on, is the 1.4. Even, if the core functionalities should be compatible with the 1.2 version, it has been decided to use the Java/Swing libraries for the GUI in the 1.4 Java version. Despite the use of Sax parsing for XML handling, when using the Java MAGE-stk API all MAGE objects (even those not needed) are loaded in memory. This causes an issue in terms of memory consumption and management. This is all the more critical as MAGE-ML files for large array designs can reach considerable sizes. The problems of memory usage may well be made even more complex by multi processing (in case of batch mode conversion). The issue has to be addressed in order to be sure that the application can be installed on average computer. 2.4.1.2 Architecture / Components To summarize, we can identify four distinct components in the tool: 1. Validation component (a) file format (b) file content i. MGED ontology checking 27

ii. Database checking (c) file curation/fixing component 2. file conversion component (a) Group creation 3. Common tool component (a) File reader / writer module (b) Configuration module (c) Logging module 4. User mode component (a) Command-line user interface The different libraries have been chosen to be compatible with an open source project (see figure 2.13): 1. Programmatic interface to the MAGE object Model: MAGE-stk Java API (MAGE software tool kit) 2. XML parsing: JAXB (Java Architecture for XML Binding) 3. DAML+OIL parsing, MGED Ontology parsing: from Kjell Petersen, BCCS Bergen Center for Computational Science 4. Workbook/Spreadsheet handler: jxl (Java Excel API)

validation The idea for the ADF validation is to describe the file structure in a XML file and to check the content of ADF against the description contained in the XML file. By using an external description file, it provides enough flexibility to support new versions of ADF format specification and also indirectly any tabular file. This idea came from the need of flexibility, as the new specification of ADF was not stable at time the project began. To structure description xml files, an XMLSchema has been developed. This schema represents a document (for example, an ADF) (see figure 2.15), which contained several files. These files can be composed of several dataset, which can contain, separated by a table delimiter, several data tables. A table is composed of a header. The data of the table are delimited by a String, which can be simply a tabulation or a comma. The interesting point for the validation is that a header is composed of items. The schema, for the description of tabular file, is available at this address: http://www.ebi.ac.uk/ pierre/adf_converter/implementation/ filestructurechecking/DocumentStructure.xsd and its documentation at : http://www.ebi.ac.uk/ pierre/adf_converter/implementation/ filestructurechecking/doc.html. Each item is described by different attributes: • Patterns to identify it (for relaxed and strict checking) 28

XML files (data file structure description)

JAXB API

Client

J X L API

Command-line

Client Graphic User Interface

ADFConverter Library

xls files (ADF)

M O Parsing API

Server

MGED Ontology term file

MIAMExpress

MAGE-stk API

MAGE-ML files

Figure 2.13: Architecture of the tool

Figure 2.14: Picture of the whole XML Schema

29

Figure 2.15: Item structure description • the item name • the subname type or pattern to validate it • the item cardinality, if the item is allowed to be present more than once • the associated field value type name and pattern for controlled vocabulary • the field cardinality, number of none empty field • the field multiplicity and value delimiter, if the field could contain several value • the dependence with other items, if there is a value in the associated field, it must be a value in dependent item field. • for simple microarray layout (no composite sequence provided), the matching item for composed sequence data might be provided For dealing with XML description file, there were different possibilities : create a simple Java handler with associated Java objects or use the Java Architecture for XML Binding (JAXB) for an automatic binding between XML entities and Java Objects. The handler would use the two well-known Java APIs for XML Processing providing in the Java development kit: the Document Object Model (DOM) or Simple API for XML (SAX). But, JAXB API allows a direct mapping of XML Schema entities and of course, of XML entities from XML description file. . Additionally, for each supported microarray application design, there is a description file, as file header can have different items, cardinality.. , specially for the feature/reporter ADF file. Once the checking and the curation are finished and correct, files can be converted. 30

MGED ontology validation Currently, MGED ontology terms are supplied on the MGED web site as DAML+OIL file. DAML+OIL is a semantic markup language for Web resources. It builds on earlier W3C standards such as RDF and RDF Schema, and extends these languages with richer modelling primitives. DAML+OIL provides modelling primitives commonly found in frame-based languages. DAML+OIL was built from the original DAML ontology language DAML-ONT (October 2000) in an effort to combine many of the language components of OIL. It is mainly used to exchange ontology data. This file will be used to verify terms contained in ADF and MAGE-ML file. A version of a DAML+OIL parser for MGED ontology has been provided by Kjell Petersen from the Bergen Center for Computational Science. Associated with the MAGE-stk API, this parser is able to test if an OntologyEntry element from MAGE-ML has correct value, or directly if a MAGE object is associated with correct ontology entries and to retrieve possible values for a given category. However, terms provided as « new _term[proposed term] » are considered as new terms, submitted to the MGED ontology team, but not yet approved by MGED ontology, and they are simply accepted to allow the conversion while term is submitted . In the future, a specific mechanism for retrieving MGED ontology latest version from MGED ontology web site, could be implemented for an up-to-date checking. Database validation File content validation also includes checks on database tag and database accession numbers. The provided database tags must be approved by ArrayExpress, but for ADF format, the user is able to provided its own databases and want to submit data on this own repository, however, no checking is done on the accession. This will rely on pattern matching based on the information gathered from primary databases about identifiers formats. accession numbers provided by submitters are checked for format (but not for authenticity, this would be too costly) against the list of regular expression specific to Database format for accession numbers- the appendix shows a full list of approved databases -http://www.ebi.ac.uk/~ele/ext/submitter.html-. The approved database list is maintained internally at the Microarray Informatics Team by curators and provided as Microsoft Workbook on the EBI web site. It contains for each approved database its name, its associated tag and possibly a web link, some comments and a pattern (as a perl regular expression) for accession numbers. The file validation module is used to retrieve contained data and to store them in the data structure present previously. Sequence validation The control of supplied sequences is performed according to the type of the given polymer type from the MGED ontology. According to the MGED ontology, the PolymerType category can have one of the following value: • DNA: the sequence should be composed of ATGC uppercase characters. The lowercase characters are accepted in the relaxed checking mode. • RNA: the sequence should be composed of AUGC uppercase characters. The checking is case insensitive for relaxed checking mode. • protein: all letters and « - », « _ » characters are accepted in sequences. 31

MAGE-ML / MAGE Objects For the conversion from and to MAGE-ML, the MAGE software tool kit[mage](MAGEstk) will be used. It is dealing with MAGE-OM[magd] objects. It converts MAGE-OM Java objects in MAGE-ML entities and contrary. To summarize, it is a MAGE-ML file reader/writer. However, for the conversion to ADF only a subset of MAGE-OM packages and objects is needed - those related to array description (nothing from Quantification Type, BioAssay, Biomaterial and so forth)-. MAGE identifier creation The tool will automatically generate identifiers for all MAGE objects requiring those. In agreement with the MAGE best practice agreement achieved during MAGE Jamboree held at EBI in December 2003 in Hinxton, MAGEidentifiers will be used. A MAGE identifier is supposed to be unique and is considered as a Life Sciences Identifier (LSID), an I3C and OMG Life Sciences Research (LSR) Uniform Resource Name (URN), supported by IBM. As the specifications are imprecise, a more specific schema has been defined to be formed the generated identifiers: Rule: identifier="$usr_provided_url:$usr_provided_namespace/ObjectType:A-ABCD-0000.ObjectInitial-number Example: identifier="ebi.ac.uk:impression/Feature:A-ABC-0000.F-1"

2.4.1.3 library To facilitate the integration of new user modes, a library has been developed. It initialises the application (configuration) by given parameters and calls the appropriate method for treatment. This method permits to separate data treatment from the interaction with the user. So, user modes only handle parameters given by the user, then transmit them to the library. 2.4.1.4 Logging module This module is used to report any information to the user. Mainly, it reports error or warning of data file in a file. For this, the Log4j API from Apache foundation. The Log4j have the advantages to have different kind of output (file, standard output, graphical window, . . . ), of output format (pattern, html, ..), and different logging level, quite useful for debugging. And, it also allows changing dynamically the output.

2.4.2 Technical issues Several issues appeared during the application development. First, when the project starts, the specifications of the new ADF file format were not stable, making it difficult to make some technical choices without taking the risk of having to reengineer large parts of the project in case of specification modification Thus, The tool has been develop with leaving enough flexibility for supporting modifications to ADF specifications. Then, it appeared during the test step, that there were a memory consumption and CPU consumption. To solve it, some components have been rewritten, and object pools have been added to avoid recreating unnecessary objects and to clear as soon as possible no more use object. This modification divided by two the used memory quantity. 32

2.5 Available resources At the end of the project, the tool will be available on the E.B.I. website: http://www.ebi.ac.uk/adf. The development team of the Microarray Informatics team will, then, supported it. The project (source, documentation, ...) should move as Sourceforge project. Sourceforge is an Open Source software development website and provides free services to Open Source developers. On Sourceforge, it will be open to many developers, external to the EBI and to many users, for a wide broadcast and large feedbacks, thanks to tool provided by Sourceforge, as mailing list and bug report.

2.6 Future improvements Some enhancements could be added to the application via a graphical user interface (G.U.I), in addition of these two main functionalities. These functionalities have been specified since the early stage of the project as outlook. They could be added as integral part of the application or as an external application.

2.6.1 Graphical User Interface (G.U.I) After the command-line mode, a GUI could be developed above the checking and conversing functionalities. The GUI should be simple and easy to use. So, to the basic functionalities of checking and conversion of a file or several files - even a whole directory -, a visualisation of an ADF (data containing in the three files) and an ADF creator helper, to create graphically correct ADFs, could be added. It could have also the array design visualisation functionality and the ADF creator functionalities. Visualise ADF

Check ADF validity

Convert ADF to MAGE-ML file

Export MAGE-ML file

User Check MAGE-ML validity

Convert MAGE-ML to ADF

ADF creator

Figure 2.16: GUI mode functionalities.

33

Export ADF

2.6.2 ADF creator helper A helper tool could be a great improvement to the application. Users could directly create ADF from the GUI. It could deal with the matter of mandatory items (directly included) and controlled terms -including MGED Ontology- (by a list of accepted terms for each corresponding field). These functionality will help the user to create ADF file - corresponding MAGE-ML file, without having to create file by hand and thus, it will avoid errors and will compose a strict ADF specification compliant file. Usually, submitter has got features, reporters and composite sequences data in files generated by a spotter robot. So, the user should be able to import data from these specific files or to copy and paste data in relevant fields. This functionality will not treat all microarray applications; it will focus on following application cases: G.E., C.G.H and ChIP. 2.6.2.1 Helper tool: Annotation component Based on the experience gained by ArrayExpress curation team, it has appeared necessary to extend or, more precisely, to harmonize biological annotation of microarrays, both in terms of format and content. The tool could retrieve annotation from one of the most popular resource for annotation microarray, the GO annotation (http://www.geneontology.org) or by quering to Ensembl data mart (Ensmart) (http://www.ensembl.org/EnsMart. These two functionalities would significantly ease data submission of MIAME compliant data in ADF files, provided the species of interest are supported by GO consortium and Ensembl.

34

Chapter 3

The training period This thesis concludes my training period at the EBI, which lasts six months from May 4th to October 31st in the Microarray Informatics Team. During this period, I worked in an international environment. It was composed of people coming from different countries and with different backgrounds (biologist, computer scientist...).

3.1 Work environment The EBI is an European institute. So, most of the people are coming from all European countries, with different cultures and languages. But, the working language is english. Throughout the period, I have spoken in English to interact with other people and to present my project. Even, if it was difficult, after a while it became easierand you are used to do it. The team, itself, is composed of people with different backgrounds : developers, curators and researchers. Every week, a student meeting was organised with developers and curators, where each student presented improvements and issues on its project/work. So doing, students could receive feedback and advices for its project. It was completed by direct meeting with project supervisors when necessary for specific points or brainstorming. A talk, with all group members, was done each week to present current progress of each staff member. The philosophy of development is to provide data/tools to anybody. So, the different projects, including this one, are open source. This means that redistribution and use in source and binary forms, with or without modification, are permitted without specific limitations.

3.1.1 A full project When the project stars, no similar tool existed. Only few tools existed with only parts. The project started from scratch and choices had to be made. Majority of choices has been done during specification drafting, after analysis of needed and state of art. Then, throughout the period, I did an autonomous project, with some guidelines for the tool design and implementation from supervisors. I was free to propose techniques or API, but I had the final choices. But, as the aim was to offer a flexible tool, I 35

always had in mind, that every part or API could be to replace by another module, to be simplify the maintenance, for example if a new version of ADF has been approved, new libraries can be used in a more efficient way than the current tool. Choices are important concerning the performance, the maintenance, usability and its integration in another project. Most of choices have been done during the specification writing, but also during the design for technical issues. As well, it is important to search and to analyse previous works, to retrieve the possible issues, drawbacks and some guide lines to respect. This project gave me the possibility to do my training period abroad, with skillful and motivated people. During the period, I met different people coming from different countries and research centre. All was very interested by the project.

3.2 Programming skills During the period, I improved my informatics skills, in addition of my master’s degree courses seeing during this year and it has been an opportunity to see and to use new development ways and development tools: • Eclipse: an integrated development environment (I.D.E), well adapted for this project. • UML: The Unified Modeling Language- used to design the application before its implementation. • Java: the object oriented programming language - with additional libraries. • XML: the extensible markup language - and the Java Architecture for XML Binding (JAXB) providing a convenient way to bind an XML schema to a representation in Java code. • Ant: a tool for automatisation of the compilation and file release (considered as a replacement for makefile and much more). • CVS: the Concurrent Versions System, for the program version management, providing an access to anybody to the source code and an indirect backup of data. • IZpack: a Java Open Source installer, to be provided with the tool release, for any effortless tool installation. 3.2.0.1 An open project This project needed communication skills, specially for talks with team members concerning informatics (existing methods/tools, techniques issues,..), but also about biology (concepts on microarray, application, needs, ...). This interaction was as well internal to the team as external. I met and spoke with several people interested by my project from different research centres, particularly, Michael Miller from Rosetta Inpharamatics, Seattle and Java MAGE-stk development team (reporting bugs,...). In general, there is a real need for this kind of tools and standards. in the same time, they must be simple, simple to use and reduce work time on data manipulation. It was motivating to know that the developed tool will be widely used. 36

3.2.1 Involvement into an open source project By open source, it means that source code will be freely available to anybody wanting to modify and improve the application. It permits to found and report bugs easily and quickly, and even developers could propose patch for correction. However, it has to have a source code well commented and documented. It is a rewarding to know that the tool will be widely used, easily maintainable and modified and improved at will without limitation. At time of writing, as the tool has not yet been publicly release, the source code is not available. But, as soon as the tool will be released, it will be tested and bugs reported by the way of a mailing list. A mailing list is an important part on a project. It is the first approach to resolve a problem (after the documentation). Searches in archives permit to know if our problem already occurs and how to solve it or to report the problem for feedback from developer. The mailing of the MAGE-stk has been used to identify problems and to bypass them before a new release.

3.3 Contribution to the Team This project will provide a simple tool, which will fit a gap. Currently, the curation team receives submitting data and there is not tool, which will check whole data. Most of the used tools are only converting data without checking, data are supposed valid. Most of curation works is to check by hand all data. Finally, this tool will produce a real work time gain for curation. Additionnally, all data in output of the tool are curated. So, as well, data are standardizing for a better experiment comparison and data in databases consistency.

37

38

Conclusion There is a growing consensus in the life science community for a need for public repositories of gene expression data analogous to DDBJ/EMBL/GenBank for sequences Some of the reasons: • Gradually building up gene expression profiles for various organisms, tissues, cell types, developmental stages, various states, under influence of various compounds • Through links to other genomics databases builds up systematic knowledge about gene functions and networks • Comparison of profiles, access and analysis of data by third parties • Cross validation of results and platforms - quality control The outcome of this project will be an important resource for the user community and the ArrayExpress infrastructure, accelerating array design exchange in a standard format. Ensuring conversion from MAGE-ML to ADF will enhance visualization of the array design annotation. Combining MAGE-ML to ADF will assist users in choosing the best approachable solution for their needs and resources. Facilitating standardization will demonstrate the feasibility of gathering MIAME-required information for complex array design in an effective manner. The developed tool will simplify works for the submission of microarray layout, specially, for the creation of standardized data. This aim of the developed tool was to fit the needed of users, who are dealing with several file format, each one different of the other, but containing the same data. The standardisation is the way of a better use of data costly obtained, for quicker obtained efficient and precise results. The MAGEML format has a lot of advantages on existing formats: it is a free, none attach to a manufacturer, it is an extension to XML for a faster computer treatment and perfectly adapted to the data exchange. Due to its complexity, the needed of tools, as the one developed for this project, is increasing to facilitate submitter and curator work and to focus on the data and anything else. Moreover, the professional life at the EBI was really instructive, the presence of scientist of different fields : informatics, biological sciences, bioinformatics and from different nationalities allowed myself to learn so much and to meet very interesting people. As more personal benefit : live for six months in a different culture, give a better opened mind and an intensive enrichment. Besides, Cambridge thanks to its cosmopolitan aspect, offers more possibilities on the cultural, social and intellectual side 39

than most of the cities from the United Kingdown. As summary, as well as the informatics and bioinformatics benefits, this six months position in UK was an unforgettable experience.

40

Glossary P.C.R.:

Polymerase Chain Reaction, http://allserv.rug.ac.be/~avierstr/principles/pcr.html

CGH:

Comparative Genomic Hybridisation, http://amba.charite.de/cgh/cgh01.html

ChIP:

Binding-site identification by Chromatin-immunoprecipitation, http://www.bio.brandeis.edu/haberlab/jehsite/chip.html

Batchmode: this mode allows a limited subset of commands to be started and ran from the command line, without or with a limited user interaction. Batch Processing is the en-masse execution of a series of jobs on a computer. Ontology: From the Stanford Knowledge Systems Lab: « An ontology is an explicit specification of some topic. For our purposes, it is a formal and declarative representation which includes the vocabulary (or names) for referring to the terms in that subject area and the logical statements that describe what the terms are, how they are related to each other, and how they can or cannot be related to each other. Ontologies therefore provide a vocabulary for representing and communicating knowledge about some topic and a set of relationships that hold among the terms in that vocabulary. » Graphical user interface(GUI): A program interface that takes advantage of the computer’s graphics capabilities to make the program easier to use. Welldesigned graphical user interfaces can free the user from learning complex command languages. On the other hand, many users find that they work more effectively with a command-driven interface, especially if they already know the command language. Validity:

A document is validated, if the data or information given in the document has a correct semantics.

Well

form: A document is well formed, if data or information are given in a specified order, are respecting a given document structure.

Array

design: The layout or conceptual description of array that can be implemented as one or more physical arrays. The array design specification consists of the description of the common features of the array as the whole, and the description of each array design elements (e.g., each spot). MIAME distinguishes between three levels of array design elements: feature (the location on the array), reporter (the nucleotide sequence present in 41

a particular location on the array), and composite sequence (a set of reporters used collectively to measure an expression of a particular gene).}

42

Bibliography [adf]

Array Design File (ADF) format specification - ADF specification.

[aff]

Affymetrix Manufacturer, Affymetrix web site, http://www.affymetrix.com/index.affx.

[agi]

Agilent Technologies, Agilent web site: http://we.home.agilent.com/USeng/home.html.

[arra]

ArrayExpress- MIAME compliant database, ArrayExpress website, http://www.ebi.ac.uk/arrayexpress.

[arrb]

Mage-om as a set of hyper-linked diagrams., http://www.ebi.ac.uk/arrayexpress/Schema/MAGE/MAGE.htm.

[ber]

The bergen center for http://www.bccs.uib.no/.

[BHQ+ 01] A.

Brazma, P. Hingamp, P. Spellman, and C. et tion about a microarray dards for microarray data, 371, MIAME website,

computational

science,

J. Quackenbush, G. Sherlock, al. Stoeckert, Minimum informaexperiment (miame)-toward stanNature Genetics 29 (2001), 365– http://www.mged.org/Workgroups/MIAME/miame.html,

http://www.nature.com/cgi-taf/DynaPage.taf?file=/ng/journal/v29/n4/abs/ng1201-365.html

, MIAME checklist, http://www.mged.org/Workgroups/MIAME/miame_checklist.html, MIAME 1.1 Draft 6, http://www.mged.org/Workgroups/MIAME/miame_1.1.html . [BPS+ 03] A. Brazma, H. Parkinson, U. Sarkans, M. Shojatalab, J. Vilo, N. Abeygunawardena, E. Holloway, M. Kapushesky, P. Kemmeren, G.G. Lara, A. Oezcimen, Rocca-Serra; P., and Sansone; S.A., Arrayexpress– a public repository for microarray gene expression data at the ebi, Nucleic Acids Res 31 (2003), 68–71. [ebia]

EBI - Microarray Informatics Team, EBI Microarray Informatics Team web site, http://www.ebi.ac.uk/microarray.

[ebib]

European Bionformatics Institute http://www.ebi.ac.uk.

[emb]

EMBL: laboratoire européen http://www.embl.org.

[ens]

Ensembl EnsMart Genome Browser (MartView), http://www.ensembl.org/Multi/martview.

[maga]

MicroArray and Gene Expression - MAGE, MAGE website, http://www.mged.org/Workgroups/MAGE/mage.html. 43

de

EBI,

EBI

biologie

web

site,

moléculaire,

[magb]

MicroArray and Gene Expression Markup Language Document document Type Definition - MAGE-ML DTD, MIAME-ML website, http://mged.sourceforge.net/software/docs.php#MAGEML, last release, http://prdownloads.sourceforge.net/mged/MAGE-ML.dtd.gz.

[magc]

MicroArray and Gene Expression Markup Language specification - MAGE-ML specification, MAGE-ML website, http://www.mged.org/Workgroups/MAGE/mage-ml.html.

[magd]

MicroArray and Gene Expression Object ModelMAGE-OM, MAGE-OM website, http://www.mged.org/Workgroups/MAGE/mage-om.html, http://mged.sourceforge.net/software/docs.php#MAGEOM.

[mage]

MicroArray and Gene Expression software toolkitMAGEstk, MAGEstk website, http://mged.sourceforge.net/software/MAGEstk.php.

[mgea]

MGED Ontology, MGED Ontology website,http://mged.sourceforge.net/ontologies/MGEDontology.php, MAGE-ML DTD http://www.omg.org/technology/documents/formal/gene_expression.htm last release http://schema.omg.org/lsr/gene_expression/1.1/MAGE-ML.dtd.

[mgeb]

MGED society - Microarray Gene Expression Data society, MGED website, http://www.mged.org.

[mia]

MIAMExpress- MIAME compliant submission tool, MIAMExpress website, http://www.ebi.ac.uk/miamexpress.

[MPS+ 04] W.B. Mattes, S.D. Pettit, A.S. Sansone, P.R. Bushel, and M.D. Waters, Database development in toxicogenomics: issues and efforts, Environ Health Perspect Toxicogenomics 112 (2004), no. 4, 495–505, Tox-MIAMExpress: http://www.ebi.ac.uk/tox-miamexpress/. [omg]

Object Management Group http://www.omg.org/.

[rzp]

Deutsches Ressourcenzentrum fuer Genomforschung GmbH, RZPD website:http://www.rzpd.de/.

[san]

-

OMG,

the Wellcome Trust Sanger http://www.sanger.ac.uk/.

OMG

Institute,

Stanford Microarray Database, http://genome-www5.stanford.edu/.

website,

website:

[tig]

the

[Vil03]

J. et al. Vilo, The analysis of gene expression data: methods and software, The analysis of gene expression data: methods and software (2003).

44

web

site:

Annexes

45

.1 .1.1

MicroArray layout Simple microarray design:

In a simple microarray layout, a genetic element is represented by a single sequence. It may be present in more than one location on a glass slide as part as a technical replication schema, defined when engineering the microarray layout. Translating this design in MAGE terms, several Features (i.e. several physical locations defined by their coordinates) have received the same « Reporter », which is associated to a particular sequence. That particular Reporter reports about the genetic element of interest, which can actually be extensively characterised. As the MAGE Good practice requires that biological information be attached to a CompositeSequence object, it is necessary to create such object. In MAGE terms, it adds a layer of complexity, which may sound redundant for a wet lab scientist.

Figure 1: summarizes simple microarray layout: one probe representing one genetic element.

Simple microarray design in ADF A simple organization is assumed for the microarray layout, i.e. one probe represents one genetic element. Building on that flat relationship, all CompositeSequences, as well as all ancillary objects (ReporterCompositeMaps), can be automatically generated. The same BioSequence object will be therefore referenced at the level of the Reporter and at the level of the CompositeSequence (may cause some discussion). As automatic creation can be performed, the burden of ADF creation can be reduced, by leaving the "ADC" component out of the submission. 47

.1.2 Complex microarray design: In case of more complex microarray design, Reporters are engineered so that they can be combined in many different ways, for example to monitor expression of splice variants. With such arrays, 2 levels of measurements can be made: -the measurement of the signal of every individual probes -the measurement of combination of individual probes In MAGE terms, in the first case, measurements are made at the level of Reporters and in the second case; those are made at a higher level, the "CompositeSequences". MIAME requires biologists to submit the precise relationship between CompositeSequences and Reporters that are making them up, as they are part of the experimental design used to assess gene expression.

Complex microarray design in ADF In case of more complex microarray design, MIAME document requires biologists to submit the precise relationship between CompositeSequences and Reporters that are making them up, as they are part of the experimental design used to assess gene expression. Reporting such information is achieved through the submission of the "ADC" component whose format was exposed earlier in this document. The creation of the "ADC" component is assumed to be straightforward for anyone engineering such complex design. The use of the "ADC" component cuts down the redundant information to a minimum, creating a format that is easier to read and all biologically relevant information is confined to in a separate component, distinct from the actual probe level component, which is restricted to the "ADF" component. 48

In addition, such organization allows for Reporters to CompositeSequence mappings to be updated, as new annotations and/or knowledge emerge.

.2

ArrayExpress Licence

R EDISTRIBUTION AND USE IN SOURCE AND BINARY FORMS , WITH OR WITH OUT MODIFICATION , ARE PERMITTED PROVIDED THAT THE FOLLOWING CONDI TIONS ARE MET:

1. R EDISTRIBUTIONS OF SOURCE CODE MUST RETAIN THE ABOVE COPYRIGHT NOTICE , THIS LIST OF CONDITIONS AND THE FOLLOWING DISCLAIMER . 2. R EDISTRIBUTIONS IN BINARY FORM MUST REPRODUCE THE ABOVE COPYRIGHT NOTICE , THIS LIST OF CONDITIONS AND THE FOLLOWING DISCLAIMER IN THE DOCUMENTATION AND / OR OTHER MATERIALS PROVIDED WITH THE DISTRIBUTION . 3. T HE

NAME A RRAY E XPRESS MUST NOT BE USED TO ENDORSE OR PROMOTE PRODUCTS DERIVED FROM THIS SOFTWARE WITHOUT PRIOR WRITTEN PER MISSION .

F OR WRITTEN PERMISSION , PLEASE CONTACT ARRAYEXPRESS @ EBI . AC . UK

4. P RODUCTS DERIVED FROM THIS SOFTWARE MAY NOT BE CALLED "A RRAYE XPRESS " NOR MAY "A RRAY E XPRESS " APPEAR IN THEIR NAMES WITHOUT PRIOR WRITTEN PERMISSION OF THE A RRAY E XPRESS DEVELOPERS . 5. R EDISTRIBUTIONS

OF ANY FORM WHATSOEVER MUST RETAIN THE FOL LOWING ACKNOWLEDGMENT:

"T HIS PRODUCT INCLUDES SOFTWARE DEVELOPED BY A RRAY E XPRESS ( HTTP :// WWW. EBI . AC . UK / ARRAYEXPRESS )" THIS SOFTWARE IS PROVIDED BY THE ARRAYEXPRESS GROUP “AS IS” AND ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE ARRAYEXPRESS GROUP OR ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. T HE E UROPEAN B IOINFORMATICS I NSTITUTE MAY PUBLISH REVISED AND / OR NEW VERSIONS OF THIS LICENSE WITH NEW RELEASES OF A RRAY E XPRESS SOFTWARE . C OPYRIGHT ( C ) 2002 T HE E UROPEAN B IOINFORMATICS I NSTITUTE . A LL RIGHTS RESERVED . 49

.3 Approved Databases Database entries occur in the Description package as a reference to a record in a database. Database_ref identifiers are found in the BioMaterial and BioSequence packages. The database codes are prefixed with ebi.ac.uk:Database:. Database codes:

50

Code atcc affymetrix astra_hpylori

Database Name R ATCC Affymetrix Helicobacter pylori Genome Database

blocks

Blocks

candidadb

CandidaDB

catma

CATMA

compugen

Compugen

cp450 dbsnp

Cytochrome P450 dbSNP

embl

DDBJ/EMBL/GenBank

ensembl

Ensembl

ens_fam_id ens_gene_id ens_trscrpt_id entrez

Ensembl transcript ID. Entrez

entrez_protein

Entrez Protein

ec

Enzyme Commission

expasy

ExPASy

flybase flybase_bt flybase_dv

FlyBase FlyBase: Body Part FlyBase: Developmental Stage

genbank gdb

R GenBank GDB

genecards

GeneCards

genedb

GeneDB

genesnps

GeneSNPs

genew genmapp go

Genew GenMAPP GO

gpcrdb

GPCRDB

gxd hgmd

GXD R HGMD

hgvbase howdy

HGVbase HOWDY

hugo

HUGO

image

IMAGE Consortium

incyte interpro jsnp

R LiIncyte Genomics Proteome BioKnowledge brary InterPro JSNP

kegg locus medline mens mgc mgd MO

KEGG LocusLink MEDLINE MENS MGC MGD MGED Ontology

mips mtb musage nasc

MIPS MTB NASC On-Line Catalogue

Details American Type Culture Collection. Affymetrix Contains annotation and relationships for all putative open reading frames from H. pylori strain J99 and strain 26695. Multiple alignments of conserved regions of protein families. Genomic and protein sequence information and relevant annotation related to the human fungal pathogen Candida albicans. Complete Arabidopsis Transcriptome MicroArray contains Gene Sequence Tags (GSTs) covering most Arabidopsis genes, primarily for use in transcription profiling DNA arrays. Database for Compugen 75bp oligo sets which carry an ID of type: CGEN_MOUSE_3001299_1 Cytochrome P450 homepage. Single nucleotide polymorphisms database. NCBI assigns reference SNP (rs) IDs to SNPs that appear to be unique in the database. DNA Data Bank of Japan/European Molecular Biology Laboratory/genetic sequence database. Up-to-date sequence annotation for eukaryotic genomes. Ensembl family ID. Ensembl gene ID. Text-based search tool at NCBI for major databases (incl. PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete Genomes, Taxonomy). Protein entries from various sources (incl. SwissProt, PIR, PRF, PDB, and translations from annotated coding regions in GenBank and RefSeq). NC-IUBMB, general information on enzyme nomenclature plus a list of EC numbers. Expert Protein Analysis System. Analysis of protein sequences and structures as well as 2-D PAGE. Drosophila genome database. FlyBase controlled vocabularies for body parts. FlyBase controlled vocabularies for developmental stages. See DDBJ/EMBL/GenBank The Genome Database. Human genes and genomic maps. Database of human genes, their products and their involvement in diseases. Database resource for Schizosaccharomyces pombe, Leishmania major and Trypanosoma brucei. Integrates gene, sequence and polymorphism data into individually annotated gene models. Human gene nomenclature database search engine. Gene MicroArray Pathway Profiler. Gene Ontology (biological process, cellular component and molecular function). Information system for G protein-coupled receptors (GPCRs). Gene Expression Database (mouse). Human Gene Mutation Database. Contains known (published) gene lesions underlying human inherited disease. Curated human polymorphisms. Human Organized Whole genome Database. Integrated human genomic information. Human Genome Organisation - human gene symbols. Integrated Molecular Analysis of Genomes and their Expression (I.M.A.G.E) consortium clone resource center. Private database - access restricted to Incyte customers. Useful resource for whole genome analysis. Database of Japanese Single Nucleotide Polymorphisms. Kyoto Encyclopedia of Genes and Genomes. Contains information on genetic loci. Bibliographic database. Medicago EST Navigation System Mammalian Gene Collection. Mouse Genome Database. The Microarray Gene Expression Data Society Ontology. An ontology for microarray experiments. Munich Information Center for Protein Sequences. Mouse Tumor Biology database. The Nottingham Arabidopsis Stock Centre provides seed and information resources to the International Arabidopsis Genome Programme and the wider research community.

See figure ??

Figure 2: List of Approved databases for ArrayExpress.

51

Code next ncbitax nci_meta

Database Name next NCBI Taxonomy NCI Metathesaurus

netaffx nextdb nia_nih

NetAffx NEXTDB NIA/NIH Mouse Genomics

omim omni1

OMIM OmniArray

omni2 pdb pfam

PDB Pfam

pharmgkb

PharmGKB

pir pkr pkr_hanks

PIR PKR PKR: Hanks Classification

plasmodb populusDB

PlasmoDB PopulusDB

pseudomonas pu pubmed refseq rgd rgd_qtl riken rzpd sanger scop sgd stack

Pseudomonas genome project

subtilist sulfolobus swall

SubtiList Sulfolobus P2 SWALL

swissprot

Swiss-Prot

tair tigr_atdb tigr_cmr

TAIR TIGR: AtDB TIGR: CMR

tigr_cmr_hpylori26695

TIGR: CMR H. pylori 26695

tigr_cmr_hpylorij99

TIGR: CMR H. pylori J99

tigr_egad tigr_ego

TIGR: EGAD TIGR: EGO

tigr_mgi trembl

TIGR: MGI TrEMBL

tsc

The SNP Consortium Ltd.

toxoest tuberculist

ToxoEST TubercuList

unigene uw_ecoli

UniGene UW E. coli genome project

wormbase

WormBase

PubMed RefSeq RGD RGD: QTL Riken RZPD Sanger Institute Human Genome Project SCOP SGD STACKdb

Details next Taxonomy browser. The NCI Metathesaurus is based on NLM’s Unified Medical Language System Metathesaurus supplemented with additional cancer-centric vocabulary. Affymetrix NetAffx analysis center. The Nematode Expression Pattern DataBase Mouse genomics home page of Laboratory of Genetics, National Institute on Aging, National Institutes of Health. Online Mendelian Inheritance in Man. OmniArray MicroArray Analysis tool. B.pseudomallei sequence database 1 B.pseudomallei sequence database 2 Protein Data Bank Multiple sequence alignments and hidden Markov models of common protein domains. The Pharmacogenetics and Pharmacogenomics Knowledge Base. Variation in drug response based on human variation. Protein Information Resource The Protein Kinase Resource. Eukaryotic protein kinase superfamily organised into distinct families that share basic structural and functional properties (classification by Steven K. Hanks). The Plasmodium Genome Resouce Populus tremula x tremuloides genomic sequence database. Pseudomonas aeruginosa genome annotation. Bibliographic database. NCBI Reference Sequence project. Rat Genome Database. Rat Genome Database: Quantitative Trait Locus. Resource Center and Primary Database. Human mapping and sequencing information. Structural Classification of Proteins. Saccharomyces Genome Database. Sequence Tag Alignment and Consensus Knowledgebase. Non-redundant, gene-oriented clusters. Bacillus subtilis database. Sulfolobus P2 annotation database. Non-redundant protein sequence database (SwissProt+Trembl+TremblNew). Curated protein sequence database that provides a high level of annotation, a minimal level of redundancy and high level of integration with other databases. The Arabidopsis Information Resource. The TIGR Arabidopsis thaliana Database. Comprehensive Microbial Resource at TIGR. Contains all of the bacterial genome sequences completed to date. Comprehensive Microbial Resource at TIGR for Helicobacter pylori 26695. Comprehensive Microbial Resource at TIGR for Helicobacter pylori J99. The Expressed Gene Anatomy Database at TIGR. TIGR ortholog database - linking orthologous genes across eukaryotic organisms. TIGR Mouse Gene Index. Computer-annotated supplement of SWISS-PROT that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS-PROT. The TSC database contains details of single nucleotide polymorphisms (SNPs) that have been discovered and characterised by the TSC. Toxoplasma gondii clustered EST database. Genomic information on tubercle bacilli such as M. tuberculosis. Non-redundant, gene-oriented clusters. The University of Wisconsin E.coli Genome Project. Genome and biology of C. elegans (as of 28/8/2002, the genome browser shows preliminary gene predictions for C. briggsae).

Figure 3: Approved Databases to ArrayExpress.

.4 ADF specification overview .4.1 item names : 52

« Header » spreadsheet / Header file - « .adh » : Containing additional information needed really useful for the future use (Database), like date of public release and submitter name, or for conversion to MAGE-ML. • Audit & Security information : – Person LastName – Person email – Person FirstName – Person office phone – Institution/Company – Department – URL – Address – Zip code/PO box – City – State/Province – Country • Array Design Description information : – Array Design Name – Array Version – Array Design Number of Features – Application – Technology Type – Surface Type – Substrate Type – Array Description – Array Protocol • Miscellaneous information : – Release Date (date of public release) – Submission Date (data of submission in the repository) – User defined value (field used for user defined database) – FeatureReporter filepath (only in case of three plain text files) – CompositeSequence file path (only in case of three plain text files) Items in italic are optional. All others are mandatory.

53

« FeatureReporter » spreadsheet / FeatureReporter file - « adr » : Containing Feature and Reporter data. This file corresponds to the previous ADF format, even though slight modifications have been incorporated (see below) Each row corresponds to a spot in the array : position and biological information on the reporter. Feature data : • • • •

MetaColumn MetaRow Column Row

Reporter data : • • • • • • • • •

Reporter Name Reporter ControlType Reporter BioSequence Type Reporter BioSequence PolymerType Reporter BioSequence DatabaseEntry [db-tag] Reporter FailType Reporter WarningType Reporter BioSequence [Actual Sequence] Reporter Group [reportergroup tag]

Items in italic are optional. All others are mandatory. These items are for the common case, Gene Expression application use case.

For ChIP oligo-based and PCR-based application use case, another BioSequence object is added. It is called Assigned gene. • Assigned_Gene Name • Assigned_Gene Database Entry [db-tag] In this case, the « Reporter Name » item is renamed to « Reporter Name [intergenic] », and there are some optional items : • Chromosome • Chromosome_band In addition for the PCR-based use case, there are two optional items : • Reporter BioSequence [Fwd Primer] • Reporter BioSequence [Rev Primer] For CGH use case : • Species • Chromosome : non mandatory but highly recommended for CGH • Chromosome_band : recommended field for CGH 54

« Composites » spreadsheet / CompositeSequence file - « adc » (Optional) : Only when a more complex array design has been created and when Reporters can be combined in order to represent different genetic elements. Typically, the case of an array devised to monitor splice variants events. Reporters representing exons can be grouped in various ways to represent the different transcript. This third files allows for the description of such relationships between Reporter (individual elementary segments) and the CompositeSequence (representing the various transcripts. No one to one Reporter-CompositeSequence relation. Basically, there is for a CompositeSequence the same information than for Reporter, plus a map for the relation between the CompositeSequence and several Reporters or with several CompositeSequence. Items : • CompositeSequence Name • CompositeSequence BioSequence [Actual Sequence] • CompositeSequence BioSequence Type • CompositeSequence BioSequence Polymer Type • CompositeSequence BioSequence Database Entry [db-tag] • CompositeSequence Annotation Database Entry [db-tag] • ReporterCompositeMap • Chromosome • Chromosome_band Items in italic are optional. All others are mandatory.

.4.2

Item information :

.4.2.1 Header item (see figure .4.2.1) .4.2.2 Feature and Reporter file(see figures 5,6, 7 and 8)

55

Item name

Presence in file header

Type

Comment

Person LastName Person FirstName Person email Person office phone

M M M O

Free text Free text Free text List of numbers, but can begin by «+»

Free text with an @ or AT(SPAM)

Institution/Company Department URL

M M M

Free text Free text Website address

Address Zip code/PO box City State/Province Country

O O O O M

Free text Free text Free text Free text Free text

Array Design Name Array Version Array Design Number of Features Application

M M O M

Free text Free text Integer Ontology

Technology Type Surface Type Substrate Type Attachment Type Strandedness Type

M M M M M

MDEG Ontology term MDEG Ontology term MDEG Ontology term Ontology term Ontology term

Array Description Array Protocol

O M

Free text Free text

Release Date Submission Date User defined value

M M O

Date in ISO 8601 (MAGE-OM) Date in ISO 8601 (MAGE-OM) Free text

Audit & Security :

/http[s]://.*.[a-z][a-z][a-z]*/

Array Design Description :

Possible values : SNP, CGH, ChromatideIP, GE

Given by curator For an unknown database

Figure 4: ADF Header item informationLegends:O: Optional itemM: mandatory item.

56

Header item

Cardinality

Presence in file header

Presence of a value in field

Location

Condition(s), comment, number of values in a field

Information

Zone Column Zone Row Column in the Zone Row in the Zone

MetaColumn

1..1

M

M

column 1

Single value

MetaRow Column

1..1 1..1

M M

M M

column 2 column 3

Single value Single value

Row

1..1

M

M

column 4

Single value

Reporter Name [intergenic](optional: only for ChIP arrays!) Reporter Control Type ATTENTION : see Reporter Group [role] as will be discarded.

1..1

M

M

Column 5

1..1

M

M

Column 6

Reporter FailType

0..1

O

O

After Reporter Control Type

Reporter WarningType

0..1

O

O

Reporter Comment To be renamed to Reporter Description

0..n

O

O

Reporter Description Database Entry [database tag]

0..n

O

O

After Reporter Control Type or Reporter FailType, if present After Reporter Control Type or Reporter FailType, if present, or Reporter WarningType, if present After Reporter Control Type or Reporter FailType, if present, or Reporter WarningType, if present or, ReporterComment, if present

Single value must be unique Single value If there is no the reporter, is not a control : not_control if the reporter has a correct control_type, it is in the control role group. Otherwise, it is in the Experimental group If the reporter is a control, it must be indicated what kind of control is it. Single value If there is a failure, it indicates by a type. Single value If there is a warning, it indicates by a type.

Descriptors for the type of control design element

Ontology entry If not a control : >not_control, check if possible with ontology

Descriptors of failures (as in PCR) associated with reporters. Descriptors of the warnings associated with reporters.

Ontology entry

Single value in each field

A comment about the reporter

Free text

-Multiple values Mandatory approved database tag, in lowercase or user definer from Header part mandatory square brackets

A database reference describing the reporter Keep, for convenience, but not really use

Correct databaseID(s)

M : mandatory, if missing MAGE-ML transformation will fail.

D : dependence

57

integer Free text

Figure 5: ADF Feature Reporter item information.

A : Application specific

integer integer integer

The name of the reporter

see figure6

O : optional, so the column can be missing in the file

Field type

Ontology entry

Header item

Cardinality

Presence in file header

Presence of a value in field

next Reporter BioSequence [Fwd Primer]

next 0..1

next A -> (O in case of PCR)

next O

Reporter BioSequence [Rev Primer]

0..1

A ->(O in case of PCR)

Reporter BioSequence [Actual Sequence]

0..1

O (M in case of PCR)

Location

next After Reporter Control Type or Reporter FailType, if present, or Reporter WarningType, if present or, Reporter Comment, if preset, or Reporter Description Database Entry, if present O After Reporter Control Type or Reporter FailType, if present, or Reporter WarningType, if present or, Reporter Comment, if present, or Reporter Description Database Entry, if present, Reporter BioSequence [Fwd Primer] if PCR use case and present In case PCR primer are supplied (only in case of PCR microarray use) O (M in case of After Reporter PCR) Control Type or Reporter FailType, if present, or Reporter WarningType, if present or, Reporter Comment, if present, or Reporter Description Database Entry, if present, Reporter BioSequence [Fwd Primer], if PCR use case and present, or Reporter BioSequence [Rev Primer], if PCR use case and present See figure7

Condition(s), comment, number of values in a field next - Single value - In case PCR primer are supplied (only in case of PCR microarray use) - Protein sequence?

Information

Field type

next The sequence of the forward primer

next A sequence (A, T, G, C, N) in uppercase

- Single value

The Sequence of the reverse primer

A sequence (A, T, G, C,N) in uppercase

-Single value

The biologic sequence corresponding to the BioSequence

A sequence (A, T, G, C, N) in uppercase

Figure 6: ADF Feature Reporter item information. M : mandatory, if missing MAGE-ML transformation will fail. O : optional, so the column can be missing in the file A : Application specific D : dependence

58

Header item

Cardinality

Presence in file header

Presence of a value in field

Location

next Reporter BioSequence Type

next 1..1

next M (It is supposed there is at least one BioSequence for the whole reporters)

next (M If the Reporter BioSequence DatabaseEntry [db-tab] and Reporter BioSequence PolymerType fields are not empty. In case of buffer and empty buffer, there is no value.

Reporter BioSequence PolymerType

1..1

M (It is supposed there is at least one BioSequence for the whole reporters)

Reporter BioSequence Database Entry [database_tag]

1..n

M (It is supposed there is at least one BioSequence for the whole reporters)

Reporter Group [reportergroup tag]

0..n

O

M if the Reporter BioSequence DatabaseEntry [db-tab] and Reporter BioSequence PolymerType fields are not empty. In case of buffer and empty buffer, there is no value M If the Reporter BioSequence DatabaseEntry [db-tab] and ReporterBioSequence PolymerType fields are not empty In case of buffer and empty buffer, there is no value O

next After Reporter Control Type or Reporter FailType, if present, or Reporter WarningType, if present or, Reporter Comment, if present, or Reporter Description Database Entry, if present, Reporter BioSequence [ActualSequence], if present The previous item is Reporter BioSequence Type

Condition(s), comment, number of values in a field next -Single value -Dealing with buffer and empty control

Information

Field type

next The type of the BioSequence associated to the current Reporter Descriptors of biosequence based on the Sequence Ontology (SO) project

next Ontology Entry

- Single value -Dealing with buffer and empty control

The type of the polymer associated to the current BioSequenceDescriptors of the type of polymer (RNA, DNA, protein) of the BioSequence.

Ontology Entry

The previous item is Reporter BioSequence Polymer Type

Replicated field (0..n) multiple values separated by a « ; » - mandatory approved database tag, in lowercase mandatory square brackets -Dealing with buffer and empty control

A reference to an entry in a given database for the BioSequence associated with the Reporter

Correct databaseID(s)

The previous item is Reporter BioSequence Database Entry

mandatory square brackets constraint for reportergroup tag? - Single value - avoid for name : chromosome, chromosome_band, role

Useful Value for reporter belonging to the given reporter group (tag)

Free text

See figure8

Figure 7: ADF Feature Reporter item information. M : mandatory, if missing MAGE-ML transformation will fail. O : optional, so the column can be missing in the file A : Application specific D : dependence

59

Header item

Cardinality

Presence in file header

Presence of a value in field

Location

next Species

next 0..1

next O only in CGH use case

next O

Assigned_Gene Name

0..1 (if CHIP or PCR)

O (only in case of ChIP on Chip use case)

O

Assigned_Gene Database Entry [db_tag]

1..1 (if CHIP or PCR)

M only in case of ChIP on Chip use case

M

Chromosome

0..1 (if ChIP or PCR or CGH)

O (if ChIP or PCR or CGH)

O

Chromosome_band

0..1 (if ChIP or PCR or CGH)

O (if ChIP or PCR or CGH)

O

next The previous item is Reporter BioSequence Database Entry or Reporter Group [reportergroup] if present The previous item is Reporter BioSequence Database Entry or Reporter Group [reportergroup] if present The previous item is Reporter BioSequence Database Entry Or Reporter Group [reportergroup] if present Or Assign Gene Name, if present The previous item is Assign Gene Database Entry for ChIPon-Chip application or PCR technology use case Otherwise Reporter BioSequence Database Entry Or Reporter Group [reportergroup] if present The previous item is Chromosome if present, Assign Gene Database Entry for ChIP-onChip or PCR use case Otherwise Reporter BioSequence Database Entry or Reporter Group [reportergroup] if present

Condition(s), comment, number of values in a field next Multiple values, concatenated by «;»

Information

Field type

next Species

next free Text (separator between Species name and Species DatabaseId =(,) DatabaseId Separator between species value=(;)

Multiple values, concatenated by «;»

Gene associated to the CompositeSequence

Name of genes separated by « ; »

Multiple values, concatenated by « ; » mandatory approved database tag, in lowercase Mandatory square brackets

A reference to an entry in a given database for the given assigned genes

Main separator=(;) Internal separator=(,) Correct databaseID(s) for the given db-tag

Single value

The number of the chromosome on which is located the bioSequence

Integer

Single value

The localisation of the BioSequence

Integer[pq?]integer.integer

Figure 8: ADF Feature Reporter item information. M : mandatory, if missing MAGE-ML transformation will fail. O : optional, so the column can be missing in the file A : Application specific D : dependence

Note : A space between header item words, and first letter of each word in capital letter, except for the tag Capital S in BioSequence 60

.4.3

CompositeSequence item information (see figures9 and 10

Header item

Cardinality

Presence in file header

Presence of a value in field

Location

Condition(s), comment, number of values in a field

Information

Field type

CompositeSequence Name

1..1

M

M

Column 1

0..1

O (M in case of PCR)

O (M in case of PCR)

After CompositeSequence Name

CompositeSequence BioSequence Type

1..1

M (It is supposed there is at least one BioSequence for the whole CompositeSequences)

(M If the CompositeSequence BioSequence DatabaseEntry [db-tab] and CompositeSequence BioSequence PolymerType fields are not empty

After CompositeSequence or CompositeSequence BioSequence [Actual Sequence], if PCR and present

Single value

CompositeSequence BioSequence PolymerType

1..1

M (It is supposed there is at least one BioSequence for the whole CompositeSequences)

The previous item is CompositeSequence BioSequence Type

Single value Dealing with buffer and empty control

CompositeSequence BioSequence Database Entry [database_tag]

1..n

M (It is supposed there is at least one BioSequence for the whole CompositeSequences)

(M If the CompositeSequence BioSequence DatabaseEntry [db-tab] and CompositeSequence BioSequence PolymerType fields are not empty (M If the CompositeSequence BioSequence DatabaseEntry [db-tab] and CompositeSequence BioSequence PolymerType fields are not empty See figure10

The previous item is CompositeSequence BioSequence Polymer Type

- Replicated field (0..n) multiple values separated by a « ; » mandatory approved database tag, in lowercase mandatory square brackets

The name of Composite Sequence The biologic sequence corresponding to the BioSequence The type of the BioSequence associated to the current CompositeSequence Descriptors of BioSequence based on the Sequence Ontology (SO) project The type of the polymer associated to the current BioSequence Descriptors of the type of polymer (RNA, DNA, protein) of the BioSequence. A reference to the an entry in a given database for the BioSequence associated with the CompositeSequence

Free text

CompositeSequence BioSequence [Actual Sequence]

Single value must be unique Single value

Figure 9: ADF CompositeSequence item information M : mandatory, if missing MAGE-ML transformation will fail. O : optional, so the column can be missing in the file A : Application or technology specific

61

A sequence (A, T, G, C, N) in uppercase Ontology Entry

Ontology Entry

Correct databaseID(s)

Additional constraints: • A space between header item words, and first letter of each word in capital, except for the tag; • Composite and Sequence are concatenated, and capital S in CompositeSequence; • Capital S in BioSequence.

.4.4 Usual mistakes : The usual mistakes make in ADF file following the previous ADF specification are the following (discussion with curators) : • Incorrect header item name • Incorrect database tag • Incorrect database accession number • Duplicate feature • Duplicate reporter identifier • Missing reporter identifier • Missing reporter name • invalide ontology term • Missing item

.4.5 Checking lists The first step of the tool development was to determine what will be checked, in which order, check severity. It has been done for each ADF table and for file structure and contained data. Header file File/Data structure checklist: 1. Header file is a tab-delimited-file 2. Item names are correct or can be identified If an item is not identified, it is skipped. 3. All mandatory items are present in the header 62

Header item

Cardinality

Presence in file header

Presence of a value in field

Location

next CompositeSequence Group [reportergroup tag]

next 0..n

next O

next O

next The previous item is CompositeSequence BioSequence Database Entry

CompositeSequence Annotation Database Entry [unigene]

0..n

O

O

After CompositeSequence Group, if present or CompositeSequence BioSequence Database Entry

Chromosome

0..1 ChIP PCR CGH)

(if or or

O (if ChIP or PCR or CGH)

O

Chromosome_band

0..1 ChIP PCR CGH)

(if or or

O (if ChIP or PCR or CGH)

O

ReporterCompositeMap

1

M

M

The previous item is CompositeSequence Annotation Database, if present or CompositeSequence Group, if present or CompositeSequence BioSequence Database Entry The previous item is Chromosome if present Assign Gene Database Entry for ChIPon-Chip or PCR use case Otherwise Reporter BioSequence Database Entry or CompositeSequence Group [compositegroup] if present The previous item is Chromosome_band if present, Chromosome if present, Assign Gene Database Entry for ChIPon-Chip or PCR use case Otherwise ComposteSequence BioSequence Database Entry or CompositeSequence Group [reportergroup] if present

Condition(s), comment, number of values in a field next - mandatory square brackets Avoid for name : chromosome, chromosome_band, role - Single value -Multiple values Mandatory approved database tag, in lowercase or user definer from Header part Mandatory square brackets Single value

Information

Field type

next Useful Value for CompositeSequence belonging to the given CompositeSequence group (tag)

next Free text

A database reference describing the reporter Keep, for convenience, but not really use

Correct databaseID(s)

The number of chromosome on which is located bioSequence

Integer

Single value

The localisation of the BioSequence

Integer[pq?]integer.integer

- Multiple values, separated by semi-colon

Mapping between the current Composite and a reporter or several CompositeSequences

Name(s) of a reporter or CompositeSequences (usually, only reporters or CompositeSequences)

63 Figure 10: ADF CompositeSequence item information M : mandatory, if missing MAGE-ML transformation will fail. O : optional, so the column can be missing in the file A : Application or technology specific

Data/file content checklist 1. Correct field value format Possible value types: • "Integer" • "Free Text" • "Controlled vocabulary" • "MGED ontology term" • "DatabaseEntry" • "Sequence" • "Species" 2. Check single multiple value Feature Reporter file File/Data structure checklist: 1. Header File is correct (structure and data ) 2. FeatureReporter file is a tab-delimited-file 3. Header item names are correct (unknown items are skipped) 4. All mandatory items are present. item cardinalities and dependences are correct. 5. Database tags are approved and database accession numbers are correct 6. Exactly the correct number of fields for each row 7. Item order is correct (Optional, do not fail the checking) 8. Field dependences are correct Data/file content checklist 1. FeatureReporter file structure must be correct 2. Mandatory Field are present. Field cardinalities and field value multiplicities must be correct. 3. Field values are in a mandatory format • Database tags are approved by ArrayExpress and are supplied in lower case and between square brackets • Database ID are correct

64

• Ontology terms are correct (MGED ontology) • Sequences are correct following the associated polymer type (DNA, RNA, protein): • Integer field values are correct 4. Duplicate features must not exist

5. Duplicate Reporter (equal names) must have the characteristics. A reporter can be several times on a microarray, it is called duplicate. Usual, this case is ensure at least a correct value for the reporter, if a problem occurs during the experiment. Check every field of Reporter data, should be identical. Otherwise there is a mistake. CompositeSequence File/Data structure checklist: 1. Feature Reporter file must be correct (structure and data) 2. CompositeSequence file is a tab-delimited-file 3. Header item names are correct. Unknown items are skipped 4. All mandatory items are present. Header item cardinalities and dependences are correct 5. Column order is correct (non mandatory) 6. Exactly the correct number of fields for each row Data/file content checklist 1. Composite file structure must be correct 2. All mandatory fields are present. Field cardinalities are correct 3. Field values are in expected format. Field multiplicity is correct (same as Feature/Reporter)

4. Names in map are reporter or composite sequence names 5. no duplicate CompositeSequences (equal names) 65

.5 MAGE-ML Microarray Gene Expression Markup Language (MAGE-ML) "is a language designed to describe and communicate information about microarray based experiments. MAGE-ML is based on XML and can describe microarray designs, microarray manufacturing information, microarray experiment setup and execution information, gene expression data and data analysis results. MAGE-ML has been automatically derived from Microarray Gene Expression Object Model (MAGE-OM), which is developed and described using the Unified Modelling Language (UML) – a standard language for describing object models. Descriptions using UML have an advantage over direct XML document type definitions (DTDs), in many respects. First they use graphical representation depicting the relationships between different entities in a way which is much easier to follow than DTDs. Second, the UML diagrams are primarily meant for humans, while DTDs are meant for computers. Therefore MAGE-OM should be considered as the primary model, and we will explain MAGE-ML by providing simplified fragments of MAGE-OM, rather then XML DTD or XML Schema."

.5.1 Identifiers - LSID The Life Sciences Identifier (LSID) is an I3C and OMG Life Sciences Research (LSR) Uniform Resource Name (URN) specification in progress. The LSID concept introduces a straightforward approach to naming and identifying data resources stored in multiple, distributed data stores in a manner that overcomes the limitations of naming schemes in use today. Almost every public, internal, or department-level data store today has its own way of naming individual data resources, making integration between different data sources a tedious, never-ending chore for informatics developers and researchers. By defining a simple, common way to identify and access biologically significant data, whether that data is stored in files, relational databases, in applications, or in internal or public data sources, LSID provides a naming standard underpinning for wide-area science and interoperability.

.5.2 Ontology Entries The presence of OntologyEntries in the MAGE-ML files is to allow entries from ontologies, or if these are not in existence, controlled vocabularies. For an existing ontology a database can be referenced to provide a location for information on the particular ontology used. The MGED ontology provides values for terms required to fill OntologyEntries in MAGE. This table maps the packages which require OntologyEntries to the allowed instances from the MGED ontology. Where there are subclasses used in the ontology (which organize instances) these are shown indented. In some cases MAGE requires an enumerated list, these cases are marked, however these terms are also contained within the ontology (for completeness) and are defined there too. 66

MAGE Class Array_package ArrayGroup ArrayDesign_package FeatureGroup FeatureGroup PhysicalArrayDesign BioSequence_package BioSequence BioSequence BioSequence Description_package DesignElement_package DesignElementGroup Reporter DesignElement Reporter DesignElementGroup

MAGE Association

Link to MGED Ontology Class

SubstrateType_assn

SubstrateType

FeatureShape_assn TechnologyTypes_assn SurfaceType_assn

FeatureShape TechnologyType SurfaceType

PolymerType_assn Type_assn Species_assn

PolymerType BioSequenceType: PhysicalBioSequenceType, TheoreticalBioSequenceType Organism

Types_assn FailTypes_assn ControlType_assn WarningType_assn Species_assn

DesignElementGroupType FailType ControlType WarningType Organism

Figure 11: Link between MAGE andd MGED Ontology.

.6

Used tools

.6.1

Java Excel API

The Java Excel (http://www.andykhan.com/jexcelapi/) is under the GNU Lesser General Public License. This API is an open source Java API which allows Java developers reading Excel spreadsheets and generating Excel spreadsheets dynamically. In addition, it contains a mechanism in Java application to read in a spreadsheet, modify some cells and write out the new spreadsheet. This API allows non Windows operating systems running pure Java applications which can both process and deliver Excel spreadsheets. Additionnally, this API may be invoked from within a servlet, thus giving access to Excel functionality over internet and intranet web applications.

.6.2

Log4J:

Log4j is an open source project based on the work of many authors. It allows the developer controlling which log statements are output with arbitrary granularity. It is fully configurable at runtime using external configuration files.

.6.3

Ant

What is Ant : Ant is a Java centric build tools allowing performing similar tasks than the Make command, well known by develop mainly for project compilation, but allows having the Java multiplatform aspect (write once, run on every operating system). Furthermore, Ant integrates a lot of features like FTP, SMTP, ... The build file is XML formatted and it allows an easy reading, and parsing. It contained several targets (internal command), usualy it is used to compile a project, to build a project release containing jar file and external file and to deploy an application.

.6.4

The W3C XML Schema Language

The purpose of an XML Schema is to define the legal building blocks of an XML document, just like a Document Type Definition (DTD). An XML Schema: • defines elements that can appear in a document 67

• defines attributes that can appear in a document • defines which elements are child elements • defines the order of child elements • defines the number of child elements • defines whether an element is empty or can include text • defines data types for elements and attributes • defines default and fixed values for elements and attributes In comparison of a DTD, an XML Schema is an XML file. So, it offers the same advantages as XML, specially a fast parsing and an extension of DTD possibilities, by entity types, reusability of types and standardisation of types. For example, schemas written in the XML Schema Language can describe structural relationships and data types that can’t be expressed (or can’t easily be expressed) in DTDs.

.6.5 Java Architecture for XML Binding (JAXB) The Java Architecture for XML Binding (JAXB) providing a convenient way to bind an XML schema to a representation in Java code. This makes it easy for developer to incorporate XML data and processing functions in applications based on Java technology without having to know much about XML itself.

What is JAXB? XML and Java technology are recognized as ideal building blocks for developing Web services and applications that access Web services. A new Java API called Java Architecture for XML Binding (JAXB) can make it easier to access XML documents from applications written in the Java programming language. The Extensible Markup Language (XML) and Java technology are natural partners in helping developers exchange data and programs across the Internet. That’s because 68

XML has emerged as the standard for exchanging data across disparate systems, and Java technology provides a platform for building portable applications. How access and use an XML document (that is, a file containing XML-tagged data) through the Java programming language? One way to do this, perhaps the most typical way, is through parsers that conform to the Simple API for XML (SAX) or the Document Object Model (DOM). Both of these parsers are provided by Java API for XML Processing (JAXP). Java developers can invoke a SAX or DOM parser in an application through the JAXP API to parse an XML document – that is, scan the document and logically break it up into discrete pieces. The parsed content is then made available to the application. In the SAX approach, the parser starts at the beginning of the document and passes each piece of the document to the application in the sequence it finds it. Nothing is saved in memory. The application can take action on the data as it gets it from the parser, but it can’t do any in-memory manipulation of the data. For example, it can’t update the data in memory and return the updated data to the XML file. In the DOM approach, the parser creates a tree of objects that represents the content and organization of data in the document. In this case, the tree exists in memory. The application can then navigate through the tree to access the data it needs, and if appropriate, manipulate it. Currently, it is possible with JAXB API to: • access to an XML document • create an XML document • Update an XML Document To this aim, the API binds an XML Schema in Java Objects, unmarshall an XML document in Java objects for reading and marshall Java Objects as content tree in order to create an XML document.

Binding an XML Schema JAXB simplifies access to an XML document from a Java program by presenting the XML document to the program in a Java format. The first step in this process is to bind the schema for the XML document into a set of Java classes that represents the schema. Binding: Binding a schema means generating a set of Java classes that represents the schema. All JAXB implementations provide a tool called a binding compiler to bind a schema (the way the binding compiler is invoked can be implementation-specific). 69

Unmarshalling the Document Unmarshalling an XML document means creating a tree of content objects that represents the content and organization of the document. The content objects are instances of the classes produced by the binding compiler. After unmarshalling, the program can access and display the data in the XML document simply by accessing the data in the Java content objects and then displaying it. There is no need to create and use a parser and no need to write a content handler with callback methods. Marshalling the Document Marshalling is the opposite of unmarshalling. It creates an XML document from a content tree. Also, the JAXB offers, by using both functionalities, to modified an XML document. The document is unmarshalled in memory and the application can easily modified, by simply method call and then marshal the content tree to recreate the modified XML document. Distinct Advantages JAXB:

Let’s reiterate a number of important advantages of using

• JAXB simplifies access to an XML document from a Java program: . • JAXB allows to access and process XML data without having to know XML or XML processing. Unlike SAX-based processing, there’s no need to create a SAX parser or write callback methods. • JAXB allows to access data in non-sequential order, but unlike DOM-based processing, it doesn’t force to navigate through a tree to access the data. • By unmarshalling XML data through JAXB, Java content objects that represent the content and organization of the data are directly available to your program. • JAXB uses memory efficiently: The tree of content objects produced through JAXB tends can be more efficient in terms of memory use than DOM-based trees. • JAXB is flexible: – It is possible to unmarshal XML data from a variety of input sources, including a file, an InputStream object, a URL, a DOM node, or a transformed source object. 70

– It is possible to marshal a content tree to a variety of output targets, including an XML file, an OutputStream object, a DOM node, or a transformed data object – It is possible to unmarshal SAX events – for example, you can do a SAX parse of a document and then pass the events to JAXB for unmarshalling. – JAXB allows to access XML data without having to unmarshal it. Once a schema is bound you can use the ObjectFactory methods to create the objects and then use set methods in the generated objects to create content. – It is possible to validate source data against an associated schema as part of the unmarshalling operation, but you can turn validation off if you don’t want to incur the additional validation overhead. – It is possible to validate a content tree, using the Validator class, separately from marshalling. For example, you can do the validating at one point in time, and do the marshalling at another time. – JAXB’s binding behavior can be customized in a variety of ways.

71