chemoinformatique - definition

encompasses the design, creation, organization, management, retrieval, analysis, dissemination, visualization, and use .... SDF (Structure-Data File) - plusieurs molécules + propriétés;. ▫ *. ... http://www.mdli.com/downloads/literature/ctfile.pdf ...
2MB taille 105 téléchargements 390 vues
Option « InfoChimie» en LC et LCP

Alexandre Varnek,

Louis Pasteur University, Strasbourg, France

Labo d’Infochimie

Molecular Models - Quantum

Nuclei + electrons

Schredinger eq. Hartree-Fock, DFT

- Mechanical

Spherical atoms linked by springs, point charges, …

Force Field approaches

- Descriptive

Ensemble of parameters (descriptors)

(MM2, AMBER, MMFF, …)

QSAR/QSPR, chemoinformatics

Initiation à la Chemoinformatique • chemoinformatique - definition

• chemoinformatique – pourquoi ? • chemoinformatique – comment ?

Chemoinformatics: Definition G. Paris (August 1999 Meeting of the American Chemical Society), quoted by W. Warr at www.warr.com/warrzone.htm

"Chem(o)informatics is a generic term that encompasses the design, creation, organization, management, retrieval, analysis, dissemination, visualization, and use of chemical information."

Chemoinformatique: Définition

L’application

des

méthodes

informatiques pour résoudre de problèmes chimiques (J. Gasteiger)

Chemoinformatique – Pourquoi ?

Chemoinformatique – Pourquoi ? • relations complexes structure – activités biologiques réactivité chimique • quantité d’information plusieurs millions de composés et de réactions plusieurs millions de publications scientifiques

Nombre de composés chimiques répertoriés compounds published in CAS

com pou nds (m illions)

50 40 30 20 10 0 1965

1970

1975

1980

1985

year

1990

1995

2000

2005

Nombre de publications en chimie abstracts published in CAS

0

0 19

19

19

19

19

19

19

19

19

year

97

5

87

5

77

10

67

10

57

15

47

15

37

20

27

20

17

25

07

25

19

abstracts (millions)

(papers, patents, books)

all abstracts patents books papers

Problème: trop d‘information • 41 millions de composés • 1 million de nouveaux composés par an • 800,000 publications par an

Problème: pas assez d’nformation 41,000,000 250,000

220,000

composés 3D structures dans la Cambridge Crystallographic Database spectres infrarouge dans la Bio-Rad Database

=> On dispose des structures 3D et des spectres IR pour seulement 0.5 % de tous les composés

Synthesis of Properties The most fundamental and lasting objective of synthesis is not production of new compounds but production of properties George S. Hammond Norris Award Lecture, 1968

Chemoinformatique – Comment ?

Fundamental Questions in Chemistry What structure do I need for a certain property? structure-activity relationships How do I make this structure? synthesis design What is the product of my reaction? reaction prediction structure elucidation

Learning in Chemistry • deductive learning from calculations:

• inductive learning from observations:

quantum mechanics molecular mechanics

model (model driven) analogy (data driven)

From Data to Knowledge deductive learning knowledge information

data

generalization

context

measurement calculation

inductive learning

Significance „The industrial society will change into an information society. Information will be the most valuable asset.“ (The Wall Street Journal, 14. Sept. 1985)

Motivation for Students better understanding of chemistry; flood of information can only be processed by computers; chemoinformatics specialists are urgently needed; many jobs offered, particularly from pharmaceutical chemistry

Chemoinformatics - A Textbook

-

J. Gasteiger, T. Engel (Editors) 650 pages Wiley-VCH, Weinheim (September 2003)

Teaching Courses in Chemoinformatics University of Sheffield, UK (P. Willett) UMIST (Manchester), UK (H. Schofield) Indiana University, USA (G. Wiggins) University of Strasbourg, France (A. Varnek)

Option « InfoChimie» en LC et LCP

Initiation à la Chemoinformatique Gestion de données (bases de données) en chimie. • Modélisation moléculaire. • Relations « structure-propriété ». • Criblage virtuel. Design « in silico » de nouveaux composés. •

Option « InfoChimie» en LC et LCP

Logiciels à étudier •

Gestion de données en chimie ChemFinder (ChemOffice), DIVA



Modélisation moléculaire SPARTAN, Chem3D (ChemOffice), DS Studio

Relations « structure-propriété ». • Criblage virtuel. Design « in silico » de nouveaux composés. ISIDA, ChemDraw Ultra (ChemOffice), CODESSA •

Stockage, reserche et gestion de données chimiques

Bases de données en chimie Acquisition, stockage, organisation et manipulations de données Présentation de structures chimiques

Bases de données existantes

•Graphes; •Matrices de connectivité; •Tables de connectivité; •Formats MOL, SDF, RDF, … •Clefs structurales

CAS, STN, Beilstein, Gmelin, CCDS, … Concept de similarité

• espace chimique • critère de Tanimoto

TD: Développement de bases de données sous « ChemFinder »

Représentation de structures chimiques Comment stocker les structures chimiques ? Comment rechercher les structures ?

Présentation de structures et de réactions 1D •Chaînes SMILES, SMARTS; •Clefs structurales, Fingerprints

2D •Graphes; •Matrices de connectivité; •Tables de connectivité; •Formats MOL, SDF, RDF, …

3D •Format PDB •Matrice Z

Representing a chemical structure How much information do you want to include? atoms present connections between atoms bond types

stereochemical configuration charges isotopes 3D-coordinates for atoms

C8H9NO3

Topological Graph Theory branch of mathematics particularly useful in chemical informatics and in computer science generally study of “graphs” which consist of a set of “nodes” a set of “edges” joining pairs of nodes

Properties of graphs graphs are only about connectivity spatial position of nodes is irrelevant length of edges are irrelevant crossing edges are irrelevant

Properties of Graphs nodes and edges can be “coloured” to distinguish them OH

CH2 H2N

O

CH OH

Structure Diagrams as Graphs 2D structure diagrams very like topological graphs atoms ↔ nodes bonds ↔ edges terminal hydrogen atoms are not normally shown as separate nodes (“implicit” hydrogens) reduces number of nodes by ~50% “hydrogen count” information used to colour neighbouring “heavy atom” atom separate nodes sometimes used for “special” hydrogens deuterium, tritium hydrogen bonded to more than one other atom hydrogens attached to stereocentres

Advantages of using graphs mathematical theory is well understood graphs can be easily represented in computers many useful algorithms are known identical graphs ⇔ identical molecules different graphs ⇔ different molecules

Table de connectivité 13

OH

11

9

12

8 6

H2N

5

CH2 CH

4

O

3

OH

1

1. O 2. C 3. O 4. C 5. N 6. C 7. C 8. C 9. C 10. C 11. C 12. C 13. O

1 0 0 1 2 2 0 1 1 0 1 1 1

21 11 32 41 22 21 51 61 41 41 71 61 8 2 12 1 72 91 8 1 10 2 9 2 11 1 13 1 10 1 12 2 11 2 71 10 1

Matrice de Distance 1

13

OH

11

9

12

8 6

H2N

5

CH2 CH

4

O

3

OH

1

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.

O 1 2 2 3 3 4 5 6 7 6 5 8

2 3

1 C 1 1 2 2 3 4 5 6 5 4 7

2 1 O 2 3 3 4 5 6 7 6 5 8

4

2 1 2 C 1 1 2 3 4 5 4 3 6

5

3 2 3 1 N 2 3 4 5 6 5 4 7

6

3 2 3 1 2 C 1 2 3 4 3 2 5

7

4 3 4 2 3 1 C 1 2 3 2 1 4

8

5 4 5 3 4 2 1 C 1 2 3 2 3

9

6 5 6 4 5 3 2 1 C 1 2 3 2

10 11 12 13

7 6 7 5 6 4 3 2 1 C 1 2 1

6 5 6 4 5 3 2 3 2 1 C 1 2

5 4 5 3 4 2 1 2 3 2 1 C 3

8 7 8 6 7 5 4 3 2 1 2 3 O

Représentations des structures Formats d’échange (Molecular Design Limited (MDL)) ƒ

*.MOL - structure 2D ou 3D d’une molécules;

ƒ

*.SDF (Structure-Data File) - plusieurs molécules + propriétés;

ƒ

*.RXN - une réaction (structures de réactifs et de produits);

ƒ

*.RDF (Reaction-Data File) - plusieurs réaction + propriétés

http://www.mdli.com/downloads/literature/ctfile.pdf

Représentations de la structure ƒ Structure-Data File (standard d’échange)

http://www.mdli.com/downloads/literature/ctfile.pdf

Représentations de la structure ƒ Structure-Data File (standard d’échange)

Représentations de la structure

Bitstrings (chaînes de bits) Clefs structurales Bibliothèque de fragments prédéfinis

Fingerprints Génération en fonction de bibliothèque de composés

Bitstrings the fragments present in a structure can be represented as a sequence of 0s and 1s 00010100010101000101010011110100 0 means fragment is not present in structure 1 means fragment is present in structure (perhaps multiple times)

each 0 or 1 can be represented as a single bit in the computer (a “bitstring”)

Structural Keys Compounds are multi-domain: multiple occurrences of a key/substructure members of more than one chemical family

Structural Keys

Information loss!

The „Similarity Principle“: Structurally similar molecules are assumed to have similar biological properties ⇓ Relevant descriptors show a correlation between descriptor similarity and biological similarity

What is similar?

Different „spaces“, classified by: Shape

Size

Colour

Pattern

16 diverse aldehydes... O

OH H

O

O H

O

H

COOH

O

O

N

OH H

H

NH2

O

OH

N O

H

O

O H

COOH

O

O COOH

COOH H

H

Cl

Cl

O

H

H

O

Cl

O H

H

OH

NH2

O

N H

O

N H

O

NH2

O Cl

H

O NH2

...sorted by common scaffold O O H

OH

O H

O

COOH

O H

NH2

O

O H

H

O

H

NH2

Cl H

H

O

OH

COOH H

O H

Cl

O

OH

H

COOH

O

N

N H

O

O COOH

OH O

O

H

Cl

O H

NH2

O

N H

N H

O Cl

O NH2

...sorted by functional groups O

O H

H

OH

H

H

OH

H

H

O

Cl

N

OH

N

H

O

Cl

Cl

O

O

O

O

O

O

H

Cl

OH

O

O H

COOH

H

COOH

O

O

H

NH2

N

COOH H

O

O

NH2

O COOH

NH2

O

O

H

H

H

N H

O NH2

Similarity from fingerprints similarity measures are most commonly calculated from structure fingerprints count the bits that are “on” in both molecules count the bits that are “on” in each molecule separately struct A: struct B: A AND B:

00010100010101000101010011110100 00000000100101001001000011100000 00000000000101000001000011100000

similarity coefficient can be calculated from A, B and C

A

13 bits on (A) 8 bits on (B) 6 bits on (C)

B C

Tanimoto coefficient similarity =

C A+B–C

A

B C

= 6 / (13 + 8 – 6) = 0.4 the number of bits set in both molecules divided by the number of bits set in either molecule The Tanimoto coefficient is the most commonly used similarity coefficient in chemical informatics

Chemo- et Bio- informatique aspect historique

Développement de médicaments:

Découverte de molécules actives

ère industrielle : essais sur des animaux

– pas rationnelles - couteaux - longues

1 sur 500 est active, 1 sur 10000 est médicament

Développement de médicaments:

Découverte de molécules actives Ere « rationnelle » (à partir de 1960-1970).

Maladie

mécanismes biologiques appliqués cible biologique (protéine, ADN, ARN, …) concevoir « rationnelle » de molécules

Domaines de compétences : Pharmacologie moléculaire (1960-70), Biologie moléculaire (1980), Biologie structurale (1980-90), Modélisation moléculaire (1980-90).

Développement de médicaments:

Découverte de molécules actives Ere « criblage » - (à partir de 1990-1995).

cible

molécules

Criblage - High Throughput Screening (HTS) hit lead

Candidat préclinique

Chemoinformatique

Bioinformatique

cible biologique

molécules

Criblage - High Throughput Screening (HTS) hit lead

Candidat préclinique