Option « InfoChimie» en LC et LCP
Alexandre Varnek,
Louis Pasteur University, Strasbourg, France
Labo d’Infochimie
Molecular Models - Quantum
Nuclei + electrons
Schredinger eq. Hartree-Fock, DFT
- Mechanical
Spherical atoms linked by springs, point charges, …
Force Field approaches
- Descriptive
Ensemble of parameters (descriptors)
(MM2, AMBER, MMFF, …)
QSAR/QSPR, chemoinformatics
Initiation à la Chemoinformatique • chemoinformatique - definition
• chemoinformatique – pourquoi ? • chemoinformatique – comment ?
Chemoinformatics: Definition G. Paris (August 1999 Meeting of the American Chemical Society), quoted by W. Warr at www.warr.com/warrzone.htm
"Chem(o)informatics is a generic term that encompasses the design, creation, organization, management, retrieval, analysis, dissemination, visualization, and use of chemical information."
Chemoinformatique: Définition
L’application
des
méthodes
informatiques pour résoudre de problèmes chimiques (J. Gasteiger)
Chemoinformatique – Pourquoi ?
Chemoinformatique – Pourquoi ? • relations complexes structure – activités biologiques réactivité chimique • quantité d’information plusieurs millions de composés et de réactions plusieurs millions de publications scientifiques
Nombre de composés chimiques répertoriés compounds published in CAS
com pou nds (m illions)
50 40 30 20 10 0 1965
1970
1975
1980
1985
year
1990
1995
2000
2005
Nombre de publications en chimie abstracts published in CAS
0
0 19
19
19
19
19
19
19
19
19
year
97
5
87
5
77
10
67
10
57
15
47
15
37
20
27
20
17
25
07
25
19
abstracts (millions)
(papers, patents, books)
all abstracts patents books papers
Problème: trop d‘information • 41 millions de composés • 1 million de nouveaux composés par an • 800,000 publications par an
Problème: pas assez d’nformation 41,000,000 250,000
220,000
composés 3D structures dans la Cambridge Crystallographic Database spectres infrarouge dans la Bio-Rad Database
=> On dispose des structures 3D et des spectres IR pour seulement 0.5 % de tous les composés
Synthesis of Properties The most fundamental and lasting objective of synthesis is not production of new compounds but production of properties George S. Hammond Norris Award Lecture, 1968
Chemoinformatique – Comment ?
Fundamental Questions in Chemistry What structure do I need for a certain property? structure-activity relationships How do I make this structure? synthesis design What is the product of my reaction? reaction prediction structure elucidation
Learning in Chemistry • deductive learning from calculations:
• inductive learning from observations:
quantum mechanics molecular mechanics
model (model driven) analogy (data driven)
From Data to Knowledge deductive learning knowledge information
data
generalization
context
measurement calculation
inductive learning
Significance „The industrial society will change into an information society. Information will be the most valuable asset.“ (The Wall Street Journal, 14. Sept. 1985)
Motivation for Students better understanding of chemistry; flood of information can only be processed by computers; chemoinformatics specialists are urgently needed; many jobs offered, particularly from pharmaceutical chemistry
Chemoinformatics - A Textbook
-
J. Gasteiger, T. Engel (Editors) 650 pages Wiley-VCH, Weinheim (September 2003)
Teaching Courses in Chemoinformatics University of Sheffield, UK (P. Willett) UMIST (Manchester), UK (H. Schofield) Indiana University, USA (G. Wiggins) University of Strasbourg, France (A. Varnek)
Option « InfoChimie» en LC et LCP
Initiation à la Chemoinformatique Gestion de données (bases de données) en chimie. • Modélisation moléculaire. • Relations « structure-propriété ». • Criblage virtuel. Design « in silico » de nouveaux composés. •
Option « InfoChimie» en LC et LCP
Logiciels à étudier •
Gestion de données en chimie ChemFinder (ChemOffice), DIVA
•
Modélisation moléculaire SPARTAN, Chem3D (ChemOffice), DS Studio
Relations « structure-propriété ». • Criblage virtuel. Design « in silico » de nouveaux composés. ISIDA, ChemDraw Ultra (ChemOffice), CODESSA •
Stockage, reserche et gestion de données chimiques
Bases de données en chimie Acquisition, stockage, organisation et manipulations de données Présentation de structures chimiques
Bases de données existantes
•Graphes; •Matrices de connectivité; •Tables de connectivité; •Formats MOL, SDF, RDF, … •Clefs structurales
CAS, STN, Beilstein, Gmelin, CCDS, … Concept de similarité
• espace chimique • critère de Tanimoto
TD: Développement de bases de données sous « ChemFinder »
Représentation de structures chimiques Comment stocker les structures chimiques ? Comment rechercher les structures ?
Présentation de structures et de réactions 1D •Chaînes SMILES, SMARTS; •Clefs structurales, Fingerprints
2D •Graphes; •Matrices de connectivité; •Tables de connectivité; •Formats MOL, SDF, RDF, …
3D •Format PDB •Matrice Z
Representing a chemical structure How much information do you want to include? atoms present connections between atoms bond types
stereochemical configuration charges isotopes 3D-coordinates for atoms
C8H9NO3
Topological Graph Theory branch of mathematics particularly useful in chemical informatics and in computer science generally study of “graphs” which consist of a set of “nodes” a set of “edges” joining pairs of nodes
Properties of graphs graphs are only about connectivity spatial position of nodes is irrelevant length of edges are irrelevant crossing edges are irrelevant
Properties of Graphs nodes and edges can be “coloured” to distinguish them OH
CH2 H2N
O
CH OH
Structure Diagrams as Graphs 2D structure diagrams very like topological graphs atoms ↔ nodes bonds ↔ edges terminal hydrogen atoms are not normally shown as separate nodes (“implicit” hydrogens) reduces number of nodes by ~50% “hydrogen count” information used to colour neighbouring “heavy atom” atom separate nodes sometimes used for “special” hydrogens deuterium, tritium hydrogen bonded to more than one other atom hydrogens attached to stereocentres
Advantages of using graphs mathematical theory is well understood graphs can be easily represented in computers many useful algorithms are known identical graphs ⇔ identical molecules different graphs ⇔ different molecules
Table de connectivité 13
OH
11
9
12
8 6
H2N
5
CH2 CH
4
O
3
OH
1
1. O 2. C 3. O 4. C 5. N 6. C 7. C 8. C 9. C 10. C 11. C 12. C 13. O
1 0 0 1 2 2 0 1 1 0 1 1 1
21 11 32 41 22 21 51 61 41 41 71 61 8 2 12 1 72 91 8 1 10 2 9 2 11 1 13 1 10 1 12 2 11 2 71 10 1
Matrice de Distance 1
13
OH
11
9
12
8 6
H2N
5
CH2 CH
4
O
3
OH
1
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.
O 1 2 2 3 3 4 5 6 7 6 5 8
2 3
1 C 1 1 2 2 3 4 5 6 5 4 7
2 1 O 2 3 3 4 5 6 7 6 5 8
4
2 1 2 C 1 1 2 3 4 5 4 3 6
5
3 2 3 1 N 2 3 4 5 6 5 4 7
6
3 2 3 1 2 C 1 2 3 4 3 2 5
7
4 3 4 2 3 1 C 1 2 3 2 1 4
8
5 4 5 3 4 2 1 C 1 2 3 2 3
9
6 5 6 4 5 3 2 1 C 1 2 3 2
10 11 12 13
7 6 7 5 6 4 3 2 1 C 1 2 1
6 5 6 4 5 3 2 3 2 1 C 1 2
5 4 5 3 4 2 1 2 3 2 1 C 3
8 7 8 6 7 5 4 3 2 1 2 3 O
Représentations des structures Formats d’échange (Molecular Design Limited (MDL))
*.MOL - structure 2D ou 3D d’une molécules;
*.SDF (Structure-Data File) - plusieurs molécules + propriétés;
*.RXN - une réaction (structures de réactifs et de produits);
*.RDF (Reaction-Data File) - plusieurs réaction + propriétés
http://www.mdli.com/downloads/literature/ctfile.pdf
Représentations de la structure Structure-Data File (standard d’échange)
http://www.mdli.com/downloads/literature/ctfile.pdf
Représentations de la structure Structure-Data File (standard d’échange)
Représentations de la structure
Bitstrings (chaînes de bits) Clefs structurales Bibliothèque de fragments prédéfinis
Fingerprints Génération en fonction de bibliothèque de composés
Bitstrings the fragments present in a structure can be represented as a sequence of 0s and 1s 00010100010101000101010011110100 0 means fragment is not present in structure 1 means fragment is present in structure (perhaps multiple times)
each 0 or 1 can be represented as a single bit in the computer (a “bitstring”)
Structural Keys Compounds are multi-domain: multiple occurrences of a key/substructure members of more than one chemical family
Structural Keys
Information loss!
The „Similarity Principle“: Structurally similar molecules are assumed to have similar biological properties ⇓ Relevant descriptors show a correlation between descriptor similarity and biological similarity
What is similar?
Different „spaces“, classified by: Shape
Size
Colour
Pattern
16 diverse aldehydes... O
OH H
O
O H
O
H
COOH
O
O
N
OH H
H
NH2
O
OH
N O
H
O
O H
COOH
O
O COOH
COOH H
H
Cl
Cl
O
H
H
O
Cl
O H
H
OH
NH2
O
N H
O
N H
O
NH2
O Cl
H
O NH2
...sorted by common scaffold O O H
OH
O H
O
COOH
O H
NH2
O
O H
H
O
H
NH2
Cl H
H
O
OH
COOH H
O H
Cl
O
OH
H
COOH
O
N
N H
O
O COOH
OH O
O
H
Cl
O H
NH2
O
N H
N H
O Cl
O NH2
...sorted by functional groups O
O H
H
OH
H
H
OH
H
H
O
Cl
N
OH
N
H
O
Cl
Cl
O
O
O
O
O
O
H
Cl
OH
O
O H
COOH
H
COOH
O
O
H
NH2
N
COOH H
O
O
NH2
O COOH
NH2
O
O
H
H
H
N H
O NH2
Similarity from fingerprints similarity measures are most commonly calculated from structure fingerprints count the bits that are “on” in both molecules count the bits that are “on” in each molecule separately struct A: struct B: A AND B:
00010100010101000101010011110100 00000000100101001001000011100000 00000000000101000001000011100000
similarity coefficient can be calculated from A, B and C
A
13 bits on (A) 8 bits on (B) 6 bits on (C)
B C
Tanimoto coefficient similarity =
C A+B–C
A
B C
= 6 / (13 + 8 – 6) = 0.4 the number of bits set in both molecules divided by the number of bits set in either molecule The Tanimoto coefficient is the most commonly used similarity coefficient in chemical informatics
Chemo- et Bio- informatique aspect historique
Développement de médicaments:
Découverte de molécules actives
ère industrielle : essais sur des animaux
– pas rationnelles - couteaux - longues
1 sur 500 est active, 1 sur 10000 est médicament
Développement de médicaments:
Découverte de molécules actives Ere « rationnelle » (à partir de 1960-1970).
Maladie
mécanismes biologiques appliqués cible biologique (protéine, ADN, ARN, …) concevoir « rationnelle » de molécules
Domaines de compétences : Pharmacologie moléculaire (1960-70), Biologie moléculaire (1980), Biologie structurale (1980-90), Modélisation moléculaire (1980-90).
Développement de médicaments:
Découverte de molécules actives Ere « criblage » - (à partir de 1990-1995).
cible
molécules
Criblage - High Throughput Screening (HTS) hit lead
Candidat préclinique
Chemoinformatique
Bioinformatique
cible biologique
molécules
Criblage - High Throughput Screening (HTS) hit lead
Candidat préclinique