Contrôle des Changements dans XML
Objectifs
Cours: Données semi structurées
Comprendre la gestion de données dynamiques
DEA I3 : Information, Interaction, Intelligence
• À large échelle, cas d’un entrepôt de données du Web (Xyleme)
Grégory Cobena http://www-rocq.inria.fr/verso/
[email protected]
• À l’échelle du document XML, cas de la gestion de versions
20/12/2002
Motivations: à l’échelle du Web
Dans quel cas trouve-t-on la notion de changements?
•
• Savoir découvrir des sources de données et
•
des documents XML, sur le Web ou sur un Intranet Mettre en place un suivi dans le temps de ces documents Extraire des connaissances sur ce qui change: les documents, leurs propriétés, leur contenu
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
3
Enjeux
Lorsque l’on gère différents documents, on étudie les changements inter-documents Exemple: Fichier XML décrivant deux modèles de voitures, une Peugeot-307 et une 206
•
Lorsqu’on s’intéresse à l’évolution dans le temps d’un document donné Exemple: Fichier XML décrivant un carnet d’adresses
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
4
Plan du cours
Les données semi structurées doivent apporter une description plus précise que du simple texte, avec une sémantique bien définie La gestion des changements dans les données semi structurées est encore plus complexe que dans les BD relationnelles. 20/12/2002
2
Motivations: à l’échelle du document
Le contrôle des changements, c’est d’abord:
•
DEA I3 - Données semi-structurées - Grégory Cobéna
DEA I3 - Données semi-structurées - Grégory Cobéna
Xyleme
• • •
Un entrepôt de données XML à large échelle Intégration de données du Web Surveillance active des données du Web
XML Diff
• • 5
20/12/2002
Représentation des changements Détection des changements DEA I3 - Données semi-structurées - Grégory Cobéna
6
Organization
Première Partie: Xyleme
1. The Web and XML 2. Xyleme 3. Data Acquisition and Maintenance 4. XML Repository, Semantic Data Integration and Query Processing 5. Query Subscription Conclusion
A Dynamic Warehouse for the XML data of the Web
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
8
The Web today
(Part I: Xyleme) 1. The Web and XML
Terabytes of data A lot of public pages
• 1 billion in [06/2000] • several millions of servers Private web: not publicly available pages Deep web: data hidden behind forms
20/12/2002
HTML = Hypertext Language
HTML
DEA I3 - Données semi-structurées - Grégory Cobéna
Ref Name Price < product reference=”X23"> X23 Camera 359.99 camera R2D2 Robot 19350.00 359.99 Z25 PC 1299.99 easy … ... < product reference=”R2D2"> Information System
Data + Structure Semistructured: more flexible
Information System
20/12/2002
10
XML = Semistructured Data
The X23 new camera Ref Name Price replaces the X22 . It X23 Camera 359.99 comes equipped with a flash R2D2 Robot 19350.00 (worth by itself 53.99 $) Z25 PC 1299.99 hard and provides great quality for only 359.99 $.
Text + presentation Where is the data ?
DEA I3 - Données semi-structurées - Grégory Cobéna
11
20/12/2002
Robot 19350 …
...
DEA I3 - Données semi-structurées - Grégory Cobéna
12
XML : Tree Types
(Part I: Xyleme) 2. A Dynamic Warehouse for the XML Data of the Web
product-table
product
designation
price
reference
description
Semantics and structure are in paths
• product-table/product/reference • product-table/product/price 20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
13
Xyleme Research
Xyleme Company
Started September 2000
Project Xyleme at INRIA (1999-2000) : Explore XML + Web + SGBD to make the Web a Knowledge Database
INRIA
• • •
Market Challenges:
Sophie Cluet: Databases (OQL…) Serge Abiteboul: semi-structured data + web Guy Ferran: ex O2 Technology
•
Mannheim University
•
•
Few XML documents available on the Web (because of weak software support) Company is focusing on private XML:
•
Technology:
Guido Moerkotte
Université d’Orsay
•
(25 employees end of 2001)
Marie Christine Rousset
CNAM
•
20/12/2002
Dan Vodislav
DEA I3 - Données semi-structurées - Grégory Cobéna
15
Architecture
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
16
User Interface -------------------- I N T E R N E T ----------------------Web Interface
• local: Corba • external: HTTP
Acquisition Loader & Crawler
Distribution between autonomous machines Now Web Services DEA I3 - Données semi-structurées - Grégory Cobéna
• Scalability for large amount of data • Internet (+focus) / Intranet support • Monitoring and Version Management • Heterogeneous Data Integration
Functional Architecture
Cluster of PCs Developed with Linux and C++ Communications
20/12/2002
• Press, Editors, Financial Data, Biology…
Xyleme Interface Change Control
Semantic Module
Query Processor
Repository and Index Manager 17
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
18
(Part I: Xyleme) 3. Data Acquisition and Maintenance, Page Importance
Architecture -------------------- I N T E R N E T ----------------------Change Control and Semantic Integration
Change Control and Semantic Integration
Index
Index
Loader |Query
Repository
20/12/2002
Acquisition and Maintenance
E T H E R N E T
Repository
Acquisition and Maintenance
Index
Loader |Query
Repositorry
Repository
DEA I3 - Données semi-structurées - Grégory Cobéna
19
Life Cycle of a page in Xyleme
Goals
Discover XML pages on the web that are of interest for customers
• For this crawl the web (HTML+XML) Maintain them up to date Do this under bounded resources:
• The meta data of D is read • type, last_date_update...
• The document D is loaded
• Memory for known URLs • Bandwidth 20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
The document D is re(read) regularly
21
Main Issues
a standard PC main cost is Internet connection
Metadata management (access to disk) Page scheduling
22
(M. Preda, S. Abiteboul, G. Cobena) • does not require to maintain graph information • faster convergence with focused crawling
• decide which page to read or refresh next DEA I3 - Données semi-structurées - Grégory Cobéna
DEA I3 - Données semi-structurées - Grégory Cobéna
Definition: Important pages are linked to by important pages Offline algorithm (used by Google) Our Online algorithm
• we can load up to 5 millions of pages/day on
20/12/2002
20/12/2002
Page Importance
Loading of pages
•
The URL of D is discovered as a link in another page (or published by a customer) The page scheduler decides to read D
23
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
24
(Part I: Xyleme) 4. XML Repository: Semantic Data Integration and Query Processing
Querying Language Today: A mix of OQL and XQL We are currently moving to X-Query (which is also a mix of OQL and XQL…) Select boss/Name, boss/Phone From comp in BusinessDomain, boss in comp//Manager Where comp/Product contains “Xyleme”
20/12/2002
Web Heterogeneity
DEA I3 - Données semi-structurées - Grégory Cobéna
26
Indexing
Semantic domains, e.g., cinema Many possible types for data in this domain, many DTDs Semantic Integration
Standard inverted index
• word → documents that contain this word Xyleme index
• word → elements that contain this word
• one abstract DTD for the domain • gives the illusion that the system maintains an
document + element identifier
Goal: more work can be performed without accessing data
homogeneous database for this domain
1 domain = 1 abstract DTD 20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
27
I.4.1 Xyleme: Semantic Data Integration
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
28
Data Integration One application domain -- Several schemas
• heterogeneous vocabulary and structure Xyleme Semantic Integration
• gives the illusion that the system maintains an •
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
29
homogeneous database for each domain abstracts a set of DTDs into an abstract DTD = a hierarchy of pertinent terms for a particular domain
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
30
I.4.2 Xyleme: Query Processing
Technology in short Cluster DTDs into application domains
•
Business, culture, tourism, biology, …
For an application domain – semi-automatically
• • •
20/12/2002
Organize tags into a hierarchy of concepts using thesauri such as Wordnet and other linguistic tool This provides the abstract DTD for the particular domain Generate mappings between concrete DTDs and the abstract one
DEA I3 - Données semi-structurées - Grégory Cobéna
31
Xyleme Query Language
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
32
Principle of Querying query on abstract DTD
A mix of OQL and XQL, will use the W3C standard when there will be one Select product/name, product/price From doc in catalogue, product in doc/product Where product//components contains “flash” and product/description contains “camera”
catalogue/product/price
Union of concrete queries (possibly with joins)
⇒ d1//camera/price ⇒ d2/product/cost
catalogue/product/description ⇒ d1//camera/description ⇒ d2/product/info, ref ⇒ d2/description MAPPINGS between concrete and abstract DTD’s
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
33
Query Processing 1.
2.
3.
4. 5.
DEA I3 - Données semi-structurées - Grégory Cobéna
DEA I3 - Données semi-structurées - Grégory Cobéna
34
Query processing
Partial translation, from abstract to concrete, to identify “machines” with relevant data Algebraic rewriting, linear search strategy based on simple heuristics: in priority, use in memory indexes and minimize communication Decomposition into local physical subplans and installation Execution of plans If needed, Relaxation
20/12/2002
20/12/2002
35
Essential use of a smart index combining full-text and structure
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
36
I.4.2 Xyleme: Repository
Storage System: Xyleme Store Efficient storage of trees in variable length records within fixed length pages Balancing of tree branches in case of overflow
• minimize the number of I/O for direct access • 20/12/2002
37
DEA I3 - Données semi-structurées - Grégory Cobéna
Tree Balancing in Xyleme Store
20/12/2002
and scanning good compromise : compaction / access time
DEA I3 - Données semi-structurées - Grégory Cobéna
38
Questions ?
Record 1 Overflow: more children in other page
Overflow: Sub-tree in other page
Record 2 20/12/2002
Record 3
Record 4
DEA I3 - Données semi-structurées - Grégory Cobéna
(Part I: Xyleme) 5. Change Control
39
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
40
The Web changes all the time Data acquisition + maintenance
•
keep the warehouse up-to-date
Version management
•
representation and storage of change (see part II)
Change monitoring
• 20/12/2002
query subscription
DEA I3 - Données semi-structurées - Grégory Cobéna
42
Subscription Language
Example
SQL-like language based on ‘atomic events’. Combines the use of monitoring queries and continuous queries. The language can be extended by adding new types of atomic events. Uses the XML Query Language for continuous queries. “Querying the XML Documents of the Web”, V. Aguilera, S. Cluet, F. Boiscuvier, Tech. Report
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
43
Step 1: Atomic Event Detection
metadata manager document & alerts d/46 XML loader 20/12/2002
atomic event 46: URL matches pattern www.musee-orsay.fr/* atomic event 67: XML document contains the tag with the value “Monet”
d/46,67
complex event detection
DEA I3 - Données semi-structurées - Grégory Cobéna
45
URL Patterns Detection (1)
Test in O(1), total test time is O(n), where n is the length of URLs
DEA I3 - Données semi-structurées - Grégory Cobéna
Each Alerter can be viewed as a plug-in that acts on a document flow. All sorts of Atomic events can be detected: URL pattern detection, Keywords, XPath expressions, Page rank… Can be distributed. Some advanced alerts are:
• • •
20/12/2002
Long string look-ups Finding XML Patterns (e.g. XPath) Comparing digital signature of text documents (copy tracker)
DEA I3 - Données semi-structurées - Grégory Cobéna
46
Example: http://www.inria.fr/verso/index.html Test: http://www.inria.fr/verso/* http://www.inria.fr/*
20/12/2002
44
Using a tree: navigate on the tree until a leave is encountered Example: Tree is,
URL | prefix* | *suffix
Using Hash Table: try all possible patterns
•
DEA I3 - Données semi-structurées - Grégory Cobéna
URL Patterns Detection (2)
Supported patterns
•
20/12/2002
Alerters
5 millions of pages/day d
subscription myPaintings % what are the new painting entries in Musee d’Orsay site monitoring newPainting select URL Atomic where URL extends www.musee-orsay.fr/* events and contains “Monet” % manage the changes in the expositions continuous delta Exposition select ... from ... where when monthly notify daily % send me a daily report
Patricia Trees ?
47
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
48
Keywords Sequence Algorithm
Simple XPath filtering Algorithm
Detect: « Air France » Solution:
Problem:
• a Tree of backward keyword sequences • a context memory with O(1) update cost
Solution:
• detect
CONTAINS « word » • Reverse path expression • Use postfix order • Use a stack for ‘//’ and another stack for ‘/’
Tree is implemented over a hash table
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
49
Simple XPath filter example: Understanding the tree structure in postfix order
DEA I3 - Données semi-structurées - Grégory Cobéna
DEA I3 - Données semi-structurées - Grégory Cobéna
50
Simple XPath: Example
Consider tree: toto Nodes come as: toto (id=1, level=4) C (id=2, level=3) C (id=3, level=3) B (id=4, level=2) A (id=5, level=1) 20/12/2002
20/12/2002
CONTAINS toto is detected by:
• « toto »::ancestor When « toto » is detected, it is stored For each ancestor of « toto », the name is compared to . All tests are executed using an hash table 51
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
52
Step 2: Complex Event Detection
Stemming On the Alerter
• • •
Exemple: Éléphant –> ELEPHANT Do it for 500 documents / second Noise may be introduced (Example: tâche = tache)
HTML parser
Millions of alerts of pages/day Millions of subscriptions complex event detection
On the Subscription Manager
• •
To avoid duplicate registration of similar events To show the user how his query is stemmed
XML loader
Real stemmers: chevaux -> cheval 20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
53
20/12/2002
complex event 12: 67 & 46 (XML document contains the tag with value “Monet” and URL matches pattern www.musee-orsay.fr/*) DEA I3 - Données semi-structurées - Grégory Cobéna
54
Complex Events Algorithm
Step 3: Notification Processor
The formal problem is NP-hard We proposed several possible algorithms Experimental (simulation) values proved the effectiveness of our solutions The Hash-Tree based algorithm is well suited for our application: • 10 million Complex Events • 1 million Atomic Events • 100 Atomic events detected per document 0.8 ms to process a document. ~2 million documents per day (on each PC). 20/12/2002
alerts
complex event detection
notification/monitoring
Reporter
Millions of Notifications/day
triggers clock
55
DEA I3 - Données semi-structurées - Grégory Cobéna
Architecture
continuous queries
20/12/2002
notification/results
DEA I3 - Données semi-structurées - Grégory Cobéna
56
Monitoring Applications Xyleme Query Processor
documents
Trigger Engine Complex Event Detection
Xyleme Alerter
Xyleme Reporter
Reporter Subscription Manager SQL
20/12/2002
SQL
Xyleme Subscription Manager
Web Browser 57
DEA I3 - Données semi-structurées - Grégory Cobéna
Copy tracking
Query to search engine Or specific crawl + pre-filter
2
• • • • •
3 detection
20/12/2002
58
Standard portal management
Filter Flow of candidate documents
DEA I3 - Données semi-structurées - Grégory Cobéna
Web portal management
Example: a press agency wants to check that people are not publishing illegally copies of their wires Need to react fast on changes: illegal copy of the wire may last only a couple of days
1
20/12/2002
Unreachable pages Dangling pointers Incorrect pages (e.g., do not parse) Detection of interesting pages on the web Etc.
Portal archiving Subscription and notification
Slice the document
DEA I3 - Données semi-structurées - Grégory Cobéna
59
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
60
Web surveillance
Conclusion & Prospectives
Applications
• •
Focus crawling on important pages
Anti-criminal and anti-terrorist intelligence, e.g., detecting suspicious acquisition of chemical products Business intelligence, e.g., discovering potential customers, partners, competitors
• Refine notion of importance • Improve important pages discovery
Find the data (crawl the web) Monitor the changes
•
Improve Change control accuracy
new pages, deleted pages, changes in a page
Classify information and extract data of interest
• 20/12/2002
Data mining, text understanding, knowledge representation and extraction, linguistic… Very AI DEA I3 - Données semi-structurées - Grégory Cobéna
61
Questions ?
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
Temporal Queries (persistent identification of nodes) Version some documents or some sites (store a ‘delta’) Change Monitoring (query changes)
(Part II: XML Diff) 1. Detecting Changes in XML Documents Grégory Cobéna, Serge Abiteboul, Amélie Marian
We proposed a representation of changes “Change-Centric Management of Versions” (VLDB 2001)
We developed a Diff algorithm for XML
INRIA Rocquencourt, Columbia University
“Detecting Changes in XML Documents”, G. Cobena, S. Abiteboul, A. Marian ICDE 2002 (San Jose)
DEA I3 - Données semi-structurées - Grégory Cobéna
DEA I3 - Données semi-structurées - Grégory Cobéna
63
Objectives:
20/12/2002
20/12/2002
62
Deuxième Partie: XML Diff
Versions • • •
• Semantic web • Real-time advanced processing
65
Introduction
Overview
Algorithms for detecting changes in XML documents Plan
• An XML Diff algorithm • A comparative study for XML change
Motivations State of the art Change model Algorithm
• Tradeoff ‘quality’ versus speed • Quasi linear time and space complexity
detection
Experiments
• Synthetic and real world experiments 20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
67
Monitoring XML data on the Web
Change-centric management of versions in an XML warehouse A. Marian, S. Abiteboul, G. Cobéna, L. Mignet, VLDB2001
In fact, all these problems are very similar
Learning about changes Architecture and requirements ( ‘speed’ ) Multiple optimality criteria ( ‘quality’ ) 69
Consider string: abcdefg How to transform it into: bczdeyz ? Possible solutions
•
70
S1x into S2y
Conversely, to find out the shortest path for transforming S1x into S2y, it is sufficient to compare following transformations:
delete all 7 chars and insert 7 other chars Update into , into , into , into , into Mix both solutions
DEA I3 - Données semi-structurées - Grégory Cobéna
DEA I3 - Données semi-structurées - Grégory Cobéna
If we know how to transform S1 into S2, then we know how to transform:
• • •
Question: What is the shortest edit sequence?
20/12/2002
20/12/2002
Solving the String-Edit-Problem
The String Edit Problem
• •
68
Unix Diff: shows the different lines between two text files String Diff: shows which symbol have changed XML Diff: Which parts of the tree have been modified, inserted or deleted
B. Nguyen, S. Abiteboul, G. Cobéna, M. Preda, SIGMOD2001
DEA I3 - Données semi-structurées - Grégory Cobéna
DEA I3 - Données semi-structurées - Grégory Cobéna
II.1.1 XML Diff What is a diff ?
Motivations
20/12/2002
20/12/2002
71
20/12/2002
S1 into S2, then x into y S1x into S2, then insert y delete x, and then S1 into S2y
DEA I3 - Données semi-structurées - Grégory Cobéna
72
String Edit Problem The algorithm
A Quadratic Solution
Two strings S1 and S2 Cost(x,y) represents the shortest edit cost to transform S1[1..x] into S2[1..y] The cost is the sum of individual costs for each edit operation (insert, delete, update) Then, cost(x,y) is the min of:
• Cost(x-1,y-1)+update_cost(S1[x],S2[y]) • Cost(x-1,y)+delete_cost(S1[x]) • Cost(x,y-1)+insert_cost(S2[y]) 20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
73
State of the art (1): the string edit problem
The solution is to represent all possible path on a matrix: M[1..|S1|][1..|S2|]
• • • •
M[x,y] represents the cost of transforming S1[1..x] into S2[1..y] M[x,y] can be computed using M[x-1,y-1], M[x-1,y] and M[x,y-1] M[0,i] and M[i,0] are obvious Thus, M[|S1|,|S2|] can be computed
Note that the number of path is exponential, but the cost remains quadratic. Time and Space cost is O(|S1|*|S2|) 20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
74
Questions?
Best result is O(|s|^2 / log s) solution over finite alphabet O(|x|*|y|) solution with Directed A-cyclic Graph
… …
A
C
source string delete C (cost=1)
B C do nothing (cost=0) destination string 20/12/2002
insert C (cost=1)
DEA I3 - Données semi-structurées - Grégory Cobéna
75
Finds the solution in O(n*D) where n is the size of the largest string, and D the distance between the two strings
Adapt M[x,y] to work on trees (S. Chawathe)
20/12/2002
• •
Remove some edges to ensure that deleting a node will delete the subtree rooted at that node (and conversely for insert)
DEA I3 - Données semi-structurées - Grégory Cobéna
76
Kuo-Chung Tai, Lu, Selkow: based on string edit problem in XML, many labels are identical Unix Diff, Sun DiffML LaDiff (MH-Diff) , Chawathe, Rajaraman, Garcia-Molina, J. Widom
Compute M[x,y] only close to the diagonal (E. Myers)
•
DEA I3 - Données semi-structurées - Grégory Cobéna
State of the art (2): the tree pattern matching for XML
Extending the String problem
•
20/12/2002
matching criteria to compare nodes and subtrees quadratic in the ‘distance’ between both trees.
IBM diff available at alphaworks
77
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
78
Data Model
Change Model Attach persistent identifiers:
Issue: Persistent identification of nodes Catalog Pr
Pr
Pr
N P N P Camera 300
TV 100
Pr
Pr
Pr
Represent changes with a Delta
N P
N P N P
N P
VCR 200
TV 100 DVD 500
VCR 150
• Delta = Set of changes • Nice mathematical properties Change-centric Management of versions, VLDB2001
Version 2
Version 1
20/12/2002
• to every node = XID • to the document = XID-map
Catalog
79
DEA I3 - Données semi-structurées - Grégory Cobéna
Catalog
Catalog
16
Delete
Pr
10
N P N P 2
4
Camera 300
1
3
7
Pr
8
12 14 VCR 200
11 13
20/12/2002
Pr 21
15
N P N P
N P
7
9
18
20
TV 100 DVD 500
6
8
17
19
12 14 VCR 150
Update
11 13
Version 2
Version 1 XID-map: (1-16|17)
Pr
10
15
N P 9
TV 100
6
16
Insert
Pr
5
DEA I3 - Données semi-structurées - Grégory Cobéna
80
Objectives
Algorithm: Intuition
Pr
20/12/2002
Diff (V1,V2) delete(5) update(13,150) insert(16,2,(17-21))
• Constraint-Awareness:
New XID-map: (6-10,17-21,11-16|22)
DEA I3 - Données semi-structurées - Grégory Cobéna
Assign persistent identifiers by matching nodes Compute a representation of changes between the two documents Also
81
• 20/12/2002
Follow DTD specifications Correctness: No change is missed DEA I3 - Données semi-structurées - Grégory Cobéna
82
II.1.2 XML Diff The XyDiff Algorithm
Representation of Changes: Example Camera$300 DVD$400 $200$150
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
83
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
84
Phase 2: Bottom Up+Lazy Down Propagation
Phase 1: Identify Subtrees One traversal of the tree
Let L be the list of all subtrees in second document For each subtree S in L (in decreasing weight order)
Use ID-attributes from DTD to match nodes (or forbid matching) Compute for every subtree
• Signature • Weight 20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
• • •
1/ find all identical subtrees in first document 2/ Select acceptable matches 3/ If at least one match, choose the best candidate
•
4/ Propagate [Very Carefully] matching to parents and ancestors
Remove S and all its subtrees from L
85
Phase 3: Optimization
20/12/2002
Find inserted/deleted nodes Find “easy” move operations: parent node changed Find “complex” move = reordering children
•
• •
Largest common subsequence (weight) Ex: A, B, C, D, E, F E, D, A, B, C, F Largest common subsequence is A, B, C, F nodes D and E are ‘moved’ Complexity is quadratic We approximate the solution in linear time
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
propagate matching to ancestors
Then quick top-down pass propagate matching to descendant nodes based on element names if no ambiguity DEA I3 - Données semi-structurées - Grégory Cobéna
87
Key aspect: the weight of trees
Definition of weight affects both speed and quality Look-up and Propagation distances
•
Use locality (e.g. find matching ancestors) to avoid wrong matches Two small trees are matched if some ancestors are matching. For large trees, further look-up is accepted.
•
Propagation
• •
Propagation should try not to induce wrong matches. Intuition is that large matching subtrees are more relevant The larger the tree, the more we propagate the matching to ancestors. DEA I3 - Données semi-structurées - Grégory Cobéna
88
Algorithm: Tuning
Select Acceptable Match
20/12/2002
86
Phase 4: Construct the delta
Some nodes are unmatched after previous phases Use previous results to propagate matching [now a bit less carefully] First, bottom-up
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
89
Choice affects speed and quality
Trade-off Quality vs. Speed We exhibit in the paper some bounds that guarantee linear time complexity
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
90
Complexity: n*log(n)
Experiments
Phase 1 (identification) is one traversal of the tree Phase 2 (propagation) is n times ‘get best candidate’ in the worst case
• •
Simulator of changes on XML documents Speed and Quality evaluation on synthetic data
Look-up level is designed to have ‘get best candidate’ cost in O(log(n)) uses some pre-computed indexes
Comparison with Unix Diff on web data
Phase 3 (optimization) is designed to be linear Phase 4 (delta construction) is linear
• 20/12/2002
longest common subsequences of children is approximated DEA I3 - Données semi-structurées - Grégory Cobéna
91
Typical Pattern
• delta of changes • XML document D’=delta(D)
DEA I3 - Données semi-structurées - Grégory Cobéna
93
20/12/2002
Synthetic Data: Quality of the algorithm
3 2 1 0 1Mb
Size ratio of the diff over original delta 20/12/2002
Start from document D
Size of the computed delta is comparable to ‘original’ delta size For large deltas, XyDiff finds more efficient operations
DEA I3 - Données semi-structurées - Grégory Cobéna
DEA I3 - Données semi-structurées - Grégory Cobéna
94
Comparison of the size of results: XyDiff vs. UnixDiff
– Generate changes over D: D’ = delta(D) – Give (D, D’) to XyDiff and compute a delta
4
92
Experimental verification that the algorithm is quasi-linear
Parameters control the number of delete/insert/move Input: XML document D Outputs:
Typical Pattern
DEA I3 - Données semi-structurées - Grégory Cobéna
Synthetic Data: Speed of the algorithm
Simulator of changes
20/12/2002
20/12/2002
95
Experiments on 10.000 XML web documents that changed
•
at the time of that experiment, we had to crawl 10 million web pages to find them ☺
80% of the documents below the size of (UnixDiff*1.2) Almost all below UnixDiff*2 Of course, the delta of XyDiff contains much more information 20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
96
Perspective
Conclusion
Larger scale experiments on web data Learn about changes:
A novel algorithm for XML diff in quasi linear time XML specificities are used to improve quality Available as Open Source freeware at:
• Frequency, patterns, … • Obtain statistics for DTD and XMLSchema • Use the statistics to learn about changes and improve XyDiff for typed XML data
http://www-rocq.inria.fr/~cobena/XyDiffWeb/
Use XML diff to observe changes between websites 20/12/2002
97
DEA I3 - Données semi-structurées - Grégory Cobéna
XyDiff in Xyleme Architecture
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
98
Questions ?
Web Crawler
XML Loader
XyDiff
Alerter
V(n) of the XML document Delta(V(n-1),V(n))
Storage
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
(Part II: XML Diff) Etude comparative sur la détection de changements en XML Grégory Cobéna (INRIA), Talel Abdessalem (ENST), Yassine Hinnach (ENST)
99
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
100
Context Consider change-control in XML data warehouses. We want to understand changes We have only the old and new version of documents A diff need to be computed
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
102
Organization
Motivations
Motivations Data Model Representing Changes
• • •
Version Management and Querying Comparison of Change representation models Experiments
Detecting Changes
• • •
State of the art in change detection Performance analysis and experiments Quality analysis and experiments
Summary
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
103
Motivations: Detecting Changes
Motivations: Representing Changes Version management, which means that the representation should allow for effective storage strategies Temporal Databases, the support for persistent identification of nodes is mandatory Monitoring: information about changes is used to support triggers or detect events Note: HTML or XHTML documents may be used 20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
105
II.2.1 XML Diff Comparing Data Models
Correctness: the diff programs miss no changes Minimality of the result is important to save storage space and network bandwidth Semantics: some algorithms consider more semantics in XML documents Performance: with dynamic services and/or large amounts of data, high speed and low memory usage are mandatory ‘Move operations’: some algorithms support move operations whereas others don’t. This impacts both the performance of the tool and the quality of results.
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
106
Data Model (quick overview) Operations are:
• • •
(i) insert, delete applied to leaves or subtrees (ii) update of text nodes (iii) move applied to a subtree root, moving the entire subtree
An edit cost is assigned to each operation. Usually, the cost is 1 per node touched The semantic of move is to identify subtrees even when their context has changed. We use the notion of mapping between the two trees. Each node in document A (or B) that is not deleted (or inserted) is matched to the corresponding node in B (or A). 20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
107
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
108
II.2.2 XML Diff Representing Changes
Data Model: Intuition Tai’s model: delete ‘b’
Selkow’s model: delete ‘b’ root
root a
b x
20/12/2002
c
a
b x
y
c y
DEA I3 - Données semi-structurées - Grégory Cobéna
109
Representing Changes
• •
There are several version management strategies. For instance, when only deltas are stored, their size must be reduced We also consider the performance of reconstructing a document given the delta and the previous document. It is linear in all cases. A simple text-based version management is possible but can not be used for querying.
Querying Changes
• • •
20/12/2002
Labeling nodes by prefix+postfix identifiers improves querying algorithms Labeling nodes with persistent identifiers improves temporal databases There is no short labeling scheme that is good for both DEA I3 - Données semi-structurées - Grégory Cobéna
DEA I3 - Données semi-structurées - Grégory Cobéna
110
Our Example
Version Management
•
20/12/2002
111
Different representations
Notebook 2200MHz Pentium4 $1999 Digital Camera Fuji FinePix 2600Z Not Available
20/12/2002
Notebook 2200MHz Pentium4 $1999 Digital Camera Fuji FinePix 2600Z $299
DEA I3 - Données semi-structurées - Grégory Cobéna
112
Change Models: XUpdate $299 XPath expression
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
113
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
114
Same look’n’feel Change Models: as the document DeltaXML (Example)
Change Models: XyDelta (Example)
Not Available $399
115
Verify Change Models: consistency Microsoft XDL (Example)
$299 Identify nodes
• • • •
File Size
100000
XyDelta DeltaXML
100
20/12/2002
A framework for querying Validation by a DTD (may be a problem for DeltaXML, XyDelta) Verify the source document (only XDL) Support of ‘move’ operations (only XyDelta and XDL) Backward deltas (only XyDelta) Monitoring the delta (only XUpdate and DeltaXML) DEA I3 - Données semi-structurées - Grégory Cobéna
118
Change monitoring is easier with DeltaXML and XUpdate Temporal queries are easier to evaluate with XyDelta (persistent identifiers) Future work:
100000
protocols
Edit Cost
DEA I3 - Données semi-structurées - Grégory Cobéna
A formal model and nice mathematical properties Persistent identification of nodes (at least as an option)
• It is not yet clear how to query changes • Define transaction or synchronization
1000
20/12/2002
116
Change Models: Conclusion
Comparing Delta Size
10000
• •
•
1000000
10000
DEA I3 - Données semi-structurées - Grégory Cobéna
Unique advantages of XyDelta
•
Identifiers save space when few updates
1000
20/12/2002
Nice features that some are missing
Storage Experiments
100
Still missing for all of them
117
DEA I3 - Données semi-structurées - Grégory Cobéna
10
What is the parent node?
Summary
element node
1
Persistent identifiers
$399
DEA I3 - Données semi-structurées - Grégory Cobéna
20/12/2002
Not Available
The order is important (no ids, no move)
20/12/2002
mentions some unchanged nodes
119
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
120
II.2.3 Detecting Changes
State of the art Based on the String Edit Problem (1966) Tree-to-tree correction Algorithms:
• •
find the Minimum Edit Script in O(m*n) time and space, where m and n are the size of the two documents
Other algorithms
• •
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
121
Experiments: Speed of several algorithm
20/12/2002
Run in linear time or close Match nodes or subtrees depending on their content
DEA I3 - Données semi-structurées - Grégory Cobéna
Algorithms: Overview From: To:
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
123
Experiments: Quality (measured by the Edit Cost)
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
122
20/12/2002
The cheapest choice would be to move and . (cost=2) But finding the best script with ‘move’ operations is NP-hard The minimum edit script consists in deleting and and then inserting them. (cost=4) (MMDiff) Preprocessing often consists in mapping identical subtrees. In these case, an additional ‘move’ operations will be needed (cost=5)
DEA I3 - Données semi-structurées - Grégory Cobéna
124
Experiments: Speed (focus on DeltaXML)
125
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
126
Comparison summary
Other issues
Many other algorithms that have no advantages MMDiff is the reference for quality DeltaXML and XyDiff are good compromises quality/performance; but performances of XyDiff more regular Performance measure for Microsoft available soon – seems comparable in performance to DeltaXML 20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
Constrained Diff is often interesting:
• Using ‘keys’ to match specific nodes (e.g. DeltaXML)
• Using XMLSchema or DTD information • Time-constrained diff (e.g. XyDiff)
Postprocessing of results?
127
What’s next?
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
128
Questions ?
Representing Changes:
• • •
Unify and improve existing features Support Queries! Chain versions?
Change Detection:
• • •
20/12/2002
We are currently working on Microsoft’s XML Diff Use XMLSchema (or DTD) information Mining changes? Use learning ?
DEA I3 - Données semi-structurées - Grégory Cobéna
merci
129
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
130