Grégory Cobena http://www-rocq.inria.fr/verso/
[email protected]
DEA I3 : Information, Interaction, Intelligence
Cours: Données semi structurées
Contrôle des Changements dans XML
20/12/2002
•
DEA I3 - Données semi-structurées - Grégory Cobéna
du Web (Xyleme) À l’échelle du document XML, cas de la gestion de versions
2
• À large échelle, cas d’un entrepôt de données
Comprendre la gestion de données dynamiques
Objectifs
20/12/2002
•
•
DEA I3 - Données semi-structurées - Grégory Cobéna
3
des documents XML, sur le Web ou sur un Intranet Mettre en place un suivi dans le temps de ces documents Extraire des connaissances sur ce qui change: les documents, leurs propriétés, leur contenu
• Savoir découvrir des sources de données et
Le contrôle des changements, c’est d’abord:
Motivations: à l’échelle du Web
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
Lorsqu’on s’intéresse à l’évolution dans le temps d’un document donné Exemple: Fichier XML décrivant un carnet d’adresses
•
Lorsque l’on gère différents documents, on étudie les changements inter-documents Exemple: Fichier XML décrivant deux modèles de voitures, une Peugeot-307 et une 206
•
Dans quel cas trouve-t-on la notion de changements?
Motivations: à l’échelle du document
4
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
Les données semi structurées doivent apporter une description plus précise que du simple texte, avec une sémantique bien définie La gestion des changements dans les données semi structurées est encore plus complexe que dans les BD relationnelles.
Enjeux
5
Un entrepôt de données XML à large échelle Intégration de données du Web Surveillance active des données du Web
20/12/2002
• •
DEA I3 - Données semi-structurées - Grégory Cobéna
Représentation des changements Détection des changements
XML Diff
• •
•
Xyleme
Plan du cours
6
A Dynamic Warehouse for the XML data of the Web
Première Partie: Xyleme
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
1. The Web and XML 2. Xyleme 3. Data Acquisition and Maintenance 4. XML Repository, Semantic Data Integration and Query Processing 5. Query Subscription Conclusion
Organization
8
(Part I: Xyleme) 1. The Web and XML
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
10
Private web: not publicly available pages Deep web: data hidden behind forms
• 1 billion in [06/2000] • several millions of servers
Terabytes of data A lot of public pages
The Web today
20/12/2002
HTML
DEA I3 - Données semi-structurées - Grégory Cobéna
Text + presentation Where is the data ?
Information System
11
The X23 new camera Ref Name Price replaces the X22 . It X23 Camera 359.99 comes equipped with a flash R2D2 Robot 19350.00 (worth by itself 53.99 $) Z25 PC 1299.99 hard and provides great quality for only 359.99 $.
HTML = Hypertext Language
20/12/2002
...
DEA I3 - Données semi-structurées - Grégory Cobéna
Data + Structure Semistructured: more flexible
12
Robot 19350 …
Ref Name Price < product reference=”X23"> X23 Camera 359.99 camera R2D2 Robot 19350.00 359.99 Z25 PC 1299.99 easy … ... < product reference=”R2D2"> Information System
XML = Semistructured Data
price
description
reference
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
• product-table/product/reference • product-table/product/price
Semantics and structure are in paths
designation
product
product-table
XML : Tree Types
13
(Part I: Xyleme) 2. A Dynamic Warehouse for the XML Data of the Web
Sophie Cluet: Databases (OQL…) Serge Abiteboul: semi-structured data + web Guy Ferran: ex O2 Technology
Guido Moerkotte
Marie Christine Rousset
20/12/2002
•
DEA I3 - Données semi-structurées - Grégory Cobéna
Dan Vodislav
CNAM
•
Université d’Orsay
•
Mannheim University
• • •
INRIA
15
Explore XML + Web + SGBD to make the Web a Knowledge Database
Project Xyleme at INRIA (1999-2000) :
Xyleme Research
20/12/2002
•
•
•
DEA I3 - Données semi-structurées - Grégory Cobéna
• Scalability for large amount of data • Internet (+focus) / Intranet support • Monitoring and Version Management • Heterogeneous Data Integration
Technology:
• Press, Editors, Financial Data, Biology…
Few XML documents available on the Web (because of weak software support) Company is focusing on private XML:
Market Challenges:
(25 employees end of 2001)
Started September 2000
Xyleme Company
16
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
Distribution between autonomous machines Now Web Services
• local: Corba • external: HTTP
Cluster of PCs Developed with Linux and C++ Communications
Architecture
17
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
18
Semantic Module
Query Processor
Change Control
Xyleme Interface
Repository and Index Manager
Acquisition Loader & Crawler
Web Interface
-------------------- I N T E R N E T -----------------------
User Interface
Functional Architecture
20/12/2002
E T H E R N E T
Repositorry
Acquisition and Maintenance
Acquisition and Maintenance
Repository
Loader |Query
Index
DEA I3 - Données semi-structurées - Grégory Cobéna
Repository
Index
Change Control and Semantic Integration
Loader |Query
Repository
Index
Change Control and Semantic Integration
19
-------------------- I N T E R N E T -----------------------
Architecture
(Part I: Xyleme) 3. Data Acquisition and Maintenance, Page Importance
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
• Memory for known URLs • Bandwidth
Maintain them up to date Do this under bounded resources:
• For this crawl the web (HTML+XML)
21
Discover XML pages on the web that are of interest for customers
Goals
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
The document D is re(read) regularly
• The document D is loaded
• type, last_date_update...
• The meta data of D is read
22
The URL of D is discovered as a link in another page (or published by a customer) The page scheduler decides to read D
Life Cycle of a page in Xyleme
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
• decide which page to read or refresh next
Metadata management (access to disk) Page scheduling
•
a standard PC main cost is Internet connection
• we can load up to 5 millions of pages/day on
Loading of pages
Main Issues
23
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
24
(M. Preda, S. Abiteboul, G. Cobena) • does not require to maintain graph information • faster convergence with focused crawling
Definition: Important pages are linked to by important pages Offline algorithm (used by Google) Our Online algorithm
Page Importance
(Part I: Xyleme) 4. XML Repository: Semantic Data Integration and Query Processing
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
Select boss/Name, boss/Phone From comp in BusinessDomain, boss in comp//Manager Where comp/Product contains “Xyleme”
Today: A mix of OQL and XQL We are currently moving to X-Query (which is also a mix of OQL and XQL…)
Querying Language
26
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
1 domain = 1 abstract DTD
homogeneous database for this domain
• one abstract DTD for the domain • gives the illusion that the system maintains an
27
Semantic domains, e.g., cinema Many possible types for data in this domain, many DTDs Semantic Integration
Web Heterogeneity
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
28
Goal: more work can be performed without accessing data
document + element identifier
• word → elements that contain this word
Xyleme index
• word → documents that contain this word
Standard inverted index
Indexing
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
I.4.1 Xyleme: Semantic Data Integration
29
20/12/2002
•
DEA I3 - Données semi-structurées - Grégory Cobéna
30
homogeneous database for each domain abstracts a set of DTDs into an abstract DTD = a hierarchy of pertinent terms for a particular domain
• gives the illusion that the system maintains an
Xyleme Semantic Integration
• heterogeneous vocabulary and structure
One application domain -- Several schemas
Data Integration
Business, culture, tourism, biology, …
20/12/2002
•
•
•
DEA I3 - Données semi-structurées - Grégory Cobéna
Organize tags into a hierarchy of concepts using thesauri such as Wordnet and other linguistic tool This provides the abstract DTD for the particular domain Generate mappings between concrete DTDs and the abstract one
31
For an application domain – semi-automatically
•
Cluster DTDs into application domains
Technology in short
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
I.4.2 Xyleme: Query Processing
32
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
Select product/name, product/price From doc in catalogue, product in doc/product Where product//components contains “flash” and product/description contains “camera”
A mix of OQL and XQL, will use the W3C standard when there will be one
Xyleme Query Language
33
20/12/2002
⇒ d1//camera/price ⇒ d2/product/cost
DEA I3 - Données semi-structurées - Grégory Cobéna
MAPPINGS between concrete and abstract DTD’s 34
Union of concrete queries (possibly with joins)
catalogue/product/description ⇒ d1//camera/description ⇒ d2/product/info, ref ⇒ d2/description
catalogue/product/price
query on abstract DTD
Principle of Querying
DEA I3 - Données semi-structurées - Grégory Cobéna
35
Partial translation, from abstract to concrete, to identify “machines” with relevant data Algebraic rewriting, linear search strategy based on simple heuristics: in priority, use in memory indexes and minimize communication Decomposition into local physical subplans and installation Execution of plans If needed, Relaxation
20/12/2002
5.
4.
3.
2.
1.
Query Processing
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
36
Essential use of a smart index combining full-text and structure
Query processing
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
I.4.2 Xyleme: Repository
37
20/12/2002
•
DEA I3 - Données semi-structurées - Grégory Cobéna
and scanning good compromise : compaction / access time
• minimize the number of I/O for direct access
Efficient storage of trees in variable length records within fixed length pages Balancing of tree branches in case of overflow
Storage System: Xyleme Store
38
20/12/2002
Record 2
Overflow: Sub-tree in other page
DEA I3 - Données semi-structurées - Grégory Cobéna
Record 3
Record 4 39
Overflow: more children in other page
Record 1
Tree Balancing in Xyleme Store
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
Questions ?
40
(Part I: Xyleme) 5. Change Control
keep the warehouse up-to-date
20/12/2002
•
DEA I3 - Données semi-structurées - Grégory Cobéna
query subscription
Change monitoring
representation and storage of change (see part II)
•
Version management
•
Data acquisition + maintenance
The Web changes all the time
42
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
43
SQL-like language based on ‘atomic events’. Combines the use of monitoring queries and continuous queries. The language can be extended by adding new types of atomic events. Uses the XML Query Language for continuous queries. “Querying the XML Documents of the Web”, V. Aguilera, S. Cluet, F. Boiscuvier, Tech. Report
Subscription Language
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
44
subscription myPaintings % what are the new painting entries in Musee d’Orsay site monitoring newPainting select URL Atomic where URL extends www.musee-orsay.fr/* events and contains “Monet” % manage the changes in the expositions continuous delta Exposition select ... from ... where when monthly notify daily % send me a daily report
Example
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
45
XML loader
d/46,67
complex event detection
metadata manager
document & alerts d/46
atomic event 46: URL matches pattern www.musee-orsay.fr/* atomic event 67: XML document contains the tag with the value “Monet”
d
5 millions of pages/day
Step 1: Atomic Event Detection
20/12/2002
• • •
DEA I3 - Données semi-structurées - Grégory Cobéna
Long string look-ups Finding XML Patterns (e.g. XPath) Comparing digital signature of text documents (copy tracker)
Each Alerter can be viewed as a plug-in that acts on a document flow. All sorts of Atomic events can be detected: URL pattern detection, Keywords, XPath expressions, Page rank… Can be distributed. Some advanced alerts are:
Alerters
46
URL | prefix* | *suffix
Test in O(1), total test time is O(n), where n is the length of URLs
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
Example: http://www.inria.fr/verso/index.html Test: http://www.inria.fr/verso/* http://www.inria.fr/*
•
Using Hash Table: try all possible patterns
•
Supported patterns
URL Patterns Detection (1)
47
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
Patricia Trees ?
Using a tree: navigate on the tree until a leave is encountered Example: Tree is,
URL Patterns Detection (2)
48
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
Tree is implemented over a hash table
• a Tree of backward keyword sequences • a context memory with O(1) update cost
Detect: « Air France » Solution:
Keywords Sequence Algorithm
49
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
• Reverse path expression • Use postfix order • Use a stack for ‘//’ and another stack for ‘/’
Solution:
• detect
CONTAINS « word »
Problem:
Simple XPath filtering Algorithm
50
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
Consider tree: toto Nodes come as: toto (id=1, level=4) C (id=2, level=3) C (id=3, level=3) B (id=4, level=2) A (id=5, level=1)
Simple XPath filter example: Understanding the tree structure in postfix order
51
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
When « toto » is detected, it is stored For each ancestor of « toto », the name is compared to
. All tests are executed using an hash table
• « toto »::ancestor
CONTAINS toto is detected by:
Simple XPath: Example
52
To avoid duplicate registration of similar events To show the user how his query is stemmed
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
Real stemmers: chevaux -> cheval
• •
On the Subscription Manager
Exemple: Éléphant –> ELEPHANT Do it for 500 documents / second Noise may be introduced (Example: tâche = tache)
• • •
On the Alerter
Stemming
53
20/12/2002
XML loader
HTML parser
DEA I3 - Données semi-structurées - Grégory Cobéna
54
complex event 12: 67 & 46 (XML document contains the tag with value “Monet” and URL matches pattern www.musee-orsay.fr/*)
complex event detection
Millions of alerts of pages/day Millions of subscriptions
Step 2: Complex Event Detection
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
The formal problem is NP-hard We proposed several possible algorithms Experimental (simulation) values proved the effectiveness of our solutions The Hash-Tree based algorithm is well suited for our application: • 10 million Complex Events • 1 million Atomic Events • 100 Atomic events detected per document 0.8 ms to process a document. ~2 million documents per day (on each PC).
Complex Events Algorithm
55
Reporter
notification/results
Millions of Notifications/day
notification/monitoring
DEA I3 - Données semi-structurées - Grégory Cobéna
continuous queries
triggers
complex event detection
20/12/2002
clock
alerts
Step 3: Notification Processor
56
20/12/2002
Xyleme Alerter
documents
Xyleme Subscription Manager
Subscription Manager
Trigger Engine
Xyleme Query Processor
SQL
Reporter
DEA I3 - Données semi-structurées - Grégory Cobéna
SQL
Complex Event Detection
Architecture
57
Web Browser
Xyleme Reporter
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
Monitoring Applications
58
20/12/2002
1
Slice the document
2
Filter
3
DEA I3 - Données semi-structurées - Grégory Cobéna
Flow of candidate documents
Query to search engine Or specific crawl + pre-filter
59
detection
Example: a press agency wants to check that people are not publishing illegally copies of their wires Need to react fast on changes: illegal copy of the wire may last only a couple of days
Copy tracking
Unreachable pages Dangling pointers Incorrect pages (e.g., do not parse) Detection of interesting pages on the web Etc.
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
Portal archiving Subscription and notification
• • • • •
Standard portal management
Web portal management
60
Anti-criminal and anti-terrorist intelligence, e.g., detecting suspicious acquisition of chemical products Business intelligence, e.g., discovering potential customers, partners, competitors
new pages, deleted pages, changes in a page
20/12/2002
•
DEA I3 - Données semi-structurées - Grégory Cobéna
Data mining, text understanding, knowledge representation and extraction, linguistic… Very AI 61
Classify information and extract data of interest
•
Find the data (crawl the web) Monitor the changes
•
•
Applications
Web surveillance
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
• Semantic web • Real-time advanced processing
Improve Change control accuracy
• Refine notion of importance • Improve important pages discovery
Focus crawling on important pages
Conclusion & Prospectives
62
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
Questions ?
63
Deuxième Partie: XML Diff
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
“Detecting Changes in XML Documents”, G. Cobena, S. Abiteboul, A. Marian ICDE 2002 (San Jose)
We developed a Diff algorithm for XML
“Change-Centric Management of Versions” (VLDB 2001)
65
Temporal Queries (persistent identification of nodes) Version some documents or some sites (store a ‘delta’) Change Monitoring (query changes)
We proposed a representation of changes
• • •
Objectives:
Versions
INRIA Rocquencourt, Columbia University
Grégory Cobéna, Serge Abiteboul, Amélie Marian
(Part II: XML Diff) 1. Detecting Changes in XML Documents
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
detection
• An XML Diff algorithm • A comparative study for XML change
67
Algorithms for detecting changes in XML documents Plan
Introduction
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
• Synthetic and real world experiments
Experiments
• Tradeoff ‘quality’ versus speed • Quasi linear time and space complexity
Motivations State of the art Change model Algorithm
Overview
68
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
Learning about changes Architecture and requirements ( ‘speed’ ) Multiple optimality criteria ( ‘quality’ )
A. Marian, S. Abiteboul, G. Cobéna, L. Mignet, VLDB2001
Change-centric management of versions in an XML warehouse
B. Nguyen, S. Abiteboul, G. Cobéna, M. Preda, SIGMOD2001
Monitoring XML data on the Web
Motivations
69
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
In fact, all these problems are very similar
70
Unix Diff: shows the different lines between two text files String Diff: shows which symbol have changed XML Diff: Which parts of the tree have been modified, inserted or deleted
II.1.1 XML Diff What is a diff ?
delete all 7 chars and insert 7 other chars Update into , into , into , into , into Mix both solutions
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
Question: What is the shortest edit sequence?
•
• •
Consider string: abcdefg How to transform it into: bczdeyz ? Possible solutions
The String Edit Problem
71
20/12/2002
• • •
DEA I3 - Données semi-structurées - Grégory Cobéna
S1 into S2, then x into y S1x into S2, then insert y delete x, and then S1 into S2y
Conversely, to find out the shortest path for transforming S1x into S2y, it is sufficient to compare following transformations:
S1x into S2y
If we know how to transform S1 into S2, then we know how to transform:
Solving the String-Edit-Problem
72
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
• Cost(x-1,y-1)+update_cost(S1[x],S2[y]) • Cost(x-1,y)+delete_cost(S1[x]) • Cost(x,y-1)+insert_cost(S2[y]) 73
Two strings S1 and S2 Cost(x,y) represents the shortest edit cost to transform S1[1..x] into S2[1..y] The cost is the sum of individual costs for each edit operation (insert, delete, update) Then, cost(x,y) is the min of:
String Edit Problem The algorithm
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
74
M[x,y] represents the cost of transforming S1[1..x] into S2[1..y] M[x,y] can be computed using M[x-1,y-1], M[x-1,y] and M[x,y-1] M[0,i] and M[i,0] are obvious Thus, M[|S1|,|S2|] can be computed
Note that the number of path is exponential, but the cost remains quadratic. Time and Space cost is O(|S1|*|S2|)
• • • •
The solution is to represent all possible path on a matrix: M[1..|S1|][1..|S2|]
A Quadratic Solution
20/12/2002
C
do nothing (cost=0)
delete C (cost=1)
source string
insert C (cost=1)
A
DEA I3 - Données semi-structurées - Grégory Cobéna
destination string
C
B
…
…
Best result is O(|s|^2 / log s) solution over finite alphabet O(|x|*|y|) solution with Directed A-cyclic Graph
State of the art (1): the string edit problem
75
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
Questions?
76
Finds the solution in O(n*D) where n is the size of the largest string, and D the distance between the two strings
20/12/2002
•
DEA I3 - Données semi-structurées - Grégory Cobéna
Remove some edges to ensure that deleting a node will delete the subtree rooted at that node (and conversely for insert)
Adapt M[x,y] to work on trees (S. Chawathe)
•
Compute M[x,y] only close to the diagonal (E. Myers)
Extending the String problem
77
matching criteria to compare nodes and subtrees quadratic in the ‘distance’ between both trees.
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
IBM diff available at alphaworks
• •
Kuo-Chung Tai, Lu, Selkow: based on string edit problem in XML, many labels are identical Unix Diff, Sun DiffML LaDiff (MH-Diff) , Chawathe, Rajaraman, Garcia-Molina, J. Widom
State of the art (2): the tree pattern matching for XML
78
20/12/2002
Pr
TV 100
VCR 200
N P
Pr
Pr
N P
Pr
VCR 150
Version 2
TV 100 DVD 500
N P N P
Pr
Catalog
DEA I3 - Données semi-structurées - Grégory Cobéna
Version 1
Camera 300
N P N P
Pr
Catalog
Issue: Persistent identification of nodes
Data Model
79
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
Change-centric Management of versions, VLDB2001
• Delta = Set of changes • Nice mathematical properties
Represent changes with a Delta
• to every node = XID • to the document = XID-map
Attach persistent identifiers:
Change Model
80
20/12/2002
2
4 6
8
Version 1
3
9
TV 100
7
21
6
17
18
19
20
15
Update
81
New XID-map: (6-10,17-21,11-16|22)
11 13
12 14
N P
Pr
VCR 150
Version 2
8
9
TV 100 DVD 500
7
Diff (V1,V2) delete(5) update(13,150) insert(16,2,(17-21))
11 13
VCR 200
12 14
Pr
16
N P N P
10
Pr
Insert
Catalog
DEA I3 - Données semi-structurées - Grégory Cobéna
XID-map: (1-16|17)
1
Camera 300
Delete
N P
N P N P
Pr 15
5
Pr 10
Pr
16
Catalog
Algorithm: Intuition
20/12/2002
•
Follow DTD
DEA I3 - Données semi-structurées - Grégory Cobéna
specifications Correctness: No change is missed
• Constraint-Awareness:
Assign persistent identifiers by matching nodes Compute a representation of changes between the two documents Also
Objectives
82
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
83
Camera$300 DVD$400 $200$150
Representation of Changes: Example
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
II.1.2 XML Diff The XyDiff Algorithm
84
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
• Signature • Weight
Compute for every subtree
Use ID-attributes from DTD to match nodes (or forbid matching)
One traversal of the tree
Phase 1: Identify Subtrees
85
20/12/2002
•
• • •
DEA I3 - Données semi-structurées - Grégory Cobéna
86
4/ Propagate [Very Carefully] matching to parents and ancestors
Remove S and all its subtrees from L
1/ find all identical subtrees in first document 2/ Select acceptable matches 3/ If at least one match, choose the best candidate
Let L be the list of all subtrees in second document For each subtree S in L (in decreasing weight order)
Phase 2: Bottom Up+Lazy Down Propagation
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
propagate matching to descendant nodes based on element names if no ambiguity
Then quick top-down pass
propagate matching to ancestors
Some nodes are unmatched after previous phases Use previous results to propagate matching [now a bit less carefully] First, bottom-up
Phase 3: Optimization
87
20/12/2002
• •
•
DEA I3 - Données semi-structurées - Grégory Cobéna
Largest common subsequence (weight) Ex: A, B, C, D, E, F E, D, A, B, C, F Largest common subsequence is A, B, C, F nodes D and E are ‘moved’ Complexity is quadratic We approximate the solution in linear time
Find inserted/deleted nodes Find “easy” move operations: parent node changed Find “complex” move = reordering children
Phase 4: Construct the delta
88
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
89
Propagation should try not to induce wrong matches. Intuition is that large matching subtrees are more relevant The larger the tree, the more we propagate the matching to ancestors.
• •
Propagation
Use locality (e.g. find matching ancestors) to avoid wrong matches Two small trees are matched if some ancestors are matching. For large trees, further look-up is accepted.
•
Select Acceptable Match
Key aspect: the weight of trees
Choice affects speed and quality
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
Trade-off Quality vs. Speed We exhibit in the paper some bounds that guarantee linear time complexity
•
Definition of weight affects both speed and quality Look-up and Propagation distances
Algorithm: Tuning
90
Look-up level is designed to have ‘get best candidate’ cost in O(log(n)) uses some pre-computed indexes
20/12/2002
•
DEA I3 - Données semi-structurées - Grégory Cobéna
longest common subsequences of children is approximated
Phase 3 (optimization) is designed to be linear Phase 4 (delta construction) is linear
•
•
Phase 1 (identification) is one traversal of the tree Phase 2 (propagation) is n times ‘get best candidate’ in the worst case
Complexity: n*log(n)
91
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
Comparison with Unix Diff on web data
92
Speed and Quality evaluation on synthetic data
Simulator of changes on XML documents
Experiments
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
• delta of changes • XML document D’=delta(D)
Parameters control the number of delete/insert/move Input: XML document D Outputs:
Simulator of changes
93
20/12/2002
Typical Pattern
DEA I3 - Données semi-structurées - Grégory Cobéna
Experimental verification that the algorithm is quasi-linear
Synthetic Data: Speed of the algorithm
94
1Mb
20/12/2002
95
Size of the computed delta is comparable to ‘original’ delta size For large deltas, XyDiff finds more efficient operations
– Generate changes over D: D’ = delta(D) – Give (D, D’) to XyDiff and compute a delta
Start from document D
DEA I3 - Données semi-structurées - Grégory Cobéna
Size ratio of the diff over original delta
0
1
2
3
4
Typical Pattern
Synthetic Data: Quality of the algorithm
at the time of that experiment, we had to crawl 10 million web pages to find them ☺
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
80% of the documents below the size of (UnixDiff*1.2) Almost all below UnixDiff*2 Of course, the delta of XyDiff contains much more information
•
Experiments on 10.000 XML web documents that changed
96
Comparison of the size of results: XyDiff vs. UnixDiff
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
Use XML diff to observe changes between websites
improve XyDiff for typed XML data
• Frequency, patterns, … • Obtain statistics for DTD and XMLSchema • Use the statistics to learn about changes and
Larger scale experiments on web data Learn about changes:
Perspective
97
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
http://www-rocq.inria.fr/~cobena/XyDiffWeb/
A novel algorithm for XML diff in quasi linear time XML specificities are used to improve quality Available as Open Source freeware at:
Conclusion
98
20/12/2002
Storage
Delta(V(n-1),V(n))
XyDiff
DEA I3 - Données semi-structurées - Grégory Cobéna
V(n) of the XML document
XML Loader
Web Crawler
Alerter
XyDiff in Xyleme Architecture
99
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
Questions ?
100
Grégory Cobéna (INRIA), Talel Abdessalem (ENST), Yassine Hinnach (ENST)
(Part II: XML Diff) Etude comparative sur la détection de changements en XML
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
102
Consider change-control in XML data warehouses. We want to understand changes We have only the old and new version of documents A diff need to be computed
Context
Version Management and Querying Comparison of Change representation models Experiments
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
State of the art in change detection Performance analysis and experiments Quality analysis and experiments
Summary
• • •
Detecting Changes
• • •
Motivations Data Model Representing Changes
Organization
103
Motivations
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
105
Version management, which means that the representation should allow for effective storage strategies Temporal Databases, the support for persistent identification of nodes is mandatory Monitoring: information about changes is used to support triggers or detect events Note: HTML or XHTML documents may be used
Motivations: Representing Changes
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
106
Correctness: the diff programs miss no changes Minimality of the result is important to save storage space and network bandwidth Semantics: some algorithms consider more semantics in XML documents Performance: with dynamic services and/or large amounts of data, high speed and low memory usage are mandatory ‘Move operations’: some algorithms support move operations whereas others don’t. This impacts both the performance of the tool and the quality of results.
Motivations: Detecting Changes
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
II.2.1 XML Diff Comparing Data Models
107
(i) insert, delete applied to leaves or subtrees (ii) update of text nodes (iii) move applied to a subtree root, moving the entire subtree
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
An edit cost is assigned to each operation. Usually, the cost is 1 per node touched The semantic of move is to identify subtrees even when their context has changed. We use the notion of mapping between the two trees. Each node in document A (or B) that is not deleted (or inserted) is matched to the corresponding node in B (or A).
• • •
Operations are:
Data Model (quick overview)
108
x
20/12/2002
a
b
root
y
c
a x
b
root
y
Selkow’s model: delete ‘b’
DEA I3 - Données semi-structurées - Grégory Cobéna
Tai’s model: delete ‘b’
Data Model: Intuition
c
109
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
II.2.2 XML Diff Representing Changes
110
There are several version management strategies. For instance, when only deltas are stored, their size must be reduced We also consider the performance of reconstructing a document given the delta and the previous document. It is linear in all cases. A simple text-based version management is possible but can not be used for querying.
20/12/2002
• • •
DEA I3 - Données semi-structurées - Grégory Cobéna
111
Labeling nodes by prefix+postfix identifiers improves querying algorithms Labeling nodes with persistent identifiers improves temporal databases There is no short labeling scheme that is good for both
Querying Changes
•
•
•
Version Management
Representing Changes
20/12/2002
Notebook 2200MHz Pentium4 $1999 Digital Camera Fuji FinePix 2600Z $299
DEA I3 - Données semi-structurées - Grégory Cobéna
Notebook 2200MHz Pentium4 $1999 Digital Camera Fuji FinePix 2600Z Not Available
Our Example
112
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
Different representations
113
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
$299 XPath expression
Change Models: XUpdate
114
20/12/2002
mentions some unchanged nodes
DEA I3 - Données semi-structurées - Grégory Cobéna
115
The order is important (no ids, no move)
Not Available $399
Same look’n’feel Change Models: as the document DeltaXML (Example)
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
$399
Not Available
Change Models: XyDelta (Example)
116
What is the parent node?
Persistent identifiers
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
$299 Identify nodes 117
element node
Models: Microsoft XDL (Example)
Verify Change consistency
A formal model and nice mathematical properties Persistent identification of nodes (at least as an option)
A framework for querying
20/12/2002
• • • •
•
DEA I3 - Données semi-structurées - Grégory Cobéna
Validation by a DTD (may be a problem for DeltaXML, XyDelta) Verify the source document (only XDL) Support of ‘move’ operations (only XyDelta and XDL) Backward deltas (only XyDelta) Monitoring the delta (only XUpdate and DeltaXML)
Nice features that some are missing
•
Still missing for all of them
• •
Unique advantages of XyDelta
Summary
118
20/12/2002
100
1000
10000
100000
1000000
1
100 Edit Cost
1000
10000
100000
DEA I3 - Données semi-structurées - Grégory Cobéna
10
Comparing Delta Size
119
DeltaXML
XyDelta
Identifiers save space when few updates
Storage Experiments
File Size
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
protocols
• It is not yet clear how to query changes • Define transaction or synchronization
Change monitoring is easier with DeltaXML and XUpdate Temporal queries are easier to evaluate with XyDelta (persistent identifiers) Future work:
Change Models: Conclusion
120
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
II.2.3 Detecting Changes
121
20/12/2002
• •
DEA I3 - Données semi-structurées - Grégory Cobéna
Run in linear time or close Match nodes or subtrees depending on their content
122
find the Minimum Edit Script in O(m*n) time and space, where m and n are the size of the two documents
Other algorithms
• •
Based on the String Edit Problem (1966) Tree-to-tree correction Algorithms:
State of the art
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
Experiments: Speed of several algorithm
123
20/12/2002
To:
From:
DEA I3 - Données semi-structurées - Grégory Cobéna
124
The cheapest choice would be to move and . (cost=2) But finding the best script with ‘move’ operations is NP-hard The minimum edit script consists in deleting and and then inserting them. (cost=4) (MMDiff) Preprocessing often consists in mapping identical subtrees. In these case, an additional ‘move’ operations will be needed (cost=5)
Algorithms: Overview
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
Experiments: Quality (measured by the Edit Cost)
125
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
Experiments: Speed (focus on DeltaXML)
126
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
Many other algorithms that have no advantages MMDiff is the reference for quality DeltaXML and XyDiff are good compromises quality/performance; but performances of XyDiff more regular Performance measure for Microsoft available soon – seems comparable in performance to DeltaXML
Comparison summary
127
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
Postprocessing of results?
• •
(e.g. DeltaXML) Using XMLSchema or DTD information Time-constrained diff (e.g. XyDiff)
• Using ‘keys’ to match specific nodes
Constrained Diff is often interesting:
Other issues
128
Unify and improve existing features Support Queries! Chain versions?
20/12/2002
• • •
DEA I3 - Données semi-structurées - Grégory Cobéna
We are currently working on Microsoft’s XML Diff Use XMLSchema (or DTD) information Mining changes? Use learning ?
Change Detection:
• • •
Representing Changes:
What’s next?
129
20/12/2002
DEA I3 - Données semi-structurées - Grégory Cobéna
Questions ?
130
merci