à l'échelle du Web Motivations

XML Repository, Semantic Data. Integration and Query Processing. ○ 5. Query Subscription. ○ Conclusion. (Part I: Xyleme). 1. The Web and XML. 20/12/2002.
525KB taille 1 téléchargements 49 vues
Contrôle des Changements dans XML

Objectifs

Cours: Données semi structurées

Comprendre la gestion de données dynamiques

DEA I3 : Information, Interaction, Intelligence

• À large échelle, cas d’un entrepôt de données du Web (Xyleme)

Grégory Cobena http://www-rocq.inria.fr/verso/ [email protected]

• À l’échelle du document XML, cas de la gestion de versions

20/12/2002

Motivations: à l’échelle du Web

Dans quel cas trouve-t-on la notion de changements?



• Savoir découvrir des sources de données et



des documents XML, sur le Web ou sur un Intranet Mettre en place un suivi dans le temps de ces documents Extraire des connaissances sur ce qui change: les documents, leurs propriétés, leur contenu

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

3

Enjeux

Lorsque l’on gère différents documents, on étudie les changements inter-documents Exemple: Fichier XML décrivant deux modèles de voitures, une Peugeot-307 et une 206



Lorsqu’on s’intéresse à l’évolution dans le temps d’un document donné Exemple: Fichier XML décrivant un carnet d’adresses

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

4

Plan du cours

Les données semi structurées doivent apporter une description plus précise que du simple texte, avec une sémantique bien définie La gestion des changements dans les données semi structurées est encore plus complexe que dans les BD relationnelles. 20/12/2002

2

Motivations: à l’échelle du document

Le contrôle des changements, c’est d’abord:



DEA I3 - Données semi-structurées - Grégory Cobéna

DEA I3 - Données semi-structurées - Grégory Cobéna

Xyleme

• • •

Un entrepôt de données XML à large échelle Intégration de données du Web Surveillance active des données du Web

XML Diff

• • 5

20/12/2002

Représentation des changements Détection des changements DEA I3 - Données semi-structurées - Grégory Cobéna

6

Organization

Première Partie: Xyleme

1. The Web and XML 2. Xyleme 3. Data Acquisition and Maintenance 4. XML Repository, Semantic Data Integration and Query Processing 5. Query Subscription Conclusion

A Dynamic Warehouse for the XML data of the Web

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

8

The Web today

(Part I: Xyleme) 1. The Web and XML

Terabytes of data A lot of public pages

• 1 billion in [06/2000] • several millions of servers Private web: not publicly available pages Deep web: data hidden behind forms

20/12/2002

HTML = Hypertext Language

HTML

DEA I3 - Données semi-structurées - Grégory Cobéna

Ref Name Price < product reference=”X23"> X23 Camera 359.99 camera R2D2 Robot 19350.00 359.99 Z25 PC 1299.99 easy … ... < product reference=”R2D2"> Information System

Data + Structure Semistructured: more flexible

Information System

20/12/2002

10

XML = Semistructured Data

The X23 new camera Ref Name Price replaces the X22 . It X23 Camera 359.99 comes equipped with a flash R2D2 Robot 19350.00 (worth by itself 53.99 $) Z25 PC 1299.99 hard and provides great quality for only 359.99 $.

Text + presentation Where is the data ?

DEA I3 - Données semi-structurées - Grégory Cobéna

11

20/12/2002

Robot 19350 …

...

DEA I3 - Données semi-structurées - Grégory Cobéna

12

XML : Tree Types

(Part I: Xyleme) 2. A Dynamic Warehouse for the XML Data of the Web

product-table

product

designation

price

reference

description

Semantics and structure are in paths

• product-table/product/reference • product-table/product/price 20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

13

Xyleme Research

Xyleme Company

Started September 2000

Project Xyleme at INRIA (1999-2000) : Explore XML + Web + SGBD to make the Web a Knowledge Database

INRIA

• • •

Market Challenges:

Sophie Cluet: Databases (OQL…) Serge Abiteboul: semi-structured data + web Guy Ferran: ex O2 Technology



Mannheim University





Few XML documents available on the Web (because of weak software support) Company is focusing on private XML:



Technology:

Guido Moerkotte

Université d’Orsay



(25 employees end of 2001)

Marie Christine Rousset

CNAM



20/12/2002

Dan Vodislav

DEA I3 - Données semi-structurées - Grégory Cobéna

15

Architecture

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

16

User Interface -------------------- I N T E R N E T ----------------------Web Interface

• local: Corba • external: HTTP

Acquisition Loader & Crawler

Distribution between autonomous machines Now Web Services DEA I3 - Données semi-structurées - Grégory Cobéna

• Scalability for large amount of data • Internet (+focus) / Intranet support • Monitoring and Version Management • Heterogeneous Data Integration

Functional Architecture

Cluster of PCs Developed with Linux and C++ Communications

20/12/2002

• Press, Editors, Financial Data, Biology…

Xyleme Interface Change Control

Semantic Module

Query Processor

Repository and Index Manager 17

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

18

(Part I: Xyleme) 3. Data Acquisition and Maintenance, Page Importance

Architecture -------------------- I N T E R N E T ----------------------Change Control and Semantic Integration

Change Control and Semantic Integration

Index

Index

Loader |Query

Repository

20/12/2002

Acquisition and Maintenance

E T H E R N E T

Repository

Acquisition and Maintenance

Index

Loader |Query

Repositorry

Repository

DEA I3 - Données semi-structurées - Grégory Cobéna

19

Life Cycle of a page in Xyleme

Goals

Discover XML pages on the web that are of interest for customers

• For this crawl the web (HTML+XML) Maintain them up to date Do this under bounded resources:

• The meta data of D is read • type, last_date_update...

• The document D is loaded

• Memory for known URLs • Bandwidth 20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

The document D is re(read) regularly

21

Main Issues

a standard PC main cost is Internet connection

Metadata management (access to disk) Page scheduling

22

(M. Preda, S. Abiteboul, G. Cobena) • does not require to maintain graph information • faster convergence with focused crawling

• decide which page to read or refresh next DEA I3 - Données semi-structurées - Grégory Cobéna

DEA I3 - Données semi-structurées - Grégory Cobéna

Definition: Important pages are linked to by important pages Offline algorithm (used by Google) Our Online algorithm

• we can load up to 5 millions of pages/day on

20/12/2002

20/12/2002

Page Importance

Loading of pages



The URL of D is discovered as a link in another page (or published by a customer) The page scheduler decides to read D

23

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

24

(Part I: Xyleme) 4. XML Repository: Semantic Data Integration and Query Processing

Querying Language Today: A mix of OQL and XQL We are currently moving to X-Query (which is also a mix of OQL and XQL…) Select boss/Name, boss/Phone From comp in BusinessDomain, boss in comp//Manager Where comp/Product contains “Xyleme”

20/12/2002

Web Heterogeneity

DEA I3 - Données semi-structurées - Grégory Cobéna

26

Indexing

Semantic domains, e.g., cinema Many possible types for data in this domain, many DTDs Semantic Integration

Standard inverted index

• word → documents that contain this word Xyleme index

• word → elements that contain this word

• one abstract DTD for the domain • gives the illusion that the system maintains an

document + element identifier

Goal: more work can be performed without accessing data

homogeneous database for this domain

1 domain = 1 abstract DTD 20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

27

I.4.1 Xyleme: Semantic Data Integration

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

28

Data Integration One application domain -- Several schemas

• heterogeneous vocabulary and structure Xyleme Semantic Integration

• gives the illusion that the system maintains an •

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

29

homogeneous database for each domain abstracts a set of DTDs into an abstract DTD = a hierarchy of pertinent terms for a particular domain

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

30

I.4.2 Xyleme: Query Processing

Technology in short Cluster DTDs into application domains



Business, culture, tourism, biology, …

For an application domain – semi-automatically

• • •

20/12/2002

Organize tags into a hierarchy of concepts using thesauri such as Wordnet and other linguistic tool This provides the abstract DTD for the particular domain Generate mappings between concrete DTDs and the abstract one

DEA I3 - Données semi-structurées - Grégory Cobéna

31

Xyleme Query Language

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

32

Principle of Querying query on abstract DTD

A mix of OQL and XQL, will use the W3C standard when there will be one Select product/name, product/price From doc in catalogue, product in doc/product Where product//components contains “flash” and product/description contains “camera”

catalogue/product/price

Union of concrete queries (possibly with joins)

⇒ d1//camera/price ⇒ d2/product/cost

catalogue/product/description ⇒ d1//camera/description ⇒ d2/product/info, ref ⇒ d2/description MAPPINGS between concrete and abstract DTD’s

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

33

Query Processing 1.

2.

3.

4. 5.

DEA I3 - Données semi-structurées - Grégory Cobéna

DEA I3 - Données semi-structurées - Grégory Cobéna

34

Query processing

Partial translation, from abstract to concrete, to identify “machines” with relevant data Algebraic rewriting, linear search strategy based on simple heuristics: in priority, use in memory indexes and minimize communication Decomposition into local physical subplans and installation Execution of plans If needed, Relaxation

20/12/2002

20/12/2002

35

Essential use of a smart index combining full-text and structure

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

36

I.4.2 Xyleme: Repository

Storage System: Xyleme Store Efficient storage of trees in variable length records within fixed length pages Balancing of tree branches in case of overflow

• minimize the number of I/O for direct access • 20/12/2002

37

DEA I3 - Données semi-structurées - Grégory Cobéna

Tree Balancing in Xyleme Store

20/12/2002

and scanning good compromise : compaction / access time

DEA I3 - Données semi-structurées - Grégory Cobéna

38

Questions ?

Record 1 Overflow: more children in other page

Overflow: Sub-tree in other page

Record 2 20/12/2002

Record 3

Record 4

DEA I3 - Données semi-structurées - Grégory Cobéna

(Part I: Xyleme) 5. Change Control

39

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

40

The Web changes all the time Data acquisition + maintenance



keep the warehouse up-to-date

Version management



representation and storage of change (see part II)

Change monitoring

• 20/12/2002

query subscription

DEA I3 - Données semi-structurées - Grégory Cobéna

42

Subscription Language

Example

SQL-like language based on ‘atomic events’. Combines the use of monitoring queries and continuous queries. The language can be extended by adding new types of atomic events. Uses the XML Query Language for continuous queries. “Querying the XML Documents of the Web”, V. Aguilera, S. Cluet, F. Boiscuvier, Tech. Report

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

43

Step 1: Atomic Event Detection

metadata manager document & alerts d/46 XML loader 20/12/2002

atomic event 46: URL matches pattern www.musee-orsay.fr/* atomic event 67: XML document contains the tag with the value “Monet”

d/46,67

complex event detection

DEA I3 - Données semi-structurées - Grégory Cobéna

45

URL Patterns Detection (1)

Test in O(1), total test time is O(n), where n is the length of URLs

DEA I3 - Données semi-structurées - Grégory Cobéna

Each Alerter can be viewed as a plug-in that acts on a document flow. All sorts of Atomic events can be detected: URL pattern detection, Keywords, XPath expressions, Page rank… Can be distributed. Some advanced alerts are:

• • •

20/12/2002

Long string look-ups Finding XML Patterns (e.g. XPath) Comparing digital signature of text documents (copy tracker)

DEA I3 - Données semi-structurées - Grégory Cobéna

46



Example: http://www.inria.fr/verso/index.html Test: http://www.inria.fr/verso/* http://www.inria.fr/*

20/12/2002

44

Using a tree: navigate on the tree until a leave is encountered Example: Tree is,

URL | prefix* | *suffix

Using Hash Table: try all possible patterns



DEA I3 - Données semi-structurées - Grégory Cobéna

URL Patterns Detection (2)

Supported patterns



20/12/2002

Alerters

5 millions of pages/day d

subscription myPaintings % what are the new painting entries in Musee d’Orsay site monitoring newPainting select URL Atomic where URL extends www.musee-orsay.fr/* events and contains “Monet” % manage the changes in the expositions continuous delta Exposition select ... from ... where when monthly notify daily % send me a daily report

Patricia Trees ?

47

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

48

Keywords Sequence Algorithm

Simple XPath filtering Algorithm

Detect: « Air France » Solution:

Problem:

• a Tree of backward keyword sequences • a context memory with O(1) update cost

Solution:

• detect CONTAINS « word » • Reverse path expression • Use postfix order • Use a stack for ‘//’ and another stack for ‘/’

Tree is implemented over a hash table

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

49

Simple XPath filter example: Understanding the tree structure in postfix order

DEA I3 - Données semi-structurées - Grégory Cobéna

DEA I3 - Données semi-structurées - Grégory Cobéna

50

Simple XPath: Example

Consider tree:
toto Nodes come as: toto (id=1, level=4) C (id=2, level=3) C (id=3, level=3) B (id=4, level=2) A (id=5, level=1) 20/12/2002

20/12/2002

CONTAINS toto is detected by:

• « toto »::ancestor
When « toto » is detected, it is stored For each ancestor of « toto », the name is compared to . All tests are executed using an hash table 51

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

52

Step 2: Complex Event Detection

Stemming On the Alerter

• • •

Exemple: Éléphant –> ELEPHANT Do it for 500 documents / second Noise may be introduced (Example: tâche = tache)

HTML parser

Millions of alerts of pages/day Millions of subscriptions complex event detection

On the Subscription Manager

• •

To avoid duplicate registration of similar events To show the user how his query is stemmed

XML loader

Real stemmers: chevaux -> cheval 20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

53

20/12/2002

complex event 12: 67 & 46 (XML document contains the tag with value “Monet” and URL matches pattern www.musee-orsay.fr/*) DEA I3 - Données semi-structurées - Grégory Cobéna

54

Complex Events Algorithm

Step 3: Notification Processor

The formal problem is NP-hard We proposed several possible algorithms Experimental (simulation) values proved the effectiveness of our solutions The Hash-Tree based algorithm is well suited for our application: • 10 million Complex Events • 1 million Atomic Events • 100 Atomic events detected per document 0.8 ms to process a document. ~2 million documents per day (on each PC). 20/12/2002

alerts

complex event detection

notification/monitoring

Reporter

Millions of Notifications/day

triggers clock

55

DEA I3 - Données semi-structurées - Grégory Cobéna

Architecture

continuous queries

20/12/2002

notification/results

DEA I3 - Données semi-structurées - Grégory Cobéna

56

Monitoring Applications Xyleme Query Processor

documents

Trigger Engine Complex Event Detection

Xyleme Alerter

Xyleme Reporter

Reporter Subscription Manager SQL

20/12/2002

SQL

Xyleme Subscription Manager

Web Browser 57

DEA I3 - Données semi-structurées - Grégory Cobéna

Copy tracking

Query to search engine Or specific crawl + pre-filter

2

• • • • •

3 detection

20/12/2002

58

Standard portal management

Filter Flow of candidate documents

DEA I3 - Données semi-structurées - Grégory Cobéna

Web portal management

Example: a press agency wants to check that people are not publishing illegally copies of their wires Need to react fast on changes: illegal copy of the wire may last only a couple of days

1

20/12/2002

Unreachable pages Dangling pointers Incorrect pages (e.g., do not parse) Detection of interesting pages on the web Etc.

Portal archiving Subscription and notification

Slice the document

DEA I3 - Données semi-structurées - Grégory Cobéna

59

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

60

Web surveillance

Conclusion & Prospectives

Applications

• •

Focus crawling on important pages

Anti-criminal and anti-terrorist intelligence, e.g., detecting suspicious acquisition of chemical products Business intelligence, e.g., discovering potential customers, partners, competitors

• Refine notion of importance • Improve important pages discovery

Find the data (crawl the web) Monitor the changes



Improve Change control accuracy

new pages, deleted pages, changes in a page

Classify information and extract data of interest

• 20/12/2002

Data mining, text understanding, knowledge representation and extraction, linguistic… Very AI DEA I3 - Données semi-structurées - Grégory Cobéna

61

Questions ?

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

Temporal Queries (persistent identification of nodes) Version some documents or some sites (store a ‘delta’) Change Monitoring (query changes)

(Part II: XML Diff) 1. Detecting Changes in XML Documents Grégory Cobéna, Serge Abiteboul, Amélie Marian

We proposed a representation of changes “Change-Centric Management of Versions” (VLDB 2001)

We developed a Diff algorithm for XML

INRIA Rocquencourt, Columbia University

“Detecting Changes in XML Documents”, G. Cobena, S. Abiteboul, A. Marian ICDE 2002 (San Jose)

DEA I3 - Données semi-structurées - Grégory Cobéna

DEA I3 - Données semi-structurées - Grégory Cobéna

63

Objectives:

20/12/2002

20/12/2002

62

Deuxième Partie: XML Diff

Versions • • •

• Semantic web • Real-time advanced processing

65

Introduction

Overview

Algorithms for detecting changes in XML documents Plan

• An XML Diff algorithm • A comparative study for XML change

Motivations State of the art Change model Algorithm

• Tradeoff ‘quality’ versus speed • Quasi linear time and space complexity

detection

Experiments

• Synthetic and real world experiments 20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

67

Monitoring XML data on the Web

Change-centric management of versions in an XML warehouse A. Marian, S. Abiteboul, G. Cobéna, L. Mignet, VLDB2001

In fact, all these problems are very similar

Learning about changes Architecture and requirements ( ‘speed’ ) Multiple optimality criteria ( ‘quality’ ) 69

Consider string: abcdefg How to transform it into: bczdeyz ? Possible solutions



70

S1x into S2y

Conversely, to find out the shortest path for transforming S1x into S2y, it is sufficient to compare following transformations:

delete all 7 chars and insert 7 other chars Update
into , into , into , into , into Mix both solutions

DEA I3 - Données semi-structurées - Grégory Cobéna

DEA I3 - Données semi-structurées - Grégory Cobéna

If we know how to transform S1 into S2, then we know how to transform:

• • •

Question: What is the shortest edit sequence?

20/12/2002

20/12/2002

Solving the String-Edit-Problem

The String Edit Problem

• •

68

Unix Diff: shows the different lines between two text files String Diff: shows which symbol have changed XML Diff: Which parts of the tree have been modified, inserted or deleted

B. Nguyen, S. Abiteboul, G. Cobéna, M. Preda, SIGMOD2001

DEA I3 - Données semi-structurées - Grégory Cobéna

DEA I3 - Données semi-structurées - Grégory Cobéna

II.1.1 XML Diff What is a diff ?

Motivations

20/12/2002

20/12/2002

71

20/12/2002

S1 into S2, then x into y S1x into S2, then insert y delete x, and then S1 into S2y

DEA I3 - Données semi-structurées - Grégory Cobéna

72

String Edit Problem The algorithm

A Quadratic Solution

Two strings S1 and S2 Cost(x,y) represents the shortest edit cost to transform S1[1..x] into S2[1..y] The cost is the sum of individual costs for each edit operation (insert, delete, update) Then, cost(x,y) is the min of:

• Cost(x-1,y-1)+update_cost(S1[x],S2[y]) • Cost(x-1,y)+delete_cost(S1[x]) • Cost(x,y-1)+insert_cost(S2[y]) 20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

73

State of the art (1): the string edit problem

The solution is to represent all possible path on a matrix: M[1..|S1|][1..|S2|]

• • • •

M[x,y] represents the cost of transforming S1[1..x] into S2[1..y] M[x,y] can be computed using M[x-1,y-1], M[x-1,y] and M[x,y-1] M[0,i] and M[i,0] are obvious Thus, M[|S1|,|S2|] can be computed

Note that the number of path is exponential, but the cost remains quadratic. Time and Space cost is O(|S1|*|S2|) 20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

74

Questions?

Best result is O(|s|^2 / log s) solution over finite alphabet O(|x|*|y|) solution with Directed A-cyclic Graph

… …

A

C

source string delete C (cost=1)

B C do nothing (cost=0) destination string 20/12/2002

insert C (cost=1)

DEA I3 - Données semi-structurées - Grégory Cobéna

75

Finds the solution in O(n*D) where n is the size of the largest string, and D the distance between the two strings

Adapt M[x,y] to work on trees (S. Chawathe)

20/12/2002

• •

Remove some edges to ensure that deleting a node will delete the subtree rooted at that node (and conversely for insert)

DEA I3 - Données semi-structurées - Grégory Cobéna

76

Kuo-Chung Tai, Lu, Selkow: based on string edit problem in XML, many labels are identical Unix Diff, Sun DiffML LaDiff (MH-Diff) , Chawathe, Rajaraman, Garcia-Molina, J. Widom

Compute M[x,y] only close to the diagonal (E. Myers)



DEA I3 - Données semi-structurées - Grégory Cobéna

State of the art (2): the tree pattern matching for XML

Extending the String problem



20/12/2002

matching criteria to compare nodes and subtrees quadratic in the ‘distance’ between both trees.

IBM diff available at alphaworks

77

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

78

Data Model

Change Model Attach persistent identifiers:

Issue: Persistent identification of nodes Catalog Pr

Pr

Pr

N P N P Camera 300

TV 100

Pr

Pr

Pr

Represent changes with a Delta

N P

N P N P

N P

VCR 200

TV 100 DVD 500

VCR 150

• Delta = Set of changes • Nice mathematical properties Change-centric Management of versions, VLDB2001

Version 2

Version 1

20/12/2002

• to every node = XID • to the document = XID-map

Catalog

79

DEA I3 - Données semi-structurées - Grégory Cobéna

Catalog

Catalog

16

Delete

Pr

10

N P N P 2

4

Camera 300

1

3

7

Pr

8

12 14 VCR 200

11 13

20/12/2002

Pr 21

15

N P N P

N P

7

9

18

20

TV 100 DVD 500

6

8

17

19

12 14 VCR 150

Update

11 13

Version 2

Version 1 XID-map: (1-16|17)

Pr

10

15

N P 9

TV 100

6

16

Insert

Pr

5

DEA I3 - Données semi-structurées - Grégory Cobéna

80

Objectives

Algorithm: Intuition

Pr

20/12/2002

Diff (V1,V2) delete(5) update(13,150) insert(16,2,(17-21))

• Constraint-Awareness:

New XID-map: (6-10,17-21,11-16|22)

DEA I3 - Données semi-structurées - Grégory Cobéna

Assign persistent identifiers by matching nodes Compute a representation of changes between the two documents Also

81

• 20/12/2002

Follow DTD specifications Correctness: No change is missed DEA I3 - Données semi-structurées - Grégory Cobéna

82

II.1.2 XML Diff The XyDiff Algorithm

Representation of Changes: Example Camera$300 DVD$400 $200$150

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

83

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

84

Phase 2: Bottom Up+Lazy Down Propagation

Phase 1: Identify Subtrees One traversal of the tree

Let L be the list of all subtrees in second document For each subtree S in L (in decreasing weight order)

Use ID-attributes from DTD to match nodes (or forbid matching) Compute for every subtree

• Signature • Weight 20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

• • •

1/ find all identical subtrees in first document 2/ Select acceptable matches 3/ If at least one match, choose the best candidate



4/ Propagate [Very Carefully] matching to parents and ancestors

Remove S and all its subtrees from L

85

Phase 3: Optimization

20/12/2002

Find inserted/deleted nodes Find “easy” move operations: parent node changed Find “complex” move = reordering children



• •

Largest common subsequence (weight) Ex: A, B, C, D, E, F E, D, A, B, C, F Largest common subsequence is A, B, C, F nodes D and E are ‘moved’ Complexity is quadratic We approximate the solution in linear time

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

propagate matching to ancestors

Then quick top-down pass propagate matching to descendant nodes based on element names if no ambiguity DEA I3 - Données semi-structurées - Grégory Cobéna

87

Key aspect: the weight of trees

Definition of weight affects both speed and quality Look-up and Propagation distances



Use locality (e.g. find matching ancestors) to avoid wrong matches Two small trees are matched if some ancestors are matching. For large trees, further look-up is accepted.



Propagation

• •

Propagation should try not to induce wrong matches. Intuition is that large matching subtrees are more relevant The larger the tree, the more we propagate the matching to ancestors. DEA I3 - Données semi-structurées - Grégory Cobéna

88

Algorithm: Tuning

Select Acceptable Match

20/12/2002

86

Phase 4: Construct the delta

Some nodes are unmatched after previous phases Use previous results to propagate matching [now a bit less carefully] First, bottom-up

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

89

Choice affects speed and quality

Trade-off Quality vs. Speed We exhibit in the paper some bounds that guarantee linear time complexity

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

90

Complexity: n*log(n)

Experiments

Phase 1 (identification) is one traversal of the tree Phase 2 (propagation) is n times ‘get best candidate’ in the worst case

• •

Simulator of changes on XML documents Speed and Quality evaluation on synthetic data

Look-up level is designed to have ‘get best candidate’ cost in O(log(n)) uses some pre-computed indexes

Comparison with Unix Diff on web data

Phase 3 (optimization) is designed to be linear Phase 4 (delta construction) is linear

• 20/12/2002

longest common subsequences of children is approximated DEA I3 - Données semi-structurées - Grégory Cobéna

91

Typical Pattern

• delta of changes • XML document D’=delta(D)

DEA I3 - Données semi-structurées - Grégory Cobéna

93

20/12/2002

Synthetic Data: Quality of the algorithm

3 2 1 0 1Mb

Size ratio of the diff over original delta 20/12/2002

Start from document D

Size of the computed delta is comparable to ‘original’ delta size For large deltas, XyDiff finds more efficient operations

DEA I3 - Données semi-structurées - Grégory Cobéna

DEA I3 - Données semi-structurées - Grégory Cobéna

94

Comparison of the size of results: XyDiff vs. UnixDiff

– Generate changes over D: D’ = delta(D) – Give (D, D’) to XyDiff and compute a delta

4

92

Experimental verification that the algorithm is quasi-linear

Parameters control the number of delete/insert/move Input: XML document D Outputs:

Typical Pattern

DEA I3 - Données semi-structurées - Grégory Cobéna

Synthetic Data: Speed of the algorithm

Simulator of changes

20/12/2002

20/12/2002

95

Experiments on 10.000 XML web documents that changed



at the time of that experiment, we had to crawl 10 million web pages to find them ☺

80% of the documents below the size of (UnixDiff*1.2) Almost all below UnixDiff*2 Of course, the delta of XyDiff contains much more information 20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

96

Perspective

Conclusion

Larger scale experiments on web data Learn about changes:

A novel algorithm for XML diff in quasi linear time XML specificities are used to improve quality Available as Open Source freeware at:

• Frequency, patterns, … • Obtain statistics for DTD and XMLSchema • Use the statistics to learn about changes and improve XyDiff for typed XML data

http://www-rocq.inria.fr/~cobena/XyDiffWeb/

Use XML diff to observe changes between websites 20/12/2002

97

DEA I3 - Données semi-structurées - Grégory Cobéna

XyDiff in Xyleme Architecture

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

98

Questions ?

Web Crawler

XML Loader

XyDiff

Alerter

V(n) of the XML document Delta(V(n-1),V(n))

Storage

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

(Part II: XML Diff) Etude comparative sur la détection de changements en XML Grégory Cobéna (INRIA), Talel Abdessalem (ENST), Yassine Hinnach (ENST)

99

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

100

Context Consider change-control in XML data warehouses. We want to understand changes We have only the old and new version of documents A diff need to be computed

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

102

Organization

Motivations

Motivations Data Model Representing Changes

• • •

Version Management and Querying Comparison of Change representation models Experiments

Detecting Changes

• • •

State of the art in change detection Performance analysis and experiments Quality analysis and experiments

Summary

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

103

Motivations: Detecting Changes

Motivations: Representing Changes Version management, which means that the representation should allow for effective storage strategies Temporal Databases, the support for persistent identification of nodes is mandatory Monitoring: information about changes is used to support triggers or detect events Note: HTML or XHTML documents may be used 20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

105

II.2.1 XML Diff Comparing Data Models

Correctness: the diff programs miss no changes Minimality of the result is important to save storage space and network bandwidth Semantics: some algorithms consider more semantics in XML documents Performance: with dynamic services and/or large amounts of data, high speed and low memory usage are mandatory ‘Move operations’: some algorithms support move operations whereas others don’t. This impacts both the performance of the tool and the quality of results.

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

106

Data Model (quick overview) Operations are:

• • •

(i) insert, delete applied to leaves or subtrees (ii) update of text nodes (iii) move applied to a subtree root, moving the entire subtree

An edit cost is assigned to each operation. Usually, the cost is 1 per node touched The semantic of move is to identify subtrees even when their context has changed. We use the notion of mapping between the two trees. Each node in document A (or B) that is not deleted (or inserted) is matched to the corresponding node in B (or A). 20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

107

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

108

II.2.2 XML Diff Representing Changes

Data Model: Intuition Tai’s model: delete ‘b’

Selkow’s model: delete ‘b’ root

root a

b x

20/12/2002

c

a

b x

y

c y

DEA I3 - Données semi-structurées - Grégory Cobéna

109

Representing Changes

• •

There are several version management strategies. For instance, when only deltas are stored, their size must be reduced We also consider the performance of reconstructing a document given the delta and the previous document. It is linear in all cases. A simple text-based version management is possible but can not be used for querying.

Querying Changes

• • •

20/12/2002

Labeling nodes by prefix+postfix identifiers improves querying algorithms Labeling nodes with persistent identifiers improves temporal databases There is no short labeling scheme that is good for both DEA I3 - Données semi-structurées - Grégory Cobéna

DEA I3 - Données semi-structurées - Grégory Cobéna

110

Our Example

Version Management



20/12/2002

111

Different representations

Notebook 2200MHz Pentium4 $1999 Digital Camera Fuji FinePix 2600Z Not Available

20/12/2002

Notebook 2200MHz Pentium4 $1999 Digital Camera Fuji FinePix 2600Z $299

DEA I3 - Données semi-structurées - Grégory Cobéna

112

Change Models: XUpdate $299 XPath expression

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

113

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

114

Same look’n’feel Change Models: as the document DeltaXML (Example)

Change Models: XyDelta (Example)

Not Available $399

115

Verify Change Models: consistency Microsoft XDL (Example)

$299 Identify nodes

• • • •

File Size

100000

XyDelta DeltaXML

100

20/12/2002

A framework for querying Validation by a DTD (may be a problem for DeltaXML, XyDelta) Verify the source document (only XDL) Support of ‘move’ operations (only XyDelta and XDL) Backward deltas (only XyDelta) Monitoring the delta (only XUpdate and DeltaXML) DEA I3 - Données semi-structurées - Grégory Cobéna

118

Change monitoring is easier with DeltaXML and XUpdate Temporal queries are easier to evaluate with XyDelta (persistent identifiers) Future work:

100000

protocols

Edit Cost

DEA I3 - Données semi-structurées - Grégory Cobéna

A formal model and nice mathematical properties Persistent identification of nodes (at least as an option)

• It is not yet clear how to query changes • Define transaction or synchronization

1000

20/12/2002

116

Change Models: Conclusion

Comparing Delta Size

10000

• •



1000000

10000

DEA I3 - Données semi-structurées - Grégory Cobéna

Unique advantages of XyDelta



Identifiers save space when few updates

1000

20/12/2002

Nice features that some are missing

Storage Experiments

100



Still missing for all of them

117

DEA I3 - Données semi-structurées - Grégory Cobéna

10

What is the parent node?

Summary

element node

1

Persistent identifiers

$399

DEA I3 - Données semi-structurées - Grégory Cobéna

20/12/2002

Not Available

The order is important (no ids, no move)



20/12/2002

mentions some unchanged nodes

119

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

120

II.2.3 Detecting Changes

State of the art Based on the String Edit Problem (1966) Tree-to-tree correction Algorithms:

• •

find the Minimum Edit Script in O(m*n) time and space, where m and n are the size of the two documents

Other algorithms

• •

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

121

Experiments: Speed of several algorithm

20/12/2002

Run in linear time or close Match nodes or subtrees depending on their content

DEA I3 - Données semi-structurées - Grégory Cobéna

Algorithms: Overview From:
To:

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

123

Experiments: Quality (measured by the Edit Cost)

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

122

20/12/2002

The cheapest choice would be to move and . (cost=2) But finding the best script with ‘move’ operations is NP-hard The minimum edit script consists in deleting and and then inserting them. (cost=4) (MMDiff) Preprocessing often consists in mapping identical subtrees. In these case, an additional ‘move’ operations will be needed (cost=5)

DEA I3 - Données semi-structurées - Grégory Cobéna

124

Experiments: Speed (focus on DeltaXML)

125

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

126

Comparison summary

Other issues

Many other algorithms that have no advantages MMDiff is the reference for quality DeltaXML and XyDiff are good compromises quality/performance; but performances of XyDiff more regular Performance measure for Microsoft available soon – seems comparable in performance to DeltaXML 20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

Constrained Diff is often interesting:

• Using ‘keys’ to match specific nodes (e.g. DeltaXML)

• Using XMLSchema or DTD information • Time-constrained diff (e.g. XyDiff)

Postprocessing of results?

127

What’s next?

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

128

Questions ?

Representing Changes:

• • •

Unify and improve existing features Support Queries! Chain versions?

Change Detection:

• • •

20/12/2002

We are currently working on Microsoft’s XML Diff Use XMLSchema (or DTD) information Mining changes? Use learning ?

DEA I3 - Données semi-structurées - Grégory Cobéna

merci

129

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

130