The PMML Interpreter - Sujeevan Aseervatham Home Page

Apr 13, 2002 - patterns in a dataset by discovering relationships among the data. ... used to help in decision-making, usually in Customer Relationship Management. The data mining ...... The schema below describes the data's flow in the PMML Interpreter: ..... This diagram illustrates the global structure of the interpreter:.
670KB taille 1 téléchargements 241 vues
UNIVERSITE PARIS 13 INSTITUT GALILEE

The PMML Interpreter INTERNSHIP REPORT

Sujeevan ASEERVATHAM, DESS Exploration Informatique des Données. Under the direction of: Erik MARCADÉ, KXEN Younes BENNANI, Université Paris 13

April 2002 – September 2002

ACKNOWLEDGEMENTS

ACKNOWLEDGMENTS

I wish to express my gratitude to Erik Marcadé, the KXEN’s CTO, for his support during the training period. Special thanks go to Bertrand Lamy who assisted me in the project’s development and from who I learned much on computer engineering. I am also particularly thankful to Benoît Rognier who always was willing to help, and to explain things, on whatever subject (especially on Linux). I also thank Fatima Adouhane, Alain Charroux, Redouane Mahrach and Victor Coustenoble for their helps and their kindnesses. I also greet former, current and future trainees, especially Youssef Laghzali and Marie Berlioz.

1

ABSTRACT

ABSTRACT

This report contains information about the traineeship of 6 months that I did at KXEN during the last year of my computer engineering studies. The new emerging standard to define data mining models is described in this report as well as some background information on data mining. Moreover, a full description on the project I have had to develop is given. The objective of this training was to study the new standard for the definition of data mining models. This standard, called The Predictive Model Markup Language (PMML) V2.0, is intended to allow the models’ exportation from one application to another, opening then a new era: the era of cooperation between data mining applications such as IBM’s Intelligent Miner, KXEN Analytic Framework, SAS, Clementine, SPSS and much others. The study was to allow the development of a program able to validate and execute the PMML models. The program was intended to be given to the data mining community as an open source program in response to miscellaneous needs.

2

SUMMARY

SUMMARY The present report deals with my training period at KXEN from April the 1st, 2002 to September the 30th, 2002. An overview of the data mining is given as well as information about the Predictive Model Markup Language. KXEN is an American company of about 35 people, with the research and development department located in France, near Paris. The company is specialized in the development of data mining software components to be embedded in applications (usually in statistical applications). KXEN is well known for the quality of the components. Indeed, KXEN is partner of companies such as BusinessObjects, IBM, SPSS… Data Mining is a science that has emerged further to the request of companies wanting to obtain quickly knowledge from a large volume of data. Data mining is used to find hidden patterns in a dataset by discovering relationships among the data. Those patterns can then be used to help in decision-making, usually in Customer Relationship Management. The data mining community is leaded by the Data Mining Group (DMG) that is composed of the most active companies in this science. One of the DMG’s main objectives is to develop the cooperation between data mining applications. That means, in a first step, to make the data mining models interchangeable, one application can then read and execute models generated by another application. Further to this idea, the DMG has developed the Predictive Model Markup Language (PMML) that is an XML based language, which allows data mining application to export most of their models to other applications. Currently the latest version: PMML V2.0 defines 8 kinds of models. KXEN, strongly believing that PMML can be beneficial to the community, has decided to implement the PMML V2.0 standard and to provide a free tool, the “PMML Interpreter”, that will be able not only to read and validate PMML 2.0 models but also to execute such models on datasets. The PMML Interpreter must be able to check the models’ compliancy to ensure that applications will release valid models in order to keep the standard free of any corruptions. Moreover, it must be able to execute models in order to convince the data mining application’s editors to turn their models into the PMML 2.0 standard. Indeed, software editors can support the PMML at no cost because KXEN provides a free PMML interpreter that can be embedded into any application.

In the R&D department of KXEN, I was in charge of the PMML Interpreter’s project. I mainly worked on the translation of the PMML 2.0 specification into C++ program.

3

TABLE OF CONTENTS

TABLE OF CONTENTS INTRODUCTION ..................................................................................................................... 6 I. The presentations................................................................................................................... 7 I.1. The KXEN company...................................................................................................... 7 1.1. Company overview ..................................................................................................... 7 1.2. The team ...................................................................................................................... 7 1.2.1. The management team ......................................................................................... 7 1.2.2. The Research and Development team .................................................................. 7 1.2.3. The Administration and Sales team ..................................................................... 8 1.2.4. The Scientific team............................................................................................... 8 1.3. KXEN Analytic Framework........................................................................................ 9 1.4. The Business Model. ................................................................................................. 10 1.5. Examples of applications of the KXEN components................................................ 10 I.2. Data Mining.................................................................................................................. 10 2.1. What Is Data Mining? ............................................................................................... 10 2.2. The applications of Data Mining............................................................................... 11 2.3. The branches of industry of Data Mining. ................................................................ 12 I.3. The Training................................................................................................................. 12 3.1. The subject ................................................................................................................ 12 3.2. The Predictive Model Markup Language (PMML) .................................................. 12 3.3. The aim...................................................................................................................... 14 3.4. The work environment .............................................................................................. 15 II. PMML V2.0 specification .................................................................................................. 16 II.1. The header .................................................................................................................. 16 II.2. The Data Dictionary................................................................................................... 16 II.3. The Transformation Dictionary................................................................................ 16 3.1. Constant..................................................................................................................... 17 3.2. FieldRef..................................................................................................................... 17 3.3. NormContinuous ....................................................................................................... 17 3.4. NormDiscrete ............................................................................................................ 18 3.5. Discretize................................................................................................................... 18 3.6. MapValues ................................................................................................................ 18 3.7. Aggregate .................................................................................................................. 18 II.4. The Mining schema .................................................................................................... 19 II.5. The Statistics............................................................................................................... 19 5.1. Counts........................................................................................................................ 19 5.2. NumericInfo .............................................................................................................. 19 5.3. DiscrStats .................................................................................................................. 19 5.4. ContStats ................................................................................................................... 19 5.5. Partition ..................................................................................................................... 20 II.6. The Regression model ................................................................................................ 20 II.7. The Clustering model................................................................................................. 21 4

TABLE OF CONTENTS

7.1. The Center-Based clustering ..................................................................................... 21 7.2. The Distribution-Based clustering ............................................................................ 23 II.8. The Neural Network model ....................................................................................... 28 III. The PMML Interpreter..................................................................................................... 30 III.1. The architecture........................................................................................................ 30 1.1. The Basis library ....................................................................................................... 30 1.2. The XMLEventParser library.................................................................................... 30 1.3. The StateMachine library .......................................................................................... 31 1.4. The PmmlTree library ............................................................................................... 32 1.5. The IOConnector library ........................................................................................... 33 1.6. The TreeBuilder library............................................................................................. 34 1.7. The Data flows .......................................................................................................... 34 III.2. The encountered problems ...................................................................................... 35 2.1. The Data types in PMML.......................................................................................... 35 2.2. The Missing, invalid and outlier values .................................................................... 35 2.3. The Distribution based clustering. ............................................................................ 36 2.4. The PMML V2.0 stability. ........................................................................................ 38 III.3. The results ................................................................................................................. 39 CONCLUSION........................................................................................................................ 41 REFERENCES ....................................................................................................................... 42 APPENDICES......................................................................................................................... 43 Appendix 1: An example of PMML V1.1 File. ................................................................ 44 Appendix 2: An example of PMML V2.0 File. ................................................................ 45 Appendix 3: First Progress report, April the 8th, 2002. .................................................. 46 Appendix 4: Nominal to Numeric encoding proposal..................................................... 51 GLOSSARY ............................................................................................................................. 54 INDEX ..................................................................................................................................... 56

5

INTRODUCTION

INTRODUCTION

This is the report of my training period at the KXEN’s Research and Development department, France. A training period is an obligated part of the graduation program of the computer science department at the Université de Paris 13, France. Currently, I’m a fifth year student and from now I’ll be working on my traineeship, which is about data mining and software development. The training period is from April the 1st, 2002 to September the 30th, 2002. I’ve joined the R&D department on April the 2nd, 2002 (01/04 is public holiday in France) and after a week of presentation, I was given the PMML Interpreter’s project. The project was to develop a program that will be able to read, validate and execute data mining models under the Predictive Model Markup Language (PMML) format. PMML was developed by the Data Mining Group to describe models generated by data mining applications. Thus, my mission was to propose a specification for the project and once validated by my supervisor, Erik Marcadé, I was in charge of the project’s achievement. This document deals with the work I realized during the traineeship. In the first step, we will make a presentation of the company and the mission that I was given, for background information, then, in the next step, we’ll describe the PMML 2.0 specification to then present the work realized on the project and finally we’ll finish by a short conclusion about the training.

6

I. The presentations

I. The presentations I.1. The KXEN company. 1.1. Company overview KXEN (Knowledge Extraction Engines) is a global analytic software company that provides advanced analytics to be embedded in existing enterprise applications and business processes. KXEN makes cutting-edge data mining technology available to both business decision makers and data mining professionals. KXEN provides them with more accurate, timely and actionable information. Customers will have the ability to understand, predict, manage and influence through a better knowledge of the information contained in their corporate data. The KXEN Analytic Framework is distributed through partnerships with leading system integrators, application vendors, and OEMs. Founded in 1998 by Roger Haddad, Erik Marcadé and Michel Bera, KXEN is a privately held company headquartered in California with research and development facilities in France. Offices are located across the USA and Europe. After Receiving an initial round of seed financing of $500,000 in March 1999, KXEN secured a Preferred Series B round of financing of $5.5 million in June 2000 and lately extended that round for an additional $2million from a core group of well-known investors including Sofinnova France, Sofinnova US, and Innovacom. At the date of August 31, 2002, KXEN has 35 employees worldwide, excluding contractors. Its offices are located across the USA (San Francisco, Chicago, New York and Charlotte), as well as in Europe (Paris, France; Peterborough, England; and Zurich, Switzerland).

1.2. The team 1.2.1. The management team The management team is composed of the three company’s founders: - Roger Haddad: Chief Executive Officer - Erik Marcadé: Chief Technical Officer - Michel Bera: Chief Scientific Officer

1.2.2. The Research and Development team

7

I. The presentations

Erik Marcadé CTO

Bertrand Lamy Senior Software Engineer

Alain Charroux Senior Software Engineer

Willy Buet System Administrator and Quality Control

Benoit Rognier Software Engineer

Serge Danzanvilliers Software Engineer

Sebastien Ducamp Software Engineer

Cédric Simard Documentation and Translation Manager

François Paris Software Engineer

1.2.3. The Administration and Sales team Roger Haddad President of EMEA Operations

Eric Sallou France Operations Director

Victor Coustenoble Integration & Support Engineer

Bruno Delahaye EMEA Operations Manager

Emmanuel Duhesme Account Manager

Thierry Mulot Pre-Sales Engineer

Claire TranLe Europe Marketing and Communications Manager

Jocelyne Gérault CFO

Sophie Chevallier Accounting Manager

Redouane Mahrach Legal Manager

Fatima Adouhane Office Assistant

1.2.4. The Scientific team Chaired by Michel Bera, the scientific comity groups several experts in France and in the US: - Leon Bottou - Olivier Chapelle - Lee Giles - Yann LeCun - Philippe Lelong - Gregory Piatetsky-Shapiro - Gilbert Saporta - Emmanuel Viennet

8

I. The presentations

1.3. KXEN Analytic Framework. KXEN Analytic Framework is composed of eight components that can be separated into three groups: - The first group is composed of components used to transform and to encode data before use. The automation of this treatment is one of the main interests of KXEN. It allows to save a lot of time and to treat the same way very different files (historic data, log files, time series…). - The second one contains the modelling tools: the objective of such components is to build the best model: the one that is the most suitable for data and that gives the best score. There still, the user has several possibilities: robust regression, support vector machine, segmentation… - The last one contains only one component, which allows the user to apply the model. It creates C code that is able to give the same results than the initial model.

Component K2C (KXEN Consistent Coder) KEL (KXEN Event Log) KSC (KXEN Sequence Coder)

Function Data preparation Prepares data: encodes nominal and ordinal variables, automatically fills in missing values and detects out of range data Aggregates events into periods of time: allows integrating transactional data with demographic customer data Aggregates events into a series of transitions (for example a customer click-stream from a Web site can be transformed into a series of data for each session) 9

I. The presentations

KTS (KXEN Time Series) K2R (KXEN Robust Regression) KSVM (KXEN Support Vector Machine) K2S (KXEN Smart Segmenter) KCG (KXEN Code Generator)

Predicts meaningful patterns and trends in your data over time Modelling Uses a proprietary regression algorithm to build predictive and descriptive models Is a binary classification component, particularly well suited for analysing data sets with a small number of observations but with a high number of variables Discovers natural groupings or clusters in a set of data Generation of C code Generates C or XML code corresponding to the model built with the KXEN Analytic Framework

1.4. The Business Model. The business model of KXEN is based on a policy of indirect sale via partners. Taking into account the specificity of its offer (software components), KXEN is distributed through partnerships with leading system integrators, application vendors and original equipment manufacturers (OEM’s). These various partners associate their trade competence with the new prospects brought by the KXEN components.

1.5. Examples of applications of the KXEN components. -

Predict which customers will buy a given product, and the reasons for it. Improve the risk analyse for credits, loans or mortgages. Find unexploited market segments and possible sources of profit. Discover influent factors on profit and productivity. Find customers about to leave. Predict the road traffics the TV audience. Predict energy consummation in order to optimise production.

I.2. Data Mining. During the few last years, the quantity of data to work with has highly increased. Databases have now become an essential solution to store such quantities of data. However, as the volume increase, it becomes more and more harder to analyse and interpret the data. Moreover, Companies do not need data but information. It is then become a need to get information from large quantities of data.

2.1. What Is Data Mining? The data mining (KDD: “Knowledge Data Discovery”) is the solution to get, quickly, the essential information from data. One definition could be the search for valuable information in larges volumes of data. The data mining is at the crossroad of databases, statistics, symbolic learning and artificial intelligence. The goal is to build models that describe the relation between

10

I. The presentations

descriptive variables (variables that describe or explain a situation) and target variables (variable that are the results of the situation). Such models can be used: - To describe the most important variables that are responsible of the results, we can then for example classify data. - To predict the result according to a situation. The schema below describes the usual way the data mining is used:

With Data Mining, it is now possible to answer to questions like “which product are likely to be bought by people according to their sex and address?” from observation data like: - “Males who live in Paris, buy the product B” - “Females who live in Paris, buy the product A” - … In this example descriptive variables are sex, address and the target variable is product.

2.2. The applications of Data Mining. The data mining can be used for several purposes: - To better target commercial efforts. - To improve quality of the services. - To detect fraudulent behaviors. - To analyze technical data. Data Mining - … DataBases : - Observation - …

Required Results: - To improve quality - To target customers -

To manage, to sort useful data To explore data

To display the results

To model data

To analyze, to predict To cluster data



11

I. The presentations

2.3. The branches of industry of Data Mining. The use of data mining is numerous: o Medicine and genetics. o Astronomy. o Industrial processes. o Agriculture. o Customer Relationship Management (CRM). o … Data mining is especially used in activities that need a better knowledge of the customers to improve the sells.

I.3. The Training The training at KXEN proceeds in the Research and Development department from April the first, 2002 to October the first, 2002. It is intended to provide a good experience of software engineering in a data mining state of the art company.

3.1. The subject The subject of this training is to develop a program that must be able to read and interpret the Predictive Model Markup Language (PMML). The program must respect the following constraints: - In order to achieve good performance, it must be written in C++. - It should be able to be executed on a wide variety of platform, that means it must be able to be compiled by several compilers, even by those that can, nowadays, be considered as obsolete. - It should easily be integrated in a software environment, as a small component. - It should be able to validate PMML files by indicating if the file is PMML compliant or not. In case if the file is not compliant, the program must provide a human-readable error message, indicating what’s wrong with the file; the error message must be clear enough to quickly locate and correct the error in the file. - It is intended to be distributed as free software so it must be well commented to allow several people to deal with the software’s maintenance and evolution.

3.2. The Predictive Model Markup Language (PMML) The Predictive Model Markup Language is an XML (eXtensible Markup Language) based language to describe statistical and data mining models (Appendices 1&2). XML is a set of conventions used to structure data under a text file form that is both human-readable and machine-readable. It just looks like HTML (Hyper-Text Markup Language). PMML is nothing else than XML but with its own conventions. The conventions are defined by a DTD (Document Type Definition) and by an XML schema but they essentially rely on the specification provided in English.

12

I. The presentations

PMML was defined by the DMG (Data Mining Group) that includes the following companies: o Angoss Software Corp. o IBM Corp. o KXEN. o NCR Corp. o Magnify Inc. o Microsoft. o MINEit Software Ltd. o Oracle Corp. o Quadstone. o National Center for Data Mining. o SAS. o SPSS Inc. o Xchange Inc. Actually, Robert Grossman is the person in charge of PMML maintenance and evolution. PMML provides a quick and easy way for companies to define and share predictive models. A PMML document is a definition of fully trained analytic models with sufficient information for an application to build and apply the models on a data set. With PMML, it is now possible to exchange Data Mining models between various applications, no matter about the application that generated the model. Learning Application

Data

build

export

Model

apply

PMML

import

Scoring Application

Data

Score

PMML V1.1 was released in August 2000. It provides specification to define six kinds of models: o Tree Models o Neural Network models o Clustering models.

13

I. The presentations

o Regression models. o General Regression models. o Association models. Nowadays, the following softwares are known to be able to generate PMML V1.1 compliant models: o IBM’s Intelligent Miner for Data. o KXEN components. o Dialogis’s D-Miner (formerly known as Kepler). And the following softwares are able to interpret PMML V1.1 models: o IBM’s IM Scoring . o Dialogis’s D-Miner. PMML V2.0 was released in August 2001. It introduces data transformations so PMML can do data preprocessing before applying models. The supported transformations are the following: o Constant (the result is always a constant) o Continuous Normalization by piecewise linear interpolation. o Discrete Normalization of nominal values to discrete numeric values. o Discretization of continuous values to nominal value. o Values mapping. o Aggregate function like sum, average… of a variable. It also supports two more kinds of models: o Naïve Bayes models. o Sequence models. Nowadays, only KXEN Analytic Frameworks is known to generate PMML V2.0 models.

3.3. The aim The goal of PMML is to allow the definition of predictive models while ensuring that they will be independent of the application that has produced them. In fact, as PMML introduces the ability of defining vendor specific conventions (tags), sharing models is not as simple as it seems. Indeed, applications abusively use such conventions to describe their models. As a result, the information contained in the PMML core specification is not enough to build consistent models from such files. Then it often happens that a client cannot read a PMML file generated by one producer, however they both pretend to be PMML compliant. To face such problems, KXEN decided to provide, to the data mining community, a program that will help not only the producers to generate PMML compliant models but also the applications to read and interpret PMML models. The primary aim is to allow the PMML producers to be sure that their files are PMML 2.0 compliant. In other words, if our program validates the file then it is 100% PMML 2.0 compliant, that means that any PMML clients that cannot read such files should be considered as PMML 2.0 non-compliant. Moreover, it will give also the DMG a tool to validate that proposals of PMML extension can correctly be executed.

14

I. The presentations

The usage of this program can be beneficial to the community. Indeed, the guarantee that all the files available on the web are compliant, will allow giving a unique direction to the PMML’s evolution. The secondary aim is also to provide an open source interpreter that can apply PMML models on various kinds of data sources (files, databases…).

3.4. The work environment The development team for this project is composed of only one member: the trainee supervised by Erik Marcadé, the KXEN CTO. The materials available for the development are: o A PC computer with Window 2000. o Emacs for writing the program. o “cl”, the visual C++ compiler. o Cygwin bash shell for the command line compilation. o Doxygen for generating the project’s documentation from the program’s comment. o CVS to share the program source code.

15

II. PMML V2.0 specification

II. PMML V2.0 specification A well-structured document regarding the XML’s specification doesn’t mean that the document is PMML compliant, the document must also be conform to the specification provided by the Data Mining Group. This specification is essentially provided in English because XML’s DTD or schema can only express the way that data should be structured to provide a valid PMML document. So the biggest difficulty in this project is to understand and translate the specification in C++ code. Moreover, the task is made harder because of the fact that there are, at this time, no PMML V2.0 files since the norm is not yet supported. In this section we will only discuss about the specification that has been studied and implemented. Indeed, PMML defines too many data-mining model to be implemented within the deadline.

II.1. The header The header of a PMML document contains only background information about the application that generated the document. It contains a field to describe the copyright for the model, the name and the version of the application that produced the model, the date of creation and some additional annotations about the model. Well, the header is not important but it may be interesting to store the information in case a user wants to have some background information.

II.2. The Data Dictionary The data dictionary is a field where, at least, all the variables required to apply the model must be defined. That means the name of the variable must be given together with the values that the variable can take. PMML defines three kinds of variables: - Continuous variable: for those variables, either an interval of valid values or an enumeration of (valid or invalid) values must be defined. - Ordinal variable: in PMML an ordinal variable can take any values in an ordered set of nominal values, this set of values must be defined. - Categorical variable: those variables can take any defined nominal values. The data dictionary is used to check if a value taken by a variable, coming from the external environment, is valid or invalid.

II.3. The Transformation Dictionary The transformation dictionary allows the definition of internal variables that are constructed by the mathematical transformations of external variables (those defined in the data dictionary). There are seven kinds of transformation (detailed below). This field was introduced in PMML2.0, allowing then capturing some of the data preprocessing required to apply a model. It is important to note that the output of a transformation can also be the input of another transformation.

16

II. PMML V2.0 specification

Valid values

Internal variable

Transformation

DataDictionary

External variable

Transformation Data Flows

3.1. Constant The constant transformation is the simplest one, which defines variables that always take the same defined value: a constant.

3.2. FieldRef The “FieldRef” transformation is just used to rename a variable. The new variable is only a reference to the input variable. It was introduced to allow inline transformations (transformations that are defined outside the transformation dictionary, usually inside a model, emanating from PMML 1.1) to use a transformation already defined in the transformation dictionary. Moreover it is also used in models that can only be applied on transformed data (such as center-based clustering); a reference to a data dictionary defined variable is then used.

3.3. NormContinuous NormContinuous defines how to normalize continuous variable. It uses the piecewise linear interpolation. A set of couples of value must be defined; each couple is defined by an original value and his normalized value. Normalized values N3 Y N1 N0 N2

V0

V1

V2

X

V3

Original values

A Set of couples

To normalize a value, we just have to look for the value right inferior and the value right superior, for example if we want to normalize X, we just have to look for the two couples (V2; N2) and (V3; N3) such as V2≤X