Grammatical Inference of Context-Free Languages from ... .fr

our example, the final partition πfinal would be {{N1,N2,N3,N4,N5,N6}}, and thus the grammar is S → S1 ..... In fact, many practical applications in grammatical inference ...... volume 1963 of Lecture Notes in Computer Science, pages 58–75.
2MB taille 18 téléchargements 334 vues
Grammatical Inference of Context-Free Languages from Positive Data Seminar Paper Report

Author: James Scicluna Supervisor: Dr. John Abela

University of Malta Faculty of ICT Department of Intelligent Systems

September 2010

The research work disclosed in this publication is partially funded by the Strategic Educational Pathways Scholarship (Malta). The scholarship is part-financed by the European Union European Social Fund.

Abstract In this report, we define the problem of grammatical inference as a concept learning task on formal languages. We analyze its positive and negative theoretic results. We discuss the general techniques used to solve this problem in the setting of context-free languages from positive data. Each technique is presented as a method to overcome the limitations on learning imposed by the negative results. We classify the techniques into 4 categories, namely: learning subclasses, heuristic based learning, Bayesian learning and learning alternative representations. For each of these techniques, we investigate the most relevant work done by several researchers. We then proceed by explaining how grammatical inference algorithms can be empirically evaluated. Finally, we propose the initial parts of a system that infers context-free grammars from positive data.

Preliminary Notation

N denotes the set of natural numbers. R denotes the set of real numbers. P(X) denotes the power set (i.e. set of all subsets) of X.  denotes the empty string. Σ denotes a finite alphabet of symbols. Σ∗ denotes the infinite set of all sequences of 0 or more symbols from Σ Unless explicitly stated otherwise: • N denotes the finite set of non-terminals (disjoint from Σ) of a context-free grammar. • P denotes the finite set of production rules of type N × (Σ ∪ N )∗ of a context-free grammar. • A,B,C denote non-terminals in N . • a,b,c denote symbols in Σ. • α, β denote sequences of terminals and non-terminals in (Σ ∪ N )∗ The terms training data, training examples and training set are used interchangeably. The terms positive data and positive examples are used interchangeably.

i

Contents 1 Introduction

1

2 Preliminaries 2.1 Formal Languages . . . . . . . . . . . . . 2.1.1 Phrase Structure Grammars . . . 2.1.2 Language Classes . . . . . . . . . 2.1.3 Context-Free Languages . . . . . 2.2 Concept Learning . . . . . . . . . . . . . 2.2.1 Assumptions in Concept Learning 2.2.2 Information Given . . . . . . . . 2.2.3 Learning Models . . . . . . . . . 2.2.4 VC-Dimension . . . . . . . . . . . 2.3 Bibliographical Notes . . . . . . . . . . . 3 Grammatical Inference 3.1 Hardness Results . . . 3.1.1 Identification in 3.1.2 PAC-learning . 3.2 Positive Results . . . .

. . the . . . .

. . . . Limit . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . . . . . . . .

. . . .

. . . . . . . . . .

. . . .

. . . . . . . . . .

. . . .

. . . . . . . . . .

. . . .

. . . . . . . . . .

. . . .

. . . . . . . . . .

. . . .

. . . . . . . . . .

. . . .

4 Grammatical Inference of CFLs from Positive Data 4.1 Learning Subclasses . . . . . . . . . . . . . . . . . . . 4.1.1 Even Linear CFLs . . . . . . . . . . . . . . . 4.1.2 Simple CFLs . . . . . . . . . . . . . . . . . . 4.1.3 Substitutable CFLs . . . . . . . . . . . . . . . 4.1.4 Non-Terminally Separated CFLs . . . . . . . . 4.2 Heuristic-Based Techniques . . . . . . . . . . . . . . 4.2.1 Alignment Based Learning . . . . . . . . . . . 4.2.2 EMILE . . . . . . . . . . . . . . . . . . . . . 4.2.3 Clark’s Omphalos Algorithm . . . . . . . . . . 4.2.4 The Minimum Description Length principle . ii

. . . . . . . . . .

. . . .

. . . . . . . . . .

. . . . . . . . . .

. . . .

. . . . . . . . . .

. . . . . . . . . .

. . . .

. . . . . . . . . .

. . . . . . . . . .

. . . .

. . . . . . . . . .

. . . . . . . . . .

. . . .

. . . . . . . . . .

. . . . . . . . . .

. . . .

. . . . . . . . . .

. . . . . . . . . .

. . . .

. . . . . . . . . .

. . . . . . . . . .

. . . .

. . . . . . . . . .

. . . . . . . . . .

. . . .

. . . . . . . . . .

. . . . . . . . . .

. . . .

. . . . . . . . . .

. . . . . . . . . .

. . . .

. . . . . . . . . .

. . . . . . . . . .

3 3 4 6 7 9 11 12 13 16 17

. . . .

18 19 19 21 21

. . . . . . . . . .

24 24 25 29 34 39 40 42 46 50 52

4.3

4.4

4.5

4.2.5 Lattice Exploration . . . . Bayesian Learning . . . . . . . . 4.3.1 Inside-Outside Algorithm . 4.3.2 Bayesian Model Merging . Alternative Representations . . . 4.4.1 String Rewriting Systems 4.4.2 Pure Grammars . . . . . . 4.4.3 Lindenmayer Systems . . . Bibliographical Notes . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

5 Empirical Evaluation of GI Algorithms 5.1 Evaluation Techniques . . . . . . . . . . 5.1.1 Grammar Equality . . . . . . . . 5.1.2 Comparison of Derivation trees . 5.1.3 Classification of Unseen Examples 5.1.4 Precision and Recall . . . . . . . 5.2 The Omphalos Competition . . . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

6 Proposal 6.1 Phrasal/Terminal Normal Form (PTNF) . . . 6.2 Cycle Separation . . . . . . . . . . . . . . . . 6.3 Nonterminal Hierarchy in the Cycle-Free Part 6.4 Reversible Cycle-Free Part . . . . . . . . . . . 6.5 Annotated Substring Lattice . . . . . . . . . . 6.6 Search for the Most Specific Hypothesis . . . .

iii

. . . . . . . . .

. . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . .

. . . . . . . . .

54 56 59 62 63 63 69 71 74

. . . . . .

75 76 76 76 77 78 79

. . . . . .

80 80 82 84 91 92 95

Chapter 1 Introduction Many real world objects can be represented as strings over some alphabet. For example, DNA sequences are strings from a 4-letter alphabet {’A’,’C’,’G’,’T’}. Natural language sentences are strings over a lexicon. Musical compositions are strings over the alphabet of possible musical notes. Speech is made up of strings on the alphabet of phonemes. Sometimes, we do not know how to describe some category of such strings, but we do know some examples in that category. For example, lets take the case of DNA sequencing. We might not know the exact description of the class of DNA sequences which are common to people with a certain illness. However, we will have examples of DNA sequences of these people. The same case applies for natural language. We do not have a general description of what a grammatically correct sentence is, but we have many examples of grammatically correct sentences. The task of grammatical inference is to learn (or infer) the categories from the examples. In this report, we will focus on strings in general, which can represent many different real world objects. The categories of strings in general are known as formal languages. We focus on the inference of a particular type of formal languages, known as contextfree languages. We investigate techniques which infer these types of languages from the minimum possible information given i.e. just examples from the target category (known as positive data). In fact, many real world data consists just of positive examples. This report is made up of 6 chapters (including this introductory chapter). In chapter 2, we formally describe what formal languages are and in particular focus on context-free languages. We also describe what it means for a concept to be learned and introduce the terminology used in the field of machine learning. In chapter 3, we define grammatical inference as a concept learning task on formal languages. We investigate both the positive and negative results obtained in this field. In chapter 4, we introduce the techniques used in inferring context-free languages from positive data. We group these technique into 4 categories: 1

1. Correct inference of non-super-finite subclasses of context-free languages. 2. Approximate solutions through the use of heuristics (without giving proof of correctness). 3. Bayesian learning (i.e. learning using probabilities). 4. Correct identification of classes of languages (which include some context-free languages) using alternative language representations (i.e. not context-free grammars). In chapter 5, we describe several ways to evaluate a grammatical inference algorithm. We also introduce the Omphalos Context-Free Learning Competition, which is the only large-scale competition-type benchmark for evaluating state-of-the-art GI algorithms on context-free languages. We conclude, in chapter 6, by proposing the initial phases of a heuristic based approach for identifying relatively large and non-trivial languages (i.e. Omphalos type languages).

2

Chapter 2 Preliminaries In this section, we will give a very brief overview of the two principle areas from which grammatical inference evolved: Formal Languages and Concept Learning. For further reading on the subject of formal languages, we recommend M. Sipser’s book Introduction to the Theory of Computation [Sip05] and Hopcroft and Ullman’s Introduction to Automata Theory, Languages, and Computation [HU79]. For concept learning, we recommend T. Mitchell’s book Machine Learning [Mit97].

2.1

Formal Languages

A formal language is simply a set of strings. More formally: Definition 2.1.1 A formal language L over a finite alphabet Σ is a subset of Σ∗ , where Σ∗ is the free monoid. A representation for some formal language L is a function which, given any string w ∈ Σ∗ , returns whether w ∈ L is true or not. More formally: Definition 2.1.2 A representation of a formal language L, over a finite alphabet Σ, is a predicate πL : Σ∗ → {T rue, F alse}, s.t.

πL (w) =

 T rue

if w ∈ L

F alse

Otherwise

Any language representation πL , apart from deciding whether a string is in L or not, can also generate L. Definition 2.1.3 A language representation πL generates L as follows: 3

L = {w ∈ Σ∗ | πL (w)} It can be shown that there are languages which do not have a (finite) language representation. This is because there are an uncountably infinite number of languages (i.e. as much as there are real numbers) and just a countably infinite number of finite language representations (i.e. as much as there are natural numbers). A one-to-one mapping between languages and representations clearly leaves several languages ’unmapped’. Proof of this is given in [Sip05]. Languages which have a language representation are known as computable languages, the rest are known as uncomputable languages. It has been shown that a particular class of language representations known as phase structure grammars (a.k.a unrestricted grammars) are able to represent any computable language (see [HU79] for proof). Thus, we can characterize the set of computable languages using phrase structure grammars as shown in figure 2.1.

P(∑*) Uncomputable Languages

PSGs

Computable Languages Figure 2.1: The set of all languages vs the set of phase structured grammars

2.1.1

Phrase Structure Grammars

Definition 2.1.4 The syntax of a phrase structure grammar is a 4-tuple G =< N, Σ, P, S >, where N is a finite set of nonterminal symbols, Σ is a finite set of terminal symbols, N ∩ Σ = ∅, P is a finite set of production rules (Σ ∪ N )+ × (Σ ∪ N )∗ . S ∈ N is referred to as the start symbol.

4

Note that we shall abbreviate the term phase structure grammar to the term grammar in this section. Definition 2.1.5 The derivational binary relation over a grammar G =< N, Σ, P, S >, denoted as ⇒G : (Σ ∪ N )∗ → (Σ ∪ N )∗ is defined as follows: a ⇒G b ⇔ ∃u, v, x, y ∈ (Σ ∪ N )∗ : (a = uxv) ∧ (b = uyv) ∧ (x → y ∈ P ). The reflexive transitive closure of ⇒G is denoted as ⇒∗G . This means that, for X ⇒∗G Y, X can derive Y using zero or more production rules from grammar G. Definition 2.1.6 A grammar G =< N, Σ, P, S > for a language L decides that a string w is in L iff S ⇒∗G w is true. Definition 2.1.7 For a grammar G =< N, Σ, P, S >, a sequence D of length n of elements in (N ∪ Σ∗ ) is said to be a derivation of G iff D[1] = S, D[n] ∈ Σ∗ and for each i ∈ N, i < n, D[i] ∈ (N ∪ Σ∗ ) and D[i] ⇒G D[i + 1]. Definition 2.1.8 w ∈ (Σ ∪ N )∗ is said to be a sentential form of a grammar G =< N, Σ, P, S > if w ∈ {u ∈ (Σ ∪ N )∗ | S ⇒∗G u} Definition 2.1.9 A sentential form w is a sentence if w ∈ Σ∗ Definition 2.1.10 The language generated by a grammar G =< N, Σ, P, S >, L(G), is the set of all sentences derived from G i.e. L(G) = {w ∈ Σ∗ | S ⇒∗G w} Some grammars have the property of being ambiguous. Definition 2.1.11 The leftmost derivational binary relation over a grammar G =< N, Σ, P, S >, denoted as ⇒lG : (Σ ∪ N )∗ → (Σ ∪ N )∗ is defined as follows: a ⇒lG b ⇔ ∃u, v, x, y ∈ (Σ ∪ N )∗ : (a = uxv) ∧ (b = uyv) ∧ (x → y ∈ P ) and for all substrings us of u, there is no production rule us → α ∈ P for any α. Definition 2.1.12 For a grammar G =< N, Σ, P, S >, a sequence D of length n of elements in (N ∪ Σ)∗ is said to be a leftmost derivation of G iff D[1] = S, D[n] ∈ Σ∗ and for each i ∈ N, i < n, D[i] ∈ (N ∪ Σ)∗ and D[i] ⇒lG D[i + 1]. Definition 2.1.13 A grammar G is said to be ambiguous iff it has 2 or more different leftmost derivations. Definition 2.1.14 A language L is said to be inherently ambiguous iff for all grammars G s.t. L(G) = L, G is ambiguous. 5

2.1.2

Language Classes

A language class is a set of languages. P(Σ∗ ) is the infinite set of all languages over an alphabet Σ. Thus any language class C over an alphabet Σ is a subset or equal to P(Σ∗ ). Therefore, any language class can be formally defined as a restriction on P(Σ∗ ): Definition 2.1.15 A language class C is a restriction on P(Σ∗ ) through some predicate π s.t. C = {L ∈ P(Σ∗ ) | π(L)}. Language classes can be defined through restrictions imposed by some set of language representations: Definition 2.1.16 A language class C can be defined using a (possibly infinite) set of language representations {πL1 , πL2 , . . .} as follows: C = {L ∈ P(Σ∗ ) | L ∈ {L1 , L2 , . . .}} In [Cho56], Chomsky defined a hierarchy of 4 language classes (on computable languages) using 4 different language representation classes. Each language class is a proper subset of the other. The largest largest class contains all computable languages, which is characterized by the class of unrestricted grammars (i.e. phase structure grammars). These are defined as follows: Computable Context-Sensitive Context-Free Regular

Figure 2.2: Chomsky Hierarchy

• Computable languages (Type-0): All languages representable using unrestricted grammars. • Context-sensitive languages (Type-1): All languages representable using contextsensitive grammars. • Context-free languages (Type-2): All languages representable using context-free grammars. • Regular languages (Type-3): All languages representable using regular grammars. 6

Note that context-sensitive, context-free and regular grammars are all restricted forms of unrestricted grammars. Chomsky’s containment hierarchy of languages was, and still is, accepted as the ’standard’ way of classifying computable languages. Many works on formal languages define languages as elements of either one of these classes or of their subsets. Since the majority of the work in the field of grammatical inference is on Chomsky’s language classes, many of the interesting properties and results in GI are related to this hierarchy. We focus now on the language class which is in our interest i.e. the class of context-free languages.

2.1.3

Context-Free Languages

The class of context-free languages is defined as the set of all languages representable using a context-free grammar. Definition 2.1.17 A phrase structure grammar G =< N, Σ, P, S > is said to be contextfree if ∀p : P · p ∈ N × (N ∪ Σ)∗ . 2.1.3.1

Normal Forms

Context-free grammars are equivalent to grammars with restrictions on the forms of production rules. Two forms of restrictions on CFGs are the Chomksy Normal Form (CNF) and the Greibach Normal Form (GNF). Definition 2.1.18 A CFG is said to be in Chomsky Normal Form if the production rules are of the form A → BC, A → a or A →  where A, B, C ∈ N , a ∈ Σ and  is the empty string. Definition 2.1.19 A CFG is said to be in Greibach Normal Form if the production rules are of the form A → aα or A →  where A ∈ N , a ∈ Σ, α ∈ N ∗ and  is the empty string. Theorem 2.1.1 Every context-free language can be generated by a grammar in Chomsky Normal Form. Theorem 2.1.2 Every context-free language can be generated by a grammar in Greibach Normal Form. Theorem 2.1.3 Every -free CFL L (i.e.  ∈ / L) can be generated by a CFG in both Chomsky and Greibach Normal Forms without production rules of the form A → . Proofs for Theorems 2.1.1, 2.1.2 and 2.1.3 are given in [HU79]. 7

2.1.3.2

Closure Properties

Definition 2.1.20 A class of languages C is said to be closed under an n-ary operation ω ⇔ ∀ L0 , L1 , . . . , Ln−1 : C · ∃ L : C · L = ω(L0 , L1 , . . . , Ln−1 ). Table 5.1 lists and defines a number of language operations, and states whether CFLs are closed under each operation. Proof of these results are given in [HU79] and [Sip05]. Operation

Definition

Closed

Union (L0 ∪ L1 )

{w|w ∈ L0 ∨ w ∈ L1 }

Yes

Intersection (L0 ∩ L1 )

{w|w ∈ L0 ∧ w ∈ L1 }

No

{w|w ∈ / L0 }

No

{w · x|w ∈ L0 ∧ x ∈ L1 }

Yes

{} ∪ {w · x|w ∈ L0 ∧ x ∈ L∗0 }

Yes

{wR |w ∈ L0 }

Yes

Homomorphism (h(L0 ))

See Definition 2.1.21

Yes

Inverse Homomorphism (h−1 (L0 ))

See Definition 2.1.22

Yes

Substitution (σ(L0 ))

See Definition 2.1.23

Yes

Complement (¬L0 ) Concatenation (L0 · L1 ) Kleene Star (L∗0 ) Reverse (LR 0)

Table 2.1: Context-Free Language Closures

Definition 2.1.21 A homomorphism is a function h : Σ → ∆∗ substituting symbols in Σ by strings in ∆∗ . This can be extended to h : Σ∗ → ∆∗ by: h() =  and for all u = av with a ∈ Σ and u, v ∈ Σ∗ , h(av) = h(a) · h(v), where · is the string concatenation operator. h can be extended to operate on languages: for L ⊆ Σ∗ , h(L) = {h(w) | w ∈ L} [Gro]. Definition 2.1.22 An inverse homomorphism is a function h−1 : ∆∗ → Σ substituting strings in ∆∗ by symbols in Σ. This can be extended to h−1 : ∆∗ → Σ∗ by: h() =  and for all x = x0 x1 x2 . . . xn where x0 , x1 , x2 , . . ., xn are elements of ∆∗ , h−1 (x) = h−1 (x0 ) · h−1 (x1 ) · h−1 (x2 ) · . . . · h−1 (xn ). h−1 can be extended to operate on languages: for L ⊆ ∆∗ , h−1 (L) = {h−1 (x) | x ∈ L} [Gro]. Definition 2.1.23 A substitution is a function σ : Σ → P(∆∗ ) substituting symbols in Σ by a (possibly infinite) set of strings in ∆∗ . This can be extended to σ : Σ∗ → P(∆∗ ) by: σ() =  and for all u = av with a ∈ Σ and u, v ∈ Σ∗ , σ(av) = σ(a) · σ(v). σ can be extended to operate on languages: for L ⊆ Σ∗ , σ(L) = {σ(w) | w ∈ L} [Gro]. 8

2.1.3.3

Other Properties

The following are decidable properties of context-free languages on any context-free grammar A: • Is the context-free language L(A) empty? i.e. L(A) = ∅? • Is the context-free language L(A) finite? • The membership problem: for any string w, is w ∈ L(A)? These are undecidable properties of context-free languages on any context-free grammars A and B: • The equivalence problem: is L(A) = L(B)? • Empty Intersection: is L(A) ∩ L(B) = ∅? • Is L(A) = Σ∗ ? • Subset Problem: is L(A) ⊆ L(B)? The following are some other interesting properties: • The intersection of context-free language with a regular language is always contextfree. • The pumping lemma [Sip05] can be used to show whether a language is not contextfree (it cannot be used to show whether a language is context-free).

2.2

Concept Learning

Informally, concept learning is a learning task in which an agent is trained to infer a class of objects C (by learning a predicate c that decides whether any object is in C) by being shown a set of finite objects from C (and possibly a finite set of objects not in C). Let U be the universal set and let UX ∈ U be the universe of discourse containing all objects of type X. Elements x ∈ UX are known as instances. A category C is a subset of UX (i.e. a set of elements of type X). Elements in C are known as instances of C. The predicate that decides whether an instance x is in C is known as the concept of C, c, where C = {x ∈ UX | c(x)}. So, we can formally describe concept learning as follows:

9

Definition 2.2.1 Concept learning is the task of inferring a concept c, known as the target concept (which describes the target category C), from a given finite set of pairs of type (UX × {T rue, F alse}) known as the training set, where for each (x, b) in the training set, b = c(x). We will denote the training set of a target concept c as Bc . In concept learning (where the target concept is c), for all instances x ∈ UX , if (x, T rue) is in the Bc , then x is known to be a member of the target category. The set of all known members in Bc is called the positive data. If (x, F alse) is in the Bc , then x is known to be a nonmember of the target category. The set of all known nonmembers in a training set is called the negative data. If neither (x, T rue) nor (x, F alse) are in the training set, then x is labeled as unknown. A candidate concept h for describing the target category is said to be a hypothesis. The set of all hypothesis is called the hypothesis space. Definition 2.2.2 A hypothesis h is said to be consistent with a training set Bc , Consistent(h,Bc ), iff ∀(x, c(x)) : Bc · h(x) = c(x).

Universe of Discourse

M

H A

I

O

Categories defined by concepts

G J

Instances X

N

F

K C E D

L

A consistent hypothesis for training set: {(A,true), (J,true), (F,false)}

B

Figure 2.3: Universe of Discourse

Definition 2.2.3 The version space V SH,Bc with respect to hypothesis space H and training set Bc is the subset of hypotheses from H consistent with with Bc i.e. V SH,Bc = {h ∈ H | Consistent(h, Bc )}. 10

Definition 2.2.4 A hypothesis h0 is said to be more general than or equal to a hypothesis h1 , denoted as h0 ≥g h1 , iff ∀x : UX · h1 (x) is true ⇒ h0 (x) is true. The ≥g relation over a hypothesis space H creates a partial order. This is because ≥g is reflexive (h0 ≥g h0 is true), antisymmetric (h0 ≥g h1 ⇒ h1 ≥g h0 is false) and transitive (h0 ≥g h1 ∧ h1 ≥g h2 ⇒ h0 ≥g h2 is true).

{A,J}

{A,H,G,J,K,N}

{A,J,C}

{A,J,M}

{A,H,G,J,I,C}

U \ {F,H}

U \ {F}

{A,J,M,N}

General to Specific

{A,H,G,J}

Most Specific

Most General

Figure 2.4: Part of the General-to-Specific Ordering of the Hypotheses in Figure 2.3 Thus, the version space can be represented as a partial order (as shown in figure 2.4) bounded by the sets of most general and most specific hypotheses. Definition 2.2.5 The set of most general hypotheses with respect to a version space V SH,Bc is {h | h ∈ V SH,Bc , ¬∃h0 : V SH,Bc · h0 ≥g h}. Definition 2.2.6 The set of most specific hypotheses with respect to a version space V SH,Bc is {h | h ∈ V SH,Bc , ¬∃h0 : V SH,Bc · h ≥g h0 }.

2.2.1

Assumptions in Concept Learning

Without making any assumptions, concept learning can at best guarantee that the hypothesis found is consistent with the target concept over the training set. Thus, the learner is unable to classify any unseen example. Therefore, a fundamental assumption is made in concept learning in order to address this limitation problem. This is known as the inductive learning hypothesis. 11

Definition 2.2.7 The inductive learning hypothesis: Any hypothesis found to approximate the target concept well over a sufficiently large set of training examples will also approximate the target concept well over other unobserved examples. [Mit97] With the inductive learning hypothesis, the learner is able to classify unseen examples. The transition the learner makes from being able to classify training examples only to dealing with any example given is known as the inductive leap. Concept learning is essentially a search through the version space for the hypothesis that correctly describes the target concept. When the version space is infinite (which normally is for non-trivial problems), there will always be an infinite number of candidate hypothesis consistent with the training set, irrespective of how large the training set is. This is because for any finite subset T (training set) of elements in an infinite set U (universe of discourse), there are always an infinite number of sets (hypotheses) which have T as one of their subsets. So to make it possible for the learner to select a hypothesis, it must make some assumptions on the hypothesis space. These assumptions will limit the number of candidate hypothesis to a finite one. Definition 2.2.8 The inductive bias of a learning algorithm L with an infinite hypothesis space H is any set of assumptions that minimizes H to a finite set. Alternatively, the inductive bias can be defined as a total ordering relation over the hypothesis space, where the first hypothesis is chosen as the one which best describes the target concept. A well-known example of such bias is Occam’s Razor. Definition 2.2.9 Occam’s Razor is an inductive bias which orders the hypothesis space according to a ’simplicity’ relation. One way of defining the simplicity of a hypothesis h is by its length |h| under some encoding E; the smaller |h| is under E, the more simple the hypothesis is.

2.2.2

Information Given

Apart from the training set, learning algorithms can be given other types of information on the target concept. Definition 2.2.10 A concept learning algorithm is said to be unsupervised if the training set is the only given information on the target concept. Definition 2.2.11 A concept learning algorithm is said to be supervised if more information is given on the target concept apart from the training set. 12

Definition 2.2.12 An oracle for a target concept c is an entity which has full information about c and can give this information through queries made by the learner. An oracle is defined by the type of queries it can answer. The more types of queries an oracle can answer, the more resourceful it is for the learner. There are several types of queries from which the learner can gain different types of information. Definition 2.2.13 A membership query on a target concept c is a function which, given an unknown member x, returns c(x). Definition 2.2.14 A weak equivalence query on a target concept c is a function which, given a hypothesis h, returns true if h is equal to c, false otherwise. Definition 2.2.15 A strong equivalence query on a target concept c is a function which, given a hypothesis h, returns true if h is equal to c, and returns an instance x otherwise, where (c(x) is true ∧ h(x) is false) ∨ (c(x) is false ∧ h(x) is true). Definition 2.2.16 A subset query on a target concept c is a function which, given a hypothesis h, returns true if c is more general or equal to h, and returns an instance x otherwise, where h(x) is true ∧ c(x) is false.

2.2.3

Learning Models

When is a concept successfully learned? Does a learning algorithm LA learn a target concept c when a hypothesis h is found such that for all instances x, h(x) = c(x)? Is a concept c learned if h(x) = c(x) is true with a high probability p? Is c learned if with a high probability q, h(x) = c(x) is true with a high probability p? Clearly, the meaning of learning depends on the particular framework, or learning model, we have in mind [Mit97]. So, for every learning algorithm, we need to formally define a learning model which defines what it means for a concept to be learned. Three popular learning models used in concept learning are Identification in the Limit [Gol67], Probably Approximately Correct (PAC)-learning [Val84] and Query Learning [Ang88]. 2.2.3.1

Identification in the Limit

Definition 2.2.17 A positive presentation of a concept c, Ic , is a (possibly infinite) sequence of instances of all the elements in the set { x ∈ UX | c(x) is true }. In other words, it is a sequence of all the instances of the category C described by concept c.

13

Definition 2.2.18 Let C be a set of concepts and let Ic be the set of all possible positive presentations of a concept c ∈ C. C can be identified in the limit iff there exists a concept learning algorithm LA such that for any concept c ∈ C, for any sequence I ∈ Ic , LA infers c when given In as positive data, where In is a finite sequence consisting of the first n instances in sequence I. Alternatively, identification in the limit can be thought in terms of a learning process where the learning algorithm (LA) is presented with a sequence of instances (x1 , x2 , x3 . . .) from the target concept. Instances are presented one at a time. Each time an instance is presented, the learning algorithm proposes a hypothesis to describe the target concept. LA is said to identify in the limit a class of concepts C, if for each concept c ∈ C and for each presentation I of instances in c, LA produces a hypothesis hi = c after presentation of i instances and keeps producing the same hypothesis for (n − i) instances. Figure 2.5 illustrates this process.

LA

x1

x2

x3

xi

xn

h1

h2

h3

hi

hn h i = hn = c

Figure 2.5: Identification in the Limit Thus, identification in the limit assumes that the learner: 1. has to identify the target concept exactly; 2. receives a training set with positive examples only; 3. has access to an arbitrarily large number of examples; and 4. the learner is not limited by any consideration of computational complexity [NKN02] PAC-learning is another learning model which relaxes the restrictive assumption (1) and the unrestrictive assumptions (3) and (4). 14

2.2.3.2

PAC-learning

Definition 2.2.19 Consider a set of concepts C defined over a set of instances X of length n and a learner L using hypothesis space H. C is PAC-learnable by L using H if for all concepts c ∈ C, distributions D over X,  such that 0 <  < 21 , and δ such that 0 < δ < 12 , leaner L will with probability at least (1 − δ) output a hypothesis h ∈ H such that errorD (h) ≤ , in time that is polynomial in 1 , 1δ , n and size(c). [Mit97] Where errorD (h) is defined as follows: Definition 2.2.20 The true error (denoted errorD (h)) of hypothesis h with respect to target concept c and distribution D is the probability that h will misclassify an instance drawn at random according to D. [Mit97] errorD (h) ≡ P rx∈D [c(x) 6= h(x)] Thus, in PAC-learning, a learning algorithm is said to learn a concept if it most of the time (probably) finds a concept which is nearly equal (approximately) to the target concept in polynomial time. 2.2.3.3

Query Learning

Definition 2.2.21 A concept class C is said to be indexable if there is an enumeration (ci )i≥1 of all the concepts in C such that membership is uniformly decidable, i.e. there exists a computable function that, for any instances x in the instance space and for any i ≥ 1, returns 1 if ci (x) is true, and 0 otherwise. [TK07] Definition 2.2.22 A learning algorithm A learns an indexable concept class C under the query learning model if, for all concepts c ∈ C, A is able to exactly guess c after using a finite amount of queries. In section 2.2.2, we list several types of queries which can be used in the query learning model. Note that since the membership problem for context-free languages is decidable, then the class of context-free languages is indexable. Thus, the query learning model can be applied on the class of context-free languages.

15

2.2.4

VC-Dimension

A natural question to ask in concept learning is ”How many training examples are required to learn any target concept c from hypothesis space H under learning model M ?”. The answer to this question clearly depends on the complexity of the hypothesis space H. Vapnik and Chervonenkis in [VC71] describe a measure, known as the VC-dimension, for the complexity of a hypothesis space. This measure can be used with different learning models to answer our question. To define the VC-dimension, we must first define what it means for a set of instances to be shattered by a hypothesis space. Definition 2.2.23 A set of instances S is shattered by hypothesis space H iff for every dichotomy of S, there exists some hypothesis in H consistent with this dichotomy. A dichotomy of a set S is a partition of S into two disjoint subsets. [Mit97] So, for example, a set of 3 instances is shattered by the hypothesis space as shown in figure 2.6 (assuming that the hypothesis space consists of a set of circles). This is because each of the 23 dichotomies is covered by some hypothesis.

+

+

+

+

+

-

-

-

-

+

+

-

+ +

+

-

-

+

+

-

-

Figure 2.6: A set of 3 instances shattered by the hypothesis space However, for our example, a set of 4 instances is not shattered. This is because there exists a dichotomy (as shown in figure 2.7) which cannot be represented by the hypothesis space.

-

+

+

-

Figure 2.7: A set of 4 instances not shattered because of this dichotomy

16

When an unbiased hypothesis space H (i.e. without an inductive bias) is able to shatter an instance space X it means that H is capable of representing every possible concept (dichotomy) definable over X [Mit97]. Intuitively, the larger the subset of X that can be shattered, the more expressive H is [Mit97]. The VC dimension of H is precisely this measure [Mit97]. Definition 2.2.24 The Vapnik-Chervonenkis dimension (VC-dimension), VC(H), of a hypothesis space H defined over an instance space X is the size of the largest finite subset if X shattered by H. If arbitrarily large finite sets of X can be shattered by H, then V C(H) = ∞. [Mit97]

2.3

Bibliographical Notes

Unless explicitly stated otherwise: • The definitions in section 2.1 are adapted from either M. Sipser’s book Introduction to the Theory of Computation [Sip05] or Hopcroft and Ullman’s Introduction to Automata Theory, Languages, and Computation [HU79]. • The definitions in section 2.2 are adapted from T. Mitchell’s book Machine Learning [Mit97].

17

Chapter 3 Grammatical Inference Definition 3.0.1 Grammatical Inference is a concept learning task where: Instances are strings s ∈ Σ∗ . Target Category is a formal language L ⊆ Σ∗ . Target Concept is a language representation πL : Σ∗ → {T rue, F alse}. Hypothesis Space is a language class C ⊆ P(Σ∗ ). Training Set is a set of pairs (s, πL (s)). Thus, each grammatical inference problem is characterized by the following features: The Language class defining the hypothesis space. The Language representation used to define the target concept. The Information Given to the learning task. The Learning Model describing what successful learning means. This report is concerned with the problem of inferring CFLs from positive data (using grammar-like language representations). Thus, we investigate techniques for which: The Language class is the class of context-free languages. The Language representation is a grammar-like structure (will will be mainly concerned with CFGs, but we shall also investigate techniques which use alternative representations similar to CFGs).

18

The Information Given is only a training set with positive examples. Before introducing the techniques used in grammatical inference of context-free languages, we must consider the hardness results in the field. The techniques themselves are modeled to ’get around’ the learning limitations imposed by the hardness results.

3.1

Hardness Results

Most of the theoretic results in grammatical inference are negative. The most prominent hardness results in grammatical inference are on learning under the identification in the limit learning model and on PAC-learning.

3.1.1

Identification in the Limit

Definition 3.1.1 A class of languages is super-finite if it contains all the finite languages and at least one infinite language. [Gol67] Corollary 3.1.1 The class of context-free languages is super-finite (this follows from definition 3.1.1). Theorem 3.1.2 Gold’s Theorem: No super-finite class of languages can be identified in the limit from positive examples. [Gol67] Corollary 3.1.3 The class of context-free languages is not identifiable in the limit (this follows from corollary 3.1.1 and theorem 3.1.2). Proof We shall prove theorem 3.1.2 by contradiction. Let L be a super-finite language class, and let Linf ∈ L be an infinite language. Assume that there exists a learning algorithm A that identifies L in the limit. We shall construct an infinite sequence of instances I on which A does not converge to the language Linf . This contradicts our assumption since, by definition, any learning algorithm that identifies in the limit a class of languages C, should converge to any target language l ∈ C given any infinite sequence of instances in l. Let s1 be a string such that there exists a language L1 ∈ L which contains only the string s1 i.e. L1 = {s1 }. So, for A to identify L in the limit, it should be able to produce L1 in the limit. Thus, if we keep on adding s1 to I, at some point, A would return L1 . When this happens, we add a string s2 to I, where L2 ∈ L contains both s1 and s2 i.e. L2 = {s1 , s2 }. The string s2 is kept on added into I until eventually A returns L2 . If this procedure continues indefinitely, I will be equal to s1 , . . . , s2 , . . . , s3 , . . .; where the 19

number of repetitions of si is enough for A to produce Li = {s1 , s2 , s3 , . . . , si }. Note that I is now an infinite sequence of strings from which A can learn languages L1 , L2 , L3 etc. . .. Let Linf = {sn |n ∈ N}. So, the sequence I contains all the strings in Linf , however, Linf can never be produced by A. Thus, the language class L is not identifiable in the limit, which contradicts our assumption.  The proof of Gold’s result (theorem 3.1.2) is related to the limit point property: Definition 3.1.2 A class of languages L has a limit point if there exists an infinite sequence of languages in L: L0 ⊂ L1 ⊂ L2 ⊂ L3 ⊂ . . ., and there exists a language S Linf ∈ L such that Linf = n∈N Ln [dlH10]. Theorem 3.1.4 If a language class L admits a limit point, then L is not identifiable in the limit. Figure 3.1 illustrates a language class with a limit point. ⊂ L1 ⊂ L2 ⊂ L3 ⊂ . . . ⊂ Li ⊂ . . . is an infinite ascending chain of languages and Linf is the union of all these languages.

Linf

L1

L2

L3

.....

Li

.....

Figure 3.1: A language class that admits a limit point

20

3.1.2

PAC-learning

Another important hardness result is concerned with learning under the PAC learning model. Theorem 3.1.5 A super-finite language class has an infinite VC-dimension [BEHW89]. Corollary 3.1.6 The context-free language class has an infinite VC-dimension (this follows from corollary 3.1.1 and theorem 3.1.5). Proof We shall prove theorem 3.1.5 by showing that, for any finite set of instances (i.e. strings) S in a super-finite languages class L, there always exists a consistent hypothesis (i.e. a language) L which is an element of L. Thus, this would imply that V C(L) = ∞. Take any finite set of instances S where each instance s ∈ S is a string of some language L ∈ L. For every dichotomy of S, we can build a finite language Lf inite which contains all the positive examples (for the case when there are no positive examples, Lf inite = {}). By definition 3.1.1, a super-finite language contains all the possible finite languages. Thus, Lf inite will always be an element of L, since L is a super-finite language class. This implies that any finite set of instances S will shatter the hypothesis space L, since there always exists a consistent hypothesis Lf inite ∈ L. So, by definition 2.2.24, the VC dimension of any super-finite language class is infinite ie. V C(L) = ∞.  Theorem 3.1.7 A concept class with an infinite VC-dimension is not PAC-learnable [BEHW89]. Corollary 3.1.8 The context-free language class is not PAC-learnable (this follows from corollary 3.1.6 and theorem 3.1.7). Moreover, it has been shown by [KV94] that if context-free languages are PAC-learnable, then the problem of breaking the RSA code can be solved in polynomial time. This implies that PAC-learning context-free languages is at least as hard as solving cryptographical problems such as RSA.

3.2

Positive Results

Two positive theoretical results on grammatical inference from positive data are described in the following theorems (3.2.1 and 3.2.2):

21

Definition 3.2.1 A language class L has finite thickness if for all strings s in the class, there exists only a finite amount of languages in L consistent with s (i.e. each set Ls = {L | L ∈ L, s ∈ L} on any string s is finite) [dlH10]. Theorem 3.2.1 Any computable class of languages with finite thickness is identifiable in the limit from positive data [Ang80]. Definition 3.2.2 A language class L has finite elasticity if for every infinite sequence of strings s0 s1 . . . and for every infinite sequence of languages in L, L0 L1 . . ., there is a finite number n s.t. if sn ∈ / Ln then Ln is inconsistent with {s0 , s1 , . . . , sn−1 } [dlH10]. Theorem 3.2.2 Any computable class of languages with finite elasticity is identifiable in the limit from positive data [Wri89]. Theorem 3.2.1 can be easily proved. If there are only a finite amount of candidate languages for each string in a finitely thick language class L, then a learning algorithm for L needs only to traverse the finite space of candidate languages for one string. Note that this space can be exponentially large, however the identification in the limit learning model does not put constraints on the time complexity of the learning algorithm. K. Wright in [Wri89] showed that if a language class has finite thickness, then it also has finite elasticity. Thus, theorem 3.2.2 holds since we already proved that a finitely thick language class can be identified in the limit from positive data. Another positive result is that the whole class of context-free languages can be learned in polynomial time under the query learning model. D. Angluin in [Ang87] designed an algorithm that does exactly this by using equivalence and membership queries. However, as we stated earlier, our aim is to learn from positive examples only (without the use of queries). According to [dlH10], most of the positive results obtained in grammatical inference are on learning subclasses of regular languages or on learning regular languages using queries. Regular languages are easier to learn than context-free languages since the former satisfy 3 important properties which the latter do not. These properties are: 1. A canonical normal form. 2. Decidable equivalence. 3. A finite index.

22

For every regular language R, there is a unique minimal DFA that generates R [HU79]. On the other hand, context-free languages do not have this property; the Chomsky normal form does not guarantee a unique representation since it is not canonical [HU79]. The advantage of having a canonical normal form for regular languages is that the hypothesis space can be restricted to the set of all minimal DFAs instead of all the possible DFAs. A canonical normal form entails a decidable equivalence. Languages L1 and L2 are equivalent iff they have the same representation in canonical normal form. Thus, for regular languages, equivalence is computable since there exists a canonical normal form; whilst the equivalence of context-free languages is undecidable. To define what a language index is, we need to first define what the L-equivalent relation is. Definition 3.2.3 For a language L over Σ∗ , two string a, b ∈ Σ∗ are L-equivalent (denoted as a ≡L b) iff for all strings x ∈ Σ∗ , ax ∈ L ⇔ bx ∈ L. [HU79] L-equivalent is an equivalence relation. This is because it is reflexive: a ≡L a (since ax ∈ L ⇔ ax ∈ L); it is symmetric: a ≡L b ⇒ b ≡L a (since if ax ∈ L ⇔ bx ∈ L is true then bx ∈ L ⇔ ax ∈ L is true); it is transitive: a ≡L b ∧ b ≡L c ⇒ a ≡L c (if u ⇔ v and v ⇔ w then u ⇔ w). [HU79] Definition 3.2.4 The index of a language L is the number of equivalence classes of a L under the L-equivalent relation. [HU79] Theorem 3.2.3 All regular languages have a finite index. All languages which are contextfree and not regular have an infinite index. [HU79] Theorem 3.2.4 The Myhill-Nerode theorem states that if the index of a language L is i, then there is a minimal DFA D with i states such that L(D) = L. [HU79] Thus, the Myhill-Nerode theorem gives us a very interesting relationship between the observable strings in a language and the minimal language representation. This theorem is useful when learning a language class with a finite index. Thus, the Myhill-Nerode theorem can only be used for regular language learning, not for CFLs.

23

Chapter 4 Grammatical Inference of CFLs from Positive Data In this chapter, we present the different techniques used in algorithms to infer context-free languages from positive examples. Each technique is presented as a method to avoid the limitation problems imposed by the hardness results in section 3.1.

4.1

Learning Subclasses

One of the most common technique used in GI to ’get around’ the hardness results is to consider subclasses of context-free languages which are not super-finite [Lee96]. The normal approach is to first define a non-super-finite subclass of context-free languages (as shown in figure 4.1), design a learning algorithm that infers languages of this subclass and subsequently show that the algorithm works under some learning model in polynomial time.

Context-Free A wh langu ich a is n ge cl a o fin t su ss ite pe r-

Regular Finite

Figure 4.1: A language class which is not super-finite 24

In this section, we investigate four CFL subclasses: Even Linear, Simple, Substitutable and Unambiguous Non-Terminally Separated CFLs.

4.1.1

Even Linear CFLs

Definition 4.1.1 A linear CFG is a CFG with at most one nonterminal in the RHS of each production rule i.e. each rule is of the form A → α or A → αN β, where α and β are sequences of zero or more terminals. [LN03] Definition 4.1.2 An even linear CFG is a linear CFG where for each production rule of the form A → αN β, the number of terminals in α is equal to that in β i.e |α| = |β|. [LN03] Definition 4.1.3 An even linear CFL is a language which can be generated by an even linear CFG. [LN03] Theorem 4.1.1 Every even linear CFG admits the following normal form: A → aBb or A → a or A → , where A, B ∈ N and a, b ∈ Σ. [LN03] The earliest works on the inference of subclasses of CFLs were on even linear CFLs. This subclass of CFLs was first introduced by [AP64]. Afterwards, several authors ([Tak88], [RN88], [Maa96] and [SG94]) showed how the problem of learning even linear CFLs (ELCFLs) can be reduced to the problem of learning regular languages. This is because ELCFLs have several properties in common with regular languages. However, ELCFLs include the whole class of regular languages [Tak88], which thus means that they cannot be identified in the limit from positive data alone (since the class of regular languages is super-finite). So, later on, non-super finite subclasses of ELCFLs were defined and showed to be learnable under Gold’s model. In this section, we investigate one of these subclasses. 4.1.1.1

Terminal Distinguishable Even Linear CFLs

Definition 4.1.4 An even linear CFL L is a Terminal Distinguishable Even Linear CFL (TDEL-CFL) iff for all u1 , u2 , v1 , v2 , w, z ∈ Σ∗ such that |u1 | = |v1 |, |u2 | = |v2 | and T er(w) = T er(z), where T er(x) denotes the set of terminals in x, if u1 wv1 , u2 wv2 ∈ L then u1 zv1 ∈ L iff u2 zv2 ∈ L. The finite language Lf inite = {axyc, bxyd, ayxc} is not a TDEL-CFL (u1 = a, u2 = b, v1 = c, v2 = d, w = xy, z = yx). Thus the class of TDEL-CFLs is not super-finite.

25

Laxminarayan & Nagaraja in [LN03] propose an algorithm, TDELG-Inference, that polynomial infers TDEL-CFL in the limit from positive data. The following are some preliminary definitions needed for one to understand the algorithm. Definition 4.1.5 A skeleton of a string s w.r.t a grammar G is a derivation tree of s using G Now, since even linear grammars admit the normal form in theorem 4.1.1, the skeleton of any string s = a1 a2 a3 . . . an on an even linear grammar will always take one of the forms shown in figure 4.2. We shall call the skeleton of an even linear grammar the even linear skeleton. S

S

a1

N1

an

a1

N1

an

a2

N2

an-1

a2

N2

an-1

a3

N3

an-2

a3

N3

an-2

. . . .

N(n/2)-1 an/2

. . . .

Ntrunc(n/2) atrunc(n/2)+1

a(n/2)+1 When n is even

When n is odd

Figure 4.2: The two forms of even linear skeletons Let N D(S + ) denote the set of nodes (non-terminals) found in all even linear skeletons of S + , where S + is the set of positive examples. A partition of a set N D(S + ) is a collection of pairwise disjoint nonempty subsets of N D(S + ) whose union is N D(S + ).If π is a partition of the set N D(S + ), then for any element p ∈ N D(S + ) there is a unique element of π containing p, which is denoted as B(p, π), and we shall call this the block of π containing p. A partition π is said to refine another partition π 0 iff every block of π 0 is a union of blocks of π. Definition 4.1.6 The frontier string of a node N in an even linear skeleton, F S(N ), is the string derived from the sub-skeleton from node N . 26

Definition 4.1.7 The head string u and tail string v of a node N in an even linear skeleton, denoted as Head(N ) and T ail(N ) respectively, are strings such that uF S(N )v = the whole string generated by the even linear skeleton. Definition 4.1.8 A skeletal structure node function of a node N in an even linear skeleton, denoted as SSN F (N ), is an ordered triple < Head(N ), T er(F S(N )), T ail(N ) >. Definition 4.1.9 A parent terms function of a node N in an even linear skeleton, denoted as P T F (N ), is an ordered triple < P arent(N ),(Last(Head(N )),F irst(T ail(N ))), T er(F S(N )) > where F irst(s) and Last(s) are the first and last characters of a string s respectively and P arent(N ) is the parent node of N in the even linear skeleton. Figure 4.3 shows an example even linear skeleton along with the values of the different functions we have just defined.

Figure 4.3: An even linear skeleton of abcabcddbacbac (reproduced from [LN03]) Algorithm 1 shows the TDELG-Inference algorithm. This algorithm can be split in two parts: the initialization phase (before the while loop) and the merging stage. In the initialization part, the N D(S + ) is first computed. So, given S + = {ab, aabb, aaabbb}, N D(S + ) = {N1 , N2 , N3 , N4 , N5 , N6 }. Then, the initial partition π0 is computed such that each block in the partition has a common SSNF. For our example, π0 = {{N1 , N2 , N4 }, {N3 , N5 }, {N6 }}; this was computed by using the values in table 4.1

27

Head

Tail

FS

Ter(FS)

N1





ab

{a, b}

N2





aabb

{a, b}

N3

a

b

ab

{a, b}

N4





aaabbb

{a, b}

N5

a

b

aabb

{a, b}

N6

aa

bb

ab

{a, b}

Table 4.1: Computing the head, tail, FS and Ter(FS) Algorithm 1: TDELG-Inference Algorithm in [LN03] Input: Set of positive examples S + Output: A TDEL context-free grammar Compute N D(S + ) ; Compute the partition π0 of N D(S + ) s.t. 2 nodes p and q are in the same block B if SSN F (p) = SSN F (q) ; Let LIST contain all unordered pairs {p, q} of nodes in N D(S + ) s.t. p ∈ B1 , q ∈ B2 , B1 , B2 ∈ π0 , B1 6= B2 and F S(p) = F S(q) ; i←0; while |LIST | > 0 do Remove some element {p, q} from LIST ; Find 2 blocks B1 and B2 from πi such that ∃p ∈ B1 and ∃q ∈ B2 ; if B1 6= B2 then Construct πi+1 which is the same as πi with B1 and B2 merged ; For every pair of disjoint nodes (r, s) whose parent nodes are in B1 ∪ B2 and P T F (r) = P T F (s), place {r, s} in LIST ; i←i+1 ; end end Construct the production rules P , using the skeletons, which have been modified to reflect the equivalence classes in πi ; The algorithm proceeds by constructing a list, which contains all the pairs of nodes with a common frontier string. In our case, LIST = {(N1 , N3 ), (N1 , N6 ), (N3 , N6 ), (N2 , N5 )}. The algorithm then starts the merging phase. An element of LIST (p, q) is selected (any element) and it is checked wether p and q reside in different blocks in π0 . If so, a new partition π1 is constructed by merging the block of p with that of q. LIST is then updated 28

accordingly. So, in our case, if (N1 , N3 ) are selected from LIST , we can see that N1 and N3 reside in different blocks of π0 . Thus, these blocks are merged and the new partition π1 = {{N1 , N2 , N3 , N4 , N5 }, {N6 }}. This process is repeated on every new partition until LIST is empty. Then, a non-terminal is assigned for each block of the final partition, which is substituted in the even linear skeleton to obtain the resulting grammar. For our example, the final partition πf inal would be {{N1 , N2 , N3 , N4 , N5 , N6 }}, and thus the grammar is S → S1 S1 → aS1 b S1 → ab . Laxminarayan & Nagaraja in [LN03] show that the T DELG − Inf erence algorithm correctly learns T DEL − CF Ls in O(kα(k)), where k is the number of positive samples and α is a very slow growing function.

4.1.2

Simple CFLs

Definition 4.1.10 A simple CFG is a CFG in Greibach Normal Form such that if A → aα, A → aβ ∈ P then α = β, where a ∈ Σ and α, β ∈ N ∗ . [Yok03] Definition 4.1.11 A simple CFL is a language which can be generated by a simple CFG. [Yok03] Like even linear CFLs, the class of simple CFLs is super-finite. We can prove this by showing that a simple CFG can generate all finite languages and at least one infinite language. Every finite language is made up of a finite number n of strings, where each ith string, 1 ≤ i ≤ n, is of different length mi . Thus, we can represent the j th symbol of the ith string in a finite language as ai.j (1 ≤ i ≤ n and 1 ≤ j ≤ mi ). So, any finite language L can be represented as follows: L = {a1.1 a1.2 . . . a1.m1 , a2.1 a2.2 . . . a2.m2 , a3.1 a3.2 . . . a3.m3 , .. . an.1 an.2 . . . an.mn } Let Cx be the set of all symbols ai.x for any 1 ≤ i ≤ n. Let the k equivalence classes of Cx under the equality relation (’=’) be [ax,1 ], [ax,2 ], . . ., [ax,k ]. So, for x=1 (i.e. the first symbols of each string), we can build the following simple CFG:

29

S → a1,1 N1,1 S → a1,2 N1,2 .. . S → a1,k N1,k where, ax,1 , ax,2 , . . ., ax,k are strings from the equivalence classes [ax,1 ], [ax,2 ], . . ., [ax,k ] respectively. Note that all strings in each class are the same since these are equivalence classes over the equality relation. Using the same procedure, the CFG can be added with rules for x = 2 by using non-terminals N1,1 , N1,2 , . . ., N1,k . The procedure is repeated until the CFG accepts all the strings. So, for example, given the finite language Lf inite = {a, b, ab, ac, abc}, we can build the following simple CFG: S → aN1,1 S → bN1,2

N1,1 N1,1 N1,1 N1,2

→ → bN2,1 → cN2,2 →

N2,1 →  N2,1 → cN3,1 N2,2 → 

N3,1 → 

One infinite language accepted by a simple CFG is (a)∗ : S → aS, S →  Therefore, the class of simple CFLs is super-finite. However, there are subclasses of simple CFLs that are not super-finite. We investigate two of these subclasses (very simple CFLs and Right-Unique Simple CFLs) that are proved to be polynomially identifiable in the limit from positive data alone. 4.1.2.1

Very Simple CFLs

Definition 4.1.12 A very simple CFG (VSCFG) is a simple CFG such that if A → aα, B → aβ ∈ P then A = B and α = β. Definition 4.1.13 A very simple CFL (VSCFL) is a language which can be generated by a very simple CFG. The following are examples of VSCFLs in different language classes: • Any language containing one string of unit length; S → a (finite) • The language (ab)∗ ; S → aB, B → bS, S →  (infinite regular). • The language {an bn | n ∈ N}; S → aSB, B → b (non-regular context-free). The class of VSCFLs is not a super-finite class. We can show this by giving an example of a finite language which is not a VSCFL. 30

Example Take the finite language Lf inite = {aab, aac}. Now, either there is one or two rules with a starting non-terminal in the LHS. With two rules having S in LHS, the CFG takes the following form: S → aα S → aβ By the property of simple CFG, α should be equal to β. Thus, we can only build a CFG with one rule having S. Now, there are only 2 remaining possible CFGs in Greibach Normal that generate Lf inite using only one rule with S in LHS: S → aX X → aB X → aC B→b C→c

S → aAX A→a X→b X→c

The leftmost grammar does not satisfy the VSCFL property with the rules S → aX, X → aB and X → aB. The other grammar also does not satisfy the VSCFL property with the rules S → aAX and A → a. This means that there is no VSCFG that generates Lf inite , which thus means that the class of VSCFLs is not super-finite. Yokomori in [Yok03] presents an algorithm that identifies VSCFLs in the limit in polynomial time from positive data. The algorithm first builds a prefix tree acceptor (PTA) that generates only the strings in the training set. Each node of the PTA represents a non-terminal in the target grammar. The PTA is then generalized through a sequence of state merges, where each merge is justified by some theorem that holds for VSCFLs. The following are some of the theorems used for state merging: Theorem 4.1.2 For any VSCFL, if S ⇒∗ xα and S ⇒∗ xα0 then α = α0 . Thus, if the start state in the PTA has multiple edges with the same terminal to different states, merge all of these states into one state. Theorem 4.1.3 For any VSCFL, if A → a and B → a then A = B. Thus, if multiple final states have incoming edges with the same terminal, merge the final states into one state. Theorem 4.1.4 For any VSCFL L, if for some n > 0, uv ∈ L and uxn v ∈ L, then ux∗ v ⊆ L 31

Thus, if multiple non-starting states generate the same string, merge them into one state. Once the states (which represent non-terminals) have been merged, the algorithm proceeds by fining the length of each rule. A parameter na is defined s.t. for rules of the form Na → aC, na = |C| − 1. This parameter represents the length increase of a sentential form induced by one application of the rule Na → aC [Sta04]. The following is the explanation given by Starkie [Sta04] for the remaining part of the algorithm: ”Using the length of each sentence in the training set, and the number of instances of each terminal symbol in each string, a set of simultaneous equations are formed. These simultaneous equations are then solved to determine the number of nonterminals that must appear on the right-hand side of each rule. Depending upon the number of examples and number of sentences there may be one, more than one or less than one solution to this set of simultaneous equations. At this point the number of rules, the left-hand of each rule, the first symbol on the righthand side of each rule and the number of non-terminals on the right-hand side of each rule is known. The last remaining step is to determine the actual non-terminals that appear on the right-hand side of each rule. This is determined by simulating the derivations of a number of training examples.” 4.1.2.2

Right-Unique Simple CFLs

Definition 4.1.14 A right-unique simple CFG (RSCFG) is a simple CFG such that if A → aα, B → aβ ∈ P then α = β. Definition 4.1.15 A right-unique simple CFL (RSCFL) is a language which can be generated by a right-unique simple CFG. Clearly, RSCFLs are a superclass of VSCFLs. Therefore, the examples shown of VSCFLs in different language classes (see section 4.1.2.1) are all RSCFLs. The class of RSCFLs is not super-finite. This is because we can show that a finite language Lf inite = {aab, aac} is not a RSCFL. Prove of this is similar to the proof in section 4.1.2.1 that Lf inite is not VSCFL. Yoshinaka in [Yos06] presents a polynomial algorithm for identifying VSCFL from positive data. This algorithm is consistent and conservative. Definition 4.1.16 A learning algorithm is said to be consistent if it always outputs a hypothesis which is consistent with the training examples. Definition 4.1.17 A learning algorithm is said to be conservative if it changes its resultant hypothesis only when this does not remain consistent with a newly given example from the target concept. 32

Yoshinaka’s algorithm uses functions known as shapes. Definition 4.1.18 For any RSCFG G, a function φG from (Σ ∪ N )∗ to Z is called the shape of G iff • φG (a) = |α| − 1 for A → aα, • φG (A) = −1 for A ∈ N and • φG (uv) = φG (u) + φG (v) for u, v ∈ (Σ ∪ N )∗ . Definition 4.1.19 The shape φ of some RSCFG is said to be consistent with a language L iff there is an RSCFG G whose shape is φ such that L ⊆ L(G). Algorithm 2: RSCFG learning algorithm [Yos06] Input: Positive Data: {x0 , x1 , x2 , . . . , xn } Output: A RSCFG Let Gcurrent = Grammar generating empty language for i ← 0 to n do if xi ∈ / L(Gcurrent ) then Enumerate all the shapes φ1 , φ2 , . . . , φm consistent with x0 , . . . , xi if m = 0 then return ”It’s not a RSCFG” end for j ← 1 to m do Construct the minimum grammar Gj with φGj = φj and x0 , . . . , xi ⊆ L(Gj ) end Gcurrent = G1 for j ← 2 to m do if Gj Gcurrent then Gcurrent ← Gj end end end end return Gcurrent Yoshinaka’s algorithm (see algorithm 2) considers the training examples one by one. It enumerates all the shapes which are consistent with the training examples. For each shape 33

s, it constructs the minimum grammar that has s as its shape. From these grammars, it chooses the one which produces the largest language. The algorithm updates its current hypothesis only if an inconsistent training example is found (i.e. it is conservative).

4.1.3

Substitutable CFLs

Clark & Eyraud in [CE07] define substitutable context-free languages and show how these languages can be polynomially identified in the limit from positive data only. Their main motivation is to apply their work in the field of First Language Acquisition (which is the study of how humans learn natural language in their early childhood). Unlike [Yok03], they do not define the class of substitutable languages as the set of languages generated by a restricted form of context-free grammar. Instead, they define it as a class of languages that satisfy a restrictive property on their own structure. This is because in First Language Acquisition, the language representation is not known1 , however it is known that natural languages are learnable. Thus, Clark & Eyraud in [CE07] identify a structure in languages (substitutability) that can be observed and construct a learnable representation based on that structure. Definition 4.1.20 A pair (l, r) ∈ Σ∗ × Σ∗ is said to be a context of a string w w.r.t a language L iff lwr ∈ L. Definition 4.1.21 Two strings u and v are syntactically congruent (or strongly substitutable) w.r.t a language L iff ∀l, r ∈ Σ∗ · lur ∈ L ⇔ lvr ∈ L. The syntactical congruence relation is an equivalence relation. We denote the equivalence class of a string u under the syntactical congruence as [u]. Note that if [u0 ] = {u0 , u1 , u2 , . . . , un }, then [u1 ], [u2 ], . . . and [un ] all refer to the same congruence class [u0 ]. Definition 4.1.22 Two strings u and v are weakly substitutable w.r.t a language L iff ∃l, r ∈ Σ∗ · lur ∈ L ∧ lvr ∈ L. Definition 4.1.23 A language L is substitutable iff ∀u, v ∈ Σ∗ · if u and v are weakly substitutable then u and v are also strongly substitutable (or syntactically congruent). The following are examples of substitutable languages in different language classes [CE07]: • Any language containing only one string (finite). 1

do children learn how to construct grammatically correct sentences by using context-free grammars?

34

• The set Σ∗ of all the strings on the alphabet is substitutable. This is because all strings appear in all possible contexts (infinite regular). • The language {an |n > 0} (another infinite regular). • The palindrome language with a center marker: {wcwR |w ∈ (Σ\{c})∗ } (non-regular context-free). To prove that the class of substitutable languages is not super-finite, we only need to give an example of a finite language which is not substitutable. Example Take for example the finite language Lf inite = {ba, ab, abba}. Clearly, ab and ba are weakly substitutable since they appear in the same context (, ), where  is the empty string. Thus, for Lf inite to be substitutable, ab and ba need to be strongly substitutable. However, ab also appears in the context {, ba}, for which ba does not appear (since the string baba is not in Lf inite ). Thus, ab and ba are weakly substitutable but not strongly substitutable. Therefore, the finite language Lf inite is not substitutable. To learn substitutable languages, Clark & Eyraud define an algorithm called SGL (Substitution Graph Learner) [CE07]. This algorithm builds a substitution graph from the set of all substrings of the given positive data. Definition 4.1.24 A substitution graph SG(S) on a finite set of words S, is made up of a set of vertices V and a set of edges E s.t. V = {u ∈ Σ+ | ∃l, r : Σ∗ · lur ∈ S} E = {(u, v) ∈ Σ+ × Σ+ | u and v are weakly substitutable } For example, suppose that the set of positive examples +ve = {a, ab, ac, abc}. Figure 4.4 illustrates the substitution graph for +ve. Note that the number of graph components is always equal to the number of equivalence classes under the syntactic congruence relation. For our example, we have 2 equivalence classes: [a] = {a, ab, ac, abc} where all the strings have (, ) as their common context and [bc] = {b, c, bc} where all the strings have (a, ) as their common context. The algorithm continues by associating each graph component (i.e. equivalence class) with a different non-terminal. Note that their will be precisely one component of the substitution graph that will contain all the strings given as positive examples, since positive examples (and only these) have the context (, ). The component containing the positive examples is associated with the starting non-terminal S. So, for our example, we will associate non-terminal S with component [a] and non-terminal X with component [bc]. 35

Figure 4.4: Substitution graph for +ve = {a, ab, ac, abc} The algorithm then proceeds by looping on all substrings as shown in algorithm 3. Algorithm 3: SGL Loop Input: Set of all substrings (i.e. nodes) V Output: Set of Production Rules P foreach Substring w ∈ V do if |w| > 1 then foreach non-empty strings u, v such that w = uv do P ← P ∪ {[w] → [u][v]} end else P ← P ∪ {[w] → w} end end So, for our example, the following production rules are obtained from algorithm 3: [a] → a [ab] → [a][b] [ac] → [a][c] [abc] → [a][bc] [abc] → [ab][c]

[b] → b [c] → c [bc] → [b][c]

Then, by replacing the congruence classes with their associated non-terminal, we obtain the following grammar:

36

S→a S → SX

X→b X→c X → XX

Thus, for our example, the substitutable language learned is a(b|c)∗ . Clark & Eyraud prove in [CE07] that algorithm SGL identifies in the limit the class of substitutable context-free languages in polynomial time. They prove this by first showing that on any given training set, the resultant hypothesis will always be a subset of the target language. Then, they show how a characteristic set of strings can be constructed from the target grammar such that, if the training examples contain this characteristic set, the resultant hypothesis will always be correct. Definition 4.1.25 The characteristic set, CS(G), for the SGL algorithm [CE07], on a target grammar G =< V, Σ, P, S > is {lwr | (N → α) ∈ P, (l, r) = c(N ), w = w(α)}, where c(N ) is the smallest pair of strings (l, r), such that S ⇒∗ lN r and w(α) is the smallest word generated by α ∈ (Σ ∪ V )+ . 4.1.3.1

k,l-Substitutable CFLs

Yoshinaka in [Yos08] extended Clark & Eyraud’s [CE07] work by showing how a superclass of substitutable CFLs, known as k,l-substitutable CFLs (KLSCFLs), can be polynomially identified in the limit from positive data. Definition 4.1.26 Let k and l be nonnegative integers. A language L is k,l-substitutable iff for any x1 , y1 , z1 , x2 , y2 , z2 ∈ Σ∗ , v ∈ Σk (i.e. |v| = k), u ∈ Σl (i.e. |u| = l) such that vy1 u, vy2 u 6= , x1 vy1 uz1 , x1 vy2 uz1 , x2 vy1 uz2 ∈ L ⇒ x2 vy2 uz2 ∈ L In Clark & Eyraud [CE07], 2 strings are said to be substitutable if they share the same contexts; whilst in Yoshinaka’s [Yos08], 2 strings are substitutable if they share the same k,l-context. Essentially, Clark & Eyraud’s notion of substitutability is exactly 0,0-substitutability in Yoshinaka’s terms. Definition 4.1.27 Let k and l be nonnegative integers. A pair (l, r) ∈ Σ∗ × Σ∗ is said to be a k,l-context of a string w w.r.t a language L iff there exists v ∈ Σk , u ∈ Σl such that lvwur ∈ L.

37

The substitutable languages in different language classes given in section 4.1.3 are obviously all k,l-substitutable languages (for k = l = 0). There are languages which are k,l-substitutable but not substitutable. The context-free language {an bn | n ∈ N} is k,lsubstitutable for k = l = 1 but not substitutable (since a and aab have the same context (, b) but a has also the context (, ab) which aab does not). The class of k,l-substitutable languages is not super-finite, since we can always construct a finite language Lkl = {ak bal , ak cal , xak bal x, yak cal y} (where xn = x1 x2 . . . xn ) which is not k,l-substitutable for any k and l. This is because whilst strings b and c will have the same k-l-context (, ) in ak bal and ak cal , they will always have different k-l-contexts (x, x) and (y, y) in xak bal x and yak cal y respectively. Unlike Clark & Eyraud’s algorithm in [CE07], Yoshinaka presents an algorithm in [Yos08] that learns k,l-substitutable languages without using a substitution graph. Yoshinaka’s algorithm starts with the hypothesis grammar generating the empty language. It reads positive examples one by one, and updates the current hypothesis only when a new example is inconsistent with it. A grammar Gcurrent =< Σ, VK , PK , S > is built using this construction (note that K is the set of positive examples read): VK = {[y] | xyz ∈ K, y 6= } ∪ {S} PK = {[vyu] → [vy 0 u] | xvyuz, xvy 0 uz ∈ K, |v| = k, |u| = l, vyu, vy 0 u 6= } ∪ {S → [w] | w ∈ K} ∪ {[xy] → [x][y] | [xy], [x], [y] ∈ VK } ∪ {[a] → a | a ∈ Σ} Algorithm 4: Learning Algorithm for k,l-substitutable CFLs [Yos08] Input: Sequence of positive examples w1 , w2 , . . ., wn Output: Grammar Gcurrent Gcurrent ← CFG generating the empty language for i ← 1 to n do if wi ∈ L(Gcurrent ) then Gcurrent ← Build new grammar on K = {w1 , . . . , wi } end end return Gcurrent So, for example, if algorithm 4 is given the training set {ab, aabb, aaabbb}, with k = l = 1, the grammar generated on ab is : S → [a] S → [b]

[a] → a [b] → a 38

Then, on aabb, Gcurrent would be added with the following production rules: S → [aabb] [aabb] → [a][abb] [aabb] → [aa][bb] [aabb] → [aab][b] [aab] → [a][ab] [aab] → [aa][b]

[abb] → [a][bb] [abb] → [ab][b] [aa] → [a][a] [bb] → [b][b] [ab] → [a][b]

Finally, on aaabbb, Gcurrent is similarly added with rules of the form [xy] → [x][y] and with the rule S → [aaabbb]. The crucial rule added now is [aabb] → [aaabbb]. This rule is added since ab has the same 1,1-context of aabb (with v = a and u = b). This rule allows the grammar to accept an infinite amount of strings. In fact, the resulting grammar generates the context-free language {an bn | n ∈ N}. Like Clark & Eyraud in [CE07], Yoshinaka [Yos08] proves that on any given training set, the resultant hypothesis will always be a subset of the target language. Then, he shows how a characteristic set of strings, KG , can be constructed from the target grammar G such that, if the training examples K contain this characteristic set, the resultant hypothesis will always be correct. Definition 4.1.28 Let δ(α) = min{w ∈ Σ∗ | α ⇒∗G } denote the smallest string reachable from α ∈ (Σ ∪ V )∗ . Definition 4.1.29 Let χ(A) = min{(x, z) ∈ Σ∗ × Σ∗ | S ⇒∗G xAz} denote the smallest context of A ∈ V . Definition 4.1.30 Let KA = {vw1 . . . wn u ∈ Σ∗ | A → vB1 . . . Bn u, Bi → βi ∈ P, wi = δ(βi )} ∪ {y ∈ Σ∗ | A → y ∈ P } for A ∈ V . Definition 4.1.31 The characteristic set, KG , for Yoshinaka’s algorithm [Yos08] is {xyz ∈ Σ∗ | χ(A) = (x, z), y ∈ KA , A ∈ V }.

4.1.4

Non-Terminally Separated CFLs

The techniques we investigated till now all learn subclasses of CFLs under the identification in the limit learning model. Now we give an example of a CFL subclass learned under the PAC framework. Clark in [Cla06] shows how Unambiguous Non-terminally Separated (UNTS) context-free languages can be PAC-learned from positive data. 39

Definition 4.1.32 A grammar G =< Σ, V, P, S > is non-terminally separated (NTS) iff for all N, M ∈ V , if N ⇒∗G αβγ and M ⇒∗G β then N ⇒∗G αM γ for α, β, γ ∈ (Σ∗ ∪N ). Definition 4.1.33 A language L is an Unambiguous Non-terminally Separated (UNTS) CFL iff there exist a non ambiguous NTS grammar G generating L. The algorithm adopted by Clark to learn UNTS CFL, known as PACCFG [Cla06], is similar in principle to that in [CE07]. The main difference is that it uses probabilistic CFGs (PCFGs) and assumes certain distributions. A detailed explanation of PCFGs is given in section 4.3. PACCFG uses probabilistic congruence classes (similar to the syntactic congruence classes used in [CE07]) as non-terminals. Production rules are built by using The following is a high-level step-by-step description, given by Clark & Lappin in [LC10], of the PACCFG algorithm: • Gather a finite sample. • Identify frequent substrings in the sample. • Test the substrings for probabilistic congruence, and identify the probabilistic congruence classes. • Create a grammar by – adding non-terminals for each congruence class, – adding production rules [uv] → [u][v] for congruence classes of uv substrings, – adding production rules [a] → a for congruence classes of single symbol substrings a, and – identifying the initial S symbol with the congruence class of strings in the language.

4.2

Heuristic-Based Techniques

Definition 4.2.1 A heuristic is a technique which seeks good (i.e. near-optimal) solutions at a reasonable computational cost without being able to guarantee either feasibility or optimality, or even in many cases to state how close to optimality a particular solution is. [Ree93]

40

So, a heuristic is a technique used to approximately solve complex problems without giving proof of correctness. In learning algorithms, this is done by searching in an exponentially (or infinitely) large hypothesis space without traversing the whole space. This is achieved through a number of decisions taken during the search process. These decisions eliminate candidate hypotheses which, according to the heuristic used, are not classified as good solutions (but might however be correct). In the context of grammatical inference, heuristics are used as a technique to avoid the hardness results, under the assumption that an approximate solution is good enough. Heuristics are mainly used when the class of languages to be learned is not known or known to be super-finite. In fact, many practical applications in grammatical inference use heuristics since the language classes of real-world data are often not known or superfinite. However, when the language class is known to be non-super-finite and learnable under some learning model, then exact algorithms (like those in section 4.1) should be used rather than heuristics. In this section, we analyze five learning techniques (ABL, EMILE, A. Clark’s Omphalos algorithm, MDL and lattice exploration) that make use of heuristics. Note that the authors of the first 3 techniques are interested in grammatical inference of natural languages. Thus, in their work, they use some terminology which is normally used by linguists. The term ’constituent’ is one important common term used in these three papers. Definition 4.2.2 In linguistics, a constituent is a word or group of words which behave as a single unit within a hierarchical structure [JM09]. In natural languages, sentences can be structured in subgroups of one or more words (i.e. they are not just a juxtaposition of words). Sentences have tree-like structure, where the leaf nodes are the words themselves, and the rest of the nodes are part of speech. Figure 4.5 illustrates a sentence tree example. The constituents of a sentence are the words (or group of contiguous words) derived by a single node from the sentence tree. So for the sentence in figure 4.5, the constituents are the words themselves, the whole sentence, ”that man”, ”this woman” and ”saw this woman”. CFGs are one commonly used mathematical structure for modeling constituent structure in natural language [JM09]. For our example, the CFG is: S → NP V P NP → D N V P → V NP

D → that D → this N → man N → woman V → saw 41

Figure 4.5: The sentence tree for ”that man saw this woman” The following is an important property of constituents, stated by Z.Harris in [Har51]: Definition 4.2.3 Harris’ implication: if two constituents A and B are of the same type, then A and B can be replaced by each other in a sentence (and the sentence will remain syntactically correct) For our example, the constituents ”that man” and ”this woman” (both of type NP) can be replaced to form a new syntactically correct sentence ”this woman saw that man”.

4.2.1

Alignment Based Learning

Alignment Based Learning (ABL) is a learning algorithm introduced by M. van Zaanen in [vZ01]. It was particulary designed to infer natural language grammars; however it can be applied in any domain. The algorithm takes a set of unstructured positive examples and outputs a labeled, bracketed version of the input. So, for the example sentence ”that man saw this woman”, the output from ABL would be: ( ( (that) (man) ) ( (saw) ( (this) (woman) ) ) ) | {z } | {z } | {z } | {z } | {z } V | D {z N } | D {z N } NP NP | {z } VP | {z } S

Then, a context-free grammar can be easily deduced from the structured examples [vZ01]. ABL consists of two stages: alignment and selection. 42

4.2.1.1

Alignment Phase

This phase uses the reverse of Harris’ implication [Har51] as a heuristic to identify constituents. Definition 4.2.4 Reverse of Harris’ implication: If words (or group of contiguous words) A and B are substitutable, then A and B are two constituents of the same type. Algorithm 5 shows how sentences are compared to each other to find the substitutable parts (i.e. the candidate constitutes). Algorithm 5: Alignment stage Input: A set of correct sentences S foreach sentence sa ∈ S do foreach sentence sb ∈ S do if sa 6= sb then Align s1 to s2 ; Find the identical parts of sentences s1 and s2 and link them ; Group adjacent linked words ; Group the remaining adjacent non-linked words ; Assign non-terminals to the candidate constituents (where the constituents are the groups of distinct parts) ; end end end Identical parts of the sentences can be found using a string edit distance algorithm [WF74], which finds the minimum number of edit operations (insertion, deletion and substitution) to change one sentence into the other [vZ01]. The identical words in a sentence are those found in places where no edit operation is applied [vZ01]. Clearly, this process is ambiguous (i.e. can give different results). For example, given the 2 sentences: ”from Malta to Italy” and ”from Italy to Malta”, the following are all possible results (underlined words are the linked identical words and the words in brackets are the candidate constituents):

43

from ()1 Malta (to Italy)2 from (Italy to)1 Malta ()2 from (Malta to)1 Italy ()2 from ()1 Italy (to Malta)2 from (Malta)1 to (Italy)2 from (Italy)1 to (Malta)2 To solve this ambiguity problem, the ABL algorithm uses one of these three methods: 1. Selects the alignment with the maximum number of linked words. When multiple alignments admit the same maximum number of linked words, the choice between them is postponed to the selection stage of the algorithm [vZ03]. 2. Selects one alignment at random [vZ03]. 3. Keeps all possible alignments, and decides in the selection stage of the algorithm (thus increasing the burden on the selection phase) [vZ03]. 4.2.1.2

Selection Phase

Another potential problem (apart from ambiguity) is that overlapping constituents are found. Overlapping constituents cannot exist since these cannot be represented by a context-free grammar. The following example shows a sentence (B) aligned to two different sentences (A and C), producing overlapping constituents (”Give me all flights” overlaps with ”all flights from Malta to Italy”) Sentence A: (Book this flight)1 from Malta to Italy Sentence B: (Give me all flights)1 from Malta to Italy Sentence B: Give me (all flights from Malta to Italy)2 Sentence C: Give me (your passport number)2 Thus, the role of the selection phase is to solve this overlapping problem by choosing the correct constituent. Either one of these three methods are used to solve this problem: 1. The easiest method is ABL:incr, which assumes that the constituent identified first is the correct one. Thus, if at some point a constituent is found that overlaps a previously found constituent, then the new one is ignored. However, once an incorrect constituent is learned, it will never be corrected [vZ03]. 44

2. Another method is ABL:leaf, which selects the most probable constituent. The probability of a constituent is computed using the following equation: Pleaf (c) =

|c0 ∈ C : yield(c0 ) = yield(c)| |C|

where C is the set of all candidate constituents and the yield of a constituent is the list of words derived from the constituent’s non-terminal. 3. The ABL:branch method builds on ABL:leaf by taking into account the constituent’s non-terminal, NT(c). Pleaf (c | N T (c) = N ) = 4.2.1.3

|c0 ∈ C : yield(c0 ) = yield(c) ∧ N T (c0 ) = N | |c00 ∈ C : N T (c00 ) = N |

Results from ABL

Each ABL selection method (i.e. ABL:incr, ABL:leaf and ABL:branch) was evaluated on two large corpora, ATIS and OVIS. The ATIS corpus is from the Penn Treebank and has 716 sentences containing 11,777 constituents. The larger OVIS corpus, which is a Dutch corpus, contains sentences on travel information. It is made of 6,797 sentences (with two or more words) and 48,562 constituents. Each ABL method was tested 10 times, and the mean percentage results and standard deviations (shown in table 4.2) were computed on three different metrics: NCBP, NCBR and ZCS.

ABL:incr ABL:leaf ABL:branch

NCBP

ATIS NCBR

ZCS

83.24 (1.17) 81.42 (0.11) 85.31 (0.01)

87.21 (0.67) 86.27 (0.06) 89.31 (0.01)

18.56 (2.32) 21.63 (0.50) 29.75 (0.00)

NCBP

OVIS NCBR

ZCS

88.71 (0.79) 85.32 (0.02) 89.25 (0.00)

84.36 (1.10) 79.96 (0.03) 85.04 (0.00)

45.11 (3.22) 30.87 (0.09) 42.20 (0.01)

Table 4.2: ABL results on ATIS and OVIS corpora (from [vZ01]) NCBP, which stands for Non-Crossing Brackets Precision, denotes the percentage of learned constituents that do not overlap with any constituents in the original corpus. NCBR, which stands for Non-Crossing Brackets Recall, denotes the percentage of constituents in the original corpus that do not overlap with any learned constituents. ZCS, Zero-Crossing Sentences, denotes the percentage of sentences that do not have any overlapping constituents.

45

4.2.2

EMILE

P. Adriaans et al. in [ATV00] describe a context-free grammar learning algorithm called EMILE (version 4.1). Like ABL, EMILE was designed to learn natural language grammars, but can be used to infer context-free grammars from any type of data. EMILE finds grammatical types (constituents) by using 2-dimensional clustering on a context/expression matrix (definition 4.1.20 explains what a context is). We will first define what a context/expression matrix is, then explain 1-dimensional clustering and finally explain 2-dimensional clustering and show how this is used to infer the rules of the target grammar. 4.2.2.1

Context/Expression Matrix

Definition 4.2.5 A context/expression function f on a set of sentences S is a function which, when given a context from some sentence in S, (l, r), and a list of contiguous words from some sentence in S, w, returns whether lwr ∈ S. The context/expression matrix of S is thus a matrix of lists of contiguous words from all sentences in S (rows) and all possible context in S (columns), where an entry in the matrix is marked if f ((l, r), w) is true. Table 4.3 shows an example of a context/expression matrix.  likes tennis Mary likes tennis Mary likes likes tennis Mary likes tennis plays Mary plays plays tennis Mary plays tennis

Mary tennis

Mary likes 

 tennis

Mary 

 

×

 plays tennis

Mary plays 

× × ×

× × × ×

× × × ×

Table 4.3: A context/expression matrix for S = {’Mary likes tennis’, ’Mary plays tennis’ }. Note that the upper two rows show the left and right context respectively.

4.2.2.2

1-Dimensional Clustering

Definition 4.2.6 The set of 1-dimensional clusters, 1D, on a context/expression matrix over sentences S consists of pairs of the type P(W ) × C, where P(W) is the power set of contiguous words in S and C is the set of all possible contexts of sentences in S. 46

1D = { ({w | w ∈ W, s ∈ S, lwr = s}, (l, r)) | (l, r) ∈ C } Where the cardinality of the set of words for each context must be bigger or equal to 2. So for the example shown in table 4.3, the set of 1-dimensional clusters is: { ({likes, plays}, (Mary,tennis)), ({Mary likes, Mary plays}, (,tennis)), ({likes tennis, plays tennis}, (Mary,)), ({Mary likes tennis, Mary plays tennis}, (,)) } We can extend the set of 1-dimensional clusters by grouping contexts that have the same set of expressions. So, if we add the sentences ’Mary likes football’ and ’Mary plays football’ to our example, the new extended 1-dimensional clusters would contain the following clusters with grouped contexts: ({likes, plays}, {(Mary,tennis), (Mary,football)}) ({tennis, football}, {(Mary likes,), (Mary plays,)}) ({Mary likes, Mary plays}, {(,tennis), (,football)}) We can then assign grammatical types to each cluster. So, in our case, likes and plays will have the same type (verb) and tennis and football will be have another grammatical type (noun). 4.2.2.3

2-Dimensional Clustering

The assignment of grammatical types (constituents) by 1-dimensional clustering does not take into consideration the contexts whose type is ambiguous [ATV00]. For instance, if we add the sentences ’Mary is skiing’ and ’Mary likes skiing’, the context (Mary likes, ) would become ambiguous. This is because it is a context of both a nouns (tennis, football) and ing-phrases (skiing). Thus, we would like to obtain the following clusters for tennis, football and skiing so that skiing is assigned a different grammatical type: ({tennis, football}, {(Mary likes,), (Mary plays,)}) ({skiing}, {(Mary likes,), (Mary is,)}) but instead, with 1-dimensional clustering, we will obtain the following clusters: ({tennis, football}, {(Mary plays,)}) ({tennis, football, skiing}, {(Mary likes,)}) ({skiing}, {(Mary likes,), (Mary is,)}) 47

In order to obtain the desired result, we need a different type of clustering. Thus, EMILE uses 2-dimensional clustering. This clustering technique searches for the maximum-sized blocks in the context/expression matrix [ATV00]. Table 4.4 shows the clustered context/expression matrix (for one word phrases, excluding the word Mary)

plays likes is skiing tennis football

Mary tennis (1) (1)

Mary football (1) (1)

Mary skiing

Mary likes 

Mary plays 

Mary is 

(3)

(4) (4)

(3) (4) (4)

(2) (2)

Table 4.4: Part of the clustered context/expression matrix for S = {’Mary likes tennis’, ’Mary plays tennis’, ’Mary likes football’, ’Mary plays football’, ’Mary likes skiing’, ’Mary is skiing’ }. There are 4 cluster blocks. Elements of the same blocks are marked with the same number So, the 4 clusters derived from the context/expression matrix in table 4.4 are: (1) (2) (3) (4)

→ → → →

({plays, likes}, {(Mary,tennis), (Mary,football)}) ({likes, is}, {(Mary,skiing)}) ({tennis, football}, {(Mary likes,), (Mary plays,)}) ({skiing}, {(Mary likes,), (Mary is,)})

P. Adriaans et. al. explain in [ATV00] the algorithm to find the maximum-sized blocks as follows: ”The algorithm starts from a single context/expression pair. It randomly adds contexts and expressions whilst ensuring that the resulting block is contained in the matrix, and keeps adding contexts and expressions until the block can no longer be enlarged. This is done for each context/expression pair that is not already contained in some block. Some of the resulting blocks may be completely covered by other blocks; these are eliminated at the end of the algorithm.” 4.2.2.4

Finding Rules

After finding the grammatical types through 2-dimensional clustering, EMILE transforms these into production rules. Each expression e belonging to a grammatical type T generates the rule: T →e

48

EMILE derives more complex rules by comparing clusters which share common phrases. So, for the clusters: ({tennis, football}, {(Mary likes,), (Mary plays,)}) ({Mary likes tennis, Mary likes football}, {(,)}) EMILE derives the rule: S → Mary likes T T → tennis T → f ootball EMILE is also able to find recursive rules. Given the sentence ’Ben likes that Mary plays football’, EMILE can derive the following recursive rules: S S T T 4.2.2.5

→ Ben likes that S → Mary plays T → tennis → f ootball

Results from EMILE

Adriaans et. al in [ATV00] conducted several experiments with EMILE. These included experiments on small and large data sets. For example, EMILE correctly identified the grammatical types (and the whole grammar) of the following target grammar: S → I cannot V mail with N V → read | write | open | send N → MS-Mail | MS-Outlook | Mail | Outlook EMILE was also tested on large data sets which included a 2000 sentence sample, a set of bio-medical abstracts from the Medline archive (3000 lines) and the King James version of the Bible. For the 2000 sentence sample, EMILE produced an oversized grammar but identified the key grammatical types. The Medline and Bible experiments produced good results. Moreover, EMILE was compared to ABL in [vZA01]. Tables 4.5 and 4.6 show the results obtained

49

UR

UP

F

ATIS

EMILE ABL

16.814 (0.687) 35.564 (0.020)

51.588 (2.705) 43.640 (0.023)

25.351 (1.002) 39.189 (0.021)

OVIS

EMILE ABL

36.893 (0.769) 61.536 (0.007)

49.932 (1.961) 61.956 (0.008)

41.433 (3.213) 61.745 (0.007)

Table 4.5: Results: UR = unlabeled recall, UP = unlabeled precision, F = F-score (from [vZA01]) CB

0 CB

≤2 CB

ATIS

EMILE ABL

16.814 (0.687) 35.564 (0.020)

51.588 (2.705) 43.640 (0.023)

25.351 (1.002) 39.189 (0.021)

OVIS

EMILE ABL

36.893 (0.769) 61.536 (0.007)

49.932 (1.961) 61.956 (0.008)

41.433 (3.213) 61.745 (0.007)

Table 4.6: Results: CB = average crossing brackets, 0 CB = no crossing brackets, ≤2 CB = two of fewer crossing brackets (from [vZA01])

4.2.3

Clark’s Omphalos Algorithm

Alexander Clark’s algorithm in [Cla07] is the only one which solved some hard problems in the 2005 Omphalos Context-Free Learning Competition (see section 5.2 for more details on this competition). It still remains the only algorithm with successful results on at least one hard Omphalos problem, since further attempts (i.e. after the end of the competition) made to solve these were all unsuccessful. Clark’s algorithm solved 2 hard problems (with target grammars having 24 and 38 rules) by correctly labeling a large number of test sentences (12,004 and 17,230) from a small number of positive examples (281 and 419) [SCvZ05]. Moreover, the algorithm exactly identified the target languages [SCvZ05]. Basically, Clark’s Omphalos algorithm tries to identify constituents from the positive examples and substitutes them with arbitrary non-terminal. This constituent identification (and substitution) process gives structure to the positive examples. The algorithm then merges some non-terminals (i.e. set different non-terminal equal to each other). This merging process generalizes the grammar, such that the grammar will accept other strings apart from the positive examples. The algorithm continues identifying constituents, substituting them with non-terminals and merging non-terminals. This is repeated until the training examples become very similar, in which case the initial non-terminal S is assigned to each different training example.

50

4.2.3.1

Constituent Identification

Clark’s algorithm identifies constituents by selecting a subset of strings from the set of all positive example substrings. First, it eliminates all substrings which occur less than a predefined number (fmin ) of times in the training examples. The constant fmin should depend on the number and size of the positive examples (the larger and bigger the positive examples are, the larger the constant should be set). The algorithm then proceeds by filtering out all the substrings whose mutual information (MI) is less than a predefined number Mmin . The MI is an information-theoretic measure that is defined on substrings from positive examples using the functions both, lef t, right and none: Definition 4.2.7 Let both(l, w, r) for l, r ∈ Σ and w ∈ Σ∗ denote the occurrence of substring lwr. Definition 4.2.8 Let lef t(l, w) for l ∈ Σ and w ∈ Σ∗ denote the occurrence of substring X lw i.e. lef t(l, w) = both(l, w, r). r∈Σ

Definition 4.2.9 Let right(w, r) for r ∈ Σ and w ∈ Σ∗ denote the occurrence of substring X wr i.e. right(w, r) = both(l, w, r). l∈Σ

Definition 4.2.10 Let none(w) for w ∈ Σ∗ denote the occurrence of substring w. Definition 4.2.11 The mutual information of a substring w, MI(w) is: M I(w) =

X

X both(l, w, r) both(l, w, r) none(w) log none(w) lef t(l, w) right(w, r)

l∈Σ∪{} r∈Σ∪{}

Clark claims that substrings which are constituents tend to have a high MI value, whilst substrings that are not will have a low MI value [Cla07]. 4.2.3.2

Merging

For Clark’s algorithm to generalize from the training examples, it makes an assumption on the substitutability of constituents. The assumption is that if strings xuy and xvy are both positive examples, where x, y ∈ Σ∗ and u and v are both constituents, then the non-terminals of u and v are the same (and thus are merged). This assumption holds when the target grammar is NTS (see definition of NTS in section 4.1.4). Clark uses a substitution graph to check for the substitutability property. Note that the substitution graph is the same as the one used by Clark & Eyraud in [CE07] (which is 51

Figure 4.6: Part of the substitution graph for problem 4 in the Omphalos Context-Free Learning Competition [SCvZ05]. Each number labeling an edge from node A to B denotes the number of contexts that nodes A and B have in common (reproduced from [Cla07]) explained in section 4.1.3). Figure 4.6 shows part of the substitution graph of an Omphalos training example. Each clique in the substitution graph will contain the substrings which have at least one context in common. Thus, the non-terminals of all substrings in each clique are merged.

4.2.3.3

Generalizing and Specifying the Hypothesis

When the algorithm does not return a correct hypothesis, Clark assumed that his inferred grammar was either too specific or too general. So he used two heuristics, one to generalize the grammar and the other to make the grammar more specific. The generalizing heuristic takes two production rules of the form A → uwv and B → w, where u, v, w ∈ Σ∗ and A and B are non-terminals, and replaces the first rule with A → uBv. The other heuristic makes the grammar more specific by taking two production rules of the form A → uBv and B → w, and replaces the first with A → uwv.

4.2.4

The Minimum Description Length principle

GRIDS is a heuristic based algorithm introduced by G. Wolff in [Wol78], and further developed by P. Langley and S. Stromsten in [LS00] and later by G. Petasis et al. in [PPSH04]. It uses the minimum description length principle. Definition 4.2.12 The minimum description length principle (MDL) is a version 52

of Occam’s Razor inductive bias which recommends a learning algorithm to choose the hypothesis that minimizes the description length of the hypothesis plus the description length of the data given the hypothesis [Mit97]. So, GRID searches in the space of possible grammars and chooses the one which minimizes the most a cost function. This cost function is computed on the size of the grammar plus the length of the derivations of the strings in the training set on the grammar. These two factors are both needed so that the two extreme grammar cases (i.e. overly specific and overly general grammars) are avoided. Large grammars with very specific rules, albeit having small derivations, will be avoided due to their large size. On the other hand, small and overly general grammars will be ruled out due to their long derivations on the strings in the training set. GRID’s algorithm starts its search with a grammar that exactly generates the given positive string. This is done by simply assigning the rules S → w for each string w in the training set. Then the algorithm iteratively searches for a better grammar by performing one of the following two operations: 1. Substitution: Take a sequence of terminals and non- terminals from the grammar, and substitute each occurrence of this sequence in the RHS rules by a new nonterminal. Note that this operation does not change the language generated by the grammar, it just introduces new structure to the grammar, which affects the derivations. 2. Merging: Take two non-terminals and merge them (i.e. substitute each occurrence of the 1 non-terminals, both in the LHS and RHS of rules, by the other). This operation generalizes the grammar by generating other strings not in the training set. Also note that merging can introduce cycles in the grammar and thus infinite languages can be generated. The algorithm finally halts after no operation can improve the current grammar. So, for example, let S+ = {aa, baabb, bbaabbbb} be a set of positive examples. The following are the grammars GRIDS might find in its search: Start Initial Grammar S → aa S → baabb S → bbaabbbb

Step 1 Substitute aa with A S→A S → bAbb S → bbAbbbb A → aa

Step 2 Substitute bAbb with B S→A S→B S → bBbb A → aa B → baabb 53

Step 3 Merge S and B S→A S→S S → bSbb A → aa S → baabb

GRIDS was tested on some small test cases. On subsets of English grammars (15 rules, 8 non-terminals, 9 terminals), it needed 120 sentences to converge to the correct grammar. On the language (ab)∗ , it converged after being presented with all (15) possible strings of length ≤ 30. Finally, on Dyck language on 2 symbols, it needed all (65) possible strings of length ≤ 12 to converge.

4.2.5

Lattice Exploration

J. Giordano in [Gio94] proposes a search in a partial ordering of context-free grammars under the structural containment relation. Definition 4.2.13 The set of structures of a grammar G is the set of derivation trees of G with all non-terminal labels deleted [HR80]. Definition 4.2.14 A grammar G1 is structurally contained in (resp. structurally equivalent to) a grammar G2 if the set of structures generated by G1 is contained in (resp. equal to) the set of structures generated by G2 [Gio94]. For example, given the following two CFGs: G1 S → AB A → AA A→a B→b

G2 S → SB S → AA A → AA A→a B→b

It is clear that G1 is structurally contained in G2 . This is because the infinite set of structures of G1 (shown in figure 4.7) are all contained in the set of structures of G2 (shown in figure 4.8). Note that the set of CFGs under the structural containment relation results in a partial order. This is because structural containment is reflexive, antisymmetric and reflexive (analogous to the ⊆ relation). This means the space of CFGs can be represented as a lattice under the structural containment relation, with the grammar generating Σ∗ being the unique supremum and the grammar generating the empty language being the unique infimum. Structural containment is decidable in exponential time on the set of context-free grammars in general [McN67], but in polynomial time on one of their normal forms [HR80]: uniquely invertible grammars. 54

b .. .

a a

.. .. . . ....

.. .

a

Figure 4.7: The infinite set of structures of G1 , which contains a derivation tree for each sequence of as followed by a b

..

b

..

b .. ..

b .. .

a a

.. .. . . ....

.. .

a

Figure 4.8: The infinite set of structures of G2 , which contains a derivation tree for each sequence of as combined with a sequence of bs. This clearly includes all the structures in figure 4.7 Definition 4.2.15 A CFG G is uniquely invertible iff no two production rules in G have the same right-hand side i.e. if X → u and Y → u are productions in G, then X =Y. Thus, Giordano considers the space of uniquely invertible CFGs as hypotheses, partially ordered under the structural containment relation [Gio94]. To obtain the inductive leap (i.e. to restrict the hypothesis space to a finite size), Giordano assumes that the target grammar admits, at most, a fixed number of non-terminals (where this number is chosen a priori arbitrarily) [Gio94]. Giordano’s algorithm then finds the best hypothesis by ’walking’ upwards or downwards in the lattice of grammars to find more general or more specific grammars respectively. With each positive example, a more general grammar is found and with each negative example, a more specific grammar is found. Now, since we are concerned with the problem of inferring CFLs from positive data alone, Giordano’s algorithm can be easily modified to ’walk’ only upwards in the lattice with each positive 55

example given by starting from the grammar generating the empty language.

4.3

Bayesian Learning

The basic principle in Bayesian learning is that the best hypothesis for describing the target concept is the most probable one, given information on the target concept plus any initial knowledge about the prior probabilities (before any information is given) of the various hypothesis [Mit97]. So, let P (h) be the probability that a hypothesis h describes the target concept, before any information is given. This is known as the prior probability of h. It reflects our background knowledge on the hypotheses space [Mit97]. Similarly, let P (D) be the prior probability that the training data D will be observed (given no knowledge about which hypothesis holds) [Mit97]. P (D|h) is then the probability that a training example D is observed given that the target concept is described by hypothesis h. As we mentioned earlier, we are in fact interested in maximizing P (h|D), which is the probability that h holds given the training set D. This is called the posterior probability. Bayes theorem [Mit97] states that: P (A|B) =

P (B|A)P (A) P (B)

So, by Bayes theorem, we can find the most probable hypothesis, known as the maximum a posteriori (MAP) hypothesis hM AP , by maximizing the posterior probability as follows: hM AP ≡ arg max P (h|D) h∈H

= arg max h∈H

P (D|h)P (h) P (D)

= arg max P (D|h)P (h) h∈H

Note that in the last step, we omitted P (D). This is because, on any possible hypothesis h, P (D) will always remain the same. Thus, since we are interested in maximizing P (h|D) (rather then finding its exact value), we can ignore the constant P (D). In the context of grammatical inference, in general, we do not know the a priori probabilities of our hypotheses. This is because before any training examples are given, any language representation is equally probable of being the one describing the target language. This means that P (h) is the same for any h ∈ H. Therefore, as we have done for P (D), we can remove P (h) from our equation. This leaves us with one probability measure to maximize, namely, P (D|h). The hypothesis h from H that maximizes the most P (D|h) is refereed to as the maximum likelihood (ML) hypothesis, hM L . 56

hM L = arg max P (D|h) h∈H

In order for the notion of probabilistic hypotheses to make sense in GI of context-free languages, probabilistic CFG need to be introduced. The underlying assumption behind a ’normal’ CFG is that all strings generated by a grammar are equally probable. However, a PCFG is more flexible since it assigns probabilities to its production rules such that each string generated has an associated probability. Definition 4.3.1 A probabilistic CFG (PCFG, a.k.a Stochastic CFG) is a 5-tuple: < Σ, N, P, S, φ > where, like CFGs, Σ is the alphabet of terminals, N is the set of nonterminals, P is the set of production rules N ×(N ∪Σ)∗ and S is the starting non-terminal. φ is a function P → [0, 1] that assigns a probability to each production rule s.t. ∀A : N · A → β ∈ P ⇒

X

φ(A → β) = 1 [Cha96]

β

So, given the following 2 examples of PCFGs (where the probability of each production rule is written right next to each rule): PCFG 1 S → AB 0.7 A → AA 0.3 A → B 0.8 A → a 0.4 B → BB 0.6 B → b 0.4

PCFG 2 S → AB 1.0 A → AA 0.6 A → B 0.1 A → a 0.3 B → BB 0.2 B → b 0.8

PCFG 1 is not well formed. This is because there exist non-terminals (S and A) for which the summation of the rule probabilities in which they appear in the LHS do not add up to 1 (φ(S → AB) = 0.7 6= 1 and φ(A → AA) + φ(A → B) + φ(A → a) = 0.3 + 0.8 + 0.4 6= 1). On the other hand, PCFG 2 is well formed. Definition 4.3.2 The probability of a string w generated by a PCFG G 2 is calculated on the summation of each leftmost derivation dw ∈ Dw on w (i.e. dw [|dw |] = w and Dw is the set of all possible leftmost derivations on w) over the product of the probabilities of the production rules used in each derivation: |dw |−1

X

Y

dw ∈Dw

i=1

φ(The production rule used s.t. dw [i] ⇒lG dw [i + 1]) [Cha96]

2

An efficient way to compute this measure is by using the CYK algorithm [HU79]. This algorithm returns all the possible derivations by a given CFG on a given string.

57

So, for example, the string aabb on PCFG 2 has the following two leftmost derivations: 1. (S, AB, AAB, aAB, aaB, aaBB, aabB, aabb) (see figure 4.9) 2. (S, AB, AAB, AAAB, aAAB, aaAB, aaBB, aabB, aabb) (see figure 4.10)

Figure 4.9: First derivation tree of PCFG 2

Figure 4.10: Second derivation tree of PCFG 2 Thus, the probability of aabb on PCFG 2 is: First Derivation: 1.0 × 0.6 × 0.3 × 0.3 × 0.2 × 0.8 × 0.8 = 0.006912 Second Derivation: 1.0 × 0.6 × 0.6 × 0.3 × 0.3 × 0.1 × 0.8 × 0.8 = 0.0020736 Total : 0.006912 + 0.0020736 = 0.0089856. In this chapter, we investigate two techniques that use Bayesian learning on PCFGs from positive data: the Inside-Outside algorithm and Bayesian Model Merging. 58

4.3.1

Inside-Outside Algorithm

One of the earliest use of Bayesian learning in the inference of PCFG is attributed to K. Lari & S. J. Young in [LY90]. They propose a method for finding the maximum likelihood PCFG, given a set of training examples (i.e. the PCFG that best describes the training examples). The following are the basic steps involved in their approach: 1. Generate all production rules from some polynomially bounded set of possible production rules that generate the training examples. All the possible production rules for the training examples cannot be all generated. This is because there are an exponential number of them. Thus, some assumptions on the target grammar are needed to bound. 2. Assign them some initial probabilities arbitrarily (let φ be the probability function). 3. Run the Inside-Outside algorithm to assign better rule probabilities. 4. Remove the rules with zero probabilities, leaving the ’correct’ grammar. The inside-outside algorithm receives as input a PCFG G and a set of training examples W and returns a PCFG G0 with the same production rules as G but with different probabilities. The algorithm basically performs a hill-climbing search for the best probability function φ for G to maximize the likelihood of W .

59

Algorithm 6: The Inside-Outside Algorithm Input: PCFG G =< Σ, N, P, S, φ > in CNF ; Training set W Output: A PCFG G0 =< Σ, N, P, S, φ0 > repeat P reviousLikelihood ← Pφ (W |G) ; foreach A → α ∈ P do count(A → α) = 0 ; foreach w = w1 w2 . . . wn ∈ W do foreach substring wk wk+1 . . . wl ∈ w do Ikl (A, w) = Pφ (A ⇒∗ wk wk+1 . . . wl ) ; Okl (A, w) = Pφ (S ⇒∗ w1 w2 . . . wk−1 Awl+1 wl+2 . . . wn ) ; end Calculate cφ (A → α, w) on the values of I(A, w) and O(A, w) for all k, l ; count(A → α) = count(A → α) + cφ (A → α, w) ; end end foreach A → α ∈ P do φ(A → α) =

count(A → α) X ; count(A → β) A→β∈P

end until P reviousLikelihood = Pφ (W |G); φ0 ← φ ; return G0 =< Σ, N, P, S, φ0 > ; Where: • The likelihood of a training set W on grammar G = Pφ (W |G) = Pφ (w1 |G) × Pφ (w2 |G) × . . . × Pφ (wn |G). These probabilities can be calculated using definition 4.3.2. • Ikl (A, w) is the inside probability which denotes the probability that non-terminal A generates the string wk wk+1 . . . wl . This can be calculated using definition 4.3.2 (by treating A as the starting non-terminal). • Okl (A, w) is the outside probability which denotes the probability that the starting non-terminal S generates the string w1 w2 . . . wk−1 Awl+1 wl+2 . . . wn . This can be calculated using definition 4.3.2.

60

• cφ (A → α, w) is the expected number of times the rule A → α is used to generate w. If we assume (for convenience) that the rules are from a CFG in CNF, then: X φ(A → BC) cφ (A → BC, w) = Oik (A, w) Iij (B, w) Ij+1,k (C, w) Pφ (w) 1≤i≤j≤k≤n cφ (A → a, w) =

φ(A → a) X Oii (A, w) Pφ (w) 1≤n

• The probabilities of the production rules are re-estimated in the last foreach loop.

Figure 4.11: The inside and outside probabilities (reproduced from [Cha96]) There are mainly 2 problems with this approach. First of all, there is the issue on the polynomially bounded set of possible production rules used for the inside-outside algorithm. It might be the case that the selected production rules do not characterize the target language (i.e. no grammar with those rules exist, on any possible probabilities, from which the target language can be generated). Secondly, even when we assume that the correct production rules are selected initially, the inside-outside algorithm does not guarantee the optimal solution. This is because it can get ’stuck’ into local maxima (since it uses a hill-climbing strategy) as shown in figure 4.12

61

Figure 4.12: The problem with a hill climbing strategy (reproduced from [Max])

4.3.2

Bayesian Model Merging

A. Stolcke & S. Omohundro in [SO94] describe how Bayesian model merging can be used to infer PCFGs. In general, Bayesian model merging is a method for building probabilistic models (HMMs, PCFGs, etc. . .) that best describe a given set of positive data. It involves 3 steps: 1. Data incorporation: Given a positive training set X, build an initial model M0 by explicitly accommodating each element x ∈ X such that M0 maximizes the likelihood P (X|M0 ). The size of this initial model M0 will become larger as the amount of examples in the training set increase. 2. Structure Merging: Perform a hill-climbing search for new models M1 , M2 , . . . by applying some generalization operation g, such that Mi+1 = g(Mi ) for i = 0, 1, 2, . . . . Halt after no model can be found which improves the current model. 3. Parameter estimation: The probabilities of the model are found using some parameter estimation algorithm (e.g. the Baum-Welch algorithm [BPSW70] for HMMs or the inside-outside algorithm for PCFGs, described in section 4.3.1). Like Lari & Young’s approach in [LY90] (described in section 4.3.1), Bayesian model merging uses a hill climbing technique to search for the target grammar and thus, it has the disadvantage of possibly getting stuck into local maxima. However, unlike Lari & Young’s technique, Bayesian model merging searches for both the best model structure (in our case, the PCFGs production rules) and the probabilistic parameters (in Lari & Young’s algorithm, no search is done for the initial structure). The following are the 3 steps involved in Bayesian model merging when applied on the PCFG model, as described in [SO94]: 62

1. Data incorporation: This is done by assigning the symbols in the training set a1 a2 . . . an different non-terminals Na1 Na2 . . . Nan respectively. Then, for each training example w = w1 w2 . . . wk , add production rules of the form S → Nw1 Nw2 . . . Nwk to build the initial grammar (model) G0 . 2. Structure merging: This is done by using two operations: merging of nonterminals (by choosing two non-terminals and substitute each occurrence of one with the other) and chunking (by choosing a sequence of non-terminals and substitute each occurrence of this by a newly added non-terminal). The resulting grammars after merging/chunking are evaluated using the posterior and prior probabilities P (G|X) and P (G) respectively (where P (G) is calculated on the size of the grammar s.t. if |G1 | < |G2 | then P (G1 ) > P (G2 )). The algorithm favors grammars which maximize these measures. 3. Parameter estimation: Use the inside-outside algorithm. Note that Bayesian model merging is very similar to the GRIDS algorithm proposed by G. Wolff in [Wol78] (described in section 4.2.4). The main difference is that GRIDS only tries to minimize the description length of grammars (i.e. it essentially tries to maximize P (G)) but does not take into consideration the posterior probability P (G|X).

4.4

Alternative Representations

Another technique used to avoid the hardness results in grammatical inference is to change the representation of the hypothesis space. The idea is to infer context-free language generating mechanisms which are not CFGs (or restrictive forms of CFGs). We focus here on grammar-like structures (we do not mention inference of automata). In this section, we investigate three examples of alternative representations, namely, string rewriting systems, pure grammars and Lindenmayer systems.

4.4.1

String Rewriting Systems

Definition 4.4.1 Let Σ be a finite alphabet. A string rewriting system (SRS or semiThue system) R on Σ is a finite subset of Σ∗ × Σ∗ . Each element (l,r) of R is called a rewrite rule, and is denoted as l ` r. [BO93] Definition 4.4.2 The length of a SRS R, |R|, is equal to

X (l,r)∈R

63

|l| + |r|. [BO93]

Definition 4.4.3 If R is an SRS on Σ, then the single-step reduction relation on Σ∗ that is induced by R is defined as follows: for any u, v ∈ Σ∗ , u →R v iff there exists l ` r ∈ R such that for some x, y ∈ Σ∗ , u = xly and v = xry. The reduction relation on Σ∗ induced by R is the reflexive, transitive closure of →R and is denoted by →∗R . [BO93] Definition 4.4.4 A string w is reducible with a SRS R iff there exists w’ s.t. w →R w0 , otherwise the string is irreducible. [BO93] Definition 4.4.5 An SRS R is said to terminate iff on any given string w, w always reaches an irreducible string w0 (i.e. w →∗R w0 ) on any possible way of applying the rewrite rules. [BO93] Definition 4.4.6 The language L(R,x) generated by a SRS R on Σ, where x ∈ Σ∗ is irreducible, is L(R, x) = {s | s ∈ Σ∗ , s →∗R x}. [BO93] The following are three examples of a SRS on Σ = {a, b, }: R0 ab ` 

R1 ab `  ba ` 

R2 aa ` a bb ` b ab ` 

L(R0 , ) is the Dyck language (i.e. the language of parenthesis, where a and b are opening and closing parenthesis respectively). L(R1 , ) = {w ∈ Σ∗ | |w|a = |w|b } where |w|x denotes the number of occurrences of x in w. L(R2 , ) = {(ai bj )n | i, j, n ∈ N}. Clearly, the language class generated by SRS’s is super-finite. Any finite language Lf inite can be generated by a SRS with rules of the form w ` x for each w ∈ Lf inite , where x is the irreducible string. We have already shown examples of infinite languages generated by a SRS (Dyck language for example). So, SRS’s cannot be identified in the limit from positive data. C. de la Higuera et. al. in [dlHEJ07] introduce a learning algorithm, LARS, that polynomially identifies in the limit a class of SRS called Hybrid ANo DSRS. We first define DSRS’s, Hybrid DSRS’s and ANo DSRS’s. Then, we proceed by explaining the algorithm used by LARS to infer Hybrid ANo DSRS. Finally, we evaluate the LARS algorithm. 4.4.1.1

Delimited SRS

Definition 4.4.7 A delimited string rewriting system (DSRS) on Σ is a finite subset of {$ ∪ }Σ∗ {£ ∪ } × {$ ∪ }Σ∗ {£ ∪ }, where each rewrite rule l ` r satisfies one (and only one) of the following four constraints: 64

1. 2. 3. 4.

l, r l, r l, r l, r

∈ $Σ∗ (used to rewrite prefixes) ∈ $Σ∗ £ (used to rewrite whole strings) ∈ Σ∗ (used to rewrite substrings) ∈ Σ∗ £ (used to rewrite suffixes)

Rules satisfying constraints 1 or 2 are called $-rules. Rules satisfying constraints 3 or 4 are called non-$-rules. We shall denote the set {$ ∪ }Σ∗ {$ ∪ } as Σ∗ . Definition 4.4.8 The language L(R,x) generated by a DSRS R on Σ, where x ∈ Σ∗ is irreducible, is L(R, x) = {s | s ∈ Σ∗ , $s£ →∗R $x£} 4.4.1.2

Hybrid Delimited SRS

Definition 4.4.9 The length-lexicographic order on Σ∗ , denoted as C, is defined as follows: ∀w1 , w2 ∈ Σ∗ · w1 C w2 ⇔ (|w1 | < |w2 |) ∨ (|w1 | = |w2 | ∧ w1 is lexicographically smaller than w2 ) Definition 4.4.10 The length-lexicographic order on Σ∗ , denoted as ≺, is defined as follows: ∀w1 , w2 ∈ Σ∗ · w1 C w2 ⇒ w1 ≺ $w1 ≺ w1 £ ≺ $w1 £ ≺ w2 Definition 4.4.11 A DSRS rewrite rule l ` r is said to be hybrid iff 1. l ` r is a $-rule and r ≺ l or 2. l ` r is a non-$-rule and |r| < |l|. A DSRS is hybrid iff all its rules are hybrid. C. De la Higuera et al. in [dlHEJ07] prove that any hybrid DSRS R is capable of inducing only a finite amount of reductions. This means that a hybrid DSRS cannot induce rewrite rules indefinitely on any given string (i.e. hybrid DSRS’s always terminate). Moreover, they show that every derivation starting from a string w on a hybrid DSRS R has a length that is at most |w|.|R|. This means that all derivations of a hybrid DSRS are polynomial w.r.t the size of the starting string and the size of the DSRS. All this means that hybrid DSRS induce finite and tractable reductions.

65

4.4.1.3

Almost Non-Overlapping Delimited SRS

Hybrid DSRS’s guarantee termination and tractable derivations. However, many different irreducible strings can be reached from one given string. Thus, a hybrid DSRS does not guarantee a tractable solution for the language membership problem. This is because to decide whether a string w ∈ L(R, x), it is required to compute all the derivations that start with w and checking if one of them ends with x. This might be intractable if the number of possible ways to reduce w is exponentially large. Thus, another restriction on DSRS is introduced. This is the Church-Rosser property. Definition 4.4.12 A DSRS R is Church-Rosser iff for all strings w, a, b ∈ Σ∗ such that w →∗ a and w →∗ b, there exists w0 ∈ Σ∗ such that a →∗ w0 and b →∗ w0 (see figure 4.13).

w

w *

a

*

*

*

a

b

b *

*

w’ Figure 4.13: The Church-Rosser property: if w derives a and b, then there exists w0 s.t. a and b derive w0 Clearly, if a DSRS R satisfies the Church-Rosser property then, for any string w, there will always be at most one irreducible string w0 derived from w using R. C. de la Higuera et al. in [dlHEJ07] describe a restricted form of DSRS, ANo DSRS, which always satisfies the Church-Rosser property. Definition 4.4.13 Two (non necessarily distinct) rules l1 ` r1 and l2 ` r2 are almost nonoverlapping (ANo) iff they satisfy either one of the following six conditions: 1. If l1 = l2 then r1 = r2 or 2. If l1 is strictly included in l2 (i.e. ∃u, v ∈ Σ∗ · ul1 v = l2 ∧ uv 6= , see figure 4.14(a)) then ur1 v = r2 or 3. If l2 is strictly included in l1 (i.e. ∃u, v ∈ Σ∗ · ul2 v = l1 ∧ uv 6= ) then ur2 v = r1 or 66

4. If a strict suffix of l1 is a strict prefix of l2 (i.e. ∃u, v ∈ Σ∗ · l1 u = vl2 ∧ 0 < |v| < |l1 |, see figure 4.14(b)) then r1 u = vr2 or 5. If a strict suffix of l2 is a strict prefix of l1 (i.e. ∃u, v ∈ Σ∗ · ul1 = l2 v ∧ 0 < |v| < |l1 |) then ur1 = r2 v or 6. There is no overlapping (i.e. for i, j ∈ {1, 2}, i 6= j, li = uv and lj = vw ⇒ v = ).

u

l1 l2

v

l1

u

v

(a)

l2 (b)

Figure 4.14: (a): l1 is strictly included in l2 . (b): a strict suffix of l1 is a strict prefix of l2

Definition 4.4.14 A DSRS is almost nonoverlapping (ANo) iff its rules are pairwise almost nonoverlapping. C. de la Higuera et al. prove in [dlHEJ07] that every ANo DSRS is Church-Rosser. Thus, this means that the language membership problem for a hybrid ANo DSRS is decidable in polynomial time (the hybrid property guarantees termination and the ANo property guarantees a tractable solution). They also show in [dlHEJ07] that for each regular language Lreg , there is a hybrid ANo DSRS R and a string x s.t. Lreg = L(R, x). This means that the class of languages that can be generated by a hybrid ANo DSRS is super-finite. Therefore, the whole class of languages generated by hybrid ANo DSRS’s cannot be identified in the limit from positive data alone. In fact, LARS learns only a non-super finite subclass of languages generated by hybrid ANo DSRS’s. 4.4.1.4

The LARS Algorithm

The LARS algoritm described in [dlHEJ07] works on both positive and negative examples. However, it can easily be modified to work on positive data alone. Algorithm 7 describes LARS on positive data.

67

Algorithm 7: LARS (Learning Algorithm for Rewriting Systems) on positive data Input: A set of positive examples S+ Output: (R, x), where R is a hybrid ANo DSRS and x is an irreducible string I+ ← {$s£ | s ∈ S+ } R←∅ F ← the set of all substrings of I+ , sorted by the relation  (i.e. smallest first) for i ← 1 to |F | do if F [i] is a substring of some string in I+ then for j ← 0 to (i − 1) do Rtemp ← R ∪ {F [i] ` F [j]} if Rtemp is a hybrid ANo DSRS then I+ ← normalize(I+ , Rtemp ) R ← Rtemp end end end end x ← min I+ foreach w ∈ I+ do if w 6= x then R ← R ∪ {w ` x} end end return (R, x) The algorithm basically builds a DSRS by greedily searching for rewrite rules in the set of substrings of the positive examples. The chosen rewrite rules are consistent with the training examples and can form part of a hybrid ANo DSRS. After a rewrite rule is chosen, the set of substrings of positive examples is normalized. In normalization, each substring is reduced to its one irreducible string. This is possible since the chosen rewrite rules are always hybrid and ANo. After all the substrings are traversed, the algorithm concludes by adding a rewrite rule w ` x for each string w in the positive sample. This ensures that the language generated by resulting hybrid ANo DSRS includes all the positive examples. 4.4.1.5

Evaluation of LARS

Figure 4.15 shows the class of languages which can be identified in the limit with LARS compared to other language classes. 68

Figure 4.15: The languages identified by LARS vs other language classes (reproduced from [dlHEJ07]) Experiments were conducted on LARS using both small and large benchmark languages [dlHEJ07]. LARS correctly identified all the small languages. These were the Dyck language, the context-free language {an bn | n ∈ N}, {w ∈ Σ∗ | |w|a = |w|b } and the Lukasewitz language. However, LARS did not learn any benchmark language from both the ABBADINGO [LPP98] and OMPHALOS [SCvZ04] competitions. C. de la Higuera et al. in [dlHEJ07] claim that the reasons behind LARS’s failure to learn these large languages may be the following: 1. Insufficient training examples. 2. No hybrid ANo DSRS would have generated one of the large benchmark languages. 3. LARS is not optimized (maybe a non-greedy approach would have given more positive results).

4.4.2

Pure Grammars

Koshiba et al. in [KMT00] study the learnability of pure grammars from positive data. Pure grammars, unlike Chomsky type grammars, make no distinction between terminals and non-terminals (in other words, there are no non-terminals3 ). Definition 4.4.15 A pure context-free grammar (PCF grammar) is a system G =< Σ, P, s >, where Σ is a finite alphabet, P is finite set of production rules of the form α → β, 3

It is argued [MSW80] that the idea of including non-terminals in grammars originates from the linguistic background of formal language theory (due to Chomsky’s work). Some argue [dlH10] that it is more feasible to learn grammars that do not use non-terminals, since non-terminals are essentially hidden states which cannot be observed from the training data

69

where α ∈ Σ and β ∈ Σ∗ and s ∈ Σ∗ is the axiom. We assume that the empty string  is not allowed in the RHS of any production. [KMT00] Definition 4.4.16 The language generated by a PCF grammar G =< Σ, P, s > is {w ∈ Σ∗ | s ⇒∗P w}. Note that a PCF grammar is not a restricted form of CFG. It is just a different language representation for generating context-free languages. The class of languages generated by PCF grammars (PCFL) is clearly super-finite. Any finite language Lf inite over an alphabet Σ can be built using a PCF grammar G =< Σ ∪ {s}, P, s > where P = {s → w | w ∈ Lf inite }. The regular language a+ is one example of an infinite language generated by a PCF grammar G =< {a}, a → aa, a >. So, the PCF language class is super-finite and thus cannot be identified in the limit from positive data. However, there are subclasses of the PCF languages which are identifiable in the limit from positive data alone. We investigate two of these subclasses: k-uniform and deterministic PCFLs [KMT00]. 4.4.2.1

k-uniform PCFLs

Definition 4.4.17 A PCF grammar G =< Σ, P, s > is k-uniform, k > 1, if all rules l → r ∈ P , |r| = k. The language generated by a k-uniform PCF grammar is a kuniform PCFL. The class of k-uniform PCFLs is identifiable in the limit from positive data. This is because it has finite thickness. Proof Given any k-uniform PCFL L, described by a PCF grammar G =< Σ, P, s >, we are guaranteed that the the length of s is smaller or equal to the size of the smallest string in L. This is because the rules of a PCF grammar can only generate strings larger or equal to the length of the axiom. Thus, s can be found in finite time. Note that this property holds for PCFLs in general. Also, there are only a finite number of possible production rules for any k-uniform PCF grammars. This is because the length of the RHS of rules is bounded by k and the cardinality of Σ, which both are finite. Thus, all this implies that there are only a finite amount of consistent k-uniform PCF languages for any string s, and this means that this language class has finite thickness 

70

4.4.2.2

Deterministic PCFLs

Definition 4.4.18 A PCF grammar G =< Σ, P, s > is deterministic if for each a ∈ Σ, if a → x and a → y are production rules in P then x = y. The language generated by a deterministic PCF grammar is a deterministic PCFL. The class of deterministic PCFLs is identifiable in the limit from positive data. This is because it has finite elasticity. Proof Suppose that the language class of deterministic PCFLs D has infinite elasticity. This means that there exists an infinite sequence w0 w1 . . . of strings and an infinite sequence L0 L1 . . . of languages from D s.t. for all n > 1, {w0 , w1 , . . . , wn−1 } ⊆ Ln but wn ∈ / Ln . Let Gn =< Σ, Pn , sn > be a deterministic PCF grammar generating Ln . We know that there can only be finitely many possible axioms for a PCF grammar (we explained this previously in the k-uniform PCFL proof). For at least one axiom s, there are infinitely many grammars Gn that start from axiom s. These grammars have a growing subset of common strings. However, the number of rules in each deterministic PCF grammar Gi is bounded by the cardinality of Σ. It is impossible to have an infinite sequence of grammars with a bounded number of production rules. Thus, our initial assumption is incorrect, which means that D has a finite elasticity.  Note that by proving that k-uniform and deterministic PCFLs are identifiable in the limit, does not imply that there exists a polynomial time algorithm to do so (it only guarantees an algorithm which can take, for example, exponential time). In fact, Koshiba et al. in [KMT00] conclude by posing the following open problem: Is there a polynomial time learning algorithm that identifies in the limit from positive data the language class of k-uniform deterministic PCFLs?.

4.4.3

Lindenmayer Systems

Definition 4.4.19 A Lindenmayer system (0L-system) is a triple G =< Σ, P, s >, where Σ is an alphabet, P is a set of production rules of type Σ → Σ∗ and s ∈ Σ+ is called an axiom. Definition 4.4.20 The 0L derivation relation over a 0L-system G =< Σ, P, s >, denoted as G : Σ∗ → Σ∗ , is defined as follows: a G b ⇔ ∃x ∈ Σ · ∃y, v1 , v2 , . . . v|a|x +1 ∈ Σ∗ · (a = v1 xv2 x . . . xv|a|x +1 ) ∧ (b = v1 yv2 y . . . yv|a|x +1 ) ∧ (x → y ∈ P ), where |a|x is the number of occurrences of symbol x in a string a. 71

The reflexive transitive closure of G is denoted as ∗G . This means that, for a ∗G b, a can derive b using zero or more production rules from 0L-system G. Definition 4.4.21 The language generated by a Lindenmayer system G =< Σ, P, s >, L(G) = {w ∈ Σ∗ | s ∗G w}. The only difference between PCF grammars and Lindenmayer systems lies in the way production rules are used. PCF grammars apply rules on only one symbol whilst Lindenmayer system substitute each occurrence of the symbol. For example, in a PCF grammar, a rule of the form a → b1 b2 . . . bn on an axiom aa can derive the strings b1 b2 . . . bn a and ab1 b2 . . . bn when applied once and the string b1 b2 . . . bn b1 b2 . . . bn when applied twice. For a Lindenmayer system on the same rule and axiom, only the string b1 b2 . . . bn b1 b2 . . . bn can be derived from aa with one application of the rule (and the rule can only be applied once since the resulting string does not contain any as). Y. Mukouchi et al. in [MYS98] compare the inference of PCF grammars and Lindenmayer systems.

Figure 4.16: The derivation tree of a Lindenmayer system It can easily be shown that the class of languages generated by a Lindenmayer system is not identifiable in the limit from positive data. We can construct any finite language Lf inite on alphabet Σ by a Lindenmayer system < Σ, {s → w | w ∈ Lf inite } , s >, where |s| = 1. One infinite language generated by a 0L-system is < {a}, {a → aa}, a >, which generates the language {an | n > 0, n is even}. Thus, the class of languages generated by 0L-systems is super-finite, which means that it cannot be identified in the limit from positive data. A PD0L-system is a restricted form of 0L-system. Definition 4.4.22 A PD0L system (propagating deterministic Lindenmayer system) is a 0L-system G =< Σ, P, s > where, for all production rules a → b ∈ P , b 6=  (propagating property) and |b| = 1 (deterministic property). 72

T. Yokomori in [Yok92] proves that the class of languages generated by PD0L-systems is identifiable in the limit from positive data. He also describes an algorithm, A1 , for doing so (see algorithm 8). Algorithm 8: A1 : Identification in the limit of PD0L-systems Input: A positive presentation U Output: A sequence of PD0L-systems GT Initialize T = ∅ repeat Read the next positive example w ∈ U T ← T ∪ {w} if Procedure(T;GT ) succeeds then Output GT else Output GT =< Σ, Id, w >, where Id is the identity relation over Σ end until GT converges Procedure(T, GT ) constructs a grammar GT given a set T of positive examples. This procedure succeeds only when TL ⊆ T ⊂ L(G), where TL is a finite tell-tale set of the target language L = L(G) and G is a grammar for the target language equivalent to GT . Definition 4.4.23 A language class L is identifiable in the limit iff there is a computable partial function φ : L × N → Σ∗ s.t. ∀L ∈ L · TL = {φ(L, n) | n ∈ N} is a finite subset of L called a tell-tale set of L. [dlH10] T. Yokomori in [Yok92] showed that any presentation of examples from a language L generated by a PD0L-system will, after some finite number of examples, contain a finite tell-tale set for L. This guarantees that, at some point, Procedure(T, GT ) will successfully generate a grammar which describes the target language. Let A be a learning algorithm for identifying a language class L in the limit. The following are 3 desirable properties of A: Definition 4.4.24 We say that A is consistent on a language class L if for all L ∈ L, and for every positive presentation of L, the hypotheses produced by A are consistent with the training examples read up to that point in the computation. Definition 4.4.25 We say that A is responsive on a language class L if for all L ∈ L, and for every positive presentation of L, A returns a hypothesis in response to any training example given, before accepting another example. 73

Definition 4.4.26 We say that A is conservative on a language class L if for all L ∈ L, and for every positive presentation of L, each hypothesis hi from A is equal to the previous hypothesis hi−1 unless hi−1 is inconsistent with the ith training example given. In other words, a conservative algorithm never changes its hypothesis unless this fails to be consistent with the training examples. T. Yokomori shows in [Yok92] that A1 is responsive but neither consistent nor conservative. Thus, he extends A1 and describes another algorithm for inferring PD0L in the limit, A2 , which is responsive, consistent and conservative.

4.5

Bibliographical Notes

Unless explicitly stated otherwise, the contents (definitions) of the sections under Section are based (taken) from the corresponding work under Reference in table 4.7: Section

Reference

4.1.1.1

[LN03]

4.1.2.1

[Yok03]

4.1.2.2

[Yos06]

4.1.3

[CE07]

4.1.3.1

[Yos08]

4.1.4

[Cla06]

4.2.1

[vZ01]

4.2.2

[ATV00]

4.2.3

[Cla07]

4.2.4

[Wol78]

4.2.5

[Gio94]

4.3.1

[LY90]

4.3.2

[SO94]

4.4.1

[dlHEJ07]

4.4.2

[KMT00]

4.4.3

[Yok92]

Table 4.7: Bibliographical Notes

74

Chapter 5 Empirical Evaluation of GI Algorithms Ideally, the merit of a GI algorithm is determined by the class of languages it successfully learns under some specified learning model. The larger the class of languages, the more useful the algorithm is. This method of evaluation gives us a precise measure of the capabilities of an algorithm. By comparing the classes learned by different GI algorithms, we can make informed judgments on whether an algorithm is better than another. We can say that an algorithm A is better than algorithm B if the language class learnt by A contains all the languages in the language class learnt by B. However, this approach to evaluation may not always be possible. First of all, algorithms that learn under different learning models cannot be compared due to their different notion of learning. Secondly, it may not always be possible to determine the class of languages an algorithm learns. This depends on the technique used in the algorithm. If a GI algorithm was not designed a priori to infer a particular class of languages, it can be difficult to find this afterwards by analysis. In particular, algorithms that use heuristics or that are empirically driven are more difficult to analyze. The solution for evaluating such GI algorithms is to use empirical methods of evaluation. The idea is to evaluate the algorithm by its results and not through analysis. This eliminates the problems involved in analysis and modeling of the algorithms. In this chapter we introduce 4 empirical evaluation techniques proposed by F. Coste et al. in Progressing the State-of-the-art in Grammatical Inference by Competition [SCvZ05]. Furthermore, we introduce the Omphalos Context-Free Language Learning Competition [SCvZ05] as the only available benchmark in the GI community used for evaluating grammatical inference algorithms on context-free languages.

75

5.1 5.1.1

Evaluation Techniques Grammar Equality

One technique used to empirically evaluate GI algorithms is by checking the equality of grammars. First, a set of CFGs is constructed and, for each grammar, strings are randomly generated and passed to the GI algorithm. Then, the algorithm is evaluated by checking whether the target grammars exactly match the output grammars. Note that only the equality of the grammars themselves can be checked (not the language equality). This is because the equivalence problem for context-free languages is undecidable (for any grammars A and B, there is no algorithm that checks whether the language generated by A is equal to that generated by B). The advantage of this technique is that it can be applied on any target grammar. The complexity of the grammars to be learned does not make any difference. This means that the target grammars can be manually created or randomly generated. Also, the training data can be easily augmented by generating more random strings. Moreover, the training data can be generated in such a way that certain strings are more preferred than others. For example, strings with shorter length can be given more preference than longer once. This might then generate a training set which includes all the strings in the characteristic set that a GI algorithm would need to be able to successfully learn the target grammar. The main disadvantage of this technique is that it only checks whether the learning algorithm has exactly identified the target grammar and not the target language. Note that, for each context-free language, there are multiple grammars describing the same language. Thus, identifying grammars is much harder (in terms of learnability) than learning languages. Also, if two algorithms do not exactly match the target grammar, this technique has no way of measuring which one came closer.

5.1.2

Comparison of Derivation trees

Another approach is to use unlabeled derivation trees (i.e. without non-terminal labels) as the structures that the GI algorithm should infer. Given a target grammar and a training set, we can build a derivation tree for each string in the training set. The task for the learner will be to infer a CFG and return a derivation tree (based on the inferred grammar) for each string in the language. Note that the derivation trees indirectly describe the inferred grammar. The advantage of this approach is that we can rank GI algorithms according to the number of correctly inferred derivation trees. Moreover, a distance measure on trees can be used to rank the similarity between the target and inferred tree. Thus, if two GI systems 76

produce two derivation trees for a particular string that do not exactly match the target one, we can measure which one came closer through a distance measure. The main disadvantage with this technique is due to the ambiguity of CFGs. For each string in the training set, an ambiguous target grammar can derive an exponential number of derivation trees. This also applies for the inferred grammar. Thus, it might be the case that the target and inferred grammars are the same but their derivation trees on some strings are different.

5.1.3

Classification of Unseen Examples

The previous two evaluation methods measure the structure (grammar or derivation trees) inferred by GI algorithms. By classification of unseen examples, we can evaluate a GI algorithm on the language it infers rather than the structure it uses to represent the language. With this approach, the GI system is first given a set of training examples (generated by the target CFG). Its task is to infer some language representation (it may not necessarily be a grammar) that is able to decide whether any given unseen example is in the target language or not. Then, the GI system is evaluated through a set of unseen examples (test set) which contains both negative and positive strings. The more unseen examples are classified correctly, the more confident we can be that the system has indeed inferred the correct language. The main advantage of this approach is that it removes the problems involved in comparing structures. This is because there is no room for ambiguity (if the target language is exactly equal to the inferred one, then the classification of unseen examples will always give perfect results). Also, this technique accommodates for the evaluation of a wider range of GI algorithms which do not necessarily infer a CFG (e.g. algorithms that infer PDAs or SRSs). The main setback with this technique is on how to generate the test set. If this is too small, then it might not give a good idea of the actual performance of the GI algorithm. Also, the construction of negative examples is by itself a major problem. If the negative examples vary greatly from the positive ones, then it is possible that the learning algorithm infers a generic language that classifies correctly all the examples. On the other hand, if the negative examples are too similar to the positive ones, then it might be difficult for the learning algorithm to classify any of the examples. Figure 5.1 shows the difference between having negative examples varying greatly or very similar to the positive examples.

77

-

-

-

+ + + +

-

-

+

+

+ + + + +

+

(A)

(B)

Figure 5.1: The shaded part shows the space of languages which are consistent with the positive examples. In (A), it is easy to infer an inconsistent language. In (B), it is hard to infer a consistent language.

5.1.4

Precision and Recall

This technique gives a numeric measure based on how much a GI system correctly classifies unseen positive examples (recall) and how much strings belonging to the target language can the inferred grammar generate (precision). So first, the learning algorithm is given a set of training examples and its task is to infer a CFG. Then, a set of previously unseen set of positive examples are tested on the inferred grammar. The percentage of these examples that the grammar accepts is measured. This is known as the recall measure. Next, strings are randomly generated from the inferred grammar. The percentage of these strings that are in fact in the target language is measured. This is known as the precision measure. Recall and precision can be combined into one measure, F-score: Fscore =

2 . precision . recall precision + recall

This technique shares the advantages of using classification. Also, it returns only one number that can be easily used to directly compare different GI systems. The problem with this technique is that once a test example has been used to measure recall, it cannot be used again. This is because this can then be used as an additional training example. Also, unlike classification, this technique is not cannot be used on any GI system since it requires the algorithm to return some language representation that is able to generate new positive examples.

78

5.2

The Omphalos Competition

The Omphalos Context-Free Language Learning Competition [SCvZ05] formed part of the ICGI-2004 (International Colloquium on Grammatical Inference). It was set up with the aim of promoting development of new GI algorithms, better then the current state-ofthe-art in GI. This was done through random generation of CFGs, with their respective training and test sets, for which no algorithm at that time could identify. The task of the competitors was to identify 5 unknown context-free languages from positive examples and another 5 from positive and negative examples (as shown in figure 5.1). Classification of unseen examples is the evaluation technique used in Omphalos to determine whether a GI system identified the target language. The Omphalos system of evaluation ensures that no hill-climbing techniques could be used by the competitors. This is because the system only informs the user whether all examples were classified correctly or not (i.e. no percentage results nor indication of the examples identified incorrectly). Moreover, the system allows users to evaluate their systems only 25 times a day. Prob No.

+ve \ -ve

|Σ|

#Training Ex.

#Test Ex.

1

+ve & -ve

5

801

12000

2

+ve

5

280

12004

3

+ve & -ve

24

924

13016

4

+ve

24

418

17230

5

+ve & -ve

24

1238

17994

6

+ve

24

498

15888

7

+ve & -ve

236

2529

26815

8

+ve

260

1079

13026

9

+ve & -ve

249

1783

17981

10

+ve

243

1055

18925

Table 5.1: The benchmark problems of the Omphalos Competition A. Clark’s Algorithm in [Cla07] (explained in section 4.2.3) was the only algorithm (and still is to our knowledge) that correctly identified problems from the Omphalos algorithm (from positive data alone). These were problems 2 and 4.

79

Chapter 6 Proposal We have seen that for the problem of inferring non-trivial and relatively large context-free languages from positive data alone, only one algorithm gave some positive results. This is A. Clark’s Omphalos algorithm. We propose a heuristic-based system which performs better then Clark’s algorithm. To prove this, we shall try to solve some (or possible all) the yet unsolved problems in the Omphalos competition. Till now, we have have defined a new normal form for context-free grammars, which divides the production rules into two sections: a cycle-free part and a cycle-inducing part. We claim that it is more ’natural’ to learn grammars in our normal form. We also developed a new method, using an annotated substring lattice on the partial order induced by the substring relation, that extracts relevant information from the positive data given. We propose a search for the most specific hypothesis in the space of cycle-free grammars. This should be done by using the information extracted by the annotated substring lattice. Then, we plan to generalize the most specific grammar by using heuristics (or metaheuristics) to add rules in the cycle-inducing part of the grammar. In this section, we first prove (by construction) that any context-free grammar can be transformed into our new normal form and still generate the same language. We then define our annotated substring lattice data structure. We conclude by showing how we intend to search for the most specific hypothesis.

6.1

Phrasal/Terminal Normal Form (PTNF)

Any grammar in CNF can be transformed to PTNF form. A grammar in PTNF takes the following form: A → β or A→α 80

where β is a sequence of one or more non-terminals and α is a sequence of one or more terminals. Production rules with a sequence of non-terminals in RHS are called phrasal rules. Rules with sequence of terminals in RHS are called terminal rules. We can prove that any CFG in CNF can be transformed into a CFG in PTNF and viceversa (and still generate the same language). The proof that CN F ⇒ P T N F is trivially true since any production rule in CNF is itself in PTNF. Production rules containing two non-terminals in RHS (A → BC) are essentially production rules with a sequence of one or more non-terminals in RHS (A → β). Production rules with one terminal in RHS (A → a) are rules with a sequence of one or more terminals in RHS (A → α). Now we prove the other way round i.e. that CN F ⇐ P T N F . Any production rule in PTNF of the form: N → a0 a1 a2 . . . an (where the RHS is composed of a sequence of terminals), can be replaced by one rule in PTNF with a sequence of non-terminals in RHS and a number of CNF rules: N → A0 A1 A2 . . . An A 0 → a0 A 1 → a1 A 2 → a2 .. . A n → an where A0 , A1 , A2 , . . . An are newly introduced non-terminals (different from other nonterminals in the grammar). Any production rule in PTNF of the form: N → B0 B1 B2 . . . Bn (where the RHS is composed of a sequence of non-terminals), can be replaced by the following CNF rules: N → B0 T0 T0 → B1 T1 T1 → B2 T2 .. . Tn−2 → Bn−1 Bn where T0 , T1 , T2 , . . . Tn−2 are newly introduced non-terminals (different from other nonterminals in the grammar). Thus, if CN F ⇒ P T N F and CN F ⇐ P T N F then CN F ⇔ P T N F . This means that any CFL can be represented by a PTNF CFG.

81

6.2

Cycle Separation

We can define a relation similar to the derivational binary relation (see definition 2.1.5) which we will name as the reachability binary relation. The reachability binary relation over a grammar G =< N, Σ, P, S > in PTNF, denoted as G : N → (N ∪ Σ∗ ) is formally defined as follows: A G β ⇔ ∃u, v ∈ (Σ ∪ N )∗ : A → uβv ∈ P . Thus, A G β means that a string containing β (where β can either be a non-terminal or a sequence of terminals) can be reached from A using only one production rule from G. Given any context-free grammar G =< N, Σ, P, S >, we can build a directed reachability graph RG as follows: RG =< N, {(A, β) | A ∈ N, β ∈ (N ∪ Σ∗ ), A G β} >. First of all, we can say that the graph is always weakly connected. This is because all non-terminals should be reachable from S, and thus there will always exist a path in the graph from S to any other non-terminal. Also, the graph can be cyclic since any cycles in the grammar will be represented as cycles in the graph. For example, for this context-free grammar in PTNF: S → AB S → BC A → AD B → SCS C → DE D → EX E → BY

S → abcxyz A → abc B → bc X → xz Y → yz

the reachability graph is:

Figure 6.1: Reachability Graph

82

Using DFS on a RG (starting from S), we can identify all the cycles in the graph. If we remove the last edge from each cycle, the graph will become acyclic. Using these last edges, we can transform the grammar in such a way that cycles are separated from the rest of the grammar. We can build a grammar which is split in two, where one part is cycle-free and the other includes the production rules which induce cycles. So this can be done by the following algorithm: Algorithm 9: Cycle Separation Input: Grammar G =< N, Σ, P, S > Output: Modified G Build the reachability graph of G, RG(G) Using DFS, identify all the cycles in RG(G) foreach Cycle C = N0 N1 . . . Nn N0 in RG(G) do foreach Production rule p ∈ P do if Nn G N0 using p then Substitute each occurrence of N0 in the RHS of p by N00 Add production rule N00 → N0 in P end end end Thus, for the CFG in figure 6.1, the resulting grammar after cycle separation is: S → AB S → BC A → A0 D B → S 0 CS 0 C → DE D → EX E → B0Y S → abcxyz A → abc B → bc X → xz Y → yz

                      

This is the cycle-free    part of the grammar                   

S0 → S ) This is the part which A0 → A induces cycles in the grammar B0 → B 83

6.3

Nonterminal Hierarchy in the Cycle-Free Part

After transforming the grammar into 2 parts using cycle separation, it is clear that the reachability graph of the cycle-free part of the grammar will produce a directed acyclic grammar. For the context-free grammar example in the previous section, the reachability graph of the cycle-free part is:

Figure 6.2: Cycle-Free Reachability Graph A topological sort of a DAG is a total ordering of its nodes in which each node comes before all nodes to which it has outbound edges. We define a similar but different ordering which we will call hierarchical sort. A hierarchical sort of a DAG G = (V, E) consists of a finite hierarchy of sets H = H0 , H1 , H2 , . . . Hn containing V, such that the intersection of any two sets Hx ∩ Hy is always ∅ and H0 ∪ H1 ∪ H2 ∪ . . . ∪ Hn = V. The hierarchical sort is defined by these 2 conditions: • ∀v : Hx · x > 0 ⇒ ∃u : Hx−1 · (u, v) ∈ E ∧ • ∀v : Hx · x < n ⇒ ∀u : Hx+1 ∪ Hx+2 ∪ . . . ∪ Hn · (u, v) ∈ /E The following is the algorithm for hierarchically sorting a DAG:

84

Algorithm 10: Hierarchical Sort Input: DAG D =< V, E >; S ∈ V start vertex Output: A sequence of sets H0 , H1 , H2 . . . Hn where H0 ∪ H1 ∪ H2 ∪ . . . ∪ Hn = V H0 ← {S} hnum ← 1 P laced ← {S} Current ← ∅ while P laced 6= V do foreach e = (v0 , v1 ) ∈ E do if v0 ∈ P laced ∧ v1 ∈ / P laced then Current ← Current ∪ {v1 } end end foreach w0 ∈ Current do foreach w1 ∈ Current do if w0 6= w1 then if there is a path in D from w0 to w1 then Current ← Current\{w1 } end end end end Hhnum ← Current hnum ← hnum + 1 P laced ← P laced ∪ Current Current ← ∅ end If we only allow reachability graph edges which link nodes in neighboring sets (neighboring sets of Hx are Hx−1 and Hx+1 ), then we end up with a hierarchical reachability graph. So, if we do this for the graph in figure 6.2, we will have the following graph:

85

Figure 6.3: Hierarchical Reachability Graph The remaining edges which link nodes in non-neighboring hierarchy sets can be replaced by a sequence of edges passing through neighboring sets. This can be done by the addition of intermediary nodes.In our example, there are 3 edges linking non-neighboring nodes: (S, C), (A, D) and (C, E). The following is the hierarchical reachability graph after intermediary nodes C0, D0 and E0 are added to link non-neighboring nodes in figure 6.3 (the added edges/nodes are marked in red):

86

Figure 6.4: Hierarchical Reachability Graph with added Intermediary Nodes Algorithm 11 shows how intermediary nodes can be added as new non-terminals: Algorithm 11: Adding Intermediary Non-terminals Input: Grammar G =< N, Σ, P, S >; Set of non-neighboring edges (u to v) with their respective number of intermediary nodes (i): N N E = {(u0 , v0 , i0 ), (u1 , v1 , i1 ), . . ., (un , vn , in )} Output: Modified G foreach (u, v, i) ∈ N N E do foreach Production rule p ∈ P do if u G v using p then Substitute each occurrence of v in the RHS of p by v0 for j ← 1 to (i − 1) do Add production rule vj−1 → vj end Add production rule vi−1 → v end end end

87

Thus, the grammar for our example after adding intermediary non-terminals will be (the production rules in bold are the new/modified ones): S → AB S → BC0 C0 → C A → A 0 D0 D0 → D B → S 0 CS 0 C → DE0 E0 → E D → EX E → B0Y S → abcxyz A → abc B → bc X → xz Y → yz

                              

This is the cycle-free  part of the grammar      with added intermediary non-terminals                        

S0 → S ) This is the part which A0 → A induces cycles in the grammar B0 → B Note that the leaf nodes will always be either the sequences of terminals (in our example: abcxyz, abc, bc, xz, yz) or the non-terminals added in the cycle separation process (in our example: S 0 , A0 , B 0 ). The other nodes representing the non-terminals in the grammar cannot be leaf nodes since, for the grammar to make sense, at some point these must reach a sequence of terminals. We can add edges to the hierarchical graph in such a way that the leaf nodes will consist only of the terminal nodes. Secondly, we can augment the graph with nodes such that all the terminal nodes (and only these) are placed in the last set of the hierarchy. The following is the algorithm for doing these 2 transformations:

88

Algorithm 12: Terminal nodes in last hierarchical set as leaf nodes Input: DAG D =< V, E >; S ∈ V start vertex; A sequence of sets H0 H1 H2 . . . Hn Output: Modified D with added nodes; A modified sequence of sets H0 , H1 , H2 . . . Hn with an added set Hn+1 foreach terminal node t ∈ V do counter ← 0 while t ∈ / Hn+1 do Let Hx be the set which contains vertex t Hx ← Hx \{t} Hx+1 ← Hx+1 ∪ {t} Hx ← Hx ∪ {Ntcounter } Let v be the vertex which points to t Remove edge (v, t) i.e. E ← E\{(v, t)} Add edges (v, Ntcounter ) and (Ntcounter , t) counter ← counter + 1 end end foreach leaf node N 0 ∈ V do Let Hx be the set which contains vertex N 0 Choose any node M from Hx+1 such that there is a path from N to M Add edge (N 0 , M ) i.e. E ← E ∪ {(N 0 , M )} end In the first foreach of the algorithm, the graph is added with nodes such that all the terminal nodes (and only these) are placed in the last set of the hierarchy. In the second foreach of the algorithm, edges are added to the graph such that all leaf nodes are terminal nodes. The resultant graph after applying algorithm 12 will satisfy the following conditions: 1. ∀(u, v) : E · u ∈ Hx ⇒ v ∈ Hx+1 ∧ 2. ∀u : V · u ∈ / Hn ⇒ ∃(u, v) ∈ E; Where Hn is the last set in the hierarchy. This means that: 1. all edges link nodes from neighboring sets and 2. all nodes (excluding those belonging to the last set in the hierarchy) have at least one outgoing edge 89

By definition of the reachability graph, all nodes (apart those in the first hierarchical set) have at least one incoming edge. The following is the resultant hierarchical reachability graph after applying algorithm 12 on the graph in figure 6.4 (the added edges/nodes are marked in red):

Figure 6.5: Hierarchical Reachability Graph with all terminal nodes in the last set as the only leaf nodes The new non-terminals/edges for the reachability graph in figure 6.5 can be added to the grammar in a similar way in which intermediary non-terminals where added in algorithm 11. The following are the added production rules to our example grammar: S → N0abcxyz N0abcxyz → N1abcxyz N1abcxyz → N2abcxyz N2abcxyz → N3abcxyz N3abcxyz → N4abcxyz N4abcxyz → N5abcxyz N5abcxyz → abcxyz

A → N0abc N0abc → N1abc N1abc → N2abc N2abc → N3abc N3abc → N4abc N4abc → abc A0 → D

B → N0bc N0bc → N1bc N1bc → N2bc N2bc → N3bc N3bc → N4bc N4bc → bc

S 0 → N2abcxyz 90

X → N0xz N0xz → N1xz N1xz → xz

B 0 → N0yz

Y → N0yz N0yz → yz

6.4

Reversible Cycle-Free Part

Definition 6.4.1 A CFG is invertible if A → α, B → α ∈ P then A = B for any A, B ∈ N and α ∈ (N ∪ Σ)∗ . Definition 6.4.2 A CFG is reset-free if A → αBβ, A → αCβ ∈ P then α = β for any A, B, C ∈ N and α, β ∈ (N ∪ Σ)∗ . Definition 6.4.3 A CFG is reversible if it is invertible and reset-free. Thus, for a CFG to be reversible, each of its production rule has to have a unique RHS (i.e. the invertible property) and it has to have exactly one rule for each non-terminal (i.e.the reset-free property), assuming useless non-terminals and exactly identical production rules are removed. This means that the invertible property does not hold when a PTNF grammar has rules of this form: A1 → α A2 → α A3 → α .. . An → α where α ∈ Σ∗ or α ∈ N ∗. The reset-free does not hold when a grammar has rules of the form: A → α1 A → α2 A → α3 .. . A → αn where each α1 , α2 , α3 . . . αn ∈ N ∗ . Y. Sakakibara in [Sak90] showed that every CFL can be generated by a reversible CFG. Thus, by restricting the cycle-free part of a grammar in our normal form to be reversible, we are not restricting the space of languages it can generate. Moreover, the way in which a CFG is transformed to a reversible CFG does not affect the hierarchy of non-terminals we described in the previous section.

91

6.5

Annotated Substring Lattice

Given a set of positive examples P , we can easily generate the set of its substrings Psubs . The set Psubs on the substring relation gives us a strict (i.e. irreflexive) partial order. This is because this relation is irreflexive (a string w is a not a substring of itself), antisymmetric (if w0 is a substring of w, then w is not a substring of w0 ) and transitive (if w0 is a substring of w and w00 is a substring of w0 then w00 is a substring of w). Thus, we can represent the strings in Psubs in a lattice structure on the substring relation. The strings of length 1 will be the lower bounds and some of the strings in P will be the upper bounds (note that if a ∈ P is a substring of b ∈ P , then a will not be an upper bound). We can add information to this structure (i.e. annotate this structure) by highlighting nodes and arrows with different colours to represent other relations. Figure 6.6 shows an annotated substring lattice, where: • The number in brackets in each node represent the number of occurrences of the string as a substring in P . • Nodes in red represent strings in P • Nodes with green circumferences represent strings which are prefixes of one or more strings in P . • Nodes with blue circumferences represent strings which are suffixes of one or more strings in P . • Nodes with orange circumferences represent strings which are both prefixes and suffixes of one or more strings in P . • A green arrow from string u to v means that v = au, where |a| = 1. • A blue arrow from string u to v means that v = ua, where |a| = 1. Based on this information, we can annotate the lattice with more colours, from which more information can be extracted: • The nodes coloured in magenta represent the strings which have 2 or more different 1-left-contexts and 2 or more different 1-right-contexts . • The number in square brackets in each magenta node represents the number of magenta nodes which are ancestors of that node.

92

Definition 6.5.1 The 1-left-contexts of a string w ∈ Psubs w.r.t P are all the different symbols a ∈ Σ ∪ {#} (where # ∈ / Σ), such that aw is a substring of some string #u#, u ∈ P. Definition 6.5.2 The 1-right-contexts of a string w ∈ Psubs w.r.t P are all the different symbols a ∈ Σ ∪ {#} (where # ∈ / Σ), such that wa is a substring of some string #u#, u ∈ P.

93

94

Figure 6.6: Annotated Substring Lattice for the set of strings: {abxy, xyab, abxyab, ababxy, xyabxyab, abxyabab}. Note that nodes which have to be both red and magenta are coloured magenta and the text ’RED’ is added in the node

6.6

Search for the Most Specific Hypothesis

The most specific hypothesis is simply the CFG that generates exactly the positive data. This should not contain cycles since it only generates a finite language. By using G. Wolff’s idea of the substitution operation [Wol78], described in section 4.2.4, we can build the most specific hypothesis CFG by simply selecting substrings from the training examples and substituting them with newly introduced non-terminals (we do not need the merge operator since no generalization is needed). We restrict Wolff’s substitution operation by only allowing the selection of terminals (not non-terminals) from the training set. In other words, we keep substituting substrings from the training set until this is covered with non-terminals. Let the set of newly introduced non-terminals at this stage be N T1 . We then repeat this process by treating the newly introduced non-terminals as terminals and cover them with another set of newly added non-terminals, which will make up the set N T2 . This process can be repeated until no substitution can be made on the training set. We can note 2 important facts about the resulting CFG after this process: 1. It will always be reversible (see definition 6.4.3). First of all, no production rule will have the same LHS because we are always introducing new non-terminals. Secondly, no 2 production rules will have the same RHS since we are substituting each occurrence of the selected strings. 2. Its non-terminals will form a hierarchy as explained in section 6.3. This is because, given any non-terminal A in a set N Tx , x > 1, A can only reach non-terminals found in N Tx−1 . We have proved in the previous sections of this chapter that any CFL can be represented by a CFG with a reversible cycle-free part, where non-terminals form a hierarchy (as explained in section 6.3). Thus, this approach for building the most specific hypothesis grammar can be used to build any CFL (if the appropriate production rules are used in the cycle-inducing part of the grammar). This is an important result since it gives us the potential of learning any CFL, given that the correct substrings are selected for the cycle-free part and the correct production rules are added in the cycle-inducing part. The hardness of the learning problem now lies in: 1. the selection process for building the cycle-free part and 2. in choosing the rules for the cycle-inducing part.

95

We propose the use of an annotated Hasse diagram (explained in section 6.5) as a search space for the selection process. Multiple relations on the Hasse diagram will give us important information about the given data by highlighting patterns. This will guide the selection process in choosing the best possible substrings needed to build the most specific hypothesis grammar. The final stage, after building the most specific grammar, is to generalize this grammar. This is done by adding cycle-inducing rules of the form A → B. This technique is more effective then the merging operation proposed by G. Wolff [Wol78]. By merging 2 nonterminals A and B, what you are essentially doing is adding the rules: A → B and B → A. The generalization involved in this step is greater than just adding one rule. We can proof this by giving a simple example. Lets consider the following CFG which generates a language containing just one string, {ab}: S → AB A→a B→b If we first merge A with B, the resulting grammar would be: S → AA A→a A→b and then merge S with A, the grammar would be: S → SS S→a S→b This CFG accepts any string in Σ∗ , for Σ = {a, b}. Thus, with two merges, we generalized to the most generic language. Now, if instead of merging, we include rules of the form X → Y , the resulting grammar would not be as generalized. So, if we include the rules A → B and A → S, the resulting grammar would be: S → AB A→a B→b

A→B A→S

The language accepted by this grammar is {bn | n ∈ N, n > 1} ∪ {a(b)m | m ∈ N, m > 0}

96

Bibliography [Ang80]

Dana Angluin. Inductive inference of formal languages from positive data. Information and Control, 45(2):117 – 135, 1980.

[Ang87]

D. Angluin. Learning k-bounded context-free grammars. Technical Report YALEU/ DCS/RR-557, Department of Computer Science, Yale University, August 1987.

[Ang88]

Dana Angluin. Queries and concept learning. Mach. Learn., 2(4):319–342, 1988.

[AP64]

V. Amar and Gianfranco R. Putzolu. On a family of linear grammars. Information and Control, 7(3):283–291, 1964.

[ATV00]

Pieter Adriaans, Marten Trautwein, and Marco Vervoort. Towards high speed grammar induction on large text corpora. In Vclav Hlavc, Keith Jeffery, and Jir Wiedermann, editors, SOFSEM 2000: Theory and Practice of Informatics, volume 1963 of Lecture Notes in Computer Science, pages 58–75. Springer Berlin / Heidelberg, 2000.

[BEHW89] Anselm Blumer, A. Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the vapnik-chervonenkis dimension. J. ACM, 36(4):929–965, 1989. [BO93]

Ronald V. Book and Friedrich Otto. String-rewriting systems. SpringerVerlag, London, UK, 1993.

[BPSW70] Leonard E. Baum, Ted Petrie, George Soules, and Norman Weiss. A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. The Annals of Mathematical Statistics, 41:164–171, 1970. [CE07]

Alexander Clark and R´emi Eyraud. Polynomial identification in the limit of substitutable context-free languages. Journal of Machine Learning Research, 8:1725–1745, Aug 2007. 97

[Cha96]

Eugene Charniak. Statistical Language Learning. MIT Press, Cambridge, MA, USA, 1996.

[Cho56]

Noam Chomsky. Three models for the description of language. IRE Transactions on Information Theory, 2:113–124, 1956.

[Cla06]

Alexander Clark. PAC-learning unambiguous NTS languages. In Proceedings of the 8th International Colloquium on Grammatical Inference (ICGI), pages 59–71, 2006.

[Cla07]

Alexander Clark. Learning deterministic context free grammars: the Omphalos competition. Machine Learning, 66(1):93–110, January 2007.

[dlH10]

Colin de la Higuera. Grammatical Inference: Learning Automata and Grammars. Cambridge University Press, New York, NY, USA, 2010.

[dlHEJ07] Colin de la Higuera, R´emi Eyraud, and Jean-Christophe Janodet. Lars: A learning algorithm for rewriting systems. Machine Learning, 66(1):7–31, 2007. [Gio94]

Jean Giordano. Inference of context-free grammars by enumeration: Structural containment as an ordering bias. In Rafael Carrasco and Jose Oncina, editors, Grammatical Inference and Applications, volume 862 of Lecture Notes in Computer Science, pages 212–221. Springer Berlin / Heidelberg, 1994.

[Gol67]

E. Mark Gold. Language identification in the limit. Information and Control, 10(5):447–474, 1967.

[Gro]

Radu Grosu. Properties of context-free grammars, lecture slides from the cse 350: Theory of computation course at stony brook university. http://www.cs.sunysb.edu/ cse350/ as accessed on July 25th 2010.

[Har51]

Zellig Harris. Methods in Structural Linguistics. University of Chicago Press, Chicago, 1951.

[HR80]

H. B. Hunt, III and D. J. Rosenkrantz. Efficient algorithms for structural similarity of grammars. In POPL ’80: Proceedings of the 7th ACM SIGPLANSIGACT symposium on Principles of programming languages, pages 213–219, New York, NY, USA, 1980. ACM.

[HU79]

JE Hopcroft and JD Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, Reading, Massachusetts, 1979.

98

[JM09]

Dan Jurafsky and James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Prentice Hall series in artificial intelligence. Prentice Hall, Pearson Education International, Englewood Cliffs, NJ, 2. ed., [pearson international edition] edition, 2009.

[KMT00]

Takeshi Koshiba, Erkki M¨akinen, and Yuji Takada. Inferring pure context-free languages from positive data. Acta Cybern., 14(3):469–477, 2000.

[KV94]

Michael Kearns and Leslie Valiant. Cryptographic limitations on learning boolean formulae and finite automata. J. ACM, 41(1):67–95, 1994.

[LC10]

Shalom Lappin and Alexander Clark. Computational learning theory and poverty of the stimulus arguments. NASSLLI 2010, 2010.

[Lee96]

Lillian Lee. Learning of context-free languages: A survey of the literature. Technical Report TR-12-96, Harvard University, 1996.

[LN03]

J. A. Laxminarayana and G. Nagaraja. Efficient inference of a subclass of even linear languages. Department of Computer Science and Engineering, Indian Institute of Technology, Mumbai, 2003.

[LPP98]

Kevin Lang, Barak Pearlmutter, and Rodney Price. Results of the abbadingo one dfa learning competition and a new evidence-driven state merging algorithm, 1998.

[LS00]

Pat Langley and Sean Stromsten. Learning context-free grammars with a simplicity bias. In ECML ’00: Proceedings of the 11th European Conference on Machine Learning, pages 220–228, London, UK, 2000. Springer-Verlag.

[LY90]

K. Lari and S. J. Young. The estimation of stochastic context-free grammars using the inside-outside algorithm. Computer Speech & Language, 4(1):35 – 56, 1990.

[Maa96]

Erkki Maakinen. A note on the grammatical inference problem for even linear languages. Fundam. Inf, 25:25–175, 1996.

[Max]

Dama Max. Trading optimization: Simulated annealing. http://www.maxdama.com/2008/07/trading-optimization-simulated.html as accessed on August 19th 2010.

[McN67]

Robert McNaughton. Parenthesis grammars. J. ACM, 14(3):490–500, 1967.

99

[Mit97]

Tom M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997.

[MSW80]

H.A. Maurer, A. Salomaa, and D. Wood. Pure grammars. Information and Control, 44(1):47 – 72, 1980.

[MYS98]

Yasuhito Mukouchi, Ikuyo Yamaue, and Masako Sato. Inferring a rewriting system from examples. In Setsuo Arikawa and Hiroshi Motoda, editors, Discovey Science, volume 1532 of Lecture Notes in Computer Science, pages 568–568. Springer Berlin / Heidelberg, 1998.

[NKN02]

Martin A. Nowak, Natalia L. Komarova, and Partha Niyogi. Computational and evolutionary aspects of language. Nature, 417(6889):611–617, June 2002.

[PPSH04] Georgios Petasis, Georgios Paliouras, Constantine D. Spyropoulos, and Constantine Halatsis. eg-grids: Context-free grammatical inference from positive examples using genetic search. In Grammatical Inference: Algorithms and Applications, 7th International Colloquium, ICGI 2004, Athens, Greece, October 11-13, 2004, Proceedings, volume 3264 of Lecture Notes in Computer Science, pages 223–234. Springer, 2004. [Ree93]

Colin R. Reeves. Modern Heuristic Techniques for Combinatorial Problems. Halsted Press, New York, 1993.

[RN88]

V. Radhakrishnan and G. Nagaraja. Inference of even linear grammars and its applications to picture description languages. Pattern Recogn., 21(1):55–62, 1988.

[Sak90]

Yasubumi Sakakibara. Learning context-free grammars from structural data in polynomial time. Theor. Comput. Sci., 76(2-3):223–242, 1990.

[SCvZ04]

Bradford Starkie, Fran¸cois Coste, and Menno van Zaanen. The omphalos context-free grammar learning competition. In Grammatical Inference: Algorithms and Applications, 7th International Colloquium, ICGI 2004, Athens, Greece, October 11-13, 2004, Proceeding, pages 16–27, 2004.

[SCvZ05]

Bradford Starkie, Fran¸cois Coste, and Menno van Zaanen. Progressing the state-of-the-art in grammatical inference by competition: The omphalos context-free language learning competition. AI Commun., 18(2):93–115, 2005.

[SG94]

Jose Sempere and Pedro Garca. A characterization of even linear languages and its application to the learning problem. In Rafael Carrasco and Jose 100

Oncina, editors, Grammatical Inference and Applications, Lecture Notes in Computer Science, pages 38–44. Springer Berlin / Heidelberg, 1994. [Sip05]

Michael Sipser. Introduction to the Theory of Computation. Course Technology, 2 edition, February 2005.

[SO94]

Andreas Stolcke and Stephen M. Omohundro. Inducing probabilistic grammars by bayesian model merging. In ICGI ’94: Proceedings of the Second International Colloquium on Grammatical Inference and Applications, pages 106–118, London, UK, 1994. Springer-Verlag.

[Sta04]

Bradford C. Starkie. Identifying languages in the limit using alignment-based learning. Bachelor of Engineering Phd Thesis, 2004.

[Tak88]

Yuji Takada. Grammatical inference for even linear languages based on control sets. Information Processing Letters, 28(4):193 – 199, 1988.

[TK07]

Cristina Tˆırn˘auc˘a and Timo Knuutila. Efficient language learning with correction queries. Technical Report 822, Turku Center for Computer Science, May 2007.

[Val84]

L. G. Valiant. A theory of the learnable. In STOC ’84: Proceedings of the sixteenth annual ACM symposium on Theory of computing, pages 436–445, New York, NY, USA, 1984. ACM.

[VC71]

V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl., 16:264–280, 1971.

[vZ01]

Menno van Zaanen. Abl: Alignment-based learning. CoRR, 2001.

[vZ03]

Menno van Zaanen. Theoretical and practical experiences with AlignmentBased Learning. In Proceedings of the Australasian Language Technology Workshop; Melbourne,Australia, pages 25–32, December 2003.

[vZA01]

Menno van Zaanen and Pieter Adriaans. Alignment-Based Learning versus EMILE: A comparison. In Proceedings of the Belgian-Dutch Conference on Artificial Intelligence (BNAIC); Amsterdam, the Netherlands, pages 315–322, October 2001.

[WF74]

Robert A. Wagner and Michael J. Fischer. The string-to-string correction problem. J. ACM, 21(1):168–173, 1974.

101

[Wol78]

J. Gerard Wolff. Grammar discovery as data compression. In AISB/GI (ECAI), pages 375–379, 1978.

[Wri89]

Keith Wright. Identification of unions of languages drawn from an identifiable class. In COLT ’89: Proceedings of the second annual workshop on Computational learning theory, pages 328–333, San Francisco, CA, USA, 1989. Morgan Kaufmann Publishers Inc.

[Yok92]

Takashi Yokomori. Inductive inference of 0l languages. Lindenmayer Systems (Rozenberg and Salomaa, Eds.), pages 115–132, 1992.

[Yok03]

Takashi Yokomori. Polynomial-time identification of very simple grammars from positive data. Theor. Comput. Sci., 298(1):179–206, 2003.

[Yos06]

Ryo Yoshinaka. Polynomial-time identification of an extension of very simple grammars from positive data. In Grammatical Inference: Algorithms and Applications, volume 4201 of Lecture Notes in Computer Science, pages 45– 58. Springer Berlin / Heidelberg, 2006.

[Yos08]

Ryo Yoshinaka. Identification in the limit of k,l-substitutable context-free languages. In ICGI ’08: Proceedings of the 9th international colloquium on Grammatical Inference, pages 266–279, Berlin, Heidelberg, 2008. SpringerVerlag.

102