Learning Probabilistic Automata

May 4, 2014 - learning probabilistic finite state machines will therefore not be reported in this talk. Readers more ..... Computational Approaches to Syntax and Morphology. Oxford University ... The discovery of algorithmic probability. Jcss ...
95KB taille 11 téléchargements 281 vues
Learning Probabilistic Automata Colin de la Higuera May 4, 2014 Abstract Probabilistic finite state automata define distributions of probabilities over strings. The model is equivalent to hidden Markov models, and has many interesting extensions to tree automata, probabilistic contextfree grammars and transducers. A number of properties concerning such finite state machines are worth studying, including parsing issues and others. State merging techniques have been proposed in order to learn deterministic versions. Other forms of learning are possible.

1

Organisation of the tutorial

Note: the tutorial’s organisation takes into account the talk to be given by Ricard Gavald` a at Wata 2014. Important, exciting new results concerning learning probabilistic finite state machines will therefore not be reported in this talk. Readers more interested in convergence proofs and spectral methods are therefore encouraged to turn to his presentation for more complete references. • About grammatical inference and distributions – A bit of history – Some motivations • About probabilistic finite state automata – Representation issues – Parsing issues – Distances – Most probable string • Estimation issues – Learning Pfa: Em – Learning Pfa: Gibbs Sampling – Smoothing • Learning models – Identification in the limit with probability one – PAC learning 1

• Learning Dpfa: State merging – Alergia and variants – Dsai – Mdi • Extending learning – Probabilistic transducers – Probabilistic context-free grammars – Benchmarks and competitions (PAutomaC)

2

Some references

In this section we point out a number of papers and books which may be of use when studying the topic. The presentation is necessarily incomplete.

2.1

Motivations

The first recorded attempt to build grammars was done by Horning [1969] who proposed to learn probabilistic grammars for natural language processing: since then, this line has been followed by a number of researchers. Strong arguments in favour of this can be found in [Clark and Lappin, 2011]. Yet, before that1 , Solomonoff [1997] worked on ideas concerning the questions of inductive inference, and there also, probabilistic context-free grammars were used, learnt, and led to his definitions of intrinsic complexity. Horning [1969] proved that probabilistic context-free grammars were identifiable with probability one and also proposed an alternative empirical algorithm, relying on finding a (context-free) grammar giving a good compromise between the quality of the probabilities and its simplicity. This approach was going to be followed several times in the future, since, with for example algorithm Mdi by Thollard et al. [2000]. Between the applications of algorithms that learn probabilistic deterministic finite state automata (Dpfa), one can find text and document analysis [YoungLai and Tompa, 2000], and web page classification [Goan et al., 1996].

2.2

PFA

The earliest reference about probabilistic finite automata is Paz [1971]’ book; more recent material can be found in [Vidal et al., 2005] and [de la Higuera, 2010]. 2.2.1

Relation with other models

Between the other probabilistic finite state machines we can find: Hidden Markov models (Hmms) [Rabiner, 1989, Jelinek, 1998], probabilistic regular grammars [Carrasco and Oncina, 1994], Markov chains [Saul and Pereira, 1997], 1 The date of the reference is misleading as Solomonoff’s work on the subject was done at the end of the 1950s

2

n-grams [Ney et al., 1997, Jelinek, 1998], probabilistic suffix trees [Ron et al., 1994], deterministic probabilistic automata (Dpfa) [Carrasco and Oncina, 1994], weighted automata [Mohri, 1997]. Dupont et al. [2005] prove the equivalence between Hmms and Pfa, with an alternative proof by Vidal et al. [2005]. 2.2.2

Parsing issues

The parsing algorithms are now well known; the Forward algorithm is described by Baum et al. [1970]. The Viterbi algorithm is named after Viterbi [1967]. Another problem related with parsing is the computation of the probability of all strings sharing a given prefix, suffix or substring in a Pfa [Fred, 2000]. 2.2.3

About the power of expression

Several interesting questions are raised in [Guttman, 2006]’s PhD thesis : for instance the question of knowing how reasonable it can be to attempt to approximate an unknown distribution with a regular one is raised. He proves that for a fixed bounded number of states n, the best Pfa with at most n states can be arbitrarily bad. The proof that Pfa are strictly more powerful than Dpfa can be found in a number of texts, including [Vidal et al., 2005, de la Higuera, 2010]. On the other hand, Pfa with λ-transitions are equivalent to those without. More practically, there exist various algorithms allowing to eliminate λtransitions; Some authors have provided alternative parsing algorithms to deal with λ-Pfa [Pic´o and Casacuberta, 2001]. Mohri et al. [2000] eliminate λtransitions by means of first running the Floyd-Warshall algorithm in order to obtain the λ-transition distance between pairs of edges before removal of these edges. Hanneforth [2008] points out that when dealing with A obtained automatically, once the λ-transitions have been eliminated, there is pruning to be done, as some states can be no longer accessible. Hanneforth and de la Higuera [2010] give an alternative algorithm for the same task. 2.2.4

Distances

Pfa model distributions over the set of all strings. Therefore, each Pfa can be viewed as a vector in infinite dimension and there are several ways to measure how close two vectors (distributions) are one from another (for example, see [Cover and Thomas, 1991]). In the context of Hmms and with intended bio-informatics applications, intractability results were given by Lyngsø et al. [1999], Lyngsø and Pedersen [2001]: they showed that many distances could not be computed. Interestingly, the fact of being able to compute a distance metric allows also to decide the equivalence between two Pfa. This was noticed by Cortes et al. [2006] using an algorithm computing the Euclidian distance by Balasubramanian [1993]. More work following these paths has been done Carrasco [1997], CaleraRubio and Carrasco [1998] for the KL divergence (for string languages and tree languages), Carrasco and Rico-Juan [2003] for the L2 between trees and Murgue

3

and de la Higuera [2004] for the L2 between strings. Lyngsø et al. [1999], Lyngsø and Pedersen [2001] introduced the co-emission probability, which is an idea also used to build kernels. Cortes et al. [2006] proved that computing the L1 distance was intractable; this is also the case for each Ln with n odd. Guttman [2006] studied the influence of the different distance norms over the hardness of the learning problem, extending a result by [Kearns et al., 1994] in the case of using KL-divergence. 2.2.5

Most probable string

More complicated is the question of finding the most probable string in a distribution defined by a Pfa. In the general case the problem is intractable [Casacuberta and de la Higuera, 2000], with some associated problems undecidable [Blondel and Canterini, 2003] (for general weighted automata), but in the deterministic case a polynomial algorithm can be written by using dynamic programming. A recent parameterized algorithm is [de la Higuera and Oncina, 2014]. The problem of finding the optimal translation given a transducer is called optimal decoding: it is N P-hard [Casacuberta and de la Higuera, 2000]. Recent work provides techniques allowing to compute this string in many cases [de la Higuera and Oncina, 2013].

2.3

About estimation

An option is to suppose that the architecture of the finite state machine is given, and the goal is then only to find the best values for the numerical parameters. If the architecture corresponds to a deterministic automaton, things are relatively simple, as noticed by Wetherell [1980]. 2.3.1

Baum-Welch

When the underlying machine is not deterministic the question is, of course, more interesting, and corresponds to that of solving an optimization problem. One alternative is to use the maximum path instead of the total probability in the function to be optimised: this allows to use a simpler algorithm, called the Viterbi re-estimation algorithm. This is discussed by Casacuberta [1996]. The expectation maximisation ideas were first presented for hidden Markov models and Pfa by Baum et al. [1970], Baum [1972]. Convergence issues of the algorithm are discussed in [Chaudhuri and Rao, 1986]. The problem itself of obtaining the optimum is probably N P-Hard, as shown by Abe and Warmuth [1992]. That is why only local-optimum solutions to the optimisation problem seem possible. The relation between the probability of the optimal path of states and the probability of generating a string has been studied in [S´anchez et al., 1996]. Obviously, the question really is: where does the structure come from? That is why Stolcke [1994] proposed a method not only to estimate the probabilities but also to learn the structure of Hidden Markov Models.

4

2.3.2

Gibbs sampling

Other learning methods, for fixed topologies, include spectral methods [Hsu et al., 2012] and Gibbs sampling [Gelfand and Smith, 1990]. For Pfa and Hmms the Pautomac competition [Verwer et al., 2013] allowed to compare methods. The winners, Shibata and Yoshinaka [2013], used Gibbs sampling. 2.3.3

Smoothing

An important practical issue with distances corresponds to smoothing: this is crucial if perplexity is being used as a measure of success (as for example in speech [Goodman, 2001] or statistical clustering [Kneser and Ney, 1993, Brown et al., 1992]). The idea is that after learning there will usually be strings whose probability, when parsing with the resulting Pfa is going to be 0. This is source of all types of problems in practice. Dupont and Amengual [2000], but also Thollard [2001] studied this difficult question for which there still is a lot of room for answers.

2.4 2.4.1

Theoretical models and limits Identification with probability one

Identification in the limit with probability one has been studied formally since Angluin [1988] who first analysed the setting and gave an enumerative algorithm allowing to learn probabilistic context-free grammars in te limit, with probability one. De la Higuera and Thollard [2000] introduced Stern-Brocot trees in order to identify the probabilities and de la Higuera and Oncina [2004] proposed a more detailed study of this question. Definitions where the complexity of the learning algorithm (or of the quantity of information needed for learning) is taken into account have been attempted, without real success. 2.4.2

PAC learning

Probably Approximatively Correct (Pac) learning in the context of approximating distributions, is studied by Abe and Warmuth [1992] who show that even the seemingly simpler problem of estimating correctly the probabilities, given the structure provided by a finite non-deterministic automaton, is hard. For different types of distances, Pac-learning algorithms have been proposed for deterministic probabilistic automata from text [Thollard and Clark, 2004, Palmer and Goldberg, 2005, Castro and Gavald`a, 2008], with queries [Balle et al., 2010] and from data streams [Balle et al., 2012]. When the structure is unknown, even for the case of Dfa, most results are negative: Kearns and Valiant [1989] linked the difficulty of learning Dfa with that of solving cryptographic problems believed to be intractable (a nice proof is published in Kearns and Vazirani [1994]’s book).

2.5

State merging techniques

In [Guttman et al., 2005, Guttman, 2006]’s work one can find a geometric point of view to the hardness of learning Pfa.

5

2.5.1

ALERGIA and variants

Algorithm Alergia was invented by Carrasco and Oncina [1994]. They proved the convergence of a simpler version of that algorithm, called Rlips [Carrasco and Oncina, 1999]. An extension of Alergia by de la Higuera and Thollard [2000] not only identifies the structure, but also the actual probabilities. Extensions of Alergia to the tree case was proposed by Carrasco et al. [2001]. Another extension to deal with the richer class of probabilistic deterministic linear languages can be found in [de la Higuera and Oncina, 2003]. The same authors propose a study of the learnability of probabilistic languages for a variety of queries by de la Higuera and Oncina [2004]. 2.5.2

DSAI

The use of distinguishing strings was introduced by [Ron et al., 1995]. The key ideas have led to further investigation [Thollard and Clark, 2004, Palmer and Goldberg, 2005, Guttman, 2006] in order to obtain Pac-type results. An incremental version was proposed by Gavald`a et al. [2006]. 2.5.3

MDI

Algorithm Mdi was invented by Thollard and Dupont [1999], Thollard et al. [2000], Thollard [2001] and used since then on a variety of tasks, with specific interest in language modelling.

2.6 2.6.1

Extensions Dealing with non determinism

A first step in learning non-deterministic probabilistic finite automata has consisted in studying the class of the probabilistic residual finite state automata, introduced by Esposito et al. [2002], and being the probabilistic counterparts to the residual finite state automata introduced by Denis et al. [2000, 2001]. The richness of the class of the probabilistic finite automata has led to the introduction by Habrard et al. [2006], Denis et al. [2006] of innovative algorithm Dees that learns a multiplicity automaton (the weights can be negative) by iteratively solving equations on the residuals. Conversion to a Pfa is possible. 2.6.2

Transducers

Transducers are used for a number of tasks including morphology [Roark and Sproat, 2007] and automatic translation [Amengual et al., 2001]. There are few positive results concerning learning probabilistic transducers. Akram et al. [2012] learn the deterministic ones in an active setting and Akram and de la Higuera [2012] in a batch setting. New models, less deterministic (and therefore constrained) than the subsequential machines are currently being investigated. 2.6.3

PCFGs

One approach to learn probabilistic tree automata was followed by Rico-Juan et al. [2000] who learn k-testable trees and then estimate the probabilities. 6

Deterministic linear grammars are a special class of linear grammars. A probabilistic version is studied by de la Higuera and Oncina [2003]. In such grammars there is exactly one terminal symbol before the non-terminal one and there is a deterministic rule to be followed. There is little work on learning probabilistic context-free grammars from text. One exception is algorithm Comino [Sicluna and de la Higuera, 2014].

3

Pautomac

The Pautomac competition which took place in 2012 gave as winner Shibata and Yoshinaka [2012]. Bayesian model merging [Stolcke, 1994] and spectral methods [Bailly, 2011] are some other methods used for this task. A number of ideas used during the competition were analysed by the organisers[Verwer et al., 2013], the winners[Shibata and Yoshinaka, 2013], or participants [Balle et al., 2013b], [Balle et al., 2013a] 3.0.4

Learning from experts and queries

A different approach is that of considering that the strings do not come from a set but from a sequence (each string can only be generated once). This has been analysed in [de la Higuera, 1998]. Kermorvant et al. [2004] learns Dpfa with additional knowledge. Learning distributions with queries is interesting, at least for theoretical reasons [de la Higuera and Oncina, 2004]. Alternative sources are [Bergadano and Varricchio, 1996, Guttman et al., 2005]. More up-to-date references can be found in [Balle et al., 2010] or in Gavald`a’s talk at Wata 2014.. 3.0.5

R-NNs

There are of course alternative ways to represent distributions over strings: Recurrent neural networks [Carrasco et al., 1996] have been tried, but comparisons with the direct grammatical based approaches have not been made on large datasets. With the new generation of (deep) neural networks, their recurrent counterparts are currently being (re)-investigated.

References N. Abe and M. Warmuth. On the computational complexity of approximating distributions by probabilistic automata. Machine Learning Journal, 9:205– 260, 1992. H. I. Akram and C. de la Higuera. Learning probabilistic subsequential transducers from positive data. In Proceedings of Icaart 2013, Barcelona, 2012. H. I. Akram, C. de la Higuera, and Claudia Eckert. Actively learning probabilistic subsequential transducers. In J. Heinz, C. de la Higuera, and T. Oates, editors, Proceedings of the Eleventh International Conference on Grammatical Inference, University of Maryland, College Park, United States, volume 21, pages 19–33. Jmlr.org, 2012. 7

J. C. Amengual, J. M. Bened´ı, F. Casacuberta, A. Casta˜ no, A. Castellanos, V. M. Jim´enez, D. Llorens, A. Marzal, M. Pastor, F. Prat, E. Vidal, and J. M. Vilar. The EuTrans-I speech translation system. Machine Translation, 15(1):75–103, 2001. D. Angluin. Identifying languages from stochastic examples. Technical Report Yaleu/Dcs/RR-614, Yale University, March 1988. R. Bailly. QWA: Spectral algorithm. Journal of Machine Learning Research Workshop and Conference Proceedings, ACML’11, 20:147–163, 2011. V. Balasubramanian. Equivalence and reduction of hidden Markov models. Master’s thesis, Department of Electrical Engineering and Computer Science, Mit, 1993. issued as AI Technical Report 1370. B. Balle, J. Castro, and R. Gavald`a. A lower bound for learning distributions generated by probabilistic automata. In Algorithmic Learning Theory, 21st International Conference, ALT 2010, volume 6331 of Lncs, pages 179–193. Springer-Verlag, 2010. B. Balle, J. Castro, and R. Gavald`a. Bootstrapping and learning pdfa in data streams. Journal of Machine Learning Research - Proceedings Track, 21:34– 48, 2012. B Balle, X. Carreras, F. M. Luque, and Ariadna Quattoni. Spectral learning of weighted automata. Machine Learning Journal, to appear, DOI 10.1007/s10994-013-5416-x, 2013a. B Balle, J. Castro, and R. Gavald`a. Adaptively learning probabilistic deterministic automata from data streams. Machine Learning Journal, to appear, DOI 10.1007/s10994-013-5408-x, 2013b. L. E. Baum. An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities, 3:1–8, 1972. L. E. Baum, T. Petrie, G. Soules, and N. Weiss. A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics, 41:164–171, 1970. F. Bergadano and S. Varricchio. Learning behaviors of automata from multiplicity and equivalence queries. Siam Journal of Computing, 25(6):1268–1280, 1996. V. D. Blondel and V. Canterini. Undecidable problems for probabilistic automata of fixed dimension. Theory of Computer Systems, 36(3):231–245, 2003. P. Brown, V. Della Pietra, P. de Souza, J. Lai, and R. Mercer. Class-based Ngram models of natural language. Computational Linguistics, 18(4):467–479, 1992. J. Calera-Rubio and R. C. Carrasco. Computing the relative entropy between regular tree languages. Information Processing Letters, 68(6):283–289, 1998.

8

R. C. Carrasco. Accurate computation of the relative entropy between stochastic regular grammars. Rairo (Theoretical Informatics and Applications), 31(5): 437–444, 1997. R. C. Carrasco and J. Oncina. Learning stochastic regular grammars by means of a state merging method. In R. C. Carrasco and J. Oncina, editors, Grammatical Inference and Applications, Proceedings of Icgi ’94, number 862 in Lnai, pages 139–150. Springer-Verlag, 1994. R. C. Carrasco and J. Oncina. Learning deterministic regular grammars from stochastic samples in polynomial time. Rairo (Theoretical Informatics and Applications), 33(1):1–20, 1999. R. C. Carrasco and J. R. Rico-Juan. A similarity between probabilistic tree languages: application to Xml document families. Pattern Recognition, 36 (9):2197–2199, 2003. R. C. Carrasco, M. Forcada, and L. Santamaria. Inferring stochastic regular grammars with recurrent neural networks. In L. Miclet and C. de la Higuera, editors, Proceedings of Icgi ’96, number 1147 in Lnai, pages 274– 281. Springer-Verlag, 1996. R. C. Carrasco, J. Oncina, and J. Calera-Rubio. Stochastic inference of regular tree languages. Machine Learning Journal, 44(1):185–197, 2001. F. Casacuberta. Growth transformations for probabilistic functions of stochastic grammars. International Journal on Pattern Recognition and Artificial Intelligence, 10(3):183–201, 1996. F. Casacuberta and C. de la Higuera. Computational complexity of problems on probabilistic grammars and transducers. In de Oliveira [2000], pages 15–24. J. Castro and R. Gavald` a. Towards feasible Pac-learning of probabilistic deterministic finite automata. In A. Clark, F. Coste, and L. Miclet, editors, Grammatical Inference: Algorithms and Applications, Proceedings of Icgi ’08, volume 5278 of Lncs, pages 163–174. Springer-Verlag, 2008. R. Chaudhuri and S. Rao. Approximating grammar probabilities: Solution to a conjecture. Journal of the Acm, 33(4):702–705, 1986. A. Clark and S. Lappin. Linguistic Nativism and the Power of the Stimulus. Wiley-Blackwell Press, Chichester, UK, 2011. C. Cortes, M. Mohri, and A. Rastogi. On the computation of some standard distances between probabilistic automata. In Proceedings of Ciaa 2006, volume 4094 of Lncs, pages 137–149. Springer-Verlag, 2006. T. Cover and J. Thomas. Elements of Information Theory. John Wiley and Sons, New York, NY, 1991. C. de la Higuera. Learning stochastic finite automata from experts. In V. Honavar and G. Slutski, editors, Grammatical Inference, Proceedings of Icgi ’98, number 1433 in Lnai, pages 79–89. Springer-Verlag, 1998.

9

C. de la Higuera. Grammatical inference: learning automata and grammars. Cambridge University Press, 2010. C. de la Higuera and J. Oncina. Identification with probability one of stochastic deterministic linear languages. In R. Gavald`a, K. Jantke, and E. Takimoto, editors, Proceedings of Alt 2003, number 2842 in Lncs, pages 134–148. Springer-Verlag, 2003. C. de la Higuera and J. Oncina. Learning probabilistic finite automata. In G. Paliouras and Y. Sakakibara, editors, Grammatical Inference: Algorithms and Applications, Proceedings of Icgi ’04, volume 3264 of Lnai, pages 175– 186. Springer-Verlag, 2004. C. de la Higuera and J. Oncina. Computing the most probable string with a probabilistic finite state machine. In Proceedings of Fsmnlp, https://aclweb.org/anthology/W/W13/W13-1801.pdf, 2013. C. de la Higuera and J. Oncina. The most probable string: an algorithmic study. Journal of Logic and Computation, 24(2):311–330, 2014. C. de la Higuera and F. Thollard. Identication in the limit with probability one of stochastic deterministic finite automata. In de Oliveira [2000], pages 15–24. A. L. de Oliveira, editor. Grammatical Inference: Algorithms and Applications, Proceedings of Icgi ’00, volume 1891 of Lnai, 2000. Springer-Verlag. F. Denis, A. Lemay, and A. Terlutte. Learning regular languages using non deterministic finite automata. In de Oliveira [2000], pages 39–50. F. Denis, A. Lemay, and A. Terlutte. Learning regular languages using Rfsa. In N. Abe, R. Khardon, and T. Zeugmann, editors, Proceedings of Alt 2001, number 2225 in Lncs, pages 348–363. Springer-Verlag, 2001. F. Denis, Y. Esposito, and A. Habrard. Learning rational stochastic languages. In Proceedings of Colt 2006, volume 4005 of Lncs, pages 274–288. SpringerVerlag, 2006. P. Dupont and J.-C. Amengual. Smoothing probabilistic automata: an errorcorrecting approach. In de Oliveira [2000], pages 51–62. P. Dupont, F. Denis, and Y. Esposito. Links between probabilistic automata and hidden markov models: probability distributions, learning models and induction algorithms. Pattern Recognition, 38(9):1349–1371, 2005. Y. Esposito, A. Lemay, F. Denis, and P. Dupont. Learning probabilistic residual finite state automata. In P. Adriaans, H. Fernau, and M. van Zaannen, editors, Grammatical Inference: Algorithms and Applications, Proceedings of Icgi ’02, volume 2484 of Lnai, pages 77–91. Springer-Verlag, 2002. A. Fred. Computation of substring probabilities in stochastic grammars. In de Oliveira [2000], pages 103–114.

10

R. Gavald` a, P. W. Keller, J. Pineau, and D. Precup. Pac-learning of markov models with hidden state. In J. F¨ urnkranz, T. Scheffer, and M. Spiliopoulou, editors, Proceedings of Ecml ’06, volume 4212 of Lncs, pages 150–161. Springer-Verlag, 2006. A. Gelfand and A. Smith. Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85(410):pp. 398– 409, 1990. T. Goan, N. Benson, and O. Etzioni. A grammar inference algorithm for the world wide web. In Proceedings of Aaai Spring Symposium on Machine Learning in Information Access, Stanford, CA, 1996. Aaai Press. J. Goodman. A bit of progress in language modeling. Technical report, Microsoft Research, 2001. O. Guttman. Probabilistic Automata and Distributions over Sequences. PhD thesis, The Australian National University, 2006. O. Guttman, S. V. N. Vishwanathan, and R. C. Williamson. Learnability of probabilistic automata via oracles. In Jain et al. [2005], pages 171–182. A. Habrard, F. Denis, and Y. Esposito. Using pseudo-stochastic rational languages in probabilistic grammatical inference. In Y. Sakakibara, S. Kobayashi, K. Sato, T. Nishino, and E. Tomita, editors, Grammatical Inference: Algorithms and Applications, Proceedings of Icgi ’06, volume 4201 of Lnai, pages 112–124. Springer-Verlag, 2006. T. Hanneforth. A memory-efficient epsilon-removal algorithm for weighted acyclic finite-state automata. In Proceedings of Fsmnlp, 2008. T. Hanneforth and C. de la Higuera. Epsilon removal by loop reduction for finite-state automata. In T. Hanneforth and G. Fanselow, editors, Studia Grammatica. Language and Logos. Akademie Verlag, 2010. J. J. Horning. A study of Grammatical Inference. PhD thesis, Stanford University, 1969. D. Hsu, S. M. Kakade, and T. Zhang. A spectral algorithm for learning hidden Markov models. J. Comput. Syst. Sci., 78(5):1460–1480, 2012. S. Jain, H.-U. Simon, and E. Tomita, editors. Proceedings of Alt 2005, volume 3734 of Lncs, 2005. Springer-Verlag. F. Jelinek. Statistical Methods for Speech Recognition. The Mit Press, Cambridge, Massachusetts, 1998. M. Kearns and L. Valiant. Cryptographic limitations on learning boolean formulae and finite automata. In 21st Acm Symposium on Theory of Computing, pages 433–444, 1989. M. J. Kearns and U. Vazirani. An Introduction to Computational Learning Theory. Mit press, 1994.

11

M. J. Kearns, Y. Mansour, D. Ron, R. Rubinfeld, R. E. Schapire, and L. Sellie. On the learnability of discrete distributions. In Proceedings of the 25th Annual Acm Symposium on Theory of Computing, pages 273–282, 1994. C. Kermorvant, C. de la Higuera, and P. Dupont. Improving probabilistic automata learning with additional knowledge. In A. Fred, T. Caelli, R. Duin, A. Campilho, and D. de Ridder, editors, Structural, Syntactic and Statistical Pattern Recognition, Proceedings of Sspr and Spr 2004, volume 3138 of Lncs, pages 260–268. Springer-Verlag, 2004. R. Kneser and H. Ney. Improved clustering techniques for class-based language modelling. In European Conference on Speech Communication and Technology, pages 973–976, Berlin, 1993. R. B. Lyngsø and C. N. S. Pedersen. Complexity of comparing hidden Markov models. In Proceedings of Isaac ’01, number 2223 in Lncs, pages 416–428. Springer-Verlag, 2001. R. B. Lyngsø, C. N. S. Pedersen, and H. Nielsen. Metrics and similarity measures for hidden Markov models. In Proceedings of Ismb’99, pages 178–186, 1999. M. Mohri. Finite-state transducers in language and speech processing. Computational Linguistics, 23(3):269–311, 1997. M. Mohri, F. C. N. Pereira, and M. Riley. The design principles of a weighted finite-state transducer library. Theoretical Computer Science, 231(1):17–32, 2000. T. Murgue and C. de la Higuera. Distances between distributions: Comparing language models. In A. Fred, T. Caelli, R. Duin, A. Campilho, and D. de Ridder, editors, Structural, Syntactic and Statistical Pattern Recognition, Proceedings of Sspr and Spr 2004, volume 3138 of Lncs, pages 269–277. Springer-Verlag, 2004. H. Ney, S. Martin, and F. Wessel. Corpus-Based Statiscal Methods in Speech and Language Processing, chapter Statistical Language Modeling Using LeavingOne-Out, pages 174–207. S. Young and G. Bloothooft, Kluwer Academic Publishers, 1997. N. Palmer and P. W. Goldberg. Pac-learnability of probabilistic deterministic finite state automata in terms of variation distance. In Jain et al. [2005], pages 157–170. A. Paz. Introduction to probabilistic automata. Academic Press, New York, 1971. D. Pic´o and F. Casacuberta. Some statistical-estimation methods for stochastic finite-state transducers. Machine Learning Journal, 44(1):121–141, 2001. L. Rabiner. A tutorial on hidden Markov models and selected applications in speech recoginition. Proceedings of the Ieee, 77:257–286, 1989. J. R. Rico-Juan, J. Calera-Rubio, and R. C. Carrasco. Probabilistic k-testable tree-languages. In de Oliveira [2000], pages 221–228.

12

B. Roark and R. Sproat. Computational Approaches to Syntax and Morphology. Oxford University Press, 2007. D. Ron, Y. Singer, and N. Tishby. Learning probabilistic automata with variable memory length. In Proceedings of Colt 1994, pages 35–46, New Brunswick, New Jersey, 1994. Acm Press. D. Ron, Y. Singer, and N. Tishby. On the learnability and usage of acyclic probabilistic finite automata. In Proceedings of Colt 1995, pages 31–40, 1995. J. A. S´anchez, J. M. Bened´ı, and F. Casacuberta. Comparison between the inside-outside algorithm and the Viterbi algorithm for stochastic context-free grammars. In P. Perner, P. Wang, and A. Rosenfeld, editors, Advances in Structural and Syntactical Pattern Recognition, volume 1121 of Lncs, pages 50–59. Springer-Verlag, 8th International Workshop Sspr ’96, Leipzig, 1996. L. Saul and F. Pereira. Aggregate and mixed-order Markov models for statistical language processing. In C. Cardie and R. Weischedel, editors, Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, pages 81–89. Association for Computational Linguistics, Somerset, New Jersey, 1997. C. Shibata and R. Yoshinaka. Marginalizing out transition probabilities for several subclasses of Pfas. Journal of Machine Learning Research - Workshop and Conference Proceedings, ICGI’12, 21:259–263, 2012. C. Shibata and R. Yoshinaka. A comparison of collapsed bayesian methods for probabilistic finite automata. Machine Learning Journal, to appear, DOI 10.1007/s10994-013-5410-3, 2013. J. Sicluna and C. de la Higuera. Pcfg induction for unsupervised parsing and language modelling. Working draft, 2014. R. Solomonoff. The discovery of algorithmic probability. Jcss, 55(1):73–88, 1997. A. Stolcke. Bayesian Learning of Probabilistic Language Models. Ph.D. dissertation, University of California, 1994. F. Thollard. Improving probabilistic grammatical inference core algorithms with post-processing techniques. In Proceedings 8th International Conference on Machine Learning, pages 561–568. Morgan Kauffman, 2001. F. Thollard and A. Clark. Pac-learnability of probabilistic deterministic finite state automata. Journal of Machine Learning Research, 5:473–497, 2004. F. Thollard and P. Dupont. Entropie relative et algorithmes d’inf´erence grammaticale probabiliste. In M. Sebag, editor, Actes de la conf´erence Cap ’99, pages 115–122, 1999. F. Thollard, P. Dupont, and C. de la Higuera. Probabilistic Dfa inference using Kullback-Leibler divergence and minimality. In Proceedings of the 17th International Conference on Machine Learning, pages 975–982. Morgan Kaufmann, San Francisco, CA, 2000. 13

S. Verwer, R. Eyraud, and C. de la Higuera. Pautomac: a probabilistic automata and hidden markov models learning competition. Machine Learning Journal, to appear, DOI 10.1007/s10994-013-5409-9, 2013. E. Vidal, F. Thollard, C. de la Higuera, F. Casacuberta, and R. C. Carrasco. Probabilistic finite state automata – part I and II. Pattern Analysis and Machine Intelligence, 27(7):1013–1039, 2005. A. J. Viterbi. Error bounds for convolutional codes and an assymptotically optimum decoding algorithm. Ieee transactions of the empirical distribution, 13:260–269, 1967. C. S. Wetherell. Probabilistic languages: a review and some open questions. Computing Surveys, 12(4):361–379, 1980. M. Young-Lai and F. W. Tompa. Stochastic grammatical inference of text database structure. Machine Learning Journal, 40(2):111–137, 2000.

14