{"title": "Controlling the Complexity of HMM Systems by Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 737, "page_last": 743, "abstract": null, "full_text": "Controlling the Complexity of HMM Systems by \n\nRegularization \n\nChristoph Neukirchen, Gerhard Rigoll \n\nDepartment of Computer Science \n\nGerhard-Mercator-University Duisburg \n\n47057 Duisburg, Germany \n\nemail: {chn.rigoll}@fb9-ti.uni-duisburg.de \n\nAbstract \n\nThis paper introduces a method for regularization ofHMM systems that \navoids parameter overfitting caused by insufficient training data. Regu(cid:173)\nlarization is done by augmenting the EM training method by a penalty \nterm that favors simple and smooth HMM systems. The penalty term \nis constructed as a mixture model of negative exponential distributions \nthat is assumed to generate the state dependent emission probabilities of \nthe HMMs. This new method is the successful transfer of a well known \nregularization approach in neural networks to the HMM domain and can \nbe interpreted as a generalization of traditional state-tying for HMM sys(cid:173)\ntems. The effect of regularization is demonstrated for continuous speech \nrecognition tasks by improving overfitted triphone models and by speaker \nadaptation with limited training data. \n\n1 Introduction \n\nOne general problem when constructing statistical pattern recognition systems is to ensure \nthe capability to generalize well, i.e. the system must be able to classify data that is not \ncontained in the training data set. Hence the classifier should learn the true underlying data \ndistribution instead of overfitting to the few data examples seen during system training. \nOne way to cope with the problem of overfitting is to balance the system's complexity and \nflexibility against the limited amount of data that is available for training. \nIn the neural network community it is well known that the amount of information used in \nsystem training that is required for a good generalization performance should be larger than \n,the number of adjustable weights (Baum, 1989). A common method to train a large size \nneural network sufficiently well is to reduce the number of adjustable parameters either \nby removing those weights that seem to be less important (in (Ie Cun, 1990) the sensitiv(cid:173)\nity of individual network weights is estimated by the second order gradient) or by sharing \n\n\f738 \n\nC. Neukirchen and G. Rigoll \n\nthe weights among many network connections (in (Lang, 1990) the connections that share \nidentical weight values are determined in advance by using prior knowledge about invari(cid:173)\nances in the problem to be solved). A second approach to avoid overfitting in neural net(cid:173)\nworks is to make use of regularization methods. Regularization adds an extra term to the \ntraining objective function that penalizes network complexity. The simplest regularization \nmethod is weight decay (Plaut, 1986) that assigns high penalties to large weights. A more \ncomplex regularization term is used in soft weight-sharing (Nowlan, 1992) by favoring \nneural network weights that fall into a finite set of small weight-clusters. The traditional \nneural weight sharing technique can be interpreted as a special case of soft weight-sharing \nregularization when the cluster variances tend towards zero. \nIn continuous speech recognition the Hidden Markov Model (HMM) method is common. \nWhen using detailed context-dependent triphone HMMs, the number ofHMM-states and \nparameters to estimate in the state-dependent probability density functions (pdfs) is in(cid:173)\ncreasingly large and overfitting becomes a serious problem. The most common approach \n,to balance the complexity of triphone HMM systems against the training data set is to re(cid:173)\nduce the number of parameters by tying, i.e. parameter sharing (Young, 1992). A popular \nsharing method is state-tying with selecting the HMM-states to be tied in advance, either \nby data-driven state-clustering based on a pdf-dependent distance metric (Young, 1993), \nor by constructing binary decision trees that incorporate higher phonetic knowledge (Bahl, \n1991). In these methods, the number of state-clusters and the decision tree sizes, respec(cid:173)\ntively, must be chosen adequately to match the training data size. However, a possible \ndrawback of both methods is that two different states may be selected to be tied (and their \npdfs are forced to be identical) although there is enough training data to estimate the differ(cid:173)\nent pdfs of both states sufficiently well. In the following, a method to reduce the complexity \nof general HMM systems based on a regularization term is presented. Due to its close rela(cid:173)\ntionship to the soft weight-sharing method for neural networks this novel approach can be \ninterpreted as soft state-tying. \n\n2 Maximum likelihood training in HMM systems \n\nTraditionally, the method most commonly used to determine the set of adjustable param(cid:173)\neters 8 in a HMM system is maximum likelihood (ML) estimation via the expectation \nmaximization (EM) algorithm. If the training observation vector sequence is denoted as \nX = (x(l), ... ,x(T)) and the corresponding HMM is denoted as W the ML estimator is \ngiven by: \n\n{)ML = argmax {logpe(XIW)} \n\n() \n\n(1) \n\nIn the following, the total number of different HMM states is given by K. The emission pdf \n,of the k-th state is denoted as bk (x); for continuous HMMs bk (x) is a mixture of Gaussian \npdfs most commonly; in the case of discrete HMMs the observation vector x is mapped by \na vector quantizer (VQ) on the discrete VQ-Iabel m(x) and the emission pdfis replaced by \nthe discrete output probability bk (m). By the forward-backward algorithm the probabilistic \nstate counts rdt) can be determined for each training observation and the log-likelihood \nover the training data can be decomposed into the auxiliary function Q (8) optimized in \nthe EM steps (state transition probabilities are neglected here): \n\nT K \n\nQ(8) = L L rk(t) \u00b7logbk(x(t)) \n\nt=l k=l \n\n(2) \n\nSometimes, the observation vector x is split up into several independent streams. If the total \nnumber of streams is given by Z, the features in the z-th stream comprise the subvector x(z) \nand in the case of application ofa VQ the corresponding VQ label is denoted as m(z) (x(z\u00bb). \n\n\fControlling the Complexity of HMM Systems by Regularization \n\n739 \n\nThe observation subvectors in different streams are assumed to be statistically independent \nthus the states' pdfs can be written as: \n\nZ \n\nbk(x) = II b~z)(x(z\u00bb) \n\nz=l \n\n(3) \n\n3 A complexity measure for HMM systems \n\nWhen using regularization methods to train the HMM system, the traditional objective \ntraining function Q(0) is augmented by a complexity penalization term 0 and the new \noptimization problem becomes: \n\n(reg = argmax {Q(0) + v\u00b7 0(0)} \n\n(} \n\n(4) \n\nHere, the regulizer term 0 should be small if the HMM system has high complexity and \nparameter overfitting becomes a problem; 0 should be large if the HMM-states' pdfs are \nshaped smoothly and system generalization works well. The constant v 2: 0 is a control \nparameter that adjusts the tradeoff between the pure ML solution and the smoothness of \npenalization. In Eqn. (4) the term Q (0) becomes larger the more data is used for training \n(which makes the ML estimation become more reliable) and the influence of the term v\u00b7 0 \ngets less important, relatively. \nThe basic idea when constructing an expression for the regulizer 0 that favors smooth \nHMM systems is, that in the case of simple and smooth systems the state-dependent emis(cid:173)\nsion pdfs bk (.) should fall into several groups of similar pdfs. This is in contrast to the \ntraditional state-tying that forces identical pdfs in each group. In the following, these clus(cid:173)\nters of similar emission pdfs are described by a probabilistic mixture model. Each pdf is \nassumed to be generated by a mixture of I different mixture components Pi (. ). In this case \nthe probability (-density) of generating the emission pdf bk (.) is given by: \n\np(bkO) = L Ci . Pi(bk(\u00b7)) \n\nI \n\ni=l \n\n(5) \n\nwith the mixture weights Ci that are constrained to 0 ::::; Ci ::::; 1 and 1 = 2::=1 ci. The i-th \nmixture component Pi (.) is used to model the i-th cluster of HMM-emission pdfs. Each \ncluster is represented by a prototype pdf that is denoted as fJi (.) for the i-th cluster; the \ndistance (using a suitable metric) between a HMM emission pdf bk 0 and the i-th prototype \npdfis denoted as Di(bk (.)). If these distances are small for all HMM emission probabilities \nthere are several small clusters of emission probabilities and the regulizer term 0 should be \nlarge. Now, it is assumed that the distances follow a negative exponential distribution (with \na deviation parameter Ai), yielding an expression for the mixture components: \n\np; (b.O) - (g A;,,) . exp ( - ~ A;\" \n\n. D;\" (bh)) ) \n\n(6) \n\nIn Eqn. (6) the more general case of Z independent streams is given. Hence, the HMM \nemission pdfs and the cluster prototype pdfs are split up into Z different pdfs b~) (.) and \nfJ;Z) ( . ), respectively and the stream dependent distances D i,z and parameters Ai,z are used. \nNow, for the regulizer term 0 the log-likelihood of the mixture model in Eqn. (5) over all \nemission pdfs in the HMM system can be used: \n\nK \n\n0(0) = L logp(bk (\u00b7)) \n\nk=l \n\n(7) \n\n\f740 \n\nC. Neukirchen and G. Rigoll \n\n4 Regularization example: discrete HMMs \n\nAs an example for parameter estimation in the regularization framework, a discrete HMM \nsystem with different VQs for each of the Z streams is considered here: Each VQ subdi(cid:173)\nvides the feature space into J z different partitions (i.e. the z-th codebook size is J z) and \nthe VQ-partition labels are denoted m)z) . If the observation subvector x (z) is in the j-th \nVQ-partition the VQ output is m(z) (x(z)) = m)Z). \n\nSince the discrete kind HMM output probabilities b~\\m(z)) are used here, the regulizer's \n, prototypes are the discrete probabilities (3~z) (m (z) ). As a distance metric between the HMM \nemission probabilities and the prototype probabilities used in Eqn. (6) the asymmetric \nKullback-Leibler divergence is applied: \n\n(8) \n\n4.1 Estimation of HMM parameters using regularization \nThe parameter set e of the HMM system to be estimated mainly consists of the discrete \nHMM emission probabilities (transition probabilities are not subject of regularization here). \nTo get an iterative parameter estimation in the EM style, Eqn. (4) must be maximized; e.g. \nby setting the derivative of Eqn. (4) with respect to the HMM -parameter b~) (m )z) ) to zero \nand application of Lagrange multipliers with regard to the constraint 1 = EJ~ 1 biz) (m ;z)) . \nThis leads to a quite complex solution that can be only solved numerically. \nThe optimization problem can be simplified if the mixture in Eqn. (5) is replaced by the \nmaximum approximation; i.e. only the maximum component in the sum is considered. The \ncorresponding index of the maximum component is denoted i * : \n\nIn this simplified case the HMM parameter estimation is given by: \n\n(9) \n\n(10) \n\nThis is a weighted sum of the well known ML solution and the regulizer's prototype proba(cid:173)\nbility (3i~ (.) that is selected by the maximum search in Eqn. (9). The larger the value ofthe \nconstant II, the stronger is the force that pushes the estimate of the HMM emission probabil-\nity biz) (m ;z)) towards the prototype probability (3i~ (.). The situation when II tends towards \ninfinity corresponds to the case of traditional state-tying, because all different states that \nfall into the same cluster i* make use of (3i~ (.) as emission probability in the z-th stream. \n\n4.2 Estimation of regulizer parameters \n\nThe parameter set ~ of the regulizer consists of the mixture weights Ci, the deviation pa(cid:173)\nrameters Ai,z , and of the discrete prototype probabilities (3~z) (m ;z) ) in the case of regulizing \n\n\fControlling the Complexity ofHMM Systems by Regularization \n\n741 \n\ndiscrete HMMs. These parameters can be set in advance by making use of prior knowl(cid:173)\nedge; e.g. the prototype probabilities can be obtained from a simple HMM system that \nuses a small number of states. Alternatively, the regulizer's parameters can be estimated in \na similar way as in (Nowlan, 1992) by maximizing Eqn. (7). Since there is no direct solu(cid:173)\ntion to this optimization problem, maximization must be performed in an EM-like iterative \nprocedure that uses the HMM emission pdfs bk (.) as training data for the mixture model \nand by increasing the following auxiliary function in each step: \n\nR(~) = L L P(ilbk(\u00b7)) \u00b7logp(i, bk(\u00b7)) \n\nK \n\nI \n\nk=1 i=1 \nK \n\nk=1 i=1 \n\nI L L P(ilbk(\u00b7)) . log (Ci . Pi(bk(\u00b7))) \n\nwith the posterior probability used as weighting factor given by: \n\nP(ilbk(.)) = \n\nICi . Pi(bk(')) \n\n2::1=1 Cl . Pl(bk(\u00b7)) \n\nAgain, maximization of Eqn. (11) can be performed by setting the derivative of R(~) \nwith respect to the regulizer's parameters to zero under consideration of the constraints \n1 = 2:::=1 ci and 1 = 2:::::1 f3~Z)(m~Z)) by application of Lagrange multipliers. For the es-\no timation of the regulizer parameters this yields: \n\n(11) \n\n(12) \n\n(13) \n\n(14) \n\n(15) \n\nK \n\nCi = ~ . L P(ilbk (-)) \nk=1 \n2:::=1 P(ilbk(\u00b7)) \n\n~. _ \n~,z - 2:::=1 Di,z(b~) (.)) . P(ilbk(-)) \n\n~(z)( (z)) _ \ni mj \n\nexp \n\n(\n\n2:::=1 P(ilbk(')) 'IOgb~)(m)Z))) \n\nK . \n\n2::k=1 P(zlbk(')) \n\n- ~ (2:::=1 P(llbk (.)) 'IOgb~)(m}Z))) \n\n~exp \n1=1 \n\nK \n\n2::k=1 P(llbk (\u00b7)) \n\nThe estimate Ci can be interpreted as the a:-erage probability that a HMM emission prob(cid:173)\nability falls into the i-th mixture cluster; Ai,z is the inverse ofthe weighted average dis(cid:173)\ntance between the emission probabilities and the prototype probability f3;z) ( .). The estimate \n~;z)(m)zl) is the average probability over all emission probabilities for the VQ-label m~zl \nweighted in the log-domain. \nIf the Euclidean distance between the discrete probabilities is used instead of Eqn. (8) to \nmeasure the differences between the HMM emission probabilities and the prototypes \n\nDi'Z (b~)(m(zl)) = L (f3jz\\myl) - b~z)(m;Zl)) \n\nJz \n\n2 \n\nj=1 \n\n(16) \n\nthe estimate of the prototype probabilities is given by the average of the HMM probabilities \nweighted in the original space: \n\n(17) \n\n\f742 \n\nC. Neukirchen and G. Rigo/l \n\n5 Experimental results \n\nTo investigate the performance of the regularization methods described above a HMM \nspeech recognition system for the speaker-independent resource management (RM) con(cid:173)\ntinuous speech task is built up. For training 3990 sentences from 109 different speakers are \nused. Recognition results are given as word error rates averaged over the official DARPA \nRM test sets feb'89, oct'89, feb'91 and sep'92, consisting of 1200 sentences from 40 differ(cid:173)\nent speakers, totally. Recognition is done via a beam search guided Viterbi decoder using \nthe DARPA RM word pair grammar (perplexity: 60). \n'As acoustic features every 10 ms 12 MFCC coefficients and the relative signal power are \nextracted from the speech signal along with the dynamic ~- and ~~-features, comprising \n39 features per frame. The HMM system makes use of standard 3-state discrete proba(cid:173)\nbility phonetic models. Four different neural networks, trained by the MMI method, that \nis described in in (Rigoll, 1997) and extended in (Neukirchen, 1998), are used as VQ to \nquantize the features into Z = 4 different streams of discrete labels. The codebook size in \neach stream is set to 200. \nA simple system with models for 47 monophones and for the most prominent 33 function \nwords (totally 394 states) yields a word error rate of 8.6%. A system that makes use of \nthe more detailed (but untied) word internal triphone models (totally 6921 states) yields \n12.2% word error. Hence HMM overfitting because of insufficient training data is a severe \nproblem in this case. Traditional methods to overcome the effects of overfitting like inter(cid:173)\npolating between triphones and monophones (Bahl, 1983), data driven state-clustering and \ndecision tree clustering yield error rates of 6.5%, 8.3% and 6.4%, respectively. It must be \nnoted that in contrast to the usual training procedure in (Rigoll, 1996) no further smoothing \nmethods are applied to the HMM emission probabilities here. \nIn a first series of experiments the untied triphone system is regulized by a quite simple \nmixture of I = 394 density components, i.e. the number of clusters in the penalty term is \nidentical to the number of states in the monophone system. In this case the prototype prob(cid:173)\nabilities are initialized by the emission probabilities of the monophone system; the mixture \nweights and the deviation parameters in the regulizer are set to be uniform, initially. In \norder to test the inluence of the tradeoff parameter v it is set to 50, 10 and 2, respectively. \nThe corresponding word error rates are 8.4%, 6.9% and 6.3%, respectively. In the case \nof large vs regularization degrades to a tying of trip hone states to monophone states and \n,the error rate tends towards the monophone system performance. For smaller vs there is a \ngood tradeoff between data fitting and HMM smoothness yielding improved system perfor(cid:173)\nmance. The initial prototype probability settings provided by the monophone system do not \nseem to be changed much by regulizer parameter estimation, since the system performance \nonly changes slightly when the regulizer's parameter reestimation is not incorporated. \nIn preliminary experiments the regularization method is also used for speaker adaptation. \nA speaker-independent system trained on the Wall Street Journal (WSJ) database yields an \nerror rate of32.4% on the Nov. 93 S33>0 test set with 10 different non-native speakers. The \nspeaker-independent HMM emission probabilities are used to initialize the prototype prob(cid:173)\nabilities of the regulizer. Then, speaker-dependent systems are built up for each speaker \nusing only 40 fast enrollment sentences for training along with regularization (v is set to \n10). Now, the error rate drops to 25.7% what is better than the speaker adaptation method \ndescribed in (Rottland, 1998) that yields 27.3% by a linear feature space transformation. In \ncombination both methods achieve 23.0% word error. \n\n6 Summary and Discussion \n\nA method to avoid parameter overfitting in HMM systems by application of a regulariza(cid:173)\ntion term that favor smooth and simple models has been presented here. The complexity \n\n\fControlling the Complexity of HMM Systems by Regularization \n\n743 \n\nmeasure applied to the HMMs is based on a finite mixture of negative exponential distribu(cid:173)\ntions, that generates the state-dependent emission probabilities. This kind of regularization \nterm can be interpreted as a soft state-tying, since it forces the HMM emission probabilities \nto form a finite set of clusters. The effect of regularization has been demonstrated on the \nRM task by improving overfitted trip hone models. On a WSJ non-native speaker adaption \ntask with limited training data, regularization outperforms feature space transformations. \nEqn. (4) may be also interpreted from a perspective of Bayesian inference: the term v . \nn plays the role of setting a prior distribution on the HMM parameters to be estimated. \nHence, the use of a mixture model for n is equivalent to using a special kind of prior in the \nframework of MAP estimation for HMMs (Gauvain, 1994). \n\nReferences \n\nL.R. Bahl, F. Jelinek, L.R. Mercer, 'A Maximum Likelihood Approach to Continuous \nSpeech Recognition', IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 5, No.2 \nMar. 1983, pp. 179-190. \nL.R. Bahl, P.v. de Souza, P.S. Gopalakrishnan,D. Nahamoo, M.A. Picheny, (1991) Context \ndependent modeling of phones in continuous speech using decision trees. Proc. DARPA \nspeech and natural language processing workshop, 264-270. \nE.B. Baum, D. Haussler, (1989) What size net gives valid generalization? Neural Compu(cid:173)\ntation, 1:151-160. \nY. Ie Cun, J. Denker, S. Solla, R.E. Howard, L.D. Jackel, (1990) Optimal brain damage. \nAdvances in Neural Information Processing Systems 2, San Mateo, CA, Morgan Kauffinan. \nJ.L. Gauvain, C.R. Lee, (1994) Maximum a posteriori estimation for multivariate Gaussian \nmixture observations of markov chains. IEEE Transaction Speech and Audio Proc., Vol. 2, \n2:291-298. \nKJ. Lang, A.H. Waibel, G.E. Hinton, (1990) A time-delay neural network architecture for \nisolated word recognition. Neural Networks, 3:23~3. \nCh. Neukirchen, D. Willett, S. Eickeler, S. Muller, (1998) Exploiting acoustic feature \ncorrelations by joint neural vector quantizer design in a discrete HMM system. Proc. \nICASSP'98,5-8. \nS.J. Nowlan, G.E. Hinton, (1992) Simplifying neural networks by soft weight-sharing. \nNeural Computation, 4:473~93. \nD.C. Plaut, S.J. Nowlan, G.E. Hinton, (1986) Experiments on learning by back(cid:173)\npropagation. technical report CMU-CS-86-126, Carnegie-Mellon University, Pittsburgh, \nPA. \nG. Rigoll, Ch. Neukirchen, J. Rottland, (1996) A new hybrid system based on MMI-neural \nnetworks for the RM speech recognition task. Proc. ICASSP'96, 865-868. \nG. Rigoll, Ch. Neukirchen, (1997) A new approach to hybrid HMMI ANN speech recogni(cid:173)\ntion using mutual information neural networks. Advances in Neural Information Process(cid:173)\ning Systems 9, Cambridge, MA, MIT Press, 772-778. \nJ. Rottland, Ch. Neukirchen, G. Rigoll, (1998) Speaker adaptation for hybrid MMI(cid:173)\nconnectionist speech recognition systems. Pmc. ICASSP '98, 465~68. \n'S.J. Young, (1992) The general use of tying in phoneme-based HMM speech recognizers. \nProc. ICASSP '92, 569- 572. \nSJ. Young, P.C. Woodland (1993) The use of state tying in continuous speech recognition. \nProc. Eurospeech '93, 2203-2206. \n\n\f", "award": [], "sourceid": 1519, "authors": [{"given_name": "Christoph", "family_name": "Neukirchen", "institution": null}, {"given_name": "Gerhard", "family_name": "Rigoll", "institution": null}]}