182 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 39, NO. 1, FEBRUARY 2009
A Constraint-Based Evolutionary Learning Approach
to the Expectation Maximization for Optimal
Estimation of the Hidden Markov Model
for Speech Signal Modeling
Shamsul Huda, John Yearwood, and Roberto Togneri, Senior Member, IEEE
Abstract—This paper attempts to overcome the tendency of speech signal modeling in automatic speech recognition (ASR)
the expectation–maximization (EM) algorithm to locate a local [1], [2] and also signal classification. This is because the HMM
rather than global maximum when applied to estimate the hidden has a powerful ability to model the temporal nature of speech
Markov model (HMM) parameters in speech signal modeling.
We propose a hybrid algorithm for estimation of the HMM in signals statistically as well as the ability to represent arbitrarily
automatic speech recognition (ASR) using a constraint-based evo- complex probability density functions. In a Bayesian classifica-
lutionary algorithm (EA) and EM, the CEL-EM. The novelty tion scenario (for signal classification and recognition), a model
of our hybrid algorithm (CEL-EM) is that it is applicable for for a speech signal provides the mapping from the features of
estimation of the constraint-based models with many constraints the instances of a particular phoneme class (a basic theoretical
and large numbers of parameters (which use EM) like HMM. Two
constraint-based versions of the CEL-EM with different fusion unit of speech sound) to the probabilistic parameterized model
strategies have been proposed using a constraint-based EA and (HMM). The HMM provides the posterior probability of the
the EM for better estimation of HMM in ASR. The first one uses speech signal given the phoneme classes/signal classes in signal
a traditional constraint-handling mechanism of EA. The other classification. Therefore, the success of the recognizer/classifier
version transforms a constrained optimization problem into an un- for speech depends heavily on how precisely the estimated
constrained problem using Lagrange multipliers. Fusion strategies
for the CEL-EM use a staged-fusion approach where EM has been HMM can represent the underlying phoneme or signal classes
plugged with the EA periodically after the execution of EA for a in speech data. In a real recognition task, feature vectors from
specific period of time to maintain the global sampling capabilities different instances of the same phoneme vary largely due to
of EA in the hybrid algorithm. A variable initialization approach variations in speakers, variation in the emotion of the speaker,
(VIA) has been proposed using a variable segmentation to provide and changes in the environment. Therefore, it is very difficult to
a better initialization for EA in the CEL-EM. Experimental results
on the TIMIT speech corpus show that CEL-EM obtains higher find an appropriate estimation for the parameters of the HMM
recognition accuracies than the traditional EM algorithm as well that can precisely represent all the phonemes in the training
as a top-standard EM (VIA-EM, constructed by applying the VIA speech data.
to EM). The standard method to estimate the parameters of HMM-
Index Terms—Constraint-based evolutionary algorithm (EA), based acoustic models in ASR systems is the Baum–Welch
expectation maximization (EM), fusion strategies, hidden Markov (expectation maximization, EM) [1]–[3] algorithm. The
model (HMM), hybrid algorithm, Lagrange multiplier (LM), sig- Baum–Welch (EM) [1]–[3] estimation approach is attractive
nal modeling and classification, speech recognition. because it can approximate the underlying distribution from
the set of observed data which has some missing or hidden
I. I NTRODUCTION components. In particular, while modeling a speech signal using
HMM, features from the speech signal are observed, but the
HE HIDDEN Markov model (HMM) is the most suc-
T cessful and widely used statistical modeling technique for
state sequence of HMM which generated the signal remains
hidden. In this case, optimization of the likelihood function
is usually analytically intractable [3]. However, the EM algo-
rithm simplifies the likelihood function by considering some
Manuscript received January 25, 2008; revised April 29, 2008 and July 18, additional variables for hidden components of the data and
2008. First published December 9, 2008; current version published January 15, initial values for those variables so that the likelihood function
2009. This paper was recommended by Associate Editor Y. Soon.
S. Huda and J. Yearwood are with the Center for Informatics and can be optimized. The EM estimates the parameters of HMM
Applied Optimization, University of Ballarat, Ballarat, Vic. 3350, Australia in an iterative manner that makes it more computationally
(e-mail: shuda@ballarat.edu.au; shuda9203@yahoo.com; j.yearwood@ efficient and helps to converge fast because EM guarantees an
ballarat.edu.au).
R. Togneri is with the Centre for Intelligent Information Processing System, increment in the likelihood function at each iteration [1]–[4] of
School of Electrical, Electronic and Computer Engineering, University of West- its estimation procedure.
ern Australia, Perth, W.A. 6009, Australia (e-mail: roberto@ee.uwa.edu.au). Unfortunately, the estimation of HMM parameters computed
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. by the Baum–Welch (EM) approach is not always the best [1],
Digital Object Identifier 10.1109/TSMCB.2008.2004051 [2], and thereby, the use of the model estimated by EM may
1083-4419/$25.00 © 2008 IEEE
Authorized licensed use limited to: University of Western Australia. Downloaded on November 24, 2009 at 01:16 from IEEE Xplore. Restrictions apply.
,HUDA et al.: CEL-EM FOR OPTIMAL ESTIMATION OF THE HMM FOR SPEECH SIGNAL MODELING 183
lower the recognition accuracy of ASR systems. The reason is satisfied while estimating the HMM parameters. When the EM
that the EM algorithm is strongly dependent on the selection is applied separately, the constraints are automatically satisfied
of the initial values of model parameters and is guaranteed to [1]. However, EA is stochastic and can violate the constraints of
produce a local rather than a global maximum of the likelihood HMM when applied with the hybrid algorithm. Therefore, the
function [1], [2], [4]. This gives a nonoptimized estimation of algorithms proposed in [8]–[10], and [14] cannot be applied on
the parameters of HMM and consequently lowers the recogni- the HMM.
tion accuracy. Two important research questions are therefore Moreover, in the literature [8]–[10], hybrids of EA and EM
as follows: How do we choose initial estimates of the HMM use a pipelining fusion strategy [15]–[17] with the Lamarckian
parameters and how can we escape from a local maximum point viewpoint [16]–[18] where EA provides the initial points
if EM is found to be stuck there for better estimation for HMM for EM at every generation of the EA. A pipelining strategy
parameters and higher recognition accuracy of the ASR system. is conceptually similar to segmental-K-means segmentation
In this paper, we focus on these drawbacks of the EM algorithm [19] in the context of varying the EM initial point. However,
for the estimation of HMM parameters in signal classification the optimization problem in EM for HMM is very high-
or ASR systems. dimensional, and the surface of the optimization function in
A global search method such as an evolutionary algorithm ASR is very complex, which includes many local maxima
(EA) could be used to avoid the local maximum problem of [1], [2]. Empirical studies [18], [20] show that a pipelining
EM. The EA can explore the search space without using any mechanism is not suitable for high-dimensional functions,
knowledge about the underlying problem structure and is less since it reduces the global sampling capabilities of EA and
likely to be trapped into the local maxima. However, it is well cannot maintain the diversity in the population [20] (diversity
known that EA is inefficient for high-dimensional optimization in the EA population is essential for high-dimensional function
problems. Application of a hybrid algorithm using EA and local optimization [21]).
search will be investigated. The power of global sampling capabilities in EA is due
Recently, several investigators have applied a hybrid algo- to its schema processing capabilities (which represents the
rithm using neighborhood search such as Tabu Search (TS) [5] hyperplane partition) [22], [23]. In fact, far more hyperplanes
and [6] simulated annealing (SA) [7] in combination with EM are sampled simultaneously than the actual number of chro-
to overcome the problem of EM for continuous density HMM mosomes contained in the population of EA using the implicit
(CDHMM). The TS requires huge amount of memory to main- parallelism technique of EA. This provides the global search
tain the list of already visited solutions due to the high number capabilities to the EA [22], [23]. However, the use of pipelining
of variables in CDHMM. The move attribute-based method strategy [8], [9], [14] changes the genetic information of the
could be used to reduce the memory requirement, but this chromosome at each generation of EA that results a loss of sta-
makes TS too restrictive. Empirical studies show that in a high- tistical information about the hyperplane partition information
dimensional search space, a limited number of iterations with implicitly contained in the population as well as the inherited
restrictive TS make it dependent on initial point. An inappropri- schema [15], [18], [24]. This, in turn, decreases the global
ate choice of initial point results in a failure to find an optimal sampling capabilities of EA [18], [22]–[24]. The disadvantages
solution. Both SA and TS work on single-candidate models. In of pipelining hybrid EA have also been discussed in many
this context, hybrid algorithm (that combines population-based papers including those in [15]–[18], [20], and [24].
algorithm (EA) and local search EM) may be more effective. In this paper, we therefore propose a constraint-based evo-
In the hybrids of EA [8]–[13], the probability of choosing an lutionary learning approach to EM (CEL-EM) that hybridizes
inappropriate initial point is minimized due to the use of a large a constraint-based EA and the EM for optimal estimation
number of initial points of EA distributed over the whole search of HMM for ASR systems. The novelty of our hybrid al-
space. Therefore, hybrids of EA [8]–[13] can explore the search gorithm (CEL-EM) is that it is applicable for estimation of
space more extensively than hybrids of single-candidate-model- the constraint-based models with many constraints and large
based approaches (with SA and TS). numbers of parameters like HMM.
In the literature, the authors in [8]–[10], and [14] have Our contribution also includes the following investigation
used EA in combination with EM for optimal estimation of a hitherto unreported in the literature.
Gaussian mixture model (GMM) in a nonlinear classification
problem and unsupervised clustering. These hybrid algorithms 1) Different constraint-based versions for CEL-EM have
in [8]–[10], and [14] ignore the constraints of the GMM and been developed and formulated for HMM in the ASR
assume equal mixture weights which may fail in many practical systems to avoid the local maxima problem of EM in the
situations where the mixture weights of individual mixtures HMM estimation process.
of GMM are not the same. Therefore, these algorithms [8]– 2) Combinations of constraint-based versions of and fusion
[10], [14] cannot be applied on the constraint-based models. strategies for CEL-EM have also been developed and
When a hybrid of EA and EM is applied on a constraint- experimentally verified to find a suitable constraint-based
based model like HMM, the problem context is changed. HMM method and fusion strategy for the estimation of HMM in
combines several GMMs into a single model that constitutes a ASR systems.
large number of parameters and mixture constraints aggregated 3) In CEL-EM, a variable initialization approach (VIA) has
from the GMMs. The HMM also has transition and observation been proposed using a variable segmentation to provide a
probability constraints for each state. These constraints must be better initialization for EA and also for EM. Experimental
Authorized licensed use limited to: University of Western Australia. Downloaded on November 24, 2009 at 01:16 from IEEE Xplore. Restrictions apply.
, 184 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 39, NO. 1, FEBRUARY 2009
results of CEL-EM have been compared with a standard II. HMM FOR S PEECH S IGNAL M ODELING ,
EM as well as VIA-EM which has been constructed by I TS P ARAMETERS , AND C ONSTRAINTS
applying the VIA to EM.
The HMM is a doubly stochastic process. One stochastic
process comprises the distribution of observations at
Two different constraint-based versions of “CEL-EM” have
each state which is a multimodal Gaussian mixture
been developed. The first constraint-based version of “CEL-
for a CDHMM. The other stochastic process involves
EM” follows a penalization method [25] similar to the tra-
the transitions between the HMM states, which are
ditional constraint-handling technique used in the EA. The
the transition probabilities A = aij . HMM parameters
second version of “CEL-EM” uses a Lagrange multiplier-
can be represented as {λ = cjn , µjn , Σjn , aij }, where
(LM)-based technique [26], [27] to handle the constraints. The
i, j = 1, 2, . . . , K (K = total number of states), and
traditional constraint-handling method [25] of EA makes dif-
n = 1, 2, . . . , M (M = total number of mixtures), with cjn =
ferent levels in a particular constraint depending on the values
mixture weight, µjn = mean vector, and Σjn = covariance
of the constraint. This requires a large number of optimal static
for jth state and nth mixture. The probability for the tth
penalty coefficients for many HMM constraints which are hard
observation at the jth state is consideredas bj (Ot ) =
to find and increases the number of parameters to be optimized M n
[25]. Therefore, in the second constraint-handling approach, we n=1 cjn bj (Ot ), where bnj (Ot ) = (1/ (2π)D |Σjn |)
′ −1
have used LMs. LM can be used to transform a constraint-based exp(−1/2(Ot − µjn ) Σjn (Ot − µjn )) and D = dimension
M
optimization problem into an unconstraint-based problem [26], of O . The constraints of HMM are n=1 cjn = 1,
K t T
[27]. LM also adds some additional parameters to be optimized, j=1 aij = 1, and t=1 b j (O t ) = 1, where T = total
but the total number of multipliers in LM is fixed for HMM. number of observations in an instance. In a Bayesian
The only concern here is to find optimal values for LM. We classification scenario, the model for speech signals provides
have proposed an evolutionary approach to find the values for the posterior probability of unknown signal data or phoneme
LM. The LM approach has also been found to provide better where each signal class is represented by a separate HMM.
recognition than the traditional method. Therefore, appropriate estimation of HMM parameter is a
In addition, we have proposed two fusion strategies for primary concern in signal classification/phoneme recognition.
“CEL-EM” where Lamarckian evolution [18] is executed pe- The maximum likelihood (ML) estimation is the standard
riodically after the execution of Darwinian evolution [22], [23], method to estimate the parameters of a probabilistic
[28] for a specific period of time, making the two stages of model. The ML estimate of HMM parameters is λML ˆ =
the hybrid algorithm a staged fusion [15], [18], [20]. In the arg maxλ {log P (O|λ)}, where log P (O|λ) is the log-
staged fusion of CEL-EM, EM is executed periodically after likelihood of the observed data. P (O|λ) can be written as
the execution of EA for a specific period of time (a Darwinian
evolution [22], [23], [28]), thus utilizing the local knowledge P (O|λ) = P (O, q|λ)/P (q|O, λ) (1)
of EM as well as minimizing the loss of hyperplane partition
information in the EA and maintaining the global sampling where q = HMM state sequence. Taking the logarithm of (1)
capabilities of EA. The first fusion strategy of CEL-EM uses log [P (O|λ)] = log [P (O, q|λ)] − log [P (q|O, λ)] . (2)
a simple staged-fusion approach, and the second strategy ap-
plies a biased-crossover technique [29], [30] with the staged If we knew the state sequence q, the ML estimate of the HMM
fusion. The staged-fusion strategy of CEL-EM requires care- parameters λML ˆ could be computed by taking the derivative of
ful choice of evolutionary operators for EA that can produce (2) with respect to λ and then equating the derivative to zero.
feasible solutions in the offspring population of EA which, However, we do not know the state sequence q which generated
in turn, requires a feasible initial population that is created the observed data O. Therefore, there is no closed form of
by problem-specific heuristics [25] (not randomly generated). solution from the derivative, and a direct derivative method of
Therefore, in the CEL-EM, we have proposed a variable seg- ML estimation will not work to estimate the HMM parameters.
mentation technique to create the initial population for the This estimation problem for HMM can be solved by the EM
evolutionary process of CEL-EM. However, the recognition [1], [3] algorithm.
experiments have shown that the second fusion strategy with
the LM-based constrained version outperforms all other com-
III. EM A LGORITHM TO E STIMATE HMM P ARAMETERS
bination of fusion strategies and constraint-based versions of
AND P ROBLEM OF EM
CEL-EM.
The remainder of this paper is organized as follows. In the The EM algorithm [1], [3] is an iterative method to solve
next section, a brief description of the HMM, its parameters, the estimation problem of HMM parameters. EM simplifies
and the constraints are discussed. Section III briefly describes the log-likelihood L = log P (O|λ) of observed data O in
the EM algorithm and its problems to estimate the HMM pa- terms of the expected value of the log-likelihood Q(λ, λk ) =
k
rameters. The CEL-EM with fusion strategies for the estimation ∀q∈Q P (q|O, λ ) log[P (O, q|λ)] of complete data (O, q) by
of HMM parameters is described in Section IV. Section V dis- assuming a set of variables for hidden states q (that generates
cusses the experimental procedure and results. The significance the observed data O) and initial values of model parameters λk .
of the results is analyzed in Section VI. Conclusions of this Then, it maximizes Q(λ, λk ) for estimation of the parameters.
paper are given in the last section. However, maximization of Q(λ, λk ) maximizes the likelihood
Authorized licensed use limited to: University of Western Australia. Downloaded on November 24, 2009 at 01:16 from IEEE Xplore. Restrictions apply.