TELKOM
NIKA Indonesia
n
Journal of
Electrical En
gineering
Vol. 15, No. 2, August 201
5, pp. 368 ~
372
DOI: 10.115
9
1
/telkomni
ka.
v
15i2.837
4
368
Re
cei
v
ed Ap
ril 17, 2015; Revi
sed
Jun
e
29, 2015; Accepted July 1
4
,
2015
Phonemes Classification Using the Spectrum
Ahmed El Ghaz
i*
1
, Cher
k
i
Daoui
2
Lab
orator
y of Informatio
n
Pro
c
essin
g
an
d Decisio
n
Sup
por
t, F
a
culty
of Scienc
es an
d T
e
chni
ques,
Béni Mellal, Morocco
*Corres
p
o
ndi
n
g
author, e-ma
i
l
: hmadgm
@
y
a
hoo.fr
1
, daou
ic
@
y
ah
oo.com
2
A
b
st
r
a
ct
In this work, we pres
ent an
automatic speec
h class
i
fication syst
em
for
the Tama
z
i
ght phonemes.
W
e
based o
n
the spectru
m
pr
esentati
on of the spe
e
ch si
gn
al to mo
de
l the
s
e pho
ne
mes. W
e
have use
d
a
n
oral databas
e
of Tam
a
z
i
gh
t
phonem
es. To test the syst
em
’s perfor
m
ances, we calculate the c
l
assific
a
tion
rate. T
he obt
ai
ned r
e
sults
are satisfactory i
n
co
mp
aris
o
n
w
i
th the refere
nce d
a
tab
a
se
and th
e q
ual
ity o
f
speec
h files.
Ke
y
w
ords
:
ph
one
mes, spectrum, g
aussi
an
mixtur
e mod
e
l
tama
z
i
g
h
t
Copy
right
©
2015 In
stitu
t
e o
f
Ad
van
ced
En
g
i
n
eerin
g and
Scien
ce. All
rig
h
t
s reser
ve
d
.
1. Introduc
tion
Some stu
d
ie
s sho
w
that
Automatic
S
peech Re
cognition (AS
R
) system
s still
lack
perfo
rman
ce
whe
n
comp
a
r
ed to
hum
a
n
listen
e
rs in
con
d
ition
s
t
hat involve a
dditive noi
se
[1].
Such
sy
stem
s
can
imp
r
ov
e pe
rforman
c
e in th
ose
condition
s by addition
al
lev
e
ls of
lan
gua
ge
and co
ntext
mod
e
ling. Ho
wever,
thi
s
contextual
inform
ation
will
be
mo
st effe
ctive
whe
n
unde
rlying ph
oneme
seq
u
e
n
ce
is sufficie
n
tly
accu
rate
.
He
nce, robu
st ph
one
me
classificatio
n
i
s
a very im
portant stag
e of
ASR [2-4]. Acco
rdi
ngl
y,
the front
-en
d
feature
s
mu
st be
sele
cte
d
carefully to ensu
r
e that th
e best p
hone
me se
que
nce
is pre
d
icte
d. In this pape
r, we investig
ate
the pe
rform
a
nce
s
of th
e
spe
e
ch spe
c
trum a
nd Ga
ussian mixture.
Phonem
e cla
ssifi
cation
is
comm
only used for this pu
rpose.
We
are p
a
rticularly inte
re
st
ed in
Mo
ro
cc
an T
a
ma
zight
pho
neme
s
a
nd
we
have
selecte
d
the sp
ect
r
al f
eature
s
. Fo
r i
n
stan
ce, the
Mel Frequ
en
cy Cep
s
tral
Co
efficients
(MF
C
C) i
s
the m
o
st
popul
ar featu
r
es u
s
ed to
model the
sp
eech si
gnal.
These featu
r
es a
r
e th
e be
st mod
e
ling o
f
the
perceptio
n a
nd p
r
od
uctio
n
of h
u
man
device
s
. In t
h
is
wo
rk,
we
use the
Ga
ussian
mod
e
l
to
cla
ssify the
Tama
zight
p
honem
es. In
this
co
ntext, we
too
k
a
popul
ation of
pho
neme
s
t
hat
con
s
tru
c
t the
digits from o
ne to ten;
we
m
odel ea
ch phon
eme by
feature
s
ve
ctors.
P
hone
m
e
cla
ssifi
cation
by Gaussia
n
mixture pe
rm
its to co
lle
ct
t
he aco
u
st
i
c
v
e
ct
ors that have the sa
me
cha
r
a
c
teri
stics. Thi
s
cla
s
sificatio
n
can
be u
s
e
d
in
hybrid
with
a hid
den
Markov mo
d
e
l in
particula
r ap
plicatio
ns [4
-7]. The o
b
ta
ined
cla
ssifi
cation rate is variabl
e a
c
cording
to the
phon
eme type and its cont
ext.
Anyway, this pape
r is org
anized in th
e
followi
n
g
ma
nner. In
se
cti
on 2, we will
give a
descri
p
tion o
f
spee
ch
sp
ectru
m
; Sect
ion 3 d
e
scri
bes
a Ga
ussian mixtu
r
e
for a
spe
e
c
h
cla
ssifi
cation.
Section 4 p
r
esents th
e experim
ent
s
results. Finall
y
, the study is end
ed by a
con
c
lu
sio
n
.
2. Speech
Spe
c
trum
The
spe
e
ch
spectrum i
s
th
e presentatio
n of
si
gnal
on
the three-di
mensi
onal
sp
ace, th
e
axis
X
prese
n
ts a time, freque
ncy in t
he axis
Y
an
d the level of
each fre
que
ncy in the
axi
s
Z
.
This
analy
s
is is o
b
tained
by usin
g bu
n
c
h of f
ilters
a
nd Fo
urie
r transfo
rm. Th
e
followin
g
fig
u
re
pre
s
ent
s an e
x
ample of a spectrum for t
he wo
rd
‘yan’
(numb
e
r o
n
e
)
. The bla
c
k l
e
vels p
r
e
s
ent
a
con
c
e
n
tration
of frequen
cie
s
(Fig
ure 1 a
nd Figu
re 2).
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Phonem
es
Cl
assificatio
n
Using the Spe
c
trum
(Ahm
ed El Ghazi
)
369
Figure 1. Speech
spe
c
tru
m
for the word ‘y
an’
Figure 2. Spectrum of word
‘SIN’ and
phon
eme
s
bo
unda
rie
s
This
spe
c
tru
m
is varia
b
le
according to t
he sig
nal form. The sp
ee
ch spe
c
trum is used to
improve ph
o
neme
s
bou
n
darie
s by det
ecting the
a
r
eas of freq
u
ency con
c
ent
ration (Fi
gure
2
)
.
This
segm
en
tation is ap
proximate; it is difficu
lt to determin
e
the
exact bou
nd
arie
s in spee
ch
sign
al.
3. Gaussi
ans
Mixture
The G
a
u
ssi
a
n
mixture m
o
del (GMM) is an effe
ctive
descri
p
tion of
data sets co
mpri
sing
clu
s
ters of vectors that a
r
e mo
re
co
mplex
than simple Gau
s
sian distri
buti
on.
A
Gau
s
sian
mixture mode
l [1, 7], [9-11] is define
d
as:
∑
,
µ
,∑
Whe
r
e
,
µ
,∑
is the Gau
ssi
an
prob
ability de
nsity function
with mean
µ
and cova
ria
n
ce
∑
=
,
x
is
a random D-dime
nsio
nal vecto
r
,
,
,…,
and the
are weig
hts which
descri
b
e
the
relative li
kelih
ood
of
classe
s b
e
ing
ge
ne
rated
from
ea
ch
of the
cl
u
s
ters
and
mu
st
sat
i
sf
y
∑
, where
N is the num
ber of cla
s
se
s.
In order to g
e
nerate
the
G
MMs f
r
om th
e
ph
onem
e training
seque
nce,
we
em
ployed th
e
Expectation
-Maximizatio
n
(EM) al
gori
t
hm [
10], [12-13].
The
EM algorithm for maximum
-
likeliho
od e
s
timation of the para
m
eters of a GMM is an iterativ
e pro
c
ed
ure
in whi
c
h ea
ch
iteration con
s
ists of two st
eps: an e
s
tim
a
tion st
ep (E
-step), follo
we
d by a maximization
step (M-
step). In the
E-step, the likeliho
o
d
s
, mean
s and
covari
an
ce
matrix of GMMs are est
i
mated
depe
nding
o
n
the o
b
serv
ation
sequ
en
ce. In th
e M-
step, the
ne
w value
s
of the e
s
timation
of
para
m
eters o
f
the GMMs a
r
e co
mpute
d
.
Suppo
se that
we have a
sample of S points
,
,…,
, j=1,…, S, drawn fro
m
a
set of
point
s
whi
c
h
are
a
s
sume
d to
lie i
n
N cl
uste
rs.
We i
n
itialize
N G
a
u
ssi
an
s
with p
r
ob
abili
ties
p
1
=p
2
=…. =p
n
=1/
N
, m
ean
s
µ
1
, µ
2
, …, µ
n
, which
can
either be
ra
ndom
or
set
equal t
o
N
of
the
data point
s with a sm
all pertu
rbatio
n, and covaria
n
c
e matri
c
e
s
∑
,∑
,…,∑
,
set equ
al to the
identity matrix [1].
In the E-step
we compute:
The total likeli
hood:
∑
,
µ
,∑
,
,
,
…
,
Where g is the Gaussi
an pr
obability density function,
the normali
ze
d likelihoods:
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 15, No. 2, August 2015 : 368 –
372
370
,
µ
,∑
/
The notion
a
l cou
n
t:
∑
,
,
,…,
The notion
a
l mean
s:
∑
/
,
,
,…,
And the notio
nal sum
s
of squares:
∑
/
,
,
,…
,
In the M-step,
we co
mpute
new valu
es of
param
eters of the Gaussi
an model a
s
follows:
/
µ
∑
Whe
r
e i=1, 2,…, N.
The
Gau
s
sia
n
mo
del
ca
n
be
assimil
a
ted to
a
hidd
en M
a
rkov
model
with
o
ne
state
(Figu
r
e 3). Ea
ch state repre
s
ent
s a phon
eme with
n
G
aussia
n
com
pone
nts. The
observation
s
o
f
this
state a
r
e devid
ed b
e
twee
n the
Gau
ssi
an
co
mpone
nts.
In
the cla
ssifi
cation step, we
cal
c
ulate the l
i
kelih
ood bet
wee
n
the inp
u
t feature
s
an
d all the Gau
ssi
an compo
nents.
Figure 3. Gau
ssi
an mixture
with 2 comp
onent
s
4. Databa
se
The traini
ng data com
p
ri
sed a small vo
cab
u
lary
of ten isolated di
g
i
ts in Tama
zight (from
one to ten) spoken by ten spea
ke
rs
(5
males an
d 5 females) a
nd test data
spo
k
e
n
by five
spe
a
kers. Th
e produ
ce
d
si
gnal
s a
r
e
sa
mpled
at 16
KHz. T
hen, t
hespee
ch
dat
a was win
d
o
w
ed
(25
ms) and
512 point
sof FFTs
were computed
with
a 256
poi
nts
(1
2,5ms) ad
vance
bet
we
en
frame
s
. Th
e
FFT
coeffici
e
n
ts
we
re
binn
ed into
12 M
e
l
-
sp
aced val
u
es to
p
r
od
uce 12
-dim
entio
nal
featureve
c
tors [1]. The Ta
ble 1 present
s a traini
ng d
a
taba
se an
d
Table 2 p
r
e
s
ents the li
st of
phon
eme
s
used.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Phonem
es
Cl
assificatio
n
Using the Spe
c
trum
(Ahm
ed El Ghazi
)
371
Table 1. Training data
b
a
s
e
Numbers
Phonetic transcription
Tifinagh Tra
n
scription
1
Y A
N
ⵢⴰⵏ
2
S I N
ⵙⵉⵏ
3
C R A DD
ⴽⵕⴰⴹ
4
K O Z
ⴽⴽⵓⵣ
5
S MM U S
ⵙⵎⵎⵓⵙ
6
SS DD E SS
ⵚⴹⵉ
ⵚ
7
SS A
ⵙⴰ
8
TT A M
ⵜⴰⵎ
9
T Z A
ⵜⵥⴰ
Table 2. Pho
neme
s
list used
Phoneme
Context
S
y
mbol
/Y
/ Y
̴
A
̴
N
ⵢ
/I/,/ N/
S
̴
I
̴
N
ⵉ
,
ⵏ
/C/,/R/
C
̴
R
̴
A
̴
DD
ⴽ
,
ⵕ
/K/, /O/
K
̴
O
̴
Z
ⴽⴽ
,
ⵓ
/S/, /U/
S
̴
MM
̴
U
̴
S
ⵙ
,
ⵓ
/DD/
SS
̴
DD
̴
E
̴
SS
ⴹ
/A/, /SS/
SS
̴
A
ⵙ
,
ⴰ
/TT/, /M/
TT
̴
A
̴
M
ⵜ
,
ⵎ
/T/, /Z/
T
̴
Z
̴
A
ⵜ
,
ⵥ
5.
Resul
t
s of P
honemes
Classifica
tion
The
cla
s
sifica
tion sy
stem
p
e
rmits to
affect ea
ch
p
hon
eme to
its Ga
ussian
comp
onent. In
the classifica
tion step, we
calcul
ate the lik
eliho
od betwe
en the phon
eme fea
t
urein the in
put
vectors and
all
the refe
re
nce
s
Gau
s
si
ans
comp
on
ents.
T
he ob
tained re
sult
s
a
r
e
sho
w
n
in
Table 3.
Table 3. Obta
ined re
sult
s
Phone
mes
Classific
a
tio
n
r
a
te
Error rate
/Y/
74%
26%
/I/
72%
28%
/C/
78%
22%
/K/
62%
38%
/S/
79%
21%
/DD/
80%
20%
/A
/
75,5%
24,5%
/TT/
72,66%
27,34%
/T/
70%
30%
/N/
81%
19%
/R/
82%
18%
/O/
78,5%
21,5%
/U/
68%
32%
/SS/
70,5%
29,5%
/M/
72%
28%
/Z/
76%
24%
The
error rate represents
the cl
assification
error that
illust
rates
the am
biguity
between
phon
eme
s
. At the aco
u
sti
c
level, there
are
co
m
m
on
cha
r
a
c
teri
stics bet
wee
n
sp
eech unit
s
. T
he
Table 4 p
r
e
s
e
n
ts so
me of these ambig
u
i
t
ies.
Table 4. Som
e
ambiguitie
s
betwee
n
pho
neme
s
pho
nem
e
s
i=/S/
j=/SS
i=/T/
j=/TT/
i=/O/
j=/U/
i=/C/
j=/K/
A
m
b
i
gui
t
y
rate
Ti/j
20%
24%
20%
25%
A
m
b
i
gui
t
y
rate
Tj/i
27%
25%
21%
26%
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 15, No. 2, August 2015 : 368 –
372
372
The o
b
taine
d
re
sults illu
strate the im
portant
ambig
u
ity betwee
n
p
h
oneme
s
. T
h
is sh
ows
that, in a
c
o
u
stic level, it i
s
difficult to
dist
ingui
sh between pho
nem
es.
T
o
reme
d
y
this pro
b
le
m,
it
need
s a ling
u
istics stu
d
y to integrate
new p
a
ra
m
e
ters th
at pe
rmitto determ
i
ne ca
refully
the
phon
eme
s
bo
unda
rie
s
.
Ambiguity, in gene
ral, take
s pla
c
e b
e
tween the n
e
ig
hbou
r ph
one
mes o
r
ph
on
emes th
at
have n
early t
he
same
p
r
o
nun
ciation. In
this
co
ntext, there i
s
an
a
m
biguity bet
ween /T/
and
/TT/,
/S/ and /SS/ and /O/ a
nd /
U
/. The
r
e i
s
a
l
so a
not
he
r a
m
biguity bet
ween p
hon
eme
s
that a
r
e
clo
s
e
in spee
ch
sig
nal, for example betwe
en /A/ and /DD/ in numbe
r ‘
CR
AD
’
(three). This ambig
u
i
t
y is
due to the intera
ction of the aco
u
sti
cs fe
ature
s
.
6. Conclu
sion
The p
hon
em
es
cla
s
sificati
on i
s
the
met
hod th
at ca
n
cla
ssify a
sp
eech u
n
its b
a
se
d o
n
the acou
stics feature
s
. Th
is
classification can
be u
s
ed
as a
cla
ssifie
r
fo
r the
hidde
n Ma
rkov
model
or ne
u
r
al
network a
nd it
permits
to impr
ove th
e recognition
rate
an
d
red
u
ce
the
spee
ch
units’ am
bigu
ity. The Gau
ssi
an mixture
is the mo
st popul
ar mo
d
e
l use
d
in cl
assication. It is
based on the
three
-
dime
nsi
onal present
a
t
ion of data base
d
on the ve
ctors of the averag
e an
d
covari
an
ce m
a
trice
s
th
at model in
a bett
e
r
way
variati
on of spee
ch
sign
al. The
obtaine
d re
su
lts
sho
w
that th
ere i
s
an
a
m
biguity bet
wee
n
ph
one
mes a
nd the
r
e a
r
e n
o
ex
act bo
und
ari
e
s in
spe
e
ch si
gnal
. To re
solve
this p
r
o
b
lem,
a speci
a
l la
n
guag
e stu
d
y for e
a
ch dial
e
c
t mu
st be
m
ade
to take into a
c
count othe
r cha
r
a
c
teri
stics of spe
e
ch si
gnal.
Referen
ces
[1]
Y Z
hang, M Alder, R
T
ogneri
.
Using Gaussi
an Mixture Mo
deli
ng for Pho
n
e
m
e Cl
assific
a
tion
. Centre
for Intellig
ent Informatio
n
Pro
c
essin
g
S
y
ste
m
Departm
e
n
t of Electrical
an
d Electron
ic En
gin
eeri
ng T
h
e
Univers
i
t
y
of Western Austral
i
a
. 2003.
[2]
S Jamouss
i
. Méthod
es statisti
ques
po
ur
la c
o
mpré
hens
io
n
automati
q
u
e
d
e
la
paro
l
e.
Ec
ole doctora
l
e
IAEM Lorraine
. 2004.
[3]
T
Pellegr
ini, R
Duré
e. Suivi d
e
la voi
x
p
a
rlé
e
grâce au mo
d
è
le d
e
Markov
caché. 19
89.
[4]
A Cornijeol, L Miclet. A
ppre
n
tissag
e
Artificiel
le: méthod
es e
t
concepts. 198
8.
[5]
SJ Young
et PC W
oodl
and.
T
he use of state tying in c
o
ntinu
ous sp
ee
ch recog
n
itio
n
.
Proc, ESCA
Eurosp
eec
h 19
93. Berli
n
, German
y
.
199
3; 3: 2203-
22
06.
[6]
H Bour
lar
d
, CJ
Wellek
ens, H
Ne
y
.
Co
nn
ecte
d di
git r
e
cog
n
it
ion
usi
ng v
e
ctor qu
anti
z
at
io
n
. Proc. IEEE
Int. Conf. ASSP San Die
go.
CA. 1984.
[7]
RM Gray
. Vector quantization.
IEEE ASSP
Mag.,
1984; 1(
2): 4-29.
[8]
JL Gauv
ain,
C
H
L
ee. Ma
ximu
m a p
o
steri
o
ri
estima
tio
n
for
multivari
a
te Ga
ussia
n
mi
xtur
e
observ
a
tio
n
s
of Markov chai
ns.
IEEE Trans, On Speech a
nd Aud
i
o Proc
essin
g
. 199
4; 2(2): 291-2
98.
[9]
D Jouv
et, M Dautrem
ont et
Q. Gossart.
Comp
arais
on
des mu
lti mod
è
les
et des
d
ensités m
u
lt
i
gauss
i
en
nes
p
our l
a
reco
nn
a
i
ssanc
e de
la
paro
l
e p
a
r d
e
s
modè
les d
e
Markov
. Actes
des 2
0
è
m
es
JEP
. 1994: 15
9-16
4.
[10]
R Andr
é-Obre
cht. A ne
w
sta
t
istical a
ppro
a
c
h
for the a
u
to
matic segm
ent
ation
of contin
uous s
peec
h
sign
al.
IEEE Trans. On acoust
i
cs, speech, si
gna
l process
i
n
g
.
1988; 3
6
(1).
[11] F
Jelinek.
Co
ntino
u
s speec
h recog
n
itio
n by s
t
atistical metho
d
s.
Proceed
in
g
of IEEE. 1976;
64(4): 532-
556.
[12] LA
Lip
o
race.
Maxi
mu
m Lik
e
lih
oo
d esti
mation for
multi
-
va
riant o
b
ser
v
ation of Mar
k
ov sources
.
Procee
din
g
IEEE trans IT
. 19
82; 28(5): 7
29-
734.
[13]
M H
w
a
ng,
X H
uan
g.
Sub ph
o
netic mode
li
ng
w
i
th
Markov
mo
de
l.
Proceeding IEEE ICA
SSP-92, San
F
r
ancisco, CA.
1992; 1: 33-
36
.
Evaluation Warning : The document was created with Spire.PDF for Python.