TELKOM
NIKA Indonesia
n
Journal of
Electrical En
gineering
Vol. 13, No. 1, Janua
ry 201
5, pp. 137 ~
144
DOI: 10.115
9
1
/telkomni
ka.
v
13i1.690
2
137
Re
cei
v
ed Se
ptem
ber 11, 2014; Revi
se
d Octob
e
r 23,
2014; Accept
ed No
vem
b
e
r
15, 2014
Voice Activity Robust Detection of Noisy Speech in
Toeplitz
Jingfan
g Wa
ng
Schoo
l of Information Sci
enc
e and En
gi
neer
ing, Hu
na
n Internatio
nal Ec
on
omics Univ
ersit
y
,
Chan
gsh
a
, Ch
ina, postco
de:4
102
05
email: matl
ab_
b
y
s
j
@1
26.com
A
b
st
r
a
ct
A T
oepl
it
z
d
e
-noisi
ng
metho
d
usi
ng t
he
maxi
mu
m
eig
e
n
v
alu
e
is
pro
p
o
s
ed for th
e v
o
i
c
e activi
t
y
detectio
n
at lo
w
SNR scenarios. T
h
is meth
od uses
the s
e
lf-correl
a
tio
n
sequ
enc
e of speec
h ba
ndw
i
d
t
h
spectru
m
to construct a new
symmetric T
o
eplit
z
matr
ix a
nd to co
mp
ute
the larg
est ei
genv
alu
e
, an
d
the
dou
ble
dec
isio
n thresh
ol
ds i
n
the
larg
est
eig
enva
l
u
e
ar
e
app
lie
d i
n
the
decis
ion
fra
m
ew
ok. Simulati
o
n
results sh
ow
that the pr
ese
n
t
ed al
gorith
m
i
s
mor
e
e
ffectiv
e in d
i
stin
guis
h
in
g spe
e
ch fr
om
nois
e
a
nd
has
better robust
n
ess und
er vari
ous no
isy env
i
r
on
me
nts.
Comp
are
d
w
i
th novel
meth
od o
f
recurrence r
a
te
ana
lysis, this
a
l
gorit
hm sh
ow
s low
e
r w
r
on
g d
e
cisio
n
r
a
te. T
he
alg
o
rith
m
is
of low
c
o
mput
ation
a
l c
o
mpl
e
xity
and is si
mple i
n
real-ti
m
e real
i
z
at
io
n
.
Ke
y
w
ords
:
voice
activity
detectio
n
(V
AD), speec
h
ban
dw
idth
sp
ectrum, max
i
mu
m eig
env
al
ue
m,
robustn
ess, T
oeplit
z
matrix
Copy
right
©
2015 In
stitu
t
e o
f
Ad
van
ced
En
g
i
n
eerin
g and
Scien
ce. All
rig
h
t
s reser
ve
d
.
1. Introduc
tion
Voice is a
n
aco
u
sti
c
perf
o
rma
n
ce of l
angu
age, it is audito
ry organ pe
rcepti
on on the
mech
ani
cal vibration of o
u
t
side soun
d prop
agatio
n
medium, an
d
it is an important carrier of
human
info
rmation tran
smissi
on
and
emotio
nal
co
mmu
n
i
c
a
tion
. C
u
r
r
e
n
t
ly, vo
ic
e pr
oc
es
s
i
n
g
techn
o
logie
s
requi
re
voice
input in
a
qui
et env
ironme
n
t, whe
n
the
r
e is the
ambi
ent noi
se
(su
c
h
as facto
r
ie
s,
airpo
r
ts, etc.), system
perfo
rm
an
ce
decrea
s
e
s
dram
atically.
However, voice
comm
uni
cati
on p
r
o
c
e
s
s would in
evitabl
y be affe
cted
from th
e
surroun
ding
envi
r
onm
ent n
o
ise,
su
ch
as p
r
op
agation
medi
um. Voice a
c
tivity detecti
on is an
imp
o
rtant pa
rt of th
e digital
spee
ch
pro
c
e
ssi
ng [1
-5], the aim is to detect the
spee
ch
sig
n
a
l seg
m
ent
s
and the n
o
ise
segm
ents from
the
sample
d digital
si
gn
al. The
colle
ction of voi
c
e
sign
al i
s
divid
ed into
pu
re
noise a
nd
noi
sy
spe
e
ch seg
m
ent, the b
eginni
ng a
n
d
endin
g
poi
n
t
s of the
sp
eech segme
n
t is dete
r
mi
ned,
spe
e
ch e
ndp
oint test
is a
n
imp
o
rtant
p
a
rt of
th
e
sp
eech e
nha
ncement
algo
rithms an
d
spe
e
ch
codi
ng. In the spee
ch recognition p
r
o
c
ess, if
the beginnin
g
and endin
g
of the spee
ch seg
m
ent
endp
oints are corre
c
tly d
e
termin
ed, a
nd the
amo
u
n
t of comput
ation a
nd th
e e
r
ror
rate
of
spe
e
ch re
cog
n
ition ca
n be
redu
ce
d.
In endp
oint
detectio
n
alg
o
rithm, sho
r
t-term
en
ergy is the m
o
st
comm
on feat
ure
s
[6],
whi
c
h
can ef
fectively sep
a
rate the
sp
eech and
noi
se at hig
h
S
NR
enviro
n
m
ent, but a large
numbe
r of e
x
perime
n
tal result
s sh
ow that
short-term ene
rgy
appro
a
ch p
e
rform
a
n
c
e
has
decli
ned
ma
rkedly
at l
o
w
SNR environ
ment a
nd
no
n-statio
na
ry n
o
ise
envir
on
ments. Of co
urse,
part of the al
gorithm
can
maintain
stab
le perfo
rma
n
c
e in the lo
w SNR e
n
viro
nment [7]. The
disa
dvantag
e
is that th
e computation
a
l
com
p
lexity
is too l
a
rg
e; i
t
is not
suita
b
le for
re
al-ti
m
e
spe
e
ch re
cog
n
ition system
. Shen [8] first propo
sed th
e entropy for
spe
e
ch / noise cla
ssifi
catio
n
,
and the noi
se differen
c
e i
n
the perso
n'
s sp
ee
ch
can
be expre
s
se
d from their
spectral entro
py.
Speech sp
ect
r
um entropy algorith
m
is b
e
tter at
low SNR e
n
viron
m
ent than app
roach ba
sed o
n
energy. It is better in the white noise, bu
t it
still difficult to work in
colore
d noi
se.
The si
gnal
su
bsp
a
ce [9-1
2] has
be u
s
ed
in term
s of spee
ch en
han
ceme
nt. Because it
is difficult to a
c
hieve
voice
endp
oint d
e
te
ction
at
lo
w S
N
R an
d n
on-stationa
ry noi
se
co
ndition
s,
a
de-n
o
isi
ng
spee
ch en
dp
oint dete
c
tio
n
me
thod i
s
propo
se
d
based on
Toeplitz l
a
rge
s
t
eigenvalu
e
s i
n
this pap
er.
In this metho
d
, spee
ch
sp
ectru
m
auto
c
orrelation
seq
uen
ce is u
s
e
d
to
c
o
ns
tr
uc
t a
s
y
mme
tr
ic
To
e
p
litz
matri
x
, the information a
m
o
unt of the
matrix maxi
mum
eigenvalu
e
s i
s
u
s
ed, a
nd t
he spee
ch
si
gnal e
ndp
oint
s a
r
e d
e
tecte
d
by the d
oub
le thre
shol
d.
The
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 13, No. 1, Janua
ry 2015 : 137 –
144
138
algorith
m
gre
a
tly improves the VAD det
ection a
c
cu
ra
cy and validit
y
,
the algorit
hm ca
n be a
b
le
to maintain a
better dete
c
tion perfo
rma
n
ce in
a vari
ety of noisy environ
ment
s and lo
w SNR
conditions.
2. Structure
of Toeplitz
Information Matrix
Voice si
gnal
and ch
aracte
rizatio
n
of their
pro
p
e
r
ties change ran
domly on the
whole,
and its p
a
ra
meters are th
e essential
chara
c
te
rist
i
c
s chan
ge ove
r
time, and it is a typical
n
on-
stationa
ry proce
s
s, but in a s
hort pe
ri
od of time (10 ~ 30m
s), its pro
p
e
r
ties remain relatively
stable, a
nd th
erefo
r
e it can
be seen
as
a qua
si-stati
o
nary p
r
o
c
e
ss,
that is short
-
term sta
b
ility of
the sp
ee
ch
si
gnal. At present, most
of t
he spee
ch
si
gnal p
r
o
c
e
ssi
ng techniq
u
e
s
a
r
e in
ba
se
d on
the "sh
o
rt-tim
e", the voice
sign
al i
s
divi
ded into
a
plurality of
seg
m
ents, a
nd it
s
cha
r
a
c
teri
zed
para
m
eters a
r
e an
alyze
d
, whe
r
e e
a
ch
se
ction i
s
called a "fram
e
"
,
the pro
c
e
s
s of segm
ents is
said
as "fram
i
ng" process,
it is to
be
achi
eved th
rough th
e voi
c
e
sign
al
win
dowi
ng fun
c
ti
on,
frame si
ze g
enerally take
s 10
~ 30m
s. Sub-fram
e
s
can be
a continuo
us
se
gment, but it is
gene
rally carried o
u
t by
overlap
p
ing
segm
ents of
a slidi
ng
wi
ndo
w, su
ch
that the sm
o
o
th
transitio
n can
be don
e bet
wee
n
frame
s
, which ca
n
maintain the
contin
uity of the sig
nal. O
n
sele
cted
win
dow fun
c
tion,
in order to
get a high freque
ncy re
solution and
overcome Gi
bbs
phen
omen
on,
we ch
oo
se Hannin
g
(Hann
ing)
wi
ndo
w i
n
overlap
style segm
ents.
Noi
s
y spee
ch si
gnal x
(n
) is fram
ed, f
r
am
e
si
ze
is
Frame
Len, f
r
ame
shift is
StepLen
(StepLe
n <F
rameLe
n), the
total numbe
r of frame
s
is Nu
m, if fast Fourie
r tran
sform
(FFT
) is
made to the k-th frame si
g
nal, we get the spe
c
tru
m
NFFT points Y
F
(i, k) (0
≤
i
≤
NF
FT), spe
e
ch
freque
ncy
ra
nge i
s
bet
we
en 20
0Hz a
n
d
4kHz, the
corr
espon
ding
point of inte
rval [Nd, Ng]
point
(0
≤
Nd
<
N
g
≤
NFFT) i
s
o
b
tai
ned, L
=
Ng-Nd
+
1, LM
=L/
2
is
th
e si
ze
of the To
epl
itz matrix; Xk(i)=
YF(i+
N
g
-
1,k
)
(1
≤
i
≤
L)
.
K-th frame au
tocorrel
ation
seq
uen
ce of
spe
e
ch sp
ect
r
um is
R (m
):
1
,...,
2
,
1
,
0
,
)
(
)
(
1
)
(
1
LM
m
m
i
X
i
X
m
L
m
R
m
L
i
k
k
(1)
LM-dim
en
sio
nal stru
ctu
r
e
of real
sy
mm
et
ric To
eplit
z
mat
r
ix
A
:
)
0
(
)
1
(
)
2
(
)
0
(
)
1
(
)
1
(
)
1
(
)
0
(
)
(
R
LM
R
LM
R
R
R
LM
R
R
R
R
Toeplitz
A
(
2
)
This o
r
de
r of Toeplitz m
a
tri
x
is not hi
gh, see
k
in
g eige
nvalue at fast
spee
d.
3. Realiz
ation of Voi
ce A
c
tiv
i
t
y
Detec
t
ion
3.1. Iterativ
e
Metho
d
for
Maximum Principle of Eigenv
alue
Matrix powe
r
method i
s
to strive for the la
rge
s
t eigenval
u
e
and
corre
s
po
ndin
g
eigenve
c
tor
of an iterative met
hod.
An n-set
are lin
ea
rly related
to
the eige
nvectors
v1
,
v2
,,
…
v
n, the corre
s
p
ondin
g
eigen
value
1
,
2
,,
…
n s
a
tisfy:
|
1
| >
|
2
|
…
|
n
|
(3)
3.1.1. The Ba
sic Idea
Bec
a
us
e {
v
1
,
v
2
,,
…
v
n
} is a basi
s
of
C
n
, So any given
x
(0)
0
,
n
i
i
i
v
a
x
1
)
0
(
‘s
linear
rep
r
e
s
entatio
n.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Voice Acti
vity Robu
st Dete
ction of
Noi
sy Speech in T
oeplitz (Jin
gfang Wang
)
139
]
)
(
[
)
(
2
1
1
1
1
1
1
1
)
0
(
n
i
i
i
k
i
k
n
i
k
k
i
i
n
i
i
k
i
n
i
i
i
k
k
v
a
v
a
v
a
v
A
a
v
a
A
x
A
(
4
)
If
a
1
0,then
1
1
i
,
when
k
i
s
a
r
ge e
nou
gh,
A
(
k
)
x
(0)
1
k
a
1
v
1
=
cv
1
i
s
1
’s
eigenve
c
tors.
On the othe
r hand, max(
x
) =
x
i
,
wh
ere |
x
i
| =
||
x
||
,
When k is
suffici
ently large,
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
)
0
(
1
)
0
(
)
max(
)
max(
)
max(
)
max(
)
max(
)
max(
v
a
v
a
v
a
v
a
x
A
x
A
k
k
k
k
k
k
If a1 = 0, d
u
e
to ro
undi
ng
e
rro
r, an
d the
r
e will
be
an it
eration
vecto
r
com
pon
ent i
n
the v1
dire
ction i
s
n
o
t 0, iteration
continu
e
s
1 can
be obt
ained a
nd th
e co
rre
sp
ond
ing app
roxim
a
te
eigenve
c
tors.
3.1.2. Standa
rdiza
t
ion
In practi
cal
calcul
ation,if
1| > 1,|
1k
a1
|
;or if |
1| <
1,|
1
k
a1|
0will
stop. A
"standa
rdi
z
e
d
"
appro
a
ch.
)
(
)
1
(
)
(
)
(
)
(
)
max(
k
k
k
k
k
Ay
x
x
x
y
k
= 0,1,2,…
(5)
Theorem
:
Gi
ven any initial vector
0
)
0
(
x
,
Eigenvalue
x
rs
Eigenvecto
v
v
y
k
k
k
k
1
)
(
1
1
)
(
)
max(
lim
)
max(
lim
(
6
)
Proof:
n
i
i
k
i
i
k
n
i
i
k
i
i
k
k
k
k
k
k
k
k
k
k
k
k
v
a
v
a
v
a
v
a
x
A
x
A
x
x
A
x
x
A
Ay
Ay
x
x
y
2
1
1
1
1
2
1
1
1
1
)
4
(
)
0
(
)
0
(
)
1
(
)
1
(
)
1
(
)
1
(
)
1
(
)
1
(
)
(
)
(
)
(
]
)
(
[
max
]
)
(
[
)
max(
)
)
max(
max(
)
max(
)
max(
)
max(
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 13, No. 1, Janua
ry 2015 : 137 –
144
140
)
max(
)
max(
]
)
(
[
max
]
)
(
[
1
1
1
1
1
1
2
1
1
1
2
1
1
1
v
v
v
a
v
a
v
a
v
a
v
a
v
a
k
n
i
i
k
i
i
n
i
i
k
i
i
1
1
1
1
1
1
1
)
1
(
)
(
)
max(
)
max(
)
max(
)
max(
)
)
max(
max(
)
max(
)
max(
v
v
v
Av
v
v
A
Ay
x
k
k
k
Note: If the e
i
genvalue
s
does not satisf
y condi
tion (3), the power law conve
r
g
ence of
the analy
s
is i
s
1 =
2
= … =
r complicated,
but
if
1| > |
r +
1
|
…
|
n| is The
o
re
m
con
c
lu
sio
n
st
i
l
l
hold
s
. At this time iteration with different
initial vect
or vector
seq
uen
ce ge
ne
ra
lly
tend to feature vectors of
1 different.
3.2. The Lar
g
est Eigenv
alue Algorith
m
for Toeplitz Matri
x
A
In orde
r to
so
lve one of the
large
s
t eig
e
n
v
alue, wh
ere
we u
s
e th
e p
o
we
r meth
od,
whi
c
h
can
avoid
the
matrix
de
co
mpositio
n o
r
i
n
verse m
a
trix cal
c
ul
ation
s
i
n
see
k
ing
ei
genvalu
e
s. The
impleme
n
tation step
s:
1) Initial val
ues:LM
-
dim
e
nsio
nal
column vec
t
or y=
[1,1,…,1]
H
,
H is tran
spo
s
e; LM
-
dimensional
colum
n
vect
or y
0
=[0,0,…
,
0]
H
; Decisi
o
n
cycl
e con
d
itions
ep
s=0.0001
(A sm
aller
numbe
r), d
=
1
.
2) Matrix:z=
A
y
3) No
rmali
z
e
d
:
}
,...,
2
,
1
|,
)
(
max{|
||
||
,
||
||
LM
i
i
z
z
z
z
y
(
7
)
4) Cal
c
ul
ation
:
}
,...,
2
,
1
|,
)
(
0
)
(
max{|
LM
i
i
y
i
y
d
The las
t
y is
r
e
ser
v
ed, y0=y
5) Cyc
l
e verdic
t: If d>
eps
, to turn (2) s
t
ep, or to turn (6) s
t
ep.
6) To calculat
e the large
s
t eigenvalu
e
:
}
,...,
2
,
1
|,
)
(
max{|
LM
i
i
z
(
8
)
7) To retain the larg
est eig
envalue info
rmation in k-th
frame:
)
(
log
10
)
(
10
k
Tzv
(
9
)
3.3
.
Double threshold Voi
ce Endpoint
Discriminati
o
n
In order to
prevent that the large
s
t
eigenvalu
e
informatio
n Tzv appe
ar jagg
ed
fluctuation
s
b
e
twee
n fra
m
es, ave
r
ag
e filteri
ng i
s
done in th
e
Tzv of adja
c
ent three fra
m
es.
Dou
b
le thre
shold Voice Endpoi
nt Discri
mination ste
p
:
Step 1: to ide
n
tify the initial frame
N0
a
s
t
he noi
se f
r
a
m
e, to calcul
ate the ave
r
a
ge value
Avg the stan
dard
deviatio
n
Std of Tzv(l)
(
0<l
<
= N
0
)
. Doubl
e thre
shol
d is
defin
ed a
s
thre
sh
old
TS in spee
ch
frame
s
and T
N
thre
shol
ds i
n
noise frame
s
, they are in
Formul
a (1
0):
TN=Avg+
α
*Std
α
>0
TS=Avg+
β
*St
d
,
β
>
α
(10)
Step 2: to
cal
c
ulate
the la
rgest
eigenval
ue in
fo
rmatio
n Tzv
(l) of th
e next fram
e
spe
e
ch
sign
al. When
the u
p
fram
e is the
noi
se fram
e, if
th
e T
z
v(l)<TS,
I-th fram
e i
s
as
noi
se fra
m
es,
Tzv(l
)
of spee
ch frame
s
wa
s g
r
eate
r
tha
n
TS. Wh
e
n
t
he up f
r
ame i
s
the
spe
e
ch
frame, T
z
v(l)
is
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Voice Acti
vity Robu
st Dete
ction of
Noi
sy Speech in T
oeplitz (Jin
gfang Wang
)
141
comp
ared wit
h
thre
shold T
N
, if the Tzv(l)<T
N, I-th
fra
m
e is a
s
the noise frame
s
,
Tzv(l) of spe
e
ch
frame
s
wa
s g
r
eate
r
than T
N
. Signal sa
mpling loo
p
s
up to the end
of step 2.
α
,
β
can be
selecte
d
between the (0,4)
,
different value
α
,
β
are
sele
cted at different
noise. The sp
eech se
gmen
t continue
s at
least som
e
time, such as
0.2 se
con
d
s.
If the detected
speech segm
ent is less than Tzv (l
), it is call
ed "Voice deb
ris" (i
n the non
-G
au
ssian n
o
ise [eg:
factory n
o
ise
(facyo
ry), lou
d
noi
se
(ba
b
b
le)] u
nde
r
common
)
, the
last of i
s
olati
on "Voice d
e
b
ris"
is rem
o
ved or on the adja
c
ent "V
oice de
bris" i
s
integration.
4. Experimental Ev
aluation
Backgroun
d noise is take
n from Noi
s
e
x
-92
datab
ase [13], and its samplin
g fre
quen
cy fs
= 19.9
8
kHZ.
We h
a
ve the
sam
e
sampli
ng fre
quen
cy
fs, the noi
se
in the comp
uter record
a
nd
interio
r
noi
se
environ
ment,
"langua
ge, to
ne, end
point
" sou
nd
sho
w
n in Fig
u
re
1(a), the m
e
tho
d
frame lin
e for
the end
point
detectio
n
re
sults. Pro
c
e
ss
in the voice sub-frame
s
, e
a
ch frame ta
king
25ms, the fra
m
e si
ze Fra
m
eLen
=
[0.0
25fs] point, frame shift
]
4
[
FrameLan
, to
determi
ne th
e
fast Fou
r
ie
r transfo
rm of e
a
c
h frame
(FF
T
) len
g
th
of the take is
eq
ual to fram
e l
ength F
r
ame
Len,
interception t
a
rted the noi
se frame N
0
=2
0.
The
o
r
igi
nal
v
o
ice, origi
nal voice and
noi
se
Noi
s
ex
-
9
2
libra
ry
n
o
ise
- w
h
ite n
o
is
e
(white
),
pink n
o
ise (p
ink), ai
rcraft noise (f16
_cockpit),
were
loud noi
se (babbl
e) noi
se Toeplit
z m
a
trix
with the larg
est eige
nvalu
e
articl
e end
point dete
c
tio
n
method, th
e sign
al to n
o
ise
ratio SNR =
5dB, 0dB,-5
d
B
, the use of
re
cu
rsive
al
gorithm
an
d
sign
al an
alysi
s
meth
od [1
4
]
comp
ared t
h
e
test re
sults
a
r
e presented
in Fi
gure 1-3. Left part of the figure
the
absci
ssa is ti
me (second
s),
the vertical
a
x
is for the
ra
nge; the mid
d
le of t
he a
b
sci
ssa i
s
the
numbe
r of fra
m
es, the ve
rtical
axis i
s
the
la
rgest
eige
nval
ue T
oeplitz
matrix info
rm
ation
(dB);
ri
ght si
de
of th
e ab
scissa
is the
numbe
r
of fra
m
es, l
ongitud
i
nal
coo
r
din
a
tes fo
r th
e recursive
de
gree
s
(%). Fig
u
re
SNR
= 5
d
B i
n
the left pa
rt
of S, for voice, sp
ee
ch mi
xed with
different n
o
ise a
n
d
their dete
c
t
i
on, the
cent
ral
figure
of the
Toeplit
z m
a
trix algo
rithm
for th
e la
rg
est ei
genval
u
e
divide
d lin
e an
d e
ndp
o
i
n
t
informatio
n; this al
gorith
m
in variou
s n
o
ise mixin
g
Next, Toeplit
z matrix max
i
mum eig
env
alue
curve
is
not
the amo
u
n
t
of informa
t
ion,
voice
endp
oint se
gmentation
accuracy, g
ood
adapta
b
ility.
Figure 1
.
Th
e origin
al voice and mixed
with different
noise (SNR = 5dB) co
mpa
r
ed Endpoi
nt
Dete
ction
(a) T
he o
r
igin
al spe
e
ch en
dpoint dete
c
ti
on and th
e e
ndpoi
nt detection algo
rithm
,
(a1) of
the origi
nal voice al
go
rith
m informatio
n T
oeplit
z m
a
trix eigenval
ue cu
rve an
d the maxim
u
m
partition lin
e; (a2
)
corre
s
p
ondin
g
to the
mea
s
ure of
the sig
nal
cu
rve of recursi
v
e analysi
s
a
n
d
segm
entation
;
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 13, No. 1, Janua
ry 2015 : 137 –
144
142
(b) Mixe
d Wh
ite Noi
s
e (whi
te) and th
e spee
ch e
ndpoi
nt detectio
n
a
l
gorithm, (b1
)
Mixed
plant noi
se,
the algo
rithm
informatio
n Toep
litz
matrix eigenvalue
curve
and t
he maximum
partition line;
(b22
) the co
rre
sp
ondi
ng measure of
the sign
al cu
rve of recursi
v
e analysis a
n
d
segm
entation
;
(c) Mixture
of pink
noi
se (pink) Voi
c
e o
f
the endp
oin
t
detection
al
gorithm, (c1
)
mixed
pink noi
se al
gorithm info
rmation Toepl
itz ma
trix eigenvalue curv
e and the maximum parti
tion
line; (c2
)
corresp
ond
s to re
cursive an
alysis of
the me
asu
r
em
ent si
gnal curve se
gmentation;
(d) Mixe
d fig
h
ter cockpit
(f16_cockpit)
and the
noi
se of the spee
ch e
ndp
oint
detectio
n
algorith
m
, (d
1) hyb
r
id ai
rcraft noi
se al
g
o
rithm info
rm
ation To
eplitz matrix eig
e
n
v
alue curve
a
n
d
the maximu
m
pa
rtition line;
(d
2)
co
rresp
ondin
g
to
th
e
mea
s
u
r
e of
t
he sign
al curve
of
recursive
analysi
s
an
d segm
entation
;
(e)
Loud
noi
se mixed peo
p
l
e (ba
bble
)
a
nd the alg
o
rit
h
m of sp
ee
ch
endpoi
nt det
ection,
(e1
)
loud noi
se hybrid alg
o
rithm were the
large
s
t eigen
value Toeplit
z matrix with the dividing lin
e
informatio
n curve;
(e2) correspon
ds to
recu
rsive
an
al
ysis
of
the m
easure
m
ent signal cu
rve
a
n
d
segm
entation
.
Figure 2
.
Th
e origin
al voice and mixed
with different
noise (SNR = 0dB) of the endpoi
nt
detectio
n
co
n
t
rast (simila
r to Figure 1 leg
end)
Figure 3
.
Th
e origin
al voice and mixed
with different
noise (SNR =- 5dB)
comp
a
r
ed End
point
Dete
ction
To furth
e
r
evaluate th
e alg
o
rithm
perfo
rmance,
qu
ant
itative analysi
s
of th
e me
rit
s
of th
e
algorith
m
, this pap
er sele
ct the following
three indi
cat
o
rs to me
asure [15-17]:
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Voice Acti
vity Robu
st Dete
ction of
Noi
sy Speech in T
oeplitz (Jin
gfang Wang
)
143
1,
0
1
1
SA
N
P
N
,
0,
1
0
1
NA
N
P
N
,
0,
1
1
,
0
01
1
A
NN
P
NN
(11)
Whe
r
e, N
1
an
d N
0
were
ha
nd-la
bele
d
te
st spee
ch
in
voice f
r
ame
s
and th
e total
numbe
r
of noise fra
m
es,N
1,0
for the
hand-l
abel
ed
spee
ch fram
es an
d error f
r
ame
s
for n
o
ise ide
n
tificati
on
numbe
r, N
0,1
frame fo
r the ha
nd-marked
and i
d
e
n
tified as
noi
se
wrong
nu
mber
of sp
e
e
ch
frame
s
. Then
P (A / S) is the co
rrect rat
e
of det
ection
of spee
ch frames, P (A / N) to dete
c
t the
corre
c
t frame
rate
of th
e n
o
ise, P
(A
) of
the to
tal dete
c
tion accu
ra
cy.
Table 1 sh
ows
the
different
noise in different SNR envi
r
onm
ent the
t
w
o met
h
o
d
s
r
e
sult
s sum
m
a
ry
t
able.
Table 1. End
point dete
c
tio
n
test re
sults
Detector
ToeplitzMaximu
m
eigenvalue algorithm
Signal recursive algorithm
Noise
Correc
t
rate
(%)
P(A/S)
P(A/N)
P(A)
P(A/S)
P(A/N)
P(A)
wh
i
t
e
SNR=5dB
91.90
97.48
94.86
79.35
100.00
90.29
SNR=0dB
80.57
100.00
90.86
68.02
100.00
84.95
SNR=-5dB
68.83
100.00
85.33
51.01
100.00
76.95
pink
SNR=5dB
91.90
97.48
94.86
78.95
100.00
90.10
SNR=0dB
79.76
100.00
90.48
67.61
100.00
84.76
SNR=-5dB
69.23
100.00
85.52
47.37
100.00
75.24
f16
SNR=5dB
91.90
100.00
96.19
81.38
78.06
79.62
SNR=0dB
72.47
100.00
87.05
71.66
77.70
74.86
SNR=-5dB
66.80
100.00
84.38
63.56
77.70
71.05
babble
SNR=5dB
78.54
77.34
77.90
87.45
50.72
68.00
SNR=0dB
73.58
77.34
75.62
80.97
50.72
64.95
SNR=-5dB
74.90
60.43
67.24
72.47
50.72
60.95
5. Conclusio
n
s and Ou
tlo
o
k
A new metho
d
of robu
st n
o
isy
spe
e
ch
endp
oint
det
ection
is
prop
ose
d
fro
m
ne
w visual
angle in this
pape
r, whi
c
h
is base
d
on
the maxi
mum eigenvalue
of Toeplitz; In this method,
autocorrelatio
n
sequ
en
ce
with the
spe
c
tral
ra
nge
(200
Hz - 4
k
Hz) i
s
used
to
con
s
tru
c
t
a
symmetri
c
To
eplitz matrix,
the maximum eigenval
u
e
s
of the matrix is used o
n
the endp
oi
nt
detectio
n
of the sp
ee
ch si
gnal pai
rs th
reshold.
Mai
n
signal i
s
extracted by u
s
in
g the maximu
m
eigenvalu
e
,
and
noi
se i
s
sup
p
re
ssed
b
e
tter. When
the SNR is be
low
5dB, the
gene
ral
end
p
o
int
det
ect
i
o
n
m
e
t
hod
s s
e
e
m
almo
st
p
o
we
rle
ss,
su
ch
as
s
h
ort-time s
p
ec
tral es
timation, this
algorithm i
s
still useful, it
has to calculat
e sim
p
le
,
noi
se im
munity
characteri
stics i
s
strong, and
experim
ents
sho
w
that th
e metho
d
of
co
rre
ct
ne
ss, but also it
has
goo
d robu
stne
ss, t
he
algorith
m
is
good in
co
m
m
on u
s
e
s
, an
d it can
a
d
a
p
t to the environm
ent wid
e
. Especi
a
lly that
the aliasin
g
noise is in the low and hi
gh frequ
en
cy band, noi
sy spe
e
ch endp
oint detectio
n
is
very good, it is wo
rthy of furt
he
r improvement in the ca
se of noi
se
aliasin
g
in the voice ba
nd.
Referen
ces
[1]
Raj B, Si
ngh
R
.
Classifier-
bas
ed n
on-l
i
n
ear p
r
ojec
ti
on for a
d
aptive
en
dpo
in
ting of co
ntin
u
ous sp
eec
h.
Co
mp
uter Spe
e
ch an
d La
ng
u
age.
20
03; 17:
5-26.
[2]
T
anyer SG, Ozer H. Voice
acti
vit
y
d
e
tectio
n i
n
no
nstatio
nar
y n
o
ise.
IEEE
Transactions on Speech and
Audi
o Processi
ng
. 200
0; 8(4): 478-
482.
[3]
Karra
y
L, Marti
n
A.
T
o
w
a
r
d
s i
m
provin
g spe
e
c
h det
ectio
n
ro
bustness for sp
eech rec
o
g
n
itio
n in adv
erse
cond
itions.
Sp
eech C
o
mmun
i
c
ation
. 20
03; 4
0
: 261-2
76.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 13, No. 1, Janua
ry 2015 : 137 –
144
144
[4]
Kuroi
w
a
S, N
a
ito M, Yama
moto S, et al.
Ro
b
u
st spe
e
c
h detecti
on
method for t
e
l
eph
on
e spe
e
c
h
recog
n
itio
n s
y
s
t
em.
Speech C
o
mmunic
a
tio
n
.
199
9; 27: 135-
148.
[5]
Ramirez J, Se
gura JC, Ben
i
tez C, et al. Efficient
vo
ice act
i
vit
y
d
e
tectio
n alg
o
rithms usi
ng lo
ng-ter
m
speec
h inform
ation.
Spe
e
ch Co
mmun
icati
o
n.
2004; 4
2
: 27
1-28
7.
[6]
Ramirze J, Se
gura JC, Be
nit
e
z C, et al. An efective su
bba
nd OSF
-
base
d
VAD
w
i
th n
o
ise
reductio
n
f
o
r
robust spe
e
ch
recog
n
itio
n.
IEEE Transactio
n
s on Spe
e
ch
and Au
dio Pro
c
essin
g
.
2005;
13(6): 111
9-
112
9.
[7]
Nemer E, Gou
b
ran R, Ma
hm
oud S. R
obust
voice activ
i
t
y
d
e
tection usin
g hig
her-or
der
st
atistics
in
the
LPC resi
dua
l d
o
mai
n
.
IEEE Transacti
ons o
n
Speec
h an
d Audi
o Processi
n
g
. 2001; 9(
3): 217-2
31.
[8]
Shen J, Hu
ng
J, Lee L.
Ro
bust entro
py-b
ased
end
po
int
detectio
n
for speec
h reco
gn
ition i
n
no
is
y
envir
on
me
nts
. Proc of International Conf
erence on
S
poken Language Processing, S
y
dney
, Australia.
199
8: 232-
238.
[9]
Ephra
i
m Y, va
n T
r
ees H LA
sign
al su
bsp
a
c
e ap
pro
a
ch f
o
r spe
e
ch
enh
ancem
ent.
IEEE Trans on
Speec
h Aud
i
o
Processi
ng
. 19
95; 3(4): 25
1-2
66.
[10]
Klei
n M, Kaba
l P. Signa
l
subs
pace s
peec
h e
nha
nceme
n
t
w
i
th
perce
ptual
p
o
st filtering.
IE
EE-ICASSP
.
Orland
o, F
l
orid
a, USA. 2002:
537-
540.
[11]
Mittal U, Ph
a
m
do N. S
i
gn
al
/noise K
L
T
ba
s
ed a
ppr
oach
for enh
anc
ing
speec
h d
egra
d
ed b
y
col
o
re
d
nois
e
. IEEE
T
r
ans on Speec
h Audio Pr
oc
ess
i
ng. 2000;
8: 159-167.
[12]
Yi H, L
o
izo
u
P.
CA g
e
n
e
ral
i
ze
d su
bspac
e a
p
p
r
oac
h
for enh
anci
ng
s
p
e
e
ch corrupte
d
b
y
c
o
lor
ed nois
e
.
IEEE
T
r
ans on Speec
h an
d Audi
o Processi
n
g
. 2003; 1
1
(4).
[13]
Spib n
o
ise
dat
a [EB / OL] [2011-1
0
-20] .h
ttp:. //spib.rice.edu
/spi
b/sel
e
ct_n
o
i
se.html.
[14]
YAN Run-
qia
n
g
, Z
HU Yi-sheng. Sp
eech
end
poi
nt
dete
c
tion bas
ed o
n
recurre
nce
rate ana
l
y
sis
.
Journ
a
l of Com
m
unic
a
tions. 2
007; 28(
1): 35-
39.do
i:10.3
3
2
1
/
j.issn:
100
0-4
3
6
X.
200
7.01.0
0
6
[15]
Marzinzik M, K
o
llmeier B. Sp
eech pa
use de
tection
for nois
e
spectrum estimatio
n
b
y
trac
king po
w
e
r
enve
l
op
e d
y
n
a
m
ics. IEEE
T
r
a
n
s on Spe
e
ch
and Au
di
o Pro
c
essin
g
. 200
2; 10: 109-
11
8.
[16]
LI Jin, W
A
NG Jing-fa
ng, GAO Jin-din
g
. Sp
eec
h
end
po
int detectio
n
al
gor
ithm base
d
o
n
EMD and
RP
.
Co
mp
uter En
gin
eeri
ng
an
d Ap
plic
ation
s
.
201
0; 46
(34): 1
32-1
3
5
. do
i:10.3
7
7
8
/j.issn.10
0
2
-
833
1.20
10.3
4
.040
[17]
W
A
NG Jingfa
n
g
. Re
al-time
v
o
ice
activit
y
r
o
bust d
e
tectio
n.
Co
mputer E
n
gin
eeri
n
g
an
d
Appl
icatio
ns.
201
1; 47(2
0
): 147-1
49. do
i:10.
377
8/j.issn.1
00
2-83
31.2
011.2
0
.042
Evaluation Warning : The document was created with Spire.PDF for Python.