TELKOM
NIKA Indonesia
n
Journal of
Electrical En
gineering
Vol.12, No.6, Jun
e
201
4, pp. 4475 ~ 4
4
8
4
DOI: 10.115
9
1
/telkomni
ka.
v
12i6.548
5
4475
Re
cei
v
ed
De
cem
ber 2
4
, 2013; Re
vi
sed
Febr
uary 19,
2014; Accept
ed March 5, 2
014
A Dual-Microphone Speech Enhancement Algorithm for
Close-Talk Syst
em
Yi Jiang*
1
, Z
h
enming Fe
ng
1
,Yuan
y
ua
n Zu
2
, Xi Lu
2
1
Departme
n
t of Electronic En
g
i
ne
erin
g, T
s
inghua U
n
ivers
i
t
y
Beiji
ng 1
0
0
084
, P.R. China, +
86-6
278
17
02
2
Quartermaster
Equipm
ent Re
search Inst
itute
,
General Lo
gis
t
ics Departme
n
t
Beiji
ng 1
0
0
082
, P.R. China, +
86-6
227
66
75
*Corres
p
o
ndi
n
g
author, e-ma
i
l
: jian
g
y
i0
9@m
a
ils.tsin
ghu
a.e
du.cn
A
b
st
r
a
ct
W
h
ile
hu
man
listeni
ng
is r
o
bust i
n
co
mpl
e
x a
uditory
sc
enes, c
u
rrent
speec
h e
n
h
a
n
c
ement
algorithm
s
do not perfor
m
well in
noisy
envir
onm
e
nt
s, even c
l
ose-t
a
lk system
is
used. This
paper
addr
esses t
he
robustn
ess i
n
dua
l
micr
oph
o
ne
e
m
be
dd
ed
close
talk syst
em by
e
m
p
l
oyi
ng
a co
mputati
o
n
a
l
aud
itory scen
e
analys
is (CAS
A) framew
ork. T
he ener
gy
differenc
e betw
e
e
n
the tw
o microph
on
es is us
ed
as the pri
m
ary
separ
ation c
u
e
to estimate th
e ide
a
l bi
nary
mask (IBM). W
e
also us
e voic
e activity detec
tio
n
to find
the
n
o
i
s
e p
e
rio
d
s, a
n
d
u
p
d
a
te th
e s
epar
ation
critic
al v
a
lu
e. Gen
e
r
ali
z
a
t
i
o
n
int
e
rferenc
e l
o
cati
o
n
s
and rev
e
rber
a
n
t conditi
ons a
r
e used to ex
amin
e per
for
m
ance of the pr
opos
ed syste
m
. Evalu
a
tio
n
an
d
comparis
on sh
ow
that the propos
e
d
system
outperfor
m
s ot
her tw
o systems on the test conditi
ons.
Ke
y
w
ords
:
en
ergy differe
nce
,
close-talk sys
tem, spe
e
ch e
nha
nce
m
e
n
t, binary
mask
Copy
right
©
2014 In
stitu
t
e o
f
Ad
van
ced
En
g
i
n
eerin
g and
Scien
ce. All
rig
h
t
s reser
ve
d
.
1. Introduc
tion
Speech en
h
ancement in
noisy an
d reverbe
r
a
n
t environm
ents
is a very ch
allengin
g
probl
em.Even a
cl
ose-tal
k
mi
croph
on
e is u
s
e
d
to
colle
ct
the
ta
rget sp
ee
ch. The perfo
rm
ance
gap
betwe
en
huma
n
li
sten
er a
nd
sp
ee
ch en
han
cem
ent algo
rithm
s
remai
n
s l
a
rge [1]. In recent
decade, Com
putational
au
ditory
sce
ne analysi
s
(C
A
SA) provide
s
a n
e
w app
ro
ach
to
solve t
h
is
spe
e
ch en
ha
ncem
ent p
r
o
b
lem di
re
ctly depe
nding
on
the hum
an li
stenin
g
proce
ssi
ng [2]. It use
s
two dimen
s
io
n ideal bina
ry masks (IBM) to segr
eg
ate
the spee
ch d
o
mination tim
e
-fre
que
ncy (T-
F) unit
s
and can im
prov
e the intelligi
b
ility of
the noisy
speech for
both normal-heari
ng
and
heari
n
g
-
impai
red liste
ne
rs
dram
atically [3, 4].
Based on CASA,
one
micropho
ne
speech enhancement al
gorit
h
ms use m
o
naural
feature, fun
d
a
mental freq
uen
cie
s
(F
0),
onse
t/offset, GFCC [5],
and
so
on, to se
gregate
the
target spee
ch
from the noi
se. The
s
e
sy
stem
s are ha
rd to work in
noisy conditio
n
s, for the n
o
i
se
distort
s
the monau
ral fea
t
ure. Mona
ural spe
e
ch
se
paratio
n is p
a
rticul
arly difficult as o
ne ha
s
acce
ss o
n
ly to a singl
e-ch
annel n
o
isy si
gnal.
With two e
a
rs, hum
an li
st
ening i
s
ro
bu
st un
der
both
noi
sy and
re
verbe
r
ant
co
ndition
s.
Binaural cue
s
co
ntribute t
o
auditory scene analy
s
is
[6]. So
far, dual-mi
c
roph
o
ne system of
ten
employs
bin
aural fe
ature
,
such
as int
e
rau
r
al time
differen
c
e
s
(I
TD) a
nd inte
rau
r
al inten
s
i
t
y
differen
c
e
s
(II
D
). T
h
e
s
e
system
s have
yielded
sig
n
ificant improvem
ent in
spe
e
ch
sep
a
ration [
7
,
8]. Kernel
de
nsity, GMM
with SVM [9, 10], mu
ltilayer p
e
rcept
ron
(MLP) [11]
and d
eep
ne
ural
netwo
rk
(DNNs) [12] are
use
d
to mod
e
l this feat
ures an
d cla
s
si
fier the target
spee
ch. But
the
model traini
n
g
and cla
s
sification i
s
too complex an
d time con
s
umption to u
s
e in re
al time
embed
ded
system. Othe
r two
or multi
microp
ho
n
e
array
syste
m
s
are
al
so
use
d
in
po
rt
able
system [13, 1
4
], but can’td
eal with the u
n
stea
dy noise.
In this paper,
using the
com
putational auditory scene analys
i
s
(CASA) as a framework,
we p
r
opo
se
a spe
e
ch se
paratio
n app
roa
c
h for
cl
ose
-
talk
syst
em. We ju
st use the d
u
a
l-
microph
one
energy difference (DME
D) a
s
the
cu
es
to se
parate
the clo
s
e
sp
eech an
d the
far
noise [15]. We also u
s
e th
e DME
D
to d
e
tect t
he n
o
ise peri
od, an
d
update th
e separation
criti
c
al
value. The
system do
es
not nee
d trai
n, and
run
s
on
re
al time.
It is simpl
e
to integrate in
the
embed
ded system.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 6, June 20
14: 4475 – 4
484
4476
The
re
st of th
is pa
pe
r is organi
zed
as fo
llows
.
In
secti
on
2, we pre
s
ent an
ove
r
view
of
the sp
ee
ch separation al
g
o
rithm for
du
al-mi
c
ro
ph
o
n
e
clo
s
e
-
talk
system.se
ction
3 describe
s
how
to extract DE
MD feature a
nd estimate t
he IBM.
The system
atic e
v
aluation and
compa
r
i
s
on
is
pre
s
ent in se
ction 4. Finall
y
, we con
c
lud
e
the pape
r in
sectio
n 5.
2. Sy
stem
O
v
erv
i
e
w
The propo
se
d dual-microp
hone em
bed
ded cl
ose talk
syste
m
is shown in Figu
re1. Two
microph
one
s
are
used to
collect the
clo
s
e m
out
h
sp
eech an
d the
far noi
se
si
multaneo
us.
The
same
two
ga
mmatone filte
r
ban
ks a
r
e u
s
ed to
de
co
mpose the t
w
o mi
croph
o
nes i
nput int
o
T-F
domain
rep
r
e
s
entatio
ns. A
T-F unit corresp
ond
s to a
certai
n ch
an
nel in a filterb
ank at a
cert
ain
time frame. It
is a simila
r m
anne
r as the
human e
a
rs d
o
.
Figure 1. Sch
e
matic Di
ag
ram of the Du
al-mi
c
ro
pho
n
e
Clo
s
e-tal
k
System
We extra
c
t the dual mi
cro
p
hone e
nergy differe
n
c
e (DMED) featu
r
e
in each T
-
F
unit pair,
whi
c
h i
s
re
sp
ect to th
e lo
cations of the
sou
n
d
s
.
The
DMED is use
d
a
s
the
cu
e
to estimate
the
IBM for the T-F unit
s
, wh
ere 1 in
dicates the ta
rget
signal
domi
nates the
co
rrespon
ding ti
me-
freque
ncy (T
-F) unit and 0
otherwise.
The DMED f
eature i
s
al
so
use
d
in voice ac
tivity detection
(VAD) to find the no
ise only
perio
d. In thi
s
p
e
ri
od,
we
upd
ate the
l
o
cal
SNR
cri
t
erion
(L
C) v
a
lue fo
r th
e
IBM estimati
on
algorith
m
. As two mi
croph
one
ene
rgy di
fference feat
ur
e
s
va
ry wit
h
freq
uen
cy
chann
els [1
6,
17],
the LC is
cal
c
ulated for ea
ch cha
nnel
se
parately.
In resynthe
si
s pro
c
e
s
sing
, the T-F units wi
th the target label (unity) com
p
ri
se the
seg
r
eg
ated ta
rget stream.
3.
Feature Extr
action an
d Speech Seg
r
e
g
ation
As acl
o
se-tal
k sy
stem, on
e microph
one
indexed
a
s
1
is pla
c
ed i
n
front of the m
outh only
several
centi
m
eters a
w
ay.
It mainly
coll
ects t
he ta
rg
et sp
ee
ch. A
nother mi
cro
phon
e in
dexe
d
a
s
2 is pla
c
ed n
ear the left ea
r, which used
to co
llect the
target sp
eech and interfe
r
ence equ
ally.
3.1. Gammatone Filterba
nks
The audito
ry filterban
k
i
s
u
s
ed
to de
co
mpose
i
nput mixture sign
a
l
into small
reque
ncy
band
re
spe
c
ti
vely. In this paper, a
sim
p
lified impleme
n
tation of a
cochl
ear
mod
e
l gamm
a
to
ne
filterban
ks p
r
opo
sed
by
Roy Patterson
is p
r
ovide
d
fo
r the
emb
edd
ed
system. E
a
ch
filter b
a
n
k
i
s
desi
gne
d as
a set of parall
e
l equivalent
recta
ngul
ar b
and
width (ERB) band pa
ss filter, descri
b
ed
in time domai
n as Equatio
n
(1).
2(
)
1
(,
)
c
o
s
(
2
)
c
bf
t
n
cc
gf
t
t
e
f
t
(1)
The chan
nel
numbe
r of th
e gammato
n
e
filterban
ks
c
is set to 32 fo
r embe
dde
d
system.
The filter
cent
er fre
que
ncie
s
c
f
are o
r
gani
ze
d from hig
h
freque
nci
e
s at
the ba
se of th
e co
chl
ea to
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
A Dual-Mi
c
ro
phon
e Spee
ch Enhan
cem
ent Algorit
hm
for Clo
s
e-Tal
k
System
(Yi
Jian
g)
4477
low freque
ncies at th
e a
pex. In ord
e
r
to si
mulat
e
hum
an a
u
ditory be
havior, the
cent
ral
freque
nci
e
s
of the filter
b
ank in thi
s
p
aper
c
f
is f
r
om
5
0
Hz to
8000
Hz. An
d the
band
width
o
f
each co
chle
a
r
filter
()
c
bf
is de
scribed by 1.01
9 times the ERB on each central frequ
ency. The
orde
r,
n
is set to 4 fo
r em
be
dded
sy
stem.
is the
ph
ase. We
set
to ze
ro. We
also in
trodu
ce
a
simplified g
a
mmatone filte
r
form to do this work [18]:
42
2
4
22
4
6[
(
)
6
(
)
]
()
[(
)
]
cc
c
c
cc
sB
sB
Gs
sB
(
2
)
Whe
r
e
2(
)
cc
Bb
f
and
2
cc
f
. E
a
c
h
filterbank s
e
tup with
fo
ur se
co
nd
-order filters.
3.2. DMED F
eatur
e Extr
a
c
tion
With gamm
a
tone filterban
k, the sign
al
s re
ceived b
y
microp
hon
es are divid
ed into
variou
s frequ
ency
ch
annel
s. The
n
a
tim
e
fram
e
wind
ow i
s
use
d
to
se
gment th
e
sig
nal to
sm
all
units called T
-
F units.
,1
,
,
2
,
(,
)
(
,
)
,
,
.
.
.
cc
c
c
m
fm
t
f
m
t
fm
t
k
Xc
m
g
f
t
x
x
x
(
3
)
The filter fre
q
uen
cy ch
ann
el index is
c
, ce
nter fre
que
ncy is
c
f
.The fram
e index is
m
, The
t
is the
time se
ction
of frame
m
,sign
a
l contai
ns
k
d
a
ta points. In
this pap
er, the T-F
unit is 20-m
s
time
frame
s
with 1
0
-m
s overla
p
p
ing bet
wee
n
con
s
e
c
utive frame
s
.
k
equal to 320.
The ene
rgy o
f
the one T-F
unit is cal
c
ul
a
t
e by:
2
2
,
(,
)
c
mt
f
t
Xc
m
X
(4)
The ene
rgy d
i
fference bet
wee
n
the two
microp
hon
es in each T-F
units is calcul
ated b
y
the energy ra
tio.
2
1
2
2
(,
)
(,
)
(,
)
Xc
m
DM
E
D
c
m
Xc
m
(
5
)
The
DMED v
a
lue in
dicate
the dista
n
ce
differen
c
e b
e
t
ween th
e so
und
sou
r
ces
and the
two microp
h
one
s. We al
so u
s
e
(,
)
S
DME
D
c
m
and
(,
)
N
DMED
c
m
to indicate
DM
ED value of the
target sound
sou
r
cean
d no
ise source
re
spe
c
tively.
Figure 2. The histog
ram o
f
S
DMED
As sho
w
n in
Figure 2. I
n
the 16
th
chann
el
s
DMED
of the nea
r
sou
n
d
so
urce
are
distrib
u
ted around 20
dB. The small
e
st value is
ab
ove
10dB, and the large
s
t is more than 24
d
B
.
The two big p
eaks indi
cate
the a microph
one
lo
cation
cha
ngin
g
in the talkin
g period.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 6, June 20
14: 4475 – 4
484
4478
(a) 0
o
b
abbl
e noise
(b) 4
5
o
bab
ble
noise
(c
) 90
o
ba
bble
noise
(d) 1
3
5
o
ba
bbl
e noise
Figure 3. The
Histog
ram of
N
DMED
at Variou
s Azimuths
Figure 3 sho
w
s th
e 16
th
cha
nnel
N
DM
ED
of a babbl
e noi
se
. The interfe
r
ence locat at
azimuth
0°,
45°, 90
° and
135°. As
shown in Fig
u
re 3, the
N
D
ME
D
values a
r
e
clo
s
e to 0dB.
Comp
are to the di
stan
ce
betwe
en the
far interf
e
r
en
ce to the t
w
o
microph
one
s. The
dista
n
c
e
differen
c
e b
e
t
ween inte
rfe
r
en
ce a
nd two microph
po
nes i
s
verysmall. At different locatio
n
s,
the
N
DME
D
cha
nge small
.
(a)
s
p
eech with 0
o
babble noise
(b)
spe
e
c
h
wi
th 45
o
babbl
e noise
Figure 4. DM
ED of Mixture
s
The DME
D
o
f
two mixture sign
als a
r
e shown in
Figure 4. The mixture spee
ch i
s
actual
recording
se
ntence, with
a ba
bble
n
o
ise
pres
ent
at azi
m
uth
0°(a
)
and
4
5
°(b
)
. Th
ere
are
obviou
s
ly two
peaks in (a
) and (b
). The
peaks on
th
e left (close to 0dB) re
pre
s
ent the DM
ED
offar noi
se,
while the
pea
ks o
n
the
right
(cl
o
se to
20
dB) rep
r
e
s
ent
the DMED
o
f
target
spe
e
c
h.
Due to the effect of the human hea
d’s
shape an
d
the
micro
phon
e locatio
n
, the DMED i
s
rob
u
st
with various
nois
e loc
a
tions.
As a clo
s
e
-
tal
k
syste
m
s, th
e differen
c
e
betwe
en
N
DMED
and
S
DMED
is significant. The
DMED is
used as
thec
ue t
o
s
e
parate the target speec
h
.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
A Dual-Mi
c
ro
phon
e Spee
ch Enhan
cem
ent Algorit
hm
for Clo
s
e-Tal
k
System
(Yi
Jian
g)
4479
3.3. Ideal Binar
y
Mask Estimation
In this c
l
ose-talk s
y
s
t
ems, the IBM is
estim
a
ted
b
y
the
DMED cu
es,
an
d
use
d
to
sep
a
rate the t
a
rget si
gnal b
y
fellows:
1(
,
)
(
,
)
(,
)
0
i
f
D
M
E
D
cm
L
C
cm
BM
c
m
others
(
6
)
Whe
r
e
BM
is the
estimated
bin
a
ry ma
sk val
ue of
the T-F
units in fre
q
uen
cy cha
n
n
e
l
c
and time f
r
a
m
e
m
. The
1 in
dicate
s T
-
F
u
n
its that a
r
e
dom
inate
d
by
the target spee
ch, an
d 0
indicates the
T-F unit
s
that
are do
minat
ed by noi
se.
L
C
is the local
separation cri
t
ical. Usi
n
g
the DMED a
s
the spee
ch
separation cue
s
, the
L
C
is calcul
ated as:
2
(,
)
11
(,
)
(
,
)
SN
LC
c
m
DMED
c
m
DMED
c
m
(
7
)
In usually cl
ose
-
talk im
pl
ementation
(Fi
gure 2) an
d given the con
c
lu
sio
n
of HRTF,
S
DMED
is
always o
v
er 10
0.
N
DMED
from
the far
noi
se
is a
r
ou
nd 1,
much
sm
alle
r than
S
DM
ED
.
The diffe
ren
c
e bet
wee
n
them i
s
signif
i
cant.
Con
s
id
ered
the Eq
uation
(7
), th
e
(,
)
LC
c
m
is
deci
ded
by t
he
smalle
r v
a
lueb
etwee
n
S
DMED
and
N
DMED
. Obvious
ly, the
N
DMED
is the
deci
s
ive facto
r
. We calculat
e the
L
C
as
:
2
(,
)
1
0.01
(,
)
N
LC
c
m
DMED
c
m
(8)
In this pape
r we up
date th
e
LC
in noi
se onl
y period. A voice a
c
tivity
detectio
n
(VAD) is
use
d
to distin
guish the noi
se only sectio
nin the mixture.
1(
,
)
()
0
m
N
and
DMT
D
m
VA
D
m
others
(
9
)
Whe
r
e
m
N
is the
numbe
r
of sp
eech in
clu
d
in
g chan
nel
s in
frame
m
. It c
ount the
c
h
annel number,
whi
c
h
(,
)
1
0
DMED
c
m
. And
set to 2, in
dicates th
e ta
rg
et sp
ee
ch
exist only i
n
ve
ry limited
cha
nnel
s. T
h
e
(,
)
DMT
D
m
is the
time differen
c
e
betwe
en t
w
o
microp
hon
e
s
ign
a
ls.
For
the
DMT
D
of thetarg
e
t
spe
e
ch is larger th
an the
noise si
gnal i
n
mo
st co
ndi
tions. Th
e
s
e
ts
to 6
based on ou
re experie
nce.
To cal
c
ulate
t
h
e
(,
)
DM
T
D
m
, we
use t
he two mi
cro
phon
e
sign
als
1
x
and
2
x
as whol
e.
Th
e
norm
a
lized correlog
ram b
e
twee
n two
microph
one
s in each fra
m
e
(,
)
Cor
r
m
is calculat
ed
with
delay
by the followin
g
cro
s
s co
rrelation f
unctio
n
.
,1
,
2
,1
,
2
1
22
,1
,
2
,1
,
2
11
((
)
)
(
(
)
)
(,
)
((
)
)
(
(
)
)
k
mm
mm
n
kk
mm
mm
nn
xnx
x
n
x
Co
rr
m
xnx
x
n
x
(10)
(,
)
m
a
x
(,
)
DM
T
D
m
C
orr
m
(11)
Whe
r
e
k
is the
frame
si
ze i
n
sampli
ng
poi
nt, and e
qual
to 320 i
n
thi
s
study. The
ra
nge of
is
from -1ms
t
o
1ms
.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 6, June 20
14: 4475 – 4
484
4480
4.
Ev
aluation and Comparis
on
4.1. Test Cor
pous
An a
c
tual
re
cordin
g d
a
ta o
f
a du
al-mi
c
ropho
ne
syste
m
is u
s
ed
to
test th
e al
go
rithm’s
perfo
rman
ce i
n
office environment. we
reco
rd the target spe
e
ch a
nd noi
se sep
a
rately.
In the reco
rdi
ng test co
rpu
s
, the target sp
e
e
ch inclu
des 60
0 sh
ort Chinese utteran
c
e
s
involving 200
Chine
s
e n
a
m
es, 20
0 sto
ck n
a
me
s an
d 200 pla
c
e
name
s
, whi
c
h we
re collected in
quiet office rooms by two
male spea
kers a
nd one
female sp
ea
ker. The noi
se
sound
s in
clu
d
e
babbl
e, whit
e, m109 an
d machineg
u
n
com
e
fro
m
NOISE 92 databa
se.
The interfe
r
ence
pre
s
ent
s
at a
dista
n
ce of
1
.
5m from
the
listene
r, a
z
im
uth at
0°, 4
5
°
,
90°,
135
°
a
nd 1
80°. All
at
0° elevation,
unle
ss
oth
e
rwise sp
ecifie
d.
We
use
re
cordin
g d
ual
-microph
one
clea
n
spe
e
ch an
d va
rio
u
s l
o
cation
s noi
se to
gene
rate the
mixture si
gn
al with d
e
fine
d SNR.
At this conditio
n
, the cl
ean
sp
e
e
ch i
s
fixed
on
origin
al magn
itude. We adj
ust the ene
rg
y of recording
noise to get the define
d
SNR.
Another
simu
lated test co
rpu
s
is al
so
em
ployed, which i
s
creat
ed by variou
s cle
an
spe
e
ch sig
nal
s with fou
r
different noi
se
s and2 room re
verbe
r
ant con
f
iguration
s
.
We u
s
e a se
t of binaural impulse re
sp
onses
(BI
R
s) to generate
the transfe
r functio
n
from interfe
r
e
n
ce to two m
i
cro
pho
ne
s. To the cl
ose-talk sy
stem, it is hard to g
e
t the tran
sfe
r
function from
mouth to two microph
one
s. We u
s
e th
e sam
e
se
ntences
with di
fferent amplit
ude
to simul
a
te t
he target
sp
eech
signal
s of the
two microph
one
s.
The sp
ee
ch
materi
als of
are
cho
s
e
n
from
TIMIT corp
us rando
mly. Four noi
se
s co
me from NOISE 92 databa
se too.
We u
s
e the
ROO
M
SIM packag
e
to ge
nerate a lib
ra
ry of BIRs. The refle
c
tion
paths of a
particula
r so
und source
are obtai
ned
using
the i
m
age reverb
eration m
o
d
e
l for a sm
all
recta
ngul
ar o
ffice room.
Reflection
coef
ficients of the
wall surfa
c
e
s
are set to b
e
equally. T
he
room
si
ze
i
s
6m×4m
×
3m
. The two microph
one
s locate at 2.5m×2.5m
×
2
m
. The distance
betwe
en the
two mi
croph
o
nes i
s
8cm.
Seventeen
sound
so
urce
s locate
at a
same di
stan
ce
of
1.5 m from the two mi
cro
phon
es. The
azimuth i
s
bet
wee
n
-18
0
° a
nd 180
°, and
the elevation
is
betwe
en 0° a
nd 90° by ste
p
of 45°.
Noi
s
e
s
are drawn
ran
doml
y
from the da
ta bas
e
and a
r
e convolved
with a select
BIRs to
gene
rate the
mixture with t
he spee
ch
utteran
c
e
s
. Th
e interfe
r
en
ce num
ber i
s
randomly 1
to
5.
The interfe
r
e
n
ce
s are loca
te at 17 positi
ons rand
omly
. All
SNR of the mixture si
gnal
s are
-5d
B
.
Finally, we g
enerate a
set
of 1000
sim
u
lated
mixtures to eval
uat
e
the pe
rformance of
the dual micropho
ne spee
ch en
han
cem
ent algorith
m
.
4.2. Compari
s
on Sy
stems
In the expe
ri
ments
belo
w
, reu
s
lts
of the
p
r
op
osed
method
are
com
p
a
r
ed
with two
existing meth
ods from the
literature [14
,
19]. T
he system propo
sed in [14], de
noted PLD, i
s
a
coh
e
re
nce-ba
sed
alg
o
rithm
.
The
ene
rgy
level diffe
re
n
c
e
and
cohe
rence fun
c
tion
is
used to
g
e
t
the target so
und in noi
sy environ
ment.
The dist
an
ce betwee
n
the two microp
hone
s is sm
all,
whi
c
h
ma
ke it
ha
rd to
be
u
s
ed
in
clo
s
e
-
t
a
lk
syste
m
. T
he al
gorith
m
estimate
s th
e
po
we
r
spe
c
tral
den
sity of the noise and
re
duce it, which
make
s it
ha
rd to eliminate
the non
-ste
a
d
y noise. In this
pape
r,we
u
s
e the first 1
00ms si
gnal
of the mi
xture to
estim
a
te the n
o
ise. The
se
co
nd
comp
ari
s
o
n
system u
s
ed i
s
the joi
n
t lo
cali
zation
an
d se
gre
gatio
n app
roa
c
h
p
r
esented
in [
19],
dubb
ed MES
S
L, and
is repre
s
e
n
tative of the
sp
at
ial cl
uste
ring
approa
ch to
l
o
cali
zatio
n
. T
he
system re
quires sp
ecifi
c
ati
on
of
th
e nu
mber of
sou
r
ce
s a
nd ite
r
at
ively fits GM
M mod
e
ls of i
n
ter-
aural p
h
a
s
e differen
c
e (IP
D
) an
d ILD to the obse
r
v
ed data usi
n
g an EM pro
c
ed
ure. Acro
ss
freque
ncy int
egratio
n is h
andle
d
by tying GMM mo
dels in indivi
dual freq
uen
cy band
s to
a
principal IT
D. We give the MESSL the number
of the sound
source in
these testing.T
h
is
algorith
m
i
s
also
ha
rd to
use o
n
line.
We
u
s
e
an
implem
entat
ion of th
e DLP and
MES
S
L
provide
d
by the algo
rithm
authors. We also
sh
o
w
the
result
s of the
IBM as a baselin
e.
4.3. Ev
aluation Res
u
lts
1) SNR perfo
r
mance
w
i
th recording d
a
ta
Table 1
sho
w
s
the spe
e
c
h seg
r
e
gatio
n
re
sult
s
with
one babbl
e noise
invari
o
u
s
S
NR
con
d
ition
s
. The interfe
r
e
n
c
e fixed at a
z
imuth
0
°
. T
he The
prop
ose
d
alg
o
rith
m gets th
e b
e
st
perf
o
rman
ce
in all
con
d
it
o
n
s,
a
nd
clo
s
e
t
o
t
he
re
sult
s of
I
B
M.
A
l
most
all
sy
st
ems
get
p
o
si
t
i
ve
results o
n
this test conditi
ons, e
s
pe
cial
ly in lo
w SNR co
ndition
s.
Becau
s
e the
location
s of
the
target spee
ch and noi
se
are not fixed
in the
recording pe
riod, the IPD and I
L
D are chan
ging
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
A Dual-Mi
c
ro
phon
e Spee
ch Enhan
cem
ent Algorit
hm
for Clo
s
e-Tal
k
System
(Yi
Jian
g)
4481
from time to time. The ME
SSL get the worst re
sult
s. The propo
se
d algo
rithm suit to deal wit
h
daily noise.
Table 1. SNR (dB) Perfo
r
m
a
ce
with Bab
b
le Noi
s
e
SNR(dB)
-5
0
5
10
IBM 6.76
9.74
12.91
16.37
Proposed
2.28
6.32
10.49
14.43
DLP 1.97
5.45
9.32
13.43
MESSL 1.31
4.47
5.11
8.87
The SNR pe
rforma
nce wi
th various
n
o
ise
type is
also eval
uat
ed. The SNR of the
mixture
sign
a
l
is
0dB. Th
e
interferen
ce l
o
cate
ata
z
im
uth 18
0°, el
e
v
ation 0°. A
s
sho
w
n
in T
a
b
l
e
2, in most
co
ndition
s, the
prop
osed al
g
o
rithm g
e
ts
better result than compa
r
e system
s.
With
machi
negun
noise, the M
ESSL get a little higher
score than the propos
ed
system. DLP lose
ability on machinegun noi
se.
Table 2. SNR (dB) Perfo
r
m
ance with Various
Noi
s
e Types
Noise
t
y
pe
BABBLE M109
MACHINE
G
U
N WHITE
IBM 9.99
9.56
11.98
10.29
Proposed
6.46
5.47
8.25
7.00
DLP 5.67
4.66
0.00
6.58
MESSL 5.37
5.30
8.51
6.17
Figure 5. The
SNR Perfo
r
mances
with
Babble Noise
at Various A
z
imuth
s
White
noi
se l
o
cate
sat va
ri
ous a
z
imuth
s
is u
s
ed
to e
v
aluate the
SNR pe
rform
ance of
system
s. The
SNR of the
mixture
sig
n
a
l
is -5dB. As
sho
w
n in Fi
g
u
re 5. Th
e propo
sed al
gori
t
hm
gets the high
est SNR imp
r
oveme
n
ts in
most co
n
d
itons. It is also more robu
stne
ss tha
n
the
comp
are two
algorithm
s.T
he pro
p
o
s
ed
DMED
algo
rithm ca
n improve the SNR o
n
vario
u
s
azimuth
s
an
elevation
s
. It
also h
a
s g
o
o
d
perfo
rman
ce on variou
s frequ
en
cie
s
.
2) Performa
nce ev
aluation
w
i
th simu
lated da
ta
Figure 6 illust
rates the results of the three spee
ch segregation algorithm
s
. This
is a five
interferen
ce
s and
-5
dB S
N
R test
co
n
d
itions. T
he
five interfere
n
ce
s
ran
dom
ly locate
at
17
positio
ns. (a
) sho
w
s the sp
ectro
g
rams o
f
the reve
rberant mixture. The noi
se sig
nal diso
rde
r
the
target spee
ch
seri
ou
s. It is hard to di
scri
minate
the ta
rget spee
ch f
r
om noi
se
sig
nal. The (b
) a
n
d
(c) i
s
the
sp
e
c
trog
ram
s
of
the si
gnal th
a
t
resynt
h
e
si
zed by th
e ide
a
l bina
ry ma
sk a
nd th
e bin
a
ry
mask e
s
timat
ed by the p
r
o
posed alg
o
rit
h
m. Com
par
e
to the mixture sig
nal, this t
w
o meth
od
s
ge
t
the targ
et sp
eech an
d d
e
c
re
ase the
n
o
ise
sig
n
if
ica
n
t. The pe
rfo
r
m of the
pro
posed
syste
m
is
very clo
s
e to
the re
sult of I
B
M. The e
s
timation
e
rro
rs make the
re
sult of DMED little worse t
han
the re
sult
of IBM. The (d
) is the
spe
c
trogra
m
s outp
u
t of the PL
D ba
se
d al
g
o
rithm. It re
d
u
ce
s
most of noi
se, and dam
a
ge the targ
et spee
ch at t
he sa
me time. There
are
also
some n
o
ise
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 6, June 20
14: 4475 – 4
484
4482
retain
obviously. The
result of MESSL algorithm
is
shown i
n
(e).
It removes
most noise, and
rebuil
d
s the t
a
rget
sp
ee
ch
. It also d
a
m
age
s the ta
rget sig
nal
si
gnifica
nt, esp
e
cially o
n
so
me
certai
n sig
nal
frequen
cy.
(a) mix
t
ur
e
(b) IBM
(c) pro
p
o
s
e al
gorithm
(d) D
L
P
(e) MESSL
Figure 6. The
Spectro
g
ra
m
s
of
Spee
ch Segre
gation Re
sults
We evaluate
the
sy
stem p
e
rform
a
n
c
e with
va
riou
s numbe
r of
ba
bble noi
se
s. All
input
mixture S
N
Rs a
r
e
-5dB.
We
cal
c
ul
ate
the S
N
R im
provem
ents.
Re
sults a
r
e
given in
Ta
bl
e 3.
The ideal bi
n
a
ry mask (IB
M
) is al
so u
s
ed as the b
a
seline.
Table 3. SNR Improvement
s with 1 to 5 Interferen
ce
s
Noise
sound
nu
mber
1 2 3 4 5
IBM
8.64 7.46 7.16 7.13 6.92
Proposed
7.84 6.63 6.38 6.37 6.13
DLP
2.72 1.95 1.99 2.13 2.20
MESSL
7.64 5.88 4.34 3.69 3.05
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
A Dual-Mi
c
ro
phon
e Spee
ch Enhan
cem
ent Algorit
hm
for Clo
s
e-Tal
k
System
(Yi
Jian
g)
4483
As sho
w
n in
Table
3, all al
gorithm
s h
a
ve a p
o
s
i
tive res
u
lt. The wors
t result is
1.95 from
DLP. Th
e p
r
opo
sed
algo
rithm gets th
e be
st pe
rformance in
all
co
ndition
sa
nd
close to
the
baseline
IBM re
sult
s. It h
a
s
a
gra
duall
y
decre
asi
n
g
with
the so
und num
ber increa
sing.
T
he
multiple s
o
und s
o
urces
are the main reason fo
r the wors
e performance of MESSL algorithm.
MESSL gets
very high
score in ju
st one babble
noi
se, but drop qu
ickly with the interferences
numbe
r in
cre
a
sin
g
. We
al
so find th
e
system p
e
rf
o
r
m
better
with si
mulated d
a
te
than recordi
ng
data, forthe reco
rdin
g data
provide
s
more comp
l
e
x au
ditory scene
s than the sim
u
lated data.
Figure 7. The
Performa
nce
of Speech In
telligibility with Variou
s Interferen
ce
s
To mea
s
ure cla
ssifi
cation
-based sepa
ration
perfo
rm
ance, we u
s
e
HIT-FA as o
u
r main
evaluation cri
t
erion,
whi
c
h
has been
shown to be
well
correlated to hum
an i
n
telligibility [3]. The
HIT rate i
s
the pe
rcent o
f
corre
c
tly classified
targ
et-domi
nant
T-F unit
s
in
the IBM. The FA
(false
-al
a
rm
) rate is the p
e
r
ce
nt of wro
n
g
ly
cl
assified
interferen
ce-domina
n
t T-F
units. As sh
o
w
n
in Figure 7.
With the noise num
ber increasing the intelligibility dec
rease sl
owly
. The HIT-FA
rate
is almo
st 70
% with five babble
noises,
and b
e
tter than mo
st of the on
e micro
phon
e algo
rit
h
m
s
[5].
5.
Summar
y
Concluding Remarks
The perfo
rma
n
ce of the propo
sed alg
o
ri
thm wi
th various inte
rfere
n
ce
s in reve
rberatio
n
and noi
sy en
vironme
n
ts a
r
e evaluated b
y
SNR and
s
pee
ch intelligi
b
ility. The results indi
cate t
he
prop
osed
system ha
s b
e
tter perfo
rma
n
ce th
an
oth
e
r compa
r
i
s
o
n
algo
rithms.
The propo
sed
spe
e
ch
sep
a
ration a
pproa
ch is suit for t
h
e cl
os
e-tal
k
system, not
on
ly high
perfo
rmance b
u
t al
so
simple
compl
e
x and
real
time. The
m
onau
ral fe
atu
r
e,
su
ch
as
pitch, G
F
CC and
MF
CCi
s
potentially b
enefit sp
ee
ch dete
c
tion
and
seg
r
e
g
a
t
ion, whi
c
h
can
be
used
to improve
the
perfo
rman
ce
of the propo
sed algo
rithm. This is
a topic that will be addre
s
sed in f
u
ture work.
Referen
ces
[1]
W
G
Yan, GY
Xi
an
g, Z
X
Qu
n. A si
gna
l su
bspac
e sp
eec
h e
nha
ncem
en
t method
for v
a
rio
u
s n
o
ises
,
T
E
LKOMNIKA Indon
esi
an Jou
r
nal of Electric
al Eng
i
ne
eri
n
g
.
2013; 1
1
(2): 726-7
35.
[2]
DL Wang, GJ Bro
w
n.
Eds.
C
o
mpu
t
a
t
i
o
nal
Au
di
to
ry
Scen
e
Ana
l
y
s
i
s
:
Pri
n
ci
pl
e
s
, Algo
ri
th
ms, a
nd
Appl
icatio
ns. Ne
w
J
e
rse
y
: JOHN W
I
LEY & SONS. 2006.
[3]
N Li, P
C
L
o
iz
o
u
. Factors i
n
flu
enci
ng
intel
l
i
g
i
b
ilit
y
of i
dea
l b
i
nar
y-mask
ed s
peec
h: Implic
ations f
o
r n
o
is
e
reducti
on.
J Acoust Soc Am
.
2
008; 12
3(3): 16
73-1
682.
[4]
Y Jiang, W Liang, H Zhou, ZM F
eng. Performanc
e of binar
y
tim
e
-freq
uenc
y
masks i
n
lo
w
sig
nal t
o
nois
e
ratio e
n
vi
ronme
n
ts.
J. Tsing
hua U
n
iv.
201
2; 52(5): 63
6-64
1.
[5]
YX W
a
n
g
, K Han, DL W
a
n
g
. Exp
l
ori
ng mo
n
aura
l
features f
o
r classific
a
tio
n
-bas
ed sp
eec
h segre
gati
o
n
.
IEEE Trans. On Audio Speec
h
Lang Proces
s.
2013; 21(
2): 270-
279.
[6]
AS Bregman.
Auditor
y
Sce
n
e
Anal
ysis. MIT
Press, Cambri
dge, MA. 199
0.
[7]
J Woodr
uff, DL Wang. B
i
naur
al lo
calization
of multiple sources in
rev
e
rber
ant and nois
y
environments
.
IEEE Trans. on Audio Speec
h Lang Process
.
201
2; 20(5): 15
03-1
512.
[8]
S Kero
ne
n, H
Kall
asjok
i
, U
Remes, GJ Br
o
w
n,
JF
Gem
m
eke, KJ. Pa
l
o
ma
ki. Mask estimation an
d
imputati
on met
hods for m
i
ssi
ng d
a
ta sp
eec
h reco
gniti
on
i
n
a mu
ltisourc
e
rever
bera
n
t envir
onme
n
t.
Co
mp
uter Spe
e
ch & Lan
gu
ag
e
. 2013; 2
7
(3): 798-
819.
0
20
40
60
80
100
12345
Accu
racy
(%)
Noise
Nu
mb
er
HIT
FA
HIT
‐
FA
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 6, June 20
14: 4475 – 4
484
4484
[9]
B Jing, W
Jie, Z
X
Ying. A P
a
rameters Optimiza
tio
n
Meth
od of nu-s
upp
ort Vector Machin
e an
d Its
Appl
icatio
n in
Speec
h Rec
o
g
n
itio
n.
Journ
a
l of
Computers
.
201
3; 8(1): 113
-120.
[10]
Y Bo, L
Ha
ife
ng, F
C
hun
yi
n
g
. Spe
e
ch
Emotio
n
Rec
o
g
n
it
ion bas
ed on
Optimized
Sup
port
Vecto
r
Machi
ne.
Jour
nal of Softw
are
. 2012; 7(1
2
): 2726-
273
3.
[11]
Z
Z
Jin, DL W
a
ng. A su
perv
i
s
ed l
ear
nin
g
a
p
p
roac
h to m
o
n
aura
l
se
greg
ati
on of r
e
ver
ber
ant sp
eech.
IEEE Trans. On Audio Speec
h
Lang Proces
s.
2009; 17(4):
625-6
38.
[12]
G Hinton, S O
s
ind
e
ro, Y T
he. A fast
learni
n
g
al
gorithm for
dee
p be
li
ef n
e
ts.
Neura
l
Co
mp
ut.
200
6;
18(7): 15
27
–15
54.
[13]
H Z
hou, Y Ji
ang, M Ji
ang,
Q Chen. En
erg
y
d
i
fferenc
e bas
ed s
pee
ch segr
egati
o
n
for close-ta
l
k
s
y
stem.
App
lie
d Mecha
n
ics a
nd Materi
als
. 2
012; 22
3-2
31: 173
8-17
41.
[14]
NYous
efian, P
C
Lo
izo
u
. A d
ual-micr
op
hon
e sp
e
e
ch
en
ha
nceme
n
t al
gori
t
hm bas
ed
on t
he co
her
en
c
e
function.
IEEE Trans. on Audi
o Speec
h La
ng
Proces
s. 201
2
;
20(2): 599-6
0
9
.
[15]
Y Jian
g, M Ji
ang, Y Z
u
,
H
Z
hou, F
e
n
g
.
Using
e
nerg
y
differenc
e for
speec
h se
para
t
ion of
du
al-
microph
on
e clo
s
e-talk s
y
st
em.
Sensors a
nd T
r
ansd
u
cers
. 20
13; 21(5): 1
22-
127.
[16]
J Bla
uert. Sp
a
t
ial H
ear
ing
T
he Ps
ych
o
p
h
y
si
cs of
H
u
man
Soun
d Loca
liz
ation, Cambr
i
d
ge,
MA:
MIT
Press. 1997.
[17] N Roma
n, DL
W
ang, GJ Bro
w
n.
Spe
e
ch
s
e
greg
ation
bas
e
d
on s
o
u
nd l
o
c
a
lizati
on.
J Ac
oust Soc A
m
.
200
3; 114(
4): 2236-
225
2.
[18]
Y Jiang, YY Z
u
, X Chen,
H Z
hou. Performanc
e eval
uatio
n of a
gammaton
e
filterba
nk for the
embe
dde
d s
y
stem.
Appli
ed M
e
cha
n
ics a
nd
Materials
. 2
0
1
3
; 336-3
38: 14
59-1
462.
[19]
MI Mandel, R
J
W
e
iss, DPW
Ellis. Mode
l-bas
ed
e
x
p
e
ct
ation-m
a
ximiz
a
tion sourc
e
se
parati
on a
nd
localization.
IEEE Trans. on Audi
o Spe
e
ch L
ang. Process.
201
0; 18(2): 38
2-39
4.
Evaluation Warning : The document was created with Spire.PDF for Python.