Internati
o
nal
Journal of Ele
c
trical
and Computer
Engineering
(IJE
CE)
V
o
l.
6, N
o
. 3
,
Ju
n
e
201
6, p
p
. 1
059
~ 10
71
I
S
SN
: 208
8-8
7
0
8
,
D
O
I
:
10.115
91
/ij
ece.v6
i
3.8
904
1
059
Jo
urn
a
l
h
o
me
pa
ge
: h
ttp
://iaesjo
u
r
na
l.com/
o
n
lin
e/ind
e
x.ph
p
/
IJECE
Exploration of Corpus Augmen
tation App
r
oach for English-
Hindi Bidirectional Statistica
l Machine Translation System
K. J
aya
, D
eepa
Gupta
1
Department of Computer
Scien
ce and
Engineering, Amrita Vishwa Vid
y
apee
th
am Amrita School of
Engin
eering
,
Bangalor
e
Camp
us, India
2
Department of Mathemat
ics, A
m
rita Vishwa Vid
y
ap
eeth
am Amrita School o
f
En
gineer
ing, Bang
alore Campus, I
ndia
Article Info
A
B
STRAC
T
Article histo
r
y:
Received Aug 25, 2015
R
e
vi
sed Dec 4,
2
0
1
5
Accepte
d Ja
n
2, 2016
Even though
lot of Statistical Machin
e
Tr
anslation (SMT) research work is
happening for
English-Hindi language
p
a
ir
, there is no effo
rt done to
s
t
andardi
ze the datas
e
t. Ea
ch
of
the
r
e
sear
ch wo
rk uses differ
e
nt number of
sentences, datas
e
ts and
p
a
rameters during v
a
rious phases of translation
resulting
in v
a
ri
ed transl
ation
out
put. So
com
p
ari
ng these m
odels, understand
the result of th
ese m
odels, to get insi
ght into
corpus behavior for these
m
odels, regener
a
ting th
e result o
f
these
res
ear
ch work becom
e
s
tedious
. This
necessit
a
tes the
need for standardiz
ation of
datase
t and to identif
y t
h
e
common parame
ter for the development of
mode
l. The main con
t
ribution of
this pap
e
r is
to
introdu
ce
an
a
pproach
to st
an
dardiz
e th
e d
a
t
a
set
and to
identif
y th
e best
param
e
ter whic
h in com
b
inatio
n gives best perform
ance. It
also investig
ates a novel corp
us a
ugmentatio
n approach to
improve the
transla
tion qua
lit
y of Englis
h-Hindi bidire
c
tional sta
tisti
ca
l m
achine
transla
tion s
y
st
e
m
. This m
odel works well for the scar
ce r
e
source withou
t
incorporating th
e extern
al par
a
llel da
ta corpus of the underly
i
ng
language.
This exper
i
ment is car
ried ou
t us
ing Open Source phr
ase-based toolkit
Moses. Indian
Languag
e
s Corpora Init
iative (I
LCI) Hindi-Eng
lish tourism
corpus is used. W
ith lim
ited da
t
a
set,
considerab
le improvement is achiev
e
d
using the corpus
augmentation approach
for the English-Hindi
b
i
directional
SMT
s
y
ste
m
.
Keyword:
C
o
r
pus
a
ugm
ent
a
t
i
o
n
In
di
an
l
a
n
gua
g
e
Mach
in
e tran
sl
atio
n
Moses SMT
OO
V
Statistical mac
h
in
e
t
r
an
slation
Copyright ©
201
6 Institut
e
o
f
Ad
vanced
Engin
eer
ing and S
c
i
e
nce.
All rights re
se
rve
d
.
Co
rresp
ond
i
ng
Autho
r
:
Dr.
De
epa
G
u
p
t
a,
Depa
rt
m
e
nt
of
M
a
t
h
em
at
i
c
s,
Am
ri
t
a
Scho
ol
of
En
gi
nee
r
i
n
g
,
B
a
ngal
o
re,
I
n
d
i
a.
Em
a
il: j
a
yak
a
rayil@g
m
a
il.co
m
1.
INTRODUCTION
Exp
o
n
e
n
tial g
r
owth
o
f
in
tern
et and
hug
e av
ailab
ility o
f
in
fo
rm
atio
n
p
o
s
e a n
e
w ch
allen
g
e
t
o
l
a
ng
uage t
e
c
h
n
o
l
o
gy
. Ge
nerat
i
on o
f
k
n
o
wl
e
d
ge an
d nee
d
t
o
obs
er
ve t
h
e k
n
o
wl
e
dge at
t
h
e
po
we
r at
whi
c
h i
t
is
d
i
ssip
a
ted
requ
ires on
e to
be well v
e
rsed with
th
e
lan
g
u
a
g
e
in
wh
ich th
e k
n
o
wled
ge is p
u
b
lish
e
d. Bu
t
m
a
st
eri
ng al
l
l
a
ng
uage
s i
s
i
m
possi
bl
e. T
h
ere c
o
m
e
s M
achi
n
e
Tra
n
sl
at
i
on
(M
T)
t
echni
que
t
o
t
r
a
n
sl
at
e
doc
um
ent
i
n
any
l
a
ng
ua
ge t
o
doc
um
ent
i
n
any
ot
he
r l
a
n
g
u
a
ge. O
f
al
l
t
h
e
m
achi
n
e t
r
an
sl
at
i
on t
ech
nol
ogi
es
,
St
at
i
s
t
i
cal
M
a
chi
n
e
Tra
n
sl
at
i
on
(SM
T
)
[1
]
consi
d
ere
d
as an
i
m
port
a
nt
M
T
Tec
h
n
i
que.
SM
T
c
a
n
be
d
e
v
e
l
o
p
e
d
indep
e
nd
en
t o
f
t
h
e u
n
d
e
rlying
lan
g
u
a
g
e
and
are b
a
sed
on
b
ilin
gu
al sen
t
en
ce alig
n
e
d
parallel
co
rpu
s
.
With
i
n
creased
av
ailab
ility o
f
free, larg
e la
n
g
u
a
ge co
rpu
s
, and h
i
gh
sp
eed
pro
cessor
with
h
u
g
e
m
e
m
o
ry
SM
T
has
bec
o
m
e
an im
port
a
nt
pa
ra
di
gm
i
n
m
achine t
r
a
n
sl
at
i
o
n.
India is a
m
u
ltilingual count
ry with Hindi
as offi
cial langua
ge. Engl
is
h being lingua-fra
nca of
sci
e
nce, m
e
di
a and t
echn
o
l
o
gy
, i
s
a de-fa
ct
o
m
e
di
um
of educat
i
onal
m
a
t
e
ri
al
s created w
o
rl
d
ove
r
,
t
h
e
im
port
a
nce
of
En
gl
i
s
h-
Hi
n
d
i
M
achi
n
e t
r
a
n
sl
at
i
on i
s
ob
vi
o
u
s
. Hi
n
d
i
i
s
m
o
rp
h
o
l
o
gi
cal
l
y
ri
ch l
a
ng
ua
ge and i
s
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E
V
o
l
.
6,
No
. 3,
J
u
ne 2
0
1
6
:
10
5
9
– 10
71
1
060
Su
bject
-O
b
j
ect
-Ve
r
b
(
S
O
V
)
wo
rd
or
der
l
a
n
gua
ge.
Di
ffe
re
nce
i
n
wo
r
d
o
r
de
ri
n
g
bet
w
ee
n
l
a
ng
ua
ges
a
n
d
t
h
e
m
o
r
p
ho
log
i
cal r
i
ch
n
a
t
u
r
e
of
H
i
nd
i pr
oves to
b
e
ch
allen
g
e
i
n
Statistical Mach
in
e Translatio
n
.
En
g
lish
-
Hind
i(En
g-Hin) lan
g
u
a
g
e
p
a
ir is co
n
s
id
ered
for th
is
expe
ri
m
e
nt
and use I
L
C
I
Hi
n
d
i
-
E
n
gl
i
s
h t
o
u
r
i
s
m
cor
p
us
(h
ttp
://td
il-d
c
.in
/
).
Eve
n
though t
h
ere are SMT re
search
[2],[3]
using IL
C
I
Hi
nd
i-En
g
lish
to
uris
m
co
rpu
s
, th
ere is n
o
set
stan
d
a
rd on
splitt
in
g
th
e p
a
rallel co
rpu
s
i.e., sp
littin
g
the parallel co
rp
us i
n
to
train
i
ng
,
dev
e
lop
m
en
t an
d
test
d
a
taset, fo
r the tran
slatio
n
task
.
Variation
s
in
nu
m
b
er of
s
e
ntences use
d
in
transla
tion
,
v
a
ries th
e translatio
n
q
u
a
lity. Wh
en
m
o
d
e
ls
are
g
e
n
e
rated
u
s
in
g
v
a
riou
s
SMT tech
n
i
qu
es, co
m
p
arison o
f
th
ese mo
d
e
ls,
reprodu
cib
ility o
f
th
e
resu
lt or un
d
e
rstand
ing
b
e
h
a
v
i
o
r
o
f
t
h
e corpu
s
for th
e v
a
riou
s SM
T tech
n
i
qu
es i
s
n
o
t
pos
sible. So there is a necessary to
standa
rdize the corpus
for the SMT re
s
earch. One of the ai
m
o
f
th
is
work
i
s
t
o
pr
o
v
i
d
e
a m
e
t
hod t
o
ge
nerat
e
dat
a
set
f
o
r
En
gl
i
s
h-
Hi
n
d
i
(E
ng
-Hi
n
) B
i
di
rect
i
onal
Sy
st
em
. Thi
s
m
e
t
hod
ol
o
g
y
can be u
s
ed t
o
st
an
dar
d
i
ze t
h
e dat
a
set
an
d can be a
d
a
p
t
e
d f
o
r a
n
y
l
a
ng
uage
pai
r
bei
n
g
con
s
i
d
ere
d
f
o
r
t
r
ansl
at
i
o
n.
T
h
i
s
pape
r al
s
o
foc
u
ses
o
n
e
x
pl
o
r
i
n
g t
h
e
va
r
i
ous
m
odel
pa
ram
e
t
e
rs whi
c
h
gi
ves
th
e
b
e
st tran
sl
atio
n
qu
ality fo
r g
e
n
e
ratin
g th
e
b
a
selin
e for En
g-Hin
Bi
directio
n
a
l syst
e
m
u
s
ing
u
s
ing
ILCI
Hind
i-En
g
lish
to
urism
co
rp
us. Th
is
wo
rk
i
s
first
o
f
i
t
s
k
i
nd
fo
r E
n
g
-
H
i
n B
i
di
rect
i
o
n
a
l
sy
st
em
. Furt
her
co
n
t
ribu
tio
n of th
is
p
a
p
e
r
d
i
scu
sses corpu
s
p
a
rallel aug
m
en
tatio
n to
im
p
r
ov
e t
h
e tran
sl
atio
n
qu
ality o
f
th
e
En
g-
Hi
n B
i
di
r
ect
i
onal
sy
st
e
m
.
The r
e
st
o
f
t
h
e
pape
r i
s
or
ga
ni
zed as
f
o
l
l
o
ws.
Sect
i
on
2
di
scuss
e
s i
n
br
i
e
f t
h
e m
achi
n
e t
r
ansl
at
i
o
n
researc
h
ca
rri
e
d
out
f
o
r
I
ndi
an
l
a
ng
ua
ges usi
n
g vari
o
u
s M
achi
n
e Tra
n
sl
at
i
on
t
ech
ni
q
u
es.
I
n
Sect
i
o
n 3
t
h
e
statistical
m
a
c
h
in
e tran
slation
app
r
o
a
ch
is co
v
e
red
in
bri
e
f
.
Sect
i
on 4
gi
v
e
s i
n
fo
rm
at
i
on abo
u
t
t
h
e co
rp
us an
d
expe
ri
m
e
nt
al
set
up.
Sect
i
on
5
di
scusse
s ab
o
u
t
t
h
e p
r
op
os
e
d
ap
p
r
oac
h
a
n
d
Sect
i
on
6 ab
o
u
t
t
h
e ex
pe
ri
m
e
nt
an
d
resu
lt.
2.
M
T
IN
IN
DIA
Fo
r a m
u
ltilin
g
u
a
l cou
n
t
ry lik
e In
d
i
a, d
e
velo
p
m
en
t o
f
go
od
m
ach
in
e t
r
an
slation
syst
e
m
fo
r th
e
vari
ous l
o
cal
l
a
ng
ua
ges i
s
n
ecessary
fo
r p
e
opl
e t
o
c
o
m
m
uni
cat
e and share
kn
o
w
l
e
d
g
e wi
t
h
out
act
ual
l
y
m
a
st
eri
ng t
h
e
i
ndi
vi
dual
l
a
n
gua
ge.
C
o
nsi
d
eri
n
g t
h
e
i
m
port
a
nce
o
f
M
T
fo
r
In
di
a, t
h
e
Go
ve
rnm
e
nt
o
f
I
n
di
a
in
itiated
TDIL(Techno
log
y
Dev
e
l
o
p
m
en
t for Ind
i
an
Lan
g
u
a
g
e
s)
with th
e in
ten
tion o
f
creating
to
o
l
and
t
echni
q
u
es
f
o
r
m
achi
n
e t
r
ans
l
at
i
on. T
h
ere
a
r
e l
a
r
g
e
num
ber o
f
act
i
v
e
g
r
ou
ps
w
o
r
k
i
n
g
on
M
T
S
o
m
e
of t
h
e
active players
in MT resea
r
c
h
are
CDAC,
IIT Bom
b
ay,
I
I
S
C Ban
g
alore
,
II
IT
Hy
de
raba
d,
IIT
Ka
n
p
u
r
,
Tam
i
l
Un
i
v
ersity, Coch
in
Un
iv
ersity, Am
rita
Un
iversity. So
m
e
of t
h
e pro
j
ect
s f
u
nde
d by
TD
IL
are – A
n
g
a
l
a
b
h
art
h
i
[4]
,
A
n
usaa
rka [
5
]
,
An
u
bha
rat
h
i
[6]
sy
st
em
s devel
o
ped
by
II
T Kan
p
u
r
,
M
a
Tra
(h
ttp
://www.cdac
m
u
m
b
a
i.in
/matra) dev
e
lop
e
d b
y
CDAC
, Mu
m
b
ai, Man
t
ra
(h
ttp
://www.cdac.in
/h
tm
l/aa
i
/
m
an
tra.asp) d
e
v
e
lop
e
d
b
y
CDAC, Pun
e
, Sh
iv
a and
Sh
akth
i are th
e p
r
o
j
ect
s
joi
n
t
l
y
dev
e
l
o
p
e
d
by
I
I
SC
B
a
ngal
o
re
an
d
II
I
T
,
Hy
dera
ba
d.
R
e
search
w
o
r
k
i
n
M
T
f
o
r
I
ndi
a
n
l
a
n
gua
ges i
n
v
o
l
v
e
s
v
a
ri
o
u
s M
T
t
echni
que
s l
i
k
e
R
u
l
e
B
a
sed
,
Em
pi
ri
cal
B
a
sed.
R
u
l
e
base
d m
achi
n
e t
r
a
n
sl
at
i
on sy
st
em
ret
r
i
e
ves
l
a
n
g
u
a
ge
kn
o
w
l
e
d
g
e
fr
om
di
ct
i
onar
y
a
n
d
g
r
amm
a
r o
f
t
h
e resp
ectiv
e lan
g
u
a
g
e
to aid
in
tran
sla
tio
n.
Ru
le Based MT (RBMT) is
o
f
th
ree catego
r
ies
–
Direct, T
r
ansfer, Interlingual RBMT.
Di
rec
t
Transl
at
i
o
n s
y
st
em
[7]
does
a wo
r
d
by
w
o
r
d
t
r
a
n
sl
at
i
o
n
usi
n
g
b
ilin
gu
al d
i
ction
a
ry.
Anu
s
arak
,
Direct MT, tran
slate b
e
tween
two
clo
s
el
y related
Ind
i
an
lang
u
a
g
e
s
u
s
in
g
th
e
pri
n
ci
pl
es o
f
p
a
ni
ni
an
gram
m
a
r. A
n
i
n
t
e
ract
i
v
e En
gl
i
s
h-Ta
m
i
l M
T
[8]
al
lows
user t
o
up
dat
e
t
h
e sy
st
em
by
ad
d
i
n
g
m
o
re wo
rd
s i
n
to
t
h
e lex
i
con
an
d ru
les in
to
t
h
e
ru
le-b
ase. In
terlingu
a
b
a
sed
MT
[9
],
[1
0
]
tr
an
slates th
e
sou
r
ce l
a
n
g
u
a
g
e t
o
i
n
t
e
rm
edi
a
t
e
l
a
ngua
ge
and t
h
en t
o
t
a
r
g
et
l
a
ng
ua
ge.
An
gal
a
b
h
a
r
t
h
i
an I
n
t
e
rl
i
n
gua
base
d
approach a
n
alyses Englis
h sentences a
nd c
r
eates an
in
termed
iate stru
ctu
r
e called
PLIL(Pseud
o
Li
n
g
u
a
for
In
di
an La
ng
ua
ges)
. A
n
En
gl
i
s
h-
Hi
n
d
i
i
n
t
e
rl
i
n
g
u
a M
T
sy
st
em
usi
ng U
n
i
v
ersal
Nat
u
ral
Lan
gua
ge (
U
N
L
) as
i
n
t
e
rl
i
n
g
u
a,
w
h
i
c
h co
nve
rt
s so
urce se
nt
ence i
n
t
o
UNL a
n
d f
r
om
whi
c
h t
h
e
t
a
rget
sent
e
n
c
e
i
s
gene
rat
e
d.
Thi
s
syste
m
does
part of speech
disam
b
iguation a
nd som
e
sense
disam
b
iguation for
postposition m
a
rkers a
n
d
pr
o
n
o
u
n
s. M
a
nt
ra (M
Achi
N
e
assi
st
ed TR
Ansl
at
i
o
n t
o
ol
) de
vel
o
ped
by
C
DAC
us
es Tran
sfe
r
B
a
sed
Ap
pr
oac
h
.
A
n
En
gl
i
s
h t
o
Kan
n
a
d
a M
T
sy
st
em
[11]
i
s
devel
ope
d at
R
e
sou
r
ce cent
r
e fo
r I
ndi
a
n
L
a
ng
ua
ge
Tech
nol
ogy
S
o
l
u
t
i
o
ns uses
t
r
ans
f
er base
d app
r
oach
,
f
u
n
d
ed by
G
o
vt
of Kar
n
at
aka
,
and
i
s
a
ppl
i
e
d
t
o
t
h
e
dom
ai
n o
f
go
v
e
rnm
e
nt
ci
rcul
ars.
Em
pi
ri
cal
B
a
sed M
achi
n
e T
r
ansl
at
i
o
n(E
P
B
M
T) use
s
l
a
r
g
e am
ount
o
f
dat
a
i
n
fo
rm
of c
o
r
p
ora
.
EPB
M
T i
s
o
f
t
w
o cat
e
g
o
r
i
e
s
– E
x
am
pl
e B
a
sed(EB
M
T
) an
d Statistical
Based
.
M
A
TREX [12
]
, Eng
lish
to
Hi
n
d
i
EB
M
T
Sy
st
em
[13]
,
An
u
bha
rt
i
,
S
h
i
v
a a
nd
Sh
akt
i
M
T
Sy
st
em
,
use E
x
am
pl
e B
a
sed M
T
t
e
c
hni
que
.
Exam
ple Based MT t
r
anslati
o
n is
by anal
ogy an
d work
s well with
do
m
a
i
n
with
lim
i
t
ed
words.
Th
e
d
i
sadv
an
tag
e
of t
h
e varied
MT tech
n
i
q
u
e
s d
i
scu
s
sed
ab
ov
e is
its in
ab
ility to
g
e
n
e
rate a
l
a
ng
uage i
n
dep
e
nde
nt
m
odel
.
Thi
s
res
u
l
t
s
i
n
St
at
i
s
t
i
cal
M
a
chi
n
e T
r
an
sl
at
i
on t
ech
ni
q
u
e
ga
i
n
i
ng m
o
m
e
nt
um
i
n
l
a
ng
uage t
e
c
h
nol
ogy
,
whi
c
h
pr
ovi
des nea
r
t
o
hum
an t
r
ansl
at
i
on a
nd i
s
l
a
ng
uage i
n
depe
n
d
ent
.
Fi
gu
re 1
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
Expl
or
at
i
o
n
of
C
o
rp
us
Au
g
m
e
n
t
a
t
i
o
n A
p
pro
a
c
h f
o
r E
ngl
i
s
h-
Hi
n
d
i
Bi
di
rect
i
o
n
a
l
St
at
i
s
t
i
c
al
(
K
. Jay
a
)
1
061
refl
ect
s t
h
e S
M
T t
echni
q
u
e
havi
n
g
an e
d
ge o
v
er
ot
he
r MT techniques in the rece
nt years. Figure 1 is
g
e
n
e
rated
b
a
sed
on
th
e
nu
m
b
er of p
a
p
e
rs pu
b
lish
e
d
fo
r Ind
i
an
langu
ages u
tilizin
g
v
a
riou
s MT techno
lo
g
i
es.
Use
of
SM
T t
echni
que
f
o
r
I
ndi
a
n
l
a
n
g
u
ag
e gai
n
e
d
m
o
m
e
nt
um
whe
n
M
o
ses ,
o
p
e
n
sou
r
ce t
ool
ki
t
fo
r SM
T
becam
e popul
ar,
from
year 2005,
which c
a
n
be see
n
from
the Figure
1.
A s
u
rvey
of
va
rious m
a
chine
tran
slatio
n m
e
t
h
od
s is carried ou
t in [1
4
]
. Th
e m
a
j
o
r challen
g
e
in u
s
i
n
g
SMT fo
r Ind
i
an
langu
ag
es is th
e
av
ailab
ility o
f
larg
e
p
a
rallel co
rpu
s
.
Gen
e
ratin
g
larg
e
p
a
rallel co
rpu
s
is
ex
p
e
n
s
i
v
e an
d ti
m
e
co
n
s
u
m
i
n
g. So
m
o
st
o
f
th
e research
work
con
cen
t
r
ate o
n
i
m
p
r
o
v
i
ng
th
e
tran
slation
qu
ality u
s
in
g
th
e scare resource. Research
work in
statistical
m
ach
in
e tran
slation
for Ind
i
an
l
a
ng
ua
ges
were
d
one
i
n
c
o
r
p
orat
i
n
g
pre
p
ro
cessi
n
g
app
r
oaches
–
M
o
r
p
h
o
l
o
gy
a
n
d
De
pe
nde
nc
y
R
e
l
a
t
i
on [
15]
, i
n
c
o
r
p
orat
i
o
n
reo
r
deri
ng
r
u
l
e
s i
n
t
h
e
p
r
e
p
r
o
cessi
n
g
st
age [
1
6]
,[
17]
, PO
S t
a
g
g
i
n
g
[1
8]
,[
1
9
]
,
co
nc
ept
l
a
bel
i
n
g [
2
0]
, sy
nt
act
i
c
and m
o
r
p
h
o
l
o
gi
cal
i
n
fo
rm
at
i
o
n [
2
1]
-
[2
4]
so
u
r
ce si
d
e
reo
r
deri
ng
[
2
5]
,[
26]
i
n
t
r
a
n
s
l
at
i
ng f
r
om
on
e l
o
cal
l
a
n
gua
g
e
t
o
ot
her
o
r
fr
om
Engl
i
s
h t
o
ot
he
r
l
o
cal
l
a
n
gua
ges.
O
n
e
o
f
t
h
e
exam
pl
e o
f
SM
T sy
st
em
f
o
r
I
ndi
a
n
l
a
ng
ua
ge i
s
‘G
oo
gl
e
Translate’(translate.g
oog
le.com/ab
o
u
t/in
tl/en_
ALL/) m
u
ltil
i
n
gu
al serv
ice
p
r
ov
id
ed
b
y
Go
og
le. It is
b
a
sed
on
Statistical Mac
h
in
e
Tran
slation
MT tech
n
i
qu
e.
Go
og
le
T
r
anslate tra
n
slates source
langu
ag
e to
i
n
termed
iate
l
a
ng
uage
an
d t
h
en
t
o
t
a
rget
l
a
ng
ua
ge.
It
use
s
m
i
l
l
i
ons o
f
d
o
cum
e
nt
d
u
ri
n
g
t
r
a
n
sl
at
i
o
n.
On
e ano
t
h
e
r
MT tech
n
i
qu
e
is Hyb
r
i
d
Mach
in
e Tran
slation
(HMT), u
s
es
m
u
ltip
le MT t
ech
n
i
q
u
e
s i
n
tran
slatin
g
t
h
e
lan
g
u
a
g
e
. Sam
p
aark
(h
ttp
://
sam
p
ark
.
iiit.
ac.in
/) Anu
v
a
d
a
ksh
are a
H
ybri
d
MT System
fund
ed
by
TDI
L
. A
n
u
v
ada
k
s
h
i
s
a hy
bri
d
SM
T
wh
i
c
h i
s
an i
n
t
e
g
r
at
i
on
of f
o
ur
di
ffe
re
nt
M
T
techn
o
l
o
gi
es -
Tree-
Adjo
in
ing-Grammar (TAG)
b
a
sed
MT, Statistical
b
a
sed
Machine transl
ation, Analyze and Ge
nerate
rules
(A
nl
age
n
)
bas
e
d M
T
, E
x
am
pl
e base
d M
T
.
Thi
s
sy
st
em
transl
at
es t
h
e t
e
xt
fr
om
Engl
i
s
h t
o
si
x ot
her
In
di
a
n
lan
g
u
a
g
e
s
i.e. Hind
i, Urdu
, Oriya,
Bang
la, Marath
i,
Tam
i
l
.
Fi
gu
re 1.
M
T
R
e
search
Tre
n
d fo
r In
di
an
La
ng
ua
ge
The sy
st
em
handl
es l
a
ng
ua
ge
di
ver
g
e
n
ce i
n
a bet
t
e
r way
.
Im
port
a
nce
of
HM
T i
s
wi
t
h
S
M
T addi
n
g
m
o
re v
a
lu
e to
th
e tran
slation
with
o
t
h
e
r MT tech
n
i
qu
es
p
r
o
v
i
d
i
ng
supp
ort to
en
h
a
n
ce the tran
slatio
n
ou
tpu
t
.
In
sp
ite of SM
T po
pu
larity, th
ere is little o
r
n
o
effort in
term
s o
f
stand
a
rd
izin
g d
a
taset
to
b
e
u
s
ed
in
variou
s
stage in ge
nerating the tra
n
s
l
ation m
odel
or m
e
thod to
create the standa
rdized da
ta
set. Because of thi
s
l
i
m
i
t
a
t
i
on i
n
SM
T research
, com
p
ari
s
on
of
vari
o
u
s SM
T m
odel
on a com
m
on basel
i
n
e or rep
r
od
uci
ng t
h
e
resul
t
o
f
t
h
ese
experi
m
e
nt
s or un
de
rst
a
n
d
i
n
g t
h
e cor
p
us b
e
havi
or w
h
e
n
vari
ous t
ech
ni
que
s are use
d
,
i
s
not
stu
d
i
ed
. Th
is
calls fo
r th
e need
to
standardize the data and
use this st
anda
rdize data to create translation
m
odel
whi
c
h
h
e
l
p
s t
h
e
SM
T r
e
searche
r
s t
o
b
e
i
n
sy
nc
wi
t
h
m
odel
beha
vi
o
r
a
nd t
o
un
der
s
t
a
nd t
h
e t
r
a
n
sl
at
i
o
n
quality better.
In this pa
pe
r,
a
m
e
thod is de
scribe
d,
whic
h is langua
ge indepe
ndent, ca
n be use
d
to s
p
lit the
cor
p
us as
t
e
st
,
t
r
ai
n a
n
d
devel
opm
ent
dat
a
set
.
T
h
i
s
dat
a
set
i
s
use
d
d
u
ri
ng
t
h
e
vari
ou
s
pha
ses o
f
t
r
ansl
at
i
o
n
I
n
ad
d
ition
to
st
an
d
a
rd
izin
g
t
h
e d
a
taset, th
e b
e
st p
a
ram
e
t
e
r to
b
e
used for g
e
n
e
rating
th
e
b
a
seline for
b
i
d
i
rection
a
l En
g-Hi
n
SM
T syste
m
is also
id
en
tified
.
Usi
n
g t
h
e st
an
dar
d
i
zed
dat
a
s
e
t
,
paral
l
e
l
cor
pus a
u
gm
ent
a
ti
on –
pre
p
roce
ssi
ng a
p
p
r
oach
i
s
used t
o
i
m
p
r
ov
ise th
e
b
a
selin
e tran
sl
atio
n
o
u
t
p
u
t
. Th
is aug
m
en
tati
o
n
h
e
l
p
s to imp
r
ov
e th
e
wo
rd alig
n
m
en
t an
d redu
ce
th
e OOV in th
e test set resu
ltin
g in
b
e
tter tran
slatio
n ou
tpu
t
. In
th
is exp
e
ri
men
t
Mo
ses
[27
]
ph
rase-b
ased
o
p
e
n
so
urce too
l
k
i
t
,
a co
m
p
lete SMT system is u
s
ed
.
Th
e
o
t
h
e
r SMT to
o
l
k
it av
ailab
l
e is MARIE
(h
ttp
://www.talp
.
u
p
c
.edu
/ind
e
x
.
p
h
p
/techn
o
l
og
y/to
o
l
s/m
ach
i
n
e-tran
slation
-
t
o
o
l
s/
7
5
-m
arie) d
e
v
e
lop
e
d at
TALP Researc
h
ce
nter
of the Univ
ersitat Po
litècn
ica
d
e
Catalu
n
y
a
(UPC)
b
y
Joseph M. Crego
in
2
005
.
Ph
ram
e
r
[28
]
, Op
en
Source Ph
rase-Based
SMT,
co
m
p
a
tib
le with Ph
arao
h (20
04)
written
in
Jav
a
, Jo
shua
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E
V
o
l
.
6,
No
. 3,
J
u
ne 2
0
1
6
:
10
5
9
– 10
71
1
062
[2
9]
, a dec
o
de
r de
vel
o
pe
d as
a researc
h
c
o
l
l
a
bo
rat
i
on
bet
w
een
Jo
h
n
s H
o
p
k
i
n
s U
n
i
v
e
r
si
t
y
and U
n
i
v
e
r
si
t
y
of
Penn
sylv
an
ia in
2
009
. Of all
th
ese Op
en
Sou
r
ce too
l
,
Mo
ses is a co
m
p
let
e
SMT syste
m
wh
ich
is wid
e
ly u
s
ed
i
n
SM
T resea
r
ch an
d
has wi
de de
vel
o
pm
ent
an
d su
p
p
o
r
t
com
m
uni
t
y
.
B
e
fo
re di
sc
uss
i
ng t
h
e e
x
peri
m
e
nt
al
set
u
p
an
d pr
o
p
o
se
d
a
p
p
r
oac
h
,
a
b
r
i
e
f di
sc
uss
i
on o
n
basi
cs o
f
SM
T ap
pr
oac
h
i
s
pr
o
v
i
d
e
d
.
3.
BASI
CS OF
S
T
ATISTIC
A
L
M
A
C
H
INE TRA
N
SL
ATI
O
N APP
R
O
A
CH
Statistical
Ma
ch
in
e Tran
slat
io
n
is a co
rpu
s
b
a
sed
m
a
c
h
in
e tran
slatio
n
syste
m
b
a
sed
on
No
isy
C
h
an
nel
m
odel
.
Whe
n
h
u
g
e paral
l
e
l
cor
p
us
as i
nput
, i
t
p
r
o
v
i
d
es
near t
o
err
o
r-
free t
r
ansl
at
i
o
n
.
SM
T uses
b
ilin
gu
al sen
t
en
ce align
e
d mo
d
e
l, wh
ich
is
d
e
p
e
nd
en
t
o
f
lan
g
u
a
g
e
b
e
ing
tran
slated
.
Generatio
n
of
tran
slatio
n
th
ro
ugh
SMT is in
ex
p
e
n
s
ive, requ
ires
n
o
h
u
m
an
in
terven
tio
n
in
th
e
tran
slatio
n
and
also
can
gen
e
rate
l
a
ng
uage
i
n
dep
e
nde
nt
m
odel
.
The
goal
of
S
M
T i
s
t
o
ge
ne
r
a
t
e
t
a
rget
se
nt
ence f
r
o
m
t
h
e sou
r
ce se
nt
ence
usi
n
g
t
h
e pa
ral
l
e
l
co
rp
us.
SM
T
has
t
h
ree
com
p
o
n
e
nt
s:
La
ng
ua
g
e
M
o
del
w
h
i
c
h
refl
ect
s t
h
e
fl
ue
ncy
o
f
t
h
e
t
a
rget
l
a
ng
uage
, Tra
n
sl
at
i
on M
o
del
- i
d
ent
i
f
i
e
s t
h
e
cor
r
esp
o
n
d
e
nc
e bet
w
ee
n w
o
r
d
s an
d
ph
rases
i
n
so
urce a
n
d t
a
rge
t
l
a
ng
uage
s;
an
d
Deco
de
r w
h
i
c
h i
d
e
n
t
i
f
i
e
s be
s
t
t
a
rget
sent
e
n
c
e
fo
r a
gi
ve
n i
n
p
u
t
se
nt
ence
usi
n
g t
h
e t
r
an
s
l
at
i
o
n
and
l
a
n
gua
ge
m
odel
s
.
Th
us t
h
ree c
o
m
ponent
s
- a
l
a
ng
uage
m
o
d
e
l
,
a t
r
a
n
sl
at
i
o
n m
odel
,
a
n
d
a dec
o
d
e
r
f
o
r
m
t
h
e co
re
com
pone
nt
i
n
St
at
i
s
t
i
cal
Machi
n
e T
r
an
sl
at
i
on. Fi
gu
re 2
sho
w
t
h
e
S
M
T archi
t
ect
u
r
e.
Whe
n
an
So
urc
e
lan
g
u
a
g
e
sen
t
en
ce(S)
is
g
i
v
e
n
as inpu
t to
t
h
e
d
ecod
e
r, the co
rr
esp
ond
ing
Tar
g
et languag
e
sen
t
en
ce
(T)
is
gene
rat
e
d
bas
e
d
on
t
h
e e
q
uat
i
o
n
.
4.
CO
RPU
S
CO
NTOU
R &
E
X
P
E
RI
ME
NTAL
SETU
P
In th
is exp
e
ri
men
t
, Hind
i-En
g
lish
tou
r
ism
corp
us, colle
cted under
In
dian Languages
Corpora
In
itiativ
e (ILC
I)
proj
ect i
n
itiated
b
y
th
e
DeitY, Gov
t
.
o
f
In
d
i
a, Jawah
a
rl
al Neh
r
u
Un
i
v
ersity, New
Delh
i, is
u
s
ed
. Th
e corpu
s
statistics is
p
r
esen
ted in
t
h
e Tab
l
e
1
.
Tab
l
e
1
.
Statistics o
f
ILC
I
Hind
i-En
g
lish
tou
r
is
m
co
rpu
s
Data
St
atist
i
cs
English
Hindi
No.
Of
Sentences
12194
12200
No.
of
wor
d
s
3
5
8
76
3
6
7
86
M
i
n Sentence length
9
11
M
a
x Senten
ce L
e
n
g
th
79
94
Avg Sentence Length
50
55
Fi
gu
re 2.
Hi
gh
Level
Desi
g
n
o
f
SM
T Sy
st
em
The t
ool
ki
t
use
d
i
n
t
h
i
s
expe
ri
m
e
nt
- M
o
ses t
ool
k
i
t
req
u
i
r
es
Gi
za++ [3
0]
op
en s
o
u
r
c
e
im
pl
em
ent
a
t
i
o
n of t
h
e IB
M
m
odel
s
, for w
o
rd al
i
gnm
ent
.
Tu
ni
n
g
i
s
do
ne
by
deco
di
n
g
and m
i
nim
u
m
err
o
r rat
e
t
r
ai
ni
n
g
(M
ER
T) [
3
1]
. Ke
nL
M
t
ool
ki
t
[
3
2]
i
s
used
f
o
r
b
u
i
l
d
i
ng l
a
n
gua
ge
m
odel
.
B
L
E
U
[3
3]
i
s
use
d
f
o
r t
h
e
au
to
m
a
tic ev
al
u
a
tio
n
of th
e
SMT System
.
Bleu
score is
wid
e
ly u
s
ed
metric wh
ich
is lan
g
u
a
g
e
ind
e
pen
d
e
n
t
,
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
Expl
or
at
i
o
n
of
C
o
rp
us
Au
g
m
e
n
t
a
t
i
o
n A
p
pro
a
c
h f
o
r E
ngl
i
s
h-
Hi
n
d
i
Bi
di
rect
i
o
n
a
l
St
at
i
s
t
i
c
al
(
K
. Jay
a
)
1
063
i
n
ex
pen
s
i
v
e an
d co
rrel
a
t
e
s hi
ghl
y
wi
t
h
hum
an eval
uat
i
o
n
.
Thi
s
m
e
t
r
i
c
gives t
h
e p
r
eci
si
on
of
n-
gram
s wi
t
h
respect t
o
the
refere
nce tra
n
sl
ation. T
h
e sc
ore is us
ually
between
0 a
n
d 1.
Score close
r
to 1 re
presents a
good
translation. T
h
e m
odified n-gram
precision s
c
ore
,
p
n
f
o
r eac
h n
-
gram
l
e
ngt
h by
s
u
m
m
i
ng ove
r t
h
e
m
a
t
c
h
e
s
f
o
r
every
hypothes
is sentence
S i
n
th
e co
m
p
lete co
rpu
s
C as:
p
n=
5.
PROP
OSE
D
APP
R
O
A
CH
There a
r
e
6 sta
g
es in t
h
is experim
e
nt and Figure
3 re
prese
n
ts the se
que
nc
e of t
h
e approa
ch. T
h
ey are
descri
bed
i
n
su
bse
que
nt
sect
i
o
ns.
5.
1.
Preproces
sing
The
paral
l
e
l
c
o
r
p
us e
v
en t
h
o
u
g
h
i
s
p
ubl
i
s
h
e
d
by
TD
IL,
n
eed t
o
be
cl
eaned
, t
o
be
rel
e
vant
fo
r t
h
e
expe
ri
m
e
nt
. T
h
e
fi
rst
part
o
f
t
h
e e
xpe
ri
m
e
nt i
s
cl
eani
n
g
t
h
e co
rp
us
p
r
ese
n
t
e
d i
n
T
a
bl
e
1.
I
n
t
h
i
s
st
age
,
bl
a
n
k
lin
es, sen
t
en
ces th
at h
a
v
e
n
o
eq
u
i
valen
t/v
al
id
tran
slation
are rem
o
v
e
d
fro
m
th
e p
a
rallel co
rpu
s
.
Aft
e
r th
is
step, the
basic
pre
p
roces
sing like, cas
e co
nv
ersi
o
n
, tok
e
n
i
zatio
n
is
d
one. Tab
l
e
2
list th
e statistics o
f
t
h
e
clean
ed
p
a
rallel co
rpu
s
.
Aft
e
r co
rpu
s
clean
, p
a
rallel co
rp
u
s
is read
y to
b
e
u
s
ed
in
g
e
n
e
rating
tran
slatio
n
m
odel
.
Tab
l
e
2
.
Statistics o
f
p
a
rallel co
rpu
s
after clean
i
ng
Data
St
atist
i
cs
English
Hindi
No.
Of
Sentences
11700
11700
No.
of
wor
d
s
3
3
7
20
3
5
9
30
5.
2.
Generation
of Standardiz
ed Dataset
After
preprocessin
g
stag
e, t
h
e co
rpu
s
is sp
lit as train
i
ng
, test
, d
e
v
e
lop
m
en
t d
a
ta set to
b
e
used
for th
e
vari
ous
p
h
ases
o
f
SM
T
.
T
h
e
t
r
ai
ni
n
g
set
i
s
use
d
fo
r
ge
ne
rat
i
n
g
t
h
e t
r
an
sl
at
i
on m
odel
,
de
vel
o
pm
ent
set
i
s
uni
que
a
n
d
i
s
nei
t
h
er
pa
rt
of
t
r
ai
ni
n
g
or
t
e
st
i
ng,
i
s
neede
d
fo
r SM
T
m
ode
l
param
e
t
e
r co
m
b
i
n
at
i
on an
d
t
e
st
set
is for testing
t
h
e
b
u
ild m
o
d
e
l. Th
e tran
slatio
n
qu
ality
v
a
ries with
d
i
fferen
ce in sp
lit of th
e
d
a
ta.
On
e
o
f
t
h
e
m
a
i
n
cont
ri
b
u
t
i
o
n
o
f
t
h
i
s
pa
p
e
r i
s
t
o
de
fi
ne
a m
e
t
hod
t
h
at
can
be
used
as
a st
an
dar
d
m
e
tho
d
t
o
gene
rat
e
dat
a
set. This m
e
thod
uses t
h
e Out-Of-Voca
bul
ary (OOV
) c
r
iteria. Out-Of-Voca
b
ul
ary is num
ber
of
unknown
words seen
in th
e test se
t t
h
at is n
o
t
in
t
r
ain
set. Th
ere are two
typ
e
s o
f
d
a
taset created
- 0
%
Ou
t-Of-
V
o
cabu
lar
y
(OO
V
)
an
d Least
Ou
t-
Of
-Vo
c
ab
u
l
ar
y(
LO
OV
)
.
Fo
r 0
%
OOV d
a
taset,
th
e test
an
d
d
e
v
datasets
have
se
ntences
which a
r
e t
h
ere in the
traini
ng
dataset.
Th
ere are
no
new
words in 0
%
OOV test
d
a
taset i.e.,
test set is su
bset o
f
train
i
ng set. Translatio
n
accur
acy measure on
0% OOV (0OO
V)
d
a
taset p
r
ov
id
es an
in
sigh
t in
to
the co
rp
us ab
ility to
tran
slate th
e do
cu
m
e
n
t
. In
case
o
f
LOOV
d
a
taset, the test an
d
d
e
v
d
a
taset
will h
a
v
e
so
m
e
word
s
wh
ich
are no
t see in
train
i
ng
d
a
ta
set. Dataset with
least n
u
m
b
e
r of OOV is con
s
i
d
ered
fo
r L
O
O
V
data
set.
Fi
gu
re
3.
Sc
he
m
a
t
i
c
Di
agram
o
f
P
r
o
p
o
se
d A
p
p
r
oach
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E
V
o
l
.
6,
No
. 3,
J
u
ne 2
0
1
6
:
10
5
9
– 10
71
1
064
For c
r
eat
i
o
n o
f
dat
a
set
,
5%
of
paral
l
e
l
cor
p
u
s
i
s
segre
g
at
ed
as t
e
st
set
,
15
% cor
p
us as d
e
v an
d t
h
e
rem
a
i
n
i
ng 80
% of t
h
e c
o
r
p
us as t
r
ai
n set
.
Test
and
De
v
set
of LO
OV
dat
a
set
have
uni
que se
nt
enc
e
s wi
t
h
resp
ect t
o
train d
a
taset. Th
e LOOV
d
a
taset is con
s
tru
c
ted
t
o
h
a
v
e
least OOV in
th
e test set. Th
is is achieved
b
y
g
e
n
e
ratin
g
d
a
taset with variou
s co
m
b
in
atio
n
s
o
f
sen
t
en
ces i
n
trai
n
,
test and
d
e
v in
su
ch
a
way that th
e
n
u
m
b
e
r
o
f
unkn
own
w
o
r
d
s in test set is m
i
n
i
m
u
m
.
Fig
u
re
4
.
Gen
e
ratio
n of Eng
lish
Dataset wit
h
LOOV
Fig
u
re
5
.
Gen
e
ratio
n of
Hind
i Dataset
with
LOOV
Tabl
e
3.
Dat
a
st
at
i
s
t
i
c
s of T
r
ai
ni
ng
,
Dev
an
d Test
dat
a
set
LOOV
Eng-to-Hin
Datasets S
t
atistics
E
nglish
Hindi
# Sentences
# wor
d
s
# Sentences
# wor
d
s
Train
10200
23967
10200
30208
Dev
500
3649
500
4200
Tes
t
1000
6104
1000
7038
LOOV
Hin-to-Eng Datasets S
t
atistics
Hindi
E
nglish
# Sentences
# wor
d
s
# Sentences
# wor
d
s
Train
10200
26163
10200
24926
Dev
500
3749
500
4142
Tes
t
1000
6018
1000
6862
The Fi
g
u
re
4 sho
w
s t
h
e O
O
V
rat
e
fo
r vari
ous E
n
gl
i
s
h r
u
ns. A r
u
n wi
t
h
t
h
e
m
i
nim
u
m
OO
V i
n
t
h
e
test set is co
n
s
id
ered
as Least OOV
d
a
taset for En
g-
t
o
-Hi
n
bi
di
rect
i
o
nal
sy
st
em
and t
h
e cor
r
es
po
n
d
i
n
g d
e
v
and t
r
ai
n set
a
r
e al
so e
x
t
r
act
ed. T
h
e Least
OO
V
dat
a
set
i
s
ge
nerat
e
d f
o
r b
o
t
h
En
gl
i
s
h
and
Hi
ndi
l
a
n
gua
ge
.
Fi
gu
re
5 sh
o
w
s OO
V
rat
e
fo
r
vari
ous
Hi
n
d
i
t
e
st
dat
a
set
wh
i
c
h i
s
use
d
f
o
r
t
h
e En
g-
Hi
n
S
M
T sy
st
em
. Tabl
e 3
gi
ves
dat
a
st
at
i
s
t
i
c
s of
t
r
ai
ni
n
g
,
de
v a
n
d
t
e
st
dat
a
set
.
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
Expl
or
at
i
o
n
of
C
o
rp
us
Au
g
m
e
n
t
a
t
i
o
n A
p
pro
a
c
h f
o
r E
ngl
i
s
h-
Hi
n
d
i
Bi
di
rect
i
o
n
a
l
St
at
i
s
t
i
c
al
(
K
. Jay
a
)
1
065
5.
3.
Identify
ing
Best
Model Co
nfig
uratio
n F
o
r
Eng-Hin Bidir
ection
a
l S
y
s
t
e
m
Mo
ses co
m
p
r
i
se o
f
Langu
ag
e Mod
e
l and
Tr
an
slation Mo
d
e
l. Th
ese
m
o
d
e
l suppo
r
t
s
v
a
r
i
o
u
s
p
a
ram
e
ters which
o
n
tun
i
n
g
will fin
d
th
e baselin
e
m
o
d
e
l
with
b
e
st tran
slatio
n
q
u
ality.
Lang
u
a
g
e
m
o
d
e
l h
a
s
p
h
rase len
g
t
h
as on
e of its
p
a
ram
e
ter. Para
m
e
ters th
at
tran
slation
m
o
d
e
l supp
or
ts are tran
slatio
n
p
h
rase
len
g
t
h, align
m
en
t and
reo
r
d
e
ring
. Tran
slatio
n phrase le
ngth s
p
ecifies the num
b
er of
wo
rd
s with
in
a p
h
rase.
Word
alig
n
m
en
t is task
of id
en
tifying
translatio
n
re
l
a
t
i
o
n
s
hi
ps am
on
g
wo
rd
s. T
h
ere
are va
ri
o
u
s al
i
gnm
ent
h
e
uristics ,supp
orted
b
y
Mo
ses. So
m
e
o
f
th
e
m
are in
ters
ectio
n
,
gro
w
-d
iag
o
n
a
l, un
ion
etc. In
in
tersectio
n, the
Giza++ alig
nmen
t is tak
e
n an
d
fo
r
Un
io
n, th
e
un
ion o
f
Giza++ alig
n
m
en
t is c
o
n
s
i
d
ered
. Reo
r
d
e
ri
ng
identifyies the
meaning
of t
h
e sentences
.
Vari
ous
re
or
d
e
r
i
ng
supp
or
ted b
y
m
o
ses
are distance,
hiera
r
chial,
ph
rase
-m
sd-bi
d
i
r
ect
i
o
nal
et
c
.
Di
st
a
n
ce
bas
e
d
reo
r
deri
n
g
assi
g
n
s a
p
e
nal
t
y
t
o
e
v
ery
re
or
deri
ng
, a
n
d
t
h
e
pena
rty increas
es as the
re
or
dering distance
i
n
crease
s
..
One a
not
her
o
b
ject
i
v
e of t
h
i
s
expe
ri
m
e
nt
i
s
t
o
fi
n
d
a best
m
odel
whi
c
h g
i
ves a t
op pe
rf
orm
a
nce f
o
r
baseline m
ode
l. The la
ngua
ge pair
features
, features
s
u
c
h
as w
o
r
d
or
de
r o
f
t
h
e l
a
ng
u
a
ge, m
o
rp
h
o
l
o
gi
cal
ri
ch
ness
of
t
h
e
l
a
ng
ua
ge i
n
fl
uence
s
t
h
e m
odel
.
F
o
r
E
n
g
-
Hi
n
bi
di
r
ect
i
o
nal
SM
T sy
st
e
m
, t
h
ere i
s
n
o
pre
v
i
o
us
wo
rk
w
h
er
e t
h
ere i
s
di
sc
ussi
on
o
n
t
h
e
best
m
odel
for
bas
e
l
i
n
e sy
st
em
. Ide
n
t
i
f
y
i
ng
t
h
e
best
m
odel
f
o
r E
n
g
-
Hin
b
i
d
i
rectio
nal SMT system is on
e
o
f
th
e task
o
f
th
is exp
e
ri
m
e
n
t
.
Usi
n
g t
h
e dat
a
set
0OO
V
an
d
LOO
V
desc
ri
bed i
n
t
h
e
pre
v
i
o
us sect
i
on t
h
e basel
i
n
e i
s
gene
rat
e
d
.
I
n
th
is exp
e
rim
e
n
t
th
e lan
g
u
a
g
e
m
o
d
e
l(lm
) p
h
rase leng
th
is
set to
3
.
Becau
s
e
o
f
th
e li
m
i
tatio
n
of th
e syste
m
reso
u
r
ce, t
h
e l
m
l
e
ngt
h i
s
c
h
ose
n
t
o
be
3(
d
e
faul
t
)
.
Exp
e
rimen
t
s with
variou
s tran
slation
phrase leng
t
h
are
carried ou
t. Fro
m
th
is ex
p
e
ri
men
t
it is found
th
at
settin
g
t
r
an
slatio
n phrase len
g
t
h
t
o
7
g
i
v
e
s t
o
p p
e
rforman
c
e
fo
r E
n
g
-
Hi
n
bi
di
rect
i
o
nal
sy
s
t
em
. Fi
gure
6
r
e
prese
n
t
t
h
e
B
l
eu sc
ore
o
f
t
h
e
m
odel
by
va
r
y
i
ng
ph
rase l
e
ngt
h.
Inc
r
ease i
n
ph
r
a
se l
e
n
g
t
h
e
x
,
ph
rase l
e
ngt
h=
11
gi
ves
be
tter
score for
th
e 0
OOV d
a
taset b
u
t
it d
e
teriorates
for
th
e LOOV
d
a
t
a
set.
Usi
n
g va
ri
o
u
s al
i
gnm
ent
s
and
reor
deri
ng
par
a
m
e
t
e
rs, base m
odel
for b
o
t
h
0O
OV an
d L
O
O
V
dat
a
set
are g
e
ne
rat
e
d.
Th
us
Fi
g
u
re
7 a
n
d
Fi
g
u
re
8 s
h
o
w
s t
h
e B
l
eu Sc
ore
f
o
r
0O
O
V
dat
a
set
an
d L
O
OV
dat
a
set
respect
i
v
el
y
us
i
ng
va
ri
o
u
s c
o
m
b
i
n
at
i
on o
f
p
a
ram
e
t
e
rs. Th
e Tabl
e
4
gi
ve
s t
h
e s
u
m
m
ary of
t
h
e
best
pa
r
a
m
e
t
e
r
i
d
ent
i
f
i
e
d
i
n
t
h
i
s
expe
ri
m
e
nt
. From
Tabl
e
4,
fo
r
base
lin
e LO
OV
system
d
a
taset Gro
w
-di
a
g-fi
nal
-
an
d
an
d
m
s
d-
bi
di
rect
i
onal
-fe/
d
i
s
t
a
nce as reorde
ri
ng a
nd a
l
i
g
n
m
ent
param
e
ter gi
ves t
h
e best
resul
t
.
0 O
OV
basel
i
n
e sm
t
syste
m
output specifies the
maxim
u
m
transl
a
tion accuracy an SMT syste
m
can achieve.
Fi
gu
re
6.
B
a
sel
i
n
e B
l
eu
Sco
r
e fo
r vari
ous
p
h
r
ase
len
g
th
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E
V
o
l
.
6,
No
. 3,
J
u
ne 2
0
1
6
:
10
5
9
– 10
71
1
066
Fi
gu
re
7.
O
O
V
B
l
eu Sc
ore
f
o
r
va
ri
o
u
s M
odel
usi
n
g
va
ri
o
u
s
param
e
t
e
rs
Fi
gu
re
8.
LO
O
V
B
l
eu
Sc
ore
f
o
r
va
ri
o
u
s M
o
del
u
s
i
n
g
vari
o
u
s
param
e
t
e
rs
Tabl
e
4.
Sum
m
a
ry
o
f
t
h
e
be
st
param
e
t
e
r i
d
en
t
i
f
i
e
d f
o
r
ge
ne
r
a
t
i
ng
basel
i
n
e
L
a
nguage p
a
ir
Align
m
ent
Reor
der
i
ng Bleu
Scor
e
Eng-to-Hin(0
% O
OV)
Grow
-dia
g
m
s
d-
bidir
e
ctional-
fe
74.
85
Hin-
to
-
E
ng(
0
%
OOV)
I
n
ter
s
ection
Hier
-
m
slr
-
b
idir
ectional-
fe 83.
23
Eng-to-Hin(leas
t O
OV)
Grow
-diag
-
final-
an
d m
s
d-
bidir
e
ctiona
l-
fe/distance 23.
05
Hin-
to
-
E
ng(L
e
ast OOV)
Gr
ow-
d
iag
-
fina
l-an
d m
s
d-
bidir
e
ctional-
fe
16.
03
5.
4.
ENHANCEMENT TO B
A
SELINE SYST
EM
Usi
n
g t
h
e st
an
dar
d
i
zed
dat
a
s
e
t
di
scusse
d i
n
sect
i
on
5.
2 a
nd
best
m
odel
speci
fi
ed i
n
Ta
bl
e 3 f
o
r t
h
e
En
g-
Hi
n
bi
di
r
ect
i
onal
sy
st
em
, corpu
s
a
u
gm
ent
a
t
i
on p
r
e pr
ocessi
ng
app
r
oach
i
s
u
s
ed t
o
i
m
pro
v
e t
h
e
tran
slatio
n. In
th
is stag
e, th
e parallel co
rpu
s
is au
g
m
en
ted
with
ad
d
itio
n
a
l in
fo
rm
atio
n
ex
tracted
fro
m
o
r
ig
in
al
paral
l
e
l
co
rp
us
. Fo
r t
h
e sca
r
c
e
reso
u
r
ce E
n
g
-
Hi
n bi
di
rect
i
o
nal
sy
st
em
, t
h
i
s
pre
p
roces
si
n
g
ap
p
r
oac
h
pr
o
v
es t
o
b
e
a v
i
ab
le so
l
u
tio
n
to
im
p
r
ov
e th
e tran
slati
o
n
q
u
a
lity. Co
rp
u
s
au
g
m
en
tatio
n
ap
pro
ach
for En
g-to-Hi
n
SMT
sy
st
em
and
Hi
n-t
o
-E
n
g
SM
T
sy
st
em
vari
es,
beca
use
of
di
f
f
ere
n
ce i
n
m
o
rph
o
l
o
gi
cal
ri
c
h
ness
of
t
h
e l
a
n
gua
ge
.
To i
n
co
r
p
o
r
at
e m
o
rph
o
l
o
gi
cal
vari
at
i
o
ns,
t
h
e c
o
rp
us
augm
ent
a
t
i
on
i
s
d
one
di
ffe
r
e
nt
l
y
t
o
get
bet
t
e
r
per
f
o
r
m
a
nce.
For sca
r
ce res
o
urce like Hi
ndi, corpus a
u
gmentation ap
pr
oach
wo
r
k
wel
l
t
o
im
prove t
r
ansl
at
i
o
n
.
I
n
t
h
i
s
ap
pr
oach
,
t
h
e fre
q
u
ent
l
y
rel
e
vant
ph
r
a
ses are ext
r
a
c
t
e
d fr
om
t
h
e paral
l
e
l
cor
p
us
and a
u
gm
ented by
in
creasing
th
e
weigh
t
lin
early an
d
ad
d
e
d
to
t
h
e orig
in
al
p
a
rallel co
rp
us to
g
e
t b
e
tter alig
nmen
t wh
ich
in
tu
rn
t
o
get
t
o
p pe
rf
orm
a
nce. Usi
ng t
h
e co
rp
us
and
wi
t
h
out
u
s
i
ng t
h
e
u
nde
r
l
y
i
ng l
a
ng
ua
ge
kn
o
w
l
e
d
g
e, c
o
r
p
us
au
g
m
en
tatio
n
ap
pro
ach
h
e
lps to
im
p
r
o
v
e
th
e tran
slation
quality. Two
typ
e
s o
f
corpu
s
au
gmen
tatio
n
is do
n
e
i
n
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
Expl
or
at
i
o
n
of
C
o
rp
us
Au
g
m
e
n
t
a
t
i
o
n A
p
pro
a
c
h f
o
r E
ngl
i
s
h-
Hi
n
d
i
Bi
di
rect
i
o
n
a
l
St
at
i
s
t
i
c
al
(
K
. Jay
a
)
1
067
this pha
se – Le
xicalized and
Lemmatized corpus for
Hi
n-to-E
ng Bidirect
ional SMT syst
e
m
and are e
x
plained
i
n
s
ubse
q
uent
s
ect
i
on.
5.
4.
1.
Pa
ra
llel C
o
rpus Augmentatio
n
f
o
r Hin-to-
E
ng SM
T Syst
em
For
Hi
n
-
t
o
-E
n
g
SM
T sy
st
em
, corpus a
u
gm
ent
a
t
i
on i
s
do
ne
by
au
gm
ent
i
ng t
h
e
Lexi
cal
an
d
Lem
m
a
ti
zed cor
p
us t
o
t
h
e o
r
i
g
i
n
al
co
rp
us.
The resul
t
a
nt
corp
us i
s
use
d
fo
r ge
nerat
i
ng Hi
n-t
o
-E
n
g
SM
T
syste
m
. Alg
o
ri
th
m
to
g
e
nerat
e
th
e Lex
i
calized
an
d Lemm
a
tized
corpu
s
is
as fo
llo
ws.
1. Ge
nerat
e
ph
r
a
se t
a
bl
e of ph
r
a
se l
e
ngt
h 1.
2. Extract phra
se from
the phrase tabl
e. For e
ach source phrase(h), there
is
a target translation phrase(e)
3. For each
phrase (h),
Fetch
φ
(h|e) an
d
φ
(e|h
) prob
ab
ilit
y, wh
ere
φ
(h
|e) is in
v
e
rse
p
h
r
ase tran
slatio
n
p
r
ob
ab
ility
an
d
φ
(e|h) is
d
i
rect p
h
r
ase tran
slatio
n
p
r
ob
ab
ilit
y
Get phrases(h,e) ,
(h
,e) =
The phrases
which are
m
o
re releva
nt
in
the corpus
a
r
e e
x
tracted from
the corpus
and a
u
gm
ented to
th
e orig
i
n
al corpu
s
so
as to
get b
e
tter
word
alig
n
m
en
t.
Phrase leng
th on
e , aligh
m
en
t is set as in
tersect
io
n
t
o
construct
phra
se table from the clea
n
pa
ra
llel corpus . For e
x
am
ple the
phrase
‘
अ
ं
े
ज
’ h
a
s mu
ltip
le
tran
slatio
n i
n
p
h
rase tab
l
e as
sh
own
in th
e ex
am
p
l
e.
अ
ं
े
ज
||| b
r
itish
||| 1
0
.
01
307
19
0
.
3
333
33
0
.
51
282
1
अ
ं
े
ज
||| b
u
ilt ||
| 0
.
0
270
27
0
.
00
361
01
0
.
16
666
7 0.025
641
अ
ं
े
ज
||| few ||| 0
.
07
142
86
0
.
00
943
4 0.166
667
0
.
0
256
41
अ
ं
े
ज
||| j
i
a |||
0
.
1
4
0.333
333
0
.
1
666
67
0
.
025
641
अ
ं
े
ज
||| leg
aci
es ||| 0
.
5 0.25
0.16
666
7 0.0256
41
There
are
4
di
ffe
rent
sc
ore
s
fo
r eac
h p
h
r
as
e. I
n
t
h
i
s
e
x
p
e
ri
m
e
nt
onl
y
t
h
e fi
rst
an
d t
h
i
r
d sc
ore
-
in
v
e
rse ph
rase tran
slatio
n
pro
b
a
b
ility
φ
(h
|e) an
d
d
i
rect p
h
rase tran
slation
p
r
ob
ab
ility
φ
(e|h) re
spectiv
ely
are
considere
d
. M
o
re
reliable words are e
x
tra
c
ted from
t
h
e ph
rase t
a
bl
e
and t
h
ese are
ones
wi
t
h
hi
ghe
st
d
i
rect/in
v
e
rse t
r
an
slation
p
r
obab
ilities.
अ
ं
े
ज
||| b
r
itish
||| 1
0
.
01
307
19
0
.
3
333
33
0
.
51
282
1
So
the ph
rase i
s
consi
d
ered
fo
r
t
h
e
augm
ent
a
t
i
on.
Th
e
ph
rases ‘british
’
is au
g
m
en
ted
to
t
h
e Eng
lish
t
r
ai
ni
n
g
c
o
r
p
us
an
d
‘
अ
ं
े
ज
’ is add
e
d to
t
h
e
Hin
d
i
co
rpu
s
and
th
e wei
g
h
t
of th
is co
rp
us is l
i
n
early scaled.
In
ten
tio
n of
g
e
n
e
rating
lemmatized
corpu
s
i
s
to
redu
ce t
h
e nu
m
b
er of
un
kno
wn
wo
rd
s. Mu
ltip
le
hi
n
d
i
w
o
rds
m
a
p t
o
a E
n
gl
i
s
h
w
o
r
d
.
Fo
r e
x
a
m
pl
e Hi
ndi
n
o
u
n
‘
िबला
’,
‘
िबली
’ a
r
e m
a
pped to ‘cat
’ in E
nglis
h
and a
d
jective ‘
अ
छ
ा
’,
’
आच
े
’ are m
a
pped t
o
‘g
o
o
d
’
i
n
E
ngl
i
s
h.T
h
ere a
r
e v
a
ri
o
u
s i
n
fl
ect
i
o
n i
n
Hi
ndi
– N
o
u
n
,
Adjective,Ve
rb. Nouns in
Hi
ndi are
i
n
fl
ect
ed f
o
r
ge
nde
r,
num
ber an
d c
a
se
. Adje
ctive
s
need t
o
agre
e with
noun which are inflected fo
r gender, num
b
er and case. Verbs a
r
e in
flect
ed for ge
nder, num
ber, case ,tense
an
d
vo
ice. Th
i
s
in
flectio
n
resu
lt is OOV i.e, word
s t
h
at
are seen in test set are not see
n
during traini
ng. T
o
h
a
nd
le th
is
m
o
rpho
log
i
cal v
a
riatio
n
,
th
e p
a
rallel co
r
pus i
s
l
e
m
m
a
t
i
zed. Engl
i
s
h l
e
m
m
at
izat
i
on i
s
do
ne usi
n
g
p
y
th
on web
m
i
n
i
ng
m
o
du
le Pattern
(h
ttp
://www.clip
s.u
a
.ac.b
e
/p
attern) and
Hind
i lemm
a
tizatio
n
is
d
one u
s
i
n
g
th
e h
i
nd
i sh
al
lo
w p
a
rser (h
ttp
://ltrc.
iiit.ac.i
n
/an
a
lyzer/h
i
n
d
i/). After le
mma
tizatio
n
th
e
m
o
re reliab
l
e words
are e
x
tracted from
le
mmatize
d c
o
rpus
using
the sam
e
proc
e
d
u
r
e use
d
fo
r
l
e
xi
cal
co
rp
us e
x
t
r
act
i
o
n.
The l
e
xi
cal
co
rp
us a
nd l
e
m
m
a
t
i
zed corp
u
s
gen
e
rat
e
d a
s
descri
be
d ab
ove a
r
e a
ugm
ent
e
d t
o
t
h
e
ori
g
i
n
al
paral
l
e
l
cor
p
us.
The
r
e
sul
t
a
nt
a
ugm
ent
e
d c
o
rp
us i
s
use
d
t
o
ge
nerat
e
t
h
e
Hi
n
-
t
o
-
E
ng
SM
T sy
st
e
m
.
5.
4.
2.
Pa
ra
llel Co
rpus Augmentatio
n
for
Eng
-
t
o
-Hin SMT Sy
s
t
em
C
o
r
pus
A
ugm
ent
a
t
i
on
fo
r E
n
g
-
t
o
-Hi
n
SM
T Sy
st
em
i
s
done
di
f
f
ere
n
t
l
y
whe
n
c
o
m
p
ared t
o
E
n
g-t
o
-
Hi
n
SM
T
Sy
st
em
. Tran
sl
at
i
o
n
f
r
om
En
g
l
i
s
h t
o
Hi
ndi
is a c
h
allenge
beca
use
of the
diffe
rence in
m
o
rph
o
l
o
gi
cal
nat
u
re
of t
h
e
l
a
ng
uage
. C
o
rp
us au
gm
ent
a
t
i
on f
o
r E
n
g-t
o
-
H
i
n
i
s
d
o
n
e
i
n
sim
i
l
a
r
m
a
nne
r as
expl
ai
ne
d i
n
s
ect
i
on 5.
4.
1.
I
n
ad
di
t
i
on t
o
t
h
at
, m
o
rph
o
l
o
gi
cal
vari
at
i
o
n
s
for
Hi
n
d
i
no
un a
nd a
d
ject
i
v
e ar
e
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E
V
o
l
.
6,
No
. 3,
J
u
ne 2
0
1
6
:
10
5
9
– 10
71
1
068
ad
d
e
d to th
e
p
a
rallel co
rp
us. For ex
am
p
l
e, ‘b
ad
’ has t
r
an
slation
ph
rase ‘
ब
ु
रा
’
i
n
p
h
rase
table.
V
a
rio
u
s
m
o
rph
o
lo
gical f
o
rm
‘
ब
ु
र
े
’,
ब
ु
री
,
of
‘
ब
ु
रा
’
are
g
e
n
e
rated and
au
g
m
en
ted
t
o
th
e parallel corpu
s
. Sim
ilarl
y
no
un
m
o
rph
o
l
o
gi
cal
vari
at
i
o
ns a
r
e
add
e
d t
o
t
h
e
augm
ent
e
d l
i
s
t
.
Thes
e va
ri
at
i
ons
are a
u
gm
ent
e
d t
o
t
h
e
pa
ral
l
e
corpus for the
corres
p
ond
ing
Eng
lish
ph
rase.
Th
e resu
ltan
t
tran
sform
e
d
co
rpu
s
is u
s
ed
to
g
e
n
e
rate th
e En
g-to-
Hin SMT
syste
m
.
6.
E
X
PERI
MEN
T
AL RES
U
L
T
AN
AL
YSIS
The
pr
o
p
o
s
ed
paral
l
e
l
co
r
pus
au
gm
ent
a
t
i
on
sy
st
em
i
s
com
p
are
d
a
g
ai
nst
one
o
f
t
h
e
best
pe
rf
orm
i
ng
o
p
e
n
sou
r
ce SMT to
o
l
s, Mosses.
In
itially th
e
p
a
ram
e
ter t
u
n
i
n
g
fo
r syste
m
is d
o
n
e
and
th
e b
e
st p
a
rameters
were fo
und
to
b
e
th
e fo
llowing
.
l
a
ng
uage
m
ode
l
ph
rase l
e
ngt
h
= 3
tran
slatio
n
p
h
rase leng
th
= 7
Th
e align
m
en
t and
reo
r
d
e
ring
p
a
ram
e
ters id
en
tified
an
d
su
mmarized
in
th
e Tab
l
e
4
is
u
s
ed
to
bu
ild
Eng
-
Hin
Bi
d
i
rectio
n
a
l SMT
Syste
m
u
s
in
g
d
a
ta set listed in
Tab
l
e
3
.
Tab
l
e 5 list th
e resu
lt of th
e
co
rpu
s
augm
ent
a
t
i
on
pre
p
r
o
cessi
ng
app
r
oach
f
o
r
t
h
e En
g-
Hi
n
bi
di
rect
i
onal
SM
T
sy
st
em
.
Table 5.
Bleu Scores
Baseline
Aug
m
ent
a
tion
Dev
Test
Dev
Test
E
ng-
to-Hin s
y
s
t
e
m
22.
43
23.
05
26.
52
25.
12
Hin-
to
-
E
ng s
y
s
t
e
m
15.
59
16.
63
20.
12
19.
56
Fro
m
th
e tab
l
e it can
b
e
ob
serv
ed
th
at
the
Bleu score has
im
proved
by
ap
pro
x
i
m
a
tel
y
2
po
in
ts
i
n
case of E
n
g-t
o
-Hi
n
t
r
ansl
at
i
o
n sy
st
em
. For Hi
n-t
o
-E
n
g
t
r
ansl
at
i
on sy
st
e
m
,
t
h
e score i
s
im
pro
v
ed
by
2.
93
p
o
i
n
t
s. Th
e imp
r
ov
ed
tran
slat
io
n
system
id
e
n
tifies th
e correct word
wh
ich
im
p
r
o
v
e
s t
h
e tran
slatio
n
quality.
Som
e
of the OOV
words not translated
in B
a
seline SMT syste
m
output are
translated because
of the a
ddition
of
pa
ral
l
e
l
cor
p
us a
u
gm
ent
a
t
i
o
n.
Fu
rt
he
r l
o
cal
re
or
deri
ng
i
s
a
l
so t
a
ke
n ca
re i
n
t
h
e
p
r
op
ose
d
SM
T.
An
ex
am
p
l
e o
f
Hin-to
-Eng
p
a
rallel co
rp
u
s
au
g
m
en
te
d
syst
e
m
is illu
strate
d
in
Tab
l
e
6
.
Th
e referen
c
e
translation
gives the actually expecte
d
translation. From
the Table
6, it can
b
e
o
b
serv
ed
th
at in
th
e B
aelin
e
SMT system
, som
e
words
are
not
properly
t
r
anslated, viz, ‘
गहन
’ ,
‘
ज़री
’,
‘
कश
ीद
ा
क
ारी
’. Wh
ile
th
ese OOV
words in Base
line SMT syst
e
m
out
put, vi
z, ‘
गहन
’ , ‘
ज़री
’, ‘
क
श
ीदाक
ा
र
ी
’ are
translated to ‘jewellery’
,
‘b
roca
de
’ a
n
d
‘
h
a
ndi
cr
aft
’
i
n
Paral
l
e
l
C
o
rp
us
A
u
gm
ented
Sy
st
em
ou
t
put
.
Fu
rt
he
r
t
h
e
ph
rase
‘
उत
्
पाद
हस
्
ति
शल
्
प
’
is translated t
o
‘
h
andicraft such a the work of embroidery
wh
ich
is n
o
t
ab
le translatio
n
im
provem
ent over the
baseline syste
m
’
Tabl
e 6. Hi
n
-
t
o
-En
g
sam
p
l
e
out
p
u
t
Test S
e
ntence
आग
रा
स
ं
गमरमर
प
र
ज
ड़ाऊ
क
ाम
,
च
म
काय
,
ज
ू
त
े
,
त
था
प
ीत
ल
क
ा
क
ाम
,
क
ालीन
,
गहन
,
ज़
री
त
था
कशीद
ाकारी
क
े
काम
ज
ै
स
े
ह
स
्
त
िशल
्
प
उ
त
्
पाद
क
े
िलए
िस
ह
ै
।
Ref
e
rence
Translation
agra is famous for handicrafts, products such as inlay w
ork on marble,
leatherwork, footwear, and brass work, carpets, jewellery, zari and
embroidery work.
Baseline Sy
ste
m
output
agra inlay works on marble ,
च
म
काय
, shoes , and brass work , carpets ,
the
गहन
,
ज़र
ी
and
कशीद
ाकारी
work as
उ
त
्
पाद
ह
स
्
त
िशल
्
प
is famous for .
Parallel Corpus
Aug
m
ented
Sy
ste
m
output
agra inlay work on marble ,
च
म
काय
, shoe , and brass work , carpet ,
jewellery , brocade and handicraft such a the work of embroidery be famous
for the .
Evaluation Warning : The document was created with Spire.PDF for Python.