TELKOM
NIKA
, Vol.12, No
.2, June 20
14
, pp. 447~4
5
4
ISSN: 1693-6
930,
accredited
A
by DIKTI, De
cree No: 58/DIK
T
I/Kep/2013
DOI
:
10.12928/TELKOMNIKA.v12i2.2033
447
Re
cei
v
ed Fe
brua
ry 25, 20
14; Re
vised
Ap
ril 12, 201
4; Acce
pted
April 29, 201
4
Searching and Visualization of References in Resear
ch
Documents
Firnas Nadir
m
an
1
, Ahmad Ridha*
2
, Annisa
2
1
Agenc
y
for th
e Assessment
and Ap
plic
atio
n of
T
e
chnol
og
y
Jl. M.H.
T
hamrin No. 8 Jakarta
,
10340, Ind
o
n
e
sia
2,
Department o
f
Computer Sci
ence, Bog
o
r A
g
ricult
ural U
n
iv
ersit
y
Kampus IPB D
a
rmag
a
, Jl. Meranti W
i
ng 2
0
L
e
vel 5 - 6, Bog
o
r, 1668
0, Indo
nesi
a
*Corres
p
o
ndi
n
g
author, e-ma
i
l
: ridha@
ap
ps.i
pb.ac.id
A
b
st
r
a
ct
T
h
is researc
h
ai
ms to dev
el
o
p
a
mod
u
l
e
for
info
r
m
atio
n re
trieval th
at can
trace referenc
es from
bibl
io
grap
hy e
n
tries of r
e
se
arch d
o
cu
men
t
s, specif
ical
ly
those
bas
ed
on Bo
gor A
g
ri
cultura
l
Un
iver
sity
(IPB)
’
s
w
r
itin
g
gui
del
ines.
A t
o
tal
of 2
42 r
e
s
earch
doc
um
en
ts i
n
PD
F from
the
D
e
pa
rtm
e
n
t
o
f
C
o
m
p
u
t
er
Scienc
e IPB were use
d
to g
ener
ate pars
i
n
g
patterns
to
extract the bib
liogr
ap
hy entri
es. W
i
th mod
i
fie
d
ParaT
o
o
l
s, aut
omatic extracti
on of b
i
bl
io
gra
phy e
n
tr
i
e
s was p
e
r
fo
rm
ed
on
te
xt fi
l
e
s gen
e
r
a
t
ed
from
th
e
PDF
files. T
he entries ar
e stored in
a d
a
tab
a
s
e that is used
to visual
i
z
e
aut
hor rel
a
tions
hi
p as grap
hs. T
h
is
mo
du
le is sup
p
le
mented by
an infor
m
ation
retrie
val system b
a
se
d on Sphi
nx search
system and a
l
s
o
provi
des i
n
for
m
ati
on
of auth
o
rs
’
pu
bl
icatio
n
s
and cit
a
tions
. Evaluati
on sh
ow
ed that (1)
bibl
io
grap
hy e
n
try
extraction
miss
ed only 5.37
% bibl
io
grap
hy e
n
tries
ca
us
e
d
by inc
o
rrect b
i
blio
gra
phy for
m
attin
g
, (2)
91.
54%
bibl
io
grap
hy
e
n
try attributes
coul
d b
e
i
d
e
n
tified c
o
rrectl
y
, and (
3
) 9
0
.
31%
entries
w
e
re successf
ully
conn
ected to o
t
her docu
m
ent
s.
Ke
y
w
ords
:
res
earch d
o
cu
me
nts search, bib
l
iogr
aphy e
n
trie
s
extraction, au
thor relati
onsh
i
p visua
l
i
z
a
t
i
on,
ParaTools
1. Introduc
tion
Bibliography is an impo
rta
n
t part of a rese
arch
do
cu
ment as it list
s
refe
ren
c
e
s
cited in
the documen
t. The list is useful for
re
aders, usual
l
y
fellow scie
n
tists, to locate other rel
a
ted
document
s a
nd to know
o
t
her scie
ntist
s
working
on
the topic. An
importa
nt do
cument in a fi
eld
woul
d b
e
mo
re li
kely to
b
e
cite
d, so it
is al
so
de
sira
ble to
kn
ow the n
u
mbe
r
of citation
s th
at
a
document ha
s.
Nume
ro
us
studie
s
on
sea
r
chi
ng resea
r
ch do
cu
ment
s have b
een
prop
osed [1
]-[5]. A
system
call
e
d
Bibliomet
r
ic Inform
ation
Retr
i
e
val System (BI
R
S)
[1] was devel
oped
as a
web-
based info
rm
ation retrieva
l system fo
r
resea
r
ch
do
cuments. BIRS con
n
e
c
ted
three type
s of
sea
r
ch en
gin
e
s: sea
r
ch e
ngine
on the
internet, lib
rarie
s
, and
on
line data
b
a
s
e
s
. Anothe
r web-
based
syste
m
[4], DBL Browse
r F
r
am
ewo
r
k,
wa
s b
u
ilt to track rese
arch d
o
cu
ments by divi
ding
the system i
n
to three mo
dule
s
: GUI layer, Visuali
z
ation laye
r, and Data la
yer. The sea
r
ch
results were
visuali
z
ed in the form of text and
grap
hics. The system displaye
d the relation
shi
p
betwe
en the
journ
a
ls
with
the goal of getting j
ourn
a
ls refe
rred
by other jou
r
nals. The
stu
d
y
assume
d that
a highly refe
rre
d jou
r
nal b
e
cam
e
t
he b
a
si
s of the de
velopment of
kno
w
le
dge in
a
particula
r field.
Also, re
sea
r
ch [6]-[15] has been co
ndu
cted
on bibli
o
grap
hy extraction, and al
gorithm
s
have be
en d
e
velope
d to
recogni
ze
pa
tterns
of bi
bl
iogra
phy. On
e of them [8
] built a sm
all
colle
ction
of
function
s
ba
sed
on
Perl
pro
g
rammin
g
lan
guag
e.
It used
temp
lates to
extract
metadata f
r
o
m
biblio
gra
p
h
y entrie
s
. A
method
to
extract the
rese
arch
do
cuments [10]
wa
s
develop
ed
u
s
ing a combi
nation
of reg
u
lar ex
pre
s
si
ons ba
sed
o
n
heu
ri
stics and kno
w
le
d
ge
system to se
ek biblio
grap
hy
entries. Another d
o
cu
ment extracti
on method [1
5] has also b
een
develop
ed u
s
ing the Ba
sic Lo
cal Alig
nment Sea
r
ch Tool (BLA
ST). BLAST is a se
que
nce
alignme
n
t too
l
to find the m
o
st si
milar te
mplate to a
p
r
otein
se
quen
ce from a te
mplate d
a
tab
a
se
previou
s
ly co
nstru
c
ted. A
databa
se i
s
used to
sto
r
e the bibli
o
grap
hy templ
a
tes. A citati
o
n
transfo
rm
s the templates i
n
to protein fo
rm, and BL
A
S
T is use
d
to sea
r
ch for the most simi
lar
s
e
quenc
e
in t
he template databas
e
.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 12, No. 2, June 20
14: 447 – 45
4
448
Re
sea
r
ch on
visuali
z
ation
of sea
r
ch
result [16]
-[17
] has al
so
been
co
ndu
cted. A
prototype visualization sy
stem [
16] was cre
a
ted to e
nhan
ce a
u
tho
r
se
archin
g. The sy
stem
wa
s
based on au
thor co
-citati
on analysi
s
and algo
ri
th
ms su
ch a
s
Kohone
n’s fe
ature map
s
and
Pathfinder n
e
t
works.
Googl
e has
Googl
e Scholar (http://scholar.go
ogle.
com), a
search engine to
retrieve
resea
r
ch do
cuments. It o
b
tains
biblio
g
r
aphy
entrie
s
in onlin
e d
o
cum
ents to
mea
s
ure so
me
metrics an
d provide
s
aut
hors with a
publi
c
ation
p
r
ofile. Micro
s
o
ft has also
created
a se
a
r
ch
engine call
ed Microsoft Academi
c
Search (http://a
cademic.
r
esearch.micros
oft.com
) with si
milar
feature
s
. In
addition, it can di
splay
Co-Autho
r G
r
a
ph, Co
-Autho
r Path, Citati
on G
r
aph, a
nd
Gene
alogy G
r
aph to visu
al
ize the rel
a
tio
n
shi
p
betwee
n
authors.
The ap
plications d
e
velop
e
d
by Googl
e and Mi
cr
o
s
oft
can b
e
used
freely onlin
e, but they
can
not b
e
m
odified to
extract
the bi
bli
ogra
phy
e
n
tri
e
s
of re
se
arch do
cum
ents with
a
spe
c
i
f
ic
format. We al
so have n
o
control on the
scope of
do
cuments. Thi
s
pape
r is to propo
se a meth
od
to perform ex
tract
s
bibliog
r
aphi
c data e
n
tries of re
se
arch do
cume
nts from a gi
ven colle
ction
of
document
s.
This
study ai
ms to create
a module th
at can
extra
c
t referen
c
e
s
f
r
om the bi
bli
ogra
phy
entrie
s
of research do
cum
ents.
A method is create
d
to recog
n
ize
the bibliogra
phy entrie
s
from
the resea
r
ch
document
s. Once identified, the bi
blio
grap
hy entrie
s
are
store
d
into a databa
se.
The d
a
taba
se
is u
s
e
d
to
bu
ild an i
n
form
a
t
ion ret
r
ieval
system fo
r
se
arching
re
se
a
r
ch
do
cu
ment
s
along
with their refe
ren
c
e
s
and to visuali
z
e the relatio
n
shi
p
betwee
n
the authors.
2. The Metho
d
s
This
study b
egan
with collecting the
re
sea
r
ch
do
cu
ments a
s
PDF files. Each
file wa
s
conve
r
ted int
o
plaintext file and
store
d
in
a re
se
arch do
cum
e
nt databa
se.
The text was
extracted
an
d
identified to
get the
biblio
grap
hy
ent
rie
s
.
The
bibli
o
g
r
aphy entri
es were stored
i
n
to
the data
b
a
s
e
.
The
datab
a
s
e
wa
s
used
to buil
d
a
n
information retrieval
system of research
document
s. A visualizatio
n module
wa
s created to
di
sp
lay the relatio
n
shi
p
s b
e
twe
en the autho
rs
of the docum
ents from bibl
iogra
phi
c entries in
the databa
se. The st
eps of the propo
sed meth
od
are sho
w
n in
Figure
1
.
Figure 1. The
Propo
sed M
e
thod
2.1. Sy
stem
Ev
aluation
System eval
uation i
n
thi
s
study
ado
pts the
metri
cs in info
rmati
on retrieval,
namely,
recall and p
r
eci
s
ion [18]. This
study ca
rrie
s
out tw
o
kind
s of eval
uation. The fi
rst evalu
a
tion
is
the mea
s
ure
m
ent of the succe
ss of ext
r
actio
n
an
d
attributes i
denti
f
ication
of bibl
iogra
phy entri
es
in ea
ch
do
cu
ment. Assum
e
B
i
is a
set
of bibliog
r
a
p
h
y
entrie
s
in
th
e i-th
re
sea
r
ch do
cum
ent
and
E
i
is a
set of
bibliog
r
ap
hy entrie
s
that
are
su
cce
ssf
ully extracte
d
and i
dentifie
d from th
e i-th
resea
r
ch do
cument by the
system, the
n
re
call
can
be cal
c
ul
ated
by (1) an
d p
r
eci
s
io
n can
be
cal
c
ulate
d
by (2).
(1)
St
art
C
o
llecting res
ear
ch
d
o
c
u
ments and
desi
gnin
g
the d
a
tabase
Ex
t
r
action and
attribut
e i
denti
fi
c
a
ti
on
of bi
blio
graphy
entri
es
I
n
form
a
t
i
on re
tri
e
va
l
s
y
s
t
e
m
of
r
e
s
ear
c
h
d
o
c
um
en
t
d
e
v
el
op
m
e
n
t
Crea
tes
vi
s
u
aliza
t
ion
of
the a
u
thor rel
a
tion
s
h
ip
Sy
st
em Eva
l
uati
o
n
Fi
nish
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Searchin
g an
d Visuali
z
atio
n of Refere
nces
in Research Do
cum
ents (Firna
s
Nadi
rm
an)
449
(2)
The e
quatio
n
s
(1) an
d (2) are
u
s
ed
to
cal
c
ulate
the
perce
ntage
of re
call
with
(3
) a
nd
percenta
ge of
preci
s
io
n wit
h
(4) from the
whol
e bi
bliog
r
aphy ent
rie
s
throug
hout th
e document
s.
(3)
(4)
The second
evaluation i
s
the mea
s
ure
m
ent of
the succe
ss of d
o
c
ume
n
t relati
onship in
the colle
ction
with a biblio
grap
hy entry. Assum
e
C
i
is a set of rese
arch do
cume
nts that refer
the
i-th bibliog
r
a
p
h
y entry and
F
i
is a set of rese
arch do
cu
ments t
hat a
r
e determi
ned
to refer to the
i-
th bibliog
r
aph
y entry by the system, the
n
re
ca
ll can
be cal
c
ul
ated
by (5) an
d p
r
eci
s
io
n can
be
cal
c
ulate
d
by (6).
(5)
(6)
The equ
ation
(5) a
nd (6
) are then use
d
to cal
c
ulate th
e percenta
ge
of recall
with (7) a
nd
percenta
ge
o
f
pre
c
isi
on
with (8
) from
the tota
l nu
mber of ent
ri
es th
at are
con
n
e
c
ted to
a
document.
(7)
(8)
2.2 Collectio
n
Our
colle
ctio
n co
nsi
s
ts
of 242 P
D
F files of Ba
ch
el
or the
s
e
s
fro
m
Com
puter Scien
c
e
Dep
a
rtme
nt, Bogor A
g
ri
cul
t
ural
Universit
y
(IPB),
Indo
nesi
a
, and
al
most all
of them are writte
n in
Indone
sia
n
la
ngua
ge. The
r
efore, the te
mplates i
n
ou
r biblio
gra
phi
c entri
es extraction a
r
e b
a
s
ed
on IPB’s writi
ng guid
e
line
s
. The evaluation
is pe
rform
ed usi
ng this
colle
ction.
3. Results
3.1. Data
Ch
arac
teris
t
ics
The
bibliog
r
a
phy in
ou
r
collectio
n h
a
s seve
ral
cha
r
acte
ri
stics, i.e., (i) they u
s
e t
w
o
colum
n
s; (ii
)
new
cha
p
ter
doe
s not ne
cessitate pa
g
e
brea
k; and
(ii
i
) the biblio
graphy is indi
ca
ted
with a title of ‘
D
AFTAR P
U
STAKA’ or ‘REF
ERENCES’, and located before the appendices.
3.2. Data
bas
e
Design
Datab
a
se de
sign
sta
r
ts
with ide
n
tifying entitie
s i
n
inform
atio
n retri
e
val system of
resea
r
ch documents. Th
e main entit
y of the system is a do
cume
nt
that has at least on
e autho
r
and
ha
s a
bi
bliography. T
he results
of
the ide
n
tifi
cat
i
on of
entities in the
data
b
a
se
de
sig
n
a
r
e
use
d
to obtai
n a con
c
e
p
tu
al desi
gn, logi
cal, and p
h
ysi
c
al de
sig
n
s ill
ustrate
d
in
Figure
2
.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 12, No. 2, June 20
14: 447 – 45
4
450
Figure 2. Dat
aba
se Desi
gn
3.3. Data Pro
cessing
D
a
ta
pr
oc
essin
g
in
vo
lves
th
r
e
e s
t
ep
s
.
Th
e
firs
t is
c
onve
r
s
i
on
o
f
th
e r
e
s
e
ar
ch
doc
u
m
en
t
from PDF
file
s into text file
s, the
se
cond
is ex
tra
c
tion
of the biblio
graphy ent
ries,
and the thi
r
d
is
identificatio
n
of the attribut
es of
bibliog
r
aphy ent
rie
s
. The p
r
o
c
e
s
s
of conve
r
ting
PDF into text
is
usin
g Xpdf m
odule
call
ed
pdftotext. Each P
D
F file i
s
conve
r
ted i
n
to two text files, i.e., ra
w t
e
xt
and layo
ut text. Conversio
n
into ra
w tex
t
format
provides th
e text in se
que
ntial l
a
yout, while t
he
layout text has a simila
r lay
out to the PDF file (see Fi
g
u
re 3
)
.
Do
c
u
ment
s
Bi
bl
i
ograph
y
Wr
i
t
e
Aut
hor
H
ave
C
a
n be
T
h
esis
Journ
a
l
Proceedi
n
g
We
b
s
i
t
e
d
Gu
id
e
Cont
ains
Gu
id
ed
Own
e
d
d
data
_type
col
u
mn
_
na
me
Docu
ment
ID
Ti
t
l
e
Au
thor
Fi
l
e
Co
nte
n
t
Y
ear
Pu
bl
is
he
r
Docu
ment
T
y
p
e
Do
c
u
me
nt
co
l
u
m
n
_
name
data_
ty
p
e
J
o
urnal
Name
Vo
l
u
m
e
P
a
ges
Jou
r
nal
co
l
u
m
n_n
a
me
da
t
a_
t
yp
e
Per
c
ep
t
o
r
T
h
esis
co
l
u
m
n
_n
a
m
e
da
t
a
_type
P
r
o
c
e
edi
ngNa
me
Lo
ca
t
i
o
n
Da
te
Proceedi
n
g
col
u
mn
_
na
me
data_
ty
p
e
URL
W
ebsite
D
o
c
u
ment
T
y
pe =
"j
o
u
r
n
al
"
"thesi
s"
"p
roceed
i
n
g"
"w
ebs
i
t
e
"
Enhanced
Entity R
e
lationship
CO
NCEPT
UAL
D
ESIG
N
F
ile
C
ont
ent
T
i
tl
e
P
u
blish
er
D
o
cumentID
Do
c
u
m
e
n
t
D
o
cu
mentT
y
pe
Ye
a
r
Pa
g
e
s
D
o
cumentID
Vo
l
u
m
e
Journ
a
l
J
ournalN
a
me
Do
c
u
m
e
n
t
I
D
Thesis
Pe
r
c
e
p
to
r
Da
t
e
Lo
cati
o
n
D
o
cumentID
P
r
oce
e
d
i
ng
P
r
o
c
ee
ding
N
a
me
D
o
cumentID
W
e
bsite
URL
B
i
bl
iography E
n
tries
D
o
c
u
mentID
R
e
ference
R
e
ferenceID
S
o
ur
ce
D
o
cume
n
t
ID
A
uthorID
D
o
c
u
mentA
uthor
D
o
c
u
mentID
A
u
thor N
a
me
Au
t
h
o
r
A
uthorID
d
LOG
I
CA
L
DE
SIGN
Do
c
u
m
e
n
t
ye
ar
ye
ar
va
r
c
ha
r
va
r
c
ha
r
va
r
c
ha
r
file
Do
c
u
m
e
n
t
I
D
co
ntent
int
enum
publ
ishe
r
typ
e
te
xt
title
Jour
nal
D
o
cumentID
pages
var
c
ha
r
vo
lume
var
c
ha
r
j
ournal
_name
int
var
c
ha
r
P
r
o
ceeding
D
o
cumentID
date
va
r
c
h
a
r
lo
c
a
t
i
on
va
r
c
h
a
r
proc
eedin
g_na
me
int
date
We
b
s
i
t
e
varchar
in
t
D
o
cumentID
ur
l
Thesis
int
int
D
o
cu
m
e
ntID
P
e
r
c
eptor
R
e
ference
bib
entry
te
x
t
R
e
ferenceID
in
t
D
o
cumentID
in
t
in
t
Sour
c
e
D
o
c
u
me
ntID
D
o
cumentA
u
thor
D
o
cumentID
A
u
t
horID
in
t
in
t
A
uthor
A
uthor
I
D
autho
r
_
name
in
t
va
r
c
ha
r
d
PHY
S
IC
AL
DESI
GN
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Searchin
g an
d Visuali
z
atio
n of Refere
nces
in Research Do
cum
ents (Firna
s
Nadi
rm
an)
451
Figure 3. The
Result of Co
nversi
on of PDF File
s into Text
The
next ste
p
is extra
c
tin
g
the
biblio
graphy e
n
trie
s.
Ra
w text is
use
d
to
se
pa
rate th
e
bibliog
r
ap
hy from
othe
r p
a
r
ts
of the
do
cumen
t. Th
e
b
i
bliography
e
n
tries a
r
e
se
parate
d
usin
g a
ParaTo
ols m
odule, Bi
blio
Do
cume
nt Pa
rse
r
. T
he
mo
dule
ch
ecks
each lin
e fro
m
the
ra
w te
xt to
extract
biblio
grap
hic e
n
tri
e
s. T
he
bibl
iogra
phi
c e
n
t
ries are th
e
n
pa
rsed
to
obtain
sing
le
bibliog
r
ap
hy entry by com
parin
g ea
ch li
ne wi
th the te
xt position in the layout text.
The l
a
st
ste
p
is ide
n
tifying the
attrib
utes
of
ea
ch
extra
c
ted bi
bliography e
n
try
by
comp
ari
ng it
with bi
bliog
r
a
phy ent
ry te
mplates.
ParaTool
s Bibli
o
Citation Pa
rser m
odul
e i
s
use
d
to implement
this pro
c
e
ss.
The results
are bibl
i
ograp
hic attribute
s
comp
osed of
two types, i.e.,
gene
ral attrib
utes an
d sp
e
c
ific attribute
s
(see T
able 1
)
.
Table 1. Bibliogra
phy Entri
e
s Attribute
s
T
y
pe of
Attribute
s
Attributes Name
Gene
ral
Authors Name,
T
i
tle, Publisher, Y
ear
Specific
Journal Name, V
o
lume, Number,
Proceedings
Na
me (Publication), Location, URL
Each tem
p
lat
e
of biblio
gra
phy entry
con
t
ains
ce
rtain
words to ide
n
t
ify bibliograp
hy entry
attributes. Th
e template
s can b
e
adju
s
ted acco
rdin
g to the bibliogra
phy entri
es form
at. Thi
s
study ma
ke
s several bibli
ogra
phy entry templa
tes
based on IP
B’s writing g
u
ideline
s
b
e
cause
the documen
ts in our te
sting coll
ectio
n
are from IPB. In addition, more tem
p
lates a
r
e al
so
gene
rated to
enabl
e extracting malforme
d bi
bliog
r
ap
hi
c entrie
s
in th
e document
s.
After identifying the attrib
utes of a bi
bliogr
aphy e
n
try, all entries are sto
r
e
d
in the
databa
se. B
e
fore
stori
n
g
a
n
ent
ry into th
e data
b
a
s
e,
t
he
system
wil
l
ru
n a
bibli
o
g
r
aphy
simil
a
ri
ty
examination.
The p
r
o
c
e
s
s che
c
ks the
simila
rity
of the autho
r'
s name, title, and yea
r
of
the
extracted e
n
tries by u
s
ing
Leven
shtei
n
Distan
ce
fu
nction on all
bibliography
entries that are
alrea
d
y store
d
in the database. Tw
o bi
bliography en
tries are co
nsi
dere
d
the sa
me if a condition
s
h
ow
n
in
T
able
2
in s
a
tis
f
ied.
La
y
out Te
xt
Ra
w
Te
xt
B
i
bl
i
o
gra
phy
Entries
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 12, No. 2, June 20
14: 447 – 45
4
452
Table 2. Co
n
d
ition of two bibliog
r
ap
hy entrie
s
attribu
t
es are
con
s
i
dere
d
the sa
me
Attribute
Condition
Ye
a
r
Ti
tl
e
Authors Name
3.4. Information Retriev
a
l of Res
earc
h
Documen
t
An informati
on retrieval
system for rese
ar
ch do
cuments i
s
built to implement the
module
s
that
have been
made. The
system con
s
i
s
ts
of three compon
ents, i
.
e., the backend
(do
c
um
ent en
try), sea
r
ch e
ngine, an
d user interfa
c
e.
The search
e
ngine
com
p
o
nent in this
st
udy
use
s
a
n
i
n
formatio
n re
trieval engi
ne
called
Sphinx [19], whi
c
h p
r
ovid
es ba
si
c tasks su
ch
as
cre
a
ting index, a
ssi
gnin
g
wei
g
hts to the ind
e
x,
and
se
archin
g the
collectio
n. Sphinx
i
s
configured t
o
conne
ct to
re
search
do
cum
ent data
b
a
s
e
a
s
a data so
urce
and create a
n
index from
a given table
name o
r
SQL
query.
The interfa
c
e
of the inform
ation retrie
val
system
co
n
s
i
s
ts of a form
that contai
ns
a field to enter
sea
r
ch term
s
and a button t
o
initiate the sea
r
ch
(see F
i
gure 4
)
. Users ca
n enter
search term
s,
then the syst
em
uses Sphi
nx search m
o
dule to
retri
e
ve the rele
vant
docum
ents from
the
colle
ction. Th
e retrie
ve
d do
cum
ents a
r
e
sho
w
n in the
sea
r
ch re
sult
s so
rted ba
se
d on
desce
nding
relevan
c
e pro
v
ide
d
by Sphi
nx search m
o
dule (see Fi
g
u
re 5
Figure
5
).
Figure 4. The
Interface of Informatio
n Retrieval Syste
m
y
ear
1
=
y
ear
2
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Searchin
g an
d Visuali
z
atio
n of Refere
nces
in Research Do
cum
ents (Firna
s
Nadi
rm
an)
453
Figure 5. Search
Re
sult Interface
3.5. The Authors Relatio
n
ship Visualization
The biblio
gra
phy entrie
s
store
d
in the
databa
se
can b
e
used
to visualize
autho
r
relation
shi
p
(see
Figure
6
). Fro
m
the databa
se, there a
r
e
two types
of author relatio
n
shi
p
, i.e., co-autho
r
relation
shi
p
a
nd
citation
re
lationship. Th
e visu
a
lizatio
n
is create
d
by
usi
ng Jav
a
Scri
pt
InfoVis
Toolkit that ut
ilizes the HT
ML 5 Canvas.
Figure 6. The
Authors
Rela
tionshi
p Visu
alizatio
n
3.6. Sy
stem
Ev
aluation
The do
cum
e
nt processin
g
con
s
ist
s
of converting
the PDF files
into text files
,
e
x
trac
ting
the biblio
gra
p
h
y entrie
s
, id
entifying the
attributes,
an
d sto
r
ing th
e
bibliog
r
ap
hy entrie
s
into t
he
databa
se.
A document ca
n
take 10 se
con
d
s up
to
3 hou
rs with
an ave
r
ag
e
of 1 minute
20
se
con
d
s.
Th
e varied d
u
ration is du
e to the large
numbe
r of templates
use
d
for identifyin
g
bibliog
r
ap
hy entrie
s
.
There are e
r
rors in the p
r
o
c
e
ss of attrib
utes ide
n
tifica
tion due to malforme
d entri
es. Th
e
errors in
clud
e (i)
wro
ng pl
acem
ent of year o
r
un
stat
ed year; (ii
)
wro
ng pl
ace
m
ent of publi
s
he
r
name; (iii
) wrong form
at for intern
et add
resse
s
(
not p
r
ecede
d with
the proto
c
ol f
o
rmat
such a
s
http://, https://,
or ftp://); an
d (iv) wrong f
o
rmat for author name.
Measurement
of bibliogra
p
h
y extraction
and bibliog
r
aphy attribut
es ide
n
tificati
on in the
document
s is con
d
u
c
ted b
y
using the
equatio
n (3
)
and (4). It is carrie
d out
by cou
n
ting the
numbe
r
of bi
bliography
en
tries succe
ssfully extr
acte
d by the
sy
stem fro
m
e
a
ch do
cum
ent
and
the numbe
r o
f
bibliogra
phy
entries a
c
tua
lly cont
aine
d in the docum
ents.
The ext
r
actio
n
proce
ss
prod
uces 94.
63% bibli
o
g
r
aphy e
n
trie
s su
ccessfully and 98.92% of
the bi
bliog
r
aphy
entri
es are
extracted
correctly by
the
syste
m
. In t
he attri
bute
s
identification
,
91.54%
bib
liogra
phy e
n
try
attributes a
r
e
identified co
rre
ct
ly, and 99.84% bibli
ogra
phy entry
attributes are succe
s
sfully
identified.
Evaluation for relatio
n
shi
p
s of bibliog
r
aphy entri
e
s
with all the docu
m
ent
s in the
colle
ction i
s
carrie
d out to mea
s
ure the num
ber
of the biblio
grap
hy entry
referred by
the
document
s b
y
the sy
stem.
Evaluatio
n
is pe
rform
ed
o
n
50
auth
o
rs
most
refe
rre
d
do
cum
ents a
nd
cal
c
ulate
d
wi
th the equ
ation (7) a
nd
(8). The
re
sul
t
s sh
ow th
at 90.31%
of t
he bibli
ograp
hy
entrie
s
are succe
ssfully referred to th
e docum
e
n
ts, and 95.19
% of the bibliogra
phy ent
ries
corre
c
tly referred to do
cum
ents.
4. Conclusio
n
This pap
er p
r
opo
se
s
a
system to
extra
c
t bi
bli
ograph
y entrie
s
in
rese
arch
do
cument
s
automatically. Extracted bi
bliogr
aphy e
n
t
ries
ca
n be
use
d
to cre
a
te two visuali
z
ation
s
, the
co-
author g
r
a
ph
and the citati
on relatio
n
shi
p
grap
h.
The process of identifyin
g
the attribu
t
es
of a bibliogra
phy ent
ry depen
ds
on the
extraction
p
r
o
c
e
ss. E
rro
rs
durin
g the
ex
traction
will
a
ffect the resul
t
of attribute
s
identificatio
n
.
It
Co-Auth
o
r Graph
Citatio
n Graph
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 12, No. 2, June 20
14: 447 – 45
4
454
is sh
own by our evalu
a
tio
n
re
sults. Around 5% of bibliog
r
ap
hy
entrie
s
co
uld
not be extracted
accurately. As a re
sult, al
most 9% of the
attributes were
not corre
c
tly
identified.
Ackn
o
w
l
e
dg
ement
This
re
sea
r
ch
wa
s su
ppo
rt
ed by the Co
mputer
S
c
ien
c
e
Depa
rtme
nt, Bogor Agricultural
University (IP
B
), Indone
sia
;
and
the Ag
e
n
cy for th
e A
s
sessme
nt a
nd Appli
c
atio
n of Te
chn
o
l
ogy
(BPPT), Indones
ia.
Referen
ces
[1]
Ding Y,
Cho
w
dhur
y
GG, Foo S, Qian W.
Bibliometr
ic
inf
o
rmation retrieval s
y
stem
(BIRS): A
w
e
b
search i
n
terfa
c
e utiliz
ing
bi
bliom
e
tric res
earch res
u
lts.
Journa
l of
T
he America
n
Society for
Information Sci
ence.
20
00; 51
(13): 119
0–
120
4.
[2]
Jacso P. As
w
e
may
s
earc
h
-Comparis
on
of ma
jor features of the We
b of Science,
Scopus, and
Googl
e Sch
o
l
ar citati
on-b
a
sed
an
d c
i
tation-
enh
anc
ed d
a
tab
a
se
s.
CURRENT
SCIENCE-
BANGALORE
. 200
5; 89(9): 15
37.
[3]
Alt F
L
, Kirsch
RA. Citatio
n s
e
archi
ng
and
bi
blio
gra
phic
co
u
p
lin
g
w
i
t
h
r
e
mo
te on-l
i
n
e
com
puter
access.
JOURNAL
OF RESEARCH
of
the
Noti
ona
l B
u
rea
u
of Stan
d
a
rds -
B. Math
ematica
l
Sci
e
n
c
es
. 19
68
;
72(1): 61-
78.
[4]
Klink S, L
e
y
M, Rabb
idg
e
E,
Reuther
P, W
a
lter B, W
eber A.
Br
ow
sing an
d
visu
ali
z
i
n
g dig
i
ta
l
bibl
io
grap
hic d
a
ta
.
IEEE
T
C
VG Sy
m
posium on Visualiz
ation
.
Konsta
n
z
: Eurograp
hic
s
Associatio
n.
200
4.
[5]
Stuart DG, Simpson F. Efficient liter
ature s
earch
i
ng: a c
o
re skill for th
e practice
of evi
denc
e-bas
e
d
medici
ne.
Inten
s
ive care
me
di
cine
. 20
03; 29(
12): 211
9-2
127
.
[6]
Da
y MY, T
s
ai
T
H
, Sung CL,
Lee
CW
, W
u
SH, Ong
CS,
Hsu W
L
.
A
Kn
ow
ledg
e-b
a
sed
Appr
oach
to
Citatio
n Extraction
. IRI-2005 IEEE Inter
national Conference
.
2005.
[7]
Gardfiel
d E. Citation a
n
a
l
ysis as a tool in j
our
nal ev
alu
a
tio
n
.
Scienc
e
. 197
2; 178(6
0
): 471-
479.
[8]
Je
w
e
l
l
M. Para
T
ools Referenc
e Parsin
g T
oolkit-
Versio
n 1.0 Rele
ase
d
. D-li
b Magaz
ine. 2
003; 9(2).
[9] Hetzner
E.
A s
i
mple
metho
d
for citatio
n
metadata
extractio
n
usin
g h
i
dd
en
mark
ov mod
e
l
s
. JCDL '0
8
Procee
din
g
s of
the 8th ACM/IEEE-CS joi
n
t confere
n
ce o
n
Digita
l
li
brari
e
s
.
Ne
w
Y
o
rk: ACM. 2008.
[10]
Gupta D, Morris B,
Catapa
no T
,
Sautter G. A ne
w
a
ppro
a
ch to
w
a
r
d
s bib
lio
grap
hi
c referen
c
e
ide
n
tificatio
n
, parsi
ng a
nd
inline citation
matching.
IC3 of Communi
cations i
n
Co
mp
uter an
d
Information Sci
ence.
20
09; 40
: 93-102.
[11]
Ohta M, Daiki A,
Atsuhiro T
,
Jun A.
CR
F
-
based b
i
bl
io
grap
hy extraction fro
m
refer
ence strin
g
s
focusin
g
on v
a
rious
token
gr
anu
lariti
es
. 1
0
th IAPR Int
e
rn
ation
a
l W
o
rks
h
op
o
n
D
o
cum
ent An
al
ysi
s
S
y
stems (DAS
). 2012: 27
6-28
1.
[12]
Staeli
n
C, Elad
M, Greig D, Shmue
li O, Va
n
s
M. Biblio: aut
omatic meta-d
ata e
x
tractio
n
.
Internatio
na
l
Journ
a
l of Doc
u
ment Ana
l
ysis
and Rec
o
g
n
iti
on (IJDAR)
. 20
07; 10(2): 1
13-
126.
[13]
Peng F
,
McCallum A.
Accu
rate Informatio
n
Extraction from R
e
searc
h
Papers us
ing
Con
d
itio
na
l
R
a
nd
om
Fie
l
ds
. Huma
n
La
ngu
ag
e T
e
chnol
og
y c
onfer
ence
/ N
o
rth
Americ
an
ch
apter
of th
e
Associati
on for
Computati
o
n
a
l
Ling
uistics an
nua
l meetin
g (HLT
/NAACL) 2004. 20
04: 32
9
-
336.
[1
4
]
H
a
n H
,
Gi
le
s
C
,
Ma
na
vo
gl
u
E, Zh
a
H
,
Z
h
an
g
Z, Fox
E.
Autom
a
tic Document
Me
ta-d
ata Extractio
n
usin
g Sup
port Vector Machi
n
es
. Proceed
in
g
s
of Joint Conf
erenc
e on Di
git
a
l Li
brari
e
s. 20
03.
[15]
Hua
ng IA, Ho JM, Kao HY,
Lin SH. Extrac
ting ci
tatio
n
metadata from o
n
lin
e pu
blic
atio
n lists usin
g
BLAST
.
PAKD
D of Lecture Notes in Comput
er Scienc
e.
Lo
ndo
n: Sprin
ger
. 2004; 30
56: 5
39-5
48.
[16]
Lin
X, W
h
ite
HD, B
u
z
y
dl
o
w
s
k
i
J. R
e
a
l
-time a
u
thor
c
o
-citatio
n ma
p
p
in
g for
on
lin
e se
archi
ng.
Information Pr
ocessi
ng a
nd
Mana
ge
me
nt.
200
3; 39: 689-
706.
[17]
Aneg
ón F
M
, Quesad
a BV, Sola
na VH, R
odrígu
e
z Z
C
, Álvarez EC, F
e
rná
ndez M, José F
.
A ne
w
te
ch
n
i
q
u
e
fo
r b
u
i
ld
in
g
map
s
o
f
l
a
rg
e
sci
e
n
t
i
f
i
c
d
o
m
ai
n
s
b
a
s
ed
o
n
the
co
ci
ta
tio
n
o
f
cl
a
sse
s and
categor
ies.
Sci
ento
m
etrics
. B
uda
pest: Klu
w
er Academ
ic Publis
her. 20
04;
61:129-
14
5.
[18]
Baeza-Y
a
tes R
,
Rib
eiro-N
eto
B. Moder
n Info
rma
tion
Retri
e
v
a
l: T
he Co
nce
p
t
s and
T
e
chnol
og
y B
ehi
n
d
Search. Ne
w
J
e
rse
y
: P
earso
n
High
e
r Educat
ion. 20
11: 13
5.
[19]
Ali A. Sphin
x
S
earch Be
gi
nner
'
s
Guide. Packt Publis
hin
g
Ltd
.
2011.
Evaluation Warning : The document was created with Spire.PDF for Python.