Indonesian J
ournal of Ele
c
trical Engin
eering and
Computer Sci
e
nce
Vol. 2, No. 1,
April 201
6, pp. 205 ~ 21
4
DOI: 10.115
9
1
/ijeecs.v2.i1.pp20
5-2
1
4
205
Re
cei
v
ed
De
cem
ber 2
9
, 2015; Re
vi
sed
March 12, 20
16; Accepted
March 27, 20
16
Information Retrieval: T
extual Indexing Using an
Oriented Object Database
Mohammed Erritali
T
I
AD laborator
y, Comp
uter S
c
ienc
es Dep
a
rtment,
F
a
cult
y
of sciences a
n
d
techni
qu
es, Sultan Mo
ul
a
y
Sliman
e Un
iver
sit
y
, Ben
i
-Mel
l
a
l
, BP: 523, Morocco
e-mail: m.erritali@usms.ma
A
b
st
r
a
ct
T
he gr
ow
th in
the vo
lu
me
of
text data s
u
ch
as
books
a
n
d
articles
i
n
l
i
br
aries f
o
r ce
ntu
r
ies h
a
s
impos
ed to
establis
h effectiv
e
mec
h
a
n
is
ms t
o
loc
a
te th
e
m
.
Early tech
ni
qu
es such
as a
b
s
t
raction, in
dex
i
n
g
and th
e us
e of
classificati
on c
a
tegor
ies h
a
ve
mark
ed th
e bi
r
t
h of a new
fiel
d of rese
arch c
a
lle
d "Infor
mati
o
n
Retriev
a
l". Info
rmati
on R
e
trie
val (IR) can b
e
defi
ned
as the task of defi
n
in
g mod
e
ls a
nd syste
m
s w
h
o
s
e
purp
o
se is t
o
facilitat
e acces
s
to a set of d
o
cu
me
nts in
el
ectronic for
m
(
c
orpus) to
all
o
w
a user to fin
d
t
h
e
relev
ant o
nes f
o
r hi
m, th
at is
to
say, the c
o
n
t
ents w
h
ich
matches w
i
th
th
e infor
m
atio
n n
eeds
of the
us
er.
Most of the
mode
ls of
infor
m
ation
retriev
a
l
use
a sp
ec
ific
data structur
e t
o
i
ndex
a
corp
us w
h
ich
is c
a
l
l
e
d
"inverte
d fi
le"
or "rev
erse
in
dex". T
h
is i
n
v
e
rted fi
le
col
l
e
c
ts infor
m
ati
o
n
on
al
l ter
m
s
over th
e c
o
rp
us
docu
m
ents sp
e
c
ifying th
e i
den
tifiers of
d
o
cu
me
nts that co
n
t
ain the
ter
m
in
questi
on, th
e frequ
ency
of ea
c
h
term
in
the
doc
uments
of the
corpus, th
e
po
sitions
of th
e o
ccurrenc
es of
the w
o
rd. I
n
thi
s
pa
per w
e
us
e
a
n
orie
nted o
b
ject
datab
ase (d
b4
o) instea
d
of th
e inverte
d
file,
that is to sa
y, instead to se
ar
ch a term i
n
th
e
inverte
d
file, w
e
w
ill searc
h
it
in the d
b
4
o
d
a
tabas
e.
The
purp
o
se of this
w
o
rk is to ma
ke a co
mp
arati
v
e
study to se
e if
the ori
ente
d
o
b
ject d
a
tab
a
se
s may
be
co
mpetin
g for the
i
n
verse
ind
e
x i
n
ter
m
s of acc
e
s
s
spee
d an
d res
ource co
nsu
m
ption us
in
g a la
rge vol
u
me of data.
Ke
y
w
ords
: Information Retr
i
e
val, in
dex
atio
n, oriente
d
ob
j
e
ct datab
ase (
db4
o), inverte
d
file
Copy
right
©
2016 In
stitu
t
e o
f
Ad
van
ced
En
g
i
n
eerin
g and
Scien
ce. All
rig
h
t
s reser
ve
d
.
1. Introduc
tion
Due
to the
rapid
gro
w
th i
n
the vol
u
me
of ele
c
troni
cally sto
r
ed
in
formation, th
e majo
r
probl
em which ari
s
e
s
is t
o
re
spo
nd to
a sea
r
ch q
uery with
rel
e
vant mann
e
r
from a
set
o
f
unstructu
re
d
document
s in
a d
a
taba
se
calle
d the
co
rpu
s
. Thi
s
re
sea
r
ch p
r
obl
em is kno
w
n
a
s
Information
Retrieval (IR).
Information
Retrieval
can
be defined
as a
set of tech
niqu
es a
nd tools d
eal
ing with
acce
ss to info
rmation a
nd its pre
s
e
n
tatio
n
, its organi
zation and its
stora
ge [1], [2].
The te
rm "inf
ormatio
n
retri
e
val" is
given
by Ca
lvin N.
Mooe
rs in 1
9
48 for the fi
rst time in
his
thes
is
[3].
Acco
rdi
ng to
[18], an SRI
is a
set of compute
r
prog
ram
s
that ai
ms to
sele
ct
relevant
informatio
n that meets u
s
ers
nee
ds ex
pre
s
sed in
th
e form of qu
erie
s. Lan
ca
ster cited in [
19]
notes that a
SRI does
not
inform the u
s
er
on the
su
bject of his
rese
arch .it si
mply repo
rts
th
e
existen
c
e or
non-existen
c
e of docum
en
ts relating to
his re
que
st.
From the
ab
ove definitio
ns
we can
dedu
ce
that
a user tra
n
sl
ates its n
e
e
d
s in a
stru
ctured wa
y as a query t
hat it transmi
t
s
to informati
on retri
e
val system.
This
one has
as
a main task
to return to
the use
r
the maximum of relevant docu
m
ents in
relation to hi
s need (mi
n
im
um of irrel
e
vant docum
ent
s). Fo
r this, the inform
atio
n sea
r
ch sy
stem
con
n
e
c
ts th
e
availabl
e inf
o
rmatio
n
(the
co
rp
us
do
cu
ments) a
nd t
he
requi
rem
e
nts of
the
user
(the user q
u
e
r
y).
In the literatu
r
e
we find
se
veral rep
r
e
s
e
n
tati
ons
of th
e process of
informatio
n retrieval
[20]-[22] which show that t
he
ma
ppin
g
i
n
formatio
n co
ntained i
n
a
corpu
s
on th
e
one h
and,
an
d
informatio
n n
eed
s of u
s
e
r
s on
the oth
e
r
ha
nd,
is do
ne throug
h two m
e
chani
sms: ind
e
xing
and
sea
r
c
h
.
This op
eration is provid
ed
through a p
r
oce
s
s kn
own
as the pro
c
e
ss in
U as shown in
Figure 1.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 25
02-4
752
IJEECS
Vol.
2, No. 1, April 2016 : 205 –
214
206
Figure 1. Pro
c
e
ss of Information Ret
r
ie
val
2. Indexation
for Informa
t
ion Retriev
a
l
In information
retrieval sy
stems, the que
ry
and do
cu
ments in the
corpu
s
are difficult to
use in its ra
w state. To org
anize docum
ents and
the
query a
s
an interme
d
iate repre
s
e
n
tation
to
reflect a
s
clo
s
ely as po
ssi
b
le their co
nten
t, techniqu
es and m
o
d
e
ls are imple
m
ented. The
s
e
techni
que
s can de
scribe t
he do
cume
nts and the req
ues
t by a set of descripto
rs. This pro
c
e
s
s of
rep
r
e
s
entatio
n is calle
d the indexing p
r
oce
s
s.
Indexing co
nsi
s
ts in
analyzing th
e document
s and
the query to e
x
tract a set of
descr
ipto
rs [2], [7], [5],
[10], [11].
The de
scripto
r
of a docum
ent or a query is a
list of w
o
rd
s or group
s of significan
t
terms
for the
corre
s
pondi
ng textu
a
l unit, u
s
u
a
ll
y accom
pani
ed by
a
wei
g
ht re
pre
s
e
n
ting the
de
gre
e
of
rep
r
e
s
entatio
n of the conte
n
t that they d
e
scrib
e
[12].
A. Indexa
tion Appro
ache
s
Indexing i
s
traditionally p
e
r
forme
d
m
a
n
ually t
hat is to say a h
u
m
an op
erator,
usu
a
lly a
libra
rian
or
a
domai
n exp
e
rt is
re
sp
on
sible fo
r
cha
r
acteri
zin
g
, a
c
cording
to hi
s kno
w
led
ge,
th
e
conte
n
t of a document. T
he analy
s
is o
f
Docu
ment
i
s
pe
rform
ed
by a person,
not a machi
ne,
whi
c
h is very
costly in time beca
u
se o
n
the first hand indexe
r
must re
ad a
nd unde
rstan
d
a
document b
e
f
ore it can b
e
prope
rly in
dexed. On
t
he othe
r h
a
nd this type
of indexing
is
pra
c
tically ina
pplicable to the la
rg
e co
rp
us of texts [10], [12].
This ap
pro
a
ch has an
othe
r dra
w
ba
ck that it
is subje
c
tive, since the choi
ce of
indexing
terms dep
en
ds o
n
the i
n
d
e
xer a
nd its
domain
know
ledge.
With t
he in
crea
se
of the amo
u
n
t
of
document
s to be indexed, i
ndexing ten
d
s
to be autom
ated [10].
Automatic in
d
e
xing is a
co
mpletely auto
m
ated
p
r
o
c
e
s
s that is
ch
arged to extra
c
t word
s
that characterize the document [
14]. The advantage
of this
approac
h li
es in it
s ability to process
text faster th
an the
p
r
evio
us
app
roa
c
h,
and
the
r
efor
e, it is particularly
suitable
for
la
rge
corp
us
[12], [2],
[5].
One of the
problem
s of au
tomatic ind
e
xation is
lo
cat
ed at the se
mantics of th
e wo
rd
s,
becau
se the
pro
c
e
ss of a
u
t
omatic
indexi
ng co
unts the
numbe
r of occurren
ce
s of
a word witho
u
t
taking
into
accou
n
t the
me
aning
of e
a
ch
wo
rd.
For ex
ample, th
e
word
“o
ran
g
e
”
i
n
Fren
ch
me
ans
a col
o
r
and
a
fruit. If the two m
eani
ngs of the word
are
used in
a
singl
e text the word will
be
cou
n
ted twi
c
e
when h
e
wa
s not the same
.
Another di
sa
dvantage of the aut
omati
c
indexing is i
n
comp
oun
d words. Fo
r e
x
ample,
consider the
compound word "p
omme
de terre" in F
r
ench. It
will
be indexed at "pomme" and
"terre" but not
at "pomme d
e
terre"
which
is its origin
al meanin
g
.
There
i
s
an interme
d
iary
method
of
in
dexat
ion i
s
the
semi
-auto
m
atic i
ndexa
t
ion o
r
controlled in
d
e
xation. In this type of inde
xing a fi
rs
t automatic
proces
s
is
used to extract term
s o
f
Evaluation Warning : The document was created with Spire.PDF for Python.
IJEECS
ISSN:
2502-4
752
Inform
ation Retrieval: Text
ual Inde
xing
Usi
ng an O
r
ie
nted Obje
ct Databa
se
(Moh
am
m
ed Erritali)
207
the documen
t. However, t
he final ch
oi
ce re
st to
the spe
c
iali
st in the field to establi
s
h t
h
e
relation
shi
p
b
e
twee
n wo
rd
s and
sele
ct the signifi
ca
n
t
terms. In this pap
er we are pa
rticularly
intere
sted in the automati
c
indexation a
p
p
roa
c
h.
B. Indexing
Langua
ges
The vocabul
a
r
y of indexin
g langu
age i
s
form
ed
fro
m
the set of
indexation te
rms. Thi
s
se
ction
pre
s
ents th
e t
w
o mai
n
type
s of
i
ndexin
g lan
gua
ge
[10]-[12]: fre
e
lan
gua
ge
and
controlled la
n
guag
e.
Controlled i
n
dexation lan
g
uage i
s
con
s
t
r
ucte
d
from
a
set of pre-d
e
fined terms
and u
s
u
a
lly
orga
nized in
a thesa
u
ru
s. Whe
n
a do
cu
ment
is analy
z
ed, we ke
ep
only word
s b
e
longi
ng to
this thesauru
s
.
The free lan
guag
e is a langu
age cl
o
s
e to our n
a
t
ural lang
uag
e (NL
)
. In this langu
age
descri
p
tor is
automatically extr
acted fro
m
docum
ents, or user
requ
est. This type
of indexing
is espe
cially use
d
by sea
r
ch en
gine
s m
a
kin
g
a fully automatic ind
e
x
ing as Go
ogl
e.
C. Automa
ti
c Indexing Proces
s
Automatic i
n
dexing i
s
a
set
of auto
m
at
ed p
r
o
c
e
s
ses on
a
document
which
are:
segm
entation
,
removal
of empty or st
op words,
st
emming or radicalization of
wo
rd
s,
an
d
weig
hting.
Figure 2. Phase
s
of autom
atic indexatio
n
1) The To
ke
nization
(Se
g
menta
t
ion)
The to
keni
za
tion is al
so
call
ed
seg
m
entation. It
co
nsi
s
ts to
divide th
e
text into
elementa
r
y to
ken
s
. T
h
is is
an o
peration
whi
c
h "l
o
c
ate
s
" string
s su
rround
ed by
se
parato
r
s
(whit
e
spa
c
e, pu
nct
uation), an
d identifies the
m
as word
s.
2) Elimination of Empt
y
Words (Stop Words)
Stop words
(empty words) are
pre
p
o
s
i
t
ions
an
d co
njun
ction
s
. Elimination of
empty
words
red
u
ce
s the index, then we gain
in stora
ge sp
ace, but al
so
the no treat
ment of empty
words re
du
ce
s the
execution time
of a
System of inf
o
rmatio
n retri
e
val [8]. Seen that re
du
ci
ng
the num
be
r o
f
terms in
cre
a
se
s th
e p
e
rf
orma
nce,
s
o
me
sy
st
em
s
con
s
id
er,
t
oo,
su
ch
a
s
e
m
pty
words
som
e
verbs, adj
ecti
ves and a
d
ve
rbs. Th
ere ar
e two tech
niq
ues to filter o
u
t empty words:
The use of a pred
efined li
st of stop word
s (al
s
o
called
anti-di
ctiona
ry / stop-list).
Cou
n
ting the
numbe
r of
occurre
n
ces
of a wo
rd in
a docume
n
t colle
ction. F
o
llowed by
stri
king with a
frequen
cy that exceed
s a
ce
rtai
n thre
sh
old and b
e
co
me empty wo
rds.
In this
work
we have c
h
os
en to us
e the firs
t te
ch
niqu
e whi
c
h is the a
n
ti-dictio
n
a
r
y, in this
pha
se the
tre
a
tment is si
mple: if a te
rm of the
co
rpus app
ea
rs
in the a
n
ti-dic
tionary, it is not
c
o
ns
ide
r
ed
as
a
n
in
de
x te
r
m
.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 25
02-4
752
IJEECS
Vol.
2, No. 1, April 2016 : 205 –
214
208
3) Norm
aliza
t
ion of Index
Terms
Normali
z
ation
is a
process
t
hat allows
groupin
g
the m
o
rph
o
logi
cal
variants of word
s
a
s
a sin
g
le b
a
se
. Its goal i
s
to ke
ep in th
e
indexi
ng l
a
n
guag
e, the fo
rms
of re
pre
s
entative wo
rd
s,
whi
c
h offers con
s
id
era
b
le gain of stora
g
e
mem
o
ry
an
d an
effe
ctive re
se
arch. T
h
e no
rmali
z
ati
on
is ba
sed o
n
o
ne of two pro
c
ed
ure
s
: Stemming or le
m
m
atization [1
3].
a) Lemm
a
tiza
tion
Lemmati
zatio
n
is
used to
gro
up th
e
words of the
sam
e
g
r
am
matical
cate
gory a
n
d
transfo
rm th
em to their
can
oni
cal form calle
d le
mma (e.
g
. d
i
fferent form
s of a ve
rb
are
transfo
rme
d
to infinitive) [7], [10]. This tech
niqu
e is b
a
se
d on the
use of softwa
r
e and
re
sou
r
ce
s
on lemmati
zation namely: Tree
Tagger, WinBrill
and LEFFF.
Some lemma
tizers can tre
a
t multiple la
ngua
ge
s (e.g.
TreeT
agg
er treats th
e Eng
lish an
d
Germ
an lan
g
uage
s).
b) Stemming
Stemming tra
n
sforms
a wo
rd to its
root.
A st
emme
r seeks the
ro
ot of a wo
rd
ba
sed
on
its shap
e a
n
d
the d
e
si
re
d l
angu
age. F
o
r example
in
Fren
ch: "
é
col
ogie, é
c
ol
ogi
ste, é
c
olo
g
iq
ue"
are ste
mming
by one word: "écologi
e" [8] [13].
In the literatu
r
e there are several al
gorit
hms
that a
r
e
use
d
in stem
ming a
s
the a
l
gorithm
of Lovins [23]
, Paice / Husk [24]
algorith
m
and Porte
r
[25] algorithm
.
Snowb
a
ll [26
]
is another
Stemming tool wh
i
c
h wa
s invented b
y
Martin Porter (the
cre
a
tor of the
Porte
r
al
gori
t
hm). Th
ere
are
Sno
w
ball
stemm
e
rs fo
r vari
ou
s la
n
guag
es (F
ren
c
h,
English, Spa
n
ish …
)
Experiment
s
have sh
own that t
he Stemming an
d le
mmatizatio
n
signifi
cantly increa
se
s
the sea
r
ch pe
rforma
nce for morph
o
logi
cally rich lan
g
u
age
s su
ch a
s
Fren
ch an
d Italian [13].
4) Weighting
of Terms
To me
asure
the imp
o
rtan
ce
of a
word
in a
d
o
cum
ent ind
e
xing
use
s
th
e
con
c
ept
of
weig
ht. The
weig
hting is t
o
assig
n
a weight to te
rm
s of indexing a
nd search. T
h
is
weight i
s
use
d
to spe
c
ify the relative im
portan
c
e of
words
re
p
r
e
s
ented in the
documentatio
n with re
spe
c
t to
those id
entified in the req
uest. The
wei
ghting con
s
ists to answe
r the que
stion i
f
all terms ha
ve
the same im
p
o
rtan
ce an
d h
o
w to assig
n
a weig
ht to the extracted te
rms?
In gene
ral, the wei
ghting
formula
s
u
s
ed
are ba
se
d on the co
mbination of
a local
weig
hting factor qua
ntifying the local repre
s
e
n
ta
tion
of the wo
rd
in the do
cu
ment [4], and a
global
wei
ght
ing fa
ctor
qu
antifying the
overall
re
pre
s
entatio
n of t
he term
with
re
spe
c
t to t
he
colle
ction of d
o
cum
ents [2]
[10].
Local
Wei
ghti
n
g
Local weighti
ng is
used t
o
mea
s
u
r
e t
he lo
cal
rep
r
ese
n
tation of
a term. It take
s into
accou
n
t the
local i
n
form
a
t
ion of the t
e
rm in
rel
a
tion to a give
n do
cume
nt. It indicate
s
the
importa
nce of
the te
rm in
this
do
cume
nt. This
wei
ghti
ng i
s
g
ene
rall
y measured
b
y
the fre
quen
cy
of the term
t
(term fre
quen
cy, denoted
tf
in the docum
en
t
d
co
nsi
dered.
Global
Weig
h
t
ing
The glo
bal
weig
hting is based on t
he ide
a
that
a term d
o
e
s
not di
sting
u
ish th
e
document
s from ea
ch
othe
r du
ring th
e
search, if it is
distrib
u
ted u
n
i
formly in all docum
ents in t
he
colle
ction. Th
us, this term
doe
s not have any di
scri
minatory po
wer. Ther
efore
,
the terms that
appe
ar in fe
w document
s a
r
e di
scrimin
a
ting an
d wei
g
hts are a
ssi
g
ned to them.
This
weig
htin
g is
expre
s
sed by
the inverse
document fre
quen
cy
idf
of a
term
t
in the colle
ction. It is gen
erally
defined by th
e followin
g
formula:
idf
l
o
g
N
n
Whe
r
e
N is t
he num
be
r of
document
s i
n
the colle
ction;
n
i
s
the n
u
m
ber
of do
cu
ments in
dexe
d
by the term
t
.
Salton [6] has defined a we
ighting form
ul
a tf * idf by:
∗
∗l
o
g
Evaluation Warning : The document was created with Spire.PDF for Python.
IJEECS
ISSN:
2502-4
752
Inform
ation Retrieval: Text
ual Inde
xing
Usi
ng an O
r
ie
nted Obje
ct Databa
se
(Moh
am
m
ed Erritali)
209
The m
e
a
s
ure tf * idf i
s
a go
od
app
roximation of
the im
porta
nce
of a
term in th
e
document
col
l
ection
s comp
ose
d
of docu
m
ent with
ho
mogen
eou
s size
s. Ho
weve
r, for coll
ectio
n
s
contai
ning do
cume
nts of
varying sizes, words
in
lon
g
e
r d
o
cu
ment
s appe
ar frequ
ently with very
high
weig
ht than tho
s
e i
n
sho
r
ter
do
cu
ments. So
th
e long
er d
o
cuments are
more li
kely t
o
be
s
e
lec
t
ed [2], [13], [15].
3. Models of
Information
Retriev
a
l
An inform
atio
n retri
e
val
system i
s
ba
se
d on
a theo
re
tical mo
del. T
h
is m
odel
allo
ws
us to
interp
ret the
notion of rel
e
vance of a
d
o
cum
ent
with
respe
c
t to a query in a f
o
rmal
setting.
It
therefo
r
e pro
v
ides a theo
retical cadre for mod
e
ling relevan
c
e of this mea
s
u
r
e [
7
].
Figure 3. The
various m
o
d
e
ls of inform
a
t
ion retrieval
In the literatu
r
e, many mo
dels of info
rmat
ion ret
r
ie
val have bee
n prop
osed (Figure 3).
They are div
i
ded into three major
cat
egori
e
s
whi
c
h are the se
t-model
s, vector mo
del
s and
prob
abili
stic model
s.
The
Bo
olean
model wa
s
th
e
first mod
e
l to
be use
d
in
inform
ation retrieval,
be
ca
use
of
its sim
p
licity.
Ho
wever, th
e
lack
of weig
ht in
thi
s
mo
d
e
l limits it
s u
s
es. T
h
u
s
, extende
d versio
ns
of this mo
del
have be
en
p
r
opo
se
d, they
inclu
de
we
ig
hting, su
ch
a
s
the
use of fuzzy set th
eo
ry
[7], [10]. Th
e vector mo
del is proba
bly the most
widely used
in informati
on retri
e
val. Its
popul
arity is
due to
its abil
i
ty to ord
e
r fo
und
do
cu
me
n
t
s an
d its g
o
o
d
pe
rforman
c
e. Prob
abili
stic
model
s are b
a
se
d on p
r
ob
ability theory; the perfo
rma
n
ce of the
s
e
model
s app
e
a
rs i
n
the 19
90
years [10].
We
pre
s
e
n
t i
n
the follo
win
g
the
prin
cipl
e of the th
re
e mod
e
ls: Bo
olean
mod
e
l, vecto
r
model an
d probabili
stic mo
del.
a) Bo
olean
Model
The Boolea
n model was in
trodu
ced in 1
983 by Salton and McGill
[6]. This model is the
olde
st model
in the field of
informatio
n retrie
val. It wa
s em
erg
ed d
ue to the
sim
p
licity and
sp
eed
of its im
plem
entation. Th
e
que
ry in
te
rfa
c
e
of mo
st
se
arch e
ngin
e
s
(Goo
gle, Alta
Vista) is ba
sed
on the prin
ci
ples of this
model. The
Bool
ean mo
d
e
l is based
on set theo
ry. The query is
rep
r
e
s
ente
d
as a l
ogi
cal e
x
pressio
n
. In this
expressi
on, the de
script
ors
are co
mbined to
get
her
usin
g the Bo
olean
ope
rat
o
rs "¬
NOT",
"
⋀
AND" a
n
d
"
∨
O
R
". Docum
ents satisfying the lo
gical
expre
ssi
on re
pre
s
entin
g the query a
r
e consi
dered relevant [2] [7].
,
1
∈
,0
,
1
∈
⋀
,
0
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 25
02-4
752
IJEECS
Vol.
2, No. 1, April 2016 : 205 –
214
210
,
1
∈
⋁
,
0
,
1
∉
,
0
Although this
model is
simp
le to impleme
n
t, it has major drawba
cks [10]:
matchin
g
is stri
ct and do
es not allo
w docum
ents to be cla
ssifi
ed in two ca
tegorie
s, the
relevant docu
m
ents and n
on-releva
nt
d
o
cum
ent
s,
whose term
s a
r
e n
o
t ord
e
ra
ble, and
all
terms of
a d
o
c
ume
n
t o
r
a
query
are
eq
ual in
im
p
o
rt
ance
(weight
ed at
0
or 1
)
, whi
c
h
i
s
n
o
t
the case in reality,
Boolean exp
r
ession
s are n
o
t accessible
to
a wide aud
ience and co
nfusio
n exist becau
se of
the differen
c
e in "meanin
g
" of the logical AN
D and
OR and thei
r con
notation
s
in natural
langu
age op
e
r
ators.
To ove
r
come
these d
r
a
w
backs, the
e
x
tended Bo
ol
ean m
odel
[2] [7] [10] h
a
s
bee
n
prop
osed. It t
a
ke
s i
n
to a
ccount the
imp
o
r
tance
of
terms i
n
the
do
cument
rep
r
e
s
entation
and
the
query by assi
gning
weight
s to each word
of the docum
ent and the q
uery.
b) The Vec
t
o
r
Model
The ve
ctor m
odel
wa
s pro
posed
by G.
Salton
[6]. It i
s
b
a
sed
on
mathemati
c
al
ba
se
s
of
vector
spa
c
e
s
. In this mo
del do
cume
nts and th
e qu
ery are
rep
r
e
s
ente
d
by vectors in ind
e
xing
spa
c
e
i.e. th
e coordinate
s
of
a d
o
cu
ment represe
n
t the
weigh
t
of their
wo
rds.
Formally
, a
document d
_
i
is re
pre
s
e
n
ted by a vect
or of a
dim
e
nsio
n N
whi
c
h is the n
u
m
ber of in
dexing
terms of the collectio
n of docum
ents [2]
[10].
,
,
,…..,
1
,2,3,
…
,
Whe
r
e
is th
e weight of term
in the docum
ent;
, m is the number of docum
e
n
ts in the
colle
ction, an
d n is the nu
mber of ind
e
x terms.
A query Q is repre
s
e
n
ted
by a
vector of keywo
r
d
s
defined in th
e same spa
c
e vector as t
he
document.
,
,
,…..,
Whe
r
e
w
is the weight of term
t
in the Q query.
Releva
nce of the documen
t
d
for a
query Q is measu
r
e
d
as the degree of correlati
on of
the corre
s
po
nding vecto
r
s. This correl
a
t
ion can
be e
x
presse
d by one of the followin
g
mea
s
u
r
es
[2], [5]
,
[10],
[
16]:
The scala
r
produ
ct [10]:
,
∗
(1)
The co
sin
e
m
easure [10], [14]:
,
∑
∗
∑
∗
∑
(2)
The mea
s
u
r
e
of Dice [10]:
,
2∗
∑
∗
∑
∗
∑
(3)
T
h
e
me
as
ur
e o
f
J
a
c
a
r
d
[1
0]:
,
∑
∗
∑
∗
∑
∑
∗
(4)
Evaluation Warning : The document was created with Spire.PDF for Python.
IJEECS
ISSN:
2502-4
752
Inform
ation Retrieval: Text
ual Inde
xing
Usi
ng an O
r
ie
nted Obje
ct Databa
se
(Moh
am
m
ed Erritali)
211
Superpo
sition
coefficie
n
t [10]:
,
∑
∗
min
∑
,
∑
(5)
c) The Probabilistic Model
The p
r
ob
abili
stic m
odel i
s
ba
se
d on
deci
s
io
n the
o
ry. The ai
m is to
com
pute the
prob
ability of
relevan
c
e
of
a do
cu
ment
D
with
re
spe
c
t to a
que
ry
Q. The
first p
r
oba
bilisti
c m
odel
has b
een p
r
o
posed by Maron and Kuh
n
s
[17].
In the p
r
ob
ab
ilistic m
odel,
the do
cum
e
n
t
s an
d
the
qu
ery a
r
e
rep
r
e
s
ente
d
by ve
ctors in
indexing
spa
c
e a
s
in the vector m
odel. In these
ve
cto
r
s the weight
s of the index
are bina
ry. For
a query q all
document
s a
v
ailable are d
i
vided into
two sub
s
et
s: the set R of rel
e
vant docum
ents
and NR irrele
vant docum
e
n
ts. For ea
ch
document
two prob
abilitie
s are a
s
so
cia
t
ed [5] [2]:
•
P (R / d): the prob
ability that the
docum
ent is releva
n
t
to the query q.
•
P (NR / d): th
e prob
ability that the doc
u
m
ent is not re
levant to the query q.
The
similarity
betwe
en the
document
a
nd the q
u
e
r
y q is
cal
c
ulat
ed a
s
a fu
nction of
these two probabilities as f
o
llows:
,
PR/d
PNR
/d
(6)
4. Descrip
tio
n
of the Pro
posed Soluti
on and Expe
rimental Re
s
u
lts
After discu
s
si
ng the
theo
re
tical a
s
p
e
ct
s
of info
rm
atio
n ret
r
ieval, th
is
se
ction i
s
devoted
to the descrip
tion of our sy
stem and the
comp
ari
s
o
n
o
f
two techniq
ues for
re
cording text inde
x.
Our info
rmati
on retri
e
val system is ba
sed on t
he ve
ctor mo
del, a
nd provides t
he followi
ng two
feature
s
:
Indexation:
In the indexi
ng pha
se
(shown in Fig
u
re. 4
)
the first op
eration
is the rem
o
val of
sep
a
rato
rs
(common pun
ct
uation cha
r
a
c
ter)
a
c
co
rdin
g
to a file tha
t
contain
s
a
set of delimiters.
After the pha
se of
segm
e
n
tation, the corpu
s
p
a
sse
s
to the se
con
d
pha
se
whi
c
h is the
remo
val
of empty wo
rds.
The
tre
a
tment is si
mple: if a
te
rm a
ppea
rs i
n
the a
n
ti-di
c
tiona
ry, it is not
con
s
id
ere
d
as a
n
index
term, finally we pa
ss
to the impo
rtant step t
hat is ling
u
i
s
tic
norm
a
lization
,
in whi
c
h
we
use
a di
ction
a
ry of ro
ot
s t
o
repl
ace a te
rm with it
s le
mma, if the word
appe
ars in th
e dictio
nary o
f
roots; after t
h
is p
h
a
s
e
the
re
sult is the
index that will
be re
co
rd
ed
in
an inverted fil
e
or in a obje
c
t databa
se.
Figure. 4 Phase
s
of index
in
g of propo
se
d
system
In the case of use of an obj
ect database DB4O
its
structure
will be as
shown in Table 1.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 25
02-4
752
IJEECS
Vol.
2, No. 1, April 2016 : 205 –
214
212
Table 1. Example of the Structu
r
e of the Datab
a
se
Terms
Document 1
Document 2
Id frq
tabPos
Id frq
tabPos
Recherche
1 22
3,
10,…
2 22
4,5,…
Information
1 5
0,2,19,
2 9
3,1,…
Base
1 1
4,5,…
2 0
NULL
structure
1 0
NULL
2 11
0,9,…
Figure 5
sho
w
s th
e difference bet
wee
n
index
in
g time usi
n
g the
inverted file
and th
e
obje
c
t databa
se db
4o. We
note that ind
e
xing usi
ng a
databa
se is
faster
comp
a
r
ed to u
s
ing
an
inverted file for indexatio
n (FI).
Figure. 5 Co
mpari
s
o
n
of indexing time
in db4o a
nd the inverted fil
e
Re
sea
r
c
h
Re
sea
r
ch is
the se
cond f
eature of ou
r syst
em; that maps the repre
s
e
n
tation
of th
e
colle
ction
of
document
s (t
he ind
e
x)
whi
c
h i
s
in
th
e d
a
taba
se
(db
4
o
) a
n
d the
re
pre
s
entatio
n
of
use
r
nee
ds
(the indexe
d
re
que
st); to return
a set of relevant do
cu
ments to the
query.
Treatm
ent asso
ciated to re
sea
r
ch pa
sse
s
throu
gh two
stage
s
:
•
Indexing of the requ
est in the sam
e
way
as do
cume
nts in the co
rpu
s
;
•
Put in match
the rep
r
e
s
e
n
tation of do
cume
nts
with
that of the query u
s
in
g the simil
a
rity
measure (tf * idf)
The se
arch p
r
ocess u
s
ing
a databa
se (d
b4o) i
s
illust
rated in Figu
re
6.
Figure 6. The
process of s
earch in a dat
aba
se (d
b4o
)
Evaluation Warning : The document was created with Spire.PDF for Python.
IJEECS
ISSN:
2502-4
752
Inform
ation Retrieval: Text
ual Inde
xing
Usi
ng an O
r
ie
nted Obje
ct Databa
se
(Moh
am
m
ed Erritali)
213
Figure 7
provides
a vie
w
of
the differen
c
e bet
ween
th
e search
time
in the
datab
a
s
e
and
the inverted file:
Figure 7. Co
mpari
s
o
n
of search time in
the BD4O an
d FI size of th
e corpu
s
Figure 7
sho
w
s th
at the ti
me of the
se
arch
in
an o
b
ject d
a
taba
se BD4
O
is v
e
ry sl
ow
c
o
mpared to research in an inverted file.
5. Conclusio
n
W
e
pr
es
e
n
t
ed
in
th
is
pa
pe
r
,
th
e
ma
in
s
t
e
p
s
of
the pro
c
e
ss of
in
formation retrieval,
as
well as b
a
si
c models of i
n
formatio
n re
trieval.
This
pape
r focu
se
d mainly on the study an
d
evaluation
of
perfo
rman
ce
of two i
ndexi
ng a
pproa
che
s
whi
c
h
are t
he u
s
e
of the
obje
c
t data
b
a
s
e
BD4O
and th
e inverted
file. We pe
rform
ed some
co
m
pari
s
on
s that
sho
w
that the
use
of BD4
O
is
quick for i
nde
xing, but that the se
arch ti
me in in
ve
rte
d
file is b
e
tter than re
se
arch in data
b
a
s
e
s
.
In our future
work
,
we will
in
trodu
ce
the
notion
of
se
mantic ba
sed
on ontolo
g
y to
ou
r system
to
rend
er it a se
mantic
sea
r
ch engin
e
.
Referen
ces
[1]
Ricard
o
BY, Berthier RN. Mo
der
n information retrieval,
ACM (Association for
Computin
g
Machin
er
y).
[2]
Baziz M. Indexati
on co
ncept
uell
e
gu
idé
e
p
a
r ont
ol
ogi
e p
our la rech
erc
he d'
inform
atio
n (Doctora
l
dissertati
on, T
oul
ouse
3). 20
05.
[3]
Mooers
CN.
Appl
icatio
n of
rand
om co
d
e
s to th
e
g
a
theri
ng
of stati
s
tical i
n
formati
on (D
octora
l
dissertati
on, Massach
usetts Inst
itute of T
e
chno
log
y
). 1
948
.
[4]
Karbas
i S. Pon
dérati
on d
e
s termes en Rec
h
e
r
che d’
Inform
ation (D
octoral d
i
ssertation, T
oulous
e 3).
[5]
Harrathi F
.
Extr
action d
e
conc
epts et de relat
i
ons
e
n
tre conc
epts à partir de
s docume
n
ts multili
ng
ues:
appr
oche statis
tique et o
n
tolo
giq
ue. 20
09.
[6]
Salton
G. A
comp
ariso
n
bet
w
e
e
n
m
anu
al
an
d
a
u
tomatic
ind
e
x
i
n
g
metho
d
s
.
America
n
Docum
entatio
n
.
1969; 20(
1): 61-71.
[7]
Mallak
I. De
nouv
ea
u
x
fact
eurs
po
ur l'
e
x
ploit
a
tion
d
e
l
a
sém
anti
que
d'
un
te
xte
e
n
R
e
cherc
h
e
d'Information (Doctoral dissertation, Univ
ersité Paul Sabatie
r-T
oulouse III).
2011.
[8]
Aouic
ha MB.
Une a
ppr
oc
he al
gé
briq
ue
pour l
a
rec
herch
e d'
infor
m
ation structu
r
ée (Doctor
a
l
dissertati
on). 2
009.
[9]
Barr
y
C
L
. User
-defin
ed rel
e
va
nce criteri
a
: an
explor
ator
y
st
ud
y.JASI
S. 1994; 45(3): 1
49-
159.
[10]
Boub
eke
u
r-Amirouc
he F
.
Contri
butio
n à la d
é
finiti
on de mo
dèl
es de rech
e
r
che d'
inform
ation fle
x
i
b
l
e
s
basés
sur les CP-Nets
(Doctoral di
ssertation, Université de T
oulous
e, Universit
é
T
oulouse III-Paul
Sabati
e
r). 200
8.
[11]
Rouss
e
y
C. Un
e métho
de d’
in
de
xati
on sém
a
ntiqu
e
ad
apté
e
au
x corp
us multili
ng
ues. Institut Nation
a
l
des Scie
nces
Appl
iqu
ées d
e
L
y
on L
y
o
n
, Ec
ole D
o
ctor
al
e Informatiq
ue et
Information
po
ur la Soci
été.
200
1.
[12]
Azzoug
W
.
Contrib
u
tion
à
l
a
défi
n
iti
on d’
une ap
proc
he
d’i
n
d
e
x
atio
n
sémanti
que
d
e
doc
ume
n
ts
textu
e
ls. 20
14.
[13]
Porter MF
. An
al
gorithm
for
suffix stri
pp
ing
.
Program:
ele
c
tronic l
i
br
ar
y
and
inform
atio
n s
y
stems
.
198
0; 14(3): 13
0-13
7.
0
50
100
150
200
250
0
200
400
600
The
time
in
sec
o
nds
size
of
the
c
o
rpus
Recher
che
BD4O
Recher
che
FI
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 25
02-4
752
IJEECS
Vol.
2, No. 1, April 2016 : 205 –
214
214
[14]
Buckle
y
C, Si
ngh
al A, Mitra
M & Salton
G.
New
retrieval a
ppro
a
ch
es
usin
g SMART
:
T
R
EC 4
. In
Procee
din
g
s of
the F
ourth T
e
xt REtr
ieval Co
n
f
erence (T
REC-4). 1995: 2
5
-4
8.
[15]
Brini AH. Un
modèle d
e
recherc
he d'
inf
o
rmatio
n
basé
sur les rése
au
x poss
i
bi
list
e
s (Doctora
l
dissertati
on, T
oul
ouse
3). 20
05.
[16]
Maron
ME &
K
uhns
JL. On
re
leva
nce,
pr
ob
a
b
ilistic
i
nde
xin
g
an
d
informati
o
n
retri
e
val.
Jou
r
nal
of th
e
ACM (JACM)
. 196
0; 7(3): 216
-244.
[17]
Agra
w
a
l R, Imieli
ń
ski T
& S
w
a
m
i A. Minin
g
assoc
i
ati
on rul
e
s bet
w
een sets of it
ems in lar
g
e
datab
ases. In
ACM SIGMOD
Record
.
ACM
.
1993; 2
2
(2): 2
07-2
16.
[18]
T
ebri H. F
o
rmalisati
on
et sp
écificati
on d
’
u
n
s
y
stème
de fil
t
rage
i
n
créme
n
t
al d’i
n
formati
o
n
. T
hèse de
doctorat de l
’
u
n
iversit
é
Paul
Sabati
e
r, T
oulouse. 200
4.
[19]
V Rijsb
e
rge
n
C
J
. Information Retriev
a
l. Dep
a
rtm
ent of Co
mputin
g Scien
c
e Univ
ersit
y
o
f
Glasgo
w
.
[20]
Iadh O. Un modèl
e d'
ind
e
x
ati
on rel
a
tion
ne
l pour l
e
s grap
h
e
s conce
p
tuels
fondé sur un
e interpr
é
tatio
n
logi
qu
e, T
hèse pour obte
n
ir l
e
grade d
e
Doct
eur de l'
U
n
iver
sité Josep
h
F
ourier. 19
92.
[21]
Pi
w
o
w
a
rski B,
Den
o
y
er
L, Gal
linar
i P. U
n
mo
dèl
e p
our l
a
re
cherch
e d
’infor
m
ation s
u
r d
e
s
docum
ent
s
structurés. 6es
Jour
nées
i
n
ter
natio
nal
es
d’A
nal
ys
e
statistiq
ue des Do
nné
es
T
e
xtue
lles. LIP6,
PAR
I
S
– F
r
ance. 20
02
.
[22]
Den
o
s N. Mod
é
lisati
on
de l
a
pertin
ence
en r
e
c
herc
he d'
i
n
fo
rmation: mo
dèl
e conc
eptu
e
l, formalis
atio
n
et ap
plic
atio
n.
T
hèse pour
o
b
t
enir l
e
grad
e
de D
o
cteur
d
e
l'
Univ
ersité J
o
seph
F
ouri
e
r-
Grenob
le I.
199
7.
[23] http://
w
w
w
.
comp.lancs.ac.uk/comput
in
g/res
earch/stemmi
n
g
/Links/lovins.
h
tm
[24] http://
w
w
w
.
comp.lancs.ac.uk/comput
in
g/res
earch/stemmi
n
g
/Links/p
a
ic
e.h
t
m
[25] http://tartarus.org/m
artin/PorterStemmer/
[26] http://snow
ball.
tartarus.org/
Evaluation Warning : The document was created with Spire.PDF for Python.