TELKOM
NIKA
, Vol.11, No
.3, March 2
0
1
3
, pp. 1497 ~ 1505
ISSN: 2302-4
046
1497
Re
cei
v
ed O
c
t
ober 2
3
, 201
2; Revi
se
d Ja
nuary 18, 20
1
3
; Acce
pted Janua
ry 3
1
, 20
13
Multi-pattern Matching Methods Based on Numerical
Computation
Lu Jun*
1,2
, Zhang Zhuo
1,2
, Mo Juan
1,2
, Lv
Xingfeng
1
1
Colle
ge of Co
mputer Scie
nc
e and T
e
chno
l
o
g
y
, Hei
l
on
gji
a
ng Un
iversit
y
, Harbi
n
, Chi
n
a
2
Ke
y
L
abor
ator
y of Data
bas
e and Par
a
ll
el C
o
mputi
ng of He
ilon
g
j
i
an
g Provi
n
ce, Harb
in, C
h
in
a
*Corres
p
o
ndi
n
g
author, e-ma
i
l
: luju
n11
1@
ya
hoo.com.cn
A
b
st
r
a
ct
Multi-pattern
match
i
n
g
met
h
ods bas
ed on
nu
me
ric
a
l co
mputatio
n are a
d
vanc
ed in thi
s
paper.
F
i
rstly it advan
ced the mu
ltipl
e
patterns mat
c
hin
g
al
gor
ith
m
based on ad
d
ed infor
m
at
i
on.
In
the process
of
accu
mu
latin
g
o
f
informatio
n
, the se
lect
meth
od of byt
e
-acc
umul
ate o
perat
ion w
i
l
l
affect the co
llis
ion
od
ds
,
w
h
ich me
ans that the met
h
o
d
s or bytes involve
d
in the differ
ent matchin
g
steps shoul
d have gre
a
te
r
differenc
es as muc
h
as possi
ble. In additi
on
, it c
an use bal
ance
d
bin
a
ry tree to ma
nag
e ind
e
x to reduc
e
the avera
ge se
archi
ng times, and us
e t
he ch
aracteristics of a give
n pattern
set by setting the coll
isio
n fie
l
d
to eli
m
inate
col
lisio
n further. I
n
ord
e
r to re
du
ce the c
o
ll
isio
n
odds
in th
e i
n
i
t
ial step, th
e i
n
formatio
n
spl
i
ci
n
g
meth
od is a
d
va
nced, w
h
ich h
a
s
greater val
u
e
space
than
ad
ded i
n
for
m
atio
n met
hod,
thus
greatly red
u
ci
n
g
the initia
l coll
isi
on od
ds. Multiple patter
n
s matchin
g
metho
d
s base
d
on n
u
merica
l co
mp
utation fits for larg
e
m
u
lti-pattern matching.
Ke
y
w
ords
: Single pattern mat
c
hin
g
, Multi-pat
tern matchin
g
, Coll
isio
n
Copy
right
©
2013 Un
ive
r
sita
s Ah
mad
Dah
l
an
. All rig
h
t
s r
ese
rved
.
1. Introduc
tion
Pattern mat
c
hing i
s
an
essential te
ch
n
o
logy in
com
puter
kn
owle
dge. It is
wid
e
ly applie
d
in text editing, literature
searchin
g, natural la
n
gua
g
e
identifying, biology, ima
ge manip
u
lati
on,
informatio
n secu
rity, network et
c. The
r
e are
two types of pattern mat
c
hing
: single pattern
matc
hing and multiple patterns
matching. In s
i
mp
le single pattern matchin
g
alg
o
rithm, a pattern
matchin
g
pro
c
ed
ure is loo
k
ed a
s
key word (patte
rn
string) searchi
ng. It divides the text with the
length n i
n
to
n-m
+
1
sub
-
string
with the l
ength
m and detect
s
whet
her ea
ch sub
-
stri
ng
m
a
tch
e
s
to pattern stri
ng with lengt
h m or not. C
l
assical singl
e pattern matchin
g
algorith
m
include
s KMP
[1] algorithm
and BM [2] algorithm.
Other al
gorit
hm is mo
stly improved
o
n
these
cla
s
sical
algorith
m
s [3
-5]. KMP algorithm an
alyze
s
the di
st
ri
bution chara
c
teri
stic
of p
a
ttern stri
ng
and
make
s u
s
e of this in pattern matching to
advance
mat
c
hin
g
efficien
cy. In BM algorithm, pattern
string
move
s from left to
right, while
chara
c
te
r
is
compa
r
ed f
r
o
m
right to lef
t. When it lo
se
s
matchin
g
, the predefine
d
excursion fun
c
tion adopts th
e max value to decid
e right
shift value. So
it realize
s
jum
p
ing a
s
far as possibl
e a
n
d
redu
ce
s the times of matching.
Related work. Multiple pattern matchi
ng is
an exte
nsio
n of single pattern matchin
g
,
inclu
d
ing A
C
[6] algorith
m
ba
sed
on
automata,
Wu
-Man
ber [7] algo
rithm, multiple p
a
ttern
matchin
g
alg
o
rithm ba
sed on sequ
ential
binary
tree [8], fast matching algorithm
based on cro
ss
data lin
k tabl
e[9], etc. Moreover,
the
r
e
are
other i
m
p
r
oved
algo
rith
ms [10,1
1
] de
rived fro
m
th
ese
algorith
m
s. A sea
r
ching tre
e
is co
nstruct
ed ba
sed
on
pattern set in AC algo
rithm. The size of the
sea
r
ching tre
e
is related t
o
the size of
c
haracte
r se
t, so the biggish mem
o
ry
is needed.
Wu-
Manbe
r algo
rithm is multiple pattern matching algo
rit
h
m and it extend
s BM algorithm in sin
g
le
pattern mat
c
hing. This al
g
o
rithm ad
opts bad ch
ara
c
t
e
r shifting m
e
cha
n
ism of B
M
algorithm
and
make
s
use o
f
block
cha
r
a
c
ter to
exten
d
efficien
cy o
f
bad
cha
r
a
c
ter
shifting. It also
uses ha
sh
Table to sele
ct pattern string match
ed in ma
tchin
g
stage an
d re
duce more
computing. Th
e
averag
e ca
pa
bility of Wu-M
anbe
r algo
rith
m is the be
st in appli
c
ation.
2. Limitation
of Existen
t
Patte
rn Matc
hing Algorithms
Existent pattern m
a
tchin
g
algorithm
s
restri
ct
a patt
e
rn a
s
“ch
a
ra
cter
strin
g
”. T
he whole
pattern set is not been disposed as a body in matc
hing pro
c
ed
ure
.
It
takes at most out part
of
comm
on prefixes or po
stfixes to re
duce the times of m
a
tchin
g
.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NIKA
Vol. 11, No
. 3, March 20
13 : 1497 – 1
505
1498
Some pattern
matchin
g
al
gorithm
s hav
e req
u
ire
m
en
ts for lang
ua
ge. The
s
e al
gorithm
s
are fit for English and unfit for Chin
ese. They are
aim
ed at English
.
In
the case
of thousan
ds
of
pattern
s, the advan
ce of efficien
cy of th
ese alg
o
rithm
s
is notable. But there are
some que
sti
ons
in dealing
with multiple pat
terns m
a
tchi
n
g
in Chin
ese.
The efficien
cy of BM algorithm is preferably
in single
pattern matching. Some improve
d
algorith
m
s
de
rive from BM
. These alg
o
rithms a
dopt
comp
ared m
anne
r fro
m
b
a
ck to fro
n
t a
nd
gene
rate jum
p
ing Table ba
sed on curre
n
t matched c
hara
c
te
rs. All these redu
ce
matching times.
When we anatomize these algorithms
based on
char
acter jum
p
i
ng, we will fi
nd that matching
efficien
cy is
not esse
ntial
l
y advanced
by analy
z
in
g ch
aracte
r
of pattern
st
ring
s to
reali
z
e
jumping, be
cause it needs time to judge jump
ing. For example,
Horspool ad
vance
d
Boyer-
Moore-Ho
rsp
ool algo
rithm.
This algo
rith
m is an
improved algo
rith
m of BM. We call it BMH [12
]
algorith
m
. In this algo
rithm,
the chara
c
te
r pointed by
text pointer is firstly compa
r
ed with the las
t
cha
r
a
c
ter
of pattern. If it is eq
uality, then co
mpa
r
e t
he re
st m-1
cha
r
a
c
ters. No matter whi
c
h
character in the text result
s in failed matching, pa
ttern will slide to
the right according to the l
a
st
cha
r
a
c
ter in the text
that correspon
ds to patte
rn. Figure 1 sh
ows the matchin
g
pro
c
e
ss in BMH
algorith
m
.
Figure 1. Matchin
g
pro
c
e
s
s in BMH alg
o
rithm
Figure 1
sho
w
s th
at if pattern
strin
g
“se
a
rch”
ca
n jum
p
matching, t
he text T[6] should
be
sea
r
ched i
n
p
a
ttern st
ring
at first matchi
ng. It is obv
io
us that d[r]
=2
, so the patte
rn sli
d
e
s
rig
h
t 2
cha
r
a
c
ters to
the right. In fact, efficien
cy is
not advan
ced, be
ca
use
it needs 2 times to comp
are
whe
n
ch
aract
e
r ‘r’ i
s
searched in p
a
ttern. It is not
be
tter to com
p
a
r
e the la
st ch
ara
c
ter ‘h’ i
n
th
e
pattern an
d each cha
r
a
c
ter in the text step by step.
It
only saves o
ne time in the last succe
s
sful
pattern mat
c
h
i
ng. Apparent
ly, if jumping algorith
m
is u
s
ed in p
a
ralle
l multiple pattern
s matchin
g
,
the process
will be much
more
compl
e
x. It is
not favorabl
e to reali
z
e parallel operation.
Rep
r
e
s
entati
v
e multiple pattern algo
rith
m is Wu-Ma
n
ber. Wu
-Man
ber algo
rithm is based
on BM, but i
t
uses cha
r
a
c
ter blo
c
k wi
th B lengt
h (2 or 3) to re
duce the po
ssi
bility of p
a
rt
matchin
g
. Three table
s
are neede
d in this al
gorith
m
: SHIFT table,
HASH Table and PRE
F
IX
table. SHIFT[
h] ca
n
store
the num
be
r of ch
ar
a
c
te
rs jum
ped i
n
se
curity, HA
SH[h] ca
n
store
pattern ch
ain
Table with h compute
d
by the
last
B chara
c
ters’
hash value. PREFIX Ta
ble
filtrates the
p
a
ttern that h
a
s
the
sam
e
h
a
sh val
ue
b
u
t different p
r
efi
x
with the text, so the
num
b
e
r
of compa
r
e
d
patterns
will
decrea
s
e fu
rther. But wh
en pattern
set becom
es
bigge
r, jumpi
n
g
cha
r
a
c
ter n
u
m
ber
will be
closer to 1. In this case, correlative
judging alg
o
rithm will lo
se
efficien
cy an
d algo
rithm b
e
com
e
s
more
compl
e
x. Fu
rther m
o
re, Wu
-Man
ber
a
l
gorithm
doe
s not
make u
s
e of
more u
s
eful i
n
formatio
n in pattern
set.
3. Multi-Pattern Matchin
g
Base
d on Adde
d Infor
m
ation
The wh
ole id
eal of the algorithm advan
ced in this p
a
per is that ea
ch pattern in pattern
set is not been looked a
s
string. We look each byte
in
the pattern as a number from 0 to
255. So
it overcom
e
s the limitation on lang
ua
ge. Then pat
tern matchin
g
is tran
sformed into nu
mber
disp
osi
ng. Pattern set is so
rted ba
sed o
n
the su
m of the sho
r
te
st length of pattern and the ind
e
x
is built for e
a
ch p
a
ttern.
To increa
se
sea
r
ching
sp
eed, it adopt
s AVL tree t
o
store inde
x.
Becau
s
e the
maintenan
ce of AVL tree is relatively
indepen
dent
and doe
s n
o
t take matching
time, it can be ignored. It is impo
rtant that we
firstly judge
wheth
e
r
the co
ntent
s of the sh
ort
e
st
continuous patterns from
text
entrance are present in patte
rn set. It involves colligate
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Multi-pattern Matchin
g
Met
hod
s Base
d o
n
Num
e
ri
cal
Com
putation
(Lu Jun
)
1499
informatio
n a
bout multi-bytes, so matchi
ng collis
io
n p
r
oba
bility is redu
ced. O
n
ly whe
n
the su
m
of the shorte
st pattern len
g
th is the sa
me and
colli
sion is elimina
t
ed will accurate matchin
g
be
further a
c
hiev
ed. Con
s
eq
u
ently, matching times are
redu
ce
d gre
a
t
ly.
In the pattern set relate
to
this algo
rith
m, it is obvious that the
bigger
the length of the shorte
st pattern is, the better
matc
hing effic
i
enc
y
is
.
3.1. Pretrea
t
ment of Pa
ttern Set
At first, pattern set n
eed
s to be p
r
etre
ated, t
hat is, o
b
taining the
shorte
st pattern lengt
h
(minle
n) in th
e pattern
set.
To ea
ch p
a
ttern in thi
s
p
a
ttern set, the
sum of mi
nle
n
length
bytes is
comp
uted a
s
12
m
i
n
......
lin
mm
m
,
then the pattern set is sorted b
a
se
d on the su
m. To
some patte
rn
s, if the sum is the same
, they
can be sorte
d
based on the value of the bytes
again. It is sh
own a
s
Tabl
e
1.
Table 1. The
sorte
d
pattern set ba
sed o
n
the length o
f
the shorte
st pattern
Pattern
The sum of the
foregoing 4 b
y
tes
123
4567
89
1
0
1
1
1
2
1
3
acaapz
y
390
a
c
a
a
p
z
y
abcdekg
394
a
b
c
d
e
k
g
cbdaefjlaa 394
c
b
d
a
e
f
j
l
a
a
dacbfjoire 394
d
a
c
b
f
j
o
i
r
e
ldifjglk
415
l
d
i
f j
g
l
k
jdhkjw
e
416
j d
h
k
j w
e
slkdjg
ioejgr
430
s
l k
d
j
g
i
o
e
j
g
r
klkrfhkdfksd
l 436
k
l
k
r
f
h
k
d
f
k
s
d
l
a
y
zxabk
460
a
y
z
x a
b
k
xy
za
460
x
y
z
a
In the pattern set of T
abl
e 1, the sho
r
test pattern l
ength minl
en
is 4
(xyza)
and the
longe
st pattern length maxl
en is 1
3
(klkrf
hkdf
ksdl).
Th
e pattern
set i
s
so
rted b
a
se
d on the sum
o
f
the sho
r
test
pattern. Whe
n
the pattern
is ma
tche
d, the sum of
the minlen length bytes is
comp
uted
fro
m
the text ma
tching
entran
c
e a
nd
com
p
are
s
with the
sum i
n
the
so
rted p
a
ttern
set.
The pattern
set is filtrate
d. This colli
si
on (e.g.39
4
)
odd
s
(
1/
(
25
6*minle
n
))
i
s
less than the
colli
sion o
d
d
s
(
1/2
5
6
)
b
a
se
d on the
comp
ari
s
o
n
with si
ngle b
y
te. So matching efficie
n
cy is
prom
oted. It is app
are
n
t that the bigge
r of the s
hort
e
st pattern le
ngth is, the
more the p
a
ttern
numbe
r is, an
d the more o
b
vious the a
d
v
antage of this algo
rithm is.
To compa
r
e
easily, the in
d
e
x sho
u
ld b
e
built on the
sorted
pattern.
It is sh
own in
figure
2.
Figure 2. The
index on the sorte
d
pattern set
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NIKA
Vol. 11, No
. 3, March 20
13 : 1497 – 1
505
1500
3.2. Built AV
L Tree For T
h
e Indexe
s
To qui
ckly locate in mat
c
hi
ng ba
se
d on t
he ind
e
x, the index Tabl
e i
n
Figu
re 2
sh
ould b
e
transfo
rme
d
t
o
AVL tree,
so avera
ge
se
arching tim
e
s become the
l
east. Fig
u
re
3
sho
w
s the A
V
L
tree ba
sed o
n
the index table.
416
39
4
436
390
415
430
460
Figure 3. AVL tree
The nod
e’s d
a
ta stru
cture in AVL tree is
as follo
ws:
stru
ct node
{ int data;
struct node *llink;
st
ruct
nod
e *rlin
k;
struct so
rtn
ode *sli
nk;
int number;
}
In the node’
s data stru
ctu
r
e, data is the
va
lue of the node in AVL
tree, su
ch a
s
394.
Llink
point
s to the left nod
e and
rlin
k p
o
i
nts to the
rig
h
t node. Slin
k points to
the
first no
de in t
he
pattern
sub
-
set that the sum of the fo
regoi
ng mi
nl
en len
g
th bytes in
so
rted
pattern T
able
is
equal to th
e
data value.
Numbe
r
is th
e
numbe
r of
n
o
des i
n
the p
a
ttern sub-set.
It is 1 at lea
s
t,
while the n
u
m
ber of no
de
s in the pattern sub
-
set with the sum 39
4 is 3.
3.3. Collision Eliminating
Those patterns with the same foreg
o
in
g minlen leng
th bytes will
collid
e. To reduce the
comp
ari
s
o
n
ti
mes of the collision patte
rns in matchi
ng, the patte
rn set need
s to
be disposed
further. Of co
urse, the dispositio
n doe
s not occupy
matc
hing time, bec
ause it is
c
o
mpleted in
pretreatment.
The main i
d
ea is to
sea
r
ch the by
te
positio
ns
with the mo
st d
i
fference in t
h
e
colli
sion patt
e
rn
s and let
the text compares di
re
ctly with the
corre
s
p
ondin
g
positio
n of the
pattern. Th
us com
pari
ng ti
mes
are
re
du
ced. In t
he
previous
exam
ple, the patte
rn
set with i
n
dex
394 ha
s thre
e patterns. If
these pattern
s are sto
r
ed in array (suffix is 0~maxlen-1
)
, we sea
r
ch
the position
with the bigg
est differen
c
e degree.
Differen
c
e de
gree is the nu
mber of different
bytes with the same suffix in pattern set. It is
apparen
t that the biggest differen
c
e
degree is le
ss
than or eq
ual
to the pattern
numbe
r (3
) o
f
collis
io
n pattern sub
-
set. It is sho
w
n in fi
gure 4.
Figure 4. Coll
ide sketch map
In Figure
4, the suffix with the bigge
st di
fference degree i
s
0,3,5,6. We se
lect the
smalle
st 0 and write it down. So we need to add a new field for pattern se
t—colli
sion field,
who
s
e initial
value is max
length. The fi
rst collisi
on field with the
max length is modified a
s
this
suffix in the collision
pattern se
t. It is shown in figure
5.
0
1 2
3
4
5
6 7
8 9
10
11
12
a b
c
d e
k
g
c
b d
a
e
f
i
l
a a
d a
c
b f
j
o
i
r
e
394
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Multi-pattern Matchin
g
Met
hod
s Base
d o
n
Num
e
ri
cal
Com
putation
(Lu Jun
)
1501
Figure 5. Modify collision field
To advance matchin
g
efficien
cy in the future, each
pattern nee
ds an additio
nal field
.
This field is u
s
ed to store byte number
of this
pattern. So the data stru
cture of
the pattern node
is
as
follows
:
st
ru
ct
st
rin
g
{ int num;
int collision;
c
har *s
;
}
In multi-patte
rn matching, the continu
e
ma
xlen lengt
h bytes from
the text entrance is
adopte
d
as
matchin
g
win
dow. The
su
m of minlen l
ength bytes o
f
the matche
d text is computed
firstly, such a
s
394. Th
en
we sea
r
ch the pattern
sub
-
set by AVL tree. When th
e value 0 in the
first pattern
collision field i
s
found, the
byte at
0 position in each
pattern is
ch
ecked in turn
. If
corre
s
p
ondin
g
byte is ma
tched i
n
cert
ain patte
rn,
pre
c
isi
on m
a
tching
ca
n b
e
don
e
with
this
pattern. If co
rre
sp
ondi
ng
byte is n
o
t matche
d in
each patte
rn,
matchi
ng i
s
lost at thi
s
text
entran
c
e, an
d
the windo
w slides to the n
e
xt byte.
It is possibl
e that multiple colli
sion
s hap
pen.
If the in
dex of another pattern (ab
c
dfkil
)
i
s
also
394, its
positio
n is a
d
ded to the
se
con
d
of
the p
a
ttern sub
-
se
t. Here, the bi
gge
st differen
c
e
degree of the collisio
n pattern sub-set is still 3,
less than the num
ber (4
) of patterns. After the
first colli
sion
che
c
king, it is found that the fi
rst pattern still colli
d
e
s with the seco
nd pattern,
becau
se the
context with
0 suffix is ‘a’. The next
coll
ision p
o
int ne
eds to be fou
nd. This po
sit
i
on
is suffix 4. These t
w
o p
a
tterns can be
di
stingui
sh
ed b
y
storing
this
suffix to the seco
nd p
a
ttern’s
colli
sion field.
It is shown in
figure 6.
Figure 6. Multiple collidi
n
g
In mult-pattern matching, if the corre
spo
nding
byte in the text is matche
d with the third
or the fo
urth
pattern
by 0
suffix and th
e
numb
e
r
of the matched
pa
ttern is
1, pre
c
isi
on m
a
tchi
ng
with this patt
e
rn
hap
pen
s
dire
ctly. If th
e byte with
0
suffix in the
text is also ‘a
’, there a
r
e t
w
o
matchin
g
patt
e
rn
s, then co
llision h
app
e
n
s. Search
fo
r the next pa
ttern (othe
r
than the patte
rn
colli
sion
who
s
e field value
0) wh
ose col
lision field
with value 4, na
mely, the fourth byte cont
ent
in the text’s current byte st
ring
comp
are
s
with
the fo
urth byte con
t
ent in these
two patterns.
If
the corre
s
po
nding byte in a certain
pattern is
th
e same a
s
the byte in the text, precision
matchin
g
with
this pattern happe
ns. If it is diffe
rent fro
m
the corre
spondi
ng byte in each pattern,
matc
hing is
los
t. It c
an als
o
work
with further c
o
llis
i
ons.
colli
sion
field
0
1 2 3
4 5
6 7
8 9
10
11
12
0
a
b c
d
e k g
13
c
b d a
e
f
i
l
a a
13 d
a
c
b
f
j
o
i
r
e
nu
m
colli
sion
field
0
1
2
3
4
5
6
7
8
9
10 11
12
7
0
a
b
c
d
e
k
g
8 4
a
b
c
d
f
k
i
l
10 13
c
b
d
a
e
f
i
l
a
a
10
13
d
a
c
b
f
j
o
i
r
e
394
394
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NIKA
Vol. 11, No
. 3, March 20
13 : 1497 – 1
505
1502
4. The Multi-Patte
rn Matc
hing Metho
d
Based o
n
Stitched Infor
m
ation
The first few bytes of each pattern are
accum
u
lated
in
the multiple patterns m
a
tchin
g
algorith
m
ba
sed on
add
ed i
n
formatio
n. T
o
red
u
ce
the
colli
sion
pro
b
ability and th
en to redu
ce t
h
e
times of pattern mat
c
hing
, these accu
mulated valu
e are compa
r
ed. Wh
en th
e added valu
es
collid
e, this
method
will e
liminate the collision fu
rt
he
r by increa
si
ng colli
sio
n
field. After further
analysi
s
of the method, it is found
that this metho
d
can be imp
r
ov
ed.
In applicatio
n
s
, the pattern
which n
eed
s to match may have the same h
ead o
r
tail o
f
informatio
n. For exampl
e, it needs
to search for the
name
s
of bo
oks in the library or the na
mes
of papers i
n
the paper database,
"based on ..." or
"... methods"
are the
com
m
on form
s.
This
mean
s that the informatio
n accumulati
on met
hod a
gain
s
t the head or tail information still has
highe
r colli
si
on frequ
en
cy, although the collisi
on o
dds
smalle
r than the tradi
tional metho
d
of
non-num
eri
c
a
l
matching.
In orde
r to av
oid focusin
g
on the he
ad
or tail of the
same
or
simil
a
r info
rmatio
n, it can
use th
e meth
od that extra
c
t so
me byte
s of e
a
ch
pat
tern to o
b
tain
the average
informatio
n a
n
d
then to reduce the collision
odds. For example, it
can be extracted
every other k bytes. In Table
2, if the length of the shortest
pattern in the pattern
set is 8 ( minlen = 8), then the cha
r
a
c
ter
data in o
dd p
o
sition
of ea
ch pattern are
accumul
a
t
ed,
this me
an
s that it is extra
c
ted eve
r
y ot
her
byte(k=1
). This extra
c
tin
g
method re
flects
the ev
en inform
ation of each p
a
ttern and t
he
characteri
stics of different
patterns better, so the collisi
on odds will redu
ce further.
Table 2. The
sorte
d
pattern set ba
sed o
n
the sum of
even inform
ation
Pattern
The sum of the f
o
regoing
4 odd b
y
tes
0
1
2
3
4
5
6
7
8
9
10 11 12
cbdaefjmaa 406
c
b
d
a
e
f
j
m
aa
dacbfjoire 406
d
a
c
b
f
j
o
i
re
abcdekggh
408
a
b
c
d
e
k
g
g
h
ldifjglk 412
l
d
i
f
j
g
l
k
xy
zabcde
418
x
y
z
a
b
c
d
e
slkdjg
ioejgr 422
s
l
k
d
j
g
i
o
ej
g
r
acaapz
y
ijr 423
a
c
a
a
p
z
y
i
j r
klkrfhkdfksd
l 426
k
l
k
r
f
h
k
d
fk
s
d
l
jdhkjw
elm 434
j
d
h
k
j
w
e
l
m
a
y
zxabkn
449
a
y
z
x
a
b
k
n
In Table 2, only two patterns
with the sum
of the first four odd b
y
tes (406
) are same.
These two pa
tterns
can be
disting
u
is
hed
by the collisi
o
n field further.
Furtherm
o
re, it is significant to
reduce th
e collisi
on odds of the initial part of the
cal
c
ulatio
n. If the cha
r
a
c
te
r stitch
ed met
hod is
u
s
e
d
instea
d of usi
ng add
ed me
thod, it will avoid
the colli
sion i
n
whi
c
h the a
dded
cha
r
a
c
t
e
rs
are
different but the a
dded valu
e is the sam
e
, even
disting
u
ish the situation that t
he chara
c
ters are sa
me but just the location i
s
cha
nge
d.
For
example, to the cha
r
a
c
ter string “cb
d
a
e
fjmaa”
in the Table 2, whether u
s
ing
the method of
addin
g
the first few contin
uou
s cha
r
a
c
ters o
r
the method of adding the first few odd charact
e
rs,
it will both be collided by other strin
g
s. It needs to
take an ulteri
or step to solve the problem by
usin
g colli
sio
n
field. Acco
rding to the m
e
thod of
spli
ced ch
ara
c
te
r, it could exp
and calculati
o
n
nume
r
ical sp
ace, and red
u
ce the colli
sion odds.
To
the string “cbdaefjma
a
” , the proce
s
s o
f
spli
cing fro
m
the first four o
dd ch
ara
c
te
rs (bafm) i
s
sh
o
w
n in figure 7
.
Figure 7. Spliced info
rmati
o
n
The value o
b
tained fro
m
the even po
sition ch
ara
c
te
r spli
cing meth
od ca
n distin
guish th
e
pattern
s that the cha
r
a
c
ters are sa
me b
u
t their
po
sitions a
r
e
cha
n
ged. It can no
t be achi
eved
by
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Multi-pattern Matchin
g
Met
hod
s Base
d o
n
Num
e
ri
cal
Com
putation
(Lu Jun
)
1503
the adde
d method. To the
first four odd
characte
rs
(bafm) in the foreg
o
ing exa
m
ple, if only the
positio
n is
ch
ange
d into
“mabf”, the
re
sult of a
c
cum
u
lation
will be
still 406,
but
its spli
cin
g
va
lue
is different. It is sh
own in figure 8.
Figure 8. Spliced info
rmati
on after the p
o
sition i
s
ch
a
nged
To the string
s in the Tabl
e 2, the splici
ng
value of each st
ring is
cal
c
ulate
d
ou
t. Then
,
Table 3 is a
sorted Ta
ble b
a
se
d on the splicin
g value.
Table 3. The
sorte
d
pattern set ba
sed o
n
the spli
cing
value of even
information
pattern
The
splicing
value of the
foregoing 4 o
dd
by
t
e
s
0
1
2
3
4
5
6
7
8 9
10
11
12
dacbfjoire 163383972
1
d
a
c
b
f
j
o
i
r e
cbdaefjmaa 165055038
1
c
b
d
a
e
f
j
m
a a
abcdekggh 165074826
3
a
b
c
d
e
k
g
g
h
acaapz
y
ijr 166733271
3
a
c
a
a
p
z
y
i
j r
ldifjglk 168443274
7
l
d
i
f
j
g
l
k
jdhkjw
elm 168476452
4
j
d
h
k
j
w
e
l
m
slkdjg
ioejgr 181851940
7
s
l
k
d
j
g
i
o
e j
g
r
klkrfhkdfksd
l 181943715
6
k
l
k
r
f
h
k
d
f k
s
d
l
xy
zabcde
203642557
3
x
y
z
a
b
c
d
e
a
y
zxabkn
203793265
4
a
y
z
x
a
b
k
n
It can be se
en from Tabl
e 3, the compared va
lu
e is stitche
d
by the first
four odd
characters, which
contai
ns
combi
natorial
information of rela
ted character positions. The collision
odd
s can be
up to 1/(2
32
) , which is far l
e
ss than the colli
sion od
ds 1 / (256 * minlen) ba
se
d on
informatio
n
accumul
a
tion
. Hen
c
e, th
e multi-p
a
ttern mat
c
hin
g
spe
ed is
a
d
vanced. In
this
example, the
colli
sion i
s
no
t occurred
by usin
g t
he
spl
i
cing valu
e m
e
thod. It is n
o
t necessa
ry to
reduce
colli
si
ons by collisi
on field.
In summary
, the Multi-pattern matchi
ng method
s based on the spliced informatio
n
have lowe
r
colli
sion od
d
s
than the Multi-pa
ttern
matching m
e
thod
s base
d
on the ad
ded
informatio
n. Beside
s, the colli
sion odd
s of splicing
informatio
n red
u
ce ra
pidly wi
th the increa
sing
numbe
r of
b
y
tes. It supp
ose
s
that th
e
length of
si
mple d
a
ta type which sy
stem ca
n ha
n
d
led
each time is not more tha
n
8 bytes, th
en
the collisi
on odd
s between add
ed in
formation me
thod
and spliced in
formation met
hod are co
mp
ared in T
able
4.
Table 4. The
comp
ari
s
o
n
o
f
collisio
n odd
s between a
d
ded informati
on method a
n
d
spli
ced
informatio
n method
method
added informatio
n method
spliced information method
b
y
te num
ber
value space
collisi
on odds
value space
collision odds
1 b
y
te
2
8
1/2
8
2
8
1/2
8
2 b
y
tes
2*2
8
1/
(
2*2
8
)
2
16
1/2
16
3 b
y
tes
3*2
8
1/
(
3*2
8
)
2
24
1/2
24
4 b
y
tes
4*2
8
1/
(
4*2
8
)
2
32
1/2
32
5 b
y
tes
5*2
8
1/
(
5*2
8
)
2
40
1/2
40
6 b
y
tes
6*2
8
1/
(
6*2
8
)
2
48
1/2
48
7 b
y
tes
7*2
8
1/
(
7*2
8
)
2
56
1/2
56
8 b
y
tes
8*2
8
1/
(
8*2
8
)
2
64
1/2
64
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NIKA
Vol. 11, No
. 3, March 20
13 : 1497 – 1
505
1504
Figure 9 indicates the com
pari
s
on b
e
tween two
meth
ods ba
se
d on
the numeri
c
a
l
space.
The differen
c
e betwee
n
the two metho
d
s re
ache
s t
he expone
ntial level, so it
is easi
e
r to draw
figure by
usi
n
g
Log cal
c
ul
a
t
ion
to
proce
ss
th
e
nu
merical spa
c
e of
the
two
m
e
th
ods, whi
c
h can
evaluate the
value of the exponent spa
c
e.
Figure 9. The
compa
r
i
s
on
of value spa
c
e bet
we
en ad
ded informati
on and
spli
ce
d informatio
n
It can be
se
e
n
from Fig
u
re
9 that the rel
a
t
ed num
eri
c
al sp
ace of the spli
ce
d info
rmation
method a
r
e
greate
r
tha
n
that of the added info
rma
t
ion method
by increa
sing
the insp
ecti
n
g
bytes, which can g
r
eatly redu
ce colli
sio
n
odds an
d increa
se the su
ccessf
ul probability of initial
matc
hing.
In s
u
mmary, multi-pattern matc
hi
ng
s
hou
ld
fo
llo
w
th
es
e
r
u
le
s
:
(1) T
he collisi
on pro
bability
in the previo
us ste
p
s
sho
u
ld ke
ep a
s
low a
s
po
ssi
bl
e
(2) T
he sortin
g rule
s take
n by different st
eps
sho
u
ld keep a
s
differe
nt as po
ssi
ble
(3) Th
e bytes data involved in the operation sh
o
u
ld
reflect the ch
ara
c
teri
stics
of each
pattern.
5. Conclusio
n
Two m
u
ltiple
pattern
s m
a
tchin
g
met
hod
s ba
se
d
on num
eri
c
al
comp
uta
t
ion are
advan
ced i
n
this p
ape
r: on
e is b
a
sed o
n
the add
ed in
formation
met
hod, an
d the
other i
s
spli
ced
informatio
n
method. Multi
p
le patterns
matchin
g
met
hod
s ba
se
d on num
eri
c
al
comp
utation
can
overcome th
e langu
age
probl
em
s of pattern mat
c
hin
g
algo
rit
h
m. To re
du
ce
collisi
on,
the
information of multiple bytes in mat
c
hed text is
integrated. To el
iminate colli
si
on furthe
r, the
colli
sion field
s
ca
n be sel
e
cted ba
se
d on the cha
r
a
c
teri
stics of given pattern
set. This pa
per
summ
ari
z
e
s
the rul
e
s follo
wed by different step
s in
multi-pattern
matchin
g
pro
c
e
ss. It can
be
analyzed that
the multi-p
a
ttern mat
c
hin
g
method
s b
a
s
ed
on the
splice
d
inform
ation have l
o
wer
colli
sion
odd
s than the
mult
i-pattern mat
c
hing m
e
thod
s based
on the
adde
d info
rm
ation, thereb
y
multi-pattern
matchin
g
m
e
thod ba
se
d
on the sp
li
ced info
rmati
on incre
a
ses the su
ccessful
possibility of the initial matching.
Ackn
o
w
l
e
dg
ment
This pap
er is suppo
rted b
y
the
Natural Scien
c
e Fou
ndation of He
ilongjian
g
Pro
v
ince of
chin
a und
er Grant
No.
F2011
38, th
e Scientific
Re
sea
r
ch Fu
nd of Heil
on
gjiang Provin
cial
Educatio
n De
partme
n
t under Gra
n
t No.
1252
1395 an
d the Youth
Scien
c
e Fun
d
of Heilongji
ang
University un
der G
r
a
n
t No.QL20
104
4. We a
r
e g
r
a
t
eful to all those who m
ade
con
s
tru
c
tive
comm
ent
s.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Multi-pattern Matchin
g
Met
hod
s Base
d o
n
Num
e
ri
cal
Com
putation
(Lu Jun
)
1505
Referen
ces
[1]
Knuth DE, Morris H, Pratt VR.
Fast Pattern
Matchi
ng in Strings.
SIAM J
Com
p
. 19
77; 6
:
323-35
0.
[2]
Bo
yer RS, Mo
o
r
e JS. A F
a
st String Se
archi
n
g
Algorit
hm.
Co
mmu
n
icati
ons
of the ACM
. 19
77; 20: 7
62-
772.
[3]
Z
H
ANG Na, HOU Z
heng-fe
n
g
.
A F
a
st Improved BM Al
gori
t
hm
for Pattern Matching
in Strings
. Journal
of Hefei Un
iver
sit
y
of T
e
chnol
og
y. 200
6; 29(
7): 834 - 83
8
[4]
YANG Wei-
w
e
i, LIAO X
i
ang.
Improv
ed Pat
t
ern Matchi
ng
Algorit
h
m
of BM.
Computer
Appl
icatio
ns.
200
6; 26(2): 31
8 -319
[5]
PANG Shan-c
hen, W
A
NG Shu-d
ong. JIAN
G Chang-
ju
n,
Impr
ove
d
BM-a
ltorith
m
for Pat
t
ern Matchin
g
in String
. Com
puter App
lic
ation. 200
4; 24(1
2
): 11-13
[6]
Aho V, Corasick MJ.
Efficient String Matchi
n
g
: An Aid to Bi
blio
gra
phic S
e
arch.
Commu
ni
cations of th
e
ACM.197
5; 18:
333-3
40.
[7]
W
U
Sun, Man
ber U.
A Fast Algorit
h
m
for Multi-Pattern S
earch
ing
. T
e
chnical
Re
port TR. Univers
i
t
y
o
f
Arizon
a at T
u
scon. 199
4; 94-
17
[8]
HU Pei-
hu
a, W
A
HG Y
ong-c
hen
g, LIU Gon
g
-she
n.
A Multiple P
a
ttern M
a
tchin
g
Alg
o
rit
h
m B
a
se
d o
n
Sequ
enti
a
l Bin
a
ry T
r
ee
. Computer Scie
nce.
200
2; 29(1
1
): 65-68.
[9]
LI Wei, GUAN X
i
a
o
-
ho
ng
, TAH
G
We
n
-ro
ng
.
A F
a
st Matchin
g
Alg
o
rith
m Bas
ed o
n
a
Specia
l Cross
Data Li
nk T
abl
e
.
Journa
l of chin
a instit
ute of
commu
n
ic
atio
n
. 2004; 2
5
(10)
: 38-44
[10]
YANG Dong
h
ong,
Xu ke, C
U
I Yong.
I
m
pr
oved W
u
-M
an
ber Mu
ltipl
e
P
a
tterns Matchi
ng Al
gorit
h
m
. J
T
s
inghua Un
iv. 2006; 4
6
(4): 5
55-5
5
8
[11]
SUN Xiao-s
h
a
n
, W
A
NG Qiang, GUAN Yi, WANG
Xia
o
-l
ong
. An Improved W
u
-Manb
er Multipl
e
-p
atter
n
Matchin
g
Algor
ithm and Its Applicati
on.
Jo
urn
a
l of Chi
nes
e Informatio
n
Pro
c
essin
g
. 200
6; 20(2): 47-
52
[12]
HORSPOOL
RN.
Practical F
a
st Searchi
ng
in
Strings
. Soft
w
a
re-Practic
e an
d E
x
per
ie
nce.
198
0; 10(
6):
501-
506.
Evaluation Warning : The document was created with Spire.PDF for Python.