TELKOM
NIKA
, Vol.12, No
.4, Dece
mbe
r
2014, pp. 11
32~114
1
ISSN: 1693-6
930,
accredited
A
by DIKTI, De
cree No: 58/DIK
T
I/Kep/2013
DOI
:
10.12928/TELKOMNIKA.v12i4.388
1132
Re
cei
v
ed Se
ptem
ber 8, 2014; Re
vi
sed
No
vem
ber 2
0
,
2014; Accep
t
ed De
cem
b
e
r
1, 2014
Review of Local Descriptor in RGB-D Object
Recognition
Ema Rachm
a
w
a
ti*
1
, Iping Supriana
2
, Masa
y
u
Le
y
l
ia Khodra
3
Schoo
l of Elect
r
ical En
gin
eeri
ng & Informatic
s
, Institut
T
e
kn
olo
g
i Ban
d
u
n
g
*Corres
p
o
ndi
n
g
author, e-ma
i
l
: em
a.rachmaw
a
t
i
22@s
tudents.itb.ac.id
1
, iping@st
ei.itb.ac.id
2
,
masa
yu@st
e
i.i
t
b.ac.id
3
A
b
st
r
a
ct
T
he e
m
erge
nc
e of
an
RGB-D (Re
d
-Gree
n
-
Blue-D
ept
h)
s
ensor w
h
ich
i
s
cap
abl
e of
provi
d
in
g
depth
an
d RG
B i
m
ag
es g
i
ve
s hop
e to t
he c
o
mputer v
i
si
on
co
mmun
ity.
Moreov
er,
the
us
e of l
o
cal
featu
r
es
beg
an to i
n
cre
a
se ov
er the l
a
st few
years and
has sh
ow
n i
m
press
i
ve r
e
sults, esp
e
cia
lly in th
e fiel
d
of
obj
ect recog
n
it
ion. T
h
is articl
e attempts to provi
de a
surv
ey of the rece
nt technica
l ac
hiev
e
m
ents in
this
area
of r
e
sear
ch.
W
e
revi
ew
the us
e
of l
o
ca
l d
e
scriptors
a
s
the f
eature
r
epres
entati
on
w
h
ich is
extrac
te
d
from RGB-D i
m
a
ges, in
insta
n
ces an
d
cat
e
gory-lev
e
l
o
b
je
ct recog
n
iti
on.
W
e
als
o
hig
h
li
ght the
i
n
volv
e
m
e
n
t
of de
pth i
m
ag
e
s
an
d h
o
w
they
can
be
co
mb
i
ned
w
i
th
RGB i
m
a
ges
in
const
r
ucting
a
loca
l
descri
p
tor. T
h
r
e
e
different appr
o
a
ches are use
d
in inv
o
lvi
ng depth i
m
a
ges
into co
mpact f
eature r
epr
ese
n
tation, th
at is
classica
l appr
o
a
ch usin
g
d
i
stri
butio
n
b
a
se
d, kerne
l
-trick,
an
d featur
e l
earn
i
ng. In th
is
artic
l
e, w
e
sh
ow
that
the invo
lve
m
e
n
t
of depth data
successful
ly i
m
pr
oves the acc
u
racy of obj
ect recog
n
itio
n.
Ke
y
w
ords
:
R
G
B-D ima
ges, l
o
cal d
e
scriptor,
object reco
gn
ition, de
pth i
m
a
ges
1. Introduc
tion
Object recognition is an important problem in
comp
uter scie
nce, whi
c
h ha
s att
r
acte
d th
e
intere
st of re
sea
r
che
r
s i
n
the fields of
comp
uter vi
si
on, ma
chin
e l
earni
ng
and
robotics [1]. T
h
e
core of
buildi
ng o
b
ject
re
cognition
syst
ems i
s
to
extract
mea
n
ing
f
ul rep
r
e
s
ent
ations (featu
r
es)
from hi
gh-di
mensi
onal
o
b
se
rvation
s
su
ch
as im
age
s, video
s and
3
D
p
o
int cl
oud
s
[2].
Satisfacto
ry results h
a
ve
been a
c
hi
e
v
ed by us
in
g a variety
of method
s, appli
c
ation
s
and
stand
ard
be
n
c
hma
r
k d
a
ta
sets.
Neve
rt
hele
ss,
obje
c
t re
cog
n
ition
of daily
obj
ects in
a
scene
image is still
an open prob
lem. The maj
o
r
challenges in
a vi
sual object reco
gnit
i
on system
are
divided into
two
gro
u
p
s
, which
are relat
ed to
system
rob
u
stn
e
ss
and
com
putat
ional
com
p
le
xity
and
scalabilit
y. Belong to
the first
grou
p is the
ch
allenge
in h
and
ling intra-cla
s
s vari
ation
s
i
n
appe
ara
n
ce (different ap
p
eara
n
ce from
a num
ber
of obje
c
ts of th
e sa
me
cate
gory)
and i
n
ter-
cla
ss vari
atio
ns. Instan
ce
s of the same
object
categ
o
ry can g
ene
rate differe
nt image
s ca
used
by a variety of variables that influence illumin
a
tio
n
, object po
se, cam
e
ra viewpoi
nt, partial
occlu
s
ion
an
d ba
ckgroun
d
clutter.
While the
ch
allen
ges bel
ongin
g
to the
seco
nd g
r
ou
p in
cl
ude
very large
o
b
ject
s of
different
catego
ries,
hi
gh
-di
m
ensi
onal
d
e
scripto
r
s a
n
d
difficultie
s in
obtainin
g
lab
e
lled traini
ng
sampl
e
s
with
out any ambi
guity etc. [3].
To add
re
ss these two
ch
alleng
es, [3] argu
es
th
at there
are three a
s
pe
cts i
n
volved,
namely m
ode
lling a
ppea
ra
nce, l
o
calization
strategi
es and su
pervi
sed cla
ssifi
cati
on.
The
focus
of
the re
sea
r
che
r
s
wa
s trying
to develop te
chni
que
s a
n
d
algorithm
s in
those three
asp
e
ct
s in order
to improve the visual obj
ect re
cog
n
ition syst
em p
e
rform
a
n
c
e. Among these three a
s
pe
cts,
modellin
g a
p
pearan
ce i
s
t
he mo
st im
po
rtant a
s
pe
ct [
3
]. Appea
ran
c
e m
odelli
ng
is fo
cu
sed
on
the
sele
ction of feature
s
that can
ha
ndle v
a
riou
s types
of intra-cl
as
s
variation
s
an
d can capture the
discrimi
native
aspe
cts of th
e different cat
egori
e
s. Fu
rtherm
o
re, [4] also
stated th
at “
the next
step
in
the evolution
of obje
c
t recognitio
n
al
gorithm
will
require
radi
ca
l and
bold
st
eps forwa
r
d i
n
term
s of the object re
prese
n
tations, a
s
well
as the lea
r
ning an
d inference algo
rith
m
used
”.
The
e
m
ergen
ce of
the RG
B-D se
nsor (Micro
soft Kinect, Asus Xti
on, an
d Prim
eSense),
whi
c
h i
s
relati
vely chea
p, p
r
omi
s
e
s
to im
prov
e
perfo
rmance in
obj
ect recognitio
n
. The
se
nso
r
is
cap
able
of providing a
de
pth imag
e for each pi
xel
so that the im
age info
rmati
on is
abu
nda
nt.
RGB-D sen
s
or ha
s an RGB came
ra
and an infra
r
ed cam
e
ra a
nd proj
ecto
r, so it can cap
t
ure
colo
ur ima
g
e
s
an
d the de
pth of ea
ch p
i
xel in
the im
age. The
s
e t
w
o fa
ctors a
r
e very helpful
for
the image p
r
oce
s
sing field
that was al
ways dep
en
d
e
n
t on the col
our
cha
nnel
s
of the image
[5],
[6]. By using
the depth ch
annel for fo
re
grou
nd s
egm
entation or
complem
entary information
on
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Re
vie
w
of Lo
cal Descri
ptor in RGB-D O
b
ject Recogni
tion (Em
a
Rachm
a
wati
)
1133
image inte
nsi
t
y, there have many obje
c
t reco
gniti
on
resea
r
che
s
u
s
ing
RGB-D i
m
age
s with v
e
ry
s
i
gnific
ant res
u
lts
,
when compared to us
ing onl
y the RGB c
a
mera, as
c
an be
s
e
en in [1],[2],
[7]–
[10].
In gene
ral, the way
s
to repre
s
e
n
t ima
ge feat
ures
were divided
into two gro
ups, i.e.
globally an
d locally [11]. In the rep
r
e
s
en
tation of
local
feature
s
[12]
, a numbe
r o
f
features i
n
a
regio
n
that
surroun
d the t
a
rget
obje
c
t
were extr
a
c
te
d to re
pre
s
e
n
t
the obje
c
t, so that obje
c
t
can
be recogni
ze
d in p
a
rtial
o
ccl
usi
on.
Until no
w,
we
ha
ve found fo
ur other survey
-like
pa
pers t
o
introdu
ce
a l
o
cal
de
script
or [13]–[1
6
]. Zhang
et al
. [
13] cla
s
sified
feature
dete
c
tors a
nd fea
t
ure
descri
p
tors a
nd their impl
ementation o
n
comp
uter
v
i
sion p
r
obl
em
s. That pap
er did not discuss
the involvem
ent of depth
data on a fea
t
ure de
scri
pto
r
. Paper [14]
comp
ared th
e perfo
rma
n
ce of
desc
r
iptors
computed for
loc
a
l interes
t
r
egion
s
.
The desc
r
i
ptor
w
a
s
c
o
mputed on
greyscale
image
s a
n
d
d
i
d not
co
nsi
d
er
an
obje
c
t’s col
our.
Wh
ile
pap
er [15]
co
mpared
avail
able de
script
ors
in PCL (P
oint
Clo
ud
Lib
r
ary) [17], explai
ning
ho
w
the
y
wo
rk, and
made
a com
p
arative evalu
a
tion
on the RGB-D objec
t datas
et [7]. The
major differenc
e bet
ween this
artic
l
e and [13],[14],[1
6] is
that [13],[14],[16] explain
some
lo
cal
d
e
script
o
r
s fro
m
RGB
imag
es
and th
eir i
m
pleme
n
tatio
n
in
comp
uter visi
on, while thi
s
article inte
nd
s to
give insi
ghts into ho
w research
ers
exploit RGB
and
depth imag
es in con
s
tru
c
ti
ng local de
scriptor in
a
n
RGB-D o
b
je
ct reco
gnition
system, espe
ci
ally
resea
r
ch that use
s
the RGB-D O
b
je
ct dataset
[7]. The pap
ers reviewe
d
in this arti
cle were
categ
o
ri
zed i
n
to thre
e ap
p
r
oa
ch a
c
co
rdi
ng to the te
chniqu
e u
s
ed i
n
rep
r
e
s
e
n
tin
g
feature, tha
t
is
cla
ssi
cal te
chniqu
e (di
s
tri
bution b
a
sed
)
, ke
rnel
met
hod, an
d fea
t
ure le
arni
ng.
We
sh
ow t
hat
unsupe
rvise
d
feature learni
ng in co
n
s
tru
c
ting feature repre
s
e
n
tation
offers a gre
a
t
opportunity to
be develop
ed
, in regardi
ng
to the depth image from
RGB-D imag
es, in ord
e
r t
o
captu
r
e bet
ter
sha
pe featu
r
es. Th
e rest
of this a
r
ticl
e
is o
r
ga
nized accordingly. Specifically,
we sum
m
ari
z
e
the
RGB-D
Obje
ct Dataset in
Section
2; d
e
s
cribe
th
e
lo
cal de
scri
ptor in Se
ction
3;
and
su
mma
ri
ze
and
analyse t
he u
s
e
of so
me lo
cal
de
scripto
r
s in
RGB-D ba
se
d
obje
c
t re
co
gn
ition in Se
ctio
ns
4
and 5. This
survey con
c
lu
des in Se
ctio
n 6.
2. RGB-D Ob
ject Da
ta
set
RGB-D Objec
t
Datas
e
t [7] is
s
i
milar to t
he 3D Obj
e
ct Categ
o
ry
Data
set presented by
Savare
se et al. [18], which contai
ns 8
obje
c
t categ
o
ries, 10
obje
c
ts in ea
ch
category, and
24
distin
ct views of e
a
ch o
b
je
ct. But the
RGB-D Ob
j
e
ct
Data
set i
s
o
n
a la
rg
er scal
e, with
RGB
and
depth video
seq
uen
ce
s of
300 commo
n everyday o
b
je
ct
s from
multiple view angle
s
totall
ing
250,00
0 RGB
-
D ima
g
e
s
. RANSAC plan
e fitting [19]
wa
s used to segment obj
ects from the vide
o
seq
uen
ce
s.
Obje
cts
are
g
r
oup
ed i
n
to 5
1
catego
rie
s
usin
g
WordNet relatio
n
s hi
perni
m-hi
poni
m an
d
are a
su
bset of the categ
o
r
ies i
n
Image
Net [20].
This dataset doe
s not only
co
nsi
s
t of textured
obje
c
ts
su
ch
as
so
da
cans, ce
real
boxe
s
o
r
bag
s of
f
ood, b
u
t al
so
con
s
i
s
ts
of te
xtureless obj
ects
su
ch a
s
bo
wls, cup
s
of coffee, fruit, and v
egetabl
e
s
. Obje
cts
containe
d in the data
s
et a
r
e
comm
only fo
und in
home
s
and offices,
whe
r
e p
e
rso
n
a
l rob
o
ts a
r
e
expecte
d to o
perate.
Obje
cts
are
arra
nged
in a
tree
hie
r
archy
with th
e nu
mbe
r
of i
n
stan
ce
s
of e
a
ch
obj
ect
ca
tegory fo
und
in
each leaf nod
e, range
d fro
m
3-14 in
stan
ce
s for ea
ch
categ
o
ry.
3. Local Des
c
riptor
In the representation of lo
cal
features,
a numbe
r of feature
s
in the
region that surroun
ds
the target obj
ect we
re n
e
cessary for th
e
object to be
recogni
ze
d in
partial o
ccl
u
s
ion [11]. Thi
s
is
achi
eved thro
ugh the follo
wing
step
s: (1) Findi
ng a
distin
ctive ke
ypoint, (2)
Defining the re
gion
arou
nd the
keypoi
nt, (3) Extracting and no
rmali
z
ing co
ntent of region, (4
) Building lo
cal
descri
p
tors of
normali
zed region, and (5
)
Lo
cal
de
sc
riptor m
a
tchi
n
g
. Zhan
g et
al. [3] cla
s
sifies
the description of visual
feature
s
into
three
group
s, namely th
e pixel level, patch level
and
regio
n
level.
At the pixel level, feature
s
are
cal
c
ulat
ed for e
a
ch
pixel sep
a
rately. The pop
ular
descri
p
tion in
this g
r
oup i
s
g
r
ey-scale
value that
in
dicate
s the i
n
tensity of pi
xels alo
ng
wi
th
colo
ur ve
cto
r
. At patch
lev
e
l, a pat
ch/suppo
rt
region
/neighb
ourho
od of a
poi
nt is a
lo
cal
small
sub
-
wi
ndo
w t
hat su
rro
und
s so
me poi
nts of intere
st i
n
the image
plane o
r
scal
e pyramid,
which
can
be i
n
spa
r
se
samplin
g
usin
g the
key
point dete
c
to
r [13],[21]–[23
] or in
den
se
sampli
ng
on
a
regul
ar gri
d
.
Patche
s,
whi
c
h
are typicall
y small
in
si
ze, mad
e
a
pat
ch l
e
vel d
e
scriptor
also
kno
w
n
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 12, No. 4, Dece
mb
er 201
4: 113
2 – 1141
1134
as
a lo
cal
fea
t
ure
de
scripto
r
. Some
po
pu
lar
patch
leve
l de
scripto
r
s i
n
clu
de SIFT
[24], SURF [2
5]
and the filter-ban
k re
spo
n
se (Gau
ssian f
unctio
n
, Gab
o
r functio
n
s,
wavelet
s
).
Patch
size, which i
s
often
too small to b
e
abl
e to
accommod
a
te p
a
rt or all of th
e obje
c
t
mean
s a g
r
ea
ter re
gion i
s
n
eede
d to ca
pture the
mo
re
relevant visu
al cu
es. Regi
on is a
group
of
interconn
ecte
d pixels in
a
n
image.
Re
gion
can b
e
a se
gment
with regul
ar o
r
irre
gula
r
sha
pe.
The regio
n
can even b
e
the whole im
age. Descri
pt
ions
at regi
o
n
level are u
s
ually d
e
velo
ped
with the pu
rp
ose of
captu
r
i
ng the mo
st discrimi
nat
ing
visual prope
rties of
the target categ
o
ry
(or
comp
one
nt o
f
target categ
o
rie
s
)
and m
a
intain robu
st
ness in
deali
ng with i
n
tra
-
cla
ss va
riatio
ns.
Based o
n
these obj
ectiv
e
s, the mod
e
rn sy
st
em of catego
riza
tion adapts
histog
ram
-
ba
sed
rep
r
e
s
entatio
n at the regio
n
level, su
ch
as
the
BoF (Bag-of-Featu
r
es) a
nd
HO
G (Histo
gra
m
s of
Oriente
d
Gra
d
ients), whi
c
h
a
r
e usually
built on co
ntrast
-ba
s
e
d
lo
cal fe
atures such
a
s
gra
d
i
ent,
whi
c
h a
r
e inv
a
riant to the l
i
ghting o
r
col
our va
riation
s
. Shape cue
s
are al
so
often ca
pture
d
a
nd
descri
bed
at the regi
on
level for obj
ect re
co
gniti
on, su
ch
as conto
u
r o
r
edge fragm
e
n
ts,
sha
pelet
s etc. Colou
r
feat
ure
s
a
r
e
som
e
times u
s
e
d
as a
cate
gory cue, be
ca
u
s
e e
a
ch cate
gory
has
a relatively con
s
tant
colo
ur. Som
e
descripto
rs
at the region
level are the
BoF [26], HO
G
[27], GIST [2
8]–[30], and shape featu
r
e
s
[31].
4. Local Des
c
riptor in RG
B-D bas
e
d O
b
ject Recog
n
ition Sy
ste
m
A feature de
scripto
r
was b
u
ilt from a number
of input
images. In cl
assical app
ro
ach, the
feature
s
were extracte
d from lo
cal im
age pat
che
s
arou
nd dete
c
ted intere
st p
o
ints o
r
usi
n
g a
fixed grid using a powerf
u
l method such a
s
SI
FT, SURF or T
e
xton etc. Then, a learni
ng
algorith
m
, usually a techn
i
que in
ma
ch
ine lea
r
nin
g
, wa
s ap
plied
on tho
s
e fea
t
ure ve
ctors i
n
orde
r to
classify them into some
predefi
ned
cate
g
o
rie
s
(see
Figu
re
1). Tho
s
e fe
ature d
e
script
o
rs
have been
succe
ssfully u
s
ed in ma
ny appli
c
ation
s
;
however, the
y
tend to be
difficult to design
and can not be ea
sily ada
pted if there is addition
al
informatio
n. Therefo
r
e, [32]
cond
ucte
d an
experim
ent to generalize feature
s
ba
sed
on orie
ntation
histogram to
a broa
de
r cla
ss of so-call
e
d
kernel
de
scri
ptor.
In co
nst
r
uctin
g
a ke
rnel
fun
c
tion
one ca
n com
b
ine kn
owl
e
d
ge
that
hum
a
n
s
alrea
d
y have
about the
specifi
c
probl
em dom
ain.
Kernel m
e
th
ods
ca
n ope
rate in a
hig
h
-
dimen
s
ion
a
l feature
spa
c
e
by simply computi
ng the
inner p
r
od
ucts betwe
en the image
s of
all
pairs of data in the feature
spa
c
e [33].
Figure 1. Co
mmon Pipelin
e in Feature Rep
r
e
s
entati
o
n
The p
e
rfo
r
m
ance of ma
chine le
arni
ng
tec
hni
que
s
relie
s he
avily on the
sele
ction of
feature
s
re
prese
n
tation of
the app
lication dom
ain. So most of the effort in d
eploying ma
chine
learni
ng
algo
rithms lies in
the de
sign
of
pre
-
p
r
o
c
e
ssi
ng an
d tra
n
sf
ormatio
n
dat
a that p
r
odu
ces
data re
pre
s
e
n
tations that
can
sup
p
o
r
t the e
ffectiven
ess of machi
ne lea
r
ning t
e
ch
niqu
es.
Th
i
s
feature e
ngin
eerin
g proce
ss
req
u
ire
s
h
u
man in
telli
g
ence and
pri
o
r kno
w
led
g
e
to overcome
the
wea
k
n
e
sse
s
of the lea
r
ni
ng alg
o
rithm,
whi
c
h i
s
u
n
able to
extra
c
t and
cl
assi
fy discriminat
ive
informatio
n from the data.
Rep
r
e
s
entati
on lea
r
ni
ng
seeks to lea
r
n
repres
entati
ons of the d
a
ta
and i
s
ma
kin
g
it easi
e
r fo
r the process
of extracti
n
g
useful i
n
form
ation w
hen b
u
ilding a
cl
assifier
or
other p
r
edi
ctors [3
4]–[36
]. Vari
ou
s m
e
thods to le
arn
low-level fe
a
t
ures fro
m
ra
w d
a
ta
(featu
r
e
learni
ng
) h
a
ve be
en
pro
d
u
ce
d by th
e
machi
ne l
e
a
r
ning
com
m
u
n
ity, i.e. Dee
p
Belief
Net
w
ork
[37], deep
Boltzman
n
machi
ne [3
8], convol
utio
n
a
l de
ep
beli
e
f network [
39] etc. V
a
ri
ous
resea
r
che
s
t
hat imple
m
e
n
t feature
le
arnin
g
h
a
ve
also
succe
ssfully demon
st
rated i
m
pressive
accuracy. Co
ates et al. [40] su
ccessf
ully
proved t
hat good im
age featu
r
e
s
can b
e
learned
efficiently u
s
i
ng
stand
ard
unsupe
rvise
d
lea
r
ning
te
chniqu
es (see
Figu
re
2).
Ho
wever,
tho
s
e
appli
c
ations are still
somewhat
limit
ed t
o
2D images,
typically in
grey-scale. [1],[9] successful
ly
sho
w
e
d
very
good
re
sult
s
on RGB-D o
b
j
ect rec
ogniti
on u
s
ing
an
unsupe
rvise
d
feature l
e
a
r
n
i
ng
method in bui
lding feature repre
s
e
n
tation
.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Re
vie
w
of Lo
cal Descri
ptor in RGB-D O
b
ject Recogni
tion (Em
a
Rachm
a
wati
)
1135
To the
bes
t
of
my k
n
owledge, there were
s
i
x papers
[1]
,[2],[7]–[10] th
at have
propos
ed a
new fe
ature
descri
p
tor i
n
RGB-D o
b
je
ct reco
gni
tion
that use th
e
RGB-D O
b
je
ct Data
set. T
hey
can be
categ
o
rized into th
ree g
r
oup
s b
a
se
d on cu
rr
ent trend
s in machi
ne lea
r
ning; that is the
kernel
-trick
approa
ch, fe
ature l
e
a
r
nin
g
ap
pro
a
ch
and th
e di
st
ribution
-
b
a
se
d ap
pro
a
ch. A
summ
ary of the feature
re
pre
s
entatio
n approa
ch can
be see
n
in Table 1.
Figure 2. Fea
t
ure Lea
rni
n
g
in Feature
Repre
s
e
n
tation
4.1. Kernel Descriptor
Lai et. al [7] build a larg
e-scal
e and
hier
a
r
chi
c
al
multi-view d
a
t
aset, namel
y RGBD
Obje
ct Dataset, for the
p
u
rpo
s
e
s
of o
b
ject
re
cog
n
i
t
ion an
d d
e
tection. In
ad
dition, [7] al
so
introdu
ce
obj
ect recogniti
on an
d d
e
te
ction te
chni
q
ue ba
se
d o
n
RGB
-
D,
with co
mbin
atio
n of
colo
r and d
e
p
th informati
on. Feature extraction m
e
thod comm
only use
d
in
RGB image
also
impleme
n
ted
here,
i.e the
spin
ima
ges [
41] - to
extra
c
t the
shap
e
feature
- an
d
SIFT [24]
- t
o
extract the visual features. Shape f
eature ex
tra
c
tion is ge
n
e
rated from
the 3D location
coo
r
din
a
tes
o
f
each pixel
depth. Spin i
m
age
s is
co
mputed from
a set of 3
D
coo
r
din
a
tes
o
f
a
rand
om sam
p
le. Each
spi
n
image i
s
ce
ntered
on a
3D
coo
r
din
a
te and
save t
he co
ordinat
es of
the spatial di
stributio
n of its neigh
bou
ring point
s. Distributio
ns
were ma
de in
2-dime
nsi
o
n
a
l
histog
ram of
size 16 x 16, which invari
ant to
rotation. Spin imag
es are used to comp
ute EMK
feature
s
[42]
using
ran
d
o
m Fou
r
ier
set. EMK (Efficient Match Kernel
) fe
ature
s
e
s
timates
gau
ssi
an kernel between
local feat
u
r
es an
d provide a continu
ous
simila
rity value. Spatial
informatio
n a
r
e
combi
ned
to create g
r
id
size of 3
x 3
x 3, then 1
0
00 EMK feat
ure
dimen
s
io
n is
comp
uted for each
cell. O
ne hun
dre
d
p
r
inci
pal comp
onent taken
by PCA (Prin
c
ipal
Comp
o
nent
Analysis) on
EMK features in eac
h
cell.
Width, depth,
and heig
h
t from 3D b
oun
d
i
ng box is al
so
adde
d into the sha
pe featu
r
e, so
we get
a 2703
-dim
en
sion
al sh
ape
descri
p
tor.
The visual fe
ature
s
a
r
e
extracte
d from t
he
RG
B d
a
ta. SIFT are extracted
from
8
x 8 gri
d
.
Texton histog
ram featu
r
e [43] are extra
c
ted to obt
ain
texture information, usin
g
a gaussia
n
filter
respon
se
ori
ented. Texto
n
vocab
u
lary
built
from a
set of imag
es on
Label
Me [44]. Col
o
r
histog
ram, m
ean, an
d sta
n
dard
deviatio
n
from e
a
ch color
cha
nnel i
s
ad
ded
as
well as th
e visu
al
feature
s
. The
process of reco
gnition of
the
object category an
d obje
c
t instan
ce
s perfo
rme
d
usin
g SVM (linear
ke
rnel [4
5] and gau
ssi
an ke
rnel [46]
) and Rand
o
m
Fore
st [47].
The exp
e
rim
ental results
sho
w
e
d
that
the ov
erall visual fe
atures are
mo
re u
s
eful tha
n
sha
pe fe
atures fo
r
cate
g
o
ry-level
an
d
insta
n
ce
-lev
el re
co
gnitio
n
. Ho
weve
r,
sha
pe fe
ature is
relatively more useful for
category level
recognitio
n
. F
r
om thi
s
re
se
arch we can
con
c
lu
ded th
at
the co
mbinati
on of
shap
e
feature
s
a
n
d
the vis
ual
feature
s
pro
d
uce high pe
rforman
c
e
i
n
the
categ
o
ry level recognitio
n
using a
n
y cla
ssifi
ca
tion
method. A
spe
c
ial note
wa
s given to the
alternatin
g-co
ntigou
s-frame
techniqu
e, in whic
h only
uses visual
features
ca
n prod
uce high
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 12, No. 4, Dece
mb
er 201
4: 113
2 – 1141
1136
accuracy. While on leave
-
se
que
nce-ou
t techniqu
e, the co
mbinati
on of the visual and
sha
p
e
feature
s
ca
n
sig
n
ifica
n
tly improve the
accu
ra
cy, b
u
t not a
s
go
od at
co
ntig
uou
s alte
rnat
ing
frames
.
Table 1. Sum
m
ary of Feat
ure Rep
r
e
s
en
tation
Approach
Paper
Extracted fe
ature
from R
G
B Image
Extracted
feature f
r
om
depth Image
Descr
iptor
Explanation
Accur
a
cy
on RGBD O
b
ject
Dataset
Instance (%)
Categor
y (
%
)
Kernel
-tri
c
k
Kernel
Descriptor [7]
SIFT, te
xton
histogram, colour
histogram, mean
,
standard
deviation
Spin image
(using 3D
location), 3D
bounding bo
x
(
w
idth, depth,
height)
Using EMK to
generate fi
xed-
length feature
vector and
perform P
C
A
on EMK
features
Depth: 46.2
RGB: 60.
7
RGB+Dept
h:74.8
Depth: 64.7
±
2.2
RGB: 74.
5 ±
3.1
RGB+Dept
h:
83.8 ± 3.5
Depth Kern
el
Descriptor [8]
Colour, gr
adient,
LBP
Edge featur
e:
aggregation o
f
all distance
attribute pairs
Size feature:
Distance
bet
w
een each
point and the
reference p
o
int
of the point
cloud
Shape featu
r
e:
kernel spin and
kernel PCA
Building kernel
descriptor from
RGB and
depth
images.
Using p
y
ramid
EMK to
integrate
spatial
information.
Depth: 54.3
RGB: 78.
6
RGB+Dept
h :
84.5
Depth: 78.8
±
2.7
RGB: 77.
7 ±
1.9
RGB+Dept
h:
86.2 ± 2.1
Hierarchical
Kernel
Descriptor [2]
Colour, gr
adient,
LBP
Same as [8]
Define kernel
descriptor over
kernel
descr
iptor
.
Spatial
information
considered b
y
integrating
center position
of each patch.
Depth:46.8
RGB: 79.
3
RGB+Dept
h:
82.4
Depth: 75.7
±
2.6
RGB: 76.
1 ±
2.2
RGB+Dept
h:
84.1 ± 2.2
Feature
learning
Convolutional
K-Means [1]
Position
coordinate
Position
coordinate
Interest point
w
a
s det
ected
using SURF;
learning featur
e
using
Convolutional
K-Means
(unsupervised
learning).
RGB+Dept
h:
90.4
RGB+Dept
h:
86.4 ± 2.3
Unsupervise
d Featu
r
e
Learning
using HMP
[9]
G
r
ey
-
scale
intensity
,
RG
B
values
Depth values,
3D surface
normal
Learning
feature using
HMP
(unsupervised
learning) via K-
SVD)
Depth: 51.7
RGB: 92.
1
RGB+Dept
h:
92.8
Depth: 81.2
±
2.3
RGB: 82.
4 ±
3.1
RGB+Dept
h:
87.5 ± 2.9
Classic
Histogram of
Oriented
Normal
Vectors [10]
N/A
Histogram of
tangent plane
orientation
Modif
y
ing :
zenith &
azimuth angle
N/A RGB+Dept
h:
91.2 ± 2.5
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Re
vie
w
of Lo
cal Descri
ptor in RGB-D O
b
ject Recogni
tion (Em
a
Rachm
a
wati
)
1137
4.2. Hierarch
ical Kernel Descriptor
Bo et. al [2]
attempted to improve the
accura
cy of o
b
ject re
co
gnit
i
on perfo
rme
d
by [7],
usin
g
kernel
de
scripto
r
[
32] in
hie
r
archi
c
al
wa
y. The use
of kernel de
scri
ptor [3
2] is
very
effective wh
e
n
used in
co
njun
ction wit
h
EMK
and
non-li
nea
r S
V
M, thus not
suitabl
e for l
a
rge
data. Therefo
r
e [2] tried to use
ke
rnel
descri
p
tor
re
cursively to ge
nerate fe
atures, by ch
angi
ng
the pixel attri
bute into
patch level featu
r
es a
s
well
as
addin
g
de
pth
informatio
n to
the de
scripto
r
.
Kernel
de
scri
ptors u
s
e
d
in
the RGB ima
ge to
re
pre
s
e
n
t obje
c
t feat
ure
s
co
nsi
s
t
of gradient
m
a
tch
kernel (ba
s
e
d
on pixel g
r
adient attrib
u
t
es), colo
u
r
kernel
(ba
s
e
d
on pixel inte
nsity attribute
s
),
and shap
e kernel
(ba
s
e
d
on local bina
ry pattern at
t
r
ibute
s
). Th
e
prin
ciple of
kernel de
scri
ptor
adapte
d
to the de
pth ima
ge by treatin
g the de
pth i
m
age a
s
g
r
e
y
scal
e
imag
e
s
. Gradient a
n
d
sha
pe kern
el descri
p
tor
ca
n be extracte
d easily. Whil
e the colo
ur
kernel de
scri
ptor is extra
c
t
e
d
by previou
s
ly multiplying th
e depth value
with root
s, whe
r
e
s is th
e numb
e
r of
pixels from
o
b
ject
mask. Fe
atures
con
s
tructe
d from
2-laye
r hie
r
a
r
chical
kernel de
scri
ptors: (1
) Fi
rs
t lay
e
r: sa
me
as
the ke
rnel d
e
s
cripto
r from
the image p
a
tch si
ze
of 16
x 16; (2) Second layer: 1
0
00 ba
sis ve
ctors
are u
s
ed fo
r the Gau
s
sian
kernel.
4.3. Depth Kernel De
scri
p
tor
Bo et. al [8] condu
cted
oth
e
r techniq
u
e
s
to
improve t
he a
c
cura
cy
of image
-ba
s
ed obj
ect
recognitio
n
RGB-D, n
a
mel
y
to create a
kernel de
pth
descri
p
tor.
Bo et. al [8]
extract 5 d
e
p
t
h
kernel de
scri
ptors to re
prese
n
t recogn
ition cu
e
s
in
cludi
ng si
ze,
3D sha
pe, and the edg
es of
obje
c
ts
(de
p
th)
within a f
r
amework. Th
e idea
de
rive
d from th
e u
s
e of kern
el d
e
scripto
r
o
n
RGB
image
s [3
2] in which di
screti
zing
pixe
ls attrib
utes
wa
s n
o
t ne
cessary. Simil
a
rities bet
we
en
image
patche
s
was calcula
t
ed ba
sed
on
ke
rnel fu
ncti
on, that is
m
a
tch
ke
rnel, t
hat will
com
p
ute
averag
e of si
milarity value betwe
en all p
a
irs of
pixel a
ttributes in 2 i
m
age pat
che
s
. Depth im
a
ge
wa
s first co
nverted to a
3D point cl
oud by ma
p
p
ing ea
ch pi
xel to the correspon
ding
3D
c
o
or
d
i
na
te
ve
c
t
o
r
.
The u
s
e
of
kernel
de
script
or that
co
nve
r
ts
pixel
attrib
ute into
patches features,
maki
ng
the p
r
o
c
e
s
s
of gen
erating
vario
u
s feat
ure
s
from
r
e
c
o
gn
itio
n
c
ues
c
a
n be
don
e
ea
s
ily. Ker
n
e
l
descri
p
tor for the g
r
adi
ent
and
the lo
ca
l bina
ry patte
rn
ke
rnel
[32]
is
extracte
d
from the
de
pth
image. Ke
rn
el gradie
n
t a
nd lo
cal
bin
a
r
y pattern
ke
rnel i
s
a
rep
r
ese
n
tation
of edg
e featu
r
es.
Gradi
ent and
local
bin
a
ry p
a
ttern ke
rnel
f
eature
s
is extracte
d from
a
16
x 16
ima
g
e
/ d
epth
patch
by 8 pixel sp
acin
g. Com
p
uting gradie
n
t
was
sam
e
as that u
s
ed
in the SIFT. PCA dimen
s
i
ons
wa
s
set 5
0
, whil
e the
o
t
her
wa
s
set
200. Si
ze
d
e
scripto
r
,
ke
rnel PCA
de
scripto
r
s
(sha
pe
descri
p
tors),
and spin kernel de
scripto
r
(shap
e
de
scripto
r
s) we
re extra
c
ted
from 3D p
o
i
nt
clouds.
On
kernel
si
ze, f
o
r each interest po
int
will be taken not more
than 200 3D point
coo
r
din
a
tes.
As for the
kernel PCA
an
d
spin,
dist
a
n
ce
from l
o
cal
re
gion to
inte
re
st poi
nt was set
at 4 cm, and the numb
e
r of
neighb
ours i
s
not more than 200 p
o
int coo
r
din
a
tes.
Obje
cts
were
modelle
d as a set of lo
cal ke
rnel
de
scripto
r
. Agg
r
egating lo
cal
kernel
descri
p
tors in
to obje
c
t leve
l feature
s
wa
s cond
uc
te
d
usin
g EMK p
y
ramid [42], [
48]. Kernel
lo
cal
descri
p
tor i
s
mappe
d in a
low dim
e
n
s
io
nal featur
es
spa
c
e a
nd
wi
ll further b
u
il
d on obj
ect-l
e
vel
feature
s
by
taking th
e a
v
erage val
u
e
of the re
su
lting feature
s
vector. O
b
j
e
ct recogniti
on
accuracy in
crea
sed
signif
i
cantly by implementin
g the five descriptors. In addition, [8] also
su
ccessfully demon
strated
that
the performa
n
ce of kernel featu
r
e
s
exce
ede
d the perfo
rma
n
c
e
of 3D spin im
age
s featu
r
e
s
. From
the
re
sults of
exp
e
riments co
ndu
cted, [8] fou
n
d
that the
de
pth
feature
s
i
s
worse tha
n
RGB features
at instan
ce
s l
e
vel re
co
gnition. Thi
s
is b
e
ca
use differen
t
instan
ce
s in t
he same
cate
gory
can h
a
ve a shap
e th
at is almo
st
similar. So the
com
b
ination
of
depth featu
r
e
s
by
RGB fea
t
ures can im
p
r
ove t
he
re
co
gnition a
c
curacy. Whil
e o
n
re
cog
n
ition t
he
categ
o
ry, the
pe
rform
a
n
c
e
of d
epth
ke
rnel d
e
scriptor is quite
com
para
b
le t
o
th
e ima
ge
ke
rn
el
descri
p
tor,
which
indi
cate
s that th
e d
e
p
th informat
i
on i
s
a
s
im
p
o
rtant a
s
th
e
visual i
n
form
ation
for cate
gory reco
gnition.
4.4. Conv
olu
t
ional K-Mea
n
s Des
c
ripto
r
Blum et. al
[1] improve th
e a
c
curacy
o
f
obje
c
t reco
gnition
ba
sed
on
RGB
D
i
m
age,
by
prop
osi
ng
an
algo
rithm th
at is
able to
automatic
ally re
cog
n
ize i
m
age fe
ature
s
, in
whi
c
h
colour
and de
pth is
encode
d in a
comp
act rep
r
esentation.
Blum et. al [1] introdu
ce a
new d
e
script
o
r,
namely
conv
olutional
k-m
ean
s de
scri
ptor, which aut
omatically l
e
arn th
e respo
n
se
of a n
u
m
ber
of neighb
ouri
ng feature
s
of interest p
o
int wh
i
c
h was dete
c
ted.
Phase
s
in
the pro
c
e
ss
of
formation of the K-Mea
n
s
Convol
utional
Descri
ptor
can be de
scrib
ed as follo
ws:
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 12, No. 4, Dece
mb
er 201
4: 113
2 – 1141
1138
1.
Learning feat
ure respon
se
s
a.
Un
sup
e
rvised
learning
[4
0
], learnin
g
a
set of featu
r
e re
sp
on
se
s
of a nu
mbe
r
of input
v
e
ct
or
s.
b.
Normali
z
ation
of all patche
s
by su
btra
cti
ng with the
averag
e valu
e and dividin
g
by the
stand
ard
dev
iation. PCA
whiteni
ng tra
n
sforma
tion [
49] is th
en p
e
rform
ed
on
the imag
e
patch
es.
c.
Image patche
s
clu
s
te
ring u
s
ing
k-m
ean
s.
2.
Interest p
o
int detectio
n
, usi
ng SU
RF [25] to extrac
t SURF corner.
3. De
scripto
r
extraction.
4.5. Hierarch
ical Matchin
g
Pursuit for
Depth Da
ta
Bo et. al [9] tried to imp
r
o
v
e the accura
cy of obje
c
t recognition
algorith
m
by adaptin
g
HMP
(Hi
e
ra
rchi
cal
Matching Pu
rsuit) i
n
two
laye
r [5
0]. HMP i
s
a
dapted
for RGB-D ima
g
e
s
, by
learni
ng
dicti
onari
e
s an
d
encode
s feat
ure
s
u
s
in
g all
RGB
-
D
data
(g
reyscale,
RGB, de
pth,
and
cha
nnel
surf
ace
no
rmal
). HMP
uses sp
arse
co
d
i
ng to
perfo
rm le
arni
ng
(un
s
u
p
e
r
vised)
hierarchi
c
al f
eatures
representat
ion from the
RGB-D data.
HMP
w
ill build
dictionaries f
r
om
patch
and de
pth i
m
age u
s
in
g the K-SVD [51] to rep
r
e
s
ent obj
ect
s
as a spa
r
se com
b
inatio
n of
cod
e
word. F
u
rthe
rmo
r
e, h
i
era
r
chical feature
s
was b
u
ilt using
ort
hogo
nal mat
c
hing pu
rsuit and
spatial py
ram
i
d poolin
g. So that HMP ca
n be u
s
ed
fo
r
RGB-D ima
g
e
s, he
re a
r
e t
he step
s
sho
u
ld
be don
e:
1.
Learning fe
ature
s
on th
e colour
and
de
pth im
age
s b
a
se
d on th
e
con
c
e
p
t of sp
arse codin
g
.
Sparse coding will perform dictionaries learning
, that is the representation of
data with the
linear combin
ation (and
sp
arse)
of
data entry
on dicti
onari
e
s. Data
entry
was pi
xel values o
f
image
patche
s
si
ze of 16 x 16.
2.
HMP buil
d
h
i
era
r
chical fe
ature
s
of di
cti
onarie
s fro
m
the re
sult
of (1)
by
applying th
e
orthog
onal m
a
tchin
g
pu
rsu
i
t
enco
der re
cursively
a
nd perfo
rming sp
atial
pyrami
d max
poolin
g
perfo
rmed o
n
sparse
cod
e
on ea
ch layer of the hierarchical
HMP.
At the instan
ce level re
co
gnition, featu
r
es
obtaine
d from the lea
r
ning on
colo
ur imag
e
su
ccessfully i
m
prove th
e p
e
rform
a
n
c
e
compa
r
ed to
f
eature
s
o
b
tai
ned from a g
r
ey-scale i
m
a
ge.
In addition, feature
s
of th
e first layer i
s
better
(fine
-
graine
d). In
contrast to th
e cate
gory
-
le
vel
recognitio
n
, i
n
whi
c
h
the
feature
s
of t
he
sec
ond
la
yer bette
r (coarse
-graine
d
). Ba
sed
on
the
experim
ental
results, le
arni
ng di
ctiona
rie
s
sepa
rately for
ea
ch col
o
ur cha
nnel produ
ce
s
bette
r
acc
u
rac
y
than perform
learning together.
4.6. Histogr
a
m
of Oriente
d
Normal Ve
ctors
HO
NV (hi
s
to
gram of o
r
ie
nted no
rmal
vect
ors)
wa
s desig
ned b
y
[10] to capture the
cha
r
a
c
teri
stics of
the
3-D
geomet
ry fro
m
imag
e
dep
th. Witho
u
t re
lying on
textu
r
e, the
obj
ect
is
expecte
d to be re
cog
n
ize
d
by taking into account
this 3D
surfa
c
e. To red
u
ce noise in d
epth
image at the t
i
me of pre
-
p
r
oce
s
sing, G
a
ussian filt
er i
s
use
d
. HO
NV
is histo
g
ram-based featu
r
e
s
,
su
ch a
s
HO
G
feature
s
[27]
and
LBP. Object
su
rf
ace
is a
s
sume
d to re
pre
s
e
n
t o
b
ject
categ
o
ri
es
informatio
n, becau
se the
obje
c
t su
rfa
c
e ca
n be
de
scribe
d by a
tangent pl
an
e ori
entation
(i.e
norm
a
l vecto
r
on ea
ch
coo
r
dinate
su
rface). Ch
ara
c
te
ri
stics
of
3D g
eometry can be
re
pre
s
e
n
ted
as a lo
cal
distribution from
orientatio
n of
the nor
m
a
l vector. T
ang
e
t. al [10] made de
cline in t
he
formula,
whi
c
h sh
ows th
at the no
rmal v
e
ctor can b
e
rep
r
e
s
ente
d
as a
n
o
r
de
re
d pai
r of a
z
i
m
uth
and zenith a
ngle
s
, whi
c
h
can b
e
ea
sily calculat
ed from the gra
d
i
ent of depth i
m
age. HONV
is
the con
c
ate
n
a
tion from th
e local hi
stog
ram of azi
m
u
t
h and ze
nith
angle
s
, so it can be u
s
e
d
as
feature
s
in the obje
c
t dete
c
tion/cl
assification.
Normal vecto
r
on the posit
ion p = (x, y) is
the cro
ss
prod
uct of two vectors tan
gent on
tangent pla
n
e
.
Throug
h a
decli
ne in the
formula,
[10] get the form
ula of norm
a
l
vector on pi
xel
(x
, y
,
d (x
,
y
)). Sph
e
ri
ca
l co
or
dinate
s
are u
s
e
d
to
en
cod
e
o
r
ie
ntation info
rmation
with
the
rep
r
e
s
entatio
n of zenith a
nd a
z
imuth
a
ngle
s
. Pha
s
e
s
of g
e
tting
HO
NV feat
ures a
r
e
as foll
ows:
(i)
Dividing
d
e
tection
win
d
o
w in
the
size of m x n
ce
lls. The
ori
ent
ation of the
n
o
rmal ve
cto
r
on
each cell is
comp
uted an
d made into
histog
ram
s
.
Feature vect
ors
(i x j dimensi
onal
) will
be
formed
from
each
cell, i a
s
a
re
pre
s
e
n
tation of
the
zenith an
gle,
and j
as a
re
pre
s
entatio
n
of
azimuth
an
gl
e, with I
=
J
= 8;
(ii) Final
feat
ure
was
obtaine
d by
combi
n
ing
HONV featu
r
e
s
of
each cell.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Re
vie
w
of Lo
cal Descri
ptor in RGB-D O
b
ject Recogni
tion (Em
a
Rachm
a
wati
)
1139
5. Analy
s
is
Image i
s
a
d
a
ta source
wi
th sp
ecial
ch
ara
c
te
ri
stics. Each pixel
in the
imag
e
re
pre
s
ent
s
a mea
s
u
r
em
ent. Beside it
having hig
h
dimen
s
ion
s
,
the vecto
r
rep
r
esentation
of image
s typically
indicates
stro
ng co
rrelatio
n betwe
en a
pixel
and its
neigh
bou
rs.
Kernel m
e
tho
d
s have
prov
ed
successful i
n
many
areas in
com
puter vision,
mai
n
ly becaus
e of thei
r int
e
rpretability
and
flexibility [33]. In the
kernel method, a feature
descriptor i
s
constructed by
comparing pixel
orientatio
ns
or colou
r
int
ensitie
s in a
kernel
rep
r
e
s
entatio
n. In the ke
rnel
re
pre
s
entatio
n, we
simply comp
ute the inne
r
prod
uct
s
bet
wee
n
all pai
rs of data in t
he feature sp
ace [52]. Kernels
are typically desi
gne
d to captu
r
e o
ne
asp
e
ct of th
e
data, i.e. texture, col
o
u
r
o
r
edg
e etc. S
o
,
[2,7,8] desig
n and u
s
e a
new kern
el method in o
r
der to capture all asp
e
ct
s of the image to
descri
be an
o
b
ject. Thei
r e
x
perime
n
t re
sults sh
ow
ed that their kern
el app
roa
c
h h
a
s p
r
oven to
be
su
ccessful i
n
re
pre
s
e
n
ting featu
r
e
s
from
RGB
and
depth
image
s. T
here
is
sig
n
i
ficant
improvem
ent in
the accu
ra
cy
of
in
stan
ce an
d
categ
o
r
y level
re
cog
n
ition u
s
in
g t
heir ke
rn
el tri
ck,
as can be
se
en in Table 1.
Han
d
-d
esi
g
n
ed features su
ch a
s
SI
FT
and HOG only captu
r
e
lo
w-level
edge
informatio
n.
Although it
h
a
s
prove
n
dif
f
icult to d
e
si
g
n
features th
at ca
pture
mi
d-level
cu
es (e.g
.
edge inte
rsections) o
r
hig
h
-level rep
r
e
s
e
n
tation (e.g.
o
b
ject pa
rts) effectively, they supp
ort man
y
su
ccessful ob
ject re
co
gniti
on app
ro
ache
s. In addition,
the re
cent d
e
velopme
n
ts
in deep l
earni
ng
have sh
own
how hi
erarchi
e
s of f
eatu
r
e
s
can be l
earned in a
n
un
sup
e
rvised
way dire
ctly from
data. The
u
s
e of de
ep le
a
r
ning
ha
s p
r
o
v
ed su
cce
ssf
ul in lo
we
ring
state-of
-the
-
a
rt
e
rro
r rate on
the ImageNe
t
object re
co
gnition 100
0-cla
ss b
enc
h
m
ark [53]. In this pape
r we sho
w
that
unsupe
rvise
d
feature le
arn
i
ng ha
s a hi
g
h
potential in
building
a fe
ature d
e
scrip
tor that is mo
re
discrimi
native
than
th
e ke
rnel metho
d
[1,9].
Un
like
the kern
el m
e
thod,
whi
c
h
often u
s
e
s
a
nonlin
ear
cla
ssifie
r
in the
classificatio
n
pro
c
e
s
s, the feature d
e
scripto
r
ge
n
e
rated th
rou
g
h
feature l
e
a
r
ni
ng typically can b
e
e
a
sily
learned
u
s
in
g a li
nea
r
cl
assifier [1,9].
Thi
s
ap
proa
ch
enabl
es le
arn
i
ng mea
n
ingf
ul feature
s
from RGB
a
s
well a
s
de
pth
data autom
a
t
ically. It can be
see
n
from the experime
n
tal results (T
a
b
le 1) t
hat the accuracy o
f
inst
ance ob
ject re
cog
n
ition
su
ccessfully
achi
eves
eno
ugh m
a
rgi
n
compa
r
ed to
t
he u
s
e of
ke
rnel-tri
ck [2,7,8]. The a
c
curacy
of catego
ry obje
c
t re
cog
n
ition is al
so
increa
s
ed,
althoug
h not
as mu
ch a
s
the incre
a
se in
inst
an
ce o
b
je
ct
re
cog
n
it
ion
.
These
re
su
lt
s ar
e extre
m
ely encouraging, i
ndi
cat
i
ng that cu
rrent
recognitio
n
systems
ca
n
be si
gnificant
ly improved
without h
a
vin
g
to de
sign f
eature
s
ca
ref
u
lly
and m
anu
ally. This
wo
rk o
pen
s up
man
y
possibilitie
s for le
arni
ng
rich, exp
r
e
ssiv
e
features fro
m
raw RGB
-
D
d
a
ta.
Unli
ke the fiv
e
other pap
ers reviewe
d
in
this
pa
per, T
ang et al. [10]
did not
re
cog
n
ize
an
obje
c
t at instance-level in
their expe
rim
ent, but
focu
sed o
n
exploi
ting depth im
age
s to capt
ure
sha
pe featu
r
es for o
b
je
ct catego
ry re
cog
n
iti
on. Th
e local
su
rfa
c
e of an o
b
j
e
ct wa
s
capt
ured
relating to th
e histog
ram
of azimuth an
gle and z
enit
h
angle to de
scribe its 3
D
sha
pe. This i
dea
achi
eved the state-of
-the
-a
rt object cate
gory re
cog
n
ition on the RG
B-D data
s
et as can be se
en
in Table 1.
6. Conclusi
on
We h
a
ve pre
s
ente
d
a survey highlighti
ng t
he curren
t techni
cal a
c
hievement of
a local
descri
p
tor o
n
object reco
gnition ba
se
d on RGB
-
D image
s, as well as its
influen
ce on
the
accuracy of obje
c
t recogn
ition. Explora
t
ion of
local descri
p
tors o
n
depth imag
e combi
ned
with
the RGB ima
ge, whi
c
h
wa
s condu
cted
by some
re
s
earch g
o
e
s
into this a
r
ticl
e. From vari
ou
s
studie
s
it appears that the pre
s
e
n
ce
of t
he depth image ha
s a positive effect on obj
ect
recognitio
n
. Extraction
of lo
cal featu
r
e
s
o
n
depth
imag
es
can
be
used to h
e
lp
re
cogni
ze o
b
je
cts.
In addition, th
e co
mbinatio
n of the visua
l
f
eature
s
a
n
d
sh
ape fe
ature
s
of
RGB i
m
age
and
de
pth
in a certain
d
e
scripto
r
h
a
s
proved to
be
cap
abl
e of
produ
cing an
o
b
ject re
cognit
i
on
sy
stem wi
th
very high
accura
cy on th
e
RGB-D
Obje
ct data
s
et.
From the th
ree
approa
che
s
d
e
scrib
ed in
th
is
article, a
c
cu
racy of in
sta
n
ce
re
cog
n
ition
involving
depth feat
ure
s
u
s
ing f
eature l
e
a
r
ni
ng
approa
ch sho
w
s mo
re
signi
ficant improvement t
han u
s
ing
kernel m
e
thod. Wh
ere
a
s the cl
assi
cal
approa
ch u
s
i
ng normal ve
ctor di
strib
u
tion achieve
s
highe
st accu
racy in cate
go
ry recognitio
n
.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 12, No. 4, Dece
mb
er 201
4: 113
2 – 1141
1140
Referen
ces
[1]
Blum M, Spr
i
n
gen
berg
JT
, Wulfin
g J, Ri
edm
iller
M.
A le
arn
ed fe
ature
des
criptor for
obj
e
c
t recog
n
itio
n
in RGB-D
dat
a
. IEEE International
Conf
erence on Robo
tics and Automation (ICRA). 2012: 1298–
130
3.
[2]
Bo L, L
a
i K, R
en
X, F
o
x D.
Object reco
gn
ition w
i
th h
i
er
ar
chical
kern
el
d
e
scriptors
. IEE
E
Confer
enc
e
on Com
puter V
i
sion
and P
a
ttern Reco
gniti
on
(CVPR). 201
1: 172
9–
173
6.
[3]
Z
hang
X, Y
ang
Y-H, Han Z
,
W
ang H, G
ao
C. Object Cl
as
s Detectio
n: A Surve
y
.
ACM Com
p
ut
Surv.
201
3; 46(1): 10
:1–10:5
3
.
[4]
Andre
o
p
oul
os
A,
T
s
otsos JK. 50 Ye
ars
of obj
ect recog
n
it
ion: Dir
ectio
n
s
for
w
a
r
d.
Co
mputer Visi
o
n
and I
m
ag
e Un
derstan
din
g
.
2
013; 11
7(8): 82
7–8
91.
[5]
Cruz L,
Luc
io
D, Vel
ho
L.
Kinect
and
R
G
BD Ima
ges:
Cha
l
l
eng
es a
nd A
ppl
icatio
n
s
. SIBGRAPI
Confer
ence
on
Graphics, Patterns an
d Imag
es T
u
torials. 2012: 36
–4
9 .
[6]
Liu H, Ph
ili
pos
e M, Sun M-T
.
Automatic
o
b
je
cts segmentati
on
w
i
th
RGB-D
cameras.
Jour
nal of Vis
u
a
l
Co
mmun
icati
o
n and I
m
ag
e R
epres
entati
on.
201
4; 25(4): 70
9–7
18.
[7]
Lai K, Bo L,
Ren
X, F
o
x
D.
A large-sc
ale h
i
erarc
h
ic
a
l
mu
lti-view
R
G
B-D object
dataset.
IEEE
Internatio
na
l C
onfere
n
ce o
n
Rob
o
tics an
d Automatio
n
(ICR
A). 2011: 18
17
–18
24.
[8]
Bo L, R
en
X,
F
o
x
D.
De
pth kern
el
desc
r
iptors for
obj
ect recog
n
iti
o
n
. IEEE/RSJ International
Confer
ence
on
Intellig
ent Ro
b
o
ts and
S
y
ste
m
s (IROS). 20
11: 821
–8
26.
[9]
Bo L, Ren
X,
F
o
x D.
Unsu
pervis
ed F
eat
ure Le
arni
ng f
o
r RGB-D Ba
sed Object R
e
cog
n
itio
n
. In
Internatio
na
l Symp
osi
u
m on
Exp
e
rim
ental
Rob
o
tics (ISER). 2012.
[10]
T
ang S, W
ang X,
Lv
X, Ha
n T
,
Keller J, He Z
,
et al.
Histogra
m
of Orie
nted
Nor
m
al V
e
ctor
s for Object
Reco
gniti
on w
i
th a Depth Se
n
s
or.
In: Lee K,
Matsushita Y, Reh
g
J, Hu Z
,
editors. Comp
u
t
er Vision –
ACCV 20
12, vol. 772
5, Sprin
ger Berl
in He
id
elb
e
rg. 20
13: 5
25–
38.
[11]
Grauman
K, L
e
ib
e B. V
i
sua
l
Object
Rec
o
g
n
itio
n. S
y
nth
e
s
i
s L
e
ctures
on
Artificia
l
Intel
l
i
genc
e
an
d
Machi
ne Le
arn
i
ng. 20
11; 5: 1
–18
1.
[12]
Z
hang J, Marsza
ł
ek M, Laz
ebn
ik S, Schmid C. Local
F
eatures an
d Kerne
l
s for Classificati
on o
f
T
e
xture an
d Object C
a
teg
o
ri
es: A Compr
e
hens
ive Stu
d
y
.
Internatio
na
l J
ourn
a
l of C
o
mputer Vis
i
on.
200
6; 73(2): 21
3–2
38.
[13]
Li J, All
i
ns
on
NM. A com
p
reh
ensiv
e re
vie
w
of curre
nt loca
l featu
r
es for comp
uter visi
on
.
N
e
u
r
o
c
om
pu
ti
ng
.
2008; 7
1
(10-
12): 177
1–
178
7.
[14]
Mikola
jcz
y
k K,
Schmid
C. A
per
forma
nce
eval
uatio
n
of l
o
cal descr
iptor
s
.
IEEE Transactions
on
Pattern Analys
i
s
and Mach
in
e Intelli
genc
e.
20
05; 27(1
0
): 161
5–1
63
0.
[15]
Mikola
jcz
y
k K, Schmid C. Scale &
Affine Invariant Intere
st Point Detectors.
Internationa
l Journ
a
l of
Co
mp
uter Visi
on.
200
4; 60(1)
: 63–86.
[16]
T
i
an D ping. A
Revie
w
on Image F
eat
ure
E
x
tractio
n
an
d
Repr
esentati
o
n
T
e
chniq
ues.
International
Journ
a
l of Multi
m
e
d
i
a
and U
b
i
quito
us Eng
i
ne
erin
g.
201
3; 8(4): 385–
39
6.
[17]
Aldom
a A, Ma
rton Z
-
C, T
o
mbar
i F
,
W
ohlki
nger W
,
Potth
ast C, Z
e
isl B,
et al. T
u
torial: Point C
l
ou
d
Libr
ar
y
:
T
h
ree
-
Dimens
io
nal
Object R
e
cog
n
itio
n a
nd
6
DOF
Pose E
s
timation.
IEEE
Ro
botic
s
Autom
a
tion Maga
z
i
ne
. 20
12; 19(3): 80
–9
1.
[18]
Savares
e
S, Fei-F
e
i L.
3D g
e
neric o
b
ject cat
egor
i
z
a
t
io
n, loc
a
li
z
a
ti
on a
nd p
o
se esti
mati
on
.
IEEE 11th
Internatio
na
l C
onfere
n
ce o
n
Comp
uter Visi
on (ICCV). 200
7: 1–8.
[19]
Fischler MA, B
o
lles
RC. R
a
n
d
o
m Samp
le
Co
nsens
us
: A Pa
radi
gm for Mo
d
e
l F
i
tting
w
i
th
Appl
icatio
ns
to Image Analysis and Autom
a
ted Cartography
.
C
o
mmun
i
c
a
tions of the A
C
M.
1981; 24(
6): 381–
39
5.
[20]
Den
g
J, D
o
n
g
W
,
Socher
R
,
Li
L-J, L
i
K,
F
e
i-F
e
i
L.
I
m
age
Net: A
larg
e-scal
e
h
i
er
archical
i
m
ag
e
datab
ase
. IEEE Confere
n
ce
on Com
puter V
i
sion
and P
a
ttern Reco
gniti
on
(CVPR). 200
9: 248
–2
55.
[21]
T
u
y
t
el
aars
T
,
Mikola
jcz
y
k K.
Loc
al I
n
vari
a
n
t F
eature
D
e
tectors: A S
u
rve
y
.
F
oun
d T
r
e
nds
Co
mp
u
t
Graph Vis.
20
0
8
; 3(3): 177
–28
0.
[22]
Miksik O, Mikolajcz
y
k K.
Eva
l
uatio
n of loc
a
l
detectors a
nd
de
scri
p
tors for
fast feature
ma
tching
. 21st
Internatio
na
l C
onfere
n
ce o
n
Pattern
Reco
gn
ition (ICPR). 20
12: 268
1–
26
84
.
[23]
Schmid C, M
o
hr R, Bauck
h
a
ge C.
Eva
l
uati
on of i
n
terest
poi
nt detectors
.
Internation
a
l Journ
a
l
o
f
Co
mp
uter Visi
on.
200
0; 37(2)
: 151–1
72.
[24]
Lo
w
e
DG. Distinctive Image Feat
ur
es fr
om Sca
l
e-Inva
riant K
e
yp
oint
s.
Internati
ona
l Jo
urna
l
of
Co
mp
uter Visi
on.
200
4; 60(2)
: 91–11
0.
[25]
Ba
y H, Ess A,
T
u
y
t
el
aars T
,
Van Go
ol L.
S
pee
de
d-Up Ro
bust
F
eatur
es (SURF
).
Co
mp
ut Vis I
m
ag
e
Und
e
rst.
2008;
110(3): 34
6–
3
59.
[26]
Hau
p
tmann
A
G. Represe
n
tat
i
ons
of Ke
yp
oi
nt-B
ased
Sem
antic C
onc
ept
Detectio
n: A C
o
mpre
hens
ive
Study
.
IEEE Transactions on Multim
edia.
20
10; 12(1): 4
2–5
3.
[27]
Dala
l N, T
r
iggs B.
Histogra
m
s of orie
nted
gr
adi
ents for hu
ma
n detecti
on
. IEEE Computer Societ
y
Confer
ence
on
Computer Vis
i
on an
d Patte
rn
Recog
n
itio
n. 2
005; 1: 88
6–8
9
3
.
[28]
T
o
rralba A. Co
nte
x
tual
Pr
imin
g for Object D
e
tection.
Inter
n
ation
a
l Jo
urn
a
l
of Co
mput
er Visio
n
.
200
3;
53(2): 16
9–
191
.
[29]
Oliva A,
T
o
rralba A.
T
he role
of context in obj
ect recogn
ition.
T
r
ends in
Cog
n
itive Sci
e
nces.
200
7;
11(1
2
): 520
–52
7.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Re
vie
w
of Lo
cal Descri
ptor in RGB-D O
b
ject Recogni
tion (Em
a
Rachm
a
wati
)
1141
[30]
Naji
D, F
a
kir F
,
Bencharef
O, Bouikha
l
en
e
B, Razouk
A. Indexin
g Of
T
h
ree Dime
n
s
ions Obj
e
cts
Using GIST
Zernik
e & PCA Descript
o
rs.
IAES Internation
a
l Jour
nal of
Artificial Intell
ig
ence
(IJ-AI).
201
3; 2(1):1–
6
.
[31]
Belo
ngi
e S, M
a
lik J, Puz
i
ch
a
J. Shap
e mat
c
hin
g
an
d o
b
j
e
ct recogn
itio
n
usin
g sha
pe c
onte
x
ts.
IEEE
T
r
ansactio
n
s o
n
Pattern Ana
l
ysis and Mac
h
i
ne Intell
ig
ence
.
2002; 2
4
(4): 5
09–
52
2.
[32]
Bo L, Ren
X,
F
o
x D. Kern
el
Descriptors f
o
r Visua
l
Rec
ogn
ition.
Adv
a
nces in N
eur
al
Informati
o
n
Processi
ng Sy
stems.
20
10.
[33]
Lamp
e
rt CH. K
e
rne
l
Meth
ods
in C
o
mput
er Vi
sion.
F
o
und
T
r
ends Co
mput Graph
Vis.
2
009
; 4
(
3
)
: 19
3–
285.
[34]
Erhan
D, Ben
g
i
o
Y, Co
urvil
l
e
A. W
h
y
d
oes
u
n
sup
e
rvise
d
pr
e-traini
ng he
lp dee
p
l
earn
i
n
g
?
T
he Jo
urna
l
of Machin
e Le
arni
ng Res
earc
h
.
2010; 1
1
: 62
5–6
60.
[35]
Beng
io Y. Lear
nin
g
Dee
p
Architectures for A
I.
F
ound T
r
end
s Mach Lear
n.
200
9; 2(1): 1–1
27.
[36]
Beng
io Y,
Co
u
r
ville
A, Vi
ncen
t P. Repr
es
ent
ation
Le
arn
i
ng:
A Rev
i
e
w
a
n
d
Ne
w
Pers
pect
i
ves.
IEEE
T
r
ansactio
n
s o
n
Pattern Ana
l
ysis and Mac
h
i
ne Intell
ig
ence.
2013; 3
5
(8): 1
798
–1
828.
[37]
Hinto
n GE, Osi
nder
o S,
T
eh
Y-W
.
A F
a
st L
earn
i
ng Al
gorit
hm for Deep B
e
lief Nets.
Ne
u
r
al Co
mp
ut
.
200
6; 18(7): 15
27–
15
54.
[38]
Salak
hutdi
nov
R, Hint
on G.
D
eep
Bolt
z
m
a
n
n
Machi
nes
. Pr
o
c
eed
ings
of th
e Intern
atio
nal
Confer
enc
e
on Artificia
l
Intelli
ge
nce a
nd
Statistics. 2009
; 5: 448–4
55.
[39]
Lee
H, Gros
se R,
Ra
ng
a
nath R,
Ng
AY.
Convo
l
uti
ona
l d
eep
be
lief n
e
tw
orks for scala
bl
e
unsu
pervis
ed l
earn
i
ng of
hi
er
archic
al re
pres
entatio
ns.
Proc
eed
ings
of the
26th An
nu
al I
n
ternati
o
n
a
l
Confer
ence
on
Machin
e Le
arnin
g
(ICML) .
200
9: 1–8.
[40]
Coates A, Lee
H, Ng AY.
An ana
lysis of singl
e-l
a
yer ne
tw
orks in unsupervis
ed featur
e lear
nin
g
.
Procee
din
g
s of
the F
ourteent
h Inte
rnatio
na
l Confer
ence o
n
Artificial
Intelli
genc
e an
d Statistics. 2011:
215
–2
23.
[41]
Johns
on AE, H
ebert M. Usin
g
Spin Ima
ges f
o
r E
fficient Obj
e
ct Reco
gniti
o
n
in C
l
uttere
d
3D Sce
nes.
IEEE Trans Pattern Anal Mac
h
Intell
.19
99; 21(5): 43
3–
449
.
[42]
Bo L, Sminchi
s
escu C. Efficient Match Ker
nel
b
e
t
w
e
en S
e
ts of F
eatures for Visual Re
cogn
ition. In
:
Beng
io Y, Sc
huurm
ans D,
Laffert
y
JD,
W
illiams CKI,
Culotta A, e
d
itors.
Adva
nc
es in N
eura
l
Information Pr
ocessi
ng Syste
m
s
22, C
u
rran
Associates, Inc
.
2009: 13
5–1
4
3
.
[43]
Leu
ng T
,
Malik J. Represe
n
ti
ng a
nd R
e
cog
n
iz
in
g the V
i
su
al Ap
pear
anc
e
of Materials
U
s
ing T
h
ree-
dime
nsio
nal T
e
xto
n
s.
Int J Com
p
ut Vision.
2001; 43(
1): 29–
44.
[44]
Russel
l
BC,
T
o
rralb
a A, M
u
r
p
h
y
KP, F
r
e
e
m
an W
T
. Labe
lMe: A D
a
tab
a
s
e a
nd W
e
b-B
a
sed
T
ool for
Image Ann
o
tati
on.
Int J Com
p
ut Vision
. 20
08
; 77(1-3): 157
–
173.
[45]
Chang C-C, Lin C-
J. LIBSVM: A Librar
y
for Support Vector
Machines.
AC
M T
r
ans Intell
Syst T
e
chnol
.
201
1; 2(3): 27:
1–2
7:27.
[46]
F
an R-E, C
h
a
ng K-W
,
Hsi
e
h C-J, W
a
n
g
X-R,
Lin
C-J.
LIBL
INEAR: A
Libr
ar
y for
L
a
rge
Li
near
Classification.
Journ
a
l of Mac
h
in
e Le
arni
ng
Rese
arch.
20
0
8
; 9: 1871
–18
7
4
.
[47]
Breima
n L. Ra
ndom F
o
rests.
Mach Le
arn.
2
001; 45(
1): 5–3
2.
[48]
Grauman
K. T
he P
y
ram
i
d M
a
tch Kern
el
: Effi
cient L
ear
nin
g
w
i
t
h
Sets
of F
e
atures.
Jo
urna
l
of Mach
in
e
Lear
nin
g
Res
e
arch.
200
7; 8: 725
–7
60.
[49]
H
y
v
ä
rin
e
n
A,
Oja E. Ind
e
p
e
n
dent c
o
mp
one
nt
an
al
ysis: al
g
o
rithms and
a
p
p
licati
ons.
Ne
u
r
al N
e
tw
orks
.
200
0; 13(4-
5): 411
–4
30.
[50]
Bo L, Re
n
X,
F
o
x D.
Hier
a
rc
hical
Matchi
ng
Pu
rsuit for Im
age
Class
ificat
ion: Arch
itectur
e
an
d F
a
st
Algorit
hms. In: Sha
w
e
-
T
a
y
l
or
J, Z
e
mel RS, Ba
rtlett PL, Pereira F, We
inb
e
rger KQ, edit
o
rs.
Advanc
es
in Ne
ural Infor
m
ati
on Proc
es
sing Syste
m
s
2
4
, Curran Asso
ciates, Inc. 201
1: 2115
–2
12
3.
[51]
Aharon M, Ela
d
M, Bruckstein A. K -SVD: An
Algor
ithm for
Desig
n
in
g Overcompl
e
te Dic
tionar
ies fo
r
Sparse R
epres
entatio
n.
IEEE Transactions on Signal Proc
essing.
20
06; 54
(11): 431
1–
432
2.
[52]
W
a
h
y
un
ingr
um
R, Dama
ya
n
t
i F
.
Efficient
Kerne
l
-bas
ed
2DPCA for
Smile Sta
ges
Reco
gniti
on.
T
E
LKOMNIKA Indon
esi
an Jou
r
nal of Electric
al Eng
i
ne
eri
ng.
2012; 1
0
(1): 1
13–
11
8.
[53]
Krizhevsk
y
A, Sutskever
I,
Hinto
n GE. Image
Net C
l
as
sificatio
n
w
i
t
h
Dee
p
C
onvo
l
u
t
iona
l N
eura
l
Net
w
orks. In: Bartlett P, Pereira F c. n., Bu
rges
C j. c., Bottou L, W
e
inbe
rger K q., edito
rs.
Advances
in Ne
ural Infor
m
ati
on Proc
es
sing Syste
m
s
2
5
, 2012: 1
106
–
111
4.
Evaluation Warning : The document was created with Spire.PDF for Python.