Indonesian
J
our
nal
of
Electrical
Engineering
and
Computer
Science
V
ol.
37,
No.
3,
March
2025,
pp.
1797
∼
1803
ISSN:
2502-4752,
DOI:
10.11591/ijeecs.v37.i3.pp1797-1803
❒
1797
Quantitation
of
new
arbitrary
view
dynamic
human
action
r
ecognition
framew
ork
Anh-Dung
Ho
1
,
Huong-Giang
Doan
2
1
Department
of
Information
T
echnology
,
East
Asia
Uni
v
ersity
of
T
echnology
,
Ha
Noi,
V
iet
Nam
2
F
aculty
of
Control
and
Automation,
Electric
Po
wer
Uni
v
ersity
,
Ha
Noi,
V
iet
Nam
Article
Inf
o
Article
history:
Recei
v
ed
Jun
21,
2024
Re
vised
Oct
3,
2024
Accepted
Oct
7,
2024
K
eyw
ords:
Arbitrary
vie
w
recognition
Con
v
olution
neural
netw
ork
Deep
learning
Generati
v
e
adv
ersarial
netw
orks
Human
acti
vity
recognition
Multiple
vie
w
recognition
ABSTRA
CT
Dynamic
action
recognition
has
attracted
man
y
researchers
due
to
its
applica-
tions.
Ne
v
ertheless,
it
is
still
a
challenging
problem
because
the
di
v
ersity
of
camera
setups
in
the
training
phases
are
not
similar
to
the
testing
phases,
and/or
the
arbitrary
vie
w
actions
are
captured
from
multiple
vie
wpoints
of
cameras.
In
f
act,
s
ome
recent
dynamic
gesture
approaches
focus
on
multi
vie
w
action
recognition,
b
ut
the
y
are
not
resolv
ed
in
no
v
el
vie
wpoints.
In
this
research,
we
propose
a
no
v
el
end-to-end
frame
w
ork
for
dynamic
gesture
recognition
from
an
unkno
wn
vie
wpoint.
It
consists
of
three
main
components:
(i)
a
synthetic
video
generation
with
generati
v
e
adv
ersarial
netw
ork
(GAN)-based
architecture
named
ArV
i-MoCoGAN
model;
(i)
a
feature
e
xtractor
part
which
is
e
v
aluated
and
compared
by
v
arious
3D
CNN
backbones;
and
(iii)
a
channel
and
spatial
attention
module.
The
ArV
i-MoCoGAN
generates
the
synthetic
videos
at
mul-
tiple
x
ed
vie
wpoints
from
a
real
dynamic
gesture
at
an
arbitrary
vie
wpoint.
These
synthetic
videos
will
be
e
xtracted
in
t
he
ne
xt
component
by
v
arious
three-
dimensional
(3D)
con
v
olutional
neural
netw
ork
(CNN)
models.
These
feature
v
ectors
are
then
processed
in
the
nal
part
to
focus
on
the
attention
features
of
dynamic
actions.
Our
proposed
frame
w
ork
is
compared
to
the
SO
T
A
approaches
in
accurac
y
that
is
e
xtensi
v
ely
discussed
and
e
v
aluated
on
four
standard
dynamic
action
datas
ets.
The
e
xperimental
results
of
our
proposed
method
are
higher
than
the
recent
solutions,
from
0.01%
t
o
9.59%
for
arbitrary
vie
w
action
recognition.
This
is
an
open
access
article
under
the
CC
BY
-SA
license
.
Corresponding
A
uthor:
Huong-Giang
Doan
F
aculty
of
Control
and
Automation,
Electric
Po
wer
Uni
v
ersity
235
Hoang
Quoc
V
iet,
Ha
Noi,
V
iet
Nam
Email:
giangdth@epu.edu.vn
1.
INTR
ODUCTION
Human
acti
vity
recognition
(HAR)
has
become
an
attracti
v
e
eld
in
computer
vision
for
the
pas
t
40
years
[1]–[3].
Moreo
v
er
,
this
w
ork
has
still
f
aced
man
y
challenges
because
of
limitation
of
data,
v
arious
vie
wpoints,
dif
ferent
scales,
illumination
conditions,
comple
x
background,
and
v
arious
modalities.
T
o
impro
v
e
ef
cienc
y
of
action
recognition
results,
some
researchers
try
to
increase
the
number
of
data
using
generati
v
e
adv
ersarial
netw
ork
(GAN)
models
such
as
[4]–[6].
Synthetic
human
action
images
are
created
by
generator
of
GAN
models
which
are
similar
to
the
training
videos.
Doan
and
Nguyen
[7]
generated
hand
images
in
multiple
vie
ws
with
blender
-based
and
hand
glo
v
e-based.
Although
this
dataset
is
di
v
erse
in
vie
wpoints
and
v
ariety
of
samples,
it
only
pro
vided
static
hand
gestures.
J
ournal
homepage:
http://ijeecs.iaescor
e
.com
Evaluation Warning : The document was created with Spire.PDF for Python.
1798
❒
ISSN:
2502-4752
Another
approach
could
ameliorate
action
recognition
results
that
utilizes
multi-vie
w
cameras.
T
ran
et
al.
[1]
found
discriminant
of
pairwise
co
v
ariance
of
multiple
vie
ws
data
to
deal
wit
h
the
rob
ustness
of
HAR
result.
This
research
nds
transformation
of
multi-vie
w
actions
in
a
common
space.
Then,
a
ne
w
action
could
be
projected
into
the
learned
common
space.
But
this
approach
composes
of
discrete
blocks
and
it
is
dif
cult
to
deplo
y
an
end-to-end
solution.
Nguyen
and
Nguyen
[8]
proposed
a
dynamic
action
recognition
method
with
residual
netw
ork
(ResNet)18
backbone,
(2+1)D
architecture,
cross
vie
w
attention
(CV
A)
module
and
augmentation
strate
gy
.
This
w
o
r
k
impro
v
ed
action
recognition
that
used
image
sequences
in
multiple
vie
w-
points
b
ut
this
method
also
required
man
y
cameras
in
both
training
and
testing
phases.
An
end-to-end
HAR
solution
at
an
arbitrary
vie
wpoint
is
necessary
to
deplo
y
a
real
HAR
application
because
of
a
simpler
testing
en
vironment
setup.
Zhang
et
al.
[9]
composed
a
vie
w-in
v
ariant
transfer
dictionary
and
classier
for
no
v
el-vie
w
action
recognition.
T
w
o-dimensional
(2D)
vi
deos
are
projected
into
a
vie
w-in
v
ariant
sparse
representation.
Dictionary
learning
projection
is
considered
as
a
linear
algorithm
that
is
quite
a
limitation.
Gedamu
et
al.
[10]
proposed
method
to
recognize
action
in
a
certain
vie
w
b
ut
this
solution
is
impleme
nted
by
sk
eleton
image
and
recognized
static
action.
Stimulated
from
T
ran
et
al.
[5],
Doan
et
al.
[11]
proposed
an
end-to-end
frame
w
ork
for
an
arbitrary
vie
w
dynamic
action
recognition
that
combined
the
ArV
i-MoCoGAN
model,
3D
con
v
olutional
(C3D)
block
and
attention
module.
Both
generator
and
discriminator
of
ArV
i-MoCoGAN
are
utilized
on
testing
phase
to
create
multi-vie
w
s
yn
t
hetic
actions
which
could
increase
the
computational
comple
xity
of
the
system.
Further
-
more,
in
this
research,
C3D
is
used
as
a
3D
feature
e
xtractor
of
mult
i-vie
w
synthetic
video.
The
y
are
then
used
as
inputs
of
the
attention
module
to
v
ote
channel
attention
and
create
the
nal
feature
v
ector
before
passing
the
soft-max
layer
.
This
attention
module
is
not
observ
ed
for
spatial
features.
In
this
w
ork
we
propose
a
ne
w
frame
w
ork
for
an
arbitrary
vie
w
HAR
that
deals
not
only
the
channel
attention
b
ut
also
the
spatial
attention.
In
addition,
this
w
ork
also
in
v
estig
ates
and
compares
on
v
arious
3D
CNN
e
xtractors
(3
dimension
con
v
olutional
neural
netw
ork).
In
general,
our
research
composes
of
tw
o
contrib
utions,
such
as:
(i)
we
propose
a
ne
w
arbitrary
vie
w
gesture
recognition
method;
(ii)
in
v
estig
ate
the
arbitrary
HAR
frame
w
ork
with
v
arious
3D
CNN
e
xtractor
backbones.
The
remainder
of
this
research
is
or
g
anized
as
follo
ws:
rstly
,
section
2
e
xplains
our
proposed
frame
w
ork.
Ne
xt,
the
e
xperimental
results
are
analyzed
and
discussed
in
section
3.
Finally
,
section
4
consists
of
the
conclusion
of
research
direction
as
well
as
its
future
w
orks.
2.
PR
OPOSE
METHOD
Our
proposed
dynamic
action
recognition
method
in
certain
unkno
wn
vie
wpoints
is
illustrated
in
Figure
1
that
consists
of
four
cascade
main
blocks:
(i)
generate
the
synthetic
videos
from
a
certain
real
video
with
ArV
i-MoCoGAN
model
in
[11];
(ii)
feature
e
xtraction
of
the
synthetic
videos
using
v
arious
3D
CNN
models;
(iii)
nding
attentions
of
channels/vie
wpoints
and
spatial
with
con
v
olutional
block
attention
module
(CB
AM)
[12];
and
(i
v)
classication.
Our
frame
w
ork
is
e
xplained
in
the
ne
xt
parts
from
section
2.1
to
sec-
tion
2.4.
In
addition,
section
2.5
presents
multi
vie
w
datasets,
protocol
and
setup
parameters
for
the
entire
e
xperiment.
Figure
1.
Frame
w
ork
of
arbitrary
vie
w
dynamic
action
recognition
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
37,
No.
3,
March
2025:
1797–1803
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
❒
1799
2.1.
Generation
of
the
synthetic
xed
videos
A
common
space
is
rstly
created
by
ArV
i-MoCoGAN
model
which
is
trained
by
the
x
ed
real
v
i
deos
(M
x
ed
cameras
are
setuped
and
captured
for
the
training
dataset).
A
no
v
el
real
video
is
then
projected
in
this
common
space
to
generate
M
synthetic
videos
in
M
x
ed
vie
wpoints.
This
architecture
is
similar
to
the
ArV
i-
MoCoGAN
model
as
presented
in
our
pre
vious
research
[11].
Gi
v
en
a
real
dynamic
action
in
unkno
wn
vie
w
Z
r
V
i
=
[
I
(1)
V
i
,
...,
I
(
N
)
V
i
]
,
N
is
number
of
frames
in
an
arbitrary
real
video.
Outputs
of
the
trained
ArV
i-MoCoGAN
model
are
the
M
synthetic
videos
{
Z
S
V
j
=
[
I
S
,
(0)
V
j
,
...,
I
S
,
(
N
)
V
j
]
,
j
=
(1
,
...,
M
)
,
j
}
on
the
M
x
ed
vie
wpoints
as
illustrated
in
(1).
M
Ar
V
i
−
M
oC
oGAN
(
Z
r
V
i
)
=
{
Z
S
V
j
,
j
=
(1
,
...,
M
)
}
=
Z
S
V
1
=
[
I
S
,
(1)
V
1
,
...,
I
S
,
(
N
)
V
1
]
Z
S
V
2
=
[
I
S
,
(1)
V
2
,
...,
I
S
,
(
N
)
V
2
]
...
Z
S
V
M
=
[
I
S
,
(1)
V
i
,
...,
I
S
,
(
N
)
V
i
]
(1)
where
M
equals
to
5,
4,
7,
and
3
that
corresponds
to
the
number
of
classes
of
MICA
Ges,
IXMAS,
MuHA
V
i
and
NUMA
datasets
respecti
v
ely
.
Doan
et
al.
[11]
used
both
synthetic
videos
and
vie
w
predicted
probabilities
of
a
no
v
el
real
video
on
the
x
ed
vie
ws
for
the
HAR
phase.
In
this
w
ork,
only
M
synthetic
videos
of
the
ArV
i-MoCoGAN
model
are
utilized
as
inputs
of
the
3D
CNN
feature
e
xtractors
in
the
ne
xt
step.
2.2.
3DCNN
featur
e
extraction
Doan
et
al.
[11]
only
used
C3D
netw
ork
for
feature
e
xtraction.
In
this
rese
arch,
four
state-of-the-art
(SO
T
A)
3D
CNN
models
are
utilized
to
compare
ef
cient
of
our
end-to-end
HAR
model
that
concludes
C3D
[13],
ResNet50-3D
[14],
and
RNN
[15],
[16],
spatial-temporal
attention
[17].
These
models
are
deplo
yed
as
follo
ws:
−
C3D
netw
ork
is
introduced
in
[13]
with
visual
geometry
group
(V
GG)
as
backbone
that
has
become
an
ef
cient
method
for
spatial
and
temporal
3D
CNN
method
for
action
recognition.
In
this
w
ork,
the
2048-D
feature
v
ector
is
e
xtracted
from
FC7
layer
.
W
e
utilize
the
netw
ork
with
batch
normaliza
tion
after
e
v
ery
Con
v
layer
,
the
pre-trained
weights
on
Sports-1M
and
ne-tuned
on
Kinetics
dataset
[18].
−
ResNet50-3D
is
b
uilt
with
ResNet50
backbone
and
3D
Con
v
layer
.
The
spatial
and
temporal
feature
v
ector
is
tak
en
after
global
a
v
erage
pooling
la
y
e
r
whose
dimension
is
2048-D
outputting.
The
Kinetics
pre-trained
weights
are
applied
as
in
[18].
−
ResNet50-TP
(ResNet50
temporal
attention)
also
uses
ResNet50
backbone
and
TP
.
Output
feature
v
ectors
of
the
ResNet50-TP
model
equals
to
2048-D.
The
Kinetics
pre-trained
weights
are
applied
as
in
[18].
−
Recurrent
neural
netw
ork
(RNN)
architecture
[19]
and
InceptionNetV3
model
[20]
are
applied
for
dynamic
action
feature
e
xtractor
.
Dimension
of
feature
v
ector
is
512-D.
Model
is
also
ne-tuned
by
the
Kinetics
dataset
[18]
on
all
layers
of
this
feature
e
xtractor
.
In
this
paper
,
the
3D
CNN
model
is
used
as
spatial-temporal
feature
e
xtractor
.
Inputs
of
the
3D
CNN
e
xtractors
are
the
x
ed
multi-vie
w
synthetic
videos
{
Z
S
V
j
|
j
=
(1
,
...,
M
)
}
which
are
outputs
of
the
pre
vious
ArV
i-MoCoGAN
model
(in
section
.1).
Outputs
of
the
3D
CNN
e
xtractor
are
feature
v
ectors
{
F
3
D
C
N
N
V
j
∈
ℜ
1
xK
|
j
=
(1
,
...,
M
)
}
as
illustrated
in
(2).
M
3
D
C
N
N
(
Z
S
V
j
,
j
=
(1
,
...,
M
))
=
M
3
D
C
N
N
V
1
(
Z
S
V
1
)
=
F
3
D
C
N
N
V
1
[1
xK
]
=
[
F
(0)
V
1
,
...,
F
(
K
)
V
1
]
M
3
D
C
N
N
V
2
(
Z
S
V
2
)
=
F
3
D
C
N
N
V
2
[1
xK
]
=
[
F
(0)
V
2
,
...,
F
(
K
)
V
2
]
...
M
3
D
C
N
N
V
M
(
Z
S
V
M
)
=
F
3
D
C
N
N
V
M
[1
xK
]
=
[
F
(0)
V
M
,
...,
F
(
K
)
V
M
]
(2)
Where
K
equals
512
with
the
RNN
feature
e
xtractor
model
and
K
is
2048
with
C3D,
ResNet50-3D,
and
ResNet50-TP
feature
e
xtractor
models.
Quantitation
of
ne
w
arbitr
ary
vie
w
dynamic
human
action
r
eco
gnition
fr
ame
work
(Anh-Dung
Ho)
Evaluation Warning : The document was created with Spire.PDF for Python.
1800
❒
ISSN:
2502-4752
2.3.
Attention
module
Feature
v
ectors
of
synthetic
videos
{
Z
S
V
j
|
j
=
(1
,
...,
M
)
}
on
the
multiple
vie
wpoints
(
F
3
D
C
N
N
=
{M
3
D
C
N
N
(
Z
S
V
j
)
|
j
=
(1
,
...,
M
)
}
=
{
F
3
D
C
N
N
V
j
∈
ℜ
1
xK
|
j
=
(1
,
...,
M
)
}
)
are
normalized
and
composed
into
F
∈
ℜ
M
x
1
xK
as
illustrated
in
(3).
F
=
[
F
3
D
C
N
N
V
1
[1
xK
]
,
....,
F
3
D
C
N
N
V
M
[1
xK
]]
=
F
(1)
V
1
F
(1)
V
2
...
F
(1)
V
M
F
(2)
V
1
F
(2)
V
2
...
F
(2)
V
M
...
...
...
...
F
(
K
)
V
1
F
(
K
)
V
1
...
F
(
K
)
V
M
(3)
Input
of
this
module
is
F
∈
ℜ
M
x
1
xK
where
each
feature
v
ector
element
F
3
D
C
N
N
V
j
∈
ℜ
1
xK
in
F
∈
ℜ
M
x
1
xK
is
considered
as
a
channel
of
the
channel
attention
module.
The
channel
attention
module
infers
one
1-D
channel
attention
map
a
c
∈
ℜ
M
x
1
x
1
=
[
a
(1)
c
,
a
(2)
c
,
...,
a
(
M
)
c
]
.
The
output
of
the
channel
attention
part
F
c
∈
ℜ
1
xK
is
then
calculated
as
illustrated
in
(4).
F
c
=
a
c
⊗
F
=
P
M
j
=1
(
a
(
j
)
c
+
1)
∗
F
3
D
C
N
N
V
j
M
(4)
Where
⊗
denotes
element-wise
multiplication.
Each
the
feature
v
ector
is
paired
with
an
atte
n
t
ion
v
alues
ac-
cordingly
.
It
is
then
combined
with
itself
which
is
copied
along
the
spatial
dimension.
F
c
∈
ℜ
1
xK
is
passed
o
v
er
the
spatial
attention
module.
A
spatial
attention
map
a
s
∈
ℜ
1
x
1
xK
is
also
computed.
It
is
then
utilized
to
calculate
the
output
of
the
CB
AM
model
F
C
B
AM
=
F
s
∈
ℜ
1
xK
that
is
presented
as
illustrated
in
(5).
F
C
B
AM
=
F
s
=
a
s
⊗
F
c
=
a
s
⊗
a
c
⊗
F
(5)
2.4.
Classication
Output
feature
v
ector
(
F
C
B
AM
)
of
the
CB
AM
module
as
sho
wn
in
(5)
is
atten
and
fully
connected
together
before
being
passed
through
a
Softmax
layer
to
classify
.
In
this
research,
the
Softmax
cross-entrop
y
loss
function
is
applied
to
train
and
test
entire
netw
orks.
Gi
v
en
an
arbitrary
vie
w
dynamic
action
(
Z
V
i
,
(
i
̸
=
j
)
),
its
predicted
result
is
¯
p
i
.
Its
ground
truth
is
p
i
.
Thus,
the
loss
function
is
calculated
as
illustrated
in
(6).
L
sof
tmax
=
1
K
K
X
i
=1
p
i
l
og
¯
p
i
(6)
2.5.
Datasets,
pr
otocol
and
setup
parameters
Dataset:
in
this
research,
four
benchmark
datasets
are
utilized,
consisting
of
the
MICA
Ges
[21],
IXMAS
[3],
MuHA
V
i
[22],
and
NUMA
[6]
which
contain
1,500,
1,584,
3,038,
and
1,475
videos,
respec-
ti
v
ely
.
The
y
are
the
multi
vie
w
dynamic
action
datasets
which
were
mentioned
in
detail
in
[11].
Protocol:
an
arbitrary
vie
w
e
v
aluation
protocol
in
[11]
is
utilized
to
test
our
frame
w
ork
on
a
single
dataset.
Where
each
vie
w
is
separated
and
seen
as
an
arbitrary
vie
wpoint,
remaining
vie
ws
are
used
as
the
x
ed
vie
wpoint.
T
esting
is
implemented
by
lea
v
e-one-vie
w-out
protocol
entire
vie
wpoints
(
V
j
,
j
=
(1
,
...,
M
)
to
achie
v
e
the
nal
result.
Setup
parameter:
our
model
is
deplo
yed
with
tw
o
stages:
rstly
,
generator
and
discriminator
of
an
ArV
i-MoCoGAN
model
are
trained.
Then,
all
layers
of
the
generator
of
the
ArV
i-MoCoGAN
model
are
used
and
frizzed
in
training
of
the
arbitrary
vie
w
d
ynam
ic
action
recognition
frame
w
ork
as
sho
wn
in
Figure
1.
Learning
rate
is
5*
10
−
5
;
optimizer
is
Adam;
batch
size
equals
32
images;
loss
function
is
cross
entrop
y;
i
nput
image
size
is
224
×
224
pix
els.
Quantitation
results
are
compared
in
the
ne
xt
section
3.
3.
EXPERIMENT
AL
RESUL
T
The
e
v
aluation
schemes
are
written
in
Python
on
a
Pytorch
deep
learning
frame
w
ork
and
run
on
a
w
orkstation
with
NVIDIA
GPU
11G.
The
e
xperiments
are
conducted
to
indicate
t
he
follo
wing
problems:
(i)
comparison
accurac
y
of
our
arbitrary
vie
w
action
recognition
frame
w
ork
using
v
arious
3D-CNN
backbones;
(ii)
parameters
of
v
arious
arbitrary
vie
w
action
models;
and
(iii)
compari
son
of
our
best
action
recognition
models
with
SO
T
A
HAR
methods.
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
37,
No.
3,
March
2025:
1797–1803
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
❒
1801
3.1.
Arbitrary
view
gestur
e
r
ecognition
with
v
arious
3D
CNN
featur
e
extractors
In
this
section,
our
no
v
el
vie
w
action
recognition
frame
w
ork
is
e
v
aluated
by
dif
ferent
3D
CNN
back-
bones
such
as
C3D,
ResNet50-D,
RNN,
and
ResNet50-TP
.
It
is
tested
on
v
arious
benchmark
multi-vie
w
action
datasets
consisting
of
MICA
Ges,
NUMA,
IXMAS,
and
MuHA
V
i
datasets.
Results
are
presented
in
Figure
2.
It
is
e
vident
that
our
arbitrary
vie
w
action
recognition
frame
w
ork
with
C3D
backbone
obtains
the
best
accurac
y
on
MICA
Ges,
NUMA,
IXMAS,
and
MuHA
V
i
datasets
at
96.6%,
92.79%,
93.01%,
and
99.05%
respecti
v
ely
.
These
percentage
results
are
f
ar
higher
than
using
the
remaining
3D
CNN
feature
e
xtractors
(ResNet50-D,
RNN,
and
ResNet50-TP).
While
the
RNN
feature
e
xtractor
has
the
lo
west
accurac
y
at
72.54%
on
MICA
Ges
and
46.02%
on
NUMA;
the
ResNet50-TP
backbone
achie
v
es
the
smallest
accurac
y
at
65.5%
on
IXMAS
and
80%
on
MuHA
V
i.
Thus,
the
arbitrary
vie
w
action
recognition
frame
w
ork
will
be
considered
and
compared
by
other
f
actors
in
the
ne
xt
section
3.3.
Figure
2.
The
accurac
y
of
arbitrary
vie
w
gesture
recognition
with
v
arious
3D
CNN
feature
e
xtractors
3.2.
Summary
of
the
arbitrary
view
dynamic
action
r
ecognition
models
This
section
summarizes
in
terms
of
params,
FLOPs,
time
cost
and
model
size
of
v
arious
ar
bitrary
vie
w
dynamic
HAR
models
with
dif
ferent
3D
CNN
backbones
that
is
trained
by
MICA
Ges
dataset
as
illustrated
in
T
able
1.
Where
params
represents
the
number
of
the
trained
model
parameters.
FLOPs
sho
ws
the
number
of
oating
point
operations
required
by
the
trained
model.
T
ime
cost
is
the
time
total
that
the
model
processes
from
the
be
ginning
time
to
the
ending
time.
Model
size
refers
to
the
size
of
the
container
which
contains
the
trained
model
on
a
certain
dataset.
P
arameter
calculation
of
the
arbitrary
vie
w
HAR
system
(Figure
1)
is
di
vided
into
tw
o
parts:
Arvi-MoCoGAN
model
(second
ro
w
of
T
able
1),
3D
CNN
and
CB
AM
model
(from
third
ro
w
to
se
v
enth
ro
w
of
T
able
1).
This
table
indicates
some
highlight
issues
that:
−
The
ArV
i-MoCoGAN
model
has
10.8
(M)
in
params,
90.75
(G)
in
Flops,
0.218
(s)
in
time
cost,
and
40.39
(MB)
in
model
size.
These
results
indicate
that
params
number
and
model
size
of
ArV
i-MoCoGAN
model
are
smaller
b
ut
time
cost
and
FLOPs
are
higher
than
remaining
parts
of
the
end-to-end
HAR
frame
w
ork.
−
By
comparing
between
3D
CNN
e
xtractors
and
CB
AM
model,
it
is
apparent
that
using
C3D
+
CB
AM
(F
ord
column
and
third
ro
w
of
T
able
1)
has
the
smallest
time
cost
at
only
0.105
(s)
that
is
dramatically
smaller
than
1.77
(s)
of
ResNet50-3D
+
CB
AM,
(0.211
(s)
of
ResNet50-TP
+
CB
AM
and
0.171
(s)
of
RNN
+
CB
AM.
While
params,
FLOPS
and
model
size
are
lar
ger
the
remaining
3D
CNN
backbones
at
73.37
(M),
38.70
(G),
and
293.52
(MB)
respecti
v
ely
.
Despite
using
C3D
e
xtractor
has
high
params,
FLOPs
and
model
size
b
ut
it
is
the
smallest
time
cost
and
the
best
HAR
accurac
y
(section
3.1).
Thus,
this
is
w
orth
attention
and
trade-of
f
problems
for
a
real
application.
As
a
result,
our
end-to-end
arbitrary
vie
w
HAR
system
using
C3D
has
a
time
cost
total
of
0.323
(s)
that
is
equi
v
alent
to
3(fps)
while
it
obtains
the
best
accurac
y
at
96.6%
on
MICA
Ges
dataset.
These
results
can
be
accepted
in
order
to
deplo
y
a
real
application.
3.3.
Comparison
with
the
SO
T
A
arbitrary
view
gestur
e
r
ecognition
In
this
section,
we
compare
our
best
accurac
y
res
ults
with
some
SO
T
A
methods
on
four
benchmark
datasets
as
illustrated
in
T
able
2.
A
glance
at
T
able
2
it
is
e
vident
that
our
method
obtains
the
higher
accurac
y
on
three
published
datasets
than
recent
HAR
methods,
such
as:
93.01%
on
IXMAS
is
lar
ger
than
87.25%
in
Quantitation
of
ne
w
arbitr
ary
vie
w
dynamic
human
action
r
eco
gnition
fr
ame
work
(Anh-Dung
Ho)
Evaluation Warning : The document was created with Spire.PDF for Python.
1802
❒
ISSN:
2502-4752
[11],
79.4%
in
[23]
and
79.9%
in
[24].
On
MICA
Ges
dataset,
our
method
accounts
96.6%
that
is
better
than
[8],
[11],
[25]
from
3.72%
to
7.89%
in
accurac
y
.
On
the
MuHA
V
i
dataset,
our
approach
also
obt
ains
the
lar
gest
accurac
y
at
99.05%,
it
is
higher
than
0.78%
in
[11]
and
[23]
at
5.45%.
Our
accurac
y
achie
v
es
92.79%
that
is
slightly
smaller
than
[11]
and
[23]
at
1.72%
and
1.02%
on
NUMA
dataset
while
it
is
dramatically
better
than
the
remaining
methods
in
[8],
[26]–[28]
from
0.01%
to
9.59%.
This
result
once
ag
ain
indicates
that
our
proposed
solution
is
more
ef
cient
than
recent
methods
in
dynamic
action
recognition
accurac
y
.
T
able
1.
P
arameters
of
an
arbitrary
vie
w
dynamic
action
recognition
model
is
trained
by
MICA
Ges
dataset
P
arams
(M)
FLOPs
(G)
T
ime
cost
(s)
Model
size
(MB)
ArV
i-MoCoGAN
10.08
90.75
0.218
40.39
C3D
+
CB
AM
73.37
38.70
0.105
293.52
ResNet50-3d
+
CB
AM
55.43
10.15
0.177
222.04
ResNet50-TP
+
CB
AM
23.55
17.39
0.211
94.51
RNN
+
CB
AM
28.78
17.47
0.171
115.38
T
able
2.
Comparison
of
arbitrary
vie
w
action
recognition
accurac
y
(%)
using
SO
T
A
methods
IXMAS
MICA
Ges
MuHA
VI
NUMA
WLE
[24]
79.9
-
-
-
SAM
[26]
-
-
-
83.2
TSN
[29]
-
-
-
90.3
D
A-Net
[27]
-
-
-
92.1
Multi-Br
TSN-GR
U
[25]
-
88.71
-
93.81
R34(2+1)D
W
ith
CV
A
[8]
-
91.71
-
92.78
D
A
+
E
LM
+
aug
[23]
79.4
-
93.6
-
V
ie
wCon
+
MOCO
v2
[28]
-
-
-
91.7
ArV
i-MoCoGAN
+
C3D
[11]
87.25
92.88
98.27
94.51
Our
93.01
96.60
99.05
92.79
4.
CONCLUSION
In
this
research,
a
ne
w
arbitrary
vie
w
HAR
frame
w
ork
is
proposed
which
combines
a
cascade
blocks
including
an
ArV
i-MoCoGAN
netw
ork,
3D
CNN
feature
e
xtractors
and
CB
AM
unit.
Our
method
is
deplo
yed
and
e
v
aluated
by
v
arious
3D
CNN
models
such
as:
C3D,
ResNet50-3D,
ResNet50-TP
,
a
nd
RNN.
Our
e
xperi-
mental
result
is
implemented
on
dif
ferent
benchmark
datasets.
It
sho
ws
that
usi
ng
C3D
backbone
obtains
the
best
accurac
y
.
In
addition,
our
proposed
frame
w
ork
archi
v
es
higher
ef
cienc
y
than
SO
T
A
no
v
el
vie
w
action
recognition
on
most
benchmark
datasets
up
to
9.59%.
REFERENCES
[1]
H.-N.
T
ran,
H.-Q.
Nguyen,
H.-G.
Doan,
T
.-H.
T
ran,
T
.-L.
Le,
and
H.
V
u,
“P
airwise-co
v
ariance
multi-vie
w
discriminant
analysis
for
rob
ust
cross-vie
w
human
action
recognition,
”
IEEE
Access
,
v
ol.
9,
pp.
76097–76111,
2021,
doi:
10.1109/A
CCESS.2021.3082142.
[2]
P
.
Molchano
v
,
S.
Gupta,
K.
Kim,
and
J.
Kautz,
“Hand
gesture
recognition
with
3D
con
v
olutional
neural
netw
orks,
”
in
IEEE
Computer
Society
Confer
ence
on
Computer
V
ision
and
P
attern
Reco
gnition
W
orkshops
,
2015,
pp.
1–7,
doi:
10.1109/CVPR
W
.2015.7301342.
[3]
D.
W
einland,
R.
Ronf
ard,
and
E.
Bo
yer
,
“Free
vie
wpoint
action
recognition
using
motion
history
v
olumes,
”
Computer
V
ision
and
Ima
g
e
Under
standing
,
v
ol.
104,
no.
2–3,
pp.
249–257,
2006,
doi:
10.1016/j.cviu.2006.07.013.
[4]
S.
T
ulyak
o
v
,
M.-Y
.
Liu,
X.
Y
ang,
and
J.
Kautz,
“MoCoGAN:
decomposing
motion
and
content
for
video
generation,
”
in
2018
IEEE/CVF
Confer
ence
on
Computer
V
ision
and
P
attern
Reco
gnition
,
Jun.
2018,
pp.
1526–1535,
doi:
10.1109/CVPR.2018.00165.
[5]
T
.
H.
T
ran,
V
.
D.
Bach,
and
H.
G.
Doan,
“vi-MoCoGAN:
a
v
ariant
of
MoCoGAN
for
video
generation
of
human
hand
gestures
under
dif
ferent
vie
wpoints,
”
in
Pr
oceedings
of
t
he
P
attern
Reco
gnition:
A
CPR
,
2020,
v
ol.
1180
CCIS,
pp.
110–123,
doi:
10.1007/978-981-15-3651-9
11.
[6]
L.
W
ang,
Z.
Ding,
Z.
T
ao,
Y
.
Liu,
and
Y
.
Fu,
“Generati
v
e
multi-vie
w
human
action
recognition,
”
in
2019
IEEE/CVF
International
Confer
ence
on
Computer
V
ision
(ICCV)
,
Oct.
2019,
pp.
6211–6220,
doi:
10.1109/ICCV
.2019.00631.
[7]
H.-G.
Doan
and
N.-T
.
Nguyen,
“Ne
w
blender
-based
augmentation
method
with
quantitati
v
e
e
v
aluation
of
CNNs
for
hand
gesture
recognition,
”
Indonesian
J
ournal
of
Electrical
Engineering
and
Computer
Science
(IJEECS)
,
v
ol.
30,
no.
2,
pp.
796–806,
May
2023,
doi:
10.11591/ijeecs.v30.i2.pp796-806.
[8]
H.-T
.
Nguyen
and
T
.-O.
Nguyen,
“
Attention-based
netw
ork
for
ef
fecti
v
e
action
recognition
from
multi-vie
w
video,
”
Pr
ocedia
Com-
puter
Science
,
v
ol.
192,
pp.
971–980,
2021,
doi:
10.1016/j.procs.2021.08.100.
[9]
J
.
Zhang,
H.
P
.
H.
Shum,
J.
Han,
and
L.
Shao,
“
Action
recognition
from
arbitrary
vie
ws
using
transferable
dictionary
learning,
”
IEEE
T
r
ansactions
on
Ima
g
e
Pr
ocessing
,
v
ol.
27,
no.
10,
pp.
4709–4723,
Oct.
2018,
doi:
10.1109/TIP
.2018.2836323.
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
37,
No.
3,
March
2025:
1797–1803
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
❒
1803
[10]
K.
Gedamu,
Y
.
Ji,
Y
.
Y
ang,
L.
Gao,
and
H.
T
.
Shen,
“
Arbitrary-vie
w
human
action
recognition
via
no
v
el-vie
w
action
generation,
”
P
attern
Reco
gnition
,
v
ol.
118,
p.
108043,
Oct.
2021,
doi:
10.1016/j.patcog.2021.108043.
[11]
H.-G.
Doan,
H.-Q.
Luong,
and
T
.
T
.
T
.
Pham,
“
An
end-to-end
model
of
ArV
i-MoCoGAN
and
C3D
with
attention
unit
for
arbitrary-
vie
w
dynamic
gesture
recognition,
”
International
J
ournal
of
Advanced
Computer
Science
and
Applications
,
v
ol.
15,
no.
3,
2024,
doi:
10.14569/IJ
A
CSA.2024.01503122.
[12]
S.
W
oo,
J.
P
ark,
J.
Y
.
Lee,
and
I.
S.
Kweon,
“CB
AM:
con
v
olutional
block
attention
module,
”
in
Pr
oceedings
of
the
Eur
opean
confer
ence
on
computer
vision
(ECCV)
,
2018,
pp.
3–19,
doi:
10.1007/978-3-030-01234-2
1.
[13]
D.
T
ran,
L.
Bourde
v
,
R.
Fer
gus,
L.
T
orresani,
and
M.
P
aluri,
“Learning
spatiotemporal
features
with
3D
con
v
oluti
onal
netw
orks,
”
in
2015
IEEE
International
Confer
ence
on
Computer
V
ision
(ICCV)
,
Dec.
2015,
pp.
4489–4497,
doi:
10.1109/ICCV
.2015.510.
[14]
K.
Hara,
H.
Kataoka,
and
Y
.
Satoh,
“Can
spatiotemporal
3D
CNNs
retrace
the
history
of
2D
CNNs
and
ImageNet?,
”
in
2018
IEEE/CVF
Confer
ence
on
Computer
V
ision
and
P
attern
Reco
gnition
,
Jun.
2018,
pp.
6546–6555,
doi:
10.1109/CVPR.2018.00685.
[15]
M.
Schuster
and
K.
K.
P
aliw
al,
“Bidirectional
recurrent
neural
netw
orks,
”
IEEE
T
r
ansactions
on
Signal
Pr
ocessing
,
v
ol.
45,
no.
11,
pp.
2673–2681,
1997,
doi:
10.1109/78.650093.
[16]
J.
Liu,
A.
Shahroudy
,
D.
Xu,
and
G.
W
ang,
“S
patio-temporal
LSTM
with
trust
g
ates
for
3D
human
action
recognition,
”
in
Computer
V
ision–ECCV
2016:
14th
Eur
opean
Confer
ence
,
2016,
pp.
816–833,
doi:
10.1007/978-3-319-46487-9
50.
[17]
J.
W
ang
and
X.
W
en,
“
A
spatio-temporal
attention
con
v
olution
block
for
action
recognition,
”
J
ournal
of
Physics:
Confer
ence
Series
,
v
ol.
1651,
no.
1,
p.
012193,
No
v
.
2020,
doi:
10.1088/1742-6596/1651/1/012193.
[18]
W
.
Kay
et
al.
,
“The
kinetics
human
action
video
dataset,
”
arXiv
pr
eprint
arXiv:1705.06950
,
2017,
[Online].
A
v
ailable:
http://arxi
v
.or
g/abs/1705.06950.
[19]
H.
Sak,
A.
Senior
,
and
F
.
Beauf
ays,
“Long
short-term
memory
based
recurrent
neural
netw
ork
architec
tures
for
lar
ge
v
ocab
ulary
speech
recognition,
”
arXiv
pr
eprint
arXiv:1402.1128
,
2014,
[Online].
A
v
ailable:
http://arxi
v
.or
g/abs/1402.1128.
[20]
C.
Sze
gedy
,
V
.
V
anhouck
e,
S.
Iof
fe,
J.
Shlens,
and
Z.
W
ojna,
“Rethinking
the
inception
architecture
for
computer
vision,
”
in
2016
IEEE
Confer
ence
on
Computer
V
ision
and
P
attern
Reco
gnition
(CVPR)
,
Jun.
2016,
pp.
2818–2826,
doi:
10.1109/CVPR.2016.308.
[21]
H.
G.
Doan
et
al.
,
“Multi-vie
w
discriminant
analysis
for
dynamic
hand
gesture
recognition,
”
in
P
attern
Reco
gnition.
A
CPR
2019.
Communications
in
Computer
and
Information
Science
,
2020,
pp.
196–210,
doi:
10.1007/978-981-15-3651-9
18.
[22]
F
.
Murtaza,
M.
H.
Y
ousaf,
and
S.
A.
V
elastin,
“Multi-vie
w
human
action
recognition
using
2D
motion
templates
based
on
MHIs
and
their
HOG
description,
”
IET
Computer
V
ision
,
v
ol.
10,
no.
7,
pp.
758–767,
Oct.
2016,
doi:
10.1049/iet-cvi.2015.0416.
[23]
N.
Nida,
M.
H.
Y
ousaf,
A.
Irtaza,
and
S.
A.
V
elastin,
“V
ideo
augmentation
technique
for
human
action
recognition
using
genetic
algorithm,
”
ETRI
J
ournal
,
v
ol.
44,
no.
2,
pp.
327–338,
Apr
.
2022,
doi:
10.4218/etrij.2019-0510.
[24]
J.
Liu,
M.
Shah,
B.
K
uipers,
and
S.
Sa
v
arese,
“Cross-vie
w
action
recognition
via
vie
w
kno
wledge
transfer
,
”
in
Confer
ence
on
Computer
V
ision
and
P
attern
Reco
gnition
(CVPR)
,
Jun.
2011,
pp.
3209–3216,
doi:
10.1109/CVPR.2011.5995729.
[25]
A.-V
.
Bui
and
T
.-O.
Nguyen,
“Multi-vie
w
human
action
recognition
based
on
TSN
architecture
inte
grated
with
GR
U,
”
Pr
ocedia
Computer
Science
,
v
ol.
176,
pp.
948–955,
2020,
doi:
10.1016/j.procs.2020.09.090.
[26]
S.
Mambou,
O.
Krejcar
,
K.
K
uca,
and
A.
Selamat,
“No
v
el
cross-vie
w
human
action
model
recognition
based
on
the
po
werful
vie
w-in
v
ariant
features
technique,
”
Futur
e
Internet
,
v
ol.
10,
no.
9,
pp.
1–17,
2018,
doi:
10.3390/10090089.
[27]
D.
W
ang,
W
.
Ouyang,
W
.
Li,
and
D.
Xu,
“Di
viding
and
aggre
g
ating
netw
ork
for
multi-vie
w
action
recognition,
”
in
Pr
oceedings
of
the
Eur
opean
Confer
ence
on
Computer
V
ision
(ECCV)
,
2018,
pp.
457–473,
doi:
10.1007/978-3-030-01240-3
28.
[28]
K.
Shah,
A.
Shah,
C.
P
.
Lau,
C.
M.
de
Melo,
and
R.
Chellapp,
“Multi-vie
w
action
recognition
using
contrasti
v
e
learn-
ing,
”
in
2023
IEEE/CVF
W
inter
Confer
ence
on
Applications
of
Computer
V
ision
(W
A
CV)
,
Jan.
2023,
pp.
3370–3380,
doi:
10.1109/W
A
CV56688.2023.00338.
[29]
L.
W
ang
et
al.
,
“T
emporal
se
gment
netw
orks:
to
w
ards
good
practices
for
deep
action
recognition,
”
in
Eur
opean
Confer
ence
on
Computer
V
ision
(ECCV)
,
2016,
pp.
20–36.
BIOGRAPHIES
OF
A
UTHORS
Anh-Dung
Ho
recei
v
ed
B.E.
de
gree
in
Applied
Mathematics
and
Informatics
in
2001,
M.E.
in
Computer
Science
in
2007,
all
from
Hanoi
Uni
v
ersity
of
Science
and
T
echnology
,
Ha
Noi,
V
ietnam.
He
can
be
contacted
at
email:
dungha@eaut.edu.vn.
Huong-Giang
Doan
recei
v
ed
B.E.
de
gree
in
Instrumentati
on
and
Industrial
Informatics
in
2003,
M.E.
in
Instrumentation
and
Automatic
Control
System
in
2006
and
Ph.D.
in
C
ontrol
En-
gineering
and
Automation
in
2017,
all
from
Hanoi
Uni
v
ersity
of
S
cience
and
T
echnology
,
Ha
Noi,
V
ietnam.
She
can
be
contacted
at
email:
giangdth@epu.edu.vn.
Quantitation
of
ne
w
arbitr
ary
vie
w
dynamic
human
action
r
eco
gnition
fr
ame
work
(Anh-Dung
Ho)
Evaluation Warning : The document was created with Spire.PDF for Python.