Indonesian
J
our
nal
of
Electrical
Engineering
and
Computer
Science
V
ol.
37,
No.
3,
March
2025,
pp.
1616
∼
1625
ISSN:
2502-4752,
DOI:
10.11591/ijeecs.v37.i3.pp1616-1625
❒
1616
Le
v
eraging
3D
con
v
olutional
netw
orks
f
or
effecti
v
e
video
featur
e
extraction
in
video
summarization
Bhakti
Deepak
Kadam
1,2
,
Ashwini
Mangesh
Deshpande
1
1
Department
of
Electronics
and
T
elecommunication
Engineering,
MKSSS’
s
Cummins
Colle
ge
of
Engineering
for
W
omen,
Pune,
India
2
Department
of
Electronics
and
T
elecommunication
Engineering,
SCTR’
s
Pune
Institute
of
Computer
T
echnology
,
Pune,
India
Article
Inf
o
Article
history:
Recei
v
ed
Apr
9,
2024
Re
vised
Sep
13,
2024
Accepted
Sep
30,
2024
K
eyw
ords:
3D
con
v
olution
Deep
neural
netw
orks
Feature
representation
Pretrained
netw
orks
V
ideo
summarization
ABSTRA
CT
V
ideo
feature
e
xtraction
is
pi
v
otal
in
video
processing,
as
it
encompasses
the
e
xtraction
of
pertinent
information
from
video
dat
a.
This
process
enables
a
more
streamlined
representation,
analysis,
and
comprehension
of
video
content.
Gi
v
en
its
adv
antages,
feat
ure
e
xtraction
has
become
a
crucial
step
in
numer
-
ous
video
understanding
tasks.
This
study
in
v
estig
ates
the
generation
of
video
representations
utilizing
three-dimensional
(3D)
con
v
olutional
neural
netw
orks
(CNNs)
for
the
task
of
video
summarization.
The
feature
v
ectors
are
e
xtracted
from
the
vide
o
sequences
using
pretrained
tw
o-dimensional
(2D)
netw
orks
such
as
GoogleNet
and
ResNet,
along
with
3D
netw
orks
lik
e
3D
Con
v
olutional
Net-
w
ork
(C3D)
and
T
w
o-Stream
Inated
3D
Con
v
olutional
Netw
ork
(I3D).
T
o
as-
sess
the
ef
fecti
v
eness
of
video
representations,
F1-sc
ores
are
computed
with
the
generated
2D
and
3D
video
representations
for
chosen
generic
and
query-
focused
video
summarization
techniques.
The
e
xperimental
results
sho
w
that
using
feature
v
ectors
from
3D
netw
orks
impro
v
es
F1-scores,
highlighting
the
ef
fecti
v
eness
of
3D
netw
orks
i
n
video
representation.
It
is
demonstrated
that
3D
netw
orks,
unlik
e
2D
ones,
incorporate
the
time
dimension
to
capture
spatiotem-
poral
features,
pro
viding
better
temporal
proc
essing
and
of
fering
comprehensi
v
e
video
representation.
This
is
an
open
access
article
under
the
CC
BY
-SA
license
.
Corresponding
A
uthor:
Bhakti
Deepak
Kadam
Department
of
Electronics
and
T
elecommunication
Engineering
MKSSS’
s
Cummins
Colle
ge
of
Engineering
for
W
omen
Pune,
Maharashtra,
India
Email:
bhakti.kadam@cumminscolle
ge.in
1.
INTR
ODUCTION
V
ideo
feature
e
xtraction
is
a
foundational
aspect
of
computer
vision,
designed
to
o
v
ercome
the
chal-
lenges
presented
by
the
intricate
and
e
xtensi
v
e
nature
of
video
data.
Its
purpose
is
to
f
acilitate
analyses
that
are
not
only
more
ef
cient
b
ut
also
more
interpretable
and
accurate.
Ef
fecti
v
e
comprehension
of
videos
demands
representation
at
multiple
le
v
els,
necessitating
an
appropriate
video
representation.
Ef
fecti
v
e
video
process-
ing
relies
on
video
feature
e
xtraction
for
the
follo
wing
reasons:
(i)
reducing
the
comple
xity
of
video
data
for
more
manageable
analysis
,
(ii)
simplifying
the
interpretation
of
underlying
information
for
both
humans
and
deep
learning
models,
(iii)
condensing
meaningful
information
while
retaining
essential
characteristics,
(i
v)
enhancing
the
model’
s
ability
to
generalize
patterns
for
accurate
predictions,
and
(v)
ef
cient
utilization
of
computational
resources
by
focusing
on
rele
v
ant
aspects.
J
ournal
homepage:
http://ijeecs.iaescor
e
.com
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
❒
1617
V
ideo
feature
e
xtraction
in
v
olv
es
selecting
and/or
combining
v
ariables
to
generate
feature
v
ect
ors.
Feature
v
ectors
of
a
video
are
numerical
representations
that
capture
v
arious
attrib
utes
and
characteristics
of
the
video
content
in
a
structured
f
o
r
mat.
These
v
ec
tors
can
encapsulate
v
arious
visual,
spatial,
temporal,
motion,
and
audio
attrib
utes
of
the
video
content.
These
v
ectors
f
acilitate
ef
fecti
v
e
analysis,
understanding,
and
processing
of
videos
in
numerous
applications.
Feature
e
xtraction
ef
ciently
reduces
the
v
ol
ume
of
data
that
needs
processing
while
maintaining
accurac
y
in
representing
videos.
The
combination
of
video
se
gmentation
and
feature
e
xtraction
is
instrumental
in
mitig
ating
c
omputa-
tional
o
v
erhead
by
streamlining
the
preprocessing
across
all
frames
in
the
video.
When
analyzing
videos
for
computer
vision
tasks,
a
v
ariety
of
features
play
a
pi
v
otal
role.
The
dif
ferent
video
features
utilized
for
video
understanding
are
illustrated
in
Figure
1.
These
di
v
erse
features
contrib
ute
to
a
holistic
comprehension
of
video
content,
f
acilitating
v
arious
applications
within
the
eld
of
computer
vision
[1].
V
ideo
Features
Spatial
Features:
Extracted
from
still
video
frames.
The
spati
al
features
distinguish
be-
tween
static
and
dynamic
conte
xts
encompassing
the
position
of
objects.
The
y
e
xplore
the
object’
s
relati
v
e
spatial
area
and
semantic
relationships
with
other
objects
in
each
video
frame.
T
emporal
Features
(Optical
Flo
w):
In
v
olv
es
the
e
xtraction
of
action
or
object
mo
v
ement.
Optical
o
w
represents
the
motion
pattern
of
an
object
across
consecuti
v
e
frames
requiring
the
estimation
of
per
-pix
el
motion
between
them.
T
e
xtual
Features:
F
ocuses
on
the
detection
and
recognition
of
te
xtual
content
within
video
frames
cate
gorizing
te
xts
as
scene
te
xt
or
caption
te
xt
[2].
T
rajectory
Extraction:
A
dens
e
representation
of
video
obtained
through
optical
o
w
algo-
rithms
allo
ws
for
the
e
xtraction
of
dense
trajectories
[3].
These
trajectories
pro
vide
v
aluable
information
characterizing
t
he
appearance
and
motion
of
objects.
Content-based
Feature
Extrac
tion:
In
v
olv
es
the
e
xtraction
of
feature
v
ectors
based
on
the
video’
s
conte
xt
and
tailored
to
the
specic
task
at
hand
such
as
video
summarization
or
video
captioning.
Figure
1.
T
ypes
of
video
features
Features
can
be
e
xtract
ed
using
both
classical
methods
that
rely
on
local,
hand-crafted
features
and
adv
anced
techniques
in
v
olving
deep
neural
netw
orks,
as
detailed
in
section
2.
Thi
s
research
e
xplores
the
uti-
lization
of
three-dimensional
con
v
olutional
neural
netw
orks
(3D
CNNs)
to
enhance
video
representations.
A
comparati
v
e
analysis
is
conducted
on
con
v
entional
video
summarization
and
query-focused
video
summariza-
tion
techniques.
Our
contrib
utions
are
as
follo
ws:
i)
the
video
features
are
e
xtracted
utilizing
pretrained
3D
CNNs
(C3D
and
inated
3D
(I3D));
ii)
the
spe
cic
baseline
algorithms
for
generic
and
query-focused
video
summarization
are
chosen
and
the
F1-scores
for
generated
video
presentations
are
computed;
and
iii)
the
perfor
-
mance
is
e
v
aluated
by
comparing
pretrained
tw
o-dimensional
(2D)
and
3D
con
v
olutional
netw
orks
for
feature
e
xtraction
in
terms
of
calculated
F1-scores.
Gi
v
en
its
numerous
adv
antages,
feature
e
xtraction
stands
as
a
fundamental
process
in
numerous
re-
search
applications,
such
as:
−
V
ideo
classicat
ion:
it
is
the
task
of
assigning
one
or
more
global
labels
to
the
video.
The
proper
e
xtraction
of
features
from
the
input
video
leads
to
the
prediction
of
accurate
frame
labels
that
describe
the
entire
video
[4].
−
Action
recognition:
the
action
recognition
in
videos
aims
to
infer
the
actions
of
on
e
or
persons
in
the
video.
The
spatial
and
long-range
temporal
feature
e
xtraction
is
necessary
for
human
acti
vity
or
action
recognition
[5].
−
V
ideo
understanding:
is
the
task
of
recognition
and
localization
of
dif
ferent
actions
or
e
v
ents
occurring
in
the
video.
As
the
localization
is
in
both
s
patial
and
temporal
di
mensions,
this
task
requires
spatiotemporal
feature
e
xtraction
[6].
Le
ver
a
ging
3D
con
volutional
networks
for
ef
fective
video
...
(Bhakti
Deepak
Kadam)
Evaluation Warning : The document was created with Spire.PDF for Python.
1618
❒
ISSN:
2502-4752
−
V
ideo
captioning:
is
the
task
of
generating
automatic
captions
for
a
video.
This
leads
to
ef
cient
information
retrie
v
al
from
the
video
in
the
form
of
te
xt.
As
captioning
is
the
te
xtual
description
of
the
video,
it
needs
e
xtraction
of
more
comple
x
features
[7].
−
Simultaneous
localization
and
mapping
(SLAM):
is
a
method
used
for
autonomous
v
ehicles
that
de
v
elops
a
map
and
localizes
the
v
ehicle
in
the
same
map
[8].
In
SLAM,
spatial
and
motion
features
need
to
be
e
xtracted
and
matched
for
localization
and
obstacle
detection.
−
V
ideo
summarization:
is
a
process
of
generat
ing
a
temporally
condensed
v
ersion
of
the
input
video.
V
ideo
representations
at
multiple
le
v
els
are
necessary
for
spatiotemporal
modelling
due
to
long
durations
of
videos
[9]–[11].
These
applications
mak
e
video
feature
e
xtraction
a
v
aluable
research
topic
f
o
r
study
.
The
structure
of
the
paper
is
as
follo
ws:
section
2
discusses
the
related
w
ork
on
video
feature
e
xtraction.
Section
3
elaborates
on
the
use
of
3D
con
v
olutional
netw
orks
emplo
yed
for
e
xtracting
video
features.
The
e
xperimental
result
and
analysis
are
discussed
in
section
4
and
section
5
pro
vides
conclusions.
2.
RELA
TED
W
ORK
This
section
e
xplores
the
e
xisting
video
feature
e
xtraction
techniques
emplo
yed
in
sum
marization
methodologies
in
literature.
V
ideo
summarization
and
feature
e
xtraction
represent
longstanding
research
areas
in
computer
vision.
V
ideo
feature
v
ectors
can
be
e
xtracted
using
c
lassical
vision
techniques
focusing
on
hand-
crafted
features
as
well
as
deep
neural
netw
orks
[1].
The
classication
of
feature
e
xtraction
techniques
is
pro
vided
in
Figure
2.
W
ith
adv
ancements
in
deep
learning,
video
feature
e
xtraction
has
also
le
v
eraged
these
tec
hn
ol
ogies.
K
e
y
trends
propelling
the
eld
forw
ard
include
the
inte
gration
of
multimodal
information,
the
de
v
elopment
of
self-supervised
learning
techniques,
and
the
e
xploration
of
no
v
el
architectures
such
as
transformers.
The
deep
learning
based
feature
e
xtraction
techniques
ha
v
e
outperformed
the
classical
vision
techniques.
These
models
are
ef
fecti
v
ely
utilized
in
v
arious
research
domains
[1].
The
merits
of
deep
learning
based
techniques
include:
−
Extraction
of
comple
x
and
abstract
features
by
feature
engineering:
feature
engineering
deals
with
the
e
x-
traction
of
features
from
natural
data.
The
spatiotemporal
models
utilize
state-of-the-art
feature
engineering
models
to
e
xtract
more
comple
x
features
from
videos.
−
Feature
e
xtraction
for
unstructured
data:
deep
neural
net
w
orks
can
handle
unstructured
data
better
than
hand-crafted
features
by
training
on
v
arious
abstract
features.
−
Unsupervised
feature
learning:
the
process
of
labelling
the
a
v
ailable
data
is
e
xpensi
v
e
and
time-consuming.
This
process
is
more
challenging
when
it
is
e
xtended
for
videos.
The
traditional
techniques
do
not
perform
well
on
unsupervised
data,
b
ut
spatiotemporal
models
can
be
ef
ciently
used
with
unlabelled
data.
−
High-quality
results:
the
semantic
rel
ationships
between
objects
and
their
motion
patterns
are
also
e
xplored
while
e
xtracting
the
features
using
modern
machine
vision
techniques.
This
leads
to
impro
v
ement
in
the
quality
of
results
in
dif
ferent
computer
vision
tasks.
Most
of
the
summarization
methods
emplo
y
2D
CNNs,
GoogleNet,
and
Residual
Netw
ork
(ResNet)
to
e
xtract
video
features.
GoogleNet,
also
kno
wn
as
Inception
V1,
w
as
presented
in
2014
[12].
ResNet,
introduced
in
2015,
brought
forth
the
ResNet
architecture
[13].
In
video
summarization
frame
w
orks,
GoogleNet
and
ResNet
pretrained
on
the
ImageNet
dataset
[14]
are
wide
ly
emplo
yed
for
feature
e
xtraction
from
input
video
sequences.
2D
CNNs
f
ace
se
v
eral
challenges
in
video
feature
e
xtraction
due
to
their
limitations
in
handling
temporal
information:
−
Lack
of
temporal
a
w
areness:
2D
CNNs
process
each
frame
independently
,
missing
tem
poral
relationships
crucial
for
understanding
motion
and
e
v
ents.
−
Handling
motion:
the
y
struggle
with
dynamic
content
and
the
comple
xity
of
inte
grating
optical
o
w
.
−
Spatiotemporal
features:
the
y
capture
only
spatial
features,
lacking
the
rich
spatiotemporal
conte
xt
needed
for
tasks
lik
e
action
recognition.
−
Multi-modal
inte
gration:
combining
visual
features
with
audio
and
te
xt
is
challenging
without
inherent
temporal
modeling.
This
study
proposes
the
use
of
3D
CNNs
for
video
feature
e
xtraction
to
o
v
ercome
these
limitations.
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
37,
No.
3,
March
2025:
1616–1625
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
❒
1619
V
ideo
Feature
Extraction
Classical
V
ision
T
echniques
Histogram
of
Gradient
(HOG)
and
Histogram
of
Optical
Flo
w
(HOF):
These
encoding
techniques
are
used
to
e
xtract
features
for
the
most
common
tasks
of
object
detection
and
acti
vity
recognition.
Space–T
ime
Interest
Points
(STIP):
The
space-time
inte
rest
points
can
be
used
to
model
spatiotemporal
features
and
dynamic
motion
patterns.
Scale-In
v
ariant
Feature
T
ransform
(SIFT):
The
local
still
image
spatial
cues
in
videos
can
be
captured
by
using
the
classical
image
based
descriptors
such
as
SIFT
.
Dense
T
rajectory:
The
de
nse
trajectory
approach
can
be
used
to
model
local
spatial
cues
and
global
motion
cues
such
as
motion.
Deep
Learning
based
T
echniques
Con
v
olutional
Neural
Netw
ork
(CNN):
The
2D
and
3D
CNNs
a
re
ef
fecti
v
ely
used
for
e
xtracting
spatiotemporal
features
and
short-term
motion
cues
from
ra
w
video
data.
Recurrent
Neural
Netw
or
k
(RNN):
The
RNNs
are
also
used
to
e
xtract
short-
term
and
long-term
motion
patterns
from
videos.
Long
Short-T
erm
Memory
(LSTM):
The
long-term
motion
cues
can
be
mod-
elled
using
LSTMs.
Generati
v
e
Adv
ersarial
Ne
ural
Netw
ork
(GAN):
The
spatiotemporal
features
can
be
e
xtracted
from
videos
using
GANs.
Re
gularized
Feature
Fus
ion
Models:
The
feature
scores
e
xtracte
d
from
dif
fer
-
ent
netw
ork
layers
and
le
v
els
are
combined
using
feature
fusion
techniques.
Figure
2.
Classication
of
video
feature
e
xtraction
techniques
3.
METHOD
This
section
presents
the
merits
of
utilizing
3D
con
v
olution
for
video
comprehension,
along
with
the
application
of
3D
CNNs
to
capture
features
from
video
sequences
in
summarization
algorithms.
A
video
con-
sists
of
man
y
se
gments,
with
each
se
gment
comprising
shots,
and
these
shots
are
composed
of
sequence
s
of
frames.
F
or
a
comprehensi
v
e
understanding
of
the
videos,
it
is
necessary
to
learn
feature
representations
at
dif
ferent
le
v
els.
T
o
e
xtract
feature
v
ectors
at
dif
ferent
le
v
els,
the
video
is
di
vided
into
small,
non-intersecting
shots.
After
se
gmenting
the
video,
features
are
e
xtracted
using
pretrained
3D
con
v
olutional
netw
orks.
Figure
3
illustrates
the
e
xtraction
of
video
features
using
pretrained
3D
CNN
[15].
3.1.
2D
and
3D
con
v
olution
The
fundamental
dif
ference
between
2D
and
3D
con
v
olution
lies
in
the
dimensionality
of
the
input
data
that
each
processes.
Generally
,
2D
con
v
olution
is
emplo
yed
on
tw
o-dimensional
data,
lik
e
images.
This
con
v
olutional
process
ent
ails
mo
ving
a
2D
k
ernel/lter
across
the
input
image,
conducting
element-wise
multi-
plications,
and
subsequently
aggre
g
ating
the
results.
The
con
v
olution
and
pooling
are
performed
spatially
in
2D
CNNs
[15].
As
a
result,
it
does
not
model
the
temporal
information.
Figure
4
illustrates
the
distinction
between
2D
and
3D
con
v
olution.
When
2D
con
v
olution
is
emplo
yed
on
an
image,
it
yields
another
image
as
sho
wn
in
Figure
4(a).
Similarly
,
applying
2D
con
v
olution
to
m
ultiple
images
(treating
them
as
distinct
channels)
also
produces
an
image
as
the
output
as
indicated
by
Figure
4(b).
3D
con
v
olution
is
designed
for
three-dimensional
data,
such
as
video
sequences
or
v
olumetric
data.
The
3D
k
ernel
is
not
only
applied
across
height
and
width
b
ut
also
e
xtends
to
the
depth
dimension
(or
time,
in
the
conte
xt
of
videos).
This
con
v
olutional
process
tra
v
erses
the
complete
v
olume
of
the
input
data.
The
con
v
olution
and
pooling
are
performed
spatiotemporally
in
3D
CNNs.
As
a
result,
it
preserv
es
the
temporal
information
outputting
a
v
olume
as
sho
wn
in
Figure
4(c).
Le
ver
a
ging
3D
con
volutional
networks
for
ef
fective
video
...
(Bhakti
Deepak
Kadam)
Evaluation Warning : The document was created with Spire.PDF for Python.
1620
❒
ISSN:
2502-4752
The
adv
antages
of
3D
con
v
olution
o
v
er
2D
con
v
olution
are
prominent
in
tasks
in
v
olving
s
p
a
tiotem-
poral
data,
such
as
video
processing.
The
benets
of
3D
CNNs
include:
−
Spatial-temporal
features:
it
inte
grates
spatial
and
temporal
features
simultaneously
for
a
comprehensi
v
e
data
representation,
crucial
for
video
sequences.
−
T
emporal
information
capture:
it
ef
fecti
v
ely
captures
temporal
information
by
considering
the
time
dimen-
sion,
which
is
essential
for
video
analysis
and
action
recognition.
−
Natural
e
xtension
for
video
analysis:
it
e
xtends
CNN
capabilities
for
video
understanding
by
inherently
considering
the
temporal
dimension.
−
Unied
frame
w
ork
for
video
processing:
it
pro
vides
a
unied
approach
for
processing
both
spatial
and
temporal
dimensions,
simplifying
architecture
compared
to
separate
2D
and
1D
processing
units.
−
V
olumetric
understanding:
it
enables
the
modelling
of
v
olumetric
data,
of
fering
comprehensi
v
e
spatial
and
temporal
understanding,
benecial
for
3D
medical
imaging
and
other
v
olumetric
data
tasks.
Figure
3.
Extracting
video
features
using
3D
CNN
(a)
(b)
(c)
Figure
4.
Comparison
between
2D
and
3D
con
v
olution
[15]:
(a)
2D
con
v
olution
with
an
image,
(b)
2D
con
v
olution
with
multiple
frames,
and
(c)
3D
con
v
olution
on
a
video
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
37,
No.
3,
March
2025:
1616–1625
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
❒
1621
3.2.
3D
CNNS:
C3D
and
I3D
In
video
summarization
methods
proposed
by
researchers
in
the
literature,
GoogleNet
is
the
fre
q
ue
n
t
ly
chosen
deep
netw
ork
for
feature
v
ector
e
xtraction.
GoogleNet
[12]
is
2D
CNN
pretrained
on
ImageNet
dataset
[14].
Recently
,
3D
CNNs,
C3D,
and
I3D
are
also
emplo
yed
for
video
feature
e
xtraction.
The
spatiotemporal
feature
e
xtraction
using
a
3D
CNN
w
as
proposed
by
researchers
in
2015
[15].
C3D
is
a
d
e
ep
3D
CNN
with
a
homogeneous
architecture
containing
3
×
3
×
3
con
v
olutional
k
ernels
follo
wed
by
2
×
2
×
2
pooling
at
each
layer
.
The
C3D
model
of
fers
generic
feature
e
xtraction.
It
pro
vides
a
compact
representation
of
video
se
gments,
generating
a
4096
element
v
ector
from
a
16-frame
input.
The
model’
s
homogeneous
architecture,
featuring
small
k
ernel
sizes
3
×
3
×
3
,
ensures
f
ast
and
ef
cient
inference,
enabling
optimized
implementations
on
embedded
platforms.
The
I3D
model,
introduced
by
researchers
i
n
2017,
is
a
tw
o-stream
I3D
Con
vNet
that
e
xtends
2D
CNN
principles
into
the
3D
domain
[16].
By
inat
ing
2D
lters
and
pooling
k
ernels
into
3D,
the
I3D
model
aims
to
capture
spatiotemporal
features
f
rom
videos,
le
v
eraging
successful
architectures
and
parameters
from
ImageNet.
K
e
y
features
include
the
adaptation
of
2D
lters
to
3D,
e
xpansion
of
the
recepti
v
e
eld
in
space
and
time,
and
the
use
of
tw
o
3D
streams
for
enhanced
performance.
4.
RESUL
TS
AND
DISCUSSION
This
section
discusses
the
v
arious
baseline
summarization
methods
selected
for
the
study
,
the
e
xperi-
mentation
conducted,
and
the
results
obtained
from
these
e
xperiment
s.
It
pro
vides
an
o
v
ervie
w
of
summariza-
tion
techniques,
e
xperimental
setup
including
the
datasets,
and
the
e
v
aluation
metric
emplo
yed
to
assess
the
performance
of
the
summarization
methods
along
with
the
performance
comparisons.
4.1.
Summarization
methods
The
ef
fecti
v
eness
of
3D
CNNs
in
video
feature
e
xtraction
is
demonstrated
through
the
e
xamination
of
tw
o
summarization
frame
w
orks:
con
v
entional
video
summarization
and
query-focused
video
summarization.
An
o
v
ervie
w
of
summarization
methods
is
pro
vided.
4.1.1.
Con
v
entional
video
summarization
methods
under
consideration
Con
v
entional
or
generic
video
summariz
ation
in
v
olv
es
generating
a
concise
video
summary
by
auto-
matically
selecting
k
e
yframes
or
k
e
yshots
representing
the
most
important
content
necessary
for
understanding
the
video.
This
type
of
summarization
is
generally
c
on
t
ent-dri
v
en,
relying
on
the
visual
i
nformation
within
the
video
to
determine
what
should
be
included
in
the
summary
.
The
methods
under
consideration
are:
−
Di
v
ersity-representati
v
eness
re
w
ard
deep
summarization
netw
ork
(DR-DSN):
a
deep
summarization
net-
w
ork
[17]
proposed
for
es
timating
the
lik
elihood
of
indi
vidual
video
frames
and
generating
the
video
sum-
mary
.
−
V
ideo
attention
summarization
netw
ork
(V
ASNet):
a
summarization
method
[9]
combining
a
soft
self
at-
tention
and
tw
o-layer
re
gressor
netw
ork.
−
Positional
encoding
with
global
and
local
multi-head
attention
for
summarization
(PGL-SUM):
inte
gration
of
positional
encodi
n
g
with
global
and
l
ocal
multi-head
attention
[18]
for
calculating
importance
scores
of
frames.
−
Summarization
generati
v
e
adv
ersarial
netw
ork
wi
th
attention
autoencoder
(SUM-GAN-AAE):
a
supervised
summarization
technique
le
v
eraging
the
combination
of
adv
ersarial
learning
with
att
ention
mechanism
[19]
for
summarizing
videos.
−
Concentrated
attention
summarization
(CA-SUM):
a
summarization
netw
ork
emplo
ying
[11]
concentrated
attention
considering
uniqueness
and
di
v
ersity
of
video
frames.
−
Deep
summarization
netw
ork
with
reinforcement
learning
(DSR-RL):
a
recurrent
summarization
netw
ork
[20]
incorporating
self
attention
mechanism
and
reinforcement
learning.
4.1.2.
Query-f
ocused
video
summarization
methods
under
consideration
Query-focused
video
summarization
generates
the
video
summary
based
on
specic
input
queries
by
the
user
.
This
type
of
summarization
is
conte
xt-dri
v
en,
relying
on
vie
wer
queries,
making
it
more
personalized
than
con
v
entional
summarization.
The
methods
under
consideration
are:
−
Three-player
adv
ersarial
netw
ork
(TP
AN):
a
generati
v
e
adv
ersarial
netw
ork
with
three
players
[21]
operat-
ing
on
three
sets
of
query-conditioned
summaries
to
generate
query-focused
video
summaries.
Le
ver
a
ging
3D
con
volutional
networks
for
ef
fective
video
...
(Bhakti
Deepak
Kadam)
Evaluation Warning : The document was created with Spire.PDF for Python.
1622
❒
ISSN:
2502-4752
−
Mapping
netw
ork
(MapNet):
a
mapping
netw
ork
[10]
that
in
v
estig
ates
the
correlation
between
video
shots
and
queries.
−
Hierarchical
v
ariational
netw
ork
(HVN):
a
no
v
el
architecture,
hierarchical
v
ariational
netw
ork
[22]
de-
signed
to
capture
long-range
temporal
dependencies
base
d
on
queries
with
its
multi-le
v
el
v
ariational
block.
−
Query-rele
v
ant
se
gment
representation
module
with
global
attention
module
(QSRM
-GAM):
a
tw
o-stage
approach,
consisting
of
the
query-rele
v
ant
se
gment
representation
module
and
global
attention
module
[23]
proposed
for
video
summarization,
taking
into
account
user
interests.
−
Con
v
olutional
hierarchical
att
ention
netw
ork
(CHAN):
a
pioneering
model,
con
v
olutional
hierarchical
at-
tention
netw
ork
[24]
to
emplo
y
local
and
global
self-attention
for
query-focused
video
summarization.
4.2.
Experimental
setup
An
e
xperiment
is
carried
out
to
e
xtract
the
spatiotemporal
features
from
video
sequences
using
pre-
trained
C3D
[15]
and
I3D
[16]
netw
orks.
The
C3D
netw
ork
is
trained
on
the
Sports-1M
dataset
[25].
Motion
features
are
obtained
in
the
RGB
and
o
w
formats.
RGB
features
are
e
xtracted
from
video
frames
utilizing
the
I3D
model
[16],
pretrained
on
Kinetics
400
dataset
[26],
in
conjunction
with
PWC-Net
[27].
Flo
w
features,
on
the
other
hand,
are
e
xtracted
using
I3D
netw
ork
with
recurrent
all-pairs
eld
transforms
(RAFT)
[28].
All
e
xperiments
are
conducted
on
a
computer
equipped
with
an
NVIDIA
R
TX
3060
GPU.
4.2.1.
Datasets
TVSum
[29]
and
SumMe
[30]
are
the
publicly
a
v
ailable
benchmark
datasets
emplo
yed
for
generic
video
summarization.
The
SumMe
dataset
[30]
comprises
25
videos
spanning
v
arious
genres
lik
e
sports,
holi-
days,
and
cooking.
In
contrast,
the
TVSum
dataset
[29]
includes
50
Y
ouT
ube
videos
across
10
cate
gories
such
as
documentary
,
educational,
and
e
gocentric.
Both
datasets
come
with
multiple
user
annotations,
including
user
-selected
k
e
yframes
and
shot-le
v
el
importance
scores.
F
or
query-based
video
summarization,
the
benchmark
dataset
used
is
the
query-focused
video
summa-
rization
(QFVS)
dataset
[31].
The
QFVS
dataset
[31]
includes
four
e
gocentric
consumer
-grade
videos
recorded
in
uncontrolled
e
v
eryday
scenarios,
each
lasting
3
to
5
hours
and
featuring
a
di
v
erse
range
of
e
v
ents.
F
or
each
video
and
query
pair
,
the
dataset
includes
four
query-based
summaries,
consisting
of
one
oracle
summary
and
three
user
-generated
summaries.
4.2.2.
Ev
aluation
metric
V
ideo
representations
for
sequences
in
the
abo
v
e-mentioned
datasets
are
generated
using
C3D,
I3D
(RGB),
and
I3D
(o
w)
netw
orks,
and
F1-scores
are
computed.
The
F1-score
assesses
the
similarity
between
the
ground
truth
summary
(user
summary)
and
the
generated
machine
summary
[32].
It
is
the
harmonic
mean
of
precision
and
re
call.
This
metric
is
the
most
commonly
used
approach
for
measuring
the
performance
of
summarization
frame
w
orks.
4.3.
Results
The
e
xperimental
results
are
presented
in
T
ables
1
and
2.
The
result
s
pro
vide
a
comparati
v
e
analysis
of
the
performance
of
v
arious
video
representation
techniques
for
con
v
entional
and
query-focused
summarization
methodologies
under
study
.
T
able
1.
Comparati
v
e
analysis
of
feature
e
xtraction
techniques
in
generic
summarization
methods
assessed
on
TVSum
and
SumMe
datasets
Method
F1-score
with
GoogleNet
F1-score
with
C3D
F1-score
with
I3D
(RGB)
F1-score
with
I3D
(Flo
w)
SumMe
TVSum
SumMe
TVSum
SumMe
TVSum
SumMe
TVSum
DR-DSN
[17]
42.1
58.1
55.8
65.7
55.3
65.4
55.5
65.6
V
ASNet
[9]
49.7
61.4
51.4
63.8
60.2
62.6
60.8
62.3
PGL-SUM
[18]
57.1
62.7
55.7
65.3
60.3
64.8
60.1
64.2
SUM-GAN-AAE
[19]
48.9
58.3
52.3
62.7
52.1
62.6
51.7
62.3
CA-SUM
[11]
51.1
61.4
61.4
66.4
61.9
65.1
60.2
64.9
DSR-RL
[20]
50.3
61.4
54.8
64.5
51.7
63.4
55.3
62.8
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
37,
No.
3,
March
2025:
1616–1625
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
❒
1623
T
able
2.
Comparati
v
e
analysis
of
feature
e
xtraction
techniques
in
query-focused
summarization
methods
assessed
on
QFVS
dataset
Method
Features
Result
(F1-scores)
TP
AN
[21]
ResNet152
+
C3D
46.05
MapNet
[10]
ResNet152
+
C3D
47.20
CHAN
[24]
ResNet
46.94
HVN
[22]
C3D
48.87
QSRM-GAM
[23]
I3D
49.20
CHAN
[24]
C3D
51.43
CHAN
[24]
I3D
50.78
4.3.1.
Results
with
con
v
entional
video
summarization
methods
T
able
1
pro
vides
the
performance
comparison
of
con
v
entional
summarization
frame
w
orks
in
te
rms
of
F1-scores.
The
F1-scores
reported
for
GoogleNet
are
retrie
v
ed
from
corresponding
papers.
T
able
1
and
Figure
5
indicate
that
the
F1-scores
for
C3D
and
I3D
video
representations
sho
w
a
signicant
impro
v
ement
o
v
er
GoogleNet.
Additionally
,
Figures
5(a)
and
5(b)
depict
a
rising
trend
in
F1-scores
for
the
SumMe
and
TVSum
datasets,
respecti
v
ely
.
The
results
sho
w
that
using
3D
CNNs
for
video
feature
e
xtraction
has
enhanced
the
F1-scores
on
the
SumMe
dataset,
with
impro
v
ements
of
32.5%
for
DR-DSN,
3.4%
for
V
ASNet,
6.9%
for
SUM-GAN-AAE,
20.1%
for
CA-SUM,
and
8.9%
for
DSR-RL.
Similarly
,
on
the
TVSum
dataset,
the
F1-scores
ha
v
e
increased
by
13%
for
DR-DSN,
3.9%
for
V
ASNet,
4.1%
for
PGL-SUM,
7.5%
for
SUM-GAN-AAE,
8.1%
for
CA-SUM,
and
5%
for
DSR-RL.
(a)
(b)
Figure
5.
Performance
comparison
of
feature
e
xtraction
techniques
assessed
on
(a)
SumMe
and
(b)
TVSum
dataset
4.3.2.
Results
with
query-f
ocused
video
summarization
methods
T
able
2
pro
vides
the
comparati
v
e
analysis
of
query-based
summarization
frame
w
orks
in
terms
of
F1-scores.
The
majority
of
summarization
methods
utilize
ResNet
for
e
xtracting
video
fe
atures.
ResNet
[13]
is
a
pretrained
2D
CNN
trained
on
the
ImageNet
dataset.
The
F1-scores
presented
in
T
able
2
indicate
that
video
representations
obtained
with
C3D
result
in
enhanced
F1-scores.
4.4.
Discussion
Pre
vious
studies
ha
v
e
sho
wn
that
GoogleNet
and
ResNet
are
well-established
models
that
e
xcel
at
e
x-
tracting
spatial
features
from
indi
vidual
video
frames
using
inception
modules,
which
are
composed
of
multiple
parallel
con
v
olutional
lters
of
v
arying
sizes.
These
models
ef
fecti
v
ely
capture
intricate
spatial
details
within
each
frame,
making
them
highly
suitable
for
image-based
tasks.
Ho
we
v
er
,
their
focus
on
spatial
features
alone
limits
their
ability
to
fully
capture
the
temporal
dynamics
inherent
in
video
sequences.
3D
CNNs
e
xtend
the
capabilities
of
con
v
entional
2D
con
v
olutions
by
inte
grating
the
time
dim
ension,
allo
wing
for
the
simultaneous
analysis
of
both
spatial
and
temporal
aspects
of
video
data.
This
inte
gration
is
crucial
for
tasks
in
v
olving
video
sequences,
where
understanding
motion
and
changes
o
v
er
time
is
as
important
Le
ver
a
ging
3D
con
volutional
networks
for
ef
fective
video
...
(Bhakti
Deepak
Kadam)
Evaluation Warning : The document was created with Spire.PDF for Python.
1624
❒
ISSN:
2502-4752
as
recognizing
spatial
features
within
indi
vidual
frames.
This
study
in
v
estig
ates
the
use
of
3D
CNNs
for
ef
fec-
ti
v
e
video
feature
e
xtraction
in
summarization.
It
is
demonstrated
that
by
capturing
spatiotemporal
features,
3D
CNNs
of
fer
a
more
comprehensi
v
e
representation
of
video
content,
resulting
in
notable
impro
v
ements
in
per
-
formance
for
video
summarization.
Although
3D
CNNs
ha
v
e
adv
anced
video
feat
ure
e
xtraction,
there
is
scope
for
further
research
and
de
v
elopment.
Future
research
in
video
feature
e
xtraction
can
e
xplore
se
v
eral
promising
directions
lik
e
h
ybrid
models
that
combine
the
spatial
st
rengths
of
2D
CNNs
with
the
temporal
capabilities
of
3D
CNNs
and
multi-modal
video
analysis.
5.
CONCLUSION
In
this
paper
,
a
comparati
v
e
in
v
estig
ation
of
video
feature
e
xtraction
using
3D
CNNs,
focusing
on
their
applications
i
n
generic
and
query-specic
video
summarization
is
conducted.
This
study
e
xamines
the
classical
and
deep
learning
based
feature
e
xtract
ion
techniques,
highlighting
the
adv
antages
of
deep
learning
approaches.
The
majority
of
e
xisting
video
summarization
techniques
commonly
rely
on
2D
CNNs,
such
as
GoogleNet
and
ResNet,
for
feature
e
xtraction.
It
is
demonstrated
that
3D
CNNs,
such
as
C3D
and
I3D,
are
more
ef
fecti
v
e
for
video
feature
e
xtraction
in
both
generic
and
query-specic
video
summarization
compared
to
traditional
2D
CNNs.
By
e
v
aluating
F1-scores
for
v
arious
summarization
methods,
it
is
concluded
that
3D
CNNs
signicantly
impro
v
e
performance
due
to
their
ability
to
capture
both
spatial
and
temporal
features.
This
underscores
the
superiority
of
3D
CNNs
in
pro
viding
a
more
comprehensi
v
e
understanding
of
video
content,
marking
a
notable
adv
ancement
in
video
feature
e
xtraction
and
summarization
techniques.
REFERENCES
[1]
M
.
Suresha,
S.
K
uppa,
and
D.
S.
Raghukumar
,
“
A
study
on
deep
l
earning
spatiotemporal
models
and
feature
e
xtraction
tech-
niques
for
video
understanding,
”
International
J
ournal
of
Multimedia
Information
Retrie
val
,
v
ol.
9,
no.
2,
pp.
81–101,
2020,
doi:
10.1007/s13735-019-00190-x.
[2]
A.
Mirza,
O.
Zeshan,
M.
Atif,
and
I.
Siddiqi,
“Detection
and
recognition
of
cursi
v
e
te
xt
from
video
frames,
”
Eur
asip
J
ournal
on
Ima
g
e
and
V
ideo
Pr
ocessing
,
v
ol.
2020,
no.
1,
pp.
1–19,
2020,
doi:
10.1186/s13640-020-00523-5.
[3]
H.
W
ang,
A.
Kl
¨
aser
,
C.
Schmid,
and
C.
L.
Liu,
“Dense
trajectories
and
motion
boundary
descriptors
for
action
recognition,
”
Inter
-
national
J
ournal
of
Computer
V
ision
,
v
ol.
103,
no.
1,
pp.
60–79,
2013,
doi:
10.1007/s11263-012-0594-8.
[4]
Y
.
Xian,
B.
K
orbar
,
M.
Douze,
L.
T
orresani,
B.
Schiele,
and
Z.
Akata,
“Generalized
fe
w-shot
video
classication
with
video
retrie
v
al
and
feature
generation,
”
IEEE
T
r
ansactions
on
P
attern
Analysis
and
Mac
hine
Intellig
ence
,
v
ol.
44,
no.
12,
pp.
8949–8961,
2021,
doi:
10.1109/TP
AMI.2021.3120550.
[5]
Q.
W
u,
Q.
Huang,
and
X.
Li,
“Multimodal
human
action
recognition
based
on
spatio-temporal
action
representation
recognition
model,
”
Multimedia
T
ools
and
Applications
,
v
ol.
82,
no.
11,
pp.
16409–16430,
2023,
doi:
10.1007/s11042-022-14193-0.
[6]
C.
Liu,
X.
W
u,
and
Y
.
Jia,
“
A
hierarchical
video
description
for
comple
x
acti
vity
understanding,
”
International
J
ournal
of
Computer
V
ision
,
v
ol.
118,
no.
2,
pp.
240–255,
2016,
doi:
10.1007/s11263-016-0897-2.
[7]
S.
Sah,
T
.
Nguyen,
and
R.
Ptucha,
“Understanding
temporal
structure
for
video
captioning,
”
P
attern
Analysis
and
Applications
,
v
ol.
23,
no.
1,
pp.
147–159,
2020,
doi:
10.1007/s10044-018-00770-3.
[8]
R.
Liu
et
al.
,
“Exploiting
radio
ngerprints
for
simultaneous
localization
and
mapping,
”
IEEE
P
ervasive
Computing
,
v
ol.
22,
no.
3,
pp.
38–46,
2023,
doi:
10.1109/MPR
V
.2023.3274770.
[9]
J.
F
ajtl,
H.
S.
Sok
eh,
V
.
Ar
gyriou,
D.
Monek
osso,
and
P
.
Remagnino,
“Summarizing
videos
with
attention,
”
in
14th
Asian
Confer
ence
on
Computer
V
ision
,
2019,
v
ol.
11367
LNCS,
pp.
39–54,
doi:
10.1007/978-3-030-21074-8
4.
[10]
Y
.
Zhang,
M.
Kampf
fme
yer
,
X.
Zhao,
and
M.
T
an,
“Deep
reinforcement
learning
for
query-conditioned
video
summarization,
”
Applied
Sciences
(Switzerland)
,
v
ol.
9,
no.
4,
p.
750,
2019,
doi:
10.3390/app9040750.
[11]
E.
Apost
olidis,
G.
Balaouras,
V
.
Mezaris,
and
I.
P
atras,
“Summarizing
videos
using
concentrated
attention
and
considering
the
uniqueness
and
di
v
ersity
of
the
video
frames,
”
in
ICMR
2022
-
Pr
oceedings
of
the
2022
International
Confer
ence
on
Multimedia
Retrie
val
,
2022,
pp.
407–415,
doi:
10.1145/3512527.3531404.
[12]
C
.
Sze
gedy
et
al.
,
“Going
deeper
with
con
v
olutions,
”
in
Pr
oceedings
of
the
IEEE
Computer
Society
Confer
ence
on
Computer
V
ision
and
P
attern
Reco
gnition
,
Jun.
2015,
pp.
1–9,
doi:
10.1109/CVPR.2015.7298594.
[13]
K.
He,
X.
Zhang,
S.
Ren,
and
J.
Sun,
“Deep
residual
learning
for
image
recognition,
”
in
Pr
oceedings
of
the
IEEE
Com-
puter
Society
Confer
ence
on
Computer
V
ision
and
P
attern
Reco
gnition
,
Jun.
2016,
v
ol.
2016-December
,
pp.
770–778,
doi:
10.1109/CVPR.2016.90.
[14]
O.
Russak
o
vsk
y
et
al.
,
“ImageNet
lar
ge
scale
visual
recognition
challenge,
”
International
J
ournal
of
Computer
V
ision
,
v
ol.
115,
no.
3,
pp.
211–252,
2015,
doi:
10.1007/s11263-015-0816-y
.
[15]
D.
T
ran,
L.
Bourde
v
,
R.
Fer
gus,
L.
T
orresani,
and
M.
P
aluri,
“Learni
ng
spatiotemporal
features
with
3D
con
v
olutional
netw
orks,
”
in
Pr
oceedings
of
the
IEEE
International
Confer
ence
on
Computer
V
ision
,
2015,
v
ol.
2015
International
Conference
on
Computer
V
ision,
ICCV
2015,
pp.
4489–4497,
doi:
10.1109/ICCV
.2015.510.
[16]
J.
Carreira
and
A.
Zi
sserman,
“Quo
v
adis,
action
recognition?
a
ne
w
model
and
the
kinetics
dataset,
”
in
Pr
oceedings
of
the
IEEE
Confer
ence
on
Computer
V
ision
and
P
attern
Reco
gnition
,
Jul.
2017,
pp.
6299–6308,
doi:
10.1109/CVPR.2017.502.
[17]
K.
Zhou,
Y
.
Qiao,
and
T
.
Xiang,
“Deep
reinforcement
learning
for
unsupervised
video
summarization
with
di
v
ersity-
representati
v
eness
re
w
ard,
”
in
Pr
oceedings
of
the
AAAI
Confer
ence
on
Articial
Intellig
ence
,
2018,
v
ol.
32,
no.
1,
pp.
7582–7589,
doi:
10.1609/aaai.v32i1.12255.
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
37,
No.
3,
March
2025:
1616–1625
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
❒
1625
[18]
E.
Apostolidis,
G.
Balaouras,
V
.
Mezaris,
and
I.
P
atras,
“Combining
global
and
local
attention
with
positional
encoding
for
video
summarization,
”
in
Pr
oceedings
-
23r
d
IEEE
International
Symposium
on
Multimedia
(ISM)
,
2021,
pp.
226–234,
doi:
10.1109/ISM52913.2021.00045.
[19]
E.
Apostolidis,
E.
Adamantidou,
A.
I.
Metsai,
V
.
Mezaris,
and
I.
P
atras,
“Unsupervised
video
summarization
via
attention-dri
v
en
adv
ersarial
learning,
”
in
MultiMedia
Modeling:
26th
International
Confer
ence
(MMM)
,
2020,
v
ol.
11961
LNCS,
pp.
492–504,
doi:
10.1007/978-3-030-37731-1
40.
[20]
A.
Phaphuangwittayakul,
Y
.
Guo,
F
.
Y
ing,
W
.
Xu,
and
Z.
Zheng,
“Self-attention
recurrent
s
ummarization
netw
ork
with
reinforce-
ment
learning
for
video
summarization
task,
”
in
Pr
oceedings
-
IEEE
International
Confer
ence
on
Multimedia
and
Expo
(ICME)
,
2021,
pp.
1–6,
doi:
10.1109/ICME51207.2021.9428142.
[21]
Y
.
Zhang,
M.
Kampf
fme
yer
,
X.
Liang,
M.
T
an,
and
E.
P
.
Xing,
“Query-conditioned
three-player
adv
ersarial
netw
ork
for
video
summarization,
”
arXiv
pr
eprint
arXiv:1807.06677
,
2019,
doi:
10.48550/arXi
v
.1807.06677.
[22]
P
.
Jiang
and
Y
.
Han,
“Hierarchical
v
ariational
netw
ork
for
user
-di
v
ersied
&
query-focused
video
summarization,
”
in
Pr
oceedings
of
the
2019
A
CM
International
Confer
ence
on
Multimedia
Retrie
val
(ICMR)
,
2019,
pp.
202–206,
doi:
10.1145/3323873.3325040.
[23]
S.
Nalla,
M.
Agra
w
al,
V
.
Kaushal,
G.
Ramakrishnan,
and
R.
Iyer
,
“W
atch
hours
in
minutes:
summarizing
videos
with
user
intent,
”
in
Eur
opean
Confer
ence
on
Computer
V
ision
,
2020,
v
ol.
12539
LNCS,
pp.
714–730,
doi:
10.1007/978-3-030-68238-5
47.
[24]
S.
Xiao,
Z.
Zhao,
Z.
Zhang,
X.
Y
an,
and
M.
Y
ang,
“Con
v
olutional
hierarchical
attention
netw
ork
for
query-focused
video
sum-
marization,
”
in
AAAI
2020
-
34th
AAAI
Confer
ence
on
Articial
Intellig
ence
,
2020,
v
ol.
34,
no.
07,
pp.
12426–12433,
doi:
10.1609/aaai.v34i07.6929.
[25]
A.
Karpath
y
,
G.
T
oderici,
S.
Shetty
,
T
.
Leung,
R.
Sukthankar
,
and
L.
Fei-Fei,
“Lar
ge-scale
video
classication
with
con
v
olu-
tional
neural
netw
orks
,
”
in
Pr
oceedings
of
the
IEEE
confer
ence
on
Computer
V
ision
and
P
attern
Reco
gnition
(CVPR)
,
2014,
pp.
1725–1732.
[26]
W
.
Kay
et
al.
,
“The
kinetics
human
action
video
dataset,
”
arXiv
pr
eprint
arXiv:1705.06950
,
2017.
[27]
D.
Sun,
X.
Y
ang,
M.
Y
.
Liu,
and
J.
Kautz,
“PWC-Net:
CNNs
for
optical
o
w
using
p
yramid,
w
arping,
and
cost
v
olume,
”
in
Pr
oceedings
of
the
IEEE
Computer
Society
Confer
ence
on
Computer
V
ision
and
P
attern
Reco
gnition
,
2018,
pp.
8934–8943,
doi:
10.1109/CVPR.2018.00931.
[28]
Z.
T
eed
and
J.
Deng,
“Raft:
recurrent
all-pairs
eld
transforms
for
optical
o
w
,
”
in
Computer
V
ision–ECCV
2020:
16th
Eur
opean
Confer
ence
,
2020,
pp.
402–419,
[Online].
A
v
ailable:
https://doi.or
g/10.1007/978-3-030-58536-5
24.
[29]
Y
.
Song,
J.
V
allmitjana,
A.
Stent,
and
A.
Jaimes,
“TVSum:
summarizing
we
b
videos
using
titles,
”
in
Pr
oceedings
of
the
IEEE
Computer
Society
Confer
ence
on
Computer
V
ision
and
P
attern
Reco
gnition
,
2015,
v
ol.
07-12-June-2015,
pp.
5179–5187,
doi:
10.1109/CVPR.2015.7299154.
[30]
M.
Gygli,
H.
Grabner
,
H.
Riemenschneider
,
and
L.
V
.
Gool,
“Creating
summaries
from
user
videos,
”
in
Eur
opean
Confer
ence
on
Computer
V
ision
,
2014,
v
ol.
8695
LNCS,
no.
P
AR
T
7,
pp.
505–520,
doi:
10.1007/978-3-319-10584-0
33.
[31]
A.
Shar
ghi,
J.
S.
Laureland,
and
B.
Gong,
“Query-focused
video
summarization:
dataset,
e
v
aluation,
and
a
memory
netw
ork
based
approach,
”
in
Pr
oceedings
-
30th
IEEE
Confer
ence
on
Computer
V
ision
and
P
attern
Reco
gnition
(CVPR)
,
2017,
pp.
2127–2136,
doi:
10.1109/CVPR.2017.229.
[32]
M.
Otani,
Y
.
Nakashima,
E.
Rahtu,
and
J.
Heikkila,
“Rethinking
the
e
v
aluation
of
video
summaries,
”
in
Pr
oceedings
of
the
IEEE/CVF
Confer
ence
on
Computer
V
ision
and
P
attern
Reco
gnit
ion
(CVPR)
,
2019,
v
ol.
2019-June,
pp.
7596–7604,
doi:
10.1109/CVPR.2019.00778.
BIOGRAPHIES
OF
A
UTHORS
Bhakti
Deepak
Kadam
is
a
research
scholar
in
MKSSS’
s
Cummins
Colle
ge
of
Engi-
neering
for
W
omen,
Pune.
She
recei
v
ed
her
BE
de
gree
in
E&TC
and
M.T
ech
de
gree
i
n
electronics
engineering
from
Sa
vitribai
Phule
Pune
Uni
v
ersity
in
2011
and
2014,
respecti
v
ely
.
Her
current
re-
search
interests
include
video
processing,
computer
vision,
and
deep
learning.
She
can
be
contacted
at
email:
bhakti.kadam@cumminscolle
ge.in.
Ashwini
Mangesh
Deshpande
is
an
associate
professor
in
the
Electronics
and
T
elecom-
munication
Department,
at
MKSSS’
s
Cummins
Colle
ge
of
Engineering
for
W
omen,
Pune,
In-
dia.
Her
research
interests
include
image
and
video
processing,
computer
vision,
deep
learning,
and
satellite
image
processing.
She
has
40
research
papers
in
reputed
journals
and
conferences.
She
has
acted
as
a
PI
in
v
arious
Go
v
ernment-funded
projects.
She
can
be
contacted
at
email:
ashwini.deshpande@cumminscolle
ge.in.
Le
ver
a
ging
3D
con
volutional
networks
for
ef
fective
video
...
(Bhakti
Deepak
Kadam)
Evaluation Warning : The document was created with Spire.PDF for Python.