Indonesian J our nal of Electrical Engineering and Computer Science V ol. 37, No. 3, March 2025, pp. 1616 1625 ISSN: 2502-4752, DOI: 10.11591/ijeecs.v37.i3.pp1616-1625 1616 Le v eraging 3D con v olutional netw orks f or effecti v e video featur e extraction in video summarization Bhakti Deepak Kadam 1,2 , Ashwini Mangesh Deshpande 1 1 Department of Electronics and T elecommunication Engineering, MKSSS’ s Cummins Colle ge of Engineering for W omen, Pune, India 2 Department of Electronics and T elecommunication Engineering, SCTR’ s Pune Institute of Computer T echnology , Pune, India Article Inf o Article history: Recei v ed Apr 9, 2024 Re vised Sep 13, 2024 Accepted Sep 30, 2024 K eyw ords: 3D con v olution Deep neural netw orks Feature representation Pretrained netw orks V ideo summarization ABSTRA CT V ideo feature e xtraction is pi v otal in video processing, as it encompasses the e xtraction of pertinent information from video dat a. This process enables a more streamlined representation, analysis, and comprehension of video content. Gi v en its adv antages, feat ure e xtraction has become a crucial step in numer - ous video understanding tasks. This study in v estig ates the generation of video representations utilizing three-dimensional (3D) con v olutional neural netw orks (CNNs) for the task of video summarization. The feature v ectors are e xtracted from the vide o sequences using pretrained tw o-dimensional (2D) netw orks such as GoogleNet and ResNet, along with 3D netw orks lik e 3D Con v olutional Net- w ork (C3D) and T w o-Stream Inated 3D Con v olutional Netw ork (I3D). T o as- sess the ef fecti v eness of video representations, F1-sc ores are computed with the generated 2D and 3D video representations for chosen generic and query- focused video summarization techniques. The e xperimental results sho w that using feature v ectors from 3D netw orks impro v es F1-scores, highlighting the ef fecti v eness of 3D netw orks i n video representation. It is demonstrated that 3D netw orks, unlik e 2D ones, incorporate the time dimension to capture spatiotem- poral features, pro viding better temporal proc essing and of fering comprehensi v e video representation. This is an open access article under the CC BY -SA license . Corresponding A uthor: Bhakti Deepak Kadam Department of Electronics and T elecommunication Engineering MKSSS’ s Cummins Colle ge of Engineering for W omen Pune, Maharashtra, India Email: bhakti.kadam@cumminscolle ge.in 1. INTR ODUCTION V ideo feature e xtraction is a foundational aspect of computer vision, designed to o v ercome the chal- lenges presented by the intricate and e xtensi v e nature of video data. Its purpose is to f acilitate analyses that are not only more ef cient b ut also more interpretable and accurate. Ef fecti v e comprehension of videos demands representation at multiple le v els, necessitating an appropriate video representation. Ef fecti v e video process- ing relies on video feature e xtraction for the follo wing reasons: (i) reducing the comple xity of video data for more manageable analysis , (ii) simplifying the interpretation of underlying information for both humans and deep learning models, (iii) condensing meaningful information while retaining essential characteristics, (i v) enhancing the model’ s ability to generalize patterns for accurate predictions, and (v) ef cient utilization of computational resources by focusing on rele v ant aspects. J ournal homepage: http://ijeecs.iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 1617 V ideo feature e xtraction in v olv es selecting and/or combining v ariables to generate feature v ect ors. Feature v ectors of a video are numerical representations that capture v arious attrib utes and characteristics of the video content in a structured f o r mat. These v ec tors can encapsulate v arious visual, spatial, temporal, motion, and audio attrib utes of the video content. These v ectors f acilitate ef fecti v e analysis, understanding, and processing of videos in numerous applications. Feature e xtraction ef ciently reduces the v ol ume of data that needs processing while maintaining accurac y in representing videos. The combination of video se gmentation and feature e xtraction is instrumental in mitig ating c omputa- tional o v erhead by streamlining the preprocessing across all frames in the video. When analyzing videos for computer vision tasks, a v ariety of features play a pi v otal role. The dif ferent video features utilized for video understanding are illustrated in Figure 1. These di v erse features contrib ute to a holistic comprehension of video content, f acilitating v arious applications within the eld of computer vision [1]. V ideo Features Spatial Features: Extracted from still video frames. The spati al features distinguish be- tween static and dynamic conte xts encompassing the position of objects. The y e xplore the object’ s relati v e spatial area and semantic relationships with other objects in each video frame. T emporal Features (Optical Flo w): In v olv es the e xtraction of action or object mo v ement. Optical o w represents the motion pattern of an object across consecuti v e frames requiring the estimation of per -pix el motion between them. T e xtual Features: F ocuses on the detection and recognition of te xtual content within video frames cate gorizing te xts as scene te xt or caption te xt [2]. T rajectory Extraction: A dens e representation of video obtained through optical o w algo- rithms allo ws for the e xtraction of dense trajectories [3]. These trajectories pro vide v aluable information characterizing t he appearance and motion of objects. Content-based Feature Extrac tion: In v olv es the e xtraction of feature v ectors based on the video’ s conte xt and tailored to the specic task at hand such as video summarization or video captioning. Figure 1. T ypes of video features Features can be e xtract ed using both classical methods that rely on local, hand-crafted features and adv anced techniques in v olving deep neural netw orks, as detailed in section 2. Thi s research e xplores the uti- lization of three-dimensional con v olutional neural netw orks (3D CNNs) to enhance video representations. A comparati v e analysis is conducted on con v entional video summarization and query-focused video summariza- tion techniques. Our contrib utions are as follo ws: i) the video features are e xtracted utilizing pretrained 3D CNNs (C3D and inated 3D (I3D)); ii) the spe cic baseline algorithms for generic and query-focused video summarization are chosen and the F1-scores for generated video presentations are computed; and iii) the perfor - mance is e v aluated by comparing pretrained tw o-dimensional (2D) and 3D con v olutional netw orks for feature e xtraction in terms of calculated F1-scores. Gi v en its numerous adv antages, feature e xtraction stands as a fundamental process in numerous re- search applications, such as: V ideo classicat ion: it is the task of assigning one or more global labels to the video. The proper e xtraction of features from the input video leads to the prediction of accurate frame labels that describe the entire video [4]. Action recognition: the action recognition in videos aims to infer the actions of on e or persons in the video. The spatial and long-range temporal feature e xtraction is necessary for human acti vity or action recognition [5]. V ideo understanding: is the task of recognition and localization of dif ferent actions or e v ents occurring in the video. As the localization is in both s patial and temporal di mensions, this task requires spatiotemporal feature e xtraction [6]. Le ver a ging 3D con volutional networks for ef fective video ... (Bhakti Deepak Kadam) Evaluation Warning : The document was created with Spire.PDF for Python.
1618 ISSN: 2502-4752 V ideo captioning: is the task of generating automatic captions for a video. This leads to ef cient information retrie v al from the video in the form of te xt. As captioning is the te xtual description of the video, it needs e xtraction of more comple x features [7]. Simultaneous localization and mapping (SLAM): is a method used for autonomous v ehicles that de v elops a map and localizes the v ehicle in the same map [8]. In SLAM, spatial and motion features need to be e xtracted and matched for localization and obstacle detection. V ideo summarization: is a process of generat ing a temporally condensed v ersion of the input video. V ideo representations at multiple le v els are necessary for spatiotemporal modelling due to long durations of videos [9]–[11]. These applications mak e video feature e xtraction a v aluable research topic f o r study . The structure of the paper is as follo ws: section 2 discusses the related w ork on video feature e xtraction. Section 3 elaborates on the use of 3D con v olutional netw orks emplo yed for e xtracting video features. The e xperimental result and analysis are discussed in section 4 and section 5 pro vides conclusions. 2. RELA TED W ORK This section e xplores the e xisting video feature e xtraction techniques emplo yed in sum marization methodologies in literature. V ideo summarization and feature e xtraction represent longstanding research areas in computer vision. V ideo feature v ectors can be e xtracted using c lassical vision techniques focusing on hand- crafted features as well as deep neural netw orks [1]. The classication of feature e xtraction techniques is pro vided in Figure 2. W ith adv ancements in deep learning, video feature e xtraction has also le v eraged these tec hn ol ogies. K e y trends propelling the eld forw ard include the inte gration of multimodal information, the de v elopment of self-supervised learning techniques, and the e xploration of no v el architectures such as transformers. The deep learning based feature e xtraction techniques ha v e outperformed the classical vision techniques. These models are ef fecti v ely utilized in v arious research domains [1]. The merits of deep learning based techniques include: Extraction of comple x and abstract features by feature engineering: feature engineering deals with the e x- traction of features from natural data. The spatiotemporal models utilize state-of-the-art feature engineering models to e xtract more comple x features from videos. Feature e xtraction for unstructured data: deep neural net w orks can handle unstructured data better than hand-crafted features by training on v arious abstract features. Unsupervised feature learning: the process of labelling the a v ailable data is e xpensi v e and time-consuming. This process is more challenging when it is e xtended for videos. The traditional techniques do not perform well on unsupervised data, b ut spatiotemporal models can be ef ciently used with unlabelled data. High-quality results: the semantic rel ationships between objects and their motion patterns are also e xplored while e xtracting the features using modern machine vision techniques. This leads to impro v ement in the quality of results in dif ferent computer vision tasks. Most of the summarization methods emplo y 2D CNNs, GoogleNet, and Residual Netw ork (ResNet) to e xtract video features. GoogleNet, also kno wn as Inception V1, w as presented in 2014 [12]. ResNet, introduced in 2015, brought forth the ResNet architecture [13]. In video summarization frame w orks, GoogleNet and ResNet pretrained on the ImageNet dataset [14] are wide ly emplo yed for feature e xtraction from input video sequences. 2D CNNs f ace se v eral challenges in video feature e xtraction due to their limitations in handling temporal information: Lack of temporal a w areness: 2D CNNs process each frame independently , missing tem poral relationships crucial for understanding motion and e v ents. Handling motion: the y struggle with dynamic content and the comple xity of inte grating optical o w . Spatiotemporal features: the y capture only spatial features, lacking the rich spatiotemporal conte xt needed for tasks lik e action recognition. Multi-modal inte gration: combining visual features with audio and te xt is challenging without inherent temporal modeling. This study proposes the use of 3D CNNs for video feature e xtraction to o v ercome these limitations. Indonesian J Elec Eng & Comp Sci, V ol. 37, No. 3, March 2025: 1616–1625 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 1619 V ideo Feature Extraction Classical V ision T echniques Histogram of Gradient (HOG) and Histogram of Optical Flo w (HOF): These encoding techniques are used to e xtract features for the most common tasks of object detection and acti vity recognition. Space–T ime Interest Points (STIP): The space-time inte rest points can be used to model spatiotemporal features and dynamic motion patterns. Scale-In v ariant Feature T ransform (SIFT): The local still image spatial cues in videos can be captured by using the classical image based descriptors such as SIFT . Dense T rajectory: The de nse trajectory approach can be used to model local spatial cues and global motion cues such as motion. Deep Learning based T echniques Con v olutional Neural Netw ork (CNN): The 2D and 3D CNNs a re ef fecti v ely used for e xtracting spatiotemporal features and short-term motion cues from ra w video data. Recurrent Neural Netw or k (RNN): The RNNs are also used to e xtract short- term and long-term motion patterns from videos. Long Short-T erm Memory (LSTM): The long-term motion cues can be mod- elled using LSTMs. Generati v e Adv ersarial Ne ural Netw ork (GAN): The spatiotemporal features can be e xtracted from videos using GANs. Re gularized Feature Fus ion Models: The feature scores e xtracte d from dif fer - ent netw ork layers and le v els are combined using feature fusion techniques. Figure 2. Classication of video feature e xtraction techniques 3. METHOD This section presents the merits of utilizing 3D con v olution for video comprehension, along with the application of 3D CNNs to capture features from video sequences in summarization algorithms. A video con- sists of man y se gments, with each se gment comprising shots, and these shots are composed of sequence s of frames. F or a comprehensi v e understanding of the videos, it is necessary to learn feature representations at dif ferent le v els. T o e xtract feature v ectors at dif ferent le v els, the video is di vided into small, non-intersecting shots. After se gmenting the video, features are e xtracted using pretrained 3D con v olutional netw orks. Figure 3 illustrates the e xtraction of video features using pretrained 3D CNN [15]. 3.1. 2D and 3D con v olution The fundamental dif ference between 2D and 3D con v olution lies in the dimensionality of the input data that each processes. Generally , 2D con v olution is emplo yed on tw o-dimensional data, lik e images. This con v olutional process ent ails mo ving a 2D k ernel/lter across the input image, conducting element-wise multi- plications, and subsequently aggre g ating the results. The con v olution and pooling are performed spatially in 2D CNNs [15]. As a result, it does not model the temporal information. Figure 4 illustrates the distinction between 2D and 3D con v olution. When 2D con v olution is emplo yed on an image, it yields another image as sho wn in Figure 4(a). Similarly , applying 2D con v olution to m ultiple images (treating them as distinct channels) also produces an image as the output as indicated by Figure 4(b). 3D con v olution is designed for three-dimensional data, such as video sequences or v olumetric data. The 3D k ernel is not only applied across height and width b ut also e xtends to the depth dimension (or time, in the conte xt of videos). This con v olutional process tra v erses the complete v olume of the input data. The con v olution and pooling are performed spatiotemporally in 3D CNNs. As a result, it preserv es the temporal information outputting a v olume as sho wn in Figure 4(c). Le ver a ging 3D con volutional networks for ef fective video ... (Bhakti Deepak Kadam) Evaluation Warning : The document was created with Spire.PDF for Python.
1620 ISSN: 2502-4752 The adv antages of 3D con v olution o v er 2D con v olution are prominent in tasks in v olving s p a tiotem- poral data, such as video processing. The benets of 3D CNNs include: Spatial-temporal features: it inte grates spatial and temporal features simultaneously for a comprehensi v e data representation, crucial for video sequences. T emporal information capture: it ef fecti v ely captures temporal information by considering the time dimen- sion, which is essential for video analysis and action recognition. Natural e xtension for video analysis: it e xtends CNN capabilities for video understanding by inherently considering the temporal dimension. Unied frame w ork for video processing: it pro vides a unied approach for processing both spatial and temporal dimensions, simplifying architecture compared to separate 2D and 1D processing units. V olumetric understanding: it enables the modelling of v olumetric data, of fering comprehensi v e spatial and temporal understanding, benecial for 3D medical imaging and other v olumetric data tasks. Figure 3. Extracting video features using 3D CNN (a) (b) (c) Figure 4. Comparison between 2D and 3D con v olution [15]: (a) 2D con v olution with an image, (b) 2D con v olution with multiple frames, and (c) 3D con v olution on a video Indonesian J Elec Eng & Comp Sci, V ol. 37, No. 3, March 2025: 1616–1625 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 1621 3.2. 3D CNNS: C3D and I3D In video summarization methods proposed by researchers in the literature, GoogleNet is the fre q ue n t ly chosen deep netw ork for feature v ector e xtraction. GoogleNet [12] is 2D CNN pretrained on ImageNet dataset [14]. Recently , 3D CNNs, C3D, and I3D are also emplo yed for video feature e xtraction. The spatiotemporal feature e xtraction using a 3D CNN w as proposed by researchers in 2015 [15]. C3D is a d e ep 3D CNN with a homogeneous architecture containing 3 × 3 × 3 con v olutional k ernels follo wed by 2 × 2 × 2 pooling at each layer . The C3D model of fers generic feature e xtraction. It pro vides a compact representation of video se gments, generating a 4096 element v ector from a 16-frame input. The model’ s homogeneous architecture, featuring small k ernel sizes 3 × 3 × 3 , ensures f ast and ef cient inference, enabling optimized implementations on embedded platforms. The I3D model, introduced by researchers i n 2017, is a tw o-stream I3D Con vNet that e xtends 2D CNN principles into the 3D domain [16]. By inat ing 2D lters and pooling k ernels into 3D, the I3D model aims to capture spatiotemporal features f rom videos, le v eraging successful architectures and parameters from ImageNet. K e y features include the adaptation of 2D lters to 3D, e xpansion of the recepti v e eld in space and time, and the use of tw o 3D streams for enhanced performance. 4. RESUL TS AND DISCUSSION This section discusses the v arious baseline summarization methods selected for the study , the e xperi- mentation conducted, and the results obtained from these e xperiment s. It pro vides an o v ervie w of summariza- tion techniques, e xperimental setup including the datasets, and the e v aluation metric emplo yed to assess the performance of the summarization methods along with the performance comparisons. 4.1. Summarization methods The ef fecti v eness of 3D CNNs in video feature e xtraction is demonstrated through the e xamination of tw o summarization frame w orks: con v entional video summarization and query-focused video summarization. An o v ervie w of summarization methods is pro vided. 4.1.1. Con v entional video summarization methods under consideration Con v entional or generic video summariz ation in v olv es generating a concise video summary by auto- matically selecting k e yframes or k e yshots representing the most important content necessary for understanding the video. This type of summarization is generally c on t ent-dri v en, relying on the visual i nformation within the video to determine what should be included in the summary . The methods under consideration are: Di v ersity-representati v eness re w ard deep summarization netw ork (DR-DSN): a deep summarization net- w ork [17] proposed for es timating the lik elihood of indi vidual video frames and generating the video sum- mary . V ideo attention summarization netw ork (V ASNet): a summarization method [9] combining a soft self at- tention and tw o-layer re gressor netw ork. Positional encoding with global and local multi-head attention for summarization (PGL-SUM): inte gration of positional encodi n g with global and l ocal multi-head attention [18] for calculating importance scores of frames. Summarization generati v e adv ersarial netw ork wi th attention autoencoder (SUM-GAN-AAE): a supervised summarization technique le v eraging the combination of adv ersarial learning with att ention mechanism [19] for summarizing videos. Concentrated attention summarization (CA-SUM): a summarization netw ork emplo ying [11] concentrated attention considering uniqueness and di v ersity of video frames. Deep summarization netw ork with reinforcement learning (DSR-RL): a recurrent summarization netw ork [20] incorporating self attention mechanism and reinforcement learning. 4.1.2. Query-f ocused video summarization methods under consideration Query-focused video summarization generates the video summary based on specic input queries by the user . This type of summarization is conte xt-dri v en, relying on vie wer queries, making it more personalized than con v entional summarization. The methods under consideration are: Three-player adv ersarial netw ork (TP AN): a generati v e adv ersarial netw ork with three players [21] operat- ing on three sets of query-conditioned summaries to generate query-focused video summaries. Le ver a ging 3D con volutional networks for ef fective video ... (Bhakti Deepak Kadam) Evaluation Warning : The document was created with Spire.PDF for Python.
1622 ISSN: 2502-4752 Mapping netw ork (MapNet): a mapping netw ork [10] that in v estig ates the correlation between video shots and queries. Hierarchical v ariational netw ork (HVN): a no v el architecture, hierarchical v ariational netw ork [22] de- signed to capture long-range temporal dependencies base d on queries with its multi-le v el v ariational block. Query-rele v ant se gment representation module with global attention module (QSRM -GAM): a tw o-stage approach, consisting of the query-rele v ant se gment representation module and global attention module [23] proposed for video summarization, taking into account user interests. Con v olutional hierarchical att ention netw ork (CHAN): a pioneering model, con v olutional hierarchical at- tention netw ork [24] to emplo y local and global self-attention for query-focused video summarization. 4.2. Experimental setup An e xperiment is carried out to e xtract the spatiotemporal features from video sequences using pre- trained C3D [15] and I3D [16] netw orks. The C3D netw ork is trained on the Sports-1M dataset [25]. Motion features are obtained in the RGB and o w formats. RGB features are e xtracted from video frames utilizing the I3D model [16], pretrained on Kinetics 400 dataset [26], in conjunction with PWC-Net [27]. Flo w features, on the other hand, are e xtracted using I3D netw ork with recurrent all-pairs eld transforms (RAFT) [28]. All e xperiments are conducted on a computer equipped with an NVIDIA R TX 3060 GPU. 4.2.1. Datasets TVSum [29] and SumMe [30] are the publicly a v ailable benchmark datasets emplo yed for generic video summarization. The SumMe dataset [30] comprises 25 videos spanning v arious genres lik e sports, holi- days, and cooking. In contrast, the TVSum dataset [29] includes 50 Y ouT ube videos across 10 cate gories such as documentary , educational, and e gocentric. Both datasets come with multiple user annotations, including user -selected k e yframes and shot-le v el importance scores. F or query-based video summarization, the benchmark dataset used is the query-focused video summa- rization (QFVS) dataset [31]. The QFVS dataset [31] includes four e gocentric consumer -grade videos recorded in uncontrolled e v eryday scenarios, each lasting 3 to 5 hours and featuring a di v erse range of e v ents. F or each video and query pair , the dataset includes four query-based summaries, consisting of one oracle summary and three user -generated summaries. 4.2.2. Ev aluation metric V ideo representations for sequences in the abo v e-mentioned datasets are generated using C3D, I3D (RGB), and I3D (o w) netw orks, and F1-scores are computed. The F1-score assesses the similarity between the ground truth summary (user summary) and the generated machine summary [32]. It is the harmonic mean of precision and re call. This metric is the most commonly used approach for measuring the performance of summarization frame w orks. 4.3. Results The e xperimental results are presented in T ables 1 and 2. The result s pro vide a comparati v e analysis of the performance of v arious video representation techniques for con v entional and query-focused summarization methodologies under study . T able 1. Comparati v e analysis of feature e xtraction techniques in generic summarization methods assessed on TVSum and SumMe datasets Method F1-score with GoogleNet F1-score with C3D F1-score with I3D (RGB) F1-score with I3D (Flo w) SumMe TVSum SumMe TVSum SumMe TVSum SumMe TVSum DR-DSN [17] 42.1 58.1 55.8 65.7 55.3 65.4 55.5 65.6 V ASNet [9] 49.7 61.4 51.4 63.8 60.2 62.6 60.8 62.3 PGL-SUM [18] 57.1 62.7 55.7 65.3 60.3 64.8 60.1 64.2 SUM-GAN-AAE [19] 48.9 58.3 52.3 62.7 52.1 62.6 51.7 62.3 CA-SUM [11] 51.1 61.4 61.4 66.4 61.9 65.1 60.2 64.9 DSR-RL [20] 50.3 61.4 54.8 64.5 51.7 63.4 55.3 62.8 Indonesian J Elec Eng & Comp Sci, V ol. 37, No. 3, March 2025: 1616–1625 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 1623 T able 2. Comparati v e analysis of feature e xtraction techniques in query-focused summarization methods assessed on QFVS dataset Method Features Result (F1-scores) TP AN [21] ResNet152 + C3D 46.05 MapNet [10] ResNet152 + C3D 47.20 CHAN [24] ResNet 46.94 HVN [22] C3D 48.87 QSRM-GAM [23] I3D 49.20 CHAN [24] C3D 51.43 CHAN [24] I3D 50.78 4.3.1. Results with con v entional video summarization methods T able 1 pro vides the performance comparison of con v entional summarization frame w orks in te rms of F1-scores. The F1-scores reported for GoogleNet are retrie v ed from corresponding papers. T able 1 and Figure 5 indicate that the F1-scores for C3D and I3D video representations sho w a signicant impro v ement o v er GoogleNet. Additionally , Figures 5(a) and 5(b) depict a rising trend in F1-scores for the SumMe and TVSum datasets, respecti v ely . The results sho w that using 3D CNNs for video feature e xtraction has enhanced the F1-scores on the SumMe dataset, with impro v ements of 32.5% for DR-DSN, 3.4% for V ASNet, 6.9% for SUM-GAN-AAE, 20.1% for CA-SUM, and 8.9% for DSR-RL. Similarly , on the TVSum dataset, the F1-scores ha v e increased by 13% for DR-DSN, 3.9% for V ASNet, 4.1% for PGL-SUM, 7.5% for SUM-GAN-AAE, 8.1% for CA-SUM, and 5% for DSR-RL. (a) (b) Figure 5. Performance comparison of feature e xtraction techniques assessed on (a) SumMe and (b) TVSum dataset 4.3.2. Results with query-f ocused video summarization methods T able 2 pro vides the comparati v e analysis of query-based summarization frame w orks in terms of F1-scores. The majority of summarization methods utilize ResNet for e xtracting video fe atures. ResNet [13] is a pretrained 2D CNN trained on the ImageNet dataset. The F1-scores presented in T able 2 indicate that video representations obtained with C3D result in enhanced F1-scores. 4.4. Discussion Pre vious studies ha v e sho wn that GoogleNet and ResNet are well-established models that e xcel at e x- tracting spatial features from indi vidual video frames using inception modules, which are composed of multiple parallel con v olutional lters of v arying sizes. These models ef fecti v ely capture intricate spatial details within each frame, making them highly suitable for image-based tasks. Ho we v er , their focus on spatial features alone limits their ability to fully capture the temporal dynamics inherent in video sequences. 3D CNNs e xtend the capabilities of con v entional 2D con v olutions by inte grating the time dim ension, allo wing for the simultaneous analysis of both spatial and temporal aspects of video data. This inte gration is crucial for tasks in v olving video sequences, where understanding motion and changes o v er time is as important Le ver a ging 3D con volutional networks for ef fective video ... (Bhakti Deepak Kadam) Evaluation Warning : The document was created with Spire.PDF for Python.
1624 ISSN: 2502-4752 as recognizing spatial features within indi vidual frames. This study in v estig ates the use of 3D CNNs for ef fec- ti v e video feature e xtraction in summarization. It is demonstrated that by capturing spatiotemporal features, 3D CNNs of fer a more comprehensi v e representation of video content, resulting in notable impro v ements in per - formance for video summarization. Although 3D CNNs ha v e adv anced video feat ure e xtraction, there is scope for further research and de v elopment. Future research in video feature e xtraction can e xplore se v eral promising directions lik e h ybrid models that combine the spatial st rengths of 2D CNNs with the temporal capabilities of 3D CNNs and multi-modal video analysis. 5. CONCLUSION In this paper , a comparati v e in v estig ation of video feature e xtraction using 3D CNNs, focusing on their applications i n generic and query-specic video summarization is conducted. This study e xamines the classical and deep learning based feature e xtract ion techniques, highlighting the adv antages of deep learning approaches. The majority of e xisting video summarization techniques commonly rely on 2D CNNs, such as GoogleNet and ResNet, for feature e xtraction. It is demonstrated that 3D CNNs, such as C3D and I3D, are more ef fecti v e for video feature e xtraction in both generic and query-specic video summarization compared to traditional 2D CNNs. By e v aluating F1-scores for v arious summarization methods, it is concluded that 3D CNNs signicantly impro v e performance due to their ability to capture both spatial and temporal features. This underscores the superiority of 3D CNNs in pro viding a more comprehensi v e understanding of video content, marking a notable adv ancement in video feature e xtraction and summarization techniques. REFERENCES [1] M . Suresha, S. K uppa, and D. S. Raghukumar , A study on deep l earning spatiotemporal models and feature e xtraction tech- niques for video understanding, International J ournal of Multimedia Information Retrie val , v ol. 9, no. 2, pp. 81–101, 2020, doi: 10.1007/s13735-019-00190-x. [2] A. Mirza, O. Zeshan, M. Atif, and I. Siddiqi, “Detection and recognition of cursi v e te xt from video frames, Eur asip J ournal on Ima g e and V ideo Pr ocessing , v ol. 2020, no. 1, pp. 1–19, 2020, doi: 10.1186/s13640-020-00523-5. [3] H. W ang, A. Kl ¨ aser , C. Schmid, and C. L. Liu, “Dense trajectories and motion boundary descriptors for action recognition, Inter - national J ournal of Computer V ision , v ol. 103, no. 1, pp. 60–79, 2013, doi: 10.1007/s11263-012-0594-8. [4] Y . Xian, B. K orbar , M. Douze, L. T orresani, B. Schiele, and Z. Akata, “Generalized fe w-shot video classication with video retrie v al and feature generation, IEEE T r ansactions on P attern Analysis and Mac hine Intellig ence , v ol. 44, no. 12, pp. 8949–8961, 2021, doi: 10.1109/TP AMI.2021.3120550. [5] Q. W u, Q. Huang, and X. Li, “Multimodal human action recognition based on spatio-temporal action representation recognition model, Multimedia T ools and Applications , v ol. 82, no. 11, pp. 16409–16430, 2023, doi: 10.1007/s11042-022-14193-0. [6] C. Liu, X. W u, and Y . Jia, A hierarchical video description for comple x acti vity understanding, International J ournal of Computer V ision , v ol. 118, no. 2, pp. 240–255, 2016, doi: 10.1007/s11263-016-0897-2. [7] S. Sah, T . Nguyen, and R. Ptucha, “Understanding temporal structure for video captioning, P attern Analysis and Applications , v ol. 23, no. 1, pp. 147–159, 2020, doi: 10.1007/s10044-018-00770-3. [8] R. Liu et al. , “Exploiting radio ngerprints for simultaneous localization and mapping, IEEE P ervasive Computing , v ol. 22, no. 3, pp. 38–46, 2023, doi: 10.1109/MPR V .2023.3274770. [9] J. F ajtl, H. S. Sok eh, V . Ar gyriou, D. Monek osso, and P . Remagnino, “Summarizing videos with attention, in 14th Asian Confer ence on Computer V ision , 2019, v ol. 11367 LNCS, pp. 39–54, doi: 10.1007/978-3-030-21074-8 4. [10] Y . Zhang, M. Kampf fme yer , X. Zhao, and M. T an, “Deep reinforcement learning for query-conditioned video summarization, Applied Sciences (Switzerland) , v ol. 9, no. 4, p. 750, 2019, doi: 10.3390/app9040750. [11] E. Apost olidis, G. Balaouras, V . Mezaris, and I. P atras, “Summarizing videos using concentrated attention and considering the uniqueness and di v ersity of the video frames, in ICMR 2022 - Pr oceedings of the 2022 International Confer ence on Multimedia Retrie val , 2022, pp. 407–415, doi: 10.1145/3512527.3531404. [12] C . Sze gedy et al. , “Going deeper with con v olutions, in Pr oceedings of the IEEE Computer Society Confer ence on Computer V ision and P attern Reco gnition , Jun. 2015, pp. 1–9, doi: 10.1109/CVPR.2015.7298594. [13] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition, in Pr oceedings of the IEEE Com- puter Society Confer ence on Computer V ision and P attern Reco gnition , Jun. 2016, v ol. 2016-December , pp. 770–778, doi: 10.1109/CVPR.2016.90. [14] O. Russak o vsk y et al. , “ImageNet lar ge scale visual recognition challenge, International J ournal of Computer V ision , v ol. 115, no. 3, pp. 211–252, 2015, doi: 10.1007/s11263-015-0816-y . [15] D. T ran, L. Bourde v , R. Fer gus, L. T orresani, and M. P aluri, “Learni ng spatiotemporal features with 3D con v olutional netw orks, in Pr oceedings of the IEEE International Confer ence on Computer V ision , 2015, v ol. 2015 International Conference on Computer V ision, ICCV 2015, pp. 4489–4497, doi: 10.1109/ICCV .2015.510. [16] J. Carreira and A. Zi sserman, “Quo v adis, action recognition? a ne w model and the kinetics dataset, in Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Reco gnition , Jul. 2017, pp. 6299–6308, doi: 10.1109/CVPR.2017.502. [17] K. Zhou, Y . Qiao, and T . Xiang, “Deep reinforcement learning for unsupervised video summarization with di v ersity- representati v eness re w ard, in Pr oceedings of the AAAI Confer ence on Articial Intellig ence , 2018, v ol. 32, no. 1, pp. 7582–7589, doi: 10.1609/aaai.v32i1.12255. Indonesian J Elec Eng & Comp Sci, V ol. 37, No. 3, March 2025: 1616–1625 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 1625 [18] E. Apostolidis, G. Balaouras, V . Mezaris, and I. P atras, “Combining global and local attention with positional encoding for video summarization, in Pr oceedings - 23r d IEEE International Symposium on Multimedia (ISM) , 2021, pp. 226–234, doi: 10.1109/ISM52913.2021.00045. [19] E. Apostolidis, E. Adamantidou, A. I. Metsai, V . Mezaris, and I. P atras, “Unsupervised video summarization via attention-dri v en adv ersarial learning, in MultiMedia Modeling: 26th International Confer ence (MMM) , 2020, v ol. 11961 LNCS, pp. 492–504, doi: 10.1007/978-3-030-37731-1 40. [20] A. Phaphuangwittayakul, Y . Guo, F . Y ing, W . Xu, and Z. Zheng, “Self-attention recurrent s ummarization netw ork with reinforce- ment learning for video summarization task, in Pr oceedings - IEEE International Confer ence on Multimedia and Expo (ICME) , 2021, pp. 1–6, doi: 10.1109/ICME51207.2021.9428142. [21] Y . Zhang, M. Kampf fme yer , X. Liang, M. T an, and E. P . Xing, “Query-conditioned three-player adv ersarial netw ork for video summarization, arXiv pr eprint arXiv:1807.06677 , 2019, doi: 10.48550/arXi v .1807.06677. [22] P . Jiang and Y . Han, “Hierarchical v ariational netw ork for user -di v ersied & query-focused video summarization, in Pr oceedings of the 2019 A CM International Confer ence on Multimedia Retrie val (ICMR) , 2019, pp. 202–206, doi: 10.1145/3323873.3325040. [23] S. Nalla, M. Agra w al, V . Kaushal, G. Ramakrishnan, and R. Iyer , “W atch hours in minutes: summarizing videos with user intent, in Eur opean Confer ence on Computer V ision , 2020, v ol. 12539 LNCS, pp. 714–730, doi: 10.1007/978-3-030-68238-5 47. [24] S. Xiao, Z. Zhao, Z. Zhang, X. Y an, and M. Y ang, “Con v olutional hierarchical attention netw ork for query-focused video sum- marization, in AAAI 2020 - 34th AAAI Confer ence on Articial Intellig ence , 2020, v ol. 34, no. 07, pp. 12426–12433, doi: 10.1609/aaai.v34i07.6929. [25] A. Karpath y , G. T oderici, S. Shetty , T . Leung, R. Sukthankar , and L. Fei-Fei, “Lar ge-scale video classication with con v olu- tional neural netw orks , in Pr oceedings of the IEEE confer ence on Computer V ision and P attern Reco gnition (CVPR) , 2014, pp. 1725–1732. [26] W . Kay et al. , “The kinetics human action video dataset, arXiv pr eprint arXiv:1705.06950 , 2017. [27] D. Sun, X. Y ang, M. Y . Liu, and J. Kautz, “PWC-Net: CNNs for optical o w using p yramid, w arping, and cost v olume, in Pr oceedings of the IEEE Computer Society Confer ence on Computer V ision and P attern Reco gnition , 2018, pp. 8934–8943, doi: 10.1109/CVPR.2018.00931. [28] Z. T eed and J. Deng, “Raft: recurrent all-pairs eld transforms for optical o w , in Computer V ision–ECCV 2020: 16th Eur opean Confer ence , 2020, pp. 402–419, [Online]. A v ailable: https://doi.or g/10.1007/978-3-030-58536-5 24. [29] Y . Song, J. V allmitjana, A. Stent, and A. Jaimes, “TVSum: summarizing we b videos using titles, in Pr oceedings of the IEEE Computer Society Confer ence on Computer V ision and P attern Reco gnition , 2015, v ol. 07-12-June-2015, pp. 5179–5187, doi: 10.1109/CVPR.2015.7299154. [30] M. Gygli, H. Grabner , H. Riemenschneider , and L. V . Gool, “Creating summaries from user videos, in Eur opean Confer ence on Computer V ision , 2014, v ol. 8695 LNCS, no. P AR T 7, pp. 505–520, doi: 10.1007/978-3-319-10584-0 33. [31] A. Shar ghi, J. S. Laureland, and B. Gong, “Query-focused video summarization: dataset, e v aluation, and a memory netw ork based approach, in Pr oceedings - 30th IEEE Confer ence on Computer V ision and P attern Reco gnition (CVPR) , 2017, pp. 2127–2136, doi: 10.1109/CVPR.2017.229. [32] M. Otani, Y . Nakashima, E. Rahtu, and J. Heikkila, “Rethinking the e v aluation of video summaries, in Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Reco gnit ion (CVPR) , 2019, v ol. 2019-June, pp. 7596–7604, doi: 10.1109/CVPR.2019.00778. BIOGRAPHIES OF A UTHORS Bhakti Deepak Kadam is a research scholar in MKSSS’ s Cummins Colle ge of Engi- neering for W omen, Pune. She recei v ed her BE de gree in E&TC and M.T ech de gree i n electronics engineering from Sa vitribai Phule Pune Uni v ersity in 2011 and 2014, respecti v ely . Her current re- search interests include video processing, computer vision, and deep learning. She can be contacted at email: bhakti.kadam@cumminscolle ge.in. Ashwini Mangesh Deshpande is an associate professor in the Electronics and T elecom- munication Department, at MKSSS’ s Cummins Colle ge of Engineering for W omen, Pune, In- dia. Her research interests include image and video processing, computer vision, deep learning, and satellite image processing. She has 40 research papers in reputed journals and conferences. She has acted as a PI in v arious Go v ernment-funded projects. She can be contacted at email: ashwini.deshpande@cumminscolle ge.in. Le ver a ging 3D con volutional networks for ef fective video ... (Bhakti Deepak Kadam) Evaluation Warning : The document was created with Spire.PDF for Python.