Indonesian J our nal of Electrical Engineering and Computer Science V ol. 37, No. 3, March 2025, pp. 1797 1803 ISSN: 2502-4752, DOI: 10.11591/ijeecs.v37.i3.pp1797-1803 1797 Quantitation of new arbitrary view dynamic human action r ecognition framew ork Anh-Dung Ho 1 , Huong-Giang Doan 2 1 Department of Information T echnology , East Asia Uni v ersity of T echnology , Ha Noi, V iet Nam 2 F aculty of Control and Automation, Electric Po wer Uni v ersity , Ha Noi, V iet Nam Article Inf o Article history: Recei v ed Jun 21, 2024 Re vised Oct 3, 2024 Accepted Oct 7, 2024 K eyw ords: Arbitrary vie w recognition Con v olution neural netw ork Deep learning Generati v e adv ersarial netw orks Human acti vity recognition Multiple vie w recognition ABSTRA CT Dynamic action recognition has attracted man y researchers due to its applica- tions. Ne v ertheless, it is still a challenging problem because the di v ersity of camera setups in the training phases are not similar to the testing phases, and/or the arbitrary vie w actions are captured from multiple vie wpoints of cameras. In f act, s ome recent dynamic gesture approaches focus on multi vie w action recognition, b ut the y are not resolv ed in no v el vie wpoints. In this research, we propose a no v el end-to-end frame w ork for dynamic gesture recognition from an unkno wn vie wpoint. It consists of three main components: (i) a synthetic video generation with generati v e adv ersarial netw ork (GAN)-based architecture named ArV i-MoCoGAN model; (i) a feature e xtractor part which is e v aluated and compared by v arious 3D CNN backbones; and (iii) a channel and spatial attention module. The ArV i-MoCoGAN generates the synthetic videos at mul- tiple x ed vie wpoints from a real dynamic gesture at an arbitrary vie wpoint. These synthetic videos will be e xtracted in t he ne xt component by v arious three- dimensional (3D) con v olutional neural netw ork (CNN) models. These feature v ectors are then processed in the nal part to focus on the attention features of dynamic actions. Our proposed frame w ork is compared to the SO T A approaches in accurac y that is e xtensi v ely discussed and e v aluated on four standard dynamic action datas ets. The e xperimental results of our proposed method are higher than the recent solutions, from 0.01% t o 9.59% for arbitrary vie w action recognition. This is an open access article under the CC BY -SA license . Corresponding A uthor: Huong-Giang Doan F aculty of Control and Automation, Electric Po wer Uni v ersity 235 Hoang Quoc V iet, Ha Noi, V iet Nam Email: giangdth@epu.edu.vn 1. INTR ODUCTION Human acti vity recognition (HAR) has become an attracti v e eld in computer vision for the pas t 40 years [1]–[3]. Moreo v er , this w ork has still f aced man y challenges because of limitation of data, v arious vie wpoints, dif ferent scales, illumination conditions, comple x background, and v arious modalities. T o impro v e ef cienc y of action recognition results, some researchers try to increase the number of data using generati v e adv ersarial netw ork (GAN) models such as [4]–[6]. Synthetic human action images are created by generator of GAN models which are similar to the training videos. Doan and Nguyen [7] generated hand images in multiple vie ws with blender -based and hand glo v e-based. Although this dataset is di v erse in vie wpoints and v ariety of samples, it only pro vided static hand gestures. J ournal homepage: http://ijeecs.iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
1798 ISSN: 2502-4752 Another approach could ameliorate action recognition results that utilizes multi-vie w cameras. T ran et al. [1] found discriminant of pairwise co v ariance of multiple vie ws data to deal wit h the rob ustness of HAR result. This research nds transformation of multi-vie w actions in a common space. Then, a ne w action could be projected into the learned common space. But this approach composes of discrete blocks and it is dif cult to deplo y an end-to-end solution. Nguyen and Nguyen [8] proposed a dynamic action recognition method with residual netw ork (ResNet)18 backbone, (2+1)D architecture, cross vie w attention (CV A) module and augmentation strate gy . This w o r k impro v ed action recognition that used image sequences in multiple vie w- points b ut this method also required man y cameras in both training and testing phases. An end-to-end HAR solution at an arbitrary vie wpoint is necessary to deplo y a real HAR application because of a simpler testing en vironment setup. Zhang et al. [9] composed a vie w-in v ariant transfer dictionary and classier for no v el-vie w action recognition. T w o-dimensional (2D) vi deos are projected into a vie w-in v ariant sparse representation. Dictionary learning projection is considered as a linear algorithm that is quite a limitation. Gedamu et al. [10] proposed method to recognize action in a certain vie w b ut this solution is impleme nted by sk eleton image and recognized static action. Stimulated from T ran et al. [5], Doan et al. [11] proposed an end-to-end frame w ork for an arbitrary vie w dynamic action recognition that combined the ArV i-MoCoGAN model, 3D con v olutional (C3D) block and attention module. Both generator and discriminator of ArV i-MoCoGAN are utilized on testing phase to create multi-vie w s yn t hetic actions which could increase the computational comple xity of the system. Further - more, in this research, C3D is used as a 3D feature e xtractor of mult i-vie w synthetic video. The y are then used as inputs of the attention module to v ote channel attention and create the nal feature v ector before passing the soft-max layer . This attention module is not observ ed for spatial features. In this w ork we propose a ne w frame w ork for an arbitrary vie w HAR that deals not only the channel attention b ut also the spatial attention. In addition, this w ork also in v estig ates and compares on v arious 3D CNN e xtractors (3 dimension con v olutional neural netw ork). In general, our research composes of tw o contrib utions, such as: (i) we propose a ne w arbitrary vie w gesture recognition method; (ii) in v estig ate the arbitrary HAR frame w ork with v arious 3D CNN e xtractor backbones. The remainder of this research is or g anized as follo ws: rstly , section 2 e xplains our proposed frame w ork. Ne xt, the e xperimental results are analyzed and discussed in section 3. Finally , section 4 consists of the conclusion of research direction as well as its future w orks. 2. PR OPOSE METHOD Our proposed dynamic action recognition method in certain unkno wn vie wpoints is illustrated in Figure 1 that consists of four cascade main blocks: (i) generate the synthetic videos from a certain real video with ArV i-MoCoGAN model in [11]; (ii) feature e xtraction of the synthetic videos using v arious 3D CNN models; (iii) nding attentions of channels/vie wpoints and spatial with con v olutional block attention module (CB AM) [12]; and (i v) classication. Our frame w ork is e xplained in the ne xt parts from section 2.1 to sec- tion 2.4. In addition, section 2.5 presents multi vie w datasets, protocol and setup parameters for the entire e xperiment. Figure 1. Frame w ork of arbitrary vie w dynamic action recognition Indonesian J Elec Eng & Comp Sci, V ol. 37, No. 3, March 2025: 1797–1803 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 1799 2.1. Generation of the synthetic xed videos A common space is rstly created by ArV i-MoCoGAN model which is trained by the x ed real v i deos (M x ed cameras are setuped and captured for the training dataset). A no v el real video is then projected in this common space to generate M synthetic videos in M x ed vie wpoints. This architecture is similar to the ArV i- MoCoGAN model as presented in our pre vious research [11]. Gi v en a real dynamic action in unkno wn vie w Z r V i = [ I (1) V i , ..., I ( N ) V i ] , N is number of frames in an arbitrary real video. Outputs of the trained ArV i-MoCoGAN model are the M synthetic videos { Z S V j = [ I S , (0) V j , ..., I S , ( N ) V j ] , j = (1 , ..., M ) , j } on the M x ed vie wpoints as illustrated in (1). M Ar V i M oC oGAN ( Z r V i ) = { Z S V j , j = (1 , ..., M ) } = Z S V 1 = [ I S , (1) V 1 , ..., I S , ( N ) V 1 ] Z S V 2 = [ I S , (1) V 2 , ..., I S , ( N ) V 2 ] ... Z S V M = [ I S , (1) V i , ..., I S , ( N ) V i ] (1) where M equals to 5, 4, 7, and 3 that corresponds to the number of classes of MICA Ges, IXMAS, MuHA V i and NUMA datasets respecti v ely . Doan et al. [11] used both synthetic videos and vie w predicted probabilities of a no v el real video on the x ed vie ws for the HAR phase. In this w ork, only M synthetic videos of the ArV i-MoCoGAN model are utilized as inputs of the 3D CNN feature e xtractors in the ne xt step. 2.2. 3DCNN featur e extraction Doan et al. [11] only used C3D netw ork for feature e xtraction. In this rese arch, four state-of-the-art (SO T A) 3D CNN models are utilized to compare ef cient of our end-to-end HAR model that concludes C3D [13], ResNet50-3D [14], and RNN [15], [16], spatial-temporal attention [17]. These models are deplo yed as follo ws: C3D netw ork is introduced in [13] with visual geometry group (V GG) as backbone that has become an ef cient method for spatial and temporal 3D CNN method for action recognition. In this w ork, the 2048-D feature v ector is e xtracted from FC7 layer . W e utilize the netw ork with batch normaliza tion after e v ery Con v layer , the pre-trained weights on Sports-1M and ne-tuned on Kinetics dataset [18]. ResNet50-3D is b uilt with ResNet50 backbone and 3D Con v layer . The spatial and temporal feature v ector is tak en after global a v erage pooling la y e r whose dimension is 2048-D outputting. The Kinetics pre-trained weights are applied as in [18]. ResNet50-TP (ResNet50 temporal attention) also uses ResNet50 backbone and TP . Output feature v ectors of the ResNet50-TP model equals to 2048-D. The Kinetics pre-trained weights are applied as in [18]. Recurrent neural netw ork (RNN) architecture [19] and InceptionNetV3 model [20] are applied for dynamic action feature e xtractor . Dimension of feature v ector is 512-D. Model is also ne-tuned by the Kinetics dataset [18] on all layers of this feature e xtractor . In this paper , the 3D CNN model is used as spatial-temporal feature e xtractor . Inputs of the 3D CNN e xtractors are the x ed multi-vie w synthetic videos { Z S V j | j = (1 , ..., M ) } which are outputs of the pre vious ArV i-MoCoGAN model (in section .1). Outputs of the 3D CNN e xtractor are feature v ectors { F 3 D C N N V j 1 xK | j = (1 , ..., M ) } as illustrated in (2). M 3 D C N N ( Z S V j , j = (1 , ..., M )) = M 3 D C N N V 1 ( Z S V 1 ) = F 3 D C N N V 1 [1 xK ] = [ F (0) V 1 , ..., F ( K ) V 1 ] M 3 D C N N V 2 ( Z S V 2 ) = F 3 D C N N V 2 [1 xK ] = [ F (0) V 2 , ..., F ( K ) V 2 ] ... M 3 D C N N V M ( Z S V M ) = F 3 D C N N V M [1 xK ] = [ F (0) V M , ..., F ( K ) V M ] (2) Where K equals 512 with the RNN feature e xtractor model and K is 2048 with C3D, ResNet50-3D, and ResNet50-TP feature e xtractor models. Quantitation of ne w arbitr ary vie w dynamic human action r eco gnition fr ame work (Anh-Dung Ho) Evaluation Warning : The document was created with Spire.PDF for Python.
1800 ISSN: 2502-4752 2.3. Attention module Feature v ectors of synthetic videos { Z S V j | j = (1 , ..., M ) } on the multiple vie wpoints ( F 3 D C N N = {M 3 D C N N ( Z S V j ) | j = (1 , ..., M ) } = { F 3 D C N N V j 1 xK | j = (1 , ..., M ) } ) are normalized and composed into F M x 1 xK as illustrated in (3). F = [ F 3 D C N N V 1 [1 xK ] , ...., F 3 D C N N V M [1 xK ]] = F (1) V 1 F (1) V 2 ... F (1) V M F (2) V 1 F (2) V 2 ... F (2) V M ... ... ... ... F ( K ) V 1 F ( K ) V 1 ... F ( K ) V M (3) Input of this module is F M x 1 xK where each feature v ector element F 3 D C N N V j 1 xK in F M x 1 xK is considered as a channel of the channel attention module. The channel attention module infers one 1-D channel attention map a c M x 1 x 1 = [ a (1) c , a (2) c , ..., a ( M ) c ] . The output of the channel attention part F c 1 xK is then calculated as illustrated in (4). F c = a c F = P M j =1 ( a ( j ) c + 1) F 3 D C N N V j M (4) Where denotes element-wise multiplication. Each the feature v ector is paired with an atte n t ion v alues ac- cordingly . It is then combined with itself which is copied along the spatial dimension. F c 1 xK is passed o v er the spatial attention module. A spatial attention map a s 1 x 1 xK is also computed. It is then utilized to calculate the output of the CB AM model F C B AM = F s 1 xK that is presented as illustrated in (5). F C B AM = F s = a s F c = a s a c F (5) 2.4. Classication Output feature v ector ( F C B AM ) of the CB AM module as sho wn in (5) is atten and fully connected together before being passed through a Softmax layer to classify . In this research, the Softmax cross-entrop y loss function is applied to train and test entire netw orks. Gi v en an arbitrary vie w dynamic action ( Z V i , ( i ̸ = j ) ), its predicted result is ¯ p i . Its ground truth is p i . Thus, the loss function is calculated as illustrated in (6). L sof tmax = 1 K K X i =1 p i l og ¯ p i (6) 2.5. Datasets, pr otocol and setup parameters Dataset: in this research, four benchmark datasets are utilized, consisting of the MICA Ges [21], IXMAS [3], MuHA V i [22], and NUMA [6] which contain 1,500, 1,584, 3,038, and 1,475 videos, respec- ti v ely . The y are the multi vie w dynamic action datasets which were mentioned in detail in [11]. Protocol: an arbitrary vie w e v aluation protocol in [11] is utilized to test our frame w ork on a single dataset. Where each vie w is separated and seen as an arbitrary vie wpoint, remaining vie ws are used as the x ed vie wpoint. T esting is implemented by lea v e-one-vie w-out protocol entire vie wpoints ( V j , j = (1 , ..., M ) to achie v e the nal result. Setup parameter: our model is deplo yed with tw o stages: rstly , generator and discriminator of an ArV i-MoCoGAN model are trained. Then, all layers of the generator of the ArV i-MoCoGAN model are used and frizzed in training of the arbitrary vie w d ynam ic action recognition frame w ork as sho wn in Figure 1. Learning rate is 5* 10 5 ; optimizer is Adam; batch size equals 32 images; loss function is cross entrop y; i nput image size is 224 × 224 pix els. Quantitation results are compared in the ne xt section 3. 3. EXPERIMENT AL RESUL T The e v aluation schemes are written in Python on a Pytorch deep learning frame w ork and run on a w orkstation with NVIDIA GPU 11G. The e xperiments are conducted to indicate t he follo wing problems: (i) comparison accurac y of our arbitrary vie w action recognition frame w ork using v arious 3D-CNN backbones; (ii) parameters of v arious arbitrary vie w action models; and (iii) compari son of our best action recognition models with SO T A HAR methods. Indonesian J Elec Eng & Comp Sci, V ol. 37, No. 3, March 2025: 1797–1803 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 1801 3.1. Arbitrary view gestur e r ecognition with v arious 3D CNN featur e extractors In this section, our no v el vie w action recognition frame w ork is e v aluated by dif ferent 3D CNN back- bones such as C3D, ResNet50-D, RNN, and ResNet50-TP . It is tested on v arious benchmark multi-vie w action datasets consisting of MICA Ges, NUMA, IXMAS, and MuHA V i datasets. Results are presented in Figure 2. It is e vident that our arbitrary vie w action recognition frame w ork with C3D backbone obtains the best accurac y on MICA Ges, NUMA, IXMAS, and MuHA V i datasets at 96.6%, 92.79%, 93.01%, and 99.05% respecti v ely . These percentage results are f ar higher than using the remaining 3D CNN feature e xtractors (ResNet50-D, RNN, and ResNet50-TP). While the RNN feature e xtractor has the lo west accurac y at 72.54% on MICA Ges and 46.02% on NUMA; the ResNet50-TP backbone achie v es the smallest accurac y at 65.5% on IXMAS and 80% on MuHA V i. Thus, the arbitrary vie w action recognition frame w ork will be considered and compared by other f actors in the ne xt section 3.3. Figure 2. The accurac y of arbitrary vie w gesture recognition with v arious 3D CNN feature e xtractors 3.2. Summary of the arbitrary view dynamic action r ecognition models This section summarizes in terms of params, FLOPs, time cost and model size of v arious ar bitrary vie w dynamic HAR models with dif ferent 3D CNN backbones that is trained by MICA Ges dataset as illustrated in T able 1. Where params represents the number of the trained model parameters. FLOPs sho ws the number of oating point operations required by the trained model. T ime cost is the time total that the model processes from the be ginning time to the ending time. Model size refers to the size of the container which contains the trained model on a certain dataset. P arameter calculation of the arbitrary vie w HAR system (Figure 1) is di vided into tw o parts: Arvi-MoCoGAN model (second ro w of T able 1), 3D CNN and CB AM model (from third ro w to se v enth ro w of T able 1). This table indicates some highlight issues that: The ArV i-MoCoGAN model has 10.8 (M) in params, 90.75 (G) in Flops, 0.218 (s) in time cost, and 40.39 (MB) in model size. These results indicate that params number and model size of ArV i-MoCoGAN model are smaller b ut time cost and FLOPs are higher than remaining parts of the end-to-end HAR frame w ork. By comparing between 3D CNN e xtractors and CB AM model, it is apparent that using C3D + CB AM (F ord column and third ro w of T able 1) has the smallest time cost at only 0.105 (s) that is dramatically smaller than 1.77 (s) of ResNet50-3D + CB AM, (0.211 (s) of ResNet50-TP + CB AM and 0.171 (s) of RNN + CB AM. While params, FLOPS and model size are lar ger the remaining 3D CNN backbones at 73.37 (M), 38.70 (G), and 293.52 (MB) respecti v ely . Despite using C3D e xtractor has high params, FLOPs and model size b ut it is the smallest time cost and the best HAR accurac y (section 3.1). Thus, this is w orth attention and trade-of f problems for a real application. As a result, our end-to-end arbitrary vie w HAR system using C3D has a time cost total of 0.323 (s) that is equi v alent to 3(fps) while it obtains the best accurac y at 96.6% on MICA Ges dataset. These results can be accepted in order to deplo y a real application. 3.3. Comparison with the SO T A arbitrary view gestur e r ecognition In this section, we compare our best accurac y res ults with some SO T A methods on four benchmark datasets as illustrated in T able 2. A glance at T able 2 it is e vident that our method obtains the higher accurac y on three published datasets than recent HAR methods, such as: 93.01% on IXMAS is lar ger than 87.25% in Quantitation of ne w arbitr ary vie w dynamic human action r eco gnition fr ame work (Anh-Dung Ho) Evaluation Warning : The document was created with Spire.PDF for Python.
1802 ISSN: 2502-4752 [11], 79.4% in [23] and 79.9% in [24]. On MICA Ges dataset, our method accounts 96.6% that is better than [8], [11], [25] from 3.72% to 7.89% in accurac y . On the MuHA V i dataset, our approach also obt ains the lar gest accurac y at 99.05%, it is higher than 0.78% in [11] and [23] at 5.45%. Our accurac y achie v es 92.79% that is slightly smaller than [11] and [23] at 1.72% and 1.02% on NUMA dataset while it is dramatically better than the remaining methods in [8], [26]–[28] from 0.01% to 9.59%. This result once ag ain indicates that our proposed solution is more ef cient than recent methods in dynamic action recognition accurac y . T able 1. P arameters of an arbitrary vie w dynamic action recognition model is trained by MICA Ges dataset P arams (M) FLOPs (G) T ime cost (s) Model size (MB) ArV i-MoCoGAN 10.08 90.75 0.218 40.39 C3D + CB AM 73.37 38.70 0.105 293.52 ResNet50-3d + CB AM 55.43 10.15 0.177 222.04 ResNet50-TP + CB AM 23.55 17.39 0.211 94.51 RNN + CB AM 28.78 17.47 0.171 115.38 T able 2. Comparison of arbitrary vie w action recognition accurac y (%) using SO T A methods IXMAS MICA Ges MuHA VI NUMA WLE [24] 79.9 - - - SAM [26] - - - 83.2 TSN [29] - - - 90.3 D A-Net [27] - - - 92.1 Multi-Br TSN-GR U [25] - 88.71 - 93.81 R34(2+1)D W ith CV A [8] - 91.71 - 92.78 D A + E LM + aug [23] 79.4 - 93.6 - V ie wCon + MOCO v2 [28] - - - 91.7 ArV i-MoCoGAN + C3D [11] 87.25 92.88 98.27 94.51 Our 93.01 96.60 99.05 92.79 4. CONCLUSION In this research, a ne w arbitrary vie w HAR frame w ork is proposed which combines a cascade blocks including an ArV i-MoCoGAN netw ork, 3D CNN feature e xtractors and CB AM unit. Our method is deplo yed and e v aluated by v arious 3D CNN models such as: C3D, ResNet50-3D, ResNet50-TP , a nd RNN. Our e xperi- mental result is implemented on dif ferent benchmark datasets. It sho ws that usi ng C3D backbone obtains the best accurac y . In addition, our proposed frame w ork archi v es higher ef cienc y than SO T A no v el vie w action recognition on most benchmark datasets up to 9.59%. REFERENCES [1] H.-N. T ran, H.-Q. Nguyen, H.-G. Doan, T .-H. T ran, T .-L. Le, and H. V u, “P airwise-co v ariance multi-vie w discriminant analysis for rob ust cross-vie w human action recognition, IEEE Access , v ol. 9, pp. 76097–76111, 2021, doi: 10.1109/A CCESS.2021.3082142. [2] P . Molchano v , S. Gupta, K. Kim, and J. Kautz, “Hand gesture recognition with 3D con v olutional neural netw orks, in IEEE Computer Society Confer ence on Computer V ision and P attern Reco gnition W orkshops , 2015, pp. 1–7, doi: 10.1109/CVPR W .2015.7301342. [3] D. W einland, R. Ronf ard, and E. Bo yer , “Free vie wpoint action recognition using motion history v olumes, Computer V ision and Ima g e Under standing , v ol. 104, no. 2–3, pp. 249–257, 2006, doi: 10.1016/j.cviu.2006.07.013. [4] S. T ulyak o v , M.-Y . Liu, X. Y ang, and J. Kautz, “MoCoGAN: decomposing motion and content for video generation, in 2018 IEEE/CVF Confer ence on Computer V ision and P attern Reco gnition , Jun. 2018, pp. 1526–1535, doi: 10.1109/CVPR.2018.00165. [5] T . H. T ran, V . D. Bach, and H. G. Doan, “vi-MoCoGAN: a v ariant of MoCoGAN for video generation of human hand gestures under dif ferent vie wpoints, in Pr oceedings of t he P attern Reco gnition: A CPR , 2020, v ol. 1180 CCIS, pp. 110–123, doi: 10.1007/978-981-15-3651-9 11. [6] L. W ang, Z. Ding, Z. T ao, Y . Liu, and Y . Fu, “Generati v e multi-vie w human action recognition, in 2019 IEEE/CVF International Confer ence on Computer V ision (ICCV) , Oct. 2019, pp. 6211–6220, doi: 10.1109/ICCV .2019.00631. [7] H.-G. Doan and N.-T . Nguyen, “Ne w blender -based augmentation method with quantitati v e e v aluation of CNNs for hand gesture recognition, Indonesian J ournal of Electrical Engineering and Computer Science (IJEECS) , v ol. 30, no. 2, pp. 796–806, May 2023, doi: 10.11591/ijeecs.v30.i2.pp796-806. [8] H.-T . Nguyen and T .-O. Nguyen, Attention-based netw ork for ef fecti v e action recognition from multi-vie w video, Pr ocedia Com- puter Science , v ol. 192, pp. 971–980, 2021, doi: 10.1016/j.procs.2021.08.100. [9] J . Zhang, H. P . H. Shum, J. Han, and L. Shao, Action recognition from arbitrary vie ws using transferable dictionary learning, IEEE T r ansactions on Ima g e Pr ocessing , v ol. 27, no. 10, pp. 4709–4723, Oct. 2018, doi: 10.1109/TIP .2018.2836323. Indonesian J Elec Eng & Comp Sci, V ol. 37, No. 3, March 2025: 1797–1803 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 1803 [10] K. Gedamu, Y . Ji, Y . Y ang, L. Gao, and H. T . Shen, Arbitrary-vie w human action recognition via no v el-vie w action generation, P attern Reco gnition , v ol. 118, p. 108043, Oct. 2021, doi: 10.1016/j.patcog.2021.108043. [11] H.-G. Doan, H.-Q. Luong, and T . T . T . Pham, An end-to-end model of ArV i-MoCoGAN and C3D with attention unit for arbitrary- vie w dynamic gesture recognition, International J ournal of Advanced Computer Science and Applications , v ol. 15, no. 3, 2024, doi: 10.14569/IJ A CSA.2024.01503122. [12] S. W oo, J. P ark, J. Y . Lee, and I. S. Kweon, “CB AM: con v olutional block attention module, in Pr oceedings of the Eur opean confer ence on computer vision (ECCV) , 2018, pp. 3–19, doi: 10.1007/978-3-030-01234-2 1. [13] D. T ran, L. Bourde v , R. Fer gus, L. T orresani, and M. P aluri, “Learning spatiotemporal features with 3D con v oluti onal netw orks, in 2015 IEEE International Confer ence on Computer V ision (ICCV) , Dec. 2015, pp. 4489–4497, doi: 10.1109/ICCV .2015.510. [14] K. Hara, H. Kataoka, and Y . Satoh, “Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet?, in 2018 IEEE/CVF Confer ence on Computer V ision and P attern Reco gnition , Jun. 2018, pp. 6546–6555, doi: 10.1109/CVPR.2018.00685. [15] M. Schuster and K. K. P aliw al, “Bidirectional recurrent neural netw orks, IEEE T r ansactions on Signal Pr ocessing , v ol. 45, no. 11, pp. 2673–2681, 1997, doi: 10.1109/78.650093. [16] J. Liu, A. Shahroudy , D. Xu, and G. W ang, “S patio-temporal LSTM with trust g ates for 3D human action recognition, in Computer V ision–ECCV 2016: 14th Eur opean Confer ence , 2016, pp. 816–833, doi: 10.1007/978-3-319-46487-9 50. [17] J. W ang and X. W en, A spatio-temporal attention con v olution block for action recognition, J ournal of Physics: Confer ence Series , v ol. 1651, no. 1, p. 012193, No v . 2020, doi: 10.1088/1742-6596/1651/1/012193. [18] W . Kay et al. , “The kinetics human action video dataset, arXiv pr eprint arXiv:1705.06950 , 2017, [Online]. A v ailable: http://arxi v .or g/abs/1705.06950. [19] H. Sak, A. Senior , and F . Beauf ays, “Long short-term memory based recurrent neural netw ork architec tures for lar ge v ocab ulary speech recognition, arXiv pr eprint arXiv:1402.1128 , 2014, [Online]. A v ailable: http://arxi v .or g/abs/1402.1128. [20] C. Sze gedy , V . V anhouck e, S. Iof fe, J. Shlens, and Z. W ojna, “Rethinking the inception architecture for computer vision, in 2016 IEEE Confer ence on Computer V ision and P attern Reco gnition (CVPR) , Jun. 2016, pp. 2818–2826, doi: 10.1109/CVPR.2016.308. [21] H. G. Doan et al. , “Multi-vie w discriminant analysis for dynamic hand gesture recognition, in P attern Reco gnition. A CPR 2019. Communications in Computer and Information Science , 2020, pp. 196–210, doi: 10.1007/978-981-15-3651-9 18. [22] F . Murtaza, M. H. Y ousaf, and S. A. V elastin, “Multi-vie w human action recognition using 2D motion templates based on MHIs and their HOG description, IET Computer V ision , v ol. 10, no. 7, pp. 758–767, Oct. 2016, doi: 10.1049/iet-cvi.2015.0416. [23] N. Nida, M. H. Y ousaf, A. Irtaza, and S. A. V elastin, “V ideo augmentation technique for human action recognition using genetic algorithm, ETRI J ournal , v ol. 44, no. 2, pp. 327–338, Apr . 2022, doi: 10.4218/etrij.2019-0510. [24] J. Liu, M. Shah, B. K uipers, and S. Sa v arese, “Cross-vie w action recognition via vie w kno wledge transfer , in Confer ence on Computer V ision and P attern Reco gnition (CVPR) , Jun. 2011, pp. 3209–3216, doi: 10.1109/CVPR.2011.5995729. [25] A.-V . Bui and T .-O. Nguyen, “Multi-vie w human action recognition based on TSN architecture inte grated with GR U, Pr ocedia Computer Science , v ol. 176, pp. 948–955, 2020, doi: 10.1016/j.procs.2020.09.090. [26] S. Mambou, O. Krejcar , K. K uca, and A. Selamat, “No v el cross-vie w human action model recognition based on the po werful vie w-in v ariant features technique, Futur e Internet , v ol. 10, no. 9, pp. 1–17, 2018, doi: 10.3390/10090089. [27] D. W ang, W . Ouyang, W . Li, and D. Xu, “Di viding and aggre g ating netw ork for multi-vie w action recognition, in Pr oceedings of the Eur opean Confer ence on Computer V ision (ECCV) , 2018, pp. 457–473, doi: 10.1007/978-3-030-01240-3 28. [28] K. Shah, A. Shah, C. P . Lau, C. M. de Melo, and R. Chellapp, “Multi-vie w action recognition using contrasti v e learn- ing, in 2023 IEEE/CVF W inter Confer ence on Applications of Computer V ision (W A CV) , Jan. 2023, pp. 3370–3380, doi: 10.1109/W A CV56688.2023.00338. [29] L. W ang et al. , “T emporal se gment netw orks: to w ards good practices for deep action recognition, in Eur opean Confer ence on Computer V ision (ECCV) , 2016, pp. 20–36. BIOGRAPHIES OF A UTHORS Anh-Dung Ho recei v ed B.E. de gree in Applied Mathematics and Informatics in 2001, M.E. in Computer Science in 2007, all from Hanoi Uni v ersity of Science and T echnology , Ha Noi, V ietnam. He can be contacted at email: dungha@eaut.edu.vn. Huong-Giang Doan recei v ed B.E. de gree in Instrumentati on and Industrial Informatics in 2003, M.E. in Instrumentation and Automatic Control System in 2006 and Ph.D. in C ontrol En- gineering and Automation in 2017, all from Hanoi Uni v ersity of S cience and T echnology , Ha Noi, V ietnam. She can be contacted at email: giangdth@epu.edu.vn. Quantitation of ne w arbitr ary vie w dynamic human action r eco gnition fr ame work (Anh-Dung Ho) Evaluation Warning : The document was created with Spire.PDF for Python.