Indonesian J our nal of Electrical Engineering and Computer Science V ol. 16, No. 2, No v ember 2019, pp. 827 834 ISSN: 2502-4752, DOI: 10.11591/ijeecs.v16i2.pp827-834 r 827 An optimization of facial featur e point detection pr ogram by using se v eral types of con v olutional neural netw ork Sh y ota Shindo 1 , T akaaki Goto 2 , T adaaki Kirishima 3 , K ensei Tsuchida 4 1,3,4 T o yo Uni v ersity , 2100 K ujirai, Ka w agoe, Saitama, Japan 2 Ryutsu K eizai Uni v ersity , 3-2-1 Shin-Matsudo, Matsudo, Chiba, Japan Article Inf o Article history: Recei v ed Jan 17, 2019 Re vised Apr 7, 2019 Accepted May 10, 2019 K eyw ords: F acial feature point detection Neural netw ork Con v olutional neural netw ork ABSTRA CT Detection of f acial feature points is an important technique used for biometric au- thentication and f acial e xpress ion estimation. A f acial feature point is a local point indicating both ends of the e ye, holes of the nose, and end points of the mouth in the f ace image. Man y researches on f ace feature point detection ha v e be en done so f ar , b ut the accurac y of f acial or g an point detection is impro ving by the approach using Con- v olutional Neural Netw ork (CNN). Ho we v er , CNN not only tak es time to learn b ut also the neural netw ork becomes a complicated model, so it is necessary to impro v e learning time and detection accurac y . In this research, the impro v ement of the detec- tion accurac y of the learning speed is impro v ed by increasing the con v olution layer . Copyright c 2019 Insitute of Advanced Engineeering and Science . All rights r eserved. Corresponding A uthor: T akaaki Goto, Ryutsu K eizai Uni v ersity , 3-2-1 Shin-Matsudo, Matsudo, Chiba, Japan. Email: tg@gotolab .net 1. INTR ODUCTION A f acial feature point is a local point indicating a place such as an e ye end or a mouth end of a f acial image. The detection of f acial feature points is applied to important technologies such as f acial e xpression estimation and biometric authentication using f acial images. Man y detection methods ha v e been proposed so f ar , b ut with the adv ent of Con v olutional Neural Netw ork (CNN) in recent years, man y researches on detection methods using CNN ha v e been conducted, and detection with higher accurac y is getting e xpected [1]. Ho we v er , CNN learning tak es time. If the layer of CNN becomes deep and the number of t raining data is lar ge, the learning time becomes huge. As a method for speeding up learning, there are methods using GPU with good performance, and methods for de vising hardw are such as adding main memory . In addition, the methods for de vising softw are are [2, 3, 4]. Among them, there are a fe w methods [5] to de vise pre-processing for input data. In this paper , we aim at impro ving preprocessing of input data and speed up learning of f acial fea ture point detection program using CNN. CNN w as implemented in Python with reference to the program of Y a- mashita et al. [6]. W e propose a method to reduce the number of layers of CNN by applying Laplacian filter to preprocessing and reducing image features. 2. RELA TED W ORKS F acial feature point detection can be obtained by v arious methods such as CNN and image processing. As a con v entional method, Cootes et al’ s Acti v e Appearance Model (AAM) is a v ailable [7]. In this method, the a v erage Shape is obtained by using the coordinate points of the f acial images of all the le arning data, J ournal homepage: http://iaescor e .com/journals/inde x.php/ijeecs Evaluation Warning : The document was created with Spire.PDF for Python.
828 r ISSN: 2502-4752 and the a v erage f ace is obtained by using the pix el v alues. Principal component analysis is performed using the coordinate points of this Shape and the pix el v alues in the Shape, and the change amount is obtained. Appearance can sho w the features of the front f ace and Shape can e xpress the orientation and s hape of the f ace. By combining tw o, it becomes possible to create a f ace image that can respond to changes in f ace orientation and shape. That is, in order to obtain the f acial feature point of the input f ace image, Shape and Appearance are updated by the gradient descent method. Ho we v er , this method can not deal with unkno wn image data, resulting in lo w accurac y . As a method of image processing, V ukadino vic et al. independently detect each feature point using a Gabor filter (a filter that e xtracts the direction of the line in the image) [8]. Recently used method is CNN. Since winning in the object recognition cate gory of ILSVRC in 2012, CNN has recei v ed great attention. As a f acial f eature point detection method using CNN, there is a method by Kimura et al [9]. In this method, f acial or g an points are detected by learning input v alues as 100 100 grayscale images and learning teaching data as coordinate v alues of f acial or g an points of the images. This mak es it pos sible to cope with unkno wn image data. There is also a method of creating an o pt imum mini batch in CNN mini batch learning. There is also a method of creating an optimum mini batch in CNN mini batch learning [10]. Minag a w a et al. do not use CNN b ut detect it using DNN [11]. In this proposed method, points are mark ed as a learning sample in an image among e xisting correct feature point and a certain range of the feature point, and a transfer v ector representing the relati v e position from the feature point to each learning sample is used. Ho we v er , with this approach, it is necessary to use a separate DNN for each or g an and the accurac y is reduced. Con v entional methods were those that independently detect each feature by image processing, or one using CNN. Although there is a method using DNN, complicated processing is in v olv ed. In this research, we aim to propose a ne w method which is more accurate and f aster than the con v entional method. In addition, we also compare e x ecution time and detection accurac y with detection by ef fecti v e CNN in f acial feature point detection. In [4], authors search for an object from the image, clip it out, and mak e it an input v alue to CNN. And CNN judges whether it is a f ace or not (R - CNN). This paper is a research on accelerating R-CNN. Our method is seeking f acial or g an points for image data which has already cut out the f ace. [1] proposes a f ace authentication system that F acebook made. It is dif ferent from our research that t h e f ace orientation is detected and af fine transformation is performed. In [6], the proposed method reduces the coordinate v alues of f acial or g an points. [12] sho ws the result t hat it becomes high performance when learning f acial or g an point re gression and classification of features such as wearing glasses simultaneously (TCDCN). As in the model of [13] , there are researches that adapted Laplacian filter to pretreatment. Ho we v er , this is a technique for reducing the v ariability of the input pattern by the f ace detection program and does not mention the learning speed. [14] describes f acial feature point detection using Gabor filter . The method in this research dif fers from this research. In paper [15], f ace recognition is used in the alarm system. A method of e xtracting a feature quantity by a his togram is used for a f ace recognition method. In paper [16], the Enhanced Local binary pattern (EnLBP) is performed to compress the image and stored in the database. Authors of this paper ha v e proposed a method to recognize f aces by comparing sa v ed images with images EnLBPed of input images. 3. RESEARCH METHOD Python is used for the programming language of machine learning conducted in this research. Man y machinery learning support libraries are pro vided in Python, b ut we do not use these libraries because this study does not mak e the comparison of e x ecution times ambiguous. T able 1 sho ws the PC en vironment in which machine learning w as performed. In addition, Python has a library for CUD A that allo ws GPU to calculate, b ut in this research we ha v e not done calculations with GPU at all. T able 1. En vironment Item Spec CPU Intel(R)Celeron(R)2957U, 1.40GHz, Multi-Core Memory 4.00GB Indonesian J Elec Eng & Comp Sci, V ol. 16, No. 2, No v ember 2019 : 827 834 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 r 829 First of all, CNN of the structure of Figure 1 which is the standard in this research w as implemented Figure 1. The structure of CNN [10] 3.1. Implementation of F acial F eatur e P oint Detection by CNN As sho wn in Fig. 1, the CNN hierarchically follo ws the con v olution layer and the pooling layer aft er the input image, and then passes through the fully connected layer . F or the structure and acti v ation function of CNN in this study , we refer to [10], and the con v olution layer and pooling layer shall be as sho wn in Figure 1. Because CNN is a supervised learning, CNN uses the squared error between the output v alue and the teacher data as a loss function. Furthermore, it transmits error to the input layer by error back propag ation method and updates by gradient descent method. W e use the Labeled F aces in the W ild (LFW) data set. The LFW data set cuts out an image with the detection range from the forehead of the f ace image to the submaxillary as the detection range and annotates the i mage after normalizing the cut out image to 100 100 in that range. The coordinate v alue and the clipping range at that time are publicly a v ailable. Annotation contains a total of 10 points, 4 e yes at both e yes and at the inner corner , 2 at the bottom of the nose, 4 points at both ends of the lip and abo v e and belo w . Since the learning amount is not enough with only the LFW data set, 1500 sheets of image data are subjected to data aggre g ation to increase the number of images for learning to 20000 sheets. Data augmentation performed is noise addi tion, translational mo v ement of 5 pix els up and do wn, left and right, a v eraging by mean filter , sharpening by sharpening filter . As learning methods, batch learning, stochastic gradient descent method, mini batch can be used, b ut in this research mini batch is adopted. Creating a mini-batch is carried out by randomly selected from the image obtained by increasing perform data augmentation. In addition, we will use the same epoch e v ery mini batch in learning, not shuf fle in order . F or mini batch, we mak e 20 sheets per batch and di vide the error accumulated during the gradient descent method by the number of mini batches. The learning update rate is l e 4(0 : 0001) , the number of epochs is 20, and updating is done 20,000 times. Ne xt, the structure of CNN will be e xplained. First, as an input v alue, an image is clipped to the range sho wn in the LFW data set, and an image normal ized to a size of 100 100 and made into grayscale is used. Ne xt, the con v olution layer and the pooling layer are alternately arranged and each has three layers. The filter size of each con v olution layer is 9 9 , and the number of mo v ements in the con v olution operation i s 1 pix el. The acti v ation function uses Maxout with tw o adjacent feature maps. The number of filters is 16, 8 32 , and 16 64 . Each pooling layer performs max-pooling by filter size 2 2 . Finally , the total coupling layer and the output layer are composed of one layer , and the input v alue becomes one dimension of the feature map through the pooling layer . In this paper , the number of input v alues is 1152, and the number of output v alues is 20. The number of outputs is the binary v alue of the coordinate v alues x and y on which the annotation is performed, and it is twice t he number of the feature points. The output v alue is between 0 and 1, and the acti v ation function uses linear combination. Since the output is between 0 a n d 1, the teacher data is di vided by 1000, and when it is detected, it is obtained by multiplying the output v alue by 1000. Also, to pre v ent o v er learning, Dropout is pro vided in all the bonding layers. The probability of Dropout is 50% , and it is independent for each image in the subset. The structure of CNN is sho wn in T able 2. An optimization of facial featur e point detection pr o gr am... (Shyota Shindo) Evaluation Warning : The document was created with Spire.PDF for Python.
830 r ISSN: 2502-4752 T able 2. Details of CNN Size of images Filter , number of weight Acti v ation function Input Layer 100 100 Con v olutional layer 1 92 92 16 Maxout Pooling Layer 1 46 46 Zero padding 48 48 Con v olutional layer 2 40 40 8 32 Maxout Pooling Layer 2 20 20 Con v olutional layer 3 12 12 16 64 Maxout Pooling Layer 3 6 6 Full connected Layer 1152 Output Layer Linear Combination As a result, the e x ecution time w as about 61 hours. Ho we v er , with only 20 epochs, the error con v er ged only to about 0 : 012 . Detection accurac y using test data w as 96% . Since the learning w as carried out at 20 epochs this time, the accurac y deteriorated considerably b ut the con v er gence of the error w as quite slo w b ut it gradually became smaller , so if update has been done 300 thousands, it is e xpected that the error can con v er ge to almost 0. In the neural netw ork de vised in this research, the first goal is to mak e the con v er gence of errors f aster and better than this result. The second goal is to implement things that lea v e 20 more epochs and continue to con v er ge e v en less error . Figure 2. A result of f acial feature detection by CNN (Photographic images are obtained from [17]) 3.2. Outline of pr oposed neural netw ork W e propose a neural netw ork which learns images with DNN or CNN after applying the Laplacian filter . The Laplacian filter is a kind of filter processing used for image processing, and i t is possible to e xtract only the po r tion where the dif ference in luminance v alue is drastic from the image. By utilizi ng this property , information amount is reduced from the input image, and the structure of the neural netw ork is simplified. There are tw o reasons wh y we decided to use the Laplacian filte r: (1)It is not necessary to consider the dif ference in skin color due to race by using Laplacian filter , and (2) This is because it is possible to discriminate f acial feature points such as the edge of the e yes and the edge of the mouth by only the outline from human e yes therefore we assumed that machine learning can also discriminate same as human. In order to use the image processed by Laplacian filter for learning of neural netw ork, all input images were processed with Laplacian filter . The coef ficient of the filter is as follo ws: F il t er = 2 4 1 1 1 1 8 1 1 1 1 3 5 Instead of processing the input image as resized to 100 100 as it is, the input image is changed to a size of 102 102 with zero padding, and filtering is performed. Figure 3 sho ws input images after applying Laplacian filter . Indonesian J Elec Eng & Comp Sci, V ol. 16, No. 2, No v ember 2019 : 827 834 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 r 831 Figure 3. Input images after applying Laplacian filter (Photographic images are obtained from [17]) Learning w as done with neural netw orks using these images. As input v alues, a one-dimensional input image reduced by tw o pooling layers is used as preprocessing. The structure of the neural netw ork is sho wn in Figure 4. Poolying layer 1 Poolying layer 2 100 x 100 50 x 50 25 x 25 N eural netwo rk Figure 4. Frame w ork of beta type neural netw ork 4. RESUL TS AND DISCUSSION W e implemented and e v aluated the type neural netw ork with v arious configurations. 4.1. Beta type 5-lay er DNN W e implemented and v erified 3 - layer DNN, 4 - layer DNN and 5 - layer DNN. T able 3 sho ws the beta type 5-layer DNN that finally con v er ged most. Learning w as conducted with three layers of intermediate layers in order to f urther impro v e discrimination po wer . Also, although the acti v ation function used Maxout and Sigmoid up to the last time, Maxout has a better result of con v er gence of error e v en if layers are added, so fix it to Maxout. The composition of each layer is summa rized in T able 3. The initial v alue is 1e-3 to -1e-3, and the learning rate is 1e-2. An optimization of facial featur e point detection pr o gr am... (Shyota Shindo) Evaluation Warning : The document was created with Spire.PDF for Python.
832 r ISSN: 2502-4752 T able 3. Frame w ork of beta type 5-layer DNN Size of images, number of units Acti v ation function Probability of Dropout Pooling Layer 1 100 100 Pooling Layer 2 50 50 Input Layer 625 Intermediate layer 1 600 Maxout 50% Intermediate layer 2 500 Maxout 25% Intermediate layer 3 400 Maxout 10% Output Layer 20 Linear Combination The e x ecution time w as about 2 hours when the number of units w as the minimum and about 4 hours at the maximum. The error did not decrease from nearly 0.01. Sometimes the error w as as lar ge as 0.02. As a result of increasing the number of interlayers, the error con v er gence has impro v ed considerably . Ho we v er , since the error increases with the input image, we found that the discrimination po wer is still weak. Therefore, we added a con v olution layer and thought about learning with CNN. Then we had another e xperiment to detect with CNN. As the input v alue, use the input image resized to 50 50 . Then, the con v olution layer and the pooling layer are repeated se v eral times to pass to the entire binding layer . The structure of CNN is sho wn in Figure 5. Convolutional  Layer Pooling Layer Fully Connected  Layer 50 X 50 Figure 5. The structure of beta type CNN 4.2. Beta type CNN with 2-lay er con v olution lay er In this time, the input image is passed directly to the con v olution layer without being reduced by the pooling layer . The output of the con v olution layer is gradually increasing. The total coupling layer uses the pre vious DNN. The size of the filter of the con v olution layer is 11 11 , the initial v alue is 1 e 3 to 1 e 3 , and the learning rate is 5 e 3 . The composition of each layer is summarized in T able 4. T able 4. Frame w ork of beta type CNN with 2-layer con v olution layer Number of channel hight width Acti v ation function Probability of Dropout Input Layer 50 50 Con v olution Layer 1 50 40 40 Maxout Pooling Layer 1 25 20 20 Con v olution Layer 2 40 10 10 Maxout Pooling Layer 2 20 5 5 Input of All connected layer 500 50% Intermediate layer 1 300 Maxout 25% Intermediate layer 2 200 Maxout 10% Output Layer 20 Linear Combination Indonesian J Elec Eng & Comp Sci, V ol. 16, No. 2, No v ember 2019 : 827 834 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 r 833 The e x ecution time w as about 6 hours when the output number of the con v olution layer and the number of units of the middle layer w as the minimum, the maximum case w as about 16 hours, and the error con v er ged to 0 : 02 . The more con v er gence layers are added, the better the error con v er gence. From this result, it w as found that the con v er gence of error w as impro v ed considerably when CNN added a layer rather than DNN. Ho we v er , it also turned out that the e x ecution time als o significantly increased. From this, it is e xpected that -type CNN can be considerably increased in speed. If the con v olution layer has tw o layers, we increase the output number of the con v olution layer more, it will be a f aster and more accurate classifier than the CNN implemented. Ho we v er , in order to raise the accurac y without increasing the e x ecution time an y more, the learning w as performed by further reducing the feature amount of the input image. Reduction is to set the small pix el v alue of the input image to 0. Ev en if small pix el v alues are reduced, since edges of the e yes can be recognized by human e yes, learning is carried out with three patterns of 10 pix els or less, 20 pix els or less, 50 pix els or less of the input image set to 0 . The e x ecution result con v er ges to about 0 : 02 if less than 10 pix els is made 0 , and when the 20 pix els or less is made 0 , the error con v er ged to about 0 : 013 . And if we set 0 pix el or less to 0 , the error con v er ged to about 0 : 03 . From this result, v ery good con v er gence w as obtained by setting 0 to 20 pix els or less. In the method, the error has already con v er ged to about 0 : 013 at the time of reaching 10 epochs, and the error did not become smaller thereafter . T able 5. Result of learning time and con v er gence of error for each NN The maximum learning time The con v er gence of error Beta type NN 10m F ailed Beta type 3-layer DNN 1h30m F ailed Beta type 4-layer DNN 3h F ailed Beta type 5-layer DNN 4h Almost success Beta type CNN 22h Success As sho wn in T able 5, in Beta type NN, learning f ailed with all patterns. In Beta type DNN, things of 3 layers and 4 layers f ailed to learn. Fi v e layers of learning results were quite good, b ut it w as not enough. And Beta type CNN succeeded in learning. F or the CNN not pretreated with the Laplacian filter , the learning time of the type CNN pretreated w as about 2 : 8 times f aster and the detection accurac y w as 1% lo wer as sho wn in T able 6. T able 6. The accurac y comparison Base line CNN type CNN Learning time 61 hours 22 hours Accurac y 96% 95% 5. CONCLUSION In this research, f acial feature points were detected by CNN using the programming language Python. Then, we attempted to create a neural netw ork with a f ast learning time and a higher precision than this CNN. In this research, input images are processed by Laplacian filter , and then learning is done with neural netw ork. The point that the image processing processing is performed first is de vised, and the learning time has drastically decreased compared with the neural netw ork (the image processing by the Laplacian filter is not done). F or future w ork, it is concei v able to add impro v ements such as using a Laplacian filter or increas ing the con v olution layer of CNN. REFERENCES [1] Y . T aigman, M. Y ang, M. Ranzato, and L. W olf, “Deepf ace: Closing the g ap to human-le v el performance in f ace v erification, in 2014 IEEE Confer ence on Computer V ision and P attern Reco gnition , June 2014, pp. 1701–1708. An optimization of facial featur e point detection pr o gr am... (Shyota Shindo) Evaluation Warning : The document was created with Spire.PDF for Python.
834 r ISSN: 2502-4752 [2] D. T riantafyllidou and A. T ef as, A f ast deep con v olutional neural netw ork for f ace detection in big visual data, in Advances in Big Data , P . Angelo v , Y . Manolopoulos, L. Ili adis, A. Ro y , and M. V ellasco, Eds. Springer International Publishing, 2017, pp. 61–70. [3] D. T riantafyllidou, P . Nousi, and A. T ef as, “F ast deep con v olutional f ace detection in the wild e xploiting hard sample m ining, Big Data Resear c h , v ol. 11, pp. 65 76, 2018, selected papers from the 2nd INNS Conference on Big Data: Big Data and Neural Netw orks. [Online]. A v ailable: http://www .sciencedirect.com/science/article/pii/S2214579617300096 [4] S. Ren, K. He, R. B. Girshick, and J. Sun, “F aster R-CNN: to w ards real-time object detection with re gion proposal netw orks, CoRR , v ol. abs/1506.01497, 2015. [Online]. A v ailable: http://arxi v .or g/abs/1506.01497 [5] Q. Gao, P . F orster, K. R. Mob us, and G. S. Mosch ytz, Fingerprint recognition using cnns: fingerprint preprocessing, in ISCAS 2001. The 2001 IEEE International Symposium on Cir cuits and Systems (Cat. No.01CH37196) , v ol. 3, May 2001, pp. 433–436 v ol. 2. [6] T . Y amashita, T . W atasue, Y . Y amauchi, and H. Fujiyoshi, “F acial point detection using con v olutional neural netw ork transferred from a heterogeneous task, in 2015 IEEE International Confer ence on Ima g e Pr ocessing (ICIP) , Sep. 2015, pp. 2725–2729. [7] T . F . Cootes, G. J. Edw ards, and C. J. T aylor , Acti v e appearance models, in Computer V ision ECCV’98 , H. Burkhardt and B. Neumann, Eds. Berlin, Heidelber g: Springer Berlin Heidelber g, 1998, pp. 484–498. [8] D. V ukadino vic and M. P antic, “Fully automatic f acial feature point detection using g abor feature based boosted classifiers, in 2005 IEEE International Confer ence on Systems, Man and Cybernetics , v ol. 2, Oct 2005, pp. 1692–1698 V ol. 2. [9] M. Kim ura, H. Fukui, T . Y amashita, Y . Y amauchi, and H. Fujiyoshi, “F acial point detection based on deep con v olutional neural netw ork with optimal minibatch, in TECHNICAL REPORT OF IEICE. CN R, T ec hnical Committee on Cloud Network Robotics (CNR) , v ol. 114, no. 455. The Institute of Elec tronics, Information and Communication Engineers, Feb . 2015, pp. 87–88, (in Japanese). [Online]. A v ailable: https://ci.nii.ac.jp/naid/110010014760/ [10] T . Y amashita, M. Kimura, H. Fukui, Y . Y amauchi, and H. Fujiyoshi, “Optimal mini-batch procedure for f acial piont detection based on a deep con v olutional neural netw ork, in The 21st Symposium on Sensing via Ima g e Information , 2015, (in Japanese). [11] Y . Minag a w a, M. Abe, and Q. Zhao, Automatic f ace feature e xtraction based on neural netw orks, in SICE T ohoku 284 , v ol. 284, no. 2, 11 2013, pp. 1–4, (in Japanese). [12] Z. Zhang, P . Luo, C. C. Lo y , and X. T ang, “Learning deep representation for f ace alignment with auxiliary attrib utes, IEEE T r ansactions on P attern Analysis and Mac h i ne Intellig ence , v ol. 38, no. 5, pp. 918–930, May 2016. [13] C. Garcia and M. Delakis, “Con v olutional f ace finder: a neural architecture for f ast and rob ust f ace detection, IEEE T r ansactions on P attern Analysis and Mac hine Intellig ence , v ol. 26, no. 11, pp. 1408– 1423, No v 2004. [14] K. Sudhakar , P . Nith yanandam, An accurate f acial component detection using g abor filter , Bulletin of Electrical Engineering and Informatics , v ol. 6, no. 3, pp. 287–294, September 2017. [15] Ri Cerd Ng, Kian Ming Lim, Chin Poo Lee, Siti F atimah Abdul Razak, “S urv ei llance system with motion and f ace detection using histograms of oriented gradients, Indonesian J ournal of Electrical Engineering and Computer Science , v ol. 14, no. 2, pp. 869–876, May 2019. [16] Srini v asa Perumal Ra maling am, Nadesh R. K., SenthilK umar N. C., “Rob ust f ace recognition using en- hanced local binary pattern, Bulletin of Electrical Engineering and Informatics , v ol. 7, no. 1, pp. 96–101, March 2018. [17] LFW, “Labeled F aces in the W ild (LFW), http://vis-www .cs.umass.edu/lfw/. Indonesian J Elec Eng & Comp Sci, V ol. 16, No. 2, No v ember 2019 : 827 834 Evaluation Warning : The document was created with Spire.PDF for Python.