Inter national J our nal of Electrical and Computer Engineering (IJECE) V ol. 8, No. 6, December 2018, pp. 4693 4704 ISSN: 2088-8708, DOI: 10.11591/ijece.v8i6.pp4693-4704 4693       I ns t it u t e  o f  A d v a nce d  Eng ine e r i ng  a nd  S cie nce   w     w     w       i                       l       c       m     Deep Belief Netw orks f or Recognizing Hand writing Captur ed by Leap Motion Contr oller Abas Setiawan 1,2 and Reza Pulungan 1 1 Dept. of Computer Science and Electronics, F aculty of Mathematics and Natural Sciences, Uni v ersitas Gadjah Ma da, Indonesia 2 Dept. of Informatics Engineering, F aculty of Computer Science, Uni v ersitas Dian Nusw antoro, Semarang, Indonesia Article Inf o Article history: Recei v ed December 29, 2017 Re vised May 30, 2018 Accepted July 13, 2018 K eyw ord: Handwriting Leap Motion Deep belief netw ork Accurac y Resilient backpropag ation ABSTRA CT Leap Motion controller is an input de vice that can track hands and fingers position quickly and precisely . In some g aming en vironment, a need may arise to capture letters written in the air by Leap Motion, which cannot be directly done right no w . In this paper , we propose an approach to capture and recognize which letter has been dra wn by the user with Leap Motion. This approach is based on Deep Belief Netw orks (DBN) with Resilient Backprop- ag ation (Rprop) fine-tuning. T o assess the performance of our proposed approach, we con- duct e xperiments in v olving 30,000 samples of handwritten capital letters, 8,000 of which are to be recognized. Our e xperiments indicate that DBN with Rprop achie v es an accu- rac y of 99.71%, which is better than DBN with Backpr opag ation or Multi-Layer Perceptron (MLP), either with Backpropag ation or with Rprop. Our e xperiments also sho w that Rprop mak es the process of fine-tuning significantly f aster and results in a much more accurate recognition compared to ordinary Backpropag ation. The time needed to recognize a letter is in the order of 5,000 microseconds, which is e xcellent e v en for online g aming e xperience. Copyright c 2018 Institute of Advanced Engineering and Science . All rights r eserved. Corresponding A uthor: Reza Pulung an Department of Computer Science and Electronics, Uni v ersitas Gadjah Mada, Indonesia Gedung KPTU FMIP A UGM, Sekip Utara, Bulaksumur , Sleman, Y ogyakarta, 55281, Indonesia +6282136261216 Email: pulung an@ugm.ac.id 1. INTR ODUCTION Computer aided handwriting recognition can be performed either on-line or of f-line. T ypically , on-line hand- writing recognition is based on specific input de vices, such as real-time capturing digital pen, finger -touch screen, or mouse mo v ement. In the recent decade, s u c h input de vices ha v e already been equipped with computer vision tech- nology that enables recognition of natural hand or body gestures as seen in Kinect-based handwriting recognition [1]. Huang et al. [2] introduce digit recognition on Kinect by using Multiple Se gment and Scaled Coding with 94.6% accurac y . Other methods are proposed by using Depth-Skin Background Mixture Model for hand se gmentation [3] and Dynamic T ime W arping combined with Support V ector Machines (SVM) for digit recognition [1]. Kinect has the ability to recognize body shape or hand mo v ements, b ut it f aces dif ficulties in obtaining precise finger position compared to Leap Motion controller . Leap Motion controller (or simply Leap Motion) is a ne w computer vision de vice that can track the position of ng e rs more precisely . The 3D finger data can be obtained at o v er 100 frames per second [4], making it possible to ef ficiently track the position of each finger in detail. Ho we v er , Leap Motion still f aces some limitations: it cannot directly recognize a specific sequence of fingers’ strok es (mo v ements) performed when dra wing a letter or a digit in air . Se v eral algorithm s ha v e been proposed to recognize these sequences of strok es to obtain a better accurac y . V ikram et al. de v elop handwriting recognition based on Leap Motion by using Dynami c T ime W arping algorithm [4]. Ag arw al et al. [5] present se gmentation and recognition of 3D handwritten te xt captured by Leap Motion. Hidden Mark o v Model (HMM) is used to train se gmented w ords and achie v es an accurac y of 77.6%. Another algorithm for rapid recognition of hand graphical gestures is SVM. SVM is used to recogni ze 100 testing samples with an a v erage recognition rate of 82.4% [6]. J ournal Homepage: http://iaescor e .com/journals/inde x.php/IJECE       I ns t it u t e  o f  A d v a nce d  Eng ine e r i ng  a nd  S cie nce   w     w     w       i                       l       c       m     Evaluation Warning : The document was created with Spire.PDF for Python.
4694 ISSN: 2088-8708 Researches sho w that algorithms with shallo w architectures ( i.e . , Decision T ree [7], Gaussian Mixture Models (GMM), HMM, Conditional Random Fields (CRF), Multi-Layer Perceptron (MLP) [8], and SVM [9]) are ef fecti v e in solving man y simple or well-constrained problems. Ho we v er , these al gorithms may f ace dif ficulties in complicated real-w orld applications in v olving natural signals, such as human speech, natural sound and language, natural images, and visual scenes [10] because of their modelling limitation. Instead, a deep architecture algorithm able to recognize handwritten digits captured by Leap Motion is a v ailable, namely Con v olutional Neural Netw orks (CNN). Ho we v er , it is hard to train deep MLP (and hence also CNN) without GPU parallelization [11]. CNN has been used to simulate 3D digit recognition model that can be projected as 2D in a 3D en vironment on Leap Motion tracking data [12]. This method achie v es an accurac y of 92.4%. In this paper , we propose to use Deep Belief Netw orks (DBN) to solv e this recognition problem. DBN is one of deep architecture algorithms and consists of tw o phases: greedy layer -wise unsupervised pre-training and discriminati v e fine-tuning [13, 14, 15]. Pre- training, initialized by a stack ed Restricted Boltzmann Machines, can generate a better result compared to purely random configuration of the MLP . After the pre-training phase in the DBN, the netw ork can be treated simply as an MLP and then trained with Backpropag ation to enhance its weights [14, 15]. Alternati v ely , Resilient Backpropag ation can be used to train the netw orks instead of ordinary Backpropag ation. 2. PRELIMIN ARIES 2.1. Leap Motion contr oller Leap Motion is an input de vice that a llo ws users to recognize and track hands, fingers, and finger -lik e tools. Leap Motion is de v eloped by Michael Buckw ald and Da vid Holz at Leap Motion, Inc. Leap Motion is able to track 3D finger data precisely and obtains them at o v er 100 frames per second [4]. Besides, Leap Motion de vice is also cheaper compared to other similar computer -vision de vices. The dimension of the de vice is 1.2 3 7.6 cm 3 and it uses optical sensors and infrared light, as sho wn in Figure 1 [16]. Leap Motion’ s sensors can detect hands or fingers across the upw ard y-axis and possess a 150 -vision with ef fecti v e range of 25 to 600 millimeters [17]. Figure 1. Detail of the Leap Motion’ s sensors [16] T o de v elop applications that use Leap Motion as input de vice, we can mak e use of an a v ailable API that can be accessed via v arious programming languages. Leap Motion API v ersion 2.0 [17] intr od uc es Ne w Sk eletal T racking Model that pro vides additional information for arms, fingers, and o v erall tracking enhancement. In order to use Leap Motion, first, it must be connected with an e xisting USB port and positioned beside the monitor at a distance of the cable’ s range. Leap Motion de vice will detect gestures and the position as well as the mo v ement of hands or fingers as the user positions her hands on its range. T o retrie v e the desired hand tracking dat a, Leap Motion API pro vides an update set or a frame of data. Ev ery single frame object from the API represents a frame that contains a list of tracking entities called motion tr ac king data . Motion tracking data, such as hands, fingers, and tools, can also be accessed through the API. 2.2. Restricted Boltzmann machines A Boltzmann machine is a stochast ic recurrent neural netw ork with binary stochastic units. There are tw o types of layers in Boltzmann machines: visible layer and hidden layer . V isible units or neurons on the visible layer are connected to other units and recei v e information from the en vironment. V isible units als o bring the input signal from the en vironment and send it to other units during training. Hidden units, on the other hand, operate freely and e xtract features from the input signal. Interconnections of units in Boltzmann machines can be done e v en in a single layer itself. Unfortunately , learning in Boltzmann machines is often impractical and furthermore the y also f ace Int J Elec & Comp Eng, V ol. 8, No. 6, December 2018: 4693 4704 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Elec & Comp Eng ISSN: 2088-8708 4695 scalability issues. In order to mak e learning in Boltzmann machines easier , adv anced restrictions of connecti vity must be enforced. This restriction is called Restricted Boltzmann Machines (RBM) [18]. In RBM, there is no connection between units of the same layer . RBM is composed of a visible layer and a hidden layer of stochastic neurons that are connected by symmetrical weights ( i.e . , there is no weight to w ard itself). Hidden units in RBM are considered as feature detector . 2.3. Deep belief netw orks A deep belief netw ork is a generati v e model, consisting of multiple stochastic layers. A stochastic layer has se v eral latent v ariables. Latent v ariables typical ly ha v e binary v alues and are called hidden units or f eature detec- tors [13]. Each DBN layer is an RBM and in order to create m an y layers, se v eral RBMs are stack ed on top of each other to construct a DBN construction. Hinton et al. in [13] propose an idea to learn one layer at a time and restrict the connecti vity of binary stochastic units in order to mak e learning more ef ficient and simpler [18]. DBN training be gins with RBM training: starting from input samples and then sequentially progressing to w ard the ne xt RBM training. Generated patterns formed by the top RBM can deli v er a re v ersed propag ation to the input layer by using conditional probability as in belief netw ork or sigmoid belief netw ork. This training procedure is called DBN’ s greedy layer -wise training [13]. DBN can w ork with or without supervision. Al though DBN can be used in supervi sed or discriminati v e mode, basically , greedy layer -wise training is an unsupervised training for each layer of stack ed RBMs. It is called a pre-training procedure [19, 20, 21]. The purpose of pre-training procedure on each layer’ s training is to put the parameters of all layers in a re gion of certain parameter ranges that contain a good generalization of local optimum that can be obtained by local gradient descent [20, 21]. In an RBM, the joint distrib ution p ( v ; h ; w ) of visible units v and hidden units h with parameters w is defined by the ener gy function E ( v ; h ; w ) . F or Bernoulli-visible units that lead to Bernoulli-hidden units, the ener gy function is defined by: E ( v ; h ; w ) = I X i =1 J X j =1 w ij v i h j I X i =1 b i v i J X j =1 a j h j ; where w ij represents the symmetrical interaction between visible unit v i and hidden uni t h j , b i and a j are the bias terms of, and I and J are the numbers of visible and hidden units, respecti v ely . Therefore, the conditional probabilities can be calculated by: p ( h j = 1 j v ; w ) = '   I X i =1 v i w ij + a j ! ; and (1) p ( v i = 1 j h ; w ) = ' 0 @ J X j =1 h j w ij + b i 1 A ; (2) where ' ( x ) = 1 = (1 + exp( x )) . Greedy procedure on s tack ed RBMs is used to raise the maximum lik elihood learning. Equation (3) is a learning rule for updated weights at each RBM to collect the gradient log-lik elihood: w ij = E data ( v i ; h j ) E model ( v i ; h j ) ; (3) where E data ( v i ; h j ) is the e xpectation of observ ation on the training data and E model ( v i ; h j ) is the e xpectation under the distrib ution defined by the model. Calculating E model ( v i ; h j ) is intractable, so Contrasti v e Di v er gence (CD) algorithm is used to solv e this problem [13]. CD algorithm performs Gibbs sampling to replace E model ( v i ; h j ) in one or more steps. CD-1, namely CD algorithm that only performs one step of Gibbs sampling on the data, is used to train the model. The follo wing are the steps of CD-1. First, input sample x and updated hidden units are used to compute p ( h j = 1 ; v ; w ) . Second, units are acti v ated by comparing the sample result from the pre vious step with random v alues to obtain the binary state for each unit. Third, thos e binary units are reconstructed by calculating p ( v i = 1 ; h ; w ) . F ourth, the out p ut probability of the reconstruction is calculated, and the reconstruction result becomes an input. Fifth, after all collections of statistics from the pre vious step are obtained, the weights on the RBM are updated. After completing pre-training procedure in DBN, the ne xt phase is to perform discrimi nati v e fine-tuning with Backpropag ation [13, 14]. Fine-tuning with Backpropag ation is carried out by adding a final output layer , which is a labeled class e xcluded from the pre-training procedure. Deep Belief Networks for Reco gnizing Handwriting Captur ed by Leap Motion Contr oller (Setiawan and Pulungan) Evaluation Warning : The document was created with Spire.PDF for Python.
4696 ISSN: 2088-8708 3. PR OPOSED APPR O A CH Figure 2 depicts the architecture of our proposed approach for in-air handwriting recognition using Leap Motion. The proposed approach is de v eloped based on the idea laid out in [22]. Data acquisition is a process of collecting e v ery letter sample by using Leap Motion. This captured letter sample can be used for recognition, or else, it can be stored in a dataset. The letter dataset is split into tw o par ts, namely: training and testing datasets. The training dataset will be incorporated into a learning system by using se v eral proposed algorithms; and this basically constitutes the training process. After training process has been completed, se v eral model algorithms will be constructed. These model algorithms are then used for testing (by using the testing dataset) in order to obtain and compare their accurac y . The best model algorithm is then selected for further recognition of letter input data captured by Leap Motion. D at a A c q u i s i t i on   u s i n g L e ap   M ot i on A  S i n gl e   D at a L e t t e r   D at as e t S ave d  or   R e c ogn i z e d ? L e ar n i n g S ys t e m S ave d M od e l R e c ogn i z e d R e c ogn i t i on Figure 2. Our proposed approach 3.1. Data acquisition Data acquisition is a mechanism to obtain handwritten sample data by using Leap Motion. A sample data is a sequence of information about the mo v ement of hand (inputs only use one hand and one finger) in the air as captured by Leap Motion. The motion of finger -hand dra ws a letter strok e, which is then displayed on the screen of the simulation application inte grated with Leap Motion API. Figure 3 depicts the flo w for creating one sample data using the e xisting API. S t ar t I n p u t H and   G e s t u r e L e ap M ot i on   I n i t i al i z at i on E n d 2 D  M app i n g N or m al i z i n S t abi l i z e d   P os i t i on T ou c h   E m u l at i on I m ag e  M at r i C r e at i on I m ag e  M at r i D i l at i on O u t p u t L e t t e r   as  a  B i n ar V e c t or Figure 3. Data acquisition w orkflo w Leap Motion initialization Initialization is performed to connect the simulation application to the API. In the sim- ulation application, the user is gi v en information about what letter should be dra wn with the Leap Motion. As the user starts dra wing the letter in the air , we can access the captured F r ame object of the Leap Motion through the API. 2D mapping Through the F r ame we can access the Finger object, and hence the tip posit ions of all fingers can be obtained through the API. Ho we v er , we do not need the positions of all fingers, b ut only the dominant finger as when pointing with the forefinger . The dominant finger can be obtained via a property of the Finger object, namely F rontmost . In addition, in order to obtain the position of the dominant finger’ s tip with minimal jitter , a property named Stabiliz edTipP osition can be used. Because of 2D mapping, the z-position of the forefinger produced by Stabiliz edTipP osition is ignored, and the x and y-position are then denoted by x pos and y pos , respecti v ely . Normalizing stabilized position Inter actionBo x property of F r ame is an object on the API to ensure t h e user’ s finger -hand al w ays stays on the de vice’ s visibility range. While finger -hand is inside Inter actionBo x , mo v ements of the fingers will be detected on 3D c oo r dinates, as sho wn in Figure 4(left). When the application is mapped to 2D coordinates, the Inter actionBo x is as sho wn in Figure 4(right). Int J Elec & Comp Eng, V ol. 8, No. 6, December 2018: 4693 4704 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Elec & Comp Eng ISSN: 2088-8708 4697 Figure 4. Field-of-vie w of the (left) 3D and the (right) 2D Inter actionbo x [17] Inter actionBo x simplifies the mapping of e v ery point of finger’ s tip position by normalizing the range of captured hands from left t o right or top to bottom to v alues betwe en 0 to 1, relati v ely to the application screen. If w app and h app are the width and height, respecti v ely , of the 2D field in the application screen, Inter actionBo x pro vides the normalized position of the final strok e relati v e to the application’ s coordinate ( x app ; y app ) , which is gi v en by: x app = N P ( x pos ) w app and y app = (1 N P ( y pos )) h app ; where N P is a normalization function in Inter actionBo x called nor maliz eP oint . T ouch emulation From the positions of all strok es representing a single letter , a sequence of strok es ~ S = ( S 1 ; S 2 ; S 3 , S 4 ; : : : ; S m ) is formed, where each S i = ( x app ; y app ) for 1 i m , and m is the number of strok es required to dra w the letter . Strok es can be in t w o states: ho v ering and touching. Not all strok es are used; those that are in ho v ering state can simply be ignored, as illustrated in Figure 5(left). (0,0) (336,0) (336,336) (0,336) x x x x x x x x x x x x x x x x x x x O =  T ouch X = Hover H o v e r i n T o u c h i n g +1  0 -1  Figure 5. T ouch emulation: the dif ference between ho v ering and touching [17] The state is determined by a normalized distance in the range of +1 to 1 . First, the finger enters the ho v ering state from normalized distance of +1 and the dis tance decreases to w ard 0 where the finger is near the touching surf ace. When the finger penetrates the surf ace, the normalized distance decreases to the maximum 1 , which indicates that it is right no w in the touching state. Strok e points are only recorded when the finger’ s tip is in the touching state. Image matrix cr eation The tracing in the pre vious process is related to the tile-based editor of the simulation application. T ile-based editors are often de v eloped in g ame technology , specifically t o create the g ame en viron- ment [23, 24]. In this research, all tiles are of the same width and height, namely 12 pix els, and the tile-map is a 28 28 tile matrix. If the tile-map is considered as a representation of an image, each tile will ha v e an intensity v alue and an inde x position on the tile-map. Each tile indicates a pix el image dra wn by the user in the air . Therefore, in order to select a tile that corresponds to t he user’ s creation, a fix ed strok e point position with the ho v ering or touching state is used. Figure 6 pro vides an illustration on ho w the selection of tile-based position for each strok e captured by Leap Motion is done. In Figure 6, m t is map-top distance, m l is map-left distance, T s is the size of a tile, and ( x t ; y t ) is the selected inde x of a tile. The s equence of strok es ~ S will be con v erted into an inde x ed (v ector) sequence of strok es ~ V = ( V 1 ; V 2 ; V 3 ; V 4 ; : : : ; V n ) with V i = ( x i ; y i ) for 1 i n and n is the v ector size of the 28 28 image matrix, namely 784. Each V i has an intensity v alue of 0 or 1 determined by: V i = ( 1 ; if d > 0 ; 0 ; otherwise ; where d is the touch distance, which defines whether the finger is in ho v ering or touching state. A coordinate ( x t ; y t ) of each strok e point is then defined by: x t = x app m l T s and y t = y app m t T s : Deep Belief Networks for Reco gnizing Handwriting Captur ed by Leap Motion Contr oller (Setiawan and Pulungan) Evaluation Warning : The document was created with Spire.PDF for Python.
4698 ISSN: 2088-8708 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 x app  = 218, y app  = 158 T s  = 12 px (x t ,y t ) = (14,9) = 1 28    28 Matrix m t  = 50 px m l  = 50 px Figure 6. Selecting an inde x ed tile on the tile-map In the end, this phase basically con v erts an image (tile) matrix obtained by the Leap Motion into a v ector with binary elements representing the image. Image matrix dilation Sometimes when the user writes a letter in the air by using Leap Motion too quickly , the handwritten letter becomes unclear . T o clarify the handwritten letter , dilation operation can be applied to the image matrix. Dilation simply tak es se v eral specific neighboring points of a point and gi v es them an intensity v alue of 1, as depicted in Figure 7. x - 1 , y - 1 t t x , y - 1 t t x + 1 , y - 1 t t x - 1 , y t t x , y t t x + 1 , y t t x , y + 1 t t x + 1 , y + 1 t t x - 1 , y + 1 t t 1      1 1 Figure 7. Image matrix dilation for letter “D” 3.2. Data, dataset, and data pr eparation Each instance of data acquisition creates a single sample data. A single sample data is represented by a 784- size v ector , which encodes the 28 28 matrix produced in image matrix creation. The data is used for tw o purposes: as an addition to dataset collection used as trai ning or testing dataset with labeled data, and to be used in the recognition process, where the label will be searched out based on the learning algorithm trained before. In thi s research, 22,000 samples of training and 8,000 samples of testing datasets are collected from 10 respondents. The respondents are under graduate students in Uni v ersitas Dian Nusw antoro, Semarang, Indonesia, aged 20-21 years old. Each of them produces around 3,000 handwritings with Leap Motion, captured within 1 month. The training and testing datasets are restricted for upper -case letters from A to Z. Because the learning task is supervised, the dataset used for the DBN is in the form of [( ~ V 1 ; L 1 ) ; ( ~ V 2 ; L 2 ) ; : : : ; ( ~ V k ; L k )] . F or each letter sample, input data is a v ector representing the s equence of strok es ~ V i for 1 i k whose label is L i , and k is the number of s amples in the dataset. All elements of input v ector ~ V i are of binary v alue. Encoding the label L i , which has v alue from A to “Z”, such that L i 2 f 0 ; 1 ; 2 ; : : : ; 25 g cannot be included directly into a DBN. In order to map L i as an element of a v ec tor , one-hot encoding mechanism is used. In one-hot encoding, label L i containing the v alue A is encoded by [1 ; 0 ; 0 ; : : : ; 0] , “B” is encoded by [0 ; 1 ; 0 ; : : : ; 0] , etc. 3.3. Lear ning system This research mak es use of Deep Belief Netw orks (DBN) in supervised f ashion. DBN ba sically learns the model of F w : f ~ V g ! f L g , which maps an input v ector ~ V into a labeled v ector L with parameter weights w . Figure 8(a) sho ws the architecture of a DBN with tw o hidden layers, h 1 and h 2 . T w o main processes e xist on learning this DBN model. Firstly , the process o f greedy layer -by-layer training for each stack ed RBM by using Contrasti v e Di v er gence algorithm, which is called pre-training. Secondly , the process of fine-tuning the entire netw ork using Resilient Backpropag ation or Rpr op . 3.3.1. RBM initialization and construction As des cribed earlier , a DBN consists of stack ed RBMs [13], therefore, each RBM layer first needs to be initialized. Figure 8(b) sho ws a DBN constructed by tw o layers of RBMs. The follo wing is the pre-training process, Int J Elec & Comp Eng, V ol. 8, No. 6, December 2018: 4693 4704 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Elec & Comp Eng ISSN: 2088-8708 4699 R B M 1 V h 1 h 2 v 1 P r e - t r a i n i n g L y F i n e - t u n i n g ( R p r o p ) R B M   1 h 1   w 1 v 1 H I D D E N  L A Y E R V I S I B L E  L A Y E R I n p u t   D a t a   x R B M   2 h 2 w 2 v 2 H I D D E N  L A Y E R V I S I B L E  L A Y E R H I D D E N  L A Y E R I N P U T  L A Y E R w ij h 1 I n p u t   D a t a   x f i n e - t u n i n g H I D D E N  L A Y E R w jk h 2 f i n e - t u n i n g O U T P U T  L A Y E R y l y (a) (b) (c) Figure 8. (a) A DBN model with Rprop, (b) Constructing tw o RBMs in a DBN (c) Fine-tuning DBN with Rprop which uses unsupervised learning algorithm on DBN to construct the stack ed RBMs: Step 0: In the first RBM, visible layer v 1 gets direct input from sample ~ V . Step 1: T rain the RBM’ s visible v 1 and hidden h 1 layers to adjust weights w 1 . Step 2: After the training is completed, use the pattern acti vity of the first RBM’ s hidden la yer h 1 as input to visible layer v 2 to train the second RBM. Step 3: By cop ying h 1 as v 2 in the second RBM, train visible layer v 2 and hidden layer h 2 to adjust weights w 2 . Note that in this process, class labels of data are not in v olv ed. 3.3.2. T raining each RBM lay er Ev ery RBM is trained by using Contrasti v e Di v er gence with one step of Gibbs sampling algorithm (or CD-1). CD-1 follo ws the gradient dif ference function called K ullback-Leibler di v er gence. Here are the steps of CD-1: Step 1: While maximum epoch is not reached, do step 2 to 7 repeatedly . Step 2: Use data sample ~ V as visible layer v . Step 3: Update hidden layer by calculating p ( h j = 1 ; v ; w ) (cf. Equation (1)), then acti v ate stochastic binary neurons h j on the hidden layer by taking the probability p ( h j = 1 ; v ; w ) . Step 4: Reconstruct visible neurons v 1 i by calculating p ( v i = 1 ; h ; w ) (cf. Equation (2)) using the v alue of stochas- tic binary neurons calculated before as input. Step 5: Calculate the output probability of h 1 j from pre vious reconstruction step, namely p ( h 1 j = 1 ; v 1 ; w ) . Step 6: Calculate weights (cf. Equation (3)) by using the learning rate as follo ws: w ij = ( h v i h j i 0 h v i h j i 1 ) ; where h v i h j i 0 = E data ( v i h j ) and h v i h j i 1 = E model ( v i h j ) . Step 7: Update the ne w weights by adding momentum and weight decay w d as follo ws: w g r ad = h v i h j i 0 h v i h j i 1 ; and w ij = w ij ( t 1) + ( w ij + ( w g r ad ) w d w ij ( t 1)) : Learning rate is a parameter that influences the lear n i ng speed of the algorithm in reaching con v er gence. Momentum is a parameter used to pre v ent the system from being trapped in local minima. W eight decay w orks by adding “additional terms” to the normal gradient. The additional term is a deri v ati v e of a function that penalizes an e xcessi v e weight [25]. The process of updating we ights of RBM refers to the update rules of the gradient descent. In the case of lar ge-scale datasets, using batch gradient descent requires quite an e xpensi v e computation when it only needs one step f o r the entire data traini n g. Therefore, we use a more ef ficient learning mode that learns the dataset for each predetermined batch. This mode is called mini-batch gradient descent [25]. 3.3.3. Fine-tuning Rpr op Because DBN is to be used as supervised (after pre-training process is completed), which in v olv es label classes, weights ha v e to be fine-tuned. Fine-tuning is conducted by training the netw ork with Backpropag ation algo- Deep Belief Networks for Reco gnizing Handwriting Captur ed by Leap Motion Contr oller (Setiawan and Pulungan) Evaluation Warning : The document was created with Spire.PDF for Python.
4700 ISSN: 2088-8708 rithm [13, 14]. In this research, besides using fine-tuning with Backpropag ation, we also perform Resilie nt Backprop- ag ation (Rprop) algorithm as sho wn in Figure 8(c). Theoretically , Rprop algorithm is more ef ficient in terms of learning speed compared to ordinary Backprop- ag ation. The dif ference between Rprop and Backpropag ation is located on the weight update, where Rprop mak es use of the sign that identifies the v alue of the last gradient update [26]. In the Rprop algorithm, we use weights and biases obtained from pre-training process, namely w ij = w 1 (from input to the first hidden layer) and w j k = w 2 (from the first hidden layer to the second hidden layer). The weights and bias connection between the second hidden layer to output layer uses random initialization in the interv al ( 1 ; 1) . W e initialize the maximum epoch and set the v alues of 0 = 0 : 0125 , max = 50 : 0 , min = 10 6 , + = 1 : 2 , and = 0 : 5 . 4. RESUL T AND AN AL YSIS A learning algorithm naturally minimizes the cost function in order to obtain optimal result. When the algorithm is being trained using some training dataset, we are basically creating a model algorithm that cl o s ely reflects the a v ailable data. Afterw ards, the obtained model algorithm can then be applied to testing dataset to assess the performance of the algorithm. 4.1. Experimental setup All methods and algorithms in our proposed approach are implemented in C#. All e xperiments are conducted using this implementation on a computer with a Core(TM) i7-2600 CPU @ 3.40GHz (8 CPUs), 3.7 GHz. Learning is performed with a parallel computation on a logical core for each CPU. F or our e xperiments, we perform data acquisition that produces 30,000 samples of capital letters, partitioned into tw o groups: 22,000 samples as training and 8,000 sampl es as testing datasets. There are about 843 samples for each letter from A to Z in training dataset, whereas in testing dataset, each letter has about 307 samples. Ev ery single data in the training dataset will be used as an input for the learning algorithm. In order to assess the performance of our proposed approach, we not only conduct e xperiments o n DBN with Rprop fine-tuning, b ut also on three other approaches for the purpose of comparison. The three approaches are MLP with Backpropag ation (Bp), MLP with Rprop, and DBN with Backpropag ation. T able 1 sho ws the e xperimental configuration we use for each of those approaches. T able 1. Experimental configurations Dataset Architecture Pre-training Fine-tuning Letter 784-400-100-26 (MLP) - Bp (Ep.: 300) 784-400-100-26 (MLP) - Rprop (Ep.: 300) 784-400-100-26 (DBN) CD-1 (Ep.: 300) Bp (Ep.: 300) 784-400-100-26 (DBN) CD-1 (Ep.: 300) Rprop (Ep.: 300) Digit (MNIST) 400-400-100-10 (MLP) - Bp (Ep.: 500) 400-400-100-10 (MLP) - Rprop (Ep.: 500) 400-400-100-10 (DBN) CD-1 (Ep.: 300) Bp (Ep.: 500) 400-400-100-10 (DBN) CD-1 (Ep.: 300) Rprop (Ep.: 500) T able 1 sho ws that the neural netw ork architecture we use when dealing with the letter dataset is 784-400- 100-26. In this architecture, the input is a 784-sized v ector , which is a flattening of the 28 28 tile-map matrix for each letter sample. Both MLP and DBN use 400 and 100 neurons in the first and second hidden layers, respecti v ely . The size of the output v ector is adjusted to the number of letter labels, namely 26 classes with one-hot-encoding mechanism. Beside using the letter dataset captured by our o wn data acquisition mechanism, we also mak e use of the digit dataset from MNIST database [27]. The MNIST digit dataset consists of 60,000 samples in training dataset and 10,000 samples in testing dataset. Each sample in the MNIST digit dataset is formed by a grayscale image. T o use them, it is necessary first to perform preprocessing by transforming the grayscale into binary images. Afterw ards, width normaliza tion is carried out by changing the original size of 28 28 pix els to 20 20 pix els. The 20 20 tile- map matrix is then flattened into an input v ector of size 400 as sho wn in the last four ro ws of T able 1. Similar to the letter configuration, in this digit configuration, both MLP and DBN use 400 neurons and 100 neurons in the first and second hidden layers, respecti v ely . Because the digit labels range from 0 to 9, the size of the output v ector is 10 with one-hot-coding mechanism. Int J Elec & Comp Eng, V ol. 8, No. 6, December 2018: 4693 4704 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Elec & Comp Eng ISSN: 2088-8708 4701 Both for the letter and MNIST digit datasets, we use the same configuration on the pre-training phase of the DBN. There are tw o RBM architectures in letter configuration: the first RBM is 784-400 and the second RBM is 400-100. In MNIST digit configuration, the first RBM architecture is 400-400 and the second is 400-100. There is no pre-training phase in the MLP . Each RBM in the letter or MNIST digit configurations is trained with CD-1 algorithm. Based on our e xperiments, we select the same initialization throughout, namely learning-rate = 0.5, momentum = 0.5, and weight-decay = 0.001. The maximum epoch for each RBM is set to 300. The batch size parameter for each RBM is set to 100, which means that if there are 22,000 training samples in the dataset, the y will be processed 100 at a time. The Backpropag ation for fine-tuning on DBN or training on MLP uses learning rate = 0.05, momentum = 0.5, and batch size = 100. Unlik e Backpropag ation, Rprop does not need learning rate and mom entum, b ut instead needs constant v alues 0 , max , min , + , and (cf. Section 3.3.3.). The i n i tial epoch for Rprop is set to 300 for letter and 500 for MNIST digit configurations. 4.2. T esting and r ecognition T able 2 sho ws accurac y and the computational times required in recognizing each letter (from testing dataset) by the four dif ferent methods. Recall that there are about 307 samples for each letter in testing dataset. T able 2. Accurac y and computational times for the letter (testing) dataset Lab MLP-Bp MLP-Rprop DBN-Bp DBN-Rprop Acc. t ( s) stde v Acc. t ( s) stde v Acc. t ( s) stde v Acc. t ( s) stde v A 99.35 4,572.2 1,081.2 100 5,305.1 1,718.6 100 5,117.5 1,392.5 100 3,278.2 811.7 B 92.86 5,740.0 1,534.4 99.68 4,622.1 918.5 99.03 5,369.4 1,416.5 100 5,235.6 2,074.0 C 94.81 3,727.2 912.2 100 5,337.8 1,117.6 99.35 5,126.9 1,684.3 100 3,922.0 2,140.9 D 92.88 2,972.8 628.1 100 4,991.5 1,229.1 98.71 4,124.2 996.1 100 5,668.1 1,935.6 E 92.51 3,151.1 751.4 100 4,428.3 968.1 99.35 3,718.4 941.1 100 5,212.0 1,156.5 F 97.73 3,265.4 724.0 99.68 5,219.6 1,224.0 99.03 4,776.6 1,213.6 99.35 4,602.7 1,556.9 G 77.67 4,862.7 1,100.5 100 5,068.3 1,262.4 98.38 3,558.6 958.8 99.68 5,459.3 1,452.2 H 90.29 4,893.7 1,163.0 99.68 4,205.8 935.3 99.68 3,139.1 819.7 99.68 4,549.9 1,634.3 I 90.58 5,055.4 1,231.4 97.40 3,573.4 1,095.7 95.45 3,714.9 1,205.8 98.38 5,856.4 3,309.5 J 96.74 5,439.8 1,390.0 100 4,720.8 1,233.2 100 5,064.3 1,357.8 100 4,439.5 1,304.9 K 94.48 4,619.5 1,006.0 99.03 4,766.1 1,270.2 98.70 5,416.3 1,433.2 100 5,325.4 1,096.8 L 99.68 3,705.6 913.6 100 4,446.1 1,392.5 100 5,260.5 1,268.8 100 5,545.3 1,287.1 M 91.53 3,833.7 907.5 99.67 5,013.1 1,151.6 99.02 4,462.5 1,136.8 99.67 3,833.9 896.3 N 98.38 4,287.0 1,097.0 99.35 4,819.3 1,019.2 99.35 3,699.3 914.0 99.35 4,909.1 863.1 O 95.45 3,338.1 816.7 99.03 5,237.3 1,000.0 98.38 4,846.7 1,167.2 99.35 4,849.6 1,563.8 P 95.44 5,606.4 1,390.2 99.67 5,486.6 1,502.8 99.67 3,633.0 880.7 100 5,707.3 2,559.3 Q 91.53 4,486.3 903.3 99.67 5,036.0 1,219.2 99.35 3,387.2 788.2 99.67 4,680.0 1,310.3 R 91.86 4,513.4 1,194.6 99.67 5,515.9 1,335.5 95.44 5,289.7 1,185.9 99.67 4,567.4 1,063.3 S 93.16 4,300.1 1,176.0 99.35 3,967.6 890.0 99.35 5,619.3 1,385.8 100 3,560.6 970.2 T 93.18 5,536.6 1,321.9 100 4,180.4 1,104.3 99.03 4,618.1 1,199.2 100 5,156.8 1,118.2 U 94.16 4,861.5 1,147.3 100 3,639.5 864.8 100 5,545.0 1,445.5 100 4,873.8 1,197.9 V 99.03 4,755.6 1,467.6 100 5,169.7 1,482.1 100 4,198.5 1,255.0 100 3,876.2 1,317.2 W 95.44 4,117.0 1,094.3 99.35 4,782.4 1,781.9 100 4,723.2 1,347.3 100 4,161.0 773.4 X 95.11 4,788.1 1,129.6 99.35 4,025.6 1,605.4 100 3,967.8 893.0 100 4,158.1 1,044.3 Y 98.05 3,699.5 986.6 99.35 5,022.5 1,330.0 99.67 3,239.5 777.4 99.67 5,238.1 1,067.9 Z 77.20 4,996.6 1,105.0 99.35 4,571.7 1,473.3 98.37 4,002.9 973.7 98.05 5,230.8 1,210.8 All 93.43 4,427.9 1,083.6 99.59 4,736.6 1,235.6 99.05 4,446.9 1,155.3 99.71 4,765.3 1,412.2 The first four columns of the first ro w of T able 2 indicate that out of 307 samples of the letter A, 99.35% (Acc.) of them are recognized and the computational time spent until the point when the letters are successfully or unsuccessfully recognized is on a v erage 4,572.2 microseconds ( t ), with a standard de viation of 1,081.2 (stde v). The last ro w of the table pro vides the o v erall stat istics of accurac y and computational times for the whole letter training dataset produced by the four dif ferent methods. T able 3 sho ws similar results for the recognition of the MNIST digit testing dataset. Deep Belief Networks for Reco gnizing Handwriting Captur ed by Leap Motion Contr oller (Setiawan and Pulungan) Evaluation Warning : The document was created with Spire.PDF for Python.
4702 ISSN: 2088-8708 T able 3. Accurac y and computational times for the MNIST digit (testing) dataset Lab MLP-Bp MLP-Rprop DBN-Bp DBN-Rprop Acc. t ( s) stde v Acc. t ( s) stde v Acc. t ( s) stde v Acc. t ( s) stde v 1 97.62 3,012.4 575.6 98.15 3,062.5 688.6 97.71 2,747.4 595.3 98.59 2,844.7 1019.6 2 86.92 3,028.2 890.5 94.67 2,268.8 562.4 88.76 2,703.7 482.7 95.83 3,017.7 721.4 3 75.54 2,294.3 673.0 95.84 2,783.3 354.4 91.98 2,623.2 802.8 96.14 2,425.2 589.1 4 88.70 3,059.5 699.7 95.52 2,400.4 621.4 88.90 2,616.5 307.7 96.74 2,525.1 819.4 5 90.92 3,041.4 763.8 96.08 2,795.0 377.6 85.76 2,720.6 295.3 95.52 2,452.3 742.4 6 89.56 3,211.7 795.0 96.45 2,774.5 331.8 95.51 2,656.7 834.6 95.93 2,408.7 556.1 7 88.62 2,811.0 736.3 94.94 2,732.5 661.6 90.66 2,759.4 589.3 95.04 3,069.1 777.3 8 85.01 2,920.0 710.8 94.25 2,002.9 506.3 88.19 2,327.9 594.7 95.69 2,429.5 567.5 9 88.01 1,740.0 556.3 93.56 2,835.8 629.6 87.22 2,804.7 790.1 95.44 3,017.0 665.7 0 95.61 3,031.5 561.9 98.06 2,404.7 641.0 95.41 2,773.7 819.0 98.06 2,805.1 379.9 All 88.65 2,815.0 696.3 95.75 2,606.0 537.5 91.01 2,673.4 611.1 96.30 2,699.4 683.9 T able 4 summarizes the results for recognizing both letter and MNIST digit testing datasets. The best accu- rac y (column Acc.) for both testing datasets is DBN with Rprop fine-tuning with accurac y of 99.71% and 96.30%, respecti v ely . This is better than the accurac y of MLP with Rprop fine-tuning, which achie v es 99.59% and 95.75%, respecti v ely . These tw o approaches e vidently are much more superior than both DBN and MLP with Backpropag ation training. It seems that fine-tuning with Rprop contrib utes significantly to the accurac y of both MLP and DBN. T able 4. Summary of accurac y and computational times Algorithm Acc. Epoch T raining Recognition (Dataset) t (hours) t ( s) stde v MLP-Bp (Letter) 93.43% 300 > 5 4,427.9 1,083.6 MLP-Rprop (Letter) 99.59% 35 > 1 4,736.6 1,235.6 DBN-Bp (Letter) 99.05% 300 > 10 4,446.9 1,155.3 DBN-Rprop (Letter) 99.71% 35 > 6 4,765.3 1,412.2 MLP-Bp (MNIST) 88.65% 500 > 12 2,815.0 696.3 MLP-Rporp (MNIST) 95.75% 70 > 2 2,606.0 537.5 DBN-Bp (MNIST) 91.01% 500 > 24 2,673.4 611.1 DBN-Rprop (MNIST) 96.30% 70 > 12 2,699.4 683.9 The last tw o columns of T able 4 ( t and stde v) sho w that the recognition times of the four approaches are comparable. It seems that using Rprop results in longer recognition times both for MLP and DBN. Out of the four approaches, the recognition time of DBN fine-tuned with Rprop is the longest, namely on a v erage 4,765.3 microsec- onds with a standard de viation of 1,412.2 for the letter dataset. F or the MNIST digit dataset, the recognition times of the four approaches are much more similar to each other , and are a bit more than a hal f of those for the letter dataset. This is be cause the resulting neural net w ork for the digit dataset has smaller input as well as output v ectors, and thus is quick er to process. The time needed to recognize a letter or a digit is belo w ten milliseconds, which is e xcellent e v en for online g aming e xperience [28]. 4.3. T raining and fine-tuning Prior to recognition, the neural netw orks, both DBN and MLP , must first be trained or fine-tuned using training dataset. The fourth column of T able 4 pro vides the computational times spent by the four dif ferent methods to complete training and pre-training (if an y). Because MLP does not ha v e pre-training phase, its training time is much shorter than DBN. The time spent by DBN to complete pre-training is around 5 hours for the letter da taset and 10 hours for the MNIST digit dataset. It is e vident that this pre-training phase tak es most of the training computational time of DBN and Rprop is really helpful in reducing the training (b ut outside of pre-training) time. In Figure 9 and Figure 10, we sho w ho w the accurac y of each method changes as the number of epochs increases in the training process. In order to produce these figures, ri g ht after each epoch in the training process (with training datasets), each method is paused, and the resulting neural ne tw ork is then used to recognize letters or digits in the testing datasets. W e observ e these changes until 300 and 500 epochs for the letter and digit datasets, respecti v ely . Int J Elec & Comp Eng, V ol. 8, No. 6, December 2018: 4693 4704 Evaluation Warning : The document was created with Spire.PDF for Python.