Inter national J our nal of Electrical and Computer Engineering (IJECE) V ol. 10, No. 5, October 2020, pp. 5479 5486 ISSN: 2088-8708, DOI: 10.11591/ijece.v10i5.pp5479-5486 r 5479 Benchmarking open sour ce deep lear ning framew orks Ghadeer Al-Bdour , Raffi Al-Qurran, Mahmoud Al-A yy oub, Ali Shatnawi Jordan Uni v ersity of Science and T echnology , Jordan Article Inf o Article history: Recei v ed Jul 23, 2019 Re vised Apr 25, 2020 Accepted May 6, 2020 K eyw ords: CNTK Performance comparison T ensorFlo w Theano ABSTRA CT Deep Learning (DL) is one of the hottest fields. T o foster t he gro wth of DL, se v eral open source frame w orks appeared pro viding implementations of the most common DL algorithms. These frame w orks v ary in the algorithms the y support and in the quality of their implementations. The purpose of this w ork i s to pro vide a qualitati v e and quantitati v e comparison among thre e such frame w orks: T ensorFlo w , Theano and CNTK. T o ensure that our study is as comprehensi v e as possible, we consider multi- ple benchmark datasets from dif ferent fields (image processing, NLP , etc.) and mea- sure the performance of the frame w orks’ implementations of dif ferent DL algorithms. F or most of our e xperiments, we find out that CNTK’ s implementations are superior to the other ones under consideration. Copyright c 2020 Insitute of Advanced Engineeering and Science . All rights r eserved. Corresponding A uthor: Mahmoud Al-A yyoub, Jordan Uni v ersity of Science and T echnology , Irbid, Jordan. Email: maalshbool@just.edu.jo 1. INTR ODUCTION Deep learni ng (DL) is the hottest trend in machine learning (ML). Although the theoretical conce p t s behind DL are not ne w , it has enjo yed a sur ge of interest o v er the past decade due to man y f actors. One e xample is that DL approaches ha v e significantly outperformed state-of-the-art (SO T A) approaches in man y tasks across dif ferent fields such as image processing, computer vision, speech processing, natural language processing (NLP), etc. Moreo v er , the scientific community (from both the academia and the industry) has quickly and massi v ely adopted DL. Open source implementations of s uccessful DL algorithms quickly appeared on code sharing websites, and were subsequently used by man y researchers in dif ferent fields. Se v eral DL frame w orks e xist, such as T ensorFlo w , Theano, CNTK, Caf fe and PyT orch, each with dif ferent feat u r es and charact eristics. Furthermore, each frame w ork utilizes dif ferent techniques to optimize its code. Although the same algorithm is implemented in dif ferent frame w orks, the performance of the dif ferent implementations can v ary greatly . A researcher/practitioner looking to use such an algorithm in his/her w ork w ould f ace a dif ficult choice, since the number of dif ferent implementations is high and the ef fort in v ested by the research community in scientifically comparing these implementations is limited. In this w ork, we a im at pro viding qualitati v e and quantitati v e comparisons between three popular open source DL frame w orks: T ensorFlo w , Theano and CNTK. These frame w orks support multi-core CPUs as well as multiple GPUs. All of them import cuDNN, which is a DL library from NVIDIA that supports highly tuned implementations for standard routines such as forw ard and backw ard con v olution, normalization, pooling and acti v ation layers. W e compare these frame w orks by training dif ferent neural netw ork (NN) architectures on v e dif ferent standard benchmark datasets for v arious tasks in image processing, computer vision and NLP . Despite their importance, comparati v e studies lik e ours that focus on performance issues are rare. Limited ef forts ha v e been dedicated to conducting comparati v e studies between SO T A DL frame w orks running on dif ferent hardw are platforms (CPU and GPU) to highlight the adv antages and limitations for each frame w ork for dif ferent deep NN architectures. These ef forts included papers [1-9] as well as online blogs J ournal homepage: http://ijece .iaescor e .com/inde x.php/IJECE Evaluation Warning : The document was created with Spire.PDF for Python.
5480 r ISSN: 2088-8708 (https://github .com/soumith/con vnet-benchmarks). Due to space constraint, we do not discuss the details of these w orks here. Interested readers are referred to earlier v ersions of this w ork [10, 11] for such details. Ho we v er , we do note that, in pre vious studies, the comparison goal focused only on processing time. None of those comparati v e studies dealt with CPU and GPU utilization or memory consumption. This w ork co v ered these metrics to find which of the considered frame w orks achie v e the best performance. Finally and most importantly , the comparisons in v olv ed more datasets from more fields compared with pre vious studies. The rest of this paper is or g anized as follo ws. Section 2. discusses the frame w orks, the w ay the y were used to train the datasets and a brief comparison between them. The methodology we follo w is discuss ed in Section 3. Experimental results and the discussion are detailed in Section 4. The w ork is concluded with final thoughts presented in Section 5. 2. DEEP LEARNING FRAMEW ORKS The frame w orks considered in this comparati v e study are: CNTK, T ensorFlo w a n d Theano. Moreo v er , we use K eras on top of these frame w orks as discussed later . All of these frame w orks pro vide fle xible APIs and configuration options for performance optimization. Softw are v ersions of the frame w orks are sho wn in T able 1 and their properties are sho wn in T able 2. T able 1. Frame w orks used for this comparati v e study Frame w ork Major V ersion Github Commit ID CNTK 2.0 7436a00 T ensorFlo w 1.2.0 49961e5 Theano 0.10.0.de v1 8a1af5b K eras 2.0.5 78f26df T able 2. Properties of the considered frame w orks Property CNTK T ensorFlo w Theano K eras Core C++ C++ Python Python CPU X X X X Multi-Threaded CPU X Eigen Blas, con v2D, Limited OpenMP X GPU X X X X Multi-GPU X X X (e xperimental v ersion) X NVIDIA cuDNN X X X X 2.1. Micr osoft cogniti v e toolkit (CNTK) CNTK is an open source DL frame w ork de v eloped by Microsoft Research [12] for training and testing man y types of NN across multiple GPUs or serv ers. CNTK supports dif ferent DL architectures lik e Feedfor - w ard, Con v olutional, Recurrent, Long Short-T erm Memory (LSTM) and Sequence-to-Sequence NN. In CNTK, a Computational Netw ork learns an y function by con v erting it to a directed graph, where leaf nodes consist of an input v alues or learning parameters while other nodes r epresent matrix operation applied to its children. In this case, CNTK has an adv antage as it can automatically find the deri v e gradients for all the computations which are required to learn the parameters. In CNTK, users specify their netw orks using a configuration file that contains information about the netw ork type, where to find input data and the w ay to optimize param- eters [13]. C NTK interf ace supports dif ferent APIs of se v eral languages such as Python, C++ and C# across both GPU (CUD A) or CPU platforms. According to its de v elopers (https://docs.microsoft.com/en-us/cogniti v e- toolkit/cntk-e v aluation-o v ervie w), CNTK w as written in C++ in an ef ficient w ay , where it remo v es duplicated computations in forw ard and backw ard passes, uses mini mal memory and reduces memory reallocation by reusing them. 2.2. Theano Theano is an open source Python librar y de v eloped at MILA lab at the Uni v ersity of Montreal as a compiler for mathematical e xpressions that lets users and de v elopers optimize and e v aluate their e xpressions using NumPy’ s syntax (a Python library that supports a lar ge and multi-dimensional arrays) [3, 14]. Theano starts performing computations automatically by optimizing the selection of computations, translates them into other machine learning languages such as C++ or CUD A (for GPU) and then compiles them into Python Int J Elec & Comp Eng, V ol. 10, No. 5, October 2020 : 5479 5486 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Elec & Comp Eng ISSN: 2088-8708 r 5481 modules in an ef ficient w ay on CPUs or GPUs. Theano’ s de v elopment started in 2008 and it is more popular on a research and ecosystem platform than man y DL libraries. Se v eral softw are packages ha v e been de v eloped to b uild on top of Theano, with a higher -le v el user interf ace which aims to mak e Theano easier to e xpress and train dif ferent architectures of deep learning models, such as Pylearn2, Lasagne and K eras. 2.3. T ensorFlo w T ensorFlo w is an open source frame w ork de v eloped by Google Brain T eam [15]. It uses a single data flo w graph, e xpressing all numerical computations, to achie v e e xcellent performance. T ensorFlo w constructs lar ge computation graphs where each node represents a mathematical operation, while the edges represent the communication between nodes. This data flo w graph e x ecutes the communication between sub-computations e xplicitly , which mak es it possible to e x ecute independent computations in parallel or to use multiple de vices to e x ecute partition computations [15]. Programmers of T ensorFlo w define lar ge computation graphs from basic operators, then distrib ute the e x ecution of these graphs acros s a heterogeneous distrib uted system (can deplo y computation to one or more CPUs or GPUs on a dif ferent hardw are platforms such as desktops, serv ers, or e v en mobile de vices). The fle xible architecture of T ensorFlo w allo ws de v elopers and users to e xperiment and train a wide v ariety of NN m odels; It is used for deplo ying ML systems into production for dif ferent fields including speech recognition, NLP , computer vision, robotics and computational drug disco v ery . T ensorFlo w uses dif ferent APIs of se v eral languages such as Python, C++ and Ja v a for constructing and e x ecuting a graph. Python API is the most complete and the easiest to use (https://www .tensorflo w .or g/). 2.4. K eras K eras is an open source DL library de v eloped in Python. It runs on top of CNTK, Theano or T en- sorFlo w frame w orks. K eras w as founded by Google engineer Chollet in 2015 as a part of the r esearch project ONEIR OS (Open-ended Neuro-Electronic Intelligent Robot Operating System). K eras is designed in a w ay that allo ws f ast e xpression with DNN and easy and f ast prototyping (modularity and e xtensibility) [16]. 3. METHODOLOGY The goal of this e xperimental study is to compare the aforementioned frame w orks (Theano, T en- sorFlo w and CNTK) by using them to train Con v olutional NN (CNN) and Recurrent NN (RNN) models on standard benchmark datasets of classical problems in image processing (MNIST , CIF AR-10 and Self-dri ving Car) and NLP (Penn T reeBank and IMDB). Specifically , we aim at comparing the resources consumed by each frame w ork to reach a certain accurac y le v el for each problem. Thus, we e xperiment with dif ferent epoch counts in order to mak e sure the accurac y for all frame w orks are close to each other . Each frame w ork’ s performance is e v aluated using running time, memory consumption and CPU and GPU utilization. W e use a laptop that has an Intel Core i7-6700HQ CP U @ 2.60GHz (4 cores) with 16 GB RAM, 64-bit operating system (W indo ws 10), and NVIDIA GEFORCE GTX 960m graphics card with PCI Express 3.0 b us support, equipped with 4 GB GDDR5 memory and 640 CUD A cores. 3.1. Benchmark datasets In this subsection, we discuss the datasets used in our e xperiments. 3.1.1. MNIST The MNIST (Mix ed National Institute of Standards and T echnology) dataset for handwritten digits is widely used in ML [17]. It has 60,000 trai ning images and 10,000 testing images. Each image is 28 28 pix els which is flattened into a 784-v alue v ector . The label of each image is a number between 0 and 9 representing the digit appearing in the image. 3.1.2. CIF AR-10 The CIF AR-10 dataset is one of the datasets collected by Krizhe vsk y et al. [18, 19]. It consists of 60,000 32 32 color images e v enly distrib uted o v er ten classes: airplane, autom obile, bird, cat, deer , dog, frog, horse, ship and truck. There are 50,000 training images and 10,000 test images. The classes are completely mutually e xclusi v e. I.e., there is no o v erlap between them. F or instance, the Automobile” class includes sedans, SUVs, etc. On the other hand, the “T ruck” class includes only big trucks. T o a v oid o v erlap, neither one of these tw o classes includes pickup trucks. Benc hmarking open sour ce deep learning fr ame works (Ghadeer Al-Bdour) Evaluation Warning : The document was created with Spire.PDF for Python.
5482 r ISSN: 2088-8708 3.1.3. P enn T r eeBank In 1993, Marcus et al. [20] wrote a paper on constructing a lar ge annotated corpus of English called the Penn T reeBank (PTB). The y re vie wed their e xperience with constructing one lar ge annotated corpus that consists of o v er 4.5 million w ords of Americ an English. It w as annotated for part-of-s peech (POS) tag infor - mation. Moreo v er , half of the corpus w as annotated for sk eletal syntactic structure. The dataset is lar ge and di v erse. It includes the Bro wn Corpus (retagged) and the W all Street Journal Corpus, as well as Department of Ener gy abstracts, Do w Jones Ne wswire stories, Department of Agriculture b ulletins, Library of America te xts, MUC-3 messages, IBM Manual sentences, WB UR radio transcripts and A TIS sentences. 3.1.4. IMDB The IMDB dataset [21] is another e xample of applying CNN, which is an online dataset of informa- tion re g arding films, TV programs and video g ames. It consists of 25,000 re vie ws labeled by the sentiment (positi v e/ne g ati v e) of each re vie w . The re vie ws ha v e been preprocessed and encoded as int e gers in a form of a sequence of w ord inde x es. W ords are inde x ed by o v erall frequenc y in the dataset, so that the inde x i encodes the i th most frequent w ord in the data in order to allo w operations of quick filtering. 3.1.5. Self-Dri ving Car This dataset uses a Udacity’ s Self-Dri ving Car simulator as a testbed for training an autonomous car . This w ork started in the 1980s with Carne gie Mellon Uni v ersity’ s Na vlab and AL V projects [22]. The training phase starts with acti v ating the simulator which is an e x ecutable application. A user initiates the service of collecting the data for training follo wed by collecting the data as images and sa ving them locally on the computer . So, the frame w ork can tak e these images and train them. The trai ning is done via distin- guishing the image edges, which are tak en by the three cameras laying on the front of the car in the simulator . After the training phase is done, the testing phase be gins by taking the file generated whene v er the performance in the epoch is better than the pre vious best. Finally , the last generated file is e x ecuted in order to mak e the car dri v e autonomously to observ e the testing phase results. 3.2. Netw orks ar chitectur e CNN is used for the MNIST , CIF AR-10, IMDB and Self-Dri ving Car datasets, where a dif ferent netw ork architecture is used for each dataset. The architecture of each CNN is sho wn in [23]. F or the MNIST and CIF AR-10 datasets, tw o con v olutional layers with ReLU acti v ation function are used after the input layer . The acti v ation function is used to reduce the training time and to pre v ent v anishing gradients. After ea ch CNN layer , a max-pooling layer is added in order to do wn-sample the input and to reduce o v erfitting. In the max- pooling layer , the stride v alue must be specified wit h which the filter is slid. When the stride is x , the filter (windo w) i s mo v ed x pix els at a time. This will produce smaller output v olumes spatially . After each max- pooling layer , the dropout method is used in order to reduce o v erfitting by forcing the model to learn man y independent representations of the same data through randomly disabling neurons in the learning phase. F or the Self-Dri ving Car dataset, the CNN has the same components as the ones used with the MNIST and CIF AR-10 datasets, b ut with deeper model that consists of v e con v olutional layers with Exponential Linear Unit (ELU) acti v ation function. The con v olutional layers are used for feature engineering. The fully connected layer is used for predicting the steering angle (final output). The dropout a v oids o v erfitting and, finally , the ELU acti v ation function is used to solv e the problem of the v anishing gradient. In the IMDB dataset, the mo vie re vie ws are composed of sequences of w ords of dif ferent lengths. These w ords are encoded by mapping mo vie re vie ws to sequences of w ord embeddings where w ords are mapped to v ectors of real numbers; the netw ork architecture consists of an embedding layer follo wed by a 1D con v olutional layer which is used for temporal data follo wed by a global max-pooling operation. These sequences are padded to ha v e the same size as the lar gest sequence because the y ha v e dif ferent lengths. The other NN type we consider is RNN with LSTM. One of the most popular uses of LSTM is for te xt analysis tasks such as the ones associated with the Penn T reeBank (PTB) dataset. W ord-le v el prediction e xperiments on PTB w as adopted, which consists of 929k training w ords, 73k v alidation w ords and 82k test w ords. It has 10k w ords in its v ocab ulary . W e trained models of tw o sizes (small and medium) using the same architecture presented in [24]. T o e v aluate language models of the PTB implementation, a special metric called a perple xity is used, where better prediction accurac y is achie v ed when perple xity v alue is as lo w as possible. Perple xity is the in v erse of probability defini tion. This means that minimizing perple xity v alue is the same as maximizing probability . The goal of applying PTB dataset is to match a probabilistic form which Int J Elec & Comp Eng, V ol. 10, No. 5, October 2020 : 5479 5486 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Elec & Comp Eng ISSN: 2088-8708 r 5483 assigns probabilities to sentences. This process is done by predicting the ne xt w ords in a te xt gi v en a history of pre viously located w ords. LSTM cells represent the core of the model which processes one w ord at a time and computes probabilities of the possibl e v alues for the ne xt w ord in the sentence. A v ector of zeros is used for the memory state of the netw ork to get initialized and updated after reading each w ord. In the small model, tw o hidden layers (with 200 LSTM units per layer) are used with T anh acti v at ion function. The weights are initialized to 0.1. W e trained it for four epochs with a learning rate of one (number of epochs trained with initial learning rate) and then the learning rate is decreased by a f actor of tw o after each epoch (the decay of the learning rate for each epoch after four epochs), for a total of 13 training epochs. The size of each batch is 20, then the netw ork is unrolled for 20 steps. This model’ s architecture is sho wn in [23]. 4. RESUL TS AND DISCUSSION In this section we discuss the results of our e xperiments. F or each model on each dat aset, T able 3 sho ws the CPU and GPU processing times while T ables 4–8 sho w the utilization le v els of the CPU, GPU and their memories. F or the image classific ation datasets (MNIST and CIF AR-10), one can observ e the superi- ority of CNTK o v er T ensorFlo w and Theano in terms of GPU and CPU multithreadi ng; ho we v er , in CIF AR-10 using 8, 16 and 32 threads in CPU, T ensorFlo w w as f aster than CNTK. On the other hand, Theano re v ealed to be more time consuming than other frame w orks. T ransitioning to sentiment analysis dataset (IMDB), CPU multithreading w as not performed because CNTK is written in Python in which multithreading is not sup- ported. W ithout CPU multithreading (CPU uses the def ault number of e xisting ph ysical cores which are equal one thread per core), the superiority of T ensorFlo w is re v ealed in both CPU and GPU en vironments. Th e results for the te xt analysis dataset (Penn T reeBank) sho ws the superiority of T ensorFlo w o v er CNTK and Theano, for CPU with 8 threads as well as the case in GPU. Mo ving forw ard to video analysis dataset (Self-Dri ving Car), the superiority of T ensorFlo w is re v ealed in both CPU and GPU en vironments, while CNTK sho wed to be more time consuming than the other tw o frame w orks. The processing times clearly sho w the adv antage of GPU o v er CPU for training CNN and RNN. The adv antage of f ast GPU w ould be more significant when training comple x models with lar ger data as in the Self-Dri ving Car dataset. From the CPU results, the best performance occurred when the number of threads is equal to the number of ph ysical CPU cores, where each thread possesses a single core. In our w ork, we use a laptop with 8 cores. Thus, in each dataset, the best performance in terms of processing time w as achie v ed while using 8 threads. The metrics measurement of each frame w ork w as conducted to e xplain the f ailure of one of the selected frame w orks. W e notice poor performance of Theano at most datasets comparing to CNTK and T ensorFlo w . This could be attrib uted to its lo w CPU utilization compared to the other frame w orks. CNTK outperformed both T ensorFlo w and Theano while training MNIST and CIF AR-10 datasets. This achie v ement is highly lik ely due to the use of BrainScript format (https://docs.microsoft.com/en-us/cogniti v e-toolkit/BrainScript-Netw ork- Builder), which is a custom netw ork description language that mak es CNTK more fle xible for NN customiza- tion. On the other hand, T ensorFlo w uses Eigen (http://eigen.tuxf amily .or g/inde x.php?title=Main P age), which is a C++ template library (BLAS library) for linear algebra including matrices, v ectors, numerical solv ers and related algorithms. It is used to mak e T ensorFlo w perform better than CNTK and Theano in RNN. In addition to processing time, we also report the utilization le v els of the CPU, GPU and their memories for each model on each frame w ork under consideration. These results are sho wn in T ables 4–8. The utilization le v els for both CPU and GPU are high for all models. The only supersizing numbers are the CPU uti lization for Theano, which were v ery lo w . The tables also sho w that the utilization le v els are rather small for both types of memory . This applies to all models for all frame w orks. Ho we v er , the tables also sho w that, in most cases, CNTK had the lo west memory utilization while T ensorFlo w had the highest. Surprisingly , the case is almost re v ersed for the video analysis dataset (the Self-Dri ving Car dataset), where CNTK had the highest utilization and Theano had the lo west. Another une xpected finding of these e xperiments is that the models of the IMDB generally needed the lar gest portions of memory . Comparing our w ork to pre vious w orks [1, 2], we re v eal the follo wing findings. Bahrampour et al. [1] based their comparati v e study on three main aspects including speed, hardw are utilization and e xtensibility . Besides, the y used three NN types: CNN, AutoEncoder (AE) and LSTM to train MNIST , ImageNet [25] and IMDB datasets on Caf fe, Neon, T ensorFlo w , Theano and T orch frame w orks. The y came up with the follo wing results. T raining on CPU, T orch performed the best follo wed by Theano while Neon had the w orst Benc hmarking open sour ce deep learning fr ame works (Ghadeer Al-Bdour) Evaluation Warning : The document was created with Spire.PDF for Python.
5484 r ISSN: 2088-8708 performance. Moreo v er , Theano and T orch are the best in terms of e xtensibility , as well as T ensorFlo w and Theano were v ery fle xible and Caf fe w as the easiest to find the performance. Re g arding training datasets on GPU and for lar ger con v olutional and fully connected netw orks (FCN), T orch w as the best follo wed by Neon. F or smaller netw orks Theano w as the best. F or LSTM, Theano’ s results were the best, while T ensorFlo w’ s performance w as not competiti v e compared with the other studied frame w orks. On the other hand, Shi et al. [2] based their comparati v e study on tw o metrics: processing time and con v er gence rate. The NN used are fully connected NN, CNN and RNN to train ImageNet, MNIST and CIF AR-10 datasets on Caf fe, CNTK, MXNet, T ensorFlo w and T orch frame w orks. The results of T ensorFlo w were the best using CPU. While using si ngle GPU; on FCN, Caf fe, CNTK and T orch performed better than MXNet and T ensorFlo w . As for small CNN, Caf fe and CNTK achie v ed a good performance and for RNN (LSTM), ho we v er , CNTK w as the f astest (5-10x f aster than other frame w orks). Using multi-GPU implementa- tion, all frame w orks had higher throughput and accelerated the con v er gence speed compared with single-GPU implementation. T able 3. Processing time for each dataset (measured in seconds), F or the En vironment columns, CPU ( x ) denotes CPU with x threads En v CNTK T ensorFlo w Theano MNIST CPU (1) 847 5130 3560 CPU (2) 630 3180 2500 CPU (4) 574 2070 2260 CPU (8) 560 1740 2060 CPU (16) 567 1920 2050 CPU (32) 588 2010 2050 GPU 66.67 328.93 377.86 CIF AR-10 CPU (1) 20196 25905 26700 CPU (2) 14520 16610 18700 CPU (4) 13662 11550 17250 CPU (8) 11484 9955 15800 CPU (16) 11550 10340 15850 CPU (32) 11649 10835 15750 GPU 926 2166.4 2386.1 IMDB CPU (1) - 1244 538 CPU (2) - 642 412 CPU (4) - 390 380 CPU (8) 486 290 368 CPU (16) - 249 368 CPU (32) - 302 384 GPU 73.1 62.4 220. Self-Dri ving Car CPU (1) - 33.3 hours 50 hours CPU (2) - 19.8 hours 44.2 hours CPU (4) - 15 hours 42.6 hours CPU (8) 47.6 hours 14.1 hours 43.5 hours CPU (16) - 16.4 hours 43.5 hours CPU (32) - 16.4 hours 43.5 hours GPU 8.7 hours 6 hours 6.8hours PTB CPU (1) - 40560 27066 CPU (2) - 26819 23244 CPU (4) - 18733 21541 CPU (8) 4290 16407 21450 CPU (16) - 16848 21476 CPU (32) - 18369 21541 GPU 2106 1342.28 1630 Int J Elec & Comp Eng, V ol. 10, No. 5, October 2020 : 5479 5486 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Elec & Comp Eng ISSN: 2088-8708 r 5485 T able 4. Performance metrics of all models on the MNIST dataset Metrics En vironment CNTK T ensorFlo w Theano Accurac y CPU 99.27 99.14 99.10 GPU 99.26 99.11 99.17 CPU% - 99.6 92.2 14.7 GPU% - 92 77 95 Memory% CPU 1.7 2.2 2.1 GPU 3.6 5.2 4.9 Epochs# CPU 7 15 10 GPU 7 15 10 T able 5. Performance metrics of all models on the the CIF AR-10 dataset Metrics En vironment CNTK T ensorFlo w Theano Accurac y CPU 82.68 82.26 82.29 GPU 82.57 82.33 82.30 CPU% - 99.8 87.3 15.3 GPU% - 97 73 94.5 Memory% CPU 3.2 5.3 5.1 GPU 4.5 7.4 7.5 Epochs# CPU 33 55 50 GPU 33 55 50 T able 6. Performance metrics of all models on the IMDB dataset Metrics En vironment CNTK T ensorFlo w Theano Accurac y CPU 88.87 88.68 88.72 GPU 88.93 88.83 88.48 CPU% - 94.8 92.2 14.6 GPU% - 76 76 88 Memory% CPU 5.7 6.6 5.1 GPU 6.6 9.1 7.6 Epochs# CPU 2 3 2 GPU 2 3 2 T able 7. Performance metrics of all models on the Self-Dri ving Car dataset Metrics En vironment CNTK T ensorFlo w Theano Accurac y CPU 99.93 99.96 99.71 GPU 99.97 99.97 99.73 CPU% - 93.2 85 22 GPU% - 32.4 34 31 Memory% CPU 5.3 4.3 3.2 GPU 6.6 6.2 5.3 Epochs# CPU 10 10 10 GPU 10 10 10 T able 8. Performance metrics of all models on the Penn T reeBank dataset Metrics En vironment CNTK T ensorFlo w Theano Perple xity CPU 113.7 114.79 114.57 GPU 113.2 113.21 113.3 CPU% - 94 91.8 18.3 GPU% - 76.6 77 81 Memory% CPU 1.3 2.3 4.1 GPU 2.2 4.5 5.4 Epochs# CPU 13 13 13 GPU 13 13 13 5. CONCLUSIONS AND FUTURE W ORK In this paper , we ha v e pro vided a qualitati v e and quantitati v e comparison between three of the m ost popular and most comprehensi v e DL frame w orks (namely Microsoft’ s CNTK, Google’ s T ensorFlo w and Uni- v ersity of Montreal’ s Theano). The main goal of this w ork w as to help end users mak e an informed decision Benc hmarking open sour ce deep learning fr ame works (Ghadeer Al-Bdour) Evaluation Warning : The document was created with Spire.PDF for Python.
5486 r ISSN: 2088-8708 about the best DL frame w ork that suits their needs and resources. T o ensure that our study is as comprehen- si v e as possible, we ha v e used multiple benchmark datasets namel y MNIST , CIF AR-10, Self-Dri ving Car and IMDB which were trained via multilayer CNN netw ork architecture and Penn T reeBank dataset which w as trained via RNN architecture. W e ha v e run our e xperiments on a laptop with wi ndo ws 10 operating system. W e ha v e measured performance and utilization of CPU multithreading , GPU and memory . F or most of our e xperiments, we find out that CNTK’ s implementations are superior to the other ones under consideration. REFERENCES [1] S. Bahrampour , N. Ramakrishnan, L. Schott, and M. Shah, “Comparati v e study of deep learning softw are frame w orks, arXi v preprint arXi v:1511.06435, 2015. [2] S. Shi, et al., “Benchma rking state-of-the-art deep learning softw are tools, arXi v preprint arXi v:1608.07249, 2016. [3] R. Al-Rfou et al., “Theano: A p ython frame w ork for f ast computation of mathematical e xpressions, arXi v preprint arXi v:1605.02688, 2016. [4] P . Goldsborough, A tour of tensorflo w , arXi v preprint arXi v:1610.01178, 2016. [5] V . K o v ale v et al., “Deep learning with theano, torch, caf fe, tensorflo w , and deeplearning4j: Which one is the best in speed and accurac y?, P attern Recognition and Information Processing (PRIP), 2016. [6] F . Bastien, et al., “Theano: ne w features and speed impro v ements, arXi v preprint arXi v:1211.5590, 2012. [7] W . Ding, R. W ang, F . Mao, and G. T aylor , “Theano-based lar ge-scale visual recognition with multiple gpus, arXi v preprint arXi v:1412.2302, 2014. [8] W . Dai and D. Berleant, “Benchmarking contemporary deep learning hardw are and frame w orks: A surv e y of qualitati v e metrics, International Conference on Cogniti v e Machine Intelligence, pp. 148-155, 2019. [9] C. Coleman et al., “Da wnbench: An end-to-end deep learning benchmark and competition, 31st Confer - ence on Neural Information Processing Systems, v ol. 100, no. 101, 2017. [10] A. Shatna wi et al., A comparati v e study of open source deep learning frame w orks, 9th International Conference on Information and Communication Systems, pp. 72-77, 2018. [11] G. Al-Bdour , R. Al-Qurran, M. Al-A yyoub, and A. Shatna wi, A detailed comparati v e study of open source deep learning frame w orks, arXi v preprint arXi v:1903.00102, 2019. [12] D. Y u et al., An introduction to com putational netw orks and the computational netw ork toolkit, Microsoft, T ech. Rep. MSR-TR-2014-112, 2014. [13] D. Y u, K. Y ao, and Y . Zhang, “The computational netw ork toolkit [best of the web], IEEE Signal Pro- cessing Mag azine, v ol. 32, no. 6, pp. 123-126, 2015. [14] J. Ber gstra et al., “Theano: A cpu and gpu math compiler in p ython, Proc. 9th Python in Science Conference, v ol. 1, pp. 3-10, 2010. [15] M. Abadi et al., “T ensorflo w: A system for lar ge-scale machine learning, Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pp. 265-283, 2016. [16] F . Chollet et al., “K eras, 2015. [17] Y . LeCun, L. Bottou, Y . Bengio, and P . Haf fner , “Gradient-based learning applied to document recogni- tion, Proceedings of the IEEE, v ol. 86, no. 11, pp. 2278–2324, 1998. [18] A. Krizhe vsk y , I. Sutsk e v er , and G. E. Hinton, “Im agenet classification with deep con v olutional neural netw orks, Adv ances in neural information processing systems, pp. 1097-1105, 2012. [19] A. Krizhe vsk y and G. Hinton, “Learning multiple layers of features from tin y images, Uni v ersity of T oronto, T ech. Rep., 2009. [20] M. P . Marcus, M. A. Marcinkie wicz, and B. Santorini, “Building a lar ge annotated corpus of english: The penn treebank, Computational linguistics, v ol. 19, no. 2, pp. 313-330, 1993. [21] A. Maas et al., “Learning w ord v ectors for sentiment analys is, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language T echnologies, 2011. [22] R. W allace et al., “First results in robot road-follo wing, IJCAI, pp. 1089-1095, 1985. [23] G. Al-Bdour , “Comparati v e study between deep learning frame w orks using multiple benchmark datasets, Master’ s thesis, Jordan Uni v ersity of Science and T echnology , 2017. [24] W . Zaremba, I. Sutsk e v er , and O. V in yals, “Recurrent neural netw ork re gularization, arXi v preprint arXi v:1409.2329, 2014. [25] J. Deng, et al., “Imagenet: A lar ge-scale hiera rchical image database, IEEE conference on computer vision and pattern recognition, pp. 248-255, 2009. Int J Elec & Comp Eng, V ol. 10, No. 5, October 2020 : 5479 5486 Evaluation Warning : The document was created with Spire.PDF for Python.