Inter national J our nal of Electrical and Computer Engineering (IJECE) V ol. 8, No. 5, October 2018, pp. 4042 4046 ISSN: 2088-8708 4042       I ns t it u t e  o f  A d v a nce d  Eng ine e r i ng  a nd  S cie nce   w     w     w       i                       l       c       m     Multi-modal Asian Con v ersation Mobile V ideo Dataset f or Recognition T ask Dewi Suryani, V alentino Ekaputra, and Andry Cho wanda Computer Science Department, School of Computer Science, Bina Nusantara Uni v ersity , Jakarta, Indonesia 11480 Article Inf o Article history: Recei v ed December 6, 2017 Re vised July 15, 2018 Accepted August 12, 2018 K eyw ord: Multi-modal dataset Asian con v ersation dataset Mobile video Recognition task F acial features e xpression Emotion recognition ABSTRA CT Images, audio, and videos ha v e been used by researchers for a long time to de v elop se v eral tasks re g arding human f acial recognition and emotion detection. Most of the a v ailable datasets usually focus on either static e xpression, a short video of changing emotion from neutral to peak emotion, or dif ference in sounds to detect the current emotion of a person. Moreo v er , the common datasets were collected and processed in the United States (US) or Europe, and only se v eral datasets were originated from Asia. In this paper , we present our ef fort to create a unique dataset that can fill in the g ap by currently a v ailable datasets. At the time of writing, our datasets contain 10 full HD ( 1920 1080 ) video clips with annotated JSON file, which is in total 100 minutes of duration and the total size of 13 GB. W e belie v e this dataset will be useful as a training and benchmark data for a v ariety of research topics re g arding human f acial and emotion recognition. Copyright c 2018 Institute of Advanced Engineering and Science . All rights r eserved. Corresponding A uthor: De wi Suryani, V alentino Ekaputra, Andry Cho w anda Computer Science Department, School of Computer Science, Bina Nusantara Uni v ersity Jl. K. H. Syahdan No. 9, P almerah, Jakarta, Indonesia 11480 (+6221) 534 5830, e xt. 2188 f dsuryani, v ekaputra, acho w anda g @binus.edu 1. INTR ODUCTION No w adays, a mobile de vice such as a smartphone is equipped with a high quality camera and a good processor which enabled people to record a high definition (HD) video. There is no need to b uy an e xpensi v e digital or DSLR camera for the people who w ant to record a video with a good quality result. Based on the observ ation performed by Ofcom [1] in 2016, in the USA, people spent approximately 87 hours on a v erage a month to bro wse on a smar tphone compared to 34 hours on laptop or desktop. This signifies that people tend to use a smartphone for most of their acti vities. It means that the feature on their smartphone is more than enough for their daily needs, and this incl udes the video recording. Statista [2] also claimed that 71% of 44,761 respondents use their smartphone to tak e photos/videos, which is the second highest acti vity in the smartphone after accessing the internet. This indicates the camera in their smartphones is already satisfied for taking images and recording video compared to a fe w years before. Despite the increasing amount of time people spend on their smartphone, there is no publicly a v ail- able dataset re g arding Asian f acial feature that is captured using mobile smartphone camera. Generally , the dataset only e xists still as images and most of the video dataset only co v ers the western person f acial features. As literature suggested, although f acial e xpression recognition is uni v ersal to all races of humans, emotions perception from f acial e xpressions cues are quite dif ferent from one culture to others. Moreo v er , most of the datasets focus only on f acial features of the video and there is no such t h i ng as f acial features when a person is talking with the others in the wild. Hence, by using smartphone, we could capture a natural con v ersation of tw o interlocutors in the wild. Such datasets serv e as datasets for computer to learn emotions recognition, f acial e xpression recognition features and classification. In this w ork, we are aiming to address this g ap by presenting a mobile video dataset that contains J ournal Homepage: http://iaescor e .com/journals/inde x.php/IJECE       I ns t it u t e  o f  A d v a nce d  Eng ine e r i ng  a nd  S cie nce   w     w     w       i                       l       c       m     DOI:  10.11591/ijece.v8i5.pp4042-4046 Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE ISSN: 2088-8708 4043 videos of Asian people ha ving a natural con v ersation with each other . Our dataset is obtained by recording a natural con v ersation between 2 people inside a controlled room with adequate lighting and the m ob i le camera is in a steady position. In order to collect the v ariety of f acial features, we pro vide se v eral topics to be chosen by the interlocutors. The topics are mostly general topics such as foods, lecturers, etc. The rest of this paper is or g anized as follo ws: In Section 2., we list the e xist ing di f ferent publicly a v ailable dataset and e xplain its dif ference with our dataset. Our method of collecting data and the characteristic of the data will be e xplained in Section 3. Potential applications of the dataset are described in Section 4. And lastly , the conclusion will be pro vided in Section 5. 2. RELA TED W ORKS Currently , se v eral datasets ha v e been created for man y kinds of recognition tasks, especially f acial e x- pression analysis and recognition [3, 4, 5]. Ho we v er , only fe w datasets that contain Asian people and recorded by using a smartphone. The e xtended Cohn-Kanade database, CK+ [6] composed of 593 recordings of posed and non-posed sequences. It is recorded under controlled conditions of light and head motion, and range be- tween 9-60 frames per sequence. Each sequence represents a single changing f acial e xpression that starts with a neutral e xpression and ends with a peak e xpression. The transitions between e xpressions are not included. Moreo v er , there is an NRC-IIT database [7] which contains pairs of short lo w-resolution mpe g1-encoded video clips. Each video clip is sho wing a f ace of a user who sits in front of the monitor t hat e xhibiting a wide range of f acial e xpressions and orientations as captured by a USB webcam mounted on the computer . Ev ery video clip is about 15 seconds long, has a capture rate of 20 fps and is compressed with the A VI Intel codec 481 Kbps bit-rate. In the other hands, there is the Cohn-Kanade DF A T -504 dataset [8] that consists of 100 un i v er sity students ranging in age from 18 to 30 years. 65% were female, 15% were African-American, and 3% were Asian or Latino. Students were instructed by an e xperimenter to perform a series of 23 f acial e xpressions. Students be g an and ended each display with a neutral f ace. Image sequences from neutral to tar get display were digitized into 640 by 480 pix el arrays with 8-bit precis ion for grayscale v alues. Similar to the others, the MMI database [9] contains a lar ge collection of F A CS coded f acial videos. Ho we v er , it consi sts of 1395 manually A U coded video sequences with the majority of the video is posed and recorded in laboratory settings. In the dataset mentioned abo v e, most of them only focus on the image part of the video without recording the audio and mak es the dataset quite unnatural for se v eral e xpression. The dataset abo v e mostly used camera or webcam to tak e the video. Also the duration of each data considered to be too short for applications in real life condition which combined dif ferent aspects and the conte xt of the topic with the f acial e xpression in the video. In s ummary , our dataset is dif ferent in follo wing points: (i) recorded using the camera in a mobile de vice; (ii) long duration of natural con v ersation video; (iii) includes full HD videos; and (i v) includes audio for matching e xpression with conte xt. 3. PR OPOSED METHOD In this paper , we proposed a ne w mobile video dataset, which can be used as a benchmar k data for se v eral recognition tasks as well as serv e as a dataset for machine learning tas ks. Here, we start by e xplaining ho w we collect the dataset for this research. The aim of thi s research is pro viding a publicly a v ailable dataset for Asian (specifically Indonesian) f acial features, e xpressions, and con v ersations. Thus, we recruited twenty v olunteers, whose mainly are Indonesian students (age between 19 - 21) to partici pate in this research. The participants are gi v en a list of possible topics to be discussed during the recording in 10 minutes. In one session of recording, there were tw o interlocutors sitting f acing each other across the table. The participants then start the con v ersation in with the other interlocutor , when the researcher gi v e signal to t h e m to start the con v ersation. T o record the con v ersation, the researchers set up 2 smart phone with identical camera specifi cation. The smart phone used in this research were tw o Xiaomi Mi 4i with 13MP , f/2.0 camera and the video w as recorded in full a HD setting with resolution of 1920x1080 pix els and 30 fps. The smart phone were placed in a steady position in front of each interlocutor . The recorded video is depicted in the follo wing Figure 1 where Figure 1a sho ws the first interlocutor who in v olv es in this con v ersation with a specific topic selected, with the other interlocutor is illustrated in Figure 1b. During the con v ersation, the v olunteers are encouraged to beha v e as if it is a normal and natural con- v ersation in order to get the most natural dataset possible. After 10 minutes, the y will be reminded to stop the con v ersation and the video wi ll be sa v ed as an mp4 file with the MPEG-4 format in the de vice before it is e x- Multi-modal Asian Con ver sation Dataset for Reco gnition ... (De wi Suryani) Evaluation Warning : The document was created with Spire.PDF for Python.
4044 ISSN: 2088-8708 (a) First Interlucutor (b) Second Interlucutor Figure 1. Example of recorded video in a con v ersation between tw o interlocutors Figure 2. P art of annotated JSON e xample ported to a computer . Each file will be named in the format as follo ws: ”CONVERSA TIONID CAMERAID”, where CONVERSA TIONID is the identifier for the session and CAMERAID is the identifier for the de vice used. After the data is completely recorded, we start annotating the video for three dif ferent f acial e xpressions, i.e., sad, neutral, and happ y . Using visual object tagging tool (V O TT) [10] pro vided by Microsoft, we can get the annotation data in JSON format as sho wn in Figure 2.Furthermore, Figure 3 also describes the video e xample while annotating the data. 4. PO TENTIAL APPLICA TIONS There are se v eral potential applications that can tak e the adv antages of the a v ailability of this dat aset, such as f acial recognition and emotion detection. F acial Recognition , our dataset pro vides another data in order to increase the accurac y of f acial recognition task. The reason is that most research in this field is using CK+ dataset as done by Bartlett et al. [11], Cohen et al. [12], and Cohn et al. [13]. Unfortunately , the current used datasets mostly contain people from the western re gion, which can highly af fect the Asian f acial recognition. Emotion Det ection , most of other research re g arding emotion detection only used specific datasets lik e visual only or audio only [14]. Our dataset pro vides multi-modal information, i.e., images and audio that link ed together in this case. Moreo v er , the common research used is posed e xpression datasets that are not based on authentic emotions [15]. V irtual Humans or Intelligent VIrtual Agents , with the dataset, we can learn a natural con v ersation between tw o interlocutors and implement them into a virtual human [16, 17]. The dataset pro vides se v eral features to be learn: emotion recognition from audio (i.e. v oice) and audio (e.g. f acial e xpressions), natural IJECE V ol. 8, No. 5, October 2018: 4042 4046 Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE ISSN: 2088-8708 4045 Figure 3. Example of the video while annotation process. The red square denotes the f acial area of annotation. language processing, and con v ersation. Psychology or Social Science Study , with the dataset, researcher s from psychology or social study also could analyze and observ e human beha vior during the interaction. An ethnograph y study also can be applied to analyze or to observ e human beha vior (specifically Indonesian people) through the video. 5. CONCLUSION This paper prese nts con v ersation video dataset, containing videos of real con v ersation performed by a pair of v olunteers recorded using mobile de vice camera along with JSON data of its annotation. In future, we intend to collect more data for the datasets by asking for more di v erse v olunteers based on age, gender , and occupation. W e belie v e that this dataset wi ll be useful for se v eral applications which required training using images, audio, or videos from our datasets. W e w ant also to record the video using dif ferent conditions of lighting in order to observ e the influence of the lighting. REFERENCES [1] Ofcom. (2016, Dec.) The communications mark et report: International. [Online]. A v ailable: https://www .ofcom.or g.uk/research-and-data/multi-sector -research/cmr/cmr16/international [2] Statista. (2016, Sep.) W eekly smartphone acti vities among adult users in the united states as of august 2016. [Online]. A v ailable: https://www .statista.com/statistics/187128/leading-us-smartphone-acti vities/ [3] H. K. P alo and M. N. Mohanty , “Classification of emotional speech of children using probabilistic neural netw ork, International J ournal of Electrical and Computer Engineering , v ol. 5, no. 2, p. 311, 2015. [4] F . E. Guna w an and K. Idananta, “Predicting the le v el of emotion by means of indonesian s peech signal. T elk omnika , v ol. 15, no. 2, 2017. [5] F . Z. Salmam, A. Madani, and M. Kissi, “Emotion recognition from f acial e xpression based on fiducial points detection and using neural netw ork, International J ournal of Electrical and Computer Engineering (IJECE) , v ol. 8, no. 1, 2017. [6] P . Luce y , J. F . Cohn, T . Kanade, J. Saragih, Z. Ambadar , and I. Matthe ws, “The e xtended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified e xpression, in Computer V ision and P attern Reco gnition W orkshops (CVPR W), 2010 IEEE Computer Society Confer ence on . IEEE, 2010, pp. 94–101. [7] D. O. Gorodnich y , “V ideo-based frame w ork for f ace recognition in video, in Computer and Robot V ision, 2005. Pr oceedings. The 2nd Canadian Confer ence on . IEEE, 2005, pp. 330–338. [8] T . Kanade, J. F . Cohn, and Y . T ian, “Comprehensi v e database for f acial e xpression analysis, in A utomatic F ace and Gestur e Reco gnition, 2000. Pr oceedings. F ourth IEEE International Confer ence on . IEEE, 2000, pp. 46–53. [9] M. V alstar and M. P antic, “Induced di sgust, happiness and surprise: an addition to the mmi f acial e xpres- sion database, in Pr oc. 3r d Intern. W orkshop on EMO TION (satellite of LREC): Corpor a for Resear c h Multi-modal Asian Con ver sation Dataset for Reco gnition ... (De wi Suryani) Evaluation Warning : The document was created with Spire.PDF for Python.
4046 ISSN: 2088-8708 on Emotion and Af fect , 2010, p. 65. [10] Microsoft. (2017, Sep.) V ott: V isual object tagging tool. [Online]. A v ailable: https://github .com/Microsoft/V oTT [11] M. S. Bartlett, G. Little w ort, I. F asel, and J. R. Mo v ellan, “Real time f ace detection and f acial e xpression recognition: De v elopment and applications to human computer interaction. in Computer V ision and P attern Reco gnition W orkshop, 2003. CVPR W’03. Confer ence on , v ol. 5. IEEE, 2003, pp. 53–53. [12] I. Cohen, N. Sebe, A. Gar g, L. S. Chen, and T . S. Huang, “F acial e xpression recognition from video sequences: temporal and static modeling, Computer V ision and ima g e under standing , v ol. 91, no. 1, pp. 160–187, 2003. [13] J. F . Cohn, L. I. Reed, Z. Ambadar , J. Xiao, and T . Moriyama, Automatic analysis and recognition of bro w actions and head motion i n spontaneous f acial beha vior , in Systems, Man and Cybernetics, 2004 IEEE International Confer ence on , v ol. 1. IEEE, 2004, pp. 610–616. [14] L. C. De Silv a, T . Miyasato, and R. Nakatsu, “F acial emotion recognition using multi-modal information, in Information, Communications and Signal Pr ocessing , 1997. ICICS., Pr oceedings of 1997 International Confer ence on , v ol. 1. IEEE, 1997, pp. 397–401. [15] Y . Sun, N. Sebe, M. S. Le w , and T . Ge v ers, Authentic emotion detection in real-time video, in Interna- tional W orkshop on Computer V ision in Human-Computer Inter action . Springer , 2004, pp. 94–104. [16] A. Cho w anda, P . Blanchfield, M. Flintham, and M. V alstar , “Erisa: Building emotionally realistic social g ame-agents companions, in International Confer ence on Intellig ent V irtual Ag ents . Springer , 2014, pp. 134–143. [17] ——, “Play smile g ame with erisa, in IV A 2015, F ifteenth International Confer ence on Intellig ent V irtual Ag ents , 2015. BIOGRAPHIES OF A UTHORS Dewi Suryani is a Computer Science lecturer at Bina Nusantara Uni v ersity , Indonesia. She obtained Bachelor De gree in Computer Science from Bina Nusantara Uni v ersity in 2014. She just graduated from the Sirindhorn Thai-German Graduate School of Engineering (TGGS), King Mongkut’ s Uni v ersity of T echnology North Bangk ok (KMUTNB), Thailand, with Master of Engi- neering. Her researches are in fields of computer science, image processing, handwriting recogni- tion, etc. V alentino Ekaputra is a lecturer in Bina Nusantara Uni v ersity . He obtained Bachelor and Master De gree in Computer Science from Binus Uni v ersity in 2016. Currently , he still does not ha v e fields he is focusing on researches. Andry Cho wanda is a Computer Science lecturer in Bina Nusantara Uni v ersity Indonesia and currently is a PhD Student in the Uni v ersity of Nottingham. His research is in agent architecture. His w ork mainly on ho w to model an agent that has capability to sense and percei v e the en vironment and react based on the percei v ed data in addition to the ability of b uilding a social relationship with the user o v ertime. IJECE V ol. 8, No. 5, October 2018: 4042 4046 Evaluation Warning : The document was created with Spire.PDF for Python.