TELK OMNIKA , V ol. 17, No . 1, F ebr uar y 2019, pp . 275 281 ISSN: 1693-6930, accredited First Gr ade b y K emenr istekdikti, Decree No: 21/E/KPT/2018 DOI: 10.12928/TELK OMNIKA.v17i1.11613 275 Implementation of web scraping on GitHub task monitoring system Roll y Maulana A wang ga *1 , Sy afrial F ac hri P ane 2 , and Restiy ana Dwi Astuti 3 Applied Bachelor Prog r am of Inf or matics Engineer ing, P oliteknik P os Indonesia, Indonesia * Corresponding author , e-mail: a w angga@poltekpos .ac.id Abstract Ev olution of inf or mation and technology increasingly sophisticated, also influential in the field of education. On e of the implementation of inf or mation and technology de v elopment in the field of education is e-lear ning or electronic lear ning. GitHub socia l netw or k can be one of the e-lear ning media in studying softw are de v elopment be cause GitHub pro vides access control The n umber of contr ib utor s who commits or change in a repositor y to mak e the dur ation of the calculation process to fill t he par ameter v alue that has been deter mined. Based on the issue , this research aims to b uild a page capab le of integ r ating inf or mation from the GitHub repositor y page . Integ r ation of inf or mation will be made b y utilizing w eb scr aping technology . With a w eb page that integ r ates inf or mation from the GitHub repositor y page to get repositor y , collabor ators , commits , and issues inf or mation, the lecturer does not need to calculate ho w often the par ticipant contr ib utes to the task. K e yw or d: task management, w eb scr aping, GitHub , p ython, e-lear ning. Cop yright c 2019 Univer sitas Ahmad Dahlan. All rights reser ved. 1. Intr oduction Lear ning par adigm applies tr aditional system where a teacher is source of e v er ything and it calls T eacher-centered and no w ada ys it has been changed become lear ning par adigm of Student-Centered where students are required to be activ e to elabor ate t he inf or mation that ha v e been obtained pre viously and also shar pen their collabor ation skills in solving prob lem creativ ely and skillfully [1]. Along with the statements , students m ust be ab le to do lear ning independently , so in lear ning activity needs to choose an appropr iate str ategy in order that lear ning process r uns eff ectiv ely and efficiently . An independent lear ning process can giv e benefit to impro v e student independence in order not depend on the attendance and the descr iption of teaching mater ials f rom their teacher . The role of teacher in Student-Centered lear ning par adigm is as a f acilitator who de s i gns the lear ning process . Theref ore , to suppor t the lear ning activitie s with Student-Centered par adigm requires a media that can help and f acilitate in doin g collabor ation [2] to gener ate stim ulus f or lear ners to e xp and, deepen, and apply the inf or mation that ha v e been receiv ed while at the class [3]. Along with the de v elopment of Inf or mation and Comm unication that increase r apidly , it will also giv e impact to Educational T echnology . With the e xistence of application technology in education field will indirectly also aff ect to the method of lear ning activities that are e xpected to help students . One of the integ r ation products of inf or mation techno logy into the w or ld of education is e-lear ning [4]. Implementation of e-lear ning as a medium of comm unication and recently lear ning has been de v eloping in educational institutions , especially in college le v el [4]. Gener ally , univ ersi- ties that apply e-lear ning systems use it as supplement (additi on) through subject mater ial that presents regular ly in the classroom. GitHub is one of (or can be) e-lear ning media in soft- w are de v elopment because GitHub pro vides control access , source control, collabor ations , and tr ansparency f eatures such as b ug tr ac king, f eature request, task management, and issues [5]. T eacher tak e adv antage of collabor ation and tr ansparency f eatures from GitHub to create , reuse , and combine lessons to encour age contr ib ution from students and monitor their activities on giv en Receiv ed J uly 30, 2018; Re vised October 10, 2018; Accepted No v ember 13, 2018 Evaluation Warning : The document was created with Spire.PDF for Python.
276 ISSN: 1693-6930 tasks [6]. The le v el of student activ eness to w ards their contr ib ution to a project can be seen from regular commitment of students and the change on a repositor y . The implementation of w eb scr aping on monitor ing task system integ r ated with GitHub can help students and teachers to get inf or mation on repositor ies , collabor ators , commits , and issues which can be shor ten in the pro- cess of calculating the le v el of students’ activ eness according to their contr ib ution that ha v e been made within a cer tain per iod. 2. Related W orks Lear ning method based on Student-Centered Lear ning combines collabor ativ e method b y using lear ning media such as audio ta pes , Ov erhead T r ansparency (OHT), or GitHub cited b y Joseph F eliciano , Margaret-Anne Store y , Y iyun Zhao , W eiliang W ang and Ale x e y Zagalasky , in their research, the y descr ibes ho w GitHub emerges as a collabor ation platf or m f or education and it aims to understand ho w does the en vironment of GitHub suppor ts social and collabor ativ e f eatures can fix (or ma y inhibit) e xper iences from student and teacher [7]. F rom the finding of their research, the y find that students get benefit from tr ansparent f eature and open w or kflo w from GitHub . Ho w e v er , some students w orr y because GitHub is not inherent lear ning media [6]. The implementation of w eb scr aping as a data retr ie v al technique that has been done b y pre vious research f or v ar ious needs and o bjects , Leo Rizky J ulian and F r iska Natalia uses w eb scr aping technique to compare data from fiv e diff erent online stores in computer , then the user can sa v e the cost of purchasing components of computer [8]. The compar ison f eatures based on the pr inciple of consumers who w ant to b uy goods not only at the lo w est pr ice b ut also the best quality . Other research are conducted b y K. Sundar amoor th y , R. Durga and S . Nagadarshini from Agni College of T echnology , ba c kg r ound of the research is simplify to categor iz e ne ws from v ar ious por tals , a bot is used dynamically f or e xtr acting URL at the specific inter v al [9]. After all e xplained relat ed w or k, this research combines some of the abo v e mentioned research, implementing w eb scr aping with e-lear ning th rough GitHub . W eb scr aping will w or k to har v esting selected inf or mation that will become the cr iter ion of a task assessment, and this method mak e it easier to chec k the task. 3. Methodology 3.1. W eb Scraping W eb scr aping, usually called w eb cr a wling or w eb spider ing or prog r ammatically going o v er a collection of w eb pages and e xtr acting data [10] and also this method is e xcellent at gath- er ing or collecting and processing large amounts of data [11]. This is a method emplo y ed to e xtr act v er y large amounts of data from w ebsites whereb y the data is e xtr acted and sa v ed to a local file in y our computer or to a database in tab le f or mat or to a spreadsheet file . W eb scr aping ser vices is the technique of automating this process , instead of man ually cop ying the data from GitHub or an y other w ebsite . The la y out of the GitHub is descr ibed using Hyper te xt Mar kup Language or HTML. HTML document mostly dw ell of f our type of elements; str ucture of document, inline , reciprocal elements , and b loc k. The most common abstr act model f or HTML documents are trees , and the e xample of a HTML modelled as a tree sho wn in Figure 1. 3.1.1. W eb Scraping Steps T o g r ab a data from each link u s i ng BeautifulSoup4 module on Python 3 and it needs a f e w stage . Ho w m uch stage needed depends on a link or w eb str ucture . The first step is to deter mine pages which will be used a s inf or mation sources . List of w eb page address that will be used in this study are sho wn b y T ab le 1. The second step is to e xtr act the inf or mation from the source page b y using w eb scr aping technique . Gener ally , there are tw o stages to tak e data automatically from a w ebpage are as f ollo ws: TELK OMNIKA V ol. 17, No . 1, F ebr uar y 2019 : 275 281 Evaluation Warning : The document was created with Spire.PDF for Python.
TELK OMNIKA ISSN: 1693-6930 277 Figure 1. An Example of a HTML Document Modelled as a T ree T ab le 1. The List of URL URL Spesific URL https://github .com/b ukuinf or matika/sig https://github .com/ https://github .com/b ukuinf or matika/sig/issues https://github .com/b ukuinf or matika/sig/commits?author=a w angga 1. Lear ning and identifying the HTML document from the inf or mation w ebsite that will be tak en. HTML that flank the inf or mation that will be tak en 2. Searching na vigation mechanisms on the w ebsite to be retr ie v ed f or inf or mation to imitate through the w eb scr aper application that will be created. In this stage , BeautifulSoup4 will e xtr act m ultiple types of data - te xt, links and more as sho wn at Figure 2. Figure 2. Gr aph of W eb Scr aping 3.1.2. Data Extraction This step is t he act of de v elopment of to retr ie ving data that is already e xists on a w ebsite and con v er t it into a f or mat that’ s suitab le f or analysis . W eb-P ages are rendered b y the bro wser . BeautifulSoup4 is essentially a set of wr apper functions that mak e it simple to select common HTML or XML elements . Implementation of w eb scr aping on GitHub task monitor ing system... (Rolly Maulana A w angga) Evaluation Warning : The document was created with Spire.PDF for Python.
278 ISSN: 1693-6930 3.1.3. T ransf ormation After author get the cleaned data from the parsing and cleaning, the data ser ialization module is used to ser ializ e t he data according to the data. This is the final module and result that will be tr ansf or med into a spreadsheet document and the lecturer will use the data f or student task assessment. 4. Result and Anal ysis 4.1. Requirement Anal ysis The first phase in the w aterf all model is requirement analysis . In this phase , the author analyz es the requirements that will be made f or the application and deter mines the f e atures of the application. The requirement analysis phase is using three methods to obtain inf or mation about ho w should the application be made . The methods is obser v ations and ref erences studies . 4.2. Design In the design phase , unified modeling language (UML) diag r ams such as use case di- ag r am, activity diag r am, and sequence diag r am w ere designed. This activity is to illustr ate the process that r uns in the application and the relations betw een entities in the application, and f or the front-end, the author used Flask micro w eb fr ame w or k wr itten in Pyth on and based on the W er kz eug toolkit and Jinja2 template [12]. 4.2.1. The XML DOM TREE Generation XML parsing is taking in XML code ande xtr acting rele v ant inf or mation [13] lik e the title of the repositor y , f or k count, issue count, contr ib ution count, commit link, author name , date , and commits count XML parsing is th e process of taking r a w XML code , reading it, and gener ating a DOM tree object str ucture from it. BeautifulSoup4 is a set of wr apper functions that is used to select XML and HTML elements . It is a class that is used to parse the XML files directly also a DOM based toola in which the parser mak es a single sequential pass through the file to parse the XML file [14]. The parser does not sa v e an y of the tags or the contents inside the tags . So it leads to super f ast parsing because the XML file contents is not changed b y t he parser and the parser mak es only one pass through the file . BeautifulSoup4 class constr ucts a DOM (Document Object Model) object. It means that the entire contents of the XML file are stored in memor y . DOM is a con v ention used in HTML, XHTML, and XML f or representing and inter acting with objects [14]. The elements in an XML doc- ument ma y ha v e attr ib utes . Ev en though it is a slo w er f or m of parsing, it allo ws making changes to XML file cont ents . BeautifulSoup4 uses tw o kinds of objects to perf or m XML parsing. The objects are BeautifulSoup4 and tag in order to do X ML parsing using BeautifulSou4. BeautifulSoup4 is an object that holds the entire XML file’ s content in a tree-lik e str ucture , The tag object contains n umber of attr ib utes and methods that manipulates the XML file easily . 4.3. Algorithm f or Scraping the Content fr om XML Using BeautifulSoup4 1. Impor t the necessar y libr ar ies f or scr aping such as BeautifulSoup4 to parse the data re- tur ned from the w ebsite . 2. F etch the links of the ur l using ur llib2 libr ar y and sa v e it in a v ar iab le . 3. F or each link do: (a) P arse the XML in the page v ar iab le and store it in a BeautifulSoup4 f or mat. (b) F or each data in the item tag do scr ap the title of the repositor y , f or k count, issue count, contr ib ution count, commit link, author name , date , and commits count. 4. Sa v e the scr aped content into the Microsoft Excel or spreadsheet f or mat. TELK OMNIKA V ol. 17, No . 1, F ebr uar y 2019 : 275 281 Evaluation Warning : The document was created with Spire.PDF for Python.
TELK OMNIKA ISSN: 1693-6930 279 Listing 1. Scr aping Commit URL d e f p a r s e C o m m i t D a t a ( ) : g l o b a l c o m m i t ˙ u r l , c o m m i t ˙ d a t a r e s p = s e s . g e t ( c o m m i t ˙ u r l ) s o u p = B e a u t i f u l S o u p ( r e s p . t e x t , h t m l . p a r s e r ) d i v ˙ e l e = s o u p . f i n d ( d i v , c l a s s : c o m m i t s - l i s t i n g ˝ ) l i ˙ l = d i v ˙ e l e . f i n d ˙ a l l ( l i , c l a s s : c o m m i t s - l i s t - i t e m ˝ ) f o r l i i n l i ˙ l : t r y : l s t = [ ] a u t h = p a r s e T e x t ( l i . f i n d ( a , c l a s s : c o m m i t - a u t h o r ˝ ) . t e x t ) d t = p a r s e T e x t ( l i . f i n d ( r e l a t i v e - t i m e ) . t e x t ) i f t i t ˙ e l e : c m t ˙ t i t l e = p a r s e T e x t ( t i t ˙ e l e . t e x t ) l s t . a p p e n d ( l e n ( c o m m i t ˙ d a t a ) + 1 ) l s t . a p p e n d ( a u t h ) c o m m i t ˙ d a t a . a p p e n d ( l s t ) a u t h o r , d a t a ) e x c e p t : p r i n t ( e r r o r ) p a s s p r i n t ( c o m m i t ˙ d a t a ) The Listing 1 sho w ho w BeautifulSoup4 module parsing the HTML document, and cr a wling the w ebsite (commit data), and those code are the o v er all patter n in order to handle e xceptions and mak e it coherent at the same time . 4.3.1. Result of the Resear c h The result of this research is a w eb-based application of monitor ing task system inte- g r ated with GitHub are lecturer can get the inf or mation on repositor y , collabor ators , commits and issues as sho wn at Figure 3 which can be shor ten in the process of calculating the le v el of stu- dents’ activ eness according to their contr ib ution that ha v e been made within a cer tain per iod. Figure 3. The Result of Data Extr action Implementation of w eb scr aping on GitHub task monitor ing system... (Rolly Maulana A w angga) Evaluation Warning : The document was created with Spire.PDF for Python.
280 ISSN: 1693-6930 4.4. Maintenance Maintenance phase is the last phase of w aterf all model SDLC method [15]. In this phase , the author do the maintenance of the application. In case if there an y changes in the design str ucture of GitHub , the data scr aping process which is conducted bef ore using the same patter n will be f ailed to obtain. This f ailure can be detected if there is an indication the data that obtained in scr aping data process are decreased or the scr aping patter n cannot obtain data at all. If an y of this condition happen, the author need to reanalyz e on e v er y stage of data retr ie v al so it can be disco v ered which stage that has to be changed. 4.5. Response Time This is the cr ucial o ne , ho w f ast the system to be use in e .g a w eb ser vice . Response time is the total amount of time it tak es to respond to a request f or ser vice . The response time is the sum of the ser vice time and w ait time [16]. The ser vice time is the time it tak es to do the w or k author requested. The w ait time is ho w long the request had to w ait in a chain bef ore being ser viced. F or this research, the response time f or each URL sho wn b y T ab le 2 belo w , meanwhile , if w e r un all the URL at the same time , it took 23.2 seconds to response . T ab le 2. Response Time URL Response time https://github .com/b ukuinf or matika/sig 8.1s https://github .com/b ukuinf or matika/sig/issues 7.4s https://github .com/b ukuinf or matika/sig/commits?author=a w angga 7.7s 5. Conc lusion After perf or ming the analysis , the implementation of w eb scr aping in monitor ing tasks integ r ated with GitHub , it can be concluded that the b uilt application has been ab le to ans w er the prob lems discussed in the pre vious chapters . Our w or k sho ws that with the design of the system f acilitate data collection tasks using social netw or king media GitHub , documentation and collection of tasks more str uctured. In this research sho w ed that the lecturer can get inf or mation on GitHub: repositor y details; collabor ators details; commits count; detailed issues of the repositor y which can be shor ten in the process of calculating the le v el of students’ activ eness according to their contr ib ution that ha v e been made within a cer tain per iod at one course . Ref erences [1] M. J . Hannafin, “Student-centered lear ning, in Encyclopedia of the Sciences of Lear ning . Spr inger , 2012, pp . 3211–3214. [2] S . Ar miati and R. A w angga, “Sql collabor ativ e lear ning fr ame w or k based on soa, in Jour nal of Ph ysics: Conf erence Ser ies , v ol. 1007, no . 1. IOP Pub lishing, 2018, p . 012035. [3] D . H. Jonassen and M. A. Easter , “Conceptual change and student-centered lear ning en vironments , Theoretical f oundations of lear ning en vironments , pp . 95–113, 2012. [4] N. Dab bagh and A. Kitsantas , “P ersonal lear ning en vironments , social media, and self-regulated lear n- ing: A natur al f or m ula f or connecting f or mal and inf or mal lear ning, The Inter net and higher education , v ol. 15, no . 1, pp . 3–8, 2012. [5] E. Kalliamv ak ou, G. Gousios , K. Blincoe , L. Singer , D . M. Ger man, and D . Damian, “The promises and per ils of mining github , in Proceedings of the 11th w or king conf erence on mining softw are repositor ies . A CM, 2014, pp . 92–101. [6] J . F eliciano , M.-A. Store y , and A. Zagalsky , “Student e xper iences using github in softw are engineer ing courses: a case study , in Proceedings of the 38th Inter national Conf er ence on Softw are Engineer ing Companion . A CM, 2016, pp . 422–431. TELK OMNIKA V ol. 17, No . 1, F ebr uar y 2019 : 275 281 Evaluation Warning : The document was created with Spire.PDF for Python.
TELK OMNIKA ISSN: 1693-6930 281 [7] A. Zagalsky , J . F eliciano , M.-A. Store y , Y . Zhao , and W . W ang, “The emergence of github as a collab- or ativ e platf or m f or education, in Proceedings of the 18th A CM Conf erence on Computer Suppor ted Cooper ativ e W or k & Social Computing . A CM, 2015, pp . 1906–1917. [8] L. R. J ulian and F . Natalia, “The use of w eb scr aping in computer par ts and assemb ly pr ice compar ison, in Ne w Media (CONMEDIA), 2015 3rd Inter national Conf erence on . IEEE, 2015, pp . 1–6. [9] K. Sundar amoor th y , R. Durga, and S . Nagadarshini, “Ne wsone—an agg regation system f or ne w s using w eb scr aping method, in T echnical Adv ancements in Computers and Comm unications (ICT A CC), 2017 Inter national Conf erence on . IEEE, 2017, pp . 136–140. [10] S . K. Malik and S . Rizvi, “Inf or mation e xtr actio n using w eb usage mining, w eb scr apping and semantic annotation, in Computational Intelligence and Comm unication Netw or ks (CICN), 2011 Inter national Conf erence on . IEEE, 2011, pp . 465–469. [11] R. Mitchell, W eb scr aping with Python: collecting data from the moder n w eb . O’Reilly Media, Inc. ”, 2015. [12] M. Gr inberg, Flask w eb de v elopment: de v eloping w eb applications with p ython . O ’Reilly Media, Inc. ”, 2014. [13] G. Wilcoc k, “Pipelines , templates and tr ansf or mations: Xml f or natur al language gener ation, in Pro- ceedings of the 1st NLP and XML W or kshop , 2001, pp . 1–8. [14] V . G. Nair , Getting Star ted with Beautiful Soup . P ac kt Pub lishing Ltd, 2014. [15] M. Mahalakshmi and M. Sundar ar ajan, “T r aditional sdlc vs scr um methodology–a compar ativ e study , Inter national Jour nal of Emerging T echnology and Adv anced Engineer ing , v ol . 3, no . 6, pp . 192–196, 2013. [16] R. J ain, The ar t of computer systems perf or mance analysis: techniques f or e xper imental design, mea- surement, sim ulation, and modeling . John Wile y & Sons , 1990. Implementation of w eb scr aping on GitHub task monitor ing system... (Rolly Maulana A w angga) Evaluation Warning : The document was created with Spire.PDF for Python.