Indonesian J our nal of Electrical Engineering and Computer Science V ol. 17, No. 2, September 2020, pp. 1602 1609 ISSN: 2502-4752, DOI: 10.11591/ijeecs.v17i2.pp1602-1609 r 1602 A computing model f or tr end analysis in stock data str eam classification Abdul Razak M. S 1 , Nirmala C. R 2 1 Department of Computer Science and Engineering, Bapuji Institute of Engineering and T echnology , India 2 V isv esv araya T echnological Uni v ersity , India Article Inf o Article history: Recei v ed Sep 28, 2019 Re vised Mar 13, 2020 Accepted Apr 8, 2020 K eyw ords: Classification Data stream Stock trading T rend analysis Data stream ABSTRA CT F or se v eral decades, man y statistical and scientific ef forts took place for the better analysis or prediction of stock trading. But still it is open to of fer ne w a v enues for the scientists to rethink and disco v er ne w inferences by adopting latest technological scenarios. In this re g ard, we are trying to apply classification techniques on stock data stream through feature e xtraction for the trend analysis. The proposed w ork is in v olv- ing k-means for clustering samples into tw o clusters (the stocks in trend as one cluster and another on as stocks not in trend). The trend analysis is done based on density esti- mation of the stocks with respect to sectors. A well-kno wn data representation method that is histogram is used to represent the sector whic h is in trend. This w ork has been implemented and e xperimented by considering li v e NSE (india) data using p ython and its related tools. Copyright c 202x Insitute of Advanced Engineeering and Science . All rights r eserved. Corresponding A uthor: Abdul Razak M. S, Department of Computer Science and Engineering, Bapuji Institute of Engineering and T echnology , Da v angere, India. Email: msabdulrazak@gmail.com 1. INTR ODUCTION Data stream analysis has opened up ne w a v enues or opportunities for Computer Science and Engi- neering Scienti sts. The data stream is a recorded data with respect to time, perhaps it can be re g arded as signal. Sometimes the signal may be continuous or discrete. All the parameters apply to s ignals holds good for data streams. Stock trading and its transactions can generate numerous amount of data with respect to time and hence it can be re g arded as a data stream. Data stream classification [1] is an area, enables researchers to iden- tify or e xtract ne w features through an y acceptable scientifi c process. Classification techniques on stock data analysis may pro vide certain inferences. One such inference could be trend of the stock. Indian Stock mark et has identified ele v en major sectors [2] to cate gorize the stocks. T rend analysis [3] is the process of estimating the entity which is in tre n d or has grabbed attention among the participating entities. Some of the stocks may be in trend due to se v eral reasons that is season, price, need, dependenc y , alternate a v ailability , price do wn in je welery , currenc y mark et and so on. Data stream analysis [4] is one of the most challenging process in internet applications. Due to its continuous a v ailability and updations an ef ficient techniques to process and declare inferences are required. Stock mark et [5] produces enormous amount of data in the repository . The analysis, management of ab undant data and producing acceptable results is one of the biggest challenge [6] to the computer scientists because the beha vior of the system v aries as the ne w data is added to repository . Classification model may pro vide certain re v olutionary inferences, perhaps classification on data stream may tend to lot of openings with respect to performance of the classification model. Classifica- tion models be gins with feature e xtraction that is p r op e rties which may define samples as per the analysis. J ournal homepage: http://ijeecs.iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 r 1603 The process of feature e xtraction on data stream and classification of these features has cre ated number of a v enues in the research domain. In this paper , we mainly focus on the estimation of the sector (collection of stocks belongs to similar properties), which is in trend for a gi v en time period through class ification techniques. Features are e xtracted by considering li v e data from the NSE serv er and clustered. The proposed w ork has conducted e xperiments using AN A COND A [7] and Jup yter tools by in v olving nsep y [8] p ython package for li v e data. 2. REVIEW OF LITERA TURE Since, be ginning of the stock trading se v eral e xperiments are in progress to in v ent required decla- rations. Author [9] discuss the v olatility of K uala Lampur Composite Inde x using stochastic v olatility (SV) models and Generalized Auto re gressi v e conditional heteroscedasticity (GARCH) models. The model results pro v e the slight dif ferences in Root Mean Squared error and Mo ving A v erage En v elope. T otally 971 daily observ ations of KLCI Closing price inde x, from 2nd January 2008 to 10th No v ember 2016, e xcluding public holidays. SV model is found to be the best based on the lo west RMSE and MAE v alues. Author [10] in v olv es study of financial mark et for v e dif ferent companies from Malaysia namely CIMB, Sime Darby , Axiata, Maybank and Petronas using Machine Learning Algorithms. T w o types of e xperiments were conducted based on the type of data. The first e xperiment used te xtual data using financial ne ws in v olving 6368 articles and classified as positi v e or ne g ati v e using SVM. The second e xperiment used numeric historical data in v olving 5321 records to predict the stock price is going up or do wn using Random F orest algorithm. Author [11] has tried to propose an embedded st reaming SVM classification architecture for continuous data processing. P aper [12] presents dynamic w ay of selecting the number of clusters in K-means clustering algorithm. The proposed algorithm is applied for clustering iris datas et and the performance of the algorithm is mea- sured using inter cluster distance and sum of squared error parameters and compared with General K-Means algorithm. Author [13] has proposed a classification model to answer comple x question answering process. Author [14] presents the ef fects of ne ws in online social media ef fects purchase of pharmaceutica l stocks. The e xperiment is conducted using Nifty pharma inde x data and de v eloped sentiment analysis model for pharma stock prediction. The sentiment analysis model achie v ed an accurac y of 70.59% in predicting daily stock mo v ement. Author [15] presents analysis and prediction of US real time stocks data from yahoo finance using big data analytics. A machine learning model is de v eloped to predict the future crude oil price using the United States Oil fund (USO) data . The model identifies the be st features for better oil price prediction. In paper [16] presents an analysis of one year US stock mark et based on Netw ork approach. The paper ad- dresses the correlation of one stock with other stocks and also identifies the k e y players in the mark et based on their number of dependencies. Author [17] addresses the selection of stock using both technical and fundamental infor mation. A frame w ork is designed to mak e class predictions for the industrial sector of the Australian stock mark et. The stock selection, trading s trate gy outperformed the Australian stock inde x. The accurac y of the classifica- tion models lik e Decision tree, CHAID tree and Neural netw ork is compared. 3. METHODOLOGY Figure 1 depicts the methodology of the proposed research w ork. 3.1. Data collection Classification w ould result most useful inferences, these inferences mainly depend on the appli cable data which is collected from the en vironment. This w ork requires a data stream that is the li v e data, which is continuous and may f all within some range. The range of data from start time to end time decides the stock and its trend in the mark et. nsep y is the p ython package used to access li v e NS E India stock trading data. This package pro vides the parameters of each stock namely totalT radedV olume, totalT radedV alue, Open, Close, dayHigh, dayLo w and so on. 3.2. F eatur e extraction The proposed methodology is considered the stock data, which is with respect to time as a di screte signal. Hence all the applicable features corresponding to discrete signals are considered as features in the proposed w ork. In-spite of man y features, only features are considered as per the analysis and which gi v es A computing model for tr end analysis in stoc k data str eam classification (Mr . Razak) Evaluation Warning : The document was created with Spire.PDF for Python.
1604 r ISSN: 2502-4752 better results and this process is called as feature selection [18]. The range of data from start time to end time decides the stock and its trend in the mark et. In order to achie v e better analysis the type of data and its importance does matter in the classification and conclusion. The importance and its type can be found based on e xperience of the stock trading or through computing analysis. Stock data ha v e se v eral parameters, namely totalT radedV olume, totalT radedV alue, Open, Close, dayHigh, dayLo w and so on. Theses parameters are used for further feature e xtraction. 3.2.1. Standard de viation Since the model operates on the data which is with respect to time, the amount of standard de via- tion [19] within the members is essential to estimat e. This feature mainly produces the amount of fluctuation among the members of the data stream. Stock data stream is a sequence of v alues with respect to time or date. This paper has considered v e properties from the get history of nump y package of p ython. These v e proper - ties are Open, Close, Lo w , High and V olume. F or each stock and each property Standard De viation i s estimated. 3.2.2. K urtosis ¨ Kurtosis i s a mea sure of the combined weight of a distrib ution ´ s tai ls relati v e to the center of the distrib ution ¨ . This measure may declare the rise in the distrib ution if the measure t u r ns to positi v e [20]. Figure 2 clearly depicts the positi vity and ne g ati vity nature of the measure along with distrib ution pattern. Figure 1. Proposed methodology for trend analysis in stock data stream classification Figure 2. T ailed and centered distrib ution Indonesian J Elec Eng & Comp Sci, V ol. 17, No. 2, September 2020 : 1602 1609 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 r 1605 3.2.3. A ugmented Dick ey-Fuller T est [21] This feature applies to Non-stationary T ime v ariant systems. Stock mark et data series can be re g arded as non-stationary because the mean and v ariance of the system is v aried at an y point of time. This model is well suitable for stock mark et data to analyze the stock trend. This paper uses the unit root with drift test analysis of the Dick e y-Fuller test. Unit root or stationarity of the distrib ution can be estimated using the (1). Y t = 1 + 2 t + Y t 1 + M X i =1 i Y t i + t : (1) Figure 3 depicts t he data stream which is in trend and the same with drift. This paper uses only deterministic time trend coef ficient as a feature for further classification. Figure 3. T rend analysis with drift 3.3. Dimensionality r eduction using PCA The proposed methodology e xtracting around twelv e features that is three features from each property (Open, Close, High, Lo w , V olume) of the stock data. This process is defined around twelv e features, sometimes all these features may or may not play an important role in the classification. Hence, dimensionality reduction [22] is one of the techniques to reduce the number of features. This may reduce the comple xity of the classi- fication and may impro v e the process better with meaningful inferences [23]. Principal Component Analysis (PCA) is one of the readily a v ailable a lgorithms for Dimensionality Reduction. The proposed w ork reduces the twelv e features to three features. 3.4. Clustering using K-means clustering algorithm The proposed w ork is grouping the a v ailable stock features into tw o clusters using k-means clus tering [24] as sho wn in Figure 4. Where one will be containing the stocks which are in trend and another not. 3.5. T r end analysis This is the final phase of the methodology , which considers all the samples from cluster 2 (Cluster 2 is assumed as trend cluster , it contains all the samples whose features ha v e gi v en trend coef ficients). Distance from origi n to the centroid of the cluster declares the selection of the cluster which has stocks in trend. More the distance more will be the trend, this assumption is based on trial and error method. Apply histogram on stock cate gory (stock indices) of the samples. As per the surv e y , there are ele v en stock indices in Indian Stock mark et. Figure 5 depicts a sample histogram, which declares that banking sector inde x is in trend compared to all other sectors. The histogram [25] is clearly depicting the s tatus of the sectors in the mark et. This status indicates that the sector number 5 is in trend. This trend may change as per the mark et transactions. A computing model for tr end analysis in stoc k data str eam classification (Mr . Razak) Evaluation Warning : The document was created with Spire.PDF for Python.
1606 r ISSN: 2502-4752 Figure 4. Samples cate gorized into tw o clusters Figure 5. Histogram based on sector indices of samples 4. RESUL TS AND DISCUSSIONS As per NSE (National Stock Exchange) Nifty Auto Inde x, Nifty Bank Inde x, Nifty Financial Ser - vices Inde x, Nifty FMCG Inde x, Nifty IT Inde x, Nifty Media Inde x, Nifty Pharma I nd e x, Nifty Pri v ate Bank Inde x, Nifty PSU Bank Inde x, Nifty Realty Inde x, Nifty500 Industry Indices are the ele v en sector indices. In the propose d w ork, fifteen stocks ha v e been considered in each sector inde x for the trend analysis. Figure 6 sho ws the feature v alues, e xtracted from the selected parameters (Open, Close, Lo w , High and V olume) within a gi v en period. Figure 6. Feature v alues of selected stocks from sectors Figure 7 sho ws the results from PCA (dimensionality reduction), which is applied on features sho wn in Figure 6. Here the standard de viations, kurtosis and adfs of open, close, high and v olume properties into single columns respecti v ely as std, kurt and adf columns. PCA reduces the comple xity by e xtracting necessary features and classification process. K-means does clustering t h e gi v en samples into tw o clusters. The last column of the Figure 7 i ndicates cluster 1 by 0 and cluster 2 by 1. Figure 8 sho ws the histogram on the Inde x column of the Figure 7 by considering only cluster 2 samples. Indonesian J Elec Eng & Comp Sci, V ol. 17, No. 2, September 2020 : 1602 1609 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 r 1607 Figure 7. PCA and K-means clustering results Figure 8. Histogram on sector indices considering only cluster 2 samples The histogram is clearly declaring that the second sector that is banking sector stocks were in trend during the gi v en period of time in the mark et. The classification models certainly impro v es the ef ficienc y of the process which i n v olv es lar ge a mount of data. The e xtracted features lik e Standard De viation, K urtosi s and Dick e y Fuller T est ha v e yielded the result which is acceptable as per the statistics. A computing model for tr end analysis in stoc k data str eam classification (Mr . Razak) Evaluation Warning : The document was created with Spire.PDF for Python.
1608 r ISSN: 2502-4752 5. CONCLUSION The stock m ark et has tremendous opportunities for b usinessman, manuf acturer , in v estor and e v en for data analyst to study the beha viour of en vironment and society . In this re g ard, this paper has tried to analyze the stock data stream to estimate trend sector inde x in the mark et based on feature e xtraction and unsupervised clustering (K-means) technique. It has been implem ented and demonstrated the results by fetching stock data stream from the serv er through nsep y package of p ython. The proposed w ork has not been considered an y performance analysis of the model and the same can be enhanced as a ne w proposal t hroug h big data analytics. A CKNO WLEDGMENT Authors ackno wledge and thank to Management, Director and Principal of Bapuji Institute of Engi- neering and T echnology , Da v angere for pro viding an opportunit y and platform to conduct an e xperiment and produce meaningful results. REFERENCES [1] Nguyen, Hai-Long, Y e w-Kw ong W oon, and W ee-K eong Ng, ”A surv e y on data stream clustering and classification, ¨ Kno wledge and information systems, v ol. 45, no. 3, pp. 535-569, 2015. [2] [Online] A v aible : https://www .niftyindices.com/indices/equity/sectoral-indices. [3] [Online] A v aible : https://www .in v estopedia.com/terms/t/trendanalysis.asp. [4] W illiam McKnight, ”Chapter Eight - Data Stream Processing: When Storing the Data Happens Later , Editor(s): W illiam McKnight, Information Management, Mor g an Kaufmann, pp. 78-85, 2014. [5] Sachde v a, Akshay , et al., ”An Ef fecti v e T ime Series Analysis for Equity Mark et Prediction Using Deep Learning Model, International Conference on Data Science and Communication, 2019. [6] Shanmug am, D. B., et al., ”Data Stream C lustering Challenges and Management System, Journal of Computational and Theoretical Nanoscience, v ol. 16, no. 5-6, pp. 2393-2397, 2019 [7] Raschka, Sebastian, and V ahid Mirjalili, ”Python machine learning, P ackt Publishing Ltd, 2017. [8] [Online] A v aible : https://nsep y .readthedocs.io/en/latest/. [9] Ezatul Akma Abdullah, and Siti Meriam Zahari, ”Modelling v olatility of K uala Lumpur composite inde x (KLCI) using SV and g arch models”, Indonesian Journal of Electrical Engineering and Computer Science, v ol. 13, no. 3, pp. 1087-1094, 2019. [10] Puteri Hasya Damia Abd Samad, Sofianita Mutalib, ”Analytics of stock mark et prices based on machine learning algorithm, Indonesian Journal of Electrical Engineering and Computer Science, v ol. 16, no. 2, pp. 1050-1058, 2019. [11] J. Sirkunan, J. T ang, N. Shaikh-Husin, and M. Marsono, ”A streaming multi-class support v ector ma- chine classification architecture for embedded systems, Indonesian Journal of Electrical Engineering and Computer Science, v ol. 16, no. 2, pp. 1286-1296, 2019. [12] Md. Zakir Hossain, Md. Nasim Akhtarn, ¨ A dynamic K-means clustering for data mining ¨ ,” Indonesian Journal of Electrical Engineering and Computer Science, v ol. 13, no. 2, pp. 521-526,2019. [13] Reddy , A., and Madha vi, K., ”Hierarch y based firefly optimized K-means clustering for comple x ques- tion answering, Indonesian Journal of Electrical Engineering and Computer Science, v ol.17, no. 1, pp. 264-272, 2020. [14] De v Shah, Haruna Isah, F arhana Zulk ernine, ¨ Predicting the Ef fects of Ne ws Sentiments on the Stock Mark et ¨ ,” IEEE International Conference on Big Data (Big Data), 2018. [15] Zhihao PENG, ¨ Stocks Analysis and Prediction Using Big Data Analytics ¨ ,” International Conference on Intelligent T ransportation, Big Data and Smart City , 2019. [16] Susan Geor ge,Manoj Chang at, ”Netw ork approach for stock mark et data mining and portfol io analysis, International Conference on Netw orks and Adv ances in Computational T echnologies, 2017. [17] Har grea v es, Carol, and Y i Hao, ”Does the use of technical and fundamental analysis impro v e stock choice: A data mining approach applied to the Australian stock mark et, International Conference on Statistics in Science, Business and Engineering (ICSSBE), 2012. [18] S. V isalakshi and V . Radha, ”A literature re vie w of feature selection techniques and applications: Re vie w of feature selection in data mining, IEEE International Conference on Computational Intelligence and Computing Research, pp. 1-6, 2014. Indonesian J Elec Eng & Comp Sci, V ol. 17, No. 2, September 2020 : 1602 1609 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 r 1609 [19] Altman, D. G., and J. M. Bland, ”Standard de vi ations and standard errors, Bmj, v ol. 331, no. 7521, 2005. [20] Mardia, Kanti V , ”Measures of multi v ariate sk e wness and kurtosis with applications, Biometrika, v ol. 57, no. 3, pp. 519-530, 1970. [21] Harris, Richard ID, ”T esting for unit roots using the augmented Dick e y Fuller test: Some issues relating to the size, po wer and the lag structure of the test, Economics letters, v ol. 38, no. 4, pp. 381-386, 1992. [22] Lotlikar , Rohit, and Ra vi K othari, ”Adapti v e linear dimensionality reduction for classification, P attern Recognition, v ol. 33, no. 2 pp. 185-194, 2000. [23] Cao, L. J., et al., ”A comparison of PCA, KPCA and ICA for dimensionality reduction in support v ector machine, Neurocomputing, v ol. 55, no. 1-2, pp. 321-336, 2003. [24] Jain, Anil K., ”Data clustering: 50 years be yond K-means, P attern recognition letters, v ol. 31, no. 8, pp. 651-666, 2010. [25] Pizer , Stephen M., et al., ”Adapti v e histogram equalization and its v ariations, Computer vision, graphics, and image processing, v ol. 39, no. 3, pp. 355-368, 1987. A computing model for tr end analysis in stoc k data str eam classification (Mr . Razak) Evaluation Warning : The document was created with Spire.PDF for Python.