TELK OMNIKA , V ol. 16, No . 2, Apr il 2018, pp . 834 842 ISSN: 1693-6930, accredited A b y DIKTI, Decree No: 58/DIKTI/K ep/2013 DOI: 10.12928/telk omnika.v16.i2.7669 834 Data Cleaning Ser vice f or Data W arehouse: An Experimental Comparative Stud y on Local Data Arif Bramantor o F aculty of Computing and Inf or mation T echnology in Rabigh King Abdulaziz Univ ersity , Jeddah, Saudi Ar abia e-mail: asoegihad@kau.edu.sa Abstract Data w arehouse is a collectiv e entity of data from v ar ious data sources . Data are pro ne to se v er al complications and irregular ities in data w arehouse . Data cleaning ser vice is non tr ivial activity to ensure data quality . Data cl eaning ser vice in v olv es identification of errors , remo ving them and impro v e the quality of data. One of the common methods is duplicate elimination. This research f ocuses on the ser vice of duplicate elimination on local data. It i nitially sur v e ys data quality f ocusing on quality prob lems , cleaning methodology , in v olv ed stages and ser vices within data w arehouse en vironment. It also pro vides a compar ison through some e xper iments on local data with diff erent cases , such as diff erent spelling on diff erent pron unciation, misspellings , name ab bre viation, honor ific prefix es , common nic knames , splitted name and e xact match. All ser vices are e v aluated based on the proposed quality of ser vice metr ics such as perf or mance , capability to process the n umber of records , platf or m suppor t, data heterogeneity , and pr ice; so that in the future these ser vices are reliab le to handle big data in data w arehouse . K e yw or d: Data Cleaning Ser vice , Data W arehouse , Data Quality , Local Data Cop yright c 2018 Univer sitas Ahmad Dahlan. All rights reser ved. 1. Intr oduction Data w arehouse is a relational database f or questioning and analyzing b y fur ther pro- cessing. It is obtained from se v er al tr ansactions from other sources . Integ r ated data w arehouse is an integ r ation of files , sources and other records . Se v er al ser vices are used to ensure good data, such as data cleaning and data integ r ation ser vice within enter pr ise . Su bject or iented data w arehouse is a subject centr ic model in v olving se v er al subjects , such as v endor , product, sales and customer [1]. A good data w arehouse m ust f ocus on proper analysis and cleaning of data r ather than daily ser vice tr ansaction and oper ations . This kind of model is required b y most enter pr ises . The model m ust be simple and related to the data cleaning objectiv e . It should also a v oid data which are not required f or tr ansaction and decision making oper ations . Non v olatile nature of data is impor tant f or t hese oper ations . It should be ph ysically separ ated a w a y from the application. This separ ation helps data f or reco v er y and other time consuming mechanisms , such as loading and retr ie v al of data [2]. Time v ar iant is the per iod of time which is in v olv ed in the data stor age in data w arehouse . This is the element of time . The decisions tak en dur ing the process is v er y impor tant, theref ore the ge ner ated trend repor ts are significant elements in data w arehousing. A proper decision suppor t system w or ks successfully with this trend repor t. There are se v er al commercial applications such as customer relationship and b usiness applications using utilizing data cleaning ser vice . A proper scheme is in v olv ed dur ing the de v elopment of data w arehouse . The questions and analysis are completed in the designing stage . Meaningful access of rele v ant data is required together with the gener ated v alues . The e xtr action of the source is impor tant, theref ore it m ust be v er y clean without an y unrelated sources . The pro vided ser vice is data cleaning in an y big enter pr ise . The input data input are rechec k ed and tested bef ore the y are allocated to a specific data w arehouse . The loaded data are separ ated from technical specification and process [3]. Receiv ed October 25, 2017; Re vised F ebr uar y 13, 2018; Accepted March 14, 2018 Evaluation Warning : The document was created with Spire.PDF for Python.
TELK OMNIKA ISSN: 1693-6930 835 Se v er al automatic e x ecutions are also made to eliminate error in the data. Identifying the incomplete data is a hindr ance f or processing of data. It mak es the corrections more complicated. A ser vice called bac k flushing is used to rechec k the data cleaning frequently . Installation of data occurs in the first instance f or the model from other sources . Monitor ing ser vice is used f or the reco v er y of data at diff erent le v els from huge to small quantity . The amount of load is a lso related to the process in w arehouse . Hence , caution steps are note w or th y to process the data smoothly . ETL is the process of Extr act, T r ansf or m, and Load of data. It means that the e xtr action of rele v ant data is f ollo w ed b y the tr ansf or mation and consequently the loading of data in the w arehouse . Extr act is a method of data e xtr action which occurs in data w arehouse f rom the allocated resources . The consolidation of these resources also tak es places in the separ ated system that is allocated f or each le v el of processing. This step of e xtr action tak es the data into another le v el called tr ansf or ming [4]. T r ansf or m is a mechanism to in v olv e the data e xtr action which is con v er ted from pre vious f or m and placed in the data w arehouse without an y errors in the data. The source of data needs a proper manipula tion of al l methods . It f ollo ws a set of functions to e xtr act data into the w arehouse without an y modification to the e xisting data. The technical and other requirements are v alidated to meet the requirements . There are se v er al tr ansf or mations in v olv ed in the process , such as selecting only par ticular in f or mation and assigning them with specific functions . Coding the data with v alues is a concrete ser vice of data cleaning which occurs automatically . Another f or m of data cleaning ser vice is to encode the result into a ne w v alue and to combine data v alues of tw o diff erent methods . The f or m of data can be simple or comple x method. The path of data ma y be f ailed or successful, which both methods in v olv e in handling data in a specific prog r am. F or e xample , the model can be a tr anslated code in an e xtr acted data. Load is a process of data handling which is impor tant together with the targeting r ange of inf or mation. F e w data can be o v erwr itten with other non-updated or updated data. The selec- tion of the design is a lso impor tant together with a proper understanding of the a v ailab le choices related to time and b usiness requirements . The comple x model of system is to allo w se v er al changes to be updated and uploaded. The o v er all quality of data in the data w arehouse en viron- ment is v alidated b y utilizing ETL mechanism [5]. The objectiv e of this paper is to identify the causes f or data quality prob lem. P ar ticular methodology and e xper iments are adopted to address the prob lem. It is e xpected that it con- tr ib utes to better data quality in data w arehouse . Data cleaning and duplicate elimination ser vices are the appropr iate methods to impro v e the data quality . Exper iments are conducted to pro vide the results and compar isons betw een de-duplication ser vices . The research approach as it is presented here is no v el due to the le v el of implementation b y utilizing ser vice or iented approach as an e vident suppor t to the conducted sur v e y . In this research , only data cleansing ser vice is considered as the main task. The rest of the data quality ser vices such as completeness and histor ical reputation ser vices remain as a future w or k to compose more ser vices in a ser vice-or iented system. 2. Data Quality in Data W arehouse Data w arehousing is a promising industr y f or se v er al go v er nment organization and pr i- v ate institutions , which in v olv es se v er al confidential dat a stor age with regards to inter nal secur ity . With the enor mous amount of data, the responsibility of organization becomes cr itical when it comes to secur ity concer ns [6]. The assur ance of data quality is the pr imar y objectiv e of an y management le v els . There is an increased potential data quality and its irregular ities . Data w are- house is adopted b y the organization to impro v e the relationship betw een customers , client and management. Thus , impro ving the efficiency f or the entire organization is required. Data quality is defined as the measured perf or mance or the loss of data in an organization [2]. The pur pose of data quality measurement is to identify the missing data from the system. The quality of data attained f or the data w arehouse model assures the inputs on the client side . Ho w e v er , one user is diff erent from another user . The data m ust be simple , consistent and full of understanding. The ab undance of data increase the b urden on the system side . The quality Data Cleaning Ser vice f or Data W arehouse: An Exper imental ... (Ar if Br amantoro) Evaluation Warning : The document was created with Spire.PDF for Python.
836 ISSN: 1693-6930 of d ata is cr itical together with the identification of irregular ities . The k e y quality of data and its dimension metr ics are impor tant to understand the eff ectiv e quality impro v ement. Data quality has the impor tance due to the use of data w arehouse system. Data quality is measured in each phase of oper ations . Metr ics are selected to ensure measurement of data quality and analysis . The selection of metr ics is cr itical to the final result which directly aff ects to the customer relationship . Quantifying data is impor tant to sa v e the cost and impro v e mar k et standards in a competitiv e econom y [7]. In this paper , data quality and quality of ser vice metr ics are combined to impro v e the confidence of data quality process . With an increasing technology and enor mous data inputs in industr y , the author ities need to impro v e the quality of data in enter pr ise . There are se v er al prob lems f aced b y the enter pr ises in order to maintain and sustain their quality of ser vice in deliv er ing the project. The types of data are classified into intr insic , conte xtual, representativ e and accessib le . P erf or mance is the f actor of quality standard in an y enter pr ise tr ademar ks . The addition of data m ust be static r ather than dynamic in order to efficiently a v oid irregular ities dur ing monitor ing the process of quality standard impro v ements . The consumer m ust be carefully considered when there are data sent b y the client to the ent er pr ise [8]. A common data quality fr ame w or k includes a loop of activity in w eighting its cost and benefit as illustr ated in Figure 1. Figure 1. Data Quality Process Loop 3. Classification of Data Quality Moder n data quality impro v ement approach requires a real time scenar io with the pref- erence to a v oid oper ational and analytical models [9]. The correctness and assur ance of data quality is measured after dur ing the impro v ement. The data quality issues requires to be han- dled f or designing ser vices f or data w arehouse without an y quality prob lems . The identification of prob lems caused b y poor data is e xamined to der iv e a proper procedure . Inaccur ate inf or mation b y the customer is another cause f or a decline in quality . Unlik e con v entional approach, there are se v er al other pro ximity and time v ar iant issues that m ust be giv en a consider ation in moder n approach. The source of data in mo der n data w arehouse is related to the data quality impro v e- ment in data w arehouses . The fields are filled b y the ones in unstr uctured f or ms . These issues are impro v ements and adv ancement f or moder n research in data w arehouse compared to the con v entional method of Inmon [10] and Kimball research [11]. According to Data W arehouse Institute [12], data quality impro v ement includes the cor- rection on def ectiv e data to ensure the achie v ement of minim um le v el of data quality standard. It is also mentioned that data are required to be fla wless without an y irregular ities . It has to meet the standard requirement of the compatib le application. The quality of data required b y user is diff erent from the one required b y the organization. Str ict r ules are used t o a v oid an improper data processing. V alidation is made at par ticu lar le v el where data are equipped with pin n umbers or pass w ords . The frequent data errors are considered as a common phenomenon, ho w e v er the model de v eloped f or data quality in data w arehouse is regular ly adaptiv e to all changes . Hence , TELK OMNIKA V ol. 16, No . 2, Apr il 2018 : 834 842 Evaluation Warning : The document was created with Spire.PDF for Python.
TELK OMNIKA ISSN: 1693-6930 837 data in high quality can be used in oper ations , decision making process and modeling. In addition, the quality inf or mation indicates which data model needed b y data w arehouse . The probability of errors that can lead to a decline in data quality is required f or records and protocol distr ib ution in a netw or k. The calculation of technical inf or mation and the require- ment protocol proposed b y the enter pr ise ha v e to be fulfilled to achie v e data quality . The assured mechanism f or de v elopment of these protocols can benefit the enter pr ise b y pro viding data qual- ity management in a large scale . The goal is to meet mar k et standards r ather than to adopt lo w cost protocols that ma y lead to a f ailure of the suggested model. Man y organization and go v er n- ment agencies are i n v o lv ed with huge database collection [13]. The impor tance of data quality becomes a big concer n to achie v e results and e xper iments required b y a client. If this is not tak en ser iously , se v er al complications ma y ar ise due to a f ailure in data quality which aff ects the customer relationship model at an y le v el of processes in enter pr ise . An eff ectiv e r isk m anagement is needed f or a system to lear n from its deficiencies . The designed protocols m ust be in such a w a y to cope up with the r isk and deliv er the required stan- dard results . The policy mak ers m ust decide t he r isk str ategies to comprehend the desired data quality standards . Fur ther management of the r isk mitigation protocols f or data quality impro v e- ment and the desired policy f or m ulation pla y a major role based on data quality requirements . The agent f or r isk mitigation approach is assigned after se v er al testing le v els , since the y are going to pla y a major role in the enter pr ise w or king le v el. The decisions are tak en from the policy of the r isk mitigation f or data quality approaches . The management of an y enter pr ise should pa y an attention to Lly ods approach [14] of data quality model and r isk mitigation standards . 4. Duplicate Elimination T est Bed Duplicate elimination is one of the impor tant concrete ser vices in data cleani ng ser vice composition. The main objectiv e of data cleaning ser vice is to maintain data quantity . It is a ser vice-or iented method to remo v e duplicated data which ma y be represente d b y the user more than one time . The gener al idea is a matching process that enab les to identify duplicated data. One impor tant aspect dur ing the search of the duplication of the same records is the ambiguity of data. There are se v er al e xper iments conducted to con vince the duplicate elimination. Se v er al ser vices are used dur ing the matching process on those e xper iments . Ho w e v er , only f e w of them giv e the desired results . The duplicate inf or mation are displa y ed and recorded in the f or m of a tab le together with the indication of its percentage . The eff ectiv e ser vices are gener ally chosen to get successful results of the impro v ement on data quality standards . The aim of this research is to compare the duplicate elimination ser vices and find out which ones perf or m better . The compar ison is gener ally based on tw o par ameters . The first par ameter is th e time to detect the errors in the data that alter the system and en vironment. Additional time is required to impro v e the quality of data in the process system of an y functionality . The second par ameter is the memor y that deter mines on the eff ectiv eness of the data quality . The ser vices required f or the e xper iments are a v ailab le from the f ollo wing ser vice pro viders: WinPure Clean and Match (ref erred as WinPure), Doub leT ak e3 Dedupe & Merge (ref erred as Doub leT ak e), WizSame (ref erred as the same name), and Dedupe Express (ref erred as DQ- Global). Bef ore the compar ison betw een ser vices are conducted, the e xper imental test bed needs to be de v eloped on real data in local Saudi Ar abia. Dur ing the first e xper iment , there are eight duplicates from the data set man ually selected from data w arehouse and fur ther e xamined b y the duplicate detection ser vices as sho wn in Figure 2. It is impor tant to note that the data with high pr iv acy are preser v ed. Due to the limitation of the page , only the result of the duplicate data detected b y Doub leT ak e ser vice is presented in this paper as illustr ated in Figure 3 which has se v en duplicate data. In this figure , Doub leT ak e ser vice pro vides some inf or mation that might be diff erent from other ser vices , such as the n umber of suppressed records and the r ate of records per hour . Hence , this research standardiz es the ser vice output as the n umber of duplicate records to ease the compar ison. The quality of ser vice is included f or the perf or mance analysis as w ell. Data Cleaning Ser vice f or Data W arehouse: An Exper imental ... (Ar if Br amantoro) Evaluation Warning : The document was created with Spire.PDF for Python.
838 ISSN: 1693-6930 Figure 2. First Exper iment Dataset Figure 3. Duplicate Data Detected By Doub leT ak e Due to the limitation of the page , only one e xper imental test bed is presented in this paper . A summar y of total fiv e e xper iments is presented in T ab le 1. T ab le 1. Summar y of Fiv e Exper iments WinPure Doub leT ak e Wizsame DQGlobal Exper iment 1 50% 88% 75% 88% Exper iment 2 25% 75% 67% 33% Exper iment 3 50% 90% 90% 80% Exper iment 4 88% 50% 75% 63% Exper iment 5 17% 100% 92% 83% TELK OMNIKA V ol. 16, No . 2, Apr il 2018 : 834 842 Evaluation Warning : The document was created with Spire.PDF for Python.
TELK OMNIKA ISSN: 1693-6930 839 5. Comparisons Between Ser vices in Duplicate Detection In this compar ison, there is a finer g r an ular ity based on the pre vious e xper iment test bed. Each ser vice processes the same set of records so that the detection capability of all ser vices can be justified. All the records f or the compar ison are based on the duplicate types . The compar isons are made based on the predefined duplication types . There are se v en duplication types as f ollo ws: 1. Diff erent spelling and pron unciation compar ison. The duplicated records e xamined in this compar ison and the e xamination result b y r unning f our ser vices are illustr ated in Figure 4. Due to the e xistence of diff erent languages in Saudi Ar abia, inconsistent name tr ansliter ated from another language is not uncommon. It is interesting to note that the ser vice pro vided b y WinPure is unab le to detect an y records with diff erent spelling and pron unciation. Figure 4. Diff erent Spelling and Pron unciation Duplicated Records and Examination Results 2. Compar ison based on misspellings . The duplicated records e xamined in this compar ison and the e xamination result b y r unning f our ser vices are illustr ated in Figure 5. This compar ison pro vides less percentage of the detected records than the pre vious compar ison. It can be inf erred that the misspelling cases ha v e more v ar iants in the records . The ne xt compar isons are not sho wn as a figure due to the limitation of the page . 3. Compar ison based on name ab bre viation. The duplicated records are e xamined in this compar ison and the e xamination results b y r unning f our ser vices . In this compar ison, the records with name ab bre viation are handled more accur ately b y the ser vices , e xcept f or WinPure ser vice . 4. Compar ison based on honor ific prefix es . The duplicated records are e xamined in this compar ison and the e xamination result b y r un- ning f our ser vices . It is interesting to note that DQGlobal ser vice under perf or ms in this e xper iment. 5. Compar ison based on common nic knames . The duplicated records are e xamined in this compar ison and the e xamination results b y r unning f our ser vices . In this compar ison, Doub leT ak e ser vice is unab le to detect. Data Cleaning Ser vice f or Data W arehouse: An Exper imental ... (Ar if Br amantoro) Evaluation Warning : The document was created with Spire.PDF for Python.
840 ISSN: 1693-6930 Figure 5. Misspellings Duplicated Records and Examination Results 6. Compar ison based on splitted name . The duplicated records are e xamined in this compar ison and the e xamination result b y r un- ning f our ser vices . In this compar ison, WinPure ser vice under perf or ms again. 7. Compar ison based on e xact match. The duplicated records are e xamined in this compar ison and the e xamination result b y r un- ning f our ser vices . Exact ma tch f eature is impor tant f or some cases which need a specific handling, such as to in v estigate the inter nal mistak e of data w arehousing. A complete compar ison summar y f or se v en duplicate types is illustr ated in Figure 6. It can be inf erred that WizSame ser vice has the highest reliability in an y compar ison cr iter ia amongst other ser vices , although it has no peak perf or mance in ter m of the n umber of detected records . Figure 6. Se v en T ype Duplicates Examination Results 6. Quality of Data Cleaning Ser vices In addition to the e v aluation f or quality of data, there is a requirement to assess the data cleaning ser vice based on the quality of ser vice . There are se v er al quality of ser vice metr ics are tak en into account in this pape r , such as perf or mance , capability to process the n umber of TELK OMNIKA V ol. 16, No . 2, Apr il 2018 : 834 842 Evaluation Warning : The document was created with Spire.PDF for Python.
TELK OMNIKA ISSN: 1693-6930 841 records , data heterogeneity , and pr ice . f or The perf or mance of the ser vice is brok en do wn into tw o metr ics: processing time and memor y . Time is an impor tant f actor which is mostly tak en into account in most algor ithm compar isons which is calculated based on the processed records . 1000 records are consider ab ly enough to be tak en into account f or this compar ison. The time spent b y each ser vice on the processing of 1000 records is being calculated. The results of these record manipulation depends on the system en vironment. Theref ore , the compar ison betw een all of these ser vices is conducted in the same en vironment. The en vironment related to e xper i- ments are k ept consistent on all f ou r ser vices . Modifying the en vironment ma y aff ect the o v er all perf or mance of these ser vices . The result of the processing time e v aluation is presented in Figure 7 (a). It sho ws that WizSame and WinPure utiliz ed less CPU time f or processing 1000 records . Accordingly , DQ- Global took the maxim um time f or processing 1000 records . Figure 7 (b) presents the compar i- son betw een the memor y utilization of e xamination ser vices . In this e v aluation, both Doub leT ak e and WizSame had an optimal perf or mance , while DQGlobal and WinPure had more memor y consumption f or the processing of 1000 records . Figure 7. Time Spent and Memor y Utilization on The Processing of 1000 Records The capability of e ach ser vice to process the records is an impor tant metr ic f or data cleaning ser vice . The compar ison descr ibes ho w man y records that each ser vice can process the remo v al of duplication. WinPure w as ab le to process 250,000 records at maxim um. Doub leT ak e w as ab le to process tw enty million records at maxim um. WizSame w as ab le to pro cess one million record at maxim um. DQGlobal w as ab le to process one million records at maxim um. Diff erent ser vice has diff erent capability to process par ticular data f or mat. It is considered as a data heterogeneity metr ic. WinPure is ab le to process T e xt File , MS Excel, MS Access , Dbase , MS SQL Ser v er . Doub leT ak e is ab le to process MS Excel, MS Access , Dbase , Plain T e xt File , ODBC , F o xPro , MS SQL Ser v er , DB2 and Or acle . WizSame is ab le to process dBase , MS SQL, MS Access and Or acle , Plain t e xt file , Dbase , ODBC , OLE DB . DQGlobal is ab le to process MS Access , P ar ado x, MS Excel, DBF , Lotus , F o xPro , and Plain te xt file . This compar ison sho ws that Doub leT ak e r uns more data f or mats compared to other ser vices . WizSame scores second f or r unning more data f or mats in remo ving duplication. Pr ice is another quality of ser vice metr ics considered in this paper . The ser vice that has high pr ice is not f easib le f or par ticular users . The pr ice f or purchasing t he license of the appli- cations is in a wide r ange b y the time this paper is wr itten. WinPure costs $949.00, Doub leT ak e costs $5,900.00, WizSame costs $2,495.0 0 and DQGlobal costs $3,850.00. This compar ison im- plies that Doub leT ak e has the highest pr ice compared to the rest of the ser vices . Ho w e v er , since w e wr ap all these applications as ser vices , the cost is minimiz ed b y pa ying only f or th e e x ecuted ser vices . Data Cleaning Ser vice f or Data W arehouse: An Exper imental ... (Ar if Br amantoro) Evaluation Warning : The document was created with Spire.PDF for Python.
842 ISSN: 1693-6930 7. Conc lusion In data w arehousing, data cleaning ser vice pla ys an impor tant roles in man y domains . If the data is not clean and full of anomalies , the resultant data ha v e a lot of issues , such as data integ r ation and quer y errors . In order to get the best f or m of the e xtr acted data, it is impor tant to clean the data as an initial step . Data redundancy should be remo v ed to maintain the data integ r ity . This research pro vides an o v er vie w about the quality of data to be used in data w are- housing and to analyz e , pr actice and e xper iment the concept of data quality b y utilizing real local data. Hence , this research has tw o contr ib utions . First, it su r v e y ed of data quality in the en viron- ment of data w arehouse and the d ata integ r ity analysis . Second, it compared the ser vices that can remo v e the duplication of data through some real e xper iments . The e xper iments w ere con- ducted based on the perf or mance measures so that it could be deter mined which ser vice is more eff ectiv e f or the remo v al of data duplication. The compar ison is considered as an aid f or users to select the best ser vices depending on their needs , especially in the scope of Saudi Ar abia. Ac kno wledg ement This w or k w as suppor ted b y the Deanship of Scientific Research (DSR), King Abdulaziz Univ ersity , Jeddah, Saudi Ar abia. The author , theref ore , g r atefully ac kno wledges the DSR techni- cal and financial suppor t. The author also thanks Mshar i AlT ur aifi f or conducting the e xper iments in Saudi Ar abia. Ref erences [1] B . Moustaid and M. F akir , “Implementation of b usiness intelligence f or sales management, IAES Inter national Jour nal of Ar tificial Intelligence (IJ-AI) , v ol. 5, no . 1, pp . 22–34, 2016. [2] I. Khliad, “Data w arehouse design and implementation based on qualit y requirements , Inter- national Jour nal of Adv ances in Engineer ing and T echnology , pp . 642–651, 2014. [3] L. Rober t, “Data quality in healthcare data w arehouse en vior nments , 34th Ha w aii Inter na- tional Conf erence System on System Sciences , pp . 9–1, 2001. [4] G. Shankar anar a y anan, “T o w ards implementin g total data quality management in a data w arehouse , Jour nal of Inf or mation T echnology Management , v ol. 16, no . 1, pp . 21–30, 2005. [5] A. Amine , R. A. Daoud, and B . Bouikhalene , “Efficiency compar aison and e v aluation betw een tw o etl e xtr action tools , Indonesian Jour nal of Electr ical Engineer ing and Computer Science , v ol. 3, no . 1, pp . 174–181, 2016. [6] R. A rchana, R. S . Hegadi, and T . Manjunath, “A big data secur ity using data masking meth- ods , Indonesian Jour nal of Electr ical Engineer ing and Computer Science , v ol. 7, no . 2, pp . 449–456, 2017. [7] H. Marcus , K. Mathias , and K. Ber nard, “Ho w to measure data quality; a metr ic approach, T w enty Eighth Inter national conf erence on Inf or mation System, Montreal , pp . 1–15, 2007. [8] H. F reder ik, Z. Dennis , and L. Anders , “The cost of poor quality , Jour nal of industr ial En- gneer ing and Management , pp . 163–193, 2011. [9] K. Rahul, “Data quality in data w arehouse prob lems and solution, Jour nal of Computer En- gineer ing (ISOR-JCE) ISSN-2278-0661, V olume 16, Issue1 , pp . 18–24, 2014. [10] B . Inmon, “Data w arehousing 2.0 architecture f or ne xt gener ation of data w arehousing, T ech. Rep ., 2010. [11] R. Kimball, M. Ross , J . Mundy , and W . Thor nthw aite , The Kimball Group Reader : Relent- lessly Pr actical T ools f or Data W arehousing and Business Intelligence Remastered Collec- tion . John Wile y & Sons , 2015. [12] United States Depar tment of Inter ior CIO, “Data qu ality management guide , T ech. Rep ., 2008. [13] Q. Sun and Q. Xu, “Research on collabor ativ e mechanism of d ata w arehouse in shar ing platf or m, Indonesian Jour nal of Electr ical Engineer ing and Computer Science , v ol. 12, no . 2, pp . 1100–1108, 2014. [14] LL y ods , “Solv ency ii-section 4-statistical quality standards , T ech. Rep ., 2010. TELK OMNIKA V ol. 16, No . 2, Apr il 2018 : 834 842 Evaluation Warning : The document was created with Spire.PDF for Python.