TELK OMNIKA , V ol. 15, No . 3, September 2017, pp . 1040 1047 ISSN: 1693-6930, accredited A b y DIKTI, Decree No: 58/DIKTI/K ep/2013 DOI: 10.12928/telk omnika.v15.i3.6513 1040 Application Pr ofiling and Mapping on NoC-based MPSoC Em ulation Platf orm on Reconfigurab le Logic Jia W ei T ang * , Y uan W en Hau ** , Nasir Shaikh-Husin * , and Muhammad Nadzir Mar sono * * F aculty of Electr ical Engineer ing, Univ ersiti T eknologi Mala ysia, Johor , Mala ysia ** IJN-UTM Cardio v ascular Engineer ing Centre , F aculty of Biosciences and Medical Engineer ing, Un iv ersiti T eknologi Mala ysia, Johor , Mala ysia * Corresponding author , e-mail: jwtang2@liv e .utm.m y Abstract In netw or k-on-chip (NoC) based m ulti-processor system-on-chip (MPSoC) de v elopment, applica- tion profiling is one of the most cr ucial step dur ing design time to search and e xplore optimal mapping. Con v entional mapping e xplor ation methodologies analyse application-specific g r aphs b y estimating its r un- time beha viour using analytical or sim ulation models . Ho w e v er , t he f or mer does not replicate the actual application r un-time perf or mance while the latter requires significant amount of time f or e xplor ation. T o map applications on a specific MPSoC platf or m, the application beha viour on cycle-accur ate em ulated platf or m should be considered f or obtaining better mappin g quality . This paper proposes an application mapping methodology that utiliz es a MPSoC prototyped in Field-Prog r ammab le Gate Arr a y (FPGA). Applications are implemented on homogeneous MPSoC cores and their costs are analysed and profiled on the platf or m in ter m of e x ecution time , intr a-core comm unication and inter-core comm unication dela ys . These metr ics are utiliz ed in analytical e v aluation of the application mapping. The prop osed analytical-based mapping is demonstr ated against the e xhaustiv e br ute f orce method. Results sho w that the proposed method is ab le to produce quality mappings compared to the g round tr uth solutions b ut in shor ter e v alu ation time . K e yw or ds: Embedded system, application mapping, m ulti-processor system-on-chips (MPSoCs), through- put constr aint. Cop yright c 2017 Univer sitas Ahmad Dahlan. All rights reser ved. 1. Intr oduction Multi-Processor System-on-Chips (MPSoCs) is commonly used in moder n embedded systems such as smar tphones and tab lets as w ell as in other div erse fields such as m ultimedia and netw or king applications to perf or m time-constr ained tasks . Mapping of time cr itical applica- tions is a cr ucial step in design time to pro vide the highest possib le system perf or mance . The prob lem of mapping applications onto MPSoC is one of the biggest challenge embedded system prog r ammers f ace no w ada ys [1–3]. An application is commonly specified as a data-flo w g r aph consisting of tasks that comm unicate betw een them, on the other hand, an MPSoC platf or m is descr ibed b y the n umber of cores , the type of cores (in heterogeneous platf or m), and their com- m unication topology or infr astr ucture . The design space e xplor ation process tak es the application g r aph as input and assigns all tasks to processing cores on the desired MPSoC platf or m to find the best perf or mance mapping. Methods f or e v aluating each mapping perf or mance in design space can be categor iz ed into three types [4]: estimation from analytical model, measurement from sim ulation model, and measurement from implementation prototype . Analytical estimation is the f astest e v alua- tion method b ut with limited accur acy as it does not sufficiently replicate the system beha viour . Sim ulation-based measurement e xists betw een tw o e xtremes , i.e . slo w with high accur acy and f ast b ut less accur ate . On the contr ar y , prototype-based e v aluation pro vides the highest accur acy b ut requires long de v elopment time . Receiv ed F ebr uar y 27, 2017; Re vised J une 28, 2017; Accepted J uly 21, 2017 Evaluation Warning : The document was created with Spire.PDF for Python.
TELK OMNIKA ISSN: 1693-6930 1041 In most cases , a cer tain application is desired to be e x ecuted on a specific MPSoC plat- f or m. As the task e x ecution and comm unication dela ys v ar y f or diff erent MPSoC platf or ms , map- ping that solely depends on the costs f or e x ecuting application specification beha viour ma y not be sufficient to produce high quality mapping when consider ing the MPSoC ph ysical interconnect f abr ic. The beha viour on the platf or m m ust be considered and profiled thoroughly to aid applica- tion mapping e xplor ation process . Hence , an application profiled in a prototype-based platf or m can impro v e profiling p recision o v er the analytical-based mapping while ha ving f ast e v aluation speed. This paper presents a application mapping methodology that utiliz es field-prog r ammab le gate arr a y (FPGA) em ulation as the e v aluation platf or m to optimiz e the application mapping f or the highest throughput. Applications are analysed and profiled on the em ulation platf or m to estimate their implemented costs in ter m of e x ecution, intr a-core comm unication and inter-core comm u- nication dela ys . The profiled application costs are used in impro ving the accur acy of analytical- based mapping. The proposed methodology is demonstr ated b y mapping se v er al test applications on a 3 3 meshed netw or k-on-chip (NoC) based MPSoC prototyped in FPGA. The rest of this paper is organiz ed as f ollo ws . Section 2. presents the related w or ks in application mapping. Section 3. discusses the proposed throughput a w are application mapping methodology including the e x ecution tr ace analysis , platf or m par ameters e xtr action and mapping e xplor ation steps . Section 4. presents the result of the proposed mapping approach on diff erent test cases and Section 5. concludes this paper . 2. Related w orks T ask mapping technique can be classified into design time map ping, r un-time mapping and h ybr id mapping. In design time mapping [5–9], the mapping process is perf or med off-line to f acilitate better decision in using system resources . Design time mapping is suitab le f or static w or kload scenar ios ha ving a predefined task g r aph and st atic platf or m (i.e ., both applications and platf or m do not change in time). Run-time mapping [10–13] on the contr ar y , is ab le to handle dynamism of r un-time w or kload such as inser tion or remo ving ne w applications into or from the system, as w ell as the dynamic platf or m beha viour due to r un-time f aults . Ho w e v er , r un-time task mapping is perf or med using limited on-line processing resource while making a quic k decision without sufficient mapping e xplor ation, thus ma y produce lo w-quality mapping. In addition, h ybr id mapping has also been proposed in ma n y diff erent w or ks [14–19] b y combining design time and r un-time techniques . Hybr id mapping e xplores best mapping of tasks dur ing design time f or diff erent scenar ios (constr aints conditions) while chooses the most suitab le mapping at r un-time . Ho w e v er , these con v entional approaches in application mapping assume a specific w ell-defined prog r am beha viour , often specifie d in task g r aph and does not consider platf or m dependent par ameters , such as the changes in comm unication dela y and comm unication v olume dur ing dynamic mapping e xplor ation. Most mapping methodologie s in liter atures e .g. [5–1 3] assumed a specific prog r am e x e- cution while the comm unication cost w as considered constant across platf or m. As pr actical be- ha viour of applications ma y diff er compared to its specified task g r aph, there are diff erent methods in liter atures e xtr acting application r un-time metr ics pr ior to the mapping e xplor ation. T r ace anal- ysis technique w as used in [20] to find the e x ecution beha vior f or each mapping and e xamine their diff erences . Singh et al. [21] used the application beha viour from tr ace analysis to optimally map m ultiple ap plications on the same MPSoC . T o our best kno wledge , there is no w or k that profiles and maps application based on e x ecution, intr a-core and inter-core comm unicat ion on an em ulation platf or m. 3. Pr oposed Application Mapping Methodology The flo w of the proposed application mapping ap proach is depicted in Figure 1. Unlik e con v entional mapping methodology which solely depends on the costs specified in application task g r aph, the proposed approach considers the applica tion pr actical beha viour to better estimate Application Profiling and Mapping on NoC-based MPSoC ... (Jia W ei T ang) Evaluation Warning : The document was created with Spire.PDF for Python.
1042 ISSN: 1693-6930 the mapping perf or mance . The proposed methodology f or application mapping are as f ollo ws: 1. Application Pr ofiling The predefined application are e x ecuted on the targeted platf or m to measure its pr actical implementation costs . Application are profiled in ter m of e x ecution and comm unication costs in time dela y (cycles). 2. Application Mapping Application mapping design space e xplor ation is perf or med to op- timiz e o v er all throughput. In order to e v aluate each mapping perf or mance , e x ecution tr aces are analysed. An analytical model is f or m ulated based on the profiled application costs are used to estimate the mapping perf or mance dur ing the design space e xplor ation. A p p lic at ion   P r o f ilin g App lic a tio n t 0 t 1 t 2 t 3 t 4 M P S o C   in  F P GA Applic a tion   S pe c if ic a tion P la tf or S p e c if ic a tion M a p p in D e s ign   S p ac e   E xp lor a t io n A pplic a tion  P r o f ile s   E xe c ution  T r a c e s   A na lys is e xe c ution,   x intr a - c or e   c o mm. ,   c a inte r - c or e   c omm. ,   c e e mula te C ur r e nt  M a ppin g C o r e   1 : C o r e   2 : C o r e   3 : t i m e e xe c.   d e l a y x i c ij co m m.   d e l a y A na lytic   E va lua tion B e s M a p ping? ye s Ne xt M a pp ing no Op tima M a ppin g Figure 1. Proposed application mapping using application profiled in FPGA-em ulated MPSoC . 3.1. Application and Platf orm Model T ask g r aph of an application can be char acter iz ed b y a directed g r aph (DG), G = ( T ; E ) , where T = f t 1 ; t 2 ; :::; t n g is the set of tasks in the application and E = f e 1 ; e 2 ; :::; e n g is the set of directed edges representing dependencies of tasks as sho wn in Figure 2. Each task, t i contains an x i denoting the w orst-case e x ecution time f or the task and is assumed to remain fix ed dur ing the e x ecution. Each edge , e k includes v ij which represents the comm unication v olume betw een tasks t i and t j . As designing the entire NoC-based MPSoC platf or m is out of scope of this w or k, the emplo y ed platf or m is gener ated using ProNoC tool [22]. The o v er vie w of the NoC-based MPSoC platf or m used in this w or k is depicted in Figure 3. Figure 3a illustr ates the architecture f or each core (tile) in the MPSoC platf or m. Distr ib uted memor y architecture is implemented, where each core consists of a processor , a local memor y , and a netw or k interf ace (NI). Mapped tasks in each core are stored and e x ecuted from the local me mor y , while cores inter-comm unicate with other cores via netw or k interf aces . All cores are homogeneous and connected to each other through the NoC interconnect f abr ic. TELK OMNIKA V ol. 15, No . 3, September 2017 : 1040 1047 Evaluation Warning : The document was created with Spire.PDF for Python.
TELK OMNIKA ISSN: 1693-6930 1043 t 0 t 1 t 2 t 3 t 4 t 5 e 1 e 0 e 2 e 3 e 4 e 5 e 6 Figure 2. Example of an application model containing 6 tasks . Pr o ce s s o r M e m o r y N et w o r k I nt er f ac e I nt e r c o nnec t C o r e  0 .. .. . Pr oc es s o r M em o r y N e t w o r k I n t er f a ce C o r e  k (a) Each core consists of a processor , local memor y , and netw or k interf ace in MPSoC platf or m model C o re   0 C o re   1 C o re   2 Cor e   3 Cor e   4 C o re   5 Cor e   6 C o re   7 C o re   8 R R R R R R R R R (b) 3 3 mesh based NoC intercon- nect. Figure 3. MPSoC platf or m model f or profiling, mapping and em ulating application in FPGA. Mapping of an application is the process of assigning its tasks into the a v aila b le cores on MPSoC platf or m. Cooper ativ e (non pre-emptiv e) m ulti-tasking is emplo y ed in each core , where one core can e x ecute more than one task. Comm unicating tasks use message passing interf ace (MPI) protocol to send and receiv e data and control pac k ets . Hence , comm unication betw een tasks that are mapped to the same core also require message cop ying dela y that is considered as intr a-core comm unication. On the other hand, inter-core comm unication is defined as the comm unication betw een tasks that are mapped in diff erent cores . 3.2. Application Pr ofiling Application profiling is the process to e xtr act pr actical task e x ecution and comm unication dela ys of an application when em ulated in FPGA. The aim of perf or ming application profiling is to measure the eff ect of task e x ecution and comm unication on the o v er all system throughput. Ex ecution time , x i of each task t i is defined as the time required to complete a task after it receiv es complete data (or tok en) dur ing one e x ecution per iod. T ask e x ecution time of an application can be easily obtained b y captur ing the time diff erence betw een the star ting and ending of a task’ s e x ecution. As the e x ecution time on an y homogeneous core is equal regardless of the mapping, the maxim um e x ecution dela ys of all tasks can be recorded b y mapping all tasks into one core . On the contr ar y , comm unication time , c ij is divided into tw o types represented b y a tw o- tuple < ca; c e > , with ca and ce denote the intr a-core and inter-core dela ys , respectiv ely . As application as specified in a task g r aph contains only comm unication v olume in each connecting task, profiling each connecting core has to be done to measure the pr actical comm unication time . In order to profile each intr a-core comm unication dela y , the tw o connecting tasks are isolated and e x ecuted in the same core . The intr a-core comm unication dela y is measured as the throughput diff erence betw een e x ecuting both tasks independently without comm unication and with com- m unication. On the contr ar y , both tasks are mapped in neighbour ing cores to profile inter-core comm unication dela y . After profiling, the application g r aph is re-constr ucted with profiled spec- ification as illustr ated in Figure 4. Although the actual dela ys of the applications ma y diff er f or diff erent mappings (ma ybe due to the m ulti-task o v erhead or comm unication infr astr ucture de- Application Profiling and Mapping on NoC-based MPSoC ... (Jia W ei T ang) Evaluation Warning : The document was created with Spire.PDF for Python.
1044 ISSN: 1693-6930 la y such as contention), the profiled costs can be used as the basis to estimate the application r un-time beha viour . x 0 x 1 c 0 1 = < c a,  ce > 0 1 t 0 t 1 e 01 profil ing Figure 4. Application g r aph re-constr uction after profiling e x ecution and comm unication time on em ulated FPGA. 3.3. Mapping Design Space Exploration After application profiling, analytical-based mapping utiliz es the profiled costs to optimiz e the applicat ion throughput. Ex ecution tr aces of each mapping is analysed to find the estimation of application r un-time beha viour . Each mapping is e v aluated using analytical e xpression to cal- culate the throughput using the profiled costs . Mapping e xplor ation process is iter ativ e until the best mapping is obtained. 3.3.1. Ex ecution T races Anal ysis Ex ecution tr aces of a mapped applications are analysed to obtain the application through- put. T ask par allelism and tempor al par allelism are considered in the e x ecution tr aces analysis . The f or mer is the ability to allo w par allel e x ecution of diff erent tasks while the latter allo ws par allel e x ecution of diff erent tempor al data in diff erent cores at the same time . F or instance , Figure 5 depicts the e x ecution tr aces of an application ha ving 6 tasks , as sho wn in Figure 2, mapped onto a 4-core system. All 4 cores are comm unicating with each other while e x ecuting their assigned tasks in par allel as soon as their data are a v ailab le . T ask par allelism occurs f or task t 1 and t 2 , taking x 1 and x 2 dela ys to be e x ecuted in core 1 and core 2, respectiv ely . Data par allelism is demonstr ated through the e x ecution of diff erent tempor al data (e .g data 1 and data 2) at diff erent time instances . As core 4 requires the longest time to compute one result, the per iod (in v erse of throughput) is deter mined b y the combination of e x ecution dela y and comm unication dela y . C o r e   1 : C o r e   2 : C o r e   3 : C o r e   4 : x 0 c 0 1 x 1 c 0 2 x 2 c 0 2 c 1 3 c 2 3 c 1 3 c 2 3 c 2 4 x 3 c 3 5 c 2 4 x 4 c 3 5 x 5 x 0 c 0 1 x 1 c 0 2 x 2 c 0 2 c 1 3 c 2 3 c 1 3 c 2 3 c 2 4 x 3 c 3 5 c 2 4 x 4 c 3 5 x 5 x 0 c 0 1 x 1 c 0 2 x 2 c 0 2 c 1 3 c 2 3 c 1 3 c 2 3 c 2 4 x 3 c 3 5 c 2 4 x 4 c 3 5 e 5 e xe c. d e la y  fo r  ta s k i d a ta  1 d a ta  2 d a ta  3 x i co m m . d e l a y fo r  ta sk i to  ta sk  j c i j t i m e p e r i o d p e r i o d p e ri o d Figure 5. Ex ecution tr aces of a 6-tasks applications r unning on 4 cores system. Application throughput is reciprocal of one per iod (i.e , the a v er age time needed to com- plete one iter ation of the application). Each core utiliz es the time per iod to e x ecute its assigned task as w ell as to comm unicate with other cores . Since the tasks in each application can be e x- ecuted in pipeline and task-par allel in diff erent cores , the o v er all throughput is deter mined b y the slo w est core (i.e , the core that requires the longest time to complete e x ecutions and comm unica- tions of all its mapped tasks). TELK OMNIKA V ol. 15, No . 3, September 2017 : 1040 1047 Evaluation Warning : The document was created with Spire.PDF for Python.
TELK OMNIKA ISSN: 1693-6930 1045 3.3.2. Anal ytical Ev aluation Giv en a g r aph representation of an application with N t tasks , mapping e xplor ation in this w or k is the process of assigning all tasks into a v ailab le N c cores to maximiz e the system throughput. Based on the e x ecution tr ace ana lys i s , throughput of a mapped application is the reciprocal of per iod. The per iod is defined as the maxim um total required time (among all cores) f or e x ecution and comm unication, and can be mathematically f or m ulated in Equation 1. The total required time in each core is calculated b y the summation of the profiled e x ecution dela ys of all individual assigned tasks as w ell as all the profiled intr a-core and inter-core comm unication dela ys associated to it. 1 thr oug hput = per iod = max 1 k N C ( X k + C k ) = max 1 k N C ( N T X i =0 x i :mt ik + N T X i =0 N T X j =0 ca ij :ma ij k + N T X i =0 N T X j =0 ce ij :me ij k ) (1) where i and j are task identifiers while k is the core identifier . X k denotes the total e x ecution dela ys of mapped tasks in core k , which is the summation of the profiled e x ecution dela y x i if task t i is mapped to core k ( mt ik = 1 ). C k denotes the total comm unication dela ys associated to core k , which is the summation of all profiled intr a-core comm unication dela y ca ij if intr a-core comm u- nication of task t i and t j e xist in core k ( ma ij k = 1 ), and all profiled inter-core comm unication dela y ce ij if inter-core comm unication of task t i and t j is associated to core k ( me ij k = 1 ). 4. Results and Discussion The proposed analytical-based mapping is demonstr ated b y emplo ying the e xhaustiv e br ute f orce method to map se v er al r andom application g r aphs and find the mapping that produces the best application throughput. The best mapping obtained b y the proposed analytical-based e xhaustiv e (AE) are compared with em ulation-based e xhaustiv e (EE) solutions (i.e ., g round tr uth) which is obtained b y e x ecuting all mappings on a FPGA em ulation system. Figure 6a sho ws the throughput f or all possib le mapping f or the proposed analytical- based applicat ion mapping and em ulation-based mapping. The proposed analytical-based map- ping produces results with similar trend with the g round tr uth solution. Figure 6b depicts the compar ison betw een em ulation-based throughput and analytical-based solution. The proposed analytical-based mapping throughput is almost similar with em ulation-based throughput as the points scatter nearb y the str aight line of g r adient one . T ab le 1 illustr ates the compar ison betw een proposed analytical-based and em ulation- based e xhaustiv e (g round tr uth) f or application of diff erent r andom g r aphs . The best throughput (BT) is obtained b y em ulating the best mapping produced b y both methods on FPGA. The e v al- uation time (ET) is the time tak en to perf or m the e xhaustiv e e v aluation on all possib le mappings while profiling time is the time tak en to profile the tasks based on the proposed method. Results illustr ate that the e v aluation time f or both e xhaustiv e mapping increases e xponentially with the n umber of possib le mappings . The proposed analytical-based method is ab le to e v aluates all mappings in m uch shor ter time despite requir ing e xtr a profiling time . Ho w e v er , the profiling time does not increase with n umber of mappings as it corresponds to the n umber of profiles required, that is the total n umber of tasks and edges in application g r aphs . 5. Conc lusion Efficient mapping algor ithm produces high quality solution in f ast e v aluation time . This paper presents a application mapping methodology that utiliz es FPGA em ula tion as an e v aluation platf or m to optimiz e the application throughput. Applications are analysed and profiled in the Application Profiling and Mapping on NoC-based MPSoC ... (Jia W ei T ang) Evaluation Warning : The document was created with Spire.PDF for Python.
1046 ISSN: 1693-6930 0 50 100 150 200 250 300 100000 200000 300000 400000 500000 mappings (sorted with EE) 1/throughput (cycles) AE EE (a) Throughput of all possib le mapping f or the proposed analytical-based e xhaustiv e (AE) and em ulation-based e x- haustiv e (EE) mapping (sor ted based on EE). 100000 200000 300000 400000 500000 150000 200000 250000 300000 350000 400000 450000 500000 EE 1/throughput (cycles) AE 1/throughput (cycles) (b) Em ulation-based v ersus analytical-based e xhaustiv e throughput. Figure 6. Throughput compar ison betw een em ulation-based e xhaustiv e (EE) and proposed analytical-based e xhaustiv e (AE) mapping in design space e xplor ation of a 7-task application. T ab le 1. Compar ison in throughput and e v aluation time betw een proposed analytical-based (AE) and em ulation-based e xhaustiv e (EE) f or application of diff erent r andom g r aphs Applications No . of EE Proposed AE T asks BT ET BT PT ET r andom1(r7) 5 87 607.61 s 87 252.21 s 0.03 s r andom2(r8) 6 401 2912.33 s 401 296.10 s 0.15 s r andom4(r2) 7 106 15745.43 s 106 309.34 s 0.96 s r andom5(r4) 7 875 16143.52 s 875 276.82 s 0.91 s BT : Best Throughput ET : Ev aluation Time PT : Profiling Time em ulated en vironment to obtain their pr actical implemented costs in ter m of e x ecution, intr a-core comm unication and inter-core comm unication dela ys . Application mapping utiliz es the profiled costs find optimal mapping through e x ecution tr ace analysis and analytical e v aluation. Results sho w that the proposed analytical-based e xhaustiv e based on the pro filing can produce similar best throughput compared to the g round tr uth solutions b ut in shor ter e v aluation time . Ac kno wledg ement This w or k is suppor ted in par t b y the Collabor ativ e Research in Engineer ing, Science & T echnology (CREST) g r ant P17C114 (UTM v ote no . 4B176) and Univ ersiti T eknologi Mala ysia Matching g r ant (UTM v ote no . 00M75). Ref erences [1] R. Marculescu, U . Y . Og r as , L.-S . P eh, N. E. Jerger , and Y . Hosk ote , “Outstanding research prob lems in noc design: system, microarchitecture , and circuit perspectiv es , IEEE T r ansac- tions on Computer-Aided Design of Integ r ated Circuits and Systems , v ol. 28, no . 1, pp . 3–21, 2009. [2] A. K. Singh, M. Shafique , A. K umar , and J . Henk el, “Mapping on m ulti/man y-core systems: sur v e y of current and emerging trends , in Proceedings of the 50th Ann ual Design A utomation Conf erence . A CM, 2013, p . 1. [3] P . K. Sahu and S . Chattopadh y a y , “A sur v e y on application mapping str ategies f or netw or k- on-chip design, Jour nal of Systems Architecture , v ol. 59, no . 1, pp . 60–76, 2013. [4] R. Piscitelli and A. D . Pimentel, “Design space pr uning through h ybr id analysis in system- le v el design space e xplor ation, in Design, A uto mation & T est in Europe Conf erence & Exhi- TELK OMNIKA V ol. 15, No . 3, September 2017 : 1040 1047 Evaluation Warning : The document was created with Spire.PDF for Python.
TELK OMNIKA ISSN: 1693-6930 1047 bition (D A TE), 2012 . IEEE, 2012, pp . 781–786. [5] J . Hu and R. Marculescu, “Energy-and perf or mance-a w are mapping f or regular noc archi- tectures , IEEE T r ansactions on computer-aided design of integ r ated circuits and systems , v ol. 24, no . 4, pp . 551–562, 2005. [6] K. Sr iniv asan, K. S . Chatha, and G. K onje v od, “Linear-prog r amming-based techniques f or synthesis of netw or k-on-chip architectures , IEEE T r ansactions on V er y Large Scale Integ r a- tion (VLSI) Systems , v ol. 14, no . 4, pp . 407–420, 2006. [7] S . T osun, “Cluster-based application mapping method f or netw or k-on-chip , Adv ances in En- gineer ing Softw are , v ol. 42, no . 10, pp . 868–874, 2011. [8] P . K. Sahu, T . Shah, K. Manna, and S . Chattopadh y a y , “Application mapping onto mesh- based netw or k-on-chip using discrete par ticle s w ar m optimization, IEEE T r ansactions on V er y Large Scale Integ r ation (VLSI) Systems , v ol. 22, no . 2, pp . 300–312, 2014. [9] Y . Z. T ei, Y . W . Hau, N. Shaikh-Husin, and M. N. Marsono , “Netw or k par titioning domain kno wledge m ultiobjectiv e application mapping f or large-scale netw or k-on-chip , Applied Com- putational Intelligence and Soft Computing , v ol. 2014, p . 9, 2014. [10] E. L. de Souza Car v alho , N. L. V . Calazans , and F . G. Mor aes , “Dynamic task mapping f or mpsocs , IEEE Design & T est of Computers , v ol. 27, no . 5, pp . 26–35, 2010. [11] A. K. Singh, T . Sr ikanthan , A. K umar , and W . Jigang, “Comm unication-a w are heur istics f or r un-time task mapping on noc-based mpsoc platf or ms , Jour nal of Systems Architecture , v ol. 56, no . 7, pp . 242–255, 2010. [12] M. F attah, A.-M. Rahmani, T . C . Xu, A. Kandur i, P . Liljeberg, J . Plosila, and H. T enhunen, “Mix ed-cr iticality r un-time task mapping f or noc-based man y-core systems , in 2014 22nd Euromicro Inter national Conf erence on P ar allel, Distr ib uted and Netw or k-Based Processing (PDP) . IEEE, 2014, pp . 458–465. [13] T . D . Ngo , K. J . Mar tin, and J .-P . Diguet, “Mo v e based algor ithm f or r untime mapping of dataflo w actors on heterogeneous mpsocs , Jour nal of Signal Processing Systems , pp . 1– 18, 2015. [14] S . Stuijk, M. Geilen, and T . Basten, “A predictab le m ultiprocessor design flo w f or streaming applications with dynamic beha viour , in 2010 13th Euromicro Conf erence on Digital System Design: Architectures , Methods and T ools (DSD) . IEEE, 2010, pp . 548–555. [15] L. Schor , I. Baciv aro v , D . Rai, H. Y ang, S .-H. Kang, and L. Thiele , “Scenar io-based design flo w f or mapping streaming applications onto on-chip man y-core systems , in Proceedings of the 2012 inter national conf erence on Compilers , architectures and synthesis f or embedded systems . A CM, 2012, pp . 71–80. [16] C . Lee , S . Kim, and S . Ha, “Efficient r un-time resource management of a man ycore accel- er ator f or stream-based applications , in 2013 IEEE 11th symposium on Embedded systems f or real-time m ultimedia (ESTIMedia) . IEEE, 2013, pp . 51–60. [17] W . Quan and A. D . Pimentel, “A scenar io-based r un-time task mapping algor ithm f o r mpsocs , in Proceedings of the 50th Ann ual Design A utomation Conf erence . A CM, 2013, p . 131. [18] A. K. Singh, A. K umar , and T . Sr ikanthan, “Acceler ating throughput-a w are r untime mapping f or heterogeneous mpsocs , A CM T r ansactions on Design A utomation of Electronic Systems (T OD AES) , v ol. 18, no . 1, p . 9, 2013. [19] W . Quan and A. D . Pimentel, “A h ybr id task mapping algor ithm f or heterogeneous mpsocs , A CM T r ansactions on Embedded Computing Systems (TECS) , v ol. 14, no . 1, p . 14, 2015. [20] A. Goens and J . Castr illon, “Analysis of process tr aces f or mapping dynamic kpn applications to mpsocs , in IFIP Inter national Embedded Systems Symposium (IESS) , 2015. [21] A. K. Singh, M. Shafique , A. K umar , and J . Henk el, “Resource and throughput a w are e x- ecution tr ace analysis f or efficient r un-time mapping on mpsocs , IEEE T r ansactions on Computer-Aided Design of Integ r ated Circuits and Systems , v ol. 35, no . 1, pp . 72–85, 2016. [22] ProNoC. (2016) Prototype NoC based MPSoC . Opencore . [Online]. A v ailab le: http://opencores .org/project,an-fpga-implementation -of-lo w-latency- noc-based-mpsoc Application Profiling and Mapping on NoC-based MPSoC ... (Jia W ei T ang) Evaluation Warning : The document was created with Spire.PDF for Python.