Inter national J our nal of Electrical and Computer Engineering (IJECE) V ol. 8, No. 5, October 2018, pp. 3021 3037 ISSN: 2088-8708 3021       I ns t it u t e  o f  A d v a nce d  Eng ine e r i ng  a nd  S cie nce   w     w     w       i                       l       c       m     Netw ork Function Modeling and P erf ormance Estimation Mario Baldi and Amedeo Sapio Department of Control and Computer Engineering, Politecnico di T orino, Italy Article Inf o Article history: Recei v ed July 20, 2017 Re vised March 13, 2018 Accepted April 7, 2018 K eyw ord: Netw ork Function Modeling Netw ork Functions V irtualization Performance Estimation ABSTRA CT This w ork introduces a methodology for the modelization of netw ork functions focused on the identification of recurring e x ecution patterns as basic b ui lding blocks and aimed at pro viding a platform independent representation. By mapping each modeling b uilding block on specific hardw are, the performance of the netw ork functi on can be estimated in terms of maximum throughput that the netw ork function can achie v e on the specific e x ecution platform. The approach is such that once the basic modeling b uilding blocks ha v e been mapped, the estimate can be com puted automatically for an y modeled netw ork function. Experimental results on se v eral sample netw ork functions sho w that although our approach cannot be v ery accurate without taking in consideration traf fic characteristics, it is v ery v alua ble for those application where e v en loose estimates are k e y . One such e xample is orchestration in netw ork functions virtualization (NFV) platforms, as well as in general virtualization platforms where virtual machine placement is based also on the performance of netw ork services of fered to them. Being able to automatically estimate the performance of a virtualized netw ork function (VNF) on dif ferent e x ecution hardw are, enables optimal placement of VNFs themselv es as well as the virtual hosts the y serv e, while ef ficiently utilizing a v ailable resources. Copyright c 2018 Institute of Advanced Engineering and Science . All rights r eserved. Corresponding A uthor: Amedeo Sapio Department of Control and Computer Engineering Politecnico di T orino, Italy amedeo.sapio@polito.it 1. INTR ODUCTION F or a fe w years no w softw are netw ork appliances ha v e been increasingly deplo yed. Initially , their ap- peal stemmed from their lo wer cost, shorter time-to-mark et, ease of upgrade when compared to purposely designed hardw are de vices. These features are particularly adv antageous in the case of appliances, a.k.a. middlebox es, op- erating on relati v ely recent, higher layer protocols that are usually more comple x and are possibly still e v olving. More recently , with the o v erwhelming success and dif fusion of cloud computing and virtualization, softw are ap- pliances became natural means to ensure that netw ork functionalities ha v e the same fle xibility and mobility as the virtual machines (VMs) the y of fer services to. In this conte xt, impl ementing in softw are e v en less comple x, more stable netw ork functionalities is v aluable. This trend led to embracing Softwar e Defined Networking and Network Functions V irtualization (NFV). The former as a h ybrid hardw are/softw are approach to ensure high performance for lo wer layer pack et forw arding, while retaining a high de gree of fle xibility and programmability . The latter as a virtualization solution tar geting the e x ecution of softw are netw ork functions in isolated VMs sharing a pool of hosts, rather than on dedicated hardw are (i.e., appliances). Such a solution enables virtual netw ork appliances (i.e., VMs e x ecuting netw ork functions) to be pro visioned, allocated a dif ferent amount of resources, and possibly mo v ed across data centers in little time, which is k e y in ensuring that the netw ork can k eep up with the fle xibility in the pro- visioning and deplo yment of virtual hosts in today’ s virtualized data centers. Additional fle xibility is of fered when coupling NFV with SDN as netw ork traf fic can be steered through a chain of V irtualized Network Functions (VNFs) in order to pro vide aggre g ated services. W ith inputs from the industry , the NFV approach has been standardized by the European T elecommunications Standards Institute (ETSI) in 2013 [1]. The fle xibility pro vided by NFV requires the ability to ef fecti v ely assign compute nodes to VNFs and allocate the most appropriate amount of resources, such as CPU quota, RAM, virtual interf aces. In the ETSI standard the component in char ge of taking such decisions is called or c hestr ator and it can also dynami cally modify the J ournal Homepage: http://iaescor e .com/journals/inde x.php/IJECE       I ns t it u t e  o f  A d v a nce d  Eng ine e r i ng  a nd  S cie nce   w     w     w       i                       l       c       m     DOI:  10.11591/ijece.v8i5.pp3021-3037 Evaluation Warning : The document was created with Spire.PDF for Python.
3022 ISSN: 2088-8708 Figure 1. NF modeling and performance estimation approach. amount of resources assigned to a running VNF when needed. The orches trator can also request the migration of a VNF when the current compute node e x ecuting it is no longer capable of fulfilling the VNF performance requirements. These tasks require the orchestrator to be able to estimate the performance of VNFs according to the amount of resources the y can use. Such estimation must tak e into account the nature of the traf fic manipulation performed by the VNF at hand, some specifics of its implementation, and the e xpected amount of traf fic it operates on. A good estimation is k e y in ensuring higher resource usage ef ficienc y and a v oid adjustments at runtime. This w ork proposes a unified modeling approach applicable to an y VNF , independently of the platform it is running on. By mapping a VNF model on a specific hardw are it is possible to predict the maximum amount of traf fic that the VNF can sustain with the required performance. The proposed modeling approach relies on the identification of the most significant operations performed by the VNF on the mos t common pack ets. These operations are described in a hardw are independent notation to ensure that the model is v alid for an y e x ecution platform. The mapping of the model on a tar get hardw are architecture (required in order to determine the actual performance) can be automated, hence allo wing to easily apply the approach to each a v ailable hardw are platform and choose the most suitable for the e x ecution. Ev en if the proposed modeling approach has been defined with NFV in mind, it can be applied to non- virtualized netw ork functions (NFs), whether implemented in softw are or hardw are, pro vided that the implementa- tion and characteri stics of the underlying hardw are are kno wn. The a v ailability of a unified modeling approach for VNF and NF is instrumental in the inte gration of middlebox es in an NFV infrastructure [2], which is important in a transition phase and for specific applications where a dedicated or specialized hardw are platform is necessary for a specific NF to satisfy performance requirements. The modeling approach is introduced in Section 2. and is illustrated in Section 3. by applying it to v ari- ous netw ork functions. In order to v alidate the proposed models, Section 4. compares the estimated performance with actual measurements of softw are netw ork functions running on a general purpose hardw are platform. After discussing related w ork in Section 5., Section 6. concludes the paper . 2. METHODOLOGY The proposed modeling approach is based on the definition of a set of processing steps, here called Elemen- tary Oper ations (EOs), that are common throughout v arious NF implementations. This stems from the obse rv ation that, generally , most NFs perform a rather small s et of operations when processing the a v erage pack et, namely , well-defined alteration of pack et header fields, coupled with data structure lookups. An EO is informally defined as the longest sequence of elementary steps (e.g., CPU instructions or ASIC transactions) that is common among the processing tasks or multiple NFs. As a consequence, an EO has v ariable granularity ranging from a simple I/O or memory load operation, to a whole IP checksum computation. On the other hand, EOs are defined so that each can be potentially used in multiple NF models. An NF is modeled as a sequence of EOs that represent the actions performed for the v ast majority of pack ets. Since we are interested in performance estimation, we ignore operations that af fects only a small number of pack ets (i.e., less the 1%), since these tasks ha v e a ne gligible impact on performance, e v en when the y are more comple x and resource intensi v e than the most common ones. Accordingly e xceptions, such as f ailures, configuration changes, etc., are not considered. It is important to highlight that NF models produced with this approach are hardw are independent, which ensures that the y can be applied when NFs are deplo yed on dif ferent e x ecution platforms. In order to estimate the IJECE V ol. 8, No. 5, October 2018: 3021 3037 Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE ISSN: 2088-8708 3023 T able 1. List of sample EOs EO P arameters Description 1 I/O_mem - mem_I/O hdr , data P ack et cop y between I/O and (cache) memory 2 parse - deparse b P arse or encapsulate a data field 3 increase - decrease b Increase/decrease a field 4 sum b Sum 2 operands 5 checksum - inc_checksum b Compute IP checksum 6 array_access es, max Direct access to a byte array in memory 7 ht_lookup N, HE, max, p Simple hash table lookup 8 lpm_lookup b, es Longest prefix match lookup 9 ct_insertion N, HE, max, p Cache table insertion performance of an NF on a specific hardw are platform, each EO must be mapped on the hardw are components in v olv ed in its e x ecution and their features. This mapping allo ws to tak e into consideration the limits of the in v olv ed hardw are components and g ather a set of parameters that af fect the performance (e.g., clock frequenc y). Moreo v er , the load incurred by each component when e x ecuting each EO must be estimated, whether through actual e xper - iments or based on nominal hardw are specifications. The data collected during such mapping are specific to EOs and the hardw are platform, b ut not to a particular NF . Hence, the y can be applied to es timate the performance of an y NF modeled in terms of EOs. Specifically , the performance of each indi vidual EO in v olv ed in the NF model is computed and composed considering the cumul ati v e load that all EOs impose on the hardw are components of the e x ecution platform, while heeding all of the applicable constraints. Figure 1 summarizes the steps and intermediate outputs of the proposed approach. T able 1 presents a list of sample EOs that we identified when modeling a number of NFs. Such list is by no means meant to be e xhausti v e; rather , it should be incrementally e xtended whene v er it turns out that a ne w NF being considered cannot be described in terms of pre viously identified EOs . When defining an EO, it is impor - tant to identify the parameters related to traf fic characteristics that significantly af fect the e x ecution and resource consumption. 2.1. Elementary Operations A succinct description of the EOs listed in table 1 is pro vided belo w . 1. P ac k et copy between I/O and memory : A pack et is copied from/to an I/O b uf fer to/from memory or CPU cache. hdr is the number of bytes that are preferably stored in the f astest cache memory , while data bytes can be k ept in lo wer le v el cache or main memory . The parameters ha v e been chosen taking into consideration that some NPUs pro vide a m anual cache that can be e xplicitly loaded with the data that need f ast access. General purpose CPUs may ha v e assembler instructions (e.g., PREFETCHh ) to e xplicitly influence the cache logic. 2. P ar se or encapsulate a data field : A data field of b bytes stored in memory i s parsed. A parsing operation is necessary before performing an y computation on a field (it corresponds to loading a processor re gister). The dual operat ion, i.e., depar se , implies storing back into memory a properly constructed sequence of fields. 3. Incr ease/decr ease a field : Increase/decrease the numerical v alue contained in a field of b bytes. The field to increase/decrease must ha v e already been parsed. 4. Sum 2 oper ands : T w o operands of b bytes are added. Network Function Modeling and P erformance Estimation (Mario Baldi) Evaluation Warning : The document was created with Spire.PDF for Python.
3024 ISSN: 2088-8708 Figure 2. Hardw are architecture description. 5. Compute IP c hec ksum : The standard IP checksum computation is performed on b bytes. When only some bytes change in the rele v ant data, the checksum can be computed incrementally from the pre vious correct v alue [3]. In this case, the pre vious v alue of the checksum must be parsed beforehand and b is the number of cha n ge d bytes for which the checksum must be incrementally computed. 6. Dir ect access to a byte arr ay in memory : This EO performs a direct access to an element of an array in memory using an inde x. Each array entry has size es , while the array has at most max entries. 7. Simple hash table lookup : A simple lookup in a direct hash table is performed. The hash k e y consists of N components and each entry has size equal to HE . The table has at most max entries and the collision probability is p . 8. Long est Pr efix Matc h lookup : This EO selects an entry from a ta b l e based on the Longest Prefix Match (LPM). This lookup algorithm selects the most specific of the matching entries in a table (i.e., the one where the lar gest number of leading bits of the k e y match those in the table entry). The param eter b represents the number of bytes, on a v erage, of the matching prefix, while es is the entry size. 9. Cac he table insertion : Sa v e in a hash table an entry with the current timestamp or update the timestamp if the entry is already present. This EO ha v e the same parameters of the simple hash table lookup operation; the performance of both EOs depends from the hash table characteristics. F or the sak e of simplicity (and without af fecting the v alidity of the approach, as sho wn by the results in Section 4.), in modeling NFs by means of EOs, we assume that the number of processor re gisters is lar ger than the number of pack et fields t hat must be processed simultaneously . Therefore there is no competit ion for processor re gisters. 2.2. Mapping to Hard war e W e no w proceed to map the described EOs to a specific hardw are platform: a serv er with 2 Intel Xeon E5-2690 v2 CPUs (Ivy Bridge architecture with ten ph ysical cores at 3 GHz), 64 GB DDR3 RAM memory and one Intel 82599ES netw ork card with 2x10Gbps Ethernet ports. Figure 2 pro vides a schematic representation of the platform main components and relati v e constraints using the template proposed in [4]. Using the CPU reference manual [5], it is possible to determine the operations required for the e x ecution of each EO in T able 1 and estimate the achie v able performance. 1. I/O_mem(hdr, data) - mem_I/O(hdr, data) The considered CPU pro vides a DMA-lik e mechanism to mo v e data from the I/O b uf fers to the shared L3 cache and vice v ersa. Intel DPDK dri v ers [6] with Data Direct I/O T echnology (DDIO) le v erage this capability to mo v e pack ets to/from t he L3 cache without the CPU interv ention, impro ving the pack et processing speed. The IJECE V ol. 8, No. 5, October 2018: 3021 3037 Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE ISSN: 2088-8708 3025 portion of each pack et that must be processed ( hdr ) is then mo v ed from L3 cache into the L1/L2 cache by the CPU. This operation requires 31 clock c ycles to access the L3 cache, around 5 c ycles to write a L1/L2 cache line and 9 c ycles to write back a L3 cache line [7]. On the whole, the e x ecution of this EO requires: 31 + [5 j 9] d hdr 64 B e clock c ycles pro vided that hdr is less than the total amount of L1 and L2 caches, as it is reasonable for modern systems and common pack et sizes. The multiplier is 5 for I/O_mem and 9 for mem_I/O . 2. parse(b) - deparse(b) Loading a 64 bit re gister requires 5 clock c ycles if data is in L1 cache or 12 clock c ycles if data is in L2 cache, otherwise an additional L3 cache or DRAM memory access is required to retrie v e a 64 byte line and store it in L1 or L2 respecti v ely (the re v erse operation has the same cost): 5 d b 8 B e clock c ycles f + d b 64 B e L3 or DRAM accesses g or 12 d b 8 B e clock c ycles f + d b 64 B e L3 or DRAM accesses g 3. increase(b) - decrease(b) Whether a processor includes an increase (decrease) instruction or one for adding (subtract) a constant v alue to a 64 bit re gister , this EO requires 1 clock c ycle to complete. Ho we v er , thanks to pipelining, up to 3 independent such instructions can be e x ecuted during 1 clock c ycle: d 0 : 33 b 8 B e clock c ycles 4. sum(b) On the considered architecture, the e x ecution of this EO is equi v alent to the increase(b) EO. Please note that this is not necessarily the case on e v ery architecture. 5. checksum(b) - inc_checksum(b) Figure 3 sho ws a sample assembly code to compute a checksum on an Intel x86-64 processor . Assuming that the data on which the checksum is computed is not in L1/L2 cache, according to the Intel documentation [5], the e x ecution of this code requires 7 d b 2 e + 8 clock c ycles + d b 64 B e L3 or DRAM accesses 6. array_access(es, max) Direct array access needs to e x ecute an ADD instruction (1 clock c ycle) for computing the inde x and a LOAD instruction resulting into a direct memory access and as man y clock c ycles as the number of CPU re gisters required to load the selected array element: 1 + d es 8 B e clock c ycles + d es 64 B e DRAM accesses 7. ht_lookup(N, HE, max, p) W e assume that a simple hash table lookup is implemented according to the pseudo-code described in [4] and sho wn in Figure 4 for ease of reference. Considering that the hash entry needs to be loaded from memory to L1 cache, a simple hash table lookup w ould require approximately: d (4 N + 106 + 5 d H E 8 B e + 5 d H E 32 B e ) (1 + p ) e Network Function Modeling and P erformance Estimation (Mario Baldi) Evaluation Warning : The document was created with Spire.PDF for Python.
3026 ISSN: 2088-8708 Register ECX: number of bytes b Register EDX: pointer to the buffer Register EBX: checksum CHECKSUM_LOOP: XOR EAX, EAX ;EAX=0 MOV AX, WORD PTR [EDX] ;AX <- next word ADD EBX, EAX ;add to checksum SUB ECX, 2 ;update number of bytes ADD EDX, 2 ;update buffer CMP ECX, 1 ;check if ended JG CKSUM_LOOP MOV EAX, EBX ;EAX=EBX=checksum ;EAX=checksum>>16 EAX is the carry SHR EAX, 16 AND EBX, 0xffff ;EBX=checksum&0xffff ;EAX=(checksum>>16)+(checksum&0xffff) ADD EAX, EBX MOV EBX, EAX ;EBX=checksum SHR EBX, 16 ;EBX=checksum>>16 ADD EAX, EBX ;checksum+=(checksum>>16) MOV checksum, EAX ;checksum=EAX Figure 3. Sample Intel x86 assembly code for checksum computation. clock c ycles and d ( d H E 64 B e (1 + p )) e L3 or DRAM accesses Otherwise, if the entry is already in the L1/L2 cache, the memory accesses and cache store operations are not required. Notice that in order for the whole table to be in cache, its size should be limited by: max H E 32 K B + 256 K B = 288 K B 8. lpm_lookup(b,es) There are se v eral dif ferent algorithms for finding the longest matching rule. Here we consider the DIR-24-8 algorithm [8], which i n most cases (when the entry matches up to 24 bits) is able to find the first matching rule with only one memory access. This speed, ho we v er , comes at the cost of space, because of the redundant storage of rules. Ho we v er , the v ery f ast lookup this algorithm pro vides hea vily outweighs this space constraint. W ith the DIR-24-8 algorithm the longest prefix match requires the equi v alent of an array_access(es,16M) operation if b 3 , otherwise an additional memory access is required, corresponding to an array_access(es,255) . 9. ct_insertion(N, HE, max, p) The EO corresponds to a lookup in a hash table follo wed by either the insertion of a ne w entry or the update of the timestamp in an e xisting one. The tw o operations ha v e approximately the same cost; the pseudo-code in Figure 5 sho ws the operations required to update the timestamp of the entry . As a result the cache table insertion algorithm w ould require approximately: d (4 N + 129 + 7 d H E 8 B e + 5 d H E 32 B e ) (1 + p ) e clock c ycles and 2 d ( d H E 64 B e (1 + p )) e L3 or DRAM accesses 3. MODELING USE CASES This section demonstrates the application of the modeling approach described in section 2.. EOs are used to describe the operation of simple netw ork functions, such as L2 Switches, and a more comple x case, a Br oadband IJECE V ol. 8, No. 5, October 2018: 3021 3037 Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE ISSN: 2088-8708 3027 Register $1-N: key components Register $HL: hash length Register $HP: hash array pointer Register $HE: hash entry size Register $Z: result Pseudo code: # hash key calculation eor $tmp, $tmp for i in 1 ... N eor $tmp, $i # key is available in $tmp # calculate hash index from key udiv $tmp2, $tmp, $HL mls $tmp2, $tmp2, $HL, $tmp # index is available in $tmp2 # index -> hash entry pointer mul $tmp, $tmp2, $HE add $tmp, $HP # entry pointer available in $tmp <prefetch entry to L1 memory> # pointer to L1 entry -> $tmp2 # hash key check (entry vs. key) for i in 1 ... N ldr $Z, [$tmp2], #4 # check keys cmp $i, $Z bne collision # no jump means matching keys # pointer to data available in $Z Figure 4. Hash table lookup pseudo-code. Network Gate way (BNG). The model is us ed to estimate the performance of each use case on the hardw are platform presented in Section 2.2.. The accurac y of the estimation is e v aluated in Section 4. based on real measurements obtained through a range of e xperiments. 3.1. L2 Switch First we model an Ethernet switch with a static forw arding table. In this case the output port is selected through a simple lookup in the table using the destination MA C address. Afterw ards we consider a more general case where the forw arding table is populated using the backw ard learning algorithm. Finally , we model an MPLS switch, which selects the output interf ace according to the MPLS label in the pack et. 3.1.1. Basic F orwarding F or each pack et the switch selects the output interf ace where it must be forw arded; such interf ace is re- trie v ed from a hash table using as a k e y the destination MA C address e xtracted from the pack et. More in detail, when a netw ork interf ace recei v es a pack et, it stores it in an I/O b uf fer . In order to access the Ethernet header , the CPU/NPU must first cop y the pack et in cache or main memory (possibly with the help of a Direct Memory Access module). Since the switch operates only on the Ethernet header together with the identifier of the ingress and e gress ports through which it is recei v ed and forw arded, the corresponding 30 bytes ( 18 + 6 + 6 bytes) 1 are copied in the f astest cache, while the rest of the pack et (up to 1500 bytes) can be k ept in L3 cache or 1 In this paper we consider that interf aces are identified by their Ethernet address. Dif ferent implementations can use a dif ferent identifier , which leads to a minor v ariation in the model. Network Function Modeling and P erformance Estimation (Mario Baldi) Evaluation Warning : The document was created with Spire.PDF for Python.
3028 ISSN: 2088-8708 Register $HE: updated hash entry Register $HT: pointer to previous L1 entry Register $HS: hash entry size Pseudo code: for i in 1 ... $HS/8 mov [$HT], $HE add $HT, #8 #update timestamp rdtsc mov [$HT], EDX add $HT, #2 mov [$HT], EAX <store updated entry> Figure 5. Entry update pseudo-code for cache table insertion. I/O_mem(30,ps) parse(6) ht_lookup(1,12,2M,0) deparse(6) mem_I/O(30,ps) (a) Basic forw arding switch model. I/O_mem(30,ps) parse(8) ht_lookup(1,14,2M,0) parse(12) ct_insertion(2,14,2M,0) deparse(6) mem_I/O(30,ps) (b) Learning switch model. I/O_mem(34,ps-4) parse(3) ht_lookup(1,12,1M,0) parse(1) decrease(1) deparse(10) mem_I/O(34,ps-4) (c) MPLS switch model. Figure 6. Models of dif ferent L2 switches. main memory . T o ensure generali ty , we consider that an incoming pack et cannot be copied directly from an I/O b uf fer to another , b ut instead it must be first copied in (cache) memory . The swi tch must then read the destination MA C address (6 bytes) prior to using it to access the hash table to get the appropriate output interf ace. The hash table has one k e y (the destination MA C) and consists of 12 byte entries composed of the k e y and the output interf ace MA C address. A common number o f entries in a typical switch implementation is 2 M , which gi v es an idea, when mapping the model to a specific hardw are, of whether the hash table can be fully stored in cache under generic traf fic conditions. The ne w output port must be stored in the data structure in L3 cache or main memory (which, as pre viously e xplained, has the same cost as parsing 6 bytes), before mo ving the pack et to the b uf fer of the selected output I/O de vice. The resulting model e xpressing the abo v e steps in terms of EOs is summarized in Figure 6a, where ps is the ethernet payload size. Such model assumes that the collision probability of the hash is ne gligible (i.e., the hash table is suf ficiently sparse). Applying to the Ethernet switch model the mapping of EOs presented in Section 2.2., we can estimate that forw arding a pack et, re g ardless of the pack et size (thanks to DDIO), requires: 213 clock c ycles + 1 DRAM access As a consequence, a single core of an Intel Xeon E5-2690v2 operating at 3.6 Ghz can process 17.31 Mpps , while the DDR3 memory can support 111.08 Mpps . The memory throughput is estimated considering that each pack et requires a 12 byte memory access to read the hash table entry , which has a latenc y of: ( CAS latenc y 2) + 3 data rate If we consider minimum size (64 bytes) pack ets (i.e., an unrealistic, w orst case scenario), a single core can process 11.36 Gbps . IJECE V ol. 8, No. 5, October 2018: 3021 3037 Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE ISSN: 2088-8708 3029 3.1.2. Lear ning Switch W e here consider an Ethernet switch with VLAN support, in which case the k e y used for lookups in the forw arding table consist s of the destination MA C address and the VLAN ID ( 2 bytes). Hence, 8 bytes must be parsed from the header (destination address and VLAN ID) of each pack et in order to obtain the lookup k e y and entries in the forw arding table are 14 bytes long (destination address and VLAN ID as k e y and output interf ace as v alue). Since the switch is applying backw ard learning, for each pack et the source MA C address and source port are used to update the forw arding table. The switch must also parse the source MA C address and read from memory the source port (added to pack ets stored in memory) and either add an entry in the forw arding table or just update the timestamp of an e xisting one. The resulting model is sho wn in Figure 6b. When mapped to our hardw are architecture, forw arding a pack et requires an estimated: 352 clock c ycles + 2 DRAM accesses hence the maximum throughput reachable by a single core is reduced to 10.47 Mpps , while the DDR3 memory can support 55.54 Mpps . This translates to a maximum throughput of 6.87 Gbps for 64 byte pack ets. 3.1.3. MPLS Switch An MPLS switch is a simple, yet curr ently widely deplo yed, Netw ork Function. F or each pack et the switch sw aps a single MPLS label and forw ards the pack et on an Ethernet netw ork to w ards the ne xt hop. The ne w label and the ne xt hop are retrie v ed from a hash ta ble whose k e y is the label e xtracted from the pack et. Since the MPLS switch modifies the label in the MPLS header , in addition to associating to it the output port, the MPLS header ( 4 bytes) is also preferably copied in the L1/L2 cache, while the rest of the pack et can be k ept in L3 cache or main memory . The switch must then e xtract the MPLS label (20 bit 3 bytes) prior to using it to access the hash table to get the ne w label and the ne xt hop. The hash table has one k e y (the label) and consists of 12 byte entries: Input label (k e y) - 3 bytes Output label - 3 bytes Ne xt hop Ethernet address - 6 bytes. The maximum number of entries in the hash table is, in the w orst case, 1 M = 2 20 and we consider that the collision probability is ne gligible. In the mos t general case, each entry , referred in the MPLS standard documents as Ne xt Hop Label F orw ard- ing Entry (NHLFE), could hold more than one label in case of multiple label operations. F or the sak e of simplicity we model only a single label operation: the sw apping of a label, which is the most frequent operation in common MPLS switch deplo yment scenarios. The switch must also decrease the T ime-T o-Li v e (TTL) contained in the MPLS header , which requires parsing the corresponding field, follo wed by a decrease operation for t h e 1 byte field. The ne w (outgoing) MPLS header and output port must be stored in main memory (encapsulation of 10 bytes) and mo v ed to the b uf fer of the output I/O de vice. The resulting model is summarized in Figure 6c. As we map this model to the considered hardw are platform, we can conclude that the estimated forw arding cost for a MPLS switch is: 224 clock c ycles + 1 DRAM access corresponding to a maximum per core throughput of 16.45 Mpps , while the memory could pro vide the same throughput as the basic forw arding switch. The maximum bitrate considering 64 bytes pack ets is 10.8 Gbps . 3.2. Br oadband Netw ork Gateway A Broadband Netw ork Gate w ay (BNG) is the first IP point in the netw ork for DSL and cable modem subscribers connecting them to the broadband IP netw ork. The primary task of a BNG is to aggre g ate traf fic from v arious subscriber sessions from an access netw ork, and route it to the core netw ork of the service pro vider . More- o v er , a BNG carries out additional vital tasks for Netw ork Service Pro viders (NSPs), such as managing subscribers’ sessions, performing accounting and enforcing operator policies. Hence, a BNG represents a more comple x use case for the application of the proposed modelization approach. Network Function Modeling and P erformance Estimation (Mario Baldi) Evaluation Warning : The document was created with Spire.PDF for Python.
3030 ISSN: 2088-8708 Dst addr Src addr S - VL A N C - VL A N EtherT y pe Data FC S 6 by tes 6 by tes 4 by tes 4 by tes 2 by tes 1 8   1480 by tes by te s Ethern et MPLS D a t a FC S 14 by tes 4 by tes 20 by tes 0 1444 by tes 4 b y tes 12 by tes IPv 4 20 by tes A cce ss Co re IPv 4 22 by tes G RE  Key 32  bi ts VL A ID:  12 bi ts VL A ID:  12 bi ts Ethern et  + Q i nQ 20 by tes G RE IPv 4 Figure 7. P ack et formats. In our modeling ef fort we refer to the softw are implementation of a BNG present in the Intel Data Plane Performance Demonstrators (DPPD) [9]. This is an open source, highly optimized softw are BNG specifically in- tended for performance analysis. In this implementation the traf fic in the access netw ork between the Customer Premise Equipment (CPE) and the BNG is encapsulated using Ethernet QinQ frames, whil e the traf fic between the BNG and the Carrier -grade N A T (CGN A T) in the core MPLS netw ork is encapsulated using GRE (Generic Rout- ing Encapsulation). In this scenario pack ets recei v ed from the access netw ork and pack ets recei v ed from the core netw ork are processed dif ferently by the BNG, thus 2 separate models are required for the 2 directions. The tw o dif ferent formats of pack ets forw arded in the access and in the core netw ork is illustrated in Figure 7. P ack ets from CPEs are matched with 2 dif ferent tables: ( i ) a hash table that gi v en the QinQ tag pro vides the corresponding GRE k e y (up to 16M entries of 7 bytes) and ( ii ) an LPM routing table that gi v en the destination IP address returns the output port, the IP address of the remote GRE tunnel endp oi nt, the ne xt hop MA C address and the MPLS label (this table can contain up to 8K routes). P ack ets from the core netw ork are instead matched with only one hash table that gi v en the GRE k e y and the inner destination IP address pro vides the QinQ tag, the destination MA C address and the output port. The BNG supports up to 64K CPEs, thus this table can contain up to 64K entries of 23 bytes. The QinQ tag and the GRE k e y are used t o track the subscriber (e.g., for accounting), while the tunnel endpoint (i.e., the CGN A T) is selected according to the destination of the pack et. The resulting models for both directi ons are summarized in Figure 8. When processing pack ets from the access netw ork, MA C with QinQ and IP headers are loaded preferably in L1/L2 cache, so that the QinQ header can be parsed. The e xtracted QinQ tag is used for the lookup in table ( i ), while the destination IP address is parsed and deplo yed in the LPM lookup table ( ii ). These 2 lookups pro vide the output GRE k e y , destination IP and MA C addresses, MPLS tag and output port that are used in the encapsulation of the output pack et. The TTL (T ime T o Li v e) of the int ernal IP pack et is decremented and thus the checksum must be incrementally updated start ing from the current v alue. The ne w pack et format requires also the computation of the GRE checksum and the e xternal IP pack et T otal Length field and header checksum. Moreo v er , backw ard learning is used to populate the table used to process pack ets from the core netw ork. Hence, an additional ct insertion operation is required, after parsing source port, MA C and IP addresses. The final pack et is formed with the encapsulation of 70 bytes, corresponding to the ne w ethernet, MPLS, e xternal IP , GRE and inner IP headers and then sent to the output I/O b uf fer . P ack ets from the core net w ork require a parse operation for the GRE k e y and the inner destination IP before using them for an hash table lookup to get the QinQ tag, the destination MA C address and the output port. In this case also the TTL of the inner IP pack et is decremented and the checksum incrementally updated. The ne w outgoing pack et must then be stored in memory or cache (encapsulation of 42 bytes) and mo v ed to the b uf fer of the output I/O de vice. Mapping these models to the considered hardw are platform, we can conclude that the estimated cost to process a 64 bytes pack et from the access netw ork is: 717 clock c ycles + 6 DRAM accesses corresponding to a maximum per core throughput of 5.14 Mpps ( 3.37 Gbps ), while the DDR3 memory can support 12.11 Mpps ( 7.95 Gbps ). The estimated cost to process a 64 byte pack et from the core netw ork is: 274 clock c ycles + 1 DRAM access IJECE V ol. 8, No. 5, October 2018: 3021 3037 Evaluation Warning : The document was created with Spire.PDF for Python.