Inter national J our nal of Electrical and Computer Engineering (IJECE) V ol. 7, No. 3, June 2017, pp. 1430 1435 ISSN: 2088-8708 1430       I ns t it u t e  o f  A d v a nce d  Eng ine e r i ng  a nd  S cie nce   w     w     w       i                       l       c       m     VLSI Design of a F ast Pipelined 8x8 Discr ete Cosine T ransf orm Nurulnajah Mohd Zabidi and Ab Al-Hadi Ab Rahman F aculty of Electrical Engineering, Uni v ersiti T eknologi Malaysia, Malaysia Article Inf o Article history: Recei v ed Jan 4, 2017 Re vised Mar 18, 2017 Accepted Apr 2, 2017 K eyw ord: DCT VLSI Pipeline V ideo Coding ABSTRA CT This paper presents a V ery Lar ge Scale Inte grated (VLSI) design and implementation of a fix ed-point 8x8 multiplierless Discrete Cosine T ransform (DCT) using the ISO/IEC 23002- 2 algorithm. The standard DCT algorithm, which is mainly used in image and video com- pression technology , consists of only adders, subtractors, and shifters, therefore making it ef ficient for hardw are implementation. The VLSI implementation of the algorithm gi v en in this paper further enhance s the performance of the transform unit. Furthermore, circuit pipelining has been applied to the base design of the DCT , which significantly impro v es the performance by reducing the longest path in the non-pipeline design. The DCT has been implemented using semi-custom VLSI design methodology using the TSMC 0.13um pro- cess technology . Results sho w that our DCT designs can run up to around 1.7 Gig a pix els/s, which is well abo v e the timing required for real-time ultra-high definition 8K video. Copyright c 2017 Institute of Advanced Engineering and Science . All rights r eserved. Corresponding A uthor: Ab Al-Hadi Ab Rahman F aculty of Electrical Engineering, Uni v ersiti T eknologi Malaysia 81310 Skudai, Johor Bahru, Malaysia Phone: +6019-7338914 Email: hadi@fk e.utm.my 1. INTR ODUCTION Image compres sion is a process to reduce data redundancies in the image in order to increase the storage capacity . Discrete Cosine T ransform (DCT) is widely used for lossy compr ession. Most image or video coding standards such as Joint Photographic Ex pe rt Group (JPEG) and Mo ving Picture Expert Group (MPEG) use DCT as a standard transform coding sche me. The latest High Ef ficienc y V ideo Coding (HEVC) which is finalized in 2013 also utilizes the DCT as the main transform unit [1]. DCT basis function is the Cosine, where multiplication and addition are the main arithmetic operations in v olv ed. Man y DCT -based research has been conducted in the past fe w years, which has produced dif ferent kind of DCT Algorithms, such as Arai DCT scheme, W ang F actorization, Lee DCT for po wer of tw o block length, Loef fler algorithm, and Feig-W inograd f actorization ([2]). These Algorithms ha v e been used in practical appl ications. In recent image processing technology , v arious hardw are implementation of DCT are using Arai DCT scheme [3]. It uses only v e multiplications and twenty-nine addition, which is less arithmetic operations if compared to other stated algorithms. F or the MPEG technology , the International Standards Or g anization (ISO) released an optimized fix ed- point multiplierless v ersion of the DCT algorithm, suitable for image and video compression. The standard which is called the ISO/IEC 23002-2 is described and implemented in the present w ork [4]. VLSI design of DCT can be found in numerous articles, with an o v ervie w gi v en in [5]. F or comparison purposes, we ha v e analyzed three similar designs. The w ork by Mandayak e et al in [6] presents a VLSI architecture of the DCT using the Arai DCT scheme. It pr o pos es a f ast algorithm by reducing the number of inte ger channels. The design is implemented using 45nm technology . The w ork by W ahid et al [7] proposes an area ef ficient fix ed point DCT architecture implemented in 0.18 um CMOS technology . Another interesting w ork is by Fu et al [8], where a lo w po wer implementation is proposed based on algebraic inte ger encoding technique. This w ork also utilizes 0.18um CMOS technology . Performance results for these w orks are gi v en in the results section. The present paper on the other hand, describes the semi-custom V ery Lar ge Scale Inte gration (VLSI) design of the IS O/IEC 23002-2 DCT algorithm using TSMC 0.13um technology , similar to the design methodology used in J ournal Homepage: http://iaesjournal.com/online/inde x.php/IJECE       I ns t it u t e  o f  A d v a nce d  Eng ine e r i ng  a nd  S cie nce   w     w     w       i                       l       c       m     DOI:  10.11591/ijece.v7i3.pp1430-1435 Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE ISSN: 2088-8708 1431 [9]. The design has also been optimized by applying circuit pipelining. 1.1. DCT Theor etical Backgr ound The basic equation of N-point one-dimensional Discrete Cosine T ransform (1D-DCT) as defined in [10]: v ( k ) = ( k ) ( N 1) X n =0 u ( n ) cos (2 n + 1) k 2 N ; 0 k N 1 (1) where u(n) denotes the input data, v(k) denotes the output data, whereby (0) = r 1 N ; ( k ) = r 2 N ; 1 k N 1 (2) 1D-DCT equation for N=8 , can also be written in matrix form as follo ws: 2 6 6 6 6 6 6 6 6 6 6 4 v (0) v (1) v (2) v (3) v (4) v (5) v (6) v (7) 3 7 7 7 7 7 7 7 7 7 7 5 = 2 6 6 6 6 6 6 6 6 6 6 4 c 4 c 4 c 4 c 4 c 4 c 4 c 4 c 4 c 1 c 3 c 5 c 7 c 7 c 5 c 3 c 1 c 2 c 6 c 6 c 2 c 2 c 6 c 6 c 2 c 3 c 7 c 1 c 5 c 5 c 1 c 7 c 3 c 4 c 4 c 4 c 4 c 4 c 4 c 4 c 4 c 5 c 1 c 7 c 3 c 3 c 7 c 1 c 5 c 6 c 2 c 2 c 6 c 6 c 2 c 2 c 6 c 7 c 5 c 3 c 1 c 1 c 3 c 5 c 7 3 7 7 7 7 7 7 7 7 7 7 5 2 6 6 6 6 6 6 6 6 6 6 4 u (0) u (1) u (2) u (3) u (4) u (5) u (6) u (7) 3 7 7 7 7 7 7 7 7 7 7 5 (3) where c i = cos i 16 ; i = (2 n + 1) k (4) Equation (3) sho ws that the matrix computes in column wise to produce 1D-DCT . Computing it further in ro w wise produces the 2D-DCT . Considering the intensi v e computations required in DCT , man y ef ficient algorithms are proposed and reported in literature as mentioned, including the ISO/IEC 23002-2 used in the preset w ork. The algorithm is gi v en in Figure 1. The inputs are ind0 to ind7, where each input is 32-bit wide. Se v eral v ariables are defined for intermediate operations, where finally the results are stored in the outputs outd0 to outd7. VLSI Design of a F ast Pipelined 8x8 Discr ete Cosine T r ansform (Nurulnajah Mohd Zabidi) Evaluation Warning : The document was created with Spire.PDF for Python.
1432 ISSN: 2088-8708 Figure 1. The ISO/IEC 23002-2 1D-DCT algorithm One of the c ritical components in the DCT algorithm is the PMUL operation, gi v en in Figure 2. The PMUL performs Polynomial Multiplication. Equation (5) sho ws an e xample of polynomial equation, where the highest de gree of polynomial in this equation is 11. The e xponential could be eliminated by replacing with shift right operations. Equation (6) sho ws the con v ersion results of all the e xponential in equation (5) into arithmetic shift right. y = ( y 3 y 7 ) + (( y 3 y 7 ) y 11 ) 1 (5) y = ( y 3 y 7) + (( y 3 y 7) y 11)1 (6) Figure 2. PMUL components used in the ISO/IEC 23002-2 1D-DCT algorithm 2. PIPELINE CONCEPT The main concept in circuit pipelining is to split the job process into smaller stages which will help to enhance the performance by reducing the combinatorial critical path. Figure 3 sho ws the dif ference between non pipeline and pipeline structure in terms of combinational logic circuit. Non-pipeline circuit is made up of a combinational logic, an input and an output. Pipeline circuit is made up of a combinational logic that has been partitioned into smaller portion and then connected by re gisters. Essentially , pipelining allo ws the design to run at a higher operating frequenc y at a ne glible cost in latenc y caused by initializing the pipeline stages. IJECE V ol. 7, No. 3, June 2017: 1430 1435 Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE ISSN: 2088-8708 1433 Figure 3. Non-pipeline and pipeline concept 3. PR OPOSED DCT HARD W ARE ARCHITECTURE The ISO/IEC 23002-2 algorithm presented in Figure 1 is translated into a hardw are architecture via a dataflo w graph (DFG), sho wn in Figure 4. There are a total of 54 intermediate v ariables, which translates into wires for non- pipeline implementation. Similar to the input and output widths, wires are set at 32-bits each. The algorithm also utilizes 27 subtractors, 19 adders, and 24 shifters. From the DFG, it can be seen that the longest path is 7 arithmetic operators. Therefore, the design can be split for pipelining to reduce the path length. Essentially , the DFG can be partitioned into stages, and at each partition bounday , re gisters are added to store and hold intermediate data. It should be noted that a v ery small initial delay is e xpected to fill in the pipeline re gisters. Optimum paritioning boundary can be found using some of the proposed algorithms in [11]. Figure 4. Data Flo w Graph of Non-pipeline DCT VLSI Design of a F ast Pipelined 8x8 Discr ete Cosine T r ansform (Nurulnajah Mohd Zabidi) Evaluation Warning : The document was created with Spire.PDF for Python.
1434 ISSN: 2088-8708 4. RESUL T AND AN AL YSIS In this w ork, Mentor Graphics ED A tools with VLSI process technology TSMC 0.13um CMOS are used to design, implement and v alidate the DCT . Some of the tools used include Pyxis for schematic and ph ysical design; Modelsim for high-le v el simulation; and Eldonet and EZ w a v e for lo w-le v el SPICE simulation. Semi-custom VLSI design methodology is used, whereby the arithmetic operators are instantiated manually from a standard cell library to form a complete DCT . The simulation results obtained from Modelsim at the R TL le v el is compared to the one obtained from SPICE simulation to ensure correct results. Here, we sho w results for non-pipeline and a 2-stage pipeline implementation. The follo wing is the number of components used for the DCT . There are a total of 19 Adders, 27 subtractors and 24 shifters (all 32-bits) that ha v e been used in both Non-pipeline DCT and pipeline DCT . As for re gsiters, 16 32-bit and 42 32-bit re gisters ha v e been used in Non-pipeline and Pipeline DCT respecti v ely . The pipeline design uses roughly 2.6 times more re gisters compared to non-pipeline. As for the total number of transistors, non-pipeline DCT consists of 74240, while pipeline DCT consists of 85120, with increase of roughly 14% more resource due to the pipeline re gisters. By simulation for tw o dif ferent test patterns called pattern1 and pattern 2 (deri v ed from the F oreman QCIF video frame), the critical data path (cpd) for Non-pipel ine DCT is found to be 7.97ns for input pattern 1, and 6.59ns for input pattern 2. As for the Pipeline DCT , it is found that the cpd is 4.61ns for input pattern 1 and 3.83ns for input pattern 2. From this , it can be estimated that the non-pipeline DCT can run at the maximum speed of 125MHz, and the pipeline DCT at 217MHz. In terms of throughput, maximum output rate achie v ed are 1 Gig a pix els/s and 1.7 Gig a pix els/s respecti v ely for non-pipeli ne and pipeline DCTs. The graph in Figure 5 sho ws the plot of maximum operating frequenc y vs throughput. Both of these designs can support the speed requirements of well abo v e 8K UHD resolution at real-time. It should be noted ho we v er that there are man y other f actors and components that may af fec t the speed of this DCT when inte grated into a complete video codec. Figure 5. Maximum frequenc y (Fmax) vs Throughput in Mpix els/s T able 1 sho ws the comparison to three similar w orks in literature: Madanayak e et al [6] with 45nm CMOS, W ahid et al [7] with 0.18um CMOS, and Fu et al [8] with 0.18 um CMOS technologies. In terms of throughput, It can be seen that our designs are superior to [7, 8], b ut trails the design from [6]. This w ork sho ws higher performance due to the smaller feature technology used. Ho we v er , our design has sho wn to meet the requirements of state-of-the-art video coding resolution e v en when using a lar ger technology . T able 1. Comparison with similar w orks in literature Madan yak e W ahid Fu et al Non-pipeline Pipeline et al [6] et al [7] [8] DCT DCT CMOS T echnology (nm) 45 180 180 130 130 Operating frequenc y (MHz) 946 194.7 75 116 250 throughput (Mp/s) 7568 194.7 75 1000 1739 IJECE V ol. 7, No. 3, June 2017: 1430 1435 Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE ISSN: 2088-8708 1435 5. CONCLUSION The objecti v e of the present paper is to present results for implementing a f ast VLSI pipelined multiplierless fix ed-point 8x8 DCT . The algorithm used is the ISO/IEC 23002-2 for the image and video compr ession technology . The design methodology be gins with modeling the algorithm using dataflo w graphs in order to analyze for pipelining. The results sho w that the implementations are feasible to be applied in the latest ultra-high definition 8K video, e v en when using the relati v ely lar ge 0.13um technology . W e ha v e implemented and compared the DCT using tw o architectures, non-pipeline and a 2-stage pipeline. Simulation results v alidated the e xpectation that throughput can be impro v ed by almost a f actor of tw o, from around 1 Gig a pix els/s t o 1.8 Gig a pix els/s, at a small cost of roughly 14% more resource (i.e. pipeline re gisters). Future w ork aims to e xtend the methodology for implementing dif ferent DCT dimensions for other applications. A CKNO WLEDGEMENT The authors w ould lik e to thank the Malaysia Ministry of Education for pro viding the funds for the w ork in this paper (V ote no. 4F659). REFERENCES [1] V . Sze, M. Budag a vi, G.J. Sulli v an, “High Ef ficienc y V ideo Coding, Springer , 2014. [2] He yne, B. Gotze, J., A Lo w Po wer and High Quality Implementation of The Discrete Cosine T ransformation,“ Adv ances in Radio Science , 2007. [3] U. S. Potluri, A. Madanayak e, R. J. Cintra, F . M. Bayer , S. K ulasek era, and A. Edirisuriya, Impro v ed 8-point approximate DCT for image and video compression requiring only 14 additions, IEEE T ransactions on Circuits and Systems I: Re gular P apers , v ol. 6, no. 6, pp. 1727-1740, June 2014. [4] ISO/IEC, Information technology - MPEG video technologies - P art 2: Fix ed-point 8x8 in v erse discrete cosine transform and discrete cosine transform, International Standard , Dec 2014. [5] A. Madanayak e et al., ”Lo w-Po wer VLSI Architectures for DCT/D WT : Precision vs Approximation for HD V ideo, Biomedical, and Smart Antenna Applications, IEEE Circuits and Systems Mag azine , v ol. 15, no. 1, pp. 25-47, 2015. [6] A. Edirisuriya, A. Madanayak e, R. Cintra, V . Dimitro v and N. Rajapaksha, ”A single-channel architecture for al- gebraic inte ger -based 8x8 2-D DCT computation”, IEEE T ransacti o ns on Circuits Systems for V ideo T echnology , v ol. 23, no. 12, pp. 2083-2089, Dec. 2013. [7] K. A. W ahid, M. Martuza, M. Das and C. McCrosk y , ”Ef ficient hardw are implementa tion of 8x8 inte ger cosine transforms for multiple video codecs”, Journal of Real-T ime Processing , pp. 1-8, July 2011. [8] M. Fu, G. A. Jullien, V . S. Dimitro v and M. Ahmadi, ”A lo w-po wer DCT IP core based on 2D algebraic inte ger encoding”, Proceedings of the International Symposium on Circuits Systems (ISCAS) , v ol. 2, pp. 765-768. [9] A. A. H. Ab-Rahman and I. Kamisian and A. Z. Sha’ameri, VLSI design and implementation of adapti v e channel equalizer , 2008 International Conference on Computer and Communication Engineering , pp. 1121-1124, May 2008. [10] Rachmad V idya W icaksana Putra, Rella Mareta, Nurfitri Anbarsanti , T rio Adiono A Ne w R TL Design Approach for a DCT/IDCT -Based Image Compression Archi tecture using the mCBE Algorithm, Journal of ICT Research and Applications , V ol. 6, No. 2, pp 131-150, 2012 [11] A. Prihozh y and E. Bezati and A. A. H. Ab Rahman and M. Matta v elli Synthesis and Optimization of Pipelines for HW Implementations of Dataflo w Programs, IEEE T ransactions on Computer -Aided Design of Inte grated Circuits and Systems , V ol. 34, no. 10, pp 1613-1626, 2015. VLSI Design of a F ast Pipelined 8x8 Discr ete Cosine T r ansform (Nurulnajah Mohd Zabidi) Evaluation Warning : The document was created with Spire.PDF for Python.