Inter
national
J
our
nal
of
Electrical
and
Computer
Engineering
(IJECE)
V
ol.
7,
No.
3,
June
2017,
pp.
1430
–
1435
ISSN:
2088-8708
1430
I
ns
t
it
u
t
e
o
f
A
d
v
a
nce
d
Eng
ine
e
r
i
ng
a
nd
S
cie
nce
w
w
w
.
i
a
e
s
j
o
u
r
n
a
l
.
c
o
m
VLSI
Design
of
a
F
ast
Pipelined
8x8
Discr
ete
Cosine
T
ransf
orm
Nurulnajah
Mohd
Zabidi
and
Ab
Al-Hadi
Ab
Rahman
F
aculty
of
Electrical
Engineering,
Uni
v
ersiti
T
eknologi
Malaysia,
Malaysia
Article
Inf
o
Article
history:
Recei
v
ed
Jan
4,
2017
Re
vised
Mar
18,
2017
Accepted
Apr
2,
2017
K
eyw
ord:
DCT
VLSI
Pipeline
V
ideo
Coding
ABSTRA
CT
This
paper
presents
a
V
ery
Lar
ge
Scale
Inte
grated
(VLSI)
design
and
implementation
of
a
fix
ed-point
8x8
multiplierless
Discrete
Cosine
T
ransform
(DCT)
using
the
ISO/IEC
23002-
2
algorithm.
The
standard
DCT
algorithm,
which
is
mainly
used
in
image
and
video
com-
pression
technology
,
consists
of
only
adders,
subtractors,
and
shifters,
therefore
making
it
ef
ficient
for
hardw
are
implementation.
The
VLSI
implementation
of
the
algorithm
gi
v
en
in
this
paper
further
enhance
s
the
performance
of
the
transform
unit.
Furthermore,
circuit
pipelining
has
been
applied
to
the
base
design
of
the
DCT
,
which
significantly
impro
v
es
the
performance
by
reducing
the
longest
path
in
the
non-pipeline
design.
The
DCT
has
been
implemented
using
semi-custom
VLSI
design
methodology
using
the
TSMC
0.13um
pro-
cess
technology
.
Results
sho
w
that
our
DCT
designs
can
run
up
to
around
1.7
Gig
a
pix
els/s,
which
is
well
abo
v
e
the
timing
required
for
real-time
ultra-high
definition
8K
video.
Copyright
c
2017
Institute
of
Advanced
Engineering
and
Science
.
All
rights
r
eserved.
Corresponding
A
uthor:
Ab
Al-Hadi
Ab
Rahman
F
aculty
of
Electrical
Engineering,
Uni
v
ersiti
T
eknologi
Malaysia
81310
Skudai,
Johor
Bahru,
Malaysia
Phone:
+6019-7338914
Email:
hadi@fk
e.utm.my
1.
INTR
ODUCTION
Image
compres
sion
is
a
process
to
reduce
data
redundancies
in
the
image
in
order
to
increase
the
storage
capacity
.
Discrete
Cosine
T
ransform
(DCT)
is
widely
used
for
lossy
compr
ession.
Most
image
or
video
coding
standards
such
as
Joint
Photographic
Ex
pe
rt
Group
(JPEG)
and
Mo
ving
Picture
Expert
Group
(MPEG)
use
DCT
as
a
standard
transform
coding
sche
me.
The
latest
High
Ef
ficienc
y
V
ideo
Coding
(HEVC)
which
is
finalized
in
2013
also
utilizes
the
DCT
as
the
main
transform
unit
[1].
DCT
basis
function
is
the
Cosine,
where
multiplication
and
addition
are
the
main
arithmetic
operations
in
v
olv
ed.
Man
y
DCT
-based
research
has
been
conducted
in
the
past
fe
w
years,
which
has
produced
dif
ferent
kind
of
DCT
Algorithms,
such
as
Arai
DCT
scheme,
W
ang
F
actorization,
Lee
DCT
for
po
wer
of
tw
o
block
length,
Loef
fler
algorithm,
and
Feig-W
inograd
f
actorization
([2]).
These
Algorithms
ha
v
e
been
used
in
practical
appl
ications.
In
recent
image
processing
technology
,
v
arious
hardw
are
implementation
of
DCT
are
using
Arai
DCT
scheme
[3].
It
uses
only
fi
v
e
multiplications
and
twenty-nine
addition,
which
is
less
arithmetic
operations
if
compared
to
other
stated
algorithms.
F
or
the
MPEG
technology
,
the
International
Standards
Or
g
anization
(ISO)
released
an
optimized
fix
ed-
point
multiplierless
v
ersion
of
the
DCT
algorithm,
suitable
for
image
and
video
compression.
The
standard
which
is
called
the
ISO/IEC
23002-2
is
described
and
implemented
in
the
present
w
ork
[4].
VLSI
design
of
DCT
can
be
found
in
numerous
articles,
with
an
o
v
ervie
w
gi
v
en
in
[5].
F
or
comparison
purposes,
we
ha
v
e
analyzed
three
similar
designs.
The
w
ork
by
Mandayak
e
et
al
in
[6]
presents
a
VLSI
architecture
of
the
DCT
using
the
Arai
DCT
scheme.
It
pr
o
pos
es
a
f
ast
algorithm
by
reducing
the
number
of
inte
ger
channels.
The
design
is
implemented
using
45nm
technology
.
The
w
ork
by
W
ahid
et
al
[7]
proposes
an
area
ef
ficient
fix
ed
point
DCT
architecture
implemented
in
0.18
um
CMOS
technology
.
Another
interesting
w
ork
is
by
Fu
et
al
[8],
where
a
lo
w
po
wer
implementation
is
proposed
based
on
algebraic
inte
ger
encoding
technique.
This
w
ork
also
utilizes
0.18um
CMOS
technology
.
Performance
results
for
these
w
orks
are
gi
v
en
in
the
results
section.
The
present
paper
on
the
other
hand,
describes
the
semi-custom
V
ery
Lar
ge
Scale
Inte
gration
(VLSI)
design
of
the
IS
O/IEC
23002-2
DCT
algorithm
using
TSMC
0.13um
technology
,
similar
to
the
design
methodology
used
in
J
ournal
Homepage:
http://iaesjournal.com/online/inde
x.php/IJECE
I
ns
t
it
u
t
e
o
f
A
d
v
a
nce
d
Eng
ine
e
r
i
ng
a
nd
S
cie
nce
w
w
w
.
i
a
e
s
j
o
u
r
n
a
l
.
c
o
m
,
DOI:
10.11591/ijece.v7i3.pp1430-1435
Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE
ISSN:
2088-8708
1431
[9].
The
design
has
also
been
optimized
by
applying
circuit
pipelining.
1.1.
DCT
Theor
etical
Backgr
ound
The
basic
equation
of
N-point
one-dimensional
Discrete
Cosine
T
ransform
(1D-DCT)
as
defined
in
[10]:
v
(
k
)
=
(
k
)
(
N
1)
X
n
=0
u
(
n
)
cos
(2
n
+
1)
k
2
N
;
0
k
N
1
(1)
where
u(n)
denotes
the
input
data,
v(k)
denotes
the
output
data,
whereby
(0)
=
r
1
N
;
(
k
)
=
r
2
N
;
1
k
N
1
(2)
1D-DCT
equation
for
N=8
,
can
also
be
written
in
matrix
form
as
follo
ws:
2
6
6
6
6
6
6
6
6
6
6
4
v
(0)
v
(1)
v
(2)
v
(3)
v
(4)
v
(5)
v
(6)
v
(7)
3
7
7
7
7
7
7
7
7
7
7
5
=
2
6
6
6
6
6
6
6
6
6
6
4
c
4
c
4
c
4
c
4
c
4
c
4
c
4
c
4
c
1
c
3
c
5
c
7
c
7
c
5
c
3
c
1
c
2
c
6
c
6
c
2
c
2
c
6
c
6
c
2
c
3
c
7
c
1
c
5
c
5
c
1
c
7
c
3
c
4
c
4
c
4
c
4
c
4
c
4
c
4
c
4
c
5
c
1
c
7
c
3
c
3
c
7
c
1
c
5
c
6
c
2
c
2
c
6
c
6
c
2
c
2
c
6
c
7
c
5
c
3
c
1
c
1
c
3
c
5
c
7
3
7
7
7
7
7
7
7
7
7
7
5
2
6
6
6
6
6
6
6
6
6
6
4
u
(0)
u
(1)
u
(2)
u
(3)
u
(4)
u
(5)
u
(6)
u
(7)
3
7
7
7
7
7
7
7
7
7
7
5
(3)
where
c
i
=
cos
i
16
;
i
=
(2
n
+
1)
k
(4)
Equation
(3)
sho
ws
that
the
matrix
computes
in
column
wise
to
produce
1D-DCT
.
Computing
it
further
in
ro
w
wise
produces
the
2D-DCT
.
Considering
the
intensi
v
e
computations
required
in
DCT
,
man
y
ef
ficient
algorithms
are
proposed
and
reported
in
literature
as
mentioned,
including
the
ISO/IEC
23002-2
used
in
the
preset
w
ork.
The
algorithm
is
gi
v
en
in
Figure
1.
The
inputs
are
ind0
to
ind7,
where
each
input
is
32-bit
wide.
Se
v
eral
v
ariables
are
defined
for
intermediate
operations,
where
finally
the
results
are
stored
in
the
outputs
outd0
to
outd7.
VLSI
Design
of
a
F
ast
Pipelined
8x8
Discr
ete
Cosine
T
r
ansform
(Nurulnajah
Mohd
Zabidi)
Evaluation Warning : The document was created with Spire.PDF for Python.
1432
ISSN:
2088-8708
Figure
1.
The
ISO/IEC
23002-2
1D-DCT
algorithm
One
of
the
c
ritical
components
in
the
DCT
algorithm
is
the
PMUL
operation,
gi
v
en
in
Figure
2.
The
PMUL
performs
Polynomial
Multiplication.
Equation
(5)
sho
ws
an
e
xample
of
polynomial
equation,
where
the
highest
de
gree
of
polynomial
in
this
equation
is
11.
The
e
xponential
could
be
eliminated
by
replacing
with
shift
right
operations.
Equation
(6)
sho
ws
the
con
v
ersion
results
of
all
the
e
xponential
in
equation
(5)
into
arithmetic
shift
right.
y
=
(
y
3
y
7
)
+
((
y
3
y
7
)
y
11
)
1
(5)
y
=
(
y
3
y
7)
+
((
y
3
y
7)
y
11)1
(6)
Figure
2.
PMUL
components
used
in
the
ISO/IEC
23002-2
1D-DCT
algorithm
2.
PIPELINE
CONCEPT
The
main
concept
in
circuit
pipelining
is
to
split
the
job
process
into
smaller
stages
which
will
help
to
enhance
the
performance
by
reducing
the
combinatorial
critical
path.
Figure
3
sho
ws
the
dif
ference
between
non
pipeline
and
pipeline
structure
in
terms
of
combinational
logic
circuit.
Non-pipeline
circuit
is
made
up
of
a
combinational
logic,
an
input
and
an
output.
Pipeline
circuit
is
made
up
of
a
combinational
logic
that
has
been
partitioned
into
smaller
portion
and
then
connected
by
re
gisters.
Essentially
,
pipelining
allo
ws
the
design
to
run
at
a
higher
operating
frequenc
y
at
a
ne
glible
cost
in
latenc
y
caused
by
initializing
the
pipeline
stages.
IJECE
V
ol.
7,
No.
3,
June
2017:
1430
–
1435
Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE
ISSN:
2088-8708
1433
Figure
3.
Non-pipeline
and
pipeline
concept
3.
PR
OPOSED
DCT
HARD
W
ARE
ARCHITECTURE
The
ISO/IEC
23002-2
algorithm
presented
in
Figure
1
is
translated
into
a
hardw
are
architecture
via
a
dataflo
w
graph
(DFG),
sho
wn
in
Figure
4.
There
are
a
total
of
54
intermediate
v
ariables,
which
translates
into
wires
for
non-
pipeline
implementation.
Similar
to
the
input
and
output
widths,
wires
are
set
at
32-bits
each.
The
algorithm
also
utilizes
27
subtractors,
19
adders,
and
24
shifters.
From
the
DFG,
it
can
be
seen
that
the
longest
path
is
7
arithmetic
operators.
Therefore,
the
design
can
be
split
for
pipelining
to
reduce
the
path
length.
Essentially
,
the
DFG
can
be
partitioned
into
stages,
and
at
each
partition
bounday
,
re
gisters
are
added
to
store
and
hold
intermediate
data.
It
should
be
noted
that
a
v
ery
small
initial
delay
is
e
xpected
to
fill
in
the
pipeline
re
gisters.
Optimum
paritioning
boundary
can
be
found
using
some
of
the
proposed
algorithms
in
[11].
Figure
4.
Data
Flo
w
Graph
of
Non-pipeline
DCT
VLSI
Design
of
a
F
ast
Pipelined
8x8
Discr
ete
Cosine
T
r
ansform
(Nurulnajah
Mohd
Zabidi)
Evaluation Warning : The document was created with Spire.PDF for Python.
1434
ISSN:
2088-8708
4.
RESUL
T
AND
AN
AL
YSIS
In
this
w
ork,
Mentor
Graphics
ED
A
tools
with
VLSI
process
technology
TSMC
0.13um
CMOS
are
used
to
design,
implement
and
v
alidate
the
DCT
.
Some
of
the
tools
used
include
Pyxis
for
schematic
and
ph
ysical
design;
Modelsim
for
high-le
v
el
simulation;
and
Eldonet
and
EZ
w
a
v
e
for
lo
w-le
v
el
SPICE
simulation.
Semi-custom
VLSI
design
methodology
is
used,
whereby
the
arithmetic
operators
are
instantiated
manually
from
a
standard
cell
library
to
form
a
complete
DCT
.
The
simulation
results
obtained
from
Modelsim
at
the
R
TL
le
v
el
is
compared
to
the
one
obtained
from
SPICE
simulation
to
ensure
correct
results.
Here,
we
sho
w
results
for
non-pipeline
and
a
2-stage
pipeline
implementation.
The
follo
wing
is
the
number
of
components
used
for
the
DCT
.
There
are
a
total
of
19
Adders,
27
subtractors
and
24
shifters
(all
32-bits)
that
ha
v
e
been
used
in
both
Non-pipeline
DCT
and
pipeline
DCT
.
As
for
re
gsiters,
16
32-bit
and
42
32-bit
re
gisters
ha
v
e
been
used
in
Non-pipeline
and
Pipeline
DCT
respecti
v
ely
.
The
pipeline
design
uses
roughly
2.6
times
more
re
gisters
compared
to
non-pipeline.
As
for
the
total
number
of
transistors,
non-pipeline
DCT
consists
of
74240,
while
pipeline
DCT
consists
of
85120,
with
increase
of
roughly
14%
more
resource
due
to
the
pipeline
re
gisters.
By
simulation
for
tw
o
dif
ferent
test
patterns
called
pattern1
and
pattern
2
(deri
v
ed
from
the
F
oreman
QCIF
video
frame),
the
critical
data
path
(cpd)
for
Non-pipel
ine
DCT
is
found
to
be
7.97ns
for
input
pattern
1,
and
6.59ns
for
input
pattern
2.
As
for
the
Pipeline
DCT
,
it
is
found
that
the
cpd
is
4.61ns
for
input
pattern
1
and
3.83ns
for
input
pattern
2.
From
this
,
it
can
be
estimated
that
the
non-pipeline
DCT
can
run
at
the
maximum
speed
of
125MHz,
and
the
pipeline
DCT
at
217MHz.
In
terms
of
throughput,
maximum
output
rate
achie
v
ed
are
1
Gig
a
pix
els/s
and
1.7
Gig
a
pix
els/s
respecti
v
ely
for
non-pipeli
ne
and
pipeline
DCTs.
The
graph
in
Figure
5
sho
ws
the
plot
of
maximum
operating
frequenc
y
vs
throughput.
Both
of
these
designs
can
support
the
speed
requirements
of
well
abo
v
e
8K
UHD
resolution
at
real-time.
It
should
be
noted
ho
we
v
er
that
there
are
man
y
other
f
actors
and
components
that
may
af
fec
t
the
speed
of
this
DCT
when
inte
grated
into
a
complete
video
codec.
Figure
5.
Maximum
frequenc
y
(Fmax)
vs
Throughput
in
Mpix
els/s
T
able
1
sho
ws
the
comparison
to
three
similar
w
orks
in
literature:
Madanayak
e
et
al
[6]
with
45nm
CMOS,
W
ahid
et
al
[7]
with
0.18um
CMOS,
and
Fu
et
al
[8]
with
0.18
um
CMOS
technologies.
In
terms
of
throughput,
It
can
be
seen
that
our
designs
are
superior
to
[7,
8],
b
ut
trails
the
design
from
[6].
This
w
ork
sho
ws
higher
performance
due
to
the
smaller
feature
technology
used.
Ho
we
v
er
,
our
design
has
sho
wn
to
meet
the
requirements
of
state-of-the-art
video
coding
resolution
e
v
en
when
using
a
lar
ger
technology
.
T
able
1.
Comparison
with
similar
w
orks
in
literature
Madan
yak
e
W
ahid
Fu
et
al
Non-pipeline
Pipeline
et
al
[6]
et
al
[7]
[8]
DCT
DCT
CMOS
T
echnology
(nm)
45
180
180
130
130
Operating
frequenc
y
(MHz)
946
194.7
75
116
250
throughput
(Mp/s)
7568
194.7
75
1000
1739
IJECE
V
ol.
7,
No.
3,
June
2017:
1430
–
1435
Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE
ISSN:
2088-8708
1435
5.
CONCLUSION
The
objecti
v
e
of
the
present
paper
is
to
present
results
for
implementing
a
f
ast
VLSI
pipelined
multiplierless
fix
ed-point
8x8
DCT
.
The
algorithm
used
is
the
ISO/IEC
23002-2
for
the
image
and
video
compr
ession
technology
.
The
design
methodology
be
gins
with
modeling
the
algorithm
using
dataflo
w
graphs
in
order
to
analyze
for
pipelining.
The
results
sho
w
that
the
implementations
are
feasible
to
be
applied
in
the
latest
ultra-high
definition
8K
video,
e
v
en
when
using
the
relati
v
ely
lar
ge
0.13um
technology
.
W
e
ha
v
e
implemented
and
compared
the
DCT
using
tw
o
architectures,
non-pipeline
and
a
2-stage
pipeline.
Simulation
results
v
alidated
the
e
xpectation
that
throughput
can
be
impro
v
ed
by
almost
a
f
actor
of
tw
o,
from
around
1
Gig
a
pix
els/s
t
o
1.8
Gig
a
pix
els/s,
at
a
small
cost
of
roughly
14%
more
resource
(i.e.
pipeline
re
gisters).
Future
w
ork
aims
to
e
xtend
the
methodology
for
implementing
dif
ferent
DCT
dimensions
for
other
applications.
A
CKNO
WLEDGEMENT
The
authors
w
ould
lik
e
to
thank
the
Malaysia
Ministry
of
Education
for
pro
viding
the
funds
for
the
w
ork
in
this
paper
(V
ote
no.
4F659).
REFERENCES
[1]
V
.
Sze,
M.
Budag
a
vi,
G.J.
Sulli
v
an,
“High
Ef
ficienc
y
V
ideo
Coding,
”
Springer
,
2014.
[2]
He
yne,
B.
Gotze,
J.,
A
Lo
w
Po
wer
and
High
Quality
Implementation
of
The
Discrete
Cosine
T
ransformation,“
Adv
ances
in
Radio
Science
,
2007.
[3]
U.
S.
Potluri,
A.
Madanayak
e,
R.
J.
Cintra,
F
.
M.
Bayer
,
S.
K
ulasek
era,
and
A.
Edirisuriya,
Impro
v
ed
8-point
approximate
DCT
for
image
and
video
compression
requiring
only
14
additions,
IEEE
T
ransactions
on
Circuits
and
Systems
I:
Re
gular
P
apers
,
v
ol.
6,
no.
6,
pp.
1727-1740,
June
2014.
[4]
ISO/IEC,
Information
technology
-
MPEG
video
technologies
-
P
art
2:
Fix
ed-point
8x8
in
v
erse
discrete
cosine
transform
and
discrete
cosine
transform,
International
Standard
,
Dec
2014.
[5]
A.
Madanayak
e
et
al.,
”Lo
w-Po
wer
VLSI
Architectures
for
DCT/D
WT
:
Precision
vs
Approximation
for
HD
V
ideo,
Biomedical,
and
Smart
Antenna
Applications,
”
IEEE
Circuits
and
Systems
Mag
azine
,
v
ol.
15,
no.
1,
pp.
25-47,
2015.
[6]
A.
Edirisuriya,
A.
Madanayak
e,
R.
Cintra,
V
.
Dimitro
v
and
N.
Rajapaksha,
”A
single-channel
architecture
for
al-
gebraic
inte
ger
-based
8x8
2-D
DCT
computation”,
IEEE
T
ransacti
o
ns
on
Circuits
Systems
for
V
ideo
T
echnology
,
v
ol.
23,
no.
12,
pp.
2083-2089,
Dec.
2013.
[7]
K.
A.
W
ahid,
M.
Martuza,
M.
Das
and
C.
McCrosk
y
,
”Ef
ficient
hardw
are
implementa
tion
of
8x8
inte
ger
cosine
transforms
for
multiple
video
codecs”,
Journal
of
Real-T
ime
Processing
,
pp.
1-8,
July
2011.
[8]
M.
Fu,
G.
A.
Jullien,
V
.
S.
Dimitro
v
and
M.
Ahmadi,
”A
lo
w-po
wer
DCT
IP
core
based
on
2D
algebraic
inte
ger
encoding”,
Proceedings
of
the
International
Symposium
on
Circuits
Systems
(ISCAS)
,
v
ol.
2,
pp.
765-768.
[9]
A.
A.
H.
Ab-Rahman
and
I.
Kamisian
and
A.
Z.
Sha’ameri,
VLSI
design
and
implementation
of
adapti
v
e
channel
equalizer
,
2008
International
Conference
on
Computer
and
Communication
Engineering
,
pp.
1121-1124,
May
2008.
[10]
Rachmad
V
idya
W
icaksana
Putra,
Rella
Mareta,
Nurfitri
Anbarsanti
,
T
rio
Adiono
A
Ne
w
R
TL
Design
Approach
for
a
DCT/IDCT
-Based
Image
Compression
Archi
tecture
using
the
mCBE
Algorithm,
Journal
of
ICT
Research
and
Applications
,
V
ol.
6,
No.
2,
pp
131-150,
2012
[11]
A.
Prihozh
y
and
E.
Bezati
and
A.
A.
H.
Ab
Rahman
and
M.
Matta
v
elli
Synthesis
and
Optimization
of
Pipelines
for
HW
Implementations
of
Dataflo
w
Programs,
IEEE
T
ransactions
on
Computer
-Aided
Design
of
Inte
grated
Circuits
and
Systems
,
V
ol.
34,
no.
10,
pp
1613-1626,
2015.
VLSI
Design
of
a
F
ast
Pipelined
8x8
Discr
ete
Cosine
T
r
ansform
(Nurulnajah
Mohd
Zabidi)
Evaluation Warning : The document was created with Spire.PDF for Python.