Indonesian
J
our
nal
of
Electrical
Engineering
and
Computer
Science
V
ol.
20,
No.
2,
September
2020,
pp.
854
862
ISSN:
2502-4752,
DOI:
10.11591/ijeecs.v20i2.pp854-862
r
854
Thr
oughput
maximization
f
or
full-duplex
tw
o-way
r
elay
with
finite
b
uffers
Betene
Anyugu
Francis
Lin
Department
of
Electronic
Engineering,
Shanghai
Jiao
T
ong
Uni
v
ersity
,
China
Article
Inf
o
Article
history:
Recei
v
ed
Feb
11,
2020
Re
vised
Apr
14,
2020
Accepted
Apr
29,
2020
K
eyw
ords:
Buf
fer
-aided
relaying
Full-duple
x
Q-learning
Reinforcement
learning
Throughput
ABSTRA
CT
Optimal
queueing
control
of
multi
-hop
netw
orks
remains
a
challenging
problem,
especially
in
tw
o-w
ay
relaying
systems,
e
v
en
in
the
most
straightforw
ard
scenarios.
In
this
paper
,
we
e
xplore
tw
o-w
ay
relaying
ha
ving
a
full-duple
x
decode-and-forw
ard
relay
with
tw
o
finite
b
uf
fers.
Principally
,
we
propose
a
no
v
el
concept
based
on
the
multi-agent
reinforcement
learning
(that
maximizes
the
cumulati
v
e
netw
ork
through-
put)
based
on
the
combination
of
the
b
uf
fer
states
and
the
lossy
links;
a
decision
is
generated
as
to
whether
it
can
transmit,
recei
v
e
or
e
v
en
simultaneousl
y
recei
v
e
and
transmit
information.
T
o
w
ards
this
objecti
v
e,
chiefly
,
based
on
t
he
queue
state
transi-
tion
and
the
lossy
links,
an
analytic
Mark
o
v
decision
process
is
proposed
to
analyze
this
scheme,
and
the
throughput
and
queueing
delay
are
deri
v
ed.
Our
numerical
results
re
v
eal
e
xciting
insights.
First,
artificial
intelligence
based
on
reinforc
ement
learning
is
optimal
when
the
length
of
the
b
uf
fer
is
superior
to
a
certain
threshold.
Second,
we
demonstrate
that
reinforcement
learning
can
boost
transmission
ef
ficienc
y
and
pre
v
ent
b
uf
fer
o
v
erflo
w
.
Copyright
c
2020
Insitute
of
Advanced
Engineeering
and
Science
.
All
rights
r
eserved.
Corresponding
A
uthor:
Betene
An
yugu
Francis
Lin,
Department
of
Electrical
and
Computer
Engineering,
Shanghai
Jiao
T
ong
Uni
v
ersity
,
Shanghai,
China.
Email:
francislin@sjtu.edu.cn
1.
INTR
ODUCTION
W
ith
the
significant
e
v
olution
of
computer
in
term
of
speed,
it
becomes
easy
to
implement
ne
w
algo-
rithms
to
boost
suf
ficiently
its
performance,
arti
ficial
intelligence
(AI)
h
a
s
become
the
k
e
yw
ord
which
defines
the
future
and
e
v
erything
that
it
holds.
Not
only
has
Artificial
Intelligence
tak
en
o
v
er
traditional
methods
of
computing,
b
ut
it
has
also
changed
the
w
ay
industries
perform.
Recently
,
AI
algorithms
ha
v
e
f
as
cinated
re-
searchers
and
ha
v
e
also
been
applied
successfully
to
solv
e
probl
ems
in
engineering,
such
as
visual
perception,
speech
recognition,
decision-making,
and
translation
between
languages.
Our
research
aims
to
design
a
ne
w
concept
of
artificial
intelligence
based
on
the
full-duple
x
tw
o-w
ay
relaying
system
with
reinforcement
learning,
such
that
the
achie
v
able
data
rate/throughput
is
maximized,
and
the
bottleneck
problem
of
b
uf
fer
o
v
erflo
w
is
remedied.
T
w
o-w
ay
relaying
systems,
which
w
as
first
studied
by
Shannon
in
[1],
adding
with
the
relay
to
boost
the
e
xchange
of
information
between
tw
o
nodes
has
been
dramatically
analyzed
and
implemented
to
o
v
ercome
the
incessant
demand
of
speed.
In
[2-10],
the
capacity
of
the
tw
o-w
ay
relay
channel
(TWRC)
in
half-duple
x
and
full-duple
x
(FD)
has
been
e
xtensi
v
ely
in
v
estig
ated,
and
the
a
v
erage
throughput
and
delays
of
the
proposed
protocols
were
e
v
aluated.
W
ith
the
benefit
of
AI,
for
cooperati
v
e
wireless
netw
orks,
in
half-duple
x.
Reinforce-
ment
learning(RL)
is
used
to
increase
the
performance
of
cooperati
v
e
wireless
netw
orks.
In
[11],
the
authors
introduced
the
frame
w
ork
of
communication
systems
and
the
jamming
modes
commonly
used
in
communica-
J
ournal
homepage:
http://ijeecs.iaescor
e
.com
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
r
855
tion.
Then
the
basic
principle
of
the
quality–learning(Q-Learning)
algorithm
is
briefly
introduced.
Ho
we
v
er
,
the
y
didn’
t
analyze
the
throughput
and
system
delay
.
In
[12],
the
authors
e
xamined
the
problem
of
relay
node
selection
in
cooperati
v
e
netw
orks,
in
which
one
relay
node
can
be
used
by
multiple
source-
d
e
stination
trans-
mission
pairs,
and
all
transmission
pairs
share
the
same
set
of
relay
nodes.
Ho
we
v
er
,
the
throughput
w
as
not
optimized.
In[13],
the
authors
in
v
estig
ate
the
po
wer
control
problem
in
a
cooperati
v
e
netw
ork
with
multiple
wireless
transmitters,
multiple
amplify-and-forw
ard
relays,
and
one
dest
ination.
Ne
v
ertheless,
the
po
wer
con-
trol
problem
in
a
cooperati
v
e
netw
ork
for
the
tw
o-w
ay
relay
w
as
not
in
v
estig
ated.
The
goal
is
to
maximize
the
long-term
system
throughput
by
fully
e
xploiting
multi-user
di
v
ersity
in
the
netw
ork.
Furthermore,
the
delay
performance
of
the
netw
ork,which
is
v
ery
crucial
to
e
v
aluate
the
system
netw
ork
w
as
not
in
v
estig
ated.
Whereas
in
[14,
15],
the
authors
present
a
no
v
el
deep
reinforcement
learning-based
joint
spectrum
sensing
and
po
wer
control
algorithm
for
do
wnlink
communications
in
a
cogniti
v
e
small
cell.
Re
grettably
,
the
authors
did
not
in
v
estig
ate
the
outage
probability
for
the
whole
system.
In[16],
the
authors
propose
an
i
terati
v
e
optimal
po
wer
allocation
(OP
A)
method
to
maximize
the
deri
v
ed
end-to-end
sum
rate,
which
is
based
on
Lagrangian
and
Ne
wton-Raphson
algorithm
under
total
po
wer
constraint.
The
authors
used
reinforcement
learning
for
tw
o-w
ay
relaying
systems.
Ho
we
v
er
,
the
system
delay
w
as
not
in
v
estig
ated.
Ho
we
v
er
,
in
[17],
the
authors
in
v
estig
ated
tw
o-w
ay
practical
relay
protocol
with
an
enhanced
transmission
scheduling,
which
tak
es
a
joint
consideration
of
the
finite
relay
b
uf
fers,
signalling
o
v
erheads,
and
lossy
links.
In
this
re
g
ard,
to
impro
v
e
the
performance
of
the
system,
full-duple
x
tw
o-w
ay
relaying
with
b
uf
fer
is
proposed
due
to
its
throughput
merits.
Ho
we
v
er
,
the
authors
in
[18]
propose
a
no
v
el
adapti
v
e
protocol
(that
maximises
the
cumulati
v
e
netw
ork
throughput)
based
on
the
combination
of
the
b
uf
fer
states,
lossy
link,
and
the
outage
probability;
a
decision
is
generated
as
to
whether
it
can
transmit,
recei
v
e
or
e
v
en
simultaneously
recei
v
e
and
transmit
information.
Similarly
,
the
y
in
v
estig
ated
the
queue
state,
the
transition
matrix,
an
analytic
Mark
o
v
chain
model
w
as
proposed
to
analyse
this
scheme,
and
the
throughput
and
queueing
delay
were
deri
v
ed.
From
the
a
v
ailable
literature,
we
are
the
first
and
the
only
one
to
ha
v
e
proposed
a
concept
of
the
full-duple
x
tw
o-w
ay
relay
(FD-TWR)
with
reinforcement
learning,
such
that
the
achie
v
able
data
rate/throughput
is
maximized.
The
contrib
ution
of
this
paper
can
be
summarized
as
follo
ws:
W
e
propose
a
reinforcement
learning
approach
for
full-duple
x
tw
o-w
ay
relay
based
on
multi-agents.
As
f
ar
as
we
kno
w
,
we
are
the
first
to
introduce
the
use
of
reinforcement
learning
for
the
full-duple
x
tw
o-w
ay
relay
,
and
we
do
not
only
tak
e
into
consideration
the
instantaneous
qualities
of
the
in
v
olv
ed
links
b
ut
also
consider
the
states
of
the
queues
at
the
b
uf
fers
for
the
full-duple
x
tw
o-w
ay
relay
.
W
e
present
a
general
frame
w
ork
for
obtaining
the
a
v
erage
throughput
and
the
system
delay
based
on
Mark
o
v
decision
process
of
the
queue
states
at
the
b
uf
fers.
2.
RESEARCH
METHOD
2.1.
Channel
models
In
this
paper
,
throughout
this
paper
,
we
assume
that
source
and
destination
al
w
ays
transmit
data,
we
in
v
estig
ate
the
maximum
rate
of
the
three-node
model-based
full-duple
x
tw
o-w
ay
relay(FD-TWR)
with
finite
b
uf
fers
in
the
decode-and-forw
ard
mode,
denoted
as
Q
a
and
Q
b
.
There
is
no
direct
link
between
node
A
and
node
B.
W
e
consider
the
FD-TWR
with
finite
b
uf
fer
-aided
where
tw
o
users
named
A
and
B
e
xchange
data
via
the
relay
(R),which
is
equipped
with
tw
o
finite
b
uf
fers,
see
Figure
1.
On
another
note,
we
denote
T
as
the
state
when
the
relay
transmits
data,
R
as
the
state
when
the
relay
recei
v
es
data
and
T
R
as
the
state
when
the
relay
transmits
and
recei
v
es
data
at
the
same
time.
In
paral
lel,
we
denote
E
as
the
state
when
the
b
uf
fers
are
empty
,
F
as
the
st
ate
when
the
b
uf
fers
are
full,
and
F
E
state
of
the
b
uf
fers
when
the
b
uf
fers
are
neither
empty
nor
full.
W
e
denote
the
channel
coef
ficients
as
h
AR
,
h
B
R
,
h
R
R
representing
the
channels
from
the
source
A
to
the
relay
R
and
from
the
source
B
to
the
relay
R,
and
the
self-interference,
respecti
v
ely
.
Namely
,
the
relay
can
transmit
and
recei
v
e
data
at
the
same
time.
Moreo
v
er
,
due
to
the
co-channel
transmission
and
imperfect
interference
cancellation,
we
assume
that
the
FD-TWR
w
ould
under
go
more
se
v
ere
self-interference.
Thus,
in
essence,
one
of
the
byproducts
of
this
w
ork
is
to,
unlik
e
in
[19],
assume
that
user
A
and
user
B
ha
v
e
enough
information
to
send
to
the
relay
in
some
time
slots,
and
do
not
ha
v
e
enough
information
in
other
time
slots.
Especially
,
we
assume
that
the
state
of
b
uf
fers
can
also
be
empty
or
full
in
all
time
slots.
Finally
,
for
the
theoretical
approach,
we
assume
that
the
full-duple
x
relay
is
perfect,
that
means
that
h
R
R
=
0
and
h
AR
,
h
B
R
are
binary
dependent
v
ariables.
Thr
oughput
maximization
for
full-duple
x
two-way
r
elay
with...
(Betene
Anyugu
F
r
ancis
Lin)
Evaluation Warning : The document was created with Spire.PDF for Python.
856
r
ISSN:
2502-4752
Figure
1.
System
model
2.2.
Reinf
or
cement
lear
ning
f
or
multi-agent
In
this
section,
unlik
e
[11-15]
where
unsupervised
and
supervised
learning
are
used,
we
implement
the
system
model
based
on
reinforcement
learning,
as
sho
wn
in
Figure
2.
Figure
2.
Reinforcement
learning
c
ycle
The
current
state
of
the
en
vironment
which
the
agent
observ
ers
is
defined
as
s
t
,
the
action
of
the
agent
is
defined
as
a
t
.The
re
w
ard
(or
penalty)
recei
v
ed
by
the
action
a
t
reflects
ho
w
perfect
the
pre
vious
action
a
t
1
in
the
en
vironment
s
t
1
.
The
model
of
the
single-agent
RL
is
a
Mark
o
v
decision
process.
It
is
defined
as:
f
:
S
A
!
[0
;
1]
:
S
A
!
<
(1)
where
S
is
the
finite
set
of
the
en
vironment
states,
A
is
the
finite
set
of
agent
actions,
f
is
the
state
transition
probability
function,
is
the
re
w
ard
function.
F
or
the
dynamic
en
vironment,
the
transition
probability
function
and
the
re
w
ard
are:
where
s
t
and
s
t
1
are
the
en
vironment
states
at
t
and
t
1
,
A
t
is
the
agent
actions
at
time
t,
f
t
is
the
state
transition
probability
function,
p
t
is
the
re
w
ard
function.
The
beha
vior
of
the
agent
is
described
by
the
polic
y
,
which
specifies
ho
w
does
the
agent
choose
its
actions
from
the
gi
v
en
state.
The
polic
y
may
be
either
static
and
dynamic.
In
static
case
=
S
!
A
,
the
polic
y
is
stationary
.
In
a
dynamic
case,
t
:
S
A
!
[0,
1]
,
the
polic
y
is
time-v
arying.
The
path
planning
is
to
find
a
polic
y
such
that
the
return
R
is
maximized
for
the
state
S
,
R
=
E
f
1
X
k
=0
k
r
k
+1
j
s
0
=
s;
g
(2)
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
20,
No.
2,
September
2020
:
854
–
862
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
r
857
where
[0
;
1)
is
the
discount
f
actor
.
This
form
is
tak
en
o
v
er
the
probabilistic
state
transition
under
the
polic
y
.
R
also
represents
the
re
w
ard
accumulated
by
the
agent.
The
task
of
the
agent
is,
therefore,
to
maximize
its
long-term
performance.
It
is
obtained
by
computing
the
optimal
state
action-v
alue
function,
called
Q-function.
The
Q-function
is
presented
as
Q
h
:
S
A
!
R
Q
(
s;
a
)
=
E
f
1
X
k
=0
k
r
k
+1
j
s
0
=
s;
a
0
=
a;
g
(3)
The
optimal
Q-function
is
defined
as
Q
(
s;
a
)
=
max
Q
(
s;
a
)
(4)
Q
(
s;
a
)
can
be
written
as
V
(
s
)
:=
max
a
2A
(
s
)
Q
(
s;
a
)
=
max
a
E
[
1
X
k
=0
k
r
t
+
k
+1
j
s
t
=
s;
a
t
=
a
]
=
max
a
E
[
r
t
+1
+
1
X
k
=0
k
r
t
+
k
+2
j
s
t
=
s;
a
t
=
a
]
=
max
a
E
[
r
t
+1
+
V
(
s
t
+1
)
j
s
t
=
s;
a
t
=
a
]
=
max
a
X
s
0
2
S
P
(
s
0
j
s;
a
)[
R
(
s;
a;
s
0
)
+
V
(
s
0
)]
(5)
In
the
multi-agent
case,
the
state
transitions
are
the
result
of
joint
actions
a
k
within
all
agents,
a
k
P
:
S
A
S
!
R
f
:
S
A
S
!
[0
1]
A
=
A
1
A
2
:::
A
n
(6)
The
joint
polic
y
is
defined
as
.
The
return
of
the
multi-agent
depends
on
the
joint
polic
y
is
a
k
=
a
T
1
;K
;
:::;
a
T
n;K
]
T
a
k
A;
a
i;k
A
i
R
i
(
x
)
=
E
f
1
X
k
=0
k
r
k
+1
j
s
0
=
s;
g
(7)
The
Q-function
of
each
agent
depends
on
the
joint
action
and
the
join
polic
y
Q
i
:
S
A
!
R
Q
i
(
s;
a
)
=
E
f
1
X
k
=0
k
r
k
+1
j
s
0
=
s;
a
0
=
a;
g
(8)
The
transition
diagram
can
be
represented
as,
as
sho
wn
in
Figure
2.
Thr
oughput
maximization
for
full-duple
x
two-way
r
elay
with...
(Betene
Anyugu
F
r
ancis
Lin)
Evaluation Warning : The document was created with Spire.PDF for Python.
858
r
ISSN:
2502-4752
Figure
3.
T
ransition
diagram
2.3.
Thr
oughput-delay
analysis
In
this
section,
we
present
a
general
frame
w
ork
for
the
throughput
maximization.
F
or
simplicity
,
the
Mark
o
v
process
decision
is
reducible
since
there
are
isolated
states
in
the
Mark
o
v
process
deci
sion.
W
e
can
represent
al
l
this
me
chanics
on
a
re
w
ard
table.
On
this
table,
the
ro
ws
represent
the
rooms,
and
the
columns
represent
the
actions.
The
v
alues
on
this
matrix
represent
the
re
w
ards,
the
v
alue
(-1)
indicates
that
some
specific
actions
are
not
a
v
ailable.
W
e
assume
that
the
probability
of
mo
ving
for
one
state
to
anot
her
can
be
represented
as
p
1
and
p
2
.
Therefore,
for
simplicity
,
we
ass
ume
that
p
1
=
p
2
.
Consequently
,
we
can
represent
all
this
mechanics
on
a
re
w
ard
table
R
R
=
2
6
6
6
6
6
6
4
1
1
1
1
0
1
1
1
1
0
1
100
1
1
1
0
1
1
1
0
0
1
0
1
0
1
1
0
1
100
1
0
1
1
0
100
3
7
7
7
7
7
7
5
(9)
W
e
ha
v
e
3
possible
actions:
1
,
4
or
5,
with
their
re
w
ards
0,
0,
100.
It
is
merely
all
positi
v
e
v
alues
from
ro
w
”5”,
and
we
are
just
interested
in
the
one
with
the
most
significant
v
alue.
Then,
we
select
the
biggest
Q
v
alue
with
those
possible
actions
by
selecting
Q(5,1),
Q
(5
;
4)
,
Q
(5
;
5)
.
Q
=
2
6
6
6
6
6
6
4
0
0
0
0
0
0
0
0
0
0
0
100
p
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
3
7
7
7
7
7
7
5
(10)
By
selecting
the
action
”1”
as
our
ne
xt
state
will
ha
v
e
no
w
the
follo
wing
possible
actions
Q
=
2
6
6
6
6
6
6
4
0
0
0
0
0
0
0
0
0
0
0
100
p
1
0
0
0
0
0
0
0
80
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
3
7
7
7
7
7
7
5
(11)
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
20,
No.
2,
September
2020
:
854
–
862
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
r
859
Q
=
2
6
6
6
6
6
6
4
0
0
0
0
0
400
p
1
0
0
0
0
320
p
1
100
p
1
0
0
0
0
320
p
1
0
0
400
p
1
256
p
1
0
0
500
p
1
320
p
1
0
0
0
320
p
1
500
p
1
0
400
p
1
0
0
0
0
3
7
7
7
7
7
7
5
(12)
Q
=
(3
;
1)
+
0
:
8
:max
([
Q
(1
;
3)
;
Q
(1
;
5))]
p
1
=
100
+
0
:
80
(13)
After
a
lot
of
episodes(100000)
our
Q
matrix
can
be
considered
to
ha
v
e
con
v
er
gence,
in
this
case
Q
will
be
lik
e
this:
Q
=
2
6
6
6
6
6
6
4
0
0
0
0
80
p
1
0
0
0
0
64
p
1
0
100
p
1
0
0
0
64
p
1
0
0
0
80
p
1
51
p
1
0
80
p
1
0
64
0
0
64
p
1
0
100
p
1
0
80
p
1
0
0
80
p
1
100
p
1
3
7
7
7
7
7
7
5
(14)
Then,
we
define
the
throughput
as
the
amount
of
successfully
deli
v
ered
data
of
both
A
and
B
per
second,
which
can
be
gi
v
en
by
=
Q
a
X
1
Q
b
X
1
Q
(
s;
a
)
T
Qa
Qb
=
Q
a
X
1
Q
b
X
1
max
Q
(
s;
a
)
T
Qa
Qb
=
Q
a
X
1
Q
b
X
1
max
a
X
s
0
2
S
P
(
s
0
j
s;
a
)[
R
(
s;
a;
s
0
)
+
V
(
s
0
)](
s;
a
)
T
Qa
Qb
(15)
where
T
is
the
fix
ed
transmission
rate
of
the
nodes.
After
that,
we
denote
Q
as
the
a
v
erage
length
of
b
uf
fers,
then
we
ha
v
e
Q
=
a
X
a
=1
b
X
b
=1
n
a;b
a;b
=
n
X
k
=1
k
p
k
=
p
(
np
(
n
+1)
(
n
+
1)
p
n
+
1)
(
p
1)
2
:
(16)
The
last
equation
in
(16)
can
be
pro
v
ed
by
induction
as
follo
ws.
When
n
=
1
,
we
ha
v
e
p
=
p
(
p
1+1
(1
+
1)
p
+
1)
(
p
1)
2
=
p
(
p
2
2
p
+
1)
(
p
1)
2
:
Assume
it
holds
for
n;
n
2
,
then
for
n
+
1
,
we
ha
v
e
Thr
oughput
maximization
for
full-duple
x
two-way
r
elay
with...
(Betene
Anyugu
F
r
ancis
Lin)
Evaluation Warning : The document was created with Spire.PDF for Python.
860
r
ISSN:
2502-4752
Q
=
(
n
+
1)
P
n
+1
+
p
(
np
n
+1
(
n
+
1)
p
n
+
1)
(
p
1)
2
=
(
n
+
1)
p
n
+1
(
p
1)
2
+
p
(
np
n
+1
(
n
+
1)
p
n
+
1)
(
p
1)
2
=
p
((
n
+
1)
p
n
+2
(
n
+
1
+
1)
p
n
+1
+
1)
(
p
+
1)
2
=
n
+1
X
k
=1
k
p
k
:
(17)
Then
(16)
has
been
pro
v
ed.
Finally
,
the
a
v
erage
pack
et
delay
T
can
be
properly
defined
according
to
the
Little’
s
La
w
in
[25],
i.e.,
T
=
Q
:
(18)
3.
RESUL
TS
AND
DISCUSSION
In
this
section,
we
e
v
aluate
the
performance
of
the
FD-TWR
aided
b
uf
fer
in
terms
of
the
throughput
and
the
system
delay
according
to
the
formulas
deri
v
ed
in
the
abo
v
e
section.
F
or
the
simulations,
the
size
of
the
b
uf
fers
(
N
)
is
finite,
and
we
adopt
N
=
40
in
all
simulations.
T
Qa
Qb
is
a
v
ariable
rate,
T
=
5
is
a
fix
ed
rate,
and
p
1
=
p
2
=
1
Figure
4
compares
the
system
delay
between
FD-TWR
aided
b
uf
fer
with
Q-learning
a
nd
FD-TWR
aided
b
uf
fer
without
Q-learning
v
ersus
the
size
of
the
b
uf
fers.
It
re
v
eals
that
when
the
size
of
the
b
uf
fers
increases,
the
system
delay
is
reduced,
and
the
a
v
erage
pack
et
delay
of
the
full-duple
x
with
Q-learning
relay
is
smaller
than
the
one
without
Q-learning.
That
means
that
with
Q-learning
after
a
certain
threshold
v
alue
15
,
the
size
of
the
b
uf
fer
does
n’
t
af
fect
the
performance
of
the
system;
this
is
due
to
the
reinforcement
learning
algorithm.
Figure
5
compares
the
throughput
between
FD-TWR
aided
b
uf
fer
with
Q-learning
and
the
throughput
of
FD-TWR
aided
b
uf
fer
without
Q-learning
v
ersus
the
size
of
the
b
uf
fers.
v
ersus
the
number
of
the
b
uf
fer
.
The
results
sho
w
that
when
the
b
uf
fer
size
is
5
,
the
throughput
is
constant
for
both
curv
es,
this
is
due
to
the
optimization
of
the
full-duple
x
relay
when
it
can
recei
v
e
and
send
at
the
same
time.
Moreo
v
er
,
we
observ
e
that
there
is
a
correlation
between
the
tw
o
curv
es.
Both
curv
es
are
steady
,
and
the
throughput
of
FD-TWR
aided
b
uf
fer
with
Q-learning
o
v
ertak
es
the
one
without
a
Q-learning.
0
5
10
15
20
25
30
35
40
buffer size
0
10
20
30
40
50
60
System delay
delay with Q-learning
delay without Q-learning
X: 15
Y: 21.75
X: 15
Y: 2.398
Figure
4.
System
delay
v
ersus
b
uf
fer
size
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
20,
No.
2,
September
2020
:
854
–
862
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
r
861
0
5
10
15
20
25
30
35
40
Buffer size
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
Throughput (Mbps)
Throughtpt with Q-learning
Throughput without Q-learning
X: 5
Y: 1.375
X: 5
Y: 0.6768
Figure
5.
Throughput
with
Q-learning
v
ersus
v
ersus
b
uf
fer
size
4.
CONCLUSION
In
this
paper
,
we
st
udied
the
throughput
and
the
system
delay
in
FD-TWR
netw
orks
using
a
rein-
forcement
learning
scheme.
Specifically
,
we
considered
a
b
uf
fer
-aided-relay
which
has
a
finite
length
b
uf
fer
.
Based
on
the
Mark
o
v
decision
process,
we
characterized
a
reinforcement
learning
approach
that
can
maximize
the
aggre
g
ate
netw
ork
throughput
by
taking
into
considerati
on
the
const
raint
of
the
b
uf
fer
state.
Moreo
v
er
,
we
compared
the
pe
rformance
of
the
proposed
transmission
polic
y
to
the
FD-TWR
with
the
reinforcement
learning
approach
and
the
one
without
the
reinforcement
learning
scheme,
especially
when
our
full-duple
x
relay
has
finite
b
uf
fers.
The
numerical
results
sho
wed
that
b
uf
fering
relay
techniques
impro
v
e
the
capacity
of
relay
netw
orks
in
slo
w
f
ading
en
vironments,
and
the
reinforcement
learning
scheme
significantly
impro
v
es
the
throughput
of
the
system
by
queuing
or
dequeuing
the
b
uf
fers
rapidl
y
according
to
the
b
uf
fer
states
and
lossy
link.
Furthermore,
the
reinforcement
learning
scheme
does
not
need
too
much
b
uf
fer
size
to
maximize
a
v
erage
throughput
and
the
system
delay
.
It
also
remedies
the
bottleneck
problem
of
b
uf
fer
o
v
erflo
w
.
A
CKNO
WLEDGMENTS
This
research
w
as
supported
by
Project
1236130208,
supported
by
the
CSC
(China
S
cholarship
Council).
REFERENCES
[1]
C.
E.
Shannon,
et
al.,
”T
w
o-w
ay
communication
channels,
”
Proc.
4th
Berk
ele
y
Symp.
Math.
Stat.
Prob
,
v
ol.
1,
pp.
611-644,
1961.
[2]
G.
Kramer
,
et
al.,
”Cooperati
v
e
communications,
”
F
oundations
and
T
rends
R?
in
Netw
orking
,
v
ol.
1,
no.
3-4,
pp.
271-425,
2007.
[3]
P
.
Popo
vski
and
H.
Y
omo,
”Ph
ysical
netw
ork
coding
in
tw
o-w
ay
wireless
relay
channels,
”
IEEE
Interna-
tional
Conference
on
Communications
,
pp.
707-712,
2007.
[4]
L.
Ong,
”The
half-duple
x
g
aussian
tw
o-w
ay
relay
channel
with
direct
links,
”
IEEE
International
Sympo-
sium
on
Information
Theory
,
pp.
1891-1895,
2015.
[5]
L.
Ding,
M.
T
ao,
F
.
Y
ang,
and
W
.
Zhang,
”Joint
scheduling
and
relay
selection
in
one-and
tw
o-w
ay
relay
netw
orks
with
b
uf
fering,
”
IEEE
International
Conference
on
Communications
,
pp.
1-5,
2009.
[6]
H.
Liu,
P
.
Popo
vski,
E.
De
Carv
alho,
and
Y
.
Zhao,
”Sum-rate
optimization
in
a
tw
o-w
ay
relay
netw
ork
with
b
uf
fering,
”
IEEE
Communications
Letters
,
v
ol.
17,
no.
1,
pp.
95-98,
2013.
[7]
Y
.
Gu
and
S.
Aissa,
”Interference
aided
ener
gy
harv
esting
in
decode-and-forw
ard
relaying
systems,
”
IEEE
international
conference
on
communications
,
pp.
5378-5382,
2014.
[8]
N.
Zlatano
v
,
D.
Hranilo
vic,
and
J.
S.
Ev
ans,
”Buf
fer
-aided
relaying
impro
v
es
throughput
of
full-
duple
x
relay
netw
orks
with
fix
ed-rate
transmissions,
”
IEEE
Communications
Letters
,
v
ol.
20,
no.
12,
pp.
2446-2449,
2016.
[9]
K.
T
.
Phan
and
T
.
Le-Ngoc,
”Po
wer
allocation
for
b
uf
fer
-aided
full-duple
x
relaying
wi
th
imperfect
self-
interference
cancelation
and
statistical
delay
constraint,
”
IEEE
Access
,
v
ol.
4,
pp.
3961-3974,
2016.
Thr
oughput
maximization
for
full-duple
x
two-way
r
elay
with...
(Betene
Anyugu
F
r
ancis
Lin)
Evaluation Warning : The document was created with Spire.PDF for Python.
862
r
ISSN:
2502-4752
[10]
M.
M.
Razlighi
and
N.
Zlatano
v
,
”Buf
fer
-aided
relaying
for
the
tw
o-hop
full-duple
x
relay
channel
with
self-interference,
”
IEEE
T
ransactions
on
W
ireless
Communications
,
v
ol.
17,
no.
1,
pp.
477-491,
2018.
[11]
Z.
Zhang,
Q.
W
u,
B.
Zhang,
and
J.
Peng,
”Intelligent
anti-jamming
relay
communication
system
based
on
reinforcement
learning,
”
2nd
International
Conference
on
Communication
Engineering
and
T
echnology
,
pp.
52-56,
2019.
[12]
Z.
Chen,
T
.
Lin,
and
C.
W
u,
”Decentralized
l
earning-based
relay
assignment
for
cooperati
v
e
communica-
tions,
”
IEEE
T
ransactions
on
V
ehicular
T
echnology
,
v
ol.
65,
no.
2,
pp.
813-826,
2015.
[13]
F
.
Shams,
G.
Bacci,
and
M.
Luise,
”Ener
gy
ef
fici
ent
po
wer
control
for
multi
ple-relay
cooperati
v
e
netw
orks
using
q-learning,
”
IEEE
T
ransactions
on
W
ireless
Communications
,
v
ol.
14,
no.
3,
pp.
1567-1580,
2014.
[14]
X.
Meng,
H.
Inaltekin,
and
B.
Krongold,
”Deep
reinforcement
learning-based
po
wer
control
in
full-duple
x
cogniti
v
e
radio
netw
orks,
”
IEEE
Global
Communications
Conference
(GLOBECOM)
,
pp.
1-7,
2018.
[15]
X.
Qiu,
T
.
Jiang,
and
N.
W
ang,
”Safe
guarding
multiuser
communication
using
full-duple
x
jamming
and
q-learning
algorithm,
”
IET
Communications
,
v
ol.
12,
no.
15,
pp.
1805-1811,
2018.
[16]
K.
Chang
and
Y
.
Choi,
”Performance
e
v
aluation
of
in-band
full-duple
x
system
using
one-time-slot
tw
o-
w
ay
relay
,
”
IEEE
Systems
Journal
,
2019.
[17]
S.
Shi,
S.
Li,
and
J.
T
ian,
”Mark
o
v
modeling
for
practical
tw
o-w
ay
relay
with
finite
relay
b
uf
fer
,
”
IEEE
Communications
Letters
,
v
ol.
20,
no.
4,
pp.
768-771,
2016.
[18]
B.
A.
F
.
Lin,
X.
Y
e,
and
S.
Hao,
”Adapti
v
e
protocol
for
full-duple
x
tw
o-w
ay
systems
with
the
b
uf
fer
-aided
relaying,
”
IET
Communications
,
v
ol.
13,
no.
1,
pp.
54-58,
2018.
[19]
V
.
Jamal
i,
N.
Zlatano
v
,
and
R.
Schober
,
”Bidirectional
b
uf
fer
-aided
relay
netw
orks
with
fix
ed
rate
trans-
missionpart
ii:
Delay-constrained
case,
”
IEEE
T
ransactions
on
W
ireless
Communicati
ons
,
v
ol.
14,
no.
3,
pp.
1339-1355,
2015.
[20]
A.
Shafeeq
and
K.
Hareesha,
”Dynamic
clustering
of
data
with
modified
k-means
algorithm,
”
Proceedings
of
the
2012
conference
on
information
and
computer
netw
orks
,
pp.
221-225,
2012.
[21]
N.
Kamari,
I.
Musirin,
Z.
Hamid,
and
A.
A.
Ibrahim,
”Optimal
tuning
of
svc-pi
controller
using
whale
optimization
algorithm
for
angle
stability
impro
v
ement,
”
Indonesian
Journal
of
Electrical
Engineering
and
Computer
Science
,
v
ol.
12,
no.
2,
pp.
612-619,
2018.
[22]
M.
M.
S
aufi,
M.
A.
Zamanhuri,
N.
Mohammad,
and
Z.
Ibrahim,
”Deep
learning
for
roman
handwritten
character
recognition,
”
International
Journal
of
Electrical
Engineering
and
Computer
Science
,
v
ol.
12,
no.
2,
pp.
455-460,
2018.
[23]
A.
H.
Basori,
A.
T
enria
w
aru,
and
A.
B.
F
.
Mansur
,
”Intelligent
a
v
atar
on
e-learning
using
f
acial
e
xpression
and
haptic
,
”
TELK
OMNIKA
T
elecommunication,
Computing,
Electronics
and
Control
,
v
ol.
9,
no.
1,
2011.
[24]
R.
Luo,
W
.
Liao,
and
Y
.
Pi,
”Discriminati
v
e
supervi
sed
neighborhood
preserving
embedding
feature
e
xtraction
for
h
yperspectral-image
classification,
”
TELK
OMNIKA
T
elecommunication,
Computing,
Electronics
and
Control
,
v
ol.
10,
no.
5,
pp.
1051-1056,
2012.
[25]
J.
D.
Little,
”Or
forumlittle’
s
la
w
as
vie
wed
on
its
50th
anni
v
ersary
,
”
Operations
research
,
v
ol.
59,
no.
3,
pp.
536-549,
2011.
BIOGRAPHIES
OF
A
UTHORS
Francis
BETENE
is
a
professional
research
assistant
at
REMI-Ult
ra-electronics
with
a
Master
of
science
from
Hohai
uni
v
ersity
China
(2013).
He
obtained
a
PhD
De
gree
in
wireless
technologies
from
Shanghai
Jiao-T
ong
uni
v
ersity
(China)
in
2019.
His
researches
are
in
fields
of
electronics,
digital
systems,
wireless,
signal
processing,
and
artificial
intelligence.
Recently
,
system’
s
application
on
reinforcement
learning
has
been
tackled.
He
is
af
filiated
with
IEEE
as
re
vie
wer
member
.
In
IJECE,
IAES
journals,
and
other
scientific
publications,
he
has
serv
ed
as
in
vited
re
vie
wer
.
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
20,
No.
2,
September
2020
:
854
–
862
Evaluation Warning : The document was created with Spire.PDF for Python.