IAES
Inter
national
J
our
nal
of
Artificial
Intelligence
(IJ-AI)
V
ol.
9,
No.
2,
June
2020,
pp.
183
192
ISSN:
2252-8938,
DOI:
10.11591/ijai.v9i2.pp183-192
r
183
GS-OPT
:
A
new
fast
stochastic
algorithm
f
or
solving
the
non-con
v
ex
optimization
pr
oblem
Xuan
Bui
1
,
Nhung
Duong
2
,
T
rung
Hoang
3
1,3
Thuyloi
Uni
v
ersity
,
175
T
ay
Son,
V
ietnam
2
Thai
Nguyen
Uni
v
ersity
of
Information
and
Communication
T
echnology
,
V
ietnam
Article
Inf
o
Article
history:
Recei
v
ed
Oct
28,
2019
Re
vised
Dec
16,
2019
Accepted
Feb
19,
2020
K
eyw
ords:
Bayesian
learning
Non-con
v
e
x
optimization
Posterior
inference
Stochastic
optimization
T
opic
models
ABSTRA
CT
Non-con
v
e
x
optimization
has
an
important
role
in
machine
learning.
Ho
we
v
er
,
the
theoretical
understanding
of
non-con
v
e
x
optimization
remained
rather
limited.
Studying
ef
ficient
algorithms
for
non-con
v
e
x
optimization
has
attracted
a
great
deal
of
attention
from
man
y
researchers
around
the
w
orld
b
ut
these
problems
are
usually
NP-hard
to
solv
e.
In
this
paper
,
we
ha
v
e
proposed
a
ne
w
algorithm
nam
ely
GS-OPT
(general
stochastic
optimization)
which
is
ef
fecti
v
e
for
solving
the
non-con
v
e
x
problems.
Our
idea
is
to
combine
tw
o
stochastic
bounds
of
the
objecti
v
e
function
where
t
he
y
are
made
by
a
common
discrete
probability
distrib
ution
namely
Bernoul
li.
W
e
consider
GS-OPT
care
fully
on
both
the
theoretical
and
e
xperimental
aspects.
W
e
also
apply
GS-OPT
for
solving
the
posterior
inference
problem
in
the
latent
D
irichlet
allocation.
Empirical
results
sho
w
that
our
approach
is
often
more
ef
ficient
than
pre
vious
ones.
This
is
an
open
access
article
under
the
CC
BY
-SA
license
.
Corresponding
A
uthor:
Xuan
Bui,
Thuyloi
Uni
v
ersity
,
175
T
ay
Son,
Dong
Da,
Hanoi,
V
ietnam.
Email:
xuanbtt@tlu.edu.vn
1.
INTR
ODUCTION
In
machine
learning,
there
are
a
lot
of
problems
that
lead
to
non-con
v
e
x
optimization.
In
this
paper
,
we
focus
on
the
optimization
problems
in
machine
learning
as
follo
w
x
=
arg
min
x
2
f
(
x
)
(1)
where
the
objecti
v
e
function
f
(
x
)
is
smooth
(possibly
non-con
v
e
x)
on
the
compact
domain
.
T
o
the
best
of
our
kno
wledge,
if
f
(
x
)
is
con
v
e
x,
problem
(1)
is
easy
to
solv
e
by
applying
some
con
v
e
x
optimization
methods
such
as
gradient
descent
(GD),
Ne
wton,
or
Stochastic
GD
[1].
But,
in
practice,
non-
con
v
e
x
models
ha
v
e
se
v
eral
adv
antages
compared
with
the
con
v
e
x
one.
F
or
e
xample,
deep
neural
netw
orks,
which
ha
v
e
been
widely
used
in
computer
vis
ion
and
data
mining
are
highly
non-con
v
e
x
optimization.
In
these
cases,
solving
problem
(1)
is
more
dif
ficult
than
a
con
v
e
x
one
because
non-con
v
e
x
optimization
usually
admits
a
multimodal
structure,
and
common
con
v
e
x
optimization
methods
may
trap
in
poor
local
optima.
In
this
paper
,
we
focus
on
proposing
the
ne
w
optimization
method
for
solving
problem
(1)
in
which
the
objecti
v
e
function
f
(
x
)
is
non-con
v
e
x
and
smooth.
More
and
more
researches
consider
in
solving
non-con
v
e
x
problem
(1)
such
as
stochastic
v
ariance
reduced
gradient
(SVRG)
[2],
Proximal
SVRG
(Prox-SVRG)
[3].
SVRG
and
Prox-SVRG
can
be
used
to
J
ournal
homepage:
http://ijai.iaescor
e
.com
Evaluation Warning : The document was created with Spire.PDF for Python.
184
r
ISSN:
2252-8938
solv
e
non-con
v
e
x
finite-sum
problems,
b
ut
we
find
out
t
h
a
t
the
y
may
not
con
v
er
ge
to
the
global
optimization
for
non-con
v
e
x
functions.
Conca
v
e-con
v
e
x
procedure
(CCCP)
[4]
is
widely
used
for
non-con
v
e
x
problem.
It
transforms
the
non-con
v
e
x
problem
into
a
sum
of
a
con
v
e
x
function
and
a
conca
v
e
one,
and
then
linearizing
the
conca
v
e
function.
Ho
we
v
er
,
the
comple
xity
of
a
single
loop
for
CCCP
is
much
higher
than
SGD,
because
CCCP
s
olv
es
quadratic
programming
at
each
iteration.
Graduated
optimization
algorithm
(GO
A)[5]
is
also
a
popular
global
search
algorithm
for
non-con
v
e
x
problems,
b
ut
directly
calculating
its
gradient
is
costly
.
In
[6],
the
authors
propose
a
ne
w
optimization
namely
GradOpt.
Experimental
results
in
[6]
sho
w
that
GradOpt
can
f
ast
yield
a
much
better
solution
than
mini-batch
SGD.
Ho
we
v
er
,
[7]
which
proposes
tw
o
algorithms:
SVRG-
GO
A
and
PSVRG-GO
A
for
solving
t
he
non-con
v
e
x
problem,
sho
ws
that
GradOpt
has
some
shortcomings:
it
con
v
er
ges
slo
wly
partly
due
to
the
decrease
of
step-size,
the
application
ra
ng
e
of
GradOpt
is
limited.
The
v
alue
of
an
objecti
v
e
function
may
be
trapped
around
a
number
which
is
lar
ger
than
the
global
minimum
because
the
smooth
parameter
shrinks
slightly
after
se
v
eral
iterations.
W
e
also
ha
v
e
seen
man
y
f
amous
stochastic
op-
timization
algorithms
such
as
Adagrad
[8],
RMSProp
[9],
Adam
[10],
Adadelta
[11],
RSA
G
[12],
Natasha2
[13],
NEON2
[14]
are
proposed
for
solving
the
optimization
problem
in
machine
learning.
The
big
challenges
for
non-con
v
e
x
optimization
algorithms
in
machine
learning
are:
Can
local/global
optimum
be
found?
Is
it
possible
to
get
rid
of
saddles?
Ho
w
to
escape
saddle
points
ef
ficiently?
Can
the
optimum
solution
be
found
with
an
acceptable
time
and
with
lar
ge
data?
Finding
an
optimum
of
a
non-con
v
e
x
optimization
problem
is
NP-hard
in
the
w
orst
case
[15].
Despite
the
intractability
results,
non-con
v
e
x
optimization
is
the
main
algorithmic
technique
behind
man
y
state-
of-the-art
machine
learning
and
deep
learning
results.
In
light
of
this
background,
we
state
the
main
contrib
utions
of
our
paper:
a
Using
Bernoulli
distrib
ution
and
tw
o
stochastic
approximation
sequences,
we
de
v
elop
GS-OPT
for
solving
a
wide
class
of
non-con
v
e
x
problems.
And
we
sho
w
that
it
usually
performs
better
than
pre
vious
algorithms.
b
Applying
GS-OPT
to
solving
the
posterior
inference
problem
in
topic
models,
we
obtain
tw
o
learning
methods:
ML-GSOPT
and
Online-GSOPT
in
topic
models.
In
addition,
GS-OPT
is
v
ery
fle
xible,
then
we
can
adapt
GS-OPT
to
solv
e
man
y
non-con
v
e
x
models
in
machine
learning.
Or
g
anization:
This
paper
is
structured
as
follo
ws.
In
Section
2,
a
ne
w
algorithm
for
solving
the
non-con
v
e
x
optimization
problem
is
proposed
in
detail.
In
Section
3,
we
ha
v
e
applied
GS-OPT
to
solv
e
the
posterior
inference
in
latent
Dirichlet
allocation
and
designed
tw
o
methods
of
learning
LD
A.
In
Section
4,
we
gi
v
e
some
results
tested
with
tw
o
lar
ge
datasets:
Ne
w
Y
ork
T
imes
and
Pubmed.
Finally
,
we
conclude
the
paper
in
Section
5.
Notation:
Throughout
the
paper
,
we
use
the
follo
wing
con
v
entions
and
notations.
Bold
f
aces
denote
v
ectors
or
matrices.
x
i
denotes
the
i
th
element
of
v
ector
x
,
and
A
ij
denotes
the
element
at
ro
w
i
and
column
j
of
matrix
A
.
The
unit
simple
x
in
the
n
-dimensional
Euclidean
space
is
denoted
as
n
=
f
x
2
R
n
:
x
0
;
P
n
k
=1
x
k
=
1
g
,
and
its
interior
is
denoted
as
n
.
W
e
will
w
ork
with
te
xt
collections
with
V
dimensions
(dictionary
size).
Each
document
d
will
be
represented
as
frequenc
y
v
ector
,
d
=
(
d
1
;
::;
d
V
)
T
,
where
d
j
represents
the
frequenc
y
of
term
j
in
d
.
Denote
n
d
as
the
length
of
d
,
i.e.,
n
d
=
P
j
d
j
.
The
inner
product
of
v
ectors
u
and
v
is
denoted
as
h
u
;
v
i
.
I
(
x
)
is
the
indicator
function
which
returns
1
if
x
is
true,
and
0
otherwise.
2.
PR
OPOSED
ST
OCHASTIC
OPTIMIZA
TION
ALGORITHM
W
e
consider
in
the
optimization
problem
as
form
as:
x
=
arg
min
x
2
[
f
(
x
)
=
g
(
x
)
+
h
(
x
)]
(2)
where
the
non-con
v
e
x
objecti
v
e
function
f
(
x
)
includes
tw
o
components
g
(
x
)
and
h
(
x
)
.
W
e
find
out
that
numerous
models
f
all
in
the
frame
w
ork
of
problem
(2)
in
machine
learning.
F
or
e
xample,
in
Bayesian
learning,
we
usually
ha
v
e
solving
the
Maximum
a
Posteriori
Estimation
(MAP)
problem:
x
=
arg
max
x
[log
P
(
D
j
x
)
+
log
P
(
x
)]
(3)
where
P
(
D
j
x
)
denotes
the
lik
elihood
of
an
observ
ed
v
ariable
D
,
P
(
x
)
denotes
the
prior
of
the
hidden
v
ariable
x
.
W
e
find
out
that
the
problem
(3)
can
be
re
written
as
form
as:
Int
J
Artif
Intell,
V
ol.
9,
No.
2,
June
2020
:
183
–
192
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
r
185
x
=
arg
min
x
[
log
P
(
D
j
x
)
log
P
(
x
)]
(4)
W
e
also
notice
that
the
problem
(4)
turns
out
the
problem
(2)
where
g
(
x
)
=
log
P
(
D
j
x
)
and
h
(
x
)
=
log
(
P
(
x
))
.
T
o
solv
e
the
problem
(3),
by
changing
the
learning
rate
in
OFW
algorithm
[16]
and
considering
carefully
about
the
theoretical
aspect,
OPE
[17]
is
propos
ed
for
solving
the
MAP
estimation
problem
in
man
y
probabilistic
models.
Comparing
with
CCCP
[4]
and
SMM
[18],
OPE
has
man
y
preferable
properties.
The
first,
the
con
v
er
gence
rate
of
CCCP
and
SM
M
is
unkno
wn
for
non-con
v
e
x
problems.
The
second,
while
each
iteration
of
SMM
requires
us
to
solv
e
a
con
v
e
x
problem,
each
iteration
of
CCCP
has
to
sol
v
e
a
non-linear
equation
system
which
is
e
xpensi
v
e
and
non-tri
vial
in
man
y
cases.
W
e
find
out
that
each
iteration
of
OPE
requires
us
to
solv
e
a
linear
program
which
is
significantly
easier
than
a
non-linear
problem.
Therefore,
OPE
promises
to
be
much
more
ef
ficient
than
CCCP
and
SMM.
The
con
v
er
gence
rate
of
OPE
is
significantly
f
aster
than
that
of
PMD
[19]
and
HAMCMC
[20].
In
this
section,
we
figure
out
more
important
characters
of
OPE,
some
were
in
v
estig
ated
in
[17].
In
general,
the
optimization
theory
has
encountered
man
y
dif
ficulties
in
solving
a
non-con
v
e
x
optimization
problem.
Man
y
methods
are
only
good
in
theory
b
ut
inapplicable
in
practice
via
careful
researches.
Therefore,
instead
of
directly
solving
the
non-con
v
e
x
optimization
with
the
true
objecti
v
e
function
f
(
x
)
,
OPE
constructs
a
sequence
of
stochastic
functions
F
t
(
x
)
that
approximates
to
the
objecti
v
e
function
of
int
erest
by
alternati
v
ely
choosing
uniformly
from
f
g
(
x
)
;
h
(
x
)
g
at
each
iteration
t
.
It
is
guaranteed
that
F
t
con
v
er
ges
to
f
when
t
!
1
.
OPE
is
one
of
the
stochastic
optimization
algorithms.
OPE
is
straightforw
ard
to
implement,
computationally
ef
ficient
and
suitable
for
problems
that
are
lar
ge
in
terms
of
data
and/or
parameters.
[17]
has
e
xperimentally
and
theoretically
sho
wed
the
ef
fecti
v
eness
of
OPE
when
applying
to
the
posterior
inference
of
the
Latent
Dirichlet
Allocation
model.
The
main
idea
of
OPE
is
to
construct
a
stochastic
sequence
F
t
(
x
)
that
approximates
for
f
(
x
)
by
using
uniform
distrib
ution,
so
that
(2)
becomes
easy
to
solv
e.
Although
OPE
is
better
than
other
methods
before,
we
w
ant
to
e
xplore
a
ne
w
stochastic
optimization
algorithm
to
solv
e
the
problem
(2)
more
ef
ficient.
W
e
find
out
some
limitations
such
as
follo
ws:
Uniform
distrib
ution
is
too
simple,
then
it
is
not
suitable
for
man
y
problems.
Using
one
approximation
function
replacing
the
true
objecti
v
e
is
not
more
ef
fecti
v
e
than
using
tw
o
approximation
bounds.
After
finding
out
the
dra
wback
of
OPE,
we
do
some
impro
v
ements
in
order
to
get
a
ne
w
algorithm,
that
is
GS-OPT
.
It
mak
es
sense
that
tw
o
stochastic
approximating
sequences
of
objecti
v
e
function
f
(
x
)
is
better
than
one.
So,
using
Bernoulli
distrib
ution,
we
construct
tw
o
sequences
that
are
both
con
v
er
ging
to
f
(
x
)
,
one
be
gins
with
g
(
x
)
called
the
sequence
f
L
t
g
,
the
other
be
gins
with
h
(
x
)
called
the
sequence
f
U
t
g
.
W
e
adjust
g
and
h
according
to
Be
rnou
l
li
parameter
p
2
(0
;
1)
:
G
(
x
)
:=
g
(
x
)
=p;
H
(
x
)
:=
h
(
x
)
=
(1
p
)
.
W
e
set
f
l
1
:=
G
(
x
)
.
Pick
f
l
t
as
a
Bernoulli
v
ariabl
e
with
probability
p
2
(0
;
1)
where
P
(
f
l
t
=
G
(
x
))
=
p;
P
(
f
l
t
=
H
(
x
))
=
1
p;
t
=
2
;
3
;
:
:
:
.
Then,
we
set
L
t
:=
1
t
P
t
h
=1
f
l
h
.
Similarly
,
we
set
f
u
1
:=
H
(
x
)
.
Pick
f
u
t
as
Bernoulli
distrib
ution
from
f
G
(
x
)
;
H
(
x
)
g
with
probability
p
2
(0
;
1)
where
P
(
f
u
t
=
G
(
x
))
=
p
;
P
(
f
u
t
=
H
(
x
))
=
1
p;
t
=
2
;
3
;
:
:
:
.
Then,
we
set
U
t
:=
1
t
P
t
h
=1
f
u
h
.
W
ith
our
construction,
we
mak
e
sure
tw
o
sequences
f
L
t
g
and
f
U
t
g
both
con
v
er
ge
to
f
when
t
!
+
1
.
Using
both
tw
o
stochastic
sequences
f
L
t
g
and
f
U
t
g
at
each
iteration
gi
v
es
us
more
information
about
objecti
v
e
function
f
(
x
)
,
so
that
we
can
get
more
chances
to
reach
a
minimal
of
f
(
x
)
.
W
e
approximate
the
true
objecti
v
e
function
f
(
x
)
by
F
t
(
x
)
which
is
a
linear
combination
of
U
t
and
L
t
with
a
suitable
parameter
2
(0
;
1)
:
F
t
(
x
)
:=
U
t
(
x
)
+
(1
)
L
t
(
x
)
The
usage
of
both
bounds
is
stochastic
and
helps
us
reduce
the
possibility
of
getting
stuck
at
a
l
ocal
stationary
point
and
this
is
an
ef
ficient
approach
for
escaping
saddle
points
in
non-con
v
e
x
optimization.
So,
our
ne
w
v
ariant
seems
to
be
more
appropriate
than
OPE.
Although
GS-OPT
aims
at
increasing
randomness,
GS-OPT
w
orks
dif
ferently
with
OPE.
While
OPE
constructs
only
one
sequence
of
function
F
t
,
at
each
iteration
t
,
GS-OPT
constructs
three
sequences
f
L
t
g
,
f
U
t
g
and
f
F
t
g
,
in
which
f
F
t
g
depending
on
f
U
t
g
and
f
L
t
g
.
So,
the
structure
of
the
main
sequence
F
t
is
actually
changed.
Details
of
GS-OPT
are
presented
in
Algorithm
1.
Uniform
distrib
ution
is
a
special
case
of
Bernoulli
one
with
parameter
p
=
0
:
5
.
So
OPE
is
not
fle
xi
ble
in
man
y
datasets.
GS-OPT
adapts
well
with
dif
ferent
datasets,
we
will
sho
w
it
in
our
e
xperiments.
In
the
rest
of
this
section,
we
will
sho
w
that
GS-OPT
preserv
es
the
k
e
y
adv
antage
of
OPE
which
is
the
guarantee
of
the
quality
and
con
v
er
gence
rate.
GS-OPT
:
A
ne
w
fast
stoc
hastic
algorithm
for
solving
the
non-con
ve
x
optimization
pr
oblem
(Xuan
Bui)
Evaluation Warning : The document was created with Spire.PDF for Python.
186
r
ISSN:
2252-8938
Algorithm
1
GS-OPT
:
A
ne
w
General
Stochastic
OPT
imization
algorithm
for
solving
the
non-con
v
e
x
problem
Input:
Bernoulli
parameter
p
2
(0
;
1)
and
linear
combination
parameter
2
(0
;
1)
Output:
x
that
minimizes
f
(
x
)
=
g
(
x
)
+
h
(
x
)
on
.
1:
Initialize
x
1
arbitrarily
in
2:
Set
G
(
x
)
:=
g
(
x
)
=p;
H
(
x
)
:=
h
(
x
)
=
(1
p
)
3:
f
l
1
:=
G
(
x
)
;
f
u
1
:=
H
(
x
)
4:
f
or
t
=
2
;
3
;
:
:
:
1
do
5:
Pick
f
l
t
as
a
Bernoulli
v
ariable
where
P
(
f
l
t
=
G
(
x
))
=
p;
P
(
f
l
t
=
H
(
x
))
=
1
p
6:
L
t
:=
1
t
P
t
h
=1
f
l
h
7:
Pick
f
u
t
as
a
Bernoulli
v
ariable
where
P
(
f
u
t
=
G
(
x
))
=
p;
P
(
f
u
t
=
H
(
x
))
=
1
p
8:
U
t
:=
1
t
P
t
h
=1
f
u
h
9:
F
t
:=
U
t
+
(1
)
L
t
10:
a
t
:=
arg
min
x
2
<
F
0
t
(
x
t
)
;
x
>
11:
x
t
+1
:=
x
t
+
a
t
x
t
t
12:
end
f
or
Theorem
1
(Con
v
er
gence
of
GS-OPT
algorithm)
Consider
the
objective
function
f
(
x
)
in
equation
(2)
,
the
linear
combination
par
ameter
2
(0
;
1)
and
Bernoulli
par
ameter
p
2
(0
;
1)
.
F
or
GS-OPT
,
with
pr
obability
one
,
F
t
(
x
)
con
ver
g
es
to
f
(
x
)
as
t
!
+
1
for
any
x
2
and
x
t
con
ver
g
es
to
a
local
minimal/stationary
point
of
f
(
x
)
at
a
r
ate
of
O
(1
=t
)
.
The
proof
of
Theorem
1
is
similar
in
[17].
The
objecti
v
e
function
f
(
x
)
is
non-con
v
e
x.
The
criterion
used
for
con
v
er
gence
analysis
is
the
importance
of
non-con
v
e
x
optimization.
F
or
unconstrained
problems,
the
gradient
norm
kr
f
(
x
)
k
is
typically
used
to
measure
con
v
er
gence,
because
kr
f
(
x
)
k
!
0
captures
con
v
er
-
gence
to
a
stationary
point.
Ho
we
v
er
,
this
criterion
can
not
be
used
for
constrained
problems.
Instead,
we
use
the
”Frank-W
olfe
g
ap”
criterion
[21].
Let
a
t
and
b
t
=
t
a
t
be
the
number
of
times
that
we
ha
v
e
already
pick
ed
G
(
x
)
and
H
(
x
)
respecti
v
ely
after
t
iterations
to
construct
sequence
f
L
t
g
.
W
e
ha
v
e
a
t
B
(
t;
p
)
and
E
(
a
t
)
=
tp
;
D
(
a
t
)
=
tp
(1
p
)
.
Then
S
t
=
a
t
tp
!
N
(0
;
tp
(1
p
))
when
t
!
1
.
So
S
t
=t
!
0
as
t
!
1
with
probability
1
.
W
e
ha
v
e
L
t
f
=
S
t
t
(
G
H
)
;
L
0
t
f
0
=
S
t
t
(
G
0
H
0
)
Thus,
we
find
out
that
L
t
!
f
as
t
!
+
1
with
probability
1
.
Similarly
,
we
also
ha
v
e
U
t
!
f
as
t
!
+
1
with
probability
1
.
In
addition,
we
ha
v
e
F
t
=
U
t
+
(1
)
L
t
)
F
t
f
=
(
U
t
f
)
+
(1
)(
L
t
f
)
.
W
e
notice
that
U
t
and
L
t
tend
to
f
(
x
)
as
t
!
+
1
with
probability
1
.
Then,
we
conclude
that
the
sequence
F
t
(
x
)
!
f
(
x
)
as
t
!
+
1
with
probabil
ity
1
.
W
e
will
sho
w
the
ef
ficient
of
GS-OPT
algorithm
via
our
e
xperiments
when
we
apply
GS-OPT
for
solving
the
posterior
inference
problem
in
topic
models
in
the
ne
xt
section.
3.
APPL
YING
GS-OPT
FOR
THE
MAP
PR
OBLEM
IN
T
OPIC
MODELS
Latent
dichlet
allocation
(LD
A)
[22]
is
a
generati
v
e
model
for
modeling
te
xt
and
discrete
data.
It
as
sumes
that
a
corpus
is
composed
from
K
topics
=
(
1
;
:
:
:
;
K
)
,
each
of
which
is
a
sample
from
V
dimensional
Dirichlet
distrib
ution,
D
ir
ichl
et
(
)
.
Each
document
d
is
a
mixture
of
those
topics
and
is
assumed
to
arises
from
the
follo
wing
generati
v
e
process:
dra
w
d
j
D
ir
ichl
et
(
)
.
F
or
the
n
th
w
ord
of
d
:
dra
w
topic
inde
x
z
dn
j
d
M
ul
tinomial
(
d
)
and
w
ord
w
dn
j
z
dn
;
M
ul
tinomial
(
z
dn
)
.
Each
topic
mixture
d
=
(
d
1
;
:
:
:
;
dK
)
represents
the
contrib
utions
of
topics
to
document
d
,
while
k
j
sho
ws
the
con-
trib
ution
of
term
j
to
topic
k
.
Note
that
d
2
K
;
k
2
V
;
8
k
.
Both
d
and
z
d
are
unobserv
ed
v
ariables
and
local
for
each
document
as
sho
wn
in
Figure
1.
According
to
[23],
the
t
ask
of
Bayesian
inference
(learning)
gi
v
en
a
corpus
C
=
f
d
1
;
:
:
:
;
d
M
g
is
to
estimate
the
posterior
distrib
ution
p
(
z
;
;
jC
;
;
)
o
v
er
the
latent
topic
indicies
z
=
f
z
1
;
:
:
:
;
z
d
g
,
topic
mixtures
=
f
1
;
:
:
:
;
M
g
,
and
topics
=
(
1
;
:
:
:
;
K
)
.
The
problem
of
posterior
inference
for
each
document
d
,
gi
v
en
a
model
f
;
g
,
is
to
estimate
the
full
joint
distrib
ution
p
(
z
d
;
d
;
d
j
;
)
.
Direct
estimation
of
this
distrib
ution
is
intractable.
Hence
e
xisti
ng
approaches
use
dif
ferent
schemes
such
as
v
ariational
Bayes
(VB)
[22],
collaps
ed
v
ariational
Bayes
(CVB)
[23],
CVB0
[24],
and
collapsed
Gibbs
sampling
(CGS)
[25,
Int
J
Artif
Intell,
V
ol.
9,
No.
2,
June
2020
:
183
–
192
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
r
187
26].
W
e
find
out
that
VB,
CVB
and
CVB0
try
to
estimate
the
distrib
ution
by
maximizing
a
lo
wer
bound
of
the
lik
elihood
p
(
d
j
;
)
,
whereas
CGS
tr
ies
to
estimate
p
(
z
d
j
d
;
;
)
.
The
ef
ficienc
y
of
LD
A
in
practice
is
determined
by
the
ef
ficienc
y
of
the
inference
method
being
emplo
yed.
Ho
we
v
er
,
none
of
the
mentioned
methods
has
a
theoretical
guarantee
on
quality
and
con
v
er
gence
rate.
W
e
consider
the
MAP
estimation
of
topic
mixture
for
a
gi
v
en
document
d
:
=
arg
max
2
K
P
(
;
d
j
;
)
=
ar
g
max
2
K
P
(
d
j
;
)
P
(
j
)
(5)
Problem
(5)
is
equi
v
alent
to
the
follo
wing:
=
arg
max
2
K
X
j
d
j
log
K
X
k
=1
k
k
j
+
(
1)
K
X
k
=1
log
k
(6)
And
we
re
write
(6)
as
form
as
=
arg
min
2
K
[(
X
j
d
j
log
K
X
k
=1
k
k
j
)
+
(1
)
K
X
k
=1
log
k
]
(7)
W
e
find
out
that
(7)
is
a
non-con
v
e
x
optimization
probl
em
when
<
1
.
This
optimization
problem
is
usually
non-con
v
e
x
and
NP-hard
in
practice
[27].
W
e
denote
g
(
)
:=
X
j
d
j
log
K
X
k
=1
k
k
j
;
h
(
)
:=
(1
)
K
X
k
=1
log
k
then
the
objecti
v
e
function
f
(
)
=
P
j
d
j
log
P
K
k
=1
k
k
j
+
(
1)
P
K
k
=1
log
k
=
g
(
)
+
h
(
)
.
W
e
see
that
problem
(7)
is
one
case
of
(2),
then
we
can
use
GS-OPT
algorithm
to
solv
e
problem
(6).
Figure
1.
Latent
dichlet
allocation
model
W
e
ha
v
e
seen
man
y
attracti
v
e
properties
of
GS-OPT
that
othe
r
methods
do
not
ha
v
e.
W
e
further
sho
w
the
simplicity
of
using
GS-OPT
for
designing
f
ast
learning
algorithms
for
topic
models.
More
specifically
,
based
on
tw
o
learning
algorithms
with
LD
A
which
are
M
L-OPE
and
Online-OPE
[17],
we
design
tw
o
algo-
rithms:
ML-GSOPT
which
enables
us
to
learn
LD
A
from
either
lar
ge
corpora
or
data
streams,
Online-GSOPT
which
learns
LD
A
from
lar
ge
corpora
in
an
online
f
ashion.
These
algorithms
emplo
y
GS-OPT
to
do
MAP
in-
ference
for
indi
vidual
documents,
and
the
online
scheme
or
streaming
scheme
to
infer
global
v
ariables
(topics).
Details
of
ML-GSOPT
and
Online-GSOPT
are
presented
in
Algorithm
2
and
Algorithm
3.
Algorithm
2
ML-GSOPT
for
learning
LD
A
from
massi
v
e/streaming
data
Input:
data
sequence
K
;
;
>
0
;
2
(0
:
5
;
1]
Output:
1:
Initialize
0
randomly
in
V
2:
f
or
t
=
1
;
2
;
:
:
:
1
do
3:
Pick
a
set
C
t
of
documents
4:
Do
inference
by
GS-OPT
for
each
d
2
C
t
to
get
d
,
gi
v
en
t
1
5:
Compute
intermediate
topic
^
t
as
^
t
k
j
/
P
d
2C
t
d
j
dk
6:
Set
step-size:
t
=
(
t
+
)
7:
Update
topics:
t
:=
(1
t
)
t
1
+
t
^
t
8:
end
f
or
GS-OPT
:
A
ne
w
fast
stoc
hastic
algorithm
for
solving
the
non-con
ve
x
optimization
pr
oblem
(Xuan
Bui)
Evaluation Warning : The document was created with Spire.PDF for Python.
188
r
ISSN:
2252-8938
Algorithm
3
Online-GSOPT
for
learning
LD
A
from
massi
v
e
data
Input:
T
raining
data
C
with
D
documents,
K
;
;
;
>
0
;
2
(0
:
5
;
1]
Output:
1:
Initialize
0
randomly
2:
f
or
t
=
1
;
2
;
:
:
:
1
do
3:
Sample
a
set
C
t
consisting
of
S
documents,
4:
Use
GS-OPT
to
do
posterior
inference
for
each
document
d
2
C
t
,
gi
v
en
the
global
v
ariable
t
1
/
t
1
in
the
last
step,
to
get
topic
mixture
d
.
Then,
compute
d
as
d
j
k
/
dk
k
j
5:
F
or
each
k
2
f
1
;
2
;
:
:
:
;
K
g
,
form
an
intermediate
global
v
ariable
^
k
for
C
t
by
^
k
j
=
+
D
S
P
d
2C
t
d
j
d
j
k
6:
Update
the
global
v
ariable
by
t
:=
(1
t
)
t
1
+
t
^
where
t
=
(
t
+
)
7:
end
f
or
4.
EMPIRICAL
EV
ALU
A
TION
This
section
is
de
v
oted
to
in
v
es
tig
ating
practical
beha
viors
of
GS-OPT
,
and
ho
w
useful
it
is
when
GS-OPT
is
emplo
yed
to
design
tw
o
ne
w
algorithms
for
learning
topic
models
at
lar
ge
scales.
T
o
this
end,
we
tak
e
the
follo
wing
methods,
data-sets,
and
performance
measures
into
in
v
estig
ation.
4.1.
Datasets:
W
e
used
the
tw
o
lar
ge
corpora:
PubMed
dataset
consists
of
330,000
articles
from
the
PubMed
central
and
Ne
w
Y
ork
T
imes
datas
et
consists
of
300,000
ne
ws.
The
dat
a
sets
were
tak
en
from
http://archi
v
e.ics.uci.edu/ml/datasets.
F
or
each
data
set,
we
use
10,000
documents
for
the
test
set.
4.2.
P
arameter
settings:
T
o
compare
our
methods
with
another
ones,
almost
of
free
parameters
are
the
same
as
in
[17].
a.
Model
parameters:
The
number
of
topics
K
=
100
,
the
h
yper
-parameters
=
1
K
and
the
topic
Dirichlet
parameter
=
1
K
.
These
parameters
are
commonly
used
in
topic
models.
b
.
Inference
parameters:
The
number
of
iterations
is
chosen
as
T
=
50
.
c.
Learning
parameters:
=
0
:
9
,
=
1
adapted
best
for
e
xisting
inference
methods.
W
e
do
man
y
e
xperiments
with
tw
o
scenarios:
(1)
Choosing
Bernoulli
parameter
p
2
f
0
:
30
;
0
:
35
;
:
:
:
;
0
:
65
;
0
:
70
g
with
mini-batch
size
j
C
t
j
=
25
;
000
;
(2)
Choosing
the
Bernoulli
parameter
p
2
f
0
:
1
;
0
:
2
;
:
:
:
;
0
:
9
g
with
the
mini-batch
size
j
C
t
j
=
5
;
000
.
W
e
do
e
xperiments
with
GS-OPT
by
choosing
the
linear
combination
parameter
=
0
:
3
on
Ne
w
Y
ork
T
imes
and
=
0
:
1
on
PubMed.
W
e
also
can
do
much
more
e
xperiments
to
e
xamine
the
ef
fect
of
the
parameter
2
(0
;
1)
in
GS-OPT
.
4.3.
P
erf
ormance
measur
es:
W
e
used
Lo
g
Pr
edictive
Pr
obability
(LPP)
and
Normalized
P
ointwise
Mutual
Information
(NPMI)
to
e
v
aluate
the
learning
methods.
NPMI
[28]
e
v
aluates
semantics
quality
of
an
indi
vidual
topic.
From
e
xtensi
v
e
e
xperiments,
[28]
found
that
NPMI
agrees
well
with
human
e
v
aluation
on
the
interpretability
of
topic
models.
Predicti
v
e
probability
[26]
measures
the
predictability
and
generalization
of
a
model
to
ne
w
data.
4.4.
Ev
aluation
r
esults:
4.4.1.
Infer
ence
methods:
V
ariational
Bayes
(VB)
[22],
Collapsed
v
ariational
Bayes
(CVB,
CVB0)
[24],
C
o
l
lapsed
Gibbs
sam-
pling
(CGS)
[26],
OPE
[17],
and
GS-OPT
.
CVB0
and
CGS
ha
v
e
been
observing
to
w
ork
best
by
se
v
eral
pre
vious
studies
[24,
26].
Therefore,
the
y
can
be
considered
as
the
state-of-the-art
inference
methods.
4.4.2.
Lar
ge-scale
lear
ning
methods:
ML-GSOPT
,
Online-GSOPT
,
ML-OPE,
Online-OPE
[17],
Online-CGS
[26],
Online-CVB
[24],
Online-
VB
[29].
T
o
a
v
oid
randomness,
the
learning
methods
for
each
dataset
are
run
fi
v
e
times
and
reported
their
a
v
erage
results.
By
changing
of
v
ariables
and
bound
functions,
we
obtain
GS-OPT
which
is
more
ef
fec-
ti
v
e
than
OPE.
GS-OPT
has
param
eter
2
(0
;
1)
when
c
o
ns
tructing
a
l
inear
combination
F
t
of
U
t
and
L
t
,
Int
J
Artif
Intell,
V
ol.
9,
No.
2,
June
2020
:
183
–
192
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
r
189
then
e
xperimental
results
of
GS-OPT
is
bad
or
good
depending
on
ho
w
is
chosen.
W
e
do
some
e
xper
-
iments
for
learning
LD
A
with
GS-OPT
algorithm
on
tw
o
data-sets
via
choosing
Bernoulli
parameter
p
2
f
0
:
30
;
0
:
35
;
:
:
:
;
0
:
65
;
0
:
70
g
with
mini-batch
size
j
C
t
j
=
25
;
000
.
[17]
sho
ws
that
OPE
is
better
than
pre
vious
methods.
Thus,
we
compare
our
method
with
OPE
via
LPP
and
NPMI
measures
and
on
tw
o
datasets.
Details
of
our
e
xperimental
results
on
this
case
are
sho
wn
in
Figure
2.
(a)
ML-GSOPT
(b)
Online-GSOPT
Figure
2.
Results
of
GS-OPT
with
dif
ferent
v
alue
of
p
with
mini-batch
size
j
C
t
j
=
25
;
000
.
Higher
is
better
,
(a)
ML-GSOPT
,
(b)
Online-GSOPT
V
ia
our
e
xperiments,
we
find
out
that
using
Bernoulli
distrib
ution
and
tw
o
bounds
of
the
objecti
v
e
function
in
GS-OPT
can
gi
v
e
results
better
than
OPE.
W
e
also
see
that
GS-OPT
depends
on
Bernoulli
pa-
rameter
p
chosen.
W
e
ha
v
e
made
further
e
xperiments
by
di
viding
the
data
into
smaller
mini-batches
such
as
j
C
t
j
=
5
;
000
and
choosing
the
Bernoulli
parameter
p
more
e
xtensi
v
e,
such
as
p
2
f
0
:
1
;
0
:
2
;
:
:
:
;
0
:
8
;
0
:
9
g
,
and
parameter
=
0
:
3
on
Ne
w
Y
ork
T
imes
and
=
0
:
1
on
Pubmed
dataset.
Details
of
our
e
xperimental
results
on
this
case
are
sho
wn
in
Figure
3.
W
e
find
out
that
GS-OPT
gi
v
es
the
dif
ferent
results
which
depend
on
Bernoulli
parameter
p
and
parameter
chosen.
W
e
also
find
out
that
using
mini-batch
size
j
C
t
j
=
5
;
000
is
better
than
using
mini-batch
size
j
C
t
j
=
25
;
000
.
It
means
LPP
and
NPMI
in
case
of
mini-batch
size
j
C
t
j
=
5
;
000
are
higher
than
in
case
of
j
C
t
j
=
25
;
000
.
In
addition,
we
find
out
that
Online-GSOPT
is
better
than
Online-VB,
Online-CVB
and
Online-CGS
on
tw
o
datasets
with
LPP
and
NPMI
measures.
Details
of
these
results
are
sho
wn
in
Figure
4.
This
e
xplains
the
contrib
ution
of
the
prior/lik
elihood
of
solving
the
inference
problem.
GS-OPT
:
A
ne
w
fast
stoc
hastic
algorithm
for
solving
the
non-con
ve
x
optimization
pr
oblem
(Xuan
Bui)
Evaluation Warning : The document was created with Spire.PDF for Python.
190
r
ISSN:
2252-8938
(a)
ML-GSOPT
(b)
Online-GSOPT
Figure
3.
Results
of
GS-OPT
with
dif
ferent
v
alue
of
p
with
mini-batch
size
j
C
t
j
=
5
;
000
.
Higher
is
better
,
(a)
ML-GSOPT
,
(b)
Online-GSOPT
Figure
4.
Performance
of
dif
ferent
learning
methods
as
seeing
more
documents.
Higher
is
better
.
Online-GSOPT
is
better
than
Online-OPE,
Online-VB,
Online-CVB
and
Online-CGS.
W
e
choose
mini-batch
size
j
C
t
j
=
5
;
000
Int
J
Artif
Intell,
V
ol.
9,
No.
2,
June
2020
:
183
–
192
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
r
191
5.
CONCLUSION
In
this
paper
,
we
propose
GS-OPT
,
a
ne
w
algorithm
solving
ef
ficiently
the
non-con
v
e
x
optim
ization
problems.
Using
Bernoulli
distrib
ution
and
stochastic
approximations,
we
pro
vide
the
GSOPT
algorithm
to
deal
well
with
the
posterior
inference
problem
in
topic
models.
The
Bernoulli
parameter
p
in
GS-OPT
is
seen
as
the
re
gularization
parameter
that
helps
the
model
to
be
more
ef
ficient
and
a
v
oid
o
v
er
-fitting.
By
e
xploiting
GS-OPT
carefully
in
topic
models,
we
ha
v
e
arri
v
ed
at
tw
o
ef
ficient
methods
for
learning
LD
A
from
lar
ge
corpora.
As
a
result,
the
y
are
good
candidates
to
help
us
deal
with
te
xt
streams
and
big
data.
In
addition,
GS-OPT
is
fle
xible
then
we
can
apply
it
to
solv
e
more
and
more
the
non-con
v
e
x
problems
in
machine
learning.
A
CKNO
WLEDGEMENT
This
w
ork
w
as
supported
by
the
Uni
v
ersity
of
Information
and
Communication
T
echnology
(ICTU)
under
project
T2019-07-06.
REFERENCES
[1]
L.
Bottou,
F
.
E.
Curtis,
and
J.
Nocedal,
“Optimization
methods
for
lar
ge-scale
machine
learning,
”
SIAM
Re
vie
w
,
v
ol.
60,
no.
2,
pp.
223-311,
2018.
[2]
R.
Johnson
and
T
.
Zhang,
“
Accelerating
stochastic
gradient
descent
using
predicti
v
e
v
ariance
reduction,
”
Adv
ances
in
neural
information
processing
systems,
pp.
315-323,
2013.
[3]
L.
Xiao
and
T
.
Zhang,
“
A
proximal
stochastic
gradient
method
with
progressi
v
e
v
ariance
reduction,
”
SIAM
Journal
on
Optimization,
v
ol.
24,
no.
4,
pp.
2057-2075,
2014.
[4]
A.
L.
Y
uille
and
A.
Rang
arajan,
“The
conca
v
e-con
v
e
x
procedure,
”
te
xtslNeural
computation,
v
ol.
15,
no.
4,
pp.
915-
936,
2003.
[5]
A.
Blak
e
and
A.
Zisserman,
”V
isual
reconstruction,
”
te
xtslMIT
press,
1987.
[6]
E.
Hazan,
K.
Y
.
Le
vy
,
and
S.
Shale
v-Shw
artz,
“On
graduated
optimization
for
stochastic
non-con
v
e
x
problems,
”
International
conference
on
machine
learning,
pp.
1833-1841,
2016.
[7]
X.
Chen,
S.
Liu,
R.
Sun,
and
M.
Hong,
“On
the
con
v
er
gence
of
a
class
of
adam-type
algorithms
for
non-con
v
e
x
optimization,
”
arXi
v
preprint
arXi
v:1808.02941,
2018.
[8]
J.
Duchi,
E.
Hazan,
and
Y
.
Singer
,
“
Adapti
v
e
subgradient
methods
for
online
learning
and
stochastic
optimiza-
tion,
”
Journal
of
Machine
Learning
Research,
v
ol.
12,
pp.
2121-2159,
2011.
[9]
T
.
T
ieleman
and
G.
Hinton,
“Lecture
6.5-rmsprop:
Di
vide
the
gradient
by
a
running
a
v
erage
of
its
recent
magni-
tude,
”
COURSERA:
Neural
netw
orks
for
machine
learning,
v
ol.
4,
no.
2,
pp.
26-31,
2012.
[10]
D.
P
.
Kingma
and
J.
L.
Ba,
“
Adam:
Amethod
for
stochastic
optimization,
”
Proc.
3rd
Int.
Conf.
Learn.
Repre-
sentations,
2014.
[11]
M.
D.
Zeiler
,
“
Adadelta:
an
adapti
v
e
learning
rate
method,
”
arXi
v
preprint
arXi
v:1212.5701,
2012.
[12]
S.
Ghadimi
and
G.
Lan,
“
Accelerated
gradient
methods
for
noncon
v
e
x
nonlinear
and
stochastic
programming,
”
Math.
Program.,
v
ol.
156,
no.
1-2,
pp.
59-99,
2016.
[13]
Z.
Allen-Zhu,
“Natasha
2:
F
aster
non-con
v
e
x
optimization
than
sgd,
”
Adv
ances
in
Neural
Information
Processing
Systems.
Curran
Associates,
Inc.,
pp.
2680-2691,
2018.
[14]
Z.Allen-ZhuandY
.Li,
“Neon2:
Finding
local
mini
ma
via
first-orderoracles,
”
Adv
ances
in
Neural
Information
Processing
Systems,
pp.
3720-3730
2018.
[15]
C.
J.
Hillar
and
L.-H.
Lim,
“Most
tensor
problems
are
np-hard,
”
Journal
of
the
A
CM
(J
A
CM),
v
ol.
60,
no.
6,
p.
45,
2013.
[16]
E.
Hazan
and
S.
Kale,
“Projection-free
online
learning,
”
Proceedings
of
Annual
International
Conference
on
Machine
Learning,
2012.
[17]
K.
Than
and
T
.
Doan,
“Guaranteed
inference
in
topic
models,
”
arXi
v
preprint
arXi
v:1512.03308,
2015.
[18]
J.
Mairal,
“Stochastic
majorization-minimization
algorithms
for
lar
ge-scale
optimization,
”
Adv
ances
in
neural
information
processing
systems,
pp.
2283-2291,
2013,.
[19]
B.
Dai,
N.
He,
H.
Dai,
and
L.
Song,
“Pro
v
able
baye
sian
inference
via
particle
mi
rror
descent,
”
Artificial
Intelligence
and
Statistics,
pp.
985-994,
2016.
[20]
U.
Simsekli,
R.
Badeau,
T
.
Cemgil,
and
G.
Richard,
“Stochastic
quasi-ne
wton
lange
vin
monte
carlo,
”
Interna-
tional
Conference
on
Machine
Learning
(ICML),
2016.
[21]
S.
J.
Reddi,
S.
Sra,
B.
P
´
oczos,
and
A.
J.Smola,
“Stochastic
frank-w
olfe
methods
for
noncon
v
e
x
GS-OPT
:
A
ne
w
fast
stoc
hastic
algorithm
for
solving
the
non-con
ve
x
optimization
pr
oblem
(Xuan
Bui)
Evaluation Warning : The document was created with Spire.PDF for Python.
192
r
ISSN:
2252-8938
optimization,
”
Proceedings
of
54th
Annual
Allerton
Conference
on
Communication,
Control,
and
Computing,
pp.
1244-1251,
2016,
[22]
D.
M.
Blei,
A.
Y
.
Ng,
and
M.
I.
Jordan,
“Latent
dirichlet
allocation,
”
Journal
of
machine
Learning
research,
v
ol.
3,
pp.
993-1022,
2003.
[23]
Y
.
W
.
T
eh,
K.
K
urihara,
and
M.
W
elling,
“Collapsed
v
ariational
inference
for
hdp,
”
Proceedings
of
Adv
ances
in
Neural
Information
Processing
Systems,
pp.
1481-1488,
2007.
[24]
A.
Asuncion,
M.
W
elling,
P
.
Smyth,
and
Y
.
W
.
T
eh,
“On
smoothing
and
inference
for
topic
models,
”
Proceed-
ings
of
the
T
wenty-Fifth
Conference
on
Uncertainty
in
Artificial
Intelligence.
A
U
AI
Press,
pp.
27-34,
2009.
[25]
T
.
L.
Gri
f
fiths
and
M.
Ste
yv
ers,
“Finding
scientific
topics,
”
Proceedings
of
the
National
academy
of
Sci-
ences,
v
ol.
101.
National
Acad
Sciences,
pp.
5228-5235,
2004.
[26]
M.
Hof
fman,
D.
M.
Blei,
and
D.
M.
Mimno,
“Sparse
stochastic
inference
for
latent
dirichlet
allocation,
”
Proceedings
of
the
29th
International
Conference
on
Machine
Learning
(ICML-12).
A
CM,
pp.
1599-
1606,
2012.
[27]
D.
Sontag
and
D.
Ro
y
,
“Comple
xity
of
inference
in
latent
dirichlet
allocation,
”
Proceedings
of
Adv
ances
in
Neural
Information
Processing
System,
2011.
[28]
J.
H.
Lau,
D.
Ne
wman,
and
T
.
Baldwin,
“Machine
reading
tea
lea
v
es:
Aut
omatically
e
v
aluating
topic
coherence
and
t
opic
model
quality
,
”
Proceedings
of
the
14th
Conference
of
the
European
Chapter
of
the
Association
for
Computational
Linguistics,
pp.
530-539,
2014.
[29]
M.
D.
Hof
fman,
D.
M.
Blei,
C.
W
ang,
and
J.
W
.
P
aisle
y
,
“Stochastic
v
ariational
inference,
”
Journal
of
Machine
Learning
Research,
v
ol.
14,
no.
1,
pp.
1303-1347,
2013.
Int
J
Artif
Intell,
V
ol.
9,
No.
2,
June
2020
:
183
–
192
Evaluation Warning : The document was created with Spire.PDF for Python.