IAES Inter national J our nal of Artificial Intelligence (IJ-AI) V ol. 9, No. 2, June 2020, pp. 183 192 ISSN: 2252-8938, DOI: 10.11591/ijai.v9i2.pp183-192 r 183 GS-OPT : A new fast stochastic algorithm f or solving the non-con v ex optimization pr oblem Xuan Bui 1 , Nhung Duong 2 , T rung Hoang 3 1,3 Thuyloi Uni v ersity , 175 T ay Son, V ietnam 2 Thai Nguyen Uni v ersity of Information and Communication T echnology , V ietnam Article Inf o Article history: Recei v ed Oct 28, 2019 Re vised Dec 16, 2019 Accepted Feb 19, 2020 K eyw ords: Bayesian learning Non-con v e x optimization Posterior inference Stochastic optimization T opic models ABSTRA CT Non-con v e x optimization has an important role in machine learning. Ho we v er , the theoretical understanding of non-con v e x optimization remained rather limited. Studying ef ficient algorithms for non-con v e x optimization has attracted a great deal of attention from man y researchers around the w orld b ut these problems are usually NP-hard to solv e. In this paper , we ha v e proposed a ne w algorithm nam ely GS-OPT (general stochastic optimization) which is ef fecti v e for solving the non-con v e x problems. Our idea is to combine tw o stochastic bounds of the objecti v e function where t he y are made by a common discrete probability distrib ution namely Bernoul li. W e consider GS-OPT care fully on both the theoretical and e xperimental aspects. W e also apply GS-OPT for solving the posterior inference problem in the latent D irichlet allocation. Empirical results sho w that our approach is often more ef ficient than pre vious ones. This is an open access article under the CC BY -SA license . Corresponding A uthor: Xuan Bui, Thuyloi Uni v ersity , 175 T ay Son, Dong Da, Hanoi, V ietnam. Email: xuanbtt@tlu.edu.vn 1. INTR ODUCTION In machine learning, there are a lot of problems that lead to non-con v e x optimization. In this paper , we focus on the optimization problems in machine learning as follo w x = arg min x 2 f ( x ) (1) where the objecti v e function f ( x ) is smooth (possibly non-con v e x) on the compact domain . T o the best of our kno wledge, if f ( x ) is con v e x, problem (1) is easy to solv e by applying some con v e x optimization methods such as gradient descent (GD), Ne wton, or Stochastic GD [1]. But, in practice, non- con v e x models ha v e se v eral adv antages compared with the con v e x one. F or e xample, deep neural netw orks, which ha v e been widely used in computer vis ion and data mining are highly non-con v e x optimization. In these cases, solving problem (1) is more dif ficult than a con v e x one because non-con v e x optimization usually admits a multimodal structure, and common con v e x optimization methods may trap in poor local optima. In this paper , we focus on proposing the ne w optimization method for solving problem (1) in which the objecti v e function f ( x ) is non-con v e x and smooth. More and more researches consider in solving non-con v e x problem (1) such as stochastic v ariance reduced gradient (SVRG) [2], Proximal SVRG (Prox-SVRG) [3]. SVRG and Prox-SVRG can be used to J ournal homepage: http://ijai.iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
184 r ISSN: 2252-8938 solv e non-con v e x finite-sum problems, b ut we find out t h a t the y may not con v er ge to the global optimization for non-con v e x functions. Conca v e-con v e x procedure (CCCP) [4] is widely used for non-con v e x problem. It transforms the non-con v e x problem into a sum of a con v e x function and a conca v e one, and then linearizing the conca v e function. Ho we v er , the comple xity of a single loop for CCCP is much higher than SGD, because CCCP s olv es quadratic programming at each iteration. Graduated optimization algorithm (GO A)[5] is also a popular global search algorithm for non-con v e x problems, b ut directly calculating its gradient is costly . In [6], the authors propose a ne w optimization namely GradOpt. Experimental results in [6] sho w that GradOpt can f ast yield a much better solution than mini-batch SGD. Ho we v er , [7] which proposes tw o algorithms: SVRG- GO A and PSVRG-GO A for solving t he non-con v e x problem, sho ws that GradOpt has some shortcomings: it con v er ges slo wly partly due to the decrease of step-size, the application ra ng e of GradOpt is limited. The v alue of an objecti v e function may be trapped around a number which is lar ger than the global minimum because the smooth parameter shrinks slightly after se v eral iterations. W e also ha v e seen man y f amous stochastic op- timization algorithms such as Adagrad [8], RMSProp [9], Adam [10], Adadelta [11], RSA G [12], Natasha2 [13], NEON2 [14] are proposed for solving the optimization problem in machine learning. The big challenges for non-con v e x optimization algorithms in machine learning are: Can local/global optimum be found? Is it possible to get rid of saddles? Ho w to escape saddle points ef ficiently? Can the optimum solution be found with an acceptable time and with lar ge data? Finding an optimum of a non-con v e x optimization problem is NP-hard in the w orst case [15]. Despite the intractability results, non-con v e x optimization is the main algorithmic technique behind man y state- of-the-art machine learning and deep learning results. In light of this background, we state the main contrib utions of our paper: a Using Bernoulli distrib ution and tw o stochastic approximation sequences, we de v elop GS-OPT for solving a wide class of non-con v e x problems. And we sho w that it usually performs better than pre vious algorithms. b Applying GS-OPT to solving the posterior inference problem in topic models, we obtain tw o learning methods: ML-GSOPT and Online-GSOPT in topic models. In addition, GS-OPT is v ery fle xible, then we can adapt GS-OPT to solv e man y non-con v e x models in machine learning. Or g anization: This paper is structured as follo ws. In Section 2, a ne w algorithm for solving the non-con v e x optimization problem is proposed in detail. In Section 3, we ha v e applied GS-OPT to solv e the posterior inference in latent Dirichlet allocation and designed tw o methods of learning LD A. In Section 4, we gi v e some results tested with tw o lar ge datasets: Ne w Y ork T imes and Pubmed. Finally , we conclude the paper in Section 5. Notation: Throughout the paper , we use the follo wing con v entions and notations. Bold f aces denote v ectors or matrices. x i denotes the i th element of v ector x , and A ij denotes the element at ro w i and column j of matrix A . The unit simple x in the n -dimensional Euclidean space is denoted as n = f x 2 R n : x 0 ; P n k =1 x k = 1 g , and its interior is denoted as n . W e will w ork with te xt collections with V dimensions (dictionary size). Each document d will be represented as frequenc y v ector , d = ( d 1 ; ::; d V ) T , where d j represents the frequenc y of term j in d . Denote n d as the length of d , i.e., n d = P j d j . The inner product of v ectors u and v is denoted as h u ; v i . I ( x ) is the indicator function which returns 1 if x is true, and 0 otherwise. 2. PR OPOSED ST OCHASTIC OPTIMIZA TION ALGORITHM W e consider in the optimization problem as form as: x = arg min x 2 [ f ( x ) = g ( x ) + h ( x )] (2) where the non-con v e x objecti v e function f ( x ) includes tw o components g ( x ) and h ( x ) . W e find out that numerous models f all in the frame w ork of problem (2) in machine learning. F or e xample, in Bayesian learning, we usually ha v e solving the Maximum a Posteriori Estimation (MAP) problem: x = arg max x [log P ( D j x ) + log P ( x )] (3) where P ( D j x ) denotes the lik elihood of an observ ed v ariable D , P ( x ) denotes the prior of the hidden v ariable x . W e find out that the problem (3) can be re written as form as: Int J Artif Intell, V ol. 9, No. 2, June 2020 : 183 192 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 r 185 x = arg min x [ log P ( D j x ) log P ( x )] (4) W e also notice that the problem (4) turns out the problem (2) where g ( x ) = log P ( D j x ) and h ( x ) = log ( P ( x )) . T o solv e the problem (3), by changing the learning rate in OFW algorithm [16] and considering carefully about the theoretical aspect, OPE [17] is propos ed for solving the MAP estimation problem in man y probabilistic models. Comparing with CCCP [4] and SMM [18], OPE has man y preferable properties. The first, the con v er gence rate of CCCP and SM M is unkno wn for non-con v e x problems. The second, while each iteration of SMM requires us to solv e a con v e x problem, each iteration of CCCP has to sol v e a non-linear equation system which is e xpensi v e and non-tri vial in man y cases. W e find out that each iteration of OPE requires us to solv e a linear program which is significantly easier than a non-linear problem. Therefore, OPE promises to be much more ef ficient than CCCP and SMM. The con v er gence rate of OPE is significantly f aster than that of PMD [19] and HAMCMC [20]. In this section, we figure out more important characters of OPE, some were in v estig ated in [17]. In general, the optimization theory has encountered man y dif ficulties in solving a non-con v e x optimization problem. Man y methods are only good in theory b ut inapplicable in practice via careful researches. Therefore, instead of directly solving the non-con v e x optimization with the true objecti v e function f ( x ) , OPE constructs a sequence of stochastic functions F t ( x ) that approximates to the objecti v e function of int erest by alternati v ely choosing uniformly from f g ( x ) ; h ( x ) g at each iteration t . It is guaranteed that F t con v er ges to f when t ! 1 . OPE is one of the stochastic optimization algorithms. OPE is straightforw ard to implement, computationally ef ficient and suitable for problems that are lar ge in terms of data and/or parameters. [17] has e xperimentally and theoretically sho wed the ef fecti v eness of OPE when applying to the posterior inference of the Latent Dirichlet Allocation model. The main idea of OPE is to construct a stochastic sequence F t ( x ) that approximates for f ( x ) by using uniform distrib ution, so that (2) becomes easy to solv e. Although OPE is better than other methods before, we w ant to e xplore a ne w stochastic optimization algorithm to solv e the problem (2) more ef ficient. W e find out some limitations such as follo ws: Uniform distrib ution is too simple, then it is not suitable for man y problems. Using one approximation function replacing the true objecti v e is not more ef fecti v e than using tw o approximation bounds. After finding out the dra wback of OPE, we do some impro v ements in order to get a ne w algorithm, that is GS-OPT . It mak es sense that tw o stochastic approximating sequences of objecti v e function f ( x ) is better than one. So, using Bernoulli distrib ution, we construct tw o sequences that are both con v er ging to f ( x ) , one be gins with g ( x ) called the sequence f L t g , the other be gins with h ( x ) called the sequence f U t g . W e adjust g and h according to Be rnou l li parameter p 2 (0 ; 1) : G ( x ) := g ( x ) =p; H ( x ) := h ( x ) = (1 p ) . W e set f l 1 := G ( x ) . Pick f l t as a Bernoulli v ariabl e with probability p 2 (0 ; 1) where P ( f l t = G ( x )) = p; P ( f l t = H ( x )) = 1 p; t = 2 ; 3 ; : : : . Then, we set L t := 1 t P t h =1 f l h . Similarly , we set f u 1 := H ( x ) . Pick f u t as Bernoulli distrib ution from f G ( x ) ; H ( x ) g with probability p 2 (0 ; 1) where P ( f u t = G ( x )) = p ; P ( f u t = H ( x )) = 1 p; t = 2 ; 3 ; : : : . Then, we set U t := 1 t P t h =1 f u h . W ith our construction, we mak e sure tw o sequences f L t g and f U t g both con v er ge to f when t ! + 1 . Using both tw o stochastic sequences f L t g and f U t g at each iteration gi v es us more information about objecti v e function f ( x ) , so that we can get more chances to reach a minimal of f ( x ) . W e approximate the true objecti v e function f ( x ) by F t ( x ) which is a linear combination of U t and L t with a suitable parameter 2 (0 ; 1) : F t ( x ) := U t ( x ) + (1 ) L t ( x ) The usage of both bounds is stochastic and helps us reduce the possibility of getting stuck at a l ocal stationary point and this is an ef ficient approach for escaping saddle points in non-con v e x optimization. So, our ne w v ariant seems to be more appropriate than OPE. Although GS-OPT aims at increasing randomness, GS-OPT w orks dif ferently with OPE. While OPE constructs only one sequence of function F t , at each iteration t , GS-OPT constructs three sequences f L t g , f U t g and f F t g , in which f F t g depending on f U t g and f L t g . So, the structure of the main sequence F t is actually changed. Details of GS-OPT are presented in Algorithm 1. Uniform distrib ution is a special case of Bernoulli one with parameter p = 0 : 5 . So OPE is not fle xi ble in man y datasets. GS-OPT adapts well with dif ferent datasets, we will sho w it in our e xperiments. In the rest of this section, we will sho w that GS-OPT preserv es the k e y adv antage of OPE which is the guarantee of the quality and con v er gence rate. GS-OPT : A ne w fast stoc hastic algorithm for solving the non-con ve x optimization pr oblem (Xuan Bui) Evaluation Warning : The document was created with Spire.PDF for Python.
186 r ISSN: 2252-8938 Algorithm 1 GS-OPT : A ne w General Stochastic OPT imization algorithm for solving the non-con v e x problem Input: Bernoulli parameter p 2 (0 ; 1) and linear combination parameter 2 (0 ; 1) Output: x that minimizes f ( x ) = g ( x ) + h ( x ) on . 1: Initialize x 1 arbitrarily in 2: Set G ( x ) := g ( x ) =p; H ( x ) := h ( x ) = (1 p ) 3: f l 1 := G ( x ) ; f u 1 := H ( x ) 4: f or t = 2 ; 3 ; : : : 1 do 5: Pick f l t as a Bernoulli v ariable where P ( f l t = G ( x )) = p; P ( f l t = H ( x )) = 1 p 6: L t := 1 t P t h =1 f l h 7: Pick f u t as a Bernoulli v ariable where P ( f u t = G ( x )) = p; P ( f u t = H ( x )) = 1 p 8: U t := 1 t P t h =1 f u h 9: F t := U t + (1 ) L t 10: a t := arg min x 2 < F 0 t ( x t ) ; x > 11: x t +1 := x t + a t x t t 12: end f or Theorem 1 (Con v er gence of GS-OPT algorithm) Consider the objective function f ( x ) in equation (2) , the linear combination par ameter 2 (0 ; 1) and Bernoulli par ameter p 2 (0 ; 1) . F or GS-OPT , with pr obability one , F t ( x ) con ver g es to f ( x ) as t ! + 1 for any x 2 and x t con ver g es to a local minimal/stationary point of f ( x ) at a r ate of O (1 =t ) . The proof of Theorem 1 is similar in [17]. The objecti v e function f ( x ) is non-con v e x. The criterion used for con v er gence analysis is the importance of non-con v e x optimization. F or unconstrained problems, the gradient norm kr f ( x ) k is typically used to measure con v er gence, because kr f ( x ) k ! 0 captures con v er - gence to a stationary point. Ho we v er , this criterion can not be used for constrained problems. Instead, we use the ”Frank-W olfe g ap” criterion [21]. Let a t and b t = t a t be the number of times that we ha v e already pick ed G ( x ) and H ( x ) respecti v ely after t iterations to construct sequence f L t g . W e ha v e a t B ( t; p ) and E ( a t ) = tp ; D ( a t ) = tp (1 p ) . Then S t = a t tp ! N (0 ; tp (1 p )) when t ! 1 . So S t =t ! 0 as t ! 1 with probability 1 . W e ha v e L t f = S t t ( G H ) ; L 0 t f 0 = S t t ( G 0 H 0 ) Thus, we find out that L t ! f as t ! + 1 with probability 1 . Similarly , we also ha v e U t ! f as t ! + 1 with probability 1 . In addition, we ha v e F t = U t + (1 ) L t ) F t f = ( U t f ) + (1 )( L t f ) . W e notice that U t and L t tend to f ( x ) as t ! + 1 with probability 1 . Then, we conclude that the sequence F t ( x ) ! f ( x ) as t ! + 1 with probabil ity 1 . W e will sho w the ef ficient of GS-OPT algorithm via our e xperiments when we apply GS-OPT for solving the posterior inference problem in topic models in the ne xt section. 3. APPL YING GS-OPT FOR THE MAP PR OBLEM IN T OPIC MODELS Latent dichlet allocation (LD A) [22] is a generati v e model for modeling te xt and discrete data. It as sumes that a corpus is composed from K topics = ( 1 ; : : : ; K ) , each of which is a sample from V dimensional Dirichlet distrib ution, D ir ichl et ( ) . Each document d is a mixture of those topics and is assumed to arises from the follo wing generati v e process: dra w d j D ir ichl et ( ) . F or the n th w ord of d : dra w topic inde x z dn j d M ul tinomial ( d ) and w ord w dn j z dn ; M ul tinomial ( z dn ) . Each topic mixture d = ( d 1 ; : : : ; dK ) represents the contrib utions of topics to document d , while k j sho ws the con- trib ution of term j to topic k . Note that d 2 K ; k 2 V ; 8 k . Both d and z d are unobserv ed v ariables and local for each document as sho wn in Figure 1. According to [23], the t ask of Bayesian inference (learning) gi v en a corpus C = f d 1 ; : : : ; d M g is to estimate the posterior distrib ution p ( z ; ; jC ; ; ) o v er the latent topic indicies z = f z 1 ; : : : ; z d g , topic mixtures = f 1 ; : : : ; M g , and topics = ( 1 ; : : : ; K ) . The problem of posterior inference for each document d , gi v en a model f ; g , is to estimate the full joint distrib ution p ( z d ; d ; d j ; ) . Direct estimation of this distrib ution is intractable. Hence e xisti ng approaches use dif ferent schemes such as v ariational Bayes (VB) [22], collaps ed v ariational Bayes (CVB) [23], CVB0 [24], and collapsed Gibbs sampling (CGS) [25, Int J Artif Intell, V ol. 9, No. 2, June 2020 : 183 192 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 r 187 26]. W e find out that VB, CVB and CVB0 try to estimate the distrib ution by maximizing a lo wer bound of the lik elihood p ( d j ; ) , whereas CGS tr ies to estimate p ( z d j d ; ; ) . The ef ficienc y of LD A in practice is determined by the ef ficienc y of the inference method being emplo yed. Ho we v er , none of the mentioned methods has a theoretical guarantee on quality and con v er gence rate. W e consider the MAP estimation of topic mixture for a gi v en document d : = arg max 2 K P ( ; d j ; ) = ar g max 2 K P ( d j ; ) P ( j ) (5) Problem (5) is equi v alent to the follo wing: = arg max 2 K X j d j log K X k =1 k k j + ( 1) K X k =1 log k (6) And we re write (6) as form as = arg min 2 K [( X j d j log K X k =1 k k j ) + (1 ) K X k =1 log k ] (7) W e find out that (7) is a non-con v e x optimization probl em when < 1 . This optimization problem is usually non-con v e x and NP-hard in practice [27]. W e denote g ( ) := X j d j log K X k =1 k k j ; h ( ) := (1 ) K X k =1 log k then the objecti v e function f ( ) = P j d j log P K k =1 k k j + ( 1) P K k =1 log k = g ( ) + h ( ) . W e see that problem (7) is one case of (2), then we can use GS-OPT algorithm to solv e problem (6). Figure 1. Latent dichlet allocation model W e ha v e seen man y attracti v e properties of GS-OPT that othe r methods do not ha v e. W e further sho w the simplicity of using GS-OPT for designing f ast learning algorithms for topic models. More specifically , based on tw o learning algorithms with LD A which are M L-OPE and Online-OPE [17], we design tw o algo- rithms: ML-GSOPT which enables us to learn LD A from either lar ge corpora or data streams, Online-GSOPT which learns LD A from lar ge corpora in an online f ashion. These algorithms emplo y GS-OPT to do MAP in- ference for indi vidual documents, and the online scheme or streaming scheme to infer global v ariables (topics). Details of ML-GSOPT and Online-GSOPT are presented in Algorithm 2 and Algorithm 3. Algorithm 2 ML-GSOPT for learning LD A from massi v e/streaming data Input: data sequence K ; ; > 0 ; 2 (0 : 5 ; 1] Output: 1: Initialize 0 randomly in V 2: f or t = 1 ; 2 ; : : : 1 do 3: Pick a set C t of documents 4: Do inference by GS-OPT for each d 2 C t to get d , gi v en t 1 5: Compute intermediate topic ^ t as ^ t k j / P d 2C t d j dk 6: Set step-size: t = ( t + ) 7: Update topics: t := (1 t ) t 1 + t ^ t 8: end f or GS-OPT : A ne w fast stoc hastic algorithm for solving the non-con ve x optimization pr oblem (Xuan Bui) Evaluation Warning : The document was created with Spire.PDF for Python.
188 r ISSN: 2252-8938 Algorithm 3 Online-GSOPT for learning LD A from massi v e data Input: T raining data C with D documents, K ; ; ; > 0 ; 2 (0 : 5 ; 1] Output: 1: Initialize 0 randomly 2: f or t = 1 ; 2 ; : : : 1 do 3: Sample a set C t consisting of S documents, 4: Use GS-OPT to do posterior inference for each document d 2 C t , gi v en the global v ariable t 1 / t 1 in the last step, to get topic mixture d . Then, compute d as d j k / dk k j 5: F or each k 2 f 1 ; 2 ; : : : ; K g , form an intermediate global v ariable ^ k for C t by ^ k j = + D S P d 2C t d j d j k 6: Update the global v ariable by t := (1 t ) t 1 + t ^ where t = ( t + ) 7: end f or 4. EMPIRICAL EV ALU A TION This section is de v oted to in v es tig ating practical beha viors of GS-OPT , and ho w useful it is when GS-OPT is emplo yed to design tw o ne w algorithms for learning topic models at lar ge scales. T o this end, we tak e the follo wing methods, data-sets, and performance measures into in v estig ation. 4.1. Datasets: W e used the tw o lar ge corpora: PubMed dataset consists of 330,000 articles from the PubMed central and Ne w Y ork T imes datas et consists of 300,000 ne ws. The dat a sets were tak en from http://archi v e.ics.uci.edu/ml/datasets. F or each data set, we use 10,000 documents for the test set. 4.2. P arameter settings: T o compare our methods with another ones, almost of free parameters are the same as in [17]. a. Model parameters: The number of topics K = 100 , the h yper -parameters = 1 K and the topic Dirichlet parameter = 1 K . These parameters are commonly used in topic models. b . Inference parameters: The number of iterations is chosen as T = 50 . c. Learning parameters: = 0 : 9 , = 1 adapted best for e xisting inference methods. W e do man y e xperiments with tw o scenarios: (1) Choosing Bernoulli parameter p 2 f 0 : 30 ; 0 : 35 ; : : : ; 0 : 65 ; 0 : 70 g with mini-batch size j C t j = 25 ; 000 ; (2) Choosing the Bernoulli parameter p 2 f 0 : 1 ; 0 : 2 ; : : : ; 0 : 9 g with the mini-batch size j C t j = 5 ; 000 . W e do e xperiments with GS-OPT by choosing the linear combination parameter = 0 : 3 on Ne w Y ork T imes and = 0 : 1 on PubMed. W e also can do much more e xperiments to e xamine the ef fect of the parameter 2 (0 ; 1) in GS-OPT . 4.3. P erf ormance measur es: W e used Lo g Pr edictive Pr obability (LPP) and Normalized P ointwise Mutual Information (NPMI) to e v aluate the learning methods. NPMI [28] e v aluates semantics quality of an indi vidual topic. From e xtensi v e e xperiments, [28] found that NPMI agrees well with human e v aluation on the interpretability of topic models. Predicti v e probability [26] measures the predictability and generalization of a model to ne w data. 4.4. Ev aluation r esults: 4.4.1. Infer ence methods: V ariational Bayes (VB) [22], Collapsed v ariational Bayes (CVB, CVB0) [24], C o l lapsed Gibbs sam- pling (CGS) [26], OPE [17], and GS-OPT . CVB0 and CGS ha v e been observing to w ork best by se v eral pre vious studies [24, 26]. Therefore, the y can be considered as the state-of-the-art inference methods. 4.4.2. Lar ge-scale lear ning methods: ML-GSOPT , Online-GSOPT , ML-OPE, Online-OPE [17], Online-CGS [26], Online-CVB [24], Online- VB [29]. T o a v oid randomness, the learning methods for each dataset are run v e times and reported their a v erage results. By changing of v ariables and bound functions, we obtain GS-OPT which is more ef fec- ti v e than OPE. GS-OPT has param eter 2 (0 ; 1) when c o ns tructing a l inear combination F t of U t and L t , Int J Artif Intell, V ol. 9, No. 2, June 2020 : 183 192 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 r 189 then e xperimental results of GS-OPT is bad or good depending on ho w is chosen. W e do some e xper - iments for learning LD A with GS-OPT algorithm on tw o data-sets via choosing Bernoulli parameter p 2 f 0 : 30 ; 0 : 35 ; : : : ; 0 : 65 ; 0 : 70 g with mini-batch size j C t j = 25 ; 000 . [17] sho ws that OPE is better than pre vious methods. Thus, we compare our method with OPE via LPP and NPMI measures and on tw o datasets. Details of our e xperimental results on this case are sho wn in Figure 2. (a) ML-GSOPT (b) Online-GSOPT Figure 2. Results of GS-OPT with dif ferent v alue of p with mini-batch size j C t j = 25 ; 000 . Higher is better , (a) ML-GSOPT , (b) Online-GSOPT V ia our e xperiments, we find out that using Bernoulli distrib ution and tw o bounds of the objecti v e function in GS-OPT can gi v e results better than OPE. W e also see that GS-OPT depends on Bernoulli pa- rameter p chosen. W e ha v e made further e xperiments by di viding the data into smaller mini-batches such as j C t j = 5 ; 000 and choosing the Bernoulli parameter p more e xtensi v e, such as p 2 f 0 : 1 ; 0 : 2 ; : : : ; 0 : 8 ; 0 : 9 g , and parameter = 0 : 3 on Ne w Y ork T imes and = 0 : 1 on Pubmed dataset. Details of our e xperimental results on this case are sho wn in Figure 3. W e find out that GS-OPT gi v es the dif ferent results which depend on Bernoulli parameter p and parameter chosen. W e also find out that using mini-batch size j C t j = 5 ; 000 is better than using mini-batch size j C t j = 25 ; 000 . It means LPP and NPMI in case of mini-batch size j C t j = 5 ; 000 are higher than in case of j C t j = 25 ; 000 . In addition, we find out that Online-GSOPT is better than Online-VB, Online-CVB and Online-CGS on tw o datasets with LPP and NPMI measures. Details of these results are sho wn in Figure 4. This e xplains the contrib ution of the prior/lik elihood of solving the inference problem. GS-OPT : A ne w fast stoc hastic algorithm for solving the non-con ve x optimization pr oblem (Xuan Bui) Evaluation Warning : The document was created with Spire.PDF for Python.
190 r ISSN: 2252-8938 (a) ML-GSOPT (b) Online-GSOPT Figure 3. Results of GS-OPT with dif ferent v alue of p with mini-batch size j C t j = 5 ; 000 . Higher is better , (a) ML-GSOPT , (b) Online-GSOPT Figure 4. Performance of dif ferent learning methods as seeing more documents. Higher is better . Online-GSOPT is better than Online-OPE, Online-VB, Online-CVB and Online-CGS. W e choose mini-batch size j C t j = 5 ; 000 Int J Artif Intell, V ol. 9, No. 2, June 2020 : 183 192 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 r 191 5. CONCLUSION In this paper , we propose GS-OPT , a ne w algorithm solving ef ficiently the non-con v e x optim ization problems. Using Bernoulli distrib ution and stochastic approximations, we pro vide the GSOPT algorithm to deal well with the posterior inference problem in topic models. The Bernoulli parameter p in GS-OPT is seen as the re gularization parameter that helps the model to be more ef ficient and a v oid o v er -fitting. By e xploiting GS-OPT carefully in topic models, we ha v e arri v ed at tw o ef ficient methods for learning LD A from lar ge corpora. As a result, the y are good candidates to help us deal with te xt streams and big data. In addition, GS-OPT is fle xible then we can apply it to solv e more and more the non-con v e x problems in machine learning. A CKNO WLEDGEMENT This w ork w as supported by the Uni v ersity of Information and Communication T echnology (ICTU) under project T2019-07-06. REFERENCES [1] L. Bottou, F . E. Curtis, and J. Nocedal, “Optimization methods for lar ge-scale machine learning, SIAM Re vie w , v ol. 60, no. 2, pp. 223-311, 2018. [2] R. Johnson and T . Zhang, Accelerating stochastic gradient descent using predicti v e v ariance reduction, Adv ances in neural information processing systems, pp. 315-323, 2013. [3] L. Xiao and T . Zhang, A proximal stochastic gradient method with progressi v e v ariance reduction, SIAM Journal on Optimization, v ol. 24, no. 4, pp. 2057-2075, 2014. [4] A. L. Y uille and A. Rang arajan, “The conca v e-con v e x procedure, te xtslNeural computation, v ol. 15, no. 4, pp. 915- 936, 2003. [5] A. Blak e and A. Zisserman, ”V isual reconstruction, te xtslMIT press, 1987. [6] E. Hazan, K. Y . Le vy , and S. Shale v-Shw artz, “On graduated optimization for stochastic non-con v e x problems, International conference on machine learning, pp. 1833-1841, 2016. [7] X. Chen, S. Liu, R. Sun, and M. Hong, “On the con v er gence of a class of adam-type algorithms for non-con v e x optimization, arXi v preprint arXi v:1808.02941, 2018. [8] J. Duchi, E. Hazan, and Y . Singer , Adapti v e subgradient methods for online learning and stochastic optimiza- tion, Journal of Machine Learning Research, v ol. 12, pp. 2121-2159, 2011. [9] T . T ieleman and G. Hinton, “Lecture 6.5-rmsprop: Di vide the gradient by a running a v erage of its recent magni- tude, COURSERA: Neural netw orks for machine learning, v ol. 4, no. 2, pp. 26-31, 2012. [10] D. P . Kingma and J. L. Ba, Adam: Amethod for stochastic optimization, Proc. 3rd Int. Conf. Learn. Repre- sentations, 2014. [11] M. D. Zeiler , Adadelta: an adapti v e learning rate method, arXi v preprint arXi v:1212.5701, 2012. [12] S. Ghadimi and G. Lan, Accelerated gradient methods for noncon v e x nonlinear and stochastic programming, Math. Program., v ol. 156, no. 1-2, pp. 59-99, 2016. [13] Z. Allen-Zhu, “Natasha 2: F aster non-con v e x optimization than sgd, Adv ances in Neural Information Processing Systems. Curran Associates, Inc., pp. 2680-2691, 2018. [14] Z.Allen-ZhuandY .Li, “Neon2: Finding local mini ma via first-orderoracles, Adv ances in Neural Information Processing Systems, pp. 3720-3730 2018. [15] C. J. Hillar and L.-H. Lim, “Most tensor problems are np-hard, Journal of the A CM (J A CM), v ol. 60, no. 6, p. 45, 2013. [16] E. Hazan and S. Kale, “Projection-free online learning, Proceedings of Annual International Conference on Machine Learning, 2012. [17] K. Than and T . Doan, “Guaranteed inference in topic models, arXi v preprint arXi v:1512.03308, 2015. [18] J. Mairal, “Stochastic majorization-minimization algorithms for lar ge-scale optimization, Adv ances in neural information processing systems, pp. 2283-2291, 2013,. [19] B. Dai, N. He, H. Dai, and L. Song, “Pro v able baye sian inference via particle mi rror descent, Artificial Intelligence and Statistics, pp. 985-994, 2016. [20] U. Simsekli, R. Badeau, T . Cemgil, and G. Richard, “Stochastic quasi-ne wton lange vin monte carlo, Interna- tional Conference on Machine Learning (ICML), 2016. [21] S. J. Reddi, S. Sra, B. P ´ oczos, and A. J.Smola, “Stochastic frank-w olfe methods for noncon v e x GS-OPT : A ne w fast stoc hastic algorithm for solving the non-con ve x optimization pr oblem (Xuan Bui) Evaluation Warning : The document was created with Spire.PDF for Python.
192 r ISSN: 2252-8938 optimization, Proceedings of 54th Annual Allerton Conference on Communication, Control, and Computing, pp. 1244-1251, 2016, [22] D. M. Blei, A. Y . Ng, and M. I. Jordan, “Latent dirichlet allocation, Journal of machine Learning research, v ol. 3, pp. 993-1022, 2003. [23] Y . W . T eh, K. K urihara, and M. W elling, “Collapsed v ariational inference for hdp, Proceedings of Adv ances in Neural Information Processing Systems, pp. 1481-1488, 2007. [24] A. Asuncion, M. W elling, P . Smyth, and Y . W . T eh, “On smoothing and inference for topic models, Proceed- ings of the T wenty-Fifth Conference on Uncertainty in Artificial Intelligence. A U AI Press, pp. 27-34, 2009. [25] T . L. Gri f fiths and M. Ste yv ers, “Finding scientific topics, Proceedings of the National academy of Sci- ences, v ol. 101. National Acad Sciences, pp. 5228-5235, 2004. [26] M. Hof fman, D. M. Blei, and D. M. Mimno, “Sparse stochastic inference for latent dirichlet allocation, Proceedings of the 29th International Conference on Machine Learning (ICML-12). A CM, pp. 1599- 1606, 2012. [27] D. Sontag and D. Ro y , “Comple xity of inference in latent dirichlet allocation, Proceedings of Adv ances in Neural Information Processing System, 2011. [28] J. H. Lau, D. Ne wman, and T . Baldwin, “Machine reading tea lea v es: Aut omatically e v aluating topic coherence and t opic model quality , Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 530-539, 2014. [29] M. D. Hof fman, D. M. Blei, C. W ang, and J. W . P aisle y , “Stochastic v ariational inference, Journal of Machine Learning Research, v ol. 14, no. 1, pp. 1303-1347, 2013. Int J Artif Intell, V ol. 9, No. 2, June 2020 : 183 192 Evaluation Warning : The document was created with Spire.PDF for Python.