I AE S In t er na t io na l J o urna l o f   Art if icia l In t ellig ence   ( I J - AI )   Vo l.   10 ,   No .   3 Sep tem b er   2 0 2 1 ,   p p .   6 3 6 ~ 6 4 8   I SS N:  2 2 5 2 - 8 9 3 8 ,   DOI : 1 0 . 1 1 5 9 1 /ijai.v 10 .i 3 . p p 6 3 6 - 6 4 8     636       J o ur na l ho m ep a g e h ttp : //ij a i . ia esco r e. co m   A   two - pha se   pla g ia rism   detec tion   s y stem   ba sed   on   m ulti - la y er   lo ng   sho rt - term   memo ry   network s       Ng uy en   Va n   So n 1 ,   Le   T ha nh   H uo ng 2 ,   Ng uy en   Chi   T ha nh 3   1, 3 In stit u te   of   In f o rm a ti o n   Tec h n o lo g y ,   M I S T,   Vie t n a m   2 S c h o o l   of   In f o rm a ti o n   a n d   Co m p u ter   S c ien c e   Tec h n o lo g y ,   Ha n o i   Un iv e rsity   of   S c ien c e   a n d   Tec h n o lo g y   Ha n o i,   Vie tn a m       Art icle   I nfo     AB S T RAC T   A r ticle   his to r y:   R ec eiv ed   Au g   2 5 ,   2 0 2 0   R ev is ed   May   15,   2 0 2 1   Acc ep ted   May   2 6 ,   2 0 2 1       F in d in g   p la g iarism   strin g s   b e twe e n   two   g iv e n   d o c u m e n ts   a re   th e   m a in   tas k   of   th e   p lag iarism   d e tec ti o n   p ro b le m .   Trad it io n a l   a p p r o a c h e s   b a se d   on   stri n g   m a tch in g   a re   not   v e ry   u se fu l   in   c a se s   of   sim il a r   s e m a n ti c   p lag iarism .   De e p   lea rn in g   a p p ro a c h e s   s o lv e   th is   p r o b lem   by   m e a su rin g   t h e   se m a n ti c   sim il a rit y   b e twe e n   p a irs   of   se n ten c e s.   Ho we v e r,   th e se   a p p ro a c h e s   stil l   fa c e   th e   fo ll o win g   c h a ll e n g i n g   p o in ts.   F irst ,   it   is   imp o ss ib le   to   so l v e   c a se s   wh e re   o n ly   p a rt   of   a   se n ten c e   b e lo n g s   to   a   p lag iarism   p a ss a g e .   S e c o n d ,   m e a su rin g   th e   se n ten ti a l   sim il a rit y   wit h o u t   c o n sid e ri n g   th e   c o n tex t   of   su rr o u n d i n g   se n ten c e s   lea d s   to   d e c re a sin g   in   a c c u ra c y .   To   s o lv e   th e   a b o v e   p ro b lem s,   t h is   p a p e r   p ro p o se s   a   two - p h a se   p lag iarism   d e tec ti o n   s y ste m   b a se d   on   m u lt i - lay e r   l o n g   sh o r t - term   m e m o ry   n e two rk   m o d e l   a n d   fe a tu re   e x trac ti o n   tec h n i q u e :   (i)   a   p a ss a g e - p h a se   to   re c o g n ize   p lag iarism   p a ss a g e s,   a n d   (ii )   a   wo rd - p h a se   to   d e term in e   th e   e x a c t   p lag iarism   strin g s.   Ou r   e x p e rim e n t   re su lt s   on   P AN   2 0 1 4   c o rp u s   re a c h e d   9 4 . 2 6 %   F - m e a su re ,   h ig h e r   th a n   e x isti n g   re se a rc h   in   th is   fiel d .   K ey w o r d s :   Deep   lear n in g   Featu r e   ex tr ac tio n   Mu lti - lay er   lo n g   s h o r t - ter m   m em o r y   Plag iar is m   d etec tio n   T wo - p h ase   T h is   is   an   o p e n   a c c e ss   a rticle   u n d e r   th e   CC   BY - SA   li c e n se .     C o r r e s p o nd ing   A uth o r :   Ng u y en   Van   So n   I n s titu te   of   I n f o r m atio n   T ec h n o lo g y   MI ST,   Vietn am   T el:   ( +8 4 )   9 0 4 2 3 6 6 8 3   E m ail:   s o n n v 7 8 @ g m ail. co m       1.   I NT RO D UCT I O N   Plag iar is m   is   d ef in ed   as   th e   r eu s e   of   an o t h er   p er s o n s   id ea s ,   p r o ce s s es,   r esu lts ,   or   wo r d s   with o u t   ex p licitly   ac k n o wled g in g   th e   s o u r ce   [ 1 ] .   Plag iar is m   d etec tio n   is   th e   alg o r ith m   f o r   au to m atica lly   r etr iev in g   s tr in g s   in   a   s u s p icio u s   d o cu m en t   r eu s ed   f r o m   an o th er   d o c u m en t.   Plag iar is m   m eth o d s   ar e   d iv id ed   in to   two   m ain   ty p es:   liter al   p lag iar is m   an d   in tellig en t   o n e,   b ased   on   t h e   p lag ia r is t’ s   b eh av io r   [ 2 ] .   L iter al   p lag iar is m   is   a   co m m o n   an d   p o p u lar   ca s e   in   wh ich   p lag iar is ts   do   not   s p en d   m u ch   tim e   h id in g   t h e   a ca d em ic   cr im e   th e y   co m m itted .   Fo r   ex am p le,   th e y   co p y   a n d   p aste   th e   tex t   f r o m   th e   i n ter n et.   I n tellig en t   p lag iar is m   is   s ev er e   ac ad em ic   d is h o n esty   wh er ein   p lag iar is ts   tr y   to   d ec eiv e   r ea d er s   by   ch an g in g   o th er s   co n tr i b u tio n s   to   ap p ea r   as   th eir   o wn .   I n tellig en t   p la g iar i s ts   tr y   to   h id e,   o b f u s ca te,   an d   ch an g e   th e   o r ig in al   wo r k   in   v ar io u s   in tellig en t   way s ,   in clu d in g   te x t   m an ip u latio n ,   tr an s latio n ,   a n d   id ea   a d o p tio n .   Ov er   th e   p ast   two   d ec ad es,   a u to m atic   p lag ia r is m   d etec tio n   h as   r ec ei v ed   s ig n i f ican t   atte n tio n   f r o m   th e   r esear ch   c o m m u n ity .   T w o   m ain   task s   of   au t o m atic   p l ag iar is m   d etec tio n   a r e   s o u r c e   r etr iev al   a n d   tex t   alig n m en t.   In   th e   s o u r ce   r etr i ev al   task ,   g iv e n   a   s u s p icio u s   d o cu m e n t   an d   a   web   s ea r ch   e n g in e,   t h e   task   is   to   r etr iev e   all   s o u r ce   d o c u m en ts   f r o m   wh ic h   tex t   h as   b ee n   r eu s ed .   In   th e   tex t   alig n m en t   s u b ta s k ,   g iv en   a   p air   of   d o cu m e n ts   (a   s u s p icio u s   d o cu m en t   an d   a   s o u r ce   o n e) ,   th e   task   is   to   id en tify   co n tig u o u s   m ax im al - len g th   p ass ag es   of   r eu s ed   tex t.   Evaluation Warning : The document was created with Spire.PDF for Python.
I n t J Ar tif  I n tell   I SS N:  2252 - 8 9 3 8         A   tw o - p h a s p la g ia r is d etec tio n   s ystem  b a s ed   o n   mu lti - la y er LS TM   n etw o r k s   ( N g u ye n   V a n   S o n )   637   Mo s t   of   ex is tin g   wo r k s   on   tex t   alig n m en t   f o cu s   on   s u p er v is ed   an d   u n s u p er v is ed   a p p r o ac h e s .   Sev er al   u n s u p er v is ed   ap p r o ac h es   u s e   ch ar ac ter - b ased   m et h o d s   ( e. g . ,   [ 1 ] ,   [ 3 ] ,   [ 4 ] )   th at   a p p lied   s tr in g   m atc h in g   or   ap p r o x im ate   s tr in g   m atch in g   with   m ea s u r es   s u ch   as   Ha m m in g   or   L e v en s h tein   d is tan ce s   to   co m p u te   th e   s im ilar ity   b etwe en   two   s tr in g s   with in   a   s lid in g   win d o w.   I n s tead   of   co m p a r in g   s tr in g s   as   in   ch ar ac ter - b ased   m eth o d s ,   v ec to r - b ased   m et h o d s   ( e. g . ,   [ 5 ] ,   [ 6 ] )   p r o p o s ed   r ep r esen tin g   in p u t   tex ts   as   v ec to r s   of   to k e n s   an d   m ea s u r in g   th e   d is tan ce   b etw ee n   th ese   v ec to r s   by   u s in g   s im ilar ity   co ef f icien ts   s u ch   as   J ac ca r d ,   C o s in e,   E u clid ea n ,   or   Ma n h attan   d is tan ce s .   B ased   on   th e   i n tu itio n   t h at   s im ilar   d o cu m en ts   wo u ld   h a v e   s im ilar   s y n tactica l   s tr u ctu r es,   s o m e   r esear ch   wo r k s   ( e. g . ,   [ 7 ] ,   [ 8 ] )   u s ed   s y n tactic   in f o r m atio n   at   th e   f ir s t   s tag e   of   m ea s u r in g   s en ten tial   s im ilar ity .   T h e   m ain   lim itatio n   of   th ese   u n s u p er v is ed   ap p r o ac h es   is   th at   th ey   ca n n o t   d ea l   with   i n tellig en t   p lag iar is m   in   wh ich   th e   s am e   co n ten t   ca n   be   ex p r ess ed   by   d if f er en t   w o r d s   an d   in   d if f er e n t   o r d er s .   R esear ch   on   i n tellig en t   p lag iar is m   ( e. g . ,   [ 9 ] - [ 1 1 ] )   o f t e n   c o n c e n t r a t e   on   f i n d i n g   t h e   s i m i l a r i t y   b e t w e e n   p a i r s   of   s e n t e n c e s .   G h a r a v i   et   a l.   [ 9 ]   p r o p o s ed   a   p lag iar is m   d etec tio n   m eth o d   f o r   th e   Per s ian   lan g u ag e   by   r e p r esen tin g   ea c h   s en ten c e   by   a   s em an tic   em b ed d in g   v ec to r   an d   th en   c o m p ar in g   t h e   s im ilar ity   b etwe e n   th ese   v ec to r s   u s in g   t h e   co s in e   s im ilar ity .   C h er r o u n   et   a l.   [ 1 0 ]   p r o p o s e d   a   two - p h ase   s y s tem   u s in g   a   s u p er v is ed   lear n in g   a p p r o ac h   to   d etec t   p lag iar is m   in   Ar ab ic.   T h e   f ir s t   p h ase   p r o d u ce d   a   r ep r esen tin g   v ec to r   f o r   ea c h   s en ten ce   by   co m b in in g   d if f e r en t   f ea tu r es,   in clu d in g   wo r d   em b e d d in g ,   w o r d   alig n m en t,   ter m   f r eq u en c y   weig h tin g ,   an d   p ar t - of - s p ee ch   tag g in g .   T h e   s ec o n d   p h ase   u s ed   lex ica l,   s y n tactic,   an d   s em an tic   f ea t u r es   in   t h r ee   m ac h in e   lea r n in g   m o d els   ( s u p p o r t   v ec to r   m ac h in e   ( SVM) ,   d ec is io n   tr ee s   ( DT ) ,   an d   r an d o m   f o r ests   ( R F))   to   im p r o v e   th e   a cc u r ac y   of   th e   f ir s t   p h ase   r esu lts .   Ho wev er ,   th eir   ap p r o ac h   d i d   not   d ea l   with   o b f u s ca ted   p lag iar is m   ca s es   wh en   a   p ass ag e   is   in s er ted   in   th e   m id d le   of   a   s en ten ce .   Alth en ey an   et   a l.   [ 1 1 ]   p r esen ted   two   s y s tem s   ( Plag L in SVM   an d   Plag R b f SVM)   u s in g   th e   s u p p o r t   v ec to r   m ac h i n e   class if ie r   ( SVM)   with   lex ical,   s y n tactic,   an d   s em an tic   f ea tu r es   to   d etec t   p lag ia r is m   s en ten ce s .   T h eir   a p p r o ac h   a p p lied   two   p lag iar is m   d etec tin g   lev els:   p ar a g r ap h   an d   s en ten ce   o n es.   T h e   p a r ag r ap h - lev el   d etec ts   s im ilar   p ar a g r ap h s   in   th e   two   in p u t   d o cu m en ts   b asin g   on   th e   n u m b er   of   co m m o n   u n ig r am s   an d   b ig r am s   of   th ese   p ar ag r ap h s .   T h e   s en ten ce - lev el   alig n s   s en ten ce s   in   th e   ab o v e   r esu lt   p ar ag r ap h   p air s   b asin g   on   th e   n u m b er   of   co m m o n   u n ig r am s   b etwe en   th e   two   s en ten ce s .   If   th e   s co r e   of   a   s en ten ce   p air   was   h ig h er   th an   th e   p r e - d ef in e d   th r e s h o ld ,   th e   SVM   clas s if ier   is   a p p lied   to   d eter m in e   wh eth er   two   s en ten ce s   ar e   s i m ilar   or   n o t.   Fin ally ,   p la g iar is m   p ass ag es   wer e   cr ea ted   by   co n n ec tin g   ad jace n t   s en ten ce s   th at   wer e   co p ied   f r o m   th e   s o u r ce   d o cu m en ts .   Pre v io u s   in tellig en t   p lag iar is m   ap p r o ac h es   h av e   lim itatio n s   on   f in d in g   co p ie d   p ar a g r ap h s   b ased   on   s en ten ce   u n its ,   ass u m in g   t h at   p eo p le   o n ly   c o p y   or   r ewr ite   s en ten ce s .   Ho wev er ,   ex is tin g   c ases   of   p lag iar is m   ar e   more   co m p licated   th an   t h at.   W h en   co m p ar in g   th e   p lag i ar is m   s tr in g s   an d   th e   s o u r ce   o n e,   we   f o u n d   th at   th ey   can   be   d if f er e n t   in ;   ( i)   th e   n u m b er   of   s en ten ce s ;   ( ii)   th e   s en ten ce   len g th ;   an d   ( iii)   th e   tex t   ap p ea r an ce s   o r d er .   T h e   ab o v e   s itu atio n s   ar e   not   r eso lv e d   y et   in   e x is tin g   r esear ch   on   p lag iar is m   d etec tio n .   R ec en tly ,   d ee p   lear n i n g   ap p r o ac h es   h a v e   p r o v e n   to   be   e f f icien t   in   s o lv i n g   m an y   task s   of   n atu r al   lan g u ag e   p r o ce s s in g .   H o wev e r ,   as   f ar   as   we   k n o w,   th e   lar g est   tr ain in g   co r p u s   f o r   th e   p lag iar is m   d etec tio n   task   is   s ti ll   v er y   s m all   f o r   th e   tr ain in g   p h ase.   T h er ef o r e,   in   th is   p ap er ,   we   p r o p o s e   a   p la g iar is m   s y s tem   th at   tak es   ad v an tag e   of   h a n d - cr a f t ed   f ea tu r e   v ec to r s   an d   lo n g   s h o r t - ter m   m em o r y   ( L STM )   n et wo r k   m o d el   [ 1 2 ]   to   d ea l   with   th e   p r o b lem s   m en tio n ed   ab o v e.   T h e   s y s tem   in clu d es   two   m ain   p h ases :     p ass ag e - p h ase   to   f ig u r e   o u t   p l ag iar is m   p ass ag es   in   s u s p icio u s   an d   s o u r ce   d o cu m e n ts .     wo r d - p h ase   to   r em o v e   r ed u n d an cy   p ar ts   f r o m   p lag iar is m   p ass ag es   to   ac h iev e   th e   ex ac t   p lag iar is m   s tr in g s .   T h e   m ain   co n tr ib u tio n s   of   th is   wo r k   ar e:     We   p r o p o s ed   n ew   f ea t u r es   at   b o th   th e   p ass ag e   an d   wo r d   l ev el   to   im p r o v e   th e   ac cu r ac y   in   d etec tin g   s im ilar   s tr in g s   b etwe en   two   d o cu m en ts .   T h ese   f ea tu r es   ar e:   ( i)   Ma x im ize   p ass ag e   s im ilar ity ,   m ax im ize   p ass ag e   in ter s ec tio n ,   p ass ag e   im p o r tan ce   at   th e   p ass ag e - p h ase;   an d   ( ii)   wo r d   s im ilar ity ,   av er ag e   wo r d   s im ilar ity ,   s en ten ce   b ased   s im ilar ity   at   th e   wo r d - p h ase.     We   p r o p o s ed   a   two - p h ase   p l ag iar is m   d etec tio n   s y s tem   b a s ed   on   a   m u lti - lay er   L STM   n etwo r k   m o d el   u s in g   our   p r o p o s ed   f ea tu r es   to   s o lv e   b o th   liter al   a n d   in tellig e n t   p lag iar is m   p r o b lem s .   T h e   r est   of   t h e   ar ticle   is   o r g a n ized   as:   our   p r o p o s ed   m eth o d   is   in tr o d u ce d   in   s ec tio n   2.   In   s ec tio n   3,   we   d escr ib e   our   ex p er im e n ts   an d   an aly ze   th e   r esu lts .   Fin ally ,   our   co n clu s io n s   an d   f u t u r e   r esear ch   d ir ec tio n s   ar e   p r esen ted   in   s ec tio n   4.       2.   P RO P O SE D   M E T H O D     T h e   p r o b lem   of   f in d in g   s im ilar   s tr in g s   b etwe en   two   d o cu m e n ts   is   s tated   is   [ 1 3 ] :   Def in itio n   1:   Giv en   two   d o c u m en ts   d   an d   d ,   t h e   g o al   is   to   d etec t   a   s et   of   p ass ag e   p air s ,   P,   s u ch   as:     P = { < p d i , p d j > |   p d i , p d j : p d i d     p d j d   | p d i   p d j | > }   ( 1 )   Evaluation Warning : The document was created with Spire.PDF for Python.
                      I SS N :   2 2 5 2 - 8 9 3 8     I n t J Ar tif   I n tell Vo l.  10 ,   N o .   3 Sep tem b er   2 0 2 1 :    6 3 6   -   648   638   in   wh ich   p d i   is   a   s tr in g   f r o m   d;   p d′ j   is   a   s tr in g   f r o m   d ;   p d i   p d j   in d icate s   th e   s im ilar ity   b etwe en   p d i   a n d   p d j ;     is   a   th r esh o ld   th at   is   u s e d   to   d eter m in e   wh eth er   two   s tr in g s   ar e   s im i lar   en o u g h   to   be   co n s id er ed   as   p lag ia r is m .   T h e   s er ies   of   co m p etitio n   s h ar ed   task s   f o r   p lag iar is m   d etec tio n   n am ed   p la g iar is m   an aly s is ,   au th o r s h ip   i d en tific atio n ,   a n d   n ea r - d u p licate   d etec tio n   ( PAN)   h as   d ef in e d   f o u r   ty p es   of   p la g iar is m .   a.   No n e   o b f u s ca tio n :   C r ea te   p lag iar is m   ca s es   by   co p y in g   a   p ar ag r ap h   f r o m   th e   s o u r ce   d o c u m en t   an d   in s er t   it   in to   th e   s u s p icio u s   o n e.   b.   R an d o m   o b f u s ca tio n :   C r ea te   p lag iar is m   ca s es   by   in s er tin g ,   d eletin g ,   ch an g in g   t h e   o r d er   of   wo r d s   f r o m   a   p ar ag r a p h   of   th e   s o u r ce ,   an d   i n s er tin g   it   in to   th e   s u s p icio u s   d o cu m e n t.   c.   T r an s latio n   o b f u s ca tio n :   C r ea te   p lag iar is m   ca s es   by   tr an s latin g   a   p ar ag r ap h   more   th an   o n ce   th r o u g h   s ev er al   lan g u a g es   an d   b ac k   to   th e   o r ig i n al   lan g u ag e   u s in g   d if f e r en t   m ac h in e   tr an s latio n   to o ls .   T h en ,   in s er tin g   th e   tr an s lated   p a r ag r ap h   in to   t h e   s u s p icio u s   d o c u m en t.   d.   Su m m ar y   o b f u s ca tio n :   C r ea te   p lag iar is m   ca s es   by   s u m m ar izin g   th e   s o u r ce   p ar ag r ap h   a n d   in s er tin g   it   in to   th e   s u s p icio u s   d o c u m en t.   T h is   p ap er   aim s   at   s o lv in g   p l ag iar is m   ca s es   b elo n g   to   all   f o u r   ty p es   ab o v e .   Ou r   p r o p o s e d   s y s tem s   wo r k f lo w   is   s h o wn   in   Fig u r e   1,   in clu d i n g   th r ee   s tep s .     Pre - p r o ce s s in g :   T h is   s tep   s p lits   in p u t   d o cu m en ts   in to   s en ten ce s ,   r em o v es   s to p wo r d s   an d   s p ec ial   ch ar ac ter s ,   an d   c o m b in es   s o r t   s en ten ce s   in to   o n e.     Pas s ag e - p h ase:   Af ter   th e   p r e - p r o ce s s in g   s tep ,   we   u s e   a   co n tex t   win d o w   s lid in g   o v er   th e   s o u r ce   a n d   s u s p icio u s   d o cu m en ts   to   cr e ate   ca n d id ate   p ass ag es.   We   ex tr ac t   f ea tu r es   f r o m   th ese   p ass ag es   an d   g en er ate   an   in p u t   f ea tu r e   m a tr ix   co r r esp o n d in g   to   th ese   f ea tu r es.   T h is   m atr i x   is   f ee d   in to   a   b in ar y   class if ier   of   th e   ca n d id ate   s ele ctio n   m o d u le   to   o b tain   p air s   of   p lag iar is m   p ass ag es.     W o r d - p h ase:   T h e   p air s   of   p lag iar is m   p ass ag es   ar e   u s ed   as   th e   in p u t   f o r   th e   wo r d - p h ase.   T h e   p u r p o s e   of   th is   p h ase   is   to   d ef in e   th e   ex ac t   p lag iar is m   s tr in g s   f r o m   th e   in p u t   p ass ag es.   A   b in ar y   cl ass if ier   at   th e   wo r d - lev el   is   u s ed   to   p er f o r m   th is   task .           Fig u r e   1.   Ov e r v iew   of   th e   p r o p o s ed   s y s tem s   wo r k f lo w   f o r   p lag iar is m   d etec tio n       2 . 1 .    P re - pro ce s s ing     T h e   in p u t   d o cu m en ts   ar e   s p lit   in to   s en ten ce s   u s in g   th e   s en t   to k en izer   to o l   f r o m   th e   NL T K   lib r ar y .   T h en   s to p wo r d s   ar e   r em o v ed   f r o m   th ese   s en ten ce s .   So m e   s p ec if ic   ca s es   can   af f ec t   th e   ac cu r ac y   of   p lag iar is m   s elec tio n .   T h ese   ca s es   ar e:     T h e   in p u t   d o cu m en ts   co n tain   n u m b er s   th at   ar e   wr itten   in co r r ec tly ,   s u ch   as   8 .   3 9 ,   ‘7   p.   m .   In   th is   ca s e,   th e   s en ten ce   s p litt er   in co r r ec tl y   s eg m en ts   tex t   in to   s en ten ce s   at   th e   dot   ( . )   ch ar ac ter .       Af ter   r em o v in g   s to p wo r d s ,   th er e   ar e   s o m e   s h o r t   s en ten ce s   co n tain in g   n o n e   or   o n ly   o n e   or   two   to k e n s .   Fo r   ex am p le,   two   s en ten ce s   C an   you   f ee l   th e   b u r n ?” ,   “Wh o   we   ar e?   r em ain   two   wo r d s   an d   em p ty ,   r esp ec tiv ely ,   af ter   clea n i n g   s to p wo r d s   an d   p u n ctu atio n   c h ar a cter s .   Sin ce   th e   s im ilar ities   of   s h o r t   s en ten ce s   do   not   h av e   m u ch   m ea n in g ,   we   c o m b in e   th e   s h o r t   s en ten ce s   with   s u r r o u n d in g   s en ten ce s   a n d   co m p ar e   th e   s im ilar ity   b et wee n   th e   p ass ag es   af ter   co m b in ed .   T h er e f o r e,   to   d ea l   with   th e   p r o b lem s   m en ti o n ed   a b o v e,   we   f ir s t   ap p ly   t h e   s en ten ce   s p litt er   an d   th e n   r em o v e   s to p wo r d s ,   n u m b er s ,   an d   s p ec ial   ch ar ac ter s   f r o m   th e   s en ten ce s .   Af ter   clea n in g   th e   tex t,   s en ten ce s   with   less   th an   th r ee   wo r d s   ar e   co m b in e d   with   th e   n ex t   s en ten ce   to   cr ea te   ex ten d ed   s en ten ce s .   To   th e   b est   of   our   k n o wled g e,   th e   ab o v e   co m b in atio n   s tep   allo w s   us   to   ef f icien tly   m an ag e   th e   p ass ag e’ s   len g th   af ter   p air in g   an d   av o i d in g   th e   ca s e   of   cr ea tin g   to o - l o n g   p ass ag es.   We   u s e   a   win d o w   of   s ize   w   ( s en ten ce s )   s lid in g   on   b o th   s u s p icio u s   an d   Evaluation Warning : The document was created with Spire.PDF for Python.
I n t J Ar tif  I n tell   I SS N:  2252 - 8 9 3 8         A   tw o - p h a s p la g ia r is d etec tio n   s ystem  b a s ed   o n   mu lti - la y er LS TM   n etw o r k s   ( N g u ye n   V a n   S o n )   639   s o u r ce   d o cu m e n ts   to   g en er ate   ca n d id ate   p lag iar is m   p ass ag es,   wh ich   ar e   u s ed   as   in p u ts   of   th e   p ass ag e - p h ase.   T h e   o p tim al   win d o w   s ize   f o r   t h e   PAN   d atasets   is   th r ee   s en te n ce s .     2 . 2 .     P a s s a g e - ph a s e   T h e   in p u t   of   th is   p h ase   is   ca n d id ate   p la g iar is m   p ass ag es,   each   p ass ag e   co n s is tin g   of   th r ee   co n s ec u tiv e   s en ten ce s   f r o m   th e   s u s p icio u s   or   s o u r ce   d o cu m en ts .   In   th is   p h ase,   each   p ass ag e   is   en co d ed   as   a   s em an tic   em b ed d in g   v ec to r .   T h e   s em an tic   s im ilar ity   b etwe en   two   p ass ag es   is   ca lcu l ated   b ased   on   th e   d is tan ce   b etwe en   th ese   v ec to r s .   We   u s e   S B E R T   to   en co d e   p ass ag es,   s in ce   it   is   p r o v ed   in   [ 1 4 ]   th at   SB E R T   is   b etter   th an   o th er   m eth o d s   ( e . g . ,   W o r d 2 Vec   [ 1 5 ] ,   Glo v e   [ 1 6 ] ,   Fas tex t   [ 1 7 ] ,   I n f er Sen t   [ 1 8 ] ,   or   Un iv er s al   Sen ten ce   E n co d e r   [ 1 9 ] )   in   v a r io u s   d o m ain s .   Featu r es   r ep r e s en tin g   f o r   ea c h   p ass ag e   is   d er iv ed   f r o m   th ese   p ass ag e   v ec to r s .   T h ey   ar e   th en   u s ed   as   in p u ts   f o r   th e   b in ar y   class if icatio n   at   th e   p as s ag e   lev el   to   d etec t   wh eth er   two   p ass ag es   ar e   s im ilar   or   n o t.     2 . 2 . 1 .   P a s s a g e - ph a s e   f ea t ure   ex t ra ct io n   Giv en   a   s et   of   all   ca n d id ate   p ass ag es   in   th e   s u s p icio u s   d o cu m en t   U   =   (u 1 ,u 2 , …, u n )   an d   a   s et   of   all   ca n d id ate   p ass ag es   in   t h e   s o u r ce   d o c u m en t   V   =   (v 1 ,v 2 ,…,v m ) ,   with   each   p ass ag e   u i   an d   v j   is   r ep r esen ted   as   a   p ass ag e   em b ed d in g   v ec to r .   We   p r o p o s e   th e   f o llo win g   f ea tu r es   f o r   th is   p h ase:       Ma x im ize   p ass ag e   s im ilar ity   T h is   f ea tu r e   is   u s ed   to   d eter m in e   th e   m ax im u m   s im ilar ity   of   a   p ass ag e   v ec to r   u i   ag ain s t   a   s et   of   p ass ag e   v ec to r s   V.   L et   us   s ay    ,   is   th e   s im ilar ity   b etwe en   two   p ass ag e   v ec to r s   u i   an d   v j   wh er e   u i     U,   v j     V .   L et    ,   is   th e   m ax im u m   p ass ag e   s im ilar ity   of   t h e   p ass ag e   v ec to r   u i   ag ai n s t   th e   s et   of   p as s ag e   v ec to r s   V .   It   is   ca lcu lated   as:      , = ma x  ( , )   ( 2 )     T h e   ma ximize   p a s s a g e   s imila r ity   f ea tu r e   v ec to r   of   all   p ass ag e   v ec to r s   in   th e   p air   of   s u s p i cio u s   an d   s o u r ce   d o c u m en t   is   d eter m i n e d   by   ( 3 ) :      ( , ) =   (  1 , ,  2 , , ,  , ,  1 , ,  2 , , ,  , )   ( 3 )       Ma x im ize   p ass ag e   in ter s ec tio n   To   d eter m i n e   th e   m ax im u m   i n ter s ec tio n   v alu e   of   a   p ass ag e   u i   with   a   s et   of   p ass ag es   V ,   we   s p lit   p ass ag es   in to   wo r d s   an d   f in d   t h e   in ter s ec tio n   wo r d s   of   each   p ass ag e   p air   (u i ,   v j ),   with   u i     U,   v j     V   an d   ta k e   th e   m ax im u m   len g th   of   th is   in ter s ec tio n .   T h is   v alu e   is   ca lcu l ated   as   in   ( 4 ) :      , = ma x  ( )   ( 4 )     T h e   ma ximize   p a s s a g e   in ter s ec tio n   f ea tu r e   v ec to r   of   all   p ass ag es   in   t h e   p ai r   of   s u s p icio u s   an d   s o u r ce   d o cu m e n t   is   d eter m in ed   by   ( 5 ) :      ( , ) =   (  1 , ,  2 , , ,  , ,  1 , ,  2 , , ,  , )   ( 5 )       Pas s ag e   im p o r tan ce   T er m   f r eq u e n cy - i n v er s e   d o c u m en t   f r eq u e n cy   (TF - I DF)   is   th e   m o s t   wid ely   u s ed   an d   co n s id er ed   one   of   th e   m o s t   ap p r o p r iate   ter m   weig h tin g   s ch em es .   T h is   TF - I DF   is   em p lo y e d   to   g et   r id   of   ter m s   with   lo wer   weig h ts   f r o m   d o c u m en ts   an d   h elp s   to   in cr ea s e   th e   r etr iev al   ef f ec tiv en ess .   T er m   f r e q u en c y - in v er s e   d o c u m en t   f r eq u e n cy   is   a   n u m e r ical   s tatis tic   th at   tells   us   how   im p o r ta n t   a   wo r d   is   to   a   d o cu m en t   in   a   c o llectio n   or   a   co r p u s .   It   is   m o s tly   u s ed   as   a   weig h tin g   f ac to r   in   v ar io u s   p r o ce s s es   u s ed   f o r   in f o r m atio n   r etr iev al   an d   tex t   m in in g .   To   d eter m in e   s im ilar   p ass ag es,   we   put   f o r war d   th e   id ea   of   ter m   f r eq u en cy - in v er s e   s en te n ce   f r eq u e n cy   (TF - I SF )   [ 2 0 ] .   We   t r ea t   each   p ass ag e   as   a   d o cu m e n t   an d   each   d o cu m en t   as   a   co r p u s ,   th en   ca lcu late   th e   v alu es   of   TF( w , U) ,   TF( u i , U) ,   an d   I S F ( u i , U) ,   in   wh ic h   w   is   a   ter m   in   a   p ass ag e   u i ,   U   is   th e   d o cu m e n t   co n tain in g   u i .   Giv en   |   |   is   th e   to tal   n u m b er   of   wo r d s   in   th e   p ass ag e   u i ,   TF( u i , U)   is   co m p u ted   as:        ( , ) =  ( , )     | |   ( 6 )     I S F ( u i , U)   is   co m p u te d   by   ( 7 ) :     Evaluation Warning : The document was created with Spire.PDF for Python.
                      I SS N :   2 2 5 2 - 8 9 3 8     I n t J Ar tif   I n tell Vo l.  10 ,   N o .   3 Sep tem b er   2 0 2 1 :    6 3 6   -   648   640    ( , ) =  ( , )       |   |   ( 7 )     T h e   p ass ag e   im p o r tan ce   of   th e   p ass ag e   u i   in   th e   d o cu m e n t   U   is   d eter m in ed   by   ( 8 ) :       , =    (   , ) ×  (   , )   ( 8 )     T he   p a s s a g e   imp o r ta n ce   f ea t u r e   v ec to r   of   all   p ass ag e   in   th e   p air   of   s u s p icio u s   a n d   s o u r ce   d o c u m en t   is   d eter m in ed   by   ( 9 ) :     ( , ) = ( 1 , , 2 , , , , , 1 , , 2 , , , , )   ( 9 )       T h e   f ea tu r e   m atr ix   f o r   th e   p ass ag e - p h ase   Af ter   ex tr ac tin g   a n d   cr ea tin g   th r ee   f ea tu r e   v ec to r s   p s im( U, V ) ,   p in ter(U, V ) ,   a n d   p imp ( U, V ) ,   we   co m b in e   th em   i n to   a   two - d im en s io n al   m atr ix   of   s ize   ( n + m )   x   3   wh e r e   n +m   is   th e   to tal   n u m b er   of   p ass ag es   f r o m   s u s p icio u s   an d   s o u r ce   d o cu m e n ts .   T h e   f ea tu r e   m atr ix   f o r   all   p ass ag es   in   th e   p air   of   s u s p icio u s   an d   s o u r ce   d o c u m en ts   is   d eter m in ed   as   in   ( 1 0 ) .   It   is   u s ed   as   th e   in p u t   f o r   th e   m u lti - lay er   L ST M   n etwo r k   m o d el,   d escr ib ed   in   s ec tio n   2 . 2 . 2 .      =   (    1 ,  1 , 1 ,  2 ,  2 , 2 ,  ,  , , )     ( 1 0 )     2 . 2 . 2 .   P la g ia rism   pa s s a g e   s elec t io n   We   b u ild   o u r   b i n ar y   class if ier   by   u s in g   a   m u lti - lay er   L STM   n etwo r k   m o d el,   w h ich   is   u s ed   to   p r ed ict   th e   p r o b a b ilit y   of   b ein g   a   p lag iar is m   p ass ag e   in   th e   p air   of   s u s p icio u s   an d   s o u r ce   d o cu m e n ts .   Fig u r e   2   s h o ws   th e   s tr u ctu r e   of   o u r   m o d el   at   t h e   p ass ag e - p h ase.   At   th is   p h as e,   we   g e n er ate   th e   in p u t   v ec to r s   by   r esh ap in g   th e   f ea tu r e   m atr ix   f passage   in to   a   t h r ee - d im e n s io n al   m atr ix   of   b a tch _ s iz e,   time_ s tep s ,   an d   s eq _ len   an d   f ee d   th em   in to   th e   m o d el.   T h e   p a r am eter s   u s in g   in   th e   L STM   m o d el   ar e:   ( i)   b a tc h _ s iz e   eq u als   th e   n u m b er   of   p ass ag es;   ( ii)   time_ s tep s   eq u als   1;   ( iii)   s eq _ len   eq u als   th e   n u m b er   of   f ea tu r es   ( s eq _ len =3 ) .           Fig u r e   2.   T h e   ar ch itectu r e   of   t h e   m u lti - lay er   L STM   m o d el   at   th e   p ass ag e - p h ase       T h e   o u tp u t   of   th e   s ig m o id   ac ti v atio n   f u n ctio n   is   alwa y s   in   th e   r an g e   of   ( 0 , 1 ) .   T h is   f u n ctio n   is   ap p lied   to   th e   o u tp u t   of   all   u n its   in   th e   last   h id d en   L STM   lay er .   L et   = ( 1 , 2 , , + )   is   th e   o u tp u t   of   th e   b in ar y   class if icatio n   mode l   (0   <   y i   <   1 ) ,   an d   n +m   is   th e   n u m b er   of   p ass ag es   in   th e   p air   of   s u s p icio u s   an d   s o u r ce   d o cu m e n ts .   Fig u r e   3   s h o ws   th e   o u tp u t   of   th e   m o d el   is   a   v ec to r   of   0s   an d   1s   in   wh ich   v alu e s   1   f o r   all   y i   b ein g   h ig h er   th a n   a   th r esh o ld   θ ,   an d   v alu es   0   f o r   th e   r e m ain in g .   Evaluation Warning : The document was created with Spire.PDF for Python.
I n t J Ar tif  I n tell   I SS N:  2252 - 8 9 3 8         A   tw o - p h a s p la g ia r is d etec tio n   s ystem  b a s ed   o n   mu lti - la y er LS TM   n etw o r k s   ( N g u ye n   V a n   S o n )   641   0 0 1 1 1 0 0 0 . . . 0 0 1 1 1 1 0 0 . . S u s p i c i o u s S o u r c e . 0 0 . . . . . 0 0 .     Fig u r e   3.   T h e   o u t p u t   of   th e   m o d el   at   th e   p ass ag e - p h ase         Plag iar is m   p ass ag es   ar e   g en er ated   by   s elec tin g   s en ten ce s   c o r r esp o n d in g   to   th e   lo n g est   v alu es   of   1   f r o m   th e   o u tp u t   of   th e   m o d el.   W h en   o b s er v i n g   a n d   an aly zin g   th e   p lag iar is m   p ass ag es   o b ta in ed ,   we   f o u n d   th at   m o s t   p lag iar is m   p ass ag es   co n tain   en tire   s en ten ce s .   Ho we v er ,   th e   p lag iar is m   p ar ag r a p h   co n tain s   s ev er al   r ed u n d an t   wo r d s   at   th e   two   en d s ,   s u ch   as   th e   e x am p le   in   th e   PAN   2014   c o r p u s   e x p lain ed   b y :   th is   ex am p le.   In   th is   ex am p le,   th e   u n d er lin e d   t ex t   is   in s id e   th e   p lag iar is m   p ar ag r ap h ,   wh er ea s   th e   r est   is   r ed u n d an t .   Th e   s u s p icio u s   p la g ia r is m   p a r a g r a p h :     T h e   ca p s u le   was   d esig n ed   f o r   en tr y   i n to   th e   Ma r tian   atm o s p h er e,   d escen t   to   th e   s u r f a ce ,   im p ac t   s u r v iv al,   an d   s u r f ac e   life tim es   of   as   m u ch   as   s ix   m o n th s   an d   co n tain ed   th e   p o wer ,   g u id an ce ,   co n tr o l   c o m m u n icatio n s ,   an d   d ata   h an d lin g   s y s tem s   n ec ess ar y   to   co m p lete   its   m is s io n .   is   p erh a p s   th e   mo s t   p r o d u ctive   s p a ce   p r o b e   yet   d ep lo ye d ,   visi tin g   fo u r   p la n ets   and   th eir   mo o n s ,   in clu d in g   tw o   p r ima r y   visi t s   to   p r ev io u s ly   u n ex p lo r ed   p l a n ets,   w ith   p o w erfu l   ca mera s   and   a   mu ltit u d e   of   s cien tifi c   in s tr u men ts ,   at   a   fr a ctio n   of   t h e   mo n ey   la ter   s p en t   on   s p ec i a liz ed   p r o b es   s u ch   as   th e   a n d   th e   p r o b e.   A lo n g   w ith ,   and   V o y a g e r   2   is   an   . V o y a g e r   2   G a l i l e o   s p a c e c r a f t   C a s s i n i - H u y g e n s   [ 2 ]   [ 3 ]   P i o n e e r   10   P i o n e e r   11   V o y a g e r   1   N e w   H o r i z o n s   i n t e r s t e l l a r   p r o b e   r e s i d e n t   p e r   y e a r ,   or   r o u g h l y   h a l f   t h e   c o s t   of   one   c a n d y   b a r   each   y e a r   s i n c e   p r o j e c t   in ce p tio n .     Th e   s o u r ce   p la g ia r is m   p a r a g r a p h :     Vo y ag er   2   u n m a n n ed   in ter p l an etar y   s p ac e   p r o b e   V o y ag e r   p r o g r am   Vo y ag er   1   Vo y ag e r   2   ec lip tic   So lar   Sy s tem   Ur an u s   Nep tu n e   g r av ity   ass is t   Satu r n   Vo y ag er   2   T itan   Plan etar y   G r an d   T o u r   [ 1 ]   is   p erh a p s   th e   m o s t   p r o d u ctive   s p a ce   p r o b e   yet   d e p l o ye d ,   visi tin g   f o u r   p l a n ets   a n d   th eir   mo o n s ,   in clu d in g   tw o   p r ima r y   visi t s   to   p r ev io u s ly   u n ex p lo r ed   p la n ets,   w ith   p o w erfu l   ca me r a s   and   a   mu ltit u d e   of   s cien tifi c   in s tr u men ts ,   at   a   fr a ctio n   of   t h e   mo n e y   la ter   s p en t   on   s p ec ia liz ed   p r o b es   s u ch   as   th e   a n d   th e   p r o b e.   A l o n g   w ith ,   ,   and   V o ya g er   2   is   an   . V o ya g er   2   Ga lileo   s p a ce cra ft   C a s s in i - Hu yg en s   [ 2 ]   [ 3 ]   P io n ee r   10   P io n ee r   11   V o ya g er   1   N ew   H o r iz o n s   in ters tella r   p r o b e   C o n ten ts   T itan   3E   C en tau r   was   o r ig in ally   p la n n ed   to   b e,   p ar t   of   th e.     To   s o lv e   th is   p r o b lem ,   we   e x ten d   p air s   of   p lag iar is m   p a s s ag es   f r o m   th e   s u s p icio u s   an d   s o u r ce   d o cu m e n ts   by   a d d in g   k   s en ten ce s   to   th e   lef t   an d   r ig h t   of   b o t h   p ass ag es.   E x ten d ed   p ass ag es   will   be   u s ed   as   th e   in p u t   f o r   th e   w o r d - p h ase   to   f in d   ex ac t   p lag iar is m   s tr in g s .   It   is   done   by   r e m o v in g   r e d u n d an t   tex t   f r o m   th e   ex ten d ed   p lag iar is m   p ass ag es.   T h e   wo r d - p h ase   will   be   in tr o d u ce d   n ex t.     2 . 3 .     Wo rd - ph a s e   To   r em o v e   th e   r e d u n d an t   tex t   at   th e   two   en d s   of   th e   ex te n d ed   p lag iar is m   p ass ag es,   we   n ee d   to   id en tify   s em an tically   r elate d   s eg m en ts   b ased   on   co n s ec u tiv e   wo r d s   of   h ig h   s im ilar ity .   To   g et   th e   m ea n in g   of   a   wo r d ,   we   put   th at   wo r d   in   a   win d o w   s ize   of   3   with   one   wo r d   on   th e   lef t   an d   one   wo r d   on   th e   r ig h t.   T h e   tex t   in s id e   th is   win d o w   is   u s ed   as   th e   in p u t   of   SB E R T   to   cr ea te   wo r d   f ea tu r e   v ec to r s .       2 . 3 . 1 .   Wo rd - lev el   f ea t ure   ex t ra ct io n   In   th is   p h ase,   t h r ee   f ea t u r es   a r e   p r o p o s ed   b ased   on   th e   co s i n e   s im ilar ity   b etwe en   th e   w o r d   an d   th e   s en ten ce   co n tain in g   th at   wo r d .   T h e   w o r d   s imila r i ty   f ea tu r e   is   a   v ec to r   th at   co n tain s   th e   m ax im u m   s im ilar ity   v alu es   of   each   wo r d .   T h e   m ax im u m   s im ilar ity   of   a   wo r d   in   th e   s u s p icio u s   p ass ag e   is   t h e   m ax im u m   s im ilar it y   of   th at   wo r d   with   each   wo r d   in   th e   s o u r ce   p ass ag e   an d   v ice   v er s a.   Featu r es   a ve r a g e   w o r d   s imi la r ity   an d   s en ten ce   b a s ed   s imila r ity   ar e   u s ed   to   s o lv e   ca s es   wh er e   th e   s im ilar ity   v alu e   of   a   wo r d   h as   a   b ig   d if f er e n ce   with   th e   s u r r o u n d i n g   wo r d s .   T h e   a ve r a g e   w o r d   s imila r ity   f ea tu r e   is   a   v ec to r   th at   each   item   is   th e   av er ag e   of   th e   w o r d   s imila r ity   v alu es   wi th in   th e   s en ten ce .   T h e   s en ten ce   b a s ed   s imila r ity   f ea tu r e   is   a   v ec to r   th at   ea c h   item   is   th e   m ax im u m   of   s en ten ce   s im ilar ities   of   th e   s en ten c e   co n tain in g   th at   wo r d .   T h e   d etailed   in f o r m atio n   on   th e   wo r d - p h ase   f ea tu r es   is   ex p lain ed   by :   Giv en   th e   e x ten d ed   s u s p icio u s   p ass ag e   P =( p 1 ,p 2 , …, p n ) ,   t h e   e x ten d ed   s o u r ce   p ass ag e   Q=( q 1 ,q 2 ,…,q m )   with   each   wo r d   p i   an d   q j   is   r ep r esen ted   by   a   wo r d   e m b ed d in g   v ec to r .     W o r d   s im ilar ity   L et   us   ca ll   s im( p i ,q j )   is   th e   co s in e   s im ilar ity   b etwe en   two   wo r d   v ec to r s   p i   an d   q j .   T h e   w o r d   s imila r ity   f ea tu r e   b etwe en   P   a n d   Q   is   a   v ec to r   b ein g   co m p u ted   as   ( 1 1 ) .     Evaluation Warning : The document was created with Spire.PDF for Python.
                      I SS N :   2 2 5 2 - 8 9 3 8     I n t J Ar tif   I n tell Vo l.  10 ,   N o .   3 Sep tem b er   2 0 2 1 :    6 3 6   -   648   642   w s im( P , Q)   =   (   ( 1 , ) ,     ( 2 , ) ,…,     ( , ) )   ( 1 1 )       Av er ag e   wo r d   s im ilar ity   Giv en   wi   ( with   i=   1 ÷ n +m ) ,   is   th e   i - th   wo r d   in   th e   p air   of   s u s p icio u s   an d   s o u r ce   p ass ag es,   d   is   th e   s en ten ce   th at   w i     d,   an d   | d |   is   th e   to tal   n u m b e r   of   wo r d s   in   th e   s en ten ce   d .   L et   us   ca ll   a vg ( w i )   is   th e   a ve r a g e   s imila r ity   of   wo r d   w i   in   th e   s en ten ce   d;   w s im( i)   is   th e   v al u e   of   th e   i - th   item   in   th e   w o r d   s imila r ity   f ea tu r e   v ec to r .   T h en ,   th e   a vg ( w i )   is   co m p u ted   as :        ( ) =  ( ) | |   ( 1 2 )     T h e   a ve r a g e   w o r d   s imila r ity   f ea tu r e   b etwe en   two   p ass ag es   P   an d   Q   is   a   v ec to r   d eter m in e d   by   th e   f o llo win g   f o r m u la:     w a vg ( P , Q)   =   ( a v g ( p 1 ),   a vg   (p 2 ) , …,   a vg   (p n ),   a vg ( q 1 ),   a vg ( q 2 ) , …,   a vg ( q m ))   ( 1 3 )       Sen ten ce   b ased   s im ilar ity   We   r eu s e   th e   ma ximi z e   p a s s a g e   s imila r ity   f ea tu r e   ( as   d e s cr ib ed   in   th e   p ass ag e - p h ase)   with   th e   m ea n in g   of   th e   p ass ag e   is   th e   s en ten ce .   Giv en   th e   s et   of   s en ten ce s   U   =   (u 1 ,u 2 , …, u k ),   an d   V   =   (v 1 ,v 2 , …, v s )   in   th e   s u s p icio u s   an d   s o u r ce   p as s ag es,   r esp ec tiv ely .   L et   us   ca l l   s im_ s en t( p i )   is   th e   s en ten ce   b a s ed   s imila r ity   of   wo r d   p i   in   th e   s en ten ce   u j .   T h e   s im_ s en t( p i )   is   co m p u ted   as :        _  ( ) =   ma x  ( , ) |       ( 1 4 )     T h e   s en ten ce   b a s ed   s imila r ity   f ea tu r e   b etwe en   two   p ass ag es   P   an d   Q   is   a   v ec to r   d eter m in ed   by   th e   f o llo win g   f o r m u la:     w s en t( P , Q) =( s im_ s en t( p 1 ) , s im _ s en t( p 2 ) , …, s im_ s en t( p n ) , s im_ s en t( q 1 ) , s im_ s en t( q 2 ) , …, s im_ s en t( q m ))   ( 1 5 )     T h e   f ea tu r e   m atr ix   f o r   th e   w o r d - p h ase:   Af ter   co m p u tin g   th r ee   f ea tu r e   v ec to r s   w s im( P , Q) ,   w a vg ( P , Q) ,   an d   w s en t( P , Q) ,   we   co m b in e   th ese   f ea tu r e   v ec to r s   in to   a   two - d im en s io n al   m atr ix   of   s ize   ( n +m)   x   3 .        =   (       ma x q j Q s im ( p 1 , q j )  ( 1 )  _  ( 1 ) ma x q j Q s im ( p 2 , q j )  ( 2 )  _  ( 2 ) ma x p j Q s im ( q m , p j )  ( )  _  ( ) )         ( 1 6 )     T h e   f ea tu r e   m atr ix   of   all   th e   e x ten d ed   p lag iar is m   p ass ag es   is   d eter m in ed   by   ( 1 6 ) .   T h is   f ea tu r e   m atr ix   is   u s ed   as   th e   in p u t   f o r   th e   m u lti - lay er   L STM   m o d el,   d escr ib e d   in   s e ctio n   2 . 3 . 2 .     2 . 3 . 2 .   P la g ia rism   s t ring   s elec t io n   In   th is   s ec tio n ,   we   co n d u ct   t wo   p r o ce s s in g   s tep s :   ( i)   s ele ct   p la g ia r is m   s en ten ce s   an d   ( ii)   r emo ve   r ed u n d a n t   text .   T h e   d etail s   of   each   s tep   ar e   d escr ib e d   as :     Select   p lag iar is m   s en ten ce s   To   s elec t   ex ac t   p lag iar is m   s en ten ce s   f r o m   t h e   ex ten d ed   p l ag iar is m   p ass ag es,   we   u s e   a   m u lti - lay er   L STM   m o d el   wh o s e   i n p u t   is   t ak en   f r o m   th e   f ea tu r e   m atr i x   f word   as   s h o wn   in   Fig u r e   4.   T h e   p ar am ete r s   u s in g   in   th is   m o d el   ar e:   ( i)   b a tc h _ s iz e   eq u als   th e   n u m b er   of   wo r d s ;   ( ii)   time_ s tep s   eq u als   1;   ( iii )   s eq _ len   eq u als   t h e   n u m b er   of   f ea tu r es   ( s eq _ len = 3 ) .   In   Fig u r e   4,   p i   a n d   q j   d e n o te s   th e   i - th   an d   j - th   wo r d   in   t h e   p air   of   ex ten d ed   p l a g i a r i s m   p a s s a g e s ,   _  = ( 1 , 2 , , + )   is   t h e   o u t p u t   of   t h e   b i n a r y   c l a s s i f i c a t i o n   m o d e l   (0   <   y i   <   1) ,   n + m   is   t h e   t o t a l   n u m b e r   of   w o r d s   in   t h e   p a i r   of   t h e s e   p a s s a g e s .   T h e   p r e d i c t e d   m e a n   v a l u e   of   a   s e n t e n c e   u   is   c o m p u t e d   as   in   ( 1 7 ) :       _  _  =    ( _  ) = y | |   ( 1 7 )   Evaluation Warning : The document was created with Spire.PDF for Python.
I n t J Ar tif  I n tell   I SS N:  2252 - 8 9 3 8         A   tw o - p h a s p la g ia r is d etec tio n   s ystem  b a s ed   o n   mu lti - la y er LS TM   n etw o r k s   ( N g u ye n   V a n   S o n )   643   wh er e   w i   is   a   wo r d   in   th e   s en te n ce   u .   Af ter   co m p u tin g   v alu es   _  _    f o r   all   s en ten ce s ,   we   cr ea te   a   v ec to r   with   th e   s ize   co r r esp o n d in g   to   t h e   to tal   n u m b er   of   s en ten ce s   in   th e   p air   of   p la g iar is m   p ass ag es.   If   th e   v alu e   of   y_ p r ed _ s en t   of   a   s en ten ce   is   h ig h e r   th a n   a   th r esh o l d   β ,   th e   v alu e   c o r r esp o n d i n g   to   th at   wo r d   in   th e   s en ten ce   is   1;   o th er wis e,   it   is   0.   We   s elec t   th e   lo n g est   s tr in g s   with   th e   v alu e   of   1   as   th e   p la g i a r is m   s en ten ce s .           Fig u r e   4.   T h e   ar ch itectu r e   of   t h e   m u lti - lay er   L STM   m o d el   at   th e   wo r d - p h ase         R em o v e   r ed u n d an t   te x t   To   ac h iev e   th e   ex ac t   p la g iar is m   s tr in g s ,   we   co n s id er   th e   lef tm o s t   p lag iar is m   s en te n c e   an d   th e   r ig h tm o s t   one .   T h e   d if f er en c e   b etwe en   th ese   s en ten ce s   ma x_ th r esh o ld   an d   min _ th r esh o ld   is   h ig h er   th a n     t 1   (t 1 =0 . 4 ) .   T h e   ma x_ t h r esh o ld   an d   min _ th r esh o ld   of   a   s en ten ce   u   ar e   d eter m in e d   by   ( 1 8 )   a n d   ( 1 9 ) :      _   =      ( 1 8 )     _   =     ( 1 9 )     with   w i   is   a   wo r d   in   th e   s en ten ce   u.   T h ese   s en ten ce s   ab o v e   h av e   o n e   p ar t   in s id e   a n d   t h e   r em ain i n g   p a r t   o u ts id e   th e   p lag iar is m   p ass ag e.   T h e   o u ts id e   p a r t   is   on   th e   lef t   ( o r ien t   =1 )   if   th e   s en ten ce   is   on   th e   lef t   of   th e   p lag iar is m   s en ten ce s   or   on   th e   r ig h t   ( o r ien t   =2 )   if   th e   s en ten c e   is   on   t h e   r ig h t   of   th e   p lag iar i s m   s en ten ce s .   If   t h e   p r ev io u s   s tep   r esu lt   co n tain s   o n ly   one   s en ten ce ,   t h e   o u ts id e   p ar t   b elo n g s   to   th e   two   en d s   ( o r ien t   =3 )   of   th e   s en ten ce .   An aly zin g   th e   o u tp u t   v ec to r   of   th e   L STM   m o d el   y _ p r ed ,   we   d is co v er   th at   th e   p r e d icted   v alu e   y i   c o r r esp o n d in g   of   th e   in s id e   wo r d s   is   m u ch   h ig h e r   th an   th e   p r e d ic ted   v alu e   y j   c o r r esp o n d in g   of   t h e   o u ts id e   o n es.   Alg o r ith m   1   is   u s ed   to   cu t   o f f   th e   r ed u n d a n t   tex t   f r o m   th ese   s en ten ce s .   T h e   id ea   of   th is   alg o r ith m   is :   Giv en   a   th r esh o ld   α,   f in d   th e   lo n g est   tex t   in   th e   lef tm o s t   s e n ten ce   an d   th e   r ig h tm o s t   one   wh o s e   all   of   th eir   wo r d s   h av e   th e   p r ed ictiv e   v al u e   y_ p r ed   <   α.   We   d ef in ed   th e   lef t   an d   r ig h t   p o s itio n   as   th e   f ir s t   an d   last   wo r d   of   th e   ex ac t   p lag iar is m   s tr in g s ,   r esp ec tiv ely .   T h e   alg o r ith m   r ec eiv es   th e   f o llo win g   p ar am et er s   as   in p u ts :     y_ d :   is   th e   p r ed icted   v ec to r   of   th e   s en ten ce .   y _ = ( _ 1 , _ 2 , , _ )   with   t   is   th e   n u m b er   of   wo r d s   in   th e   s en ten ce .     o r ien t:   d eter m in es   th e   in ter s e ctio n   p o s itio n   in   th e   lef t   ( o r i en t   =1 )   or   r ig h t   ( o r ien t   =2 )   or   b o th   s id es     ( o r ien t   =3 )   of   b o u n d ar y   s en te n ce s .     Alg o rit hm   1:   I n ter s ec tio n   p o s itio n   d eter m in atio n   Input:   y_d,   orient   1:   #   orient   =   1:   left;   orient   =   2:   right;   orient   =   3:   both   2:   pos_left   =   0;   pos_right   =   length(y_d)     1   3:   α   =   min(y_d)   +   (max(y_d) - min(y_d))/2   4:   if   orient   =   1   or   orient   =   3   then   5:     for   i   =   0   to   length(y_d)   -   1   do   6:       if   y_d   [i]   >   α   then   Evaluation Warning : The document was created with Spire.PDF for Python.
                      I SS N :   2 2 5 2 - 8 9 3 8     I n t J Ar tif   I n tell Vo l.  10 ,   N o .   3 Sep tem b er   2 0 2 1 :    6 3 6   -   648   644   7:              pos_left   =   i   8              break   9:   if   orient   =   2   or   orient   =   3   then   10:     for   i   =   length(y_d)     1   downto   0   do   11:       if   y_d[i]   >   α   then   12:             pos_right   =   i   13             break   Output:   pos_left,   pos_right     We   in itialize   th e   lef t   an d   r ig h t   p o s itio n s   with   th e   f ir s t   an d   last   p o in ts ,   r esp ec tiv ely   ( lin es   2 ) .   T h e   th r esh o ld   α   is   th e   av er ag e   v al u e   of   m ax im u m   a n d   m in im u m   of   y_ d   v ec to r .   We   d e f in e   t h e   lef t   ( lin e   4)   an d   r ig h t   ( lin e   9)   p o s itio n   b ased   on   th e   o r ien t   v alu e.   Fo r   each   d ir ec tio n ,   we   s ca n   all   th e   p o in ts   ( lin e   5   an d   lin e   10)   an d   g et   th e   f ir s t   p o in ts   wh o s e   p r ed ict   v alu e   y _ p r ed   ar e   h i g h e r   th an   th e   th r esh o ld   α   ( lin e   7   a n d   lin e   1 3 ) .   T h ese   p o in ts   ar e   th e   r esu lts   of   th e   alg o r ith m .       3.   E XP E R I M E N T   R E SU L T S   AND   DIS CUSS I O N   In   our   ex p e r im en t,   we   u s e   P AN   2013   tex t   alig n m en t   tr ain in g   co r p u s   [ 2 1 ]   f o r   tr ain in g   th e   s y s tem .   T h is   co r p u s   is   also   th e   tr ai n in g   co r p u s   u s in g   in   PAN   2 0 1 4   co m p etitio n .   T h e   PAN   2 0 1 3   co r p u s   co n s is ts   of   1000   n o   o b f u s ca tio n ,   1 0 0 0   r an d o m   o b f u s ca tio n ,   1 0 0 0   tr a n s latio n   o b f u s ca tio n ,   an d   1 1 8 5   s u m m ar y   o b f u s ca tio n   p air s   of   d o cu m en ts .   No r m ally ,   th is   co r p u s   is   to o   s m all   f o r   tr a in in g   a   d ee p   lear n in g   m o d el.   By   our   ex p er i m en t,   we   will   p r o v e   th at   o u r   ap p r o ac h   of   co m b in in g   h a n d - c r af te d   f ea tu r es   with   th e   L STM   m o d el   will   be   a   g o o d   s o lu tio n   f o r   th is   p r o b lem .   To   co m p ar e   o u r   s y s tem   p er f o r m a n ce   with   s tate - of - th e - ar t   r esea r ch   in   th is   task ,   we   u s ed   PAN   2014   te x t   alig n m en t   test   co r p u s   [ 2 2 ]   f o r   ev alu atin g   th e   s y s tem .     3 . 1 .     E v a lua t io n   m et rics   Ou r   s y s tem   was   ev alu ate d   by   u s in g   a   to o l   p r o v id e d   by   PA N   to   m ea s u r e   th e   s y s tem   p er f o r m an ce .   Fo u r   m ea s u r es   u s ed   in   PAN   a r e   m ac r o - av e r ag ed   Pre cisi o n ,   R ec all,   Plag d et,   an d   Gr an u lar ity .   T h e   f o r m u la   to   co m p u te   th ese   v al u es   ar e   d esc r ib ed   s u ch   as:   Giv en   S,   R,   s,   r   ar e   a   s et   of   all   p lag iar is m   ca s es,   a   s et   of   all   p lag iar is m   s y s tem - d etec tio n   ca s es,   a   p lag iar is m   ca s e,   an d   a   p lag ia r is m   s y s tem - d etec tio n   ca s e,   r esp ec tiv ely .   T h e   m ac r o - a v er a g ed   p r ec is io n   an d   r ec all   ar e   d ef in e d   by:      ( , ) =   1 | | × | ( ) | | |   ( 2 0 )      ( , ) =   1 | | × | ( ) | | |   ( 2 1 )     T h e   d etec tio n   g r an u lar ity   of   R   u n d er   S   in d icate s   wh eth er   ea ch   p lag iar is m   ca s e   s     S   is   d etec ted   as   a   wh o le   or   in   s ev er al   p iece s .   It   is   ca lcu lated   as:       ( , ) =   1 | | × | |   ( 2 2 )     wh er e   S R     S   ar e   ca s es   d etec ted   by   d etec tio n s   in   R,   an d   R S     R   ar e   th e   d etec tio n s   of   a   g iv en   s.   Plag d et   is   th e   o v er all   s co r e   of   th e   s y s tem ,   wh ich   is   ca lcu lated   as:        ( , ) =   2 × × + × 1  2 ( 1 + ( , ) )   ( 2 3 )     3 . 2 .     E x perim ent a l   re s ults   a nd   a n a ly s is   Sev er al   test s   h av e   b ee n   ca r r ied   out   to   ch o o s e   th e   b est   co n f i g u r atio n   f o r   o u r   s y s tem .   We   p er f o r m ed   ex p er im en ts   by   each   p h ase   to   o p tim ize   p ar am eter s   of   th e   s y s tem .   E x tr ac ted   f ea tu r e   v ec to r s   f r o m   p air s   of   d o cu m e n ts   in   th e   PAN   2013   tr ain in g   co r p u s   ar e   p ass ed   to   th e   m u lti - lay er   L STM   m o d el   d u r in g   t h e   tr ain in g   p r o ce s s .   We   ch o s e   b in a r y_ cro s s en tr o p y   as   th e   lo s s   f u n ctio n   s in ce   th e   m o d el   is   a   b in ar y   class if icatio n   m o d el.   T h e   th r esh o ld   θ,   wh ich   is   u s e d   to   s elec t   s en ten ce s   in   th e   p ass ag e - p h ase,   is   ch o s en   to   be   0 . 1 .   To   ch o o s e   th e   v alu e   k   ( m e n tio n ed   in   s ec tio n   2 . 2 . 2 )   f o r   ex ten d in g   p lag ia r is m   p ass ag es,   we   in itiate   th e   k   v alu e   by   1   an d   co n tin u o u s ly   in cr ea s in g   th is   v alu e   u n til   th e   s y s tem   r ea ch es   t h e   h ig h est   r ec all   v alu e.   E x p e r im en ts   p r o v e d   th at   th e   v alu e   of   k   d e p en d s   on   th e   l en g th   of   th e   p la g iar is m   p ass ag es,   as   s h o wn   in   T ab le   1.   At   th e   wo r d - p h ase,   in s tead   of   u s in g   th r esh o l d s   to   id en tify   each   wo r d ,   we   ap p ly   th e   th r esh o ld     β   ( β   =   0 . 1 )   to   th e   y_ p r ed _ s en t .   T h e   L STM   m o d el   g en er ate s   an   ar r ay   wh o s e   s ize   is   eq u al   to   th e   n u m b er   of   Evaluation Warning : The document was created with Spire.PDF for Python.
I n t J Ar tif  I n tell   I SS N:  2252 - 8 9 3 8         A   tw o - p h a s p la g ia r is d etec tio n   s ystem  b a s ed   o n   mu lti - la y er LS TM   n etw o r k s   ( N g u ye n   V a n   S o n )   645   s en ten ce s .   T h e   v alu e   of   th e   a r r ay s   elem en t   is   1   if   y_ p r ed _ s en t   is   h ig h er   th an   β ,   an d   0   f o r   o th er s .   T h en   we   s elec t   a   co n tin u o u s   s tr in g   with   th e   h ig h est   p r ed icted   v alu e.   T ab le   2   s h o ws   th e   ac cu r ac y   a n d   lo s s   v alu es   in   th e   L STM   tr ain in g   p h ase   with   th e   f o u r   d atasets   in   PAN   2013.   To   ev alu ate   th e   e f f ec tiv en ess   of   our   p r o p o s ed   f ea tu r es,   we   ca r r ied   ex p e r im en ts   u s in g   ea ch   f ea tu r e   in s tead   of   all   f ea tu r es,   with   th e   in p u t   is   p air s   of   d o cu m e n ts   f r o m   PAN   2014   test   co r p u s .   Fig u r e   5   s h o ws   th e   ef f ec t   of   t h ese   f ea tu r es   at   th e   wo r d - p h ase   on   t h e   s y s tem   o u t p u t.   T h r ee   p air s   of   Fig u r es   5 ( a)   to   5 ( f )   s h o w   th e   p r ed ictio n   r esu lts   of   y _ p r ed   a n d   th e   f in al   r esu lts   u s in g   1,   2,   an d   3   f ea t u r es,   r esp ec tiv ely .   In   th ese   f ig u r es,   th e   b lu e   lin e   s h o ws   th e   p r ed icted   r esu lt;   th e   r ed   lin e   s h o ws   th e   av er ag e   p r ed icted   v alu e   by   s e n ten ce s .   T h e   g r ee n   lin e   s ep ar ates   th e   s u s p icio u s   an d   s o u r ce   p ass ag e;   th e   b lack   lin e   s h o ws   th e   r an g e   of   th e   s elec ted   p lag iar is m   p ass ag es.   T h e   ev alu atio n   r es u lts   p r o v e d   t h a t   a l l   t h e   p r o p o s e d   f e a t u r e s   a r e   u s e f u l ,   s o l v i n g   w e l l   f o r   b o t h   l i t e r a l   p l a g i a r i s m   a n d   i n t e l l i g e n t   p lag iar is m .       T ab le   1.   T h e   d y n am ic   p ar am et er s     f o r   ex te n d in g   p ass ag e     P l a g i a r i sm   p a ss a g e s   l e n g t h   k   1   ≥6   s e n t e n c e s   1   2   ≥3   s e n t e n c e s   2   3   ≥2   s e n t e n c e s   3   4   1   se n t e n c e   4     T ab le   2.   Acc u r ac y   an d   lo s s   v a lu es   of   th e   tr ain in g   p h ase   P A N   2 0 1 3   t r a i n i n g   c o r p u s   S e n t e n c e   l e v e l   W o r d   l e v e l   A c c u r a c y   Lo ss   A c c u r a c y   Lo ss   N o n e   O b f u sc a t i o n   0 . 9 9 2 5   0 . 0 0 6 8   0 . 9 8 0 8   0 . 0 1 6 1   R a n d o m   O b f u s c a t i o n   0 . 9 7 2 7   0 . 0 8 1 4   0 . 9 3 0 3   0 . 1 9 0 9   Tr a n s l a t e   O b f u s c a t i o n   0 . 9 7 0 7   0 . 0 7 4 8   0 . 9 4 4 3   0 . 1 2 2 9   S u mm a r y   O b f u s c a t i o n   -   -   0 . 9 2 0 1   0 . 2 0 9 6         ( a)   ( b )       ( c)     ( d )         ( e)   (f)     Fig u r e   5.   E f f ec ts   of   s elec tin g   d if f er en t   f ea tu r es   at   wo r d - p h a s e   to   p lag iar is m   p ass ag e:   ( a)   u s in g   one   f ea t u r e -   wsi m   ( P,Q) ;   ( b )   o u t p u t’ s   r esu l t   wh en   u s in g   wsi m   ( P,Q) ;   ( c)   u s in g   two   f ea tu r es - wsi m   ( P,Q) ,   wav g   ( P,Q) ;     ( d )   o u tp u t’ s   r esu lt   wh en   u s in g   wsi m   ( P,Q) ,   wav g   ( P,Q) ;   ( e)   u s in g   th r ee   f ea tu r es - wsi m   ( P,Q) ,   wav g   ( P,Q) ;   wsen t   ( P,Q) ;   an d   (f)   o u tp u t’ s   r esu lt   wh en   u s in g   wsi m   ( P,Q) ,   wav g   ( P,Q) ,   wsen t   ( P,Q)     Evaluation Warning : The document was created with Spire.PDF for Python.