Word embedding for contextual similarity using cosine similarity

Indonesian Journal of Electrical Engineering and Computer Science

Word embedding for contextual similarity using cosine similarity

Abstract

Perspectives on technology often have similarities in certain contexts, such as information systems and informatics engineering. The source of opinion data comes from the Quora application, with a retrieval limit of the last 5 years. This research aims to implement Indo-bidirectional encoder representations from transformers (BERT), a variant of the BERT model optimized for Indonesian language, in the context of information system (IS) and information technology (IT) topic classification with 414 original data, which, after being augmented using the synonym replacement method, The generated data becomes 828. Data augmentation aims to evaluate the performance of models by using synonyms and rearranging text while maintaining meaning and structure. The approach used is to label the opinion text based on the cosine similarity calculation of the embedding token from the IndoBERT model. Then, the IndoBERT model is applied to classify the reviews. The experimental results show that the approach of using IndoBERT to classify SI and IT topics based on contextual similarity achieves 90% accuracy based on the confusion matrix. These positive results show the great potential of using transformer-based language models, such as IndoBERT, to support the analysis of comments and related topics in Indonesian.

Discover Our Library

Embark on a journey through our expansive collection of articles and let curiosity lead your path to innovation.

Explore Now
Library 3D Ilustration