Javanese and Sundanese speech recognition using Whisper

Computer Science and Information Technologies

Javanese and Sundanese speech recognition using Whisper

Abstract

Automatic speech recognition (ASR) technology is essential for advancing human-computer interaction, particularly in a linguistically diverse country like Indonesia, where approximately 700 native languages are spoken, including widely used languages like Javanese and Sundanese. This study leverages the pre-trained Whisper Small model an end‑to‑end transformer pretrained on 680,000 hours of multilingual speech, fine tuning it specifically to improve ASR performance for these low resource languages. The primary goal is to increase transcription accuracy and reliability for Javanese and Sundanese, which have historically had limited ASR resources. Approximately 100 hours of speech from OpenSLR were selected, covering both reading and conversational prompts, the data exhibited dialectal variation, ambient noise, and incomplete demographic metadata, necessitating normalization and fixed‑length padding. with model evaluation based on the word error rate (WER) metric. Unlike approaches that combine separate acoustic encoders with external language models, Whisper unified architecture streamlines adaptation for low‑resource settings. Evaluated on held‑out test sets, the fine‑tuned models achieved Word Error Rates of 14.97% for Javanese and 2.03% for Sundanese, substantially outperforming baseline systems. These results demonstrate Whisper effectiveness in low‑resource ASR and highlight its potential to enhance transcription accuracy, support language preservation, and broaden digital access for underrepresented speech communities. 

Discover Our Library

Embark on a journey through our expansive collection of articles and let curiosity lead your path to innovation.

Explore Now
Library 3D Ilustration