Abstract

Automatic Speech Recognition (ASR) systems have proven to be a useful tool to perform various day to day operations when used along with systems like Personal AI Assistants. Various industries require ASR to be trained in their domains. Music on Demand (MoD), over IVR, is one such industry where the user interacts with the dialogue system to play music using voice commands only. Domain adaptation of the model is expected to perform well on this domain as systems trained on public datasets are very generic in nature and not very domain-specific. To train the ASR for MoD, we experiment with the HMM-based classical approach and DeepSpeech2 on Voxforge dataset. We then fine-tune the DeepSpeech2 model on MoD data. With very limited data and little finetuning of the model, we were able to achieve 14.727% Word Error Rate (WER).