Automatic Speech Recognition (ASR) systems transcribe speech to text. They have a wide range of practical applications, from dictation tools making communication much easier for people with hearing and motor impairments to low-cost indexing and search in audiovisual content. As a building block in larger machine learning systems, ASR plays a crucial role in many commercial products, such as digital voice assistants.
Many modern ASR systems are implemented as (almost) purely data-driven, end-to-end Deep Learning models. These systems show impressive results in many domains, comparable to or even surpassing human performance. Unfortunately, these techniques often struggle when tasked with transcribing low-resource languages, especially in real-life situations. Despite the term “end-to-end”, they end up relying heavily on both an external language model and a large beam search to achieve decent results.
Pre-trained attention models such as BERT (Bidirectional Encoder Representations from Transformers) have advanced state-of-the-art across many natural language processing tasks in the past few years. Several ways of integrating BERT-like models in speech recognition systems have been proposed. However, research so far have been limited to high-resource domains.
Turning our attention to low-resource domains, we introduce a data-efficient fine-tuning strategy for BERT. BERT learns to effectively use conversational context to rescore beam search results by teaching it to disambiguate good and bad transcripts. We show how this improves performance over a robust baseline system in two distinct, specialized domains: formal parliamentary debates and customer service calls. These domains are low-resource both in terms of language (Norwegian) and speech/linguistic characteristics. We also test how to produce a richer variety of candidate transcripts to cover more possibilities using a diversity bonus.