The effects of corpus size and homogeneity on language model quality

Rose, Tony; Haddock, Nick and Tucker, Roger. 1997. 'The effects of corpus size and homogeneity on language model quality'. In: Fifth Workshop on Very Large Corpora. Beijing, China. [Conference or Workshop Item]

[img]
Preview
Text
w97-0118.pdf - Accepted Version
Available under License Creative Commons Attribution Non-commercial.

Download (1MB) | Preview

Abstract or Description

Generic speech recognition systems typically use language models that are trained to cope with a broad variety of input. However, many recognition applications are more constrained, often to a specific topic or domain. In cases such as these, a knowledge of the particular topic can be used to advantage. This report describes the development of a number of techniques for augmenting domain-specific language models with data from a more general source.
Two investigations are discussed. The first concerns the problem of acquiring a suitable sample of the domain-specific language data from which to train the models. The issue here is essentially one of
quality, since it is shown that not all domain-specific corpora are equal. Moreover, they can display significantly different characteristics that affect the quality of any language models built therefrom. These characteristics are defined using a number of statistical measures, and their significance for language modelling is discussed. The second investigation concerns the empirical development and evaluation of a set of language models for the task of email speech-u>-text dictation. The issue here is essentially one of quantity, since it is shown that effective language models can be built from very modestly sized corpora, providing the training data matches the target appfication. Evaluations show that a language model trained on only 2 million words can perform better than one trained on a corpus of over 100 times that size.

Item Type:

Conference or Workshop Item (Paper)

Departments, Centres and Research Units:

Computing

Dates:

DateEvent
1997Accepted

Event Location:

Beijing, China

Item ID:

30109

Date Deposited:

04 Jun 2021 13:26

Last Modified:

10 Jun 2021 03:24

URI:

https://research.gold.ac.uk/id/eprint/30109

View statistics for this item...

Edit Record Edit Record (login required)