An Internet agent for language model construction

Wyard, Peter and Rose, Tony. 1997. 'An Internet agent for language model construction'. In: Recent Advances in Natural Language Processing. Bulgaria. [Conference or Workshop Item]

Text - Accepted Version
Available under License Creative Commons Attribution.

Download (74kB) | Preview

Abstract or Description

A software agent is described which is able to take a seed (reference) corpus specified by the user, search the Internet for documents which are sufficiently similar to the seed corpus (as defined by a set of similarity metrics operating at a number of levels in the text), and augment the seed corpus with these documents. The size of the corpus and, hopefully, the quality of the derived language model, are thus progressively increased. The seed corpus may be quite a small collection of transcripts from the application domain, such as may be collected with minimal effort. Preliminary results are given for the perplexity of language models constructed using this approach. Potentially, our method has applications well beyond speech recognition, in corpus-based language processing in general, and document retrieval.

Item Type:

Conference or Workshop Item (Paper)

Departments, Centres and Research Units:




Event Location:


Item ID:


Date Deposited:

04 Jun 2021 13:23

Last Modified:

10 Jun 2021 03:23


View statistics for this item...

Edit Record Edit Record (login required)