The signal processing may be performed in a standard way: a digitizing frequency of 22,050 Hz, a 256-point FFT, and 26 mel-scale cepstrum coefficients obtained for each segment of 11.6 ms of the speech signal. The time segments overlap on 50%. Some of the most commonly used connectionist models for speech recognition are MLP, SOM, time-delay networks, and recurrent networks. These models are discussed and illustrated below. Their use depends on the type of the recognition performed, for example, whole word recognition, or subwords recognition, for example, phoneme recognition.
Phoneme recognition is a difficult problem because of the variation in the pronunciation of phonemes, the time alignment problem (the phonemes are not pronounced in isolation), and because of what is called the coarticulation effect, that is, the frequency characteristics of an allophonic realization of the same phoneme may differ depending on the context of the phoneme in different spoken words (see chapter 1).
There are two approaches to using MLP for the task of phoneme recognition: (1) using one, big MLP, which has as its outputs all the possible phonemes, and (2) using many small networks, specialized to recognize from one to a small group of phonemes (e.g., vowels, consonants, fricatives, plosives, etc.). A