Speech segmentation is the process of identifying the boundaries between
words, syllables, or phonemes in spoken natural languages. The term applies both to the mental processes used by humans, and to artificial processes of natural language processing.
Speech segmentation is an important subproblem of
speech recognition, and cannot be adequately solved in isolation. As in most natural language processingproblems, one must take into account , grammar, and semantics, and even so the result is often a probabilistic division rather than a categorical.
The lowest level of speech segmentation is the breakup and classification of the sound signal into a string of phones. The difficulty of this problem is compounded by the phenomenon of
co-articulationof speech sounds, where one may be modified in various ways by the adjacent sounds: it may blend smoothly with them, fuse with them, split, or even disappear. This phenomenon may happen between adjacent words just as easily as within a single word.
The notion that speech is produced like writing, as a sequence of distinct vowels and consonants, is a relic of our alphabetic heritage. In fact, the way we produce vowels depends on the surrounding consonants and the way we produce consonants depends on the surrounding vowels. For example, when we say 'kit', the [k] is farther forward than when we say 'caught'. But also the vowel in 'kick' is phonetically different from the vowel in 'kit', though we normally do not hear this. In addition, there are language-specific changes which occur on casual speech which makes it quite different from spelling. For example, in English, the phrase 'hit you' could often be more appropriately spelled 'hitcha'. Therefore, even with the best algorithms, the result of phonetic segmentation will usually be very distant from the standard written language. For this reason, the lexical and syntactic parsing of spoken text normally requires specialized algorithms, distinct from those used for parsing written text.
Statistical models can be used to segment and align recorded speech to words or phones.Applications include automatic lip-synch timing for cartoon animation, follow-the-bouncing-ball video sub-titling, and linguistic research. Automatic segmentation and alignment software is commercially available.
In all natural languages, the meaning of a complex spoken sentence (which often has never been heard or uttered before) can be understood only by decomposing it into smaller "lexical segments" (roughly, the
words of the language), associating a meaning to each segment, and then combining those meanings according to the grammar rules of the language. The recognition of each lexical segment in turn requires its decomposition into a sequence of discrete "phonetic segments" and mapping each segment to one element of a finite set of elementary sounds (roughly, the phonemes of the language); the meaning then can be found by standard table lookup algorithms.
For most spoken languages, the boundaries between lexical units are surprisingly difficult to identify. One might expect that the inter-word spaces used by many written languages, like English or Spanish, would correspond to pauses in their spoken version; but that is true only in very slow speech, when the speaker deliberately inserts those pauses. In normal speech, one typically finds many consecutive words being said with no pauses between them, and often the final sounds of one word blend smoothly or fuse with the initial sounds of the next word.
Moreover, an utterance can have different meanings depending on how it is split into words. A popular example, often quoted in the field, is the phrase "How to wreck a nice beach", which sounds very similar to "How to recognize speech". As this example shows, proper lexical segmentation depends on context and semantics which draws on the whole of human knowledge and experience, and would thus require advanced pattern recognition and artificial intelligence technologies to be implemented on a computer.
This problem overlaps to some extent with the problem of
text segmentationthat occurs in some languages which are traditionally written without inter-word spaces, like Chinese and Japanese. However, even for those languages, text segmentation is often much easier than speech segmentation, because the written language usually has little interference between adjacent words, and often contains additional clues not present in speech (such as the use of Chinese characters for word stems in Japanese).
Alexander Faaborg, Waseem Daher, José Espinosa, and Henry Lieberman. " How to wreck a nice beach you sing calm incense" International Conference on Intelligent User Interfaces (IUI 2005), San Diego (2005).
* [http://www.sprex.com/phonolyze "Phonolyze" speech segmentation software]
Wikimedia Foundation. 2010.
Look at other dictionaries:
Speech perception — is the process by which the sounds of language are heard, interpreted and understood. The study of speech perception is closely linked to the fields of phonetics and phonology in linguistics and cognitive psychology and perception in psychology.… … Wikipedia
Segmentation — may mean: *Market segmentation, in economics Biology *A morphogenesis process that divides a metazoan body into a series of semi repetitive segments *Segmentation (biology), the structure that results from said processComputing *Segmentation… … Wikipedia
Speech synthesis — Stephen Hawking is one of the most famous people using speech synthesis to communicate Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented… … Wikipedia
Text segmentation — is the process of dividing written text into meaningful units, such as words, sentences, or topics. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in computers, which are the… … Wikipedia
Scale-space segmentation — or multi scale segmentation is a general framework for signal and image segmentation, based on the computation of image descriptors at multiple scales of smoothing. One dimensional hierarchical signal segmentationWitkin s seminal work in scale… … Wikipedia
Stump speech (politics) — A political stump speech is a standard speech used by a politician running for office. The term derives from the custom in 19th century America for political candidates campaigning from town to town to stand upon a sawed off tree stump to deliver … Wikipedia
Natural language processing — (NLP) is a field of computer science and linguistics concerned with the interactions between computers and human (natural) languages; it began as a branch of artificial intelligence. In theory, natural language processing is a very attractive… … Wikipedia
Index (search engine) — Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics, and… … Wikipedia
Dyslexia — This article is about developmental dyslexia. For acquired dyslexia, see Alexia (acquired dyslexia). Dyslexia Classification and external resources ICD 10 R48.0 ICD 9 … Wikipedia
Romance languages — Romance Geographic distribution: Originally Southern Europe and parts of Africa; now also Latin America, Canada, parts of Lebanon and much of Western Africa Linguistic classification: Indo European Italic … Wikipedia