Menpunkt Institute
Menpunkt Lehre
Menpunkt Forschung
Menpunkt Publikationen
Menpunkt Links





English Version Deutsche Version
 

The Kiel Intonation Model (KIM),
its Implementation in TTS Synthesis and its Application to the Study of Spontaneous Speech

Klaus J. Kohler, IPdS Kiel, Germany

  1. Introduction
  2. KIM - The Kiel Intonation Model
  3. Linguistic environment and symbolic input to the model
  4. Implementation of the model in the RULSYS/INFOVOX TTS system for German
  5. References
  6. Acknowledgement

This document is based on a paper presented at the ATR Workshop on Computational Modeling of Prosody for Spontaneous Speech Processing, Kyoto/Japan, 12-14 April 1995

Abstract

Following on from general considerations of requirements for prosodic modelling, this paper outlines the Kiel Intonation Model for German, incorporating stress and intonation, timing and articulatory reduction. It puts the model into a linguistic environment and sets out the prosodic notation system. This symbolization framework is also the basis for prosodic labelling of speech data. The paper sketches the implementation of the model in the RULSYS/INFOVOX TTS system for German including the generation of spontaneous speech. The categories of the model are demonstrated by audio examples of sentences and texts in KIM-based TTS synthesis. The integrated framework of prosodic research at Kiel is illustrated by an excerpt from a spontaneous dialogue recording, its prosodic label file and its TTS simulation.


Top


1 Introduction

The modelling of prosody has to take the following points into account.

  1. Prosodic universals
    The study of prosody has grown out of dealing with individual languages, especially with English, more than with any other language. Categories and operations (e.g. prosodic rules) are - to a large extent - determined by the particular linguistic structures. What we need for a general prosodic theory, however, are independently motivated categories and operations. Candidates are pitch direction (falling, rising) and synchronization of pitch 'peaks' and 'valleys' with syllable timing, in each case independently of the functional use they may be put to in individual languages (e.g. tone or intonation), further, the prominence-lending features of pitch movement and segment duration for the functional use of sentence stress (focus), and timing factors at various levels (global speech rate, utterance-final lengthening, stress/syllable timing).

  2. A unified theory integrating segmental and prosodic aspects, as well as intonation and timing among the latter
    We have become used to thinking in and dealing with dichotomies in the study of speech: segmentals versus prosody, intonation versus timing. It is quite clear that all these levels of description form an intricate mutually conditioning network. Prosody (e.g. stress, timing, especially speech rate) provides conditioning factors for articulatory reduction; on the other hand segmental structures determine the manifestation of prosodic categories (different synchronization of pitch 'peaks' and 'valleys' in high/low or short/long vowels, curtailing of falls in pitch 'peaks' before voiceless consonants). Global utterance timing has not only an influence on individual segment durations, but also on their qualitative realization and on the manifestation of pitch patterns (upward scaling of F0 and reduction of F0 range in increased speed). Contrariwise, segmental features determine their timing in global utterance speed: vowel and consonant durations are adjusted differently, not by a uniform proportionate factor across all segments; depending on the segmental type and context, shortening in fast speech also involves assimilation and elision of articulatory gestures over and above their changes in timing.

  3. A prosodic phonology as an interlevel between syntax/semantics/pragmatics and the phonetic signal: phonetic substance - phonological form - linguistic function
    The phonetic-semantic relationship is not direct in the sense that the measured values themselves represent syntactic or semantic categories, but the link operates via formal elements that, on the one hand, are related to features of meaning, but are, on the other hand, defined by phonetic ranges. This phonetic substantiation of phonological categories is just as essential as the recognition of structure in phonetic substance. Both phonetic substance and phonetic structure (or signal measures and phonological form) are required for an adequate description of the phonetic-syntactic/semantic relationship, and consequently for prosodic modelling.

  4. Integration of prosodic modelling into a linguistic environment
    It follows from point 3 that prosodic modelling requires strong linkswith syntactic, semantic, and, particularly in the case of spontaneous speech, also pragmatic levels.

Top


2 KIM - The Kiel Intonation Model

2.0 Overview

This chapter gives a brief outline of a prosody model which we have been developing at Kiel for German (for further details see Kohler 1991a,b, 1997a,b). It incorporates the following domains:

(1) lexical stress - three levels: unstressed, secondary stress in compounds, and primary stress
(2) sentence stress - four levels: reinforced, neutral, partially and completely deaccented
(3) intonation:
(a) categories of pitch 'peaks' and 'valleys' as well as their combinations at each sentence stress position
(b) types of pitch category concatenation
(c) pitch of 'pre-head' before first sentence stress: low - high
(4) synchronization of pitch 'peaks' and 'valleys' with stressed syllables : early, medial, late
(5) prosodic boundaries (degrees of cohesion) - three variables: pause duration, phrase-final segmental lengthening, scaling of F0 end points
(6) overall speech rate between the utterance beginning and successive prosodic boundaries - four degrees: slow, medium, reduced, fast
(7) downstep of successive pitch 'peaks'/'valleys' and pitch 'reset'.

A system of prosodic distinctive features is used to specify abstract symbolic, phonological categories in these domains; they enter into sets of ordered symbolic rules. The features are either graded or binary and determine the parametric value spaces activated by parametric phonetic rules following the symbolic ones. The prosodic features are attributed to phonological units, which are either segmental (vowels and consonants) or non-segmental (morphological and phrase boundaries). Attached to vowels is the fundamental distinction whithin the German prosodic system, viz. stress and intonation.


Top


2.1 Stress

Within stress we have to differentiate between lexical and sentence stress. At the abstract level of phonological specifications in the lexicon, every German word has at least one vowel that has to be marked as potentially stressable, as being able to attract the feature specifications of sentence stress. Lexical stress is thus not a distinctive stress feature, it only marks a position that can attract such a feature at the sentence level, but need not.

Vowels receive combinations of the stress features <+/-FSTRESS> and <+/-DSTRESS> (referring to the association of sentence stress with the two important parameter domains of F0 and duration). In sentence-stressed words the lexically stressed vowel is <+FSTRESS,+DSTRESS>, in unstressed content words <-FSTRESS,+DSTRESS>; in unstressed function words als well as in lexically unstressed syllables that do not get sentence stress the combination is <-FSTRESS,-DSTRESS>. In partially deaccented sentence stresses <+DEACC> is added to the two positive stress features, all other vowels are <-DEACC>. Words that are to get additional emphasis receive the feature <+EMPH> in their lexical stress position, all other vowels <-EMPH>.

Whether <+DSTRESS>, responsible for longer duration, is associated with <+FSTRESS>, marking the vowel as the recipient of intonation features ('peak' and 'valley' contours), or as <-FSTRESS>, not providing the vowel with this potential, depends on the rules of grammar and context of situation in speech communication. They have to be supplied by the linguistic environment of the prosodic phonology (see Section 3). The same applies to the attribution of <+DEACC>.

To distinguish degrees of emphasis, <+EMPH> vowels are given the graded stress level feature <@STRLEV>, with @ = 1,2,...7; <-EMPH> vowels are <0STRLEV>. These vowels are made the more prominent, the higher the stresslevel. In F0 'peak' contours, this greater prominence is achieved by raising the F0 maximum, and if the 'peak' is nonfinal in a 'peak' series, by having a faster descent as well as by lowering the F0 minimum between 'peaks', proportinally to stress level. In the case of F0 'valley' contours, the final F0 point is raised in accordance with stress level. Emphasis is used to put words and phrases within sentences in focus, particularly when the expansion of intonation contours on certain structural elements is coupled with the deaccentuation of others.

In summary, the following sentence stres features are proposed in a prosodic phonology of German for a comprehensive contrastive categorization:
<+/-FSTRESS>
<+/-DSTRESS>
<+/-DEACC>
<+/-EMPH>
<@STRLEV>, with @ = 0, 1, ...7.
The feature pair <+/-EMPH> constitutes the link with the intonationfeatures.


Top


2.2 Intonation

2.2.1 Pitch categories at sentence stresses

All vowels with 'primary' or 'secondary' (= deaccented) sentence stress, i.e. with the feature specification <+FSTRESS,+DSTRESS> receive intonation features, which may be either 'valleys' or 'peaks', specified as <+/-VALLEY>, and in the case of 'peaks' (<-VALLEY>), they may contain a unidirectional F0 fall, classified as <+TERMIN>, or rise again at the end, resulting in a (rise-) fall-rise, categorized as <-TERMIN>. 'Valleys' may have a low rise, to indicate, e.g., continuation, or a high rise, used, e.g., in questions, with the specifications <+/-QUEST>. All 'peaks' and 'valleys' may have their turning points (F0 maximum in 'peaks' or F0 minimum in 'valleys') early or later with reference to the onset of <+VOK,+FSTRESS>, categorized as <+/-EARLY>, and finally, for 'peaks' <-EARLY> may be around the stressed vowel centre or towards its end, classified by the feature opposition <+/-LATE>. The categorization of <-VALLEY> into <+EARLY> and <-EARLY>, with a further subdivision of the latter into <+/-LATE>, captures the grouping of 'late' and 'medial' vs. 'early peaks', as it showed up in perceptual experiments with stepwise 'peak' shift from left to right (Kohler 1990b,1991b).

'Peaks' are characterized by a quick F0 rise confined to the vicinity of a sentence-stressed syllable. This rise precedes the onset of the latter, and is usually narrow in temporal extension, for an 'early peak'; it extends into the first half of the stressed nucleus in the case of a 'medial peak'. In the 'late peak', it starts after the stressed vowel onset and continues into the second half of the nucleus or beyond, the exact timing of the maximum 'peak' value depending on vowel type (duration according to quantity and quality), subsequent voiced/ voiceless consonants and number of immediately following unstressed syllables. There may even be a low stretch of F0 in the stressed vowel before the rise. After the 'peak' maximum is reached the F0 descends immediately, especially on subsequent unstressed syllables. But for chains of 'peaks' see 2.2.2.

'Valleys', on the other hand, have a continuous rise, starting before the stressed-syllable nucleus (early) or inside it (non-early) and extending as far as the beginning of the following sentence-stressed syllable. If there are several unstressed syllables between two sentence stresses a 'valley' is thus realised as a more gradual F0 ascent compared with the much quicker rise for a 'late peak'. The less distance there is between stressed syllables the more difficult it becomes to distinguish between a 'valley' + 'peak' and a 'late peak' + 'peak' sequence, especially if there is no F0 dip in between the first and second stress F0 maxima, as in a hat pattern (see 2.2.2).

In summary, the elements of the intonation component of the prosodic phonology of German are given the following distinctive features for a comprehensive contrastive categorization:
<+/-TERMIN>
<+/-VALLEY>
<+/-QUEST>
<+/-EARLY>
<+/-LATE>
<+/-EMPH>
<@STRLEV>, with @ = 0, 1, ...7.


Top


2.2.2 Pitch category concatenation

In a concatenation of pitch 'peaks' without prosodic boundaries between them (see 2.3.3), F0 may fall to a low or an intermediate level and then rise again for the next 'peak'. This fall will be effected on intervening unstressed syllables between the two 'peaks', reaching the lowest point, to start the next rise, in the vicinity of the following stressed syllable, depending on 'peak' position. If there are no unstresed syllables separating the two 'peaks', the dip can be accommodated between all 'peak' combinations, except for 'late' + 'early'/'medial', where a hat pattern is created; it combines the rise of the 'late peak' and the fall of the 'early peak' in a two-stress sequence.

This boundary case of the absence of an F0 descent between 'peaks' canalso be extended to concatenations with intervening unstressed syllables. In such a hat pattern, an 'early peak' is not possible initially, and a 'late' one is excluded non-initially. If there are more than two stresses incorporated in a hat the non-initial and non-final ones are unspecified as to 'peak' position because they neither have a rise nor a fall but are simply integrated into the downstepped sequence of 'peak' maxima (see 2.3.2). In the categorization of pitch patterns they are nevertheless grouped together with 'peaks'. If in a two-stress rise-fall it is difficult to decide whether the rise represents a 'valley' or a 'late peak' in a hat pattern, the latter solution is chosen.

When prosodic boundaries intervene any sequencing of 'peaks' and/or 'valleys' is possible, but the hat pattern is then excluded since it represents a very high degree of cohesion. On the other hand, a 'late peak' with a full F0 descent marks a dissociation from a following 'peak' and will then normally be linked with a prosodic boundary, i.e. final lengthening and, usually, F0 reset afterwards.

2.2.3 Pre-head

Unstressed syllables preceding the first sentence stress in a prosodic phrase may be either low or high: they represent different types of 'pre-head'.


Top


2.3 Phonetic variation

The acoustic manifestations of the distinctive prosodic categories vary according to segmental and prosodic context, depending on the temporal alignment of 'peaks' and 'valleys' - defined by a small number of significant F0 points - with different syllable types and sequences, on downstepping, speech rate, prosodic boundaries and, finally, articulation-induced microprosody.

2.3.1 Temporal alignment of 'peaks' and 'valleys'

Taking the default, 'medial peak' as a reference, two significant F0 points are defined. The first one, TF0, is positioned at the beginning of the syllable containing the <+FSTRESS> vowel, the second, T2F0, near the vowel centre, the exact timing after voiced vowel onset depending on vowel quantity, vowel height, number of following unstressed syllables and position in the utterance. The calculation of the time point T2F0 after vowel onset is carried out on the basis of the segmental duration rules for German. They have adopted the principle proposed by Klatt (1979) for the rule synthesis of English (see also Kohler 1988), defining different classes of segments (e.g. diphthongs vs. long vs. short vowels, low vs. high vowels) by different pairs of values for intrinsic duration (Di) and for minimal duration (Dmin) and generating actual segment durations in various segmental, prosodic and syntactic contexts by the application of the following rules:

(1) <DUR> ^ <(Di-Dmin)*PRCNT/100+Dmin>

(2) <PRCNT> ^ <PRCNT*PRCNT1/100>.

In (1), PRCNT = 100 initially; the rules then change the PRCNT values successively by introducing a rule-specific PRCNT1 value into (2). This way all the factors influencing segmental durations (tempo, position in the word and sentence, stress, segmental context) can be captured in specific rules by inserting a new PRCNT1 value each time. This model assumes that all the factors affecting duration operate independently of each other and that it is only the amount exceeding the minimal duration of a segment that is adjusted by these factors. The two assumptions provide a good approximation of segment timing in languages like German and English, and certainly result in prosodically acceptable speech synthesis.

T2F0 for 'medial peaks' is now derived from the basic vowel-type related duration. The only percentage factor that enters the calculation is the one referring to speech rate; it is normally set at 100, a speeding up lowers, a slowing down increases the factor, i.e. it is essentially the intrinsic vowel duration that determines the point in time after <+VOK,+FSTRESS> onset (T2F0) where the 'medial peak' is positioned. But this has to be adjusted in the case of aspiration. On the one hand aspiration lengthens the total vowel duration, compared with vowels in non-aspirated contexts, but this increase is not as large as the total aspiration phase; on the other hand it shortens the stop closure duration compared with unaspirated cases, but again not by the total amount. So the larger part of the aspiration (AH) should be added to the vowel, but some of it attached to the plosive, and the F0 'peak' placement has to take this ambivalence into account:

(3) <+VOK,+FSTRESS> ^ <T2F0=((Di-Dmin)*PRCNT/100+Dmin)*0.6+TLAH*0.75>,

i.e. three quarters of the period up to the last aspiration time point are added to T2F0, shifting it further to the right by this amount.

Sentence-final 'medial peaks' receive a third F0 point, T3F0, at 150 ms after the 'peak' maximum in a medium speech rate (see 2.3.4, 2.0(6), 3(6')); in all non-final cases, the maximum 'peak' point of one <+FSTRESS> connects with the left-base point of the next <+FSTRESS>. As the absolute F0 'peak' position is not affected by vowel duration modifications due to voiced/voiceless context, number of syllables in the word, sentence position etc., its relative position changes with vowel shortening or lengthening, moving closer towards or further away from, the end. This way the microprosodic F0 truncation before voiceless obstruents is automatically built into the rules.

Thus, in utterance-final short-vowel monosyllables, ending in voiceless consonants, an F0 'peak' fall is truncated, and has to be so to create the same perceptual 'peak' pattern as in other, non-truncating contexts. The listener obviously takes the underlying constancy in absolute F0 'peak' positioning within the same vowel type, and the difference across different vowel types, into account, disregarding contextual adjustments. He can always calculate the final F0 point that should have been reached, if the F0 contour had not been curtailed, from the F0 decline per unit of time up to the cut-off, and he can then compare this value with the likely low end of the speaker's speech voice range (about 60 - 80 Hz in a male voice, an octave higher in a female one). If the comparison is within a narrow margin the fall is terminal, otherwise it is not. Since the lower end of a speaker's F0 range is a reliable reference value, truncation of falling F0 patterns can be uniquely restored perceptually.

That no longer applies to rising F0. Here the end point is not calculable, because there is a large margin for the end point, anything up to or even above 1.5 octaves is possible. The position of the ceiling is not fixed, by contrast with the base line. That means that the intended high value of a 'valley' always has to be physically reached, it cannot be deduced from what precedes, and, moreover, it does not change with different 'valley' positions from 'early' to 'late', which is quite different in 'peaks'. There is thus a fundamental difference between 'peaks' and 'valleys' in the fixation of their offsets.

An 'early peak' has its maximum value at the <+FSTRESS> syllable onset, TF0 100 ms before, and T3F0 - in sentence-final position - in an area where the 'medial peak' has its maximum. A 'late peak' has TF0 at the same point as a 'medial peak', then an additional low F0 point T2F0 is inserted where the 'medial peak' has its centre, and the 'late' summit (T3F0) occurs 100 ms later or at the end of the last voiced segment in a non-final monosyllabic word if this distance is less than 100 ms. If there is an unstressed syllable following, the summit coincides with the unstressed vowel voice onset. In utterance-final position, a fourth F0 point, T4F0, occurs 100 ms after the summit.

'Valleys' have their left and centre F0 points at the same positions as TF0 and T2F0 in 'medial peaks'. In an 'early valley' the left point is the lowest, whereas in a 'non-early valley' it is the centre pont; in both cases, the right, high point is located at the end of the last voiced segment.


Top


2.3.2 Downstepping

Declination, i.e. the temporally fixed decline of F0, prominent in, e.g., the Dutch and Lund models, is not a feature of spontaneous speech production. It has therefore been replaced by downstepping in KIM, i.e. a structurally determined pitch lowering from sentence stress to sentence stress, independent of the time that elapses between them.

Results of interactive testing make it clear that perception orientates itself at structurally positioned and downstepped 'peaks', not at a time-based declination. The downstepping values used in KIM are 6% from 'peak' maximum to 'peak' maximum, and 18% from a 'peak' maximum to the next base. In 'valleys' both the low and the high F0 value are downstepped by 6%.Downstepping can be interrupted at any point by the feature <+EMPH> or by resetting.

2.3.3 Prosodic boundaries

One of the functions of prosody is the sequential structuring of utterances and discourse, i.e. the signalling or prosodic boundaries and - at least partially - their hierarchical organization. To decode the syntagmatic chunking of messages in accordance with the speaker's intention the listener requires signals that index degrees of cohesion or separation, respectively, between phrases, clauses, utterances and turns. The parameters that achieve this are pause duration, phrase-final segmental lengthening and scaling of F0 end points at the respective boundaries. They can be controlled by parametric rules in the prosodic model upon appropriate symbolic input.

As at this stage the linguistically and phonetically relevant categorization of these boundaries is not well understood the modelling cannot reduce the categories in this domain to the same small number as in the other areas of prosody discussed so far, but has to allow sufficient degrees of freedom for experimentation with data modelling. At each of the three parameters three degrees are therefore recognised, controlled by digit notation in the symbolic input to the model (see Section 3). As our knowledge of prosodic boundary marking increases the degrees of freedom can be reduced by establishing constraints between the three parameters in the signalling of the necessary and sufficient number of phonologically relevant distinctions.

2.3.4 Speech rate

Speech rate changes within the same speaker not only alter the segmental durations (at varying degrees for different segment types, e.g. vowels vs. consonants) but also the positioning of T2F0 within a stressed vowel, moving it farther or less far into it in slower and faster speed, respectively (see 2.3.1). This also implies slower or faster rises and falls, which lower or raise the perceived pitch level. The faster movements in turn mean that there comes a point when the complete F0 excursions can no longer be carried out. Since the 'peak' and 'valley' maxima are the essential target values controlled by the speaker, the levelling of F0 movements in fast speech particularly affects the low values. Finally, faster speech also means greater effort, which may produce a higher level of activation at the vocal folds right away. The correlation of F0 level with speech rate perception has been shown in Kohler (1986b,c,).

The control of speech rate is also coupled with articulatory reduction and elaboration (Kohler 1990a). The speech rate category therefore activates whole blocks of parametric rules that deal with F0 timing, F0 patterning, segment durations and segmental adjustments (coarticulation, reduction, reinforcement, elision). To start with, the model distinguishes four degrees, one of them - reduced - taking reduction phenomena (at an otherwise medium speech rate timing) into accout, as one way of changing speed of articulation. This level will probably have to be subdivided into subcategories according to the degree of formality and spontaneity of speaking, comprising different rule modules for the respective degree. A good deal more research into spontaneous speech is necessary before an adequate categorization can be set up in the model.


Top


2.4 Microprosody

In the generation of intonation, KIM separates two levels

  • the defining of phonology-controlled prosodic patterns by a small number of significant F0 points,
  • the output of continuous F0 contours influenced by articulation- related modifications (Kohler 1990b).

This dichotomy implies the assumption that the underlying F0 'peak' and 'valley' patterns develop independently and in a very concrete physical and physiological sense in speech production, and are modified microprosodically by output constraints in the vocal apparatus. In particular, we have to distinguish five areas of microprosodic adjustments to the basic significant point patterns discussed so far.

  1. In close vowels with 'medial peaks', the summit is raised by a factor of 1.08, compared with all other vowels.
  2. An interpolation (cosine) is carried out between the significant F0 points.
  3. After the voiceless obstruents, the F0 value at vowel onset is raised by an additive constant of 15 Hz, the increment trailing off to 0 towards T2F0.
  4. In voiced plosives all F0 values are lowered by 10 Hz, in other voiced consonants by 5 Hz.
  5. F0 is masked in voiceless stretches.

Top


3 Linguistic environment and symbolic input to the model

KIM is integrated into a pragmatic, semantic and syntactic environment. The input into the model are symbolic strings in phonetic notation with additional pragmatic, semantic and syntactic markers. The pragmatic and semantic markers trigger, e.g., the pragmatically or semantically conditioned use of 'peak' and 'valley' types or of sentence focus. Lexical stress position can largely be derived by rule, and syntactic structure rules mark deaccentuation and emphasis in word, phrase, clause and sentence construction. Phrasal accentuations are thus derived from the syntactic component preceding the prosodic model, and are given special symbolizations in the input strings to the model. The following 7 bit ASCII characters are used to represent the prosodic categories of 2.0 (using the corresponding numbering).

(1') Apostrophe and quotation mark ' " are put in front of the primary or secondary stress vowel; vowels without a lexical stress marker are unstressed:

R'ück#s"icht ("view to the back") vs. R'ücksicht ("consideration")

( # marks the phonetically - especially prosodically - relevant word boundary in compounds).



(2') Digits 3 2 1 0 are put in front of words that receive the reinforced, neutral, partially or completely deaccented sentence stress category, which in turn affects the manifestation of the respective lexically stressed vowel. Function words, marked by suffixed +, have 0 as their default, non-function words 2; in both cases the digit may be omitted in the symbolization:

2Max 0hat+ 0einen+ 2Brief 2geschrieben .
("Max did write a letter")
2Max 0hat+ 0einen+ 2Brief 1geschrieben .
(semantically unmarked rendering of "Max wrote a letter"; answer to question "What did Max do?")
2Max 0hat+ 0einen+ 2Brief 0geschrieben .
("Max wrote a letter, not a card"; answer to question "What did Max write?")
2Max 0hat+ 0einen+ 3Brief 0geschrieben .
(reinforcement of contrasted "Brief" in the previous example)

Degrees of reinforcement may be symbolized by digits 3 to 9.



(3') The punctuation marks . , ? and their sequences ., .? are used phonetically for pitch 'peaks', low and high rising 'valleys', low and high fall-rises (combined 'peak' + 'valley' patterns). Therefore texts in KIM and as TTS input (see Section 4) no longer contain orthographic punctuation.

Ja . Ja , Ja ? Ja ., Ja .?

A high pre-head is symbolized by = at the beginning of a prosodic phrase:

p:= 0Wie+ 0sieht 0das 0bei+ 0Ihnen+ 0am+ 3Donnerstag 0aus.


(4') Parentheses ) ( for 'early' and 'late peak' positions in sentence-stress syllables are put before the stressed word (after the sentence-stress digit); the 'medial peak' position is regarded as the default case and remains unmarked:

Sie+ hat+ ja+ 2)gelogen .
("She's been lying." = summarizing, concluding statement)
Sie+ hat+ ja+ 2gelogen .
(=start of a new argumentation)
Sie+ hat+ ja+ 2(gelogen .
(as the preceding example but with a contradictory note)

In connection with 'valleys' there are only early and non-early positions in the sentence-stressed syllable. Either one or the other category may be taken as the unmarked default case, depending on their frequency of occurrence: early for , and late for ? .


Wie+ 2( heißt du+ ?
"What's your name?"
Wie+ 2) heißt du+ ?
Wie+ 2) heißt du+ ,
Wie+ 2( heißt du+ ,


(5') The prosodic boundary (cohesion) marker p: is put after the word at which boundary indices occur. It is preceded by two digits, the second of which refers to pause length, the first to utterance-final lengthening. In the case of pitch 'peaks', there is a third boundary-related digit to the left of these two, referring to the scaling of the F0 end point. Each of the digits may range from 0 (= absence of pause, of final lengthening or of F0 descent) through 1 (= short pause - <=200ms; default utterance-final lengthening; intermediate F0 descent) to 2 (= long pause - >=200ms; hesitation lengthening; full F0 descent). If there is no prosodic punctuation mark associated with p: default . is assumed; the other pitch categories (see ()) have to be symbolized after the prosodic boundary marker p: in KIM and in its TTS input, but before the phrasing marker PGn in PROLAB (see Example Spontaneous).

zehn 211p: minus+ zwei 100p: mal+ drei ("10 - 2 x 3")
zehn 100p: minus+ zwei 211p: mal+ drei ("(10 - 2) x 3")


(6') The digit string associated with the phrase boundary marker p: is preceded by a further digit, ranging from 0 to 3 to mark four degrees of speech rate, which include degrees of reduction or elaboration: 2 refers to medium overall speed and default reduction (and may be omitted from the symbolization as an implicit default), 1 refers to the same speed but a higher degree of reduction; for 0, degrees of reduction and speed are increased, for 3 they are both decreased from 2. The rate digit (or default) applies to the stretch of speech between the p: marker and its predecessor or the utterance beginning, respectively. In this modelling of speech rate, segment durations are not changed by uniform and proportionate up or down-scaling across the whole sequence, but vowels and consonants are dealt with separately according to sets of rules including segmental reduction, assimilation and elision.

mit+ roten gelben blauen braunen 3212p:
("with red, yellow, blue and brown ones")
mit+ roten gelben blauen braunen 2212p:
mit+ roten gelben blauen braunen 1212p:
mit+ roten gelben blauen braunen 0212p:


(7') The model does not include the category of declination over time, but incorporates the structurally determined, time-independent category of downstep from 'peak' to 'peak' and from 'valley' to 'valley'. It is set at a constant value (6% in medium and slow speeds, 4% in fast speed) and is not indicated symbolically. Pitch reset can occur at any point in the chain of 'peaks' or 'valleys' and is associated with a prosodic boundary. It is marked by + before the digit sequence at the preceding p:.

mit+ roten gelben 2110p: blauen schwarzen 2212p:
mit+ roten gelben +2110p: blauen schwarzen 2212p:

The KIM symbolization system outlined above has been used in the TTS implementation of KIM (see Section 4), but it has also been adapted for use as an efficient prosodic labelling system (PROLAB) in the processing of recorded read and spontaneous speech data to create a labelled data bank. One of the aims of such a corpus is to enlarge the empirical basis for prosodic modelling. The different requirements in this "manual" annotation of natural speech have resulted in the following adjustments and additions of the model symbolization.


Top


  • All prosodic labels are prefixed by &, and orthographic punctuation (without this additional marker) is kept as well because it can provide certain syntactic information and is then an important factor in the analysis of the convergence/discrepancy between syntactic and prosodic phrasing in a data base.
  • The default medial peaks of KIM are marked by ^.
  • 'Early' and 'late valleys' are indicated by ] and [.
  • Prosodic phrasing is always indicated and never implied by prosodic punctuation. On the other hand, the pitch categories are marked in all cases, and . is never implied by a prosodic boundary marker. Prosodic punctuation is put before the latter.
  • Prosodic boundary markings are separated into
    • pauses and breathing: p: h:
    • phrasing markers: only one generalized category so far PGn
    • hesitation lengthening: z:
    • F0 descent in 'peaks': 0-2. put immediately before a phrasing marker or the word containing the next sentence stress.
    • high prehead: HP at its beginning
    • F0 reset: implied by PGn, excluded by = in PGn=2, | before stress digit 2 elsewhere.
  • Function words, marked by suffixed +, do not get a lexical stress symbolization; therefore, if a function word is to receive a sentence stress, ' ' is inserted at the appropriate syllable.
  • Disfluencies (break-offs and resumptions) are marked by +/ /+, or by +/ =/+ inside a word.

The following orthographic transcript with prosodic annotations is an illustration of such a PROLAB label file for a turn in a spontaneous dialogue from the Kiel Corpus recorded and processed at IPDS (see IPDS, 1995f). The corresponding speech file may be listened to by clicking on the "original" button. After converting the labelled text into the TTS-compatible KIM format it can be fed into the TTS system and changed back to speech. This synthesized version of the spontaneous dialogue turn may be activated by clicking on the "synthesized" button.

Example Spontaneous

g071a004.s1h
TIS004:

&2 <ähm> &PGn &2( D'ienstag &0würde+ &0 mir+ &0 g'ut &0. &2) p'assen , &2. &PGn

&2
<ähm> &PGn &0 das+ &2] h'eißt , &, &PGn p: &2^ Mom'ent , &1. &PGn

&2(
'allerdings &0 'erst z: &0. &PGn &2( n'achm"ittags h: . &2. &PGn

&RP &HP &0
das+ &0 wird+ &0 dann+ &2^ wahrsch'einlich &0 'n+

&0 b'ißchen &1. &2^ schw'ierig . &2. &PGn

&RM &2^
D'ienstag , &0. &|2^ m'ittwochs z: &1. &PGn &0 <äh> &PGn

p: &0
+/is=/+ &PG/ &1^ s'ieht &0 das+ &0 bei+ &0 mir+ z:

&0
+/sch=/+ &2. &PG/ &2^ schw'ierig &0 'aus . &2. &PGn

&0
da+ &0 hab' &0 ich+ &2^ tags'über &1. &2^ Term'ine . &1. &PGn

h: &2
<ähm> &PGn &RP &HP &0 wie+ &0 s'ieht &0 das+ &0 bei+

&0 Ihnen+ &0 am+ &3 D'onnerstag &0 'aus ? &2. &PGn

original
synthesized

For further details see Kohler et al. 1995.



Top


4 Implementation of the model in the RULSYS/INFOVOX TTS system for German

KIM has been implemented in the RULSYS/INFOVOX TTS for German. The Kiel development of this TTS system (for details see Carlson [KTH] et al. 1990, Kohler 1991c) makes use of a very simple adaptation of 7bit ASCII to the phonetic transcription of German: (a) upper-case letters for segmental phonemes, (b) lower case ones for allophones, (c) the characters listed in 3 (1')-(7'). These phonetic symbols are either derived by rule from orthographic input, or they are entered into the system directly, enclosed between the metacharacter #. In the latter case the input string can be either entirely phonetic, or mixed orthographic/phonetic as illustrated by the examples in Section 3.

The greater part of the prosodic notations in (2')-(7') have to be entered as such, because the syntactic component of the system is not powerful enough to derive them by rule from orthographic input. Moreover, in many cases semantic and pragmatic rules would be required to generate the correct prosodic output. The symbolic prosody markers trigger hierarchical sets of symbolic distinctive feature rules, followed by sets of parametric F0 and duration rules in the phonetic-to-acoustic output component.

The speed control digit at the p: marking attributes a parametric rate variable to every segmental symbol and sets it to a value representing the respective category. Blocks of duration, segment and F0 rules in the phonetic module are then activated by the particular rate variable value and the appropriate calculations along the three phonetic scales are performed. This means that for a particular speed it is not only the segment durations that are adjusted across the whole chain to which the particular rate factor applies, but F0 is also raised for speeding up or lowered for slowing down, and segmental reductions or elaborations are effected simultaneously, in accordance with natural speech production. The segment durations are scaled separately for vowels and consonants and also as a function of a number of other conditioning factors (vowel height, consonant category, stress, number of syllables in the word). The digit before p: controlling phrase-final lengthening triggers a more local increase or decrease of segment durations within the set global speech rate.

The TTS implementation of KIM allows the calculation of speech timing at a hierarchy of levels from segment to segment chain to phrase to utterance, according to a Klatt type (1979) model for segment timing (see 2.3.1) with factors determined by stress, utterance position, number of syllables in the word and overall speech rate (see Kohler 1986a).

The Kiel prosodic model for German is comprehensive and detailed enough for its TTS realization to generate highly intelligible and natural sounding synthetic output for continuous text. This may be either text in ordinary orthographic form which, due to its syntactic and semantic simplicity, does not require any or only very few prosodic markers, as Example Orthographic. Or it may be supplied with systematic phonetic and prosodic annotations in the case of very intricate phrasing structures, which cannot be derived uniquely from the orthography of the complicated text, as in Example Annotated. The TTS implementation of KIM is also capable of simulating spontaneous speech on the basis of phonetic and prosodic label files that have been created for natural speech files using KIM and its symbolization system, as in Example Spontaneous. Prosodic modelling, its TTS implementation and testing, and model-driven labelling of natural speech thus form an integrated set of prosodic research at IPDS Kiel.



Top


Illustrations of the prosodic categories of KIM in continuous text input to the RULSYS/INFOVOX TTS system

Example Orthographic: medium speed, not reduced

Die Buttergeschichte.

Es war in Berlin zu einer Zeit, als Lebensmittel nicht genügend vorhanden waren. Vor einem Laden stand bereits um sieben Uhr eine beachtliche Menschenmenge; denn man hatte dort am Abend vorher auf einem Schild schon lesen können, daß frische Butter eingetroffen sei. Jeder wußte, daß die Butter schnell ausverkauft sein würde und daß man ganz früh kommen müsse, um noch etwas zu erhalten. Da das Geschäft erst um acht geöffnet wurde, stellten sich die Leute vor der Laden-tür in einer Reihe an. Wer später kam, mußte sich hinten anschließen.

Je näher der Zeiger auf acht kam, desto unruhiger wurden die #)# Leute. Da kam endlich ein kleiner Mann mit grauem Haar und drängte sich ziemlich rücksichtslos nach vorn. Die wartenden Menschen waren empört über solches Verhalten und forderten ihn auf, sich ebenfalls hinten anzustellen. Aber auch als mit der Polizei schon gedroht wurde, #0# ließ sich der Mann nicht beirren, sondern drängte sich weiter durch. Er bat, man solle ihn doch #9# durchlassen. Oder glaubte man, daß diese Drängelei für ihn vielleicht ein #6)# Vergnügen sei?..

Das war für die Leute nun doch zu viel.. Alle kochten bereits vor Wut, und der Mann konnte jetzt von allen Seiten Schimpfwörter hören.. Er aber zuckte resigniert mit den Schultern und bemerkte: "Nun #6)# gut. Wie Sie #)# wollen. Wenn Sie mich nicht vorlassen, dann kann ich die Tür nicht aufschließen, und Sie können meinetwegen hier stehenbleiben, bis die Butter ranzig geworden ist."

synthesized (3.22 MByte)!
butterge.zip (2.4 MByte): wav Format, 16 Bit, 16 KHz, 3.22 MByte (decompressed), 1:45.591 minutes

Example Annotated

Address to a tutorial at the Konvens meeting in Vienna, 27 September 1994
In this text <;;> is shorthand for #000p:#, i.e. the "hat pattern". <,,> stands for the fall-rise (., in KIM).

Meine sehr ;; verehrten ;; Damen und Herren, #1# liebe ;; Teilnehmer am Tutorium #+211p:#, Aussprache-lexika in der signalnahen Sprachverarbeitung..

Es #2# begrüßt Sie die klare #+211p:#, etwas metallische #+210p:#, aber dennoch melodische #+211p:#, und vor allen ;; Dingen ;; rhythmische #+3210p:# synthetische Stimme des Nordens. Sie #2# basiert auf dem TTS-System #+2100p:# VOX #1# PC #+211p:#, der #0# Firma Infovox in Stockholm #+2212p:# und auf dem Softwehr Entwicklungswerkzeug Ruhlsys #+210p:#, der Technischen #3# Hochschule #0# Stockholm. Entscheidend für die Geburt dieser Stimme war aber die Entwicklung von Regeln #+210p:#, zur akustischen Wandlung orthographischer Symbolketten #+211p:#, im Institut für Phonetik #+210p:#, und digitale Sprachverarbeitung #+211p:#, der Christian ;; #'ALBR[CHTS%UNIVERZIT'[:T# zu Kiel... An #2# sich sollte #0# Professer #3# Kohler diese #1# einleitenden #1# Worte #0# sprechen. Da er aber noch ein #0# bißchen unter Zeitverschiebung #+2210p:# nach einer USA und einer Japan-reise #0# leidet, #0# braucht er #1# heute morgen #+210p:#, noch einen etwas längeren Anlauf. #0# Deshalb ist er sehr froh #0DAR'Y:Br#,, daß er diese Aufgabe #2# mir #+2210p:# einer Sprechmaschine übertragen kann....

Ehe Herr Kohler #+2100p:# mit Ihnen die Struktur #2110# und die Generierung #2110p:# von Aussprache-lexika #+212p:#, sowie ihren Einsatz in Forschung und Anwendung #+2000p:# auf verschiedenen ;; Ebenen erläutert #+212p:#, möchte ich nicht versäumen, den Organisatoren der #1# Tagung #211p:#, auch in #1(# seinem #010p:#, Namen #+011p:#, für die Einladung zur Ausrichtung des Tutoriums #+2100p:# sehr herzlich zu ;; #)# danken.. #2# Ihnen, den Teilnehmern, gebührt ebenfalls #0# Dank #210p:#, daß Sie sich #0# dafür entschieden haben.. Wir haben die Einladung #0#natürlich sehr ;; gern #1#aufgegriffen #+2210p:# #2# nicht nur weil sie Herrn Kohler die Möglichkeit #1# gibt #200p:#, #3# Wien zu #1#besuchen #+211p:#, und den Heurigen zu #1# genießen #+211p:#, sondern um vor allem die Kieler ;; Forschung #1# vorzustellen #+2210p:# und #)# Interesse an ihr zu #1# wecken....

Jetzt darf ich aber Herrn Kohler nicht länger von seiner Arbeit abhalten. Er ist #0# inzwischen aufgewacht #2111p:# und schon #3(# unruhig geworden. und er findet vor allem die Text-eingabe sehr ermüdend #+2211p:# da er nur mit zwei Fingern tippen kann. Ich ziehe mich also zurück, und wünsche Ihnen viel Vergnügen ;; beim Tutorium.

synthesized (4.31 MByte)!
ansprache.zip (3.4 Mbyte), wav Format, 16-bit, 16KHz, 4.31MByte (decompressed), 2:21.348 minutes


Top


References

CARLSON, R., GRANSTROM, B. & HUNNICUTT, S. (1990): Multi-lingual text-to-speech development and applications. In: Advances in Speech, Hearing, and Language Processing (ed. W.A. Ainsworth), London: JAI Press, 269-296.

IPDS (1995f): CD-ROM#1-3: The Kiel Corpus of Read/Spontaneous Speech. Kiel.

KLATT, D.H. (1979): Synthesis by rule of segmental durations in English sentences, In: Frontiers of Speech Communication Research. (eds. B. Lindblom & S. Öhman), London/New York/San Francisco: Academic Press, 287-299.

KOHLER, K.J. (1986a): Invariance and variability in speech timing: from utterance to segment in German. In: Invariance and Variability of Speech Processes (eds. J.S. Perkell & D.H. Klatt), Hillsdale, N.J.: Lawrence Erlbaum, 268-289.

KOHLER, K.J. (1986b): Parameters of speech rate perception in German words and sentences: duration, F0 movement, and F0 level. Language and Speech 29, 115-139.

KOHLER, K.J. (1986c): F0 in speech timing. Arbeitsberichte des Instituts für Phonetik und digitale Sprachverarbeitung der Universität Kiel (AIPUK) 20, 55-97.

KOHLER, K.J. (1988): Zeitstrukturierung in der Sprachsynthese. In: Digitale Sprachverarbeitung. ITG-Tagung, Bad Nauheim. (ed. A. Lacroix), Berlin/Offenbach: vde-Verlag, 165-170.

KOHLER, K.J. (1990a): Segmental reduction in connected speech in German: phonological facts and phonetic explanations. In: Speech Production and Speech Modelling (eds. W.J. Hardcastle & A. Marchal), Dordrecht/Boston/London: Kluwer Academic Publishers, 69-92.

KOHLER, K.J. (1990b): Macro and micro F0 in the synthesis of intonation. In: Papers in Laboratory Phonology I (eds. J. Kingston & M.E. Beckman), Cambridge: Cambridge University Press, 115-138.

KOHLER, K.J. (1991a): A model of German intonation. Arbeitsberichte des Instituts für Phonetik und digitale Sprachverarbeitung der Universität Kiel (AIPUK) 25, 295-360.

KOHLER, K.J. (1991b): Terminal intonation patterns in single-accent utterances of German: Phonetics, phonology, and semantics. Arbeitsberichte des Instituts für Phonetik und digitale Sprachverarbeitung der Universität Kiel (AIPUK) 25, 115-185.

KOHLER, K.J. (1991c): Prosody in speech synthesis: the interplay between basic research and TTS application. Journal of Phonetics 19, 121-138.

KOHLER, K.J. (1997a): Parametric control of prosodic variables by symbolic input in TTS synthesis. In: Progress in Speech Synthesis (eds. J.P.H. van Santen, R.W. Sproat, J.P.Olive, J. Hirschberg), New York: Springer, 459-475.

KOHLER, K.J. (1997b): Modelling prosody in spontaneous speech. In: Computing Prosody (eds. Y. Sagisaka, N. Campbell, N. Higuchi), New York: Springer, 187-210.

KOHLER, K.J., Pätzold, M., Simpson, A.P. (1995): From Scenario to segment - The controlled elicitation, transcription, segmentation and labelling of spontaneous speech. Arbeitsberichte des Instituts für Phonetik und digitale Sprachverarbeitung der Universität Kiel (AIPUK) 29.


Acknowledgement

The development of KIM was carried out with financial support from the German Research Council (DFG grants Ko 331/19-1-4) in the project "Form and function of intonation peaks in German" between 1985 and 1989. Some of the initial implementation in the RULSYS/INFOVOX TTS system was made possible by a contract with the company Infovox, Solna/Sweden in the years 1987 - 1989. Furthermore, I particularly acknowledge, with great gratitude, the continuous and extremely fruitful cooperation with Rolf Carlson and Björn Granström at KTH, Stockholm.

Top
 

Zuletzt geändert: Juni 2008, K. J. Kohler