![]() |
|||||||||||||||||||||||||||||||||||||||||||||
![]() ![]() ![]() ![]() ![]()
|
The Kiel Intonation Model (KIM),
|
||||||||||||||||||||||||||||||||||||||||||||
| (1) | lexical stress - three levels: unstressed, secondary stress in compounds, and primary stress |
| (2) | sentence stress - four levels: reinforced, neutral, partially and completely deaccented |
| (3) | intonation: (a) categories of pitch 'peaks' and 'valleys' as well as their combinations at each sentence stress position (b) types of pitch category concatenation (c) pitch of 'pre-head' before first sentence stress: low - high |
| (4) | synchronization of pitch 'peaks' and 'valleys' with stressed syllables : early, medial, late |
| (5) | prosodic boundaries (degrees of cohesion) - three variables: pause duration, phrase-final segmental lengthening, scaling of F0 end points |
| (6) | overall speech rate between the utterance beginning and successive prosodic boundaries - four degrees: slow, medium, reduced, fast |
| (7) | downstep of successive pitch 'peaks'/'valleys' and pitch 'reset'. |
A system of prosodic distinctive features is used to specify abstract symbolic, phonological categories in these domains; they enter into sets of ordered symbolic rules. The features are either graded or binary and determine the parametric value spaces activated by parametric phonetic rules following the symbolic ones. The prosodic features are attributed to phonological units, which are either segmental (vowels and consonants) or non-segmental (morphological and phrase boundaries). Attached to vowels is the fundamental distinction whithin the German prosodic system, viz. stress and intonation.
Within stress we have to differentiate between lexical and sentence stress. At the abstract level of phonological specifications in the lexicon, every German word has at least one vowel that has to be marked as potentially stressable, as being able to attract the feature specifications of sentence stress. Lexical stress is thus not a distinctive stress feature, it only marks a position that can attract such a feature at the sentence level, but need not.
Vowels receive combinations of the stress features <+/-FSTRESS> and <+/-DSTRESS> (referring to the association of sentence stress with the two important parameter domains of F0 and duration). In sentence-stressed words the lexically stressed vowel is <+FSTRESS,+DSTRESS>, in unstressed content words <-FSTRESS,+DSTRESS>; in unstressed function words als well as in lexically unstressed syllables that do not get sentence stress the combination is <-FSTRESS,-DSTRESS>. In partially deaccented sentence stresses <+DEACC> is added to the two positive stress features, all other vowels are <-DEACC>. Words that are to get additional emphasis receive the feature <+EMPH> in their lexical stress position, all other vowels <-EMPH>.
Whether <+DSTRESS>, responsible for longer duration, is associated with <+FSTRESS>, marking the vowel as the recipient of intonation features ('peak' and 'valley' contours), or as <-FSTRESS>, not providing the vowel with this potential, depends on the rules of grammar and context of situation in speech communication. They have to be supplied by the linguistic environment of the prosodic phonology (see Section 3). The same applies to the attribution of <+DEACC>.
To distinguish degrees of emphasis, <+EMPH> vowels are given the graded stress level feature <@STRLEV>, with @ = 1,2,...7; <-EMPH> vowels are <0STRLEV>. These vowels are made the more prominent, the higher the stresslevel. In F0 'peak' contours, this greater prominence is achieved by raising the F0 maximum, and if the 'peak' is nonfinal in a 'peak' series, by having a faster descent as well as by lowering the F0 minimum between 'peaks', proportinally to stress level. In the case of F0 'valley' contours, the final F0 point is raised in accordance with stress level. Emphasis is used to put words and phrases within sentences in focus, particularly when the expansion of intonation contours on certain structural elements is coupled with the deaccentuation of others.
In summary, the following sentence stres
features are proposed in a prosodic phonology of German for a
comprehensive contrastive categorization:
<+/-FSTRESS>
<+/-DSTRESS>
<+/-DEACC>
<+/-EMPH>
<@STRLEV>, with @ = 0, 1, ...7.
The feature pair <+/-EMPH> constitutes the link with the
intonationfeatures.
All vowels with 'primary' or 'secondary' (= deaccented) sentence stress, i.e. with the feature specification <+FSTRESS,+DSTRESS> receive intonation features, which may be either 'valleys' or 'peaks', specified as <+/-VALLEY>, and in the case of 'peaks' (<-VALLEY>), they may contain a unidirectional F0 fall, classified as <+TERMIN>, or rise again at the end, resulting in a (rise-) fall-rise, categorized as <-TERMIN>. 'Valleys' may have a low rise, to indicate, e.g., continuation, or a high rise, used, e.g., in questions, with the specifications <+/-QUEST>. All 'peaks' and 'valleys' may have their turning points (F0 maximum in 'peaks' or F0 minimum in 'valleys') early or later with reference to the onset of <+VOK,+FSTRESS>, categorized as <+/-EARLY>, and finally, for 'peaks' <-EARLY> may be around the stressed vowel centre or towards its end, classified by the feature opposition <+/-LATE>. The categorization of <-VALLEY> into <+EARLY> and <-EARLY>, with a further subdivision of the latter into <+/-LATE>, captures the grouping of 'late' and 'medial' vs. 'early peaks', as it showed up in perceptual experiments with stepwise 'peak' shift from left to right (Kohler 1990b,1991b).
'Peaks' are characterized by a quick F0 rise confined to the vicinity of a sentence-stressed syllable. This rise precedes the onset of the latter, and is usually narrow in temporal extension, for an 'early peak'; it extends into the first half of the stressed nucleus in the case of a 'medial peak'. In the 'late peak', it starts after the stressed vowel onset and continues into the second half of the nucleus or beyond, the exact timing of the maximum 'peak' value depending on vowel type (duration according to quantity and quality), subsequent voiced/ voiceless consonants and number of immediately following unstressed syllables. There may even be a low stretch of F0 in the stressed vowel before the rise. After the 'peak' maximum is reached the F0 descends immediately, especially on subsequent unstressed syllables. But for chains of 'peaks' see 2.2.2.
'Valleys', on the other hand, have a continuous rise, starting before the stressed-syllable nucleus (early) or inside it (non-early) and extending as far as the beginning of the following sentence-stressed syllable. If there are several unstressed syllables between two sentence stresses a 'valley' is thus realised as a more gradual F0 ascent compared with the much quicker rise for a 'late peak'. The less distance there is between stressed syllables the more difficult it becomes to distinguish between a 'valley' + 'peak' and a 'late peak' + 'peak' sequence, especially if there is no F0 dip in between the first and second stress F0 maxima, as in a hat pattern (see 2.2.2).
In summary, the elements of the intonation
component of the prosodic phonology of German are given the following
distinctive features for a comprehensive contrastive categorization:
<+/-TERMIN>
<+/-VALLEY>
<+/-QUEST>
<+/-EARLY>
<+/-LATE>
<+/-EMPH>
<@STRLEV>, with @ = 0, 1, ...7.
In a concatenation of pitch 'peaks' without prosodic boundaries between them (see 2.3.3), F0 may fall to a low or an intermediate level and then rise again for the next 'peak'. This fall will be effected on intervening unstressed syllables between the two 'peaks', reaching the lowest point, to start the next rise, in the vicinity of the following stressed syllable, depending on 'peak' position. If there are no unstresed syllables separating the two 'peaks', the dip can be accommodated between all 'peak' combinations, except for 'late' + 'early'/'medial', where a hat pattern is created; it combines the rise of the 'late peak' and the fall of the 'early peak' in a two-stress sequence.
This boundary case of the absence of an F0 descent between 'peaks' canalso be extended to concatenations with intervening unstressed syllables. In such a hat pattern, an 'early peak' is not possible initially, and a 'late' one is excluded non-initially. If there are more than two stresses incorporated in a hat the non-initial and non-final ones are unspecified as to 'peak' position because they neither have a rise nor a fall but are simply integrated into the downstepped sequence of 'peak' maxima (see 2.3.2). In the categorization of pitch patterns they are nevertheless grouped together with 'peaks'. If in a two-stress rise-fall it is difficult to decide whether the rise represents a 'valley' or a 'late peak' in a hat pattern, the latter solution is chosen.
When prosodic boundaries intervene any sequencing of 'peaks' and/or 'valleys' is possible, but the hat pattern is then excluded since it represents a very high degree of cohesion. On the other hand, a 'late peak' with a full F0 descent marks a dissociation from a following 'peak' and will then normally be linked with a prosodic boundary, i.e. final lengthening and, usually, F0 reset afterwards.
Unstressed syllables preceding the first sentence stress in a prosodic phrase may be either low or high: they represent different types of 'pre-head'.
The acoustic manifestations of the distinctive prosodic categories vary according to segmental and prosodic context, depending on the temporal alignment of 'peaks' and 'valleys' - defined by a small number of significant F0 points - with different syllable types and sequences, on downstepping, speech rate, prosodic boundaries and, finally, articulation-induced microprosody.
Taking the default, 'medial peak' as a
reference, two significant F0 points are defined. The first one, TF0,
is positioned at the beginning of the syllable containing the
<+FSTRESS> vowel, the second, T2F0, near the vowel centre, the
exact timing after voiced vowel onset depending on vowel quantity,
vowel height, number of following unstressed syllables and position in
the utterance. The calculation of the time point T2F0 after vowel onset
is carried out on the basis of the segmental duration rules for German.
They have adopted the principle proposed by Klatt (1979) for the rule
synthesis of English (see also Kohler 1988), defining different classes
of segments (e.g. diphthongs vs. long vs. short vowels, low vs. high
vowels) by different pairs of values for intrinsic duration (Di) and
for minimal duration (Dmin) and generating actual segment durations in
various segmental, prosodic and syntactic contexts by the application
of the following rules:
(1) <DUR> ^ <(Di-Dmin)*PRCNT/100+Dmin>
(2) <PRCNT> ^ <PRCNT*PRCNT1/100>.
In (1), PRCNT = 100 initially; the rules then change the PRCNT values successively by introducing a rule-specific PRCNT1 value into (2). This way all the factors influencing segmental durations (tempo, position in the word and sentence, stress, segmental context) can be captured in specific rules by inserting a new PRCNT1 value each time. This model assumes that all the factors affecting duration operate independently of each other and that it is only the amount exceeding the minimal duration of a segment that is adjusted by these factors. The two assumptions provide a good approximation of segment timing in languages like German and English, and certainly result in prosodically acceptable speech synthesis.
T2F0 for 'medial peaks' is now derived from
the basic vowel-type related duration. The only percentage factor that
enters the calculation is the one referring to speech rate; it is
normally set at 100, a speeding up lowers, a slowing down increases the
factor, i.e. it is essentially the intrinsic vowel duration that
determines the point in time after <+VOK,+FSTRESS> onset (T2F0)
where the 'medial peak' is positioned. But this has to be adjusted in
the case of aspiration. On the one hand aspiration lengthens the total
vowel duration, compared with vowels in non-aspirated contexts, but
this increase is not as large as the total aspiration phase; on the
other hand it shortens the stop closure duration compared with
unaspirated cases, but again not by the total amount. So the larger
part of the aspiration (AH) should be added to the vowel, but some of
it attached to the plosive, and the F0 'peak' placement has to take
this ambivalence into account:
(3) <+VOK,+FSTRESS> ^
<T2F0=((Di-Dmin)*PRCNT/100+Dmin)*0.6+TLAH*0.75>,
i.e. three quarters of the period up to the last aspiration time point are added to T2F0, shifting it further to the right by this amount.
Sentence-final 'medial peaks' receive a third F0 point, T3F0, at 150 ms after the 'peak' maximum in a medium speech rate (see 2.3.4, 2.0(6), 3(6')); in all non-final cases, the maximum 'peak' point of one <+FSTRESS> connects with the left-base point of the next <+FSTRESS>. As the absolute F0 'peak' position is not affected by vowel duration modifications due to voiced/voiceless context, number of syllables in the word, sentence position etc., its relative position changes with vowel shortening or lengthening, moving closer towards or further away from, the end. This way the microprosodic F0 truncation before voiceless obstruents is automatically built into the rules.
Thus, in utterance-final short-vowel monosyllables, ending in voiceless consonants, an F0 'peak' fall is truncated, and has to be so to create the same perceptual 'peak' pattern as in other, non-truncating contexts. The listener obviously takes the underlying constancy in absolute F0 'peak' positioning within the same vowel type, and the difference across different vowel types, into account, disregarding contextual adjustments. He can always calculate the final F0 point that should have been reached, if the F0 contour had not been curtailed, from the F0 decline per unit of time up to the cut-off, and he can then compare this value with the likely low end of the speaker's speech voice range (about 60 - 80 Hz in a male voice, an octave higher in a female one). If the comparison is within a narrow margin the fall is terminal, otherwise it is not. Since the lower end of a speaker's F0 range is a reliable reference value, truncation of falling F0 patterns can be uniquely restored perceptually.
That no longer applies to rising F0. Here the end point is not calculable, because there is a large margin for the end point, anything up to or even above 1.5 octaves is possible. The position of the ceiling is not fixed, by contrast with the base line. That means that the intended high value of a 'valley' always has to be physically reached, it cannot be deduced from what precedes, and, moreover, it does not change with different 'valley' positions from 'early' to 'late', which is quite different in 'peaks'. There is thus a fundamental difference between 'peaks' and 'valleys' in the fixation of their offsets.
An 'early peak' has its maximum value at the <+FSTRESS> syllable onset, TF0 100 ms before, and T3F0 - in sentence-final position - in an area where the 'medial peak' has its maximum. A 'late peak' has TF0 at the same point as a 'medial peak', then an additional low F0 point T2F0 is inserted where the 'medial peak' has its centre, and the 'late' summit (T3F0) occurs 100 ms later or at the end of the last voiced segment in a non-final monosyllabic word if this distance is less than 100 ms. If there is an unstressed syllable following, the summit coincides with the unstressed vowel voice onset. In utterance-final position, a fourth F0 point, T4F0, occurs 100 ms after the summit.
'Valleys' have their left and centre F0 points at the same positions as TF0 and T2F0 in 'medial peaks'. In an 'early valley' the left point is the lowest, whereas in a 'non-early valley' it is the centre pont; in both cases, the right, high point is located at the end of the last voiced segment.
Declination, i.e. the temporally fixed decline of F0, prominent in, e.g., the Dutch and Lund models, is not a feature of spontaneous speech production. It has therefore been replaced by downstepping in KIM, i.e. a structurally determined pitch lowering from sentence stress to sentence stress, independent of the time that elapses between them.
Results of interactive testing make it clear that perception orientates itself at structurally positioned and downstepped 'peaks', not at a time-based declination. The downstepping values used in KIM are 6% from 'peak' maximum to 'peak' maximum, and 18% from a 'peak' maximum to the next base. In 'valleys' both the low and the high F0 value are downstepped by 6%.Downstepping can be interrupted at any point by the feature <+EMPH> or by resetting.
One of the functions of prosody is the sequential structuring of utterances and discourse, i.e. the signalling or prosodic boundaries and - at least partially - their hierarchical organization. To decode the syntagmatic chunking of messages in accordance with the speaker's intention the listener requires signals that index degrees of cohesion or separation, respectively, between phrases, clauses, utterances and turns. The parameters that achieve this are pause duration, phrase-final segmental lengthening and scaling of F0 end points at the respective boundaries. They can be controlled by parametric rules in the prosodic model upon appropriate symbolic input.
As at this stage the linguistically and phonetically relevant categorization of these boundaries is not well understood the modelling cannot reduce the categories in this domain to the same small number as in the other areas of prosody discussed so far, but has to allow sufficient degrees of freedom for experimentation with data modelling. At each of the three parameters three degrees are therefore recognised, controlled by digit notation in the symbolic input to the model (see Section 3). As our knowledge of prosodic boundary marking increases the degrees of freedom can be reduced by establishing constraints between the three parameters in the signalling of the necessary and sufficient number of phonologically relevant distinctions.
Speech rate changes within the same speaker not only alter the segmental durations (at varying degrees for different segment types, e.g. vowels vs. consonants) but also the positioning of T2F0 within a stressed vowel, moving it farther or less far into it in slower and faster speed, respectively (see 2.3.1). This also implies slower or faster rises and falls, which lower or raise the perceived pitch level. The faster movements in turn mean that there comes a point when the complete F0 excursions can no longer be carried out. Since the 'peak' and 'valley' maxima are the essential target values controlled by the speaker, the levelling of F0 movements in fast speech particularly affects the low values. Finally, faster speech also means greater effort, which may produce a higher level of activation at the vocal folds right away. The correlation of F0 level with speech rate perception has been shown in Kohler (1986b,c,).
The control of speech rate is also coupled with articulatory reduction and elaboration (Kohler 1990a). The speech rate category therefore activates whole blocks of parametric rules that deal with F0 timing, F0 patterning, segment durations and segmental adjustments (coarticulation, reduction, reinforcement, elision). To start with, the model distinguishes four degrees, one of them - reduced - taking reduction phenomena (at an otherwise medium speech rate timing) into accout, as one way of changing speed of articulation. This level will probably have to be subdivided into subcategories according to the degree of formality and spontaneity of speaking, comprising different rule modules for the respective degree. A good deal more research into spontaneous speech is necessary before an adequate categorization can be set up in the model.
In the generation of intonation, KIM separates two levels
This dichotomy implies the assumption that the underlying F0 'peak' and 'valley' patterns develop independently and in a very concrete physical and physiological sense in speech production, and are modified microprosodically by output constraints in the vocal apparatus. In particular, we have to distinguish five areas of microprosodic adjustments to the basic significant point patterns discussed so far.
KIM is integrated into a pragmatic, semantic and syntactic environment. The input into the model are symbolic strings in phonetic notation with additional pragmatic, semantic and syntactic markers. The pragmatic and semantic markers trigger, e.g., the pragmatically or semantically conditioned use of 'peak' and 'valley' types or of sentence focus. Lexical stress position can largely be derived by rule, and syntactic structure rules mark deaccentuation and emphasis in word, phrase, clause and sentence construction. Phrasal accentuations are thus derived from the syntactic component preceding the prosodic model, and are given special symbolizations in the input strings to the model. The following 7 bit ASCII characters are used to represent the prosodic categories of 2.0 (using the corresponding numbering).
| (1') | Apostrophe
and quotation mark ' " are put in front of the primary or
secondary stress vowel; vowels without a lexical stress marker are
unstressed: R'ück#s"icht ("view to the back") vs. R'ücksicht ("consideration") ( # marks the phonetically - especially prosodically - relevant word boundary in compounds). |
| (2') | Digits 3 2
1 0 are put in front of words that receive the reinforced, neutral,
partially or completely deaccented sentence stress category, which in
turn affects the manifestation of the respective lexically stressed
vowel. Function words, marked by suffixed +, have 0 as
their default, non-function words 2; in both cases the digit
may be omitted in the symbolization: 2Max 0hat+ 0einen+ 2Brief 2geschrieben . ("Max did write a letter") 2Max 0hat+ 0einen+ 2Brief 1geschrieben . (semantically unmarked rendering of "Max wrote a letter"; answer to question "What did Max do?") 2Max 0hat+ 0einen+ 2Brief 0geschrieben . ("Max wrote a letter, not a card"; answer to question "What did Max write?") 2Max 0hat+ 0einen+ 3Brief 0geschrieben . (reinforcement of contrasted "Brief" in the previous example) Degrees of reinforcement may be symbolized by digits 3 to 9. |
| (3') | The
punctuation marks . , ? and their sequences ., .? are
used phonetically for pitch 'peaks', low and high rising 'valleys', low
and high fall-rises (combined 'peak' + 'valley' patterns). Therefore
texts in KIM and as TTS input (see Section 4) no longer contain
orthographic punctuation. Ja . Ja , Ja ? Ja ., Ja .? A high pre-head is symbolized by = at the beginning of a prosodic phrase: p:= 0Wie+ 0sieht 0das 0bei+ 0Ihnen+ 0am+ 3Donnerstag 0aus. |
| (4') | Parentheses )
( for 'early' and 'late peak' positions in sentence-stress
syllables are put before the stressed word (after the sentence-stress
digit); the 'medial peak' position is regarded as the default case and
remains unmarked: Sie+ hat+ ja+ 2)gelogen . ("She's been lying." = summarizing, concluding statement) Sie+ hat+ ja+ 2gelogen . (=start of a new argumentation) Sie+ hat+ ja+ 2(gelogen . (as the preceding example but with a contradictory note) In connection with 'valleys' there are only early and non-early positions in the sentence-stressed syllable. Either one or the other category may be taken as the unmarked default case, depending on their frequency of occurrence: early for , and late for ? . Wie+ 2( heißt du+ ? "What's your name?" Wie+ 2) heißt du+ ? Wie+ 2) heißt du+ , Wie+ 2( heißt du+ , |
| (5') | The prosodic
boundary (cohesion) marker p: is put after the word at which
boundary indices occur. It is preceded by two digits, the second of
which refers to pause length, the first to utterance-final lengthening.
In the case of pitch 'peaks', there is a third boundary-related digit
to the left of these two, referring to the scaling of the F0 end point.
Each of the digits may range from 0 (= absence of pause, of
final lengthening or of F0 descent) through 1 (= short pause -
<=200ms; default utterance-final lengthening; intermediate F0
descent) to 2 (= long pause - >=200ms; hesitation
lengthening; full F0 descent). If there is no prosodic punctuation mark
associated with p: default . is assumed; the other
pitch categories (see ()) have to be symbolized after the
prosodic boundary marker p: in KIM and in its TTS input, but before
the phrasing marker PGn in PROLAB (see Example Spontaneous).
zehn 211p: minus+ zwei 100p: mal+ drei ("10 - 2 x 3") zehn 100p: minus+ zwei 211p: mal+ drei ("(10 - 2) x 3") |
| (6') | The digit
string associated with the phrase boundary marker p: is
preceded by a further digit, ranging from 0 to 3 to
mark four degrees of speech rate, which include degrees of reduction or
elaboration: 2 refers to medium overall speed and default
reduction (and may be omitted from the symbolization as an implicit
default), 1 refers to the same speed but a higher degree of
reduction; for 0, degrees of reduction and speed are increased,
for 3 they are both decreased from 2. The rate digit
(or default) applies to the stretch of speech between the p:
marker and its predecessor or the utterance beginning, respectively. In
this modelling of speech rate, segment durations are not changed by
uniform and proportionate up or down-scaling across the whole sequence,
but vowels and consonants are dealt with separately according to sets
of rules including segmental reduction, assimilation and elision. mit+ roten gelben blauen braunen 3212p: ("with red, yellow, blue and brown ones") mit+ roten gelben blauen braunen 2212p: mit+ roten gelben blauen braunen 1212p: mit+ roten gelben blauen braunen 0212p: |
| (7') | The model does
not include the category of declination over time, but incorporates the
structurally determined, time-independent category of downstep from
'peak' to 'peak' and from 'valley' to 'valley'. It is set at a constant
value (6% in medium and slow speeds, 4% in fast speed) and is not
indicated symbolically. Pitch reset can occur at any point in the chain
of 'peaks' or 'valleys' and is associated with a prosodic boundary. It
is marked by + before the digit sequence at the preceding p:.
mit+ roten gelben 2110p: blauen schwarzen 2212p: mit+ roten gelben +2110p: blauen schwarzen 2212p: The KIM symbolization system outlined above has been used in the TTS implementation of KIM (see Section 4), but it has also been adapted for use as an efficient prosodic labelling system (PROLAB) in the processing of recorded read and spontaneous speech data to create a labelled data bank. One of the aims of such a corpus is to enlarge the empirical basis for prosodic modelling. The different requirements in this "manual" annotation of natural speech have resulted in the following adjustments and additions of the model symbolization. |
The
following orthographic transcript with prosodic annotations is an
illustration of such a PROLAB label file for a turn in a spontaneous
dialogue from the Kiel Corpus recorded and processed at IPDS (see IPDS,
1995f). The corresponding speech file may be listened to by clicking on
the "original" button. After converting the labelled text into the
TTS-compatible KIM format it can be fed into the TTS system and changed
back to speech. This synthesized version of the spontaneous dialogue
turn may be activated by clicking on the "synthesized" button.
&2
<ähm> &PGn &2( D'ienstag &0würde+
&0 mir+ &0 g'ut &0. &2)
p'assen , &2. &PGn
&2 <ähm> &PGn &0 das+ &2]
h'eißt , &, &PGn p: &2^ Mom'ent , &1.
&PGn
&2( 'allerdings &0 'erst z: &0. &PGn
&2( n'achm"ittags h: . &2. &PGn
&RP &HP &0 das+ &0 wird+ &0
dann+ &2^ wahrsch'einlich &0 'n+
&0 b'ißchen &1. &2^ schw'ierig .
&2. &PGn
&RM &2^ D'ienstag , &0. &|2^ m'ittwochs z:
&1. &PGn &0 <äh> &PGn
p: &0 +/is=/+ &PG/ &1^ s'ieht &0
das+ &0 bei+ &0 mir+ z:
&0 +/sch=/+ &2. &PG/ &2^ schw'ierig &0
'aus . &2. &PGn
&0 da+ &0 hab' &0 ich+ &2^
tags'über &1. &2^ Term'ine . &1. &PGn
h: &2 <ähm> &PGn &RP &HP &0
wie+ &0 s'ieht &0 das+ &0 bei+
&0 Ihnen+ &0 am+ &3
D'onnerstag &0 'aus ? &2. &PGn
| original | |
| synthesized |
For further details see Kohler et al. 1995.
KIM has been implemented in the RULSYS/INFOVOX TTS for German. The Kiel development of this TTS system (for details see Carlson [KTH] et al. 1990, Kohler 1991c) makes use of a very simple adaptation of 7bit ASCII to the phonetic transcription of German: (a) upper-case letters for segmental phonemes, (b) lower case ones for allophones, (c) the characters listed in 3 (1')-(7'). These phonetic symbols are either derived by rule from orthographic input, or they are entered into the system directly, enclosed between the metacharacter #. In the latter case the input string can be either entirely phonetic, or mixed orthographic/phonetic as illustrated by the examples in Section 3.
The greater part of the prosodic notations in (2')-(7') have to be entered as such, because the syntactic component of the system is not powerful enough to derive them by rule from orthographic input. Moreover, in many cases semantic and pragmatic rules would be required to generate the correct prosodic output. The symbolic prosody markers trigger hierarchical sets of symbolic distinctive feature rules, followed by sets of parametric F0 and duration rules in the phonetic-to-acoustic output component.
The speed control digit at the p: marking attributes a parametric rate variable to every segmental symbol and sets it to a value representing the respective category. Blocks of duration, segment and F0 rules in the phonetic module are then activated by the particular rate variable value and the appropriate calculations along the three phonetic scales are performed. This means that for a particular speed it is not only the segment durations that are adjusted across the whole chain to which the particular rate factor applies, but F0 is also raised for speeding up or lowered for slowing down, and segmental reductions or elaborations are effected simultaneously, in accordance with natural speech production. The segment durations are scaled separately for vowels and consonants and also as a function of a number of other conditioning factors (vowel height, consonant category, stress, number of syllables in the word). The digit before p: controlling phrase-final lengthening triggers a more local increase or decrease of segment durations within the set global speech rate.
The TTS implementation of KIM allows the calculation of speech timing at a hierarchy of levels from segment to segment chain to phrase to utterance, according to a Klatt type (1979) model for segment timing (see 2.3.1) with factors determined by stress, utterance position, number of syllables in the word and overall speech rate (see Kohler 1986a).
The Kiel prosodic model for German is comprehensive and detailed enough for its TTS realization to generate highly intelligible and natural sounding synthetic output for continuous text. This may be either text in ordinary orthographic form which, due to its syntactic and semantic simplicity, does not require any or only very few prosodic markers, as Example Orthographic. Or it may be supplied with systematic phonetic and prosodic annotations in the case of very intricate phrasing structures, which cannot be derived uniquely from the orthography of the complicated text, as in Example Annotated. The TTS implementation of KIM is also capable of simulating spontaneous speech on the basis of phonetic and prosodic label files that have been created for natural speech files using KIM and its symbolization system, as in Example Spontaneous. Prosodic modelling, its TTS implementation and testing, and model-driven labelling of natural speech thus form an integrated set of prosodic research at IPDS Kiel.
Es war
in Berlin zu einer Zeit, als Lebensmittel nicht genügend vorhanden
waren. Vor einem Laden stand bereits um sieben Uhr eine beachtliche
Menschenmenge; denn man hatte dort am Abend vorher auf einem Schild
schon lesen können, daß frische Butter eingetroffen sei.
Jeder wußte, daß die Butter schnell ausverkauft sein
würde und daß man ganz früh kommen
müsse, um noch etwas zu erhalten.
Da das Geschäft erst um acht geöffnet wurde, stellten sich
die Leute vor der Laden-tür in einer Reihe an. Wer später
kam, mußte sich hinten anschließen.
Je näher der Zeiger auf acht kam, desto unruhiger wurden die #)#
Leute. Da kam endlich ein kleiner Mann mit grauem Haar und drängte
sich ziemlich rücksichtslos nach vorn. Die wartenden Menschen
waren empört über solches Verhalten
und forderten ihn auf, sich ebenfalls hinten anzustellen. Aber auch als
mit der Polizei schon gedroht wurde, #0# ließ sich der Mann nicht
beirren, sondern drängte sich weiter durch. Er bat, man
solle ihn doch #9# durchlassen. Oder glaubte man, daß diese
Drängelei für ihn vielleicht ein #6)# Vergnügen sei?..
Das war für die Leute nun doch zu viel.. Alle kochten bereits vor
Wut, und der Mann konnte jetzt von allen Seiten Schimpfwörter
hören.. Er aber zuckte
resigniert mit den Schultern und bemerkte: "Nun #6)# gut. Wie Sie #)#
wollen. Wenn Sie mich nicht vorlassen, dann
kann ich die Tür nicht aufschließen, und Sie können
meinetwegen hier stehenbleiben, bis die Butter
ranzig geworden ist."
| synthesized (3.22 MByte)! | |
| butterge.zip (2.4 MByte): wav Format, 16 Bit, 16 KHz, 3.22 MByte (decompressed), 1:45.591 minutes |
Address
to a tutorial at the Konvens meeting in Vienna, 27
September 1994
In this text <;;> is shorthand for #000p:#, i.e. the
"hat pattern". <,,> stands for the fall-rise (., in KIM).
Meine
sehr ;; verehrten
;; Damen und Herren, #1# liebe ;; Teilnehmer
am Tutorium #+211p:#, Aussprache-lexika in der signalnahen
Sprachverarbeitung..
Es
#2# begrüßt Sie die klare #+211p:#, etwas metallische
#+210p:#,
aber dennoch melodische #+211p:#, und vor allen ;; Dingen ;;
rhythmische #+3210p:# synthetische Stimme des Nordens. Sie #2# basiert
auf dem
TTS-System #+2100p:# VOX #1# PC #+211p:#, der #0# Firma Infovox in
Stockholm
#+2212p:#
und auf dem Softwehr Entwicklungswerkzeug Ruhlsys #+210p:#, der
Technischen
#3# Hochschule #0# Stockholm. Entscheidend für die Geburt dieser
Stimme war aber die Entwicklung von Regeln #+210p:#, zur akustischen
Wandlung orthographischer Symbolketten #+211p:#, im Institut für
Phonetik
#+210p:#, und digitale Sprachverarbeitung #+211p:#, der Christian ;;
#'ALBR[CHTS%UNIVERZIT'[:T#
zu Kiel... An #2# sich sollte #0# Professer #3# Kohler diese #1#
einleitenden #1# Worte #0# sprechen. Da er aber noch ein #0#
bißchen unter
Zeitverschiebung #+2210p:# nach einer USA und einer Japan-reise #0#
leidet, #0# braucht er #1# heute morgen #+210p:#, noch einen etwas
längeren Anlauf. #0# Deshalb ist er sehr froh #0DAR'Y:Br#,,
daß er diese Aufgabe #2#
mir #+2210p:# einer Sprechmaschine übertragen kann....
Ehe Herr Kohler #+2100p:# mit Ihnen die Struktur #2110# und die
Generierung #2110p:# von Aussprache-lexika #+212p:#, sowie ihren
Einsatz in Forschung und Anwendung #+2000p:# auf verschiedenen ;;
Ebenen erläutert #+212p:#,
möchte ich nicht versäumen, den Organisatoren der #1# Tagung
#211p:#, auch in #1(# seinem #010p:#, Namen #+011p:#, für die
Einladung zur
Ausrichtung des Tutoriums #+2100p:# sehr herzlich zu ;; #)# danken..
#2# Ihnen, den Teilnehmern, gebührt ebenfalls #0# Dank #210p:#,
daß Sie sich #0# dafür entschieden haben.. Wir haben die
Einladung
#0#natürlich sehr ;; gern #1#aufgegriffen #+2210p:# #2# nicht nur
weil sie Herrn
Kohler die Möglichkeit #1# gibt #200p:#, #3# Wien zu #1#besuchen
#+211p:#, und den Heurigen zu #1# genießen #+211p:#, sondern um
vor allem
die Kieler ;; Forschung #1# vorzustellen #+2210p:# und #)# Interesse an
ihr zu #1# wecken....
Jetzt darf ich aber Herrn Kohler nicht länger von seiner Arbeit
abhalten. Er ist #0# inzwischen aufgewacht #2111p:# und
schon #3(# unruhig geworden. und er findet vor allem die Text-eingabe
sehr
ermüdend #+2211p:# da er nur mit zwei Fingern tippen kann. Ich
ziehe mich also
zurück, und wünsche Ihnen viel Vergnügen ;; beim
Tutorium.
| synthesized (4.31 MByte)! | |
| ansprache.zip (3.4 Mbyte), wav Format, 16-bit, 16KHz, 4.31MByte (decompressed), 2:21.348 minutes |
CARLSON, R., GRANSTROM, B. & HUNNICUTT, S. (1990): Multi-lingual text-to-speech development and applications. In: Advances in Speech, Hearing, and Language Processing (ed. W.A. Ainsworth), London: JAI Press, 269-296.
IPDS (1995f): CD-ROM#1-3: The Kiel Corpus of Read/Spontaneous Speech. Kiel.
KLATT, D.H. (1979): Synthesis by rule of segmental durations in English sentences, In: Frontiers of Speech Communication Research. (eds. B. Lindblom & S. Öhman), London/New York/San Francisco: Academic Press, 287-299.
KOHLER, K.J. (1986a): Invariance and variability in speech timing: from utterance to segment in German. In: Invariance and Variability of Speech Processes (eds. J.S. Perkell & D.H. Klatt), Hillsdale, N.J.: Lawrence Erlbaum, 268-289.
KOHLER, K.J. (1986b): Parameters of speech rate perception in German words and sentences: duration, F0 movement, and F0 level. Language and Speech 29, 115-139.
KOHLER, K.J. (1986c): F0 in speech timing. Arbeitsberichte des Instituts für Phonetik und digitale Sprachverarbeitung der Universität Kiel (AIPUK) 20, 55-97.
KOHLER, K.J. (1988): Zeitstrukturierung in der Sprachsynthese. In: Digitale Sprachverarbeitung. ITG-Tagung, Bad Nauheim. (ed. A. Lacroix), Berlin/Offenbach: vde-Verlag, 165-170.
KOHLER, K.J. (1990a): Segmental reduction in connected speech in German: phonological facts and phonetic explanations. In: Speech Production and Speech Modelling (eds. W.J. Hardcastle & A. Marchal), Dordrecht/Boston/London: Kluwer Academic Publishers, 69-92.
KOHLER, K.J. (1990b): Macro and micro F0 in the synthesis of intonation. In: Papers in Laboratory Phonology I (eds. J. Kingston & M.E. Beckman), Cambridge: Cambridge University Press, 115-138.
KOHLER, K.J. (1991a): A model of German intonation. Arbeitsberichte des Instituts für Phonetik und digitale Sprachverarbeitung der Universität Kiel (AIPUK) 25, 295-360.
KOHLER, K.J. (1991b): Terminal intonation patterns in single-accent utterances of German: Phonetics, phonology, and semantics. Arbeitsberichte des Instituts für Phonetik und digitale Sprachverarbeitung der Universität Kiel (AIPUK) 25, 115-185.
KOHLER, K.J. (1991c): Prosody in speech synthesis: the interplay between basic research and TTS application. Journal of Phonetics 19, 121-138.
KOHLER, K.J. (1997a): Parametric control of prosodic variables by symbolic input in TTS synthesis. In: Progress in Speech Synthesis (eds. J.P.H. van Santen, R.W. Sproat, J.P.Olive, J. Hirschberg), New York: Springer, 459-475.
KOHLER, K.J. (1997b): Modelling prosody in spontaneous speech. In: Computing Prosody (eds. Y. Sagisaka, N. Campbell, N. Higuchi), New York: Springer, 187-210.
KOHLER, K.J., Pätzold, M., Simpson, A.P. (1995): From Scenario to segment - The controlled elicitation, transcription, segmentation and labelling of spontaneous speech. Arbeitsberichte des Instituts für Phonetik und digitale Sprachverarbeitung der Universität Kiel (AIPUK) 29.
The development of KIM was carried out with financial support from the German Research Council (DFG grants Ko 331/19-1-4) in the project "Form and function of intonation peaks in German" between 1985 and 1989. Some of the initial implementation in the RULSYS/INFOVOX TTS system was made possible by a contract with the company Infovox, Solna/Sweden in the years 1987 - 1989. Furthermore, I particularly acknowledge, with great gratitude, the continuous and extremely fruitful cooperation with Rolf Carlson and Björn Granström at KTH, Stockholm.
Last changes: June 2008, K. J. Kohler