Label Assisted Copy Synthesis


The automatic generation of control signals to drive a formant synthesizer offers an excellent method of validating phonological models by observing their phonetic output. This is made all the more challenging by the high quality of the speech which a formant synthesizer such as Klatt's (1980) model can produce when provided with appropriate control signals.

Copy synthesis of natural utterances is undoubtedly one of the most interesting and enlightening methods of arriving at these numbers. However, two serious problems arise when mapping the results of an acoustic analysis onto the control parameters of the Klatt formant synthesizer:

  1. There is a discrepancy between the information delivered by the acoustic analysis of an utterance and the rich variety of synthesizer parameters which can be used to model the acoustic signal.
  2. Parametric information about more complex products of the vocal tract is usually not available in the analysis. Voiced fricatives are an example of this. A voiced fricative such as [z] leaves an analysis either as a voiceless fricative (no F0 found) or as a frictionless approximant (F0 found). Although the former may be the most appropriate analytical outcome for a synthetic utterance, neither allows the original fricative to be modelled.

LACS is a knowledge-based solution to the problems outlined above. The mapping of acoustic analysis onto synthesizer control parameters is carried out using information from annotations of the utterances being synthesized. At any point in the mapping process a decision can be made using the linguistic information provided by time-aligned labels. Using a large labelled corpus such as The Kiel Corpus allows copy synthesis of a number of different female and male voices carrying out different linguistic tasks.

Modelling glottal activity is one of the ways in which label information can be successfully used to fully exploit the parameters which the Klatt synthesizer provides. The diagrams below illustrate how the different correlates of h can be modelled. In either case it is only the combination of label and analytical information that the we can control the source parameters for voicing and aspiration and decide whether to use the formant information to excite the cascade or parallel branch of the synthesizer.

Here are some examples for the ear, comparing the original utterances with their copy-synthetic counterparts. The first illustrates the `reconstruction' of creak at the onset of ein when the F0 analysis has returned voicelessness. In the second example, note in particular the voiced alveolar friction in the word Konserven. This portion of signal, leaving the F0 analysis voiced, would otherwise have been synthesized as something akin to [ð].

Über die Felder weht ein Wind. Original LACS
Hier gibt es Konserven. Original LACS
Gib mir bitte die Butter. Original LACS
Wer möchte noch Milch? Original LACS
Überquere die Straße vorsichtig! Original LACS
Da möchte ich gerne mit. Original LACS
Die Kartoffeln gehören zum Mittagessen. Original LACS
Dazu essen wir den Salat. Original LACS
Danach tut eine Wanderung gut. Original LACS
Manche Obstbäume blühen prächtig. Original LACS
Am Zaun steht eine Regentonne. Original LACS
Der gelbe Küchenofen sorgt für Wärme. Original LACS
Die Rinder sind noch auf der Weide. Original LACS
Die Fahrt war ja mächtig kurz. Original LACS


