The Kiel Corpus

General Informations

The Kiel Corpus is a growing collection of read and spontaneous German which has been collected and labelled segmentally at the ipds since 1990. At present the Kiel Corpus available on CD-ROM comprises over four hours of labelled read speech on The Kiel Corpus of Read Speech Vol. I as well as four hours of labelled spontaneous speech on The Kiel Corpus of Spontaneous Speech Vol. I, Vol. II and Vol. III.

Download a free sample!

This is a zipped tar-file containing 10 signal files from the Berlin sentence corpus on The Kiel Corpus of Read Speech Vol. I as well as five turns from The Kiel Corpus of Spontaneous Speech Vol. II. The files are in ESPS/waves+ format or MS RIFF WAVE format. The zip files are
about 2 MB in size.

Segmentation and labelling

Labelling of the Kiel Corpus begins with a canonical phonemic transcription of the utterance. A list of labels is created from the transcription. Each element of prefixed with one of the following:

## for word-initial labels
$ for word-internal labels
$# for word-internal, compound-initial labels
# for word-external labels, i.e. pauses, breathing, sentence punctuation.

The labels are aligned temporally with the signal. Each label is aligned with the beginning of that portion of the signal it is considered to be chiefly responsible for. Where necessary the labels are modified. Here are some examples of possible modifications:

Before After
##b ##%b the delimitation of a portion of signal is uncertain, in this case the closure for plosive could not be ascertained.
$t $t- phonetic correlates of the phonological element are absent. Commonly used to show the absence of any stop or plosive element following fricatives or to show the absence of a vocalic portion in /@n/ or /@l/ sequences.
$n $n-m signal portion are more accurately represented with another label from the inventory. Most commonly used to represent
$-p a label from the inventory is inserted to label a signal portion not necessarily foreseen in the canonical transcription. Often used to indicate the presence of epenthetic stops.

A number of other phonetic features are also annotated using the "insertion" hyphen:

$-~ indicates presence of nasality when a nasal is no longer temporally delimitable
$-q indicates the presence junctural creak or creaky voice. The symbol q is also used to 'replace' plosive symbols (e.g. $t-q, $p-q) to show glottalized correlates often found adjacent to nasals and laterals.
$-h indicates plosive release phase (and aspiration).
$-MA is used to signal the presence of the correlates of a label which has been marked as absent, e.g. $i:- in a realization of vielleicht, in which phonetic correlates of the first vowel are cotemporal with the labiodental friction and part of the lateral, but in which no temporally discrete vocalic portion is present.

The Kiel Corpus of Read Speech Vol. I
The Kiel Corpus of Spontaneous Speech Vol. I
The Kiel Corpus of Spontaneous Speech Vol. II
The Kiel Corpus of Spontaneous Speech Vol. III