28th Apr 2016 10:30 –– 28th Apr 2016 15:00
Linguists are increasingly interested in generalization and variation in speech and language. To do this we need to look at big datasets across dialects and languages, and generalize based on things that have been annotated; increasingly we do this with large and varied sources of data. But addressing linguistic questions across a range of dialects, or even languages, is difficult. Corpora of spoken language have annotations in formats specific to that particular corpus. Methodology designed for one format does not always translate to new formats or to larger datasets.

This workshop will present a new set of speech corpus tools which are currently being developed by researchers at McGill, Canada, the Montreal Corpus Tools (MCT)MCT enables users to search and extract specific phonetic measures from one or more spoken language corpora simultaneously, without needing access to the raw data. Analyses can be performed in the time domain (i.e., durations of phones across speakers) or on the acoustics (i.e., speaker vowel spaces in various linguistic contexts).

The workshop will be highly hands on. Participants may attend to find out more about the tools for future analyses they might be intending to carry out, and/or they may bring their own data for analysis during the workshop itself. Participants will be encouraged to install Montreal Corpus Tools on their computers prior to the workshop and to bring a dataset that they would like to analyze, though sample data will be available for those wanting to work through a demonstration.

The workshop will be useful for those working on spoken language in digital corpora, for phonetic, sociolinguistic and/or other linguistic analyses. It could also be useful for those working on automated procedures on spoken language in digital humanities. 


Workshop structure:

  • 10.30-11.30:    welcome, coffee, setup (ensure software loaded correctly on laptops) 
  • 11.30-12.30:    presentation of the software with examples 
  • 12.30-1.30:      lunch 
  • 1.30-3.00:        opportunity to work through some examples/students’ own data 

 The presenters will be available in the hour before the workshop to assist in setting the software, and to troubleshoot any potential issues that might arise. The workshop will begin with a presentation of an overview of the interface and basic functionality, as well as of a couple of example analyses to highlight more advanced functionality. Following the presentation, there will be time for participants to use Speech Corpus Tools with their own datasets. The presenters will again be on hand to help and answer any questions that might come up. Slides and walkthroughs of the demos from the workshop will be provided to participants for reference materials.


Technical details

Montreal Corpus Tools requires the following basic format for corpora to be parsed and queried:

  • a collection of audio files of speech, with associated time-aligned orthographic transcriptions
  • the transcriptions must include words (e.g. orthographic units) and phones (sounds which make up words), with beginning and end times notated in some way, and aligned with the sound files.
  • the most common use format is the output of a forced aligner, e.g. FAVE, LaBB-CAT, MAUS. MCT can currently also handle the Buckeye Corpus and TIMIT. Ability to deal with other formats is planned.



Michael McAuliffe, Postdoctoral Research Fellow, McGill University, PhD.  University of British Columbia,  "Attention and salience in lexically-guided perceptual learning".

Morgan Sonderegger (McGill, Canada; Ph.D. U. Chicago Linguistics & Comp Sci, 2012;

Local Host: Jane Stuart-Smith, English Language/Glasgow University Laboratory of Phonetics (GULP).



