The speech group has developed a C/C++ system for large vocabulary continuous speech recognition. The system is language-independent, but it is particularly useful for languages like Finnish, Estonian or Turkish, in which the words consist of several morphemes. For testing the system, contact Mikko Kurimo or try the www demo.

Tools for language modeling

TheanoLM is a neural network language modeling toolkit implemented using Theano, a Python library for evaluating mathematical expressions.

VariKN language modeling toolkit can be used to create long-span n-gram language models. A direct link to the code is available here.

Morfessor can be used to decompose words into statistical morphemes. A direct link to the code is available here.

Maximum entropy language models: SRILM extension can be used to train and apply maximum entropy (MaxEnt) language models to the SRILM toolkit.

Speech data

Finnish Parliament corpus has 2269 hours of transcribed Finnish Parliament sessions 2008 - 2016 aligned by AaltoASR. The data is available from the Language Bank of Finland.

DSP corpus has spontaneous conversations recorded and transcribed by over 200 students of Aalto University. The data is available for research from the Language Bank of Finland.

Isolated Finnish words spoken by 59 speakers, about 260 words each collected at Helsinki University of Technology in 1999. A direct link to the data is available here.



Page content by: | Last updated: 29.10.2017.