VoxForge scripts for Kaldi

Some weeks ago there was a question on the Kaldi's mailing list about the possibility of creating a Kaldi recipe using VoxForge's data. For those not familiar with it, VoxForge is a project, which has the goal of collecting speech data for various languages, that can be used for training acoustic models for automatic speech recognition. The project is founded and maintained, to the best of my knowledge, by Ken MacLean and thrives thanks to the generous contributions of great number of volunteers, who record sample utterances using the Java applet available on the website, or submit pre-recorded data. As far as I know this is the largest body of free(in both of the usual senses of the word) speech data, readily available for acoustic model training. It seemed like a good idea to develop a Kaldi recipe, that can be used by people who want to try the toolkit, but don't have access to the commercial corpora. My previous recipe, based on freely available features for a subset of RM data can be also used for that purpose, but it has somewhat limited functionality. This post describes the data preparation steps, specific to VoxForge's data.


Decoding graph construction in Kaldi:
A visual walkthrough

I've got bitten recently by an issue in a Kaldi decoding cascade I was working on. I was getting more than 40% WER, while the language and acoustic model I was using suggested the decoder should have been able to do much better than that. After a lot of head-scratching I've found the reason for the sub-optimal performance was that I simply failed to add self-loops to the lexicon transducer(L). These are needed in order to pass through the special "#0" symbol used in Kaldi's recipes to make the grammar transducer(G) determinizable. The effect of this omission was that the back-off arcs in the bigram G I was using were effectively cut-off, leading to a highly non-stochastic LG cascade with a very spiky distribution over the allowed word sequences and hence the higher WER. After adding the self-loop the WER went down to 17% without changing anything else.
At that point I realized that I don't have detailed knowledge about the decoding graph construction process and decided to take a closer look. One problem here is that the actual large vocabulary cascades are hard to observe directly. In my experience GraphViz requires exorbitant resources in terms of CPU time and memory even for graphs orders of magnitudes smaller than those used in LVCSR. Even if that wasn't a problem the full optimized HCLG WFST is in my humble opinion beyond the human abilities to comprehend(at least beyond my abilities). We can easily build a small-scale version of the graph however, and I think this is mostly OK because using small scale-models to design and test various constructs is a widely used and proven method in engineering and science. This blog entry is not meant to be a replacement for the famous hbka.pdf (a.k.a. The Holly Book of WFSTs) or the excellent Kaldi decoding-graph creation recipe by Dan Povey, to which this post can hopefully serve as a complement.


Poor man's Kaldi recipe

Kaldi is a relatively new addition to the open source speech recognition toolkits, officially released about an year ago. It's based on the WFST paradigm and is mostly oriented toward the research community. There are many things I like about Kaldi. First of all I like the overall "openness" of the project: the source code is distributed under the very permissive Apache license, and even the $\LaTeX$ sources for the papers about Kaldi are in the repo. The design decision to use relatively small, single-purpose tools, that can be Unix-style pipelined, makes their code very clean and easy to follow. The project is very actively developed and support experimental options like neural nets based acoustic features and GPGPU acceleration, although I haven't had the chance to play with these yet. Last, but not least Kaldi comes with extensive documentation.
There is also a big, and growing number of recipes working with the most widely used speech databases. My problem was, however that I am not an academic, but just a coder that likes to play from time to time, with technologies that look interesting and promising. I wanted to get an idea how Kaldi works, but I don't have access to these expensive datasets. That's why I have started to search for a way to use a publicly available database. The best option I have found so far is a subset of RM1 dataset, freely available from CMU. In fact the data only includes pre-extracted feature vectors, and not the original audio. The details of feature extraction in Sphinx are almost certainly different from that in Kaldi, but as long the features used in training and decoding match we should be OK. Kaldi already has a recipe for RM, so modifiying it to use CMU's subset was a rather trivial excercise. This post describes how to use this modified recipe in case other people want to try it.
Update: As of Feb 27, 2012 a slightly modified version of this recipe is part of the official Kaldi distribution.