tag:blogger.com,1999:blog-20182947836205892162023-11-15T16:41:36.630+02:00Geeky StuffVassil Panayotovhttp://www.blogger.com/profile/16218983335087840261noreply@blogger.comBlogger3125tag:blogger.com,1999:blog-2018294783620589216.post-3586125329077079612012-07-10T11:03:00.000+03:002012-07-11T16:55:24.097+03:00VoxForge scripts for KaldiSome weeks ago there was a question on the Kaldi's mailing list about the possibility of creating a Kaldi recipe using <a href="www.voxforge.org">VoxForge's</a> data. For those not familiar with it, VoxForge is a project, which has the goal of collecting speech data for various languages, that can be used for training acoustic models for automatic speech recognition. The project is founded and maintained, to the best of my knowledge, by Ken MacLean and thrives thanks to the generous contributions of great number of volunteers, who record sample utterances using the Java applet available on the website, or submit pre-recorded data. As far as I know this is the largest body of free(in both of the usual senses of the word) speech data, readily available for acoustic model training. It seemed like a good idea to develop a Kaldi recipe, that can be used by people who want to try the toolkit, but don't have access to the commercial corpora. My <a href="http://vpanayotov.blogspot.com/2012/02/poor-mans-kaldi-recipe-setup.html">previous recipe</a>, based on freely available features for a subset of RM data can be also used for that purpose, but it has somewhat limited functionality. This post describes the data preparation steps, specific to VoxForge's data.<br />
<a name='more'></a><br />
<b>Prerequisites</b><br />
As usual the following instructions assume you are using Linux operating system. The scripts try to install some of the external tools needed for their work, but also assume you have some tools and libraries already installed. These are things that should be either already available on your system or can be easily installed through its package manager. I will try to enumerate the main dependencies when describing the respective part of the recipe making use of them.<br />
If you don't have Kaldi installed you need to download and install it by following the steps in the <a href="http://kaldi.sourceforge.net/install.html">documentation</a>. If you already have it installed you need to make sure you have a recent version, which includes this recipe. It can be found in <i>egs/voxforge/s5</i> under Kaldi's installation directory. From this point on all paths in this blog entry will be relative to this directory, unless otherwise noted.<br />
<br />
<b>Downloading the data</b><br />
Before doing anything else you should set <i>DATA_ROOT</i> variable in <i>./path.sh</i> to point to a directory residing on a drive, which has enough free space to store several tens of gigabytes of data (about 25GB should be enough in the default recipe config at the time of writing). <br />
You can use the <i>./getdata.sh</i> to download VoxForge's data. As many other parts of the recipe it hasn't been extensively tested, but hopefully will work. It assumes you have <i>wget</i> installed on your system and downloads the 16KHz versions of the data archives to $\$${DATA_ROOT}/tgz and extracts them to $\$${DATA_ROOT}/extracted. If you want to free several gigabytes by deleting the archives after the extraction finishes, you can add a "--deltgz true" parameter:<br />
<pre class="brush: bash; gutter: false">./getdata.sh --deltgz true
</pre><br />
<b>Configuration and starting the recipe</b><br />
It's usually recommended to run Kaldi recipes by hand by copy/pasting the commands in the respective <i>run.sh</i> scripts. If you do this be sure to source <i>path.sh</i> first in order to set the paths and <i>LC_ALL=C</i> variable. If this environment variable is not set you may run into sometimes hard to diagnose problems, due to the different sorted orders used in other locales. Or if you are like me and prefer to just start <i>/bin/bash run.sh</i> to execute all steps automatically, you can do this too but you may want to modify several variables first. If you happen to have a cluster of machines with Sun Grid Engine installed you may want to modify the train and decode commands in <i>cmd.sh</i> to match your config. A related parameter in <i>run.sh</i> is <i>njobs</i>, which defines the maximum number of parallel processes to be executed. The current default is 2 which is perhaps too low even for a relative new home desktop machine.<br />
There are several parameters toward the start of <i>run.sh</i> script, that I will explain when discussing the parts of the recipe they affect.<br />
The recipe is structured according to the currently recommended("s5") Kaldi script style.<br />
The main sub-directories used are:<br />
<ul><li><b>local/</b> - hosts scripts that are specific to each recipe. This is mostly data normalization stuff, which is taking the information from the particular speech database and is transforming it into the files/data structures that the subsequent steps expect. The work of using new data with Kaldi, including this recipe, involves writing and modifying scripts that fall into this category.</li>
<li><b>conf/</b> - small configuration files, specifying things like e.g. feature extraction parameters and decoding beams.</li>
<li><b>steps/</b> - scripts implementing various acoustic model training methods, mostly through calling Kaldi's binary tools and the scripts in "utils/"</li>
<li><b>utils/</b> - scripts performing small, low-level task, like adding disambiguation symbols to lexicons, converting between symbolic and integer representation of words etc.</li>
<li><b>data/</b> - This is where various metadata, produced at the recipe run time is stored. Most of it is result from the work of the scripts in local/</li>
<li><b>exp/</b> - The most important output of the recipe goes there. This includes acoustic models and recognition results.<br />
</li>
<li><b>tools/</b> - A non-standard directory, specific to this recipe, that I am using to put various external tools into(more about these later).</li>
</ul><i>Note:</i> Keep in mind that <i>steps/</i> and <i>utils/</i> are shared between all "s5" scripts and are just symlinks pointing to <i>egs/wsj/s5</i>. That means you should be careful if you make changes there, as this may affect the other "s5" recipes.<br />
<br />
<b>Data subset selection</b><br />
The recipe has the option to train and test on a subset of the VoxForge's data. For (almost) every submission directory there is a <i>etc/README</i> file, with meta-information like pronunciation dialect and gender of the speaker. For English there are many pronunciation dialects, but from an (admittedly very limited) sample of the data I've got the impression that some of the recordings with dialect set to say "European English" and "Indian English" sound distinctively non-native. So you have the option to select a subset of the dialects if you like. In my limited experience it <b><i>seems</i></b>, that the speech tagged with "American English" has relative lower percentage of non-native speech intermixed. The default for the recipe is currently set to produce an acoustic model with good coverage over (presumably) native English speakers:<br />
<pre class="brush: bash; gutter: false">dialects="((American)|(British)|(Australia)|(Zealand))"
</pre>The <i>local/voxforge_select.sh</i> script creates symbolic links to the matching submission subdirectories in $\$${DATA_ROOT}/selected and the subsequent steps work just on this data.<br />
If you want to select all of VoxForge's English speech you should perhaps set this to:<br />
<pre class="brush: bash; gutter: false">dialects="English"
</pre>I don't have experience with the non-English audio at VoxForge, so I can't comment how the scripts should be modified to work with it.<br />
<br />
<b>Mapping anonymous speakers to unique IDs</b><br />
VoxForge allows for anonymous speaker registration and speech submission. This is not ideal from Kaldi's scripts viewpoint, because they perform various <a href="http://kaldi.sourceforge.net/transform.html">speaker-dependent transforms</a>. The "anonymous" speech is recorded under different environment/channel conditions (microphones used, background noise etc), by speakers that may be both males and females and have different accents. So instead of lumping all this data together I decided to give each speaker unique identity. The script <i>local/voxforge_fix_data.sh</i> renames all "anonymous" speakers to "anonDDDD"(D is a decimal digit), based on the submission date. This is not entirely precise of course because it may give two different IDs to the same speaker who made recordings on two different dates, and also give the same ID to two or more different "anonymous" speakers who happened to submit speech on the same date. <br />
<br />
<b>Train/test set splitting and normalizing the metadata</b><br />
The next steps is to split the data into train and test sets and to produce the relevant transcription and speaker-dependent information. These steps are performed by a rather convoluted and not particularly efficient script called <i>local/voxforge_data_prep.sh</i>. The number of speakers to be assigned to the test set is defined in <i>run.sh</i>. The actual test set speakers are chosen at random. This is probably not an ideal arrangement, because the speakers will be different each time <i>voxforge_data_prep.sh</i> is started and probably the WER for the test set will be slightly different too. I think this is not very important in this case, however because VoxForge still doesn't have predefined high-quality train and test sets and test time language model. It has something at <a href="http://www.dev.voxforge.org/projects/SpeechCorpus/browser/Trunk/Prompts">here</a>, but there are speakers, for which there are utterances in both sets and I wanted the sets to be disjoint.<br />
The script assumes that you have <a href="http://flac.sourceforge.net/">FLAC codec</a> installed on your machine, because some of the files are encoded in this lossless compression format. One solution is to convert the files beforehand into <i>WAV</i> format, but the recipe is using a nifty feature of Kaldi called <a href="http://kaldi.sourceforge.net/io.html">extended filenames</a> to convert the audio on-demand. For example <i>data/local/train_wav.scp</i>, which contains a list of files to be used for feature extraction, there are lines like:<br />
<pre>benkay_20090111-ar_ar-09 flac -c -d --silent
/media/secondary/voxforge/selected/benkay-20090111-ar/flac/ar-09.flac |
</pre>This means in effect, that when <i>ar-09.flac</i> needs to be converted into MFCC features <i>flac</i> is invoked first to decode the file, and the decoded .wav-format stream is passed to <i>compute-mfcc-feats</i> using Unix pipes.<br />
It seems there are missing or not properly formatted transcripts for some speakers and these are ignored by the data preparation script. You can see these errors in <i>exp/data_prep/make_trans.log</i> file.<br />
<br />
<b>Building the language model</b><br />
I decided to use <a href="http://code.google.com/p/mitlm/">MITLM</a> toolkit to estimate the test-time language model. <a href="http://sourceforge.net/projects/irstlm/">IRSTLM</a> is installed by default under $\$$KALDI_ROOT/tools by Kaldi installation scripts, but I had a mixed experience with this toolkit before and decided to try a different tool this time. A script called <i>local/voxforge_prepare_lm.sh</i> installs <i>MITLM</i> in <i>tools/mitlm-svn</i> and then trains a language model on the train set. The installation of <i>MITLM</i> assumes you have svn, GNU autotools, C++ and Fortran compilers, as well as Boost C++ libraries installed. The order of the LM is given by a variable named <i>lm_order</i> in <i>run.sh</i>.<br />
<br />
<b>Preparing the dictionary</b><br />
The script, used for this task is called <i>local/voxforge_prepare_dict.sh</i>. It downloads the CMU's pronunciation dictionary first, and prepares a list of the words that are found in the train set, but not in <i>cmudict</i>. Pronunciations for these words are automatically generated using <a href="http://www-i6.informatik.rwth-aachen.de/web/Software/g2p.html">Sequitur G2P</a>, which is installed under <i>tools/g2p</i>. The installation assumes you have NumPy, SWIG and C++ compiler on your system. Because the training of Sequitur models takes a lot of time this script is downloading and using a pre-built model trained on cmudict instead. <br />
<br />
<b>Decoding</b><br />
These were the most important steps specific to this recipe. Most of the rest is just borrowed from WSJ and RM scripts.<br />
The decoding results for some of the steps are as follows:<br />
<pre class="brush: plain; gutter: false">exp/mono/decode/wer_10
%WER 64.96 [ 5718 / 8803, 320 ins, 1615 del, 3783 sub ]
%SER 96.29 [ 935 / 971 ]
exp/tri2a/decode/wer_13
%WER 44.63 [ 3929 / 8803, 412 ins, 824 del, 2693 sub ]
%SER 87.95 [ 854 / 971 ]
exp/tri2b_mmi/decode_it4/wer_12
%WER 38.94 [ 3428 / 8803, 401 ins, 753 del, 2274 sub ]
%SER 81.77 [ 794 / 971 ]
</pre><br />
These are obtained using a monophone(mono), a maximum likelihood trained triphone(tri2a) and a discriminatevely trained(tri2b_mmi) triphone acoustic models. Now, I know the results are not very inspiring and may look even somewhat disheartening, but keep in mind they were produced using a <i>very</i> poor language model - a bigram estimated just on the quite small corpus of training set transcripts. Another factor contributing to the relatively poor results is that a typically about 2-3% of the words(depending on the random test set selection) are found just in test set, and thus considered out-of-vocabulary. Not only they don't have any chance to be recognized themselves, but are also reducing the chance for the words surrounding them to be correctly decoded.<br />
One thing that I haven't had the time to try yet is to change the number of states and Gaussians in the acoustic model. As I've mentioned the training and decoding commands in <i>run.sh</i> were copied from RM recipe. My guess is that if the states and PDFs are increased somewhat that will lower the WER by 2-3% at least. This may be done by tweaking the relevant lines in <i>run.sh</i>, e.g.<br />
<pre class="brush: bash; gutter: false"># train tri2a [delta+delta-deltas]
steps/train_deltas.sh --cmd "$train_cmd" 1800 9000 \
data/train data/lang exp/tri1_ali exp/tri2a || exit 1;
</pre>In the above 1800 is the number of PDFs and 9000 is the total number of Gaussians in the system.<br />
There is also something else worth mentioning here. As you may have noticed already I am using relatively small number of speakers(40) for testing. This is mostly because the set of prompts used in audio submission applet is limited and so there are many utterances duplicated across speakers. Because the test utterances are excluded when training the LM, we don't want too many test set speakers, because this would also mean less text for training language model and thus even worse performance.<br />
Just to check everything is OK with the acoustic model I ran some tests using language models trained on both the train and the test set (<b>WARNING:</b> this is considered to be "cheating" and a <i>very</i> bad practice - please don't do this at home!). The results were about 3% WER with a trigram and about 17% with a bigram model with a discriminatevely trained AM.Vassil Panayotovhttp://www.blogger.com/profile/16218983335087840261noreply@blogger.com2tag:blogger.com,1999:blog-2018294783620589216.post-59814962249819977512012-06-11T15:18:00.002+03:002017-04-07T11:59:08.155+03:00Decoding graph construction in Kaldi: A visual walkthroughI've got bitten recently by an issue in a Kaldi decoding cascade I was working on. I was getting more than 40% WER, while the language and acoustic model I was using suggested the decoder should have been able to do much better than that. After a lot of head-scratching I've found the reason for the sub-optimal performance was that I simply failed to add self-loops to the lexicon transducer(<b>L</b>). These are needed in order to pass through the special "#0" symbol used in Kaldi's recipes to make the grammar transducer(<b>G</b>) determinizable. The effect of this omission was that the back-off arcs in the bigram <strong>G</strong> I was using were effectively cut-off, leading to a highly non-stochastic <b>LG</b> cascade with a very spiky distribution over the allowed word sequences and hence the higher WER. After adding the self-loop the WER went down to 17% without changing anything else.<br />
At that point I realized that I don't have detailed knowledge about the decoding graph construction process and decided to take a closer look. One problem here is that the actual large vocabulary cascades are hard to observe directly. In my experience GraphViz requires exorbitant resources in terms of CPU time and memory even for graphs orders of magnitudes smaller than those used in LVCSR. Even if that wasn't a problem the full optimized <b>HCLG</b> WFST is in my humble opinion beyond the human abilities to comprehend(at least beyond my abilities). We can easily build a small-scale version of the graph however, and I think this is mostly OK because using small scale-models to design and test various constructs is a widely used and proven method in engineering and science. This blog entry is not meant to be a replacement for the famous <a href="http://www.cs.nyu.edu/~mohri/pub/hbka.pdf">hbka.pdf</a> (a.k.a. The Holly Book of WFSTs) or the excellent <a href="http://kaldi.sourceforge.net/graph_recipe_test.html">Kaldi decoding-graph creation recipe</a> by Dan Povey, to which this post can hopefully serve as a complement.<br />
<a name='more'></a><br />
<strong>The setup</strong><br />
<br />
For our purposes we will be using very tiny grammars and lexicon. I decided to build the cascades using both unigram and bigram <b>G</b> in order to be able to observe the graphs evolution in slightly different settings. The "corpus" used to create the language models is given bellow (click to expand):<br />
<pre class="brush: plain; gutter: false; collapse:false">K. Cay
K. ache
Cay
</pre><br />
The corresponding unigram model is:<br />
<pre class="brush: plain; gutter: false; collapse:false">\data\
ngram 1=5
\1-grams:
-0.4259687 </s>
-99 <s>
-0.60206 Cay
-0.60206 K.
-0.9030899 ache
\end\
</pre><br />
Bigram model:<br />
<pre class="brush: plain; gutter: false; collapse:false">\data\
ngram 1=5
ngram 2=6
\1-grams:
-0.4259687 </s>
-99 <s> -0.30103
-0.60206 Cay -0.2730013
-0.60206 K. -0.2730013
-0.9030899 ache -0.09691
\2-grams:
-0.60206 <s> Cay
-0.30103 <s> K.
-0.1760913 Cay </s>
-0.4771213 K. Cay
-0.4771213 K. ache
-0.30103 ache </s>
\end\
</pre><br />
The lexicon contains just three words, two of which are homophones:<br />
<pre class="brush: plain; gutter: false; collapse:false">ache ey k
Cay k ey
K. k ey
</pre>The pronunciations have only two phonemes in total, because I wanted the graphs to be as simple as possible and the context-dependency transducer C, for example, becomes somewhat involved if more phones are used.<br />
The script used for generating the graphs along with pre-rendered .pdf images can be found <a href="http://www.danielpovey.com/files/blog/files/kaldi-graph-demo.tar.bz2">here</a>. In order to run "mkgraphs.sh" you need to set a "KALDI_ROOT" environment variable to point to the root directory of a Kaldi installation.<br />
<br />
<strong>Grammar transducer (G)</strong><br />
<br />
As in the aforementioned decoding graph recipe from Kaldi's documentation the steps my demo script is performing in order to produce a grammar FST are summarized by the following command(the OOV removal step is omitted since there are no out-of-vocabulary words in this demo):<br />
<pre class="brush: bash">cat lm.arpa | \
grep -v '<s> <s>' | \
grep -v '</s> <s>' | \
grep -v '</s> </s>' | \
arpa2fst - | \ (step 1)
fstprint | \
eps2disambig.pl |\ (step 2)
s2eps.pl | \ (step 3)
fstcompile --isymbols=words.txt \
--osymbols=words.txt \
--keep_isymbols=false --keep_osymbols=false | \
fstrmepsilon > G.fst (step 4)
</pre><br />
Let's examine each step of this command and their effect on a bigram LM(unigram is analogous but simpler): First some "illegal" combination of the special language model start/end tokens are filtered out, because they can make the <b>G</b> FST non-determinizable. The result is then passed to arpa2fst which produces a (binary) G graph (the first slide below).<br />
<a class="color-inline-group1" href="#bglm-raw" title="Bigram LM graph as produced by arpa2fst (step 1)"><br />
<center><img height="20%" src="http://www.danielpovey.com/files/blog/pictures/kaldi-decoding-graph/bigram%20lm/bglm-thumb.png" width="80%" /><br />
</center></a> <br />
<a class="color-inline-group1" href="#bglm-eps2disambig" title="Bigram LM after eps2disambig.pl (step 2)"></a><br />
<a class="color-inline-group1" href="#bglm-s2eps" title="Bigram LM after s2eps.pl (step 3)"></a><br />
<a class="color-inline-group1" href="#bglm-final" title="The final bigram LM after epsilon removal (step 4)"></a><br />
One thing to keep in mind here is that the weights of the FSTs are calculated by negating the natural logarithm of the probabilities, and the quantities given in <a href="http://www.speech.sri.com/projects/srilm/manpages/ngram-format.5.html">ARPA file format</a> are base 10 logarithms of probabilities. The WFST produced by arpa2fst is really straightforward but let's look little closer. There is a start node representing the start of an utterance (node 0), separate nodes for each of our "ache", "Cay", "K." vocabulary words (nodes 6, 4 and 5 respectively), a back-off node (1) and a final node(2) for the end of an utterance. Let's for the sake of the example trace a path through the graph corresponding to a bigram - say "<s> ache" bigram. Because there is no such bigram in our toy corpus we are forced to take the route through the back-off node, i.e. the arcs 0-3, 3-1, 1-6. The weight of the first arc is 0 (i.e. 1 when converted to probability), the weight of 3-1 arc is 0.69315, which corresponds to the back-off probability for "<s>" ($-ln (10^{-0.30103})$), and the weight 2.0794 of 1-6 arc corresponds to the unigram probability of "ache" ($-ln (10^{-0.9030899})$).<br />
In step 2 of the above slides <b>eps2disambig.pl</b> script converts all epsilon input labels(these are used on the back-off arcs) to the special symbol "#0" needed to keep the graph determinizable. This symbol should be present in the word symbol table too.<br />
Step 3 replaces the special language models "<s>" and "</s>" symbols with epsilons.<br />
Step 4 is epsilon removal which simplifies the graph.<br />
<br />
The word symbol table can be seen bellow:<br />
<pre class="brush: plain; collapse: true"><eps> 0
</s> 1
<s> 2
Cay 3
K. 4
ache 5
#0 6
</pre><br />
Below the analogous graphs for the unigram LM are given:<br />
<a class="color-inline-group2" href="#unilm-raw" title="Unigram LM graph as produced by arpa2fst (step 1)"><br />
<center><img height="20%" src="http://www.danielpovey.com/files/blog/pictures/kaldi-decoding-graph/unigram%20lm/G_uni-thumb.png" width="80%" /><br />
</center></a> <br />
<a class="color-inline-group2" href="#unilm-eps2disambig" title="Unigram LM after eps2disambig.pl (step 2)"></a><br />
<a class="color-inline-group2" href="#unilm-s2eps" title="Unigram LM after s2eps.pl (step 3)"></a><br />
<a class="color-inline-group2" href="#unilm-final" title="The final unigram LM after epsilon removal (step 4)"></a><br />
In what follows I will use cascades produced by composing with the unigram <b>G</b> only, because the graphs are smaller and the explanations of the following steps are not affected by the order of the grammar FST. If you want to have a look at the bigram versions you can use the .pdf files from the <a href="http://www.danielpovey.com/files/blog/files/kaldi-graph-demo.tar.bz2">archive</a>.<br />
<br />
<strong>Lexicon FST (L)</strong><br />
<br />
The lexicon FST preparation used in Kaldi's recipes is fairly standard. For each homophone (in our case "Cay" and "K.") a distinct auxiliary symbol is added by the script <b>add_lex_disambig.pl</b>. So the lexicon now looks like:<br />
<pre class="brush: plain; gutter: false;">ache ey k
Cay k ey #1
K. k ey #2
</pre><br />
The <b>L</b> FST is produced by the <b>make_lexicon_fst.pl</b> script. Which takes four parameters: the lexicon file with disambiguation symbols the probability of the optional silence phones the symbol used to represent the silence phones and the disambiguation symbol for silence. <br />
<pre class="brush: bash; gutter: true;">make_lexicon_fst.pl lexicon_disambig.txt 0.5 sil '#'$ndisambig | \
fstcompile --isymbols=lexgraphs/phones_disambig.txt \
--osymbols=lmgraphs/words.txt \
--keep_isymbols=false --keep_osymbols=false |\
fstaddselfloops $phone_disambig_symbol $word_disambig_symbol | \
fstarcsort --sort_type=olabel \
> lexgraphs/L_disambig.fst
</pre>The resulting graphs can be seen on the slides below:<br />
<a class="color-inline-group3" href="#lex-wo-loop" title="Lexicon FST without #0 loop"><br />
<center><img height="20%" src="http://www.danielpovey.com/files/blog/pictures/kaldi-decoding-graph/lexicon/lexicon-thumb.png" width="80%" /><br />
</center></a> <br />
<a class="color-inline-group3" href="#lex" title="Lexicon with #0 loop added"></a><br />
The graph on the first slide is the graph created by <b>make_lexicon_fst.pl</b> script. It adds optional silence (followed by the special silence disambiguation symbol #3) at the beginning of the sentence and also after lexicon word. On the second slide the special #0 self-loop can be seen, which is needed to pass the special symbol/word from <b>G</b> when it's composed with <b>L</b>.<br />
<br />
The phone symbol table with disambiguation symbols is:<br />
<pre class="brush: plain; collapse: true"><eps> 0
ey 15
k 22
sil 42
#0 43
#1 44
#2 45
#3 46
</pre>There are gaps in the IDs, because I am using a real acoustic model when building the <b>H</b> transducer (see below), so I wanted the phone IDs to match those from this acoustic model.<br />
<br />
<strong>L*G composition</strong><br />
<br />
The commands implementing the composition are:<br />
<pre class="brush: bash; gutter: true;">fsttablecompose L_disambig.fst G.fst |\
fstdeterminizestar --use-log=true | \
fstminimizeencoded > LG.fst
</pre>The commands used implement <a href="http://kaldi.sourceforge.net/fst_algo.html">slightly different versions</a> of the standard FST algorithms.<br />
<a class="color-inline-group4" href="#lg" title="min(det(L*G)) cascade (unigram G)"><br />
<center><img height="20%" src="http://www.danielpovey.com/files/blog/pictures/kaldi-decoding-graph/cascade/LG-thumb.png" width="80%" /><br />
</center></a><br />
<br />
<strong>Context-dependency transducer (C)</strong><br />
<br />
The <b>C</b> graph is normally not created explicitly in Kaldi. Instead the <b>fstcomposecontext</b> tool is used to create the graph on-demand when composing with <b>LG</b>. Here, however we will show an explicitly created <b>C</b> graph for didactic purposes.<br />
<pre class="brush: bash; gutter: true;">fstmakecontextfst \
--read-disambig-syms=disambig_phones.list \
--write-disambig-syms=disambig_ilabels.list \
$phones $subseq_sym ilabels |\
fstarcsort --sort_type=olabel > C.fst
</pre><br />
The context dependency related graphs are given below:<br />
<a class="color-inline-square1" href="#c-fst" title="Context dependency (C) FST"><br />
<center><img height="20%" src="http://www.danielpovey.com/files/blog/pictures/kaldi-decoding-graph/ctxdep/c-thumb.png" width="80%" /><br />
</center></a> <br />
<a class="color-inline-square1" href="#lg-subseq-loop" title="LG cascade with subsequential symbol loop added"></a><br />
<a class="color-inline-square1" href="#ilabel-map" title="ilabel transducer"></a><br />
The first slide shows the <b>C</b> transducer, created by the command above. Each state has self-loops for all auxiliary symbols introduced in <b>L</b>. The input symbols of the <b>C</b> graph are triphone IDs, which are specified by using a Kaldi-specific data structure called <b>ilabel_info</b>(frankly clabel_info would have been more intuitive name for me, but perhaps there is reason it's called that way). Basically this is an array of arrays, where the the indices of the first dimension are the triphone IDs and the individual entries of the nested arrays are the IDs of the context-independent phones, which constitute the context window for the particular triphone. For example if there are triphone "a/b/c"(i.e. central phone "b" with left context "a" and right context "c") with id "10" the eleventh entry in the <b>ilabel_info</b> will be an array containing the context-independent ID's of the phones "a", "b" and "c". As explained in Kaldi's documentation for context independent phones like "sil" there is a single ID in the respective <b>ilabel_info</b> entry. Also for convenience the IDs of the special "#X" symbols are negated, epsilon is represented by an empty array, and "#-1"(see below) with array containing a single entry with value 0. There are couple of special symbols used in the graph. The "#-1" symbol is used as input symbol at the outgoing arcs of the start (0) node. In the standard recipe described in the Mohri et al. paper mentioned above uses epsilon at this place, but <a href="http://kaldi.sourceforge.net/graph_recipe_test.html">the Kaldi's docs</a> say this would have lead to non-determinizable graphs if there are words with empty pronunciations in the lexicon. The second special symbol is "$\$$", which is used as output symbol for the inbound arcs of the final states. It is beneficial if <b>C</b> is made output-deterministic, because this leads to more efficient composition with <b>LG</b>. So in order to achieve this output determinism, the output(context-independent) symbols appear ahead (by N-P-1 positions in general) in respect to the input(context-dependent) symbols. So we "run out" of output symbols when we still have (N-P-1), or 1 in the most common triphone case, input symbols to flush. This is exactly the purpose of the "$\$$" - in effect it is sort of a placeholder to be used as output symbol at the arcs entering the final(end-of-utterance) state of <b>C</b>. We could use epsilon instead "$\$$" but this would have lead to inefficient composition because dead-end paths would have been explored for each triphone, not at the end of an utterance.<br />
<br />
The additional "$\$$" symbols however should be accounted for ("consumed") in some way when <b>C</b> is composed with <b>LG</b>. The Kaldi's tool <b>fstaddsubsequentialloop</b> links a special new final state with "$\$:\epsilon$" self-loop to each final state in <b>LG</b> as you can see on the second slide. The self-loops are added for convenience because there are in general (N-P-1) "$\$$"-s to be consumed.<br />
<br />
The third slide shows a transducer that can be composed with <b>CLG</b> in order to optimize the cascade. It does this by mapping all triphones, corresponding to the same HMM model to the triphone ID of a randomly chosen member of such set. For example it is not surprising that all instances of "sil" in all possible triphone context are mapped to the same triphone ID, because "sil" are context-independent and are represented by the same HMM model. Another such example is "<eps>/ey/<eps>:ey/ey/<eps>", i.e. "ey" with left context "ey" and no right context(end of utterance) is mapped to "ey" with both left and right context equal to "<eps>"(effectively meaning an utterance with a single "ey" phone). To see why this is so you can use Kaldi's <b>draw-tree</b> tool, which will show you that the PDFs in the HMMs for "<eps>/ey/<eps>" and "ey/ey/<eps>" are the same.<br />
<br />
<strong>CLG cascade</strong><br />
<br />
The first slide below show the unoptimized version of (unigram) <b>CLG</b>, and the second the graph with physical-to-logical triphone FST(see above) composed from the left.<br />
<a class="color-inline-larger1" href="#clg" title="CLG FST"><br />
<center><img height="20%" src="http://www.danielpovey.com/files/blog/pictures/kaldi-decoding-graph/cascade/CLG-thumb.png" width="80%" /><br />
</center></a> <br />
<a class="color-inline-larger1" href="#clg-optimized" title="Reduced CLG (with physical-to-logical triphone mapping applied)"></a><br />
As a result of the physical-to-logical mapping, the states in the unigram <b>CLG</b> cascade were reduced from 24 to 17 and the arcs from 38 to 26.<br />
<br />
<strong>H transducer</strong><br />
<br />
The <b>H</b> graph maps from transition-ids to context-dependent phones. A transition ID uniquely identifies a phone, PDF a node and an arc within a context-dependent phone. <b>H</b> transducers in fact look very similar to <b>L</b> transducers.<br />
<a class="color-inline-square2" href="#Ha-fst" title="Ha FST"><br />
<center><img height="20%" src="http://www.danielpovey.com/files/blog/pictures/kaldi-decoding-graph/hmm/Ha-thumb.png" width="80%" /><br />
</center></a><br />
There is a start/end state with an arc going into a HMM state chain - one for each context dependent <b>physical</b> phone(note that there is not a chain with "ey/ey/<eps>" as an output symbol for example). Also there are self-loops for each auxiliary symbol used on <b>C</b> level(e.g. "#-1:#-1").<br />
The input labels for this WFST are created with a <a href="https://bitbucket.org/vdp/kaldi-rm1-mod/src/4fb1791d1210/cxx/fstmaketidsyms">simple tool</a> I wrote previously. They encode the information contained withing a transition-id as four underscore-separated fields - phone, HMM state index withing the triphone, PDF ID and the index of outgoing arc from the node in the second field (in this order). For example "k_1_739_1" in this notation means that this is the transition-id associated with the state "1"(i.e. the second state) of the monophone "k" having PDF ID of 739(this is in general different for the different context-dependent versions of "k") and the outgoing arc from this HMM state, which has ID "1". <br />
Note that the HMM-self loops are not included(thus the graph is actually called <b>Ha</b>) in this transducer and are only added after the whole <b>HCLG</b> cascade is composed.<br />
<br />
<strong>HCLG cascade</strong><br />
<br />
The command used for creating the full cascade (without the HMM-level self-loops) is:<br />
<pre class="brush: bash; gutter: true;">fsttablecompose Ha.fst CLG_reduced.fst | \
fstdeterminizestar --use-log=true | \
fstrmsymbols disambig_tstate.list | \
fstrmepslocal | fstminimizeencoded \
> HCLGa.fst
</pre>i.e. the <b>Ha</b> and <b>CLG</b>(with physical-logical mapping applied) transducers are composed, determinized, the auxiliary symbols are replaced with epsilons(after all composition/determinization steps finished they are not needed anymore) and minimized.<br />
<a class="color-inline-square3" href="#HCLGa" title="HCLGa transducer (without HMM self-loops)"><br />
<center><img height="20%" src="http://www.danielpovey.com/files/blog/pictures/kaldi-decoding-graph/cascade/HCLGa-thumb.png" width="80%" /><br />
</center></a><br />
<a class="color-inline-square3" href="#HCLG" title="HCLG trasducer (with HMM self-loops and reordering)"></a><br />
The input labels are again in the format described above and are produced by <b>fstmaketidsyms</b>.<br />
<br />
The graph from the second slide is created using:<br />
<pre class="brush: bash; gutter: true;">add-self-loops --self-loop-scale=0.1 \
--reorder=true $model < HCLGa.fst > HCLG.fst
</pre>It adds self-loops adjusting their probability using the "self-loop-scale" parameter(see Kaldi's documentation) and also reorders the transition. This reordering makes decoding faster by avoiding calculating the same acoustic score two times(in typical Bakis left-to-right topologies) for each feature frame.<br />
<br />
And basically that's all. Happy decoding!<br />
<br />
<!-- Invisible section to host ColorBox inlines --><br />
<div style="display: none;"><br />
<!-- Bigram LM graphs --> <br />
<div id="bglm-raw" style="padding:10px; background:#fff;"><object data="http://www.danielpovey.com/files/blog/pictures/kaldi-decoding-graph/bigram%20lm/G_bi_raw.svg" height="300" type="image/svg+xml" width="1000"></object> <br />
</div><div id="bglm-eps2disambig" style="padding:10px; background:#fff;"><object data="http://www.danielpovey.com/files/blog/pictures/kaldi-decoding-graph/bigram%20lm/G_bi_eps2disambig.svg" height="300" type="image/svg+xml" width="1000"></object> <br />
</div><div id="bglm-s2eps" style="padding:10px; background:#fff;"><object data="http://www.danielpovey.com/files/blog/pictures/kaldi-decoding-graph/bigram%20lm/G_bi_s2eps.svg" height="300" type="image/svg+xml" width="1000"></object> <br />
</div><div id="bglm-final" style="padding:10px; background:#fff;"><object data="http://www.danielpovey.com/files/blog/pictures/kaldi-decoding-graph/bigram%20lm/G_bi.svg" height="300" type="image/svg+xml" width="1000"></object> <br />
</div><br />
<!-- Unigram LM graphs --> <br />
<div id="unilm-raw" style="padding:10px; background:#fff;"><object data="http://www.danielpovey.com/files/blog/pictures/kaldi-decoding-graph/unigram%20lm/G_uni_raw.svg" height="300" type="image/svg+xml" width="1000"></object> <br />
</div><div id="unilm-eps2disambig" style="padding:10px; background:#fff;"><object data="http://www.danielpovey.com/files/blog/pictures/kaldi-decoding-graph/unigram%20lm/G_uni_eps2disambig.svg" height="300" type="image/svg+xml" width="1000"></object> <br />
</div><div id="unilm-s2eps" style="padding:10px; background:#fff;"><object data="http://www.danielpovey.com/files/blog/pictures/kaldi-decoding-graph/unigram%20lm/G_uni_s2eps.svg" height="300" type="image/svg+xml" width="1000"></object> <br />
</div><div id="unilm-final" style="padding:10px; background:#fff;"><object data="http://www.danielpovey.com/files/blog/pictures/kaldi-decoding-graph/unigram%20lm/G_uni.svg" height="300" type="image/svg+xml" width="1000"></object> <br />
</div><br />
<!-- Lexicon graphs --><br />
<div id="lex-wo-loop" style="padding:10px; background:#fff;"><object data="http://www.danielpovey.com/files/blog/pictures/kaldi-decoding-graph/lexicon/L_disambig_wo_loop.svg" height="300" type="image/svg+xml" width="1000"></object> <br />
</div><div id="lex" style="padding:10px; background:#fff;"><object data="http://www.danielpovey.com/files/blog/pictures/kaldi-decoding-graph/lexicon/L_disambig.svg" height="300" type="image/svg+xml" width="1000"></object> <br />
</div><br />
<!-- LG graphs --> <br />
<div id="lg" style="padding:10px; background:#fff;"><object data="http://www.danielpovey.com/files/blog/pictures/kaldi-decoding-graph/cascade/LG_uni.svg" height="300" type="image/svg+xml" width="1000"></object> <br />
</div><br />
<!-- Context dependency related graphs --> <br />
<div id="c-fst" style="padding:10px; background:#fff;"><object data="http://www.danielpovey.com/files/blog/pictures/kaldi-decoding-graph/ctxdep/C.svg" height="800" type="image/svg+xml" width="1180"></object> <br />
</div><div id="lg-subseq-loop" style="padding:10px; background:#fff;"><object data="http://www.danielpovey.com/files/blog/pictures/kaldi-decoding-graph/ctxdep/LG_uni_subseq.svg" height="800" type="image/svg+xml" width="1180"></object> <br />
</div><div id="ilabel-map" style="padding:10px; background:#fff;"><object data="http://www.danielpovey.com/files/blog/pictures/kaldi-decoding-graph/ctxdep/ilabel_map.svg" height="2000" type="image/svg+xml" width="1000"></object> <br />
</div><br />
<!-- CLG graphs --> <br />
<div id="clg" style="padding:10px; background:#fff;"><object data="http://www.danielpovey.com/files/blog/pictures/kaldi-decoding-graph/cascade/CLG_uni.svg" height="700" type="image/svg+xml" width="2500"></object> <br />
</div><div id="clg-optimized" style="padding:10px; background:#fff;"><object data="http://www.danielpovey.com/files/blog/pictures/kaldi-decoding-graph/cascade/CLG_uni_reduced.svg" height="700" type="image/svg+xml" width="2500"></object> <br />
</div><br />
<!-- H graphs --><br />
<div id="Ha-fst" style="padding:10px; background:#fff;"><object data="http://www.danielpovey.com/files/blog/pictures/kaldi-decoding-graph/hmm/Ha.svg" height="2000" type="image/svg+xml" width="3160"></object><br />
</div><br />
<!-- HCLG graphs --><br />
<div id="HCLGa" style="padding:10px; background:#fff;"><object data="http://www.danielpovey.com/files/blog/pictures/kaldi-decoding-graph/cascade/HCLGa_uni.svg" height="2000" type="image/svg+xml" width="4000"></object> <br />
</div><div id="HCLG" style="padding:10px; background:#fff;"><object data="http://www.danielpovey.com/files/blog/pictures/kaldi-decoding-graph/cascade/HCLG_uni.svg" height="2000" type="image/svg+xml" width="4000"></object> <br />
</div><br />
</div>Vassil Panayotovhttp://www.blogger.com/profile/16218983335087840261noreply@blogger.com26tag:blogger.com,1999:blog-2018294783620589216.post-84938479617438969642012-02-20T11:59:00.059+02:002017-04-07T12:19:29.290+03:00Poor man's Kaldi recipe<a href="http://kaldi.sourceforge.net/">Kaldi</a> is a relatively new addition to the open source speech recognition toolkits, officially released about an year ago. It's based on the <a href="http://google.com/search?q=mohri+speech+recognition+with+weighted+finite-state+transducers+filetype:pdf">WFST</a> paradigm and is mostly oriented toward the research community. There are many things I like about Kaldi. First of all I like the overall "openness" of the project: the source code is distributed under the very permissive Apache license, and even the $\LaTeX$ sources for the papers about Kaldi are in the repo. The design decision to use relatively small, single-purpose tools, that can be Unix-style pipelined, makes their code very clean and easy to follow. The project is very actively developed and support experimental options like neural nets based acoustic features and GPGPU acceleration, although I haven't had the chance to play with these yet. Last, but not least Kaldi comes with extensive documentation. <br />
There is also a big, and growing number of recipes working with the most widely used speech databases. My problem was, however that I am not an academic, but just a coder that likes to play from time to time, with technologies that look interesting and promising. I wanted to get an idea how Kaldi works, but I don't have access to these expensive datasets. That's why I have started to search for a way to use a publicly available database. The best option I have found so far is a subset of RM1 dataset, freely available from CMU. In fact the data only includes pre-extracted feature vectors, and not the original audio. The details of feature extraction in Sphinx are almost certainly different from that in Kaldi, but as long the features used in training and decoding match we should be OK. Kaldi already has a recipe for RM, so modifiying it to use CMU's subset was a rather trivial excercise. This post describes how to use this modified recipe in case other people want to try it. <br />
<b>Update:</b> <i>As of Feb 27, 2012 a slightly modified version of this recipe is part of the official Kaldi distribution.</i> <br />
<a name='more'></a> <br />
<strong>Setup Kaldi</strong> <br />
This tutorial will assume that you still don't have Kaldi installed and you are using GNU/Linux OS. First of all we need to obtain the toolkits' source code: <pre class="brush: plain; gutter: false">svn co https://kaldi.svn.sourceforge.net/svnroot/kaldi/trunk@762 kaldi
</pre>I am explicitly using revision 762 here, simply because this is the version of Kaldi I have tested my recipe against. Then the recipe itself can be obtained from <a href="https://github.com/vdp/kaldi-rm1-mod">GitHub</a> or <a href="https://bitbucket.org/vdp/kaldi-rm1-mod">BitBucket</a> (I really like their free private Git repos). <pre class="brush: plain; gutter: false">git clone git://github.com/vdp/kaldi-rm1-mod.git
OR
git clone git@bitbucket.org:vdp/kaldi-rm1-mod.git
</pre>The recipe assumes that it is stored in a sibling directory of the directory which contains Kaldi. Kaldi needs to be patched in order to add some tools, that I wrote: <pre class="brush: plain">cd kaldi
patch -p1 < ../kaldi-rm1-mod/cxx/diff-r762.patch
</pre>Only one of these tools is essential, and the others are only used to visualize some data structures. I will say more about this later. Kaldi then can be built in the usual way (just follow the INSTALL file). <br />
<br />
<strong>Setup RM1 data</strong> <br />
Download the <a href="http://www.speech.cs.cmu.edu/databases/rm1/rm1_cepstra.tar.gz">cepstra files</a> from <a href="http://www.speech.cs.cmu.edu/databases/rm1/"> CMU's site</a> and extract the archive: <pre class="brush: plain">tar xf rm1_cepstra.tar.gz
mv rm1 rm1_feats
</pre>Then you will need to download some metadata from LDC's website: <pre class="brush: plain">wget -mk --no-parent -r -c -v -nH http://www.ldc.upenn.edu/Catalog/docs/LDC93S3B/
mv Catalog/docs/LDC93S3B ./
rm -rf Catalog
</pre>The files that we need are stored under <i>LDC93S3B/disc_1/doc/</i> directory. The original Kaldi RM recipe uses a lexicon stored in a file called <i>pcdsril.txt</i>, but unfortunatelly it's not among the files distributed freely by LDC. Now, actually a quick web search reveals that it can be found in some researchers' home directories, but since I wasn't completely sure about its legal status I decided to use the dictionary that comes with CMU's data in the master branch of my recipe. <br />
<br />
<strong>Running the recipe</strong> <br />
The main script that drives the training and decoding is <i>run.sh</i> in the root directory of my recipe. The line below need to be changed so that <i>RM1_ROOT</i> to point the directory in which the CMU and LDC data is stored: <pre class="brush: plain; gutter: false">RM1_ROOT=/home/vassil/devel/speech/datasets/rm1
</pre>The scripts assume that the aforementioned <i>rm1_feats</i> and <i>LDC93S3B</i> directories share a common root directory, and the variable <i>$RM1_ROOT</i> points to this directory. If <i>kaldi</i> and <i>kaldi-rm1-mod</i> are not siblings as explained above then you may also need to change the variable <i>$root</i> in <i>path.sh</i> to point to kaldi. When all of the above steps are done, you can cross your fingers and run: <pre class="brush: plain; gutter: false">./run.sh
</pre>This may take an hour or two for the recipe scripts to go through a subset of the stages in the original recipe. The results of execution are stored in <i>exp</i> subdirectory. The acoustic models and HMM state tying trees are stored under <i>exp/{mono,tri1,tri2a}</i>, and the results of the test set decoding are in <i>exp/decode_{mono,tri1,tri2a}</i>. The word error rates for the different decoding stages can be seen in the <i>wer</i> files in their respective directories(e.g. <i>exp/decode_mono/wer</i>). The WER for the version of the recipe that uses CMU's dictionary is significantly higher. Probably it is at least partially due to the fact, that when preparing the dictionary my scripts blindly remove all pronunciation variants except for the first one. The WER when using the original dictionary was about 6 something %, and above 7% when CMU dictionary is used. <br />
<br />
<b>Extra tools (with some pictures)</b> <br />
As I already said there is only one extra tool that is needed for the recipe to run. It is called <i>pack-sphinx-feats</i> and its sole purpose is to convert the Sphinx feature files into Kaldi tables. <br />
Because the WFST-based speech recognition is new to me I wrote some other tools to help me visualize the data structures used during the training and decoding, and also to get a basic idea of what programming Kaldi and OpenFST feels like. All these tools are just barely tested and should not be exptected to be particularly robust or bug-free. One of these is <i>draw-ali</i>. It has similar function to that of the Kaldi's tool <i>show-alignments</i>, but instead in text it shows the alignments as a distinctly colored trace through a GraphViz diagram. Initially I had assumed that the final training and decoding graphs are input-deterministic, so I had to rewrite this tool after I found it fails in some cases because aux and epsilon symbols removal tools introduce non-determinism. To give you an idea how the output of the program looks like here are the training graph alignments for a utterance at monophone training passes 0 and 10 (click on the picture below if you you have a modern browser supporting SVG graphics (i.e. not IE8)): <br />
<a class="color-inline-group1" href="#train-mono-ali-0" title="Monophone alignment at step 0"><center><img height="20%" src="http://www.danielpovey.com/files/blog/pictures/poor-mans-kaldi/train-ali-mono/mono-train-ali-thumb.png" width="80%" /></center></a> <a class="color-inline-group1" href="#train-mono-ali-10" title="Monophone alignment at step 10"></a> The input labels have the format (x RepCount)CIPhone_State_PDF_OutArc, i.e. it show how many times a self-loop arc is traversed in this alignment, the identity of the phone, the index of the HMM state within the phone model, the PDF associated with this (possibly context-dependent) state and the index of the output arc leaving the HMM state. You can see on the graphs above, that at the initial (0) step of training the alignment is uniform(i.e. self-loops have roughly the same repetition count), but after the re-alignment at step 10, when the decoder has a much better 'idea' about the acoustic model, this symetry is broken. <br />
Another somewhat similar tool is draw-tree, which can be used to produce GraphViz-based pictures of the trees used for HMM state tying, and also a colored 'trace' through this tree to show how a particular HMM state is mapped to PDF. This is the only somewhat intrusive tool in the sense it requires a new method to be added to <i>EventMap</i> interface and its descendants in order to make tree traversal possible. Click on the picture below to view (warning: these pictures, especially the context-dependent one, are even more unwieldy than the previous ones): <br />
<a class="color-inline-group2" href="#tree-mono" title="Monophone Tree"><center><img height="20%" src="http://www.danielpovey.com/files/blog/pictures/poor-mans-kaldi/phone-trees/tree-thumb.png" width="80%" /></center></a> <a class="color-inline-group2" href="#tree-tri1" title="Triphone Tree"></a> At first I wanted to render the <i>SplitEventMap's</i> "yes sets" as tooltips rather than labels, to be shown on mouse hover, but I've found that only Firefox works satisfactory with SVG tooltips. <!-- Invisible section to host ColorBox inlines --> <div style="display: none;"> <!-- Training alignments --> <div id="train-mono-ali-0" style="padding:10px; background:#fff;" title="Test SVG"><object data="http://www.danielpovey.com/files/blog/pictures/poor-mans-kaldi/train-ali-mono/trn_adg04_st1350_ali_0.svg" height="1500" type="image/svg+xml" width="15000"></object> </div><div id="train-mono-ali-10" style="padding:10px; background:#fff;"><object data="http://www.danielpovey.com/files/blog/pictures/poor-mans-kaldi/train-ali-mono/trn_adg04_st1350_ali_10.svg" height="1500" type="image/svg+xml" width="15000"></object> </div><!-- HMM state tying trees --> <div id="tree-mono" style="padding:10px; background:#fff;"><object data="http://www.danielpovey.com/files/blog/pictures/poor-mans-kaldi/phone-trees/tree-mono.svg" height="900" type="image/svg+xml" width="13000"></object> </div><div id="tree-tri1" style="padding:10px; background:#fff;"><object data="http://www.danielpovey.com/files/blog/pictures/poor-mans-kaldi/phone-trees/tree-tri1.svg" height="2500" type="image/svg+xml" width="25000"></object> </div></div>Vassil Panayotovhttp://www.blogger.com/profile/16218983335087840261noreply@blogger.com5