Chapter 4
Methodology and Implementation
This chapter describes the methodology and implementation of the speaker independent speech recognizer for the Sinhala language and the Android mobile application for voice dialing. Mainly there are two phases of the research. First one is to build the speaker independent Sinhala speech recognizer to recognize the digits spoken in Sinhala language. The second phase is to build an android application by integrating the trained speech recognizer. This chapter covers the tools, algorithms, theoretical aspects, the models and the file structures used for the entire research process.
4.1Research phase 1: Build the speaker independent Sinhala speech recognizer for recognizing the digits.
In this section the development of the speaker independent Sinhala speech recognizer is described, step by step. It includes the phonetic dictionary, language model, grammar file, acoustic speech database and the trained acoustic model creation.
4.1.1 Data preparation
This system is a Sinhala speech recognition voice dial and since there is no such speech database which is done earlier was available, the speech has to be taken from the scratch to develop the system.
Data collection
The first stage of any speech recognizer is the collection of sound signals. Database should contain a variety of enough speakers recording. The size of the database is compared to the task we handle. For this application only little number of words was considered. This research aims only the written Sinhala vocabulary that can be applied for voice dialing. Altogether twelve words were considered with the ten numbers including two initial calling words “amatanna” and “katakaranna”. Here the Database has two parts, the training part and the testing part. Usually about 1/10th of the full speech data is used to the testing part. In this research 3000 speech samples were used for training and 150 speech samples were used for testing.
Speech database
Before collecting data, a speech database was created. The database was included with the Sinhala speech samples taken from variety of people who were in different age levels. Since there was no such database published anywhere for Sinhala language relevant for voice dialing, speech had to be collected from Sinhala native speakers.
Prompt sheet
To create the speech database, the first step was to prepare the prompt sheet having a list of sentences for all the recordings. Here it used 100 sentences that are different from each other by generating the numbers randomly. 50 sentences are starting with the word “amatanna” while the other half is starting with the word “katakaranna”. The prompt sheet used for this research is given in the Appendix A.
Recording
The prepared sentences in the prompt sheet were recorded by using thirty (30) native speakers since this is speaker independent application. The speakers were selected according to the age limits and divided them into eight age groups. Four people were selected from each group except one age group. Two females and two males were included into each age group. One group only contained two people with one female and one male. Each speaker was given 100 sentences to speak and altogether 3000 speech samples were recorded for training. The description of speakers such as gender and age can be found in Appendix A. If there was an error in the recording due to the background noise and filler sounds, the speaker was asked to repeat it and got the correct sound signal. Since the proposed system is a discrete system, the speakers have to make a short pause at the start and end of the recording and also between the words when they were uttered. Speech was recorded in a quiet room and the recordings were done at nights by using a condenser recorder microphone. The sounds were recorded under the sampling rate of 44.1 kHz using mono channel and they were saved under *.wav format.
Sampling frequency and format of speech audio files
Speech recording files were saved in the file format of MS WAV. The “Praat“ software was used to convert the 44.1 kHz sampling frequency signals to 16 kHz frequency signals since the frequency should be 16kHz of the training samples. Audio files were recorded in a medium length of 11 seconds. Since there should be a silence in the beginning and the end of the utterance and it should not be exceeded 0.2 seconds, the “Praat” software was used to edit all 3000 sound signals.
4.1.2 Pronunciation dictionary
The pronunciation dictionary was implemented by hand since the number of words used for the voice dialing system is very few. It is used only 12 words from the Sinhala vocabulary. To create the dictionary, the International Phonetic Alphabet for Sinhala Language and the previously created dictionaries by CMU Sphinx were used. But the acoustic phones were taken mostly by studying the different types of databases given by the Carnegie Mellon University’s Sphinx Forum (CMU Sphinx Forum).
Get Help With Your Essay
If you need assistance with writing your essay, our professional essay writing service is here to help!
Two dictionaries were implemented for this system. One is for the speech utterances and the other one is for filler sounds. The filler sounds contain the silences in the beginning, middle and at the end of the speech utterances. The attachment of the two types of dictionaries can be found on the Appendix A. They are referred to as the languagedictionaryand thefiller dictionary.
4.1.3 Creating the grammar file
The grammar file also created by hand since the number of words used for the system is very few. The JSGF (JSpeech Grammar Format) format was used to implement the grammar file. The grammar file can be found in Appendix A.
4.1.4 Building the language model
Word search is restricted by a language model. It identifies the matching words by comparing the previously recognized words by the model and restricts the matching process by taking off the words that are not possible to be. N-gram language model is the most common language models used nowadays. It is a finite state language model and it contains statistics of word sequences. In search space where restriction is applied, a good accuracy rate can be obtained if the language model is a very successful one. The result is the language model can predict the next word properly. It usually restricts the word search which are included the vocabulary.
The language model was built using the cmuclmtk software. First of all the reference text was created and that text (svd.text) can be found in Appendix A. It was written in a specific format. The speech sentences were delimited byandtags.
Then the vocabulary file was generated by giving the following command.
text2wfreq < svd.txt | wfreq2vocab > svd.vocab
Then the generated vocabulary file was edited to remove words (numbers and misspellings). When finding misspellings, they were fixed in the input reference text. The generated vocabulary file (svd.vocab) can be found in the Appendix A.
Then the ARPA format language model was generated using these commands.
text2idngram -vocab svd.vocab -idngram svd.idngram < svd.txt
idngram2lm -vocab_type 0 -idngram svd.idngram -vocab svd.vocab –arpa svd.arpa
Finally the CMU binary of language model (DMP file) was generated using the command
sphinx_lm_convert -i svd.arpa -o svd.lm.DMP
The final output containing the language model needed for the training process is svd.lm.dmp file. This is a binary file.
4.1.5Acoustic model
Before starting the acoustic model creation, the following file structure was arranged as described by the CMU Sphinx tool kit guide. The name of the speech database is “svd” (Sinhala Voice Dial). The content of these files is given in Appendix A.
- svd.dic -Phonetic dictionary
- svd.phone -Phoneset file
- svd.lm.DMP -Language model
- svd.filler -List of fillers
- svd _train.fileids -List of files for training
- svd _train.transcription -Transcription for training
- svd _test.fileids -List of files for testing
- svd _test.transcription -Transcription for testing
All these files were included in to one directory and it was named as “etc”. The speech samples of wav files were included in to another directory and named it as “wav”. These two directories were included in to another directory and named it using the name of the database (svd). Before starting the training process, there should be another directory that contains the “svd” and the required compilation package “pocketsphinx”, “sphinxbase” and “sphinxtrain” directories. All the packages and the “svd” directory were put into another directory and started the training process.
Setting up the training scripts
The command prompt terminal is used to run the scripts of the training process. Before starting the process, terminal was changed to the database “svd” directory and then the following command was run.
python ../sphinxtrain/scripts/sphinxtrain –t svd setup
This command copied all the required configuration files into etc sub directory of the database directory and prepared the database for training. The two configuration files created were feat.params and sphinx_train.cfg. These two are given in Appendix A.
Set up the database
These values were filled in at configuration time. The Experiment name, will be used to name model files and log files in the database.
$CFG_DB_NAME = “svd”;
$CFG_EXPTNAME = “$CFG_DB_NAME”;
Set up the format of database audio
Since the database contains speech utterances with the ‘wav’ format and they were recorded using MSWav, the extension and the type were given accordingly as “wav” and “mswav”.
$CFG_WAVFILES_DIR = “$CFG_BASE_DIR/wav”;
$CFG_WAVFILE_EXTENSION = ‘wav’;
$CFG_WAVFILE_TYPE = ‘mswav’; # one of nist, mswav, raw
Configure Path to files
This process was done automatically when having the right file structure in the running directory. The naming of the files must be very accurate. The paths were assigned to the variables used in main training of models.
$CFG_DICTIONARY = “$CFG_LIST_DIR/$CFG_DB_NAME.dic”;
$CFG_RAWPHONEFILE = “$CFG_LIST_DIR/$CFG_DB_NAME.phone”;
$CFG_FILLERDICT = “$CFG_LIST_DIR/$CFG_DB_NAME.filler”;
$CFG_LISTOFFILES = “$CFG_LIST_DIR/${CFG_DB_NAME}_train.fileids”;
$CFG_TRANSCRIPTFILE = “$CFG_LIST_DIR/${CFG_DB_NAME}_train.transcription”;
$CFG_FEATPARAMS = “$CFG_LIST_DIR/feat.params”;
Configure model type and model parameters
The model type continuous and semi continuous can be used in pocket sphinx. Continuous type is used for continuous speech recognition. Semi continuous is used for discrete speech recognition process. Since this application use discrete speech the semi continuous model training was used.
#$CFG_HMM_TYPE = ‘.cont.’; # Sphinx 4, Pocketsphinx
$CFG_HMM_TYPE = ‘.semi.’; # PocketSphinx
$CFG_FINAL_NUM_DENSITIES = 8;
# Number of tied states (senones) to create in decision-tree clustering
$CFG_N_TIED_STATES = 1000;
The number of senones used to train the model is indicated in this value. The sound can be chosen accurately if the number of senones is higher. But if we use too much senones, then it may not be able to recognize the unseen sounds. So the Word Error Rate can be very much higher on unseen sounds.
The approximate number of senones and number of densities is provided in the table below.
Vocabulary |
Hours in db |
Senones |
Densities |
Example |
20 |
5 |
200 |
8 |
Tidigits Digits Recognition |
100 |
20 |
2000 |
8 |
RM1 Command and Control |
5000 |
30 |
4000 |
16 |
WSJ1 5k Small Dictation |
20000 |
80 |
4000 |
32 |
WSJ1 20k Big Dictation |
60000 |
200 |
6000 |
16 |
HUB4 Broadcast News |
60000 |
2000 |
12000 |
64 |
Fisher Rich Telephone Transcription |
Configure sound feature parameters
The default parameter used for sound files in Sphinx is a rate of 16 thousand samples per second (16KHz). If this is the case, then the etc/feat.params file will be automatically generated with the recommended values. The Recommended values are:
# Feature extraction parameters
$CFG_WAVFILE_SRATE = 16000.0;
$CFG_NUM_FILT = 40; # For wideband speech it’s 40, for telephone 8khz reasonable value is 31
$CFG_LO_FILT = 133.3334; # For telephone 8kHz speech value is 200
$CFG_HI_FILT = 6855.4976; # For telephone 8kHz speech value is 3500
Configure decoding parameters
The following were properly configured in theetc/sphinx_train.cfg.
$DEC_CFG_DICTIONARY = “$DEC_CFG_BASE_DIR/etc/$DEC_CFG_DB_NAME.dic”;
$DEC_CFG_FILLERDICT = “$DEC_CFG_BASE_DIR/etc/$DEC_CFG_DB_NAME.filler”;
$DEC_CFG_LISTOFFILES = “$DEC_CFG_BASE_DIR/etc/${DEC_CFG_DB_NAME}_test.fileids”;
$DEC_CFG_TRANSCRIPTFILE = “$DEC_CFG_BASE_DIR/etc/${DEC_CFG_DB_NAME}_test.transcription”;
$DEC_CFG_RESULT_DIR = “$DEC_CFG_BASE_DIR/result”;
# These variables, used by the decoder, have to be user defined, and
# may affect the decoder output
$DEC_CFG_LANGUAGEMODEL_DIR = “$DEC_CFG_BASE_DIR/etc”;
$DEC_CFG_LANGUAGEMODEL = “$DEC_CFG_LANGUAGEMODEL_DIR/ ${CFG_DB_NAME}.lm.DMP”;
Training
After setting all these paths and parameters in the configuration file as described above, the training was proceeded. To start the training process the following command was run.
python ../sphinxtrain/scripts/sphinxtrain run
Scripts launched jobs on the machine, and it took few minutes to run.
Acoustic Model
After the training process, the acoustic model was located in the following path in the directory. Only this folder is needed for the speech recognition tasks.
model_parameters/svd.cd_semi_200
We need only that folder for the speech recognition tasks we have to perform.
4.1.6Testing Results
150 speech samples were used as testing data. The aligning results could be obtained after the training process. It was located in the following path in the database directory.
results/svd.align
4.1.7Parameters to be optimized
Word error rate
WER was given as a percentage value. It was calculated according to the following equation
Accuracy
Accuracy was also given as a percentage. That is the opposite value of the WER. It was calculated using the following equation
To obtain an optimal recognition system, the WER should be minimized and the accuracy should be maximized. The parameters of the configuration file were changed time to time and obtained an optimal recognition system where the WER was the minimum with a high accuracy rate.
4.2Research phase 2: Build the voice dialing mobile application.
In this section, the implementation of voice dialer for android mobile application is described. The application was developed using the programming language JAVA and it was done using the Eclipse IDE. It was tested in both the emulator and the actual device. The application is able to recognize the spoken digits by any speaker and dial the recognized number. To do this process the trained acoustic model, the pronunciation dictionary, the language model and the grammar files were needed. The speech recognition was performed by using these models in the mobile device itself by using the pocketsphinx library. It is a library written in C language to use for embedded speech recognition devices in Android platform.
The step by step implementation and integration of the necessary components were discussed in detail in this section.
Resource Files
When inputting the resource files to the Android application, they were added in to theassets/directory of the project. Then the physical path was given to make them available for pocketsphinx.
After adding them, the Assets directory contained the following resource files.
Dictionary
- svd.dic
- svd.dic.md5
Grammar
- digits.gram
- digits.gram.md5
- menu.gram
- menu.gram.md5
Language model
- svd.lm.DMP
- svd.lm.DMP.md5
Acoustic Model
- feat.params
- feat.params.md5
- mdef
- mdef.md5
- means
- means.md5
- mixture_weights
- mixture_weights.md5
- noisedict
- noisedict.md5
- transition_matrices
- transition_matrices.md5
- variances
- variances.md5
Assets.lst
models/dict/svd.dic
models/grammar/digits.gram
models/grammar/menu.gram
models/hmm/en-us-semi/feat.params
models/hmm/en-us-semi/mdef
models/hmm/en-us-semi/means
models/hmm/en-us-semi/mixture_weights
models/hmm/en-us-semi/noisedict
models/hmm/en-us-semi/sendump
models/hmm/en-us-semi/transition_matrices
models/hmm/en-us-semi/variances
models/lm/svd.lm.DMP
Setup the Recognizer
First of all the recognizer should be set up by adding the resource files. The model parameters taken after the training process were added as the HMM in the application. The recognition process was depended mainly on this resource files. Since the grammar files and the language model were added as assets, these two can be used for the recognition process of the application as well as the HMM. The utterances can be recognized from either the grammar files or language model. The whole process is coded using the Java programing language.
4.3Architecture of the developed Speech Recognition System
Cite This Work
To export a reference to this article please select a referencing style below: