Mixed Language Speech Recognition without Explicit Identification of Language

Use of mixed language in day to day spoken speech is becoming common and is accepted as being syntactically correct. However machine recognition of mixed language spoken speech is a challenge to a conventional speech recognition engine. There are studies on how to enable recognition of mixed language speech. At one end of the spectra is to use acoustic models of the complete phone set of the mixed language to enable recognition while on the other end of the spectra is to use a language identification module followed by language dependent speech recognition engines to do the recognition. Each of this has its own implications. In this paper, we approach the problem of mixed language speech recognition by using available resources and show that by suitably constructing an appropriate pronunciation dictionary and modifying the language model to use mixed language, one can achieve a good recognition accuracy of spoken mixed language.


Introduction
M ixed langu ag e, als o termed as cod e s witch ing in literature, arises through the fusion of two or more, usually distinct, mixed source languages, normally in situations of thorough bilingualis m, so that it is not possible to classify the resulting language as belonging to either of the language families that were its source[17], [1], [2]. With urbanisation and geography shift of people the ability to converse in many languages is becoming co mmon . A very large nu mber of people, especially u rban youth , use mi xed langu age in everyday conversation without actually being aware of it. Though mixed language is defined as a mixtu re of t wo distinct languages in equal proportion without giving away as to which language is mixed into wh ich; at least in the Indian context, the non-native language (generally English words) is mixed into the native language. As shown in Fig. 1 the native language (Hindi) is the primary language and the non-nat ive English language is the secondary language. Prima ry language can be defined as that language in the mi xed lan guage wh ich is spoken in majority. One can observe that the words uttered in the secondary language are very often keywords or foreign words or phrases which are colloquially used. Subsequently, the rate of language change or shift is very frequent in mixed language. Thus recognition of mixed languag e speech requ ires, in ou r op in ion , an entirely different approach.
मै अपने account से �कसी दू सरे bank के account मे पै सा कै से Transfer कर सकता �ँ ? Consider a call centre in a metropolitan city wh ich has to cater to people speaking different languages. This requires all the agents in the call centre to be able to co mmunicate in mu ltip le languages which is very unlikely. A possible solution can be to ascertain the language of the caller and then, based on the language, direct the caller to an agent who can converse in that language expertly. In a similar vein, in a speech enabled application, having identified the language of the caller, a language specific speech recognition engine can be emp loyed to cater to the caller. Clearly, th is kind of system cannot work when people use mixed language speech, even if one knew the mix o f languages in use, because the language shift is very frequent. Recently there has been increased interest in mixed language recognition (for example [3], [4], [19]) however the work has been restricted to a mix of Mandarin and Taiwanese. Mixed language speech recognition is in its nascent stages of research and to the best of our knowledge there is no work reported in literature for India specific language mix.
There are two major distinct frameworks to build mixed language automatic speech recognition (ML-ASR), namely mu lti pass and one pass framework. In a mu lti pass ML-ASR, the exact instance in spoken speech where language switch happens is determined and the language of the speech identified. Once the language of the speech segment is known, corresponding language dependent automatic speech recognition (ASR) is used to recognize the speech segment. Note that a typical ASR is language specific and uses acoustic model (AM), language model (LM ) and a pronunciation lexicon (PL) built for that language to recognize spoken speech. The AM. LM and PL are constructed from language specific speech and text corpus through a training process. In the one pass approach, an ASR is built (namely, AM, LM and PL) which encompasses both the languages in the mixed language. This enables ML-ASR on mixed language speech. The one pass approach is simpler compared to mult i pass approach because (a) there is no need to specifically identify the language and (b) employ several language specific ASRs. Ho wever one pass approach to ML-A SR poses problems in the form of a need to collect sufficient amount of mixed language speech corpus (audio and the associated text transcription) which can be used to build the mixed language acoustic and the ML language model required fo r M L-ASR. In this paper, we hypothesize that one could use available resources (for example acoustic models of one of the languages in the mixed language) and carefully construct the LM and PL to do a ML-ASR. We conducted several experiments on mixed language speech where the p rimary language is Hindi and the secondary language is English. It should be noted that the approach is independent of the language mix in the sense that any other Indian language can take the place of Hindi with appropriate mapping of the phone in that Indian language to the English phones.
The rest of the paper is organised as follows. A short review on mult i pass and one pass frameworks for multi lingual speech is discussed in Section 2, followed by discussion on the mixed language database used in our experiments and highlighting our approach in performing mixed language ASR in Section 3. In Section 4, we discuss experimental results and finally conclude in Sect ion 5.

Existing Approaches
Recognition of mixed language speech is still in its init ial stages of research. There are two approaches reported in literature. One being mult i pass fra mework [4] and other is the one pass framework [3]. Ho wever, mult ilingual speech recognition is another area of research which has close relationship with ML-ASR. In mu ltilingual speech recognition, the spoken speech is not a mix of two languages unlike M L-ASR, however the main challenge is that one does not know a priori the identity of the language. So the first task in multi lingual ASR is to identify the language. This problem of identifying language is well addressed in literature [5]. Language identificat ion using LPC based acoustic features was proposed by Cimarusti et al [5]. They were ab le to identify eight different languages with reasonable success. In another work, Foil [7] used prosodic features for language identification and Naratil et. al. [10] successfully used phonotactic-acoustic features. Later Yan [9] applied a comb ination of acoustic, phonotactic and prosodic information for language identification. Nagawaka [8] co mpared four different methods to identify languages and concluded that continuous hidden Markov model (HMM) based method works best. Many recognizers like Gaussian Mixture Model (GMM), single language phone recognition followed by language modelling (PRLM), parallel PRLM (PPRLM), GMM tokenization [6] and Gaussian Mixture Bi-gram Model (GM BM) [11] have also been studied in literature for mu lti lingual speech recognition.
In order to use the mu ltilingual approaches in mixed language speech recognition, one needs to identify the exact time instants at which switching fro m one language to another occurs and follow it up with language identification. Automatic segmentation of different languages within a speech utterance had been addressed by Wu et al. [4] who use Bayesian informat ion criteria (BIC) on Delta-MFCC (Mel Frequency Cepstral Coefficients). In another related work, Chi-Jiun et al. [12] use statistical approach to segment and identify language in a speech utterance. They use ma ximu m a posterior (MAP) estimate to find the boundary segments to do language identification. Mixed language speech recognition using multi pass framework can be realised using the following steps (see Fig.  2). The mixed language speech input is divided into segments based on identification of instant where language change occurs. Then the language of the segment is identified using a language identification module. Then a language dependent ASR is used to recognize that particular segment of speech. The recognition performance of mu lti pass approach depends on (a) performance of the language boundary detection and (b) language identification block and (c) the actual performance of the language specific ASR. Clearly a poor performance by any one of the three blocks affects the overall performance of the mult i pass based ML-A SR system. The one pass framework [3] avoids the drawback of mult i pass system by building a PL, AM and LM to encompass both the languages in the mixed language. The acoustic model for mixed language is an AM generated for the co mbined phoneme set of the languages in the mixed language. Advantage of this approach (shown in Fig. 3) is that it is not dependent on the language boundary detection block or the language identification block. It is similar to a language specific ASR, except that the AM, LM and PL are built for the mixed language. Note that this approach needs mixed language speech and text corpus, wh ich generally is not available. Clearly the existing approaches cannot be used for M L-ASR. In our approach, we used the one pass framework however we used the AM of a single language (which was readily available) instead of trying to undertake the Herculean task of collect ing speech corpus and transcribing it to build AM for the co mplete phone set which encompasses both the languages. We however built a small database of mixed language corpus to (a) construct the language model to handle mixed language recognition [16] and (b) to test our approach.

Proposed Approach
We have worked on a specific language mix, namely, Hindi-English whose usage is very common in the Indian subcontinent. Specifically Hindi being the native language is spoken majority of t ime co mpared to the non-native language English. In our corpus, a little mo re than two thirds of the total spoken words in the corpus were spoken in Hindi and the rest, namely, one third, being either English words or proper nouns. Overall, our corpus consisted of 46 different speakers (with sufficient gender and age variab ility) fro m different metros in India. Each of the speakers uttered three to five different sentences, which had a mix of Hindi-English, of which at least one sentence, uttered by the speaker was elicited speech. The elicited speech gave an indication of the actual mix of the language as spoken in everyday conversation. In all there were 213 unique spoken sentences consisting of 1946 words. All the experimental results reported in this paper are based on these word utterances. During data collection, the speakers were supplied a speaker sheet (in Hindi script) and were asked to call fro m a quite environment and the recording was done using a telephony card, specifically we used a Dialogic CTI card. The speech was recorded at 11 kHz and 8 bits per samp le using a ho me grown data collecting application.
Our approach retains the framewo rk of a one pass method with the use of appropriate PL. The use of a modified PL enables us (a) avoid building an AM for the mixed language (note that mixed language speech corpus is difficult to collect) and (b) further recognition can be performed with ASR of one of the languages. We used the public domain speech recognition engine, Sphin x[15], with the HUB4 (English phones) AM in one set of experiments and in another set of experiments we used the readily availab le Hindi ASR[20] AM (Hindi phones). The reason for using these AM instead of AM for mixed language was (a) these AMs were readily available for use and (b) building acoustic models for mixed language was too cumbersome requiring actual on the field collection of a large amount of speech corpus to which we did not have access. It should be noted that a Hindi ASR has 59 phonemes while English has only 39 phonemes. When using English acoustic models we approximate those phonemes (main ly occurring in Hindi words) which are not in English by replacing the phoneme in Hindi by a co mbination of two or more English phonemes [13]. The PL that supports the ASR is constructed in the usual way by using the CMU language toolkit [14] for all the English words in the corpus. However, all the Hindi words are first transliterated into English and the pronunciation of this English word is obtained using [14] or approximate phoneme mapping (APM).

Results and Discussion
We conducted, in all, a set of nine different experiments to evaluate the performance of our approach for M L-ASR. In the first set of experiments we used the English AM's while in the second set of experiment we used the Hind i language AM's.
In all our experiments we used the Sphin x ASR[15] and the well-known n-gram LM created using the mixed language speech corpus that we collected (Section 2). In each of these experiments the manner of construction of PL was different. The distribution of the Hindi, English and proper noun words in the corpus was 62%, 28% and 10% respectively. For the first set of eight experiments done using English AMs, we used two different methods of PL construction for the three different types of words, namely, English words, Hindi words and proper nouns. The first method of PL creation is based on the CM U toolkit [14] and the second method is based on approximate phoneme mapping (APM). In APM method of lexicon creation, a word is first transliterated and the equivalent Hindi phonemes are generated; each of these Hindi phonemes is then replaced by one or more equivalent English phonemes. For example the Hindi word मत्स्य गं (Matsyagandha) is represented using the CMU tool kit as M AE TH S A Y A H G A H N D (see Fig  5(a)). While the equivalent pronunciation representation using Hindi phoneme set is M A T A S Y A G A N DH A (see Fig 5(b)). Using APM the sa me word मत्स्य ग is represented as Fig 5(c)). Note that in APM, a Hindi phoneme is replaced by one or more equivalent English phonemes. For example the phone DH, occurring only in Hindi is substituted by the phones "DH HH" in English (see Fig 5). For examp le, the English word "Identification" (आइड� टी�फकशन) can be transliterated similarly as "aidentiphikation" and equivalent pronunciation using Hindi phoneme set is EI D E N: T: I PH I K EI SH A N A (see Fig 5(b)). Using APM, it is represented as AY D EH N T IY F IY K EY SH A H N (see Fig 5(c)). In the ninth experiment, we represented every word in the PL using both the alternative phonetic representations (see Fig 5(d)), namely using CMU and APM. Tab le 1 shows experiment number and the method used to construct the PL. For example, in "Expt 6" APM was used to construct the English and the Hindi words however CMU was used to construct the proper nouns. Pronunciation using CM U toolkit is denoted as CMU while, appro ximate phoneme mapping is denoted as APM  Table 2) and Test dataset (averaged over three rounds of cross validation) in Tab le 3 separately. In case of the Train dataset the textual data used for constructing the LM is same as the corresponding speech data used for recognition, wh ile in the case of the Test dataset the text data used for constructing the LM was not part of the speech data used for recognition, in that sense the data used for LM construction and that used for recognition were comp lementary sets. It can be seen that the word accuracies for Hindi and proper nouns of the ML-ASR is higher when the PL for the Hindi words and proper nouns is built using the approximated English phones (Expt 3, Expt 5 and Expt 9) co mpared to when the PL is built using CMU toolkit (Expt 1, Expt 2, Expt 4, Expt 6, Expt 7 and Expt 8). We can conclude that representing non-English (in this case Hindi and proper nouns) words using approximate English phonemes decrease the WER. Overall W ER is less when English words are represented by CMU toolkit and the Hindi words and proper nouns are represented using APM (Expt 3 and Expt 9) in the PL. A lso note that the performance in Expt 1 and Expt 8 for English words is far poorer compared to performance in all other experiments, this can be attributed to the imperfect representation of Hindi (or proper nouns) words in Expt 1 and Expt 8 resulting in misrecognition of English words preceding or succeeding Hindi words or proper nouns (we used 3-gram representation of the mixed language in the LM). In the last experiment (Hindi-ASR), we used Hindi AM (16 kHz) [20]. The AM consist of 59 phonemes and the PL was constructed using the Hindi phone set. English words in the lexicon are constructed by transliterating the English words. As Hindi phoneme (# 59) set is a super set of the English phone set (#39) and the majority of the words spoken in the mixed language is Hindi, it can be observed that there is a decrease in WER in experiment Hindi-ASR as compared to experiments Expt 1 to Expt 8. The accuracy (100 -W ER) of the number of wo rds correctly recognized is more using Hindi-ASR (60.23 %) than all the other experiments except for Expt 9 (68.43 %). Further we observe an increase in Hindi words and proper noun recognition when Hindi AM is used. As expected, the W ER is better for the train ing set (Table 3) co mpared to the test set (Table 3) in all the experiments.

Conclusions
Mixed language automatic speech recognition (ML-ASR) is gaining increasing popularity because of its wide spread use in everyday conversations and more importantly because of its acceptance in the society. While the best approach to build a ASR to recognize mixed language is to treat the mixed language as a language in itself and build AM, LM and PL as is done for a language specific ASR. This would involve an expensive and t ime consuming task of collecting a large amount of mixed language speech and text corpus and using this corpus to build AM, LM and PL for mixed language. Note that separate speech and text corpus has to be collected fo r each mixed language pair. In th is paper we have shown an usable novel approach to enable mixed language speech recognition by making use of the availab le resources (English acoustic models, Hindi acoustic models but not the English-Hindi mixed acoustic models) and (a) carefully constructing a PL for the mixed language words and (b) constructing a LM fro m a s mall mixed language text corpus. The advantage of our approach is that (a) there is no actual need to segment speech and identify the language which in most conversational speech is very difficu lt because in mixed speech the switch fro m one language to another is very fast, (b) it does not require one to collect extensive speech corpus or data to construct the acoustic models to enable mixed language recognition. It should be noted that this approach can be applied to any other Indian language taking the place of Hindi; this would only require an appropriate mapping of the phones in that language to English phoneset.