GMM-Based Emotional Voice Conversion Using Spectrum and Prosody Features

We propose Gaussian Mixture Model (GMM)-based emotional voice conversion using spectrum and prosody features. In recent years, speech recognition and synthesis techniques have been developed, and an emotional voice conversion technique is required for synthesizing more expressive voices. The common emotional conversion was based on transformation of neutral prosody to emotional prosody by using huge speech corpus. In this paper, we convert a neutral voice to an emotional voice using GMMs. GMM-based spectrum conversion is widely used to modify non linguistic information such as voice characteristics while keeping linguistic information unchanged. Because the conventional method converts either prosody or voice quality (spectrum), some emotions are not converted well. In our method, both prosody and voice quality are used for converting a neutral voice to an emotional voice, and it is able to obtain more expressive voices in comparison with conventional methods, such as prosody or spectrum conversion.


Introduction
In recent years, speech synthesis techniques have been well developed; e.g., text reading system, speech-oriented guidance system, and synthesizing singing voices [1][2][3]. However, in these systems only the linguistic informat ion is synthesized, and they cannot handle human emotion.
The conventional method in emotional speech synthesis was rep lacing prosody using huge speech corpus. This method requ ires eno rmous time and effo rt to convert emotional prosody. Mori et al [4] proposed an F0 synthesis method for using subspace constraint in prosody. In this method, principal co mponents analysis is adopted to reduce the dimensionality of prosodic co mponents, wh ich also allows us to generate new speeches that are similar to train ing samp les. Wo et al [5] proposed a h ierarch ical prosody conversation. The pit ch contour of the source speech is decomposed into a hierarchical prosodic structure consisting of sentence, prosod ic wo rd, and subsyllab le levels. Veau x et al [6] proposed an F0 conversion system based on a Gaussian mixture model (GMM). A GMM is used to map the prosodic features bet ween neutral and exp ress ive sp eech , and the co nverted F0 cont ou r is generated under dynamic features constraints. Ho wever, these methods do not include conversion of voice quality (spectrum), and, hence, some emot ions were not converted well.
A GMM is widely used in spectrum conversion to modify non linguistic info rmation such as voice characteristics while keep ing linguistic information unchanged [7][8][9]. Toda et al adopted this method to articulatory speech synthesis [10] and speaking-aid system for laryngectomees [11].
In this paper, we propose an emotional voice conversion method that includes both voice quality and prosody. Vo ice quality is synthesized by spectrum conversion using the GMM and maximu m likelihood method [7][8][9]. Prosody is also converted using GMM-based F0 conversion. Our result demonstrates that emotions are synthesized sufficiently by converting both F0 and spectrum.
The rest of this paper is organized as follows: In Sec. 2, GMM -based voice conversion is introduced; our proposed method is developed in Sec. 3; the experimental results are described in Sec. 4; and the final section is devoted to our conclusions.

GMM-Based
These parameters are estimated by using the EM-algorithm.
the likelihood function is given by ,..., ,..., , 2 1 = is a mixture co mponent sequence. The m-th conditional probability distribution is given by where W is a 2KT-d imensional square matrix [8]. Hence we seek Introducing the following approximat ion; we obtain the suboptimu m mixture co mponent sequence m and the converted static feature vector ŷ as follo ws: and

Prosody Conversion[6]
Prosody conversion is performed applying the conversion method described in Sec. 2.1 to F0.
The target static feature vector of the i-th syllable is represented as The dynamic feature is calculated fro m the static feature, and i X represents the static and dynamic features of the i-th syllab le. The source vector and the target vector are augmented as . Hence, we obtain ŷ fro m Eq. (9).

GMM-Based Emotional Conversion
In this paper, both spectrum envelope and basic frequency which are extracted fro m a neutral voice are converted to those of emotional voices, where target emotions are "Anger", "Sadness" and "Joy", and the GMM is constructed for each emotion. Our system has two phases: the training phase and the conversion phase.
The outline of the train ing phase is shown in Fig. 1. The neutral voice word is the same as that of the emotional voice. These are spoken by the same speaker. The spectrum envelope, basic frequency, and aperiodic component are extracted fro m these two voices using the STRAIGHT analysis method [12][13][14][15]. The aperiodic co mponent is not used in our method.
The outline of the conversion method is shown in Fig. 2. The extracted basic frequency is divided into syllab les and converted using the F0 GMM trained in Fig. 1. The spectrum envelope is converted using the spectrum GMM. The emotional voice is synthesized fro m the converted F0, spectrum envelope and source speaker's aperiodic envelope using the STRAIGHT synthesis method.

Spectrum Conversion
The spectrum envelope, wh ich is ext racted using the STRA IGHT analysis method, is converted using the GMM described in Sec. 2.1. The duration of the source and target spectrum must be modified by the DP-matching algorith m. To reduce the dimension of the envelope, static features are represented by its first 12 DCT coefficients in Eq. (12).
Dynamic features are defined as follows: These features are modelled using Eq. (1).

Figure 2. Conversion process
The basic frequency, which is extracted using the STRA IGHT, is also converted using the GMM in Sec. 2.2. Fig. 3 shows how to extract the prosody feature from a Japanese word "AMAGA ERUWA". The basic frequency cannot be converted on each frame because the basic frequency is the 1-dimensional vector. Therefo re, the word is divided into syllables to obtain the prosody feature. In this paper, the contour of a syllable is represented by its first 5 DCT coefficients. When the contour length is defined as L, the coefficients are normalized by

Experiments
In our experiment, neutral words are converted to emotional words. We performed five types of experiments as shown in Table 1. In experiments (a) and (b), the spectrum envelope or the F0 are converted. To show the effectiveness of conversion, the neutral spectrum envelope, or F0, is replaced with emotional ones in experiments (d) and (e).

Experi mental Condi tions
The "Keio University Japanese Emotional Speech Database" was used in our experiment. A male Japanese speaker with acting experience recorded 47 emotions for each 20 words. We used three emotions: "Neutral", "Anger", "Joy" and "Sadness" from the database.
The speech data was directly recorded into the hard disk drive through a microphone connected to the computer in a sound-proof room. Waveforms were dig itized by 16-kHz sampling and 16-bit quantizat ion.
In our experiment, we converted "Neutral" to "Anger", "Joy", and "Sadness". Training and converting were conducted separately for each emotion. All 20 recorded words were used as training data, and we converted the same 20 words. The nu mber of mixtures of GMM is set at 64 in spectrum and F0 conversion.

STRA IGHT
We performed a subjective emotional classification test. All the listeners were Japanese, and the number of listeners was 10. The listener classified a converted voice into one emotion fro m "Neutral", "Anger", "Joy" or "Sadness".
In Table 2, some results of subjective emotional classification for recorded words are g iven. The classification rate of 100% was obtained for all emot ions, hence the corpus is sufficient for recognizing emotion. The classification results for the converted voices are shown in Table 3. Results of spectrum conversion only are shown in Table 3-(a). Almost half the listeners classified "Anger" correctly. However, the other emotions tended to be classified as "Neutral". Hence, the use of just spectral conversion is imperfect for emot ional conversion. Table 3-(b) shows the results of the conversion of basic frequency only. The classification rate o f 80% was obtained for "Sadness". Therefore, "Sadness" can be expressed by the basic frequency conversion only. However, "Anger" tends to be classified as "Neutral". The result of "Joy" is not well classified. Hence, the conversion of basic frequency only is also imperfect for emot ional conversion.
Our proposed method is shown in Table 3-(c). The classification rate of "Sadness" did not increase in comparison with Table 3-(b). Hence, spectrum conversion did not work on conversion of "Sadness". The classification rates of "Anger" and "Joy" greatly increased in comparison with Tab le 3-(a) and 3-(b). Hence, both spectrum and F0 conversion is much effective in conversion with "Anger" and "Sadness".

Discussion
We performed three types of conversions to three different emotions. In spectrum conversion, the half of listeners classified "Anger" correct ly. The other t wo emotions tended to be classified as "Neutral". Tab le 3-(d) shows results by replacing neutral spectrum only with emotional spectrum using the target voice. The classification rates of "Sadness" and "Joy" in Table 3-(d) were almost the same as shown in Table 3-(a). The classification rate of 65% was obtained for "Anger", which is close to "Anger" in Table 3-(a). Hence, spectrum conversion has a significant influence on synthesizing "Anger". Moreover, the experiment results show that, in emotional conversion, only the conversion of the spectrum envelope is imperfect.
In F0 conversion, "Sadness" obtained a high classification rate; however, the other two emotions could not achieve a high rate. Table 3-(e) shows the results obtained when replacing neutral F0 only with emotional F0 using the target voice. "Sadness" in Table 3-(e) obtained a classification rate of almost 100%. Therefore, F0 conversion has a significant influence on synthesizing "Sadness". Also, the results in Table 3-(b) are similar to those in Table 3-(e). Hence, converting F0 in emot ional conversion has sufficient accuracy.
The results obtained using our proposed method, which is the combination of F0 and spectrum conversions, are shown in Table 3-(c). The classification rates of "Anger" and "Joy" increased over those in Table 3-(a) and (b). "Anger" obtained a classification rate of 65%, and it is the same as in Table 3-(d). Our method seems to recover the classification rates of emotion obtained by converting F0. "Joy" obtained 45% in Table 3-(c), and it is a higher rate than Table 3-(d) and (e). These rates show the effectiveness of our proposed method.

Conclusions
We proposed emotional conversion of spectrum and prosody. Experimental results show that both spectrum and prosody conversion is effective in synthesizing "Anger" and "Joy". "Sadness" could be synthesized using prosody conversion only, and spectrum conversion had no effect.