User:Kapsberger/speech-analysis

When you speak, air comes through your vocal chords, glottis and mouth. Voiced sounds such as vowels make the vocal chords vibrate periodically at a rate of typically 60 Hz or higher for men and up to 300 Hz for women and children. The mouth acts as a filter that shapes the sound into a specific phone. As you speak, the air molecules in front of you vibrate as well, their displacement being proportional to the loudness of your speech. This 'loudness', or amplitude, of the sound is measured in decibels.

The source-filter model suggests that if we know the nature of the sound in the vocal chords (the source) and the shape of the articulatory instrument (the mouth, or filter), then we can recognise or produce any given speech sound.

We will assume that the vocal chords either vibrate periodically or produce random white noise. In the case of the periodic signal, we can analyse one period T out of a given waveform (recorded speech). In the case of white noise, we can calculate the mean and variance of the signal.

Periodic Signals: Fourier Analysis
Fourier Analysis is the method usually employed to give a description of a periodic signal. In a nutshell, the Fourier method says that any given periodic signal can be reduced to a sum of sinusoids with appropriate amplitudes and phases, the frequencies of which are integer multiples of the fundamental frequency. The fundamental frequency F0 reflects the rate of cycling of the vocal folds in the larynx during voiced sounds. The remaining frequencies (F1, F2... Fn) are called harmonics. The interest of Fourier Analysis is that the human ear performs a frequency based analysis of sounds too: each hair in the cochlea responds best to a particular frequency; we have therefore a method that closely matches the function of its human counterpart.

The result of the Fourier analysis is a graph called a spectrum, that shows amplitude as a function of frequency, i.e. for each harmonic, the graph displays the amplitude of that harmonic as per the Fourier equation. We say that the analysis allows us to transfer from the time domain of the waveform to the frequency domain of the spectrum.

Now, in order to understand how we plot the spectrum, we need to look at the ways we can write the Fourier series mathematically. We said that any periodic signal can be created by summing sinusoids:

$$x(t) = a_0 \cos(0 * \omega_0t + \phi_0) + a_1 \cos(1 * \omega_1t + \phi_1) + ... + a_n \cos(n * \omega_nt + \phi_n)\,$$

where $$\omega_n\,$$ is the angular frequency corresponding to the harmonic $$F_n\,$$ with $$\omega_n = 2\pi F_n\,$$

This can be written in a simpler way by using Euler's Formula, which states that:

$$e^{j\theta} = \cos\theta + j \sin\theta \,$$.

Using Euler's formula, we can write:

$$A \cos(\omega t + \phi) + A j\sin(\omega t + \phi) = Ae^{j\omega t + j\phi} = Ae^{j\phi}e^{j\omega t} = Xe^{j\omega t}\,$$

with $$X = Ae^{j\phi}\,$$.

This is an important result because $$e^{j\omega t}\,$$ is free of phase information. As we will see further down, it is important to be able to separate phase from frequency.

We can finally rewrite the Fourier series as

$$x(t) = \sum_{-\infty}^{+\infty} a_k e^{jk\omega t}\,$$ where $$a_k = Ae^{j\phi_k}\,$$.

With the help of these notations, it is now easy to see that the frequency domain can be either be plotted as a Cartesian form, showing a real and an imaginary spectra, or as a polar form split into a magnitude and a phase spectra. Interestingly, it can be shown that all spectra can be affected by a change in phase apart from the magnitude spectrum. We say that the magnitude spectrum best describes the essential characteristics of the phone under consideration. Indeed, the human ear is fairly insensitive to phase information: the sound produced by the same source at time t and time t+1 sound the same. This gives us the important result that two waveforms sound the same to a listener, regardless of their original phase and shape, as long as their magnitude spectrum is the same.

Aperiodic Signals: Fourier Transform
When dealing with aperiodic signals, it does not make sense to talk about harmonics. The magnitude and the phase of the signal are continuous functions of the frequency. Instead of using Fourier Analysis, we then use the Fourier Transform.

We first need to use the quasi-stationarity assumption, which state that over a very short period, the speech can be considered stationary. We divide the waveform a into small segments ranging from 10 ms to 20ms and consider the speech in each of these segments to have unchanged features for the duration of the segment. We then reconstruct a whole signal by repeating the same segment to infinity. It is important to state here that the signal should be symetric with regard to the origin. The newly created periodic signal can now be analysed as previous through Fourier Analysis.

Source-Filter Model
The magnitude spectrum obtained through Fourier Analysis or Fourier Transform is regarded as the result of multiplying the effect of the vocal tract source, i.e. the random noise or periodic impulse coming through the vocal cords, by the effect of the vocal tract response, i.e. the shape of the mouth cavity. The vocal tract response is characterised by the overall shape of the magnitude spectrum. The peaks in that spectrum represent what are called the formants of the sound, that is, the defining frequencies of that particular phone. So the air passing through the vocal chords is the source of the model while the shape of the oral cavity is the filter.

Observable Frequencies
It should be mentioned at this stage that real signals have frequency values up to infinity but that the human ear can only perceive sounds up to 20 MHz. In addition, the maximum frequency that can be observed in a digital signal where T is the sample period is half the sample rate 1/T, i.e. 1/(2T). This frequency is called the Nyquist frequency. Using the Nyquist frequency allows us to avoid the effects of aliasing.

In order to understand where the Nyquist frequency comes from, it is necessary to look at the sampling process. Sampling is the step where the continuous time signal is converted into a digital signal. The original speech signal is sampled at regular interval with period T: the value recorded a time t becomes the sampled value for that period. At this point, the y-axis ranges over continuous values. In a second step, the values on the y-axis are also quantised, i.e. rounded up or down on a scale of values, the number of which depends on the number of bits allocated to the information. Taking the spectrum of the sampled signal results in a periodic graph with period 1/T.

It can be observed that the shape of the spectrum is symmetric with respect to multiples of the point $$f_c = \frac{1}{2T}\,$$. If $$f_c < \frac{1}{2T}\,$$, then the shape of each period of the spectrum is clearly repeated with the magnitude of zero at point $$f_c\,$$. If however, $$f_c > \frac{1}{2T}\,$$ then the periods overlap, resulting in an incorrect spectrum at point $$f_c\,$$ and in its vicinity.