Audio Classification of paralyzed speech (Part-I)
Introduction:
In market today there are various speech recognition tools that can translate your speech to words & sentences but there is no framework exists that can decipher personalized paralyzed speech to text. There are several complex neurological disease like PSP/CBD/MSA that takes away your regular verbal communication.
In this part, we’ll explore the science behind human speech and how Machine Learning can be used towards audio phenoms classification.
Part II: Simple Flask app developed with CNN model to record and understand speech with Bag of Words.
Human Speech and Sound:
- The frequency produced in the glottal pulse results from the vibration of the vocal cords resonating against the larynx.
- Depending how you shape your vocal tract, you’ll get different speech signal from similar Glottal pulses.
- Vocal tract (Oral Cavity and Mouth) finally produces Spectral envelope of sound that’s personalized phenoms (vowels/consonants).
- Resource: https://www.youtube.com/watch?v=4_SH2nfbQZ8
We can say ->
Speech = Convolution of Vocal tract frequency response with Glottal Pulse
But to identify the identity of speech (like phonemes, timbre) we need the Spectral envelope part, not the Glottal Pulse.
Good so far. But now how to extract the spectral envelope from speech Log-spectrum.
MFCCs (Mel Frequency Cepstral Coefficients) comes into rescue. MFCC considers human perception for sensitivity at appropriate frequencies by converting the conventional frequency to Mel Scale, and are thus suitable for speech recognition tasks quite well (as they are suitable for understanding humans and the frequency at which humans speak/utter).
Mel Frequency scale provides the clear distinction between vocal tract phoneme responses. The diagram below shows ‘G’ and ‘E’ of a neuro patient with 13 MFCC frequency band energy.
n_fft = 2048
hop_length = 256
sr=22050
g_MFCCs = librosa.feature.mfcc(g_signal, n_fft = n_fft, hop_length= hop_length, n_mfcc= 13 )
E-2–0–1.wav
e_MFCCs = librosa.feature.mfcc(e_signal, n_fft = n_fft, hop_length= hop_length, n_mfcc= 13 )
Now we can run Convolution Neural Network (CNN) on the MFCC images to train sound.
Approach:
Data Gathering:
- Record A-Z each letter minimum 100 times. The recording was done in .mp3 format.
- Use simple microphone or mobile device to record.
Data Preparation:
- Change mp3 to .wav format
- Reduce noise of the sound. Remove sound signal less than 15dB. This is very important step as final recording must be noise less as well.
- Create equal Size chunks audio files for each letter.
- Create multiple files with audio clips of same length.
Training:
I’d extracted 39 MFCC features from each audio segment of individual letters. 39 features provided me a good range of mel bands with different HOP length for windowing.
Validation and Training accuracy. More data points would reduce the variance gap. Dropout and batch-size and learning rate hyperparameters to be tuned properly.
Finally the optimum training model to be saved for deployment purpose.
Part II — Deploying in Flask Application.
References: