[Column] Artificial Intelligence IV
Voice and AI (transform music to data)
- speech and music
- cock-tail party effect
- 85-1100 Hz for Human
- Steps from Human voice to MP3
- VOICE > SENSOR > VOLTAGE
- SAMPLING (computer cannot manage continuous signal) : DISCRETE (time)
- QUANTIZATION : DISCRETE (amplitude)
- ENCODING
- TIME SERIES
- WAVEFORM
- SAMPLING RATE (for MP3 is 44100 Hz , the reason is human is not sensitive to high frequency voice )
Characteristics of Voice
- frequency spectrum (x - frequency ; y - amplitude) with log coordinate
- Amplitude (Loudness)
- Frequency (Pitch)
- Timbre (related with harmonics)
Classic Voice Features -MFCC
- MFCC - Mel-Frequency Cepstral Coefficients
- Formant
- Vowel
- Steps of using MFCC
- 26 dimensions vector
- Mel-Frequency, mel(f) = 1125 *ln(1+f/700)
- good classification at low frequency , bad at high frequency (similar as human)
- Calculate Its CEPSTRAL
- the purpose of cepstral is converting 26 dimensions into 13 dimensions.
- window width
- window pitch
- finally get 13 dimensions MFCC features
- Deep learning applied (feature extract and feature classify)
- convolutional layer (convert into much more specific feature)
- pooling layer (reduce dimension)
- full connected layer (combined convolution by inner product calculation)
- softmax layer (outcome probability)
- voice assistant (without type writing , meeting record)
- acoustic model
- language model
- music automatically searching ! (fuzzy searching) - window scan
Comments
Post a Comment