[Column] Artificial Intelligence IV

[Column] Artificial Intelligence IV

- August 12, 2018

Voice and AI (transform music to data)

speech and music
cock-tail party effect
85-1100 Hz for Human
Steps from Human voice to MP3

VOICE > SENSOR > VOLTAGE
SAMPLING (computer cannot manage continuous signal) : DISCRETE (time)
QUANTIZATION : DISCRETE (amplitude)
ENCODING

TIME SERIES
WAVEFORM
SAMPLING RATE (for MP3 is 44100 Hz , the reason is human is not sensitive to high frequency voice )

Characteristics of Voice

frequency spectrum (x - frequency ; y - amplitude) with log coordinate
Amplitude (Loudness)
Frequency (Pitch)
Timbre (related with harmonics)

Classic Voice Features -MFCC

MFCC - Mel-Frequency Cepstral Coefficients

Formant
Vowel

Steps of using MFCC

26 dimensions vector

Mel-Frequency, mel(f) = 1125 *ln(1+f/700)
good classification at low frequency , bad at high frequency (similar as human)

Calculate Its CEPSTRAL

the purpose of cepstral is converting 26 dimensions into 13 dimensions.
window width
window pitch

finally get 13 dimensions MFCC features

Deep learning applied (feature extract and feature classify)

convolutional layer (convert into much more specific feature)
pooling layer (reduce dimension)
full connected layer (combined convolution by inner product calculation)
softmax layer (outcome probability)

Application - SPEECH RECOGNIZATION

voice assistant (without type writing , meeting record)
acoustic model
language model
music automatically searching ! (fuzzy searching) - window scan

Comments