Encoding residual (LPC on speech signal)
- From: Carlos Moreno <moreno_at_mochima_dot_com@xxxxxx>
- Date: Wed, 31 Aug 2005 14:30:26 -0400
Hi,
As part of my research project, I'm working on vector-quantization applied to an LPC-like encoding scheme.
For each frame of speech (I tentatively put a frame length of 128 samples -- at 8kHz sampling rate), I compute the optimal prediction filter (using the autocorrelation method), and then apply that filter to the signal (well, I apply 1-P to the signal, where P(z) is the optimal prediction filter).
So, I end up with the residual, and I want to encode this. The final goal is to do VQ on that residual.
I'm trying to obtain a parameterization of that frame, such that I can use that as the vector that I want to vector-quantize (that way, I have a lower-dimensional vector -- say, if I encode the entire frame with, say, 10 or 20 values, that's much better than dealing with the whole frame treated as a 128-dimensional vector)
My preliminary idea works works amazingly well *visually*, but when it comes to listening to the transcoded speech, the method fails miserably.
I notice that the residual consists of a few peaks plus noise, so I figured: I'll encode the amplitude and positions of up to 4 peaks (most frames have 1 peak, many have 2 or 3, and many of them have no peaks at all), and then, for the rest, I'll determine the best fitting 10th-order polynomial to represent the data.
Then, the rest should be almost pure "whitish" noise, only that the amplitude of that noise varies over the frame. So, what I do is that instead of just finding the variance (a constant, average value for the entire frame), I compute the best-fitting 10th-order polynomial for the square values of the remaining (0-mean) signal.
To reconstruct, I just add the mean (the value of the best-fitting polynomial for the data) with a pseudo-random Gaussian number, with variance given by the polynomial that gives me the "amplitude" of the noise as a function of the position in the frame.
Then, if there are peaks, I replace by the appropriate value at the appropriate positions.
As I said, when you see the actual residual frames and the reconstructed versions from the parameterization described above, the similarity is quite amazing (again, *visually*).
I put an example of a frame (the red curve is the reconstructed, or "transcoded" frame) if you want to take a quick look:
http://www.mochima.com/tmp/residual.png
Now, when you pass this frame (well, the sequence of reconstructed frames) through the reconstruction filter (reconstruct the prediction from the output, then add it to the residual and push this value onto the output stream), the resulting output is a disaster -- it is absolutely and completely intelligible, but it sounds noisy and scratchy.
No, it is not a bug in the implementation (I'm actually quite positively certain of that -- really). If I keep the same frame and pass it through the reconstruction filter, I obtain the exact same audio stream (sample-by-sample the exact same values).
If I add noise to the frame, I still hear exactly the same (with very high quality). I can even add white noise with HALF the power of the residual frame, and it still sounds *MUCH* better than when replacing the frame with the parameterized frame.
So, I guess the part that I'm missing is: what exactly is the valuable information in that residual frame? I thought it would be the positions and amplitude of the pulses for frames with "voiced" speech (vowels, M, N, L) and white noise for frames with "unvoiced" speech (S, SH, F, etc.).
After all, the simplified model of the speech production system is an acoustic filter (throat, vocal and nasal cavities) fed by a train of pulses (produced by the vocal chords).
Sure, the vocal chords don't produce perfect impulses (as in perfect "Dirac's deltas"), but when you model the acoustic filter (with a prediction filter of 15 coefficients), usually that includes the "filtering" process for the pulses of air produced by the vocal chords.
Any comments? I know it was a quite longish post, but I'm hoping some kind sould will have the patience to go through it and share some thoughts.
Thanks,
Carlos -- .
- Follow-Ups:
- Re: Encoding residual (LPC on speech signal)
- From: Rune Allnor
- Re: Encoding residual (LPC on speech signal)
- Prev by Date: Re: Hilbert Transformer
- Next by Date: Re: Windowed-sinc function
- Previous by thread: Re: Windowed-sinc function
- Next by thread: Re: Encoding residual (LPC on speech signal)
- Index(es):
Relevant Pages
|
Loading