Encoding residual (LPC on speech signal)




Hi,

As part of my research project, I'm working on vector-quantization
applied to an LPC-like encoding scheme.

For each frame of speech (I tentatively put a frame length of 128
samples -- at 8kHz sampling rate), I compute the optimal prediction
filter (using the autocorrelation method), and then apply that filter
to the signal (well, I apply 1-P to the signal, where P(z) is the
optimal prediction filter).

So, I end up with the residual, and I want to encode this.  The
final goal is to do VQ on that residual.

I'm trying to obtain a parameterization of that frame, such that
I can use that as the vector that I want to vector-quantize (that
way, I have a lower-dimensional vector -- say, if I encode the
entire frame with, say, 10 or 20 values, that's much better than
dealing with the whole frame treated as a 128-dimensional vector)

My preliminary idea works works amazingly well *visually*, but when
it comes to listening to the transcoded speech, the method fails
miserably.

I notice that the residual consists of a few peaks plus noise,
so I figured:  I'll encode the amplitude and positions of up to
4 peaks (most frames have 1 peak, many have 2 or 3, and many of
them have no peaks at all), and then, for the rest, I'll determine
the best fitting 10th-order polynomial to represent the data.

Then, the rest should be almost pure "whitish" noise, only that
the amplitude of that noise varies over the frame.  So, what I
do is that instead of just finding the variance (a constant,
average value for the entire frame), I compute the best-fitting
10th-order polynomial for the square values of the remaining
(0-mean) signal.

To reconstruct, I just add the mean (the value of the best-fitting
polynomial for the data) with a pseudo-random Gaussian number,
with variance given by the polynomial that gives me the "amplitude"
of the noise as a function of the position in the frame.

Then, if there are peaks, I replace by the appropriate value at
the appropriate positions.

As I said, when you see the actual residual frames and the
reconstructed versions from the parameterization described
above, the similarity is quite amazing (again, *visually*).

I put an example of a frame (the red curve is the reconstructed,
or "transcoded" frame) if you want to take a quick look:

http://www.mochima.com/tmp/residual.png


Now, when you pass this frame (well, the sequence of reconstructed frames) through the reconstruction filter (reconstruct the prediction from the output, then add it to the residual and push this value onto the output stream), the resulting output is a disaster -- it is absolutely and completely intelligible, but it sounds noisy and scratchy.

No, it is not a bug in the implementation (I'm actually quite
positively certain of that -- really).  If I keep the same frame
and pass it through the reconstruction filter, I obtain the exact
same audio stream (sample-by-sample the exact same values).

If I add noise to the frame, I still hear exactly the same (with
very high quality).  I can even add white noise with HALF the
power of the residual frame, and it still sounds *MUCH* better
than when replacing the frame with the parameterized frame.

So, I guess the part that I'm missing is:  what exactly is the
valuable information in that residual frame?  I thought it would
be the positions and amplitude of the pulses for frames with
"voiced" speech (vowels, M, N, L) and white noise for frames
with "unvoiced" speech (S, SH, F, etc.).

After all, the simplified model of the speech production system
is an acoustic filter (throat, vocal and nasal cavities) fed by
a train of pulses (produced by the vocal chords).

Sure, the vocal chords don't produce perfect impulses (as in
perfect "Dirac's deltas"), but when you model the acoustic
filter (with a prediction filter of 15 coefficients), usually
that includes the "filtering" process for the pulses of air
produced by the vocal chords.


Any comments? I know it was a quite longish post, but I'm hoping some kind sould will have the patience to go through it and share some thoughts.

Thanks,

Carlos
--
.



Relevant Pages

  • Re: Glitch with Canon 300D?
    ... >> frame had a filter run on it to remove the pixel-frequency noise. ... to remove the random sensor noise from the blackframe, ...
    (rec.photo.digital.slr-systems)
  • Re: Interleaving video and audio; GetCurrentPosition problems
    ... >> filter as a starting point. ... I was wanting FillBuffer to just send a frame ... >> and routed one end to the AVI Mux, one end to a Video Renderer ...
    (microsoft.public.win32.programmer.directx.video)
  • Re: Simple AVI Creator
    ... I assume I can build my filter using transformfilter class. ... >> Hi Iain, and Thore, thanks for all the help so far. ... > IT seems to me that what you want is a series of single frame images one ... > Thore indicates that an AVI file handles discontinuities well enough. ...
    (microsoft.public.win32.programmer.directx.video)
  • Need help in speech processing
    ... i wish to add noise and filter ... my noise is according to the power of the entire clean speech ... 10db and not the SNR of each individual frame is 10 db. ...
    (comp.dsp)
  • Re: New transform for video
    ... to handle motion without blocks. ... The new transform is compatible with bocks. ... if you stop the filter there (at ... Something like this: an inter frame can be divided into 2 regions: one ...
    (comp.compression)

Loading