A short introduction to microphone array signal processing, directly taken from the book Microphone Array Signal Processing.
Stationary Signal Filtering
Here we briefly discuss two filter techniques for stationary (distribution independent of time) signals.
Wiener Filter
Consider a zero-mean clean speech signal $x(k)$, contaminated by a zero-mean process $v(k)$, so that the observed signal $y(k)$ at time $k$ is:
Assuming all signals are stationary, the objective is to retrieve $x(k)$ from $y(k)$. The Wiener filter uses a finite response (FIR) filter of length $L$ (equal to the visible output signals length), defined as:
This filter is applied onto the observed signal $y(k)$ with the goal of removing $v(k)$ from $y(k)$, hence the potential error of this filter would be:
After defining the error, a criterion can be formulated, e.g., Mean square error (MSE):
Therefore we can specify a value for $\hat{h}$ the minimization problem as, by using the matrix cookbook’s definitions:
We obtain the required estimate for $\hat{h}$:
But here lies a problem, since in general the original signal $x(k)$ is unobtainable, this formula cannot be used. However, one can refactor the expression $r_{yx}$ as:
With this simplification, the whole optimization problem only depends on the (maybe unknown ) noise $v(k)$ and the observation $y(k)$.
Therefore, optimal estimation in the wiener sense is:
If we consider a particular filter, that does no noise reduction $h_1$ as:
The corresponding MSE is then:
The created filter $h_1$ is equal to $h_1 = R_{yy}^{-1}r_{yy}$, since $R_{yy}^{-1}r_{yy} = E[(yy)^-1]E[yy] = 1$.
Another approach using this identity above, leads to the following filter:
The optimal estimation of this filter is therefore:
Even though this filter can improve the signal to noise ratio (SNR), it does not maximize it.
Frost Filter
The previously shortly described Wiener filter is, even though mathematically concise, in many cases not applicable for real world applications, since the error $e(k)$ is defined with regards to the reference $x(k)$, which often does not exist.
Therefore the criterion becomes:
Minimization of this criterion leads to the obvious solution $h=0$. Fortunately in many applications, the filter $h$ is constrained by the following form:
The constrained optimization problem can then be solved using lagrange multipliers.
It is important that $C$ is fullrank and $R_{yy}$ is invertible.
Both of these filters however, can only be used for stationary signals (e.g., distribution does not change over time), which is a strict constraint, unseen in real world scenarios.
Non-Stationary Signal Filtering
In case of non-stationary signals, a possible approach for modelling was intruduced by Rudolf E. Kálmán. This approach models an observed signal as follows:
Where $h_1$ is a one-hot encoded vector (zeros with a single one). In this model, the vector $h_1$ also referred to as state vector, does change it’s values depending on the timestep $k$. The kalman filter seems to be similar in formulation and implementation as a two state hidden markov model. The optimization of the Kalman filter done with a two step (similar to HMM).
We first define the speech signals models:
With the transition matrix $A$ defined as:
Therefore, the objective is to find the optimal linear MSE of $x(k)$, for the two given objective functions:
- State Equation
- Observation Equation
- Initialization
- Computation
As it can be seen the kalman filter is recursively defined and starts from a beginning timestep. The matrices $R_{ee}$ needs to be previously initialized. Unfortunately the initialization of this matrix is essential to the effect, thus much research has been done to provide feasible starting estimates.