Empirical Copula Density Estimation

Introduction

In this post, I’d like to discuss empirical copula density estimation from data.  We will introduce the concept of copulas, the copula density estimation problem, and how to solve it.  We will also detail an implementation of the proposed solution in Matlab.  All code used in this post is taken from the Matlab copula toolbox I am currently developing.

Copula Background

Copulas are functions which capture the dependence structure of joint distribution functions.  They can also be seen as multivariate probability distribution functions with marginal distributions that are uniform.  These equivalent descriptions of copulas can be expressed mathematically through Sklar’s theorem:

C(F_{X_1}(x_1), \dots, F_{X_d}(x_d)) = F_{X_1, \dots, X_d}(x_1, \dots, x_d) \ \ \ \ \ \ \ \ \ \ \ [1]

If we define u_i = F_{X_i}(x_i), then Sklar’s theorem can be restated as

C(u_1, \dots, u_d) = F_{X_1, \dots, X_d}(F^{-1}(u_1), \dots, F^{-1}(u_d))\ \ \ \ \ \ \ \ \ \ \ [2]

Equation [1] and [2] are equivalent and show that any joint distribution F_{X_1, \dots, X_d}(x_1, \dots, x_d) can be expressed with the joint distribution’s copula function C and the marginal distributions F_{X_i}(x_i).  The perceptive reader may notice that there was an implication of the uniqueness between a joint distribution F_{X_1, \dots, X_d}(x_1, \dots, x_d) and its copula C in the previous statement.  To be precise, Sklar’s theorem guarantees that if the marginal distributions F_{X_i}(x_i) for all i are continuous random variables, then C is unique.  In this post, we will assume that X_i for all i are continuous random variables.

Problem Description

Although Sklar’s theorem, shown above in Equations [1] and [2], guarantees a unique copula C for any joint distribution F_{X_1, \dots, X_d}(x_1, \dots, x_d) (provided the marginal distributions F_{X_i}(x_i) are continuous random variables), it does not describe how to find this unique copula.  In fact, copula model selection (i.e. which copula to use) is a significant area of research.  In general, there are two solutions to copula model selection:

  1. Choose a copula among existing copula families such as Gaussian, t, or Archimedean copulas which fit the data well, using some metric to decide between the copula families.
  2. Calculating an empirical copula.

The advantages and disadvantages of each approach is out of the scope of this post.  Rather, we will focus on the second approach, which is calculating an empirical copula, and more specifically, an empirical copula density.  This can be considered as a nonparametric approach.

The empirical copula can be thought of as an estimate of the underlying dependency structure of the data, unconstrained by any a-priori models of how the data interacts with each other.  In this light, it is similar to a kernel density estimate because the copula function (and copula density function) can be thought of as a distribution function (and density function) itself.

The Empirical Copula Density Estimation Problem

Before describing the copula density estimation problem, it is useful to summarize the problem statement mathematically.  Suppose we have a multivariate dataset \mathbf{X} \in \mathbb{R}^d in an n \times d matrix, where each column X_i is a “feature,” and there are d features and n sample points per feature.  We would like to estimate the copula density c(\mathbf{u}) = \frac{\partial C(\mathbf{u})}{\partial \mathbf{u}} which captures the dependency between the columns in the dataset \mathbf{X}.  How would we go about approaching this problem?

The Solution

There are many solutions to the problem posed above.  We discuss two approaches here, and outline why the naive approach may not be a good idea for estimating copula densities.

The Naive Approach

The naive approach entails calculating the empirical copula function and then performing differentiation:

\hat{c}(\mathbf{u}) =\frac{\partial \hat{C}(\mathbf{u})}{d \mathbf{u} } = \frac{1}{n} \frac{\partial}{d \mathbf{u}} \sum_{i=1}^n \mathbf{I}(\tilde{U}_1^i \leq u_1, \dots,\tilde{U}_d^i \leq u_d)\ \ \ \ \ \ \ \ \ \ \ [3]

(\tilde{U}_1^i \leq u_1, \dots,\tilde{U}_d^i \leq u_d) = (F_1^n(X_1^i),\dots,F_d^n(X_d^i))\ \ \ \ \ \ \ \ \ \ \ \ \ [4]

where Equation [4] is an expression for the pseudo copula observations.  While this is tempting, discontinuities in data can lead to differentiability issues.  Additionally, even if discontinuities are not present in the data, the copula density derived by numerical differentiation of the empirical copula function will be spiky.  The spiky nature of the numerical derivative is simply a consequence of the algorithm that is used for differentiation.  We will see an example of the effect of numerical differentiation on an empirical copula below.

Of course, more sophisticated techniques such as second order approximations can be implemented and they may yield acceptable solutions.  However, there exists a more natural approach to estimating the copula density from data directly.

Direct Density Estimation

An alternative to the approach described above is to estimate the density directly from data using kernel methods.  For copula density estimation, it is preferred to use the Beta kernel, because both the Beta kernel and the copula have a bounded support on the interval [0,1], which eliminates boundary bias.  Additionally, the Beta kernel can be viewed as an adaptive kernel density estimation approach, because it changes shape with respect to the input, even if the kernel density estimation bandwidth, h, remains the same.  For more details, please see References 1 and 2 below.

The formula then for multivariate copula density estimation is:

\hat{c}_\beta(\mathbf{u}) = \frac{1}{N} \sum_{n=1}^N \prod_{d=1}^D \beta(F_d(x_d(n)), \frac{u}{h} + 1, \frac{1-u}{h} + 1)\ \ \ \ \ \ \ \ \ \ \ [5]

where h is the kernel bandwidth parameter, \beta is the probability density function of the Beta distribution, and F_d(x_d(n)) is the n^{th} pseudo observation in dimension d.  Equation [5] is nothing more than a multivariate kernel density estimate of the data, using beta kernels.  However, it has the advantages stated above: 1.) eliminates boundary bias, and 2.) adaptive approach to density estimation, and 3.) it is shown in the references below that the variance of the estimator decreases with increased sample size.

Implementation

Any copula operations depend upon u_i rather than x_i.  The u_i‘s are termed pseudo observations, and there are two ways of estimating them.  We discuss the pros and cons of different ways to estimate the pseudo observations before diving into the details of estimating the copula function and the copula density directly.

Generating Pseudo Observations

There are atleast two ways to generate pseudo observations:

  1. U_i = \frac{R_i}{n+1}
  2. U_i = F_{X_i}(x_i[m])

The Matlab function pseudoobs has the ability to generate pseudo observations using both methods described above.  Briefly, the rank based method (option 1 above) can be implemented as:

U = tiedrank(X)/(M+1); % scale by M+1 to mitigate boundary errors

In the code above, each column is ranked and scaled to achieve a uniform distribution from the observed data.  To implement the empirical CDF approach (option 2 above), we do the following

U = zeros(size(X));
for nn=1:N
    domain = linspace(min(X(:,nn)),max(X(:,nn)),numEcdfPts);
    FX = ksdensity(X(:,nn), domain, 'function', 'cdf')';
    empInfoObj = rvEmpiricalInfo(domain, [], FX);
    for mm=1:M
        U(mm,nn) = empInfoObj.queryDistribution(X(mm,nn));
    end
end

It can be seen that the ECDF method requires estimation of each marginal empirical density function, and then querying that distribution for each of the samples.  From a computational perspective, the rank approach is faster due to vectorization and modern sorting algorithms being lightning quick!

Both are valid approaches to generating pseudo observations, but if we do not know the marginal distributions a-priori (which is usually the case), the rank based approach  actually produces a U_i that is more uniformly distributed than the empirical CDF approach!  The rank based approach is able to do this because it evenly distributes the original observations into the unit cube (by virtue of the rank function).  To show this in Matlab, we can generate samples from a joint distribution and compute the pseudo observations using both methods, and compare the distributions of the computed pseudo observations.  To generate samples from a joint distribution, we can utilize the copularnd function as follows:

M = 1000;
alpha = 5;
u = copularnd('Frank', alpha, M);
mu = 2; a= 3; b = 2;
x = [expinv(u(:,1),mu), gaminv(u(:,2),a,b];

After generating the dependent random variables, we compute the pseudo observations using the pseudoobs function with both the rank algorithm and the ECDF algorithm as follows:

numECDFPts = 100;
U_ecdf = pseudoobs(x, 'ecdf', numECDFPts);
U_rank = pseudoobs(x);  % default option for pseudoobs is rank

Finally, we compute an empirical density via a histogram for the two different pseudo observations.  The results are shown below.

marginal_uniformity

A consequence of “more” uniformity is that the rank based calculation of the pseudo observations actually produces lower variance estimates of an empirical copula estimated with rank based pseudo observations.  For intuition on this, please refer to Reference [2].  We empirically show this with a simulation where we generate samples from a joint distribution, and perform a Monte-Carlo simulation where we compare the variance of the estimated value of the original copula density from pseudo observations computed with the ECDF and the rank method.  We only demonstrate this for a sample point in the unit cube, but the results apply to the entire space of \mathbf{I}^d.  The results are show below:

bias performance

It is clear from the figure that the variance is lower when using pseudo observations from the rank based estimator.  The proof and intuition of using the rank based algorithm for generating pseudo observations is given in Reference [2], and the full simulation to generate the figures above is located here.

Estimation via Empirical Copula Function and Gradients

Having discussed how to generate pseudo observations, let us tackle the problem at hand of estimating the empirical copula density.  We first calculate the empirical copula by

\hat{C}(\mathbf{u}) = \frac{1}{n} \sum_{i=1}^n \mathbf{I}(\tilde{U}_1^i \leq u_1, \dots,\tilde{U}_d^i \leq u_d)\ \ \ \ \ \ \ \ \ \ \ [6]

where U_i are the the pseudo observations.  The function empcopulacdf implements Equation [6], and is included in the copula Matlab toolbox.  Using this, the pseudoobs function, and gradient, we can easily estimate a copula density from samples as follows:

K = 25;
U = copularnd('Frank', 5, 1000);
C_est = empcopulacdf(U, K, 'deheuvels');
[cc] = gradient(C_est);
[~,c_est_grad] = gradient(cc);

Direct Empirical Copula Density Estimation

To estimate the density directly, we use the empcopulapdf function which implements Equation [5] above.  This function has the same call signature as empcopulacdf, and using the same data source as above, we can generate a copula density estimate as follows:

h = 0.05;
c_direct = empcopulapdf(U, h, K, 'betak');

Comparison

Let us now compare these two methods of density estimation by plotting the results, against the true copula density for the Frank copula with a dependency parameter of 5.  The figure below shows three plots of the Frank copula.  The plot on the left represents the true copula density, the plot in the middle shows the density derived from estimating the empirical copula first and performing directional derivatives, and finally the plot on the right shows the empirical copula density generated from the outlined Beta kernel approach.

comparison

The source for generating these diagrams is located in test_empcopulacdf.m.

Conclusion

In this post, we have discussed two different methods for estimating a copula density directly from data.  From the limited experiments conducted above, two recommendations can be given:

  1. Use a rank based approach to generating pseudo observations.
  2. If the empirical copula density is what is required, estimate it directly from data using the Beta kernel approach outlined above, as opposed to computing the empirical copula function and performing directional derivatives.

In the next post, we’ll discuss the mathematics behind generating random samples from an empirical copula, using the empcopularnd function.

References

  1. Beta kernel estimators for density functions
  2. The Estimation of Copulas: Theory and Practice

QPSK Burst Receiver Synchronization

Introduction

In this post, I’d like to discuss the signal acquisition algorithms that need to be performed in order to properly receive and demodulate a bursty QPSK signal in a wireless multipath environment.

Background

In the wireless communications world, receiver design is traditionally more complex than transmitter design.  This is mainly due to the receiver needing to perform large amounts of pre-processing in order to synchronize itself with respect to time and frequency to the transmitter, as well as account for wireless channel effects such as multipath.  Modern wireless communications standards (4G and beyond) have attempted to put more intelligence into the transmitter in order to simplify receiver design, but the fact remains that the receiver still needs to account for these effects.

In this post, we will consider the scenario where we have a burst mode QPSK transmitter and receiver and detail the acquisition algorithms that a receiver would need to perform in order to properly demodulate the signal.  Acquisition in a wireless receiver typically refers to time synchronization, frequency synchronization, and accounting for wireless channel effects such as multipath.  Let’s discuss each of these effects in detail.

Time Synchronization

Time synchronization is the process of aligning the receiver sample clock with the transmitter sample clock.  It is well known in digital communication theory that sampling at the optimal point gives the receiver the highest probability of making a correct decision on a certain symbol.  Intuitively,  sampling at the correct moment in time gives the receiver the maximum energy point at which to threshold and make a decision.  We illustrate this concept in a simplified example in the diagram below.  Suppose that we would like to determine the peak-to-peak voltage of a sinusoid.  As seen in the diagram below, if the sample clock is aligned properly, it can be seen that sampling the signal achieves the maximum energy (shown in the blue arrow), while a misaligned clock (shown in the red arrow) will result in reduced energy in the sample.

Clock Alignment example

Clock Alignment example

Frequency Synchronization

Frequency synchronization refers to ensuring that the receiver is locked and tracking the frequency drift of the transmitter, which can be caused by many real-world effects such as LO drift, Doppler, etc.  Again, as is the case with time synchronization, lack of frequency synchronization adversely affects the receiver.  A typical symptom of lack of frequency synchronization is a rotating constellation.  The effect of a rotating constellation for a receiver is that the receiver will incorrectly map the symbols to bits, because they are perceived to be in different locations on the constellation plane.  The figure below shows where actual QPSK points are expected (in the blue dots), and where they would end up with a frequency offset (shown by the X’s).  To be clear, a phase mismatch between the transmitter and receiver would result in a static rotation of the constellation, but a center frequency offset would result in a rotating constellation, with the rotation speed proportional to the frequency offset.

Affect of lack of frequency Synchronization [1]

An additional complication in a wireless receiver is that the time and frequency corrections need to also be tracked through the entire duration of the transmission being active. In the continuous signaling case (and in some burst cases), this is typically accomplished through the use of a Phase Lock Loop (PLL) and Frequency Lock Loop (FLL) working in conjunction with each other. The PLL/FLL approach works well for continuous signals because once initial synchronization is achieved, the feedback nature of the loops allow the receiver to track the variations. In burst situations however, the transmission toggles between on and off at a certain duty cycle that may or may not be deterministic. If the PLL and FFL’s need more data to synchronize than the initial preamble which is typically present in burst communications, data will be lost. Considering the bursty nature of the signal, we have chosen to use an adaptive filter to perform the bulk of the initial synchronization. It is important to note that we detail just one method of performing acquisition, but that this is not the only approach (or even the best).

At a high level, this approach views acquisition as an FIR filtering problem. The premise is that there is an FIR filter (optimal in the MMSE sense) that can perform time and frequency synchronization, as well as account for wireless channel effects. The question then becomes, what do the FIR filter taps need to be in order to perform acquisition? Another question is, can the taps remain constant, or do they need to change over time? The answers to these questions can be found by reviewing some wireless communication theory.

Theory

A wireless channel can be approximately modeled as a linear system, meaning that mathematically it has impulse response. The impulse response of a wireless channel at a certain time T_0 may look like the diagram below, which shows three multipath components.

Example of a multipath channel impulse response

Example of a multipath channel impulse response

The channel impulse response shown has three multipath components, which means that the receiver will see three copies of the transmitted signal, overlapped on top of each other at different amplitudes.  This is the effect of multipath, and needs to be accounted for if a receiver is to properly demodulate a signal.

If the channel impulse response (such as the one shown above) is known a-priori to the receiver, then it’s effects can be removed by applying the appropriate filter.  From linear systems theory, we know that if we can model a channel with a certain impulse response, we can then predict the output by convolving the input signal with the channel response and what we receive is the output of that convolution.  Therefore, if we know the channel response before hand, the receiver can then apply another convolution operation with the appropriate taps such that the channel effects can be removed.  The simple complication is that typically, receivers don’t know the channel impulse response.

It is important to note that the impulse response of a wireless channel can (and often does) vary with time.  Therefore, in the strict sense, a *time-varying* impulse response would be the most accurate model for the channel.  However, the time-varying nature of the channel is often discretized into time steps of \tau, where \tau is the coherence time.  Coherence time is a measure of approximately how long the channel remains constant (i.e. how long the impulse response is constant).  From this information, we can see that to answer the question posed above, it is clear in the general case that different FIR filter taps will be needed to process each burst, as the channel could have changed between burst n and burst n+1.

The Solution

Therefore, we need some system which can effectively compute the channel impulse response and derive the taps independently for each burst. Using an adaptive filter is one way to accomplish this task. The block diagram of a general adaptive filter is shown below.

A general block diagram for an adaptive filter.

A general block diagram for an adaptive filter.

As seen in the diagram, an adaptive filter takes input x[k], a desired signal d[k], and derives filter weights, W[k] such that the error between the output of the filter, y[k] and the desired output d[k] is minimum.  In order to use the adaptive filter in our system, we need to detect the preamble, perform the adaptation process during the duration of the preamble to determine the optimal weights, and then filter the remaining portion of the burst with the derived weights.  The input signal is our burst, and the desired signal is the preamble (assumed to be known a-priori when the transmitter and receiver decide on a communications protocol to use).  Therefore, the adaptation process derives FIR filter weights such that if the received filter were filtered with the derived filter weights, the reconstructed preamble would be as close as possible to the original preamble in the MMSE sense.  As noted above, adaptation stops after the preamble, so a key assumption of this design is that the burst length (in time) is shorter than the coherence time of the channel.

The adaptive filter also has the effect of taking care of synchronization. Intuitively, this makes sense because the FIR filter is adapting to the received signal, which is corrupted due to all effects including sample clock mismatch (which contributes to time synchronization), LO mismatch (which contributes to frequency synchronization), as well as channel effects such as multipath, with the reference signal (desired signal in adaptive filtering parlance) being the properly synchronized and non-noisy preamble.

Implementation in SDR

From an implementation perspective, an important point that was glossed over previously is when the adaptive filter begins the adaptive process.  It is imperitave that the adaptive filter begin adaptation at the exact sample that the preamble begins in the received signal.  If adaptation did not start at the exact sample of the received preamble, the adaptive filter would not generate the correct weights because it is not aligned to the preamble. The reason this must be pointed out when considering implementation becomes clear when we examine the typical signal processing chain for a burst receiver in SDR.

High Level Approach

The first step in a burst processing system is typically a burst detector, which outputs some samples before the burst begins, and some samples after the burst ends. The goal of the burst detector is to prevent the rest of the signal processing chain from processing data that doesn’t contain any information.  The input to and output of a burst detector is shown in the figure below, where it can be seen that the input has a lot of extraneous zero’s before and after the burst that are trimmed off by the burst detector.

Burst Detection Block

Burst Detection Block

As stated above, is imperative for the adaptive filter to know exactly at which sample it should begin adaptation.  After burst detection, it then makes sense to determine the exact start of the preamble.  One way to determine this is to perform a cross-correlation of the received signal with the preamble.  The peak of the cross-correlation will yield the exact beginning of the preamble.  However, the depending on the correlation properties of the preamble, the cross-correlation may not work properly if the signal’s center frequency offset (CFO) is not within some tolerance.  Therefore, a safe first step is to perform coarse CFO correction before the cross-correlation.  After coarse CFO correction, the exact start of the preamble must be detected, typically with a cross-correlation operation.  Finally, after precise detection of the start of the preamble, filtering can be performed to equalize the signal.  The processing chain that we now have to perform can be summarized in the following diagram.

Typical Burst Receiver Processing Flow

Typical Burst Receiver Processing Flow

Let us now explore how to build each one of these blocks.

Burst Detection

Burst detection is typically implemented as a simple energy threshold mechanism, where the input signal’s energy is calculated.  When it rises above a certain threshold that can be determined statically or dynamically, samples are captured and tagged as a burst.  Some Matlab code which implements a burst detector is shown below.

inputSigPower = (abs(x).^2)/2;

% perform moving average of the signal
kernel = ones(1,maSize)/maSize;
inputSigPowerSmoothed = conv(inputSigPower, kernel);
inputSigPowerSmoothed = inputSigPowerSmoothed(1:length(inputSigPower));

thresh = 3;
state = 0;
bursts = {};
burstsIdx = 1;
for ii=1:length(inputSigPowerSmoothed)
    if(state==0)
        if(inputSigPowerSmoothed(ii)>thresh)
            burstStartIdx = ii;
            state = 1;
        end
    else
        if(inputSigPowerSmoothed(ii)<thresh)
            burstEndIdx = ii;
            state = 0;
            
            % record burst into a vector
            bursts{burstsIdx} = inputSigPowerSmoothed(burstStartIdx:burstEndIdx);
            burstsIdx = burstsIdx + 1;
        end
    end
end

Coarse CFO Estimation and Correction

The next step is to perform coarse CFO estimation and correction. It is known that the information content of an M-PSK signal can be removed by raising the signal to the Mth power. Consequently, because this is a QPSK signal, we can remove the information content of the signal by taking the 4th power of the signal. Taking the FFT of the 4th power of the signal reveals the coarse CFO offset. After detection, the effect of the CFO can be removed by multiplying the signal by a complex exponential.

x4 = x.^4; % take qpsk ^ 4th to extract baudrate lines
X4 = fftshift(abs(fft(x4)).^2);
f = linspace(-Fs/2,Fs/2,length(x4));
[maxVal, maxIdx] = max(X4);
cfoEstimate = f(maxIdx)/4;

to = (0:length(x)-1)/Fs;
freqCorrVector = exp(-j*2*pi*cfoEstimate*to);
y = x.*freqCorrVector;

Determining exact start of Preamble in the received sequence

As discussed above, the next step is to perform a cross-correlation to determine the exact start sample of the preamble.  The code sample below shows how to determine the exact start of the preamble using a cross-correlation.

preCrossCorr_cmplx = xcorr(preSyms',eqBurst');  % preSyms is the known preamble symbols, eqBurst is the received symbols
preCrossCorr = abs(preCrossCorr_cmplx);
[maxVal,maxIdx] = max(preCrossCorr);
preambleIdxStart = length(eqBurst) - maxIdx + 1;

eqIn = eqBurst(preambleIdxStart:end);  % extract symbols which will be used for equalization (received preamble symbols only)

After performing all of the pre-processing steps, the final step is to perform the adaptive filtering. The adaptive filtering consists of decimating the input signal to rate match what the equalizer is expecting.  After rate matching, the optimal filter must be determined.  Finally, the optimal filter must be applied and then a simple first order PLL to “clean-up” the output constellation.  Let us discuss in detail each of these operations.

Rate-Matching

The first step in the adaptive filtering process is to rate-match the data being equalized and the equalizer.  By rate-matching, we mean that the equalizer should be designed so that it knows how many samples per symbol it is expecting for its input.  Typically, sampling rates of receivers are set to a minimum of 2 samples per second.  The choice is whether to design the equalizer such that it accepts the number of samples per symbol that the input signal is sampled at, or to decimate/interpolate the signal to operate at the required number of samples per symbol required by the equalization algorithm.  In our specific implementation, we assume that the receiver samples the signal at 2 samples per second, and the equalizer is designed for operating at 1 sample per second. Therefore, we decimate the input signal to accomplish the rate matching.

eqIn_div2 = eqIn(1:2:end); % decimate

Determining the Optimal Filter

In order to determine the optimal filter, we essentially run the adaptive filter shown above. It should be cautioned however that if the training sequence is short, the adaptive filter may not converge to the optimal solution in the amount of available training samples (this is due to many factors, including the learning rate, the amount of noise in the signal, etc..). In order to overcome this, we can apply the closed form solution to the adaptive filter, which ends up reducing to determining the optimal Wiener filter [2].  To determine the optimal Wiener filter, we first create an autocorrelation matrix of the input [3].  This is done by computing the autocorrelation of the input sequence, and then creating a Toeplitz matrix from the autocorrelation sequence.  The code below shows how to create the autocorrelation matrix.

[~, d_n] = genPreamble();  % generate the preamble sequence
x_n = eqInput(1:length(d_n));
x_n = x_n(:);

% generate the input correlation matrix
X = fft(x_n,2^nextpow2(2*size(x_n,1)-1));
X_magSq = abs(X).^2;
rxx_ifft = ifft(X_magSq);
m = length(x_n);
rxx = rxx_ifft./m; % Biased autocorrelation estimate

% creates: http://en.wikipedia.org/wiki/Autocorrelation_matrix
toeplitzMatCol = rxx(1:m);
toeplitzMatRow = conj(rxx(1:m));
R = toeplitz(toeplitzMatCol,toeplitzMatRow);

After creating the autocorrelation matrix, we then create what we call the P vector. The P vector is the cross-correlation of the input preamble sequence with the reference preamble sequence. Intuitively, it is a representation of how close the received preamble sequence is to the reference preamble sequence. With the P vector and the R matrix, we can setup the filter weight problem as a linear algebra problem. We have Pw=R, and we have to solve for w which is the optimal weight vector. The code below shows how to accomplish these steps.

xc = xcorr(d_n, x_n);
P_row = xc(1:m);
P = P_row(:);

% solve the optimal weights problem
w = R\P;

Finally, we scale the weight vector such that the energy of the output of the of the optimal filter is equivalent to the energy of the input vector by scaling such that the maximum input vector tap is 1. The code below shows the filter scaling operation:

wOpt = w./max(abs(w));

Filtering

After determining the filter coefficients, the next step is to filter the actual signal. Filtering can be performed either with an FFT based filter, or a convolution filter. An FFT based filter implementation is shown in the code below:

fftSize = length(eqIn_div2)+length(wOpt)-1;
eqIn_div2_ext = [eqIn_div2 zeros(1,fftSize-length(eqIn_div2))];
wOpt_ext = [wOpt zeros(1,fftSize-length(wOpt))];
whFilt_unscaled = ifft(fft(eqIn_div2_ext).*fft(wOpt_ext));

whFilt_mean = mean(abs(whFilt_unscaled));
whFilt = whFilt_unscaled./whFilt_mean;

PLL (Clean up)

Finally, in order to clean up any residual phase offsets, a simple first order PLL is implemented. The code for the PLL is shown below:

function [ y ] = qpskFirstOrderPLL( x, alpha )

phiHat = 0;
y = zeros(1,length(x));
for ii=1:length(x)
    y(ii) = x(ii)*exp(-j*phiHat);

    % demodulating circuit
    if(real(y(ii))>=0 && imag(y(ii))>=0)
        % 1 + 1j;
        xHat = exp(1j*pi/2);
    elseif(real(y(ii))>=0 && imag(y(ii))<0)
        % 1 - 1j;
        xHat = exp(1j*3*pi/2);
    elseif(real(y(ii))<0 && imag(y(ii))<0)
        % -1 - 1j;
        xHat = exp(1j*5*pi/2);
    else
        % -1 + 1j;
        xHat = exp(1j*7*pi/2);
    end

    phiHatT = angle(conj(xHat)*y(ii));
    phiHat = phiHatT*alpha + phiHat;
end

end

Conclusion

In this post, we have discussed how to implement an equalizer for a bursty QPSK signal. Matlab code samples were shown, showcasing a possible implementation of the burst synchronizer. The working complete Matlab and C++ implementation of this synchronizer (in the GNURadio framework) is located here: https://github.com/gr-vt/gr-burst/.  More specifically, the Matlab files to look at for the synchronizer are:

  1. https://github.com/gr-vt/gr-burst/blob/master/matlab/qpskSyncBurst.m
  2. https://github.com/gr-vt/gr-burst/blob/master/matlab/qpskBurstCFOCorrect.m
  3. https://github.com/gr-vt/gr-burst/blob/master/matlab/weiner_filter_equalize.m
  4. https://github.com/gr-vt/gr-burst/blob/master/matlab/qpskFirstOrderPLL.m

For a complete C++ implementation, please see:

  1. https://github.com/gr-vt/gr-burst/blob/master/include/burst/synchronizer_v4.h
  2. https://github.com/gr-vt/gr-burst/blob/master/lib/synchronizer_v4_impl.h
  3. https://github.com/gr-vt/gr-burst/blob/master/lib/synchronizer_v4_impl.cc

For an example of how to integrate this block into a full-fledged communication system, please see http://oshearesearch.com/2015/04/02/burstpskmodem/

Here is a talk I gave on this at Cyberspectrum meeting in Arlington:

References

[1] https://commons.wikimedia.org/wiki/File:QPSK_Freq_Error.svg

[2] https://en.wikipedia.org/wiki/Wiener_filter

[3] https://en.wikipedia.org/wiki/Autocorrelation_matrix

Happy climbing!

DSCN0214

PCA, Statistical Whitening, and their relation to ICA

Introduction

In this post, I’d like to solidify the concepts of PCA and Statistical Whitening, as well as explain how they relate to ICA.  While studying ICA and being a newcomer to the topic, I found that I was confused by how these three important techniques are related to each other.  I hope that this post clarifies that to others who may have similar questions.  We’ll start by reviewing linear algebra bases, and then discussing PCA, and the intuition behind this algorithm.  Next, we will discuss statistical whitening and how it is related to PCA.  Finally, we’ll look at ICA and how it relates to both.

Bases (Linear Algebra)

Before we dive into PCA, Whitening, and ICA, it is helpful to review some basic linear algebra concepts about bases.  A basis (plural bases), in terms of linear algebra, is a set of linearly independent vectors that can represent any vector in a vector space.  One basis for data in R^2 (2 dimensions) is \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}.  This basis is also called the standard basis.  You can visualize the basis as the axis in the coordinate system on which the data exists.  We use bases to represent data in the dimension that the basis covers, and the same data point can be represented by multiple bases.  For example, suppose we have data in R^2 and we have two bases to represent the data in, the standard basis \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} and another basis \begin{bmatrix} 3 & {-4} \\ 1 & 2 \end{bmatrix} .  Now, suppose that we have a vector \mathbf{v}, as shown in the figure below.  In the figure, the standard coordinates are shown with black axes and a yellow grid, while the \beta-coordinates are shown with blue axes and a cyan grid.  It is easy to see that in the standard basis, we can represent \mathbf{v} by (2,3).  That same vector can also be represented in the \beta basis as (1.6,0.7) by applying the transformation matrix from the standard basis coordinates.  For further details on basis transformation, please refer to this excellent tutorial.

Screen Shot 2015-01-09 at 11.30.52 PM

PCA

PCA, which stands for Principal Component Analysis, is a statistical procedure, which at a very high level, can be viewed as an algorithm for finding a more “meaningful” basis for the data to be analyzed.  For PCA, meaningful is measured in terms of variance and covariance.  A basis vector with higher variance is considered more meaningful than a basis vector with lower variance.  The reason this is important is because when we collect multi-dimensional real world data, it is typically corrupted by noise.  Additionally, several dimensions (features) of the multi-dimensional data may be redundant.  PCA is an algorithm that people use to try to filter out the noise as well as reduce the redundancy in the data set.

Now that we understand what PCA is supposed to do, let’s examine why variance is a part of the measure of meaningfulness of the data.  Variance is defined as the spread of a set of numbers from the expected value.  A higher variance corresponds directly to more spread, which can be interpreted from an information theoretic standpoint as more informative because inherently the data is harder to predict if it is more spread out (has more entropy).  Another way to think about variance is the concept of Signal-to-Noise ratio (SNR).  SNR is defined as the ratio of signal power to noise power, which turns out to be the ratio of the variance of the signal to the variance of the noise (because variance is a measure of power).  The higher the SNR, the less noisy the signal, or in other terms the “cleaner” the signal.  If we make an assumption that in general the data we observe has a reasonable SNR (which is a requirement if we wan’t to do anything with the data to begin with), then high SNR corresponds to a high variance due to how it is defined as a ratio of variances.

The other metric that was defined was covariance.  Covariance is a measure of linear dependency between two random variables.  If the covariance between two variables was high, this means that they are strongly linearly related.  Therefore, having a low covariance implies less (linear) redundancy in the data set.  As an aside, I emphasize linearity of the dependency metric here because it is extremely important.  People often use the words correlation and dependency synonymously, but that is not statistically accurate except in the Gaussian case!  We will explore this topic in more detail in another post, but for now let us continue our discussion of PCA.

Now that we have defined what a more meaningful basis is, how would we design an algorithm to find a more meaningful basis?  That is, what mathematical operation(s) can we perform on a data set such that we find directions where the variance is high and the covariance between different dimensions of the data is low?  Recall that the variance and covariance information of a multidimensional data set is captured by the covariance matrix.  The diagonals of a covariance matrix represent the variances of the individual dimensions, and the off diagonal terms represent the covariance between these variables.  Intuitively, what we want is a transformation such that the covariance matrix of the transformed data has diagonal terms that are in descending order (to satisfy the requirement of having a descending order of variances) and has off-diagonal terms as close to zero as possible.  Essentially, we’d like to have a transform such that the covariance matrix of the transformed data is diagonal.  It is important to note here that a diagonal covariance matrix means that the features of the data are no longer correlated.  Therefore, another way to state the algorithmic goals of PCA are to find a decorrelation matrix, which in our context rotates the data from a naive basis to a more meaningful one.

So how do we find a decorrelation matrix? Suppose our data is in a matrix \mathbf{X} \epsilon \mathbb{R}^{n x d}, where d is the number of features, and n is the number of data points captured for each feature.  The covariance matrix of \mathbf{X} is then calculated as \Sigma_{X} = E[XX^T].  We now desire a transform matrix \mathbf{P} such that the covariance of \mathbf{PX} is diagonal.  Armed with this intuition (and skipping a few steps), we find that \mathbf{P} = \mathbf{E^T}, where \mathbf{E} is the matrix of eigenvectors of \mathbf{X}.

Statistical Whitening Transform

The statistical whitening transform is a set of operations performed on a multivariate data set which results in a decorrelated data set that has uniform variances on all diagonals.  Recall that PCA also results in a decorrelated data set, but the variances are not uniform.  In fact, the variances of a data set transformed by PCA are in descending order (recall that this was a requirement of PCA).  Intuitively then, we can achieve statistical whitening by first performing PCA, and then multiplying by another matrix such that the diagonal’s of the resultant data set’s covariance matrix are unity.  Even though it wasn’t shown mathematically above, the diagonals of the PCA’ed covariance matrix correspond to the eigenvalues of \mathbf{X}, which are stored in a matrix \mathbf{D}.  It can be shown that left multiplying by \mathbf{D^{1/2}} has this effect.  Therefore, statistical whitening of a dataset in matrix \mathbf{X} can be achieved as follows: \mathbf{Y} = \mathbf{D}^{1/2} \mathbf{E}^T \mathbf{X}, where \mathbf{Y} is the whitened data.

Whitening is also sometimes referred to as sphering, because of the resultant shape of the data.  In the figure, the left figure is the joint density of two uncorrelated Gaussian random variables, the middle is the joint density of two correlated Gaussian random variables with differing variances, and the sub-figure on the right shows the output of the whitening transform on the middle data set.

sphering

ICA

Now that we understand PCA and its relation to statistical whitening, let us explore ICA and see how these concepts relate to each other.  ICA, which stands for Independent Component Analysis, is a statistical procedure which attempts to separate sources which are mixed together through properties of statistical independence.  One interpretation of ICA is similar to PCA, in the sense that it tries to extract the most “meaningful” basis for representing the multivariate data set.  However, instead of using variance and covariance as a means of measuring the meaningfulness of a basis as was done in PCA, it uses statistical independence.

We won’t delve into the details of ICA here, but rather the goal is to explain how PCA and Statistical Whitening fit into the ICA algorithm.  However, we do need to discuss some of the math behind ICA.  Mathematically, ICA can be expressed in the equation \mathbf{x} = \mathbf{As}, where \mathbf{x} is the observed vector of data, and we try to recover \mathbf{s} without knowledge of \mathbf{A}.  For this to be a tractable problem, several assumptions need to be made.  One is that the sources, represented by \mathbf{s} are statistically independent.  Another important assumption is that the sources are Non-Gaussian (although that will not be germane to our discussion here).  Finally, the sources are all assumed to be of unit variance.  In practice, although data may not be present in nature with unit variance, it can always be transformed to achieve unit variance, so this isn’t really an “assumption.”  With this background, let us proceed mathematically.

Because the sources were assumed to be independent and unit variance, we know that the covariance matrix of the sources should be the identity matrix, i.e. E[\mathbf{s}\mathbf{s^T}] = \mathbf{I}.  Now, computing the covariance matrix of the observed data matrix, some substitutions can be made as follows: E[\mathbf{x}\mathbf{x^T}] =E[\mathbf{As}\mathbf{\left(As\right)^T}] =E[\mathbf{As}\mathbf{s^TA^T}] = E[\mathbf{A}\mathbf{A^T}] =\mathbf{AA^T}.  Recall from linear algebra that any matrix can be decomposed using Singular Value Decomposition (SVD).  Let us suppose that \mathbf{A} = \mathbf{U}\Sigma \mathbf{V^T}, which is the SVD of \mathbf{A}.  Substituting into our previous expression and simplifying, we get that E[\mathbf{x}\mathbf{x^T}] = \mathbf{U}\Sigma^2 \mathbf{U^T}.  We also know that E[\mathbf{x}\mathbf{x^T}] = \mathbf{EDE^T} (through eigenvalue decomposition).  Therefore, we can set the following relation: \mathbf{EDE^T} =\mathbf{U}\Sigma^2 \mathbf{U^T} .  This implies that \mathbf{E} = \mathbf{U} and D = \Sigma^2, which implies that \Sigma^{-1} = \mathbf{D}^{-1/2}.

At this point, it is good to take a step back and realize that in essence, what we are trying to solve for is the unknown mixing matrix \mathbf{A}.  If we knew \mathbf{A}, then we could compute its inverse and plug into our original equation to recover \mathbf{s}.  From the SVD decomposition of \mathbf{A}, we can write that \mathbf{A}^{-1} = \mathbf{V}\Sigma^{-1} \mathbf{U^T}.  Substituting the relations we found above, we can show that \mathbf{A}^{-1} = \mathbf{VD^{-1/2}} \mathbf{E^T}.  From our discussion of PCA and statistical whitening, we know that \mathbf{E}^T is the decorrelation matrix (i.e. the PCA algorithm), and that \mathbf{D}^{-1/2} is the whitening matrix.  Since these are the first two operations applied to the observed vector \mathbf{x}, we can interpret the first two steps of ICA as performing PCA and then statistical whitening.  The final step is the estimate \mathbf{V} such that the output sources are statistically independent.

Conclusion

In this blog post, we have shown how PCA, Statistical Whitening, and ICA are related to each other.  I hope this has been helpful and clarifying to others studying this field.  Some excellent further resources are:

  1. https://shlens.wordpress.com/tutorials/   (More in depth tutorials on both PCA and ICA)
  2. https://www.cs.helsinki.fi/u/ahyvarin/papers/NN00new.pdf  (Very in depth tutorial on ICA)

What this blog is all about

Hi everybody, and welcome to my blog!

This blog is all about sharing technical knowledge that I have found helpful in my own schooling and career and think other people will benefit from. I’m motivated to do this because I have gained an immense amount of knowledge by reading other people’s blogs regarding technical concepts that are difficult to understand, and hope that whatever I post will be of use to somebody out there.

I’ll also be sharing my thoughts and views on life, philosophy, society, and science.  I find all of these topics fascinating and would like to articulate my thoughts to whoever may be interested.  I am also hoping that these posts will inspire thoughtful discussions about real issues, that in my opinion, we should be thinking about (and more importantly, doing something about!) as a society.

I welcome any comments and suggestions, and hope that you all enjoy!

“Alpinism is the art of suffering”
– Wojciech “Voytek” Kurtyka