PCA when M >> N

santhu · **Joined:** Wed Jul 24, 2013 4:46 pm **Posts:** 4

Hello all,

I have a matrix A of size M X N where M is greater than N. [Typical sizes are M = 15,000 while N =20]. We want to apply PCA on this matrix where variables are more than observations.[ I am aware that the results are not completely reliable but in our experiments with R, they made some sense].

In R when such a scenario is presented, the eigen values would mostly be a vector of N and eigen values are N X N.
But in alglib, though the code uses QR-decomposition and bidagional reduction to find the N eigen values, the memory allocation happens for M X M matrix. We are running out of memory because of this. [As far as I have debugged].

Is there any way we can solve this problem without creating these huge matrices which would not be useful at all.
Thanks
Santhosh

snowflake · **Joined:** Thu Sep 02, 2010 10:43 am **Posts:** 9

this is hard due to the fact that the needed underlying covariance matrix has MxM size when M is larger than N

i hope that there is a solution, as i'm dealing with this problem right now too

i use it for an Active Shape Model, where one dataset in 3D is concatenated in 1 vector in the format xxxyyyzzz, so a relatively normal object with 3000 points gets a 9000 dimensional vector and it crashes with alglib 3.7.0.

Sergey mentioned a workaround for windows here

http://bugs.alglib.net/view.php?id=439

santhu · **Joined:** Wed Jul 24, 2013 4:46 pm **Posts:** 4

Actually I was able to do this myself. The solution is pretty simple.

Instead of depending upon the existing module pcabuildbasis, do the PCA yourself using svd.

Decompose the matrix A[MXN) using SVD. http://www.alglib.net/translator/man/manual.cpp.html#unit_svd

The number of significant[non-zero] principal components would always be min(M,N) which would be a part of w. You need to divide them by 1/sqrt(min(M,N)-1).

For the eigen vectors, they would be either in U or V. I have written the code but do not have it right now. I could post that on monday. :-)

snowflake · **Joined:** Thu Sep 02, 2010 10:43 am **Posts:** 9

yeah, that would be nice :)

@ Sergey

There seems to be a bug in linalg.cpp.

Some or all methods allocate a "ae_matrix _a" object and copy the ingoing source matrix "a" into it, but never use it.

As for my tests, i got 650 MB continous memory and with a copy of this the program crashes under 32 bit.

Works fine with smaller data so this is only memory limitation. malloc returns null pointer for a second object of that size.

santhu · **Joined:** Wed Jul 24, 2013 4:46 pm **Posts:** 4

Hey snowflake, Here is the code that does PCA for huge matrices .. The results match R well ...

// The method inputs a matrix whose coloumns represent variables and rows represent observations.
// set transpose to be true if it isnt so.
// Do keep in mind that only the number of Principal components and Principal values would be limited by the min(numRows, numColoums).
// please be aware that dataMatrix would be changed in this process. It would be centered, scaled.
// #eigen values : is the min(nrow, ncol)
// the rotation matrix should be of size : #numVariables( ncol or nrow if transposed)) X min(ncol, nrow)
// retx = x * rotation. [ according to the transpose]

int numPcs;
alglib::real_2d_array ptInput;
int nrow, ncol;
if(transpose){
if(center){
centerRows(x,numRows,stride);
}
if(scale){
scaleRows(x,numRows, stride);
}
}else{
if(center){
centerColumns(x,numRows,stride);
}
if(scale){
scaleColumns(x,numRows, stride);
}
}
ptInput.setcontent(numRows, stride, x);
nrow = numRows;
ncol = stride;
numPcs = nrow <ncol? nrow : ncol;

real_1d_array w;
real_2d_array u,vt;
alglib::rmatrixsvd(ptInput,nrow,ncol,1,1,1,w,u,vt);
double denom = sqrt(ncol-1);
for(int i=0; i< numPcs; i++){
eigV[i] = w[i]/denom;
}

int count = 0;
if(transpose){
for(int i=0; i< nrow; i++){
for(int j=0; j< numPcs ; j++) {
rotation[count] = u[i][j];
count++;
}
}
}
else{
for(int i=0; i< ncol; i++){
for(int j=0; j< numPcs ; j++) {
rotation[count] = vt[j][i];
count++;
}
}
}

if(transpose){
// the rotation matrix is nrow X numPcs.
for(int i =0; i< ncol ; i++){
for(int j=0; j< numPcs ; j++){
double sum =0;
for(int k =0; k< nrow ; k++){
sum += x[ k*ncol + i] * rotation[k*numPcs + j];
}
retx[i*numPcs + j] = sum;
}
}
}else{
// the rotation matrix is of size ncol X numPcs.
for(int i=0; i< nrow; i++){
for(int j=0; j<numPcs ; j++){
double sum=0;
for(int k=0; k < ncol ; k++){
sum += x[i*ncol + k]*rotation[ k*ncol + j];
}
retx[i*numPcs + j] = sum;
}
}
}

Utillity functions which center and scale the matrix are :
static void centerColumns(double *a, int numRows, int stride){
for(int i=0; i< stride ; i++) {
double colMean = 0;
for(int j = 0; j< numRows ; j++) {
colMean += a[j*stride + i];
}
colMean = colMean/numRows;
for(int j = 0; j< numRows ; j++) {
a[j*stride + i] -= colMean;
}
}
}

static void scaleColumns(double *a, int numRows, int stride){
for(int i=0; i< stride ; i++) {
double std = 0;
for(int j = 0; j< numRows ; j++) {
std += a[j*stride + i]*a[j*stride + i];
}
std = sqrt(std/(numRows-1));
for(int j = 0; j< numRows ; j++) {
a[j*stride + i] /= std;
}
}
}

static void centerRows(double *a, int numRows, int stride){
for(int i=0; i< numRows ; i++) {
double rowMean = 0;
for(int j = 0; j< stride ; j++) {
rowMean += a[i*stride + j];
}
rowMean = rowMean/stride;
for(int j = 0; j< stride ; j++) {
a[i*stride + j] -= rowMean;
}
}
}

static void scaleRows(double *a, int numRows, int stride){
for(int i=0; i< numRows ; i++) {
double std = 0;
for(int j = 0; j< stride ; j++) {
std += a[i*stride + j]*a[i*stride + j];
}
std = sqrt(std/(stride-1));
for(int j = 0; j< stride ; j++) {
a[i*stride + j] /= std;
}
}
}

snowflake · **Joined:** Thu Sep 02, 2010 10:43 am **Posts:** 9

Hi santhu,

thanks for the code :)

here is a thing i wonder about:

Why are you centering / scaling over all values ?

It should be according too single rows / cols depending on transposition.

You are centering / scaling over all elements of data, but it should be 2
separate values for x and y.

And the rmatrixsvd routine isn't rectangular, as far as i found out.

In the original pcabuildbasis routine the matrix grows to a square max(cols, rows)
matrix, depending which one is bigger.

Seems the focus was too far for full U and VT matrix, but not for the minimized
versions as with uneeded and vtneeded equal 1 as in your code.

Is your code working, so that you get right sized matrices back ?

I changed both, pcabuildbasis and rmatrixsvd to get what i want.

Do you have a max size you could proceed with your svd version ? My changes
are crashing somewhere between 9000x3 and 9000x6 matrices on 32 bit.

My alglib version is 3.7.0.

santhu · **Joined:** Wed Jul 24, 2013 4:46 pm **Posts:** 4

Hi Snowflake,

The centering and scaling is per a single row or column depending upon transposition.

I checked my code against R and it works perfectly well. My matrixes were 40,000 X 24 and there was no crash at all. :-)

Hope it helps,
Santhu

snowflake · **Joined:** Thu Sep 02, 2010 10:43 am **Posts:** 9

Hi santhu,

yeah, i should have ckecked it better, now i see that your scaling / centering is right ;)

i'm still working on the rmatrixsvd routine and i'm still not sure if it's right.

are you developing under 64 bit or 32 bit ?

thanks for your help :)

yiuin · **Joined:** Thu May 15, 2014 5:26 pm **Posts:** 1

My PCA looks like this, it doesn't use NVARS*NVARS memory, much less so if you have a massive number of dimensions, it will be much lighter. Note you should prescale/demean.
Xr is the reduced output, odim is the number of output dimensions. Its a pretty direct translation of the description on wikipedia =). It would be nice if this was an option in alglib, since it is wayyyy more efficient if you don't care about the basis vectors. (Edit ... and I'm posting in a dead thread)

Code:

    
void pca(const alglib::real_2d_array& X, alglib::real_2d_array& Xr, int odim)
{
    alglib::real_2d_array U;
    alglib::real_2d_array VT;
    alglib::real_1d_array W;
    const double VARTHRESH = 1E-12;

    std::cout  << "svd: " << X.rows() << "x" << X.cols() << "...";
    alglib::rmatrixsvd(X, X.rows(), X.cols(), 1, 0, 2, W, U, VT); 
    std::cout  << "  done" << endl;
    
    std::cout << "odims = " << odim << "...";
    Xr.setlength(X.rows(), odim);
    for(int rr=0; rr<X.rows(); rr++) {
        for(int cc=0; cc<odim; cc++) {
            Xr(rr,cc) = U(rr, cc)*W[cc];
        }    
    }    
    std::cout  << "  Done" << endl;
}

forum.alglib.net

Forum rules

PCA when M >> N

Who is online