forum.alglib.net

ALGLIB forum
It is currently Thu Mar 28, 2024 9:50 am

All times are UTC


Forum rules


1. This forum can be used for discussion of both ALGLIB-related and general numerical analysis questions
2. This forum is English-only - postings in other languages will be removed.



Post new topic Reply to topic  [ 9 posts ] 
Author Message
 Post subject: PCA when M >> N
PostPosted: Wed Jul 24, 2013 4:52 pm 
Offline

Joined: Wed Jul 24, 2013 4:46 pm
Posts: 4
Hello all,

I have a matrix A of size M X N where M is greater than N. [Typical sizes are M = 15,000 while N =20]. We want to apply PCA on this matrix where variables are more than observations.[ I am aware that the results are not completely reliable but in our experiments with R, they made some sense].

In R when such a scenario is presented, the eigen values would mostly be a vector of N and eigen values are N X N.
But in alglib, though the code uses QR-decomposition and bidagional reduction to find the N eigen values, the memory allocation happens for M X M matrix. We are running out of memory because of this. [As far as I have debugged].

Is there any way we can solve this problem without creating these huge matrices which would not be useful at all.
Thanks
Santhosh


Top
 Profile  
 
 Post subject: Re: PCA when M >> N
PostPosted: Sat Jul 27, 2013 10:18 pm 
Offline

Joined: Thu Sep 02, 2010 10:43 am
Posts: 9
this is hard due to the fact that the needed underlying covariance matrix has MxM size when M is larger than N

i hope that there is a solution, as i'm dealing with this problem right now too

i use it for an Active Shape Model, where one dataset in 3D is concatenated in 1 vector in the format xxxyyyzzz, so a relatively normal object with 3000 points gets a 9000 dimensional vector and it crashes with alglib 3.7.0.

Sergey mentioned a workaround for windows here

http://bugs.alglib.net/view.php?id=439


Top
 Profile  
 
 Post subject: Re: PCA when M >> N
PostPosted: Sat Jul 27, 2013 10:29 pm 
Offline

Joined: Wed Jul 24, 2013 4:46 pm
Posts: 4
Actually I was able to do this myself. The solution is pretty simple.

Instead of depending upon the existing module pcabuildbasis, do the PCA yourself using svd.

Decompose the matrix A[MXN) using SVD. http://www.alglib.net/translator/man/manual.cpp.html#unit_svd

The number of significant[non-zero] principal components would always be min(M,N) which would be a part of w. You need to divide them by 1/sqrt(min(M,N)-1).

For the eigen vectors, they would be either in U or V. I have written the code but do not have it right now. I could post that on monday. :-)


Top
 Profile  
 
 Post subject: Re: PCA when M >> N
PostPosted: Sun Jul 28, 2013 2:03 am 
Offline

Joined: Thu Sep 02, 2010 10:43 am
Posts: 9
yeah, that would be nice :)

@ Sergey

There seems to be a bug in linalg.cpp.

Some or all methods allocate a "ae_matrix _a" object and copy the ingoing source matrix "a" into it, but never use it.

As for my tests, i got 650 MB continous memory and with a copy of this the program crashes under 32 bit.

Works fine with smaller data so this is only memory limitation. malloc returns null pointer for a second object of that size.


Top
 Profile  
 
 Post subject: Re: PCA when M >> N
PostPosted: Mon Jul 29, 2013 3:12 pm 
Offline

Joined: Wed Jul 24, 2013 4:46 pm
Posts: 4
Hey snowflake, Here is the code that does PCA for huge matrices .. The results match R well ...

// The method inputs a matrix whose coloumns represent variables and rows represent observations.
// set transpose to be true if it isnt so.
// Do keep in mind that only the number of Principal components and Principal values would be limited by the min(numRows, numColoums).
// please be aware that dataMatrix would be changed in this process. It would be centered, scaled.
// #eigen values : is the min(nrow, ncol)
// the rotation matrix should be of size : #numVariables( ncol or nrow if transposed)) X min(ncol, nrow)
// retx = x * rotation. [ according to the transpose]

int numPcs;
alglib::real_2d_array ptInput;
int nrow, ncol;
if(transpose){
if(center){
centerRows(x,numRows,stride);
}
if(scale){
scaleRows(x,numRows, stride);
}
}else{
if(center){
centerColumns(x,numRows,stride);
}
if(scale){
scaleColumns(x,numRows, stride);
}
}
ptInput.setcontent(numRows, stride, x);
nrow = numRows;
ncol = stride;
numPcs = nrow <ncol? nrow : ncol;

real_1d_array w;
real_2d_array u,vt;
alglib::rmatrixsvd(ptInput,nrow,ncol,1,1,1,w,u,vt);
double denom = sqrt(ncol-1);
for(int i=0; i< numPcs; i++){
eigV[i] = w[i]/denom;
}

int count = 0;
if(transpose){
for(int i=0; i< nrow; i++){
for(int j=0; j< numPcs ; j++) {
rotation[count] = u[i][j];
count++;
}
}
}
else{
for(int i=0; i< ncol; i++){
for(int j=0; j< numPcs ; j++) {
rotation[count] = vt[j][i];
count++;
}
}
}

if(transpose){
// the rotation matrix is nrow X numPcs.
for(int i =0; i< ncol ; i++){
for(int j=0; j< numPcs ; j++){
double sum =0;
for(int k =0; k< nrow ; k++){
sum += x[ k*ncol + i] * rotation[k*numPcs + j];
}
retx[i*numPcs + j] = sum;
}
}
}else{
// the rotation matrix is of size ncol X numPcs.
for(int i=0; i< nrow; i++){
for(int j=0; j<numPcs ; j++){
double sum=0;
for(int k=0; k < ncol ; k++){
sum += x[i*ncol + k]*rotation[ k*ncol + j];
}
retx[i*numPcs + j] = sum;
}
}
}

Utillity functions which center and scale the matrix are :
static void centerColumns(double *a, int numRows, int stride){
for(int i=0; i< stride ; i++) {
double colMean = 0;
for(int j = 0; j< numRows ; j++) {
colMean += a[j*stride + i];
}
colMean = colMean/numRows;
for(int j = 0; j< numRows ; j++) {
a[j*stride + i] -= colMean;
}
}
}

static void scaleColumns(double *a, int numRows, int stride){
for(int i=0; i< stride ; i++) {
double std = 0;
for(int j = 0; j< numRows ; j++) {
std += a[j*stride + i]*a[j*stride + i];
}
std = sqrt(std/(numRows-1));
for(int j = 0; j< numRows ; j++) {
a[j*stride + i] /= std;
}
}
}


static void centerRows(double *a, int numRows, int stride){
for(int i=0; i< numRows ; i++) {
double rowMean = 0;
for(int j = 0; j< stride ; j++) {
rowMean += a[i*stride + j];
}
rowMean = rowMean/stride;
for(int j = 0; j< stride ; j++) {
a[i*stride + j] -= rowMean;
}
}
}

static void scaleRows(double *a, int numRows, int stride){
for(int i=0; i< numRows ; i++) {
double std = 0;
for(int j = 0; j< stride ; j++) {
std += a[i*stride + j]*a[i*stride + j];
}
std = sqrt(std/(stride-1));
for(int j = 0; j< stride ; j++) {
a[i*stride + j] /= std;
}
}
}


Top
 Profile  
 
 Post subject: Re: PCA when M >> N
PostPosted: Mon Jul 29, 2013 8:36 pm 
Offline

Joined: Thu Sep 02, 2010 10:43 am
Posts: 9
Hi santhu,

thanks for the code :)

here is a thing i wonder about:

Why are you centering / scaling over all values ?

It should be according too single rows / cols depending on transposition.

You are centering / scaling over all elements of data, but it should be 2
separate values for x and y.

And the rmatrixsvd routine isn't rectangular, as far as i found out.

In the original pcabuildbasis routine the matrix grows to a square max(cols, rows)
matrix, depending which one is bigger.

Seems the focus was too far for full U and VT matrix, but not for the minimized
versions as with uneeded and vtneeded equal 1 as in your code.

Is your code working, so that you get right sized matrices back ?

I changed both, pcabuildbasis and rmatrixsvd to get what i want.

Do you have a max size you could proceed with your svd version ? My changes
are crashing somewhere between 9000x3 and 9000x6 matrices on 32 bit.

My alglib version is 3.7.0.


Top
 Profile  
 
 Post subject: Re: PCA when M >> N
PostPosted: Mon Jul 29, 2013 8:44 pm 
Offline

Joined: Wed Jul 24, 2013 4:46 pm
Posts: 4
Hi Snowflake,

The centering and scaling is per a single row or column depending upon transposition.

I checked my code against R and it works perfectly well. My matrixes were 40,000 X 24 and there was no crash at all. :-)

Hope it helps,
Santhu


Top
 Profile  
 
 Post subject: Re: PCA when M >> N
PostPosted: Sun Aug 04, 2013 12:23 am 
Offline

Joined: Thu Sep 02, 2010 10:43 am
Posts: 9
Hi santhu,

yeah, i should have ckecked it better, now i see that your scaling / centering is right ;)

i'm still working on the rmatrixsvd routine and i'm still not sure if it's right.

are you developing under 64 bit or 32 bit ?

thanks for your help :)


Top
 Profile  
 
 Post subject: Re: PCA when M >> N
PostPosted: Thu May 15, 2014 5:36 pm 
Offline

Joined: Thu May 15, 2014 5:26 pm
Posts: 1
My PCA looks like this, it doesn't use NVARS*NVARS memory, much less so if you have a massive number of dimensions, it will be much lighter. Note you should prescale/demean.
Xr is the reduced output, odim is the number of output dimensions. Its a pretty direct translation of the description on wikipedia =). It would be nice if this was an option in alglib, since it is wayyyy more efficient if you don't care about the basis vectors. (Edit ... and I'm posting in a dead thread)
Code:
   
void pca(const alglib::real_2d_array& X, alglib::real_2d_array& Xr, int odim)
{
    alglib::real_2d_array U;
    alglib::real_2d_array VT;
    alglib::real_1d_array W;
    const double VARTHRESH = 1E-12;

    std::cout  << "svd: " << X.rows() << "x" << X.cols() << "...";
    alglib::rmatrixsvd(X, X.rows(), X.cols(), 1, 0, 2, W, U, VT);
    std::cout  << "  done" << endl;
   
    std::cout << "odims = " << odim << "...";
    Xr.setlength(X.rows(), odim);
    for(int rr=0; rr<X.rows(); rr++) {
        for(int cc=0; cc<odim; cc++) {
            Xr(rr,cc) = U(rr, cc)*W[cc];
        }   
    }   
    std::cout  << "  Done" << endl;
}


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 9 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 48 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group