forum.alglib.net :: View topic - PCA when M >> N

forum.alglib.net http://forum.alglib.net/

PCA when M >> N http://forum.alglib.net/viewtopic.php?f=2&t=893	Page 1 of 1

Author:	santhu [ Wed Jul 24, 2013 4:52 pm ]
Post subject:	PCA when M >> N
Hello all, I have a matrix A of size M X N where M is greater than N. [Typical sizes are M = 15,000 while N =20]. We want to apply PCA on this matrix where variables are more than observations.[ I am aware that the results are not completely reliable but in our experiments with R, they made some sense]. In R when such a scenario is presented, the eigen values would mostly be a vector of N and eigen values are N X N. But in alglib, though the code uses QR-decomposition and bidagional reduction to find the N eigen values, the memory allocation happens for M X M matrix. We are running out of memory because of this. [As far as I have debugged]. Is there any way we can solve this problem without creating these huge matrices which would not be useful at all. Thanks Santhosh

Author:	snowflake [ Sat Jul 27, 2013 10:18 pm ]
Post subject:	Re: PCA when M >> N
this is hard due to the fact that the needed underlying covariance matrix has MxM size when M is larger than N i hope that there is a solution, as i'm dealing with this problem right now too i use it for an Active Shape Model, where one dataset in 3D is concatenated in 1 vector in the format xxxyyyzzz, so a relatively normal object with 3000 points gets a 9000 dimensional vector and it crashes with alglib 3.7.0. Sergey mentioned a workaround for windows here http://bugs.alglib.net/view.php?id=439

Author:	santhu [ Sat Jul 27, 2013 10:29 pm ]
Post subject:	Re: PCA when M >> N
Actually I was able to do this myself. The solution is pretty simple. Instead of depending upon the existing module pcabuildbasis, do the PCA yourself using svd. Decompose the matrix A[MXN) using SVD. http://www.alglib.net/translator/man/manual.cpp.html#unit_svd The number of significant[non-zero] principal components would always be min(M,N) which would be a part of w. You need to divide them by 1/sqrt(min(M,N)-1). For the eigen vectors, they would be either in U or V. I have written the code but do not have it right now. I could post that on monday. :-)

Author:	snowflake [ Sun Jul 28, 2013 2:03 am ]
Post subject:	Re: PCA when M >> N
yeah, that would be nice :) @ Sergey There seems to be a bug in linalg.cpp. Some or all methods allocate a "ae_matrix _a" object and copy the ingoing source matrix "a" into it, but never use it. As for my tests, i got 650 MB continous memory and with a copy of this the program crashes under 32 bit. Works fine with smaller data so this is only memory limitation. malloc returns null pointer for a second object of that size.

Author:	santhu [ Mon Jul 29, 2013 3:12 pm ]
Post subject:	Re: PCA when M >> N
Hey snowflake, Here is the code that does PCA for huge matrices .. The results match R well ... // The method inputs a matrix whose coloumns represent variables and rows represent observations. // set transpose to be true if it isnt so. // Do keep in mind that only the number of Principal components and Principal values would be limited by the min(numRows, numColoums). // please be aware that dataMatrix would be changed in this process. It would be centered, scaled. // #eigen values : is the min(nrow, ncol) // the rotation matrix should be of size : #numVariables( ncol or nrow if transposed)) X min(ncol, nrow) // retx = x * rotation. [ according to the transpose] int numPcs; alglib::real_2d_array ptInput; int nrow, ncol; if(transpose){ if(center){ centerRows(x,numRows,stride); } if(scale){ scaleRows(x,numRows, stride); } }else{ if(center){ centerColumns(x,numRows,stride); } if(scale){ scaleColumns(x,numRows, stride); } } ptInput.setcontent(numRows, stride, x); nrow = numRows; ncol = stride; numPcs = nrow <ncol? nrow : ncol; real_1d_array w; real_2d_array u,vt; alglib::rmatrixsvd(ptInput,nrow,ncol,1,1,1,w,u,vt); double denom = sqrt(ncol-1); for(int i=0; i< numPcs; i++){ eigV[i] = w[i]/denom; } int count = 0; if(transpose){ for(int i=0; i< nrow; i++){ for(int j=0; j< numPcs ; j++) { rotation[count] = u[i][j]; count++; } } } else{ for(int i=0; i< ncol; i++){ for(int j=0; j< numPcs ; j++) { rotation[count] = vt[j][i]; count++; } } } if(transpose){ // the rotation matrix is nrow X numPcs. for(int i =0; i< ncol ; i++){ for(int j=0; j< numPcs ; j++){ double sum =0; for(int k =0; k< nrow ; k++){ sum += x[ kncol + i] rotation[knumPcs + j]; } retx[inumPcs + j] = sum; } } }else{ // the rotation matrix is of size ncol X numPcs. for(int i=0; i< nrow; i++){ for(int j=0; j<numPcs ; j++){ double sum=0; for(int k=0; k < ncol ; k++){ sum += x[incol + k]rotation[ kncol + j]; } retx[inumPcs + j] = sum; } } } Utillity functions which center and scale the matrix are : static void centerColumns(double a, int numRows, int stride){ for(int i=0; i< stride ; i++) { double colMean = 0; for(int j = 0; j< numRows ; j++) { colMean += a[jstride + i]; } colMean = colMean/numRows; for(int j = 0; j< numRows ; j++) { a[jstride + i] -= colMean; } } } static void scaleColumns(double a, int numRows, int stride){ for(int i=0; i< stride ; i++) { double std = 0; for(int j = 0; j< numRows ; j++) { std += a[jstride + i]a[jstride + i]; } std = sqrt(std/(numRows-1)); for(int j = 0; j< numRows ; j++) { a[jstride + i] /= std; } } } static void centerRows(double a, int numRows, int stride){ for(int i=0; i< numRows ; i++) { double rowMean = 0; for(int j = 0; j< stride ; j++) { rowMean += a[istride + j]; } rowMean = rowMean/stride; for(int j = 0; j< stride ; j++) { a[istride + j] -= rowMean; } } } static void scaleRows(double a, int numRows, int stride){ for(int i=0; i< numRows ; i++) { double std = 0; for(int j = 0; j< stride ; j++) { std += a[istride + j]a[istride + j]; } std = sqrt(std/(stride-1)); for(int j = 0; j< stride ; j++) { a[istride + j] /= std; } } }

Author:	snowflake [ Mon Jul 29, 2013 8:36 pm ]
Post subject:	Re: PCA when M >> N
Hi santhu, thanks for the code :) here is a thing i wonder about: Why are you centering / scaling over all values ? It should be according too single rows / cols depending on transposition. You are centering / scaling over all elements of data, but it should be 2 separate values for x and y. And the rmatrixsvd routine isn't rectangular, as far as i found out. In the original pcabuildbasis routine the matrix grows to a square max(cols, rows) matrix, depending which one is bigger. Seems the focus was too far for full U and VT matrix, but not for the minimized versions as with uneeded and vtneeded equal 1 as in your code. Is your code working, so that you get right sized matrices back ? I changed both, pcabuildbasis and rmatrixsvd to get what i want. Do you have a max size you could proceed with your svd version ? My changes are crashing somewhere between 9000x3 and 9000x6 matrices on 32 bit. My alglib version is 3.7.0.

Author:	santhu [ Mon Jul 29, 2013 8:44 pm ]
Post subject:	Re: PCA when M >> N
Hi Snowflake, The centering and scaling is per a single row or column depending upon transposition. I checked my code against R and it works perfectly well. My matrixes were 40,000 X 24 and there was no crash at all. :-) Hope it helps, Santhu

Author:	snowflake [ Sun Aug 04, 2013 12:23 am ]
Post subject:	Re: PCA when M >> N
Hi santhu, yeah, i should have ckecked it better, now i see that your scaling / centering is right ;) i'm still working on the rmatrixsvd routine and i'm still not sure if it's right. are you developing under 64 bit or 32 bit ? thanks for your help :)

Author:	yiuin [ Thu May 15, 2014 5:36 pm ]
Post subject:	Re: PCA when M >> N
My PCA looks like this, it doesn't use NVARSNVARS memory, much less so if you have a massive number of dimensions, it will be much lighter. Note you should prescale/demean. Xr is the reduced output, odim is the number of output dimensions. Its a pretty direct translation of the description on wikipedia =). It would be nice if this was an option in alglib, since it is wayyyy more efficient if you don't care about the basis vectors. (Edit ... and I'm posting in a dead thread) Code:* void pca(const alglib::real_2d_array& X, alglib::real_2d_array& Xr, int odim) { alglib::real_2d_array U; alglib::real_2d_array VT; alglib::real_1d_array W; const double VARTHRESH = 1E-12; std::cout << "svd: " << X.rows() << "x" << X.cols() << "..."; alglib::rmatrixsvd(X, X.rows(), X.cols(), 1, 0, 2, W, U, VT); std::cout << " done" << endl; std::cout << "odims = " << odim << "..."; Xr.setlength(X.rows(), odim); for(int rr=0; rr<X.rows(); rr++) { for(int cc=0; cc<odim; cc++) { Xr(rr,cc) = U(rr, cc)*W[cc]; } } std::cout << " Done" << endl; }

Page 1 of 1	All times are UTC
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group http://www.phpbb.com/