forum.alglib.net
http://forum.alglib.net/

Principal Component Analysis
http://forum.alglib.net/viewtopic.php?f=2&t=26
Page 1 of 1

Author:  moex [ Tue Jun 15, 2010 5:39 am ]
Post subject:  Principal Component Analysis

Hello all,
I'm very interested in alglib library as it provides many useful
functionality in matrix manipulation and statistical analysis. I'm
especially interested in PCA (Principle Component Analysis) and algib
already provides that. I followed on some examples that i have in my
readings and unfortunately the output differs slightly from the
original correct output. I also have some questions about some of the
parameters passed to the "pcabuildbasis(ref dSet, npoints, nvars, ref
info, ref s2, ref v)" function.

The first parameter passed to the "pcabuildbasis()" function "dSet" is
an array of two dimensions x and y that holds artificial data from my
example

X Y
2.5 2.4
0.5 0.7
2.2 2.9
1.9 2.2
3.1 3
2.3 2.7
2 1.6
1 1.1
1.5 1.6
1.1 0.9

Should i pass the array as dSet[2, 10] or dSet[10, 2]. In other words
should i pass a column matrix or row matrix?

Following i'll provide the steps i followed in my example:

1. Create the original data set dSet[2, 10]

2. Find the mean of X and Y (xMean, yMean)

X Mean = 1.81
Y Mean = 1.91

3. Subtract xMean from each instance of X and yMean from every
instance of Y (Mean Adjust Data)

Mean Adjusted Dataset
0.69 , 0.49
-1.31 , -1.21
0.39 , 0.99
0.0899999999999999 , 0.29
1.29 , 1.09
0.49 , 0.79
0.19 , -0.31
-0.81 , -0.81
-0.31 , -0.31
-0.71 , -1.01

4. Call pcabuildbasis() function and pass it all required variables
including dSet (info variable returns 1 - correct)

5. The variance matrix s2

Variance Matrix
1.28402771217278
0.0490833989383274

6. The eigenvector matrix (Basis)

Eigenvectors
-0.677873398528012 , 0.735178655544408
-0.735178655544408 , -0.677873398528012

7. I multiply mean adjusted matrix * eigenvector to get final data

8. Final data

Final Dataset
-0.827970186201088 , 0.175115307046915
1.77758032528043 , -0.14285722654428
-0.992197494414889 , -0.384374988880413
-0.2742104159754 , -0.130417206574127
-1.67580141864454 , 0.209498461256753
-0.912949103158808 , -0.17528244362037
0.0991094374984439 , 0.349824698097121
1.14457216379866 , -0.0464172581832809
0.43804613676245 , -0.017764629675083
1.22382055505474 , 0.162675287076762

These were the steps i followed, and the pca algorithm works great and
is promising except that some of the data differs in sign and not in
magnitude which is good. for example, from my readings the
eigenvectors show as

-0.735178655544408 , -0.677873398528012
0.677873398528012 , -0.735178655544408

but pcabuildbasis() in my program shows

-0.677873398528012 , 0.735178655544408
-0.735178655544408 , -0.677873398528012

Moreover, the eigenvalues or variance matrix in my reading shows

0.0490833989383274
1.28402771217278

and in my program it shows

1.28402771217278
0.0490833989383274

both these matirecies affect the rest of my calculations to get the final data.
are the steps i followed correct or do i need to do something else or
adjust something?
I also thank all those responsible for putting
together the beautiful alglib library. Thank you. :D

ps: i attached a picture of the program's output

Attachments:
File comment: my pca program output
pca.jpg
pca.jpg [ 122.5 KiB | Viewed 11150 times ]

Author:  Sergey.Bochkanov [ Tue Jun 15, 2010 3:41 pm ]
Post subject:  Re: Principle Component Analysis

Hello!

moex wrote:
H
Should i pass the array as dSet[2, 10] or dSet[10, 2]. In other words
should i pass a column matrix or row matrix?

You should pass it as matrix whose first index is in [0...9], and second index is in [0..1]. Depending on terminology, it can be column matrix or row matrix :)


moex wrote:
Following i'll provide the steps i followed in my example:

1. Create the original data set dSet[2, 10]

2. Find the mean of X and Y (xMean, yMean)

......

3. Subtract xMean from each instance of X and yMean from every
instance of Y (Mean Adjust Data)

4. Call pcabuildbasis() function and pass it all required variables
including dSet (info variable returns 1 - correct)

You don't have to do steps (2)-(3) because subroutine automatically determines mean values and centers data. So (2)-(3) just doesn't make difference. But it does no harm too.


moex wrote:
.......some of the data differs in sign and not in
magnitude which is good. for example, from my readings the
eigenvectors show as

-0.735178655544408 , -0.677873398528012
0.677873398528012 , -0.735178655544408

but pcabuildbasis() in my program shows

-0.677873398528012 , 0.735178655544408
-0.735178655544408 , -0.677873398528012

Moreover, the eigenvalues or variance matrix in my reading shows

0.0490833989383274
1.28402771217278

and in my program it shows

1.28402771217278
0.0490833989383274

It just a notation question. ALGLIB puts most important eigenvectors first, your books puts them at the end. There is no such thing as "right order of eigenvectors", anyone is free to sort them as he wish. Also, eigenvectors have no preferred sign. Most important eigenvector can be (-0.67, -0.73)^T or (+0.67, +0.73)^T - you can't say that one is "right", and other is "wrong". They determine same subspace.
Just a question of convenience and notation again.

So I see no mistakes in your use of pcabuildbasis().

Good luck!

Author:  snowflake [ Fri Sep 03, 2010 6:13 am ]
Post subject:  Re: Principal Component Analysis

thanks a lot for the example values :)

tried to recode it for myself and got a bunch of ugly problems with wrong initialized input matrix ;)

but now it works fine and i get the same values as above, so everything is fine :)

is there a prepared way to do the multiplication of the mean vector with the eigenvectors ?

i found matrixmatrixmultiply but i find it kind of confusing to set the boundaries for the destination vector myself.

and as i've seen now, the first highbound of the centered mean adjusted matrix is 1 higher than the half of my input points, is that right ? or is this the upper triangled diagonal matrix, if so everything is fine ;)

here is my current source for calculating the final data

Code:
ap::real_2d_array a = _mean_adjusted_input_data;
   int ai1 = _mean_adjusted_input_data.getlowbound(0);
   int ai2 = _mean_adjusted_input_data.gethighbound(0);
   int aj1 = _mean_adjusted_input_data.getlowbound(1);
   int aj2 = _mean_adjusted_input_data.gethighbound(1);
   bool transa = false;
   ap::real_2d_array b = _eigenvectors;
   int bi1 = _eigenvectors.getlowbound(0);
   int bi2 = _eigenvectors.gethighbound(0);
   int bj1 = _eigenvectors.getlowbound(1);
   int bj2 = _eigenvectors.gethighbound(1);
   bool transb = false;
   double alpha = 1.0;
   ap::real_2d_array c;
   int ci1 = ap::minint(ai1, bj1);
   int ci2 = ap::maxint(ai2, bj2);
   int cj1 = ap::minint(aj1, bi1);
   int cj2 = ap::maxint(aj2, bi2);
   double beta = 1.0;
   ap::real_1d_array work;
   work.setbounds(1, ap::maxint(ai2, aj2));

   matrixmatrixmultiply(a, ai1, ai2, aj1, aj2, transa, b, bi1, bi2, bj1, bj2, transb, alpha, c, ci1, ci2, cj1, cj2, beta, work);


the bounds are for an original data set with 220 points with 2 axes

a: 0, 110, 0, 1
b: 0, 1, 0, 1

so the bounds for c are 0, 1, 0, 110 where the final data should fit in

has anyone a hint ? :)

Author:  Sergey.Bochkanov [ Fri Sep 03, 2010 6:37 am ]
Post subject:  Re: Principal Component Analysis

snowflake wrote:
is there a prepared way to do the multiplication of the mean vector with the eigenvectors ?
i found matrixmatrixmultiply but i find it kind of confusing to set the boundaries for the destination vector myself.

No, in the current version you have to do it yourself. But I recommend you to use functions from ablas.cpp unit, which provides better optimized BLAS. I also recommend to switch to 3.0 branch of ALGLIB because a) it is mostly compatible with 2.x you seems to use, and b) 2.x branch will not be updated anymore.

Actually all these BLAS functions are intended to be used internally, when you want to take part of matrix A, part of B, multiply them and store somewhere in the middle of C. Maybe in the next release which scheduled for September I will implement "user-friendly" BLAS.


snowflake wrote:
and as i've seen now, the first highbound of the centered mean adjusted matrix is 1 higher than the half of my input points, is that right ? or is this the upper triangled diagonal matrix, if so everything is fine ;)

......

the bounds are for an original data set with 220 points with 2 axes
a: 0, 110, 0, 1
b: 0, 1, 0, 1
so the bounds for c are 0, 1, 0, 110 where the final data should fit in

Hmm, I don't get it. Why do you talking about half of points? If you have 220 points, you should have a[0..219,0..1]. Where did you get from 111? :)

As for your code, there are several errors:
* you should allocate C - it doesn't allocated automatically by BLAS functions
* you should set beta to 0.0
* you should set ci1 to ai1, ci2 to ai2 and so on. Size of C is same as size of A.

Author:  snowflake [ Fri Sep 03, 2010 1:35 pm ]
Post subject:  Re: Principal Component Analysis

Sergey.Bochkanov wrote:
No, in the current version you have to do it yourself. But I recommend you to use functions from ablas.cpp unit, which provides better optimized BLAS. I also recommend to switch to 3.0 branch of ALGLIB because a) it is mostly compatible with 2.x you seems to use, and b) 2.x branch will not be updated anymore.

Actually all these BLAS functions are intended to be used internally, when you want to take part of matrix A, part of B, multiply them and store somewhere in the middle of C. Maybe in the next release which scheduled for September I will implement "user-friendly" BLAS.


ok, thanks for your fast response :)

is the 3.0 branch with rc1 stable and tested for the pca ? for my future use there would be too much input to the pca (probably 400-500 2D polygons as current guess, each one has to be "aligned" on the PC axes) so this is important ;)

yeah, sorry i forgot, i'm currently using 2.6.0 version, but as i just started, it should be not that big deal to change to 3.0 :) and the pca with its subsequent dependencies are the only source files i need from the alglib

as i have a (0, 110, 0, 1) bounded array with the mean data and a (0, 1, 0, 1) bounded array with the eigenvectors, i would take the whole matrices for getting the final transformed data :)

good to hear from possible more user-friendly BLAS routines for september as this is good within my project scope :D

Sergey.Bochkanov wrote:
Hmm, I don't get it. Why do you talking about half of points? If you have 220 points, you should have a[0..219,0..1]. Where did you get from 111? :)


sorry for that, i should start my day with coffee like normal people instead of surfing the web :D

as there are just 111 points, everything is fine. i confused myself with the 222 effective values for storing 111 2D points :lol:

Sergey.Bochkanov wrote:
As for your code, there are several errors:
* you should allocate C - it doesn't allocated automatically by BLAS functions
* you should set beta to 0.0
* you should set ci1 to ai1, ci2 to ai2 and so on. Size of C is same as size of A.


ok, for debugging it was too early to this morning ;)

i just finished transforming into pca space but my points are mirrored vertically, where does this come from ?

the polygon which the points form looks good, maximum variance on x axis and second max variance on y axis are right too

Code:
ap::real_2d_array a = _mean_adjusted_input_data;
   int ai1 = _mean_adjusted_input_data.getlowbound(1);
   int ai2 = _mean_adjusted_input_data.gethighbound(1);
   int aj1 = _mean_adjusted_input_data.getlowbound(0);
   int aj2 = _mean_adjusted_input_data.gethighbound(0);
   bool transa = false;

   ap::real_2d_array b = _eigenvectors;
   int bi1 = _eigenvectors.getlowbound(1);
   int bi2 = _eigenvectors.gethighbound(1);
   int bj1 = _eigenvectors.getlowbound(0);
   int bj2 = _eigenvectors.gethighbound(0);
   bool transb = false;
   double alpha = 1.0;

   ap::real_2d_array c = ap::real_2d_array();
   c.setbounds(ai1, ai2, aj1, aj2);
   double beta = 0.0;
   ap::real_1d_array work;
   work.setbounds(1, ap::maxint(ai2, aj2));

   matrixmatrixmultiply(a, ai1, ai2, aj1, aj2, transa, b, bi1, bi2, bj1, bj2, transb, alpha, c, ai1, ai2, aj1, aj2, beta, work);

   _final_data = c;

   double * transformedVectorData = new double[1];
   ap::raw_vector<double> transformedVector = ap::raw_vector<double>(transformedVectorData, 1, 1);

   QPolygonF transformedPolygon = QPolygonF();
   QPointF point;

   for (int i = 0; i < c.gethighbound(1); i++)
   {
      transformedVector = c.getrow(i, c.getlowbound(0), c.gethighbound(0));

      transformedVectorData = transformedVector.GetData();

      if (NUMBER_OF_AXES == transformedVector.GetLength())
      {
         point = QPointF(transformedVectorData[0], transformedVectorData[1]);
         transformedPolygon.append(point);
      }
   }

Author:  Sergey.Bochkanov [ Fri Sep 03, 2010 2:17 pm ]
Post subject:  Re: Principal Component Analysis

snowflake wrote:
is the 3.0 branch with rc1 stable and tested for the pca ? for my future use there would be too much input to the pca (probably 400-500 2D polygons as current guess, each one has to be "aligned" on the PC axes) so this is important ;)

yeah, sorry i forgot, i'm currently using 2.6.0 version, but as i just started, it should be not that big deal to change to 3.0 :) and the pca with its subsequent dependencies are the only source files i need from the alglib

As for PCA, rc1 is stable. Actually, you don't have to do very much to convert: another names of files, additional namespace alglib, but you just have to compile it and see where compiler starts to objects. If it compiles, it will work.

However, there are two major changes which will influence your program:
1) functions from blas.h are now became internal and you can't use them, you have to use newer BLAS interface (search reference manual for info on ablas subpackage)
2) new BLAS uses pointers to work with vectors; raw_vector class has gone because its performance is so depressingly low...


snowflake wrote:
i just finished transforming into pca space but my points are mirrored vertically, where does this come from ?

Principal vectors are determined up to the sign. So (a,b) and (-a,-b) are equally good and have same chances to appear in the PCA output.
You have to check determinant of PCA matrix and flip (multiply by -1) one of the coordinates if it is negative.

Author:  snowflake [ Sun Sep 05, 2010 9:48 pm ]
Post subject:  Re: Principal Component Analysis

ok, thanks for your response :)

i'll test for it on tuesday, as i saw now, i have some matrix functions avalaible from QT too and as i am using it for viewing purposes i could use it for matrix purposes too and as it seems to be an easier approach for me as alglib (no offence sergej ;) ) i will probably use this first :)

Author:  Sergey.Bochkanov [ Mon Sep 06, 2010 7:33 am ]
Post subject:  Re: Principal Component Analysis

Yes ,that's one of the weak points of ALGLIB. It has many high-level functions, but lacks low-level easy-to-use matrix algebra interface.

Author:  snowflake [ Wed Sep 07, 2011 6:38 pm ]
Post subject:  Re: Principal Component Analysis

Hi,

it's me again :)

i'm still working with version 2.6.0 and i have a weird problem, tests with a simple cube out of 8 points looks very fine,
but i want to apply the pca to a 3000 point heavy 3D dataset.

the result looks like sheared (different rotation for one or more of the axes)

here is a picture

http://imageshack.us/f/683/abcxe.jpg/

on the left is the original, on the right is the pca transformed. the yellow axis is interpreted as the first axis, green as second and red as third,
so the result apart from the shearing looks good

what is a nice format to upload it here for testing ?

currently, it's a stl (stereolithography) file, but i can crop it down to a .csv file, if that helps

here is a picture what my code puts out for a cube, also readen out from stl file and passed to the same pca adaption

http://imageshack.us/f/88/bcdz.jpg/

the first axis cuts through the planes of the sides in the center of the side face and the second and third were cut diagonally right on the edges

the tested data for other datasets than the above tooth shape looks really fine after pca, so i don't see the error

although the data is triangled in the above screenshots, i only input a list of unique vertices to the pca

i got look-a-like results as i passed the same point twice or more often than other points in the list, but after fixing that error, i still got the above problem

Page 1 of 1 All times are UTC
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
http://www.phpbb.com/