forum.alglib.net
http://forum.alglib.net/

Random Forest Variable Importance
http://forum.alglib.net/viewtopic.php?f=2&t=35
Page 1 of 1

Author:  alexgleith [ Mon Jul 05, 2010 12:13 am ]
Post subject:  Random Forest Variable Importance

Hi all

I would like to know if you can get access to the variable importance from the decision forest? I know that in the R implementation of RF, that you end up with a "weight" of each variable for the classification task?

Another thing, since RF is good at predicting missing variables, could there be an option to fill in missing variables (I have some NODATA values in my variables)?

Finally, I tried using some extreme values for the missing values, -9999 was used, as well as double.NaN and this resulted in some strange results in the classification task... If I have two classes (0 and 1) and the result of the classification task is 0.999 I am assuming that this means that the row is very likely to be a member of the 1 group. If it has a value of 0.1 then it would be more likely to be a member of the 0 class right? So, why would there be values of 77, for example?

Cheers,

Alex

Author:  Sergey.Bochkanov [ Mon Jul 05, 2010 1:43 pm ]
Post subject:  Re: Random Forest Variable Importance

alexgleith wrote:
I would like to know if you can get access to the variable importance from the decision forest? I know that in the R implementation of RF, that you end up with a "weight" of each variable for the classification task?

ALGLIB doesn't support this feature. It was planned, but other tasks with higher priority appeared.

alexgleith wrote:
Another thing, since RF is good at predicting missing variables, could there be an option to fill in missing variables (I have some NODATA values in my variables)?

Unfortunately, no. Actually, decision forests or neural networks will rarely see any significant development in the following year.

alexgleith wrote:
So, why would there be values of 77, for example?

77? Errr... it should be impossible to get such value when solving classification problem :) Please, show me code which leads to such output so I can trace this bug.

Author:  alexgleith [ Tue Jul 06, 2010 12:55 am ]
Post subject:  Re: Random Forest Variable Importance

Thanks for your response. It is a little disappointing that there are some missing aspects of the random forest algorithm, but I understand completely that it is rather complex and that ALGLIB is a much larger project than just this class. If only I had the time to learn the algorithm and help out!

Sergey.Bochkanov wrote:
77? Errr... it should be impossible to get such value when solving classification problem :) Please, show me code which leads to such output so I can trace this bug.


I can give you some code, although it is in the form of an addin for another program, you should be able to see how it works.

The relevant sections are as follows:

This build the array (a double [,] array) where the first column is the calssification (a double value, either 1 or 0) and the other columns are double values between 0 and 1. In this case I have a fudge where if there is a missing value (NODATA) then it is replaced with a value of 0.5. I had tried -9999 as I mentioned, since the documentation recommends creating another class for the NODATA values...

Code:
                //build an array to hold the parameters for the random forest operation.
                double[,] array = new double[inputVSG.FeatureTable.Count, nParameters + 1];

                //populate the array.
                for (int i = 1; i < inputVSG.FeatureTable.Count - 1; i++) {

                    array[i - 1, 0] = inputVSG.FeatureTable[i].classification;
                    for (int param = 0; param < nParameters; param++) {
                        try {
                            array[i - 1, param + 1] = inputParamListList[param][i];
                        } catch (NoDataException) {
                            array[i - 1, param + 1] = .5;
                        }
                    }
                }


After this I carry out the operation:

nParameters = (around 7), nClasses = 1, nTrees = (between 50 and 200), rValue = (between .3 and .6)

Code:
                //Create the random forest.
                int number = 0;
                dforest.decisionforest df = new dforest.decisionforest();
                dforest.dfreport dfreport = new dforest.dfreport();
                dforest.dfbuildrandomdecisionforest(ref array, inputVSG.FeatureTable.Count, nParameters, nClasses, nTrees, rValue, ref number, ref df, ref dfreport);


Then I check the TEST dataset using the following:

Code:
                    for (int param = 0; param < nParameters; param++) {
                        try {
                            outputAttributesArray[param] = outputParamListList[param][row.RowIndex];
                        } catch (NoDataException) {
                            outputAttributesArray[param] = .5;
                        }
                    }
                    dforest.dfprocess(ref df, ref outputAttributesArray, ref result);

                    row.rank = result[0];


This results in a value of between 0 and 1 usually. But when I had the NODATA values set at -9999 there were some strange values resulting. Let me know if you need any more information. I can also give you my dataset if you like.

Cheers,

Alex

Attachments:
File comment: C# class addin for Eonfusion
AlexLeithRForest.txt [11.98 KiB]
Downloaded 549 times

Author:  alexgleith [ Wed Jul 07, 2010 10:02 pm ]
Post subject:  Re: Random Forest Variable Importance

Just bump.

I hope that you can track this down, let me know if I am doing something wrong!

Author:  alexgleith [ Mon Jul 12, 2010 8:51 am ]
Post subject:  Re: Random Forest Variable Importance

bump again

Author:  Sergey.Bochkanov [ Tue Jul 13, 2010 12:01 pm ]
Post subject:  Re: Random Forest Variable Importance

Hello!

I've had some hard time, so I can answer your question only now. Hope it helps.

First, and most important, you store you variables in wrong order. Class values must be in the last column, and they must be from 0 to NClasses-1. See http://www.alglib.net/dataanalysis/gene ... ciples.php for more info.

Third, you should encode missing values as described at http://www.alglib.net/dataanalysis/gene ... ciples.php (same link as above).


Second, you should check Info output variable (number in your case) because it contains error code.

Author:  monsterer [ Thu Nov 29, 2018 1:00 pm ]
Post subject:  Re: Random Forest Variable Importance

Hi,

I'd really like to know how to implement this feature as well; variable importance is really useful for tuning.

I understand you are busy, but If I were to do this myself, where should I be looking in the code?

Cheers, Paul.

Page 1 of 1 All times are UTC
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
http://www.phpbb.com/