Random Forest Variable Importance

alexgleith · **Joined:** Mon Jul 05, 2010 12:03 am **Posts:** 4

Hi all

I would like to know if you can get access to the variable importance from the decision forest? I know that in the R implementation of RF, that you end up with a "weight" of each variable for the classification task?

Another thing, since RF is good at predicting missing variables, could there be an option to fill in missing variables (I have some NODATA values in my variables)?

Finally, I tried using some extreme values for the missing values, -9999 was used, as well as double.NaN and this resulted in some strange results in the classification task... If I have two classes (0 and 1) and the result of the classification task is 0.999 I am assuming that this means that the row is very likely to be a member of the 1 group. If it has a value of 0.1 then it would be more likely to be a member of the 0 class right? So, why would there be values of 77, for example?

Cheers,

Alex

Sergey.Bochkanov · **Joined:** Fri May 07, 2010 7:06 am **Posts:** 927

alexgleith wrote:

I would like to know if you can get access to the variable importance from the decision forest? I know that in the R implementation of RF, that you end up with a "weight" of each variable for the classification task?

ALGLIB doesn't support this feature. It was planned, but other tasks with higher priority appeared.

alexgleith wrote:

Another thing, since RF is good at predicting missing variables, could there be an option to fill in missing variables (I have some NODATA values in my variables)?

Unfortunately, no. Actually, decision forests or neural networks will rarely see any significant development in the following year.

alexgleith wrote:

So, why would there be values of 77, for example?

77? Errr... it should be impossible to get such value when solving classification problem :) Please, show me code which leads to such output so I can trace this bug.

alexgleith · **Joined:** Mon Jul 05, 2010 12:03 am **Posts:** 4

Thanks for your response. It is a little disappointing that there are some missing aspects of the random forest algorithm, but I understand completely that it is rather complex and that ALGLIB is a much larger project than just this class. If only I had the time to learn the algorithm and help out!

Sergey.Bochkanov wrote:

77? Errr... it should be impossible to get such value when solving classification problem :) Please, show me code which leads to such output so I can trace this bug.

I can give you some code, although it is in the form of an addin for another program, you should be able to see how it works.

The relevant sections are as follows:

This build the array (a double [,] array) where the first column is the calssification (a double value, either 1 or 0) and the other columns are double values between 0 and 1. In this case I have a fudge where if there is a missing value (NODATA) then it is replaced with a value of 0.5. I had tried -9999 as I mentioned, since the documentation recommends creating another class for the NODATA values...

Code:

                //build an array to hold the parameters for the random forest operation.
                double[,] array = new double[inputVSG.FeatureTable.Count, nParameters + 1];

                //populate the array.
                for (int i = 1; i < inputVSG.FeatureTable.Count - 1; i++) {

                    array[i - 1, 0] = inputVSG.FeatureTable[i].classification;
                    for (int param = 0; param < nParameters; param++) {
                        try {
                            array[i - 1, param + 1] = inputParamListList[param][i];
                        } catch (NoDataException) {
                            array[i - 1, param + 1] = .5;
                        }
                    }
                }

After this I carry out the operation:

nParameters = (around 7), nClasses = 1, nTrees = (between 50 and 200), rValue = (between .3 and .6)

Code:

                //Create the random forest.
                int number = 0;
                dforest.decisionforest df = new dforest.decisionforest();
                dforest.dfreport dfreport = new dforest.dfreport();
                dforest.dfbuildrandomdecisionforest(ref array, inputVSG.FeatureTable.Count, nParameters, nClasses, nTrees, rValue, ref number, ref df, ref dfreport);

Then I check the TEST dataset using the following:

Code:

                    for (int param = 0; param < nParameters; param++) {
                        try {
                            outputAttributesArray[param] = outputParamListList[param][row.RowIndex];
                        } catch (NoDataException) {
                            outputAttributesArray[param] = .5;
                        }
                    }
                    dforest.dfprocess(ref df, ref outputAttributesArray, ref result);

                    row.rank = result[0];

This results in a value of between 0 and 1 usually. But when I had the NODATA values set at -9999 there were some strange values resulting. Let me know if you need any more information. I can also give you my dataset if you like.

Cheers,

Alex

alexgleith · **Joined:** Mon Jul 05, 2010 12:03 am **Posts:** 4

Just bump.

I hope that you can track this down, let me know if I am doing something wrong!

alexgleith · **Joined:** Mon Jul 05, 2010 12:03 am **Posts:** 4

bump again

Sergey.Bochkanov · **Joined:** Fri May 07, 2010 7:06 am **Posts:** 927

Hello!

I've had some hard time, so I can answer your question only now. Hope it helps.

First, and most important, you store you variables in wrong order. Class values must be in the last column, and they must be from 0 to NClasses-1. See http://www.alglib.net/dataanalysis/gene ... ciples.php for more info.

Third, you should encode missing values as described at http://www.alglib.net/dataanalysis/gene ... ciples.php (same link as above).

Second, you should check Info output variable (number in your case) because it contains error code.

monsterer · **Joined:** Thu Nov 29, 2018 12:58 pm **Posts:** 1

Hi,

I'd really like to know how to implement this feature as well; variable importance is really useful for tuning.

I understand you are busy, but If I were to do this myself, where should I be looking in the code?

Cheers, Paul.

forum.alglib.net

Forum rules

Random Forest Variable Importance

Who is online