forum.alglib.net

ALGLIB forum
It is currently Thu Mar 28, 2024 8:49 am

All times are UTC


Forum rules


1. This forum can be used for discussion of both ALGLIB-related and general numerical analysis questions
2. This forum is English-only - postings in other languages will be removed.



Post new topic Reply to topic  [ 7 posts ] 
Author Message
 Post subject: Random Forest Variable Importance
PostPosted: Mon Jul 05, 2010 12:13 am 
Offline

Joined: Mon Jul 05, 2010 12:03 am
Posts: 4
Hi all

I would like to know if you can get access to the variable importance from the decision forest? I know that in the R implementation of RF, that you end up with a "weight" of each variable for the classification task?

Another thing, since RF is good at predicting missing variables, could there be an option to fill in missing variables (I have some NODATA values in my variables)?

Finally, I tried using some extreme values for the missing values, -9999 was used, as well as double.NaN and this resulted in some strange results in the classification task... If I have two classes (0 and 1) and the result of the classification task is 0.999 I am assuming that this means that the row is very likely to be a member of the 1 group. If it has a value of 0.1 then it would be more likely to be a member of the 0 class right? So, why would there be values of 77, for example?

Cheers,

Alex


Top
 Profile  
 
 Post subject: Re: Random Forest Variable Importance
PostPosted: Mon Jul 05, 2010 1:43 pm 
Offline
Site Admin

Joined: Fri May 07, 2010 7:06 am
Posts: 903
alexgleith wrote:
I would like to know if you can get access to the variable importance from the decision forest? I know that in the R implementation of RF, that you end up with a "weight" of each variable for the classification task?

ALGLIB doesn't support this feature. It was planned, but other tasks with higher priority appeared.

alexgleith wrote:
Another thing, since RF is good at predicting missing variables, could there be an option to fill in missing variables (I have some NODATA values in my variables)?

Unfortunately, no. Actually, decision forests or neural networks will rarely see any significant development in the following year.

alexgleith wrote:
So, why would there be values of 77, for example?

77? Errr... it should be impossible to get such value when solving classification problem :) Please, show me code which leads to such output so I can trace this bug.


Top
 Profile  
 
 Post subject: Re: Random Forest Variable Importance
PostPosted: Tue Jul 06, 2010 12:55 am 
Offline

Joined: Mon Jul 05, 2010 12:03 am
Posts: 4
Thanks for your response. It is a little disappointing that there are some missing aspects of the random forest algorithm, but I understand completely that it is rather complex and that ALGLIB is a much larger project than just this class. If only I had the time to learn the algorithm and help out!

Sergey.Bochkanov wrote:
77? Errr... it should be impossible to get such value when solving classification problem :) Please, show me code which leads to such output so I can trace this bug.


I can give you some code, although it is in the form of an addin for another program, you should be able to see how it works.

The relevant sections are as follows:

This build the array (a double [,] array) where the first column is the calssification (a double value, either 1 or 0) and the other columns are double values between 0 and 1. In this case I have a fudge where if there is a missing value (NODATA) then it is replaced with a value of 0.5. I had tried -9999 as I mentioned, since the documentation recommends creating another class for the NODATA values...

Code:
                //build an array to hold the parameters for the random forest operation.
                double[,] array = new double[inputVSG.FeatureTable.Count, nParameters + 1];

                //populate the array.
                for (int i = 1; i < inputVSG.FeatureTable.Count - 1; i++) {

                    array[i - 1, 0] = inputVSG.FeatureTable[i].classification;
                    for (int param = 0; param < nParameters; param++) {
                        try {
                            array[i - 1, param + 1] = inputParamListList[param][i];
                        } catch (NoDataException) {
                            array[i - 1, param + 1] = .5;
                        }
                    }
                }


After this I carry out the operation:

nParameters = (around 7), nClasses = 1, nTrees = (between 50 and 200), rValue = (between .3 and .6)

Code:
                //Create the random forest.
                int number = 0;
                dforest.decisionforest df = new dforest.decisionforest();
                dforest.dfreport dfreport = new dforest.dfreport();
                dforest.dfbuildrandomdecisionforest(ref array, inputVSG.FeatureTable.Count, nParameters, nClasses, nTrees, rValue, ref number, ref df, ref dfreport);


Then I check the TEST dataset using the following:

Code:
                    for (int param = 0; param < nParameters; param++) {
                        try {
                            outputAttributesArray[param] = outputParamListList[param][row.RowIndex];
                        } catch (NoDataException) {
                            outputAttributesArray[param] = .5;
                        }
                    }
                    dforest.dfprocess(ref df, ref outputAttributesArray, ref result);

                    row.rank = result[0];


This results in a value of between 0 and 1 usually. But when I had the NODATA values set at -9999 there were some strange values resulting. Let me know if you need any more information. I can also give you my dataset if you like.

Cheers,

Alex


Attachments:
File comment: C# class addin for Eonfusion
AlexLeithRForest.txt [11.98 KiB]
Downloaded 779 times
Top
 Profile  
 
 Post subject: Re: Random Forest Variable Importance
PostPosted: Wed Jul 07, 2010 10:02 pm 
Offline

Joined: Mon Jul 05, 2010 12:03 am
Posts: 4
Just bump.

I hope that you can track this down, let me know if I am doing something wrong!


Top
 Profile  
 
 Post subject: Re: Random Forest Variable Importance
PostPosted: Mon Jul 12, 2010 8:51 am 
Offline

Joined: Mon Jul 05, 2010 12:03 am
Posts: 4
bump again


Top
 Profile  
 
 Post subject: Re: Random Forest Variable Importance
PostPosted: Tue Jul 13, 2010 12:01 pm 
Offline
Site Admin

Joined: Fri May 07, 2010 7:06 am
Posts: 903
Hello!

I've had some hard time, so I can answer your question only now. Hope it helps.

First, and most important, you store you variables in wrong order. Class values must be in the last column, and they must be from 0 to NClasses-1. See http://www.alglib.net/dataanalysis/gene ... ciples.php for more info.

Third, you should encode missing values as described at http://www.alglib.net/dataanalysis/gene ... ciples.php (same link as above).


Second, you should check Info output variable (number in your case) because it contains error code.


Top
 Profile  
 
 Post subject: Re: Random Forest Variable Importance
PostPosted: Thu Nov 29, 2018 1:00 pm 
Offline

Joined: Thu Nov 29, 2018 12:58 pm
Posts: 1
Hi,

I'd really like to know how to implement this feature as well; variable importance is really useful for tuning.

I understand you are busy, but If I were to do this myself, where should I be looking in the code?

Cheers, Paul.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 7 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 56 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group