Decision Forest implementation - further analysis

HotPotato · **Joined:** Mon Sep 12, 2011 1:15 pm **Posts:** 5

All,

Firstly hello. I have been using the alglib library for an AI implementation utilising the decision forest components with good effect so far.

I can currently predict data as intended, but in a strive for pushing the accuracy i had a few questions:

1. input counts - what is the approximately optimal amount of inputs that a decision tree should work on. I know from past experience that I would not want to give an artificial neural network more than ~30 inputs. I know that decision trees can actively exclude inputs if they are not relational to the outputs - In that case can decision forests work on more (lets say 100) and how will this affect the results/accuracy?

2. variable results - even though decision forests are groups of decision trees i have still noticed variable results in the models being returned after training, even with the same datasets. Surely the purpose of decision forests is to utilise the power of numbers, gathering many trees and taking the majority vote. With the same datasets being passed in - I can retrieve one forest with 50% accuracy, and another with 70% accuracy. This I do not understand, and it has led me to implement a further layer of grouping and majority voting, something i have named a decision master - in order to suppress the variability.

3. data normalisation - is there a certain normalisation technique that decision trees will respond to better than others?
Personally I am using 'leveled' (I made that name up by the way) normalisation which maps inputs
{0.1, 0.5, 5, 8, 100} into {0.2, 0.4, 0.6, 0.8. 1}
Another technique i used was 'ranged' (again) which would map the same array as something like
{0.1, 0.5, 5, 8, 100} into {0.01, 0.07, 0.1, 0.12. 0.96}
but the results were not as strong.
What are peoples experiences with this?

4. Extracting results - is there a way to extract what inputs are more relational to the outputs than others? I can see many weights being held in the internal model of the decision forests, but im not sure how to approach this.

Many thanks for any replies.

Sergey.Bochkanov · **Joined:** Fri May 07, 2010 7:06 am **Posts:** 930

Hello!

HotPotato wrote:

1. input counts - what is the approximately optimal amount of inputs that a decision tree should work on.

There is no clear limit on the number of inputs. Decision forests can exclude redundant inputs and monitor generalization error. It is better not to feed forest with trash data, but forest is more trash-resistant than neural network or other non-ensemble model. I suppose that in most application 100 inputs will be handled without problems.

HotPotato wrote:

2. variable results - even though decision forests are groups of decision trees i have still noticed variable results in the models being returned after training, even with the same datasets.

Decision forests are random, hence some amount of randomness will be always present. If forest results are too variable, increase amount of the trees - it should make results more stable.

HotPotato wrote:

3. data normalisation - is there a certain normalisation technique that decision trees will respond to better than others?

Hard to tell. Definitely they are not sensitive to the scaling/shifting of inputs - that's because decision forest does not calculate linear combinations of inputs.

HotPotato wrote:

4. Extracting results - is there a way to extract what inputs are more relational to the outputs than others? I can see many weights being held in the internal model of the decision forests, but im not sure how to approach this.

Again, hard to tell. You can't decide what inputs are more important judging from weights assigned by decision forest. Some authors propose to make random permutation of one particular input and to train a new forest using modified sample. Comparison of the generalization errors should tell you what inputs are more important :) I should say that I don't like this idea, but it is the only solution I've heard of.

Hope this helps, and good luck with ALGLIB :)

HotPotato · **Joined:** Mon Sep 12, 2011 1:15 pm **Posts:** 5

Thanks for the replies, that's very useful.

One further question -

What would be the best way to serialize one of the AI objects?

I wish to save my results (alglib.decisionforest) using binary serialisation and saving it to my database, but as the objects are not marked with the [Serialize] attribute, I'm getting runtime compiler errors when attempting serialization.

Thanks in advance,

HP

Sergey.Bochkanov · **Joined:** Fri May 07, 2010 7:06 am **Posts:** 930

You can use dfserialize/dfunserialize functions - they provide portable serialization interface, which can be used to move serialized objects between different machines (different endianness) and even different programming languages.

HotPotato · **Joined:** Mon Sep 12, 2011 1:15 pm **Posts:** 5

Great - thanks for your reply.

forum.alglib.net

Forum rules

Decision Forest implementation - further analysis

Who is online