I have 1100 samples and 2000 binary variables and wanted to determine which of the variables were most important and significant in relation to my continuous responding variable (which I converted using labelencoder). I don't need a model for predicting, I just wanted to know the important variables. I was recommended to use a decision tree which would split the samples into two categories by the most important variables first. I created the decision tree, but I'm still not sure which variables are most important. I assume the first splitting is the most important variable, but what about these two leafs which are now split into more leafs based on two more variables? Which of these is more important? If my first variables split the 1100 samples into 1050 true and 50 false, would the variable that splits up the 1050 samples be more important than the variable that splits up the 50 samples? I'm new to decision trees so I may be misunderstanding the entire concept. I'm having trouble understanding what condition is splitting the values into true and false and what this mean. Is it simply the binary of that variable that is splitting it up?. Also, this may be off topic, but I don't understand what 'gini' means in many of the boxes.
-
1Possible duplicate of Best model for variable selection with big data?– PV8Commented Jul 12, 2019 at 10:51
-
1I know you want to have an answer for this questions, but please don't post the same question twice: stackoverflow.com/questions/56977952/…– PV8Commented Jul 12, 2019 at 10:51
-
I felt that I got an answer for my original question. But I had a different question about how decision trees work. I only repeated my problem so that the context would make sense.– Chase LewisCommented Jul 12, 2019 at 11:00
Add a comment
|
1 Answer
Basic Decision Trees use Gini Indexing or Information Gain to decide which variables are the most important and it puts that variable/s right at the top of the tree. Have you tried to print your tree by using Graphviz? You'll get something like this
-
Yes I was able to make a visual of my tree. With your example above, I guess my question would be, how do I differentiate between Pclass <= 2.25 and Pclass <= 2.85? Are they both equally strong variables? Or would the one with more samples remaining be stronger Commented Jul 12, 2019 at 11:03
-
@ChaseLewis - This was a Classification model whereby the dependant variable was 'Survived'. We wanted to see which explanatory variable (new_sex, Pclass etc) included in the range was most important in the survival of a passenger. The most important variable was 'new_sex' with a gini value of 0.47. The splits to the left and the right are for whether a Passenger was Female or Male based on the <= 0.6 value assigned by the model. Male=1 and Female = 0. The Pclass variable then was the next most important variable to have an impact on survival and so on down the tree for a max_depth of 10.. – Commented Jul 12, 2019 at 14:24