Comprehending large decision tree diagram for variable selection?

Question

I have 1100 samples and 2000 binary variables and wanted to determine which of the variables were most important and significant in relation to my continuous responding variable (which I converted using labelencoder). I don't need a model for predicting, I just wanted to know the important variables. I was recommended to use a decision tree which would split the samples into two categories by the most important variables first. I created the decision tree, but I'm still not sure which variables are most important. I assume the first splitting is the most important variable, but what about these two leafs which are now split into more leafs based on two more variables? Which of these is more important? If my first variables split the 1100 samples into 1050 true and 50 false, would the variable that splits up the 1050 samples be more important than the variable that splits up the 50 samples? I'm new to decision trees so I may be misunderstanding the entire concept. I'm having trouble understanding what condition is splitting the values into true and false and what this mean. Is it simply the binary of that variable that is splitting it up?. Also, this may be off topic, but I don't understand what 'gini' means in many of the boxes.

Possible duplicate of Best model for variable selection with big data? — PV8, Commented Jul 12, 2019 at 10:51
I know you want to have an answer for this questions, but please don't post the same question twice: stackoverflow.com/questions/56977952/… — PV8, Commented Jul 12, 2019 at 10:51
I felt that I got an answer for my original question. But I had a different question about how decision trees work. I only repeated my problem so that the context would make sense. — Chase Lewis, Commented Jul 12, 2019 at 11:00

The Orchestrator · Accepted Answer · 2019-07-12 10:58:09Z

0

Basic Decision Trees use Gini Indexing or Information Gain to decide which variables are the most important and it puts that variable/s right at the top of the tree. Have you tried to print your tree by using Graphviz? You'll get something like this

answered Jul 12, 2019 at 10:58

The Orchestrator

1049 bronze badges

Yes I was able to make a visual of my tree. With your example above, I guess my question would be, how do I differentiate between Pclass <= 2.25 and Pclass <= 2.85? Are they both equally strong variables? Or would the one with more samples remaining be stronger
– Chase Lewis
Commented Jul 12, 2019 at 11:03
@ChaseLewis - This was a Classification model whereby the dependant variable was 'Survived'. We wanted to see which explanatory variable (new_sex, Pclass etc) included in the range was most important in the survival of a passenger. The most important variable was 'new_sex' with a gini value of 0.47. The splits to the left and the right are for whether a Passenger was Female or Male based on the <= 0.6 value assigned by the model. Male=1 and Female = 0. The Pclass variable then was the next most important variable to have an impact on survival and so on down the tree for a max_depth of 10.. –
– The Orchestrator
Commented Jul 12, 2019 at 14:24

Add a comment |

Collectives™ on Stack Overflow

Comprehending large decision tree diagram for variable selection?

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged
python
scikit-learn
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged pythonscikit-learn or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
scikit-learn
or ask your own question.