The document summarizes an analysis of factors that affect health insurance costs and compares machine learning algorithms for predicting costs. It explores how age, sex, BMI, family size, region, and smoking affect costs using a dataset of over 1,300 insurance records. It tests decision trees, support vector machines, KNN, naive Bayes, neural networks, and XGBoost for prediction and finds that KNN performs best with an accuracy of 94.69%. Smoking is identified as the strongest predictor of costs.
The document summarizes an analysis of factors that affect health insurance costs and compares machine learning algorithms for predicting costs. It explores how age, sex, BMI, family size, region, and smoking affect costs using a dataset of over 1,300 insurance records. It tests decision trees, support vector machines, KNN, naive Bayes, neural networks, and XGBoost for prediction and finds that KNN performs best with an accuracy of 94.69%. Smoking is identified as the strongest predictor of costs.
The document summarizes an analysis of factors that affect health insurance costs and compares machine learning algorithms for predicting costs. It explores how age, sex, BMI, family size, region, and smoking affect costs using a dataset of over 1,300 insurance records. It tests decision trees, support vector machines, KNN, naive Bayes, neural networks, and XGBoost for prediction and finds that KNN performs best with an accuracy of 94.69%. Smoking is identified as the strongest predictor of costs.
The document summarizes an analysis of factors that affect health insurance costs and compares machine learning algorithms for predicting costs. It explores how age, sex, BMI, family size, region, and smoking affect costs using a dataset of over 1,300 insurance records. It tests decision trees, support vector machines, KNN, naive Bayes, neural networks, and XGBoost for prediction and finds that KNN performs best with an accuracy of 94.69%. Smoking is identified as the strongest predictor of costs.
beneficiary getting charged the most? 2. Which machine learning algorithm is best at predicting the charges? About Our Dataset ● We are using the insurance.csv dataset found on kaggle: https://www.kaggle.com/mirichoi0218/insurance ● There are 1,338 records ● There are 7 attributes: Age, Sex, BMI, Number of Children, Region, Smoking (yes or no), and charges (how much was the person charged for insurance) ● Our defined target variable is “charges” For added complexity: We included an additional dataset that gave health attributes about specific states. We linked this to our dataset by assigning the states to the already existing regions. We made the addition because we wanted to see if there was more evidence for confirming a relationship between health insurance costs and region beyond the given attributes in our main dataset. Experiments Decision Tree/Pruned Decision Tree What is it? A decision tree is a specific type of flow chart used to visualize the decision-making process by mapping out different courses of action, as well as their potential outcomes. Support Vector Machines What is it? SVMs work by trying to Kernels Used: divide up all the data points using ● Linear the kernel trick. This draws a line ● Radial (called a “hyperplane”), trying to ● Polynomial maximize the distance between the different classes of points as possible. KNN What is it? KNN works by finding the distances between a query and all the examples in the data, selecting the specified number examples (K) closest to the query, then votes for the most frequent label (in the case of classification) or averages the labels (in the case of regression). Naive Bayes What is it? It is a classification technique based on Bayes' Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. Multiclass Artificial Neural Network
What is it? In multi-class classification, the
neural network has the same number of output nodes as the number of classes. Each output node belongs to some class and outputs a score for that class. Multi-Class Classification (3 classes) Scores from the last layer are passed through a softmax layer. Multiclass XGBoost What is it? XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses an extreme gradient boosting framework. The extreme version is the exact same as the original, with the extreme one being focused on speed and performance. Comparison of the Results Conclusion: Being a smoker had the highest information gain in determining the health insurance cost.
The machine learning algorithm best at predicting charges is the KNN model. The machine algorithm the worst at predicting charges is Naive Bayes.
KNN had the highest classification accuracy of 94.69%