Classification task with 6 different algorithms using Python

There are 6 classification algorithms that can predict death with heart failure; Random Forest, Logistic Regression, KNN, Decision Tree, SVM and Naive Bayes to find the best algorithm.

Designed in CanvaPro

Introduction

In this blog post, I will use 6 different classification algorithms to predict death with heart failure.

To do this, we will use classification algorithms.

Here is the algorithm I will use;

  • random forest
  • Product Supply Synthesis
  • KNN
  • decision tree
  • SVM
  • Innocent Bais

And after that, I will compare the results with;

  • correctness
  • accuracy
  • remember
  • F1 score.

This will be longer than my other blog posts, yet after reading this article, you will probably have a vast knowledge about machine learning classification algorithms and evaluation metrics.

If you’d like to learn more about machine learning terms, here’s my blog post, Machine Learning AZ Briefly Explained.

Now let’s start with the data.

Data exploration

Here is the dataset from the UCI Machine Learning Repository, an open-source website, you can reach many other datasets, which are specifically categorized by task (regression, classification), attribute type (categorical, numerical) and more.

Or if you want to find where to find free resources to download datasets.

Now, this dataset contains medical records of 299 patients who had heart failure and 13 clinical features exist, which are;

Age (years) Anemia: Decrease in red blood cells or hemoglobin (Boolean) Hypertension: If patient has high blood pressure (Boolean) Creatinine Phosphokinase (CPK): Level of CPK enzyme in blood (mcg/L) Diabetes: If patient has diabetes (Boolean) Ejection Fraction: Percentage of blood leaving the heart in each contraction (percentage) Platelet: Platelets in blood (kiloplatelets/mL) Gender: Female or male (binary) Serum creatinine: Level of serum creatinine in blood (mg/dL) Serum sodium: Level of serum sodium in blood (mEq/ L) Smoking: If the patient smokes or not (boolean) Time: Follow-up period (days)[ target ] Death Event: If the patient died during the follow-up period (Boolean)

After loading the data, let’s take a first look at the data.

Photo by author

To implement a machine learning algorithm, you need to be sure of the data type and check whether the column contains non-null values.

Photo by author

Sometimes, our data set can be sorted along a specific column. That’s why I’ll use the sampling method to find out.

However, if you want to see the source code of this project, subscribe me here and I will send the codes with details in PDF.

Now, let’s continue. Here are 5 random sample rows from the dataset. Note that, if you run the code, the rows will be completely different because these functions return rows randomly.

Photo by author.

Now let’s calculate the value from high blood pressure. I know how many alternatives exist for this column (2) but still checking makes me feel competent about the data.

Photo by author

Yes, it looks like, we have 105 patients who have hypertension and 194 patients who don’t.

Let’s look at the calculation of smoking standards.

Photo by author

I think that’s enough with data exploration.

Let’s do some data visualization.

Of course, this part can be extended due to the needs of your project.

Here’s a blog post that includes examples of data analysis with Python, specifically using the Pandas library.

Data visualization

If you want to examine the distribution of features, eliminate them or detect outliers.

Photo by author – Distribution graph

Of course, this graph is just informative. If you want to look closer to identify outliers, you need to plot each graph.

Photo by author

Now, let’s come to the feature selection part.

However, Matplotlib and seaborn are very useful data visualization frameworks. If you want to know more about them, here is my article, About Data Visualization for Machine Learning with Python.

Feature selection

PCA

Well, let’s choose our features.

By doing PCA, we can actually find n feature counts explaining x percent of the data frame.

Here, it seems, about 8 features will be enough for us to explain %80 of the dataset.

PCA – Image by the author

Correlation graph

Correlated features, will destroy the performance of our model so after doing PCA, draw a correlation map to remove the correlated features.

Correlation map – Image by author

Here, you can see the highly correlated relationship between sex and smoking.

The main purpose of this article is to compare the results of classification algorithms, so I will not exclude them both, but you can in your model.

Model building

Now it’s time to build your machine learning model. To do this, first we need to split the data.

Train-test split

Evaluating your model’s performance on data the model doesn’t know is an important part of machine learning models. To do this, typically, we split the data 80/20.

Another technique used to evaluate machine learning models is cross-validation. Cross-validation is used to select the best machine-learning model among your options. This is sometimes called a dev-set, for more information you can search Andrew Ng’s videos, which are very informative.

Now let’s come to the model evaluation metrics.

Model evaluation metrics

Now let’s find out the classification model evaluation metrics.

accuracy

If you predict positively, what is the percentage of correct choices?

remember

True positive rate against all positives.

F1 score

Harmonic mean of recall and precision

To know more about Hierarchy, here is my post: Hierarchy AZ Briefly Explained.

Here are formulas for precision, recall and f1 scores.

Exact source – Author’s photo
Retraction formula – Image by author
F1 Score Formula – Author’s photo

Random Forest Classifier

Our first classification algorithm is random forest.

After applying this algorithm, here are the results.

If you want to see the source code, please subscribe for free here.

I will send you the PDF, which contains the code with an explanation.

Random Forest Evaluation Score – Author’s photo

Now, let’s continue.

Product Supply Synthesis

Here is another example of classification.

Logistic regression uses the sigmoid function to perform binary classification.

Author’s photo — Sigmoid function
Logistic Regression Prediction Score – Author’s photo

This one shows more accuracy and precision.

Let’s continue the search for the best model.

KNN

OK, now let’s apply K-Nearest Neighbor and see the results.

But when applying Knn, you need to select “K”, which is the number of neighbors you will choose.

To do this, using a loop seems to be the best way.

Finding the Best Score – Author’s photo

Now, it looks like 2 has the best accuracy, yet eliminating human intervention, let’s find the best model using the code.

Best K Score – Author’s photo

After choosing k=2, here is the accuracy. It seems that K-NN does not work well. But maybe we need to eliminate the correlated properties of normalization, of course, these operations can be different.

KNN Assessment Score – Author’s photo

Great, let’s continue.

decision tree

Now it’s time to apply the decision tree. Yet we need to find the best depth score to do this.

So when applying our model, it is important to test different depths.

Finding optimal depth for accuracy – Author’s photo

And to find the best depth in the results, let’s continue to automate.

Depth for best accuracy – Author’s photo

Well, now we find the best-performing depth. Let’s find out the accuracy.

Decision Tree Evaluation Score – Author’s photo

Excellent, let’s continue.

Support vector machines

Now to implement the SVM algorithm, we need to select the kernel type. This kernel type will affect our results so we iterate to find the kernel type, which provides the best f1 scored model.

Finding the most correct kernel type – Image by author

Well, we will use the linear kernel.

Let’s find precision, accuracy, recall and f1_score with a linear kernel.

SVM evaluation score – Author’s photo

Innocent Bais

Now, Naive Bayes will be our final model.

Do you know why naive Bayes is called naive?

Because the algorithm assumes that each input variable is independent. Of course, this assumption is impossible when using real-life data. This makes our algorithm “naïve”.

OK, let’s continue.

Naive Bayes Evaluation Score – Author’s photo.

Dictionary of Prophecy

Now after completing the model search. Let’s store the entire output in a data frame, which will allow us to evaluate it together.

After that let’s find the most accurate model.

The most accurate model

Most accurate model – Author’s photo

Model with highest accuracy

The model with the highest accuracy- Figure by the author

The model with the highest recall

Model with highest recall – Author’s photo

The model with the highest F1 score

Model with highest F1 score – Author’s photo

Conclusion

Now, the intended metric may vary depending on the needs of your project. You can find the most accurate model or the model with the highest recall.

This way, you can find the best model, which will meet the needs of your project.

If you would like to send me the source code PDF with an explanation for free, please subscribe to me here.

Thanks for reading my article!

I like to send 1 or 2 e-mails per week, if you also want a free numpy cheat sheet, here is the link for you!

If you are not yet a member of Medium and are interested in reading, here is my referral link.

“Machine learning is the last invention that mankind will have to do.” Nick Bostrom