Prasad Chaskar

Related Listings

PRE-OWNED-CAR-PRICE-P...

0 comments, 1 review , 1 like
Term Deposit Prediction

0 comments, 2 reviews , 2 likes

Thyroid Disease Prediction

0 comments, 2 reviews , 561 views, 3 likes
Pancreatic Cancer Detection

0 comments, 2 reviews , 365 views, 3 likes

Major Concepts

Models Home » Domain Usecases » Health Care and Pharmaceuticals » Brain Stroke Prediction

Brain Stroke Prediction

Models Status

Model Overview

Use Case Summary

Problem Statement

Visululize the relationships between various Healthy and Unhealthy habits to Heart Strokes, and there by making prediction whether person have brain stroke or not with best model and hypertuned parameters.Even at times, we have seen various apps and websites claims to help as a doctor on the basis of their models, we are also trying to build something like that here.This model also be first step towards the awareness of this killer at early stage.

Attribute Information
1) gender: Male(1), Female(0),Other(2)
2) age: age of the patient
3) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
4) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
5) ever_married: Yes(1), No(0)
6) work_type: Private(2), Self-employed(3), Govt_jov,children(0), Children(4), Never_worked(1)
7) Residence_type: Rural(0), Urban(1)
8) avg_glucose_level: average glucose level in blood
9) bmi: body mass index
10) smoking_status: formerly smoked(1), never smoked(2), smokes(3) and Unknown(0)

Output Description
After getting various input parameters like gender, age, various diseases, and smoking status and model predict whether person have brain stroke or not.
1 if the patient had a stroke or 0 if not

How to Use model?
1) Choose inputs with correct values.
2) Click on Predict.
3) Your output is ready.

Lets look code...

What is Brain Stroke?
A stroke occurs when the blood supply to part of your brain is interrupted or reduced, preventing brain tissue from getting oxygen and nutrients. Brain cells begin to die in minutes.A stroke is a medical emergency, and prompt treatment is crucial. Early action can reduce brain damage and other complications.

Symptoms :If you or someone you're with may be having a stroke, pay particular attention to the time the symptoms began. Some treatment options are most effective when given soon after a stroke begins.Signs and symptoms of stroke include:

Trouble speaking and understanding what others are saying. You may experience confusion, slur your words or have difficulty understanding speech.

Paralysis or numbness of the face, arm or leg. You may develop sudden numbness, weakness or paralysis in your face, arm or leg.

This often affects just one side of your body. Try to raise both your arms over your head at the same time. If one arm begins to fall, you may be having a stroke.Also, one side of your mouth may droop when you try to smile.

Problems seeing in one or both eyes. You may suddenly have blurred or blackened vision in one or both eyes, or you may see double.
Headache.

A sudden, severe headache, which may be accompanied by vomiting, dizziness or altered consciousness, may indicate that you're having a stroke.

Trouble walking. You may stumble or lose your balance. You may also have sudden dizziness or a loss of coordination.

#pandas

import pandas as pd 



#numpy

import numpy as np 



#matplotlib

import matplotlib.pyplot as plt 



#seaborn

import seaborn as sns

sns.set_theme(style="darkgrid")



#sklearn

from sklearn.metrics import f1_score

from sklearn.preprocessing import LabelEncoder,OneHotEncoder

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import confusion_matrix

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import classification_report

from sklearn.metrics import roc_auc_score

Read Data From CSV.

stroke_df = pd.read_csv('healthcare-dataset-stroke-data.csv')

stroke_df.head()

Identify Categorical and Numerical Features¶

categorical_vars = list()

numerical_vars = list()



for i in stroke_df.columns:

    if stroke_df[i].dtype =='object':

        categorical_vars.append(i)

    else:

        numerical_vars.append(i)

Inference :

If you observe we do not required id column for our prediction. So we drop it.¶

stroke_df.drop('id',axis=1,inplace=True)

stroke_df.head()

Check for NULL Values¶

stroke_df.isnull().sum()

Inference :

In our dataset there is no null values present except bmi column.¶

print("Total Rows In BMI column :",len(stroke_df.bmi))

print("Total null values present in bmi column :",stroke_df.bmi.isnull().sum())

Handling Missing Values
There are many ways to handle missing values.
One could be delete rows in which we have null values present.
But because of this we can can loss lot of information
Another way is replace null vvalues with mean/median.
The second method is effective when dataset is numeric and continous & good news is our bmi column fit perfectly in this condition.
So we use second method.

EDA :

sns.countplot(x='stroke',data=stroke_df)

plt.title("Countplot for Stroke",{'fontsize':20});

Inference :
Based on distribution of stroke feature we can say that dataset is imbalance.
We have more records of patients had no stroke as compare to patients had stroke.
Lets handle the imbalance data later.

fig, axes = plt.subplots(2, 2, figsize=(12, 8), sharey=True)

fig.suptitle('Distribution CountPlot Some Features')



sns.countplot(ax=axes[0][0], x=stroke_df['hypertension'],palette="viridis")



sns.countplot(ax=axes[0][1], x=stroke_df['work_type'],palette="rocket");



sns.countplot(ax=axes[1][0], x=stroke_df['ever_married'],palette="husl");



sns.countplot(ax=axes[1][1], x=stroke_df['work_type'],palette="husl");

Distribution based on Stroke Patients¶

plt.figure(figsize=(7,8))

labels = [ "formely smoked" , "neber smoked","smokes","unknown"]

plt.pie(x=stroke_df.smoking_status[stroke_df.stroke == 1].value_counts(),

        # explode = (0, 0, 0, 0.2),

        autopct='%1.1f%%',

        shadow=True, colors=['plum','lightpink','lawngreen','cyan']);

plt.legend(labels,bbox_to_anchor=(1.05,1.025), loc="upper left");

plt.title("Patients have stroke based on work type",{'fontsize':20});

plt.figure(figsize=(7,8))

labels = [ "Private" , "Self-employed","Govt_job","children"]

plt.pie(x=stroke_df.work_type[stroke_df.stroke == 1].value_counts(),

        explode = (0, 0, 0, 0.2),

        autopct='%1.1f%%',

        shadow=True, colors=['royalblue','darkorange','springgreen','lightcyan','lavender']);

plt.legend(labels,bbox_to_anchor=(1.05,1.025), loc="upper left");

plt.title("Patients have stroke based on work type",{'fontsize':20});

Inference :Based on distribution the people whos work type is private having stroke as compared to gov job.

X = stroke_df.drop('stroke',axis=1)

y = stroke_df.stroke



X.age = round(X.age)

Convert Categorical Variables into numeric using Label Encoder.

encoder = LabelEncoder()



objList = X.select_dtypes(include = "object").columns

for feat in objList:

    X[feat] = encoder.fit_transform(X[feat])

Handling Imbalance Data
SMOTE algorithm works in 4 simple steps:

Choose a minority class as the input vector.

Find its k nearest neighbors (k_neighbors is specified as an argument in the SMOTE() function).

Choose one of these neighbors and place a synthetic point anywhere on the line joining the point under consideration and its chosen neighbor.

Repeat the steps until data is balanced.

from imblearn.over_sampling import SMOTE

smote = SMOTE()



x_smote, y_smote = smote.fit_resample(X, y)

Spliting Data
To get a good prediction, divide the data into training and testing data, it is because as the name suggests you will train few data points and test few data points, and keep on doing that unless you get good results.

X_train,X_test,y_train,y_test = train_test_split(x_smote,y_smote,test_size=0.28)

Models
Feature Scaling :

scalar = StandardScaler()

X_train_scaled = scalar.fit_transform(X_train)

X_test_scaled = scalar.fit_transform(X_test)

Logistic Regression

log_reg = LogisticRegression()

log_reg.fit(X_train_scaled,y_train)

log_reg.score(X_test_scaled,y_test)



0.8005875872199779

Random Forest

rf = RandomForestClassifier()

rf.fit(X_train_scaled,y_train)

rf.score(X_test_scaled,y_test)



Output : 0.9309585016525891

Classification Report

rf_pred = rf.predict(X_test_scaled)

log_pred = log_reg.predict(X_test_scaled)



print("Classifiaction Report for Random Forest")

print(classification_report(y_test,rf_pred))

print("******************************************************")

print("Classification Report for Logistic Regression")

print(classification_report(y_test,log_pred))

Confusion Matrix

Random Forest

class_names = [0,1]

fig,ax = plt.subplots()

tick_marks = np.arange(len(class_names))

plt.xticks(tick_marks,class_names)

plt.yticks(tick_marks,class_names)



cnf_matrix = confusion_matrix(y_test,rf_pred)

sns.heatmap(pd.DataFrame(cnf_matrix), annot = True, cmap="Blues",

            fmt = 'g')

ax.xaxis.set_label_position('top')

plt.tight_layout()

plt.title(f'Heat Map for Random Forest', {'fontsize':20})

plt.ylabel('Actual label')

plt.xlabel('Predicted label')

plt.show()

Logistic Regression

class_names = [0,1]

fig,ax = plt.subplots()

tick_marks = np.arange(len(class_names))

plt.xticks(tick_marks,class_names)

plt.yticks(tick_marks,class_names)



cnf_matrix = confusion_matrix(y_test,log_pred)

sns.heatmap(pd.DataFrame(cnf_matrix), annot = True, cmap = 'Blues',

            fmt = 'g')

ax.xaxis.set_label_position('top')

plt.tight_layout()

plt.title(f'Heat Map for Logistic Regression', {'fontsize':20})

plt.ylabel('Actual label')

plt.xlabel('Predicted label')

plt.show()

Roc Curves

pred_prob1 = log_reg.predict_proba(X_test_scaled)

pred_prob2 = rf.predict_proba(X_test_scaled)

from sklearn.metrics import roc_curve



# roc curve for models

fpr1, tpr1, thresh1 = roc_curve(y_test, pred_prob1[:,1], pos_label=1)

fpr2, tpr2, thresh2 = roc_curve(y_test, pred_prob2[:,1], pos_label=1)



# roc curve for tpr = fpr 

random_probs = [0 for i in range(len(y_test))]

p_fpr, p_tpr, _ = roc_curve(y_test, random_probs, pos_label=1)



plt.style.use('seaborn')

# plot roc curves

plt.plot(fpr1, tpr1, linestyle='--',color='orange', label='Logistic Regression')

plt.plot(fpr2, tpr2, linestyle='--',color='green', label='Random Forest')

plt.plot(p_fpr, p_tpr, linestyle='--', color='blue')

# title

plt.title('ROC curve')

# x label

plt.xlabel('False Positive Rate')

# y label

plt.ylabel('True Positive rate')



plt.legend(loc='best')

plt.savefig('ROC',dpi=300)

plt.show();

(F1 score for Logistic Regression is :0.80 and forRandom Forest: 0.92)

Feature Importance For Random Forest Model

plt.figure(figsize=(9,7))

feature_imp1 = rf.feature_importances_

sns.barplot(x=feature_imp1, y=X.columns)

# Add labels to your graph

plt.xlabel('Feature Importance Score')

plt.ylabel('Features')

plt.title("Visualizing Important Features For Random Forest ",{'fontsize':25})

plt.show();

feature_dict = {k:v for (k,v) in zip(X.columns,feature_imp1)}

Conclusion :
We start with reading data and then categorised categorical features and numerical features.After that we deal with missing values in BMI feature.
Then we perform EDA on features.We conclude that we have imbalance data ie negative class examples is greater that positive class.
After visulization we handle imbalance data.
After that we move to most important part model building. Before starting to train model we split our data into train data(testing purpose) and test data(validation purpose) and perform feature scaling.
Random Forest and Logistic Regression models were tried.
To check which model perform best plot roc-auc curves along with classifiaction report and confusion matrices.
While Random Forest win the race.

0 comments

5 person likes this

Related Listings

Prasad Chaskar's other Models Reports

Major Concepts

Brain Stroke Prediction

Models Status

Model Overview

Read Data From CSV.

Identify Categorical and Numerical Features¶

Inference :

If you observe we do not required id column for our prediction. So we drop it.¶

Check for NULL Values¶

Inference :

In our dataset there is no null values present except bmi column.¶

EDA :

Inference :
Based on distribution of stroke feature we can say that dataset is imbalance.
We have more records of patients had no stroke as compare to patients had stroke.
Lets handle the imbalance data later.

Distribution based on Stroke Patients¶

Inference :Based on distribution the people whos work type is private having stroke as compared to gov job.

Convert Categorical Variables into numeric using Label Encoder.

Handling Imbalance Data
SMOTE algorithm works in 4 simple steps:

Spliting Data
To get a good prediction, divide the data into training and testing data, it is because as the name suggests you will train few data points and test few data points, and keep on doing that unless you get good results.

Models
Feature Scaling :

Logistic Regression

Random Forest

Classification Report

Feature Importance For Random Forest Model

Deployment

Photos

Vault

Reviews

Connect With Us

Member Sign In

Member Sign In

Create Account

Related Listings

Prasad Chaskar's other Models Reports

Major Concepts

Brain Stroke Prediction

Models Status

Model Overview

Read Data From CSV.

Identify Categorical and Numerical Features¶

Inference :

If you observe we do not required id column for our prediction. So we drop it.¶

Check for NULL Values¶

Inference :

In our dataset there is no null values present except bmi column.¶

EDA :

Inference :Based on distribution of stroke feature we can say that dataset is imbalance.We have more records of patients had no stroke as compare to patients had stroke.Lets handle the imbalance data later.

Distribution based on Stroke Patients¶

Inference :Based on distribution the people whos work type is private having stroke as compared to gov job.

Convert Categorical Variables into numeric using Label Encoder.

Handling Imbalance DataSMOTE algorithm works in 4 simple steps:

Spliting DataTo get a good prediction, divide the data into training and testing data, it is because as the name suggests you will train few data points and test few data points, and keep on doing that unless you get good results.

ModelsFeature Scaling :

Logistic Regression

Random Forest

Classification Report

Feature Importance For Random Forest Model

Deployment

Photos

Vault

Reviews

Connect With Us

Inference :
Based on distribution of stroke feature we can say that dataset is imbalance.
We have more records of patients had no stroke as compare to patients had stroke.
Lets handle the imbalance data later.

Handling Imbalance Data
SMOTE algorithm works in 4 simple steps:

Spliting Data
To get a good prediction, divide the data into training and testing data, it is because as the name suggests you will train few data points and test few data points, and keep on doing that unless you get good results.

Models
Feature Scaling :