Obesity Level Classification Using Eating Habits and Physical Condition

35 minute read

🧠 Project Overview

This project leverages supervised machine learning to classify individuals into obesity categories based on their demographic data, eating habits, and physical activity levels. The dataset consists of 2,111 observations and 17 features, capturing both behavioral and physiological variables.

The target variable, NObeyesdad, represents seven distinct obesity classes:

Insufficient Weight
Normal Weight
Overweight Level I
Overweight Level II
Obesity Type I
Obesity Type II
Obesity Type III

Dataset Source: UCI Machine Learning Repository

📊 Dataset Summary

📌 Rows: 2,111
📌 Features: 16 predictors + 1 target (NObeyesdad)
📌 Balance: 77% synthetic (via SMOTE), 23% survey-based real data
📌 Goal: Predict obesity level using lifestyle and physiological variables

📁 Feature Categories

🥗 Eating Habits:
- FAVC (Frequent consumption of high-calorie food)
- FCVC (Frequency of vegetable consumption)
- NCP (Number of daily meals)
- CAEC (Food consumption between meals)
- CH2O (Daily water intake)
- CALC (Alcohol consumption)
🏃‍♂️ Physical Activity & Lifestyle:
- FAF (Physical activity frequency)
- SCC (Calorie consumption monitoring)
- TUE (Technology usage per day)
- MTRANS (Transportation mode)
👤 Demographics:
- Gender
- Age
- Height
- Weight
- Family_History_with_Overweight
🎯 Target Variable:
- NObeyesdad (Multi-class label indicating obesity category)

Each row in the dataset corresponds to one individual, capturing their nutritional behavior, lifestyle factors, and physical condition. These features provide a comprehensive view of potential predictors for obesity and are used to train classification models.

📦 Import Required Packages and Dataset

%load_ext dotenv
%dotenv 

The dotenv extension is already loaded. To reload it, use:

%reload_ext dotenv

# Import standard libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import os
import sys

# **Append source directory to Python path**
sys.path.append(os.getenv('SRC_DIR'))

📥 Load the Dataset into a DataFrame

# Load the dataset
from DataManager import get_data

obesity_df = get_data('../data/ObesityDataSet_raw_and_data_sinthetic.csv')
obesity_df.shape

Output: (2111, 17)

🔍 Display Basic Information

obesity_df.head()

	Gender	Age	Height	Weight	family_history_with_overweight	FAVC	FCVC	NCP	CAEC	SMOKE	CH2O	SCC	FAF	TUE	CALC	MTRANS	NObeyesdad
0	Female	21.0	1.62	64.0	yes	no	2.0	3.0	Sometimes	no	2.0	no	0.0	1.0	no	Public_Transportation	Normal_Weight
1	Female	21.0	1.52	56.0	yes	no	3.0	3.0	Sometimes	yes	3.0	yes	3.0	0.0	Sometimes	Public_Transportation	Normal_Weight
2	Male	23.0	1.80	77.0	yes	no	2.0	3.0	Sometimes	no	2.0	no	2.0	1.0	Frequently	Public_Transportation	Normal_Weight
3	Male	27.0	1.80	87.0	no	no	3.0	3.0	Sometimes	no	2.0	no	2.0	0.0	Frequently	Walking	Overweight_Level_I
4	Male	22.0	1.78	89.8	no	no	2.0	1.0	Sometimes	no	2.0	no	0.0	0.0	Sometimes	Public_Transportation	Overweight_Level_II

📝 Rename Columns for Clarity

# Rename columns for improved readability
columnName = {'family_history_with_overweight': 'Family_History', 
                                          'FAVC' : 'High_Cal_Foods_Frequently', 
                                          'FCVC': 'Freq_Veg', 'NCP': 'Num_Meals', 
                                          'CAEC': 'Snacking',
                                          'SMOKE': 'Smoke',
                                          'CH2O': 'Water_Intake', 
                                          'SCC': 'Calorie_Monitoring' , 
                                          'FAF': 'Phys_Activity', 
                                          'TUE': 'Tech_Use', 'CALC':
                                          "Freq_Alcohol", 
                                          'MTRANS': 'Transportation', 
                                          'NObeyesdad': 'Obesity_Level'}


obesity_df_renamed = obesity_df.rename(columns=columnName)

ℹ️ Dataset Info After Renaming

obesity_df_renamed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2111 entries, 0 to 2110
Data columns (total 17 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 Gender                     2111 non-null   object 
 Age                        2111 non-null   float64
 Height                     2111 non-null   float64
 Weight                     2111 non-null   float64
 Family_History             2111 non-null   object 
 High_Cal_Foods_Frequently  2111 non-null   object 
 Freq_Veg                   2111 non-null   float64
 Num_Meals                  2111 non-null   float64
 Snacking                   2111 non-null   object 
 Smoke                      2111 non-null   object 
Water_Intake               2111 non-null   float64
Calorie_Monitoring         2111 non-null   object 
Phys_Activity              2111 non-null   float64
Tech_Use                   2111 non-null   float64
Freq_Alcohol               2111 non-null   object 
Transportation             2111 non-null   object 
Obesity_Level              2111 non-null   object 
dtypes: float64(8), object(9)
memory usage: 280.5+ KB

📊 Summary Statistics

obesity_df_renamed.describe()

	Age	Height	Weight	Freq_Veg	Num_Meals	Water_Intake	Phys_Activity	Tech_Use
count	2111.000000	2111.000000	2111.000000	2111.000000	2111.000000	2111.000000	2111.000000	2111.000000
mean	24.312600	1.701677	86.586058	2.419043	2.685628	2.008011	1.010298	0.657866
std	6.345968	0.093305	26.191172	0.533927	0.778039	0.612953	0.850592	0.608927
min	14.000000	1.450000	39.000000	1.000000	1.000000	1.000000	0.000000	0.000000
25%	19.947192	1.630000	65.473343	2.000000	2.658738	1.584812	0.124505	0.000000
50%	22.777890	1.700499	83.000000	2.385502	3.000000	2.000000	1.000000	0.625350
75%	26.000000	1.768464	107.430682	3.000000	3.000000	2.477420	1.666678	1.000000
max	61.000000	1.980000	173.000000	3.000000	4.000000	3.000000	3.000000	2.000000

🧹 Data Preprocessing

In this section, we perform essential data quality checks to prepare the dataset for analysis:

Check for Missing or Null Values
Confirm that no values are missing from the dataset.
Check for Uniqueness in Categories
Verify the categorical variables contain consistent and non-redundant entries.
Check for Duplicates
Identify and remove duplicate records to ensure data integrity.
Check for Outliers
Detect and assess the impact of outliers on numerical features using statistical and visual methods.

📌 Identify Numerical vs Categorical Columns

num_col = obesity_df_renamed.select_dtypes(include=['float64', 'int64']).columns
cat_col = obesity_df_renamed.select_dtypes(include=['object']).columns

✅ Missing or Null Values

# Find missing values
obesity_df_renamed.isna().sum()

Gender                       0
Age                          0
Height                       0
Weight                       0
Family_History               0
High_Cal_Foods_Frequently    0
Freq_Veg                     0
Num_Meals                    0
Snacking                     0
Smoke                        0
Water_Intake                 0
Calorie_Monitoring           0
Phys_Activity                0
Tech_Use                     0
Freq_Alcohol                 0
Transportation               0
Obesity_Level                0
dtype: int64

Observation:

✅ No missing or null values were found in the dataset. No imputation is required.

obesity_df_renamed.isnull().sum() # check for null values

Gender                       0
Age                          0
Height                       0
Weight                       0
Family_History               0
High_Cal_Foods_Frequently    0
Freq_Veg                     0
Num_Meals                    0
Snacking                     0
Smoke                        0
Water_Intake                 0
Calorie_Monitoring           0
Phys_Activity                0
Tech_Use                     0
Freq_Alcohol                 0
Transportation               0
Obesity_Level                0
dtype: int64

Similarly with missing values data doesn’t have any null values.

🧾 Uniqueness Check for Categorical Variables

unique_values_per_column = {col: obesity_df_renamed[col].unique() for col in cat_col}

# Display unique values for each column
for col, unique_vals in unique_values_per_column.items():
    print(f"Unique values in '{col}': {unique_vals}")

Unique values in 'Gender': ['Female' 'Male']
Unique values in 'Family_History': ['yes' 'no']
Unique values in 'High_Cal_Foods_Frequently': ['no' 'yes']
Unique values in 'Snacking': ['Sometimes' 'Frequently' 'Always' 'no']
Unique values in 'Smoke': ['no' 'yes']
Unique values in 'Calorie_Monitoring': ['no' 'yes']
Unique values in 'Freq_Alcohol': ['no' 'Sometimes' 'Frequently' 'Always']
Unique values in 'Transportation': ['Public_Transportation' 'Walking' 'Automobile' 'Motorbike' 'Bike']
Unique values in 'Obesity_Level': ['Normal_Weight' 'Overweight_Level_I' 'Overweight_Level_II'
 'Obesity_Type_I' 'Insufficient_Weight' 'Obesity_Type_II'
 'Obesity_Type_III']

Observation:

✅ All categorical variables contain consistent and interpretable values. No cleaning required.

🔁 Duplicate Rows

obesity_df_renamed.duplicated().sum()

np.int64(24)

🟡 24 duplicate rows detected

# Show duplicated values if needed

#obesity_df_renamed[obesity_df_renamed.duplicated(keep=False)]

Remove duplicates

# remove duplicates from the dataset
from DataManager import drop_duplicates

data = drop_duplicates(obesity_df_renamed)
data.shape

Observation:

✅ Duplicates removed successfully

🚨 Outlier Detection (IQR Method)

def find_outliers_IQR(df,numeric_col):
   
   q1=df[numeric_col].quantile(0.25)
   q3=df[numeric_col].quantile(0.75)
   IQR=q3-q1
   outliers = df[((df[numeric_col]<(q1-1.5*IQR)) | (df[numeric_col]>(q3+1.5*IQR)))]
   return outliers

numeric_columns = num_col
for column in numeric_columns:
    outliers = find_outliers_IQR(data, column)
    
    if outliers.empty:
        print(f"No outliers detected in column {column}")
    else:
        print(f"{len(outliers)} outliers detected in column {column}")
        print("Max outlier value:", str(outliers[column].max()))
        print("Min outlier value:", str(outliers[column].min()))

Observation:

✅ Detected outliers in:

Age: 167 values

Height: 1 value

Weight: 1 value

Num_Meals: 577 values

📦 Boxplots for Visual Outlier Analysis

We will visualize only the columns we have outliers

for column in num_col:
    outliers = find_outliers_IQR(data, column)
    if not outliers.empty:  # Only plot if outliers are present
        plt.figure(figsize=(8, 6))
        plt.boxplot(data[column].dropna(), vert=False, patch_artist=True)
        plt.title(f"Box Plot for {column} (Outliers Detected)")
        plt.xlabel(column)
        plt.grid(True)
        plt.show()

alt text

Outlier Interpretation

Age is right-skewed. We’ll retain outliers and revisit during feature engineering.

Num_Meals behaves like a categorical variable. We’ll retain all values as they may be informative.

Height and Weight outliers are minimal and reasonable, so we retain them.

📊 Distribution of Age

# Age distribution across different obesity levels (NObeyesdad)

plt.figure(figsize=(10,8))
nobeyesdad_order = ['Insufficient_Weight', 'Normal_Weight', 'Overweight_Level_I','Overweight_Level_II','Obesity_Type_I', 'Obesity_Type_II', 'Obesity_Type_III']  # Adjust according to your dataset
sns.boxplot(x='Obesity_Level', y='Age', data=data, palette='Set3', order=nobeyesdad_order)
plt.title('Age Distribution across Obesity Levels (Obesity_Level)')
plt.xlabel('Obesity Level (Obesity_Level)')
plt.ylabel('Age')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

alt text

From the figure below, it is evident that Age is right-skewed, suggesting that the dataset predominantly represents certain age groups. This could introduce a bias in the machine learning decision-making process. For now, we will retain the outliers and observe how the skewness of the Age feature impacts the model’s performance. Depending on the results, we will decide whether to keep or remove the outliers during the feature engineering phase.

sns.histplot(data['Age'], kde=True, bins=20, color='skyblue')
plt.title('Age Distribution')

Text(0.5, 1.0, 'Age Distribution')

alt text

Observation:

✅ Age values range from 14 to 61. We’ll retain all entries for now.

📈 Target Class Distribution

# Define the order of obesity levels based on their frequency
nobeyesdad_order = data['Obesity_Level'].value_counts().index

plt.figure(figsize=(10,8))
sns.countplot(x='Obesity_Level', data=data, palette='Set3', order=nobeyesdad_order)
plt.title('Distribution of Obesity Levels (Obesity_Level)')
plt.xlabel('Obesity Level')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

alt text

Observation:

✅ Obesity Type I is the most frequent category.

🔍 Pairplot of Numerical Features

sns.pairplot(data[['Age', 'Height', 'Weight', 'Freq_Veg', 'Water_Intake', 'Phys_Activity', 'Obesity_Level']], hue='Obesity_Level', palette='Set3')
plt.show()

alt text

Observation:

✅ Height and Weight show strong correlation. Consider engineering BMI.

Other features, such as Frequency of Vegetable Consumption (FCVC), Water Intake (CH2O), and Physical Activity Frequency (FAF), do not exhibit a clear linear relationship with other variables. However, their monolithic behavior (clustered values) suggests they may behave more like categorical features despite being numerical. This observation should be considered in the preprocessing and modeling stages to ensure these features are treated appropriately.

📊 Categorical Feature Distribution by Target

# Plot categorical variables against Obesity_Level
binary_palette = {"yes": "#8dd3c7", "no": "#fb8072"}
# Define the desired order of obesity levels
obesity_level_order = [
    "Insufficient_Weight", 
    "Normal_Weight", 
    "Overweight_Level_I", 
    "Overweight_Level_II", 
    "Obesity_Type_I", 
    "Obesity_Type_II", 
    "Obesity_Type_III"
]
plt.figure(figsize=(12, 170))
counter = 1
for var in cat_col:
    if counter < len(cat_col):
        plt.subplot(20, 1, counter)
        plt.title(f"{var} vs Obesity_Level")
        
        # Determine the appropriate palette based on variable categories
        if var in ["Family_History", "High_Cal_Foods_Frequently", "Smoke", "Calorie_Monitoring"]:
            palette = binary_palette  # Use "yes" and "no" palette
        else:
            palette = "Set3"  # Default palette for other categorical variables
        
        # Plot with the selected palette
        sns.countplot(
            x="Obesity_Level",
            hue=var,
            data=obesity_df_renamed,
            edgecolor="black",
            palette=palette,
            order=obesity_level_order
        )
    counter += 1
plt.tight_layout()
plt.show()

alt text

🧠 Insights from Categorical Analysis

Gender: Slight imbalance in Obesity II & III (more males).
Family_History: Clear link with higher obesity.
FAVC: High-calorie food consumption increases with obesity.
Snacking: “Sometimes” snacking is more frequent in overweight/obese groups.
SMOKE: No significant pattern with obesity.
SCC (Calorie Monitoring): Lower monitoring in higher obesity levels.
CALC (Alcohol): No strong pattern observed.
MTRANS: Walkers tend to be in lower obesity levels.

🔗 Correlation Matrix of Numerical Features

We will perform correlation analysis between numerical features with using Pearson Correlation coefficent.

# Create a heatmap for the correlation matrix using only numerical columns
numerical_columns = obesity_df_renamed.select_dtypes(include=['float64', 'int64']).columns
numerical_correlation_matrix = obesity_df_renamed[numerical_columns].corr()

# Plot the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(numerical_correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Numerical Variables')
plt.show()

alt text

Observation:

✅ Strong positive correlation observed between Height and Weight. 💡 We will define a new feature: BMI = Weight / Height² in the next section.

📐 Feature Engineering: Body Mass Index (BMI) Calculation

We introduce a new feature BMI (Body Mass Index) to better capture the interaction between weight and height, and explore its impact on obesity classification.

# Define new BMI feature
data['BMI'] = data['Weight'] / (data['Height'] ** 2)

📊 Exploratory Analysis on BMI

📈 BMI Distribution Across Obesity Levels

plt.figure(figsize=(10,8))
nobeyesdad_order = ['Insufficient_Weight', 'Normal_Weight', 'Overweight_Level_I','Overweight_Level_II','Obesity_Type_I', 'Obesity_Type_II', 'Obesity_Type_III']  # Adjust according to your dataset
sns.boxplot(x='Obesity_Level', y='BMI', data=data, palette='Set3', order=nobeyesdad_order)
plt.title('BMI Distribution across Obesity Levels (Obesity_Level)')
plt.xlabel('Obesity Level (Obesity_Level)')
plt.ylabel('BMI')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

alt text

Observation:

✅ BMI increases consistently with obesity severity. Variability also rises at higher obesity levels, while normal and insufficient weight groups are more tightly distributed.

📈 BMI Distribution by Gender

# BMI distribution across different obesity levels (NObeyesdad)

plt.figure(figsize=(10,8))
sns.boxplot(x='Gender', y='BMI', data=data, palette='Set3')
plt.title('BMI Distribution by Gender')
plt.xlabel('Gender')
plt.ylabel('BMI')
plt.tight_layout()
plt.show()

alt text

Observation:

✅ The plot shows that females, on average, tend to have a higher and more varied BMI compared to males. The broader spread of BMI values in females suggests more diversity in body composition within this group.

plt.figure(figsize=(15,6))
ax = sns.countplot(x = "Gender", hue = "Obesity_Level",hue_order=nobeyesdad_order, data = data, palette='Set3')

for p in ax.patches:
    ax.annotate(format(p.get_height(), '.0f'),
               (p.get_x() + p.get_width() / 2, p.get_height()),
               ha = 'center', va = 'center',
               xytext = (0,10),
               textcoords = 'offset points')

plt.title("Distribution of Obesity_Level across Gender")
plt.xlabel('')
plt.ylabel('')

plt.show()

alt text

Observation:

✅ Females exhibit a wider and higher range of BMI values than males, indicating greater variability in body composition.

📈 BMI Distribution by Age

plt.figure(figsize=(10,8))
sns.scatterplot(x='Age', y='BMI', data=data, hue_order= nobeyesdad_order,hue='Obesity_Level', palette='Set3')

plt.title('BMI Distribution by Age')
plt.xlabel('Age')
plt.ylabel('BMI')
plt.tight_layout()
plt.show()

alt text

Observation:

✅ Obesity type III is more commonly observed in individuals under the age of 30.

nobe_bmi_mean = data.groupby('Obesity_Level')['Age'].mean().reset_index()

nobe_bmi_mean_sorted = nobe_bmi_mean.sort_values(by='Age', ascending = False)

fig, axes = plt.subplots(1,2,figsize=(15,6))
sns.barplot(y='Obesity_Level', x='Age',ax=axes[0],data=nobe_bmi_mean,order=nobe_bmi_mean_sorted['Obesity_Level'],palette='Set3')

for index, row in enumerate(nobe_bmi_mean_sorted.iterrows()):
    axes[0].text(row[1]['Age'], index, f"{row[1]['Age']: .2f}", va = 'center', ha = 'left', fontsize = 10)

axes[0].set_xlabel('Age')
axes[0].set_ylabel('')
axes[0].set_xticklabels(axes[0].get_xticklabels(), rotation=45)

sns.violinplot(y = "Obesity_Level", x= "Age", data= data, ax=axes[1], order=nobe_bmi_mean_sorted['Obesity_Level'],palette='Set3')
axes[1].set_ylabel('')
axes[1].set_yticklabels([])

fig.suptitle('Average Age by NObeyesdad', fontsize = 16)
plt.show()

alt text

Observation:

✅ Normal Weight or Insufficient weight people seems to be younger on an average than the rest

📈 BMI Distribution by Family History with Overweight

# Plot BMI distribution by Family History with Overweight across Obesity Levels
binary_palette = {"yes": "#8dd3c7", "no": "#fb8072"}

plt.figure(figsize=(12, 8))
sns.boxplot(
    x='Obesity_Level',
    y='BMI',
    hue ='Family_History',
    data=data,
    palette= binary_palette,
    order=obesity_level_order
)
plt.title("BMI Distribution by Family History with Overweight Across Obesity Levels", fontsize=16)
plt.xlabel("Obesity Level")
plt.ylabel("BMI")
plt.xticks(rotation=45)
plt.legend(title="Family History", loc='upper left')
plt.tight_layout()
plt.show()

alt text

Observation:

✅ There is a strong evidence higher BMI levels have a family history of obesity.

✅ No cases of obesity level III have been observed in individuals without a family history of overweight, suggesting that family history plays a significant role, at least in the development of obesity level III.

📈 BMI Distribution by Frequent Consumption of High-Calorie Food (FAVC)

# Plot BMI distribution by Family History with Overweight across Obesity Levels
binary_palette = {"yes": "#8dd3c7", "no": "#fb8072"}

plt.figure(figsize=(12, 8))
sns.boxplot(
    x='Obesity_Level',
    y='BMI',
    hue ='High_Cal_Foods_Frequently',
    data=data,
    palette= binary_palette,
    order=obesity_level_order
)
plt.title("BMI Distribution by Frequent Consumption of High-Calorie Food (FAVC)", fontsize=16)
plt.xlabel("Obesity Level")
plt.ylabel("BMI")
plt.xticks(rotation=45)
plt.legend(title="High_Cal_Foods_Frequently", loc='upper left')
plt.tight_layout()
plt.show()

alt text

Observation:

✅ There is only one case of Obesity Type III where frequent consumption of high-calorie food is not present, suggesting that frequent consumption of high-calorie food likely plays a key role in the development of Obesity Type III. For other types of obesity and normal weight individuals, the distribution of high-calorie food consumption appears to be similar across all levels.

📈 BMI Distribution by Number of Meals (NCP)

# BMI distribution by Number of Meals (NCP) across different obesity levels (NObeyesdad)

plt.figure(figsize=(10,8))
sns.scatterplot(x='Num_Meals', y='BMI', data=obesity_df_renamed, hue='Obesity_Level', palette='Set3')

plt.title('BMI Distribution by Number of Meals (NCP)')
plt.xlabel('Number of Meals (NCP)')
plt.ylabel('BMI')
plt.tight_layout()
plt.show()

alt text

Observation:

✅ Obestiy type III are mostly consuming 3 meals where as normal weight people are consuming 1, 3 or 4 meals

📈 BMI Distribution by Smoking Habit (SMOKE)

# Plot BMI distribution by Smoking across Obesity Levels
binary_palette = {"yes": "#8dd3c7", "no": "#fb8072"}

plt.figure(figsize=(12, 8))
sns.boxplot(
    x='Obesity_Level',
    y='BMI',
    hue ='Smoke',
    data=data,
    palette= binary_palette,
    order=obesity_level_order
)
plt.title("BMI Distribution by Smoking Habit ", fontsize=16)
plt.xlabel("Obesity Level")
plt.ylabel("BMI")
plt.xticks(rotation=45)
plt.legend(title="smoking", loc='upper left')
plt.tight_layout()
plt.show()

alt text

Observation:

✅ There is insufficient evidence to conclude that smoking has a significant effect on obesity levels. The primary observation is that individuals in Obesity Level III are predominantly non-smokers.

def show_pie_chart(df, column_name):
    
    # Converting string values to categorical
    df[column_name] = df[column_name].astype('category')
    
    # Calculating the frequency of values in the column
    counts = df[column_name].value_counts()

    
    # Sorting the values in the column by their frequency in descending order
    counts_sorted = counts.sort_values(ascending=False)
    
    # subplot
    fig,axes = plt.subplots(1, 2, figsize=(16, 8))
    plt.title(column_name)
    plt.subplots_adjust(wspace = 0.5)
    
   
    axes[0].pie(counts.values, labels=counts_sorted.index, autopct='%1.1f%%', startangle=140, colors=sns.color_palette("pastel"))
    
    sns.countplot(data=df, y=column_name, ax=axes[1], order=counts_sorted.index, palette="pastel", hue=column_name, legend=False, width=0.8)
    axes[1].set_ylabel('')
    axes[1].set_xlabel('')
    
    for i, v in enumerate(counts_sorted.values):
        axes[1].text(v + 0.1, i, str(v), ha='left', va = 'center', color = 'black', fontweight = 'bold')
    #plt.title(column_name)
    plt.show()

📈 BMI Distribution by Public Transportation (MTRANS)

 #Create pie chart for MTRANS
show_pie_chart(data, 'Transportation')

alt text

Observation:

✅ 97.6% use some form of vehicles while only ~2.7% prefers walking/using bike That’s concerning!

📦 Export Cleaned Dataset

data.to_csv('clean_data.csv', index=False)

📌 Summary

Obesity_Type I has the highest number of individuals.
Most individuals have a family history of obesity.
~2.7% prefer walking/biking; the rest use vehicles — a concerning imbalance.
More females are classified as obese compared to males.
Strong positive correlation observed between Weight and Height.
Outliers present in Age distribution.

🤖 Obesity Estimation - Feature Engineering & ML Models

We apply multiple classification models to identify the best-performing model for predicting obesity levels.

🧠 Models Considered

Decision Tree
Random Forest
K-Nearest Neighbors (KNN)
XGBoost (XGBClassifier)

🧭 Workflow Steps

Train-Test Split (Stratified)
One-Hot Encoding (categorical features)
Standard Scaling (numerical features)
Label Encoding (target variable)
GridSearchCV + 5-Fold CV for performance tuning
Feature Importance Analysis

🔍 Load and Preprocess Data

import pandas as pd

clean_data_df = pd.read_csv(r'../data/clean_data.csv')

clean_data_df.info()
clean_data_df.drop('BMI', axis='columns', inplace=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2087 entries, 0 to 2086
Data columns (total 18 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 Gender                     2087 non-null   object 
 Age                        2087 non-null   float64
 Height                     2087 non-null   float64
 Weight                     2087 non-null   float64
 Family_History             2087 non-null   object 
 High_Cal_Foods_Frequently  2087 non-null   object 
 Freq_Veg                   2087 non-null   float64
 Num_Meals                  2087 non-null   float64
 Snacking                   2087 non-null   object 
 Smoke                      2087 non-null   object 
Water_Intake               2087 non-null   float64
Calorie_Monitoring         2087 non-null   object 
Phys_Activity              2087 non-null   float64
Tech_Use                   2087 non-null   float64
Freq_Alcohol               2087 non-null   object 
Transportation             2087 non-null   object 
Obesity_Level              2087 non-null   object 
BMI                        2087 non-null   float64
dtypes: float64(9), object(9)
memory usage: 293.6+ KB

🧾 Identify Feature Types

cat_cols = ['Gender', 'Family_History', 'High_Cal_Foods_Frequently', 'Snacking','Smoke', 'Calorie_Monitoring', 'Freq_Alcohol', 'Transportation']

num_cols = ['Age', 'Height', 'Weight', 'Freq_Veg', 'Num_Meals','Water_Intake', 'Phys_Activity', 'Tech_Use']

🎯 Define Feature Matrix `X` and Target `y`

X = clean_data_df.drop('Obesity_Level',axis=1)  
y = clean_data_df['Obesity_Level'] 

X.shape, y.shape

((2087, 16), (2087,))

🔄 Stratified Train-Test Split

Stratified splitting means that when you generate a training / validation dataset split, it will attempt to keep the same percentages of classes in each split.

These dataset divisions are usually generated randomly according to a target variable. However, when doing so, the proportions of the target variable among the different splits can differ, especially in the case of small datasets.

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,y,stratify=y,test_size=0.2,random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1669, 16), (418, 16), (1669,), (418,))

⚙️ Preprocessing: One-Hot Encoding + Standard Scaling

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler
Sscaler = StandardScaler().set_output(transform="pandas")
# Encoding multiple columns. 
transformer = make_column_transformer((Sscaler, num_cols), (OneHotEncoder(handle_unknown='ignore'), 
     cat_cols),verbose=True,verbose_feature_names_out=True, remainder='drop')

transformer

ColumnTransformer(transformers=[('standardscaler', StandardScaler(),
                                 ['Age', 'Height', 'Weight', 'Freq_Veg',
                                  'Num_Meals', 'Water_Intake', 'Phys_Activity',
                                  'Tech_Use']),
                                ('onehotencoder',
                                 OneHotEncoder(handle_unknown='ignore'),
                                 ['Gender', 'Family_History',
                                  'High_Cal_Foods_Frequently', 'Snacking',
                                  'Smoke', 'Calorie_Monitoring', 'Freq_Alcohol',
                                  'Transportation'])],
                  verbose=True)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

🔄 X_train Encoding

# Transforming
transformed = transformer.fit_transform(X_train)
# Transformating back
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
# One-hot encoding removed an index. Let's put it back:
transformed_df.index = X_train.index
# Joining tables
X_train = pd.concat([X_train, transformed_df], axis=1)
# Dropping old categorical columns
X_train.drop(cat_cols, axis=1, inplace=True)
# Dropping old num columns
X_train.drop(num_cols, axis=1, inplace=True)
# CHecking result
#print(X_train.head())
print(X_train.columns)
print(X_train.shape)

[ColumnTransformer]  (1 of 2) Processing standardscaler, total=   0.0s
[ColumnTransformer] . (2 of 2) Processing onehotencoder, total=   0.0s
Index(['standardscaler__Age', 'standardscaler__Height',
       'standardscaler__Weight', 'standardscaler__Freq_Veg',
       'standardscaler__Num_Meals', 'standardscaler__Water_Intake',
       'standardscaler__Phys_Activity', 'standardscaler__Tech_Use',
       'onehotencoder__Gender_Female', 'onehotencoder__Gender_Male',
       'onehotencoder__Family_History_no', 'onehotencoder__Family_History_yes',
       'onehotencoder__High_Cal_Foods_Frequently_no',
       'onehotencoder__High_Cal_Foods_Frequently_yes',
       'onehotencoder__Snacking_Always', 'onehotencoder__Snacking_Frequently',
       'onehotencoder__Snacking_Sometimes', 'onehotencoder__Snacking_no',
       'onehotencoder__Smoke_no', 'onehotencoder__Smoke_yes',
       'onehotencoder__Calorie_Monitoring_no',
       'onehotencoder__Calorie_Monitoring_yes',
       'onehotencoder__Freq_Alcohol_Always',
       'onehotencoder__Freq_Alcohol_Frequently',
       'onehotencoder__Freq_Alcohol_Sometimes',
       'onehotencoder__Freq_Alcohol_no',
       'onehotencoder__Transportation_Automobile',
       'onehotencoder__Transportation_Bike',
       'onehotencoder__Transportation_Motorbike',
       'onehotencoder__Transportation_Public_Transportation',
       'onehotencoder__Transportation_Walking'],
      dtype='object')
      ```
      
(1669, 31)

🔄 X_test Encoding

# Transforming
transformed = transformer.transform(X_test)
# Transformating back
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
# One-hot encoding removed an index. Let's put it back:
transformed_df.index = X_test.index
# Joining tables
X_test = pd.concat([X_test, transformed_df], axis=1)
# Dropping old categorical columns
X_test.drop(cat_cols, axis=1, inplace=True)
# Dropping old num columns
X_test.drop(num_cols, axis=1, inplace=True)
# CHecking result
X_test.head()

	standardscaler__Age	standardscaler__Height	standardscaler__Weight	standardscaler__Freq_Veg	standardscaler__Num_Meals	standardscaler__Water_Intake	standardscaler__Phys_Activity	standardscaler__Tech_Use	onehotencoder__Gender_Female	onehotencoder__Gender_Male	...	onehotencoder__Calorie_Monitoring_yes	onehotencoder__Freq_Alcohol_Sometimes	onehotencoder__Freq_Alcohol_no	onehotencoder__Transportation_Automobile	onehotencoder__Transportation_Public_Transportation
1153	-0.690071	-1.202029	-0.537291	1.084280	1.512030	-0.000656	0.376859	0.550747	1.0	0.0	...	0.0	1.0	0.0	0.0	1.0
132	0.899810	0.736866	0.848928	1.084280	0.391801	-1.644954	1.182627	-1.095082	0.0	1.0	...	0.0	1.0	0.0	1.0	0.0
1923	-0.587828	0.401573	1.579131	1.084280	0.391801	-0.334381	0.499952	0.495086	1.0	0.0	...	0.0	1.0	0.0	0.0	1.0
846	-1.165664	-1.047678	-0.831968	1.008997	-2.224232	-0.000656	-0.324357	1.117030	1.0	0.0	...	1.0	0.0	1.0	0.0	1.0
1246	0.821665	1.347473	0.839435	-0.801583	0.231677	0.588066	-0.155934	1.895629	0.0	1.0	...	0.0	1.0	0.0	1.0	0.0

5 rows × 31 columns

print(X_train.columns, X_train.shape)

Index(['standardscaler__Age', 'standardscaler__Height',
       'standardscaler__Weight', 'standardscaler__Freq_Veg',
       'standardscaler__Num_Meals', 'standardscaler__Water_Intake',
       'standardscaler__Phys_Activity', 'standardscaler__Tech_Use',
       'onehotencoder__Gender_Female', 'onehotencoder__Gender_Male',
       'onehotencoder__Family_History_no', 'onehotencoder__Family_History_yes',
       'onehotencoder__High_Cal_Foods_Frequently_no',
       'onehotencoder__High_Cal_Foods_Frequently_yes',
       'onehotencoder__Snacking_Always', 'onehotencoder__Snacking_Frequently',
       'onehotencoder__Snacking_Sometimes', 'onehotencoder__Snacking_no',
       'onehotencoder__Smoke_no', 'onehotencoder__Smoke_yes',
       'onehotencoder__Calorie_Monitoring_no',
       'onehotencoder__Calorie_Monitoring_yes',
       'onehotencoder__Freq_Alcohol_Always',
       'onehotencoder__Freq_Alcohol_Frequently',
       'onehotencoder__Freq_Alcohol_Sometimes',
       'onehotencoder__Freq_Alcohol_no',
       'onehotencoder__Transportation_Automobile',
       'onehotencoder__Transportation_Bike',
       'onehotencoder__Transportation_Motorbike',
       'onehotencoder__Transportation_Public_Transportation',
       'onehotencoder__Transportation_Walking'],
      dtype='object') (1669, 31)

# Setting new feature names

X_train.columns = ['Age', 'Height', 'Weight', 'Freq_Veg', 'Num_Meals', 'Water_Intake',
       'Phys_Activity', 'Tech_Use', 'Gender_Female',
       'Gender_Male', 'Family_History_no',
       'Family_History_yes',
       'High_Cal_Foods_Frequently_no',
       'High_Cal_Foods_Frequently_yes',
       'Snacking_Always', 'Snacking_Frequently',
       'Snacking_Sometimes', 'Snacking_no',
       'Smoke_no', 'Smoke_yes',
       'Calorie_Monitoring_no',
       'Calorie_Monitoring_yes',
       'Freq_Alcohol_Always',
       'Freq_Alcohol_Frequently',
       'Freq_Alcohol_Sometimes',
       'Freq_Alcohol_no',
       'Transportation_Automobile',
       'Transportation_Bike',
       'Transportation_Motorbike',
       'Transportation_Public_Transportation',
       'Transportation_Walking']

X_test.columns = ['Age', 'Height', 'Weight', 'Freq_Veg', 'Num_Meals', 'Water_Intake',
       'Phys_Activity', 'Tech_Use', 'Gender_Female',
       'Gender_Male', 'Family_History_no',
       'Family_History_yes',
       'High_Cal_Foods_Frequently_no',
       'High_Cal_Foods_Frequently_yes',
       'Snacking_Always', 'Snacking_Frequently',
       'Snacking_Sometimes', 'Snacking_no',
       'Smoke_no', 'Smoke_yes',
       'Calorie_Monitoring_no',
       'Calorie_Monitoring_yes',
       'Freq_Alcohol_Always',
       'Freq_Alcohol_Frequently',
       'Freq_Alcohol_Sometimes',
       'Freq_Alcohol_no',
       'Transportation_Automobile',
       'Transportation_Bike',
       'Transportation_Motorbike',
       'Transportation_Public_Transportation',
       'Transportation_Walking']
# After renaming the columns

X_train.head()

	Age	Height	Weight	Freq_Veg	Num_Meals	Water_Intake	Phys_Activity	Tech_Use	Gender_Female	Gender_Male	...	Freq_Alcohol_Sometimes	Freq_Alcohol_no	Transportation_Automobile	Transportation_Public_Transportation
1549	0.248290	0.705537	1.045026	-0.375885	0.391801	0.253102	0.308538	-0.452916	0.0	1.0	...	1.0	0.0	0.0	1.0
1574	1.008501	-0.608067	0.505277	1.016483	-0.755149	-1.644954	0.823971	-0.073903	0.0	1.0	...	0.0	1.0	0.0	1.0
1155	3.702329	0.457500	-0.078269	0.207948	0.391801	-1.403908	-0.827757	-1.095082	0.0	1.0	...	0.0	1.0	1.0	0.0
610	-0.208354	0.416482	-1.265285	-0.345947	0.391801	-0.252848	1.731352	0.245475	1.0	0.0	...	0.0	1.0	0.0	1.0
906	-0.523216	-1.012212	-0.725998	-0.801583	0.596256	1.643643	0.204484	-0.952112	1.0	0.0	...	1.0	0.0	0.0	1.0

5 rows × 31 columns

🧬 Apply Label Encoder

🔡 Encoding y_train

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train)

#y_train.head(10) , y_train_encoded[0:10]

🔡 Encoding y_test

le_test = LabelEncoder()
y_test_encoded = le_test.fit_transform(y_test)
y_test.head(10) , y_test_encoded[0:10]

(1153    Overweight_Level_II
        Obesity_Type_I
     Obesity_Type_III
    Overweight_Level_I
       Obesity_Type_I
       Obesity_Type_I
       Obesity_Type_I
  Overweight_Level_II
      Obesity_Type_III
      Obesity_Type_II
 Name: Obesity_Level, dtype: object,
 array([6, 2, 4, 5, 2, 2, 2, 6, 4, 3]))

🤖 Classifier Models

from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier

🧰 Models in Consideration

models={'RandomForest':RandomForestClassifier(),
        'DecisionTree':DecisionTreeClassifier(),
        'KNeighbors':KNeighborsClassifier(),
        'xgbc': XGBClassifier()}

🧪 Scoring for Measuring Model Performance

from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score
scoring = {
    'accuracy': 'accuracy',
    'precision': make_scorer(precision_score, average='weighted'),
    'recall': make_scorer(recall_score, average='weighted'),
    'f1': make_scorer(f1_score, average='weighted')}

🧾 Model Evaluation

# %load ../models/Models_eval.py

from sklearn.model_selection import GridSearchCV

def grid_search_cv_eval(X,Y, models, param_grid, scorings,cross_validation):
    """Here we evaluate the models using gridsearchcv and returns the dictionary of best models and result."""

    best_models = {}
    result = {}
    print(models)
    for model in models:
        print(result)
        print(f"\nRunning GridSearch for {model}...")
        gsv = GridSearchCV(
            estimator=models[model],
            param_grid=param_grid[model],
            cv=cross_validation,
            scoring=scorings,
            refit='accuracy'  # Primary metric for model selection
        )
        gsv.fit(X, Y)
        best_models[model] = gsv.best_estimator_
        best_index = gsv.best_index_
        print(f'Best parameters for {model}: {gsv.best_params_}')
        print(f'Best accuracy: {gsv.cv_results_["mean_test_accuracy"][best_index]:.4f}')
        print(f'Best precision: {gsv.cv_results_["mean_test_precision"][best_index]:.4f}')
        print(f'Best recall: {gsv.cv_results_["mean_test_recall"][best_index]:.4f}')
        result[model] = {"parameter":gsv.best_params_,"accuracy":gsv.cv_results_["mean_test_accuracy"][best_index], "precision": gsv.cv_results_["mean_test_precision"][best_index],"recall": gsv.cv_results_["mean_test_recall"][best_index]}

    return best_models, result


from sklearn.model_selection import cross_val_score

# Define the function to compare models with default parameters
def evaluate_models(X, Y, models, scorings, cross_validation):
    result = {}
    
    # Loop through models and evaluate each one
    for model_name, model in models.items():
        print(f"\nEvaluating {model_name}...")
        
        # Evaluate using the defined scoring metrics
        model_scores = {}
        for score_name, scorer in scorings.items():
            score = cross_val_score(model, X, Y, cv=cross_validation, scoring=scorer)
            model_scores[score_name] = score.mean()
        
        # Store results
        result[model_name] = model_scores
        
        # Print the results
        print(f"Accuracy: {model_scores['accuracy']:.4f}")
        print(f"Precision: {model_scores['precision']:.4f}")
        print(f"Recall: {model_scores['recall']:.4f}")
        print(f"F1 Score: {model_scores['f1']:.4f}")
    
    return result

result = evaluate_models(X_train, y_train_encoded, models, scoring, 5)

Evaluating RandomForest...
Accuracy: 0.9323
Precision: 0.9389
Recall: 0.9341
F1 Score: 0.9352

Evaluating DecisionTree...
Accuracy: 0.9191
Precision: 0.9230
Recall: 0.9167
F1 Score: 0.9180

Evaluating KNeighbors...
Accuracy: 0.8113
Precision: 0.8105
Recall: 0.8113
F1 Score: 0.7941

Evaluating xgbc...
Accuracy: 0.9629
Precision: 0.9633
Recall: 0.9629
F1 Score: 0.9628

📊 Models Performance Visuals

result_df = pd.DataFrame(result).T

📈 Accuracy Comparison

# Plot Accuracy
import matplotlib.pyplot as plt 
import seaborn as sns 
plt.figure(figsize=(10, 6))
sns.barplot(x=result_df.index, y='accuracy', data=result_df, palette="Blues_d")
plt.title('Model Comparison: Accuracy', fontsize=16)
plt.ylabel('Accuracy', fontsize=14)
plt.xlabel('Model', fontsize=14)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

alt text

🎯 Precision Comparison

# Plot Precision
plt.figure(figsize=(10, 6))
sns.barplot(x=result_df.index, y='precision', data=result_df, palette="Greens_d")
plt.title('Model Comparison: Precision', fontsize=16)
plt.ylabel('Precision', fontsize=14)
plt.xlabel('Model', fontsize=14)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

alt text

🔁 Recall Comparison

# Plot Precision
plt.figure(figsize=(10, 6))
sns.barplot(x=result_df.index, y='recall', data=result_df, palette="Oranges_d")
plt.title('Model Comparison: Recall', fontsize=16)
plt.ylabel('Recall', fontsize=14)
plt.xlabel('Model', fontsize=14)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

alt text

Observation:

✅ Based on Visuals above we can clearly notice that RandomForest and XGBC models are showing good results. We will tune the parameters next to find the best suitable model

📋 Creating a Comparison Table for the Models

models = ['RandomForest', 'DecisionTree', 'KNeighbors', 'xgbc']

result_df['f1_score'] = 2 * (result_df['precision'] * result_df['recall']) / (result_df['precision'] + result_df['recall'])

# Display the updated table with F1 score
display(result_df)

	accuracy	precision	recall	f1	f1_score
RandomForest	0.932303	0.938914	0.934096	0.935160	0.936499
DecisionTree	0.919117	0.923018	0.916723	0.917965	0.919860
KNeighbors	0.811273	0.810466	0.811273	0.794125	0.810869
xgbc	0.962854	0.963345	0.962854	0.962758	0.963100

Observation:

✅ The results highlight that the xgbc model outperforms others with the highest accuracy (96.3%), precision (96.3%), recall (96.3%), and F1 score (96.3%), demonstrating its superior ability to classify data correctly. The RandomForest model also shows strong performance, achieving an accuracy of 93.7%, making it a competitive alternative. In comparison, the DecisionTree model and KNeighbors model perform slightly lower, with accuracies of 92.2% and 81.1%, respectively. Based on these findings, we have decided to conduct further comparisons between RandomForest and XGBoost to refine our model selection process.

⚙️ Hyperparameter Tuning: Random Forest & XGBoost

models={'RandomForest_hyper_tuned':RandomForestClassifier(),
        'xgbc_hyper_tuned': XGBClassifier()}

# Hyperparameter grids for tuning models
param_grids = {
    # Random Forest Hyperparameter Grid
    'RandomForest_hyper_tuned': {
        'n_estimators': [50, 100, 200, 400],               # Number of trees in the forest
        'max_depth': [None, 10, 20, 50],                   # Maximum depth of each tree
        'min_samples_split': [2, 5, 10, 15]                # Minimum number of samples to split a node
    },
    # XGBoostClassifier Hyperparameter Grid
    'xgbc_hyper_tuned': {
        'n_estimators': [50, 100, 200, 400],               # Number of boosting rounds
        'max_depth': [3, 5, 7, 10],                        # Maximum depth of each tree
        'learning_rate': [0.001, 0.01, 0.1, 0.3],          # Step size shrinkage
        'objective': ['multi:softmax'],                    # Multi-class classification
        'verbosity': [0],                                  # Silence output
        'nthread': [-1],                                   # Use all available threads
        'random_state': [42]                               # Ensure reproducibility
    }
}

best_models_hyper_tuned, result_hyper_tuned = grid_search_cv_eval(X_train, y_train_encoded, models, param_grids, scoring, cross_validation=5)
best_models_hyper_tuned, result_hyper_tuned

{'RandomForest_hyper_tuned': RandomForestClassifier(), 'xgbc_hyper_tuned': XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=None, n_jobs=None,
              num_parallel_tree=None, random_state=None, ...)}
{}

Running GridSearch for RandomForest_hyper_tuned...
Best parameters for RandomForest_hyper_tuned: {'max_depth': 50, 'min_samples_split': 5, 'n_estimators': 400}
Best accuracy: 0.9371
Best precision: 0.9424
Best recall: 0.9371
{'RandomForest_hyper_tuned': {'parameter': {'max_depth': 50, 'min_samples_split': 5, 'n_estimators': 400}, 'accuracy': 0.9370915826005646, 'precision': 0.9423907933895954, 'recall': 0.9370915826005646}}

Running GridSearch for xgbc_hyper_tuned...
Best parameters for xgbc_hyper_tuned: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 400, 'nthread': -1, 'objective': 'multi:softmax', 'random_state': 42, 'verbosity': 0}
Best accuracy: 0.9664
Best precision: 0.9668
Best recall: 0.9664


({'RandomForest_hyper_tuned': RandomForestClassifier(max_depth=50, min_samples_split=5, n_estimators=400),
  'xgbc_hyper_tuned': XGBClassifier(base_score=None, booster=None, callbacks=None,
                colsample_bylevel=None, colsample_bynode=None,
                colsample_bytree=None, device=None, early_stopping_rounds=None,
                enable_categorical=False, eval_metric=None, feature_types=None,
                gamma=None, grow_policy=None, importance_type=None,
                interaction_constraints=None, learning_rate=0.1, max_bin=None,
                max_cat_threshold=None, max_cat_to_onehot=None,
                max_delta_step=None, max_depth=5, max_leaves=None,
                min_child_weight=None, missing=nan, monotone_constraints=None,
                multi_strategy=None, n_estimators=400, n_jobs=None, nthread=-1,
                num_parallel_tree=None, ...)},
 {'RandomForest_hyper_tuned': {'parameter': {'max_depth': 50,
    'min_samples_split': 5,
    'n_estimators': 400},
   'accuracy': 0.9370915826005646,
   'precision': 0.9423907933895954,
   'recall': 0.9370915826005646},
  'xgbc_hyper_tuned': {'parameter': {'learning_rate': 0.1,
    'max_depth': 5,
    'n_estimators': 400,
    'nthread': -1,
    'objective': 'multi:softmax',
    'random_state': 42,
    'verbosity': 0},
   'accuracy': 0.9664472856089622,
   'precision': 0.9668094661210735,
   'recall': 0.9664472856089622}})

📉 Confusion Matrix

from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

y_pred = best_models_hyper_tuned['xgbc_hyper_tuned'].predict(X_test)

cm = confusion_matrix(y_test_encoded, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=le.classes_, yticklabels=le.classes_)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

alt text

🌲 Feature Importance (Random Forest)

from sklearn.ensemble import RandomForestClassifier
import seaborn as sns
import plotly.express as px

from matplotlib import pyplot as plt
from sklearn.model_selection import cross_val_score

palette = ['#008080','#FF6347', '#E50000', '#D2691E'] # Creating color palette for plots

randomForest_model = best_models_hyper_tuned['RandomForest_hyper_tuned']
randomForest_model = randomForest_model.fit(X_train, y_train)

fimp = pd.Series(data=randomForest_model.feature_importances_, index=X_train.columns).sort_values(ascending=False)
plt.figure(figsize=(17,13))
plt.title("Feature importance")
ax = sns.barplot(y=fimp.index, x=fimp.values, palette=palette, orient='h')

alt text

🎯 Classic Feature Attributions (XGBoost)

Here we try out the global feature importance calcuations that come with XGBoost.

xgboot_model = best_models_hyper_tuned.get('xgbc_hyper_tuned')
xgboot_model

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.1, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=5, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=400, n_jobs=None, nthread=-1,
              num_parallel_tree=None, ...)

import xgboost
xgboost.plot_importance(xgboot_model)
#plt.title("xgboost plot_importance(model)")
plt.show()

alt text

xgboost.plot_importance(xgboot_model, importance_type="cover")
plt.title('xgboost.plot_importance(model, importance_type="cover")')
plt.show()

alt text

xgboost.plot_importance(xgboot_model, importance_type="gain")
plt.title('xgboost.plot_importance(model, importance_type="gain")')
plt.show()

alt text

🧠 SHAP Explainability Setup

import shap

# print the JS visualization code to the notebook
shap.initjs()

🧠 SHAP Explainability – TreeExplainer (XGBoost)

explainer = shap.TreeExplainer(xgboot_model)
shap_values = explainer.shap_values(X_train)

####📌 SHAP Summary Plot (All Features)

shap.summary_plot(shap_values, X_train, plot_type="bar")

! alt text

class_mapping = {
    'Insufficient_Weight': 0,
    'Normal_Weight': 1,
    'Overweight_Level_I': 2,
    'Overweight_Level_II': 3,
    'Obesity_Type_I': 4,
    'Obesity_Type_II': 5,
    'Obesity_Type_III': 6
}

🧠 Feature Importance Insights

Top Features:

Weight is the most dominant feature, influencing predictions across all classes substantially.
Height and Freq_Veg also show strong predictive power across multiple obesity levels.

Lower Features:

Features like Transportation_Walking, Snacking_Always, and Calorie_Monitoring_no contribute very little to predictions and may not add significant predictive power.

Class-Specific Observations:

Weight and Height influence multiple classes, especially Class 0 (Insufficient_Weight) and Class 4 (Obesity_Type_I).
Some features (e.g., Transportation_Public_Transportation) may have niche relevance to specific classes.
Overall, feature contribution patterns are fairly consistent across obesity levels.

🛠️ Feature Engineering (Without Height & Weight)

To assess the true impact of lifestyle and behavioral factors on obesity prediction, we retrain the XGBoost model after removing Weight and Height from the dataset.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Drop Weight and Height
dataset_cleaned = clean_data_df.drop(columns=['Weight', 'Height'])

#  Separate features and target variable
X_new = dataset_cleaned.drop(columns=['Obesity_Level'])
y_new = dataset_cleaned['Obesity_Level']

# Encode categorical target variable
le = LabelEncoder()
y_encoded_new = le.fit_transform(y)

# Split data into training and testing sets
X_train_new, X_test_new, y_train_new, y_test_new = train_test_split(X_new, y_encoded_new, stratify=y_encoded_new, test_size=0.2, random_state=42)

# One-Hot Encode Categorical Features
categorical_features = ['Gender', 'Family_History', 'High_Cal_Foods_Frequently', 'Snacking', 
                        'Smoke', 'Calorie_Monitoring', 'Freq_Alcohol', 'Transportation']
X_train_new = pd.get_dummies(X_train_new, columns=categorical_features, drop_first=True)
X_test_new = pd.get_dummies(X_test_new, columns=categorical_features, drop_first=True)

# Ensure train and test datasets have the same columns
X_test_new = X_test_new.reindex(columns=X_train_new.columns, fill_value=0)

# Step 2: Apply Min-Max Scaling to Numerical Features
num_cols = ['Age', 'Freq_Veg', 'Num_Meals', 'Water_Intake', 'Phys_Activity', 'Tech_Use']
scaler = MinMaxScaler()

X_train_new[num_cols] = scaler.fit_transform(X_train_new[num_cols])
X_test_new[num_cols] = scaler.transform(X_test_new[num_cols])

xgb_params = {
    'learning_rate': 0.3,
    'n_estimators': 200,
    'nthread': -1,
    'objective': 'multi:softmax',
    'random_state': 42,
    'verbosity': 0
}

#xgb_model = XGBClassifier(xgb_params)
xgb_model = xgboot_model
xgb_model.fit(X_train_new, y_train_new)

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.1, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=5, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=400, n_jobs=None, nthread=-1,
              num_parallel_tree=None, ...)

# Step 7: Predictions and Performance Metrics
y_pred = xgb_model.predict(X_test_new)
accuracy = accuracy_score(y_test_new, y_pred)
precision = precision_score(y_test_new, y_pred, average='weighted')
recall = recall_score(y_test_new, y_pred, average='weighted')
f1 = f1_score(y_test_new, y_pred, average='weighted')

# Display results
xgb_metrics = {
    'Accuracy': accuracy,
    'Precision': precision,
    'Recall': recall,
    'F1 Score': f1
}
xgb_metrics

{'Accuracy': 0.8397129186602871,
 'Precision': 0.8416882871552138,
 'Recall': 0.8397129186602871,
 'F1 Score': 0.8396322920498374}

Observation:

✅ All performance metrics dropped as expected but at least know we have a real persepective of real cause of the obersity levels.

Let’s also try the Random Forest

from sklearn.ensemble import RandomForestClassifier

# Step 1: Train Random Forest Model
# rf_model = RandomForestClassifier(
#     n_estimators=200,
#     max_depth=None,
#     random_state=42,
#     n_jobs=-1  # Utilize all available cores
# )
rf_model = randomForest_model
rf_model.fit(X_train_new, y_train_new)

# Step 2: Predictions and Performance Metrics
y_pred_rf = rf_model.predict(X_test_new)
rf_accuracy = accuracy_score(y_test_new, y_pred_rf)
rf_precision = precision_score(y_test_new, y_pred_rf, average='weighted')
rf_recall = recall_score(y_test_new, y_pred_rf, average='weighted')
rf_f1 = f1_score(y_test_new, y_pred_rf, average='weighted')

# Display Random Forest Metrics and Feature Importance
rf_metrics = {
    'Accuracy': rf_accuracy,
    'Precision': rf_precision,
    'Recall': rf_recall,
    'F1 Score': rf_f1
}

rf_metrics

{'Accuracy': 0.8373205741626795,
 'Precision': 0.8409240886853,
 'Recall': 0.8373205741626795,
 'F1 Score': 0.8368984104506949}

🔍 Comparison of XGBoost and Random Forest with their performance metrics

# Combine metrics into a DataFrame for comparison which exclude Weight and Height
comparison_df = pd.DataFrame([rf_metrics, xgb_metrics], index=['Random Forest_EWH', 'XGBoost_EWH'])

# Display the comparison table
comparison_df

	Accuracy	Precision	Recall	F1 Score
Random Forest_EWH	0.837321	0.840924	0.837321	0.836898
XGBoost_EWH	0.839713	0.841688	0.839713	0.839632

From this comparison, we observe that the Random Forest model demonstrates better performance. Therefore, we will proceed with using the Random Forest model’s output for feature selection.

📌 Feature Importance with Random Forest, XGBoost, and SHAP

# Compute Random Forest feature importance
rf_feature_importance = rf_model.feature_importances_

# Create a DataFrame for feature importance
rf_feature_importance_df = pd.DataFrame({
    'Feature': X_train_new.columns,
    'Importance': rf_feature_importance
}).sort_values(by='Importance', ascending=False)

# Sort the feature importances for better visualization
rf_feature_importance_df = rf_feature_importance_df.sort_values(by="Importance", ascending=False)

# Plot Feature Importance
plt.figure(figsize=(12, 8))
plt.barh(rf_feature_importance_df['Feature'], rf_feature_importance_df['Importance'], color='skyblue')
plt.xlabel('Feature Importance')
plt.ylabel('Features')
plt.title('Random Forest Feature Importance')
plt.gca().invert_yaxis()  # Highest importance at the top
plt.tight_layout()
plt.show()

🌲 Random Forest Feature Importance

! alt text

from xgboost import plot_importance

# Built-in plot with 'gain'
plt.figure(figsize=(12, 8))
plot_importance(xgb_model, importance_type='gain', max_num_features=10)
plt.title('XGBoost Feature Importance (Gain)')
plt.show()

# Custom plot with 'gain'
feature_importance = xgb_model.get_booster().get_score(importance_type='gain')
importance_df = pd.DataFrame(list(feature_importance.items()), columns=['Feature', 'Importance']).sort_values(by='Importance', ascending=False)

plt.figure(figsize=(12, 8))
plt.barh(importance_df['Feature'], importance_df['Importance'], color='skyblue')
plt.xlabel('Feature Importance (Gain)')
plt.ylabel('Features')
plt.title('XGBoost Feature Importance (Gain)')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

<Figure size 1200x800 with 0 Axes>

⚡ XGBoost Feature Importance

! alt text

# SHAP for Random Forest
explainer_rf = shap.TreeExplainer(rf_model)
shap_values_rf = explainer_rf.shap_values(X_train_new)

# SHAP for XGBoost
explainer_xgb = shap.TreeExplainer(xgb_model)
shap_values_xgb = explainer_xgb.shap_values(X_train_new)

🔍 SHAP Values for Multi-Class Classification

Refer: https://medium.com/biased-algorithms/shap-values-for-multiclass-classification-2a1b93f69c63

Since we have a multi class classification single SHAP explanation is not enough to understand the model better. With referring the Medium post above we will implement the SHAP Values for each indiviual class to udnerstand better the feature importance.

# Visualize the SHAP Summary Plot for RF

shap.summary_plot(shap_values_rf, X_train_new, plot_type="bar")

! alt text

import xgboost as xgb

# Create DMatrix
dtrain = xgb.DMatrix(X_train_new, label=y_train_new)
explainer_xgb = shap.TreeExplainer(xgb_model)
shap_values_xgb = explainer_xgb.shap_values(dtrain)

#the summary plot
shap.summary_plot(shap_values_xgb, X_train_new, plot_type="bar")

! alt text