Obesity Level Classification Using Eating Habits and Physical Condition

🧠 Project Overview

This project leverages supervised machine learning to classify individuals into obesity categories based on their demographic data, eating habits, and physical activity levels. The dataset consists of 2,111 observations and 17 features, capturing both behavioral and physiological variables.

The target variable, NObeyesdad, represents seven distinct obesity classes:

  • Insufficient Weight
  • Normal Weight
  • Overweight Level I
  • Overweight Level II
  • Obesity Type I
  • Obesity Type II
  • Obesity Type III

Dataset Source: UCI Machine Learning Repository


📊 Dataset Summary

  • 📌 Rows: 2,111
  • 📌 Features: 16 predictors + 1 target (NObeyesdad)
  • 📌 Balance: 77% synthetic (via SMOTE), 23% survey-based real data
  • 📌 Goal: Predict obesity level using lifestyle and physiological variables

📁 Feature Categories

  • 🥗 Eating Habits:
    • FAVC (Frequent consumption of high-calorie food)
    • FCVC (Frequency of vegetable consumption)
    • NCP (Number of daily meals)
    • CAEC (Food consumption between meals)
    • CH2O (Daily water intake)
    • CALC (Alcohol consumption)
  • 🏃‍♂️ Physical Activity & Lifestyle:
    • FAF (Physical activity frequency)
    • SCC (Calorie consumption monitoring)
    • TUE (Technology usage per day)
    • MTRANS (Transportation mode)
  • 👤 Demographics:
    • Gender
    • Age
    • Height
    • Weight
    • Family_History_with_Overweight
  • 🎯 Target Variable:
    • NObeyesdad (Multi-class label indicating obesity category)

Each row in the dataset corresponds to one individual, capturing their nutritional behavior, lifestyle factors, and physical condition. These features provide a comprehensive view of potential predictors for obesity and are used to train classification models.

📦 Import Required Packages and Dataset

%load_ext dotenv
%dotenv 

The dotenv extension is already loaded. To reload it, use:

%reload_ext dotenv
# Import standard libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import os
import sys

# **Append source directory to Python path**
sys.path.append(os.getenv('SRC_DIR'))

📥 Load the Dataset into a DataFrame

# Load the dataset
from DataManager import get_data

obesity_df = get_data('../data/ObesityDataSet_raw_and_data_sinthetic.csv')
obesity_df.shape

Output: (2111, 17)


🔍 Display Basic Information

obesity_df.head()
Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP CAEC SMOKE CH2O SCC FAF TUE CALC MTRANS NObeyesdad
0 Female 21.0 1.62 64.0 yes no 2.0 3.0 Sometimes no 2.0 no 0.0 1.0 no Public_Transportation Normal_Weight
1 Female 21.0 1.52 56.0 yes no 3.0 3.0 Sometimes yes 3.0 yes 3.0 0.0 Sometimes Public_Transportation Normal_Weight
2 Male 23.0 1.80 77.0 yes no 2.0 3.0 Sometimes no 2.0 no 2.0 1.0 Frequently Public_Transportation Normal_Weight
3 Male 27.0 1.80 87.0 no no 3.0 3.0 Sometimes no 2.0 no 2.0 0.0 Frequently Walking Overweight_Level_I
4 Male 22.0 1.78 89.8 no no 2.0 1.0 Sometimes no 2.0 no 0.0 0.0 Sometimes Public_Transportation Overweight_Level_II

📝 Rename Columns for Clarity

# Rename columns for improved readability
columnName = {'family_history_with_overweight': 'Family_History', 
                                          'FAVC' : 'High_Cal_Foods_Frequently', 
                                          'FCVC': 'Freq_Veg', 'NCP': 'Num_Meals', 
                                          'CAEC': 'Snacking',
                                          'SMOKE': 'Smoke',
                                          'CH2O': 'Water_Intake', 
                                          'SCC': 'Calorie_Monitoring' , 
                                          'FAF': 'Phys_Activity', 
                                          'TUE': 'Tech_Use', 'CALC':
                                          "Freq_Alcohol", 
                                          'MTRANS': 'Transportation', 
                                          'NObeyesdad': 'Obesity_Level'}


obesity_df_renamed = obesity_df.rename(columns=columnName)

ℹ️ Dataset Info After Renaming

obesity_df_renamed.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2111 entries, 0 to 2110
Data columns (total 17 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Gender                     2111 non-null   object 
 1   Age                        2111 non-null   float64
 2   Height                     2111 non-null   float64
 3   Weight                     2111 non-null   float64
 4   Family_History             2111 non-null   object 
 5   High_Cal_Foods_Frequently  2111 non-null   object 
 6   Freq_Veg                   2111 non-null   float64
 7   Num_Meals                  2111 non-null   float64
 8   Snacking                   2111 non-null   object 
 9   Smoke                      2111 non-null   object 
 10  Water_Intake               2111 non-null   float64
 11  Calorie_Monitoring         2111 non-null   object 
 12  Phys_Activity              2111 non-null   float64
 13  Tech_Use                   2111 non-null   float64
 14  Freq_Alcohol               2111 non-null   object 
 15  Transportation             2111 non-null   object 
 16  Obesity_Level              2111 non-null   object 
dtypes: float64(8), object(9)
memory usage: 280.5+ KB

📊 Summary Statistics

obesity_df_renamed.describe()
Age Height Weight Freq_Veg Num_Meals Water_Intake Phys_Activity Tech_Use
count 2111.000000 2111.000000 2111.000000 2111.000000 2111.000000 2111.000000 2111.000000 2111.000000
mean 24.312600 1.701677 86.586058 2.419043 2.685628 2.008011 1.010298 0.657866
std 6.345968 0.093305 26.191172 0.533927 0.778039 0.612953 0.850592 0.608927
min 14.000000 1.450000 39.000000 1.000000 1.000000 1.000000 0.000000 0.000000
25% 19.947192 1.630000 65.473343 2.000000 2.658738 1.584812 0.124505 0.000000
50% 22.777890 1.700499 83.000000 2.385502 3.000000 2.000000 1.000000 0.625350
75% 26.000000 1.768464 107.430682 3.000000 3.000000 2.477420 1.666678 1.000000
max 61.000000 1.980000 173.000000 3.000000 4.000000 3.000000 3.000000 2.000000

🧹 Data Preprocessing

In this section, we perform essential data quality checks to prepare the dataset for analysis:

  1. Check for Missing or Null Values
    Confirm that no values are missing from the dataset.

  2. Check for Uniqueness in Categories
    Verify the categorical variables contain consistent and non-redundant entries.

  3. Check for Duplicates
    Identify and remove duplicate records to ensure data integrity.

  4. Check for Outliers
    Detect and assess the impact of outliers on numerical features using statistical and visual methods.

📌 Identify Numerical vs Categorical Columns

num_col = obesity_df_renamed.select_dtypes(include=['float64', 'int64']).columns
cat_col = obesity_df_renamed.select_dtypes(include=['object']).columns

✅ Missing or Null Values

# Find missing values
obesity_df_renamed.isna().sum()
Gender                       0
Age                          0
Height                       0
Weight                       0
Family_History               0
High_Cal_Foods_Frequently    0
Freq_Veg                     0
Num_Meals                    0
Snacking                     0
Smoke                        0
Water_Intake                 0
Calorie_Monitoring           0
Phys_Activity                0
Tech_Use                     0
Freq_Alcohol                 0
Transportation               0
Obesity_Level                0
dtype: int64

Observation:

✅ No missing or null values were found in the dataset. No imputation is required.

obesity_df_renamed.isnull().sum() # check for null values
Gender                       0
Age                          0
Height                       0
Weight                       0
Family_History               0
High_Cal_Foods_Frequently    0
Freq_Veg                     0
Num_Meals                    0
Snacking                     0
Smoke                        0
Water_Intake                 0
Calorie_Monitoring           0
Phys_Activity                0
Tech_Use                     0
Freq_Alcohol                 0
Transportation               0
Obesity_Level                0
dtype: int64

Similarly with missing values data doesn’t have any null values.

🧾 Uniqueness Check for Categorical Variables

unique_values_per_column = {col: obesity_df_renamed[col].unique() for col in cat_col}

# Display unique values for each column
for col, unique_vals in unique_values_per_column.items():
    print(f"Unique values in '{col}': {unique_vals}")
Unique values in 'Gender': ['Female' 'Male']
Unique values in 'Family_History': ['yes' 'no']
Unique values in 'High_Cal_Foods_Frequently': ['no' 'yes']
Unique values in 'Snacking': ['Sometimes' 'Frequently' 'Always' 'no']
Unique values in 'Smoke': ['no' 'yes']
Unique values in 'Calorie_Monitoring': ['no' 'yes']
Unique values in 'Freq_Alcohol': ['no' 'Sometimes' 'Frequently' 'Always']
Unique values in 'Transportation': ['Public_Transportation' 'Walking' 'Automobile' 'Motorbike' 'Bike']
Unique values in 'Obesity_Level': ['Normal_Weight' 'Overweight_Level_I' 'Overweight_Level_II'
 'Obesity_Type_I' 'Insufficient_Weight' 'Obesity_Type_II'
 'Obesity_Type_III']

Observation:

✅ All categorical variables contain consistent and interpretable values. No cleaning required.

🔁 Duplicate Rows

obesity_df_renamed.duplicated().sum()
np.int64(24)

🟡 24 duplicate rows detected

# Show duplicated values if needed

#obesity_df_renamed[obesity_df_renamed.duplicated(keep=False)]

Remove duplicates

# remove duplicates from the dataset
from DataManager import drop_duplicates

data = drop_duplicates(obesity_df_renamed)
data.shape

Observation:

✅ Duplicates removed successfully

🚨 Outlier Detection (IQR Method)

def find_outliers_IQR(df,numeric_col):
   
   q1=df[numeric_col].quantile(0.25)
   q3=df[numeric_col].quantile(0.75)
   IQR=q3-q1
   outliers = df[((df[numeric_col]<(q1-1.5*IQR)) | (df[numeric_col]>(q3+1.5*IQR)))]
   return outliers

numeric_columns = num_col
for column in numeric_columns:
    outliers = find_outliers_IQR(data, column)
    
    if outliers.empty:
        print(f"No outliers detected in column {column}")
    else:
        print(f"{len(outliers)} outliers detected in column {column}")
        print("Max outlier value:", str(outliers[column].max()))
        print("Min outlier value:", str(outliers[column].min()))

Observation:

✅ Detected outliers in:

Age: 167 values

Height: 1 value

Weight: 1 value

Num_Meals: 577 values

📦 Boxplots for Visual Outlier Analysis

We will visualize only the columns we have outliers

for column in num_col:
    outliers = find_outliers_IQR(data, column)
    if not outliers.empty:  # Only plot if outliers are present
        plt.figure(figsize=(8, 6))
        plt.boxplot(data[column].dropna(), vert=False, patch_artist=True)
        plt.title(f"Box Plot for {column} (Outliers Detected)")
        plt.xlabel(column)
        plt.grid(True)
        plt.show()

alt text

alt text

alt text

alt text

Outlier Interpretation

Age is right-skewed. We’ll retain outliers and revisit during feature engineering.

Num_Meals behaves like a categorical variable. We’ll retain all values as they may be informative.

Height and Weight outliers are minimal and reasonable, so we retain them.

📊 Distribution of Age

# Age distribution across different obesity levels (NObeyesdad)

plt.figure(figsize=(10,8))
nobeyesdad_order = ['Insufficient_Weight', 'Normal_Weight', 'Overweight_Level_I','Overweight_Level_II','Obesity_Type_I', 'Obesity_Type_II', 'Obesity_Type_III']  # Adjust according to your dataset
sns.boxplot(x='Obesity_Level', y='Age', data=data, palette='Set3', order=nobeyesdad_order)
plt.title('Age Distribution across Obesity Levels (Obesity_Level)')
plt.xlabel('Obesity Level (Obesity_Level)')
plt.ylabel('Age')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

alt text

From the figure below, it is evident that Age is right-skewed, suggesting that the dataset predominantly represents certain age groups. This could introduce a bias in the machine learning decision-making process. For now, we will retain the outliers and observe how the skewness of the Age feature impacts the model’s performance. Depending on the results, we will decide whether to keep or remove the outliers during the feature engineering phase.

sns.histplot(data['Age'], kde=True, bins=20, color='skyblue')
plt.title('Age Distribution')
Text(0.5, 1.0, 'Age Distribution')

alt text

Observation:

✅ Age values range from 14 to 61. We’ll retain all entries for now.

📈 Target Class Distribution

# Define the order of obesity levels based on their frequency
nobeyesdad_order = data['Obesity_Level'].value_counts().index

plt.figure(figsize=(10,8))
sns.countplot(x='Obesity_Level', data=data, palette='Set3', order=nobeyesdad_order)
plt.title('Distribution of Obesity Levels (Obesity_Level)')
plt.xlabel('Obesity Level')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

alt text

Observation:

✅ Obesity Type I is the most frequent category.

🔍 Pairplot of Numerical Features

sns.pairplot(data[['Age', 'Height', 'Weight', 'Freq_Veg', 'Water_Intake', 'Phys_Activity', 'Obesity_Level']], hue='Obesity_Level', palette='Set3')
plt.show()

alt text

Observation:

✅ Height and Weight show strong correlation. Consider engineering BMI.

Other features, such as Frequency of Vegetable Consumption (FCVC), Water Intake (CH2O), and Physical Activity Frequency (FAF), do not exhibit a clear linear relationship with other variables. However, their monolithic behavior (clustered values) suggests they may behave more like categorical features despite being numerical. This observation should be considered in the preprocessing and modeling stages to ensure these features are treated appropriately.

📊 Categorical Feature Distribution by Target

# Plot categorical variables against Obesity_Level
binary_palette = {"yes": "#8dd3c7", "no": "#fb8072"}
# Define the desired order of obesity levels
obesity_level_order = [
    "Insufficient_Weight", 
    "Normal_Weight", 
    "Overweight_Level_I", 
    "Overweight_Level_II", 
    "Obesity_Type_I", 
    "Obesity_Type_II", 
    "Obesity_Type_III"
]
plt.figure(figsize=(12, 170))
counter = 1
for var in cat_col:
    if counter < len(cat_col):
        plt.subplot(20, 1, counter)
        plt.title(f"{var} vs Obesity_Level")
        
        # Determine the appropriate palette based on variable categories
        if var in ["Family_History", "High_Cal_Foods_Frequently", "Smoke", "Calorie_Monitoring"]:
            palette = binary_palette  # Use "yes" and "no" palette
        else:
            palette = "Set3"  # Default palette for other categorical variables
        
        # Plot with the selected palette
        sns.countplot(
            x="Obesity_Level",
            hue=var,
            data=obesity_df_renamed,
            edgecolor="black",
            palette=palette,
            order=obesity_level_order
        )
    counter += 1
plt.tight_layout()
plt.show()

alt text


🧠 Insights from Categorical Analysis

  • Gender: Slight imbalance in Obesity II & III (more males).

  • Family_History: Clear link with higher obesity.

  • FAVC: High-calorie food consumption increases with obesity.

  • Snacking: “Sometimes” snacking is more frequent in overweight/obese groups.

  • SMOKE: No significant pattern with obesity.

  • SCC (Calorie Monitoring): Lower monitoring in higher obesity levels.

  • CALC (Alcohol): No strong pattern observed.

  • MTRANS: Walkers tend to be in lower obesity levels.

🔗 Correlation Matrix of Numerical Features

We will perform correlation analysis between numerical features with using Pearson Correlation coefficent.

# Create a heatmap for the correlation matrix using only numerical columns
numerical_columns = obesity_df_renamed.select_dtypes(include=['float64', 'int64']).columns
numerical_correlation_matrix = obesity_df_renamed[numerical_columns].corr()

# Plot the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(numerical_correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Numerical Variables')
plt.show()

alt text

Observation:

✅ Strong positive correlation observed between Height and Weight. 💡 We will define a new feature: BMI = Weight / Height² in the next section.

📐 Feature Engineering: Body Mass Index (BMI) Calculation

We introduce a new feature BMI (Body Mass Index) to better capture the interaction between weight and height, and explore its impact on obesity classification.

# Define new BMI feature
data['BMI'] = data['Weight'] / (data['Height'] ** 2)

📊 Exploratory Analysis on BMI

📈 BMI Distribution Across Obesity Levels


plt.figure(figsize=(10,8))
nobeyesdad_order = ['Insufficient_Weight', 'Normal_Weight', 'Overweight_Level_I','Overweight_Level_II','Obesity_Type_I', 'Obesity_Type_II', 'Obesity_Type_III']  # Adjust according to your dataset
sns.boxplot(x='Obesity_Level', y='BMI', data=data, palette='Set3', order=nobeyesdad_order)
plt.title('BMI Distribution across Obesity Levels (Obesity_Level)')
plt.xlabel('Obesity Level (Obesity_Level)')
plt.ylabel('BMI')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

alt text

Observation:

✅ BMI increases consistently with obesity severity. Variability also rises at higher obesity levels, while normal and insufficient weight groups are more tightly distributed.

📈 BMI Distribution by Gender

# BMI distribution across different obesity levels (NObeyesdad)

plt.figure(figsize=(10,8))
sns.boxplot(x='Gender', y='BMI', data=data, palette='Set3')
plt.title('BMI Distribution by Gender')
plt.xlabel('Gender')
plt.ylabel('BMI')
plt.tight_layout()
plt.show()

alt text

Observation:

✅ The plot shows that females, on average, tend to have a higher and more varied BMI compared to males. The broader spread of BMI values in females suggests more diversity in body composition within this group.</span>

plt.figure(figsize=(15,6))
ax = sns.countplot(x = "Gender", hue = "Obesity_Level",hue_order=nobeyesdad_order, data = data, palette='Set3')

for p in ax.patches:
    ax.annotate(format(p.get_height(), '.0f'),
               (p.get_x() + p.get_width() / 2, p.get_height()),
               ha = 'center', va = 'center',
               xytext = (0,10),
               textcoords = 'offset points')

plt.title("Distribution of Obesity_Level across Gender")
plt.xlabel('')
plt.ylabel('')

plt.show()

alt text

Observation:

✅ Females exhibit a wider and higher range of BMI values than males, indicating greater variability in body composition.

📈 BMI Distribution by Age

plt.figure(figsize=(10,8))
sns.scatterplot(x='Age', y='BMI', data=data, hue_order= nobeyesdad_order,hue='Obesity_Level', palette='Set3')

plt.title('BMI Distribution by Age')
plt.xlabel('Age')
plt.ylabel('BMI')
plt.tight_layout()
plt.show()

alt text

Observation:

✅ Obesity type III is more commonly observed in individuals under the age of 30. </span>

nobe_bmi_mean = data.groupby('Obesity_Level')['Age'].mean().reset_index()

nobe_bmi_mean_sorted = nobe_bmi_mean.sort_values(by='Age', ascending = False)

fig, axes = plt.subplots(1,2,figsize=(15,6))
sns.barplot(y='Obesity_Level', x='Age',ax=axes[0],data=nobe_bmi_mean,order=nobe_bmi_mean_sorted['Obesity_Level'],palette='Set3')

for index, row in enumerate(nobe_bmi_mean_sorted.iterrows()):
    axes[0].text(row[1]['Age'], index, f"{row[1]['Age']: .2f}", va = 'center', ha = 'left', fontsize = 10)

axes[0].set_xlabel('Age')
axes[0].set_ylabel('')
axes[0].set_xticklabels(axes[0].get_xticklabels(), rotation=45)

sns.violinplot(y = "Obesity_Level", x= "Age", data= data, ax=axes[1], order=nobe_bmi_mean_sorted['Obesity_Level'],palette='Set3')
axes[1].set_ylabel('')
axes[1].set_yticklabels([])

fig.suptitle('Average Age by NObeyesdad', fontsize = 16)
plt.show()

alt text

Observation:

✅ Normal Weight or Insufficient weight people seems to be younger on an average than the rest </span>

📈 BMI Distribution by Family History with Overweight

# Plot BMI distribution by Family History with Overweight across Obesity Levels
binary_palette = {"yes": "#8dd3c7", "no": "#fb8072"}

plt.figure(figsize=(12, 8))
sns.boxplot(
    x='Obesity_Level',
    y='BMI',
    hue ='Family_History',
    data=data,
    palette= binary_palette,
    order=obesity_level_order
)
plt.title("BMI Distribution by Family History with Overweight Across Obesity Levels", fontsize=16)
plt.xlabel("Obesity Level")
plt.ylabel("BMI")
plt.xticks(rotation=45)
plt.legend(title="Family History", loc='upper left')
plt.tight_layout()
plt.show()

alt text

Observation:

✅ There is a strong evidence higher BMI levels have a family history of obesity.

✅ No cases of obesity level III have been observed in individuals without a family history of overweight, suggesting that family history plays a significant role, at least in the development of obesity level III. </span>

📈 BMI Distribution by Frequent Consumption of High-Calorie Food (FAVC)

# Plot BMI distribution by Family History with Overweight across Obesity Levels
binary_palette = {"yes": "#8dd3c7", "no": "#fb8072"}

plt.figure(figsize=(12, 8))
sns.boxplot(
    x='Obesity_Level',
    y='BMI',
    hue ='High_Cal_Foods_Frequently',
    data=data,
    palette= binary_palette,
    order=obesity_level_order
)
plt.title("BMI Distribution by Frequent Consumption of High-Calorie Food (FAVC)", fontsize=16)
plt.xlabel("Obesity Level")
plt.ylabel("BMI")
plt.xticks(rotation=45)
plt.legend(title="High_Cal_Foods_Frequently", loc='upper left')
plt.tight_layout()
plt.show()

alt text

Observation:

✅ There is only one case of Obesity Type III where frequent consumption of high-calorie food is not present, suggesting that frequent consumption of high-calorie food likely plays a key role in the development of Obesity Type III. For other types of obesity and normal weight individuals, the distribution of high-calorie food consumption appears to be similar across all levels.

📈 BMI Distribution by Number of Meals (NCP)

# BMI distribution by Number of Meals (NCP) across different obesity levels (NObeyesdad)

plt.figure(figsize=(10,8))
sns.scatterplot(x='Num_Meals', y='BMI', data=obesity_df_renamed, hue='Obesity_Level', palette='Set3')

plt.title('BMI Distribution by Number of Meals (NCP)')
plt.xlabel('Number of Meals (NCP)')
plt.ylabel('BMI')
plt.tight_layout()
plt.show()

alt text

Observation:

✅ Obestiy type III are mostly consuming 3 meals where as normal weight people are consuming 1, 3 or 4 meals </span>

📈 BMI Distribution by Smoking Habit (SMOKE)

# Plot BMI distribution by Smoking across Obesity Levels
binary_palette = {"yes": "#8dd3c7", "no": "#fb8072"}

plt.figure(figsize=(12, 8))
sns.boxplot(
    x='Obesity_Level',
    y='BMI',
    hue ='Smoke',
    data=data,
    palette= binary_palette,
    order=obesity_level_order
)
plt.title("BMI Distribution by Smoking Habit ", fontsize=16)
plt.xlabel("Obesity Level")
plt.ylabel("BMI")
plt.xticks(rotation=45)
plt.legend(title="smoking", loc='upper left')
plt.tight_layout()
plt.show()

alt text

Observation:

✅ There is insufficient evidence to conclude that smoking has a significant effect on obesity levels. The primary observation is that individuals in Obesity Level III are predominantly non-smokers.

def show_pie_chart(df, column_name):
    
    # Converting string values to categorical
    df[column_name] = df[column_name].astype('category')
    
    # Calculating the frequency of values in the column
    counts = df[column_name].value_counts()

    
    # Sorting the values in the column by their frequency in descending order
    counts_sorted = counts.sort_values(ascending=False)
    
    # subplot
    fig,axes = plt.subplots(1, 2, figsize=(16, 8))
    plt.title(column_name)
    plt.subplots_adjust(wspace = 0.5)
    
   
    axes[0].pie(counts.values, labels=counts_sorted.index, autopct='%1.1f%%', startangle=140, colors=sns.color_palette("pastel"))
    
    sns.countplot(data=df, y=column_name, ax=axes[1], order=counts_sorted.index, palette="pastel", hue=column_name, legend=False, width=0.8)
    axes[1].set_ylabel('')
    axes[1].set_xlabel('')
    
    for i, v in enumerate(counts_sorted.values):
        axes[1].text(v + 0.1, i, str(v), ha='left', va = 'center', color = 'black', fontweight = 'bold')
    #plt.title(column_name)
    plt.show()

📈 BMI Distribution by Public Transportation (MTRANS)

 #Create pie chart for MTRANS
show_pie_chart(data, 'Transportation')

alt text

Observation:

✅ 97.6% use some form of vehicles while only ~2.7% prefers walking/using bike That’s concerning!</span>


📦 Export Cleaned Dataset

data.to_csv('clean_data.csv', index=False)

📌 Summary

  1. Obesity_Type I has the highest number of individuals.
  2. Most individuals have a family history of obesity.
  3. ~2.7% prefer walking/biking; the rest use vehicles — a concerning imbalance.
  4. More females are classified as obese compared to males.
  5. Strong positive correlation observed between Weight and Height.
  6. Outliers present in Age distribution.

🤖 Obesity Estimation - Feature Engineering & ML Models

We apply multiple classification models to identify the best-performing model for predicting obesity levels.

🧠 Models Considered

  1. Decision Tree
  2. Random Forest
  3. K-Nearest Neighbors (KNN)
  4. XGBoost (XGBClassifier)

🧭 Workflow Steps

  • Train-Test Split (Stratified)
  • One-Hot Encoding (categorical features)
  • Standard Scaling (numerical features)
  • Label Encoding (target variable)
  • GridSearchCV + 5-Fold CV for performance tuning
  • Feature Importance Analysis

🔍 Load and Preprocess Data

import pandas as pd

clean_data_df = pd.read_csv(r'../data/clean_data.csv')

clean_data_df.info()
clean_data_df.drop('BMI', axis='columns', inplace=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2087 entries, 0 to 2086
Data columns (total 18 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Gender                     2087 non-null   object 
 1   Age                        2087 non-null   float64
 2   Height                     2087 non-null   float64
 3   Weight                     2087 non-null   float64
 4   Family_History             2087 non-null   object 
 5   High_Cal_Foods_Frequently  2087 non-null   object 
 6   Freq_Veg                   2087 non-null   float64
 7   Num_Meals                  2087 non-null   float64
 8   Snacking                   2087 non-null   object 
 9   Smoke                      2087 non-null   object 
 10  Water_Intake               2087 non-null   float64
 11  Calorie_Monitoring         2087 non-null   object 
 12  Phys_Activity              2087 non-null   float64
 13  Tech_Use                   2087 non-null   float64
 14  Freq_Alcohol               2087 non-null   object 
 15  Transportation             2087 non-null   object 
 16  Obesity_Level              2087 non-null   object 
 17  BMI                        2087 non-null   float64
dtypes: float64(9), object(9)
memory usage: 293.6+ KB

🧾 Identify Feature Types

cat_cols = ['Gender', 'Family_History', 'High_Cal_Foods_Frequently', 'Snacking','Smoke', 'Calorie_Monitoring', 'Freq_Alcohol', 'Transportation']

num_cols = ['Age', 'Height', 'Weight', 'Freq_Veg', 'Num_Meals','Water_Intake', 'Phys_Activity', 'Tech_Use']

🎯 Define Feature Matrix X and Target y

X = clean_data_df.drop('Obesity_Level',axis=1)  
y = clean_data_df['Obesity_Level'] 

X.shape, y.shape
((2087, 16), (2087,))

🔄 Stratified Train-Test Split

Stratified splitting means that when you generate a training / validation dataset split, it will attempt to keep the same percentages of classes in each split.

These dataset divisions are usually generated randomly according to a target variable. However, when doing so, the proportions of the target variable among the different splits can differ, especially in the case of small datasets.

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,y,stratify=y,test_size=0.2,random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape
((1669, 16), (418, 16), (1669,), (418,))

⚙️ Preprocessing: One-Hot Encoding + Standard Scaling

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler
Sscaler = StandardScaler().set_output(transform="pandas")
# Encoding multiple columns. 
transformer = make_column_transformer((Sscaler, num_cols), (OneHotEncoder(handle_unknown='ignore'), 
     cat_cols),verbose=True,verbose_feature_names_out=True, remainder='drop')

transformer
ColumnTransformer(transformers=[('standardscaler', StandardScaler(),
                                 ['Age', 'Height', 'Weight', 'Freq_Veg',
                                  'Num_Meals', 'Water_Intake', 'Phys_Activity',
                                  'Tech_Use']),
                                ('onehotencoder',
                                 OneHotEncoder(handle_unknown='ignore'),
                                 ['Gender', 'Family_History',
                                  'High_Cal_Foods_Frequently', 'Snacking',
                                  'Smoke', 'Calorie_Monitoring', 'Freq_Alcohol',
                                  'Transportation'])],
                  verbose=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

🔄 X_train Encoding

# Transforming
transformed = transformer.fit_transform(X_train)
# Transformating back
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
# One-hot encoding removed an index. Let's put it back:
transformed_df.index = X_train.index
# Joining tables
X_train = pd.concat([X_train, transformed_df], axis=1)
# Dropping old categorical columns
X_train.drop(cat_cols, axis=1, inplace=True)
# Dropping old num columns
X_train.drop(num_cols, axis=1, inplace=True)
# CHecking result
#print(X_train.head())
print(X_train.columns)
print(X_train.shape)
[ColumnTransformer]  (1 of 2) Processing standardscaler, total=   0.0s
[ColumnTransformer] . (2 of 2) Processing onehotencoder, total=   0.0s
Index(['standardscaler__Age', 'standardscaler__Height',
       'standardscaler__Weight', 'standardscaler__Freq_Veg',
       'standardscaler__Num_Meals', 'standardscaler__Water_Intake',
       'standardscaler__Phys_Activity', 'standardscaler__Tech_Use',
       'onehotencoder__Gender_Female', 'onehotencoder__Gender_Male',
       'onehotencoder__Family_History_no', 'onehotencoder__Family_History_yes',
       'onehotencoder__High_Cal_Foods_Frequently_no',
       'onehotencoder__High_Cal_Foods_Frequently_yes',
       'onehotencoder__Snacking_Always', 'onehotencoder__Snacking_Frequently',
       'onehotencoder__Snacking_Sometimes', 'onehotencoder__Snacking_no',
       'onehotencoder__Smoke_no', 'onehotencoder__Smoke_yes',
       'onehotencoder__Calorie_Monitoring_no',
       'onehotencoder__Calorie_Monitoring_yes',
       'onehotencoder__Freq_Alcohol_Always',
       'onehotencoder__Freq_Alcohol_Frequently',
       'onehotencoder__Freq_Alcohol_Sometimes',
       'onehotencoder__Freq_Alcohol_no',
       'onehotencoder__Transportation_Automobile',
       'onehotencoder__Transportation_Bike',
       'onehotencoder__Transportation_Motorbike',
       'onehotencoder__Transportation_Public_Transportation',
       'onehotencoder__Transportation_Walking'],
      dtype='object')
      ```
      
(1669, 31)

🔄 X_test Encoding

# Transforming
transformed = transformer.transform(X_test)
# Transformating back
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
# One-hot encoding removed an index. Let's put it back:
transformed_df.index = X_test.index
# Joining tables
X_test = pd.concat([X_test, transformed_df], axis=1)
# Dropping old categorical columns
X_test.drop(cat_cols, axis=1, inplace=True)
# Dropping old num columns
X_test.drop(num_cols, axis=1, inplace=True)
# CHecking result
X_test.head()
standardscaler__Age standardscaler__Height standardscaler__Weight standardscaler__Freq_Veg standardscaler__Num_Meals standardscaler__Water_Intake standardscaler__Phys_Activity standardscaler__Tech_Use onehotencoder__Gender_Female onehotencoder__Gender_Male ... onehotencoder__Calorie_Monitoring_yes onehotencoder__Freq_Alcohol_Always onehotencoder__Freq_Alcohol_Frequently onehotencoder__Freq_Alcohol_Sometimes onehotencoder__Freq_Alcohol_no onehotencoder__Transportation_Automobile onehotencoder__Transportation_Bike onehotencoder__Transportation_Motorbike onehotencoder__Transportation_Public_Transportation onehotencoder__Transportation_Walking
1153 -0.690071 -1.202029 -0.537291 1.084280 1.512030 -0.000656 0.376859 0.550747 1.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
132 0.899810 0.736866 0.848928 1.084280 0.391801 -1.644954 1.182627 -1.095082 0.0 1.0 ... 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0
1923 -0.587828 0.401573 1.579131 1.084280 0.391801 -0.334381 0.499952 0.495086 1.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
846 -1.165664 -1.047678 -0.831968 1.008997 -2.224232 -0.000656 -0.324357 1.117030 1.0 0.0 ... 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
1246 0.821665 1.347473 0.839435 -0.801583 0.231677 0.588066 -0.155934 1.895629 0.0 1.0 ... 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0

5 rows × 31 columns

print(X_train.columns, X_train.shape)
Index(['standardscaler__Age', 'standardscaler__Height',
       'standardscaler__Weight', 'standardscaler__Freq_Veg',
       'standardscaler__Num_Meals', 'standardscaler__Water_Intake',
       'standardscaler__Phys_Activity', 'standardscaler__Tech_Use',
       'onehotencoder__Gender_Female', 'onehotencoder__Gender_Male',
       'onehotencoder__Family_History_no', 'onehotencoder__Family_History_yes',
       'onehotencoder__High_Cal_Foods_Frequently_no',
       'onehotencoder__High_Cal_Foods_Frequently_yes',
       'onehotencoder__Snacking_Always', 'onehotencoder__Snacking_Frequently',
       'onehotencoder__Snacking_Sometimes', 'onehotencoder__Snacking_no',
       'onehotencoder__Smoke_no', 'onehotencoder__Smoke_yes',
       'onehotencoder__Calorie_Monitoring_no',
       'onehotencoder__Calorie_Monitoring_yes',
       'onehotencoder__Freq_Alcohol_Always',
       'onehotencoder__Freq_Alcohol_Frequently',
       'onehotencoder__Freq_Alcohol_Sometimes',
       'onehotencoder__Freq_Alcohol_no',
       'onehotencoder__Transportation_Automobile',
       'onehotencoder__Transportation_Bike',
       'onehotencoder__Transportation_Motorbike',
       'onehotencoder__Transportation_Public_Transportation',
       'onehotencoder__Transportation_Walking'],
      dtype='object') (1669, 31)
# Setting new feature names

X_train.columns = ['Age', 'Height', 'Weight', 'Freq_Veg', 'Num_Meals', 'Water_Intake',
       'Phys_Activity', 'Tech_Use', 'Gender_Female',
       'Gender_Male', 'Family_History_no',
       'Family_History_yes',
       'High_Cal_Foods_Frequently_no',
       'High_Cal_Foods_Frequently_yes',
       'Snacking_Always', 'Snacking_Frequently',
       'Snacking_Sometimes', 'Snacking_no',
       'Smoke_no', 'Smoke_yes',
       'Calorie_Monitoring_no',
       'Calorie_Monitoring_yes',
       'Freq_Alcohol_Always',
       'Freq_Alcohol_Frequently',
       'Freq_Alcohol_Sometimes',
       'Freq_Alcohol_no',
       'Transportation_Automobile',
       'Transportation_Bike',
       'Transportation_Motorbike',
       'Transportation_Public_Transportation',
       'Transportation_Walking']

X_test.columns = ['Age', 'Height', 'Weight', 'Freq_Veg', 'Num_Meals', 'Water_Intake',
       'Phys_Activity', 'Tech_Use', 'Gender_Female',
       'Gender_Male', 'Family_History_no',
       'Family_History_yes',
       'High_Cal_Foods_Frequently_no',
       'High_Cal_Foods_Frequently_yes',
       'Snacking_Always', 'Snacking_Frequently',
       'Snacking_Sometimes', 'Snacking_no',
       'Smoke_no', 'Smoke_yes',
       'Calorie_Monitoring_no',
       'Calorie_Monitoring_yes',
       'Freq_Alcohol_Always',
       'Freq_Alcohol_Frequently',
       'Freq_Alcohol_Sometimes',
       'Freq_Alcohol_no',
       'Transportation_Automobile',
       'Transportation_Bike',
       'Transportation_Motorbike',
       'Transportation_Public_Transportation',
       'Transportation_Walking']
# After renaming the columns

X_train.head()

Age Height Weight Freq_Veg Num_Meals Water_Intake Phys_Activity Tech_Use Gender_Female Gender_Male ... Calorie_Monitoring_yes Freq_Alcohol_Always Freq_Alcohol_Frequently Freq_Alcohol_Sometimes Freq_Alcohol_no Transportation_Automobile Transportation_Bike Transportation_Motorbike Transportation_Public_Transportation Transportation_Walking
1549 0.248290 0.705537 1.045026 -0.375885 0.391801 0.253102 0.308538 -0.452916 0.0 1.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
1574 1.008501 -0.608067 0.505277 1.016483 -0.755149 -1.644954 0.823971 -0.073903 0.0 1.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
1155 3.702329 0.457500 -0.078269 0.207948 0.391801 -1.403908 -0.827757 -1.095082 0.0 1.0 ... 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0
610 -0.208354 0.416482 -1.265285 -0.345947 0.391801 -0.252848 1.731352 0.245475 1.0 0.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
906 -0.523216 -1.012212 -0.725998 -0.801583 0.596256 1.643643 0.204484 -0.952112 1.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0

5 rows × 31 columns

🧬 Apply Label Encoder

🔡 Encoding y_train

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train)

#y_train.head(10) , y_train_encoded[0:10]

🔡 Encoding y_test

le_test = LabelEncoder()
y_test_encoded = le_test.fit_transform(y_test)
y_test.head(10) , y_test_encoded[0:10]
(1153    Overweight_Level_II
 132          Obesity_Type_I
 1923       Obesity_Type_III
 846      Overweight_Level_I
 1246         Obesity_Type_I
 1236         Obesity_Type_I
 1372         Obesity_Type_I
 1010    Overweight_Level_II
 337        Obesity_Type_III
 1524        Obesity_Type_II
 Name: Obesity_Level, dtype: object,
 array([6, 2, 4, 5, 2, 2, 2, 6, 4, 3]))

🤖 Classifier Models

from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier

🧰 Models in Consideration

models={'RandomForest':RandomForestClassifier(),
        'DecisionTree':DecisionTreeClassifier(),
        'KNeighbors':KNeighborsClassifier(),
        'xgbc': XGBClassifier()}

🧪 Scoring for Measuring Model Performance

from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score
scoring = {
    'accuracy': 'accuracy',
    'precision': make_scorer(precision_score, average='weighted'),
    'recall': make_scorer(recall_score, average='weighted'),
    'f1': make_scorer(f1_score, average='weighted')}

🧾 Model Evaluation

# %load ../models/Models_eval.py

from sklearn.model_selection import GridSearchCV

def grid_search_cv_eval(X,Y, models, param_grid, scorings,cross_validation):
    """Here we evaluate the models using gridsearchcv and returns the dictionary of best models and result."""

    best_models = {}
    result = {}
    print(models)
    for model in models:
        print(result)
        print(f"\nRunning GridSearch for {model}...")
        gsv = GridSearchCV(
            estimator=models[model],
            param_grid=param_grid[model],
            cv=cross_validation,
            scoring=scorings,
            refit='accuracy'  # Primary metric for model selection
        )
        gsv.fit(X, Y)
        best_models[model] = gsv.best_estimator_
        best_index = gsv.best_index_
        print(f'Best parameters for {model}: {gsv.best_params_}')
        print(f'Best accuracy: {gsv.cv_results_["mean_test_accuracy"][best_index]:.4f}')
        print(f'Best precision: {gsv.cv_results_["mean_test_precision"][best_index]:.4f}')
        print(f'Best recall: {gsv.cv_results_["mean_test_recall"][best_index]:.4f}')
        result[model] = {"parameter":gsv.best_params_,"accuracy":gsv.cv_results_["mean_test_accuracy"][best_index], "precision": gsv.cv_results_["mean_test_precision"][best_index],"recall": gsv.cv_results_["mean_test_recall"][best_index]}

    return best_models, result


from sklearn.model_selection import cross_val_score

# Define the function to compare models with default parameters
def evaluate_models(X, Y, models, scorings, cross_validation):
    result = {}
    
    # Loop through models and evaluate each one
    for model_name, model in models.items():
        print(f"\nEvaluating {model_name}...")
        
        # Evaluate using the defined scoring metrics
        model_scores = {}
        for score_name, scorer in scorings.items():
            score = cross_val_score(model, X, Y, cv=cross_validation, scoring=scorer)
            model_scores[score_name] = score.mean()
        
        # Store results
        result[model_name] = model_scores
        
        # Print the results
        print(f"Accuracy: {model_scores['accuracy']:.4f}")
        print(f"Precision: {model_scores['precision']:.4f}")
        print(f"Recall: {model_scores['recall']:.4f}")
        print(f"F1 Score: {model_scores['f1']:.4f}")
    
    return result

result = evaluate_models(X_train, y_train_encoded, models, scoring, 5)
Evaluating RandomForest...
Accuracy: 0.9323
Precision: 0.9389
Recall: 0.9341
F1 Score: 0.9352

Evaluating DecisionTree...
Accuracy: 0.9191
Precision: 0.9230
Recall: 0.9167
F1 Score: 0.9180

Evaluating KNeighbors...
Accuracy: 0.8113
Precision: 0.8105
Recall: 0.8113
F1 Score: 0.7941

Evaluating xgbc...
Accuracy: 0.9629
Precision: 0.9633
Recall: 0.9629
F1 Score: 0.9628

📊 Models Performance Visuals

result_df = pd.DataFrame(result).T

📈 Accuracy Comparison

# Plot Accuracy
import matplotlib.pyplot as plt 
import seaborn as sns 
plt.figure(figsize=(10, 6))
sns.barplot(x=result_df.index, y='accuracy', data=result_df, palette="Blues_d")
plt.title('Model Comparison: Accuracy', fontsize=16)
plt.ylabel('Accuracy', fontsize=14)
plt.xlabel('Model', fontsize=14)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

alt text

🎯 Precision Comparison

# Plot Precision
plt.figure(figsize=(10, 6))
sns.barplot(x=result_df.index, y='precision', data=result_df, palette="Greens_d")
plt.title('Model Comparison: Precision', fontsize=16)
plt.ylabel('Precision', fontsize=14)
plt.xlabel('Model', fontsize=14)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

alt text

🔁 Recall Comparison

# Plot Precision
plt.figure(figsize=(10, 6))
sns.barplot(x=result_df.index, y='recall', data=result_df, palette="Oranges_d")
plt.title('Model Comparison: Recall', fontsize=16)
plt.ylabel('Recall', fontsize=14)
plt.xlabel('Model', fontsize=14)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

alt text

Observation:

✅ Based on Visuals above we can clearly notice that RandomForest and XGBC models are showing good results. We will tune the parameters next to find the best suitable model

📋 Creating a Comparison Table for the Models

models = ['RandomForest', 'DecisionTree', 'KNeighbors', 'xgbc']

result_df['f1_score'] = 2 * (result_df['precision'] * result_df['recall']) / (result_df['precision'] + result_df['recall'])

# Display the updated table with F1 score
display(result_df)
accuracy precision recall f1 f1_score
RandomForest 0.932303 0.938914 0.934096 0.935160 0.936499
DecisionTree 0.919117 0.923018 0.916723 0.917965 0.919860
KNeighbors 0.811273 0.810466 0.811273 0.794125 0.810869
xgbc 0.962854 0.963345 0.962854 0.962758 0.963100

Observation:

✅ The results highlight that the xgbc model outperforms others with the highest accuracy (96.3%), precision (96.3%), recall (96.3%), and F1 score (96.3%), demonstrating its superior ability to classify data correctly. The RandomForest model also shows strong performance, achieving an accuracy of 93.7%, making it a competitive alternative. In comparison, the DecisionTree model and KNeighbors model perform slightly lower, with accuracies of 92.2% and 81.1%, respectively. Based on these findings, we have decided to conduct further comparisons between RandomForest and XGBoost to refine our model selection process.

⚙️ Hyperparameter Tuning: Random Forest & XGBoost

models={'RandomForest_hyper_tuned':RandomForestClassifier(),
        'xgbc_hyper_tuned': XGBClassifier()}

# Hyperparameter grids for tuning models
param_grids = {
    # Random Forest Hyperparameter Grid
    'RandomForest_hyper_tuned': {
        'n_estimators': [50, 100, 200, 400],               # Number of trees in the forest
        'max_depth': [None, 10, 20, 50],                   # Maximum depth of each tree
        'min_samples_split': [2, 5, 10, 15]                # Minimum number of samples to split a node
    },
    # XGBoostClassifier Hyperparameter Grid
    'xgbc_hyper_tuned': {
        'n_estimators': [50, 100, 200, 400],               # Number of boosting rounds
        'max_depth': [3, 5, 7, 10],                        # Maximum depth of each tree
        'learning_rate': [0.001, 0.01, 0.1, 0.3],          # Step size shrinkage
        'objective': ['multi:softmax'],                    # Multi-class classification
        'verbosity': [0],                                  # Silence output
        'nthread': [-1],                                   # Use all available threads
        'random_state': [42]                               # Ensure reproducibility
    }
}

best_models_hyper_tuned, result_hyper_tuned = grid_search_cv_eval(X_train, y_train_encoded, models, param_grids, scoring, cross_validation=5)
best_models_hyper_tuned, result_hyper_tuned
{'RandomForest_hyper_tuned': RandomForestClassifier(), 'xgbc_hyper_tuned': XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=None, n_jobs=None,
              num_parallel_tree=None, random_state=None, ...)}
{}

Running GridSearch for RandomForest_hyper_tuned...
Best parameters for RandomForest_hyper_tuned: {'max_depth': 50, 'min_samples_split': 5, 'n_estimators': 400}
Best accuracy: 0.9371
Best precision: 0.9424
Best recall: 0.9371
{'RandomForest_hyper_tuned': {'parameter': {'max_depth': 50, 'min_samples_split': 5, 'n_estimators': 400}, 'accuracy': 0.9370915826005646, 'precision': 0.9423907933895954, 'recall': 0.9370915826005646}}

Running GridSearch for xgbc_hyper_tuned...
Best parameters for xgbc_hyper_tuned: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 400, 'nthread': -1, 'objective': 'multi:softmax', 'random_state': 42, 'verbosity': 0}
Best accuracy: 0.9664
Best precision: 0.9668
Best recall: 0.9664


({'RandomForest_hyper_tuned': RandomForestClassifier(max_depth=50, min_samples_split=5, n_estimators=400),
  'xgbc_hyper_tuned': XGBClassifier(base_score=None, booster=None, callbacks=None,
                colsample_bylevel=None, colsample_bynode=None,
                colsample_bytree=None, device=None, early_stopping_rounds=None,
                enable_categorical=False, eval_metric=None, feature_types=None,
                gamma=None, grow_policy=None, importance_type=None,
                interaction_constraints=None, learning_rate=0.1, max_bin=None,
                max_cat_threshold=None, max_cat_to_onehot=None,
                max_delta_step=None, max_depth=5, max_leaves=None,
                min_child_weight=None, missing=nan, monotone_constraints=None,
                multi_strategy=None, n_estimators=400, n_jobs=None, nthread=-1,
                num_parallel_tree=None, ...)},
 {'RandomForest_hyper_tuned': {'parameter': {'max_depth': 50,
    'min_samples_split': 5,
    'n_estimators': 400},
   'accuracy': 0.9370915826005646,
   'precision': 0.9423907933895954,
   'recall': 0.9370915826005646},
  'xgbc_hyper_tuned': {'parameter': {'learning_rate': 0.1,
    'max_depth': 5,
    'n_estimators': 400,
    'nthread': -1,
    'objective': 'multi:softmax',
    'random_state': 42,
    'verbosity': 0},
   'accuracy': 0.9664472856089622,
   'precision': 0.9668094661210735,
   'recall': 0.9664472856089622}})

📉 Confusion Matrix

from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

y_pred = best_models_hyper_tuned['xgbc_hyper_tuned'].predict(X_test)

cm = confusion_matrix(y_test_encoded, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=le.classes_, yticklabels=le.classes_)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

alt text

🌲 Feature Importance (Random Forest)

from sklearn.ensemble import RandomForestClassifier
import seaborn as sns
import plotly.express as px

from matplotlib import pyplot as plt
from sklearn.model_selection import cross_val_score

palette = ['#008080','#FF6347', '#E50000', '#D2691E'] # Creating color palette for plots

randomForest_model = best_models_hyper_tuned['RandomForest_hyper_tuned']
randomForest_model = randomForest_model.fit(X_train, y_train)

fimp = pd.Series(data=randomForest_model.feature_importances_, index=X_train.columns).sort_values(ascending=False)
plt.figure(figsize=(17,13))
plt.title("Feature importance")
ax = sns.barplot(y=fimp.index, x=fimp.values, palette=palette, orient='h')

alt text

🎯 Classic Feature Attributions (XGBoost)

Here we try out the global feature importance calcuations that come with XGBoost.

xgboot_model = best_models_hyper_tuned.get('xgbc_hyper_tuned')
xgboot_model
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.1, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=5, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=400, n_jobs=None, nthread=-1,
              num_parallel_tree=None, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
import xgboost
xgboost.plot_importance(xgboot_model)
#plt.title("xgboost plot_importance(model)")
plt.show()

alt text

xgboost.plot_importance(xgboot_model, importance_type="cover")
plt.title('xgboost.plot_importance(model, importance_type="cover")')
plt.show()

alt text

xgboost.plot_importance(xgboot_model, importance_type="gain")
plt.title('xgboost.plot_importance(model, importance_type="gain")')
plt.show()

alt text

🧠 SHAP Explainability Setup


import shap

# print the JS visualization code to the notebook
shap.initjs()

🧠 SHAP Explainability – TreeExplainer (XGBoost)

explainer = shap.TreeExplainer(xgboot_model)
shap_values = explainer.shap_values(X_train)

####📌 SHAP Summary Plot (All Features)

shap.summary_plot(shap_values, X_train, plot_type="bar")

!alt text

class_mapping = {
    'Insufficient_Weight': 0,
    'Normal_Weight': 1,
    'Overweight_Level_I': 2,
    'Overweight_Level_II': 3,
    'Obesity_Type_I': 4,
    'Obesity_Type_II': 5,
    'Obesity_Type_III': 6
}

🧠 Feature Importance Insights

Top Features:

  • Weight is the most dominant feature, influencing predictions across all classes substantially.
  • Height and Freq_Veg also show strong predictive power across multiple obesity levels.

Lower Features:

  • Features like Transportation_Walking, Snacking_Always, and Calorie_Monitoring_no contribute very little to predictions and may not add significant predictive power.

Class-Specific Observations:

  • Weight and Height influence multiple classes, especially Class 0 (Insufficient_Weight) and Class 4 (Obesity_Type_I).
  • Some features (e.g., Transportation_Public_Transportation) may have niche relevance to specific classes.
  • Overall, feature contribution patterns are fairly consistent across obesity levels.

🛠️ Feature Engineering (Without Height & Weight)

To assess the true impact of lifestyle and behavioral factors on obesity prediction, we retrain the XGBoost model after removing Weight and Height from the dataset.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Drop Weight and Height
dataset_cleaned = clean_data_df.drop(columns=['Weight', 'Height'])

#  Separate features and target variable
X_new = dataset_cleaned.drop(columns=['Obesity_Level'])
y_new = dataset_cleaned['Obesity_Level']

# Encode categorical target variable
le = LabelEncoder()
y_encoded_new = le.fit_transform(y)

# Split data into training and testing sets
X_train_new, X_test_new, y_train_new, y_test_new = train_test_split(X_new, y_encoded_new, stratify=y_encoded_new, test_size=0.2, random_state=42)

# One-Hot Encode Categorical Features
categorical_features = ['Gender', 'Family_History', 'High_Cal_Foods_Frequently', 'Snacking', 
                        'Smoke', 'Calorie_Monitoring', 'Freq_Alcohol', 'Transportation']
X_train_new = pd.get_dummies(X_train_new, columns=categorical_features, drop_first=True)
X_test_new = pd.get_dummies(X_test_new, columns=categorical_features, drop_first=True)

# Ensure train and test datasets have the same columns
X_test_new = X_test_new.reindex(columns=X_train_new.columns, fill_value=0)

# Step 2: Apply Min-Max Scaling to Numerical Features
num_cols = ['Age', 'Freq_Veg', 'Num_Meals', 'Water_Intake', 'Phys_Activity', 'Tech_Use']
scaler = MinMaxScaler()

X_train_new[num_cols] = scaler.fit_transform(X_train_new[num_cols])
X_test_new[num_cols] = scaler.transform(X_test_new[num_cols])

xgb_params = {
    'learning_rate': 0.3,
    'n_estimators': 200,
    'nthread': -1,
    'objective': 'multi:softmax',
    'random_state': 42,
    'verbosity': 0
}

#xgb_model = XGBClassifier(xgb_params)
xgb_model = xgboot_model
xgb_model.fit(X_train_new, y_train_new)
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.1, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=5, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=400, n_jobs=None, nthread=-1,
              num_parallel_tree=None, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Step 7: Predictions and Performance Metrics
y_pred = xgb_model.predict(X_test_new)
accuracy = accuracy_score(y_test_new, y_pred)
precision = precision_score(y_test_new, y_pred, average='weighted')
recall = recall_score(y_test_new, y_pred, average='weighted')
f1 = f1_score(y_test_new, y_pred, average='weighted')

# Display results
xgb_metrics = {
    'Accuracy': accuracy,
    'Precision': precision,
    'Recall': recall,
    'F1 Score': f1
}
xgb_metrics
{'Accuracy': 0.8397129186602871,
 'Precision': 0.8416882871552138,
 'Recall': 0.8397129186602871,
 'F1 Score': 0.8396322920498374}

Observation:

✅ All performance metrics dropped as expected but at least know we have a real persepective of real cause of the obersity levels.

Let’s also try the Random Forest

from sklearn.ensemble import RandomForestClassifier

# Step 1: Train Random Forest Model
# rf_model = RandomForestClassifier(
#     n_estimators=200,
#     max_depth=None,
#     random_state=42,
#     n_jobs=-1  # Utilize all available cores
# )
rf_model = randomForest_model
rf_model.fit(X_train_new, y_train_new)

# Step 2: Predictions and Performance Metrics
y_pred_rf = rf_model.predict(X_test_new)
rf_accuracy = accuracy_score(y_test_new, y_pred_rf)
rf_precision = precision_score(y_test_new, y_pred_rf, average='weighted')
rf_recall = recall_score(y_test_new, y_pred_rf, average='weighted')
rf_f1 = f1_score(y_test_new, y_pred_rf, average='weighted')

# Display Random Forest Metrics and Feature Importance
rf_metrics = {
    'Accuracy': rf_accuracy,
    'Precision': rf_precision,
    'Recall': rf_recall,
    'F1 Score': rf_f1
}

rf_metrics

{'Accuracy': 0.8373205741626795,
 'Precision': 0.8409240886853,
 'Recall': 0.8373205741626795,
 'F1 Score': 0.8368984104506949}

🔍 Comparison of XGBoost and Random Forest with their performance metrics

# Combine metrics into a DataFrame for comparison which exclude Weight and Height
comparison_df = pd.DataFrame([rf_metrics, xgb_metrics], index=['Random Forest_EWH', 'XGBoost_EWH'])

# Display the comparison table
comparison_df
Accuracy Precision Recall F1 Score
Random Forest_EWH 0.837321 0.840924 0.837321 0.836898
XGBoost_EWH 0.839713 0.841688 0.839713 0.839632

From this comparison, we observe that the Random Forest model demonstrates better performance. Therefore, we will proceed with using the Random Forest model’s output for feature selection.

📌 Feature Importance with Random Forest, XGBoost, and SHAP

# Compute Random Forest feature importance
rf_feature_importance = rf_model.feature_importances_

# Create a DataFrame for feature importance
rf_feature_importance_df = pd.DataFrame({
    'Feature': X_train_new.columns,
    'Importance': rf_feature_importance
}).sort_values(by='Importance', ascending=False)

# Sort the feature importances for better visualization
rf_feature_importance_df = rf_feature_importance_df.sort_values(by="Importance", ascending=False)

# Plot Feature Importance
plt.figure(figsize=(12, 8))
plt.barh(rf_feature_importance_df['Feature'], rf_feature_importance_df['Importance'], color='skyblue')
plt.xlabel('Feature Importance')
plt.ylabel('Features')
plt.title('Random Forest Feature Importance')
plt.gca().invert_yaxis()  # Highest importance at the top
plt.tight_layout()
plt.show()

🌲 Random Forest Feature Importance

!alt text

from xgboost import plot_importance

# Built-in plot with 'gain'
plt.figure(figsize=(12, 8))
plot_importance(xgb_model, importance_type='gain', max_num_features=10)
plt.title('XGBoost Feature Importance (Gain)')
plt.show()

# Custom plot with 'gain'
feature_importance = xgb_model.get_booster().get_score(importance_type='gain')
importance_df = pd.DataFrame(list(feature_importance.items()), columns=['Feature', 'Importance']).sort_values(by='Importance', ascending=False)

plt.figure(figsize=(12, 8))
plt.barh(importance_df['Feature'], importance_df['Importance'], color='skyblue')
plt.xlabel('Feature Importance (Gain)')
plt.ylabel('Features')
plt.title('XGBoost Feature Importance (Gain)')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
<Figure size 1200x800 with 0 Axes>

⚡ XGBoost Feature Importance

!alt text

!alt text

# SHAP for Random Forest
explainer_rf = shap.TreeExplainer(rf_model)
shap_values_rf = explainer_rf.shap_values(X_train_new)

# SHAP for XGBoost
explainer_xgb = shap.TreeExplainer(xgb_model)
shap_values_xgb = explainer_xgb.shap_values(X_train_new)

🔍 SHAP Values for Multi-Class Classification

Refer: https://medium.com/biased-algorithms/shap-values-for-multiclass-classification-2a1b93f69c63

Since we have a multi class classification single SHAP explanation is not enough to understand the model better. With referring the Medium post above we will implement the SHAP Values for each indiviual class to udnerstand better the feature importance.

# Visualize the SHAP Summary Plot for RF

shap.summary_plot(shap_values_rf, X_train_new, plot_type="bar")

!alt text

import xgboost as xgb

# Create DMatrix
dtrain = xgb.DMatrix(X_train_new, label=y_train_new)
explainer_xgb = shap.TreeExplainer(xgb_model)
shap_values_xgb = explainer_xgb.shap_values(dtrain)

#the summary plot
shap.summary_plot(shap_values_xgb, X_train_new, plot_type="bar")

!alt text

🧠 SHAP Class-wise Contributions

🎯 Class 0 (Insufficient Weight)

Top contributing features:

  • Age: Significant impact for Class 0 (blue bar is prominent for this feature).
  • Freq_Veg: Positive contribution, showing a noticeable influence on predictions for this class.
  • Water_Intake: Moderate contribution, likely affecting predictions toward Class 0.

🎯 Class 1 (Normal Weight)

Top contributing features:

  • Freq_Veg: Strong influence on predictions for Class 1 (pink bar is prominent).
  • Age: Also contributes notably to predictions for this class.
  • Gender_Male: Adds some effect, as seen by the visible pink bar for Class 1.

🎯 Class 2 (Overweight Level I)

Top contributing features:

  • Age: Strong effect, indicated by a visible green bar for this class.
  • Gender_Male: Noticeable contribution to Class 2 predictions.
  • Freq_Veg: Moderately affects predictions.

🎯 Class 3 (Overweight Level II)

Top contributing features:

  • Age: Significant contribution for Class 3 (orange bar is prominent).
  • Gender_Male: Adds noticeable effect to Class 3 predictions.
  • Freq_Veg: Also contributes moderately.

🎯 Class 4 (Obesity Type I)

Top contributing features:

  • Age: The strongest feature for this class, with a visible purple bar.
  • Water_Intake: Moderate influence for Class 4 predictions.
  • Gender_Male: Somewhat affects predictions for Class 4.

🎯 Class 5 (Obesity Type II)

Top contributing features:

  • Age: Significant effect, shown by a brown bar for this class.
  • Tech_Use: Noticeable contribution to Class 5 predictions.
  • Freq_Veg: Moderate influence for this class.

🎯 Class 6 (Obesity Type III)

Top contributing features:

  • Age: Strongest influence, indicated by a teal bar.
  • Freq_Veg: Noticeable contribution to Class 6 predictions.
  • Water_Intake: Adds a moderate effect.

🔧 Feature Selection Based on SHAP

In this section, we will select some of the key features and rerun the code using the selected features to make a comparison. From the previous SHAP visualizations, we observed that the top contributors are

  • Age,
  • Frequency of Vegetables (Freq_Veg),
  • Gender,
  • Water Intake,
  • Physical Activity,
  • Tech Use,
  • Number of Meals (Num_Meals),
  • Family History (Yes).
# Select the specified features
selected_features = ['Age', 'Freq_Veg', 'Gender_Male', 'Water_Intake', 'Phys_Activity', 'Tech_Use', 'Num_Meals', 'Family_History_yes']

X_train_selected = X_train_new[selected_features]
X_test_selected = X_test_new[selected_features]

# Initialize and train the RandomForestClassifier with selected features
rf_selected = RandomForestClassifier(n_estimators=200, max_depth=20)  # Using the best parameters found earlier.
rf_selected.fit(X_train_selected, y_train_encoded)

# Make predictions on the test set
y_pred_selected = rf_selected.predict(X_test_selected)

# Evaluate the model (example: accuracy)
from sklearn.metrics import accuracy_score
accuracy_selected = accuracy_score(y_test_encoded, y_pred_selected)
precision_selected = precision_score(y_test_encoded, y_pred_selected, average='weighted')
recall_selected = recall_score(y_test_encoded, y_pred_selected, average='weighted')
f1_selected = f1_score(y_test_encoded, y_pred_selected, average='weighted')
print(f"Accuracy with selected features: {accuracy_selected}")

# Display Random Forest Metrics and Feature Importance
rf_metrics_selected = {
    'Accuracy': accuracy_selected,
    'Precision': precision_selected,
    'Recall': recall_selected,
    'F1 Score': f1_selected
}

rf_metrics_selected 
Accuracy with selected features: 0.8014354066985646





{'Accuracy': 0.8014354066985646,
 'Precision': 0.7997618835092666,
 'Recall': 0.8014354066985646,
 'F1 Score': 0.7975809440301639}

📊 Model Performance Comparison

# Combine metrics into a DataFrame for comparison
comparison_models = pd.DataFrame([rf_metrics, rf_metrics_selected], index=['Random Forest_EWH', 'Random Forest Selected'])

# Display the comparison table
comparison_models
Accuracy Precision Recall F1 Score
Random Forest_EWH 0.837321 0.840924 0.837321 0.836898
Random Forest Selected 0.801435 0.799762 0.801435 0.797581
rf_metrics_selected_df=pd.DataFrame([ rf_metrics_selected], index=[ 'Random Forest Selected'])

From the above table, we can observe that using the selected 8 features, compared to the model with all 14 features (excluding Weight and Height), results in only a slight difference in performance. This indicates that the reduced feature set can provide similar predictive power while simplifying the model.

Big Comparison: Whole Feature Set vs. Excluding Weight and Height vs. Selected Features

#result_comp= result_df.drop(columns=['parameter'])
result_comp = result_df.rename(columns={
    'accuracy': 'Accuracy',
    'precision': 'Precision',
    'recall': 'Recall',
    'f1_score': 'F1 Score'
})
# Combine the two DataFrames row-wise
combined_df = pd.concat([ result_comp, rf_metrics_selected_df, comparison_df], axis=0)

# Print the combined DataFrame
print(combined_df)
                        Accuracy  Precision    Recall        f1  F1 Score
RandomForest            0.932303   0.938914  0.934096  0.935160  0.936499
DecisionTree            0.919117   0.923018  0.916723  0.917965  0.919860
KNeighbors              0.811273   0.810466  0.811273  0.794125  0.810869
xgbc                    0.962854   0.963345  0.962854  0.962758  0.963100
Random Forest Selected  0.801435   0.799762  0.801435       NaN  0.797581
Random Forest_EWH       0.837321   0.840924  0.837321       NaN  0.836898
XGBoost_EWH             0.839713   0.841688  0.839713       NaN  0.839632

# Create the DataFrame
data = {
    'Model': [
        'RandomForest', 'DecisionTree', 'KNeighbors', 'xgbc',
        'Random Forest Selected', 'Random Forest_EWH', 'XGBoost_EWH',
    ],
    'Accuracy': [0.932303, 0.919117 ,0.811273, 0.962854, 0.801435, 0.837321 , 0.839713 ],
    'Precision': [0.942822,0.923018 , 0.810466, 0.963345,0.799762, 0.840924, 0.841688],
    'Recall': [0.934096, 0.917965, 0.811273, 0.962854, 0.801435, 0.837321 , 0.839713],
    'F1 Score': [0.936499, 0.919860, 0.810869, 0.963100, 0.797581, 0.836898,  0.839632]
}

df = pd.DataFrame(data)

# Rename the models for clarity
df['Model'] = df['Model'].replace({
    'RandomForest': 'Random Forest all features',
    'DecisionTree': 'Decision Tree all features',
    'KNeighbors': 'KNeighbors all features',
    'xgbc': 'XGBoost all features'
})

# Sort by Accuracy in descending order
df = df.sort_values(by='Accuracy', ascending=False)

# Highlight specific rows
def highlight_rows(row):
    if row['Model'] == 'XGBoost all features':
        return ['background-color: ##FFCCCB'] * len(row)  # Gold for XGBoost all features
    elif row['Model'] == 'Random Forest_EWH':
        return ['background-color: ##FFCCCB'] * len(row)  # Light blue for Random Forest EWH
    else:
        return [''] * len(row)

# Apply highlights
styled_table = df.style.apply(highlight_rows, axis=1)

# Display the styled table
styled_table

  Model Accuracy Precision Recall F1 Score
3 XGBoost all features 0.962854 0.963345 0.962854 0.963100
0 Random Forest all features 0.932303 0.942822 0.934096 0.936499
1 Decision Tree all features 0.919117 0.923018 0.917965 0.919860
6 XGBoost_EWH 0.839713 0.841688 0.839713 0.839632
5 Random Forest_EWH 0.837321 0.840924 0.837321 0.836898
2 KNeighbors all features 0.811273 0.810466 0.811273 0.810869
4 Random Forest Selected 0.801435 0.799762 0.801435 0.797581

🧾 Conclusion

Based on the results and observations from the above tables and SHAP visualizations, here is the overall analysis and conclusion:

1. XGBoost Performance

  • Best Overall Performance: XGBoost (XGBoost all features) consistently outperformed all other models across all metrics, achieving the highest:
    • Accuracy: 96.3%
    • Precision: 96.3%
    • Recall: 96.3%
    • F1 Score: 96.3%
  • Conclusion: XGBoost is the most robust and reliable model for this dataset when using all features. It should be considered as the primary model for deployment or further analysis.

2. Random Forest Models

2.1. Baseline Random Forest (Random Forest all features)

  • Achieved strong performance, second only to XGBoost, with:
    • Accuracy: 93.2%
    • Precision: 93.9%
    • Recall: 93.4%
    • F1 Score: 93.6%
  • Conclusion: Random Forest all features is a strong alternative to XGBoost, offering a balance between accuracy and simplicity.

2.2. Random Forest EWH (Excluding Weight and Height)

Note: Weight and 6eight were excluded from this model to reduce potential bias in the dataset before final feature selection.

  • While its performance dropped slightly compared to the full-feature model, it still performed reasonably well:
    • Accuracy: 83.7%
    • Precision: 84.1%
    • Recall: 83.7%
    • F1 Score: 83.7%
  • Conclusion: Random Forest EWH indicates that Weight and Height are critical features which is expected since we are predicting obesity levels which is highly correlated with those features. We exclude them to see real impact of other features.

3. Decision Tree and K-Nearest Neighbors

3.1. Decision Tree

  • Performed moderately well with:
    • Accuracy: 91.9%
    • Precision: 92.3%
    • Recall: 91.7%
    • F1 Score: 91.9%
  • Conclusion: The Decision Tree is a simpler model with relatively high performance but lags behind Random Forest and XGBoost. It could be a good choice if interpretability is a priority.

3.2. K-Nearest Neighbors

  • Showed the weakest performance with:
    • Accuracy: 81.1%
    • Precision: 81.0%
    • Recall: 81.1%
    • F1 Score: 81.1%
  • Conclusion: K-Nearest Neighbors is not ideal for this dataset due to its lower performance compared to other models.

4. Feature Importance Analysis

  • Critical Features:
    • SHAP visualizations revealed that Age, Freq_Veg, Gender, Water Intake, and Physical Activity are among the most influential features across models.
  • Conclusion: These findings highlight the importance of these key features for accurate predictions. Models without them may lack robustness.

5. Feature Contributions to Higher Obesity Classes (Class 4, 5, 6)

  • Class 4 (Obesity Type I):
    • Top contributing features: Weight, Age, Water Intake.
    • These features positively influence predictions for this class, with Weight being the most impactful.
  • Class 5 (Obesity Type II):
    • Top contributing features: Weight, Age, and Tech_Use.
    • Weight has the strongest positive contribution, followed by Age.
  • Class 6 (Obesity Type III):
    • Top contributing features: Weight, Age, and Freq_Veg.
    • The strong influence of Weight and Age is consistent across the higher obesity classes.

Conclusion: Weight and Age are the dominant features contributing to the prediction of higher obesity classes. Water Intake and Freq_Veg also play a significant role in differentiating these classes.


6. Feature Subset Analysis

  • Reducing the feature set to the most important features (e.g., using only 8 features as observed in the SHAP results) showed a slight performance drop compared to using all features, but the reduction simplifies the model significantly.

    Here is the selected Feature List: (Note that Weight and Height is already dropped to reduce the bias in the model before final selection)

    • Age,
    • Frequency of Vegetables (Freq_Veg),
    • Gender,
    • Water Intake,
    • Physical Activity,
    • Tech Use,
    • Number of Meals (Num_Meals),
    • Family History (Yes).
  • Conclusion: A reduced feature set can be a viable option for faster inference and simpler deployment without substantial loss of accuracy.


Final Recommendation

  1. Best Model: Deploy the XGBoost all features model for the highest accuracy and performance.
  2. Alternative Model: Use Random Forest all features if interpretability and slightly simpler training processes are preferred.
  3. Feature Optimization: Ensure the inclusion of key features such as Weight, Height, Age, and Freq_Veg. Avoid removing these features unless computational or data collection constraints require it.
  4. Future Considerations: Conduct further experiments to fine-tune XGBoost hyperparameters and evaluate its performance on unseen test data or under real-world conditions.

!alt text