Introduction¶

Food and Nutrition¶

Nutrition Image

Food and nutrition are the way that we get fuel, providing energy for our bodies. We need to replace nutrients in our bodies with a new supply every day. Water is an important component of nutrition. Fats, proteins, and carbohydrates are all required.Nutrition is the science that interprets the nutrients and other substances in food in relation to maintenance, growth, reproduction, health and disease of an organism. It includes ingestion, absorption, assimilation, biosynthesis, catabolism and excretion.

Knowing and eating mindfully is not only essential for a healthy gut but also for peace of mind. Also,A diet filled with vegetables, fruits and whole grains could help prevent major conditions such as stroke, diabetes and heart disease.More often than not, we like to gorge on our favourite foods which are not exactly the best for our bodies.While it is okay for such binges to occur occasionally, such diets can be extremely harmful if the person does not strike a balance with healthy foods.

This notebook analyses the most common available foods and the nutritional facts in them.

Data Extraction¶

In [1]:

import urllib.request
url = "https://en.wikipedia.org/wiki/Table_of_food_nutrients"
page = urllib.request.urlopen(url)

In [2]:

from bs4 import BeautifulSoup

soup = BeautifulSoup(page,"lxml")

In [3]:

food=[]
measure=[]
grams=[]
calories=[]
protein=[]
carb=[]
fiber=[]
fat=[]
sat_fat=[]
for row in soup.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==9:
        #print(cells)
        food.append(cells[0].find(text=True))
        #print(cells[0].find(text=True))
        measure.append(cells[1].find(text=True))
        grams.append(cells[2].find(text=True))
        calories.append(cells[3].find(text=True))
        protein.append(cells[4].find(text=True))
        carb.append(cells[5].find(text=True))
        fiber.append(cells[6].find(text=True))
        fat.append(cells[7].find(text=True))
        sat_fat.append(cells[8].find(text=True))

In [5]:

import pandas as pd
df=pd.DataFrame(food,columns=['Food'])
df['Measure']=measure
df['Grams']=grams
df['Calories']=calories
df['Protein']=protein
df['Fat']=fat
df['Sat.Fat']=sat_fat
df['Fiber']=fiber
df['Carbs']=carb

df.head()

Out[5]:

	Food	Measure	Grams	Calories	Protein	Fat	Sat.Fat	Fiber	Carbs
0	Cows' milk	1 qt.	976	660	32	40	36	0	48
1	skim	1 qt.	984	360	36	t	t	0	52
2	Buttermilk	1 cup	246	127	9	5	4	0	13
3	Evaporated, undiluted	1 cup	252	345	16	20	18	0	24
4	Fortified milk	6 cups	1,419	1,373	89	42	23	1.4	119

df.to_excel("output.xlsx")¶

After scraping, manually assigning categories(it is easier that way) for easy accessibility.

The dataset was then uploaded on Kaggle.

Data Cleaning¶

Data cleaning is always the first step in any data science project. Although the data here seems clean, some minor alterations are required.

In [1]:

import pandas as pd
import numpy as np 
import plotly.express as px
import seaborn as sns
import plotly.offline as py
import plotly.graph_objects as go

In [2]:

nutrients=pd.read_csv("/kaggle/input/nutrition-details-for-most-common-foods/nutrients_csvfile.csv")
nutrients.head()

Out[2]:

	Food	Measure	Grams	Calories	Protein	Fat	Sat.Fat	Fiber	Carbs	Category
0	Cows' milk	1 qt.	976	660	32	40	36	0	48	Dairy products
1	Milk skim	1 qt.	984	360	36	t	t	0	52	Dairy products
2	Buttermilk	1 cup	246	127	9	5	4	0	13	Dairy products
3	Evaporated, undiluted	1 cup	252	345	16	20	18	0	24	Dairy products
4	Fortified milk	6 cups	1,419	1,373	89	42	23	1.4	119	Dairy products

First things first, the t's in the data denote miniscule amounts so we might as well replace them by 0.

In [3]:

nutrients=nutrients.replace("t",0)
nutrients=nutrients.replace("t'",0)

nutrients.head()

Out[3]:

	Food	Measure	Grams	Calories	Protein	Fat	Sat.Fat	Fiber	Carbs	Category
0	Cows' milk	1 qt.	976	660	32	40	36	0	48	Dairy products
1	Milk skim	1 qt.	984	360	36	0	0	0	52	Dairy products
2	Buttermilk	1 cup	246	127	9	5	4	0	13	Dairy products
3	Evaporated, undiluted	1 cup	252	345	16	20	18	0	24	Dairy products
4	Fortified milk	6 cups	1,419	1,373	89	42	23	1.4	119	Dairy products

Now, we need to remove all the expressions like commas from the dataset so as to convert the numerical data to the respective integer or float variables.

In [4]:

nutrients=nutrients.replace(",","", regex=True)
nutrients['Fiber']=nutrients['Fiber'].replace("a","", regex=True)
nutrients['Calories'][91]=(8+44)/2

Now, let us convert grams, calories, protein, fat, saturated fat, fiber and carbs datatypes to int.

In [5]:

nutrients['Grams']=pd.to_numeric(nutrients['Grams'])
nutrients['Calories']=pd.to_numeric(nutrients['Calories'])
nutrients['Protein']=pd.to_numeric(nutrients['Protein'])
nutrients['Fat']=pd.to_numeric(nutrients['Fat'])
nutrients['Sat.Fat']=pd.to_numeric(nutrients['Sat.Fat'])
nutrients['Fiber']=pd.to_numeric(nutrients['Fiber'])
nutrients['Carbs']=pd.to_numeric(nutrients['Carbs'])

In [6]:

nutrients.dtypes

Out[6]:

Food         object
Measure      object
Grams         int64
Calories    float64
Protein       int64
Fat           int64
Sat.Fat     float64
Fiber       float64
Carbs       float64
Category     object
dtype: object

Nice, all our data is in desired datatypes.

Quick last checks on data quality¶

In [7]:

print(nutrients.isnull().any())
print('-'*245)
print(nutrients.describe())
print('-'*245)

Food        False
Measure     False
Grams       False
Calories     True
Protein     False
Fat         False
Sat.Fat      True
Fiber        True
Carbs       False
Category    False
dtype: bool
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
             Grams     Calories     Protein         Fat     Sat.Fat  \
count   335.000000   334.000000  335.000000  335.000000  333.000000   
mean    143.211940   188.802395    8.573134    8.540299    6.438438   
std     138.668626   184.453018   17.733722   19.797871   18.517656   
min      11.000000     0.000000   -1.000000    0.000000    0.000000   
25%      60.000000    75.000000    1.000000    0.000000    0.000000   
50%     108.000000   131.000000    3.000000    1.000000    0.000000   
75%     200.000000   250.000000   12.000000   10.000000    8.000000   
max    1419.000000  1373.000000  232.000000  233.000000  234.000000   

            Fiber       Carbs  
count  334.000000  335.000000  
mean     2.376078   24.982388  
std     16.078272   35.833106  
min      0.000000    0.000000  
25%      0.000000    3.000000  
50%      0.200000   14.000000  
75%      1.000000   30.500000  
max    235.000000  236.000000  
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

There's a null value in the fiber column, lets drop that row entirely.

In [8]:

nutrients=nutrients.dropna()
nutrients.shape

Out[8]:

(331, 10)

Data Visualization and Analysis¶

Let's start the analysis by plotting the features with one another. This will not only provide us the distribution of features with one another but also give a quick quantitative feel of the data.

In [9]:

# Plotting the KDEplots

import matplotlib.pyplot as plt


f, axes = plt.subplots(2, 3, figsize=(10, 10), sharex=True, sharey=True)

s = np.linspace(0, 3, 10)
cmap = sns.cubehelix_palette(start=0.0, light=1, as_cmap=True)

sns.kdeplot(nutrients['Carbs'],nutrients['Protein'],cmap=cmap,shade=True, ax=axes[0,0])
axes[0,0].set(xlim=(-10, 50), ylim=(-30, 70), title = 'Carbs and Protein')

cmap = sns.cubehelix_palette(start=0.25, light=1, as_cmap=True)

sns.kdeplot(nutrients['Fat'],nutrients['Carbs'], ax=axes[0,1])
axes[0,1].set(xlim=(-10, 50), ylim=(-30, 70), title = 'Carbs and Fat')

cmap = sns.cubehelix_palette(start=0.33, light=1, as_cmap=True)

sns.kdeplot(nutrients['Carbs'],nutrients['Fiber'], ax=axes[0,2])
axes[0,2].set(xlim=(-10, 50), ylim=(-30, 70), title = 'Carbs and Fat')

cmap = sns.cubehelix_palette(start=0.45, light=1, as_cmap=True)

sns.kdeplot(nutrients['Fiber'],nutrients['Fat'], ax=axes[1,0])
axes[1,0].set(xlim=(-10, 50), ylim=(-30, 70), title = 'Fiber and Fat')

cmap = sns.cubehelix_palette(start=0.56, light=1, as_cmap=True)

sns.kdeplot(nutrients['Fat'],nutrients['Sat.Fat'], ax=axes[1,1])
axes[1,1].set(xlim=(-10, 50), ylim=(-30, 70), title = 'Sat. Fat and Fat')

cmap = sns.cubehelix_palette(start=0.68, light=1, as_cmap=True)

sns.kdeplot(nutrients['Carbs'],nutrients['Calories'], ax=axes[1,2])
axes[1,2].set(xlim=(-10, 100), ylim=(-30, 70), title = 'Calories and Carbs')

f.tight_layout()

Let's dive into individual metrics¶

What is the most protein rich food in the category of vegetables and grains?

In [10]:

alls=['Vegetables A-E',
 'Vegetables F-P',
 'Vegetables R-Z','Breads cereals fastfoodgrains','Seeds and Nuts']

prot= nutrients[nutrients['Category'].isin(alls)]

protein_rich= prot.sort_values(by='Protein', ascending= False)
top_20=protein_rich.head(20)
fig = px.bar(top_20, x='Food', y='Protein', color='Protein', title=' Top 10 protein rich foods')
fig.show()

Therefore, from the category of Grains, Vegetables and Seeds, whole wheat has the most protein content followed by white bread. Soybeans are also in the top 20s. Also, Almonds rank no. 1 in the Seeds category.🌱

Foods to stay away from:¶

What food has the most calories?

In [11]:

cals= nutrients.sort_values(by='Calories', ascending= False)
top_20_cals=cals.head(20)
fig = px.bar(top_20, x='Food', y='Calories' , color='Calories',title=' Top 10 calorie rich foods')
fig.show()

Fortified milk has the most calories, followed by white bread. Also, notice how whole wheat has the most proteins but has almost equal amount of calories. Lard is fat source with most calories and 1/2 cup of ice-creams tops the charts in the dessert category.

Fat Content:¶

Normally, fat sources are often looked down upon. But, a certain amount of fat is required for a healthy gut. Let's look at some fatty foods.

In [12]:

fats= nutrients.sort_values(by='Fat', ascending= False)
top_20_fat=fats.head(20)
fig = px.bar(top_20_fat, x='Food', y='Calories', color='Calories', title=' Fat Content and Calories')
fig.show()

Therefore, Oysters and Butter have the largest combination of calories and fats, followed by lard.

Analysing categories¶

Grouping the data into categories can give us the total count of all metrics and thus we can analyse the categories.

In [13]:

category_dist=nutrients.groupby(['Category']).sum()
category_dist

Out[13]:

	Grams	Calories	Protein	Fat	Sat.Fat	Fiber	Carbs
Category
Breads cereals fastfoodgrains	5253	11921.0	403	207	99.0	115.91	2059.0
Dairy products	7412	8434.0	503	396	322.0	4.40	651.0
Desserts sweets	2958	6608.0	78	163	150.0	20.50	1184.0
DrinksAlcohol Beverages	3284	1112.0	0	0	0.0	0.00	167.0
Fats Oils Shortenings	695	3629.0	234	631	536.0	234.00	239.0
Fish Seafood	1807	2757.0	588	338	252.0	235.00	263.0
Fruits A-F	3844	3328.0	29	20	12.0	33.50	812.0
Fruits G-P	5412	4054.0	28	25	21.0	21.10	1009.0
Fruits R-Z	1973	1228.0	7	1	0.0	17.40	330.0
Jams Jellies	422	1345.0	0	0	0.0	8.00	345.0
Meat Poultry	2724	7529.0	546	520	427.0	0.00	57.3
Seeds and Nuts	682	4089.0	120	368	232.0	18.60	140.0
Soups	2495	1191.0	59	41	43.0	4.00	155.0
Vegetables A-E	3520	1804.0	101	9	6.0	36.30	356.0
Vegetables F-P	1725	711.0	40	2	0.0	16.90	142.0
Vegetables R-Z	3360	2694.0	98	76	44.0	26.20	447.0

In [14]:

category_dist=nutrients.groupby(['Category']).sum()
from plotly.subplots import make_subplots
import plotly.graph_objects as go

fig = make_subplots(
    rows=2, cols=3,
    specs=[[{"type": "domain"},{"type": "domain"},{"type": "domain"}],[{"type": "domain"},{"type": "domain"},{"type": "domain"}]])

fig.add_trace(go.Pie(values=category_dist['Calories'].values, title='CALORIES', labels=category_dist.index,marker=dict(colors=['#100b','#f00560'], line=dict(color='#FFFFFF', width=2.5))),
              row=1, col=1)

fig.add_trace(go.Pie(values=category_dist['Fat'].values,title='FAT', labels=category_dist.index,marker=dict(colors=['#100b','#f00560'], line=dict(color='#FFFFFF', width=2.5))),
              row=1, col=2)

fig.add_trace(go.Pie(values=category_dist['Protein'].values,title='PROTEIN', labels=category_dist.index,marker=dict(colors=['#100b','#f00560'], line=dict(color='#FFFFFF', width=2.5))),
              row=1, col=3)

fig.add_trace(go.Pie(values=category_dist['Fiber'].values,title='FIBER', labels=category_dist.index,marker=dict(colors=['#100b','#f00560'], line=dict(color='#FFFFFF', width=2.5))),
              row=2, col=1)

fig.add_trace(go.Pie(values=category_dist['Sat.Fat'].values,title='SAT.FAT', labels=category_dist.index,marker=dict(colors=['#100b','#f00560'], line=dict(color='#FFFFFF', width=2.5))),
              row=2, col=2)

fig.add_trace(go.Pie(values=category_dist['Carbs'].values,title='CARBS', labels=category_dist.index,marker=dict(colors=['#100b','#f00560'], line=dict(color='#FFFFFF', width=2.5))),
              row=2, col=3)
fig.update_layout(title_text="Category wise distribution of all metrics",height=700, width=1000)

fig.show()

Some inferences from the above pie charts:¶

It is clear that breads, grains and cereals have the highest amount of Carbs and Calories.
Largest percentage of protein is in seafood (God bless the vegetarians!)
Surprisingly, same amount of fiber content is present in Fats and Seafood.
Seeds and nuts have about 14% fat content.
Fruits do not have a large percentage in any of the categories except carbs, they have about 10% carbohydrates.
Dairy products (15%) have more saturated fat content than seafood (11.8%).

Analyzing the Drinks, Alcohol, Beverages and Desserts¶

Since it is clear that meat/ seafood have an abundance of protein, let us find the protein rich foods.

In [15]:

drinks= nutrients[nutrients['Category'].isin(['Fish Seafood','Desserts sweets'])]
drinks_top=drinks.sort_values(by='Calories', ascending= False)
drinks_top=drinks_top.head(10)

fig = go.Figure(go.Funnelarea(values=drinks_top['Calories'].values, text=drinks_top['Food'],
                              title = { "text": "Desserts with high calorie percentages"},
               marker = {"colors": ["deepskyblue", "lightsalmon", "tan", "teal", "silver","deepskyblue", "lightsalmon", "tan", "teal", "silver"],
                "line": {"color": ["wheat", "wheat", "blue", "wheat", "wheat","wheat", "wheat", "blue", "wheat", "wheat"]}}))



fig.show()

So, pudding has the most amount of calories followed by chocolate fudge.

In [16]:

drinks_fatty=drinks.sort_values(by='Fat', ascending= False)
drinks_fatty=drinks_fatty.head(10)

fig = go.Figure(go.Funnelarea(values=drinks_fatty['Fat'].values, text=drinks_fatty['Food'],
                              title = { "text": "Desserts with high fat percentage"},
               marker = {"colors": ["blue", "purple", "pink", "teal", "silver","yellow", "lightsalmon", "tan", "teal", "silver"],
                "line": {"color": ["wheat", "wheat", "blue", "wheat", "wheat","wheat", "wheat", "blue", "wheat", "wheat"]}}))
fig.show()

Pies and fudges have the highest percentage of fat as well.

Analyzing meat, poultry , seafood.¶

In [17]:

meat= nutrients[nutrients['Category'].isin(['Fish Seafood','Meat Poultry'])]
meats_top=drinks.sort_values(by='Protein', ascending= False)
meats_top=meats_top.head(10)

fig = go.Figure(go.Pie(values=meats_top['Protein'].values, text=meats_top['Food'],
                              title = { "text": "Desserts with high calorie percentages"},
               marker = {"colors": ["maroon", "salmon", "tan", "gold", "silver","deepskyblue", "lightsalmon", "tan", "teal", "silver"],
                "line": {"color": ["wheat", "wheat", "blue", "wheat", "wheat","wheat", "wheat", "blue", "wheat", "wheat"]}}))
fig.show()

Oysters have a large amount of proteins, after them the flatfish flounders have about 6.59% protein.

Seafood and meat always is known for having good fat content. Let's find out the fattiest of the fishes.🐟¶

In [18]:

top_10_fattest= meat.sort_values(by='Fat', ascending=False)
top_10_fattest=top_10_fattest.head(10)
fig = go.Figure(data=[go.Scatter(
    x=top_10_fattest['Food'], y=top_10_fattest['Fat'],
    mode='markers',
    marker_size=[200,180,160,140,120, 100 ,80 , 60 ,40,20])
])
fig.update_layout(title='Meat/Seafood with high Fat Content')
fig.show()

So, only have high protein as well as high fat percentage. Pork sausages are the second highest followed by Roast beef. Also, no type of fish is present in the top 10 fattiest meats list. So, fishes tend to have less fat, I suppose.

Lastly, let us find the meat with most fiber¶

In [19]:

top_10_fibrous= meat.sort_values(by='Fiber', ascending=False)
top_10_fibrous=top_10_fibrous.head(10)
top_10_fibrous

Out[19]:

	Food	Measure	Grams	Calories	Protein	Fat	Sat.Fat	Fiber	Carbs	Category
82	Oysters	6-8 med.	230	231.0	232	233	234.0	235.0	236.0	Fish Seafood
43	Bacon	2 slices	16	95.0	4	8	7.0	0.0	1.0	Meat Poultry
78	Halibut	3 1/2 oz.	100	182.0	26	8	0.0	0.0	0.0	Fish Seafood
69	Turkey	3 1/2 oz.	100	265.0	27	15	0.0	0.0	0.0	Meat Poultry
70	Veal	3 oz.	85	185.0	23	9	8.0	0.0	0.0	Meat Poultry
71	Roast	3 oz.	85	305.0	13	14	13.0	0.0	0.0	Meat Poultry
72	Clams	3 oz.	85	87.0	12	1	0.0	0.0	2.0	Fish Seafood
73	Cod	3 1/2 oz.	100	170.0	28	5	0.0	0.0	0.0	Fish Seafood
74	Crab meat	3 oz.	85	90.0	14	2	0.0	0.0	1.0	Fish Seafood
75	Fish sticks fried	5	112	200.0	19	10	5.0	0.0	8.0	Fish Seafood

Bacon, Halibut, Turkey and veal top the charts in terms of Fiber content.

Introducing 3D Scatter Plots¶

3D scatter plots are used to plot data points on three axes in the attempt to show the relationship between three variables. Each row in the data table is represented by a marker whose position depends on its values in the columns set on the X, Y, and Z axes. Basically, Plotting some data on the z-axis of a normal x-y scatter plot like the previous figure.

They are interesting and though may not provide much inferences, are visually appealing to look at.

In [20]:

trace1 = go.Scatter3d(
    x=nutrients['Category'].values,
    y=nutrients['Food'].values,
    z=nutrients['Fat'].values,
    text=nutrients['Food'].values,
    mode='markers',
    marker=dict(
        sizemode='diameter',
         sizeref=750,
        color = nutrients['Fat'].values,
        colorscale = 'Portland',
        colorbar = dict(title = 'Total Fat (% Daily Value)'),
        line=dict(color='rgb(255, 255, 255)')
    )
)
data=[trace1]
layout=dict(height=800, width=800, title='3D Scatter Plot of Fatty foods (% Daily Value)')
fig=dict(data=data, layout=layout)
py.iplot(fig, filename='3DBubble')

In [21]:

trace1 = go.Scatter3d(
    x=nutrients['Category'].values,
    y=nutrients['Food'].values,
    z=nutrients['Carbs'].values,
    text=nutrients['Food'].values,
    mode='markers',
    marker=dict(
        sizemode='diameter',
         sizeref=750,
        color = nutrients['Carbs'].values,
        colorscale = 'Portland',
        colorbar = dict(title = 'Total Fat (% Daily Value)'),
        line=dict(color='rgb(255, 255, 255)')
    )
)
data=[trace1]
layout=dict(height=800, width=800, title='3D Scatter Plot of Carbohydrate rich food')
fig=dict(data=data, layout=layout)
py.iplot(fig, filename='3DBubble')

Food group with the most calorie content¶

In [22]:

sns.set_style("whitegrid")
plt.figure(figsize=(22,10))
#plt.figure()

ax = sns.boxenplot(x="Category", y='Calories', data=nutrients, color='#eeeeee', palette="tab10")

# Add transparency to colors
for patch in ax.artists:
    r, g, b, a = patch.get_facecolor()
    patch.set_facecolor((r, g, b, .9))
    
#ax = sns.stripplot(x='Category', y='Cholesterol (% Daily Value)', data=menu, color="orange", jitter=0.5, size=5,alpha=0.15)
#
plt.title("Total Calorie Content \n", loc="center",size=32,color='#be0c0c',alpha=0.6)
plt.xlabel('Category',color='#34495E',fontsize=20) 
plt.ylabel('Total Fat (% Daily Value)',color='#34495E',fontsize=20)
plt.xticks(size=16,color='#008abc',rotation=90, wrap=True)  
plt.yticks(size=15,color='#006600')
#plt.text(2.5, 1, 'Courtesy: https://seaborn.pydata.org/examples/grouped_boxplot.html', fontsize=13,alpha=0.2)
#plt.ylim(0,200)
#plt.legend(loc="upper right",fontsize=14,ncol=5,title='Category',title_fontsize=22,framealpha=0.99)
plt.show()

Food Nutrition Extraction,Analysis + EDA