A picture says 1000 words, data visualization reveals 1000 or more insights. visualizing the data can help us to understand how the data is useful to us. Data visualizations in python are very simple and interpretable.
Data visualization is more important than writing the code and predicting the results in machine learning and data science. Data visualization gives us more information about the data.
A machine learning engineer and a data scientist can understand what we are doing but a manager level person can understand only data visualizations.
There are so many plots in data visualizations in python, let’s learn almost all of them. I promise you, I will give you all the plots and python code for them.
In this Data visualization in python tutorial, you will learn
you must have Jupiter notebooks or spider IDE or any other for visualization in python. My favorite is Jupyter notebooks, Download here.
Seaborn and Matplotlib are used for data visualization in Python. If download Anaconda environment then they preinstalled for us. Ok, let’s rock the data with visualizations.
Here I am exploring all the plots with the titanic dataset, download here.
1. Bar plot in python
Bar plot can help us to explore the 2 or 3 features of our data. Barplot takes at least one feature as numerical. lets plot with titanic data.
plt.figure(figsize=(12,7)) sns.barplot(x='Sex',y='Survived',data=df,hue='Pclass') plt.savefig('barplot.png') plt.show()
The output as below
x – feature name, y – feature name, hue – feature name, data – dataset.
plt.figure(figsize=(row,column)) : represents the figure size.
plt.savefig(‘barplot.png’) :saving the figure or plot as image.
colors represent the P-class(This is passenger class like 1,2 and 3). we can easily understand how many male and woman passengers in P-class and also in the y-axis representing the % of survived of passengers in the Titanic disaster.
2. Count plot in python
one of my favorite plot is count plot. It is very easy to understand and visualize the data. It will count the numbers and represent them in a plot like below. you can explore count plot in python with a single feature also.
plt.figure(figsize=(12,7)) sns.countplot(x='Survived',data=df,hue='Sex') plt.savefig('countplot.png') plt.show()
3. Scatter plot in python
A scatter plot represents how the data is distributed. And also it useful to identify the Outliers in our data.
print(df.shape) x= np.arange(891) plt.figure(figsize=(12,7)) plt.title('Age vs Survived') #plot title sns.scatterplot(x=x,y='Age',hue='Survived',data=df) plt.savefig('scatter.png') plt.show()
Based on the Age feature, the Survived class scattered above. colors represent the survived class.
4. Line plot in python
A line plot is used for representing a single feature.
plt.figure(figsize=(12,7)) plt.title('Fare plot') plt.plot(sorted(df['Fare']),color='g') #g - green plt.xlabel('Index') plt.ylabel('Fare') plt.savefig('lineplot.png') plt.show()
Here I sorted all the fare data for good representation. The line goes from down to high fare in the plot.
5. violin plot in python
some of the beautiful plots from seaborn is violin plot and box plots. These plots can use to understand the quartiles of the data distribution and also they represent the outliers in the data.
plt.figure(figsize=(12,7)) plt.title('Age vs Survived vs Sex') sns.violinplot(x='Sex',y='Age',data=df,hue='Survived') plt.savefig('violinplot.png') plt.show()
The violin plot includes the box plot. The white dot represents the median of data. here we plotting ‘Age‘, ‘Sex‘ and ‘Survived‘. we got 2 figures based on the sex feature. To understand better look at the below box plot.
Check out my other best tutorials here:
6. Box plot in python
Box plot is the most useful plot to understand the quartile data of features. I draw a simple picture to understand it clearly.
plt.figure(figsize=(12,7)) plt.title('Age vs Survived vs Sex') sns.boxplot(x='Sex',y='Age',hue='Survived',data=df) plt.savefig('boxplot.png') plt.show()
It representing the survived % of passengers based on Age and Gender features of the titanic dataset.
7. Pair plot in python
A pair plot is used representing the relationship between the features. we can easily understand which features are more useful to predict class labels or recommend things.
caution: Don’t plot all the features, if you plot it will take time to display all the plots. If you have 8 features in your dataset then it will plot 8 X 8 = 64 plots. It will accept only numerical features only.
here I dropped ‘PassengerId’ and ‘Parch’ features and plot a pair plot.
sns.pairplot(df.drop(['PassengerId','Parch'],axis=1), hue="Survived", palette="husl") plt.title('Age vs Survived vs Fare vs Pclass') plt.savefig('pairplot.png') plt.show()
To understand the pair plot clearly, we need to look column to row-wise. which row and column separate Survived class label those are important features to predict the class label.
8. Distplot in python
Distplot is used to understand the distribution of features in a dataset. It includes histogram.
plt.figure(figsize=(12,7)) sns.distplot(df['Age'].dropna(axis=0),color='g') plt.title('Age Distribution') plt.savefig('distplot.png') plt.show()
Here, I plotted the Age feature. It representing how Age values are distributed. we can identify mean, min and max values in Age.
9. Histplot in python
Histplot is actually called as a histogram. It holds data in the format of bins( it is a range).
plt.figure(figsize=(12,7)) plt.hist(df['Fare'],color='m') plt.title('Fare Distribution') plt.savefig('histplot.png') plt.show()
we can easily interpret the histogram. here the fare occurred more between 0 to 50 dollars, so it went peak. where the fewer data points placed there the color is less.
10. Heatmap in python
Heatmap is one of the most important plot to get more important features in a dataset to predict the class label. I love to plot the correlation between features using seaborn.
corr = df.drop('PassengerId',axis=1).corr() plt.figure(figsize=(12,7)) sns.heatmap(corr,annot=True) plt.title('Correlation between features and Class label') plt.savefig('heatmap.png') plt.show()
The correlation is between -1 to 1. -1 is low and 1 is high correlated feature. In the above heatmap 1 representing in the diagonal axis. if a feature -ve correlated then remove those features. consider +ve positive features to predict the class label.
11. Subplots in python
subplots are used to display multiple plots at a time using features of the dataset.
#print(df.shape) index= np.arange(891) fig = plt.figure(figsize=(16,8)) plt.subplot(2, 2, 1) plt.hist(sorted(df['Age']),color='m') plt.title('Age Histogram') plt.xlabel('Age') plt.subplot(2, 2, 2) plt.plot(sorted(df['Fare']),color='g') plt.title('Fare Plot') plt.xlabel('index') plt.ylabel('Age') plt.subplot(2, 2, 3) sns.countplot(x='Survived',data=df,hue='Sex') plt.title('Survived Vs Sex') plt.subplot(2, 2, 4) plt.title('Age vs Survived') sns.scatterplot(x=x,y='Age',hue='Survived',data=df) plt.xlabel('index') plt.ylabel('Age') plt.show()
In the above figure, we have 4 subplots. we can plot multiple plots using plt.subplot().
I hope you definitely love this data visualization in python tutorial. Please appreciate us through comment and also do subscribe to our newsletter.
bar plot in python, count plot in python, the histogram in python, scatter plot in python,