Exploratory Data Analysis - Vital approach in Data Science

Data science is a field that uses various mathematical measures, processes, and algorithms to extract knowledge and insights from the available data.

Analytics can be defined as Analysis (findings) + Metric (measurement). So we will be performing some kind of measurements on the findings to get meaningful insights. In short, we are detectives and need to find if there is something fishy or not 🕵🏻‍♂. So lets put our detective cap on 🎩

Concepts of data science are used to perform 4 types of analytics.

Descriptive Analytics: This type of analytics answers the question "What happened in the past?"
Diagnostic Analytics: This type of analytics answers the question "Why something has happened in the past?"
Predictive Analytics: This type of analytics predicts "What is most likely to happen in the future?"
Prescriptive Analytics: This type of analytics recommends some business solutions to countermeasure the problem.

In this article, the focus is on "Descriptive Analytics". We will be using Python and a few data visualisation libraries.

Matplotlib (Data Visualization)
Seaborn (Data Visualization)
Pandas (Data structure for analysis)
Jupyter Notebook (Interactive development environment)

Once you have these tools, you can just do wonders. Let's dive into practical example. We will use the Uber Drives data set. This data set contains the trip details of a particular Uber driver for the year 2016.

Open up the terminal and fire up Jupyter notebook

jupyter notebook

Lets import the uber drives data set

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

uber_data = pd.read_csv('Uber Drives 2016.csv')

Head and tail:

Check the first 10 rows from the data set

uber_data.head(10)

Check the last 10 rows from the data set

uber_data.tail(10)

Shape of the data set:

To check the shape of the data set

uber_data.shape

This gives a result of (1156, 7), indicating that the data set has 1156 rows and 7 columns.

I'm not satisfied with only this much information about the data set. Let's dig more on it.

Information of data set:

uber_data.info()

info() gives detailed information of the data frame. We can get to know the data types and if there are any null values present. Check out the START_DATE* column, it says 1156 non-null object. Now check the PURPOSE* column, it says 653 non-null objects. This indicates that not all values are present in our data set. Also, it gives the data type of each column.
If a column has non-numerical values, then it's data type will be object. If the values are numeric, it might be float or int.

Presence of missing values:

One of the most important steps in any type of data analysis project is to check for the presence of missing/null values. Let's check if we have any null values on our data set. It's better if we know a count of such types of values in each column. Pandas have a handy method for that.

uber_data.isna().sum()

Image 1.4: Display null values count in each column

Here we see that the PURPOSE* column has 503 missing values. The end date, category, start and stop each have 1 missing values.

There are various strategies to impute the null values, which is out of scope for this section of the blog. For now, let's just drop all the rows which has null values in them.

uber_data.dropna(inplace=True)

When inplace=True, the rows are deleted and the changes are made on the original uber_data object. If inplace=False, then a new object is returned with the deleted rows and the original object will still have rows with null values.

Five point summary:

The five-number summary / five-point summary provides a concise summary of the distribution of the observations. Reporting the five numbers avoids the need to decide on the most appropriate summary statistic. It gives the information about the location from the median, spread from the quantiles, and the minimum to the maximum range of the observations.

It consists of 5 most important sample percentiles:

1. The sample minimum

2. Lower / First quantile (25%)

3. The median / Second quantile (50%)

4. Upper / Third quantile (75%)

5. The sample maximum

uber_data.describe()

But wait... The summary is given only for MILES* column🤨, why? 🤔
Its because the rest of the columns have data type as object. The describe() method only works for numerical columns.

Unique values and unique counts:

To check the count of unique values in any column, use the .nunique() method on the column. Let's check the count of unique values for STOP* column.

uber_data['STOP*'].nunique()

Try the .unique() method on the START* column to check all unique values in that column.

Group By:

Let's check the most popular starting point for the driver. Pandas have a handy group by method, which works the same as the group by in SQL. Grouping by the start point and taking their count in ascending order should give the popular starting point for the driver.

grouped_data=uber_data.groupby('START*')
                      .count()
                      .sort_values(ascending=False, by='STOP*')['START_DATE*']

pd.DataFrame(grouped_data).rename(columns={'START_DATE*': 'Trip Counts'}).head()

Let us now check the start and stop location that has the most number of trips made by the driver. We have a few of the start and stop location as unknown, let's drop them.

df = uber_data[(uber_data['START*'] != 'Unknown Location') | (uber_data['STOP*'] != 'Unknown Location')]
df.groupby(['START*','STOP*']).size().sort_values(ascending=False).head(5)

We see that the driver has made more trips from Cary to Morrisville

Let's do one last analysis on the start and stop point. We can find the highest miles traveled by the driver between the start and stop pairs.

df.groupby(['START*','STOP*']).sum().sort_values(ascending=False, by='MILES*').head()

Image 1.8: Highest distance made by driver between start and stop pair

We see that from Cary to Durham, the driver has traveled a total of 312 miles followed by Latta to Jacksonville.

Derivative columns:

Let's move on to the START_DATE and END_DATE column. As we checked in the info() method, this has the string value, while we can see that it is representing DateTime. We can use the strptime method to convert it from string to DateTime object. Here we can make use of apply method of pandas by passing a lambda function.

df.loc[:,'START_DATE*'] = df['START_DATE*'].apply(lambda x: pd.datetime.strptime(x, '%m/%d/%Y %H:%M'))

df.loc[:,'END_DATE*'] = df['END_DATE*'].apply(lambda x: pd.datetime.strptime(x, '%m/%d/%Y %H:%M'))

If we check the info() method on the data frame, we see that START_DATE and END_DATE have types as datetime.

Sometimes, a column as a whole will not give that many insights. At such times we can explore if there is a chance to split that column into multiple columns, or if such columns can be combined to make a new column. The newly created columns will be called derivative columns.
Let us create the Time_Diff column by subtracting START_DATE from END_DATE. The result of this column will be HH:MM: SS format. It would be more meaningful if the difference would be in minutes always. We can make use of pandas Timedelta to make this happen.

df['Time_Diff'] = df['END_DATE*'] - df['START_DATE*']

df['Time_Diff] = df['Time_Diff'].apply(lambda x: pd.Timedelta.to_pytimedelta(x).days/(24*60) + pd.Timedelta.to_pytimedelta(x).seconds/60)

df.head()

Image 1.9: Creating a derivative column Time_Diff

Also, now we can split the dataset according to different months. Further, for each of the months, we can check the trip that was made each day. We can use the day method of pandas Series to get the day of the month. For each month lets create a day column which has this value. This will help us to understand the day-wise spread of the data. Let us group the 12 new data sets by the day column.
We will shortly see how all these will be useful.

# Creating dataframes according to months
Jan = df[pd.to_datetime(df['START_DATE*']).dt.month == 1]
Jan.loc['day'] = pd.to_datetime(Jan['START_DATE*']).dt.day

Feb = df[pd.to_datetime(df['START_DATE*']).dt.month == 2]
Feb.loc['day'] = pd.to_datetime(Feb['START_DATE*']).dt.day

Mar = df[pd.to_datetime(df['START_DATE*']).dt.month == 3]
Mar.loc['day'] = pd.to_datetime(Mar['START_DATE*']).dt.day

Apr = df[pd.to_datetime(df['START_DATE*']).dt.month == 4]
Apr.loc['day'] = pd.to_datetime(Apr['START_DATE*']).dt.day

May = df[pd.to_datetime(df['START_DATE*']).dt.month == 5]
May.loc['day'] = pd.to_datetime(May['START_DATE*']).dt.day

Jun = df[pd.to_datetime(df['START_DATE*']).dt.month == 6]
Jun.loc['day'] = pd.to_datetime(Jun['START_DATE*']).dt.day

Jul = df[pd.to_datetime(df['START_DATE*']).dt.month == 7]
Jul.loc['day'] = pd.to_datetime(Jul['START_DATE*']).dt.day

Aug = df[pd.to_datetime(df['START_DATE*']).dt.month == 8]
Aug.loc['day'] = pd.to_datetime(Aug['START_DATE*']).dt.day

Sep = df[pd.to_datetime(df['START_DATE*']).dt.month == 9]
Sep.loc['day'] = pd.to_datetime(Sep['START_DATE*']).dt.day

Oct = df[pd.to_datetime(df['START_DATE*']).dt.month == 10]
Oct.loc['day'] = pd.to_datetime(Oct['START_DATE*']).dt.day

Nov = df[pd.to_datetime(df['START_DATE*']).dt.month == 11]
Nov.loc['day'] = pd.to_datetime(Nov['START_DATE*']).dt.day

Dec = df[pd.to_datetime(df['START_DATE*']).dt.month == 12]
Dec.loc['day'] = pd.to_datetime(Dec['START_DATE*']).dt.day


# Group per day
Jan_Group = Jan.groupby(['day']).sum()
Feb_Group = Feb.groupby(['day']).sum()
Mar_Group = Mar.groupby(['day']).sum()
Apr_Group = Apr.groupby(['day']).sum()
May_Group = May.groupby(['day']).sum()
Jun_Group = Jun.groupby(['day']).sum()
Jul_Group = Jul.groupby(['day']).sum()
Aug_Group = Aug.groupby(['day']).sum()
Sep_Group = Sep.groupby(['day']).sum()
Oct_Group = Oct.groupby(['day']).sum()
Nov_Group = Nov.groupby(['day']).sum()
Dec_Group = Dec.groupby(['day']).sum()

A picture is worth a thousand words. Let's plot what we have done till now and check if we can make sense from all the splitting, grouping and aggregating stuff. We can use seaborn or matplotlib for plotting. For simplicity purpose, I have used the plot method of pandas

plt.figure(figsize=(18,5))
Jan_Group.plot(kind='bar')

*Image 1.10: Average distance and time taken to travel per day*

But hold on 😕, we have only MILES and TIME_DIFF, what happened to other columns? And why the X-axis is having day 🤔

The first one is pretty simple, when we did a groupby() and sum(), it is applicable only for the columns that have some numeric values. Hence the other columns are not present.

As for the second question, when we did a groupby(['day']), the column day will become the index for the data frame. If no explicit x and y values are given to the plot() method, the index will be treated as x axis and multiple plots will be made on the same graph with other columns.

In our case, day has become the index and is present in the x-axis. The other columns present are MILES and TIME_DIFF, hence we see the blue and orange bars.

On observing this plot, we can understand how the miles and time difference are varying on each day for January. We can also make deductions and assumptions

Concatenating data frames and understanding axis:

Our next task will be to create a data frame for the sum of miles covered for each month and plotting it.

We have total miles for each day in the data frame from Jan_Group till Dec_group. We can concatenate it by making the days as the row index, and the column headers will represent the name of the month.

We are replacing null values with 0 because few months have 30 days and few have 31, the extra days for those specific months will be replaced with nil. For example, Jan has 31 days. The 28th, 29th, 30th, and 31st will be nil for Feb

miles_day_frame = pd.concat([Jan_Group['MILES*'], Feb_Group['MILES*'], Mar_Group['MILES*'], Apr_Group['MILES*'], May_Group['MILES*'],
                             Jun_Group['MILES*'], Jul_Group['MILES*'], Aug_Group['MILES*'], Sep_Group['MILES*'], Oct_Group['MILES*'],
                             Nov_Group['MILES*'], Dec_Group['MILES*']], axis=1)
miles_day_frame.columns = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
miles_day_frame.fillna(0, inplace=True)
miles_day_frame.head()

Note here the axis=1. If the axis value is 0, it represents the row and the axis value 1 will represent columns. Since we are concatenating data frames along the column, the axis value is 1.

miles_day_frame.sum().plot(kind='bar')

Let us create another derivative column called range in the data frame df. This will be derived from the MILES* column.

If the miles are less than or equal to 5, the range will be <=5. If the miles are between 6 to 10, the range will be 6-10. If the miles are between 11 to 20, the range will be 11-20. If the miles are between 21 to 35, the range will be 21-35. If miles are greater than 35, then the range will be >35.

def apply_range(col):
    if col['MILES*'] <= 5:
        col['range'] = '<=5'
    elif col['MILES*'] <= 10:
        col['range'] = '6-10'
    elif col['MILES*'] <= 20:
        col['range'] = '11-20'
    elif col['MILES*'] <= 35:
        col['range'] = '21-35'
    else:
        col['range'] = '>35'
    return col

df['range'] = ''
df = df.apply(lambda x: apply_range(x), axis=1)

Note here the axis=1. We are adding a new column, and hence it will be a column operation. So the axis will be 1.

plot_data = df.groupby('range').count().sort_values(by='START_DATE*', ascending="False")
sns.countplot(x=df['range'])

Business conclusions:

With all this analysis, we can answer a few important questions that might help the business:

How many miles were earned per category and purpose?

To answer this question, we have Category* and Purpose* of travel. Let us check the distribution of category w.r.t. the purpose.

category_purpose = df.groupby(['CATEGORY*','PURPOSE*']).sum().reset_index()
category_purpose

Image 1.12: Category and purpose of travel

sns.catplot('CATEGORY*',y='MILES*', hue='PURPOSE*', data=category_purpose, kind='bar', height=8)

Image 1.13: Plot for different categories with hue for purpose

From the graph it is clearly seen that the main contributors for miles are in Business category i.e. Meetings and Customer Visit

2. What is the percentage of business miles vs personal?

For this, we can do something like

category_purpose.groupby('CATEGORY*').sum()/df['MILES*'].sum()

Image 1.14: Miles for different category

Here we see that, 96.94% was obtained by Business category. The rest was obtained by Personal category.

3. How much time was spent on drives per category and purpose?

category_purpose.groupby(['CATEGORY*','PURPOSE*']).sum()['Time_Diff']

Image 1.15: Time spent for purpose per category

In the Business category, the Meeting purpose had the most time for traveling.

By doing this kind of analysis, we get to know the history. We can make some suggestions and important business decisions based on these observations. Hence the exploratory data analysis is the very first and one of the most important steps in any data science project.

Your thoughts, feedback or suggestions are more than welcome and we would be happy to hear from you.

You can connect with me on

Credits:

Photo by Franki Chamaki on Unsplash
Uber Drives Dataset - https://www.kaggle.com/zusmani/uberdrives

Exploratory Data Analysis - Vital approach in Data Science

Automate your Deployment with GitHub Actions

Integrate Third-Party API with Serverless Architecture