GoBike System Data

By: Phuong Tran 06/16/2020

Introduction

Investigation Overview

The goal of this project is to dig into the bike data to understand about different factors that is affecting bike trip durations

Dataset Overview

The dataset is from Bay Wheel’s trip data and it is available for public use. It contains trip informations and bikers’ subscribe status. This data set is taken from Lyft’s website for year of 2017

## Import all
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

What is the structure of your dataset?

There are 13 variables (columns) and 519700 observations (rows) in this data set. There is 2 user type: subsciber and customer.

What is/are the main feature(s) of interest in your dataset?

Infomation about start and end station such as station id, station name, station longtitude and latitude. There are also bike information (bike ID) as well as bikers’s trip information such as trip duration in seconds and user type.

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

In my opinion, trip durations, start time and user type are factors that impact trip duration the most.

## Load data set in and inspect data
df = pd.read_csv('tripdata.csv')

df.head()
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type
0 80110 2017-12-31 16:57:39.6540 2018-01-01 15:12:50.2450 74 Laguna St at Hayes St 37.776435 -122.426244 43 San Francisco Public Library (Grove St at Hyde... 37.778768 -122.415929 96 Customer
1 78800 2017-12-31 15:56:34.8420 2018-01-01 13:49:55.6170 284 Yerba Buena Center for the Arts (Howard St at ... 37.784872 -122.400876 96 Dolores St at 15th St 37.766210 -122.426614 88 Customer
2 45768 2017-12-31 22:45:48.4110 2018-01-01 11:28:36.8830 245 Downtown Berkeley BART 37.870348 -122.267764 245 Downtown Berkeley BART 37.870348 -122.267764 1094 Customer
3 62172 2017-12-31 17:31:10.6360 2018-01-01 10:47:23.5310 60 8th St at Ringold St 37.774520 -122.409449 5 Powell St BART Station (Market St at 5th St) 37.783899 -122.408445 2831 Customer
4 43603 2017-12-31 14:23:14.0010 2018-01-01 02:29:57.5710 239 Bancroft Way at Telegraph Ave 37.868813 -122.258764 247 Fulton St at Bancroft Way 37.867789 -122.265896 3167 Subscriber

Data Wrangling

Assess:

  1. Convert duration second to minute and rename the column to duration_min
  2. Convert start time and end time into panda date and time format for easier acess to day or month specifically.
  3. There is no duplicated
  4. No weird value (duration less than 0 wouldn’t make any sense)
df.shape
(519700, 13)
df.isnull().sum()
## There is no Nans
duration_sec               0
start_time                 0
end_time                   0
start_station_id           0
start_station_name         0
start_station_latitude     0
start_station_longitude    0
end_station_id             0
end_station_name           0
end_station_latitude       0
end_station_longitude      0
bike_id                    0
user_type                  0
dtype: int64
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 519700 entries, 0 to 519699
Data columns (total 13 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   duration_sec             519700 non-null  int64  
 1   start_time               519700 non-null  object 
 2   end_time                 519700 non-null  object 
 3   start_station_id         519700 non-null  int64  
 4   start_station_name       519700 non-null  object 
 5   start_station_latitude   519700 non-null  float64
 6   start_station_longitude  519700 non-null  float64
 7   end_station_id           519700 non-null  int64  
 8   end_station_name         519700 non-null  object 
 9   end_station_latitude     519700 non-null  float64
 10  end_station_longitude    519700 non-null  float64
 11  bike_id                  519700 non-null  int64  
 12  user_type                519700 non-null  object 
dtypes: float64(4), int64(4), object(5)
memory usage: 51.5+ MB
## Nothing less than or equal to 0 so it's good
df[df['duration_sec']<=0]
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type
## Making sure there are only 2 values in user type
df['user_type'].value_counts()
Subscriber    409230
Customer      110470
Name: user_type, dtype: int64
df.duplicated().sum()
0
# Convert start time and end time into panda datetime format
df['start_time'] = pd.to_datetime(df['start_time'], infer_datetime_format=True)
df['end_time'] = pd.to_datetime(df['end_time'], infer_datetime_format=True)
## Adding more start columns
df['start_hour'] = df['start_time'].dt.hour
df['start_month'] = df['start_time'].dt.month
df['start_day'] = df['start_time'].dt.weekday

def to_stringday(day):
    if day==0:
        return 'Mon'
    elif day==1:
        return 'Tues'
    elif day==2:
        return 'Wed'
    elif day==3:
        return 'Thur'
    elif day==4:
        return 'Fri'
    elif day==5:
        return 'Sat'
    else:
        return 'Sun'
    
df['start_day'] = df['start_day'].apply(to_stringday)
## No longer need start and end time
df.drop(axis=1, columns=['start_time', 'end_time'], inplace=True)
df.head()
duration_sec start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type start_hour start_month start_day
0 80110 74 Laguna St at Hayes St 37.776435 -122.426244 43 San Francisco Public Library (Grove St at Hyde... 37.778768 -122.415929 96 Customer 16 12 Sun
1 78800 284 Yerba Buena Center for the Arts (Howard St at ... 37.784872 -122.400876 96 Dolores St at 15th St 37.766210 -122.426614 88 Customer 15 12 Sun
2 45768 245 Downtown Berkeley BART 37.870348 -122.267764 245 Downtown Berkeley BART 37.870348 -122.267764 1094 Customer 22 12 Sun
3 62172 60 8th St at Ringold St 37.774520 -122.409449 5 Powell St BART Station (Market St at 5th St) 37.783899 -122.408445 2831 Customer 17 12 Sun
4 43603 239 Bancroft Way at Telegraph Ave 37.868813 -122.258764 247 Fulton St at Bancroft Way 37.867789 -122.265896 3167 Subscriber 14 12 Sun

Unvariate

1. Duration

Bikers tend to make trip duration around 700 seconcs (~12 minutes)

# 1. duration.
bins = 5 ** np.arange(2, 5.0 + 0.1, 0.1) 
plt.hist(df['duration_sec'], bins=bins, color='pink')
plt.xlim(0,3200) 
plt.xlabel('Duration (Second)')
plt.title('Trip duration histogram',  fontweight='bold', fontsize=16)
plt.ylabel('Duration Range Frequency')
plt.show();

png

Month

June has the least bike rentals and October has the most. However, as it get cooleror hotter, people doesn’t want to ride a bike, especially in June (start of summer)

# 2. month
# https://seaborn.pydata.org/generated/seaborn.countplot.html
sns.countplot(df['start_month'])
plt.title("Number of trip for each month of year 2017",  fontweight='bold', fontsize=16)
plt.ylabel('Number of bike trips')
plt.xlabel('Month')
plt.show()

png

Day of the week

It’s not expected that there are more trip during weekdays other than weekend. Maybe in the observation area, people ride their bike to work more?

# 3. weekday
sns.countplot(df['start_day'])
plt.title("Number of trip for each day of week of year 2017", fontweight='bold', fontsize=16)
plt.ylabel('Number of bike trips')
plt.xlabel('Day')
plt.show()

png

Hour distribution

There is no outlier in the graph therefore nothing extreme is in the graph. If there is then we have to trim it off. Most hour is around 2 pm and who has bike trip at the midnight?

## 4. Hour distribution
sns.boxplot(df['start_hour'], color='blue')
plt.title('Start hour distribution', fontweight='bold', fontsize=16)
plt.xlabel('Hour')
plt.show()

# No outliers which is really good

png

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

The trip duration is right skewed as most trip is around 10 to 13 minutes. There are also some trips with 3000 seconds, nearly one hour.

Ask for the month, June is the one with less bike trip. In my opinion, June is the start of summer and people tend to go out less due to hotness. However, peek is in October and it’s pretty symmetric that as the whether get colder or hotter, the number of trip slightly decrease, except for June.

Overall, there are less bike trip in the weekend. Weekday has more bike trip. Tuesday and Wednesday has approximately the same number of trips. However, the number decrease as it get closer and closer to the weekend and rise back in the first day of the week (Monday).

About start hour distribution, there is no outlier displayed on the boxplot therefore I’m not worried about any extreme value. However, the min and max value is 0 and 24 … Who rent a bike in the middle of the night (Maybe visitors who want to explore the city?)

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

As of this point, there is no unsual distributions except for the right skewed in trip duration.

Some data inspection: checking for nulls, duplicated, make sure format is right

Some data cleaning: Change start and end from string to panda datetime format so that I could extract data from it easier. Drop some unused columns.

Why? To make the data cleaner and to avoid misinterpreting when ploting and making decision.

Bivariate

## Overall corralation looking and choosing out variable that has the highest correlation

plt.figure(figsize=(20,10))
sns.heatmap(df[['duration_sec', 'start_hour', 'start_month']].corr(), annot=True)
plt.title("Correlation between variable", fontsize=24, fontweight='bold')
plt.show()

png

## Start hour vs Trip duration
plt.scatter(df['start_month'],df['duration_sec'])
plt.title('Start hour vs Trip duration')
plt.xlabel('Hour')
plt.ylabel('Trip duration (second)')
plt.show()

png

sns.catplot(x="user_type", y="duration_sec", data=df)
plt.title('Trip duration base on user type', fontweight='bold', fontsize=16)
plt.xlabel('User Type')
plt.ylabel('Trip duration (second)')
plt.show()

png

## This is a multivariate
sns.countplot(df['start_month'], hue=df['user_type'])
plt.title('Number of bike trips base on user type', fontweight='bold', fontsize=16)
plt.xlabel('Month')
plt.ylabel('Number of trips')
plt.show()

png

facetgrid = sns.FacetGrid(data=df, col='user_type', col_wrap = 4, height = 6, aspect=2, sharey=False)
facetgrid.map(sns.countplot, 'start_hour')
facetgrid.axes[0].set_title('Bike trip hour for customer', fontweight='bold', fontsize=24)
facetgrid.axes[1].set_title('Bike trip hour for subscriber', fontweight='bold', fontsize=24)
facetgrid.axes[0].set_xlabel('Hours of a day', fontsize=22)
facetgrid.axes[1].set_xlabel('Hours of a day', fontsize=22)
facetgrid.axes[0].set_ylabel('Number of trips', fontsize=22)
facetgrid.axes[1].set_ylabel('Number of trips', fontsize=22)
plt.show()
C:\Users\phuon\anaconda3\lib\site-packages\seaborn\axisgrid.py:728: UserWarning: Using the countplot function without specifying `order` is likely to produce an incorrect plot.
  warnings.warn(warning)

png

sub = dict(df[df['user_type']=='Subscriber']['start_station_name'].value_counts())
cus = dict(df[df['user_type']=='Customer']['start_station_name'].value_counts())

y_sub=[]
y_cus=[]

x = df['start_station_name'].value_counts().index[0:15]

for i in x:
    y_sub.append(sub[i])
    y_cus.append(cus[i])
    
dummy_df = pd.DataFrame({'customer': y_cus, 'subscriber':y_sub}, index=x)
dummy_df
customer subscriber
San Francisco Ferry Building (Harry Bridges Plaza) 5210 9977
The Embarcadero at Sansome St 5864 7800
San Francisco Caltrain (Townsend St at 4th St) 1046 11500
San Francisco Caltrain Station 2 (Townsend St at 4th St) 985 11070
Market St at 10th St 1847 10113
Montgomery St BART Station (Market St at 2nd St) 1750 9584
Berry St at 4th St 1439 9517
Powell St BART Station (Market St at 4th St) 3174 6968
Howard St at Beale St 660 9266
Steuart St at Market St 1628 7719
Powell St BART Station (Market St at 5th St) 2087 5900
Embarcadero BART Station (Beale St at Market St) 1014 6635
2nd St at Townsend St - Coming Soon 1038 5567
3rd St at Townsend St 1089 5325
Townsend St at 7th St 489 5734
ax = dummy_df.plot.barh()
plt.title("User Type and Top 10 Stations", fontsize=18, fontweight='bold')
plt.xlabel('Number of trips taken from this station')
plt.show()

png

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

User Type and Trip Duration: When comparing user type and trip duration, both of the user type has linear relationship with the trip duration as when the trip duration increase, it is getting less and less dense. However, customer tends to have their bike trip longer than subscriber as from around 20000 seconds, it starts to be less compact for subscibers whereas customer points remain densly for almost every of the trip durations.

Number of bike trip within every month and User Type: Unlike trip duration, subscriber tends to have shorter trip duration but do more trips as within every month, subscribers always have double (triple for some months) amount of number of bike trip than customers. What is more, the trip distribution for both user type is not the same. Subscriber numer of trip peek in October while for customer is in September. Subscribers number of trips decrease more when it gets colder/hotter. However, customers’ number of trip is only slightly change except for June and July.

Trip start hour and user type: Subsciber has a unimodal shape in 8 and 16 o’clock. Both of them is left skewed. However, Customer is sharply skewed when compare to subsciber.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

There was no actual relationship between the top 10 stations and the number of trips have taken from that station.

It’s hard to find out specific correlation because user type is categorical.

Multivariate

## Trip duration within different month base on user type

plt.figure(figsize = [6,4])
sns.pointplot(ci=None, x = df['start_month'], y = df['duration_sec'], hue = df['user_type'])
plt.title('Trip duration in each month of the year', fontweight='bold', fontsize=16)
plt.ylabel('Average Trip Duration (secconds)')
plt.xlabel('Month')
plt.show();

png

## Trip duration within different month base on user type

plt.figure(figsize = [6,4])
sns.pointplot(ci=None, x = df['start_hour'], y = df['duration_sec'], hue = df['user_type'])
plt.title('Trip duration in each hour of the day', fontweight='bold', fontsize=16)
plt.ylabel('Average Trip Duration (secconds)')
plt.xlabel('Hour')
plt.show();

png

## This is a multivariate
sns.countplot(df['start_hour'], hue=df['user_type'])
plt.title('Number of bike trips base on user type', fontweight='bold', fontsize=16)
plt.xlabel('Hour')
plt.ylabel('Number of trips')
plt.show()

png

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

When I fisrt look at the member type and trip duration, I thought that customers do bike trips more than subscribers. However, it turn out that customers has higher trip durations in average than subscribers but make lower bike trips.

Trip Duration base on different months for each user type also support the above conclusion. Except for 3 am, trip duration for subscriber base on month remains stable while fluctuating for customers’ trip duration.

Trip Duration base on different hour of the day for each user show a bimodal with merely normal distributed with peak at 8 am and 5 pm. Meanwhile, for customers, the number of trips spread out evenly for “working hours’ and start to decrease when night time comes. For both of the user type, there is almost no trip for 2am to 4am.

There is no bike trip being made from January to May for the investigated year. However, subscibers’ amount of bike trips is double (almost tripple for October, November, December) when compare with customers’.

Were there any interesting or surprising interactions between features?

Before I started this project, I make a lot of assumptions about relationship between but after some processed, everything start to be cleared and I can see connection between factors verse user types is that:

The subsriber has lower average trip duration but more number of trip.

People in the investigated area like to make bike trip when the whether remain cool as there are less bike trip when the whether gets hotter or colder.

For subscribers, they do biking scheduly since there is a bimodel shape and for customer, it’s oscassionally since it’s widely distributed.

df.to_csv('df_final.csv', index=False)