#importing packages needed, and configuring matplotlib to render plots inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

#setting up matplotlib to render plots larger and inline, and seaborn's grid style
%matplotlib inline
sb.set_style('darkgrid')
plt.rcParams['figure.figsize'] = (12,8)
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('retina')
sb.set(rc={"figure.dpi":100, 'savefig.dpi':300})

# suppress warnings from final output
import warnings
warnings.simplefilter("ignore")


#loading the prosper loan dataset into a pandas DataFrame
loan_df = pd.read_csv('prosperLoanData.csv')


#a quick overview of what the dataset looks like
loan_df.head()


#getting some info on the DataFrame(especially the columns)
loan_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113937 entries, 0 to 113936
Data columns (total 81 columns):
 #   Column                               Non-Null Count   Dtype  
---  ------                               --------------   -----  
 0   ListingKey                           113937 non-null  object 
 1   ListingNumber                        113937 non-null  int64  
 2   ListingCreationDate                  113937 non-null  object 
 3   CreditGrade                          28953 non-null   object 
 4   Term                                 113937 non-null  int64  
 5   LoanStatus                           113937 non-null  object 
 6   ClosedDate                           55089 non-null   object 
 7   BorrowerAPR                          113912 non-null  float64
 8   BorrowerRate                         113937 non-null  float64
 9   LenderYield                          113937 non-null  float64
 10  EstimatedEffectiveYield              84853 non-null   float64
 11  EstimatedLoss                        84853 non-null   float64
 12  EstimatedReturn                      84853 non-null   float64
 13  ProsperRating (numeric)              84853 non-null   float64
 14  ProsperRating (Alpha)                84853 non-null   object 
 15  ProsperScore                         84853 non-null   float64
 16  ListingCategory (numeric)            113937 non-null  int64  
 17  BorrowerState                        108422 non-null  object 
 18  Occupation                           110349 non-null  object 
 19  EmploymentStatus                     111682 non-null  object 
 20  EmploymentStatusDuration             106312 non-null  float64
 21  IsBorrowerHomeowner                  113937 non-null  bool   
 22  CurrentlyInGroup                     113937 non-null  bool   
 23  GroupKey                             13341 non-null   object 
 24  DateCreditPulled                     113937 non-null  object 
 25  CreditScoreRangeLower                113346 non-null  float64
 26  CreditScoreRangeUpper                113346 non-null  float64
 27  FirstRecordedCreditLine              113240 non-null  object 
 28  CurrentCreditLines                   106333 non-null  float64
 29  OpenCreditLines                      106333 non-null  float64
 30  TotalCreditLinespast7years           113240 non-null  float64
 31  OpenRevolvingAccounts                113937 non-null  int64  
 32  OpenRevolvingMonthlyPayment          113937 non-null  float64
 33  InquiriesLast6Months                 113240 non-null  float64
 34  TotalInquiries                       112778 non-null  float64
 35  CurrentDelinquencies                 113240 non-null  float64
 36  AmountDelinquent                     106315 non-null  float64
 37  DelinquenciesLast7Years              112947 non-null  float64
 38  PublicRecordsLast10Years             113240 non-null  float64
 39  PublicRecordsLast12Months            106333 non-null  float64
 40  RevolvingCreditBalance               106333 non-null  float64
 41  BankcardUtilization                  106333 non-null  float64
 42  AvailableBankcardCredit              106393 non-null  float64
 43  TotalTrades                          106393 non-null  float64
 44  TradesNeverDelinquent (percentage)   106393 non-null  float64
 45  TradesOpenedLast6Months              106393 non-null  float64
 46  DebtToIncomeRatio                    105383 non-null  float64
 47  IncomeRange                          113937 non-null  object 
 48  IncomeVerifiable                     113937 non-null  bool   
 49  StatedMonthlyIncome                  113937 non-null  float64
 50  LoanKey                              113937 non-null  object 
 51  TotalProsperLoans                    22085 non-null   float64
 52  TotalProsperPaymentsBilled           22085 non-null   float64
 53  OnTimeProsperPayments                22085 non-null   float64
 54  ProsperPaymentsLessThanOneMonthLate  22085 non-null   float64
 55  ProsperPaymentsOneMonthPlusLate      22085 non-null   float64
 56  ProsperPrincipalBorrowed             22085 non-null   float64
 57  ProsperPrincipalOutstanding          22085 non-null   float64
 58  ScorexChangeAtTimeOfListing          18928 non-null   float64
 59  LoanCurrentDaysDelinquent            113937 non-null  int64  
 60  LoanFirstDefaultedCycleNumber        16952 non-null   float64
 61  LoanMonthsSinceOrigination           113937 non-null  int64  
 62  LoanNumber                           113937 non-null  int64  
 63  LoanOriginalAmount                   113937 non-null  int64  
 64  LoanOriginationDate                  113937 non-null  object 
 65  LoanOriginationQuarter               113937 non-null  object 
 66  MemberKey                            113937 non-null  object 
 67  MonthlyLoanPayment                   113937 non-null  float64
 68  LP_CustomerPayments                  113937 non-null  float64
 69  LP_CustomerPrincipalPayments         113937 non-null  float64
 70  LP_InterestandFees                   113937 non-null  float64
 71  LP_ServiceFees                       113937 non-null  float64
 72  LP_CollectionFees                    113937 non-null  float64
 73  LP_GrossPrincipalLoss                113937 non-null  float64
 74  LP_NetPrincipalLoss                  113937 non-null  float64
 75  LP_NonPrincipalRecoverypayments      113937 non-null  float64
 76  PercentFunded                        113937 non-null  float64
 77  Recommendations                      113937 non-null  int64  
 78  InvestmentFromFriendsCount           113937 non-null  int64  
 79  InvestmentFromFriendsAmount          113937 non-null  float64
 80  Investors                            113937 non-null  int64  
dtypes: bool(3), float64(50), int64(11), object(17)
memory usage: 68.1+ MB


#getting some summary statistics on the statistical variables
print(loan_df.describe())

       ListingNumber           Term    BorrowerAPR   BorrowerRate  \
count   1.139370e+05  113937.000000  113912.000000  113937.000000   
mean    6.278857e+05      40.830248       0.218828       0.192764   
std     3.280762e+05      10.436212       0.080364       0.074818   
min     4.000000e+00      12.000000       0.006530       0.000000   
25%     4.009190e+05      36.000000       0.156290       0.134000   
50%     6.005540e+05      36.000000       0.209760       0.184000   
75%     8.926340e+05      36.000000       0.283810       0.250000   
max     1.255725e+06      60.000000       0.512290       0.497500   

         LenderYield  EstimatedEffectiveYield  EstimatedLoss  EstimatedReturn  \
count  113937.000000             84853.000000   84853.000000     84853.000000   
mean        0.182701                 0.168661       0.080306         0.096068   
std         0.074516                 0.068467       0.046764         0.030403   
min        -0.010000                -0.182700       0.004900        -0.182700   
25%         0.124200                 0.115670       0.042400         0.074080   
50%         0.173000                 0.161500       0.072400         0.091700   
75%         0.240000                 0.224300       0.112000         0.116600   
max         0.492500                 0.319900       0.366000         0.283700   

       ProsperRating (numeric)  ProsperScore  ...  LP_ServiceFees  \
count             84853.000000  84853.000000  ...   113937.000000   
mean                  4.072243      5.950067  ...      -54.725641   
std                   1.673227      2.376501  ...       60.675425   
min                   1.000000      1.000000  ...     -664.870000   
25%                   3.000000      4.000000  ...      -73.180000   
50%                   4.000000      6.000000  ...      -34.440000   
75%                   5.000000      8.000000  ...      -13.920000   
max                   7.000000     11.000000  ...       32.060000   

       LP_CollectionFees  LP_GrossPrincipalLoss  LP_NetPrincipalLoss  \
count      113937.000000          113937.000000        113937.000000   
mean          -14.242698             700.446342           681.420499   
std           109.232758            2388.513831          2357.167068   
min         -9274.750000             -94.200000          -954.550000   
25%             0.000000               0.000000             0.000000   
50%             0.000000               0.000000             0.000000   
75%             0.000000               0.000000             0.000000   
max             0.000000           25000.000000         25000.000000   

       LP_NonPrincipalRecoverypayments  PercentFunded  Recommendations  \
count                    113937.000000  113937.000000    113937.000000   
mean                         25.142686       0.998584         0.048027   
std                         275.657937       0.017919         0.332353   
min                           0.000000       0.700000         0.000000   
25%                           0.000000       1.000000         0.000000   
50%                           0.000000       1.000000         0.000000   
75%                           0.000000       1.000000         0.000000   
max                       21117.900000       1.012500        39.000000   

       InvestmentFromFriendsCount  InvestmentFromFriendsAmount      Investors  
count               113937.000000                113937.000000  113937.000000  
mean                     0.023460                    16.550751      80.475228  
std                      0.232412                   294.545422     103.239020  
min                      0.000000                     0.000000       1.000000  
25%                      0.000000                     0.000000       2.000000  
50%                      0.000000                     0.000000      44.000000  
75%                      0.000000                     0.000000     115.000000  
max                     33.000000                 25000.000000    1189.000000  

[8 rows x 61 columns]


#finding out the shape of the DataFrame
loan_df.shape

(113937, 81)


#plotting a histograms to get a high level understanding of the distribution of the variables in the DataFrame
loan_df.hist(figsize=(30, 30), color="r");


#summary statistics of the 'BorrowerRate' column
loan_df['BorrowerRate'].describe()

count    113937.000000
mean          0.192764
std           0.074818
min           0.000000
25%           0.134000
50%           0.184000
75%           0.250000
max           0.497500
Name: BorrowerRate, dtype: float64


#plotting a histogram of BorrowerRate
binsize = np.arange(0,loan_df['BorrowerRate'].max()+0.1, 0.1)
plt.hist(data = loan_df, x = 'BorrowerRate', bins = binsize)
plt.title('A histogram of the borrower rate variable', fontsize = 18)
plt.xlabel('Borrower rate', fontsize = 14)
plt.show()


#creating a function for plotting histograms
def histogram_plot(dataset, column, bins, ticks, limits):
    binsize = bins
    bins = np.arange(0, dataset[column].max()+binsize, binsize)
    plt.figure(figsize = [8, 5])
    plt.hist(data = dataset, x = column, bins = bins, color = 'g')
    plt.xlim(limits)
    plt.xlabel(column, fontsize = 14)
    plt.xticks(ticks)
    plt.ylabel('Count', fontsize = 14)
    plt.title(f'A histogram of the {column}', fontsize = 18)
    plt.show()


#plotting a histogram of the Employment duration of the borrowers
histogram_plot(loan_df, 'EmploymentStatusDuration', 30,
               [30,60,90,120,150,180,210,240,270,300,330,360,390,420,450,480,510,540], [0, 540])


#renaming some columns of interest to follow the standard python convention
loan_df.rename(columns ={'ListingCategory (numeric)': 'ListingCategory',
                         'ProsperRating (numeric)':'ProsperRating',
                         'ProsperRating (Alpha)':'ProsperRating',
                         'TradesNeverDelinquent (percentage)':'TradesNeverDelinquent'}, inplace=True)


#getting the value counts in the 'ListCategory' column
loan_df['ListingCategory'].value_counts()

1     58308
0     16965
7     10494
2      7433
3      7189
6      2572
4      2395
13     1996
15     1522
18      885
14      876
20      771
19      768
5       756
16      304
11      217
8       199
10       91
9        85
12       59
17       52
Name: ListingCategory, dtype: int64


#plotting to see the distribution of the 'ListCategory' column
ylabels = ['Not Available', 'Debt Consolidation', 'Home Improvement', 'Business', 'Personal Loan', 
'Student Use', 'Auto', 'Other', 'Baby&Adoption', 'Boat', 'Cosmetic Procedure',
'Engagement Ring', 'Green Loans','Household Expenses', 'Large Purchases', 'Medical/Dental',
'Motorcycle', 'RV', 'Taxes', 'Vacation','Wedding Loans']
ax = sb.countplot(data=loan_df, y='ListingCategory', color = 'purple')
ax.set_title('The different listing catergories', fontsize = 18)
ax.set_yticklabels(ylabels, fontsize = 14)
plt.xscale('log')
plt.xticks([500, 1e3, 2e3, 5e3, 1e4, 2e4, 5e4], [500, '1k', '2k', '5k', '10k', '20k', '50k'])
plt.show()


#defining a function for plotting a simpler histogram
def mini_hist(dataset, column, title, x_label, y_label):
    #plt.figure(figsize = [8,5])
    plt.hist(data = dataset, x = column, color = 'y')
    plt.title(title, fontsize = 18)
    plt.xlabel(x_label, fontsize = 14)
    plt.ylabel(y_label, fontsize = 14)
    plt.xticks(rotation = 20)


#converting the datatype of the 'DateCreditPulled' column to datetime.
loan_df['DateCreditPulled'] = pd.to_datetime(loan_df['DateCreditPulled'])


#plotting a histogram to show the distribution of credits pulled over the years
mini_hist(loan_df, 'DateCreditPulled', 'The number of credits pulled by year', 'Year', 'Count')


#plotting to show the number of borrowers by income range
mini_hist(loan_df, 'IncomeRange', 'The number of loans in each income range', 'Income range', 'count')


#getting the value counts in the 'IsBorrowerHomeowner' column
loan_df['IsBorrowerHomeowner'].value_counts()

True     57478
False    56459
Name: IsBorrowerHomeowner, dtype: int64


#visualizing the number of borrowers that are home owners vs those that aren't.
ax = sb.countplot(data=loan_df, x='IsBorrowerHomeowner', color = 'blue')
ax.set_title('The number of borrowers that do/do not own a home')
xlabels = ['Doesn\'t own a home', 'Owns a home']
ax.set_xticklabels(xlabels)
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width()  / 2,rect.get_height()+ 0.75,rect.get_height(),
             horizontalalignment='center', fontsize = 11)
plt.show()


#getting the value counts in the 'EmploymentStatus' column
loan_df['EmploymentStatus'].value_counts()

Employed         67322
Full-time        26355
Self-employed     6134
Not available     5347
Other             3806
Part-time         1088
Not employed       835
Retired            795
Name: EmploymentStatus, dtype: int64


#plotting the employment status column to see its distribution
employment_status = loan_df['EmploymentStatus'].value_counts()
plt.pie(employment_status,  labels = employment_status.index,
        startangle = 90, counterclock = False, rotatelabels=True);


#plotting to see the proportion of borrowers that are employed
ax = sb.countplot(data=loan_df, x='EmploymentStatus', color = 'blue')
ax.set_title('The number of employed borrowers', fontsize = 18)
plt.xticks(rotation = 40)
plt.show()


#plotting to see the proportion of borrowers that are employed
ax = sb.countplot(data=loan_df, x='EmploymentStatus', color = 'blue')
ax.set_title('The number of employed borrowers', fontsize = 18)
plt.xticks(rotation = 40)
plt.yscale('log')
plt.yticks([500, 1e3, 2e3, 5e3, 1e4, 2e4, 5e4], [500, '1k', '2k', '5k', '10k', '20k', '50k'])
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width()  / 2,rect.get_height()+ 0.75,rect.get_height(),
             horizontalalignment='center', fontsize = 11)

plt.show()


#plotting to see the distribution of the loan status column
ax = sb.countplot(data=loan_df, y='LoanStatus', color = 'blue')
ax.set_title('A loan status countplot', fontsize = 18)
plt.xscale('log')
plt.xticks([500, 1e3, 2e3, 5e3, 1e4, 2e4, 5e4], [500, '1k', '2k', '5k', '10k', '20k', '50k'])
plt.show()


#LoanStatus EmploymentStatus
plt.figure(figsize=(25,10))
plt.xticks(rotation=20,  fontsize = 14)
sb.countplot(data = loan_df, hue = 'LoanStatus', x = 'EmploymentStatus', palette = 'pastel');
plt.yscale('log')


#a function to get the proportions of loan status by employment status
def variable_proportions(column, DataFrame, hue):
    TotalValues_forEachCategory = DataFrame.groupby([column])[hue].count()
    valuesFor_LoanStatus = DataFrame.groupby([column, hue])[hue].count()
    proportions = valuesFor_LoanStatus / TotalValues_forEachCategory
    return proportions


#a function to plot the proportions loan status by the employment status
def proportions_plot(proportions, column_names):
    # Proportions come from get_propotion function.
    proportions.unstack().plot(kind='bar');
    plt.ylabel('Proportions')
    plt.title(f'The relationship between {column_names} and loan status', fontsize=15, )


#calling both functions on the employment status column
proportions = variable_proportions('EmploymentStatus', loan_df, 'LoanStatus')
proportions_plot(proportions, 'EmploymentStatus')


#creating a new column for loans status where the value is either 'Completed' or 'Defaulted'
new = loan_df.query('LoanStatus == "Completed" | LoanStatus == "Defaulted"')
loan_df['newStatus'] = new['LoanStatus']


#calling the functions for plotting the proportions of loan status values on the new column
proportions = variable_proportions('EmploymentStatus', loan_df, 'newStatus')
proportions_plot(proportions, 'EmploymentStatus')


#converting the datatype of the values in the 'ListingCategory' column from integar to str
loan_df['ListingCategory'] = loan_df['ListingCategory'].astype('str')


#testing the code
loan_df['ListingCategory'].describe()

count     113937
unique        21
top            1
freq       58308
Name: ListingCategory, dtype: object


#getting the order to use for the x ticks labels
loan_df['ListingCategory'].value_counts()

1     58308
0     16965
7     10494
2      7433
3      7189
6      2572
4      2395
13     1996
15     1522
18      885
14      876
20      771
19      768
5       756
16      304
11      217
8       199
10       91
9        85
12       59
17       52
Name: ListingCategory, dtype: int64


#facet grid plots of the relationship between listing category and employment status

#facet grid countplot
g = sb.FacetGrid(data = loan_df, col = 'EmploymentStatus' , col_wrap=3, sharey=False)
g.map(sb.countplot,'ListingCategory', order=loan_df['ListingCategory'].value_counts().index, color='purple')
g.fig.set_size_inches(15,20)
g.set_xticklabels(['1-Debt Consolidation','0-Not Available','7-Other', '2-Home Improvement', '3-Business', '6-Auto', 
                   '4-Personal Loan', '13-Household Expenses','15-Medical/Dental', '18-Taxes', '14-Large Purchases',
                   '20-Wedding Loans', '19-Vacation', '5-Student Use', '16-Motorcycle', '11-Engagement Ring', 
                   '8-Baby&Adoption', '10-Cosmetic Procedure',  '9-Boat', '12-Green Loans', '17-RV'])
#rotating the x tick labels
for axes in g.axes.flat:
    axes.set_xticklabels(axes.get_xticklabels(), rotation=90)

plt.show()


#plotting the relationship between loan amount and borrower APR
plt.scatter(x = loan_df['LoanOriginalAmount'], y = loan_df['BorrowerAPR'], color = 'b', marker = 'x', alpha = 0.4)
plt.xlabel('Loan amount')
plt.ylabel('BorrowerAPR')
plt.title('The relationshhip between loan amount and APR', fontsize = 18)
plt.show()


#plotting the relationship between loan amount and loan status
sb.boxplot(x=loan_df['LoanOriginalAmount'] , y = loan_df['LoanStatus'], color = 'green')
plt.title('The relationship between loan amount and loan status', fontsize = 18)
plt.show()


#creating a point plot
ax = sb.pointplot(data = loan_df, x = 'LoanOriginalAmount', y = 'LoanStatus', hue = 'Recommendations',
                  dodge = 0.3, linestyles = "", palette='Greens')
plt.show()


#creating a new column for listing category where the value is 
#1,2 or 3(representing debt consolidation, home improvement and business respectively)
listings = loan_df.query('ListingCategory == "1" | ListingCategory == "2"| ListingCategory == "3" ')
loan_df['topListings'] = listings['ListingCategory']


#renaming the values for semantics
replacements = {'1':'debt consolidation', '2':'home improvement', '3':'business'}
loan_df['topListings'] = loan_df['topListings'].replace(replacements)


#plotting a clustered bar chart to answer my question
ax = sb.barplot(data = loan_df, y = 'BorrowerAPR', x = 'newStatus', hue = 'topListings')
ax.legend(loc = 8, ncol = 3, framealpha = 1, title = 'Listing category')
plt.title('The relationship between the loan status, APR and listing category')
plt.show()

Part I - Prosper Loan Exploration¶

by Ovy Evbodi¶

Introduction¶

Preliminary Wrangling¶

Importing packages¶

Loading the dataset¶

A quick overview of the dataset¶

What is the structure of your dataset?¶

What is/are the main feature(s) of interest in your dataset?¶

What features in the dataset do you think will help support your investigation into your feature(s) of interest?¶

Univariate Exploration¶

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?¶

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?¶

Bivariate Exploration¶

Question¶

Question¶

Question¶

Question¶

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?¶

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?¶

Multivariate Exploration¶

Question¶

Question¶

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?¶

Were there any interesting or surprising interactions between features?¶

Conclusions¶

	ListingKey	ListingNumber	ListingCreationDate	CreditGrade	Term	LoanStatus	ClosedDate	BorrowerAPR	BorrowerRate	LenderYield	...	LP_ServiceFees	PercentFunded	Investors
0	1021339766868145413AB3B	193129	2007-08-26 19:09:29.263000000	C	36	Completed	2009-08-14 00:00:00	0.16516	0.1580	0.1380	...	-133.18	1.0	258
1	10273602499503308B223C1	1209647	2014-02-27 08:28:07.900000000	NaN	36	Current	NaN	0.12016	0.0920	0.0820	...	0.00	1.0	1
2	0EE9337825851032864889A	81716	2007-01-05 15:00:47.090000000	HR	36	Completed	2009-12-17 00:00:00	0.28269	0.2750	0.2400	...	-24.20	1.0	41
3	0EF5356002482715299901A	658116	2012-10-22 11:02:35.010000000	NaN	36	Current	NaN	0.12528	0.0974	0.0874	...	-108.01	1.0	158
4	0F023589499656230C5E3E2	909464	2013-09-14 18:38:39.097000000	NaN	36	Current	NaN	0.24614	0.2085	0.1985	...	-60.27	1.0	20