Since my Data Science Bootcamp ended a little over a month ago, I've spent most of my time going over what I had learned and refreshing my skills. Most of my time has been spent going deeper into areas that we covered briefly. For example, I have been practicing with Keras, SQL, and time series programs like Prophet. In the last few days, I decided to go deeper into something that I did use frequently in my projects, but not to the full extent of what it is designed for. This powerful resourse is called a pipeline and it's available (like most tools I have come to love) through the Scikit-learn libraries. Scikit-learn Pipeline is used to automate machine learning workflows. Pipelines create a linear sequence of data transforms to be chained together in a process that can be evaluated all in one step. In addition to Pipelines, I am also going to tackle something completely new to me which is ColumnTransformers. A Column Transformer can be paired with a Pipeline to simplify the workflow. Let's start by importing our essential libraries and reading in our data.

import pandas as pd
import numpy as np
import seaborn as sns

from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
import joblib
data = pd.read_csv('data/brazil_cities.csv', sep=';', decimal=',')
data.head(3)
CITY STATE CAPITAL IBGE_RES_POP IBGE_RES_POP_BRAS IBGE_RES_POP_ESTR IBGE_DU IBGE_DU_URBAN IBGE_DU_RURAL IBGE_POP ... Pu_Bank Pr_Assets Pu_Assets Cars Motorcycles Wheeled_tractor UBER MAC WAL-MART POST_OFFICES
0 São Paulo SP 1 11253503.0 11133776.0 119727.0 3576148.0 3548433.0 27715.0 10463636.0 ... 8.0 1.947077e+13 2.893261e+12 5740995.0 1134570.0 3236.0 1.0 130.0 7.0 225.0
1 Osasco SP 0 666740.0 664447.0 2293.0 202009.0 202009.0 NaN 616068.0 ... 2.0 6.732330e+12 1.321699e+10 283641.0 73477.0 174.0 NaN 7.0 1.0 10.0
2 Rio De Janeiro RJ 1 6320446.0 6264915.0 55531.0 2147235.0 2147235.0 NaN 5426838.0 ... 5.0 2.283445e+12 9.738864e+11 2039930.0 363486.0 289.0 1.0 68.0 1.0 120.0

3 rows × 81 columns

data.shape
(5576, 81)

I've decided to use data on Brazilian cities. For the purpose of this blog, I am going to only look at 8 columns which relate to: State, Capital, Population, Life Expectancy, Education Index, GDP, Number of Companies, and Tourism Category.

keep = ['STATE', 'CAPITAL', 'IBGE_RES_POP', 'IDHM_Longevidade', 'IDHM_Educacao', 'GDP', 'COMP_TOT', 'CATEGORIA_TUR']
df = data[keep]
df = df.rename(columns={'STATE': 'state', 'CAPITAL': 'capital', 'IBGE_RES_POP': 'population',
                'IDHM_Longevidade': 'life_expectancy', 'IDHM_Educacao': 'education_index',
                'GDP': 'gdp', 'COMP_TOT': 'num_companies',
                'CATEGORIA_TUR': 'tourism_category',})
#Renaming the columns with more descriptive names
df.dtypes
state                object
capital               int64
population          float64
life_expectancy     float64
education_index     float64
gdp                 float64
num_companies       float64
tourism_category     object
dtype: object
df.isna().sum()
state                  0
capital                0
population             8
life_expectancy        8
education_index        8
gdp                    3
num_companies          3
tourism_category    2288
dtype: int64
df = df.dropna(subset=['population']) # Drop all NaN in target column
df.corr()
capital population life_expectancy education_index gdp num_companies
capital 1.000000 0.567143 0.050703 0.113596 0.420889 0.479255
population 0.567143 1.000000 0.081922 0.137562 0.942853 0.960272
life_expectancy 0.050703 0.081922 1.000000 0.704590 0.071787 0.091642
education_index 0.113596 0.137562 0.704590 1.000000 0.105869 0.135122
gdp 0.420889 0.942853 0.071787 0.105869 1.000000 0.946001
num_companies 0.479255 0.960272 0.091642 0.135122 0.946001 1.000000
sns.heatmap(df.corr(), annot=True, cmap="Reds")
<matplotlib.axes._subplots.AxesSubplot at 0x10ba1ccf8>

png

Train Test Split

After some basic EDA, it's time to split the data into training and testing sets. The y (target) is going to be the population, and the rest of the columns will be the features I am using to predict the population of a city in Brazil.

y = df['population']
features = [col for col in df if col != 'population']
X = df[features]
X.head(3)
state capital life_expectancy education_index gdp num_companies tourism_category
0 SP 1 0.855 0.725 6.870359e+08 530446.0 A
1 SP 0 0.840 0.718 7.440269e+07 15315.0 B
2 RJ 1 0.845 0.719 3.294314e+08 190038.0 A
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(4454, 7)
(1114, 7)
(4454,)
(1114,)

At this point in my work flow, I would import DataFrameMapper from Scikit-learn and I would start to fix each column individually. However, this is where I am going to explore a new tool to me, Scikit-Learn's ColumnTransformer, which will simultaneously transform several columns, which need the same transformation process.

Classify Columns into Categorical and Numerical

Columns do not necessarily need to be separated into these two type of columns, but in this case it makes for a better example. This is a perfect example showing when you would used DataFrameMapper and when you would use this process instead. If you have several columns that need very diferent types of transforming, I would used DataFrameMapper, if you have a more homogenous group, where you have several columns that need to be transformed in a similar way, I would use this process. It makes for a cleaner workflow.

After performing basic EDA, it became apparent that the columns State, Capital, and Tourism Category, could be defined as Categorical, and they will need to be One Hot Encoded and since there are missing values, we will need to use an Imputer. The columns Life Expectancy, Education Index, GDP, and Number of Companies, could be defined as Numerical, and they will need to be scaled, as well as have the missing values imputed.

cat_cols = ['state', 'capital', 'tourism_category']
num_cols = ['life_expectancy', 'education_index', 'gdp', 'num_companies']

Build a Pipeline for Categorical Columns

Now I am going to start building pipelines for these two types of columns. Let's start with our Categorical columns.

cat_cols #remind us what our categorical columns are
['state', 'capital', 'tourism_category']
X_train_cat = X_train[cat_cols] #pulling out the cat columns from our X_train
X_train_cat.head(3)
state capital tourism_category
200 SP 0 C
3517 SP 0 NaN
1907 PR 0 NaN
X_train_cat.isna().sum() 
state                  0
capital                0
tourism_category    1836
dtype: int64

After seeing that there are 1815 missing values, and we decide that we dont want to lose any of the valuable information in the rest of the values, we decide to impute those missing values.

si = SimpleImputer(strategy='constant', fill_value='other') #This will fill all missing values as 'other'

si.fit(X_train_cat)
X_train_cat_si = si.transform(X_train_cat)

After imputing the missing values, we will now need to one hot encode the columns.

ohe = OneHotEncoder(sparse=False)

X_train_cat_ohe = ohe.fit_transform(X_train_cat_si)

Now that we have imputed all missing values and one hot encoded the categorical columns, we will now place these two steps into a pipeline. We do this by defining them as steps and then placing them in the pipeline. The steps are an array of tuples that contain two values. The first value in a step is the name you are giving the process and the second value is the instantiated step (example si is SimpleImputer and ohe is OneHotEncoder). These steps are then placed in the pipeline and the pipe fits both the X_train and y_train.

steps = [
    ('impute', si),
    ('ohe', ohe)
]

cat_pipe = Pipeline(steps=steps)
cat_pipe.fit(X_train_cat, y_train)
Pipeline(memory=None,
     steps=[('impute', SimpleImputer(copy=True, fill_value='other', missing_values=nan,
       strategy='constant', verbose=0)), ('ohe', OneHotEncoder(categorical_features=None, categories=None,
       dtype=<class 'numpy.float64'>, handle_unknown='error',
       n_values=None, sparse=False))])

Build a Pipeline for Numerical Columns

num_cols
['life_expectancy', 'education_index', 'gdp', 'num_companies']
X_train_num = X_train[num_cols] #pulling out the numerical columns from our X_train
X_train_num.head(3)
life_expectancy education_index gdp num_companies
200 0.863 0.687 2530729.24 2464.0
3517 0.855 0.616 177756.32 213.0
1907 0.843 0.676 351801.06 731.0

The numerical columns need to be scaled and imputed. We could transform the columns by Standard Scaling and then transform the columns by SimpleImputing, in a two step process, like we did for the Categorical Columns, but we can actually just skip those two individual processes and place the two steps directly into the pipeline. We simply instantiate the two processes we need to complete, and then skip right to the last step, defining them within the steps and placing those steps into the pipeline.

si = SimpleImputer(strategy='mean')
ss = StandardScaler()

steps = [
    ('impute', si),
    ('ss', ss)
]

num_pipe = Pipeline(steps=steps)
num_pipe.fit(X_train_num, y_train)
Pipeline(memory=None,
     steps=[('impute', SimpleImputer(copy=True, fill_value=None, missing_values=nan, strategy='mean',
       verbose=0)), ('ss', StandardScaler(copy=True, with_mean=True, with_std=True))])

Use ColumnTransformer to Concatenate all Data Together

The ColumnTransformer can now take both pipelines, concatenating them together, to transform the entirety of the X_train.

A ColumnTransformer takes an array of tuples with three values. The first value is the name we give the process, the second value is the pipeline created, and the third value is the list of columns that the pipeline was created to transform.

transformers = [
    ('cat', cat_pipe, cat_cols),
    ('num', num_pipe, num_cols)
]

ct = ColumnTransformer(transformers=transformers)
X_train_trans = ct.fit_transform(X_train)
X_train_trans.shape
(4454, 39)

Create a Final Pipeline with a Machine Learning Model

Now we can take the transformer and a machine learning model, such as Linear Regression, and create a final pipeline that takes in all the raw data, transforms the data through the individual pipelines, concatenates the data with the transformer, and then predicts.

lr = LinearRegression()

final_steps = [
    ('transformer', ct),
    ('model', lr)
]

pipe = Pipeline(steps=final_steps)
pipe.fit(X_train, y_train)
Pipeline(memory=None,
     steps=[('transformer', ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
         transformer_weights=None,
         transformers=[('cat', Pipeline(memory=None,
     steps=[('impute', SimpleImputer(copy=True, fill_value='other', missing_values=nan,
       strategy='constant', ve...('model', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False))])
pipe.score(X_train, y_train)
0.9648703985990568

Now let's see if the pipeline will work on the test data. The test data was placed aside when we did our train test split. It has not been used for any of the processes above.

y_pred = pipe.predict(X_test)
y_pred
array([ 2080.,  5952., 11360., ...,  5632., 30720., 15680.])
pipe.score(X_test, y_test)
0.7800829672957542

Concluding Remarks

There are many benefits to using a Scikit-learn pipeline. By enforcing and implementing the order of steps, it makes a workflow much easier to read and understand. Pipelines are especially useful, if you are working with datasets that are continuously being added to. A pipeline can be used to transform the new data in the same process.

While I am a big fan of DataFrameMapper, which can also be placed into a pipeline, I often found myself using DataFrameMapper and having to repeat myself, as many columns need the exact same transformations. With the process above, it makes it much simpler, easier, and less time consuming. While I find this new process advantageous, I need to note that there are many advantages for using DataFrameMapper. The main one that comes to mind for me is that DataFrameMapper allows you to keep the annotations and labels that you assigned to your pandas dataframe. The results of my pipeline above, on the other hand, will reduce all information to a array/matrix. This makes DataFrameMapper much easier to comprehend. Both processes have their benefits, and I envision using both in different scenarios.

Finally, pipelines can also then be used to GridSearch and find the best models and hyper-parameters. As we can see from our train and test scores, our current model is quite overfit, a GridSearch could aid in finding a model that is less overfit, and our test score could be improved.

Last week, I finished my Data Science boot camp at Bitmaker General Assembly in Toronto. For the final two weeks, I spent my time working on my final project, known as the Capstone Project. This blog will be an overview of this project, from its inception to my presentation last Thursday. For the project, I decided to make a beer recommender system. There were various reasons as to why I chose this project. First, I've been bartending for the last three years, and I've become fairly knowledgeable about beer, and have become a human recommender system in the process. Most customers are not very well-informed about the different styles of beers, and a recommender system could help people to figure out which beers they would most prefer. Second, we only spent a couple hours during the 12-week course on recommender systems, so I thought it would be a challenge to dig deeper into the area. Third, it became quite obvious when I was researching the various project ideas I had, that it was almost impossible to find examples of recommender systems online, that were not based on the MovieLens dataset. This sealed the deal for me, knowing that I would have to wrangle my own data and model without the benefit of a bevy of online examples.

Step One: Web Scraping

The first thing I needed to do was find my beer data. The first place I found was the website Brewery DB. Brewery DB is an online database, developed for brewers and web developers to power their apps. After purchasing an API key, I spent 4 days requesting beers from the website. The first obstacle I came accross was that unless I purchased the most expensive API key, I did not have the ability refine my scraping to breweries located in Ontario. So although I scraped 20,000+ beers from the site, after filtering down to beers from Ontario and Quebec, Macro-Breweries, and well-known breweries in the US, with beers readily available in Toronto, I was only left with about 530 beers. Because of this, I decided to supplement my data, by scraping the Top Rated 100 beers in Ontario list from the Beer Advocate website.

beer_adv

After scraping these two sites, and concatenating the information I wrangled, I was left with 629 beers. The next thing I needed to do was create a 'user base'. To create this database of users, I relied on two methods. The first was to create and disseminate online, a basic beer form, comprised of 20 beers, both representative of styles, and micro and macro breweries.

beer_form beer_form2

Users were asked to rate the 20 beers on a scale from 1 to 10. In addition to this simple form, I was able to convince 14 of my friends (myself being number 15) to complete the entire list of 629 beers.

users

The final user/beer database had:

        -86 Beer Forms filled out
        -15 "Super-Users'
        -338 Beers (after I deleted beers with no ratings)

The hope was that while 86 people only were able to rate 20 beers, the super users would bridge this sparsity, and similarities would be made between regular users and super-users, so that beers that the regular users had not tried could be recommended to them, based on super-user ratings.

Step Two: Modeling

models

There are three types of Recommender Systems:

Content Based. Broadly speaking, this system is based on the premise that if you like one item, you will like another similar item. In terms of beer, if you like Guinness, you will most likely like other stouts, especially ones poured with a creamer.

Collaborative Based. This system focuses on the relationship between users and items. Strictly speaking, similar people like similar items. If person A and B, like Corona, Coors Light, and Heineken, and person A loves Sawdust City Little Norway, a Collaborative Based recommender system would recommend Little Norway to person B.

Hybrid Systems. This system combines Content and Collaborative systems. For example, if person C likes imperial stouts and IPAs. And person D also likes imperial stouts, but hates IPAs, a hybrid system would recommend other imperial stouts that person C liked to person D.

spotlight

Model 1: Spotlight Recommender Model (Collaborative)

For the first model I used the Spotlight Recommender system. Spotlight was developed by Maciej Kula, and is the recommendation system used by Netflix. Spotlight provides frameworks for Sequential models, as well as, Factorization models, both Implicit and Explicit. Due to the small dataset, the Spotlight model was not able to make accurate recommendations for users. The precision at K was .18 after tuning parameters. The recommendations seemed to be skewed by the data compiling decisions I had made. For none super users, it would only recommend those 20 beers thats were part of the Beer Form.

spotlightcode

Model 2: KNN Model (Collaborative)

A KNN model is a non-parametric model that seperates data points into several clusters and then makes inferences for new samples by measuring the distance between the target beer and the closest beers to it. This model relies on user-item ratings.

knn

While a regular KNN model will try to classify a new data point by measuring the distance between the new point and all other points, and returning the K nearest neighbours. In a recommender system, the model is not trying to classify the new point, but is trying to find items with similar user-item interactions. The KNN model seemed to work quite well. The function below takes in a beer name, and will recommend 10 beers with similar user-item ratings.

knn

Using Bellwood's Double Justu as an example, the KNN model, returned the following 10 beers as recommendations (shown below). This is rather impressive when taking into account that this model does not take styles into consideration. In addition to its regular IPA version, Jutsu, it also returned Bellwood's Witchshark, Farmageddon, and Roman Candle, three of their higher percentage beers.

fsbf

Model 3: Cosine Similarity Model (Content)

cosine

Cosine Similarity measures the similarity between two non-zero vectors. For example, two vectors that share the same orientation have a cosine similarity of 1. This method recommended beer based on similar styles. Due to the relatively small dataset, I personally mapped over 100s styles down to 31.

cc

This model, in theory, is very simplistic. It will return beers that are similar in styles. For example, if you ask for similar beers to Heineken, the function will return these beers:

rbf

Model 4: Kristi's Hybrid Model (oxybeeron.py)

My hybrid model (which is connected to the flask app), provides 10 beer recommendations based on the collaboration of the above three models. All three models above returned beer recommendations, to varying degrees of accuracy. Because of the small dataset, the Cosine Similarity model was most accurate. Thus, for the flask app, I decided to make this model more heavily weighted. However, anything that was recommended by Model 1 (Spotlight) and Model 2 (KNN), the two models based on the colaborative method, were automatically recommended, and then the final recommendation spots were filled out by Model 3 (Cosine Similarity). The flask app, allows the 'online' user to choose from a list of 10 beers; their preferences, and then this model, in turn, recommends beer based on this hybrid model.

The following are two snippets of the code used to create the final recommendations. To see the entire code used for the project, please visit My GitHub

function1

function2

Step Three: Flask App

The final requirement of the project was to serve the app locally, and try to have the app more interesting that just a white page with input boxes.

I decided to create four different templates, so that if the user did not know any of the beer options, then they could refresh the page and get new options. As you can see below, the final flask app, takes 'inputs' from the site, and the two functions shown above, provide 10 recommended beers based on the input.

flask_code

The final app looked like this:

flask_app_page

results

Step Four: Taking it further!

The more I developed this project, the more I came to realise that the sparseness of the data hindered what I could do with my modeling. While my final hybrid model recommended accurate beers using the three models (Spotlight, KNN, and Cosine Similarity), I began to realise that these three models could also be used in tandem for different configurations of data. What started as an attempt to recommend beer for my 'super-users' as a thank you for their participation, turned into another Super-User Model (also located on My GitHub). This hybrid model is an example of how the collaboration of these three models can be manipulated to conform to different datasets. This model does not work on the entirety of the user base, but for those who filled out the entire Google Spreadsheet (my 'Super Users), this model is effective. This model relies on the top recommendations provided by Spotlight. It then fetches all the beer that a certain user rated as 8 or above and then uses the cosine similarity and KNN model functions to list content similarity and user-item interaction similarity to those specific beers. The final function takes in a user id number and then recommends the beers that are present in the Spotlight recommendations, and in the combination of the KNN and cosine similarity recommendations. This model effectively recommends beer based on similar user preferences, but also considers the users preferences for style.

sf

sf2

Final Thoughts

The oxybeeron app was a nice introduction to recommender systems. It relied more heavily on, and was more effective with, the more basic models, like KNN and Cosine Similarity. This was simply because there was not enough data for Spotlight to do its magic. However, the combination of these three models, provided a basic blueprint for a hybrid model that could be quite effective.

The effectiveness of the Cosine Similarity Model with only 31 beer styles, makes me optimistic about its potential in a larger dataset where the number of beer styles could be more expansive. For example, instead of "IPAs", it could include specific IPAs (Double IPAs, Session IPAs, Vermont-Style IPAs, Black IPAs, hazy IPAs, etc.)

Overall, it became painfully obvious that it's hard to find information and examples online for recommender systems that are not based on the MovieLens dataset. I think that for Spotlight to work to the best of its ability, my dataset would need to be much larger than 101 users and 338 beers. However, the improvement in its ability to recommend beer for my super users is promising. My Super-User Model shows that in the interim between a cold start situation and a 100K+ dataset (aka MovieLens), Spotlight can be quite effective if it is combined with other recommendation algorithms. Using these three models, in tandem, resulted in the ability to make solid recommendations for users in a smaller network.

Most Data Science questions deal with studying populations. A population is a set of similar items or events that are of interest for a question or an experiment. Since the task of measuring an entire population is frequently too expensive and impractical, we take samples, and make inferences about the whole population based on the statistics we find in the sample. In statistical inference we have four key concepts:

            -Samples
            -Statistics
            -Parameters
            -Populations

Generally speaking, statistics describe samples and parameters describe populations. Statistical inference is how we move from statistics taken on a sample, to parameters about the whole population. For Frequentists the two main ways we can generalize from a sample to a populations are through Confidence Intervals and Hypothesis Tests. For Bayesians, probability is assigned to a hypothesis, where a prior probability is updated to a posterior probability, with relevant data. This blog post will outline and compare these two theories in light of Data Science.

Frequentist Probability

Frequentists have three mathematical concepts that their experiments rely on - the null hypothesis, the associated p value, and the confidence interval. Frequentists have a null hypothesis (known as H0) and an alternate hypothesis (H1). An example of this is if a scientist was running a drug trial where he split the subjects into two groups: those administered the drug and those administered a placebo. The null hypothesis would be the assumption that there was no difference between the two groups. And this would be the stance of the scientist. After this, the scientist would create a reasonable threshold for rejecting the null hypothesis. This notation - α (alpha) - a real number between 0 and 1 - is known as the p value. The p value is the probability of observing a deviation from the null hypothesis which is greater than or equal to the one you observed. Another way of thinking about it, is that the p value is the probability of the data, given that the null hypothesis is true. The scientist will reject the null hypothesis if the p value is below α and not reject otherwise. Alpha is generally set to 0.05. This is considered standard practice.

The best way to explain p value and the null hypothesis is by a drug efficacy example. Let's say that we've just created a new sleeping pill. We've randomly selected 100 people who have problems sleeping; 50 will be administered the sleeping pill, and 50 will be given a placebo. The group given the sleeping pill is the "experiment" group and the group given the placebo is known as the "control" group. In this situation the null hypothesis will be that there will be no difference in the sleeping patterns - hours slept - between the two groups. The alternate hypothesis is that there will be a difference in hours slept between the two groups. The p value (the level of significance) will be .05. Thus, we will reject the null hypothesis if the p value is below .05.

Let's simulate two groups - control and experiment - and their average hours slept per night during the trial.

import numpy as np

control = np.random.randint(2, 10, size=50)
print(control)
experiment = np.random.randint(4, 10, size=50)
print(experiment)
[8 5 7 7 4 5 6 2 6 3 3 8 6 6 7 5 8 3 6 7 4 2 8 3 3 4 9 4 5 3 8 8 5 2 9 3 8
 8 9 2 9 8 7 9 2 3 8 9 8 5]
[6 4 9 9 6 9 8 5 8 4 7 9 7 9 9 5 5 5 4 6 7 6 9 7 6 8 8 5 8 4 4 8 5 7 4 9 5
 5 9 6 5 7 4 8 8 4 9 5 5 7]
print(np.mean(control))
print(np.mean(experiment))
print(np.mean(experiment) - np.mean(control))
5.74
6.52
0.7799999999999994

The measure of difference is 0.78. What is the probability that we observed this data, assuming that the null hypothesis is true?

import scipy.stats as stats

stats.ttest_ind(experiment, control)
Ttest_indResult(statistic=1.8573147670624264, pvalue=0.06626937134706777)
t_stat, p_value = stats.ttest_ind(experiment, control)

The p value is .066, thus we must accept the null hypothesis and state that there is no difference between the sleeping patterns - hours slept - between the experiment and control groups.

The third important key in Frequentist statisitics is the confidence interval. Simply put, a confidence interval is a range of values, in which the actual value is likely to fall. It represents the accuracy or precision of a given estimate. A confidence interval can be calculated as so:

calcu1ation1

Let's simulate a population and a poll as an example. Say we live in a state and we are working for a newspaper and we are trying to determine whether a certain proposition will pass. Unfortunately, we dont have the time, or the money, to run a poll that will reach out and question all 1 million voters. If we were to run an experiment, we would poll a sample of the society, and then use those samples to make inferences about the whole population. This is how a frequentist would tackle this situation. First I will generate statistics about the state.

import pandas as pd
import numpy as np

np.random.seed(42)
population = np.random.binomial(n= 1, p= 0.60, size = 1_000_000) #similating a population where 60% would vote yes.
population[0:10]
array([1, 0, 0, 1, 1, 1, 1, 0, 0, 0])
sample = np.random.choice(population, size = 500, replace = False)

np.mean(sample)
0.554
sample2 = np.random.choice(population, size = 500, replace = False)

np.mean(sample2)
0.614

In this scenario, the population is 1,000,000, the sample is 500 people, the statistics would be the percentage of people in the sample voting yes on the proposition, and the parameter is the true percentage of people voting yes on the proposition.

We at the newspaper, do not know that 60% of the population would vote for this proposition to pass, but by looking at a sample of 500 voters, we can make educated inferences about how the entire population may vote. This is the Central Limit Theorem in action. This is a probability theory that states that when independent random variables are added their sum tends toward a normal distribution, even if the original variables are not normally distributed. It is this theory that allows for conclusions to be made about the whole population from a sample of the population. Of the sample of the first 500 citizens who were polled, 55.4% responded that they would vote for the proposition to pass. A frequentist would then calculate the margin or error, creating the confidence interval by calculating the mean and the standard deviation and combining this with the associated Z score (1.96). The area under the standard normal distribution between -1.96 and +1.96 is 95%. The confidence interval is a set of likely values for the parameter of interest.

196

sample_mean = np.mean(sample)
sigma = np.std(sample)
n = len(sample)
sample_mean - 1.96 * sigma / (n ** 0.5)
lower = round(sample_mean - 1.96 * sigma / (n ** 0.5), 4)
higher = round(sample_mean + 1.96 * sigma / (n ** 0.5), 4)
f'I am 95% confident that the true population percentage of people who will vote yes on this proposition is between {lower} and {higher}'
'I am 95% confident that the true population percentage of people who will vote yes on this proposition is between 0.5104 and 0.5976'

'I am 95% confident that the true population percentage of people who will vote yes on this proposition is between 0.5104 and 0.5976'

Frequentist statistics allow us to make inferences about the greater population by using a smaller sample of the population and the statistics observed from that sample. The main critiques of Frequentist theory however, is that the p value and the confidence interval depend on the sample size, as Frequentist theory does not perform well on sparse data sets. Moreover, the confidence intervals are not probability distributions. These two flaws in Frequentist theory are remedied in Bayesian Statistics.

BAYESIAN PROBABILITY

Bayesian statistics successfully apply probabilities to statistical problems. In addition to this, it provides the ability to upgrade probabilities with the introduction of new data. It includes three components:

            -The Prior
            -The Likelihood
            -The Posterior

The Prior is our belief about paramenters based on previous experience. This takes into account both previous data compiled, in addition to, our own beliefs based on experience. The Likelihood is the specific observed data. The Posterior distribution combines the prior and the likelihood. It is the multiplication of the Likelihood and the Prior. The Posterior distribution is calculated by the following:

calculation

For the sake of comparison with Frequentist theory, you could say that the Bayes factor is the equivalent of the p value. The Bayes factor is the ratio of the probability of one hypothesis in relation to the probability of another hypothesis. So for instance, in our sleeping pill example, the ratio of the probability of improved sleeping in the control group vs. that of the experiment group.

The High Density Interval (also known as the Credibility Interval) is the Bayesian equivalent to the Confidence Interval. The Credibility Interval is formed by the posterior distribution. While the Confidence Interval is a collection of intervals with 95% of them containing the true parameter, the Credibility Interval provides an interval that has a 95% chance of containing the true parameter. The Credibility Interval is independent of intentions and sample size. This is good for Data Science, because if we have few data points, we can use strong priors.

Now let's look at the same polling example from above performed with Bayesian inference. First we will create our prior. In this case, let's pretend that based on other statistics we've seen, and previous voting habits in this state, that we have a very strong prior belief that 20% of the state will vote yes on the proposition.

import matplotlib.pyplot as plt
from scipy.stats import beta
from scipy.stats import binom


alpha_prior = 200
beta_prior = 800

x = np.linspace(0, 1, 500)
prior = beta(alpha_prior, beta_prior)

plt.figure(figsize=(4,4))
plt.plot(x, prior.pdf(x), lw=2);
plt.title('Prior Belief', fontsize=20)
plt.xlabel('Values of P', fontsize=16)
plt.show();

png

Now we will incorporate our Likelihood and calculate the Posterior based on the Prior and the Likelihood. The Likelihood, calculated by n_trials and n_successes, will be equal to the data that we "collected" above in the Frequentist example.

plt.figure(figsize=(10, 8))

x = np.linspace(0,1,500)
a = 200
b = 800
n_trials = 500
n_successes = 300 #copying the data from the sample used above

prior = beta(a, b).pdf(x)
likelihood = binom(n_trials, x).pmf(k=n_successes)
posterior = prior * likelihood

plt.vlines([(a - 1) / (a + b - 2), n_successes / n_trials, (a + n_successes - 1) / (a + b + n_trials - 2)],
               ymin = 0,
               ymax = max(max(prior),max(likelihood), max(posterior)), 
               linestyles = 'dashed',
               colors = ['tab:red', 'tab:blue', 'tab:purple'])
plt.yticks([])


plt.title("Prior, Likelihood, and Posterior Modes", fontsize = 18)
plt.xlabel("Values of P", fontsize = 16)
plt.plot(x, prior, c = 'tab:red', label="Prior")
plt.plot(x, likelihood, c = 'tab:blue', label="Likelihood")
plt.plot(x, posterior, c = 'tab:purple', label='Posterior')
plt.legend(loc = 1, fontsize=14);   

png

As you can see in the graph above, we have set our priors (a= 200, b= 800) very strong, so that even though the actual data collected was at around 60% voting yes on the proposition, the strength of our priors, has pulled the posterior mean to around 35%. If our prior beliefs were not strong, then we would set our alpha and beta much lower. In the following model, I will set them at 2 and 8 to show the effect.

plt.figure(figsize=(10, 8))

x = np.linspace(0,1,500)
a = 2
b = 8
n_trials = 500
n_successes = 300 #copying the data from the sample used above

prior = beta(a, b).pdf(x)
likelihood = binom(n_trials, x).pmf(k=n_successes)
posterior = prior * likelihood

plt.vlines([(a - 1) / (a + b - 2), n_successes / n_trials, (a + n_successes - 1) / (a + b + n_trials - 2)],
               ymin = 0,
               ymax = max(max(prior),max(likelihood), max(posterior)), 
               linestyles = 'dashed',
               colors = ['tab:red', 'tab:blue', 'tab:purple'])
plt.yticks([])


plt.title("Prior, Likelihood, and Posterior Modes", fontsize = 18)
plt.xlabel("Values of P", fontsize = 16)
plt.plot(x, prior, c = 'tab:red', label="Prior")
plt.plot(x, likelihood, c = 'tab:blue', label="Likelihood")
plt.plot(x, posterior, c = 'tab:purple', label='Posterior')
plt.legend(loc = 1, fontsize=14);   

png

As you can see, with a weaker prior, the posterior is pulled to the left toward the prior, but only slightly. The posterior is being dominated by the data. A perfect example of when you would set a weak prior would be the first example above of a sleeping pill clinical trial. This makes the point that in Bayesian statistics, a prior is not necessary to draw a valuable conclusion (Although, we could specify a weak prior based on expert analysis in the field). In the sleeping pill example, the p value was 0.6, and thus this was interpreted as being insufficient evidence that the sleeping pill worked. There was clearly a difference observed between the two groups, but it was not judged to be sufficient enough of a difference in the standard Frequentist approach. Instead of testing whether two groups are different, Bayesians attempt to measure the difference between the two groups. A topic for a later blog!!!

CONCLUSION

While Frequentists would say that Bayesians incorporating prior beliefs is wrong, the basic idea of not including your prior beliefs, is already making a judgment call about the world. Including priors, results in more accurate distributions, because you're not including the probabilities for data points, that you know from common sense, are not possible. Furthermore, including priors allows us to incorporate expert opinions and specific knowledge into our models. When our model makes a prediction, it provides distribution of likely answers. These predictions are highly interpretable and easier to understand than p values and Confidence Intervals. Finally, if we do not have strong prior beliefs about certain areas, we can always set our prior beleifs to be weak. Thus, our posterior will be dominated by our data, and not our prior.

Boosting is an ensemble technique where new models are created to correct the errors of past models. Each subsequent tree learns from the errors of its predecessor, as all trees are added sequentially until no further improvements can be made. Gradient boosting is an approach where new models are created from the residuals or errors of prior models and then are added together to make the final prediction. The base learners in gradient boosting are weak learners, but each contributes vital information for the final model. Two examples of these programs are XGBoost and CatBoost. They are two open-source software libraries that provide a gradient boosting framework.

To compare these two programs, I will load a dataset called "Skyserver", and I will attempt to predict whether an image is a Star, Galaxy, or Quasar. I will begin with XGBoost.

import pandas as pd 
import numpy as np
from numpy import loadtxt
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn_pandas import DataFrameMapper, FunctionTransformer
from sklearn.preprocessing import LabelBinarizer, StandardScaler
from sklearn.metrics import mean_squared_error

import catboost as cb
from catboost import CatBoostClassifier, cv, Pool
import shap

import xgboost as xgb
from xgboost import plot_tree, XGBClassifier, plot_importance


sky = pd.read_csv('Skyserver.csv')

The first thing I do with a dataset is some basic EDA. This includes checking to see whether there are null values: sky.isnull().sum(). The Skyserver dataset is very clean and there are no null values. However, with XGBoost, even if there were null values, XGBoost has a built in algorithm to impute these missing values. Its algorithm automatically learns what is the best imputation value for missing values based on training. This is one key advantage of using XGBoost.

Before I am ready to model, I will prepare my dataset, by labeling the 3 items in the target column - 'star', 'galaxy', 'qso'. Then I will drop unnecessary columns, and assign my y(target) and X(features).

sky['class'] = sky['class'].map({'STAR': 0, 'GALAXY': 1, 'QSO': 2})
sky = sky.drop(columns=['objid', 'rerun', 'mjd']) 
#drop 'objid' because irrelevant to model and 'rerun' because it has the same value for each row.
y = sky['class']
X = sky.drop('class', axis=1)
data_dmatrix = xgb.DMatrix(data=X, label=y) 
#DMatrix is an internal data structure used in XGBoost. It is recommended as it is optimal for memory and speed.
/anaconda3/lib/python3.6/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
  if getattr(data, 'base', None) is not None and \
/anaconda3/lib/python3.6/site-packages/xgboost/core.py:588: FutureWarning: Series.base is deprecated and will be removed in a future version
  data.base is not None and isinstance(data, np.ndarray) \

Having completed basic EDA, assigned our X and y, and created a data dmatrix, we are ready to split out data into training and testing sets and use DataFrameMapper, StandardScaler and LabelBinarizer to prepare our data. We do not need to worry about imputing at this point, because as stated above, in addition to the dataset being clean of missing values, XGBoost has a built in imputer that deals with missing values.

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

Obviously a key difference between XGBoost and CatBoost is how each deals with categorical columns. I'll discuss CatBoost later, but for now, as is shown below, for XGBoost, categorical columns need to be one hot encoded, using LabelBinarizer. Knowing that creating dummy columns, in many instances, could expand datasets to unimaginable sizes, the XGBoost developers introduced a built-in method to deal with this, called Automatic Sparse Data Optimization. The developers chose to handle this in the same way that they handled the existence of missing values, by making the algorithm aware of the sparsity pattern in the dataset. By only visiting the present values, and not the missing values and zeroes; it exploits the sparsity and learns the best directions to handle missing values. It handles all cases of sparsity in a unified way.

mapper = DataFrameMapper([
    (['ra'], StandardScaler()),
    (['dec'], StandardScaler()),
    (['u'], StandardScaler()),
    (['g'], StandardScaler()),
    (['r'], StandardScaler()),
    (['i'], StandardScaler()),
    (['z'], StandardScaler()),
    (['run'], StandardScaler()),
    ('camcol', LabelBinarizer()),
    ('field', LabelBinarizer()),
    (['specobjid'], StandardScaler()),
    (['redshift'], StandardScaler()),
    ('plate', LabelBinarizer()),
    ('fiberid', LabelBinarizer()),
], default=None, df_out=True)

Z_train = mapper.fit_transform(X_train)
Z_val = mapper.transform(X_val)

import warnings
warnings.filterwarnings("ignore", category=DataConversionWarning)

Now it's time to model, predict, and score!

model1 = xgb.XGBClassifier(
    n_estimators = 500, #How many trees.
    num_class = 3, #Identifying there are 3 classes in our target
    objective = 'reg:logistic', #Objective refers to which model you want to use.
    eval_set='Accuracy', #Eval_set refers to how you want your model to be evaluated.
    learning_rate=0.1, #Learning_rate has to do with how much you want each tree to learn.
    early_stopping_rounds=10, #stop after ten iterations if no improvements have been made.            
    max_depth=3, #Smaller trees are more interpretable
    silent=False) #Silent deals with whether we want information for each iteration to be printed out.
model1.fit(Z_train, y_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, early_stopping_rounds=10, eval_set='Accuracy',
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=500, n_jobs=1,
       nthread=None, num_class=3, objective='multi:softprob',
       random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=None, silent=False, subsample=1)
preds = model1.predict(Z_val)
mean_squared_error(y_val, preds)
0.01

Another key feature is that there is a built-in cross validation. This runs a cross validation at each iteration of the boosting process.

params = {'colsample_bytree': '0.3', #the fraction of columns to be randomly sampled for each tree.
        'learning_rate': '0.2', 
        'max_depth': '5', #the maximum depth of a tree
        'alpha': '10'} #L1 regularization weight - Lasso

cross_val_results = xgb.cv(dtrain=data_dmatrix, params=params) #this is where the dmatrix comes in handy.
cross_val_results.head()
train-rmse-mean train-rmse-std test-rmse-mean test-rmse-std
0 0.615842 0.001887 0.617878 0.001618
1 0.514400 0.023372 0.517281 0.024252
2 0.456278 0.013461 0.460037 0.015424
3 0.420923 0.006550 0.426109 0.009830
4 0.385381 0.030245 0.391525 0.033427
print((cross_val_results["test-rmse-mean"]).tail(1))
9    0.281167
Name: test-rmse-mean, dtype: float64
xgreg = xgb.train(params=params, dtrain=data_dmatrix, num_boost_round=15)

A benefit of using boosting is that it provides you with feature importance. It provides you with a score which indicates how valuable a feature was in the construction of the boosted decision tree. The more a certain feature is used to make key decisions in the decision tree, the higher its importance. Importance is calculated by the the amount that each split point improves the overall tree, weighted over the number of observations it is resposible for.

fig, ax = plt.subplots(figsize=(18, 18))
plot_importance(model1, ax=ax)
plt.show()

png

As can be viewed by the chart, Redshift is by the far the most important feature.

XGBoost also provides the ability to peak into your model and inspect a tree.

fig, ax = plt.subplots(figsize=(20, 20))
plot_tree(xgreg, num_trees=4, ax=ax)
plt.show()

png

Above you can see how a decision tree on this data looks. You can see that it is assymetrical, which is an important point of comparison with CatBoost's trees. One interesting aspect about this tree is the fact that you can see that the algorithm has chosen to impute all 'missing' values with the direction of yes.

According to XGBoost's documentation, website, and papers given by founder, Tianqi Chen, in addition to the features described above, XGBoost's power also comes from its 'Block structures for parallel learning', 'Cache Awareness', and 'Out of Core Computing'. Data is sorted and stored in memory units called blocks. This allows the data to easily be re-used in later iterations, instead of the data having to be computed again. The data in each block is stored in a compressed column format, and with one scan of the block, the statistics for the split candidates can be found. Cache awareness is the existence of internal buffers where gradient stats can be stored. Out of core computing allows optimized disk space and maximized usage when dealing with datasets that are too large to fit in memory.

The one area where XGBoost is lacking is in feature engineering and hyper-parameter tuning. Chen, and the XGBoost team however, believe this is fine, as these areas are deeply integrated within coding programs. After reading documentation and several data science blogs on boosting systems, I see that there is an ability to grid search to find the best parameters. However, considering these boosting frameworks are designed for extremely large datasets, if we were to grid search one of these files, it would be computationally expensive, and time consuming. (Running a basic cross-validation ran almost 15 minutes). In order to improve a model, hyper-tuning is necessary, but it's hard to determine which parameters should be tuned and what the ideal value of each should be.

Catboost

CatBoost is another open-source gradient boosting on decision trees library. CatBoost's claim to fame is its categorical feature support. Instead of having to take care of one-hot-encoding during pre-processing the data, the CatBoost algorithm handles categorical columns while training the data.

I'm now going to use DataFrameMapper and StandardScaler to prepare the rest of the data. Following this, I will pass the categorical features, while fitting the model.

mapper2 = DataFrameMapper([
    (['ra'], StandardScaler()),
    (['dec'], StandardScaler()),
    (['u'], StandardScaler()),
    (['g'], StandardScaler()),
    (['r'], StandardScaler()),
    (['i'], StandardScaler()),
    (['z'], StandardScaler()),
    (['run'], StandardScaler()),
    (['specobjid'], StandardScaler()),
    (['redshift'], StandardScaler()),
], default=None, df_out=True)

Z_train2 = mapper2.fit_transform(X_train)
Z_val2 = mapper2.transform(X_val)
/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py:595: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, DataConversionWarning)
/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py:595: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, DataConversionWarning)
/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py:595: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, DataConversionWarning)
# categorical_features = list(range(0, X.shape[1]))
categorical_features = np.where(Z_train2.dtypes == int)[0]
print(categorical_features)
[10 11 12 13]
model2 = CatBoostClassifier(
    iterations=100,
    learning_rate=0.5,
    early_stopping_rounds = 10,
    loss_function = 'MultiClass', #For 2-class classification, we should use "Logloss" or "Entropy"
    custom_metric='AUC',
    use_best_model=True,
    verbose=5 #information is printed out at every fifth iteration
)
model2.fit(
    Z_train2, y_train,
    cat_features=categorical_features,
    eval_set=(Z_val2, y_val),
    plot=False #This should be True. But the printout is not compatible with this blog framework
)
0:  learn: -0.4231271   test: -0.4268821    best: -0.4268821 (0)    total: 108ms    remaining: 10.7s
5:  learn: -0.0695255   test: -0.0717988    best: -0.0717988 (5)    total: 447ms    remaining: 7s
10: learn: -0.0461552   test: -0.0538007    best: -0.0538007 (10)   total: 947ms    remaining: 7.67s
15: learn: -0.0407958   test: -0.0510322    best: -0.0508512 (14)   total: 1.46s    remaining: 7.68s
20: learn: -0.0350664   test: -0.0461989    best: -0.0461989 (20)   total: 1.99s    remaining: 7.48s
25: learn: -0.0315475   test: -0.0459406    best: -0.0459406 (25)   total: 2.47s    remaining: 7.03s
30: learn: -0.0272609   test: -0.0449078    best: -0.0449078 (30)   total: 2.96s    remaining: 6.6s
35: learn: -0.0247762   test: -0.0422198    best: -0.0421584 (34)   total: 3.52s    remaining: 6.26s
40: learn: -0.0233293   test: -0.0422688    best: -0.0421584 (34)   total: 4.09s    remaining: 5.88s
Stopped by overfitting detector  (10 iterations wait)

bestTest = -0.04215835272
bestIteration = 34

Shrink model to first 35 iterations.





<catboost.core.CatBoostClassifier at 0x1c1e62d588>

newplot1

A great feature of CatBoost is it provides information at points specified in your params (above was 'verbose = 5').

model2.tree_count_
35
model2.predict_proba
print(model2.predict_proba(data=Z_val2))
[[4.01704757e-04 9.80671060e-01 1.89272355e-02]
 [3.08763482e-04 9.96511522e-01 3.17971463e-03]
 [3.89049775e-04 9.98622853e-01 9.88097567e-04]
 ...
 [9.97135061e-01 2.57287349e-03 2.92065282e-04]
 [9.96680305e-01 2.99639694e-03 3.23297602e-04]
 [2.87351792e-04 9.95101706e-01 4.61094214e-03]]
predictions2 = model2.predict(data=Z_val2)

Like XGBoost, CatBoost also includes an early stopping round. In my model, I set this parameter to 10. This means that if no improvement has been made after 10 trees, the model will stop running. This helps prevent against over fitting. CatBoost also, like XGBoost, has a built in cross-validation.

params = {'loss_function': 'MultiClass',
          'custom_loss': 'AUC'
         }

cross_val_results2 = cv(
            params = params,
            pool = Pool(Z_train2, y_train, cat_features=categorical_features),
            fold_count = 5, #number of folds to split the data into.
            inverted = False,
            shuffle = True, #data is randomly shuffled before splitting.
            partition_random_seed = 0,
            stratified = False,
            verbose = False)
cross_val_results2.head()
iterations test-MultiClass-mean test-MultiClass-std train-MultiClass-mean train-MultiClass-std test-AUC:class=0-mean test-AUC:class=0-std test-AUC:class=1-mean test-AUC:class=1-std test-AUC:class=2-mean test-AUC:class=2-std
0 0 -1.042713 0.000341 -1.042747 0.000093 0.992025 0.003681 0.985580 0.003899 0.982720 0.005111
1 1 -0.992463 0.001261 -0.992504 0.001639 0.997906 0.001791 0.992262 0.002721 0.984138 0.005245
2 2 -0.946050 0.000323 -0.945929 0.000795 0.997925 0.001205 0.993606 0.002700 0.988726 0.003737
3 3 -0.903349 0.000100 -0.903072 0.000699 0.998220 0.001502 0.993857 0.003076 0.988547 0.003624
4 4 -0.863164 0.000607 -0.863020 0.000325 0.998217 0.001357 0.994078 0.002699 0.988690 0.003612

Also similar to XGBoost, CatBoost provides the user with feature importance. By calling on 'get_feature_importance', the user can see what the model learned about the features and which of them have the greatest influence on the decisions trees.

fstrs = model2.get_feature_importance(prettified=True)
{feature_name : value for  feature_name, value in fstrs}
{'redshift': 78.69840467429411,
 'i': 8.742787892612705,
 'u': 3.487814704869871,
 'specobjid': 2.0436652601559193,
 'plate': 1.8955707345778037,
 'z': 1.42498154722369,
 'field': 0.9011905403991105,
 'run': 0.7856686087514119,
 'fiberid': 0.48419622387041755,
 'dec': 0.40877000998928603,
 'g': 0.3113040809109803,
 'ra': 0.3112717293491661,
 'r': 0.25950045766926433,
 'camcol': 0.24487353532627085}

While XGBoost had the ability to use matplotlib to plot, Catboost's equivalent is brought to us by shap.

explainer = shap.TreeExplainer(model2)
shap_values = explainer.shap_values(Pool(Z_train2, y_train, cat_features=categorical_features))
/anaconda3/lib/python3.6/site-packages/catboost/core.py:1697: UserWarning: 'fstr_type' parameter will be deprecated soon, use 'type' parameter instead
  warnings.warn("'fstr_type' parameter will be deprecated soon, use 'type' parameter instead")
The model has complex ctrs, so the SHAP values will be calculated approximately.
explainer.expected_value

[0.10654885020023455, 1.5818176696303439, -1.6883665198535525]

shap.summary_plot(shap_values, Z_train2, plot_type="bar")

png

Above we've done a basic model, and we've discovered that redshift is the value that has the greatest effect on the decision trees. One of the best features about CatBoost is that we can now fine-tune parameters in order to speed up the model.

fast_model = CatBoostClassifier(
    random_seed=62,
    iterations=150, #The default is 1000 trees. Lowering the number of trees can speed up the model. 
    learning_rate=0.01, #Typically, if you lower the number of trees, you should raise the learning rate.
    boosting_type='Plain', #The default is Ordered. This prevents overfitting but is expensive in computation.
    bootstrap_type='Bernoulli', #The default is Bayesian. Bernoulli is faster.
    subsample=0.5, #less than 1 is optimal for speed.
    rsm=0.5, #random subspace method speeds up the training
    leaf_estimation_iterations=5, #for smaller datasets this should be set at 1 or 5.
    max_ctr_complexity=1)
#We could also set one_hot_max_size to larger than the default 2. The default 2 means that any category that has 2 
#or less values will be hot encoded. The rest will be computated by catboosts algorithm. This is more computationally
#expensive.
fast_model.fit(
    Z_train2, y_train,
    cat_features=categorical_features,
    verbose=False,
    plot=False
)
<catboost.core.CatBoostClassifier at 0x1c1e615080>

newplot2

In addition to the above parameters that were adjusted to speed up the model's performance, there are also more parameters that can be used to fine-tune the model.

tuned_model = CatBoostClassifier(
    iterations=500,
    depth = 6, # how deep trees will go. Usually best between 1 and 10
    learning_rate=0.03,
    l2_leaf_reg=3, #regularizer value for Ridge regression
    bagging_temperature=1,
    one_hot_max_size=2, #2 is the default value. 
    leaf_estimation_method='Newton',
    loss_function='MultiClass'
)
tuned_model.fit(
    Z_train2, y_train,
    cat_features=categorical_features,
    verbose=False,
    eval_set=(Z_val2, y_val),
    plot=False
)
<catboost.core.CatBoostClassifier at 0x1c400a10b8>

newplot3

final_model = CatBoostClassifier(
    random_seed=63,
    iterations=int(tuned_model.tree_count_),
)
final_model.fit(
    Z_train2, y_train,
    cat_features=categorical_features,
    verbose=100
)
Learning rate set to 0.055965
0:  learn: 0.5431275    total: 61.7ms   remaining: 30.8s
100:    learn: 0.0057702    total: 6.14s    remaining: 24.3s
200:    learn: 0.0032198    total: 11.3s    remaining: 16.8s
300:    learn: 0.0021840    total: 16.3s    remaining: 10.8s
400:    learn: 0.0016674    total: 21.1s    remaining: 5.21s
499:    learn: 0.0013561    total: 25.5s    remaining: 0us





<catboost.core.CatBoostClassifier at 0x1c1e2a59e8>
print(final_model.get_best_score())
{'learn': {'Logloss': 0.0013560782767138484}}

The greatest benefit of CatBoost is its ability to handle categorical features. With other boosting software, we need to preprocess data on our own. This would be done using LabelBinarizer (as shown above in the XGBoost section). Although DataFrameMapper makes this task more manageable, in some cases it may lead to datasets that are unimaginably large. In addition to combination features(combining columns), which are converted to number values (these combination features are not used in the first split, but then all will be used in the next splits), CatBoost has a special formula to convert categorical features into numbers. With CatBoost and its categorical features, the data is shuffled and a mean is calculated for every object on its historical data.

ctr i = (countInClass + prior) / (totalCount + 1)

CountInClass is how many times the label value was equal to i for objects with the current categorical feature value.

TotalCount is the total number of objects already visited that have the same feature value as the one being visited.

Prior is a constant defined by the starting parameters

When it comes to missing values, CatBoost also deals with these during training. The default setting is 'Min' which means that missing values are processed as the minimum value for the feature. It is also guaranteed that a split that seperates all missing values from all other values is considered when selecting trees.

Conclusion

In addition to categorical features, another main diference between XGBoost and CatBoost is the type of decision trees used. XGBoost uses assymetric trees (which could be viewed above) while CatBoost uses oblivious trees as base predictors. An assymetric tree makes splits that will learn the most. Oblivious trees simply means that the feature used for splitting is the same accross all intermediate nodes within the same level of the tree.

Overall, I find it's hard to choose the best boosting system, because both do a great job and offer great features to work with. This can be viewed by their use in winning Kaggle competition entries. However, I find CatBoost's ability to pass categorical features, instead of one-hot-encoding, novel. It saves time in preparing the data to be modeled and although more computationally expensive at first, I think with larger datasets, where one-hot-encoding could swell the dataset to enormous sizes, it would be more effective. I like the fact that it provides both options when you're fitting a model. The default is 2 - which means that if a categorical column has two or less categories it will hot encode, and if its more than 2, it will run its built-in calculation for categorical columns. Either way, categorical columns are processed internally when training the model. Thus, even if you prefer categorical columns to be one-hot-encoded, you can set the parameters accordingly and not have to pre-process these columns.

It's hard to choose between the two libraries because I think they both excel with larger datasets than the one I used. However, with my computer and the SkyServer dataset, CatBoost outperformed XGBoost in speed, and I found the ability to fine-tune and speed-up the model more user-friendly and intuitive. Also I found the instant plotting more engaging, and the shap add-on, more visually appealing.

pd.get_dummies is a pandas function that transforms a column containing categorical variables into a series of new columns composed of 1s and 0s (true and false). The significance in this process is that in order for any model or algorithm to process categorical features, they must first be transformed into numerical representation. Why not assign categorical features a number between 1 and 10 and keep them in one column? This process would result in assigning irrelevant values to items in a categorical column that would be misinterpreted by an algorithm. More columns composed of 1s and 0s is preferred over a single column, because it has the benefit of categories not weighting a value improperly.

To show how Get_Dummies works, I will create a dictionary of hockey players and transform it into a DataFrame:

from IPython.display import HTML
import pandas as pd
hockey_players = ({

    'name': 'Sidney Crosby',
    'age': '31',
    'team': 'Pittsburgh Penguins',
    'position': 'C',
    'nationality': 'Canadian',
    'hart': 'yes',
    'stanley_cup': 'yes'
},
{
    'name': 'Nikita Kucherov',
    'age': '25',
    'team': 'Tampa Bay Lightning',
    'position': 'RW',
    'nationality': 'Russian',
    'hart': 'no',
    'stanley_cup': 'no'
},
{
    'name': 'Connor McDavid',
    'age': '22',
    'team': 'Edmonton Oilers',
    'position': "C",
    'nationality': 'Canadian',
    'hart': 'yes',
    'stanley_cup': 'no'
},
{
    'name': 'Mikko Rantanen',
    'age': '22',
    'team': "Colorado Avalanche", 
    'position': 'RW',
    'nationality': 'Finnish',
    'hart': 'no',
    'stanley_cup': 'no'
},
{
    'name': 'Alex Ovechkin',
    'age': '33',
    'team': 'Washington Capitals',
    'position': 'LW',
    'nationality': 'Russian',
    'hart': 'yes',
    'stanley_cup': 'yes'
},
{
    'name': 'Braden Holtby',
    'age': '29',
    'team': 'Washington Capitals',
    'position': 'G',
    'nationality': 'Canadian',
    'hart': 'no',
    'stanley_cup': 'yes'
},
{
    'name': 'Brent Burns',
    'age': '34',
    'team': 'San Jose Sharks',
    'position': 'D',
    'nationality': 'Canadian',
    'hart': 'no',
    'stanley_cup': 'no'
},
{
    'name': 'John Tavares',
    'age': '28',
    'team': 'Toronto Maple Leafs',
    'position': 'C',
    'nationality': 'Canadian',
    'hart': 'no',
    'stanley_cup': 'no'
}, 
{
    'name': 'Mark Scheiele',
    'age': '26',
    'team': 'Winnipeg Jets',
    'position': 'C',
    'nationality': 'Canadian',
    'hart': 'no',
    'stanley_cup': 'no'
},
{
    'name': 'Carey Price',
    'age': '31',
    'team': 'Montreal Canadians',
    'position': 'G',
    'nationality': 'Canadian',
    'hart': 'yes',
    'stanley_cup': 'no'
},
{
    'name': 'Morgan Rielly',
    'age': '25',
    'team': 'Toronto Maple Leafs',
    'position': 'D',
    'nationality': 'American',
    'hart': 'no',
    'stanley_cup': 'no'

},
{
    'name': 'Nathan MacKinnon',
    'age': '23',
    'team': 'Colorado Avalanche',
    'position': 'C',
    'nationality': 'Canadian',
    'hart': 'no',
    'stanley_cup': 'no'
},
{
    'name': 'Evgeni Malkin',
    'age': '32',
    'team': 'Pittsburgh Penguins',
    'position': 'C',
    'nationality': 'Russian',
    'hart': 'yes',
    'stanley_cup': 'yes'
})
hockey_players = pd.DataFrame(hockey_players)
hockey_players = hockey_players[['name', 'age', 'nationality', 'team', 'position', 'stanley_cup', 'hart']]
hockey_players.head()

HTML(hockey_players.head(5).to_html(classes="table table-stripped table-hover"))
name age nationality team position stanley_cup hart
0 Sidney Crosby 31 Canadian Pittsburgh Penguins C 1 1
1 Nikita Kucherov 25 Russian Tampa Bay Lightning RW 0 0
2 Connor McDavid 22 Canadian Edmonton Oilers C 0 1
3 Mikko Rantanen 22 Finnish Colorado Avalanche RW 0 0
4 Alex Ovechkin 33 Russian Washington Capitals LW 1 1

Above is a simple dataset consisiting of 13 National Hockey League players. For each player, we are provided with

    -name 
    -age 
    -nationality 
    -team 
    -position 
    -whether he has won a Stanley Cup 
    -whether he has won the Hart trophy.

For the purpose of this blog, let's imagine that this is a much larger dataset consisting of all NHL players. As I mentioned above, in order to compare categorical variables, we need to transform them into 1s and 0s.

If a column is categorical and based on two inputs, forexample 'yes' and 'no', this can be as simple as:

hockey_players['stanley_cup'] = hockey_players['stanley_cup'].map({'yes': 1, 'no': 0})
                            OR
hockey_players['hart'] = hockey_players['hart'].replace(['yes', 'no'], [1, 0])
hockey_players.head()
name age nationality team position stanley_cup hart
0 Sidney Crosby 31 Canadian Pittsburgh Penguins C 1 1
1 Nikita Kucherov 25 Russian Tampa Bay Lightning RW 0 0
2 Connor McDavid 22 Canadian Edmonton Oilers C 0 1
3 Mikko Rantanen 22 Finnish Colorado Avalanche RW 0 0
4 Alex Ovechkin 33 Russian Washington Capitals LW 1 1

Whenever the column is more complex, this is when someone would reach for pd.get_dummies.

Say we thought that the nationality of a player was indicative of performance. In order to use this categorical value in a model, we would need to transform these values into 1s and 0s.

Now I am going to show how to use pd.get_dummies to transform the nationality column into multiple columns that correspond with the nationality of each player.

dummy = pd.get_dummies(hockey_players['nationality'])
dummy.head()
American Canadian Finnish Russian
0 0 1 0 0
1 0 0 0 1
2 0 1 0 0
3 0 0 1 0
4 0 0 0 1

The function provided us with 4 columns each representing 1 of 4 nationalities in the dataset. In order to compare this information to the original dataset, we will concatenate onto the original DataFrame. While we are doing that, we will drop the original column for nationality, as well as the name category, as each player is already identfied by an id number.

hockey_players = pd.concat([hockey_players, dummy], axis=1)
hockey_players = hockey_players.drop(columns=['nationality'])
hockey_players = hockey_players.drop(columns=['name'])
hockey_players.head()
age team position stanley_cup hart American Canadian Finnish Russian
0 31 Pittsburgh Penguins C 1 1 0 1 0 0
1 25 Tampa Bay Lightning RW 0 0 0 0 0 1
2 22 Edmonton Oilers C 0 1 0 1 0 0
3 22 Colorado Avalanche RW 0 0 0 0 1 0
4 33 Washington Capitals LW 1 1 0 0 0 1

At first sight, this seems like a very useful tool provided by pandas. We can now compare the nationality of each player to the values in the other columns.

However, there is one very large downside to using pd.get_dummies, which I will demonstrate below.

Imagine that we were attempting to see whether all these factors could be used to make a model that predicted whether a player would win the Hart trophy. First we will identify our features and our target.

features = [cols for cols in hockey_players if cols != 'hart']

X = hockey_players[features]
y = hockey_players['hart']

Since the dataset is clean (no missing values), the next thing we will do is split the data into training and testing sets.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

Now we have two sets of data:

-One to train our model on (X_train)
-One to test our final model on (X_test)

Next we look at our X_train and we see that we will need to use pd.get_dummies on the categorical columns for position and team.

X_train_with_dummy = pd.get_dummies(X_train, columns=['team', 'position'], drop_first=True)
X_train_with_dummy.head()
age stanley_cup American Canadian Finnish Russian team_Montreal Canadians team_Pittsburgh Penguins team_Tampa Bay Lightning team_Toronto Maple Leafs team_Washington Capitals position_D position_G position_LW position_RW
1 25 0 0 0 0 1 0 0 1 0 0 0 0 0 1
10 25 0 1 0 0 0 0 0 0 1 0 1 0 0 0
5 29 1 0 1 0 0 0 0 0 0 1 0 1 0 0
9 31 0 0 1 0 0 1 0 0 0 0 0 1 0 0
11 23 0 0 1 0 0 0 0 0 0 0 0 0 0 0

Next, we will do the same thing for the testing data.

X_test_with_dummy = pd.get_dummies(X_test, columns=['team', 'position'], drop_first=True)

Now with our training and testing data hot-encoded, we are ready to model! Let's instantiate the Logistic Regression and fit and score our model.

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

model.fit(X_train_with_dummy, y_train)
model.score(X_test_with_dummy, y_test)
/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)



---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-13-7057dac4b7c1> in <module>
      4 
      5 model.fit(X_train_with_dummy, y_train)
----> 6 model.score(X_test_with_dummy, y_test)


/anaconda3/lib/python3.6/site-packages/sklearn/base.py in score(self, X, y, sample_weight)
    286         """
    287         from .metrics import accuracy_score
--> 288         return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
    289 
    290


/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/base.py in predict(self, X)
    279             Predicted class label per sample.
    280         """
--> 281         scores = self.decision_function(X)
    282         if len(scores.shape) == 1:
    283             indices = (scores > 0).astype(np.int)


/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/base.py in decision_function(self, X)
    260         if X.shape[1] != n_features:
    261             raise ValueError("X has %d features per sample; expecting %d"
--> 262                              % (X.shape[1], n_features))
    263 
    264         scores = safe_sparse_dot(X, self.coef_.T,


ValueError: X has 9 features per sample; expecting 15

VALUE ERROR!

When we used pd.get_dummies to make dummy columns that were hot-encoded, we over-looked the fact that the testing data was much smaller and it did not have examples for each categorical variable. Although we performed pd.get_dummies on the testing data, as we did on the training data, there was no way for the transformed testing data to know that it was missing certain columns that were present in the original data set and still existed in the training data.

Observe the shapes of our two data sets:

X_train_with_dummy.shape
(10, 15)
X_test_with_dummy.shape
(3, 9)

Some may argue that this could easily be avoided by using pd.get_dummies before splitting the data into training and testing sets. However, every good Data Scientist knows the importance of using train_test_split first before tampering with your dataset.

These two ideas seem at odds, because they are! What to do?

Luckily SkLearn provides us with a better and more advanced function which considers this possible dilemna. Because of this, SkLearn's Label Binarizer should be used instead of pd.get_dummies.

from sklearn.preprocessing import LabelBinarizer

features = [cols for cols in hockey_players if cols != 'hart']

X_train = hockey_players[features]
y = hockey_players['hart']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)
X_train.head()
age team position stanley_cup American Canadian Finnish Russian
12 32 Pittsburgh Penguins C 1 0 0 0 1
10 25 Toronto Maple Leafs D 0 1 0 0 0
3 22 Colorado Avalanche RW 0 0 0 1 0
11 23 Colorado Avalanche C 0 0 1 0 0
2 22 Edmonton Oilers C 0 0 1 0 0
lb = LabelBinarizer()

lb.fit(X_train['position'])
lb.classes_      #This is where the magic happens!
lb.transform(X_train['position'])

X_train = X_train.join(
    pd.DataFrame(lb.fit_transform(X_train['position']),
    columns=lb.classes_,
    index=X_train.index))

X_train
age team position stanley_cup American Canadian Finnish Russian C D G LW RW
12 32 Pittsburgh Penguins C 1 0 0 0 1 1 0 0 0 0
10 25 Toronto Maple Leafs D 0 1 0 0 0 0 1 0 0 0
3 22 Colorado Avalanche RW 0 0 0 1 0 0 0 0 0 1
11 23 Colorado Avalanche C 0 0 1 0 0 1 0 0 0 0
2 22 Edmonton Oilers C 0 0 1 0 0 1 0 0 0 0
5 29 Washington Capitals G 1 0 1 0 0 0 0 1 0 0
6 34 San Jose Sharks D 0 0 1 0 0 0 1 0 0 0
0 31 Pittsburgh Penguins C 1 0 1 0 0 1 0 0 0 0
1 25 Tampa Bay Lightning RW 0 0 0 0 1 0 0 0 0 1
4 33 Washington Capitals LW 1 0 0 0 1 0 0 0 1 0

The built in method call 'classes_' is the saviour in this function. By fitting the training data and than labeling the columns based on the method call 'classes_', the existence of all 5 positions is remembered. Simply put, the attribute 'classes_' holds the label for each class. Now we will transform our test data.

X_test = X_test.join(
    pd.DataFrame(lb.transform(X_test['position']),
    columns=lb.classes_,
    index=X_test.index))
X_test
age team position stanley_cup American Canadian Finnish Russian C D G LW RW
7 28 Toronto Maple Leafs C 0 0 1 0 0 1 0 0 0 0
8 26 Winnipeg Jets C 0 0 1 0 0 1 0 0 0 0
9 31 Montreal Canadians G 0 0 1 0 0 0 0 1 0 0

As you can see, even though none of the players in the testing data played wing, instantiating Label Binarizer and utilizing the method 'classes_' ensures that these columns were preserved in the testing data.

Now let's perform LabelBinarizer on the team column as well!

lb = LabelBinarizer()

lb.fit(X_train['team'])
lb.classes_     
lb.transform(X_train['team'])

X_train = X_train.join(
    pd.DataFrame(lb.fit_transform(X_train['team']),
    columns=lb.classes_,
    index=X_train.index))

X_train.head()
age team position stanley_cup American Canadian Finnish Russian C D G LW RW Colorado Avalanche Edmonton Oilers Pittsburgh Penguins San Jose Sharks Tampa Bay Lightning Toronto Maple Leafs Washington Capitals
12 32 Pittsburgh Penguins C 1 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0
10 25 Toronto Maple Leafs D 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0
3 22 Colorado Avalanche RW 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0
11 23 Colorado Avalanche C 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0
2 22 Edmonton Oilers C 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0
X_test = X_test.join(
    pd.DataFrame(lb.transform(X_test['team']),
    columns=lb.classes_,    #Remembering the classes created from the original categorical column.
    index=X_test.index))

X_test = X_test.drop(columns=['team', 'position'])
X_train = X_train.drop(columns=['team', 'position'])

Now let's see if we can model our data...

model = LogisticRegression()

model.fit(X_train, y_train)
model.score(X_test, y_test)
/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)





0.6666666666666666

SUCCESS THEY'RE THE SAME SIZE!!!

LabelBinarizer is better than pd.get_dummies because when you fit your training data it gives you back a class_ method which remembers the different categories that were created when you hot encoded your column into multiple columns. When you transform your testing data and call the class attributes that were stored when you fit your training data, even though your test data may not have samples from that specific column, it remembers that it needs to represent the column even though there are no 1s present.


The Functionality of Functions

Coming from an academic background based in History, Philosophy, and Law, learning Data Science can be overwhelming; especially when you haven't taken a Computer or Math course in TWELVE years. The biggest issues I have had in the first three weeks deal with syntax. Learning Python is the equivalent of learning a new language. The one area I have continued to struggle with is how to develop new functions. While this is by far not the hardest task thrown at us, it is essential to most tasks a Data Scientist faces daily.

To start, writing a basic function is very simple.

def print_my_name(first_name, last_name):
    return (last_name, first_name)
print_my_name('kristi', 'gourlay')
('gourlay', 'kristi')

OR

def adder(number1, number2):
    return number1 + number2
adder(2, 3)
5

They get slightly more difficult when you're asked to count something or create a list.

Consider the following function that counts the vowels in a provided word.

def vowel_counter(word):
    vowels = 'aeiou'
    counter = 0
    for char in word:
            if char in vowels:
                counter += 1
                return counter
vowel_counter('elephant')
1

WAIT! That's not right!

Lesson One: Make sure the return is outside the loop! This placement of the return call had me pulling my hair out for the entire first week. I could not understand why my functions, that looked just like my classmates' functions, were not working.

def vowel_counter(word):
    vowels = 'aeiou'
    counter = 0
    for char in word:
            if char in vowels:
                counter += 1

    return counter
vowel_counter('elephant')
3

While writing a simple function is easy, the next step is learning to write a function that can be used on data or a dictionary. I struggled with this in our first lab.

Below I have created a dictionary of tv shows:

tvshows = [
{
    'name': 'curb your enthusiasm',
    'category': 'comedy',
    'network': 'hbo'
},
{
    'name': 'the wire',
    'category': 'drama',
    'network': 'hbo'
},
{
    'name': 'shameless',
    'category': 'dramedy',
    'network': 'showtime'
},
{
    'name': 'the sopranos',
    'category': 'drama',
    'network': 'hbo'
},
{
    'name': 'game of thrones',
    'category': 'drama',
    'network': 'hbo'
},
{
    'name': 'house of cards',
    'category': 'drama',
    'network': 'netflix'
},
{
    'name': 'kimmy schmidt',
    'category': 'comedy',
    'network': 'netflix'
}]

print(tvshows)
[{'name': 'curb your enthusiasm', 'category': 'comedy', 'network': 'hbo'}, {'name': 'the wire', 'category': 'drama', 'network': 'hbo'}, {'name': 'shameless', 'category': 'dramedy', 'network': 'showtime'}, {'name': 'the sopranos', 'category': 'drama', 'network': 'hbo'}, {'name': 'game of thrones', 'category': 'drama', 'network': 'hbo'}, {'name': 'house of cards', 'category': 'drama', 'network': 'netflix'}, {'name': 'kimmy schmidt', 'category': 'comedy', 'network': 'netflix'}]

For example, I want to make a function to iterate through the dictionary and return all the shows that are from a specific network.

The first step is to create a 'for loop' OR use a 'list comprehension' to write what you want to get back.

In this instance, I would like to see all the tv show names that are on the HBO network.

Name a function and pass it two arguments. 1)what you are going to iterate through. 2)what you are looking for.

def tv_net(dictionary='tvshows', network='hbo'):

    return [tv['name'] for tv in dictionary if tv['network'] == network]

tv_net(tvshows)
['curb your enthusiasm', 'the wire', 'the sopranos', 'game of thrones']

Since I set the second argument default to hbo, the function automatically looks for shows from the HBO network. Setting a default argument makes life easier, and helps you remember what type of argument you will be looking for. If I wanted to change the network argument, it's as simple as passing a different network through.

tv_net(tvshows, network='netflix')
['house of cards', 'kimmy schmidt']
tv_net(tvshows, network='showtime')
['shameless']

The important thing I learned from this process was the fact that you can then take the function you made and use it on similar data later in your code. For example, say I was then provided with a dictionary of prime time tv shows.

primetime = [
{
    'name': 'this is us',
    'category': 'drama',
    'network': 'nbc'
},
{
    'name': 'family guy',
    'category': 'comedy',
    'network': 'fox'
},
{
    'name': 'the good place',
    'category': 'comedy',
    'network': 'nbc'
},
{
    'name': 'the office',
    'category': 'comedy',
    'network': 'nbc'
}
]
tv_net(primetime, network='nbc')
['this is us', 'the good place', 'the office']
tv_net(primetime, 'fox')
['family guy']

This importance of using functions to iterate through data and dictionaries resurfaces in EDA. If we are looking at a set of data and need to change something about it, and we may need to make similar changes later in our code, it's practical to create a function. To show this, I will use the tvshow sets from above. (First I will need to place them in a Data Frame)

import pandas as pd

tvshows = pd.DataFrame(tvshows)
primetime = pd.DataFrame(primetime)
tvshows
category name network
0 comedy curb your enthusiasm hbo
1 drama the wire hbo
2 dramedy shameless showtime
3 drama the sopranos hbo
4 drama game of thrones hbo
5 drama house of cards netflix
6 comedy kimmy schmidt netflix

When we look at the dataframe for tvshows, we notice that the category and the name columns need to be swapped. We could run a simple list command, but since we know we have other similar data, it may be easier to create a function so that we don't have to repeat these verbose commands later.

def swap_col(dataframe):
    cols = list(dataframe.columns)
    a, b = cols.index('category'), cols.index('name')
    cols[b], cols[a] = cols[a], cols[b]
    dataframe = dataframe[cols]
    return dataframe
swap_col(tvshows)
name category network
0 curb your enthusiasm comedy hbo
1 the wire drama hbo
2 shameless dramedy showtime
3 the sopranos drama hbo
4 game of thrones drama hbo
5 house of cards drama netflix
6 kimmy schmidt comedy netflix

Now we import our other dataframe and we see the same item that needs to be fixed.

primetime
category name network
0 drama this is us nbc
1 comedy family guy fox
2 comedy the good place nbc
3 comedy the office nbc
swap_col(primetime)
name category network
0 this is us drama nbc
1 family guy comedy fox
2 the good place comedy nbc
3 the office comedy nbc

The ability to make a function to fix mistakes in your data becomes a useful skill to have. It will be a time saver, and that's one of the reasons why learning functions is quite important for Data Science.

One last point.

I've shown how functions become important when both analyzing and fixing your data. Sometimes you might need your function to include multiple steps and the sheer volume may seem daunting. The number one tip, I have taken away in my first three weeks learning Python is to break things down. If you take everything step by step and work your way up to the return you need, the task becomes much easier.

It's easier to do many smaller pieces of the whole, than to tackle the whole.

STEP ONE: DONT GET OVERWHELMED!

STEP TWO: SIMPLY WRITE DOWN WHAT THE FUNCTION NEEDS TO DO/RETURN

STEP THREE: WRITE DOWN IN ORDER WHAT YOU NEED TO DO TO GET THE DESIRED RETURN

The perfect example of this breaking down a function was presented to me in our first project. We were asked to analyze SAT and ACT data. At one point in the assignment, we were asked to design a function that returns the standard deviation for a column. This question took me 10x longer than any other question, because I kept thinking about the final product. Eventually, I realised I needed to follow the three steps laid out above.

After calming myself down, I answered STEP TWO. The function needs to return the standard deviation of a column in the SAT/ACT dataset.

STEP THREE: How do you calculate standard deviation?

  1. Calculate the mean

    mean = sum(column) / len(column)

  2. For each number, subtract the mean and square the results

    ((n - (sum(column) / len(column)))**2)
    
  3. Find the mean of those squared differences( START A LIST before the math. new_list = [] )

    new_list.append(((n - (sum(column) / len(column)))**2))

  4. Take the square root of that list

    math.sqrt(((sum(new_list))) / (len(column) - 1))

Start with the first thing you need to do and slowly add what you need next. The result is:

import math

def calc_std(column):
    new_list = []
    for n in column:
        new_list.append((n - (sum(column) / len(column)))**2)
    return math.sqrt(((sum(new_list))) / (len(column) - 1))
column = [4, 5, 8, 9, 3, 4, 7, 9]
calc_std(column)
2.416461403433896

Writing functions was something that I alone seemed to struggle with in my cohort. However, my focus on improving my ability to write functions, allowed me to highlight other areas in my work that I should keep an eye on. If you're code is not working, read the return for details, if you still cannot figure out what's wrong, double check these few things: * Placement * Accuracy * Brackets!

Placement: Make sure your return is located outside the loop. But also make sure that lists and counters are located in the right place as well. If needed for the function, place the lists right below the function call.

Accuracy: The amount of times my code was not working because I mis-spelt a word in one place. Spelling will always be important, not just in academia!

Brackets: If in doubt look if you're missing a bracket somewhere. I spent an hour with a function not working, to realize I just needed to square bracket the argument I was submitting.

Attention to detail!


Page 1 / 1