Logistic regression is a model that shows the probability of an event occurring from the input of one or more independent variables. In most cases, logistic regression produces only two outputs, resulting in a binary outcome.1
In this lesson you will build a logistic regresion model using demographic information from a customer base to determine if they will buy a new product launch.
Our company is a hardware store. They have a department that sells gas and charcoal barbeques.
The company is considering a new product launch of a pellet smoker type grill and wonders how its customers will react and buy one.
Day 1: Working with the dataset
The dataset consists of an Excel spreadsheet that contains information about the store's customers.
Click on the link to see the information and save it to your working directory.
Call the file "grill2.xlsx"
Hardware store customers dataset
Let's look at the dataset information.
The first column is a customer id number. The second is gender. The number 1 represents male and the number 2 is female.
The age column was coded in the following manner.
20-30 = 1
31-40 = 2
41-50 = 3
51-60 = 4
61-70 = 5
greater the 70 = 6
The salary was coded in the following way.
Less than $20,000 = 1
$21,000 - $30,000 = 2
$31,000 - $40,000 = 3
$41,000 - $50,000 = 4
$51,000 - $60,000 = 5
$61,000 - $70,000 = 6
$71,000 - $80,000 = 7
$81,000 - $90,000 = 8
$91,000 - $100,000 = 9
Greater than $100,000 = 10
Education levels were recorded as follows:
No high school = 1
High school graduate = 2
One year of college = 3
Two years of college = 4
Four year degree = 5
Advanced degrees = 6
Our hardware store wants us to predict if our customers will buy our new product: a Traeger Ironwood 885 grill. This make and model is very high end and retails at $1,599.00. A few years ago our company started selling Weber grills and they found that the customers had higher incomes, were more highly educated and tended to be middle-aged males that purchased most of the Weber grills.
We are going to assume that Weber buyers will be a good demographic for our Traeger predictions.
I used this customer database, the Excel spreadsheet.
A "1" indicates purchase and a "0" no purchase. I assigned these numbers based on past experience wih the Weber example.
In supervised learning, algorithms learn from labeled data. After understanding the data, the algorithm determines which label should be given to new data by associating patterns to the unlabeled new data.
A classifier is a type of machine learning algorithm that assigns a label to a data input. Classifier algorithms use labeled data and statistical methods to produce predictions about data input classifications.
It's called regression but performs classification based on the regression and it classifies the dependent variable into either of the classes.2
Day 2: Getting the Python Code
Open a new Python project and paste the contents into the the first frame.
Run the code in the first frame.
This code imports libraries from Python
Paste the contents into the the second frame.
Run the code in the first two frames.
These lines read in the Excel spreasheet and assign the contents to the variable dataset.
Then the first lines of the dataset are displayed.
Your file location will differ from mine. It needs to be where you saved your spreadsheet.
"grill2.xlsx" is the file name.
Create a third frame and paste the contents above into it.
The first line of code assigns the independent variables to the letter X.
The second line of code assigns the values in the dependent variable to y
Create a forth frame and paste the contents above into it.
This first line imports the library train_test_split
The next line trains the algorithum, creates a X_test set that contains 25% of the dataset.
The random_state statement assures us that the random seed will not be regenerated each time we run the program.
Create a fifth frame and paste the contents above into it.
These logistic regression lines of the model do the training and make the predictions.
Create a new frame and key in: print(X_test).
Save and run each frame starting with frame 1.
Let's examine the results of the last frame.
The above 30 items, which is 25%, represent the test set that will be trained on the remaining 75% of the dataset.
Let's look at the first entry in the X_test data. It is the eighth record in the original spreadsheet Id 6849. The 2 indicates that the record is a female. The 3 indicates the age group is 41 to 50. The 7 means that the customer is 71- 80 years. The 3 indicates that she completed one year of college.
Now let's get what prediction label determined for our X_test data. Key in this line in a new frame:
print(y_pred)
Here are the result after running all frames.
[0 0 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0]
Day 3: Matching the X_test data with the predictions.
Fifteen customers would purchase this newly launched product and another fifteen would not.
Let's examine a little more closely, the customers who would purchase this expensive grill.
The first customer in our random sample that was predicted to buy had the following numbers.
[ 1 4 7 5]1
1 is for male, 2 is for female
Age
20-30 = 1
31-40 = 2
41-50 = 3
51-60 = 4
61-70 = 5
greater the 70 = 6
The salary was coded in the following way.
Less than $20,000 = 1
$21,000 - $30,000 = 2
$31,000 - $40,000 = 3
$41,000 - $50,000 = 4
$51,000 - $60,000 = 5
$61,000 - $70,000 = 6
$71,000 - $80,000 = 7
$81,000 - $90,000 = 8
$91,000 - $100,000 = 9
Greater than $100,000 = 10
Education levels were recorded as follows:
No high school = 1
High school graduate = 2
One year of college = 3
Two years of college = 4
Four year degree = 5
Advanced degrees = 6
This customer is male, between 51 and 60 years of age, makes between $71,00 to $80,000 per year, and has a four year college degree.
If we look at our original spreadsheet for these numbers, we can see it belongs to either customer 68495,68516 or 68519.
:Day 4: checking the accuracy of our model
Next we should see how well our model performed making predictions on the unknown test set.
There are a number of metrics to evaluate our classification method.
Some of the most used ones are confusion matrix, F1, recall, precision and accuracy.
Get the following lines of code to evaluate our model.
Create a new frame and paste the contents above into it.
Save and run all frames to see the metrics and how they evaluated our classification model.
Let's examine the results.
The confusion matrix shows true positive, true negatives, false positives and false negatives.
There are 12 true negatives. True negatives are those output labels, "0" that are actually false and the model also predicted them as false.
True positives are those labels that are actually true and also predicted as true by our model. There were 15 of them.
There were three false negatives. False negatives are labels that are actually true but predicted as false by the model.
The were not any false positives. False positives are labels that are actually false, but the model predicted them as true.
The confusion matrix shows that our model did a good job: only three labels were misclassified.
A classification report is a performance evaluation metric in machine learning. It is used to show the precision, recall, F1 Score, and support of your trained classification model.
Precision is defined as the ratio of true positives to the sum of true and false positives.
Recall Recall is defined as the ratio of true positives to the sum of true positives and false negatives.
F1 Score The F1 is the weighted harmonic mean of precision and recall. The closer the value of the F1 score is to 1.0, the better the expected performance of the model is.
Support Support is the number of actual occurrences of the class in the dataset. It doesn’t vary between models, it just diagnoses the performance evaluation process.3
It would appear that our model did a very good job.
Day 5: Visualizing our model
We can create logistic regression curves to help us to see how education, salary and age can increase the probability that a customer will buy our product.
Create a new frame and paste the contents above into it.
Save and run all frames to see the graph of our classification model.
The graph shows us that more educated people are likely to purchase our product.
To graph Salary, change the X =data variable to "EstSalary" or "Age"
What conclusions can you draw after graphing the data from our model?
Day 6: What are we going to do with all of this information
If we compare our X_test results with the spreadsheet, we can find the customer id's and we can determine which customers to target and sell our new grill.
For example, our X_test data shows that the first one to purchase was record 1,4,7,5,1. The third one in X_test. Now if we search through the information in the spreadheet, we can find that customer id 68495 matches.
For all outcomes that resulted in a purchase ,"1" Continue this matching process until all X_test items are matched with a customer Id.
We know that these customers are a 25% sample of the entire dataset, however, we should use them a test target market for our new barbeque.
If our advertising campaign is successful, we can expand it include the other 75% that indicated they would buy our newly lanuched product.