Day 1 Machine Learning and Linear Regression

There a number of statistical tools to help entrepreneures to plan and evaluate the progress of their business enterprises.

Machine learning is a part of artificial intelligence. Instead of programming a computer explicitly to make a prediction, using if then else statements, case statements and loops, a computer system can examinine data and learn from that data.

Patterns are extracted from the data. Using these patterns, predictions can be made.

There two kinds of machine learning.

Supervised learning is machine learning that learns from data that has been labeled.

Unsupervised machine learning does not require labeled data.

"Supervised Learning and Unsupervised Learning are two types of Machine Learning. Supervised Learning is the Machine Learning task of learning a function that maps an input to an output based on example input-output pairs.

Unsupervised Learning is the Machine Learning task of inferring a function to describe hidden structure from unlabeled data"

Reference: www.differencebetween.com/difference -between-supervised -and-vs-un…

Linear regression analysis can be used for things like break-even analysis, fraud detection, advertising to create targeted ads.

This lesson focuses on supervised machine learning using Python for Linear regression analysis. We are going to use it to determine the relationship between hits on our website and its relationship with revenue generated.

Think of linear regression as a scatter graph, dots plotting data on a graph. If they create a somewhat straignt line, then it shows a relationship between the variables.

In other words, one variable has a direct effect of the other. In linear regresson we only work with two variables.

Day 2 Installing libraries

Before starting this unit, I recommend that you have studied Python Graphics and Predictive Analytics found on our web site.You need to have installed Python3 and Anaconda on your system. In addition, for this lesson, you will need install Skikit-Learn.

There numerous resources in books and on the web to tell you how to install these items. If you are a student at a school the instructor or lab specialist can probably install these libraries for you.

Day 3 - Project overview

Our Marketing and IT Departments have designed a web site for our company. We sell fine jewelry. Our web site can be found at:

Midas Touch

Our Marketing Department and IT Department managers believe that the revenue increases with the number of hits on our shopping cart page.

Hits, in the content of web services, is a particular page request command that seeks to access a record on a web server.

Hits are a method of monitoirng the traffic on a specific website. The more hits (or requests) the more traffic is thought to be using the page.

Hits, views and bytes can be found in the statistics section of your account on your web host's URL. Hits are recorded by the day, month and hour based on services provided by your host.

Our managers believe that there is a direct positive correlation between hits and revenue. They have collected data on a spreadsheet showing that data.

Midas Touch Spreadsheet

This spreadsheet shows the hits and revenue, day by day, for the past month. If we look at the numbers, we can see that the more hits the more revenue.

What we want to find out is how much each hit increases our revenue and to see if a linear relationship occurs with this data.

Excel has a way to save data that will be helpful to us when using Python. It is called comma separated values (csv). The comma is known as the delimiter and it separates the values.

If we were to open a csv file in notepad it would look like the example below.

Put the above data onto the clipboard and paste it into Notepad++. Save the file in your working directory and call it MidasTouchHits.csv

Python has a method to read this data file into the program, which saves us time by us not having to create a dataset out of the numbers.

Before reading in the file, we need to import some libraries into our Python program. The libraries include numpy, pandas and matplotlib.

NumPy is a Python package which stands for "Numerical Python". It is the core library for scientific computing, which contains a powerful n-dimensional array object, provide tools for integrating C, C++ etc. It is also useful in linear algebra, random number capability etc. NumPy array can also be used as an efficient multi-dimensional container for generic data. Now, let me tell you what exactly is a python numpy array.

NumPy Array: Numpy array is a powerful N-dimensional array object which is in the form of rows and columns. We can initialize numpy arrays from nested Python lists and access it elements.

In computer programming, pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.¹

matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like wxPython, Qt, or GTK+.²

I used Spyder which is a integrated development environment source code editor written in Python for data analysis.

Day 4: Read over Spyder Tutorial

Spyder Tutorial

Day 5: Importing Libraries, Data File, Printing Dataset Head, Shape and Description.

Let's get started.

The following base code came from Samuel Burns' excellent book entitled Python Machine Learning

Load Spyder from Anaconda package. Your screen should look like the picture below.

Now let's begin to enter the code. In the editor section, the part of the program where line numbers appear. In my example, it is the top half. There are other configurations of Spyder that shows the editor on the left and the console on the right.

Key in "import numpy as np"
Key in "import pandas as pd"
Key in "import matplotlib as plt"
Key in "import sys"
Key in "sys._stdout_ = sys.stdout"

Key in "dataset = pd.read_csv('MidasTouchHits.csv')"
Key in "print(dataset.head())"
Key in "print(dataset.shape)"
Key in "print(dataset.describe())"

First we need to import the necessary libraries.

Next we need to read in the file and print the head of the dataset.

Printing the dataset.head prints the first five lines of the file.

Printing the shape of the dataset shows the number of rows and columns contained in the dataset file, 30,2.

Describe the dataset line prints the statistical details of the dataset: count, mean, standard deviation min, max, for hits and revenue etc.

The count shows the number of items in the file, 30

The mean is the average, taken by adding up all Hits and dividing by 30 and adding up all the revenue and dividing by 30.

Min displays the lowest number of hits and revenue.

Max displays the highest number of hits and revenue.

In statistics, the standard deviation is a measure that is used to quantify the amount of variation or dispersion of a set of data values.

A low standard deviation indicates that the data points tend to be close to the mean of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values.

The standard deviation of a random variable, statistical population, data set, or probability distribution is the square root of its variance.²

After entering the data save the file using a .py extension. Run it to see results. Listed below

After running the file using F5 key, let's look at the results.

You can execute the program line by line using the F9 key.

	Hits	Revenue
count	30.000000	30.000000
mean	22.233333	11218.733333
std	6.595104	3302.096345
min	10.000000	5000.000000
25%	16.500000	8775.000000
50%	21.500000	10900.000000
75%	28.750000	14531.250000
max	32.000000	16000.000000

Day 6: Ploting the Data, Graph Titles, Data Preparation

Standard deviation is the average distance of every number from the mean.

For numeric data, the result index will include count, mean, std, min, max as well as lower, 50 and upper percentiles.

By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median. Median is the middle data value of an ordered data set.

Now it is time to create a scatter graph of our data showing hits and revenue.

The code for this is listed below. Key it in or put it on the clipboard and paste it into your Spyder program. Save under the same name.

The lines above are pretty self-explanatory. The x axis is for Hits and the Y axis is for the Revenue. Each hit and corresponding revenue are plotted.

For example, 25 hits is plotted at 12,175 revenue. The style is what character is printed on the scatter graph. You can use a number of things like -, + x, v.

The title is printed at the top and each axis is labeled. The X axis, the horizontal one, is the number of hits and the vertical axis,shows the Revenue produced by the number of hits.

The last line, "plt.show()" displays the graph.

The image below should look like your work after running the program using this matplotlib function of Python.

X = dataset.iloc[:, :-1].values

y = dataset.iloc[:, 1].values

Add these two lines to your program, they subdivide the dataset into labels and attributes. The attributes are the independent variables, the hits. The labels are the dependent variables whose values we want to predict, the revenue.

Day 7: Preparing the Data and Making Predictions

Add these lines to your program. We are preparing the data by subdividing it into labels and attributes. The attributes are the independent variable, hits. The label forms the dependent variable, revenue in our example.

Now we need to divide the data into two sets. We will call one training and the other test. The Python Scikit-Learn library has an algorithum called "train_test_split()" that we need to create a two data sets.

Add the following lines of code to your project.

In Algebra the term lineariy is used to show the relationship betwen two or more variables.

Plotting the variables, the X axis the Hits, the independent variable and the Y axis, the revenue and the dependent variable, the plots will show a somewhat straight line. This is what Linear Regression is all about.Looking at our graph you can see this linear relatiopnship between hits and revenue does approximate a straight line.

There many different possibilities for the line based on the data.

The Linear Regression Algorithum works by fitting multiple lines and returns the line with the fewest errors. The algorithum uses the slope and intercept of the line.

The intercept (often labeled the constant) is the expected mean value of Y, number of hits, when all X=0. Start with a regression equation with one predictor, X. If X sometimes equals 0, the intercept is simply the expected mean value of Y at that value.

Looking at it in another way, the intercept is the value at which the fixed line crosses the y axis.

The intercept in our example does not give very much meaningful information.

The slope of a regression line represents the rate of change in y, revenue as x, hits change. Because y is dependent on x , the slope describes the predicted values of y given x .

In our senario, our data shows 498.712 as the linear regressor coefficient. This means that for every one hit, the revenue will increase by 498.712.

The test_size = .20. Items are picked from the list of 30 randomly. Twenty percent of 30 is 6, which sets the size of the test data array. The remaining 80% is used as the training set.

The fit() method passes the training data to the test data. The actual and predicted amounts appear below. Linear regressor intercept = 145.87691073577298 Linear regressor coefficient = [498.71206382] Actual Predicted 0 6000 6130.421677 1 9000 9122.694059 2 15490 15605.950889 3 13150 13112.390570 4 15100 15107.238825 5 7510 7626.557868

We have created a linear regression model and can make predictions based on the data we preserved as the trainig set.

We have predicted values for revenue from the input values in the X_test series.

We can see that our test data predicted 6130.421677 compared to our actual data of 6000 with 12 hits or each hit equal to 498.71.

Our model predicted 15605.950889 compared to 15490.

Our model is not perfect, but it is pretty close.

The graph below shows what a linear regression line looks like.

The Regression Line is the line that best fits the data, such that the overall distance from the line to the points (variable values) plotted on a graph is the smallest.

In other words, a line used to minimize the squared deviations of predictions is called as the regression line.

Day 8: Seeing How Accurate Our Model is

How well did the algorithum perform on our dataset?

There three evaluation metrics that we can use. We will look at just one, RMSE(Root Mean Squared Error) which measures the vertical distance between the point and the line. A good RMSE should be less than 180.

Add the following lines to your project.

Save and execute the program to see what RMSE is equal to.

How accurate is our model using the midasTouch.csv dataset?

We could try other datasets for different months to see if we obtain different results. In all probability, December, May and June might produce different results, due to the seasonality, engagements, weddings.

What we do know is that each hit results in $498 worth of revenue, so we can predict future revenues based on number of hits to our web site.

Below is a complete listing of the program.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sys
sys._stdout_ = sys.stdout
dataset = pd.read_csv('MidasTouchHits.csv')
print(dataset.head())
print(dataset.shape)
#dataset.head()
print(dataset.describe())
dataset.plot (x='Hits', y= 'Revenue', style= 'o')
plt.title('Web page hits vs Revenue Generated')
plt.xlabel('Hits')
plt.ylabel('Revenue')
plt.show()
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 1].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)
from sklearn.linear_model import LinearRegression
linear_regressor = LinearRegression()
linear_regressor.fit(X_train, y_train)
print('Linear regressor intercept = ',linear_regressor.intercept_)
print('Linear regressor coefficient = ',linear_regressor.coef_)
print()
pred_y = linear_regressor.predict(X_test)
df = pd.DataFrame({'Actual': y_test, 'Predicted': pred_y})
print(df)
print()
from sklearn import metrics
print('RMSE',np.sqrt(metrics.mean_squared_error(y_test, pred_y)))

Day 9: Multiple Linear Regression

Linear regression as we just studied has only two variables. This is helpful for learning purposes, but is not a real world application. Those applications have multiple variables, that each have varying efects on the dependent variable.

The steps to learn how to work with multiple linear regression are similar to what we just learned. The evaluation process is different, however, we need to see which variable has the most effect on the dependent variable and how each variable effects the other ones.

We are going to perform multiple linear regression analysis for a food truck company called Inside Slider. See Starting your own business link on our web page

Starting your own business

We are a new business that has only been operating for about one month. We are going to analyze our sales data to see how to increase our revenue. We will examine past sales data. The dependent variables are location of our truck, time of day for the sale, number of meals purchased, and revenue.

We wan to see what effect each independent variable has on revenue, the dependent variable.

Our break even analysis shows us that we needed to make $26,561 per month just to break even. That breaks down to about $900 a day in revenue or $27,000 per month.

We are going to use python and multi-liner regression coding to solve our problem.

First we need to get the sales data for the month for Inside Slider. It is in comma separated format already, to make it easy to work with. Copy the file to the clipboard and save it in your working directory with a .csv extension.

Location,Num_Meals,Time,Revenue
1,3,1130,36.63
1,2,1130,25.00
1,2,1145,24.84
1,4,1145,48.84
1,3,1150,36.63
1,1,1200,12.41
1,2,1215,25.02
1,2,1215,23.57
1,3,1230,36.63
1,1,1230,12.07
1,2,1245,25.00
1,4,1245,50.00
1,3,1300,36.63
1,2,1305,24.82
1,4,1325,48.84
1,3,1330,36.63
1,1,1330,12.21
1,2,1345,24.84
1,4,1345,52.00
1,3,1400,36.63
1,1,1400,12.07
1,2,1415,25.00
1,4,1415,51.00
1,3,1430,36.63
1,1,1430,12.41
1,2,1445,25.02
1,4,1445,48.84
1,3,1500,36.63
1,2,1500,24.82
1,4,1500,48.84
2,3,1230,36.63
2,1,1230,12.41
2,2,1245,25.02
2,4,1245,48.84
2,3,1250,36.63
2,1,1300,12.41
2,2,1315,25.02
2,4,1315,48.84
2,3,1330,36.63
2,1,1330,12.41
2,2,1345,24.82
2,4,1345,48.84
3,3,1130,36.63
3,1,1130,12.41
3,2,1145,24.82
3,4,1145,48.84
3,3,1150,36.63
3,1,1200,12.41
3,2,1215,24.82
3,4,1215,48.84
3,3,1230,36.63
3,1,1230,12.41
3,2,1245,24.82
3,4,1245,48.84
3,3,1300,36.63
3,2,1305,24.82
3,4,1305,48.84
3,3,1305,36.63
3,1,1330,12.41
3,2,1330,24.82
3,4,1335,48.84
3,3,1400,36.63
3,1,1400,12.41
3,2,1415,24.82
3,4,1415,48.84
3,3,1430,36.63
3,1,1430,12.41
3,2,1445,24.82
3,4,1445,48.84
3,3,1500,36.63
3,2,1505,24.82
3,4,1525,48.84
3,2,1530,24.82
3,1,1530,12.41
3,2,1545,24.82
3,4,1545,48.84
3,3,1500,36.63
3,1,1500,12.41
3,2,1500,24.82
3,4,1500,48.84
3,3,1500,36.63
3,1,1500,12.41
3,2,1500,24.82
3,4,1500,48.84
3,5,1500,60.35
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
4,1,1130,13.00
4,2,1135,25.00
4,1,1140,13.75
4,5,1141,60.35
4,2,1143,24.82
4,4,1300,50.00
4,1,1310,14.00
5,1,1130,12.25
5,2,1141,25.50
5,1,1145,12.25
5,1,1151,12.25
5,1,1151,12.25
5,1,1152,12.25
5,1,1153,12.25
5,1,1154,12.25
5,1,1155,12.25
5,1,1156,12.25
5,1,1157,12.25
5,1,1158,12.25
5,1,1159,12.25
5,1,1200,12.25
5,2,1200,24.82
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
6,1,1130,12.21
6,1,1135,12.21
6,2,1140,25.00
6,2,1142,25.00
6,3,1145,36.63
6,3,1147,36.63
6,2,1149,24.82
6,2,1149,24.82
6,2,1149,24.82
6,2,1149,24.82
6,2,1149,24.82
6,5,1149,60.25
6,2,1149,24.82
6,5,1149,60.35
6,5,1149,60.35
6,5,1149,60.35
6,5,1149,60.35
6,1,1230,12.21
6,1,1235,12.21
6,2,1240,25.00
6,2,1242,25.00
6,3,1245,36.63
6,3,1247,36.63
6,5,1249,60.35
6,1,1400,12.21
7,3,1130,36.63
7,2,1130,24.84
7,2,1145,25.02
7,4,1145,48.57
7,3,1150,36.63
7,1,1200,12.41
7,2,1215,25.02
7,2,1215,23.57
7,3,1230,36.63
7,1,1230,12.07
7,2,1245,24.84
7,4,1245,48.57
7,3,1300,36.63
7,2,1305,24.82
7,4,1325,48.84
7,3,1330,36.63
7,1,1330,12.21
7,2,1345,24.84
7,4,1345,48.84
7,3,1400,36.63
7,1,1400,12.07
7,2,1415,24.84
7,4,1415,48.84
7,3,1430,36.63
7,1,1430,12.41
7,2,1445,25.02
7,4,1445,48.84
7,3,1500,36.63
7,2,1505,24.82
7,4,1525,48.84
1,3,1130,36.63
1,2,1130,25.00
1,2,1145,24.84
1,4,1145,48.84
1,3,1150,36.63
1,1,1200,12.41
1,2,1215,25.02
1,2,1215,23.57
1,3,1230,36.63
1,1,1230,12.07
1,2,1245,25.00
1,4,1245,50.00
1,3,1300,36.63
1,2,1305,24.82
1,4,1325,48.84
1,3,1330,36.63
1,1,1330,12.21
1,2,1345,24.84
1,4,1345,52.00
1,3,1400,36.63
1,1,1400,12.07
1,2,1415,25.00
1,4,1415,51.00
1,3,1430,36.63
1,1,1430,12.41
1,2,1445,25.02
1,4,1445,48.84
1,3,1500,36.63
1,2,1500,24.82
1,4,1500,48.84
2,3,1230,36.63
2,1,1230,12.41
2,2,1245,25.02
2,4,1245,48.84
2,3,1250,36.63
2,1,1300,12.41
2,2,1315,25.02
2,4,1315,48.84
2,3,1330,36.63
2,1,1330,12.41
2,2,1345,24.82
2,4,1345,48.84
3,3,1130,36.63
3,1,1130,12.41
3,2,1145,24.82
3,4,1145,48.84
3,3,1150,36.63
3,1,1200,12.41
3,2,1215,24.82
3,4,1215,48.84
3,3,1230,36.63
3,1,1230,12.41
3,2,1245,24.82
3,4,1245,48.84
3,3,1300,36.63
3,2,1305,24.82
3,4,1305,48.84
3,3,1305,36.63
3,1,1330,12.41
3,2,1330,24.82
3,4,1335,48.84
3,3,1400,36.63
3,1,1400,12.41
3,2,1415,24.82
3,4,1415,48.84
3,3,1430,36.63
3,1,1430,12.41
3,2,1445,24.82
3,4,1445,48.84
3,3,1500,36.63
3,2,1505,24.82
3,4,1525,48.84
3,2,1530,24.82
3,1,1530,12.41
3,2,1545,24.82
3,4,1545,48.84
3,3,1500,36.63
3,1,1500,12.41
3,2,1500,24.82
3,4,1500,48.84
3,3,1500,36.63
3,1,1500,12.41
3,2,1500,24.82
3,4,1500,48.84
3,5,1500,60.35
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
4,1,1130,13.00
4,2,1135,25.00
4,1,1140,13.75
4,5,1141,60.35
4,2,1143,24.82
4,4,1300,50.00
4,1,1310,14.00
5,1,1130,12.25
5,2,1141,25.50
5,1,1145,12.25
5,1,1151,12.25
5,1,1151,12.25
5,1,1152,12.25
5,1,1153,12.25
5,1,1154,12.25
5,1,1155,12.25
5,1,1156,12.25
5,1,1157,12.25
5,1,1158,12.25
5,1,1159,12.25
5,1,1200,12.25
5,2,1200,24.82
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
6,1,1130,12.21
6,1,1135,12.21
6,2,1140,25.00
6,2,1142,25.00
6,3,1145,36.63
6,3,1147,36.63
6,2,1149,24.82
6,2,1149,24.82
6,2,1149,24.82
6,2,1149,24.82
6,2,1149,24.82
6,5,1149,60.25
6,2,1149,24.82
6,5,1149,60.35
6,5,1149,60.35
6,5,1149,60.35
6,5,1149,60.35
6,1,1230,12.21
6,1,1235,12.21
6,2,1240,25.00
6,2,1242,25.00
6,3,1245,36.63
6,3,1247,36.63
6,5,1249,60.35
6,1,1400,12.21
7,3,1130,36.63
7,2,1130,24.84
7,2,1145,25.02
7,4,1145,48.57
7,3,1150,36.63
7,1,1200,12.41
7,2,1215,25.02
7,2,1215,23.57
7,3,1230,36.63
7,1,1230,12.07
7,2,1245,24.84
7,4,1245,48.57
7,3,1300,36.63
7,2,1305,24.82
7,4,1325,48.84
7,3,1330,36.63
7,1,1330,12.21
7,2,1345,24.84
7,4,1345,48.84
7,3,1400,36.63
7,1,1400,12.07
7,2,1415,24.84
7,4,1415,48.84
7,3,1430,36.63
7,1,1430,12.41
7,2,1445,25.02
7,4,1445,48.84
7,3,1500,36.63
7,2,1505,24.82
7,4,1525,48.84
1,3,1130,36.63
1,2,1130,25.00
1,2,1145,24.84
1,4,1145,48.84
1,3,1150,36.63
1,1,1200,12.41
1,2,1215,25.02
1,2,1215,23.57
1,3,1230,36.63
1,1,1230,12.07
1,2,1245,25.00
1,4,1245,50.00
1,3,1300,36.63
1,2,1305,24.82
1,4,1325,48.84
1,3,1330,36.63
1,1,1330,12.21
1,2,1345,24.84
1,4,1345,52.00
1,3,1400,36.63
1,1,1400,12.07
1,2,1415,25.00
1,4,1415,51.00
1,3,1430,36.63
1,1,1430,12.41
1,2,1445,25.02
1,4,1445,48.84
1,3,1500,36.63
1,2,1500,24.82
1,4,1500,48.84
2,3,1230,36.63
2,1,1230,12.41
2,2,1245,25.02
2,4,1245,48.84
2,3,1250,36.63
2,1,1300,12.41
2,2,1315,25.02
2,4,1315,48.84
2,3,1330,36.63
2,1,1330,12.41
2,2,1345,24.82
2,4,1345,48.84
3,3,1130,36.63
3,1,1130,12.41
3,2,1145,24.82
3,4,1145,48.84
3,3,1150,36.63
3,1,1200,12.41
3,2,1215,24.82
3,4,1215,48.84
3,3,1230,36.63
3,1,1230,12.41
3,2,1245,24.82
3,4,1245,48.84
3,3,1300,36.63
3,2,1305,24.82
3,4,1305,48.84
3,3,1305,36.63
3,1,1330,12.41
3,2,1330,24.82
3,4,1335,48.84
3,3,1400,36.63
3,1,1400,12.41
3,2,1415,24.82
3,4,1415,48.84
3,3,1430,36.63
3,1,1430,12.41
3,2,1445,24.82
3,4,1445,48.84
3,3,1500,36.63
3,2,1505,24.82
3,4,1525,48.84
3,2,1530,24.82
3,1,1530,12.41
3,2,1545,24.82
3,4,1545,48.84
3,3,1500,36.63
3,1,1500,12.41
3,2,1500,24.82
3,4,1500,48.84
3,3,1500,36.63
3,1,1500,12.41
3,2,1500,24.82
3,4,1500,48.84
3,5,1500,60.35
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
4,1,1130,13.00
4,2,1135,25.00
4,1,1140,13.75
4,5,1141,60.35
4,2,1143,24.82
4,4,1300,50.00
4,1,1310,14.00
5,1,1130,12.25
5,2,1141,25.50
5,1,1145,12.25
5,1,1151,12.25
5,1,1151,12.25
5,1,1152,12.25
5,1,1153,12.25
5,1,1154,12.25
5,1,1155,12.25
5,1,1156,12.25
5,1,1157,12.25
5,1,1158,12.25
5,1,1159,12.25
5,1,1200,12.25
5,2,1200,24.82
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
6,1,1130,12.21
6,1,1135,12.21
6,2,1140,25.00
6,2,1142,25.00
6,3,1145,36.63
6,3,1147,36.63
6,2,1149,24.82
6,2,1149,24.82
6,2,1149,24.82
6,2,1149,24.82
6,2,1149,24.82
6,5,1149,60.25
6,2,1149,24.82
6,5,1149,60.35
6,5,1149,60.35
6,5,1149,60.35
6,5,1149,60.35
6,1,1230,12.21
6,1,1235,12.21
6,2,1240,25.00
6,2,1242,25.00
6,3,1245,36.63
6,3,1247,36.63
6,5,1249,60.35
6,1,1400,12.21
7,3,1130,36.63
7,2,1130,24.84
7,2,1145,25.02
7,4,1145,48.57
7,3,1150,36.63
7,1,1200,12.41
7,2,1215,25.02
7,2,1215,23.57
7,3,1230,36.63
7,1,1230,12.07
7,2,1245,24.84
7,4,1245,48.57
7,3,1300,36.63
7,2,1305,24.82
7,4,1325,48.84
7,3,1330,36.63
7,1,1330,12.21
7,2,1345,24.84
7,4,1345,48.84
7,3,1400,36.63
7,1,1400,12.07
7,2,1415,24.84
7,4,1415,48.84
7,3,1430,36.63
7,1,1430,12.41
7,2,1445,25.02
7,4,1445,48.84
7,3,1500,36.63
7,2,1505,24.82
7,4,1525,48.84
1,3,1130,36.63
1,2,1130,25.00
1,2,1145,24.84
1,4,1145,48.84
1,3,1150,36.63
1,1,1200,12.41
1,2,1215,25.02
1,2,1215,23.57
1,3,1230,36.63
1,1,1230,12.07
1,2,1245,25.00
1,4,1245,50.00
1,3,1300,36.63
1,2,1305,24.82
1,4,1325,48.84
1,3,1330,36.63
1,1,1330,12.21
1,2,1345,24.84
1,4,1345,52.00
1,3,1400,36.63
1,1,1400,12.07
1,2,1415,25.00
1,4,1415,51.00
1,3,1430,36.63
1,1,1430,12.41
1,2,1445,25.02
1,4,1445,48.84
1,3,1500,36.63
1,2,1500,24.82
1,4,1500,48.84
2,3,1230,36.63
2,1,1230,12.41
2,2,1245,25.02
2,4,1245,48.84
2,3,1250,36.63
2,1,1300,12.41
2,2,1315,25.02
2,4,1315,48.84
2,3,1330,36.63
2,1,1330,12.41
2,2,1345,24.82
2,4,1345,48.84
3,3,1130,36.63
3,1,1130,12.41
3,2,1145,24.82
3,4,1145,48.84
3,3,1150,36.63
3,1,1200,12.41
3,2,1215,24.82
3,4,1215,48.84
3,3,1230,36.63
3,1,1230,12.41
3,2,1245,24.82
3,4,1245,48.84
3,3,1300,36.63
3,2,1305,24.82
3,4,1305,48.84
3,3,1305,36.63
3,1,1330,12.41
3,2,1330,24.82
3,4,1335,48.84
3,3,1400,36.63
3,1,1400,12.41
3,2,1415,24.82
3,4,1415,48.84
3,3,1430,36.63
3,1,1430,12.41
3,2,1445,24.82
3,4,1445,48.84
3,3,1500,36.63
3,2,1505,24.82
3,4,1525,48.84
3,2,1530,24.82
3,1,1530,12.41
3,2,1545,24.82
3,4,1545,48.84
3,3,1500,36.63
3,1,1500,12.41
3,2,1500,24.82
3,4,1500,48.84
3,3,1500,36.63
3,1,1500,12.41
3,2,1500,24.82
3,4,1500,48.84
3,5,1500,60.35
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
3,1,1500,12.21
4,1,1130,13.00
4,2,1135,25.00
4,1,1140,13.75
4,5,1141,60.35
4,2,1143,24.82
4,4,1300,50.00
4,1,1310,14.00
5,1,1130,12.25
5,2,1141,25.50
5,1,1145,12.25
5,1,1151,12.25
5,1,1151,12.25
5,1,1152,12.25
5,1,1153,12.25
5,1,1154,12.25
5,1,1155,12.25
5,1,1156,12.25
5,1,1157,12.25
5,1,1158,12.25
5,1,1159,12.25
5,1,1200,12.25
5,2,1200,24.82
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
5,3,1225,36.63
6,1,1130,12.21
6,1,1135,12.21
6,2,1140,25.00
6,2,1142,25.00
6,3,1145,36.63
6,3,1147,36.63
6,2,1149,24.82
6,2,1149,24.82
6,2,1149,24.82
6,2,1149,24.82
6,2,1149,24.82
6,5,1149,60.25
6,2,1149,24.82
6,5,1149,60.35
6,5,1149,60.35
6,5,1149,60.35
6,5,1149,60.35
6,1,1230,12.21
6,1,1235,12.21
6,2,1240,25.00
6,2,1242,25.00
6,3,1245,36.63
6,3,1247,36.63
6,5,1249,60.35
6,1,1400,12.21
7,3,1130,36.63
7,2,1130,24.84
7,2,1145,25.02
7,4,1145,48.57
7,3,1150,36.63
7,1,1200,12.41
7,2,1215,25.02
7,2,1215,23.57
7,3,1230,36.63
7,1,1230,12.07
7,2,1245,24.84
7,4,1245,48.57
7,3,1300,36.63
7,2,1305,24.82
7,4,1325,48.84
7,3,1330,36.63
7,1,1330,12.21
7,2,1345,24.84
7,4,1345,48.84
7,3,1400,36.63
7,1,1400,12.07
7,2,1415,24.84
7,4,1415,48.84
7,3,1430,36.63
7,1,1430,12.41
7,2,1445,25.02
7,4,1445,48.84
7,3,1500,36.63
7,2,1505,24.82
7,4,1525,48.84

Let's see what this data represents. It was compiled from the monthly sales slips for Inside Slider. The first number indicates the location of the food truck.

There are seven locations, one for each day of the week.

Location 1 is Stearns's Wharf on Sunday
Location 2 is East Beach on Monday
Location 3 is West Beach on Tuesday
Location 4 is the Breakwater on Wednesday
Location 5 is the Mesa on Thursday
Location 6 is Leadbetter Beach on Friday
Location 7 is Carpinteria Beach on Saturday.

The next number refers to the number of meals or guests on a given ticket.

The third number represents the time of day. I used military time . The food truck operates from 11:30 am until 3:00 pm or 1130 to 1500 military time.

The last number represents the total of the bill.

We are trying to determine how to increase our revenue. We are looking at location, time of day and number of guests as the independent variables. We want to see which of these variables has the most impact on increasing our revenue, the dependent variable.

We plan to use Python code to create a multiple linear regression model to find our answer.

The code is very similar to the previous example on liner regression.

Put the code on the clipboard. Save the file in your working directory. Make sure that you use a .py extenxion.

Run the code using the F5 key or run it line by line using F9 key.

The first line that are produced are the first five lines of the dataset.

The next output is much more informative. It details:

The count. This is the number of individual sales slips in the file.
The next row of numbers represent the mean of the location, time, number of meals and revenue.
Remember that the mean is just a simple average. All the numbers are added up and divided by the total number of items.
The mean of the location is not very helpful.
The mean of the number of meals is helpful. It shows 2.319. So and average order is for 2 customers.
The mean or average time of day that lunches are sold is 1307 or seven minutes after 1 pm.
The average revenue is $28.47.
The standard deviation is a measure of how spread out the numbers are, the average distance from the mean.
The standard deviation for the number of meals is 1.19 and the mean is 2.31.
The standard deviation for the revenue is 14.49 and the mean is 28.47.
The min number tells us that the minumum location was 1 and the the minimum of meals order was 1 and that the smallest order was $12.07.
The max number tells us that the max location was 7 , the max number of meals on a ticket was 5, the latest time that meals were served was 1545 and the largest ticket total was $60.35.
The numbers on the 50% line equal the medan numbers. The ones in the middle. Location 3, number of meals 2, time 1300 and revenue $24.82.
The coefficients are the numbers we are really looking for. They tell us which independent variable has the most effect on revenue, our dependent variable.
The coefficients measure the linear relationship between two data sets. The variables range between -1 and + 1. A zero number for the coefficient means there is no correlation. A plus one or a minus 1 means an exact corelation.
The results of our python program show that time of day and location have almost no effect on our revenue, but that number of meals does with a 12.145 score.
The next set of numbers show us what Python has predicted based on our actual sales.
Look at each number: actual and predicted. They are all pretty close. Remembr that 156 items were randomly selected, 20% of 776. The first coluumn shows those sale slips numbers.
The RMSE, root mean squared error, number is used to determine how accurate our sample was. The closer to zero the better. As you can see we did very well.

Now that we have our results, what changes in our business model, could we consider. Here is a list that I came up with. You should make your own list.

Increase revenue by getting not just one customer to come to the food truck, but two or more. Start a promotion at local nearby business to create an incentive for more than one employee have lunch at the food truck.
Create a promotion that gives a discount to a customer that buys one meal for themselves and takes another one of two meals back to the office or job site.
Add a catering component to your business model, which would result in multiple meals and thus increase revenue.
Look at the raw data in depth. Are there locations where the volume is low. Check out the breakwater, location 4. You only made seven sales. You know that there are many restaurants on the breakwater, maybe that is why you sales are so poor.
Which is the location that produces the most revenue? Consider going there on Wednesday, the day that you go to the breakwater.
Buy one get one free lunch introduction might be and option.

Day 10: K-Nearest Algorithum: gift basket company

Now we will examine data collected from an on-line survey of customers and use Pyton and the KNN algorithum to determine our target market and market segments.

The algorithum is a good one and is used for applications like economics, forcasting and genetics. It is a supervised algorithun since it is given a dataset made up of training observations. Our goal is to take the independent variables and use them to determine the generation of those individuals.

The generations are as follows :

Generation Z individuals born between 1996 to 2014 13-23 year olds as of this year 2019.
Generation Y, also called Millenials, 24 to 29 years old. Born between 1980 and 1985 as of 2019
Generation X folks were born between 1961 and 1981 and are 40 to 58 years old based on 2019.
The baby boomers were born between 1946 and 1964 making them 55 to 73 years old based on the year 2019.
Traditionalists were born between 1922 and 1945 making them between the ages of 74 and 97 years old based on the year 2019.

Python code

import numpy as np import pandas as pd import matplotlib.pyplot as plt import sys sys.stdout_ = sys.stdout url = "http://janetbelch.com/Survey.csv" #Assign column names to the dataset names = ['gender', 'geo_loc', 'marital', 'education', 'occup', 'income', 'race', 'generation'] #Load the data set from the url into a pandas dataframe dataset = pd.read_csv(url, names=names) print(dataset.head()) print(dataset.shape) X = dataset.iloc[:, :-1].values y = dataset.iloc[:, 7].values from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20) from sklearn.preprocessing import StandardScaler feature_scaler = StandardScaler() feature_scaler.fit(X_train) X_train = feature_scaler.transform(X_train) X_test = feature_scaler.transform(X_test) from sklearn.neighbors import KNeighborsClassifier knn_classifier = KNeighborsClassifier(n_neighbors=5) knn_classifier.fit(X_train, y_train) pred_y = knn_classifier.predict(X_test) from sklearn.metrics import confusion_matrix, classification_report print() print(confusion_matrix(y_test, pred_y)) print() print(classification_report(y_test, pred_y))

Put the Python code on the clipboard and paste it into the Spyder editor. The code came fom Samuel Burns in his Python Machine Learning texbook. I created the dataset and changed the column headings to make it work for this example. Make sure to save the file in your working directry with a .py extension.

Run the code either line by line using F9 key or F5 key. Make sure that you have Internet access, since the completed file is located on our web site.

Let's analyze the results. The first five lines of the dataset are printed out, the shape which represents the columns and rows of the dataset,100 rows and 8 columns.

The sample of 20 is 20% of the 100 rows of the dataset. Remember that each row represents one customer's responses to the questionnaire.

The confusion matrix is displayed next.

The confustion matrix is read from the top left-hand corner to the bottom right hand corner.

Seven babyboomers were correctly identified in the sample

Five Generation X's were correctly identified in our sample.

Three Generation Y;s were correctly.

Four Generation Z.

One Traditionalist was misclassified in our test data.

The classification Report is displayed next and it summarizes the confusionmatrix.

From the predictions, it is apparent that Baby boomers and generation X are our main target market segments and we should focus our marketing efforts on those two generation.

The results show that the KNN algorituum did a pretty good job in classifying the 20 records. The results show that the weighted averages were in the 90 percentiles.

The F1 score is a measure of a test's accuracy. Precision and recall are also metric measures of accuracy.
gender geo_loc marital ... income race generation 1.0 1.0 3.0 1.0 ... 1.0 4.0 GenZ 1.0 1.0 3.0 1.0 ... 1.0 6.0 GenZ 1.0 1.0 3.0 1.0 ... 1.0 4.0 GenZ 1.0 1.0 3.0 1.0 ... 1.0 6.0 GenZ 1.0 2.0 3.0 1.0 ... 1.0 6.0 GenZ [5 rows x 8 columns] (100, 8) print(confusion_matrix(y_test, pred_y)) [[7 0 0 0 0] [0 5 0 0 0] [0 0 3 0 0] [0 0 0 4 0] [0 1 0 0 0]] print() print(classification_report(y_test, pred_y)) precision recall f1-score support Baby_Boomer 1.00 1.00 1.00 7 GenX 0.83 1.00 0.91 5 GenY 1.00 1.00 1.00 3 GenZ 1.00 1.00 1.00 4 Traditionalist 0.00 0.00 0.00 1 micro avg 0.95 0.95 0.95 20 macro avg 0.77 0.80 0.78 20 weighted avg 0.91 0.95 0.93 20
These numbers represent the numbers that the test data learned from the training data. The sample of test data is 20 respondents.

Looking at the support column, you can see 7 or 35% are Baby Boomers. Five or 25% belong to Generation X. Three or 15% belong to Generation Y. Four or 20% belong to Generaton Z and one belongs to the Traditionalist Generation - 5%.

Based on the numbers from the classification report, the model did a good job.

Run the code multiple times and the results will change somewhat, since different test data is obtained each time the program is executed.

The information obtained gives the marketing professional information about the segments of their market.

Specialized campaigns can be directed at each segment.

It makes little sense to direct any campaign at the Traditionalists. The one that the model got was misclassified anyway.

Run the Python program numerous times and see how the results vary somewhat.

Here is some information I found about the characteristics of the generation X.

They were born between 1961 and 1981 making then 38 to 58 years old as of 2019.
They represent 22.9% of the United States population, which is approximately comprable to the generation z and millenials.
They are called the middle child.
Many were latch key kids
Many are divorced.
They have or had career driven parents.
They are into labels and brand names.
They have large amounts of credit card debt.
During their school years they grew up with no computers in their classrooms or at home.
The look at the world and ask "What's in it for me."
They make about seven career changes.
They are called the MTV generation.
They like punk and heavy metal music.
They are late to marry and quick to divorce.

Here is some information about the baby boomers generation.

They were born right after World War II ended 1946-1964

They prefer to communicate using the telephone.

The represent a significant spike in birth as the soldiers returned from war.

They are the 60's and 70's generations.

There two segments: The leading edge ones, 1946 to 1955 and the late bloomers, 1956 to 1964.

The are the first generation with two income households.

Divorce rates were higher than all previous generations.

They are the first generation to get television sets.

Here is how you might market products to Generation X. Remember in our simulation, our company sells flowers and gift baskets.

Use the Internet. Ninety-five percent of generation x'rs have a Facebook account.

The are loyal to brands, so reward returning customers.

They like videos, so create a video about your products and put it on your web page, or feature it in your company's Facebook account, or attach it to an email.

Personalize the product. Create custom orders of baskets and flowers based on their preferences.

Since they are family oriented, show images of a family enjoying a gift basket together around a special event.

Day 11 : Neural Networks

Neural networks are a form of machine learning that tries to imitate the way neural networks work in the human body.

According to Massachuttes Institute of Technology, understanding how the brain recognizes objects is a central challenge for understanding human vision, and for designing artificial vision systems.

No computer system comes close to human vision, but a new study by neuroscientists suggests that the brain learns to solve the problem of object recognition through its vast experience in the natural world.

Humans, based on their past experiences can recognize the identifying patterns of an object and can recognize those patterns say for example, a car, or a dog, or a cat.

Artifical intelligence neural networks try to imitate human learning similar to the nervous system of humans.

We will use an expanded dataset on target markets to compare neural networks to the KNN algorithum. The original survey only included the first eight questions. More questions were added to the dataset and questionnaire for this exercise.

Click on this link to see the questionnaire that the dataset is based on. Customer Survey

The additional questions are looking into customer buying habits. The first eight are demographic questions.

By looking at the questionnaire, I assigned the Class for each row, ie the generation based on all of the factors, not just age.

The data is not real it is just made up for educational purposes.

Iterative learning process

A key feature of neural networks is an iterative learning process in which records (rows) are presented to the network one at a time, and the weights associated with the input values are adjusted each time. After all cases are presented, the process is often repeated. During this learning phase, the network trains by adjusting the weights to predict the correct class label of input samples.

We are trying to teach the test dataset to learn the same characteristics of the larger dataset. Individual items in the dataset include age, income, occupation, education as well as techinical competence, multitasking ability, work-life balance, retirement ages, team orientation online purchase and brand loyalty.

This information was gained through our survey of our customers, entered into an Excel spreadsheet using numbers to represent the responses, formatting the numbers into two decimal places, removing the headings after the data was entered, saving it into a csv format and uploading it our web site in a file called GenXand GenY.csv

Day 12: Python Code: Neural Networks

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sys
sys.stdout_ = sys.stdout
generations_url = "http://janetbelch.com/GenXandGenYandBoomers.csv"
#Assign column names to the dataset
generation_names = ['Age', 'gender', 'geo_loc', 'marital', 'education', 'occup', 'income', 'race', 'Competence', 'Multi-tasking', 'Work-Life', 'Retirement', 'Social-Media', 'Team-Oriented', 'On-Line-Purchases', 'Brand-Loyality','generation']
#Load the data set from the url into a pandas dataframe
dataset = pd.read_csv(generations_url, names=generation_names)
#(pd.display.options.max_columns=10)
print(dataset.to_string())
print(dataset.shape)
print(dataset.head())
X = dataset.iloc[:,0:16]
y = dataset.select_dtypes(include=[object])
y.head(78)
y.generation.unique()
from sklearn import preprocessing
lab = preprocessing.LabelEncoder()
y = y.apply(lab.fit_transform)
y.generation.unique()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
from sklearn.preprocessing import StandardScaler
feature_scaler = StandardScaler()
feature_scaler.fit(X_train)
X_train = feature_scaler.transform(X_train)
X_test = feature_scaler.transform(X_test)
from sklearn.neural_network import MLPClassifier
mlp_classifier = MLPClassifier(hidden_layer_sizes=(10, 10, 10), max_iter=1000)
mlp_classifier.fit(X_train,y_train.values.ravel())
predictions = mlp_classifier.predict(X_test)
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test,predictions))
print(y_test.to_string())

Put the above code on the clipboard and paste it into the Spyder editor. Save it in your working directory with a .py extension.

Now run the program, first line by line using F9 key to check for any errors.

The code came from a book I purchased by Samuel Burns entitled Python Machine Learning I made a few modifications and added a few lines and utilized a dataset the I created.

Day 13: Analyzing the results of neural networks

The shape is printed showing 79 rows and 17 columns
The whole dataset is dislayed.
The confustion matrix which shows how correct our classification model is by creating a table of actual and predicted values.
Each time you run the program these numbers are liable to change.

Read the confusion matrix from top-left to the botttom-right corner, those numbers are correct predictions for each row.

Each row corresponds to a generation: GenY, GenX and Baby Boomers, (0,1,2).

The model correctly predicted 1 GenY, 11 GenX and 4 Baby Boomers on one of the runs.

The other numbers in the matrix, represent number of instances that the model did not predict the correct answer.

There were no prediction errors for Generation Y, Generation X or Baby Boomers.

Each time you run the code, you will most likely get a different confusion matrix and classification report since the test data is learning from the training data.

Let's examine the data from the confusion matrix on this run.
[[14 0] [ 0 2]] precision recall f1-score support 1 1.00 1.00 1.00 14 2 1.00 1.00 1.00 2 micro avg 1.00 1.00 1.00 16 macro avg 1.00 1.00 1.00 16 weighted avg 1.00 1.00 1.00 16 generation 52 1 67 1 47 1 51 1 64 1 30 1 61 1 70 1 22 1 11 2 53 1 24 1 40 1 20 1 6 2 68 1
- Remember Class 0 is Generation y, Class 1 is Generation X and Class 2 are the baby boomers.
- No Generation Y was predicted.
- There were 14 correct predictions for Generation X 87.6% (14/16).
- There were 2 Baby Boomers predicted which is 12.5% percent (2/16).
- This is a great break down of the semgents of our overall target market, and promotional materials can be designed specifically for each segment.
- You can see that using this model, Generation X is your main target market.
- The classification report is displayed next, which summarizes the confussion matrix.
- The last information displayed is the test data showing records and predictions as to their class.
Below is another run of the program giving slightly different reults.
[[ 1 0 0] [ 0 11 0] [ 0 0 4]] precision recall f1-score support 0 1.00 1.00 1.00 1 1 1.00 1.00 1.00 11 2 1.00 1.00 1.00 4 micro avg 1.00 1.00 1.00 16 macro avg 1.00 1.00 1.00 16 weighted avg 1.00 1.00 1.00 16 generation 2 2 7 2 48 1 26 1 39 1 72 1 46 1 29 1 43 1 8 2 77 0 19 1 20 1 10 2 27 1 56 1
Our model performed perfectly.
The last line print(y_test.to_string()) prints out the entire test dataset, all 16 test records and what generation that the model detemined.

Looking at the y_test dataset, you can see the individual records and the generations. This list really shows in detail what the confusion report says. Class 0 which is Generation y contains 1 member. Class 1 which is Generation X has 13 members, and Class 2 which is the baby boomers contain 2 members. If you add up these numbers you will see that it totals 16, which if the total sample size of the test data.
[[ 1 0 0] [ 0 14 0] [ 0 0 1]] precision recall f1-score support 0 1.00 1.00 1.00 1 1 1.00 1.00 1.00 14 2 1.00 1.00 1.00 1 micro avg 1.00 1.00 1.00 16 macro avg 1.00 1.00 1.00 16 weighted avg 1.00 1.00 1.00 16 generation 20 1 26 1 60 1 63 1 48 1 49 1 25 1 43 1 74 1 69 1 51 1 46 1 16 1 12 2 77 0 23 1

Above is another run of the neural network program

Day 14: Random Forest Classification problem target market survey

Random Forest Algorithum is a type of supervised machine learning algorithum that uses multiple algorithums to get a more powerfull prediction model.

In the box below is the code you will use for the problem. The dataset is the one used in the previous problem, but you will need to make a copy of it and save it in your working directory.
""" import numpy as np import pandas as pd import sys sys.__stdout__= sys.stdout dataset = pd.read_csv("Survey.csv") print(dataset.head(20)) print(dataset.tail(20)) X = dataset.iloc[: , 0:8].values y = dataset.iloc[: , 8].values from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) from sklearn.preprocessing import StandardScaler sc= StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test) from sklearn.ensemble import RandomForestClassifier classifier = RandomForestClassifier(n_estimators=20, random_state=0) classifier.fit(X_train, y_train) pred_y = classifier.predict(X_test) from sklearn.metrics import confusion_matrix, classification_report,accuracy_score print(confusion_matrix(y_test,pred_y)) print(classification_report(y_test,pred_y)) print(accuracy_score(y_test, pred_y))
Put the code on the clipboard and paste it into the Spyder editor. Save your file in your working directory wih a .py extenson.

Here is the dataset. Copy to clipboard and save in your working directory. Save the file as "Survey.csv". Note, this is the eight question survey file.

Age,Gender,Geoloc,Marital,Education,Occup,Income,Race,Generation 1,1,3,1,1,11,1,4,GenZ 1,1,3,1,1,11,1,6,GenZ 1,1,3,1,1,11,1,4,GenZ 1,1,3,1,1,11,1,6,GenZ 1,2,3,1,2,11,1,6,GenZ 1,1,3,1,1,11,1,4,GenZ 1,1,3,1,1,11,1,6,GenZ 1,2,3,1,2,11,1,6,GenZ 1,1,3,1,1,11,1,4,GenZ 1,1,3,1,1,11,1,6,GenZ 1,2,3,1,2,11,1,6,GenZ 1,2,3,1,2,11,1,6,GenZ 1,1,3,1,1,11,1,4,GenZ 1,1,3,1,1,11,1,6,GenZ 1,2,3,1,2,11,1,6,GenZ 1,1,3,1,1,11,1,4,GenZ 1,1,3,1,1,11,1,6,GenZ 1,2,3,1,2,11,1,6,GenZ 2,1,2,1,4,1,5,2,GenY 2,1,2,1,4,1,5,2,GenY 2,1,2,1,4,1,5,2,GenY 2,1,2,1,4,1,5,2,GenY 2,1,2,1,4,1,5,2,GenY 2,1,2,1,4,1,5,2,GenY 3,1,2,1,4,3,5,4,GenX 3,1,3,2,4,3,5,4,GenX 3,1,3,2,4,3,5,4,GenX 3,2,5,2,3,4,6,4,GenX 3,2,5,2,3,4,6,4,GenX 3,2,5,2,3,4,6,4,GenX 3,2,5,2,3,4,6,4,GenX 3,2,5,2,3,4,6,4,GenX 3,2,5,2,3,4,6,4,GenX 3,2,5,2,3,4,6,4,GenX 3,2,5,2,3,4,6,4,GenX 3,2,5,2,3,4,6,4,GenX 3,2,5,2,3,4,6,4,GenX 3,2,5,2,3,4,6,4,GenX 3,2,5,2,3,4,6,4,GenX 3,2,5,2,3,4,6,4,GenX 3,2,5,2,3,4,6,4,GenX 3,2,5,2,3,4,6,4,GenX 3,2,5,2,3,4,6,4,GenX 3,2,5,2,3,4,6,4,GenX 3,2,5,2,3,4,6,4,GenX 3,2,5,2,3,4,6,4,GenX 3,2,5,2,3,4,6,4,GenX 3,2,5,2,3,4,6,4,GenX 3,1,2,1,4,3,5,4,GenX 3,1,3,2,4,3,5,4,GenX 3,1,3,2,4,3,5,4,GenX 3,1,2,1,4,3,5,4,GenX 3,1,3,2,4,3,5,4,GenX 3,1,3,2,4,3,5,4,GenX 3,1,2,1,4,3,5,4,GenX 3,1,3,2,4,3,5,4,GenX 3,1,3,2,4,3,5,4,GenX 3,1,2,1,4,3,5,4,GenX 3,1,3,2,4,3,5,4,GenX 3,1,3,2,4,3,5,4,GenX 3,1,2,1,4,3,5,4,GenX 3,1,3,2,4,3,5,4,GenX 3,1,3,2,4,3,5,4,GenX 4,2,5,2,4,2,7,4,Baby_Boomer 4,2,5,2,4,2,7,4,Baby_Boomer 4,2,5,2,4,2,7,4,Baby_Boomer 4,2,5,2,4,2,7,4,Baby_Boomer 4,2,5,2,4,2,7,4,Baby_Boomer 4,2,5,2,4,2,7,4,Baby_Boomer 4,2,5,2,4,2,7,4,Baby_Boomer 4,2,5,2,4,2,7,4,Baby_Boomer 4,2,5,2,4,2,7,4,Baby_Boomer 4,2,5,2,4,2,7,4,Baby_Boomer 4,2,5,2,4,2,7,4,Baby_Boomer 4,2,5,2,4,2,7,4,Baby_Boomer 4,2,5,2,4,2,7,4,Baby_Boomer 4,2,5,2,4,2,7,4,Baby_Boomer 4,2,5,2,4,2,7,4,Baby_Boomer 4,2,5,2,4,2,7,4,Baby_Boomer 4,2,5,2,4,2,7,4,Baby_Boomer 4,2,5,2,4,2,7,4,Baby_Boomer 4,2,5,2,4,2,7,4,Baby_Boomer 4,2,5,2,4,2,7,4,Baby_Boomer 4,2,5,2,4,2,7,4,Baby_Boomer 4,2,5,2,4,2,7,4,Baby_Boomer 4,2,5,2,4,2,7,4,Baby_Boomer 4,2,5,2,4,2,7,4,Baby_Boomer 4,2,5,2,4,2,7,4,Baby_Boomer 4,2,5,2,4,2,7,4,Baby_Boomer 4,2,3,2,5,2,6,4,Baby_Boomer 4,2,3,2,5,2,6,4,Baby_Boomer 4,2,3,2,5,2,6,4,Baby_Boomer 4,2,3,2,5,2,6,4,Baby_Boomer 4,2,3,2,5,2,6,4,Baby_Boomer 4,2,3,2,5,2,6,4,Baby_Boomer 5,1,1,2,5,2,6,4,Traditionalist 5,1,1,2,5,2,6,6,Traditionalist 5,1,1,2,5,2,6,5,Traditionalist 5,1,1,2,5,2,6,3,Traditionalist 5,1,1,2,5,2,6,2,Traditionalist
Below is on of the runs of the program. Run it multiple times to see how your resuslts vary.
import numpy as np import pandas as pd import sys sys.__stdout__= sys.stdout dataset = pd.read_csv("Survey.csv") print(dataset.head(20)) Age Gender Geoloc Marital ... Occup Income Race Generation 0 1 1 3 1 ... 11 1 4 GenZ 1 1 1 3 1 ... 11 1 6 GenZ 2 1 1 3 1 ... 11 1 4 GenZ 3 1 1 3 1 ... 11 1 6 GenZ 4 1 2 3 1 ... 11 1 6 GenZ 5 1 1 3 1 ... 11 1 4 GenZ 6 1 1 3 1 ... 11 1 6 GenZ 7 1 2 3 1 ... 11 1 6 GenZ 8 1 1 3 1 ... 11 1 4 GenZ 9 1 1 3 1 ... 11 1 6 GenZ 10 1 2 3 1 ... 11 1 6 GenZ 11 1 2 3 1 ... 11 1 6 GenZ 12 1 1 3 1 ... 11 1 4 GenZ 13 1 1 3 1 ... 11 1 6 GenZ 14 1 2 3 1 ... 11 1 6 GenZ 15 1 1 3 1 ... 11 1 4 GenZ 16 1 1 3 1 ... 11 1 6 GenZ 17 1 2 3 1 ... 11 1 6 GenZ 18 2 1 2 1 ... 1 5 2 GenY 19 2 1 2 1 ... 1 5 2 GenY [20 rows x 9 columns] print(dataset.tail(20)) Age Gender Geoloc ... Income Race Generation 80 4 2 5 ... 7 4 Baby_Boomer 81 4 2 5 ... 7 4 Baby_Boomer 82 4 2 5 ... 7 4 Baby_Boomer 83 4 2 5 ... 7 4 Baby_Boomer 84 4 2 5 ... 7 4 Baby_Boomer 85 4 2 5 ... 7 4 Baby_Boomer 86 4 2 5 ... 7 4 Baby_Boomer 87 4 2 5 ... 7 4 Baby_Boomer 88 4 2 5 ... 7 4 Baby_Boomer 89 4 2 3 ... 6 4 Baby_Boomer 90 4 2 3 ... 6 4 Baby_Boomer 91 4 2 3 ... 6 4 Baby_Boomer 92 4 2 3 ... 6 4 Baby_Boomer 93 4 2 3 ... 6 4 Baby_Boomer 94 4 2 3 ... 6 4 Baby_Boomer 95 5 1 1 ... 6 4 Traditionalist 96 5 1 1 ... 6 6 Traditionalist 97 5 1 1 ... 6 5 Traditionalist 98 5 1 1 ... 6 3 Traditionalist 99 5 1 1 ... 6 2 Traditionalist [20 rows x 9 columns] X = dataset.iloc[: , 0:8].values y = dataset.iloc[: , 8].values from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) from sklearn.preprocessing import StandardScaler sc= StandardScaler() X_train = sc.fit_transform(X_train) C:\Users\jerrybelch\Anaconda3\lib\site-packages\sklearn\utils\validation.py:595: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler. warnings.warn(msg, DataConversionWarning) C:\Users\jerrybelch\Anaconda3\lib\site-packages\sklearn\utils\validation.py:595: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler. warnings.warn(msg, DataConversionWarning) X_test = sc.transform(X_test) C:\Users\jerrybelch\Anaconda3\lib\site-packages\sklearn\utils\validation.py:595: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler. warnings.warn(msg, DataConversionWarning) from sklearn.ensemble import RandomForestClassifier classifier = RandomForestClassifier(n_estimators=20, random_state=0) classifier.fit(X_train, y_train) Out[31]: RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=None, oob_score=False, random_state=0, verbose=0, warm_start=False) pred_y = classifier.predict(X_test) from sklearn.metrics import confusion_matrix, classification_report,accuracy_score print(confusion_matrix(y_test,pred_y)) [[ 3 0 0 0] [ 0 12 0 0] [ 0 0 3 0] [ 0 0 0 2]] print(classification_report(y_test,pred_y)) precision recall f1-score support Baby_Boomer 1.00 1.00 1.00 3 GenX 1.00 1.00 1.00 12 GenZ 1.00 1.00 1.00 3 Traditionalist 1.00 1.00 1.00 2 micro avg 1.00 1.00 1.00 20 macro avg 1.00 1.00 1.00 20 weighted avg 1.00 1.00 1.00 20 print(accuracy_score(y_test, pred_y)) 1.0

As you can see this is a good model. The test data contained 3 baby boomers, 12 members form Generation X and two traditionalists. Accuracy is 100%.

How do we intrepret the results.

The intent is to train the function to such an extent that whenever we have new input data (X) that we can predict the output variable (y) for that given set of data.

Day 15: K-Means Clustering

K-Means Clustering is a concept that falls under Unsupervised Learning. This algorithm can be used to find groups within unlabeled data.

Clustering is a highly used first step in exploratory data mining.
First we will look at an example of this form of unsupervised machine learning,

Next you will complete a DataFrame for two-dimensional data-set.

Then we will find the centroids for 2 clusters, and then for 3 clusters.

A graphical user interface (GUI) will be used to display the results.

You will analyze the results to make marketing decisions.
from pandas import DataFrame import matplotlib.pyplot as plt from sklearn.cluster import KMeans Data = {'Height': [62,64,65,62,63,66,67,68,69,71,69,72,75,74,71,76,62,66,72,70,65,74,68,76,62,70,73,67,70,66,72,62,62,69,71,72,67,67,63,75,76,69,70,68,70,76], 'Weight': [128,131,142,135,140,150,145,158,160,160,160,170,165,176,175,176,135,140,170,161,137,180,155,187,140,157,165,142,161,151,170,140,150,176,184,190,150,168,153,205,210,200,200,180,158,205] } df = DataFrame(Data,columns=['Height','Weight']) print (df) print(dataset.shape) kmeans = KMeans(n_clusters=4).fit(df) centroids = kmeans.cluster_centers_ print(centroids) plt.scatter(df['Height'], df['Weight'], c= kmeans.labels_.astype(float), s=50, alpha=0.5) plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=50)

Put the above code on your clipboard and paste it into Spyder. This example is about height and weight. We all know that there is a correlation between height and weight.

Save your code into your working directory with a .py extension and run it.

Below is what it should look like when you execute the code.

You can see that individual 0 is 62 inches tall ansd weighs 128 lbs. Individual 1 is 64 inches tall and weighs 131 lbs.

If you look at the data statement in the code, it is read vertically.
Height Weight 0 62 128 1 64 131 2 65 142 3 62 135 4 63 140 5 66 150 6 67 145 7 68 158 8 69 160 9 71 160 10 69 160 11 72 170 12 75 165 13 74 176 14 71 175 15 76 176 16 62 135 17 66 140 18 72 170 19 70 161 20 65 137 21 74 180 22 68 155 23 76 187 24 62 140 25 70 157 26 73 165 27 67 142 28 70 161 29 66 151 30 72 170 31 62 140 32 62 150 33 69 176 34 71 184 35 72 190 36 67 150 37 67 168 38 63 153 39 75 205 40 76 210 41 69 200 42 70 200 43 68 180 44 70 158 45 76 205 [[ 68.5625 157.125 ] [ 73. 201.66666667] [ 63.91666667 137.91666667] [ 71.83333333 176. ]]

The first column represents the individuals. They are assigned an number in the array starting with 0.

The next column represents the height of each individual.

The third column is the weight of each.

The centroids for each are listed for each cluster.

In mathematics and physics, the centroid or geometric center of a plane figure is the arithmetic mean position of all the points in the cluster.

The center of each cluster (in red) represents the mean of all the observations that belong to that cluster.

You may also see, the observations that belong to a given cluster are closer to the center of that cluster, in comparison to the centers of other clusters.

Each cluster is a different color. Two blues and one yellow. Red is the mean of the cluster.
- 73 and 197 is represented by the darker blue color
- 64 and 141 are represented by the lighter blue cluster
- 71 and 176 are the yellow dots.
What do the clusters represent: small frame, large frame or body types, ectomorph, mesomorph and endomorpy? Are some men and some women? Does age play a factor perhaps?

Find the line of the code that talks about the number of clusters. Try 2 or 4 and look at the results.

Remember k-means clustering is an exploratory technique that should help us develop other models using different algorithums.

Day 16: Creating a new data set using the existing code.

Now it is you turn to modify the program. This new model contains the number of sales made to 45 of our customers and the amounts of those sales. That will be our dataset.

We will configure it to be a two-dimensional array. Instead of height and weight use Monthly Purchases and Expenditures as the column headings. Use the table below to obtain the data.
- Customer 0 made 1 monthly purchase for $50
- Customer 1 made 2 monthly purchases for $35
- Customer 2 made 7 monthly purchases for $800
- Customer 3 made 1 monthly purchase for $500
- Customer 4 made 6 monthly purchases for $300
- Customer 5 made 12 monthly purchases for $750
- Customer 6 made 3 monthly purchases for $100
- Customer 7 made 10 monthly purchases for $450
- Customer 8 made 16 monthly purchases for $850
- Customer 9 made 2 monthly purchases for $25
- Customer 10 made 6 monthly purchases for $300
- Customer 11 made 9 monthly purchases for $200
- Customer 12 made 6 monthly purchases for $500
- Customer 13 made 2 monthly purchases for $100
- Customer 14 made 4 monthly purchases for $87
- Customer 15 made 9 monthly purchases for $650
- Customer 16 made 4 monthly purchases for $75
- Customer 17 made 6 monthly purchases for $150
- Customer 18 made 2 monthly purchases for $350
- Customer 19 made 6 monthly purchases for $300
- Customer 20 made 6 monthly purchases for $300
- Customer 21 made 12 monthly purchases for $35
- Customer 22 made 1 monthly purchase for $35
- Customer 23 made 15 monthly purchases for $600
- Customer 24 made 4 monthly purchases for $100
- Customer 25 made 18 monthly purchases for $200
- Customer 26 made 25 monthly purchases for $800
- Customer 27 made 3 monthly purchases for $99
- Customer 28 made 5 monthly purchases for $175
- Customer 29 made 8 monthly purchases for $200
- Customer 30 made 9 monthly purchases for $150
- Customer 31 made 8 monthly purchases for $900
- Customer 32 made 3 monthly purchases for $800
- Customer 33 made 4 monthly purchases for $650
- Customer 34 made 2 monthly purchases for $75
- Customer 35 made 3 monthly purchases for $150
- Customer 36 made 4 monthly purchases for $200
- Customer 37 made 15 monthly purchases for $500
- Customer 38 made 6 monthly purchases for $90
- Customer 39 made 8 monthly purchases for $300
- Customer 40 made 15 monthly purchases for $750
- Customer 41 made 14 monthly purchases for $800
- Customer 42 made 12 monthly purchases for $600
- Customer 43 made 10 monthly purchases for $900
- Customer 44 made 12 monthly purchases for $150
- Customer 45 made 8 monthly purchases for $650
The bulk of the code is the same. All you will be changing is the data and column headings.

Be very careful entering the data watching for placement of , [ { } ]

Change all column heading. Replace height and weight with Monthly Purchases and Expenditures.

Save and execute code.

Monthly Purchases Expenditures 0 1 50 1 2 35 2 7 800 3 1 500 4 6 300 5 12 750 6 3 100 7 10 450 8 16 850 9 2 255 10 18 700 11 9 200 12 6 500 13 2 100 14 4 87 15 9 650 16 4 75 17 6 150 18 2 350 19 7 65 20 5 250 21 12 35 22 1 35 23 15 600 24 4 100 25 18 200 26 25 800 27 3 99 28 5 175 29 8 200 30 9 150 31 8 900 32 3 800 33 4 650 34 2 75 35 3 150 36 4 200 37 15 500 38 6 90 39 8 300 40 15 750 41 14 800 42 12 600 43 10 900 44 12 150 45 8 650 (100, 9) [[ 4.76470588 90.94117647] [ 8.88888889 566.66666667] [ 12.8 805. ] [ 6.7 243. ]]

Day 17: Analyzing the results

Answer the following questions:

How many big spenders do you have, the light blue dots?
What is the range of expenditures of this group?
What is the average or mean amount spent for this group?
Looking at the dark blue dots, how many sales, range and mean?
Looking at the yellow dots, how many sales, range and mean?
The medium blue are the ones that spend the least. How many fall into this catagory, average sale and range of sales?
The light blue dots represent the top spenders. Typically 10-20 percent of total buyers produce 50-80 percent of a company's profit. Once this cluster has been identified, you should cconsider up-selling and cross-selling techniques that will further contribute to the profitabillity of the company.

According to Wikipedia, Upselling is a sales technique whereby a seller induces the customer to purchase more expensive items, upgrades or other add-ons in an attempt to make a more profitable sale. While it usually involves marketing more profitable services or products, it can be simply exposing the customer to other options that were perhaps not considered.

Cross-selling is the action or practice of selling an additional product or service to an existing customer. In practice, businesses define cross-selling in many different ways. Elements that might influence the definition might include the size of the business, the industry sector it operates within and the financial motivations of those required to define the term.

You will want to see what they purchased to offer suggestions for other related products.