Not far out of Phoenix Arizona are the Superstitious Mountains.

The Superstition Mountains is a range of mountains located to the east of the Phoenix metropolitan area. They are anchored by Superstition Mountain, a large mountain that is a popular recreation destination for residents of the Phoenix, Arizona, area.

Goldfield is an abandonded ghost town, located in the Superstition mountains, that has been refurbished as a tourist attraction. The town is composed of numerous shops and attractions, and rides.

Day 1: Touring website and customer survey

Take a few minutes to look at the website to get a feel for what is being promoted and available for tour customers.

The tour company, is wondering what is the most effective way to promote the tour to the town, increase visitors and sales.

They have decided to conduct some marketing research. They want to have a survey to determine what would motivate people to book a tour to Goldfield: travel magazines, Internet advertising, social media, friend recommendations, television ads or radio ads.

As an incentive, prospective customers that take the survey, will get a 25% discount on their tour cost.

Currently the tour company advertises to some degree in all of the above mentioned media.

What marketing executives are interested in is which advertising method will give them the most bookings for the money invested.

Click on the survey button to see what the survey form looks like.

The cascading style sheet formatting code is embedded in the html form.

There are no media queries included, because it is all one column and should work with any size screen.

The code for the survey form is listed below for you to examine.

To make it work for you, just change the mailTo information to one of your email accounts in the opening form tag.

You will neeed to remove the comment tags  to make the comment section work.

Here is what a response to the questionnaire might look like.

There are 44 surveys taken and they were entered into an Excel spreadsheet.

Below is the spreadsheet containing the responses.

This spreadsheet will be our source of information for the different machine algorithums that we will be working with.

Goldfield Excel Spreadsheet

We are going to use three machine learning algorithums to evalute the survey data to determine the most effective way to advertise our tours to the Goldfield ghost town in the Superstition Mountains.

Machine learning is a branch of artificial intelligence that provides systems with the ability to learn from experience without being programmed explicitly.¹

The applications access data and learn from it by themselves.

The algorithum extracts patterns from the data and then maekes predictions based on thers patterns.

The goal of machine learning is to allow computers to learn without the help of human beings.

Machine learning can be either supervised or unsupervised.

As the name suggests, the Supervised Learning definition in Machine Learning is like having a supervisor while a machine learns to carry out tasks. In the process, we basically train the machine with some data that is already labelled correctly. Post this, some new sets of data are given to the machine, expecting it to generate the correct outcome based on its previous analysis of the labelled data.²

Our approach to solving the question about the best way to advertise our tour to Goldfield uses supervised learning.

The survey provides the inputs of which media had the most influence and the output is they either booked a tour or not.

All three algorithums are considered to be classification ones, in that they are predicting a a yes or no as to whether a tour will be booked.

Regression algorithums predict a number.

The three will use are supervised learning classification algorithums.

Support Vector Machines Classifier(SVM)
K-Nearest Neighbor Classifier (KNN)
Random Forest Classifier (RF)

You will need to have installed Python and Anaconda to use this tutorial.
Installation of programs

Here are some additional pictures of GoldField.

The Superstition Mountain tour is a four-hour bus tour to the ghost town of Goldfield on the Apache Trail.

Customers are picked up in Phoenix, Scotsdale and Tempe.

The tour includes lunch, a train ride, mine tour and the gun fight reenactment.

Day 2: Support Vector Algorithum

Support Vector Classification is the first algorithum that we will use on or dataset of surveye.

We will load the dataset in the Excel spreadsheet directly into our Juypter Notebook.

The support vector machine is a classification as well as a regression algorithum.

It works by minimizing the errors between the actual data and the predicted predictions.

We will work with the code onc cell at a time.

By working a cell at a time we can more easily debug our code.

First we will import the Python libraries needed for our project.

Start a new Python project in your Jupyter notebook.

Cell 1

Paste this code in your first cell.

Save your project.

Select View and toggle line mumbers. This will make it easier to identify lines.

Click on the link for the Goldfield Excel Worksheet containing the survey data and save it to your computer in the directory you are working in.

The libraries are algorithums are.

Pandas: Needed to read in the spreadsheet and store it in a Pandas dataframe
Numpy: A general-purpose array-processing package.
Math: Provides us access to some common math functions and constants i
sklearn: It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python.
matplotlib: Needed for data visualization - graphs
pygal: Pygal is a Python module that is mainly used to build SVG (Scalar Vector Graphics) graphs and charts.
seaborn: Python data visualization library based on matplotlib.
sys: Provides access to some variables used by the interpreter to manipulate the Python run time environment.

Now we need to get the dataset from the Excel spreadsheet.

Cell 2

Paste this code in your second cell.

Save your project.

Run the first two cells to make sure you are loading the dataset.

The variable df is assigned to the dataset.

Cell 3

Paste this code in your third cell.

Save your project.

Run the first three cells to make sure you are loading the dataset, displaying the shape, rows and columns, of the dataset.

Here is what you can expect to see when you run these cells.

The dataset has 220 rows and 7 columns.

Cell4

Paste this code in your fourth cell.

Save your project.

Run the first four cells to make sure you are loading the dataset, displaying the shape, rows and columns, of the dataset.

This cell printouts the head abd tail of the dataset.

Cell 5

Paste this code in your fifth cell.

Save your project.

This cell does not produce any output.

Pre-processing refers to the transformations applied to our data before feeding it to the algorithm.

Cell 6

Paste this code in your sixth cell.

Save your project.

This cell does not produce any output.

Pre-processing refers to the transformations applied to our data before feeding it to the algorithm.

Cell 7

Paste this code in your seventh cell.

Save your project.

This cell does not produce any output.

The line of code creates two groups of data.

The test size is 20% of the total number of records, 220.

Cell 8

Paste this code in your eighth cell.

Save your project.

SVC is widely used supervised learning methods and it can be used for regression, classification, anomaly detection problems.

The SVM based classier is called the SVC (Support Vector Classifier) and we can use it in classification problems.

SVC(kernel='linear') is the output

Cell 9

Paste this code in your nineth cell.

Save your project.

This cell does not produce any output.

Assigns results from test data to the variable pred_y

prints out test results.

Predicted test results
Cell 10

Paste this code in your tenth cell.

Save your project.

Run your code. This cell should give you similar results.

Each time you rerun the program, results will change slightly using different random samples.

To have the same results the same time you run your program, add random_state=0 to the code.

Remember that this represents a 20% sample taken randomly.

You can use any integer for the random_state = statement.

random_state is basically used for reproducing your problem the same every time it is run.

If you do not use a random_state in train_test_split, every time you make the split you might get a different set of train and test data points and will not help you in debugging in case you get an issue.

The image below is textual representation of a confusion matrix. It evaluates the results of the model

It indicates that out of a random sample of 44, 36 booked a trip (true positives), 7 did not book a trip, (true negatives.

Thirty six did book a trip and both actual and model predicted this. True Positives

One response was a false negatives - ones that are actually false, but predicted as true by the model.

There were no false positives: Those actually false, but predicted as true by the model.

False negatives and false poitives affect the accuracy. These are incorrect predictions by the model.

There a number of metrics to see how well your model performed in making predictions on the unknown test set.

There different formulas for some metrics based on support: 36 yes bookings and 7 no bookings.

Precision
Recall
F1 score
Accuracy

If you want to see how to calculate these metrics, here are the formulas.

Plug in the numbers from your Python SVM program to see if they match.

Cell 11

Paste this code in your tenth cell.

Save your project.

This is a graphical representation of the confusion matrix for this project.

Cell 12

Paste this code in your twelfth cell.

Save your project.

Run it to see if you get the description of the dataset.

By describing the data set, you can get some more useful information.

There 220 records in the dataset.
The mean or average is given for Travel, Internet, Social Media, Friend, TV ad, Radio ad.
The standard deviation is given for each of the above.
The 50% number represents the media score for each.

Aa you can see by looking at the mean score for each, that the Internet and Social Media averages are considerably above the others, indicating the more people thought that these two media are more important than the others.

Cell 13

Paste this code in your thirteenth cell.

Save your project.

When you run this cell it should look like this.

Using actual data, you can do a graphical intrepretation of how resondents rated this advertising media, 107 indicated that ads in travel magazines had no influence on their decision to book a trip.

Sixty, 51% indicated that travel magazines had very little influence.

Five, 2% indicated some influence.

Twenty, 9% indicated tha travel magazines had an average amount of influence.

Two, 0% indicated that travel magaiens had a goood amount of influence on their decision to book or not book a trip.

Twenty, 9% indicated that travel magazines had a large amount of incluence on the buying preferences.

As a result of just looking at this data, you can see that travel magazines are not a good fit for promoting Superstitious Mountain bus tours.

Please note, the graph does not represent the predicted data.

Day 3: Additional Graphs

Using the data from the previous cell, create graphs for each of the other forms of advertising media used in our study.

Day 4: KNN Classifier

We are now going to look at our spreadsheet data from the survey and run it through another Python algorithum.

The kNN algorithm is a supervised machine learning model. That means it predicts a target variable using one or multiple independent variables.

We have four independent variables:Travel, Internet, Social Media, Friends, TV and radio.

We have one dependent variable: book a trip or not book a trip.

The booking variable is dependent upon one or more of the independent varibles.

K-nearest neighbor or K-NN algorithm basically creates an imaginary boundary to classify the data.

The algoritumn does not assume any relationship between the independent variables.

When new data points come in, the algorithm will try to predict that to the nearest of the boundary line.

Let's look at the code and compare the results to the SVM model.

Much of the code is very simliar.

Cell 1

Start a new Jupyter Notebook Python project.

Paste this code in your first cell.

Save your project.

Cell 2

Paste this code in your second cell.

Save your project.

There is no output from this cell. It reads in the dataset from and Excel spreadsheet and assigns it to a variable called df.

Cell 3

Paste this code in your third cell.

Save your project.

Run the code of the first three cells and you should see the size of the dataset, (220,7).

Cell 4

Paste this code in your forth cell.

Save your project.

Run the code of the first four cells and you should see the head and tail of the dataset.

Cell 5

Paste this code in your fifth cell.

Save your project.

Run the code of the this cell and you should see the dataset described.

Cell 6

Paste this code in your sixth cell.

Save your project.

The code in this cell assignes the value of all the independent variables to the variable X.

Cell 7

Paste this code in your seventh cell.

Save your project.

The code in this cell assigns the values of the dependent variables to the variable y.

Cell 8

Paste this code in your eighth cell.

Save your project.

The code in this cell assigns training and test variables to the train test split algorithum and sets a sample size for the random array of 20%

Cell 9

Paste this code in your nineth cell.

Save your project.

The code in this cell creates a vriable for the classifier

Cell 10

Paste this code in your tenth cell.

Save your project.

Cell 11

Paste this code in your eleventh cell.

Save your project.

Predictions are made from X_test and then printed out.

The output from this cell looks like the information contained in the file below.

KNN Predictions

The predictions are a 20% sample, randomly selected from the dataset.

There 44 in the test file.

The predictions are y and n and are listed at the bottom of the file.

Respondent 152, 74 based on the model predictions, would book a trip {Y}.

If you look at the actual data from the spreadsheet, it would confirm this prediction.

Respondent 71 would not book a trip based on the predictions.

Actual data also confirms this prediction.

For example our model predicted that 37 of the 44 predicted outcomes Y booked a trip (TotalPositives)

Our model predicted that 7 would not book a trip (Total Negatives).

Our model predicted that 2, #90 and #101 would not book a trip when the actual numbers indicated that they would book a trip (False Negatives) .

False negatives and false positives indicate an error in our model. But it does not seem very serious based on the number of occurrences (2).

Cell 12

Paste this code in your twelfth cell.

Save your project.

Confusion Matrix.

Cell 13

Paste this code in your thirteenth cell.

Save your project.

This cell gives you a graphical representation of the confusion matrix.

Analysis of Results

The df.describe() function provides some helpful information.

]

Looking at the mean values for each advertising media, you could make some decisions as to how best to advertise your site.

Internet advertising had a mean of 3.67 out of 5 possible choices.

Social Media had a mean of 2.45.

Radio has the lowest mean of 0.76

It seems that it would be better to use our advertising money on Internet and social media rather than tv, radio or travel magaines.

Looking at the confusion matrix, we can see that 35 out of the 44 were TP, those that would book a trip based on our predictions.

Thirty five represents almost 80% ,35/44.

Seven that came up True Negative in our projections and indicated they would not book a trip, 16%

Precision, Recall and F1 scores were all good.

Accuracy was at 95%

K value and error rate

Day 5: Random Forest Classification

Random Forest is a robust machine learning algorithm that can be used for a variety of tasks including regression and classification. It is an ensemble method, meaning that a random forest model is made up of a large number of small decision trees, called estimators, which each produce their own predictions. The random forest model combines the predictions of the estimators to produce a more accurate prediction.³

Cell 1

Paste this code in your first cell.

Save your project.

This cell imports the needed libraries.

Cell 2

Paste this code in your second cell.

Save your project.

This cell reads in the spreadsheet file.

Cell 3

Paste this code in your third cell.

Save your project.

This cell shows the shape of the dataset.

Cell 4

Paste this code in your fourth cell.

Save your project.

This cell prints out the head and tail of the dataset.

Cell 5

Paste this code in your fifth cell.

Save your project.

This cell describes the dataset.

Cell 6

Paste this code in your sixth cell.

Save your project.

This cell sets columns, imports train_test_split library,trains the test random sample of 20, equates clf to random classiier .

Cell 7

Paste this code in your seventh cell.

Save your project.

This cell's code prints out the predicted values.

Click on the link to see the predictions.

Random Forest Predictions

The data below depends on the fact that random_state was set to 0.

If you change the number to another integer, you will get different random sample of predicted values.

Let's look at respondent 153, it is the eighth one down from the top of the list.

If you look at the yes and no display at the bottom of the printout, you can count over 8.

This is the prediction for someone who would answer the way repondent 153 did.

['y' 'y' 'n' 'y' 'y' 'y' 'y' 'n' 'n' 'y' 'n' 'n' 'n' 'y' 'y' 'y' 'y' 'y' 'y' 'y' 'y' 'y' 'y' 'y' 'y' 'y' 'y' 'n' 'y' 'y' 'y' 'y' 'y' 'y' 'y' 'n' 'y' 'y' 'y' 'y' 'y' 'y' 'y' 'y']

The answers for 153 are 0,1,0,0,0,0.

Remember how the survey works. Zero means that a given advertising media had no influence on the potential buyer.

This person would not be influenced by travel magazines, social media, friend recommenations, tv and radio ads.

They would not book a trip.

They indicated that they heard about our tour through an Internet advertisement (the 1 response represents Internet box checked on the survey data) , but that it did not influence them to book a trip.

According the other responses, travel magazines, social media, friend recommendations, radio and TV advertisng would have no influence on them.

Let's look at respondents 152 and 74. They are the eighth and nineteenth y and n responses at the bottom of the page.

The prediction indicates someone who responded this way would book a tour.

Both predictions had a 5 for the Internet advertising influence.

Examine the two lines below.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) #X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

This is the line that generates random test file data.

Without the random_state=0 part of the line, different results for the predictions will change each time you run the program.

Remember the # makes the program ignore the line.

Random_state=0, gives you the same result each time you run the program.

Change random_state to 7 and run your program. You will get different predicted values.

Cell 8

Paste this code in your eighth cell.

Save your project.

This cell's code prints out the confustion matrix.

Cell 9

Paste this code in your nineth cell.

Save your project.

This cell's code prints out the classification report .

Classification Report

Precision on the 'n' is 88%
Precision for the 'y' is 100%
Accuracy is 98%

Our model has done a good job.

Cell 10

Paste this code in your tenth cell.

Save your project.

This cell's code prints outs single prediction and most important features.

Single result and most important features.

Let's look at what this cell tells us.

If we had entered information in the cell "prediction = clf.predict", our result will tell us that for those numbers would a person book a trip or not. With these numbers for the survey questions, this person would not book a trip.

The last four lines give us the information that we are looking for: Which advertising media has the most influence on booking a tour?

Num	Adv Med	Score
1	Internet	.407
2	Social Med	.292
5	Friend	.136
3	TV	.094
4	Radio	.043
0	Travel	.024

According to our model's predictions, Internet advertising has the most influence on our customers when deciding to book a tour, followed by social media.

The other media types: TV, radio, and travel magazines have a minimal effect.

Depending on costs involved, the tour company should invest in Internet advertisng and using social media to promote their business.

Cell 11

Paste this code in your eleventh cell.

Save your project.

This cell's code prints out a graphical representation of the most important features.

Day 6: Comparing Classification Algorithums

Which one's results should we use to decide our advertsing mix?

First let's look at accuracy for each.

I tried different random states from 0 to 10 for each application.

By changing the random state integer, a different random file is generated for the test data.

Below is a chart summarizing the results.

Accuracy of the Classsification Algorithums with Different Random States

Integer	SVM	KNN	RF
0	98%	98%	98%
1	95%	100%	100%
2	95%	98%	98%
3	93%	98%	98%
4	98%	98%	98%
5	95%	95%	95%
6	86%	93%	93%
7	93%	93%	93%
8	100%	100%	98%
9	95%	95%	95%
10	95%	100%	98%
Average	.9482	.9709	.9673

As you can see the accuracy is excellent for each of the algorithums.

The one with the highest accuracy is the KNN algorithum. It is not a significant amount higher than Random Forest.

You will probably want to use Random Forest since you can determine features of importance.

When run with the same random seed, all three produce the same predictions since they are all using the same algorithum on the same dataset.

Feature importance is not defined for the KNN and SVM Classification algorithms. It can be done, but it is a lot of work.

There is no easy way to compute the features responsible for booking a trip when using the KNN or SVM models.

The Marketing Department would most likely make a presentation to upper management.

You would want to provide them with all the necessary materials.

Individual completed surveys
Spreadsheet summarizing the result of the surveys
Predicted results from Random Forest
Accuracy number
Confusion matrix numbers and graph with explanation of True positive, True Negative, False positiver and false negatives
Most important features percentages
Special features graph
Suggestions as to Internet and social media advertising

Internet Advertising

Tutorial On How to Determine Advertising Effectiviness Using Machine Learning Algorithums:

Random Forest, Nearest Neighbor, Support Vector Machine

Day 4: KNN Classifier

Analysis of Results

K value and error rate