Day 1

First we need to get a copy of the worksheet for this assignment. Download it and print it out.

In this lesson we will work with the Random Forest Algorithum. It is a powerfull machine learning algorituum.

It would be helpful to go over the decision tree example on our web site.

Decision trees It will help you understand the principles of random forest trees.

The decision tree is the basic component in random forest.

A decision tree is used to help a person make a prediction by asking a series of questions. Each question can have only two possible responses such yes or no.

The example on our web site deals with what type of business to start: a food truck, restaurant or a yogurt shop.

The information that we have to work with to make our decision is the initial investment, failure rates and estimated profits and losses.

To arrive at the best choice for a business, we used a series of questions. With each question narrowing our possible values until we were confident enough to make a single prediction, the kind of business we should invest our money in.

There is a 60% chance for success in the food truck business and a failure rate or 40%. Average revenue for successful food truck businesses is $90,500 per year. Unsuccessful food truck business generate about $45,000.

In the restaurant businesss success rate is 52% and failure rate is 48%. Revenue for successful restaurants is $600,000 and unsuccssful ones bring in $120,000 per year.

Yogurt shops have about a 50% success rate and a 50% failure rate. Succcessful shops bring in about $400,000 and unsuccessful ones bring in about $75,000

The expected value is calculated in such a way that all possible outcomes.

expectedValueFoodTruck = (.60*90500) +(.40*-45000) = $36,000

The expected value is calculated in such a way that includes all possible outcomes for a decision.

The expected value reflects the average gain from investing in a food truck business : $36,000.

expectedValueRestaurant = (.52*600000) +(.48*-120000) = $254,400

expectedValueYogurtShop = (.50*400000) +(.50*-75000) =$162,500

After all the math is done, you can see that investing in a restaurant is the best business investment idea: $254,000 profit

Our prediction to invest in a restaurant rather than a food truck or yogurt shop is probably wrong. There are too many factors to take into account. Estimates on success and failure rates may be too high or too low. Our estimates on gains and lossses may also be too high or too low.

"Much as humans learn from examples, the decision tree also learns through experience, except it does not have any previous knowledge it can incorporate into the problem. Before training, we are much ‘smarter’ than the tree in terms of our ability to make a reasonable estimate. However, after enough training with quality data, the decision tree will far surpass our prediction abilities. Keep in mind the decision tree does not have any conceptual understanding of the problem even after training. From the model’s ‘perspective’, it is simply receiving numbers as inputs and outputting different numbers that agree with those it saw during training. In other words, the tree has learned how to map a set of features to targets with no knowledge of anything about investing."¹

In looking at predictions, there are variances. If, however, we take hundreds or thousands of individual cases, some high some low, and average them together, we will get a much more accuracte prediction. That is the basic idea behind a random forest decision tree model.

Our problem is known as a classification one, where the targets are a discrete class label such as food truck, restaurant or yogurt shop. In that case, the random forest will take a majority vote for the predicted class.

Day 2

Now let us look at another problem. Let's suppose that we are a company that remodels kitchens. We do everything from a $20,000 job to ones cost over $100,000.

It takes anywhere form 3 to 16 weeks to complete the remodeling jobs.

We keep track to see if we are on schedule and on budget. We also keep track of the bid: if it was awarded to us or if we did not get the contract.

We keep track of our jobs on the following spreadsheet.

Kitchen Remodel

Download this spreadsheet and print this out as we will need to create a dataset in Python using this information.

As you can see we keep track of: cost, weeks to complete the job, if we guarantee that we will be on schedule, and not go over the budget.

We have coded some of the fields to reflect true or false. A 0 means false and a 1 means true.

For example, looking at index 3, our $45,000 kitchen was estimated to take 4 weeks to complete. We did not guarantee completion within 4 weeks. We were, however, able to guarantee that we would be on budget

Our Python model using the Random Forset algorithum will attempt to analayze this data and determine the most important factor in awarding contracts and give us a tool for predicting the success of getting future contracts.

Code for this assignment came from an article entitled Data to Fish Example of Random Forest in Python.³

I modified it to use in our kitchen remodel assignment.

Load Python and Jupyter notebook for this problem. Open a new file.

Use the Copy Text button to put the above Python code on the clipboard.

Paste it into your Jupyter Notebook.

Save it.

Copy the information from the spreadsheet just below the code you just copied.

Use the Copy Text button to put the above Python code on the clipboard.

Paste it into your Jupyter Notebook.

Save it.

In Jupyter Notebook, Click on Insert then Insert cell below to get a new box for the next slice of code.

Use the Copy Text button to put the above Python code on the clipboard.

Paste it into your Jupyter Notebook.

Save it.

Day 3

Now run your code either cell by cell or Kernel and restart and run all

You should get the entire data frame printed out on your screen.

Use the Copy Text button to put the above Python code on the clipboard.

Paste it into your Jupyter Notebook in a new cell

Save it.

This code is where the Test X and train Y are created. They represent 25% of all cases. and they are chosen at random.

Run this cell just to check the synatx.

Use the Copy Text button to put the above Python code on the clipboard.

Paste it into your Jupyter Notebook in a new cell

Save it.

When you run all the cells, here is the output you will get. Let's examine these predictions and see what they mean.

Index	Cost	Weeks	onSchedule	onBudget
22	6500	6	1	1
20	10000	9	1	1
25	28500	4	1	1
4	35600	4	0	1
10	4100	4	1	1
15	78200	6	1	1
28	67250	6	1	1
11	5500	9	0	0
18	53500	4	1	1
29	87500	8	1	0

[1 1 1 1 1 1 1 1 1 0]

Ten items from the spreadsheet were chosen at random.
Ten is 25% of the 40 items on the contract bid spreadsheet.
These ten are shown including cost, weeks on schedule and on budget.
The confusion matrix is shown below the ten items.
It shows which bids were awarded in the actual data above in the printout of the entire data frame.
For example, items 22,20,25,4,10,15,28,11 in the actual data were awarded contracts.
Item 18 in the actual data did not receive a contract.
Item 18 in the test or prediction data did receive a contract. This is called a false positive.
Item 29 in the actual data did receive a contract, but the prediction was that they should not receive a contract. This is false negative.

Day 4

Now Let's see a graphical representation of this confusion matrix.

Use the Copy Text button to put the above Python code on the clipboard.

Paste it into your Jupyter Notebook in a new cell

Save it.

In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix). Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class (or vice versa). The name stems from the fact that it makes it easy to see if the system is confusing two classes (i.e. commonly mislabeling one as another).²

0,0 is the first row first column: True Negative = actual and predictions were both false
0,1 is the first row second column: False Positive = actual was false and prediction was true
1,0 is the second row first column: False Negative = actual was true and prediction was false
1,1 is the second row second column: True Positive = actual and predictions were both true.

Now let's look at the actual and predictions in a graphical form.

The botttom right box, pink one, contains the True Positives. That means that the 8 actual and 8 predictions matched.

The top left box, the black one, contains the number of the True Negatives 0, which means that both actual and predictions were false

The top right box, orchid color, contains the number of False Positives 1, which means he actual was false but the prediction was true.

The bottom left, the purple one, contains number of the False Negatives 1

True Positives index numbers 22,20,25,4,10,15,28,11
True Negative none
False Positive index number 18
False Negative index number 29
Reading the confusion matrix from top-left to bottom right: actual and predictions match
Reading from top-right to bottom left: actual and predictions do not match.

To find the accuracy of our model and have a place to enter a bid add these lines in a new cell in Python.

Use the Copy Text button to put the above Python code on the clipboard.

Paste it into your Jupyter Notebook in a new cell

Save it.

If we do the arithmetic, our model was 80% accurate.
There were eight correct predictions and 2 incorrect predictions:
Eight divided by ten equals 80%

Now let's look at the prediction cababilities of our model.

How can we use our model to fine-tune our bids and get more bids accepted.

Find the line that states "prediction = clf.predict([[87500,8,1,0]])". Our model rejected this bid when in fact it was accepted.

Enter the following numbers in this line, save after each, run and record the results on your worksheet.

87500,6,1,0 reduced the number of weeks by 2, on time and on budget the same
87500,8,1,1 weeks the same, on time same, changed on budget to true
87500,7,1,0 reduced weeks by 1, on schedule true and on budget false
87500,7,1,1 reduced weeks by 1 and both on time and budget are true
87500,6,0,0 reduced finish time by two weeks and did not commit to being on time or on budget
87000,8,1,0 same completion time , schedule on time but not on budget
87000,8,1,1 weeks the same, both on time and on budget true

What did you learn from entering the numbers above? Record your answer on the worksheet.

Now lets add some code to make it so we can see which feature is most important in deciding whether to accept or reject the bid.

Use the Copy Text button to put the above Python code on the clipboard.

Paste it into your Jupyter Notebook in a new cell

Save it.

The red bar represents the cost of the remodel.
The green bar represents the job completion in weeks.
The tan bar shows the on schedule variable.
The blue bar reflects the on budget variable's importance.

What does the graph tell you? List values for each on your worksheet.

What conclusions can you draw?

Day 5:

Prediction Model for Human Resources Department

Start a new Python project using Jupyter.

This lesson creates a prediction model using Random Forest.

Twice a year, the HR department conducts employee evaluations.

The criteria is listed below and scores are given from 1 to 5. Five is the best score and 1 is the worst score.

Attendance
Punctuality
Ability to meet deadlines
Communication skills
Honesty
Problem solving ability
Expertise in their field
Creativity
Delegation of tasks
Mentoring new employees

There are 10 employees and the dataset represents past evaluations. Their scores for each question of the ten are listed in the dataset.

For example, employee 1's scored 1 on attendance, on onTime and meeting deadlines, a 2 on communication, honesty, problems solving, expertise, creativity, delegation and a 1 on mentoring and did not receive a raise during this past evaluation.

Employee 2 got 3s on attendance, on time, and deadlines. 2's on Communication, honesty, problems solving, expertise, creativity, delegation, mentoring and did not get a raise.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
#import seaborn as sn
#from sklearn import metrics
#import matplotlib.pyplot as plt
applicant = {'Attendance': [1,3,4,5,2,5,4,5,1,5],
                 'OnTime': [1,3,4,5,3,5,4,5,2,5],
              'Deadlines': [1,3,4,5,2,5,4,3,1,4],
          'Communication': [2,2,4,5,2,5,4,5,1,5],
                'Honesty': [2,2,4,5,2,5,4,5,1,4],
         'ProblemSolving': [2,2,4,5,3,5,4,5,2,5],
              'Expertise': [2,2,4,5,3,5,4,5,1,5],
             'Creativity': [2,2,5,5,2,5,5,4,1,5],
             'Delegation': [2,2,4,5,3,5,4,5,1,4],
              'Mentoring': [1,2,4,5,2,5,3,5,1,5],
                  'Raise': [0,0,1,1,0,1,1,1,0,1]}

df=pd.DataFrame(applicant, columns=['Attendance', 'OnTime', 'Deadlines','Communication','Honesty','ProblemSolving','Expertise','Creativity','Delegation','Mentoring', 'Raise'])

X = df[['Attendance', 'OnTime', 'Deadlines','Communication','Honesty','ProblemSolving','Expertise','Creativity','Delegation','Mentoring']]
y = df['Raise']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)

clf= RandomForestClassifier(n_estimators=100)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)

prediction = clf.predict([[3,3,2,4,4,3,5,2,1,2]])
print('Predicted Result: ', prediction)

Use the Copy Text button to put the above Python code on the clipboard.

Paste it into your Jupyter Notebook in a new cell

Save it.

Here is all of the code for the project. Break it up using separate cells for each piece of code. Just like the original.

Save it and run it.

Should this evaluation give the employee a raise?

Try the following numbers listed below in this line in your Python program. "prediction = clf.predict([[3,3,2,4,4,3,5,2,1,2]])"

5,5,5,5,5,5,4,3,4,4
3,3,4,5,5,3,3,4,5,3
2,2,3,4,3,4,3,5,1,3
4,5,5,4,3,4,4,5,3,5

Record your answers on the worksheet.