Day 9: Multiple Linear Regression
Linear regression as we just studied has only two variables. This is helpful for learning purposes, but is not a real world application. Those applications have multiple variables, that each have varying efects on the dependent variable.
The steps to learn how to work with multiple linear regression are similar to what we just learned. The evaluation process is different, however, we need to see which variable has the most effect on the dependent variable and how each variable effects the other ones.
We are going to perform multiple linear regression analysis for a food truck company called Inside Slider. See Starting your own business link on our web page
Starting your own business
We are a new business that has only been operating for about one month. We are going to analyze our sales data to see how to increase our revenue. We will examine past sales data. The dependent variables are location of our truck, time of day for the sale, number of meals purchased, and revenue.
We wan to see what effect each independent variable has on revenue, the dependent variable.
Our break even analysis shows us that we needed to make $26,561 per month just to break even. That breaks down to about $900 a day in revenue or $27,000 per month.
We are going to use python and multi-liner regression coding to solve our problem.
First we need to get the sales data for the month for Inside Slider. It is in comma separated format already, to make it easy to work with. Copy the file to the clipboard and save it in your working directory with a .csv extension.
Let's see what this data represents. It was compiled from the monthly sales slips for Inside Slider. The first number indicates the location of the food truck.
There are seven locations, one for each day of the week.
- Location 1 is Stearns's Wharf on Sunday
- Location 2 is East Beach on Monday
- Location 3 is West Beach on Tuesday
- Location 4 is the Breakwater on Wednesday
- Location 5 is the Mesa on Thursday
- Location 6 is Leadbetter Beach on Friday
- Location 7 is Carpinteria Beach on Saturday.
The next number refers to the number of meals or guests on a given ticket.
The third number represents the time of day. I used military time . The food truck operates from 11:30 am until 3:00 pm or 1130 to 1500 military time.
The last number represents the total of the bill.
We are trying to determine how to increase our revenue. We are looking at location, time of day and number of guests as the independent variables. We want to see which of these variables has the most impact on increasing our revenue, the dependent variable.
We plan to use Python code to create a multiple linear regression model to find our answer.
The code is very similar to the previous example on liner regression.
Put the code on the clipboard. Save the file in your working directory. Make sure that you use a .py extenxion.
Run the code using the F5 key or run it line by line using F9 key.
The first line that are produced are the first five lines of the dataset.
The next output is much more informative. It details:
The count. This is the number of individual sales slips in the file.
The next row of numbers represent the mean of the location, time, number of meals and revenue.
Remember that the mean is just a simple average. All the numbers are added up and divided by the total number of items.
The mean of the location is not very helpful.
The mean of the number of meals is helpful. It shows 2.319. So and average order is for 2 customers.
The mean or average time of day that lunches are sold is 1307 or seven minutes after 1 pm.
The average revenue is $28.47.
The standard deviation is a measure of how spread out the numbers are, the average distance from the mean.
The standard deviation for the number of meals is 1.19 and the mean is 2.31.
The standard deviation for the revenue is 14.49 and the mean is 28.47.
The min number tells us that the minumum location was 1 and the the minimum of meals order was 1 and that the smallest order was $12.07.
The max number tells us that the max location was 7 , the max number of meals on a ticket was 5, the latest time that meals were served was 1545 and the largest ticket total was $60.35.
The numbers on the 50% line equal the medan numbers. The ones in the middle. Location 3, number of meals 2, time 1300 and revenue $24.82.
The coefficients are the numbers we are really looking for. They tell us which independent variable has the most effect on revenue, our dependent variable.
The coefficients measure the linear relationship between two data sets. The variables range between -1 and + 1. A zero number for the coefficient means there is no correlation. A plus one or a minus 1 means an exact corelation.
The results of our python program show that time of day and location have almost no effect on our revenue, but that number of meals does with a 12.145 score.
The next set of numbers show us what Python has predicted based on our actual sales.
Look at each number: actual and predicted. They are all pretty close. Remembr that 156 items were randomly selected, 20% of 776. The first coluumn shows those sale slips numbers.
The RMSE, root mean squared error, number is used to determine how accurate our sample was. The closer to zero the better. As you can see we did very well.
Now that we have our results, what changes in our business model, could we consider. Here is a list that I came up with. You should make your own list.
Increase revenue by getting not just one customer to come to the food truck, but two or more. Start a promotion at local nearby business to create an incentive for more than one employee have lunch at the food truck.
Create a promotion that gives a discount to a customer that buys one meal for themselves and takes another one of two meals back to the office or job site.
Add a catering component to your business model, which would result in multiple meals and thus increase revenue.
Look at the raw data in depth. Are there locations where the volume is low. Check out the breakwater, location 4. You only made seven sales. You know that there are many restaurants on the breakwater, maybe that is why you sales are so poor.
Which is the location that produces the most revenue? Consider going there on Wednesday, the day that you go to the breakwater.
Buy one get one free lunch introduction might be and option.
Day 10: K-Nearest Algorithum: gift basket company
Now we will examine data collected from an on-line survey of customers and use Pyton and the KNN algorithum to determine our target market and market segments.
The algorithum is a good one and is used for applications like economics, forcasting and genetics. It is a supervised algorithun since it is given a dataset made up of training observations. Our goal is to take the independent variables and use them to determine the generation of those individuals.
The generations are as follows :
Generation Z individuals born between 1996 to 2014 13-23 year olds as of this year 2019.
Generation Y, also called Millenials, 24 to 29 years old. Born between 1980 and 1985 as of 2019
Generation X folks were born between 1961 and 1981 and are 40 to 58 years old based on 2019.
The baby boomers were born between 1946 and 1964 making them 55 to 73 years old based on the year 2019.
Traditionalists were born between 1922 and 1945 making them between the ages of 74 and 97 years old based on the year 2019.
Python code
Put the Python code on the clipboard and paste it into the Spyder editor. The code came fom Samuel Burns in his Python Machine Learning texbook. I created the dataset and changed the column headings to make it work for this example. Make sure to save the file in your working directry with a .py extension.
Run the code either line by line using F9 key or F5 key. Make sure that you have Internet access, since the completed file is located on our web site.
Let's analyze the results. The first five lines of the dataset are printed out, the shape which represents the columns and rows of the dataset,100 rows and 8 columns.
The sample of 20 is 20% of the 100 rows of the dataset. Remember that each row represents one customer's responses to the questionnaire.
The confusion matrix is displayed next.
The confustion matrix is read from the top left-hand corner to the bottom right hand corner.
Seven babyboomers were correctly identified in the sample
Five Generation X's were correctly identified in our sample.
Three Generation Y;s were correctly.
Four Generation Z.
One Traditionalist was misclassified in our test data.
From the predictions, it is apparent that Baby boomers and generation X are our main target market segments and we should focus our marketing efforts on those two generation.
The results show that the KNN algorituum did a pretty good job in classifying the 20 records. The results show that the weighted averages were in the 90 percentiles.
The F1 score is a measure of a test's accuracy. Precision and recall are also metric measures of accuracy.
These numbers represent the numbers that the test data learned from the training data. The sample of test data is 20 respondents.
Looking at the support column, you can see 7 or 35% are Baby Boomers. Five or 25% belong to Generation X. Three or 15% belong to Generation Y. Four or 20% belong to Generaton Z and one belongs to the Traditionalist Generation - 5%.
Based on the numbers from the classification report, the model did a good job.
Run the code multiple times and the results will change somewhat, since different test data is obtained each time the program is executed.
The information obtained gives the marketing professional information about the segments of their market.
Specialized campaigns can be directed at each segment.
It makes little sense to direct any campaign at the Traditionalists. The one that the model got was misclassified anyway.
Run the Python program numerous times and see how the results vary somewhat.
Here is some information I found about the characteristics of the generation X.
- They were born between 1961 and 1981 making then 38 to 58 years old as of 2019.
- They represent 22.9% of the United States population, which is approximately comprable to the generation z and millenials.
- They are called the middle child.
- Many were latch key kids
- Many are divorced.
- They have or had career driven parents.
- They are into labels and brand names.
- They have large amounts of credit card debt.
- During their school years they grew up with no computers in their classrooms or at home.
- The look at the world and ask "What's in it for me."
- They make about seven career changes.
- They are called the MTV generation.
- They like punk and heavy metal music.
- They are late to marry and quick to divorce.
Here is some information about the baby boomers generation.
They were born right after World War II ended 1946-1964
They prefer to communicate using the telephone.
The represent a significant spike in birth as the soldiers returned from war.
They are the 60's and 70's generations.
There two segments: The leading edge ones, 1946 to 1955 and the late bloomers, 1956 to 1964.
The are the first generation with two income households.
Divorce rates were higher than all previous generations.
They are the first generation to get television sets.
Here is how you might market products to Generation X. Remember in our simulation, our company sells flowers and gift baskets.
Use the Internet. Ninety-five percent of generation x'rs have a Facebook account.
The are loyal to brands, so reward returning customers.
They like videos, so create a video about your products and put it on your web page, or feature it in your company's Facebook account, or attach it to an email.
Personalize the product. Create custom orders of baskets and flowers based on their preferences.
Since they are family oriented, show images of a family enjoying a gift basket together around a special event.
Day 11 : Neural Networks
Neural networks are a form of machine learning that tries to imitate the way neural networks work in the human body.
According to Massachuttes Institute of Technology, understanding how the brain recognizes objects is a central challenge for understanding human vision, and for designing artificial vision systems.
No computer system comes close to human vision, but a new study by neuroscientists suggests that the brain learns to solve the problem of object recognition through its vast experience in the natural world.
Humans, based on their past experiences can recognize the identifying patterns of an object and can recognize those patterns say for example, a car, or a dog, or a cat.
Artifical intelligence neural networks try to imitate human learning similar to the nervous system of humans.
We will use an expanded dataset on target markets to compare neural networks to the KNN algorithum. The original survey only included the first eight questions. More questions were added to the dataset and questionnaire for this exercise.
Click on this link to see the questionnaire that the dataset is based on. Customer Survey
The additional questions are looking into customer buying habits. The first eight are demographic questions.
By looking at the questionnaire, I assigned the Class for each row, ie the generation based on all of the factors, not just age.
The data is not real it is just made up for educational purposes.
Iterative learning process
A key feature of neural networks is an iterative learning process in which records (rows) are presented to the network one at a time, and the weights associated with the input values are adjusted each time. After all cases are presented, the process is often repeated. During this learning phase, the network trains by adjusting the weights to predict the correct class label of input samples.
We are trying to teach the test dataset to learn the same characteristics of the larger dataset. Individual items in the dataset include age, income, occupation, education as well as techinical competence, multitasking ability, work-life balance, retirement ages, team orientation online purchase and brand loyalty.
This information was gained through our survey of our customers, entered into an Excel spreadsheet using numbers to represent the responses, formatting the numbers into two decimal places, removing the headings after the data was entered, saving it into a csv format and uploading it our web site in a file called GenXand GenY.csv