Linear Regression for Marketing Research



Day 9: Multiple Linear Regression

Linear regression as we just studied has only two variables. This is helpful for learning purposes, but is not a real world application. Those applications have multiple variables, that each have varying efects on the dependent variable.

The steps to learn how to work with multiple linear regression are similar to what we just learned. The evaluation process is different, however, we need to see which variable has the most effect on the dependent variable and how each variable effects the other ones.

We are going to perform multiple linear regression analysis for a food truck company called Inside Slider. See Starting your own business link on our web page


Starting your own business

We are a new business that has only been operating for about one month. We are going to analyze our sales data to see how to increase our revenue. We will examine past sales data. The dependent variables are location of our truck, time of day for the sale, number of meals purchased, and revenue.

We wan to see what effect each independent variable has on revenue, the dependent variable.

Our break even analysis shows us that we needed to make $26,561 per month just to break even. That breaks down to about $900 a day in revenue or $27,000 per month.

We are going to use python and multi-liner regression coding to solve our problem.

First we need to get the sales data for the month for Inside Slider. It is in comma separated format already, to make it easy to work with. Copy the file to the clipboard and save it in your working directory with a .csv extension.

Let's see what this data represents. It was compiled from the monthly sales slips for Inside Slider. The first number indicates the location of the food truck.

There are seven locations, one for each day of the week.

  1. Location 1 is Stearns's Wharf on Sunday
  2. Location 2 is East Beach on Monday
  3. Location 3 is West Beach on Tuesday
  4. Location 4 is the Breakwater on Wednesday
  5. Location 5 is the Mesa on Thursday
  6. Location 6 is Leadbetter Beach on Friday
  7. Location 7 is Carpinteria Beach on Saturday.

The next number refers to the number of meals or guests on a given ticket.

The third number represents the time of day. I used military time . The food truck operates from 11:30 am until 3:00 pm or 1130 to 1500 military time.

The last number represents the total of the bill.

We are trying to determine how to increase our revenue. We are looking at location, time of day and number of guests as the independent variables. We want to see which of these variables has the most impact on increasing our revenue, the dependent variable.

We plan to use Python code to create a multiple linear regression model to find our answer.


The code is very similar to the previous example on liner regression.

Put the code on the clipboard. Save the file in your working directory. Make sure that you use a .py extenxion.

Run the code using the F5 key or run it line by line using F9 key.

The first line that are produced are the first five lines of the dataset.

The next output is much more informative. It details:

Now that we have our results, what changes in our business model, could we consider. Here is a list that I came up with. You should make your own list.


Day 10: K-Nearest Algorithum: gift basket company

Now we will examine data collected from an on-line survey of customers and use Pyton and the KNN algorithum to determine our target market and market segments.

The algorithum is a good one and is used for applications like economics, forcasting and genetics. It is a supervised algorithun since it is given a dataset made up of training observations. Our goal is to take the independent variables and use them to determine the generation of those individuals.

The generations are as follows :


Python code


Put the Python code on the clipboard and paste it into the Spyder editor. The code came fom Samuel Burns in his Python Machine Learning texbook. I created the dataset and changed the column headings to make it work for this example. Make sure to save the file in your working directry with a .py extension.

Run the code either line by line using F9 key or F5 key. Make sure that you have Internet access, since the completed file is located on our web site.


Let's analyze the results. The first five lines of the dataset are printed out, the shape which represents the columns and rows of the dataset,100 rows and 8 columns.

The sample of 20 is 20% of the 100 rows of the dataset. Remember that each row represents one customer's responses to the questionnaire.

The confusion matrix is displayed next.

The confustion matrix is read from the top left-hand corner to the bottom right hand corner.

  • Seven babyboomers were correctly identified in the sample

  • Five Generation X's were correctly identified in our sample.

  • Three Generation Y;s were correctly.

  • Four Generation Z.

  • One Traditionalist was misclassified in our test data.

The classification Report is displayed next and it summarizes the confusionmatrix.

From the predictions, it is apparent that Baby boomers and generation X are our main target market segments and we should focus our marketing efforts on those two generation.

The results show that the KNN algorituum did a pretty good job in classifying the 20 records. The results show that the weighted averages were in the 90 percentiles.

The F1 score is a measure of a test's accuracy. Precision and recall are also metric measures of accuracy.

These numbers represent the numbers that the test data learned from the training data. The sample of test data is 20 respondents.

Looking at the support column, you can see 7 or 35% are Baby Boomers. Five or 25% belong to Generation X. Three or 15% belong to Generation Y. Four or 20% belong to Generaton Z and one belongs to the Traditionalist Generation - 5%.

Based on the numbers from the classification report, the model did a good job.

Run the code multiple times and the results will change somewhat, since different test data is obtained each time the program is executed.

The information obtained gives the marketing professional information about the segments of their market.

Specialized campaigns can be directed at each segment.

It makes little sense to direct any campaign at the Traditionalists. The one that the model got was misclassified anyway.

Run the Python program numerous times and see how the results vary somewhat.

Here is some information I found about the characteristics of the generation X.

  • They were born between 1961 and 1981 making then 38 to 58 years old as of 2019.
  • They represent 22.9% of the United States population, which is approximately comprable to the generation z and millenials.
  • They are called the middle child.
  • Many were latch key kids
  • Many are divorced.
  • They have or had career driven parents.
  • They are into labels and brand names.
  • They have large amounts of credit card debt.
  • During their school years they grew up with no computers in their classrooms or at home.
  • The look at the world and ask "What's in it for me."
  • They make about seven career changes.
  • They are called the MTV generation.
  • They like punk and heavy metal music.
  • They are late to marry and quick to divorce.

Here is some information about the baby boomers generation.

  • They were born right after World War II ended 1946-1964

  • They prefer to communicate using the telephone.

  • The represent a significant spike in birth as the soldiers returned from war.

  • They are the 60's and 70's generations.

  • There two segments: The leading edge ones, 1946 to 1955 and the late bloomers, 1956 to 1964.

  • The are the first generation with two income households.

  • Divorce rates were higher than all previous generations.

  • They are the first generation to get television sets.

Here is how you might market products to Generation X. Remember in our simulation, our company sells flowers and gift baskets.

  • Use the Internet. Ninety-five percent of generation x'rs have a Facebook account.

  • The are loyal to brands, so reward returning customers.

  • They like videos, so create a video about your products and put it on your web page, or feature it in your company's Facebook account, or attach it to an email.

  • Personalize the product. Create custom orders of baskets and flowers based on their preferences.

  • Since they are family oriented, show images of a family enjoying a gift basket together around a special event.


Day 11 : Neural Networks

Neural networks are a form of machine learning that tries to imitate the way neural networks work in the human body.

According to Massachuttes Institute of Technology, understanding how the brain recognizes objects is a central challenge for understanding human vision, and for designing artificial vision systems.

No computer system comes close to human vision, but a new study by neuroscientists suggests that the brain learns to solve the problem of object recognition through its vast experience in the natural world.

Humans, based on their past experiences can recognize the identifying patterns of an object and can recognize those patterns say for example, a car, or a dog, or a cat.

Artifical intelligence neural networks try to imitate human learning similar to the nervous system of humans.

We will use an expanded dataset on target markets to compare neural networks to the KNN algorithum. The original survey only included the first eight questions. More questions were added to the dataset and questionnaire for this exercise.

Click on this link to see the questionnaire that the dataset is based on.  Customer Survey

The additional questions are looking into customer buying habits. The first eight are demographic questions.

By looking at the questionnaire, I assigned the Class for each row, ie the generation based on all of the factors, not just age.

The data is not real it is just made up for educational purposes.

Iterative learning process

A key feature of neural networks is an iterative learning process in which records (rows) are presented to the network one at a time, and the weights associated with the input values are adjusted each time. After all cases are presented, the process is often repeated. During this learning phase, the network trains by adjusting the weights to predict the correct class label of input samples.

We are trying to teach the test dataset to learn the same characteristics of the larger dataset. Individual items in the dataset include age, income, occupation, education as well as techinical competence, multitasking ability, work-life balance, retirement ages, team orientation online purchase and brand loyalty.

This information was gained through our survey of our customers, entered into an Excel spreadsheet using numbers to represent the responses, formatting the numbers into two decimal places, removing the headings after the data was entered, saving it into a csv format and uploading it our web site in a file called GenXand GenY.csv