Day 1: Introduction/Comma Separated Values file

You need to install Python, Anaconda3 and Juypter Notebook on your machine for this project.

Our company owns 100 department stores, twenty in California, twenty in Arizona, twenty in New Mexico and twenty in Texas. Profits are down. You suspect that some stores are doing well, while others are not.

We would like our model to predict whether to close some of our department stores. We will look at the following factors/variables:

profit margin in a decimal format
cash flow in dollars
number of new customers
customer satisfaction on a scale from 1 to 5 from our internal surveys. Five is the highest score. The number is an average.
employee performance reviews on a scale from 1 to 10. Ten is the highest. The number represents an average of all personnel in customer service.

The saveStore response is determined by the person creating the model as to if the store should be saved. "y",means A "n" response, means that the store should be closed.

We have last year's historical data on the four variables. We want to use that data to make predictions about which stores we need to close.

We are going to use Python's prediction tools to analyze this data.

Our data set is in a csv file.

A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the data set in question. The data set lists values for each of the variables.¹

Our data set is in a comma separated file. It is laid out in rows, one for each store and it is follwed by the data that pertains to that store. Each variable is separated by a comma.

These files can be created in Notepad or other simple text editor.

A Comma Separated Values (CSV) file is a plain text file that stores data by delimiting data entries with commas. CSV files are often used when data needs to be compatible with many different programs. CSVs can be opened in text editors, spreadsheet programs like Excel, or other specialized applications.

To open in Excel, just double click on the file name.

Here is what our file looks like. The first column represents the stores name. ca1 indicates that this is store one in California. az is Arizona. nm is New Mexica and tx is Texas, ok is Oklahoma

StoreNum,Income,CashFlow,NewCust,CustSatisf,EmpPerf,SaveStore
ca1,0.08,50000,600,5,4,y
ca2,0.03,20000,120,2,1,n
ca3,0.01,10000,36,3,3,y
ca4,0.09,55000,300,4,7,y
ca5,0.06,10000,189,4,5,y
ca6,0.06,20000,420,4,6,y
ca7,0.09,35000,420,4,8,y
ca8,0.09,45000,729,5,9,y
ca9,0.01,500,756,2,1,n
ca10,0.05,3000,144,3,6,n
ca11,0.09,60000,1200,5,5,y
ca12,0.01,500,300,1,2,n
ca13,0.04,2000,720,3,7,y
ca14,0.07,2000,600,3,5,y
ca15,0.02,500,240,1,2,n
ca16,0.03,5000,36,2,5,y
ca17,0.08,2000,1200,5,5,y
ca18,0.04,100,350,5,8,y
ca19,0.01,500,38,2,1,n
ca20,0.02,500,12,1,1,n
az1,0.08,50000,600,5,10,y
az2,0.03,20000,102,3,1,n
az3,0.01,10000,32,1,1,n
az4,0.09,55000,250,5,7,y
az5,0.06,14000,150,4,5,y
az6,0.06,22000,350,2,6,y
az7,0.09,45000,350,3,8,y
az8,0.09,45000,700,5,9,y
az9,0.01,300,30,1,1,n
az10,0.05,3000,120,3,1,n
az11,0.09,60000,1500,5,10,y
az12,0.01,500,20,1,1,n
az13,0.04,2000,60,3,7,y
az14,0.07,2400,50,3,5,y
az15,0.03,50,20,1,2,n
az16,0.03,5000,30,2,5,y
az17,0.08,2100,1000,5,10,y
az18,0.04,200,350,4,8,y
az19,0.02,500,30,1,1,n
az20,0.02,500,10,1,1,n
nm1,0.06,50000,500,5,10,y
nm2,0.02,200,120,2,1,n
nm3,0.01,100,30,1,1,n
nm4,0.09,55000,250,4,9,y
nm5,0.06,14000,150,4,9,y
nm6,0.06,22000,650,2,6,y
nm7,0.09,45000,450,3,9,y
nm8,0.09,55000,750,4,9,y
nm9,0.01,200,30,1,1,n
nm10,0.05,3000,120,3,2,n
nm11,0.08,60000,1500,5,10,y
nm12,0.01,500,20,1,1,n
nm13,0.05,2000,60,3,7,y
nm14,0.07,2400,450,3,5,y
nm15,0.03,50,20,1,2,n
nm16,0.04,5000,30,2,5,y
nm17,0.08,2300,1000,5,10,y
nm18,0.03,200,350,4,1,n
nm19,0.02,500,30,1,2,n
nm20,0.02,500,10,1,1,n
tx1,0.09,50000,500,5,10,y
tx2,0.02,200,120,2,1,n
tx3,0.01,100,30,1,1,n
tx4,0.09,55000,250,4,9,y
tx5,0.06,14000,150,4,9,y
tx6,0.06,22000,650,4,6,y
tx7,0.12,45000,450,3,9,y
tx8,0.09,55000,750,4,9,y
tx9,0.01,200,30,2,1,n
tx10,0.05,3000,120,2,1,n
tx11,0.08,60000,1500,5,10,y
tx12,0.01,500,20,1,1,n
tx13,0.05,2000,60,3,7,y
tx14,0.12,2400,450,3,5,y
tx15,0.03,50,20,1,2,n
tx16,0.04,5000,30,2,5,y
tx17,0.08,2300,1000,5,10,y
tx18,0.03,200,350,4,1,n
tx19,0.02,500,30,2,2,n
tx20,0.02,500,10,1,3,n
ok1,0.13,50000,500,5,10,y
ok2,0.04,200,120,2,1,n
ok3,0.01,100,30,1,3,y
ok4,0.095,55000,257,4,9,y
ok5,0.063,14000,156,4,7,y
ok6,0.06,2000,850,2,6,y
ok7,0.12,45000,450,3,9,y
ok8,0.09,55000,750,4,9,y
ok9,0.015,200,36,2,1,n
ok10,0.059,3000,128,3,1,n
ok11,0.08,60000,1509,5,10,y
ok12,0.01,500,0,1,1,n
ok13,0.05,2000,68,3,7,y
ok14,0.125,2400,450,3,5,y
ok15,0.03,50,25,2,2,n
ok16,0.04,5000,39,2,5,y
ok17,0.085,2600,2000,5,10,y
ok18,0.03,200,350,4,2,n
ok19,0.02,500,33,1,2,n
ok20,0.02,500,13,1,1,n

Open Notepad, paste the contents into the text editor.

Save the file in your working folder. Call it "storeClosures.csv"

All of the data is fabricated. I do not believe that there is a department store chain covering just California, Arizona, New Mexico, Oklahoma and Texas. The data is also made up.

The first line contains the headings for each column:StoreNum, Income, CashFlow, NewCust, CustSatisf, EmpPerf, and SaveStore.

The data is separated by commas:the first piece of data is the store id number followed by net profit expressed as a decimal, the third piece of data represents cash flow, the next, number of new customers, the next customer satisifaction (1-5), and the next is data from employee evaluations (1-10) for employees that have contact with the customers.

The person in charge of data preparation, adds the 'y' and 'n' responses. I added the last variable for each store which will represent, which stores I believe should stay open and which ones should be closed.

To help me decide how I should come up with this variable, a discussion on how the dataset was created and what each variable means will be helpful.

Net profit percentage

The way the percentage of net profit was calulated, look at the example below.

Sales		333,572.76
Begining Inventory	100,000.00
Purchases	120,868.42
Mdse avail for sale	220,858.42
Ending Inventory	50,502.92
Cost of Goods Sold		170,355.50
Gross Profit		163,217.26
Total Expenses		144,206.29
Net Profit		19,010.97

Percent net profit 19,010.97/333,572.76 = .05699

Looking at the dataset, the first variable to consider is the one after the name of the store.

For example, the net profit for ca1 is .08

You can see the percent profit ranges from .01 to .13 for our 100 stores. It is obvious that a store that generates 13% profit is a much better performing store than one that earns 5%.

Cash Flow

The next variable to consider is Cash Flow.

The cash flow can be computed by taking the initial investment from the owners plus any loans obtained plus revenue generated, minus expenditures for capital assets and expenses.

A positive level of cash flow must be maintained for an entity to remain in business.

Store ca1 had a $50,000 ending cash balance whereas store ok20 only had $500. Obviously store ca1 has a better chance of staying open.

Number of new customers

The third variable in the dataset is the number of new customers.

A store that can add more new customers, has a better chance of survival over one that adds only a few. More customers increase revenue.

For example store ok20 added only 13 new customers.

Customer Satisfaction

Each store gave a survey to its customers to find out how satisfied they were with the store. The numbers ranged from 1 to 5. Five being the best score. The numbers in the dataset reprsent an average of all customers' scores.

For example store az12's customer satisfaction rating was only 1, meaning the customers of that store were not at all satisfied with the store.

Employee Performance Reviews

The last factor or variable to consider is composed of an average of employee employment evaluations. The better the employees of an organization are, the more successful the store will be.

Scores on the employee evaluations ranged from 1 - poor to 10 - excellent.

For example store 11 in Oklahoma achieved a 10.

Determining which stores to keep open

Now it is time for me to make 100 evaluations as to which stores I think should remain open. I considered all factors and added either a y or n response.

I looked at averages for all variables. My intentions were to give any stores with less than the average numbers a "n".

The 'y' and 'n' responses are necessary to have a supervised model.

Average Net profit: .05202
Average Cash Flow: 15933
Average Number of new customers: 349.68
Average Customer satisfacion: 2.92
Average Employee performance: 4.94

Day 2: Machine Learning

Machine learning is a branch of artificial intelligence that enables computer programs to automatically learn and improve from experience. Machine learning algorithums learn from datasets, and then based on the patterns identified from the datasets, make predictions on unseen data.¹

There are two types of machine learning: supervised and unsupervised. Supervised machine learning algorithums are those where the input dataset and the corresponding output or true prediction are available and the algorithums try to find the relationship between the inputs and outputs.

Our data set is a supervised one since the output is included in the dataset y/n.

Unsupervised machine learning algorithums, however, the true labels for the output are not known, and the algorithums try to find similar patterns in the data. Clustering algorithums are a typical example of unsupervised learning.¹

There are regression and classification problems.

Regression algorithums predict a continuous value. For example, the price of a house, blood pressure.

Classification problems are the type of problems where you have to predict a discrete value. For example is a tumor benign or maligant, should a student pass or fail. Our problem is a superised clasification problem: should a store be closed or remain open based on the criteria we give the algorithum.

Classification is a subcategory of supervised learning where the goal is to predict the categorical class labels (discrete, unordered values, group membership) of new instances based on past observations.

In supervised learning, you create a function (or model) by using labeled training data that consists of input data and a wanted output. The supervision comes in the form of the wanted output, which in turn lets you adjust the function based on the actual output it produces. When trained, you can apply this function to new observations to produce an output (prediction or classification) that ideally responds correctly.²

The following video looks at the Random Forest algorithum that we will be using to determine whether to keep a store open or recommend closure.

Day 3: Importing needed libraries, Importing the dataset and printing it out

Python Libraries

Copy code to the clipboard.
Click on Start.
Click on Anaconda 3 64 bit
Select Jupyter Notebook Anaconda 3.
Select New.
Select Python 3.
Click on the first frame.
Press CTRL V to paste text into python.
Click on file and save it as storeClosures.ipynb
Click Run to run first frame.

Getting the dataset and pasting code into Python

In Python, click on the + sign to insert a cell below.

Paste your code in that cell.

You will have to modify the folder location for your files.

The two reverse slashes are very important in the code. Make sure that you use them.

Run your code for the first two cells.

Your screen should look like this.

Shape  (100, 7)
   StoreNum  Income  CashFlow  NewCust  CustSatisf  EmpPerf SaveStore
0       ca1   0.080     50000      600           5        4         y
1       ca2   0.030     20000      120           2        1         n
2       ca3   0.010     10000       36           3        3         y
3       ca4   0.090     55000      300           4        7         y
4       ca5   0.060     10000      189           4        5         y
5       ca6   0.060     20000      420           4        6         y
6       ca7   0.090     35000      420           4        8         y
7       ca8   0.090     45000      729           5        9         y
8       ca9   0.010       500      756           2        1         n
9      ca10   0.050      3000      144           3        6         n
10     ca11   0.090     60000     1200           5        5         y
11     ca12   0.010       500      300           1        2         n
12     ca13   0.040      2000      720           3        7         y
13     ca14   0.070      2000      600           3        5         y
14     ca15   0.020       500      240           1        2         n
15     ca16   0.030      5000       36           2        5         y
16     ca17   0.080      2000     1200           5        5         y
17     ca18   0.040       100      350           5        8         y
18     ca19   0.010       500       38           2        1         n
19     ca20   0.020       500       12           1        1         n
20      az1   0.080     50000      600           5       10         y
21      az2   0.030     20000      102           3        1         n
22      az3   0.010     10000       32           1        1         n
23      az4   0.090     55000      250           5        7         y
24      az5   0.060     14000      150           4        5         y
25      az6   0.060     22000      350           2        6         y
26      az7   0.090     45000      350           3        8         y
27      az8   0.090     45000      700           5        9         y
28      az9   0.010       300       30           1        1         n
29     az10   0.050      3000      120           3        1         n
30     az11   0.090     60000     1500           5       10         y
31     az12   0.010       500       20           1        1         n
32     az13   0.040      2000       60           3        7         y
33     az14   0.070      2400       50           3        5         y
34     az15   0.030        50       20           1        2         n
35     az16   0.030      5000       30           2        5         y
36     az17   0.080      2100     1000           5       10         y
37     az18   0.040       200      350           4        8         y
38     az19   0.020       500       30           1        1         n
39     az20   0.020       500       10           1        1         n
40      nm1   0.060     50000      500           5       10         y
41      nm2   0.020       200      120           2        1         n
42      nm3   0.010       100       30           1        1         n
43      nm4   0.090     55000      250           4        9         y
44      nm5   0.060     14000      150           4        9         y
45      nm6   0.060     22000      650           2        6         y
46      nm7   0.090     45000      450           3        9         y
47      nm8   0.090     55000      750           4        9         y
48      nm9   0.010       200       30           1        1         n
49     nm10   0.050      3000      120           3        2         n
50     nm11   0.080     60000     1500           5       10         y
51     nm12   0.010       500       20           1        1         n
52     nm13   0.050      2000       60           3        7         y
53     nm14   0.070      2400      450           3        5         y
54     nm15   0.030        50       20           1        2         n
55     nm16   0.040      5000       30           2        5         y
56     nm17   0.080      2300     1000           5       10         y
57     nm18   0.030       200      350           4        1         n
58     nm19   0.020       500       30           1        2         n
59     nm20   0.020       500       10           1        1         n
60      tx1   0.090     50000      500           5       10         y
61      tx2   0.020       200      120           2        1         n
62      tx3   0.010       100       30           1        1         n
63      tx4   0.090     55000      250           4        9         y
64      tx5   0.060     14000      150           4        9         y
65      tx6   0.060     22000      650           4        6         y
66      tx7   0.120     45000      450           3        9         y
67      tx8   0.090     55000      750           4        9         y
68      tx9   0.010       200       30           2        1         n
69     tx10   0.050      3000      120           2        1         n
70     tx11   0.080     60000     1500           5       10         y
71     tx12   0.010       500       20           1        1         n
72     tx13   0.050      2000       60           3        7         y
73     tx14   0.120      2400      450           3        5         y
74     tx15   0.030        50       20           1        2         n
75     tx16   0.040      5000       30           2        5         y
76     tx17   0.080      2300     1000           5       10         y
77     tx18   0.030       200      350           4        1         n
78     tx19   0.020       500       30           2        2         n
79     tx20   0.020       500       10           1        3         n
80      ok1   0.130     50000      500           5       10         y
81      ok2   0.040       200      120           2        1         n
82      ok3   0.010       100       30           1        3         y
83      ok4   0.095     55000      257           4        9         y
84      ok5   0.063     14000      156           4        7         y
85      ok6   0.060      2000      850           2        6         y
86      ok7   0.120     45000      450           3        9         y
87      ok8   0.090     55000      750           4        9         y
88      ok9   0.015       200       36           2        1         n
89     ok10   0.059      3000      128           3        1         n
90     ok11   0.080     60000     1509           5       10         y
91     ok12   0.010       500        0           1        1         n
92     ok13   0.050      2000       68           3        7         y
93     ok14   0.125      2400      450           3        5         y
94     ok15   0.030        50       25           2        2         n
95     ok16   0.040      5000       39           2        5         y
96     ok17   0.085      2600     2000           5       10         y
97     ok18   0.030       200      350           4        2         n
98     ok19   0.020       500       33           1        2         n
99     ok20   0.020       500       13           1        1         n
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   StoreNum    100 non-null    object 
 1   Income      100 non-null    float64
 2   CashFlow    100 non-null    int64  
 3   NewCust     100 non-null    int64  
 4   CustSatisf  100 non-null    int64  
 5   EmpPerf     100 non-null    int64  
 6   SaveStore   100 non-null    object 
dtypes: float64(1), int64(4), object(2)
memory usage: 5.6+ KB
None

What does this part of the program show us? First the shape of the dataframe is displayed (100,7). We have 100 entries in the file and it is arranged in 7 diferent columns.

Next the entire dataframe is printed out.

df.print info prints a description of the dataset.

The last line prints Original DataFrame.

Day 4: Splitting the data into features and labels

x = df[['Income','CashFlow', 'NewCust','CustSatisf','EmpPerf']] y = df['SaveStore']

In Python, click on the + sign to insert a cell below.

Paste your code in that cell.

The features are the input data: Income, Cash flow, new customers, customer satisfaction and employee performance numbers,

The labels are the the yes and no responses as to store closures.

Save your program.

Day4 : Training the algorithum to solve the classification problem.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20,random_state=0) clf= RandomForestClassifier(n_estimators=20) clf.fit(x_train,y_train) y_pred=clf.predict(x_test) print('Predicted Values') print (x_test) # test dataset without the actual outcome print (y_pred) # predicted values

In Python, click on the + sign to insert a cell below.

Paste your code in that cell.

Save and Run your program. Here is what the output looks like.
Predicted Values Income CashFlow NewCust CustSatisf EmpPerf 26 0.090 45000 350 3 8 86 0.120 45000 450 3 9 2 0.010 10000 36 3 3 55 0.040 5000 30 2 5 75 0.040 5000 30 2 5 93 0.125 2400 450 3 5 16 0.080 2000 1200 5 5 73 0.120 2400 450 3 5 54 0.030 50 20 1 2 95 0.040 5000 39 2 5 53 0.070 2400 450 3 5 92 0.050 2000 68 3 7 78 0.020 500 30 2 2 13 0.070 2000 600 3 5 7 0.090 45000 729 5 9 30 0.090 60000 1500 5 10 22 0.010 10000 32 1 1 24 0.060 14000 150 4 5 33 0.070 2400 50 3 5 8 0.010 500 756 2 1 ['y' 'y' 'n' 'y' 'y' 'y' 'y' 'y' 'n' 'y' 'y' 'y' 'n' 'y' 'y' 'y' 'n' 'y' 'y' 'n']
Random Forest’s ensemble of trees outputs either the mode or mean of the individual trees.

This method allows for more accurate and stable results by relying on a multitude of trees rather than a single decision tree.

The reasoning behind the Random Forest model is that individual decision trees perform much better as a group than they do alone.

When using Random Forest for classification, each tree gives a classification or a “vote.”

The forest chooses the classification with the majority of the “votes.”

The n_estimators line determines the number of trees in the forest. One hundred is the default. We set ours to 20

The test-size variable determines the size of the random sample. In our case, it is 20 out of 100 in the dataset.

That is why there are 20 predictions.

The random_state = 0 makes it so that it replicates the same set of random numbers each time it is run.

If you remove that piece of code, you will get a different set of prediction samples each time you run the program.

Analysis of the data

The predictions appear at the bottom of the list. They are the y's and n's.

The first prediction is for Store 26, which is az7. Our model says that based on the data. that store should remain open. The original data agrees with the prediction. This would be a true positve on the confusion matrix.

The last preciction is a 'n' for 8 which is store ca9. Both actual and predictions are a "n" for this store. This wouuld be a true negative on the confusion matrix.

Day 5: Accuracy of the model

A confusion matrix is a graphical presentation showing the performance of our model.

The true negatives, TN are ones where the actual and predictions were both false "n". In the random sample, there were 4 of these.

The true positives, TP are the ones that the actual and predictions were both true. In our random sample, there were a total of 15.

The false positives, FP are the ones where the actual was false and the prediction was true. Our model did not generate any of these.

The false negatives, FN are the ones where the actual answers were true and the predictions were false. We had one of these n our random sample.

Let's see if we can find this false negative.

I looked through the data set and the predictions. array item 2, ca3 in the actual data was "y" and the prediction was "n"

When counting the y, and n at the bottom of the prediction list, start counting starting with 0.

The accuracy percent for our model is calculated as follows (TN + TP)/(TN + TP + FP + FN + FP ) or (4 + 15)/(4 + 15 + 1 + 0) = .95

There is a line in our code, that allows you to enter one stores' data and have it predict whether to keep the store open or close it.

prediction = clf.predict([[.01,100,1,1,1]])

If you enter these numbers, you will find that this store should be closed.

Try some different numbers. This line is especially useful.

The last section of code tells us which of the factors are the most important when deciding store closures.

The predictions show the following results.

4 0.566881 Employee Performance
0 0.187552 Net Profit
1 0.166781 Cash Flow
2 0.042852 New Customers
3 0.035935 Customer Satisfication

A working predictive model needs to make predictions for data that it hasn't seen yet.

We are assuming that next year's results will imitate the predictions above.

As you can see from the results, The biggest reason a store got a "n" response for save store was based on poor employee evaluations: .566881

Customer satisfaction, array number 3 had the least importance, 0.035935.

Number of new customers was also not seen as not very important with a score or 0.042852

Net profit and cash flow scores fell in the middle of items of importance.

Action Plan

As result from our research we conclude that failing stores, those with a "n" for save store column are failing mainly due to poor employee performance.

First find out which department in the failing stores has the problem with employee performance evaluations. It could be the stores' management teams, sales force, etc.

We should recommend to the corporate top management that there either needs to be a major inservice for those employees with numbers below average, 4.94 out of 10 or they should be terminated and management should hire new employees.

The numbers for net profit and cash flow are accounting data predictions.

Both net profit and cash flow can be improved by increasing sales at these stores.

Sales
- Cost of goods sold
= Gross Profit
- Total operating expenses
= Net profit

Sales can be increased by in-store displays of profitable merchandise. Advertising can be increased. Customer loyality programs could be initiated. Retrain the sales force.

Expenses can also be reduced. Examine cost of good sold, insurance, utlities and labor costs.