Logistic regression model
Simmons Company (Simmons)conducted a pilot study using a random sample of customers who have the company’s credit card and customers who do not have the company’s credit card. Simmons sent the catalog to each of the 100 customers selected. At the end of a test period, Simmons noted whether each customer had used her or his coupon. The sample data for the first 10 catalog recipients are shown in Table 7.
Use the sample data provided in Table 7 to answer the following questions:
- Provide descriptive statistics of the data divided according to the divisions in the values of the categorical variable.
- Develop a logistic regression model in which the probability of using a coupon is the dependent variable.
a. State the hypotheses on the coefficients, justify the formulation of these hypotheses and interpret the results. Use ɑ = .05. Include all phases of assessment of the model and do not forget to check multicollinearity.
- Apply the estimated regression equation developed in part b to predict the probability of using a coupon for a combination of values not given in the above table for the independent variables.
To my readers, this post is only dedicated for the student who has struggled in the statistics, and this is not my regular post. If it is not related to you, you can ignore it. Thank you for your understanding
We are using the R programming language to calculate the descriptive statics of the given data. First, we are instructing that the table above presented is “activity6data” to analyze by using the below function.
# activity6data <- read.csv(“data.csv”,header=TRUE, sep= “,”)
Once we import the data, the below function is product the descriptive analysis.
#summary(activity6data)
And the descriptive data is presented below.
Standard deviation
The three variables are presented with the descriptive data, where the minimum value of annual spending is 1.182, and the value of both Simmons card and coupon is 0. Similarly, mean of the all three variables are 30.044, 0.5, and 0.2 respectively for annual spending, Simmons card, and coupon. The 1st quartile, median, 3rd quartile, and maximum value of each 3 categories are presented in the table. The standard deviation of annual spending variables is 1.599, Simon’s card is 0.5270 and coupon is 0.4216. The standard deviation was calculated by the below function.
#sapply(activity6data, sd)
1. Logistic Regression
To make the logistic regression, first of all, we are assuming that using a coupon is the dependent variable which is dependent on annual spending and having the Simmons card. Using of coupon will have two answers, either yes or no, that is the reason it is called logical regression. That means, whether or not the first 10 customers will either use or not use the coupon is based on their annual income level and will using of coupon is depending on if they have Simon’s card or not. So, we are finding if there is a relationship exists between those two independent variables with the dependent variable.
Therefore, we are setting up the hypothesis as below.
Null hypothesis (H0) = there is no relationship exists between income level and possession of Simon’s card with uses of coupon
Or, H0=β1=β2 =0
Alternative hypothesis (H1) = there is a relationship exists between income level and possession of Simon’s card with uses of coupon
Or H1= at least one of the variables ⧣ 0
Now, making up the regression model by below function, where “mymodel” is the name of the model.
# mymodel <- glm(coupon ~ annualspending + simmonscard, data=activity6data, family= “binomial”)
Next, by using the below function, R gives us the summary of our regression model.
#summary(mymodel)
The result is
The table shows the estimated coefficient of intercept between two variables is 16.155 where annual spending stands with -6.65 and Simmons card stands with -5.309. Furthermore, the standard error of the intercept is 14.770 while the Z value of the intercept is 1.094. The P-value of annual spending and Simmons card are 0.271 and 0.353 respectively. The table also shows the deviance in 9 and 7 degrees of freedom, but a noticeable one is fisher scoring iterations, which is 9.
Therefore, the regression model for use is
Y = ß0 + ß1x1 + ß2x2
Or, Y= 16.155–6.665*X1–5.309*X2
But, as it is a regression model, using coupons will be calculated in the probability. Meaning, how many percentages of probability is there that a person used a coupon. Which is calculated below. To calculate the probability, we are using the below function in R.
# prediction_1 <- predict(mymodel, activity6data, type= “response”)
# head(prediction_1)
This function is based on the below function.
And, here is the result.
The probability table says that if the person is going to use the coupon or not. Meaning, there is a 1.214981% chance that the first respondent is going to use the card. Similarly, if we multiply the result by 100 and it will give us the percentage of change the each responded will use the coupon.
At last, we are testing the goodness of fit by using the below function.
#with(mymodel, pchisq(null.deviance — deviance, df.null-df.residual, lower.tail = FALSE))
That gives us the result of 0.04079971, which is a P-value, and it is lower than our assumed p-value which is 0.05. Therefore, as per the result of P, we are rejecting the null hypothesis. That means, there is a relationship between one’s income level and possession of Simmons card with using the coupon.
Part 3.
We are using the random income level in the below table and use the Simmons card. As we have the regression equation, Y= 16.155–6.665*X1–5.309*X2, the coupon uses by the first 10 respondents is in negative percentage, which means zero probability. Therefore, no customer is using the coupon.
Thank you so much.