Diagnostic Analysis of the Factors Influencing an Ecommerce Store’s Annual Revenue Using Regression Analysis
This is a diagnostic analysis was to determine how an ecommerce store can influence the amount of money each customer spends with them annually.
To do this, we will utilize the regression analysis method to determine the relationship between each of the independent variables attributed to each customer, and how that influenced their purchasing powers. Once this relationship has been determined, the ecommerce store can then use it to determine and forecast revenue growth over a new period once it has made significant efforts towards improving the value of the independent variables.
Data Summary
The data used for the project is a collection of the information of the customers that have visited an ecommerce store over a period of time. The dataset was downloaded from Kaggle. The goal is to find a model that best determines the amount each customer can spend with the ecommerce store, once all other factors are held constant.
Here is the data dictionary of all the variables within the dataset:
Email: The email address of the customer
Address: The house/postal address of the customers
Name: The name of the customer
Avg. Session Length (AL): The average amount of time a customer uses per session when visiting the store
Time on App (TA): This describes the average time a user spends on the mobile application of the store
Time on Website (TW): This determines the average time a customer spends on the store’s website
Length of Membership (LM): This describes how long each customer has been visiting the store
Yearly Amount Spend (YS): This is the average of the annual amount each customer spends annually across the specified period.
Note: For the sake of this analysis, we are going to be working with just four variables; Avg. Session Length, Time on App, Time on Website, Length of Membership, and Yearly Amount Spend.
Data Exploration
Since we are not utilizing a high dimensional dataset, the normal practice is to plot scatterplots to determine the possible relationship between each of the independent variables and the dependent variable (Yearly Amount Spend).
Judging from the scatterplots above, I can categorically say that Avg. Session Length, Length of Membership and Time on App all have a linear relationship with the Yearly Amount Spend. But Time on Website does not show any relationship from visual interpretation.
To further confirm the relationship between these independent variables and the dependent variable, we plotted a correlation matrix as shown below. This helped us further discover which of the variables had a correlated influence on the independent variable (YS).
Model Estimation
From the data exploration results, we can assume that these relationships follow a linear pattern with YS. As such, we will model a relationship among them using a linear regression model.
From our results, we could see that even though some of the other statistical indicators showed positive results, there was a variable that seemed to be statistically insignificant within the model. This is because its P-value was above 0.05. that is, the variable TW with p-value of 0.326.
The next action will be for use to remove the insignificant variable and then fit another model to understand the dynamics of the relationship again.
From above we could see that the same characteristics of the previous model applies to this new one as well. The only difference is in the number of variables involved.
Multiple R Squared is 0.9843. this means up to 98% of the YS, the amount customers spend annually can be determined by the combination of the independent variables.
P-value is almost zero. This shows that our model is statistically significant, and as such can be used to predict future values of annual spend.
The F-Statistic is in excess of one million. This means the joint effect of all the independent variables (TA, LM and AL) is highly significant on the final amount spent annually. It also described that there is a high variation among the independent variables than one’d have expected to see.
Diagnostic Analysis
For the diagnostic analysis, we investigate whether the model built followed a normal distribution to ensure that the model is statistically significant. If it doesn’t, we’d have to transform the variables and then build another model.
Residuals vs Fitted plot is used to determine the non-linearity, constant variance, and the spread of the residuals. The first figure on chart showed that the residuals bounce randomly around. A clear indication that our assumption for linearity before building the model was correct.
Histogram is used to measure if the residuals follow a normal distribution. As could be seen from the 2nd figure in display shown above, the residuals do follow a normal distribution.
Normal Q-QPlot also shows whether residuals follow a normal distribution. These residuals as scatters are almost forming a single line as shown in the 3rd and 4th figures in display shown above. As such, the model is correct.
Shapiro-Wilk Normality Test also tests whether the data follows a normal distribution. The p-value here as shown below is greater than 0.05, as such the test for normality holds true.
Judging from the above results, we do not need to transform the data anymore.
Conclusion
We could see that for the major factors that influence the amount a customer spends with this ecommerce store annually is dependent greatly on the how long the customer has been visiting the store, the amount of time spent on the application, and avg. time spent per session during shopping. This relationship can be quantified like this: YS = 25.72AL + 38.75TA + 61.56LM — 1035.34
This implies that the store should pay more attention ton improving these three metrics; avg. length of time spent per shopping session, time spent on shopping application, and increasing the life time value of its customers. If these are done well, a significant increase will be seen in its annual revenues.