Sample Size in Multiple Regression

In this article, I would like to share about Sample Size requirement during the multiple regression analysis. Firstly, what is Multiple Regression?. Multiple Regression, is a powerful tool during the analyze stage and it is used mainly in the environment of continuous output versus continuous Xs analyze. For example in the Oil and Gas industry, mostly the type of data for the process output and inputs are continuous, example: Process output ( Kerosene Yield % ) versus Temperature ( Range from 200 Celsius to 300 Celsius ) , Pressure ( Range from 40kBar ~ 60kBar ) etc. Multiple Regression is required due to not only the process output and inputs data are continuous but also we are investigating the relation for a process output with multiple inputs. Below is the example of how the data look like in Minitab:

Above minitab file indicated Y as the process output and there are all together 15 process inputs. In this case we are trying to identify among the 15 inputs, which are the significant and what is the relation between the significant inputs to the process output (Y), in other word we are establishing Y=f(x). But before we can proceed with Multiple Regression Analysis, we need to collect data for analysis purpose and the question arise will be “How many data that we need to consider ?”. To few data will resulted, not enough degree of freedom for minitab to calculate the p-value for each factor, the p-value is important for us to know what is the significant factor to the process output, example as below

So we know that some of the inputs are not significant thus we need to remove the insignificant input one at a time starting from the highest p-value and the final model will be as below, where all the significant factors only remain, which are X1,X2,X4,X7,X8,X10,X11,X12 and X13:

In this example 104 data are collected for the process output and inputs in daily basis for 6 months period of time. The minimum requirement for number of data need to be collected will be as simple as 5 times the number of inputs under study. Example: in this case, total inputs are 15, thus the minimum number of data that need to be collected will be 5 x 15 which is 75 data or you can also refer to http://www.danielsoper.com/statcalc/calc01.aspx for the minimum required data for multiple regression analysis.

By right, the more data you collected the better but too many data also will caused trouble since it will capture the short term and also long term effect. Thus you need to do data massage especially for the data that reflects to the Long term cause effect which is SPECIAL CAUSE EFFECT data. To do that, tools like I-MR Chart , Run Chart or even Box Plot or the combination of either 2 of them , example Box Plot <-> I-MR Chart can be used to ensure your data only contain the data that reflect towards COMMON CAUSE EFFECT in your process, this is because during process optimization, we intended to minimize the COMMON CAUSE EFFECT and Eliminate the SPECIAL CAUSE EFFECT.

To conclude, as for the minimum data requirement , as mentioned above there 2 ways, but most of the practitioners will based on 3 ~ 6 months data in daily / hourly basis for multiple regression analysis.


Information About Article