Project Report A Simulation Program Generated The Data (Statistics Project Sample)
For the first project, there are two parts: A and B. You should use a total of three files. Two of the files are for part A, and one file is for part B.
For part A, each file will contain a column for subject ID and a column for either the dependent variable value or the independent variable value. First, you are expected to sort the two files by subject ID and merge them. You should not just use “cut and paste” to merge your data. Second, you are expected to deal with missing data. Your report should contain the count of the number of subject IDs that had at least one independent variable value or dependent variable value. It should also include the count of the number of subject IDs that had an independent variable. There are a number of missing data procedures. Often a statistical package has imputation algorithms in the software. For example, R has 5 different algorithms available. You may choose any algorithm except for listwise deletion. Specify your choice in your report. Often, the choice of imputation method has little effect on the results if the fraction of missing data is 30% or less. Then, you should use the statistical package of your choice to find the fitted linear model.
The data file for part B will contain one line for each subject ID. The line will contain the subject ID, the value of the independent variable, and the value of the dependent variable. A transformation of either IV or DV or both may be required. You should read the text for suggestions on fitting a model. A lack of fit (LOF) test should be applied if there are repeated values in the data sets. It is your group’s responsibility to find repeated (or near repeated) independent variable values. That is, your group should bin near repeated data into one level. For example, suppose that x¬_1=1.01,x¬_2=1.02,x_3=1.03 and y_1=2,y_2=3,y_3=4. While there are not exactly repeated x values, your group could bin these points into one group of nearly repeated points. That is, choose the average x-value as the value of x after binning. Then your binned data would be x¬_1=1.02,x¬_2=1.02,x_3=1.02 and y_1=2,y_2=3,y_3=4. Then perform a LOF test on the data set after binning all near repeated values.
You must submit a one-page report on Problem A and a one-page report on Problem B. Each report should have four sections. The introduction should contain a statement of the problem and the objective of the paper. This part is easy: your problem is to recover the function that was used to generate the dependent variable value based on the value of the independent variable. The data you receive will be generated by a simulation program. The second section should describe your methodology. Specifically, how the files were merged, the program used to perform the statistical analysis, whether you used linear regression and additional procedures such as a lack of fit test, how much missing data was present in the data, and the procedure for dealing with missing data. The third section should contain your results: what fraction of the variation of the dependent variable was explained, the analysis of variance table, the fitted function, confidence intervals for slope and test of the null hypothesis that the slope was zero. The fourth section should be conclusions and discussion. This section should focus on “big picture” issues. Was there an association between the variables? How important was it? That is, what was the r-squared value. What is your fitted function? You may submit a longer appendix of computer work and programs.
Simply submitting your computer output is not acceptable and will receive a grade of 0. You must submit a formal report to begin to get non-zero credit.
Here is a sample report. Keep in mind, this is just a general idea of what should the first project looks like. You must not copy and paste it to submit as your report with the values of the numbers changed. Such activity is plagiarism and you will receive a grade of 0.
The objective is to find the model describing the data in Problem A. A simulation program using an unknown linear function was used to generate the data.
In order to solve problem A, we used the statistics package SPSS and Microsoft Excel spreadsheet program. The original data files were supplied with two data sheets in Excel. One data sheet had the ID of an observation and its associated independent variable value, and the other had the ID and associated dependent variable value. The independent variable data file had a total of 710 independent variable values with ID# ranging from 1 to 729. The dependent variable value had a total of 690 dependent variable values with ID # ranging from 1 to 730. We first sorted data in both files in ascending ID# order and then used Excel to merge the files. We next used listwise deletion to remove 40 entries that were missing either the independent variable value or the dependent variable value. Finally, we merged the two files into one file with three columns: ID, IV and DV. There were 670 entries with both values, with ID# ranging from 1 to 729. The data was then imported into SPSS. We assume linear regression for our data, but in order to find a better fit, we also transformed dependent variable into DV^2, Sqrt(DV) and independent variables into IV^2, Sqrt(IV), 1/IV, and ln(IV).
The fitted function for the model Y= B+B1 X was DV=20.966IV+2123.719 with 99.9% fraction of variance was explained. The 95% confidence interval for the slope was [20.914 , 21.019]. The 95% confidence interval for the intercept was [2068.988 , 2178.450]. The analysis of variance table is shown below and the association between the independent variable and dependent variable was highly significant (p=0.000).
Analysis of Variance Table
DV regressed on IV
Model Sum of Squares Df Mean Square F Sig.
1 Regression 25021381100.435 1 25021381100.435 617186.738 .000b
Residual 27081402.664 668 40541.022
Total 25048462503.099 669
a. Dependent Variable: DV
b. Predictors: (Constant), IV
For problem A, the association between independent variables and dependent variables was highly significant (p=0.000), with 99.9% of the dependent variable variation explained. The plot of residual versus predicted value confirmed the validity of this model.
Note: For question B, please report transformation you have performed and the model in transformed format.
End of Report
Name of Student
The objective is to come up with the function that was used to generate the dependent variable value based on the value of the independent variable in Problem A. A simulation program generated the data.
Both Microsoft Excel spreadsheet and IBM SPSS Statistics 24 programs were used in Part A. Part A relied on two Excel data sheets which contained the original data. One Excel data file included the ID of an observation and its related dependent variable value. Likewise, the other had the ID and its related independent variable value. The data sheet with independent variables contained 939 independent variables in total with ID# ranging from 1 to 1000. The independent variable data sheet had a total of 61 missing data. The dependent data sheet had 949 values of the dependent variable with 51 missing. The ID# ranged from 1 to 1000. Both files with data are have already been sorted in ascending order of the ID#. Microsoft Excel program is then used to merge the two files. The combined data is exported to SPSS. Pairwise deletion is then used to deal with the 112 entries of the missing data in both the independent and dependent variables (Weaver and Maxwell, 2014). There were a total of 888 entries with both dependent and independent variables with the ID# that ranged between 1 and 999. The data assumes a linear regression model in finding a function that better fits the model Y= α + βX.
For the regression model Y= α + βX, the SPSS output generated the function y= 27426.478 – 0.186x with explained 81.2 % fraction of variance (Coefficient of determination- R2). (27418.677, 27434.278) was the 95% confidence interval for the intercept. Tables 1 and2 below indicate that the negative relationship between y, the dependent variable, and x, the independent variable was high at significance level p=0.000.
Analysis of Variance Table
Y regressed on X
Adjusted R Square
Std. Error of the Estimate
a. Predictors: (Constant), x
b. Dependent Variable: y
Sum of Squares
a. Dependent Variable: y
b. Predictors: (Constant), x
CoefficientsaModelUnstandardized CoefficientsStandardized CoefficientstSig.95.0% Confidence Interval for BBStd. ErrorBetaLower BoundUpper Bound1(Constant)27426.4783.9756900.355.00027418.67727434.278x-.186.003-.901-61.909.000-.192-.181a. Dependent Variable: y
Part A indicates a strong relation...
YOU MAY ALSO LIKE
- SCM200 Mathematics & Economics Statistics Project PaperDescription: 1 Upon an initial observation of the data set, it is apparent that the whole sheet is composes of all kinds of values, including qualitative (ordinal and nominal) and quantitative ones (integral, interval, and ratio). While there are lot of fields (i.e., ID, dti_joint, pymnt_plan, etc.) with missing values...3 pages/≈825 words | 1 Source | APA | Mathematics & Economics | Statistics Project |
- Understandable Statistics: Concepts And MethodsDescription: Simple probability determines the possibility of a specific event occurring during an experiment or when information provided....2 pages/≈550 words | 1 Source | APA | Mathematics & Economics | Statistics Project |
- Hypothesis Testing Mathematics & Economics Statistics ProjectDescription: It should be noticed that both the null hypothesis and the alternative hypothesis make up the two-tailored test. We may have to reject the null hypothesis if there is a significant difference between the sample mean and the median. It can be rejected even when the difference is too small....1 page/≈275 words | No Sources | APA | Mathematics & Economics | Statistics Project |