BA240 Team Project: Regression Analysis To Find A Linear Regression Model (Statistics Project Sample)
BA240 Team Project
Search at least THREE Quantitative Variables: One dependent variable and at least Two independent variable. Each variable should contain more than 50 data in order to satisfy the Normality assumption. Use Regression Analysis to find a linear regression model that can be best describes the relationship between the variables you chose.
What kind of variables usually work:
• Factors (e.g., Total Cholesterol Level vs. age and weight )
What kind of variables usually don't work:
• Qualitative Data: (i.e. Location, Color, Model)
• Mutually Exclusive Variables
• Partial vs. Total (i.e., Perimeter = 2*L +2*W, so you can't use Perimeter as a dependent and L and W and independent)
• Total Cholesterol Level vs. Age and Weight
• Field goals percentage vs. Total years played and Average minutes on play per game
• Homicide rate per 100,000 population vs. Population size, % of families with yearly incomes <$5,000, and Unemployment rate
You may not choose the following data sets:
1) Car price vs. engine size, horse power, mileage, weight, etc
2) Housing price vs. square footage, age, #bedrooms, etc
3) Low Birth Weight vs. Mother's Age and Last menstrual period weight
4) A dataset chosen by another group. First group to get dataset approved gets exclusive use
5) Any other data set used as an example
Tips and Notes
1) When you are doing INDIVIDUAL variable summary statistics (in Part II Data), DO NOT delete "outliers", you need to keep every single value (unless it's missing). For Scatter Plot, you also need to use the original data.
2) Scatter Plot is between TWO variables ONLY (Specifically, Y Vs. X). So, you need to use your dependent variable Y against EACH independent variable. That is, how many independent variables do you have, how many scatter plots you need to produce. Remember to add trend line and show the equation and R-square on the scatter plot. You can simply calculate the correlation using R-square, remember to check the SIGN.
3) Parameters are the y-intercept and slopes, so you need to find out these estimates (b0, b1, b2 …) from the output, and then form your regression equation (the least square line). When you have multiple independent variables, you need to be careful on the interpretations: that is, predict the slopes for independent variables INDIVIDUALLY, which means ONE at a time. When you predict one of the slopes, remember to FIX the others.
4) Compare your models (original model, model after first outlier removal, and model after second outlier removal) in Part IV (Discussion). Which model is better? Keep in mind: Adjusted R-square is NOT the ONLY factor to make decision. You also need to check whether or not the model and/or parameters are significant (look at the p-values).
Regression Analysis to Find a Linear Regression Model
DateName Course Name Instructor
The study focused on evaluating the relationship between the win percentage in NBA versus the points scored and the field Goal %. With the availability of NBA statists it has become easier to track results on both overall team efficiency and individual performance. In the study, the team performances are considered for the analysis. The assumption is that higher points will most result in winning, but if the defense is weak then a team is likely to struggle wining. Offensive efforts are indicated by the points scored and the field goals made. Data
Data was collected from the website NBA.com and the dependent variable was the Win %, which indicated the percentage of games that the teams won in an 82 season game. The independent variables were the average points scored in the season and the Field Goal percentage FG % (which indicates the ratio of the field goals made and the field goals attempted). Each field goal is worth two points. The data was collected for the years 2015/2016 and 2017/2017, it was aggregated to get the 60 observations as there are thirty teams in the NBA. The data is verified by external sources and corroborated by other sports news and as the official site of American basketball, and other sources also rely on the information. Only the regular season data was collected for the purpose of data analysis.
The average win rate was 0.49333 for the 60 observations, while the average points were 104.13 and the average field percentage was 45.47 (NBA, 2017). This statistics that the averages each team had the calculated means for the two seasons considering the worst and best performing teams
Correlation coefficient - r (points) = 0.5538
The value of R is 0.5538. This is a moderate positive correlation, which means there is a tendency for high points go with high win % variable scores (and vice versa)
Correlation coefficient -r (FG %) = 0.6489
The value of R is 0.6489. This is a moderate positive correlation, which means there is a tendency for high FG% variable scores to go with Win % and vice versa).
Coefficient for both points and FG % is 0.673848
Modeling( Original Modeling:i. Excel outputs and interpretations for model significance
SUMMARY OUTPUTRegression StatisticsMultiple R0.673847726R Square0.454070757Adjusted R Square0.434915345Standard Error0.11460668Observations60The absolute value of the coefficient of correlation in the sample is 0.6738, while the coefficient of determination is 0.454. The y- intercept is -2.711052394, while the value of Pts is -0.008574619 and that of FG% is 0.050981729. The three variables are used in the linear regression line to determine the dependent variable (the percentage wins)ii. Excel outputs and interpretations for parameter estimates
P-valueIntercept3.01195E-07X Variable 10.068382462X Variable 20.000238405
The P-value for the intercept is<0.0001 and the result is significant at p<0.05
The P value for PTS is 0.068382462
The P-value for FG% is 0.000238405
. Excel output and interpretation for model goodness of fit.
The adjusted r square is 0.434915345, and this measure includes the degrees of freedom and since this had not been incorporated in the R-squared it is lowers the measure indicates that 43.9% of variations in the model can be explained.
Normal and extreme outliers
Interquartile range (IQR) =0.5915-0.396=0.1955
Upper bound =0.5915+1.5(0.1955) =0.8874
Lower bound=0.396-(1.5*0.1955) =0.1078
So the outlier is the first record where the win rate is 0.89( Model after outlier removali. Excel outputs and interpretations for model signif...
- Statistics Project on Broyles Textbook Practice ExerciseDescription: Estimate the time required to provide a given laboratory procedure, suppose we measured the amount of time required when service was provided on 60 occasions...3 pages/≈825 words | No Sources | APA | Mathematics & Economics | Statistics Project |
- Conf Int/Hypo Test: The Bank And Trust CompanyDescription: Assume that you still work for Ms. Deanna V. Ashun (aka “Dee”) and that the lawsuit alleging gender bias has been resolved in favor of B&T. ...5 pages/≈1375 words | 3 Sources | APA | Mathematics & Economics | Statistics Project |
- Introduction to Probability Theory & Health StatisticsDescription: A researcher studying lifespan categorizes individuals into single, married, divorced, or widowed. What type of variable measurement is this?...2 pages/≈550 words | 3 Sources | APA | Mathematics & Economics | Statistics Project |