Sign In
Not register? Register Now!
Pages:
3 pages/≈825 words
Sources:
3 Sources
Style:
APA
Subject:
IT & Computer Science
Type:
Coursework
Language:
English (U.S.)
Document:
MS Word
Date:
Total cost:
$ 21.87
Topic:

Data Understanding and Preparation According to CRISP-DM Methodology

Coursework Instructions:

Prompt: In Milestone Two, you will begin performance on the analytic plan. You will write the Data Understanding and perform the Data Preparation.
While you must reflect on your prior coursework, your submission must consist only of DAT 690 coursework to avoid self-plagiarism. Make sure to include the following critical elements in your paper.
Critical Elements
CRISP-DM Data Understanding Phase: Select data; discuss the types of data in the data set and rationale for inclusion or exclusion
CRISP-DM Data Understanding Phase: Describe descriptive statistics analysis and correlation analysis overall results
CRISP-DM Data Understanding Phase: Discuss descriptive statistics analysis results for selected variables
CRISP-DM Data Preparation Phase: Describe the data steps for preparing and cleaning data, clean data, and data cleaning report
CRISP-DM Data Preparation Phase: Construct data, derived attributes*, generated records; integrate merged data; format/reformat data
Articulation of Response: Submission has no major errors related to citations, grammar, spelling, syntax, or organization

Coursework Sample Content Preview:

Employee Attrition
Author
Affiliation
Course
Instructor
Due Date
Employee Attrition
Data Understanding
Context
GE is keen on retaining its key employees since it has been noted that the company has a high churn rate. The cost of losing an employee is estimated to cost GE 80 percent of the employee’s annual income. GE invest heavily on its employees in order to stay competitive in the market. Thus, long hours and significant financial resources are spent on training and upskilling employees. GE seeks to build a high-accuracy model that can predict employees that are most likely to churn. The model should be able to evaluate an employee’s profile and give a real-time prediction on the likely of the employee to churn. As a result, a thorough and efficient model for detecting employees likely to churn is required. Predicting employees that likely to churn can help management intervene before losing the employee. As part of the project lifecycle we will undertake the understanding and preparation of data according to CRISP-DM methodology.
Data types
GE’s human resource data had a total of 1270 row and 35 columns. Of the 35 columns, 9 were categorical data while 26 were numeric data. The categorical data included; 'Attrition', 'BusinessTravel', 'Department' ,'EducationField', 'Gender' ,'JobRole' 'MaritalStatus', 'Over18', and 'OverTime'. Whereas the numeric data included; 'Age', 'DailyRate', 'DistanceFromHome' ,'Education', 'EmployeeCount', 'EmployeeNumber', 'EnvironmentSatisfaction', 'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', and 'YearsWithCurrManager'. Attrition is the dependent variable with on two categorical values that is “yes” and “no”. Attrition describes employees that did churn and those that did not. There were no missing values from the dataset. The remaining data set were the independent variables.
Descriptive Statistics
We undertook the descriptive statistics of the numerical values, and attached below is the snippet of the descriptive statistics output. The descriptive statistics reveal no values that were out of range. As a result, we did not find any reason to conclude their outliers. However, some variables seem not to add value to our dataset (Smart Vision Europe, n.d). This values include:” 'EmployeeCount', 'EmployeeNumber','StandardHours'. The descriptive statics revealed no meaningful statistics about the variables. 'EmployeeCount' had a mean, maximum, minimum, and standard deviation of 1, suggesting that its presence in the dataset is insignificant. 'EmployeeNumber' serves as a unique identifier of employees in the dataset, so apart from that, it has no sign...
Updated on
Get the Whole Paper!
Not exactly what you need?
Do you need a custom essay? Order right now:

👀 Other Visitors are Viewing These APA Coursework Samples:

HIRE A WRITER FROM $11.95 / PAGE
ORDER WITH 15% DISCOUNT!