Sign In
Not register? Register Now!
Pages:
10 pages/β‰ˆ2750 words
Sources:
10 Sources
Style:
Harvard
Subject:
Mathematics & Economics
Type:
Coursework
Language:
English (U.S.)
Document:
MS Word
Date:
Total cost:
$ 58.32
Topic:

Exploratory Data Analysis - EDA

Coursework Instructions:

All requirements are in the document

Coursework Sample Content Preview:
University
Exploratory Data Analysis - EDA
Scorecard model
Your Name
Course Name and Number
Professor Name
The Due Date
Word Count: 2570
Exploratory Data Analysis - EDA
Introduction
EDA is used to analyse and carry out an investigation on the dataset and summarize the main characteristics by using data visualization methods (Camizuli and Carranza, 2018). It is helpful in determining how to best manipulate the data source to get the best possible answers that are needed, thus making it easier to identify patterns, test hypotheses, spot anomalies, or check assumptions (Jebb, Parrigon and Woo, 2017).
EDA helps to better understand the variables in the dataset and the variables relationships. Most importantly, it helps to identify the best statistical techniques to be used to analyse the dataset. The dataset HMEQ provided reports the delinquency and characteristics for 5960 equity loans for homes (Gunnink and Burrough, 2019). The exploration is carried out in the dataset as below and the explanation is provided.
The first step is to import the relevant libraries as shown below;
# Import the required libraries.
import pandas as pd
import numpy as np
import seaborn as sns #visualisation
import matplotlib.pyplot as plt #visualisation
%matplotlib inline
sns.set(color_codes=True)
Load the dataset to the notebook and user the head function to see the first few rows and columns of the dataset as below;
from google.colab import files
import io
uploaded = files.upload()
df = pd.read_csv(io.BytesIO(uploaded['Assignmentdataset__HMEQ.csv']))
df.head()
Checking the data types is important, because some variables might have the wrong data type. The data types are as below for the HMEQ dataset is as below;
# Check the data type
df.dtypes
BAD int64
LOAN int64
MORTDUE float64
VALUE float64
REASON object
JOB object
YOJ float64
DEROG float64
DELINQ float64
CLAGE float64
NINQ float64
CLNO float64
DEBTINC float64
dtype: object
Missing Values and Duplicate Handling
To handle the duplicate, first check the shape of the dataset, and print out the number of rows with duplicate values (Dung and Phuong, 2019). To know if the duplicates have been removed, use count to to find the number of rows before and after removing the duplicates.
# count the number of rows
df.count()
# Drop the duplicates
df = df.drop_duplicates()
df.head()
# Count the number of rows after the duplicates have been removed
df.count()
The next step is to identify and handle the missing values in the dataset, the missing values can be handled differently depending on how many they are (Ezzine and Benhlima, 2018). If only a few values are missing, then they can be dropped, but if there are many then it will be a good approach to replace the missing values with the mean or average of the corresponding column. Use the print and null method to display the missing values.
# Display the null values
print(df.isnull().sum())
BAD 0
LOAN 0
MORTDUE 518
VALUE 112
REASON 252
JOB 279
YOJ 515
DEROG 708
Updated on
Get the Whole Paper!
Not exactly what you need?
Do you need a custom essay? Order right now:

πŸ‘€ Other Visitors are Viewing These Harvard Coursework Samples:

HIRE A WRITER FROM $11.95 / PAGE
ORDER WITH 15% DISCOUNT!