Automobile data set - EDA & Visualizations

Automobile-EDA-Visualizations

Aveshnee Iyer

27 Jun 2021 • 6 min read

symboling: -3, -2, -1, 0, 1, 2, 3.
normalized-losses: continuous from 65 to 256.
make: alfa-romero, audi, bmw, chevrolet, dodge, honda, isuzu, jaguar, mazda, mercedes-benz, mercury, mitsubishi, nissan, peugot, plymouth, porsche, renault, saab, subaru, toyota, volkswagen, volvo
fuel-type: diesel, gas.
aspiration: std, turbo.
num-of-doors: four, two.
body-style: hardtop, wagon, sedan, hatchback, convertible.
drive-wheels: 4wd, fwd, rwd.
engine-location: front, rear.
wheel-base: continuous from 86.6 120.9.
length: continuous from 141.1 to 208.1.
width: continuous from 60.3 to 72.3.
height: continuous from 47.8 to 59.8.
curb-weight: continuous from 1488 to 4066.
engine-type: dohc, dohcv, l, ohc, ohcf, ohcv, rotor.
num-of-cylinders: eight, five, four, six, three, twelve, two.
engine-size: continuous from 61 to 326.
fuel-system: 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi.
bore: continuous from 2.54 to 3.94.
stroke: continuous from 2.07 to 4.17.
compression-ratio: continuous from 7 to 23.
horsepower: continuous from 48 to 288.
peak-rpm: continuous from 4150 to 6600.
city-mpg: continuous from 13 to 49.
highway-mpg: continuous from 16 to 54.
price: continuous from 5118 to 45400.

4. This data set consists of three sections:

The specification of an auto in terms of various characteristics.
Its assigned insurance risk rating.
Its normalized losses in use as compared to other cars.

5. Summary statistics of the dataframe

According to the count, there seems to be no missing values - however on closer inspection, the null values have been filled with a question mark (?) and therefore do not show as missing values in the count.

In a normal distribution, about 68% of the scores are within one standard deviation of the mean and about 95% of the scores are within two standard deviations of the mean. Standard deviations are quite high for Curb-weight implying the variation could be large.

The upper quartile (sometimes called Q3) is the number dividing the third and fourth quartile. The upper quartile can also be thought of as the median of the upper half of the numbers. The upper quartile is also called the 75th percentile; it splits the lowest 75% of data from the highest 25%

Minimum values - the minus value under symbolling referring to ratings, no other columns have negative values.

MISSING DATA

Visualised Missing values:

There seems to be no missing values in the visualization.

2. Actual missing values:

From investigation, the missing values have been replaced with a question mark (?) and hence do not show in the missing value counts above.

In order to solve for missing values ( ? ) with NULL value, identify the number of missing values and then apply a median or mean to fill values in.

The following columns have missing values and need to be cleaned:

normalized-losses – 41
price – 4
horsepower – 2
bore – 4
stroke – 4
peak-rpm – 2
num-of-doors – 2

3. Process applied for missing data:

a) Normalized-losses:

# Cleaning the NORMALISED LOSSES field

# Find out number of records having 'NaN' value for normalized losses

# Setting the missing value to mean of normalized losses and convert the datatype to integer

b) Price:

# Find out the number of values which are not numeric using Boolean

# List out the values which are not numeric

#Setting

c) Horsepower:

# Cleaning the HORSEPOWER

# Checking the numeric and replacing with mean value and convert the datatype to integer

# Checking the outlier of horsepower

# Excluding the Outlier data for horsepower

d) Bore

# Cleaning BORE

# Find out the number of invalid values

# Replace the non-numeric value to null and convert the datatype

e) Stroke

# Cleaning the STROKE

# Replace the non-number value to null and convert the datatype

f) Peak RPM

# Cleaning the STROKE

# Convert the non-numeric data to null and convert the datatype

g) Number of doors

# Cleaning the num-of-doors data

# remove the records which are having the value '?'

DATA STORIES AND VISUALIZATIONS

1. Vehicle make frequency:

Toyota by far exceeds the other brands on the data set, almost @ 40%. Nissan is the 2nd highest. The lowest is Mercedes probably as it is a more niche vehicle.