Photo AI

Last Updated Sep 26, 2025

Outliers & Cleaning Data Simplified Revision Notes

Revision notes with simplified explanations to understand Outliers & Cleaning Data quickly and effectively.

user avatar
user avatar
user avatar
user avatar
user avatar

310+ students studying

2.3.1 Outliers & Cleaning Data

Outliers are data points that are significantly different from the rest of the data. They can either be much higher or much lower than the other values in the data set. Outliers can affect statistical analyses by skewing results, so it's important to identify and handle them appropriately.

Identifying Outliers

Box Plots:

One of the simplest ways to identify outliers is by using a box plot. In a box plot:

  • Any data point outside of the whiskers (1.51.5 times the interquartile range above Q3Q3 or below Q1Q1) is considered an outlier. image

Z-Scores:

The Z-score measures how many standard deviations a data point is from the mean. Data points with a ZscoreZ-score greater than 33 or less than 3-3 are often considered outliers.

Interquartile Range (IQR) Method:

  • Calculate the IQR: IQR=Q3Q1\text{IQR} = Q3 - Q1
  • Find the lower bound: Q11.5×IQRQ1 - 1.5 \times \text{IQR}
  • Find the upper bound: Q3+1.5×IQRQ3 + 1.5 \times \text{IQR}
  • Any data point outside this range is considered an outlier.
infoNote

Example: Consider the data set:

2,5,7,8,10,12,15,18,20,1002, 5, 7, 8, 10, 12, 15, 18, 20, 100

  • Q1 (Lower Quartile) = 6.56.5
  • Q3 (Upper Quartile) = 1717
  • IQR = 176.5=10.517 - 6.5 = 10.5
  • Lower Bound = 6.5(1.5×10.5)=9.256.5 - (1.5 × 10.5) = -9.25 (No outliers on the lower side)
  • Upper Bound = 17+(1.5×10.5)=32.7517 + (1.5 × 10.5) = 32.75 Here, 100100 is an outlier because it is greater than 32.7532.75.

Cleaning Data

Cleaning data involves addressing issues in the data, such as outliers, missing values, or errors, to ensure accurate analysis.

Steps for Cleaning Data:

  1. Identify and Handle Outliers
  2. Handle Missing Data
  3. Correct Errors
  4. Standardise and Normalise Data

  1. Identify and Handle Outliers:
  • Remove Outliers: If an outlier is due to an error or doesn't belong to the dataset context, it may be removed.
  • Transform Data: Sometimes, transforming the data (e.g., using logarithms) can reduce the impact of outliers.
  • Use Robust Statistics: Instead of the mean, use the median, which is less sensitive to outliers.
  1. Handle Missing Data:
  • Remove Missing Data: If the data set is large, you can remove rows or columns with missing values.
  • Impute Missing Data: Replace missing data with a reasonable estimate, such as the mean, median, or mode of the remaining data.
  • Use Algorithms: Advanced methods like k-nearest neighbours or regression models can predict and fill in missing values.
  1. Correct Errors:
  • Typographical Errors: Correct any data entry errors, like typos or incorrect values.
  • Consistency Checks: Ensure data is consistent across the data set. For instance, if a person's age is entered as 200, it's likely an error.
  1. Standardise and Normalise Data:
  • Standardisation: Adjust data to have a mean of 0 and a standard deviation of 1, useful for algorithms that assume normally distributed data.
  • Normalisation: Scale data to a range, usually between 0 and 1, which is useful when comparing different data sets.
infoNote

Example: Cleaning Data Suppose you have the following data set of test scores:

85,90,95,100,110,70085, 90, 95, 100, 110, 700


Step 1: Identify Outliers:

The score 700700 is an outlier.


Step 2: Handle the Outlier:

Investigate the cause. If it's a data entry error, correct or remove it.


Step 3: Handle Missing Data:

If you had a missing value in the test scores, you might replace it with the mean score or use another method.


Step 4: Check for Consistency:

Ensure all scores are within the expected range (00 to 100100).

Cleaning your data ensures that your analysis is based on accurate and reliable data, leading to more trustworthy results.

An outlier is an item of data that lies:

  • 22 standard deviations from the mean.
  • 1.51.5 interquartile ranges from the median.

infoNote

Example: Cleaning Data with Standard Deivation Let's go through a detailed example to understand how to formally identify outliers using standard deviation.

Question: Using standard deviation, formally identify any outliers in the following set: 1.3,2.4,6.7,2.8,3.9,0.11.3, 2.4, 6.7, 2.8, 3.9, 0.1


Step 1: Calculate the Mean and Standard Deviation

Using a calculator (shown below), we can find the mean and standard deviation of the data set.

From the calculator screen, we have:

  • Mean (xˉ \bar{x}) = 2.8672.867
  • Standard deviation (σ) = 2.0852.085

Step 2: Determine the Outlier Boundaries

Outliers are defined as data points that lie outside two standard deviations from the mean.

We calculate the boundaries for the outliers using the formula: ˉ±2σˉ ±2σ

Substitute the values of the mean (xˉ \bar{x}) and standard deviation (σ)

xˉ±2σ=2.867±2.085×2xˉ±2σ=2.867±2.085×2 =2.867±4.17=2.867±4.17=2.867±4.17=2.867±4.17= 2.867 \pm 4.17=2.867±4.17

Thus, the boundaries for outliers are: 1.303−1.303 and 7.0377.037


Step 3: Identify the Outliers

Any data points that fall outside the range [1.303,7.037][−1.303,7.037] are considered outliers.

Checking the data set:

1.3,2.4,6.7,2.8,3.9,0.11.3, 2.4, 6.7, 2.8, 3.9, 0.1

All of these values lie within the range [1.303,7.037][−1.303,7.037], so there are no outliers in this data set.


Explanation:

Since no data points lie outside the boundaries of[1.303,7.037] [−1.303,7.037], we conclude that this data set has no outliers.


infoNote

Example: Identifying Outliers using the IQR Let's go through a detailed example to understand how to identify outliers using the Interquartile Range (IQR).

Question: Using the same data set as before, identify any outliers using the IQR method: 1.3,2.4,6.7,2.8,3.9,0.11.3, 2.4, 6.7, 2.8, 3.9, 0.1


Step 1: Calculate the Median and IQR

The median and quartiles can be found using a calculator. Here's the result:

From the calculator screen, we have:

  • Median (MedMed) = 2.62.6
  • Lower quartile (Q1Q_1) = 1.31.3
  • Upper quartile (Q3Q_3) = 3.93.9

Thus, the IQR (Interquartile Range) is:

IQR=Q3Q1=3.91.3=2.6IQR=Q 3 −Q 1 =3.9−1.3=2.6

Step 2: Determine the Outlier Boundaries

Outliers are defined as any data points that lie 1.5 times the IQR above Q3Q_3 or below Q1Q_1.

We calculate the boundaries for outliers using the formula:

Med±1.5×IQRMed±1.5×IQR

Substitute the values:

Outlierboundaries=2.6±1.5×2.6Outlier boundaries=2.6±1.5×2.6Outlierboundaries=2.6±3.9Outlier boundaries=2.6±3.9

Thus, the outlier boundaries are:

[1.3,6.5][-1.3, 6.5]


Step 3: Identify the Outliers

Any data points that fall outside the range [1.3,6.5][-1.3, 6.5] are considered outliers.

Checking the data set:

1.3,2.4,6.7,2.8,3.9,0.11.3, 2.4, 6.7, 2.8, 3.9, 0.1

The value 6.76.7 lies outside this range (greater than 6.56.5), so 6.76.7 is an outlier.


Explanation:

Since 6.76.7 lies outside the interval [1.3,6.5][-1.3, 6.5] , we can conclude that 6.76.7 is an outlier in this data set.

Books

Only available for registered users.

Sign up now to view the full note, or log in if you already have an account!

500K+ Students Use These Powerful Tools to Master Outliers & Cleaning Data

Enhance your understanding with flashcards, quizzes, and exams—designed to help you grasp key concepts, reinforce learning, and master any topic with confidence!

30 flashcards

Flashcards on Outliers & Cleaning Data

Revise key concepts with interactive flashcards.

Try Maths Statistics Flashcards

3 quizzes

Quizzes on Outliers & Cleaning Data

Test your knowledge with fun and engaging quizzes.

Try Maths Statistics Quizzes

29 questions

Exam questions on Outliers & Cleaning Data

Boost your confidence with real exam questions.

Try Maths Statistics Questions

27 exams created

Exam Builder on Outliers & Cleaning Data

Create custom exams across topics for better practice!

Try Maths Statistics exam builder

15 papers

Past Papers on Outliers & Cleaning Data

Practice past papers to reinforce exam experience.

Try Maths Statistics Past Papers

Other Revision Notes related to Outliers & Cleaning Data you should explore

Discover More Revision Notes Related to Outliers & Cleaning Data to Deepen Your Understanding and Improve Your Mastery

96%

114 rated

Working with Data

Interpreting Data

user avatar
user avatar
user avatar
user avatar
user avatar

201+ studying

188KViews
Load more notes

Join 500,000+ A-Level students using SimpleStudy...

Join Thousands of A-Level Students Using SimpleStudy to Learn Smarter, Stay Organized, and Boost Their Grades with Confidence!

97% of Students

Report Improved Results

98% of Students

Recommend to friends

500,000+

Students Supported

50 Million+

Questions answered