Data Cleaning With Microsoft Excel.

Modu Daniel
4 min readSep 28, 2022

--

This article is meant to explain how i used Microsoft Excel to convert dirty structured data into clean data so insights can be generated easily.
The project is guided by my scope of work which i would share below. The scope of work is an outline of all the activities i did in my data cleaning project including the deliverables, milestones, timeline. I intend to use this article as the final aspect of the scope of work, Report.

Following the scope of work, the first thing i had to do was collect the data I wanted to clean and being specific, the data i needed was an healthcare dataset which i got from ForesightBI ( name of dataset: Hospital data with mixed numbers and characters).
The scope of work showed the specificity of the project and what the project did not include.

After the data collection, i could then start cleaning the data. This was quite a simple dataset to clean as i didn’t use more of complex functions but i had to do more of observations to identify some mistakes in the data entry.
Before i started the cleaning, I saved the unclean data as two different documents with different names so incase i made a mistake in my cleaning and I needed the dataset in its default form, i would just refer to my backup.

To start my cleaning, I added filter buttons to all the column labels.
Select all the column headers > home tab > Editing ribbon > sort and filter > click filter.

Next, I selected all the data and removed duplicates. To do that, you select all the data > click data tab > click Remove duplicates on the data tools ribbon > select all columns > ok.

After removing duplicates, i started checking each columns individually for errors using the filter button at the top.
The first column (Description) took the most time as the mistake were in the wordings and measurement values and required careful observation. For example, Under the description;
- 5% Dextrose Saline 500mls was a mistake during data entry. the intended description was 5% Dextrose Saline 500ml. (same thing for the 5% Dextrose Water 500ml).
- Another mistake was the use of abbreviations and then full words when representing some drugs. Like Inj Vit. K and Inj. Vitamin K meant the same thing so the error just allowed for unnecessary multiple entries.
- Under the description, some outliers were also found like name of pharmacist, signature and phone number which were out of context as the column was meant for names of drugs.
- other mistakes were in spellings. like the Syr Brocholite (correct: Broncholyte), hand glove and hand gloves e.t.c.

For other columns that contained numerical data, the mistakes were just interchanging numbers with letters. for instance, 1 was replaced with i, 0 was replaced with O, S was replaced with 5. I corrected those minor errors and the data was clean and ready for use.

Recall, the project just focused on data cleaning not using the cleaned data for analysis.

Thank you for reading.

--

--

Modu Daniel
Modu Daniel

Written by Modu Daniel

Data 📈 | Numbers 1️⃣ | Medicine⚕️ | Redamancy ❣️ | 80/20 💡. E-mail: modudharniel16@gmail.com

Responses (1)