{"id":178,"date":"2023-09-07T16:07:25","date_gmt":"2023-09-07T16:07:25","guid":{"rendered":"https:\/\/editor.mediahack.co.za\/databites\/?p=178"},"modified":"2025-11-17T17:41:35","modified_gmt":"2025-11-17T17:41:35","slug":"hands-on-5-reasons-to-switch-to-openrefine-to-clean-data","status":"publish","type":"post","link":"https:\/\/outliereditor.co.za\/index.php\/2023\/09\/07\/hands-on-5-reasons-to-switch-to-openrefine-to-clean-data\/","title":{"rendered":"5 reasons to switch to OpenRefine to clean data"},"content":{"rendered":"\n<p id=\"6452\"><strong>By Gemma Gatticchi<\/strong><\/p>\n\n\n\n<p id=\"6452\">I know this article is about\u00a0<a href=\"https:\/\/openrefine.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">OpenRefine<\/a>\u00a0(previously\u00a0<a href=\"https:\/\/code.google.com\/archive\/p\/google-refine\/\" target=\"_blank\" rel=\"noreferrer noopener\">Google Refine<\/a>), but in order to get there, I need to start with Google Sheets.<\/p>\n\n\n\n<p id=\"59da\">For years I used Google Sheets to work with my data, mostly because I believed that, besides Microsoft Excel, it was my only option. I mostly built spreadsheets using Google Sheets, so it made perfect sense to use the same program to edit, filter and compare my data.<\/p>\n\n\n\n<p id=\"797d\">That was until I learnt about OpenRefine. The OpenRefine&nbsp;<a href=\"https:\/\/openrefine.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">website<\/a>&nbsp;describes the application as &#8216;a powerful tool for working with messy data: cleaning it; transforming it from one format into another, and extending it with web services and external data&#8217;. And that is exactly what it does, leaping past some of the basic limitations of both Excel and Google Sheets.<\/p>\n\n\n\n<p id=\"a1da\">Here are my top 5 reasons to start using OpenRefine. You&#8217;ll never look back:<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">1. Consolidating different spreadsheets<\/h2>\n\n\n\n<p>Your data isn\u2019t always in the same spreadsheet. I was working on a project about grade 12 school results and had split the data into different spreadsheets, according to the provinces the schools were in. <\/p>\n\n\n\n<p>If I wanted to consolidate the dataset in Google Sheets, I would probably have had to use the tedious copy-and-paste method, which leaves room for plenty of errors. My dataset was more than 50,000 rows long and I knew that might cause problems.<\/p>\n\n\n\n<p>I was pleasantly surprised when I realised this could simply be done in OpenRefine, as long as the column headings were identical. I did this by:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Downloading the different datasets in the same format, saving them all into one zipped folder<\/li>\n\n\n\n<li>Creating a new project by importing the zipped folder into OpenRefine, where I found all my consolidated data in one place. (Although OpenRefine does not explicitly limit the number of spreadsheets you can add to a zipped folder, keep an eye on the size of the files. I have zipped together and uploaded as many as nine spreadsheets at once without any issues.)<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">2. It&#8217;s a data clean-up master<\/h2>\n\n\n\n<p>OpenRefine\u2019s <strong>&#8216;Cluster and edit&#8217; tool<\/strong> is one of my favourites! To use it, click the arrow next to your heading of choice, select &#8216;Edit cells&#8217; from the dropdown menu and then select &#8216;Cluster and edit&#8217;. This tool groups certain rows that likely have the same name, but have minor differences in spelling, spacing or capitalisation. <\/p>\n\n\n\n<p>The application suggests options of what the spelling could look like. You can choose the one you prefer, or you can change one cluster\u2019s spelling by typing in your own option.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img decoding=\"async\" src=\"https:\/\/editor.mediahack.co.za\/databites\/wp-content\/uploads\/sites\/3\/2023\/08\/1.webp\" alt=\"\" class=\"wp-image-179\"\/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img decoding=\"async\" src=\"https:\/\/editor.mediahack.co.za\/databites\/wp-content\/uploads\/sites\/3\/2023\/08\/2.webp\" alt=\"\" class=\"wp-image-180\"\/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">3. Mistakes aren\u2019t set in stone<\/h2>\n\n\n\n<p>OpenRefine keeps track of every move you make, much like the Google Sheets &#8216;version history&#8217; tool. When you\u2019re working with a big dataset in Google Sheets and want to restore an earlier version of your document, it can take several minutes to load, and may even cause your system to crash. In OpenRefine, however, each edit is tracked in the left-hand panel under the <strong>&#8216;Undo \/ Redo&#8217; section<\/strong> and you can easily and quickly switch to a previous version of your work.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img decoding=\"async\" src=\"https:\/\/editor.mediahack.co.za\/databites\/wp-content\/uploads\/sites\/3\/2023\/08\/3.webp\" alt=\"\" class=\"wp-image-181\"\/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">4. Filtering is a breeze<\/h2>\n\n\n\n<p>Unlike Excel and Google Sheets, you do not have to enable the filtering feature when you use OpenRefine. The filtering option is automatically on when you create your project. <\/p>\n\n\n\n<p>OpenRefine also offers different methods of filtering your data:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The&nbsp;<strong>text facet&nbsp;<\/strong>option automatically groups all similar values under a specific column and gives you the number of entries under each value<\/li>\n\n\n\n<li>The <strong>text filter&nbsp;<\/strong>allows you to filter your data for specific words, numbers or phrases that you can type in yourself<\/li>\n<\/ul>\n\n\n\n<p>All the filtering occurs comfortably on the left-hand side of the screen, where you can easily select and deselect values. This also makes it easy to filter for multiple columns or variables at once.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img decoding=\"async\" src=\"https:\/\/editor.mediahack.co.za\/databites\/wp-content\/uploads\/sites\/3\/2023\/08\/4.webp\" alt=\"\" class=\"wp-image-182\"\/><figcaption class=\"wp-element-caption\">Text facet<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img decoding=\"async\" src=\"https:\/\/editor.mediahack.co.za\/databites\/wp-content\/uploads\/sites\/3\/2023\/08\/5.webp\" alt=\"\" class=\"wp-image-183\"\/><figcaption class=\"wp-element-caption\"><br>Text filter<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">5. OpenRefine works offline<\/h2>\n\n\n\n<p>If you usually use Google Sheets, then you\u2019ll be familiar with the frustration that comes with a weak internet connection. The quality of your connection directly affects the ability to work. This is another reason why OpenRefine is a great option. Once downloaded, OpenRefine runs like a desktop application. While it opens in your web browser, it works entirely offline.<\/p>\n\n\n\n<p id=\"95d7\">Excel and Google Sheets are among the most popular options for working with data, but there is no need to feel chained to these programs when better options exist.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Notebook<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/openrefine.org\/download.html\">Download OpenRefine<\/a> \u2013 it&#8217;s free<\/li>\n\n\n\n<li>Watch: The Outlier&#8217;s three-part <a href=\"https:\/\/www.youtube.com\/@OutlierAfrica\">OpenRefine series on YouTube<\/a><\/li>\n\n\n\n<li>Learn: Find out about our <a href=\"https:\/\/learning.theoutlier.co.za\/\">training courses<\/a><\/li>\n\n\n\n<li>Sign up to <a href=\"https:\/\/theoutlier.co.za\/newsletter\" data-type=\"link\" data-id=\"https:\/\/theoutlier.co.za\/newsletter\">The Outlier newsletter<\/a> for more on data communication training, tips and data-driven insights<\/li>\n<\/ul>\n\n\n\n<p><em>Note: This post was updated on 25 November 2024. Questions or comments? Email us on <a href=\"mailto:hello@theoutlier.co.za\">hello@theoutlier.co.za<\/a><\/em><\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>It has a pretty steep learning curve but it&#8217;s definitely worth the effort to learn OpenRefine if you need to clean large amounts of very messy data.<\/p>\n","protected":false},"author":3,"featured_media":86677,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[447,448,1387],"tags":[473,474,449,477,480,485],"newsletter-post":[],"site":[],"class_list":["post-178","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-databites","category-how-to","category-the-outlier","tag-cleaning-data","tag-data-wrangling","tag-how-to","tag-messy-data","tag-openrefine","tag-working-with-data"],"acf":{"big_number":"","big_number_caption":"","big_number_link":"","big_number_background":"","big_number_text_colour":"#000000","big_number_icon":false,"big_number_wide":"yes","featured_chart":false,"flourish_chart_id":"","flourish_sub_title":"","flourish_chart_width":"medium","is_newsletter_post":"No","post_style":"bc","show_on_front":"Yes","link_through":"Yes","chart_url":"","background_colour":"#0089AA","text_colour":"#FFFFFF"},"_links":{"self":[{"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/posts\/178","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/comments?post=178"}],"version-history":[{"count":3,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/posts\/178\/revisions"}],"predecessor-version":[{"id":89814,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/posts\/178\/revisions\/89814"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/media\/86677"}],"wp:attachment":[{"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/media?parent=178"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/categories?post=178"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/tags?post=178"},{"taxonomy":"newsletter-post","embeddable":true,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/newsletter-post?post=178"},{"taxonomy":"site","embeddable":true,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/site?post=178"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}