{"id":89888,"date":"2024-12-05T20:59:42","date_gmt":"2024-12-05T20:59:42","guid":{"rendered":"https:\/\/outliereditor.co.za\/?p=89888"},"modified":"2025-11-17T17:38:23","modified_gmt":"2025-11-17T17:38:23","slug":"openrefine-part-3-fixing-inconsistencies-with-the-cluster-and-edit-tool","status":"publish","type":"post","link":"https:\/\/outliereditor.co.za\/index.php\/2024\/12\/05\/openrefine-part-3-fixing-inconsistencies-with-the-cluster-and-edit-tool\/","title":{"rendered":"OpenRefine Part 3: Fixing inconsistencies with the cluster and edit tool"},"content":{"rendered":"\n<p>This is the third and final part of our series on OpenRefine, The Outlier team\u2019s favourite tool for cleaning large datasets. OpenRefine is a game-changer when it comes to managing messy data, especially for larger datasets where spreadsheet programs often struggle to keep up.<\/p>\n\n\n\n<p>Here\u2019s what Laura Grant, our managing editor, has to say about it:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>It took Alastair two years to convince me to use OpenRefine\u2026 My conversion happened when I had a CSV with more than 26,000 school names that was full of typos. The facets and clustering that I dismissed as too complicated saved me hours of time. Since then, I\u2019ve tackled lists of government invoices where information was manually entered by multiple people who added their own twist to spellings of company names and products bought. Now, I can\u2019t understand why I resisted using it for so long.<\/p>\n<\/blockquote>\n\n\n\n<p>If you\u2019re still on the fence about using OpenRefine or feel intimidated by its features, this tutorial will walk you through two of its most powerful tools: <strong>clustering<\/strong> and <strong>editing<\/strong>. These tools will help you clean and standardise your dataset, saving hours of work. Once you give it a go, we can practically guarantee that you won&#8217;t look back.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Watch the video<\/h2>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"OpenRefine Part 3: Clustering and editing\" width=\"500\" height=\"281\" src=\"https:\/\/www.youtube.com\/embed\/bnrPUr1lb-8?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<p>Let&#8217;s get started!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Correcting data using text facets<\/h2>\n\n\n\n<p>A facet works a lot like a filter. Faceted browsing allows you to explore and filter data dynamically. Clicking on the dropdown arrow in the column header, select <strong>Facet &gt; Text facet<\/strong>. This will display all the unique values in the column \u2013 and will help you identify errors.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1000\" height=\"358\" src=\"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/1-OR-p3-e1733430556704.png\" alt=\"\" class=\"wp-image-89889\" srcset=\"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/1-OR-p3-e1733430556704.png 1000w, https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/1-OR-p3-e1733430556704-300x107.png 300w, https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/1-OR-p3-e1733430556704-768x275.png 768w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<p>For instance, we can see that someone has mistakenly input 2017 instead of 2022. Click on the 2017 in the facet box in the left panel. This will isolate the row(s). Hover over the incorrect cell. Click <strong>Edit<\/strong>, change it to 2022, and click <strong>Apply<\/strong>.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1000\" height=\"222\" src=\"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/2-OR-p3-e1733430656919.png\" alt=\"\" class=\"wp-image-89890\" srcset=\"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/2-OR-p3-e1733430656919.png 1000w, https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/2-OR-p3-e1733430656919-300x67.png 300w, https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/2-OR-p3-e1733430656919-768x170.png 768w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<p>To go back to the full dataset, remove the filtering or facet by clicking on the x on the facet box in the left panel.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Using text filters to edit specific entries<\/h2>\n\n\n\n<p>Text filters are helpful when you know there\u2019s a typo \u2013 a misspelling or formatting error \u2013 you want to correct.<\/p>\n\n\n\n<p>To apply a text filter, select the column where the typo might exist. Click on the dropdown arrow and choose <strong>Text filter<\/strong>.<\/p>\n\n\n\n<p>Now find and fix typos. In our example, the province name \u2018Free State\u2019 has been entered as \u2018free state\u2019 in some instances. Type the correct formatting or spelling into the text filter, and press <strong>Apply<\/strong>.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"401\" src=\"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/OpenRefine-FreeState-Pt3-1024x401.png\" alt=\"\" class=\"wp-image-89894\" srcset=\"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/OpenRefine-FreeState-Pt3-1024x401.png 1024w, https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/OpenRefine-FreeState-Pt3-300x118.png 300w, https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/OpenRefine-FreeState-Pt3-768x301.png 768w, https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/OpenRefine-FreeState-Pt3-1536x602.png 1536w, https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/OpenRefine-FreeState-Pt3.png 1888w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li>To edit one occurrence: Hovering over the cell. Click <strong>Edit<\/strong>, change it to \u2018Free State\u2019, and click <strong>Apply<\/strong>.<\/li>\n\n\n\n<li>To edit <strong>all matching cells at once:<\/strong> Use the option to <strong>Apply to all identical cells<\/strong> and make batch corrections.<\/li>\n<\/ul>\n\n\n\n<p>Close the filtering by clicking the x in the lefthand panel.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Replacing values in bulk<\/h2>\n\n\n\n<p>It\u2019s pretty straightforward to standardise inconsistencies such as mixed abbreviations. In our example, we want to replace \u2018SS\u2019 in the centre names to \u2018Secondary School\u2019. <\/p>\n\n\n\n<p>Here\u2019s how:<\/p>\n\n\n\n<p>Click the column\u2019s dropdown arrow. Select <strong>Edit cells<\/strong> > <strong>Transform<\/strong> > <strong>Replace.<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1000\" height=\"563\" src=\"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/3-OR-p3.png\" alt=\"\" class=\"wp-image-89892\" srcset=\"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/3-OR-p3.png 1000w, https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/3-OR-p3-300x169.png 300w, https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/3-OR-p3-768x432.png 768w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<p>To replace all occurances of \u2018SS\u2019 with \u2018Secondary School&#8217;, enter &#8216;SS&#8217; as the value to find and <strong>&#8216;Secondary School&#8217;<\/strong> as the replacement text. I&#8217;m using all caps here as the dataset uses caps.<\/p>\n\n\n\n<p>Check the boxes for <strong>whole word<\/strong> and make sure <strong>case insensitive<\/strong> is NOT checked. Click <strong>OK<\/strong> to apply the changes across all rows.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1000\" height=\"563\" src=\"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/4-OR-p3-1.png\" alt=\"\" class=\"wp-image-89895\" srcset=\"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/4-OR-p3-1.png 1000w, https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/4-OR-p3-1-300x169.png 300w, https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/4-OR-p3-1-768x432.png 768w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Clustering to detect and fix similar entries<\/h2>\n\n\n\n<p><strong>What is clustering?<\/strong> Inconsistencies in your data \u2013 such as variations in spelling, typos or different formats for the same information \u2013 can wreak havoc on your analysis. But mastering the clustering feature will keep you on the straight and narrow.<\/p>\n\n\n\n<p>Clustering is a tool that identifies variations in similar data entries, allowing you to group and merge them into a single, consistent format. It\u2019s particularly useful for:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Correcting typos and spelling variations. Quickly spot and standardise inconsistencies.<\/li>\n\n\n\n<li>Cleaning up data entered manually. Minimise errors from human input.<\/li>\n\n\n\n<li>Saving<strong> <\/strong>time. Automate tasks that would otherwise require hours of manual editing.<\/li>\n<\/ul>\n\n\n\n<p>In our example dataset that includes province names, there are entries like \u2018Free State\u2019, \u2018Free States\u2019, \u2018free state\u2019 and even \u2018FSS\u2019. Clustering can detect these as related entries and help you standardise them in just a few clicks.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">How to use clustering<\/h2>\n\n\n\n<p>First, <strong>cluster your data<\/strong>: Click the dropdown arrow in the relevant column (eg, province). Select <strong>Edit cells<\/strong> > <strong>Cluster and edit<\/strong>.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1000\" height=\"563\" src=\"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/5-OR-p3.png\" alt=\"\" class=\"wp-image-89896\" srcset=\"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/5-OR-p3.png 1000w, https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/5-OR-p3-300x169.png 300w, https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/5-OR-p3-768x432.png 768w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<p><strong>Choose a clustering method<\/strong>: In the clustering dialogue, you\u2019ll see options for:<\/p>\n\n\n\n<p>&#8211; <strong>Key collision<\/strong>. These are groups that are likely to have been typed differently but are intended to be the same \u2013 eg FS vs FSS.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1000\" height=\"563\" src=\"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/6-OR-p3.png\" alt=\"\" class=\"wp-image-89897\" srcset=\"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/6-OR-p3.png 1000w, https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/6-OR-p3-300x169.png 300w, https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/6-OR-p3-768x432.png 768w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<p>&#8211; <strong>Nearest neighbor<\/strong>. This focuses on similar-sounding or visually similar entries \u2013 eg Free State vs Free States.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1000\" height=\"232\" src=\"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/7-OR-p3-e1733432054690.png\" alt=\"\" class=\"wp-image-89898\" srcset=\"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/7-OR-p3-e1733432054690.png 1000w, https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/7-OR-p3-e1733432054690-300x70.png 300w, https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/7-OR-p3-e1733432054690-768x178.png 768w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<p>For most datasets, the <strong>Key collision<\/strong> method with the <strong>Metaphone 3<\/strong> keying function works best. It\u2019s considered the most advanced option for identifying related entries.<\/p>\n\n\n\n<p>Now review suggestions, such as FS vs FSS. Select the correct value (eg, FS) and click <strong>Merge selected and close<\/strong>.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use <strong>Merge selected and close<\/strong> when you are sure that the clusters currently displayed are sufficient for your cleaning task.<\/li>\n\n\n\n<li>Use <strong>Merge selected and re-cluster<\/strong> if the dataset has many related or cascading inconsistencies that might not all appear in the initial clustering.<\/li>\n<\/ul>\n\n\n\n<p><strong>Verify the changes<\/strong> by using a <strong>text filter<\/strong> to confirm that the incorrect value no longer exists \u2013 eg filter for &#8216;FSS&#8217; to check that all instances have been replaced with \u2018FS\u2019.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Exporting your cleaned dataset<\/h2>\n\n\n\n<p>Once you\u2019ve finished editing, export your dataset. Click <strong>Export<\/strong> in the top-right corner. Select your preferred format, such as CSV. The file will download automatically and can typically be found in your downloads folder.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The full series<\/h2>\n\n\n\n<p><strong>Part 1:<\/strong>&nbsp;<a href=\"https:\/\/theoutlier.co.za\/how-to\/2024-11-27\/89798\/openrefine-part-1-installing-and-merging-datasets\/\">Installing OpenRefine and merging datasets<\/a>&nbsp;|&nbsp;<a href=\"https:\/\/youtu.be\/zjNE6kzaKBw?si=evt5V83_wC2tjnru\">WATCH<\/a><\/p>\n\n\n\n<p><strong>Part 2:<\/strong>&nbsp;<a href=\"https:\/\/theoutlier.co.za\/how-to\/2024-11-28\/89816\/openrefine-part-2-removing-duplicates-and-using-version-history\/\">Removing duplicates and tracking changes<\/a> |&nbsp;<a href=\"https:\/\/youtu.be\/nPr6Puvdh5M?si=I3PgHj5ibm57x1-Z\">WATCH<\/a><\/p>\n\n\n\n<p><strong>Part 3:<\/strong>&nbsp;Fixing inconsistencies with the cluster and edit tool | <a href=\"https:\/\/youtu.be\/nPr6Puvdh5M?si=7PGrTMUGLUwLSEUj\">WATCH<\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Notebook<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/openrefine.org\/\">OpenRefine\u2019s official website<\/a>&nbsp;and&nbsp;<a href=\"https:\/\/docs.openrefine.org\/\">documentation<\/a><\/li>\n\n\n\n<li>Read more:&nbsp;<a href=\"https:\/\/theoutlier.co.za\/databites\/2023-09-07\/178\/hands-on-5-reasons-to-switch-to-openrefine-to-clean-data\/\">5 reasons to switch to OpenRefine to clean data<\/a><\/li>\n\n\n\n<li>Subscribe to&nbsp;<a href=\"https:\/\/www.youtube.com\/@OutlierAfrica\">The Outlier\u2019s YouTube channel<\/a>&nbsp;to be notified of new updates<\/li>\n\n\n\n<li>Sign up to\u00a0<a href=\"https:\/\/theoutlier.co.za\/newsletter\">The Outlier\u2019s newsletter<\/a>\u00a0for weekly tools, tips and data-driven insights<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>If you\u2019re still on the fence about using OpenRefine or feel intimidated by its features, this tutorial will walk you through two of its most powerful tools: clustering and editing. These tools will help you clean and standardise your dataset, saving hours of work.<\/p>\n","protected":false},"author":7,"featured_media":89899,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[448,1387],"tags":[482,1206,1295,1304,477,480,1297,479,1296,485],"newsletter-post":[],"site":[],"class_list":["post-89888","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-how-to","category-the-outlier","tag-data-cleaning","tag-data-tools","tag-faceting","tag-large-datasets","tag-messy-data","tag-openrefine","tag-step-by-step","tag-tips","tag-tutorial","tag-working-with-data"],"acf":{"big_number":"","big_number_caption":"","big_number_link":"","big_number_background":"","big_number_text_colour":"#000000","big_number_icon":false,"big_number_wide":"yes","featured_chart":{"ID":89899,"id":89899,"title":"OpenRefine-machine","filename":"OpenRefine-machine.webp","filesize":218950,"url":"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/OpenRefine-machine.webp","link":"https:\/\/outliereditor.co.za\/index.php\/2024\/12\/05\/openrefine-part-3-fixing-inconsistencies-with-the-cluster-and-edit-tool\/openrefine-machine\/","alt":"","author":"7","description":"","caption":"","name":"openrefine-machine","status":"inherit","uploaded_to":89888,"date":"2024-12-05 20:57:04","modified":"2024-12-05 20:57:04","menu_order":0,"mime_type":"image\/webp","type":"image","subtype":"webp","icon":"https:\/\/outliereditor.co.za\/wp-includes\/images\/media\/default.png","width":1024,"height":1024,"sizes":{"thumbnail":"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/OpenRefine-machine-150x150.webp","thumbnail-width":150,"thumbnail-height":150,"medium":"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/OpenRefine-machine-300x300.webp","medium-width":300,"medium-height":300,"medium_large":"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/OpenRefine-machine-768x768.webp","medium_large-width":768,"medium_large-height":768,"large":"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/OpenRefine-machine.webp","large-width":1024,"large-height":1024,"1536x1536":"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/OpenRefine-machine.webp","1536x1536-width":1024,"1536x1536-height":1024,"2048x2048":"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/12\/OpenRefine-machine.webp","2048x2048-width":1024,"2048x2048-height":1024}},"flourish_chart_id":"","flourish_sub_title":"","flourish_chart_width":"large","is_newsletter_post":"No","post_style":"ch","show_on_front":"Yes","link_through":"Yes","chart_url":"","background_colour":"#0089AA","text_colour":"#FFFFFF"},"_links":{"self":[{"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/posts\/89888","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/users\/7"}],"replies":[{"embeddable":true,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/comments?post=89888"}],"version-history":[{"count":10,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/posts\/89888\/revisions"}],"predecessor-version":[{"id":89909,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/posts\/89888\/revisions\/89909"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/media\/89899"}],"wp:attachment":[{"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/media?parent=89888"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/categories?post=89888"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/tags?post=89888"},{"taxonomy":"newsletter-post","embeddable":true,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/newsletter-post?post=89888"},{"taxonomy":"site","embeddable":true,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/site?post=89888"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}