{"id":89816,"date":"2024-11-28T13:22:19","date_gmt":"2024-11-28T13:22:19","guid":{"rendered":"https:\/\/outliereditor.co.za\/?p=89816"},"modified":"2025-11-17T17:38:23","modified_gmt":"2025-11-17T17:38:23","slug":"openrefine-part-2-removing-duplicates-and-using-version-history","status":"publish","type":"post","link":"https:\/\/outliereditor.co.za\/index.php\/2024\/11\/28\/openrefine-part-2-removing-duplicates-and-using-version-history\/","title":{"rendered":"OpenRefine Part 2: Removing duplicates and using version history"},"content":{"rendered":"\n<p>We\u2019re getting our hands dirty with OpenRefine \u2013 one of The Outlier\u2019s favourite tools when working with large datasets. This powerful open-source program is ideal for cleaning messy data.<\/p>\n\n\n\n<p>In this post, we\u2019re focusing on two essential features of OpenRefine: removing duplicates and using version history to keep track of your changes. First, though, we need to explain <strong>faceting<\/strong>, a powerful feature of OpenRefine.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Watch the video<\/h2>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"OpenRefine Part 2: Removing duplicates and version history\" width=\"500\" height=\"281\" src=\"https:\/\/www.youtube.com\/embed\/nPr6Puvdh5M?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">What is faceting?<\/h2>\n\n\n\n<p>Faceted browsing allows you to explore and filter data dynamically. You can navigate large datasets by applying filters and performing operations on specific subsets. It works by categorising values in a specific column, making it easier to identify patterns or spot inconsistencies across large datasets.<\/p>\n\n\n\n<p>For example, a facet can display the distribution of values in a column, like the number of entries by year in a date column or the frequency of different text values. This grouping can quickly reveal trends or highlight errors in your data.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Navigating menus<\/h2>\n\n\n\n<p>OpenRefine displays your dataset in a grid-like view, similar to Excel or Google Sheets, with column headings at the top and rows of data beneath them. You\u2019ll also notice a downward arrow next to each column heading. Clicking on these accesses the options for manipulating your data.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1000\" height=\"401\" src=\"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/Menus-OpenRefine.png\" alt=\"\" class=\"wp-image-89824\" srcset=\"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/Menus-OpenRefine.png 1000w, https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/Menus-OpenRefine-300x120.png 300w, https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/Menus-OpenRefine-768x308.png 768w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Removing duplicates<\/h2>\n\n\n\n<p>Duplicate entries can skew your data for obvious reasons. Here&#8217;s how to get rid of them:<\/p>\n\n\n\n<p>1. Click the downward arrow under the column where you suspect duplicates might exist.<\/p>\n\n\n\n<p>2. From the dropdown menu, select <strong>Facet<\/strong>, then choose <strong>Customized facets<\/strong> and click on <strong>Duplicates facet<\/strong>. OpenRefine will display two options: <strong>True<\/strong> (for rows with duplicates) and <strong>False<\/strong> (for rows without duplicates). The numbers next to each option show how many rows are duplicates and how many are unique. <\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1000\" height=\"465\" src=\"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/Rows-OpenRefine.png\" alt=\"\" class=\"wp-image-89825\" srcset=\"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/Rows-OpenRefine.png 1000w, https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/Rows-OpenRefine-300x140.png 300w, https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/Rows-OpenRefine-768x357.png 768w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<p>3. Click on <strong>True<\/strong> to view the rows with duplicate entries. (This includes the singles you want to keep as well as any duplicates.)<\/p>\n\n\n\n<p>4. Once you&#8217;ve identified the duplicates, mark the rows you wish to remove by using the <strong>star<\/strong> feature.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1000\" height=\"230\" src=\"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/star-OpenRefine.png\" alt=\"\" class=\"wp-image-89826\" srcset=\"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/star-OpenRefine.png 1000w, https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/star-OpenRefine-300x69.png 300w, https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/star-OpenRefine-768x177.png 768w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<p>5. To remove the duplicates, exit the facet view and click on the dropdown arrow next to <strong>All<\/strong>. Choose <strong>Facet<\/strong>, then select <strong>Facet by star<\/strong>.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1000\" height=\"438\" src=\"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/Facetbystar-OpenRefine.png\" alt=\"\" class=\"wp-image-89827\" srcset=\"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/Facetbystar-OpenRefine.png 1000w, https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/Facetbystar-OpenRefine-300x131.png 300w, https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/Facetbystar-OpenRefine-768x336.png 768w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<p>6. OpenRefine will display the rows you\u2019ve starred. You can now delete the duplicates by selecting <strong>Edit rows<\/strong> from the dropdown, then clicking <strong>Remove matching rows<\/strong>. With that, your duplicates are gone!<\/p>\n\n\n\n<p><strong>Pro tip:<\/strong> Keep track by keeping an eye on the number of total rows, displayed in brackets in the top strip.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1000\" height=\"390\" src=\"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/remove-rows-OpenRefine.png\" alt=\"\" class=\"wp-image-89828\" srcset=\"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/remove-rows-OpenRefine.png 1000w, https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/remove-rows-OpenRefine-300x117.png 300w, https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/remove-rows-OpenRefine-768x300.png 768w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Renaming columns<\/h2>\n\n\n\n<p>You may want to rename columns to make your data clearer or more meaningful. Here\u2019s how to do that:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Click the downward arrow next to the column you want to rename.<\/li>\n\n\n\n<li>Choose <strong>Edit columns<\/strong>, then select <strong>Rename this column.<\/strong> Enter the new name for the column that makes it easy to understand.<\/li>\n\n\n\n<li>Repeat this process for other columns as well to ensure consistency and clarity throughout your dataset.<\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1000\" height=\"413\" src=\"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/Rename-columns-Openrefine.png\" alt=\"\" class=\"wp-image-89829\" srcset=\"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/Rename-columns-Openrefine.png 1000w, https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/Rename-columns-Openrefine-300x124.png 300w, https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/Rename-columns-Openrefine-768x317.png 768w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Version history: Track your changes<\/h2>\n\n\n\n<p>One of OpenRefine\u2019s most valuable features is its version history. This feature helps you track all the changes you\u2019ve made and allows you to easily undo or redo any steps if needed.<\/p>\n\n\n\n<p>Here\u2019s how to access it:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>In the top left-hand corner, click on the <strong>Undo\/Redo<\/strong> tab.<\/li>\n\n\n\n<li>A list of all the changes you&#8217;ve made will appear, showing every action taken in your OpenRefine project.<\/li>\n\n\n\n<li>If you need to undo a change, simply click on the specific action in the list, and OpenRefine will revert to that previous version of your data.<\/li>\n<\/ol>\n\n\n\n<p>This makes it easy to experiment with your dataset without worrying about losing your original data or previous steps.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1000\" height=\"437\" src=\"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/Undo-OpenRefine-1.png\" alt=\"\" class=\"wp-image-89833\" srcset=\"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/Undo-OpenRefine-1.png 1000w, https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/Undo-OpenRefine-1-300x131.png 300w, https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/Undo-OpenRefine-1-768x336.png 768w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">The full series<\/h2>\n\n\n\n<p><strong>Part 1:<\/strong> <a href=\"https:\/\/theoutlier.co.za\/how-to\/2024-11-27\/89798\/openrefine-part-1-installing-and-merging-datasets\/\">Installing OpenRefine and merging datasets<\/a> | <a href=\"https:\/\/youtu.be\/zjNE6kzaKBw?si=evt5V83_wC2tjnru\">WATCH<\/a><\/p>\n\n\n\n<p><strong>Part 2:<\/strong> Removing duplicates and tracking changes | <a href=\"https:\/\/youtu.be\/nPr6Puvdh5M?si=I3PgHj5ibm57x1-Z\">WATCH<\/a><\/p>\n\n\n\n<p><strong>Part 3:<\/strong> <a href=\"https:\/\/theoutlier.co.za\/how-to\/2024-12-05\/89888\/openrefine-part-3-fixing-inconsistencies-with-the-cluster-and-edit-tool\">Fixing inconsistencies with the cluster and edit tool<\/a> | <a href=\"https:\/\/youtu.be\/bnrPUr1lb-8?si=mfREv1JNc1WcQbju\">WATCH<\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Notebook<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/openrefine.org\/\">OpenRefine\u2019s official website<\/a> and <a href=\"https:\/\/docs.openrefine.org\/\">documentation<\/a><\/li>\n\n\n\n<li>Read more: <a href=\"https:\/\/theoutlier.co.za\/databites\/2023-09-07\/178\/hands-on-5-reasons-to-switch-to-openrefine-to-clean-data\/\">5 reasons to switch to OpenRefine to clean data<\/a><\/li>\n\n\n\n<li>Subscribe to <a href=\"https:\/\/www.youtube.com\/@OutlierAfrica\">The Outlier\u2019s YouTube channel<\/a> to be notified of new updates<\/li>\n\n\n\n<li>Sign up to <a href=\"https:\/\/theoutlier.co.za\/newsletter\">The Outlier\u2019s newsletter<\/a> for weekly tools, tips and data-driven insights<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>OpenRefine is one of The Outlier\u2019s favourite tools when working with large datasets. This powerful open-source program is ideal for cleaning messy data. In this post, we focus on two essential features: removing duplicates and using version history to keep track of your changes.<\/p>\n","protected":false},"author":7,"featured_media":89835,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[448,1387],"tags":[473,1206,1295,480,1297,479,1296,485],"newsletter-post":[],"site":[],"class_list":["post-89816","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-how-to","category-the-outlier","tag-cleaning-data","tag-data-tools","tag-faceting","tag-openrefine","tag-step-by-step","tag-tips","tag-tutorial","tag-working-with-data"],"acf":{"big_number":"","big_number_caption":"","big_number_link":"","big_number_background":"","big_number_text_colour":"#000000","big_number_icon":false,"big_number_wide":"yes","featured_chart":{"ID":89835,"id":89835,"title":"OpenRefine-DallE-clean","filename":"OpenRefine-DallE-clean.png","filesize":284180,"url":"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/OpenRefine-DallE-clean.png","link":"https:\/\/outliereditor.co.za\/index.php\/2024\/11\/28\/openrefine-part-2-removing-duplicates-and-using-version-history\/openrefine-dalle-clean\/","alt":"","author":"7","description":"","caption":"","name":"openrefine-dalle-clean","status":"inherit","uploaded_to":89816,"date":"2024-11-28 13:21:48","modified":"2024-12-30 15:09:09","menu_order":0,"mime_type":"image\/png","type":"image","subtype":"png","icon":"https:\/\/outliereditor.co.za\/wp-includes\/images\/media\/default.png","width":700,"height":700,"sizes":{"thumbnail":"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/OpenRefine-DallE-clean-150x150.png","thumbnail-width":150,"thumbnail-height":150,"medium":"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/OpenRefine-DallE-clean-300x300.png","medium-width":300,"medium-height":300,"medium_large":"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/OpenRefine-DallE-clean.png","medium_large-width":700,"medium_large-height":700,"large":"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/OpenRefine-DallE-clean.png","large-width":700,"large-height":700,"1536x1536":"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/OpenRefine-DallE-clean.png","1536x1536-width":700,"1536x1536-height":700,"2048x2048":"https:\/\/outliereditor.co.za\/wp-content\/uploads\/2024\/11\/OpenRefine-DallE-clean.png","2048x2048-width":700,"2048x2048-height":700}},"flourish_chart_id":"","flourish_sub_title":"","flourish_chart_width":"medium","is_newsletter_post":"No","post_style":"ch","show_on_front":"Yes","link_through":"Yes","chart_url":"","background_colour":"#0089AA","text_colour":"#FFFFFF"},"_links":{"self":[{"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/posts\/89816","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/users\/7"}],"replies":[{"embeddable":true,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/comments?post=89816"}],"version-history":[{"count":7,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/posts\/89816\/revisions"}],"predecessor-version":[{"id":89912,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/posts\/89816\/revisions\/89912"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/media\/89835"}],"wp:attachment":[{"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/media?parent=89816"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/categories?post=89816"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/tags?post=89816"},{"taxonomy":"newsletter-post","embeddable":true,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/newsletter-post?post=89816"},{"taxonomy":"site","embeddable":true,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/site?post=89816"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}