{"id":185,"date":"2023-09-03T14:33:13","date_gmt":"2023-09-03T14:33:13","guid":{"rendered":"https:\/\/editor.mediahack.co.za\/databites\/?p=185"},"modified":"2025-11-17T17:41:35","modified_gmt":"2025-11-17T17:41:35","slug":"a-simple-guide-to-scraping-data-from-pdfs","status":"publish","type":"post","link":"https:\/\/outliereditor.co.za\/index.php\/2023\/09\/03\/a-simple-guide-to-scraping-data-from-pdfs\/","title":{"rendered":"Simple guide to scraping data from PDFs"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\" id=\"a311\">Papers, PDFs and poorly scanned documents. This is the not-so-glamorous way most data journalism projects begin. Luckily, we don\u2019t have to manually type out the data into a new spreadsheet to work with it. Instead, at The Outlier, we often use&nbsp;<a href=\"https:\/\/get.adobe.com\/reader\/\" target=\"_blank\" rel=\"noreferrer noopener\">Adobe Acrobat DC<\/a>&nbsp;to do all the hard work for us.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"322f\">In this example, we\u2019re going to scrape South Africa\u2019s&nbsp;<a href=\"https:\/\/www.education.gov.za\/Portals\/0\/Documents\/Reports\/2021NSCReports\/School%20Performance%20Report.pdf?ver=2022-01-31-130221-553\" target=\"_blank\" rel=\"noreferrer noopener\">School Performance Report<\/a>&nbsp;\u2014 a report that contains data about all the schools in the country and how their grade 12 cohort performed in the final exams over three years. The entire 226-page document comes as a PDF. But our aim is to extract all of the school subject data for one particular province, the Northern Cape, and all five of its districts.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"a29a\">Let\u2019s get to it.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If your document is a hard copy, scan it onto your desktop to create a readable soft copy. Upload\u00a0the PDF and open it with Adobe Acrobat.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img decoding=\"async\" src=\"https:\/\/editor.mediahack.co.za\/databites\/wp-content\/uploads\/sites\/3\/2023\/08\/1-1-1024x706.webp\" alt=\"\" class=\"wp-image-186\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Once opened, select the \u2018Organize Pages\u2019 option in the right-hand panel. This will allow you to select the pages you want to extract into a workable copy.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img decoding=\"async\" src=\"https:\/\/editor.mediahack.co.za\/databites\/wp-content\/uploads\/sites\/3\/2023\/08\/2-1-1024x488.webp\" alt=\"\" class=\"wp-image-187\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Scroll down to the pages you want to select. Hold down CTRL on your keyboard to select multiple pages at once. Now select \u2018Extract\u2019. This option will open a new window with only the pages you selected.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img decoding=\"async\" src=\"https:\/\/editor.mediahack.co.za\/databites\/wp-content\/uploads\/sites\/3\/2023\/08\/3-1-1024x440.webp\" alt=\"\" class=\"wp-image-188\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Now select \u2018Export PDF\u2019 from the right-hand panel.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img decoding=\"async\" src=\"https:\/\/editor.mediahack.co.za\/databites\/wp-content\/uploads\/sites\/3\/2023\/08\/4-1-1024x484.webp\" alt=\"\" class=\"wp-image-189\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">A new window will open asking you in which format you would like to export your data. Usually, I select \u2018Spreadsheet\u2019 and \u2018Microsoft Excel Workbook\u2019, and then \u2018Export\u2019.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img decoding=\"async\" src=\"https:\/\/editor.mediahack.co.za\/databites\/wp-content\/uploads\/sites\/3\/2023\/08\/5-1-1024x442.webp\" alt=\"\" class=\"wp-image-190\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Choose the download destination for your scraped data, and then open that file. An Excel spreadsheet will appear with four different tabs of data because you have extracted four pages. (If you extracted 80 pages, this file would have 80 tabs.)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">At this stage, technically you are done scraping your data. But having the data on different tabs makes it difficult to work with as a whole. This is when I open the Excel file in Google Sheets to make the next set of changes. Of course, you can do the same thing directly in Excel.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"1232\">Carefully copy-and-paste the data from the rest of the tabs under the data in the first tab, minus the column headings. Once that\u2019s done, delete the other tabs. Now you should have one tab with all your scraped district data under the same column headings.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img decoding=\"async\" src=\"https:\/\/editor.mediahack.co.za\/databites\/wp-content\/uploads\/sites\/3\/2023\/08\/6-1024x515.webp\" alt=\"\" class=\"wp-image-191\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"72cb\">Voila! You now have a format that\u2019s easy to filter and sort through.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Notebook<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Follow our&nbsp;<a href=\"https:\/\/www.tiktok.com\/@mediahackza?is_from_webapp=1&amp;sender_device=pc\" target=\"_blank\" rel=\"noreferrer noopener\">TikTok micro-training journey<\/a>&nbsp;where we show you the tips and tricks we use when working with Google Sheets.<\/li>\n\n\n\n<li>Learn with The Outlier Learning.&nbsp;Find out more about our training courses&nbsp;<a href=\"https:\/\/www.theoutlier.co.za\/learning\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a><\/li>\n\n\n\n<li>&nbsp;<a target=\"_blank\" href=\"https:\/\/newsletters.theoutlier.co.za\/\" rel=\"noreferrer noopener\">Subscribe to The Outlier<\/a>, a fortnightly newsletter that delivers data-driven insights<\/li>\n\n\n\n<li>&nbsp;<a href=\"https:\/\/newsletters.theoutlier.co.za\/\" target=\"_blank\" rel=\"noreferrer noopener\">Sign up to the DataBites newsletter<\/a>&nbsp;for more tips, tech and data storytelling techniques<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Papers, PDFs and poorly scanned documents are the way most data journalism projects begin. But instead of typing up a new spreadsheet, here&#8217;s how to use\u00a0Adobe Acrobat DC\u00a0to do all the hard work for you.<\/p>\n","protected":false},"author":3,"featured_media":86681,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[447,448,1387],"tags":[481,482,359,483,475,454,449,484,478,485],"newsletter-post":[],"site":[],"class_list":["post-185","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-databites","category-how-to","category-the-outlier","tag-adobe-acrobat","tag-data-cleaning","tag-data-journalism","tag-data-scraping","tag-excel","tag-google-sheets","tag-how-to","tag-pdf","tag-spreadsheets","tag-working-with-data"],"acf":{"big_number":null,"big_number_caption":null,"big_number_link":null,"big_number_background":null,"big_number_text_colour":null,"big_number_icon":null,"big_number_wide":null,"featured_chart":null,"flourish_chart_id":null,"flourish_sub_title":null,"flourish_chart_width":null,"is_newsletter_post":null,"post_style":null,"show_on_front":null,"link_through":null,"chart_url":null,"background_colour":null,"text_colour":null},"_links":{"self":[{"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/posts\/185","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/comments?post=185"}],"version-history":[{"count":1,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/posts\/185\/revisions"}],"predecessor-version":[{"id":86682,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/posts\/185\/revisions\/86682"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/media\/86681"}],"wp:attachment":[{"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/media?parent=185"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/categories?post=185"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/tags?post=185"},{"taxonomy":"newsletter-post","embeddable":true,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/newsletter-post?post=185"},{"taxonomy":"site","embeddable":true,"href":"https:\/\/outliereditor.co.za\/index.php\/wp-json\/wp\/v2\/site?post=185"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}