Simple guide to scraping data from PDFs

Papers, PDFs and poorly scanned documents. This is the not-so-glamorous way most data journalism projects begin. Luckily, we don’t have to manually type out the data into a new spreadsheet to work with it. Instead, at The Outlier, we often use Adobe Acrobat DC to do all the hard work for us.

In this example, we’re going to scrape South Africa’s School Performance Report — a report that contains data about all the schools in the country and how their grade 12 cohort performed in the final exams over three years. The entire 226-page document comes as a PDF. But our aim is to extract all of the school subject data for one particular province, the Northern Cape, and all five of its districts.

Let’s get to it.

If your document is a hard copy, scan it onto your desktop to create a readable soft copy. Upload the PDF and open it with Adobe Acrobat.

Once opened, select the ‘Organize Pages’ option in the right-hand panel. This will allow you to select the pages you want to extract into a workable copy.

Scroll down to the pages you want to select. Hold down CTRL on your keyboard to select multiple pages at once. Now select ‘Extract’. This option will open a new window with only the pages you selected.

Now select ‘Export PDF’ from the right-hand panel.

A new window will open asking you in which format you would like to export your data. Usually, I select ‘Spreadsheet’ and ‘Microsoft Excel Workbook’, and then ‘Export’.

Choose the download destination for your scraped data, and then open that file. An Excel spreadsheet will appear with four different tabs of data because you have extracted four pages. (If you extracted 80 pages, this file would have 80 tabs.)

At this stage, technically you are done scraping your data. But having the data on different tabs makes it difficult to work with as a whole. This is when I open the Excel file in Google Sheets to make the next set of changes. Of course, you can do the same thing directly in Excel.

Carefully copy-and-paste the data from the rest of the tabs under the data in the first tab, minus the column headings. Once that’s done, delete the other tabs. Now you should have one tab with all your scraped district data under the same column headings.

Voila! You now have a format that’s easy to filter and sort through.

Notebook