Scraping data to create a custom dataset
Finding datasets
There are tons of places to find free and publicly available datasets that you can use for work or personal projects. Some of the sources we often turn to include Kaggle, Google Dataset Search, Statista, Our World in Data, and Open Data Sources. But what do you do when you’re not able to find the information you need or you’re only able to find part of the information you need? Well, we ran into this issue when exploring countries with the most national parks. We couldn’t find a dataset with the information we were seeking — number of national parks in each country; the names of the national parks within each country; and the GPS coordinates of each of the parks so we could plot them on a map. We had access to an International Union for Conservation of Nature (IUCN) dataset but it was not clear whether the parks were national parks or another category of protected areas like a reserve, for example. Luckily, Wikipedia has all this information, but the challenge was that all the data was on different pages. We turned to our good friend Andrew Dang, a data engineer, to help us build this dataset. He scraped Wikipedia and created the spreadsheet we needed. Read on to find out how he did it.
Scraping data
While doing some research, we realized that Wikipedia had the information we needed but it was scattered on various pages and organized in different formats. We tried to look for a dataset that contained the information we were looking for, but we could not find one. The solution we landed on was to scrape the data from Wikipedia using Python. The web scraping code and data files can be found on GitHub.
Methodology
The data was scraped from Wikipedia using Python. The BeautifulSoup library was used to parse the HTML of each webpage.
The scraping process began by scraping a Wikipedia page (we will call this the main page moving forward) that contains several tables that listed the number of national parks for each country. Each country in these tables contained a URL for that country. By scraping the tables in the main page, we aimed to get the name of each country, the number of national parks in that country, and the URL for that country.
Once we had the URL for each country, we scraped its contents. By scraping the country URL, we aimed to get the name and URL of each national park found in that country. The country URLs did not have a consistent structure across all countries, and required many if-then statements to account for these differences when looking for the data we were interested in. However, in most cases, the names and URL of the national parks were organized either in a single table or unordered lists, or several tables or unordered lists. In many cases, a table or unordered list could be found directly after an HTML header containing the text “National Park”. By using BeautifulSoup to look for specific HTML tags, attributes, and elements in each country’s webpage, we were able to get the name and URL of the national parks for most countries. Additional code was written to collect national park names and URLs for countries that did not conform to this structure.
After obtaining the URL of each national park, we scraped its contents to get the latitude and longitude of the national park. The values for these coordinates were listed in degrees minutes seconds on the webpages. These coordinates were converted to degrees decimal using the `dms2dec` library. The geographic coordinates of each national park can be found in both degrees minute seconds and degrees decimal in the `national_parks.csv` file.
The scraped data was organized in a nested dictionary, where each country name was a key. The value for each key was another dictionary. This inner dictionary stored the URL for a country, the number of parks listed on the main Wikipedia page for this country, the number of parks were able to find coordinates for in this country, and finally, yet another dictionary that stored the names and URL for each national park found in this country.
The scraped results were ultimately converted and organized into a Pandas DataFrame and then exported as a CSV file (saved as `national_parks.csv`). Each record contained a country name, a national park name, a national park URL, and the longitude and latitude in both degrees minute seconds and degrees decimal. The records were built by looping through each national park in each country within the scraped results dictionary and appending the relevant data to a list with each iteration. Many of the park names and country names had additional text which were removed to clean the dataset. A similar process was done to create the `missing_coordinates.csv` and `summary_table.csv` files.
Limitations
The main Wikipedia page lists a total of 3,257 national parks worldwide. We were only able to find coordinates for 2,836 national parks and have added them to this dataset. There are 90 national parks that are missing coordinates due to missing country URLs, 333 national parks are missing coordinates due to missing national park URLs, and another 37 national parks are missing coordinates due to the coordinates not being present in the national park URL.
If the country did not have a URL in the main Wikipedia page, then it was not possible to get the name and coordinates of the national parks of that country. Likewise, if the national park itself did not have a URL, then the coordinates would not be scraped. Occasionally, the coordinates were present in a table within the country URL. In this scenario, the coordinates for the national park were scraped. The absence of a national park URL does not always mean the absence of geographic coordinates. In most cases however, not having a national park URL meant that we scraped a lower number of national parks for a country than what was listed on the main Wikipedia page.
Upon investigation, it appeared that some country webpages had parks that were designated as a national park using their own national definition rather than the IUCN definition. On other occasions, some web pages listed decommissioned national parks. The scraper did not account for these scenarios. There were other country webpages that listed other protected areas, such as conservation areas, that did not have the national park designation. While an attempt was made to filter out the non-national parks, it was not always successful and a handful of records in the dataset do not fall under the IUCN definition of a national park. The scenarios described above resulted in some countries having more national parks scraped than what was listed on the main Wikipedia page.
Validating the data
We checked our dataset against the figures available on Wikipedia. We weren’t able to find all the information we were looking for using this method. Why is important to validate your data? Well, when presenting data in a table, graph, map, or whatever your choice of data visualization may be, it is important that the data is accurate and/or complete. Without checking to ensure that the data is accurate or complete, we run the risk of coming to incorrect conclusions about the data and potentially misleading the audience.