Parse HTML, populate DataFrame and export to CSV
This is a simple Python script that reads an HTML table and parses the cells. It then populates a pandas DataFrame object with this 2D array and exports the values into a CSV file.
For our example, we will use this HTML table from American Hospital Directory which has hospital statistics by state.
Steps
Download this Python script as get_requests.py
or anything and run it. It works on either Python 2.7 or 3.7.
import re import requests import pandas as pd # Constants user_agent = 'Mozilla/5.0 (compatible; NoBot/1.1)' url = 'https://www.ahd.com/state_statistics.html' regex = re.compile('<td><a href="states/hospital_.+">(.+?)</a></td>s* <td align="right">(.+?)</td>s*<td align="right">(.+?)</td>s*<td align="right">(.+?)</td>s* <td align="right">(.+?)</td>s*<td align="right">(.+?)</td>') csv_filename = 'ahd.csv' # Vars states = [] number_hospitals = [] staffed_beds = [] total_discharges = [] patient_days = [] gross_patient_revenue = [] # Get page page = requests.get(url, headers={'user-agent': user_agent}) html = page.text # Parse HTML html = re.sub(r's{2,}', ' ', html) trs = re.findall(r'<tr>(.+?)</tr>', str(html)) for tr in trs: tds = regex.search(tr) if tds: states.append(tds.group(1)) number_hospitals.append(tds.group(2)) staffed_beds.append(tds.group(3)) total_discharges.append(tds.group(4)) patient_days.append(tds.group(5)) gross_patient_revenue.append(tds.group(6)) dictionary = { 'State': states, 'Number Hospitals': number_hospitals, 'Staffed Beds': staffed_beds, 'Total Discharges': total_discharges, 'Patient Days': patient_days, 'Gross Patient Revenue': gross_patient_revenue } columns = dictionary.keys() # Create dataframe ahd = pd.DataFrame(dictionary, columns=columns) ahd.to_csv(csv_filename)
You will need the Python modules requests
and pandas
to run this. Let's install them,
pip install requests pip install pandas
Now that the dependencies are installed, run the script:
python get_requests.py
The output will be a CSV file named ahd.csv
.
Please let me know if you run into any issues. Thanks for reading this post.
Created on 21 December 2018
Affiliate Disclosure: Some of the links to products on this blog are affiliate links. It simply means, at no additional cost to you, we’ll earn a commission if you click through and buy any product.