Parse HTML, populate DataFrame and export to CSV

Published on December 21, 2018

This is a simple Python script that reads an HTML table and parses the cells. It then populates a pandas DataFrame object with this 2D array and exports the values into a CSV file.

For our example, we will use this HTML table from American Hospital Directory which has hospital statistics by state.

Steps

Download this Python script as get_requests.py or anything and run it. It works on either Python 2.7 or 3.7.

import re
import requests
import pandas as pd

# Constants
user_agent = 'Mozilla/5.0 (compatible; NoBot/1.1)'
url = 'https://www.ahd.com/state_statistics.html'
regex = re.compile('<td><a href="states/hospital_.+">(.+?)</a></td>s*
<td align="right">(.+?)</td>s*<td align="right">(.+?)</td>s*<td align="right">(.+?)</td>s*
<td align="right">(.+?)</td>s*<td align="right">(.+?)</td>')
csv_filename = 'ahd.csv'

# Vars
states = []
number_hospitals = []
staffed_beds = []
total_discharges = []
patient_days = []
gross_patient_revenue = []

# Get page
page = requests.get(url, headers={'user-agent': user_agent})
html = page.text

# Parse HTML
html = re.sub(r's{2,}', ' ', html)
trs = re.findall(r'<tr>(.+?)</tr>', str(html))
for tr in trs:
    tds = regex.search(tr)
    if tds:
        states.append(tds.group(1))
        number_hospitals.append(tds.group(2))
        staffed_beds.append(tds.group(3))
        total_discharges.append(tds.group(4))
        patient_days.append(tds.group(5))
        gross_patient_revenue.append(tds.group(6))

dictionary = {
    'State': states,
    'Number Hospitals': number_hospitals,
    'Staffed Beds': staffed_beds,
    'Total Discharges': total_discharges,
    'Patient Days': patient_days,
    'Gross Patient Revenue': gross_patient_revenue
}
columns = dictionary.keys()

# Create dataframe
ahd = pd.DataFrame(dictionary, columns=columns)
ahd.to_csv(csv_filename)

You will need the Python modules requests and pandas to run this. Let's install them,

pip install requests
pip install pandas

Now that the dependencies are installed, run the script:

python get_requests.py

The output will be a CSV file named ahd.csv.

Please let me know if you run into any issues. Thanks for reading this post.

Related Posts

If you have any questions, please contact me at arulbOsutkNiqlzziyties@gNqmaizl.bkcom. You can also post questions in our Facebook group. Thank you.

Disclaimer: Our website is supported by our users. We sometimes earn affiliate links when you click through the affiliate links on our website.

Published on December 21, 2018