arul's code
free code snippets

Parse HTML, populate DataFrame and export to CSV

This is a simple Python script that reads an HTML table and parses the cells. It then populates a pandas DataFrame object with this 2D array and exports the values into a CSV file.

For our example, we will use this HTML table from American Hospital Directory which has hospital statistics by state.

Steps

Download this Python script as get_requests.py or anything and run it. It works on either Python 2.7 or 3.7.


import re
import requests
import pandas as pd

# Constants
user_agent = 'Mozilla/5.0 (compatible; NoBot/1.1)'
url = 'https://www.ahd.com/state_statistics.html'
regex = re.compile('(.+?)\s*\
(.+?)\s*(.+?)\s*(.+?)\s*\
(.+?)\s*(.+?)')
csv_filename = 'ahd.csv'

# Vars
states = []
number_hospitals = []
staffed_beds = []
total_discharges = []
patient_days = []
gross_patient_revenue = []

# Get page
page = requests.get(url, headers={'user-agent': user_agent})
html = page.text

# Parse HTML
html = re.sub(r'\s{2,}', ' ', html)
trs = re.findall(r'(.+?)', str(html))
for tr in trs:
    tds = regex.search(tr)
    if tds:
        states.append(tds.group(1))
        number_hospitals.append(tds.group(2))
        staffed_beds.append(tds.group(3))
        total_discharges.append(tds.group(4))
        patient_days.append(tds.group(5))
        gross_patient_revenue.append(tds.group(6))

dictionary = {
    'State': states,
    'Number Hospitals': number_hospitals,
    'Staffed Beds': staffed_beds,
    'Total Discharges': total_discharges,
    'Patient Days': patient_days,
    'Gross Patient Revenue': gross_patient_revenue
}
columns = dictionary.keys()

# Create dataframe
ahd = pd.DataFrame(dictionary, columns=columns)
ahd.to_csv(csv_filename)

You will need the Python modules requests and pandas to run this. Let's install them,

  pip install requests
  pip install pandas

Now that the dependencies are installed, run the script:

  python get_requests.py

The output will be a CSV file named ahd.csv.

Please let me know if you run into any issues. Thanks for reading this post.

Share this

Last Updated: Posted on 21 Dec 2018