lots of tech info here

Parse HTML, populate DataFrame and export to CSV

Published on 21 December 2018 | No Comments

This is a simple Python script that reads an HTML table and parses the cells. It then populates a pandas DataFrame object with this 2D array and exports the values into a CSV file.

For our example, we will use this HTML table from American Hospital Directory which has hospital statistics by state.

Steps

Download this Python script as get_requests.py or anything and run it. It works on either Python 2.7 or 3.7.


import re
import requests
import pandas as pd

# Constants
user_agent = 'Mozilla/5.0 (compatible; NoBot/1.1)'
url = 'https://www.ahd.com/state_statistics.html'
regex = re.compile('(.+?)s*
(.+?)s*(.+?)s*(.+?)s*
(.+?)s*(.+?)')
csv_filename = 'ahd.csv'

# Vars
states = []
number_hospitals = []
staffed_beds = []
total_discharges = []
patient_days = []
gross_patient_revenue = []

# Get page
page = requests.get(url, headers={'user-agent': user_agent})
html = page.text

# Parse HTML
html = re.sub(r's{2,}', ' ', html)
trs = re.findall(r'(.+?)', str(html))
for tr in trs:
    tds = regex.search(tr)
    if tds:
        states.append(tds.group(1))
        number_hospitals.append(tds.group(2))
        staffed_beds.append(tds.group(3))
        total_discharges.append(tds.group(4))
        patient_days.append(tds.group(5))
        gross_patient_revenue.append(tds.group(6))

dictionary = {
    'State': states,
    'Number Hospitals': number_hospitals,
    'Staffed Beds': staffed_beds,
    'Total Discharges': total_discharges,
    'Patient Days': patient_days,
    'Gross Patient Revenue': gross_patient_revenue
}
columns = dictionary.keys()

# Create dataframe
ahd = pd.DataFrame(dictionary, columns=columns)
ahd.to_csv(csv_filename)

You will need the Python modules requests and pandas to run this. Let's install them,

pip install requests
pip install pandas

Now that the dependencies are installed, run the script:

python get_requests.py

The output will be a CSV file named ahd.csv.

Please let me know if you run into any issues. Thanks for reading this post.

Created on 21 December 2018

If you liked this article, follow us on Facebook and Twitter (@aruljohn).

Share this with Others

Subscribe

Tags