If you are working on datasets running into millions of rows, you do not always have to search the Internet for them. Just generate your own dataset. This article contains a Python program that generates large datasets. This dataset has a very limited use case, and you can generate your own dataset after modifying it.
Table of Contents
CSV and JSON Dataset samples
This program generates datasets in CSV or JSON formats, depending on what you specify at the command line.
It does not use Pandas or NumPy. Rather, it uses the following modules: csv
, json
and random
.
CSV sample:
Name,Date of Birth,SSN,City,State,Address,Zip,Phone
Oliver Williams,1946-9-5,912-77-5300,93024 Center Drive,Savannah,OK,05851,137-433-2456
Levi Miller,1945-6-27,492-20-8246,75828 Maple Avenue,Newport,SD,55040,440-705-7588
Levi Smith,1932-12-29,821-34-5935,87796 Maple Avenue,Missoula,TX,97185,270-581-7332
Charlotte Wilson,1938-5-2,765-60-2025,54586 Maple Lane,Spokane,MN,42398,242-806-5936
Amelia Smith,2007-10-9,777-32-7264,88098 Market Court,Eugene,SD,91283,469-628-7220
JSON sample:
[
{
"name": "Noah Johnson",
"ssn": "483-92-9955",
"city": "6921 Pine Court",
"state": "Springfield",
"address": "SC",
"zip": "26978",
"date_of_birth": "1982-1-18",
"phone": "590-833-9448"
},
{
"name": "Oliver Davis",
"ssn": "486-21-4837",
"city": "2957 Washington Boulevard",
"state": "Springfield",
"address": "RI",
"zip": "26293",
"date_of_birth": "2004-6-23",
"phone": "370-236-7597"
},
{
"name": "Sophia Williams",
"ssn": "806-53-7969",
"city": "2168 Church Drive",
"state": "Newport",
"address": "MT",
"zip": "40305",
"date_of_birth": "1962-7-29",
"phone": "640-657-9322"
},
How authentic is the data in the generated dataset?
The data is randomly generated and fake. The first and last names were randomly added from a list of popular names in the USA. The SSN values and birthdays are all randomly generated.
The road names in the address were randomly hardcoded.
The addresses, city, state and zip code do not resolve to valid geographical coordinates. If you are looking for valid address/city/state/zip code to add to the database, you will not find it here.
How to run the program
Download the Python program generate_millions_dataset.py.
This program takes two parameters to run:
First Parameter
-n NUMBER
: NUMBER is the number of millions of rows in the dataset. It accepts values from 1 through 100. The highest value is 100, for which it will generate 100 million records. Any higher may be possible, but it can make your computer crash depending on the configuration of your computer.
-n 2
will generate 2 million records
-f FORMAT
: FORMAT is the output file format; it takes either csv
or json
as value.
Second Parameter
-f csv
will generate a csv file.
-f json
will generate a json file.
If you want to generate 2 million CSV file:
$ python generate_millions_dataset.py -n 2 -f csv
>> Generating csv dataset with 2 million rows
Total Time: 22.45485 seconds.
$ ls -al 2_million_people.csv
-rw-r--r-- 1 asjohn staff 170208480 Jan 1 17:48 2_million_people.csv
The generated file will be of the format NUM_million_people.csv where NUM
is the number of millions you entered for n
. In this case, the generated dataset is 2_million_people.csv
.
How the code works
There are several variables and functions we will use. The road names, road types, cities, states and names are strings separated by single spaces. To pick a random item from the list, we will use random.choice()
. To generate random numbers, we will use random.randint()
. For all this, we will first import the random
module.
road_names = 'Main Church High Elm Park Walnut Washington Chestnut Broad Maple Oak Maple Center Pine River Market Washington Water Union'.split()
road_types = 'Road Way Street Avenue Boulevard Lane Drive Terrace Place Court Plaza Square'.split()
cities = 'Savannah Eugene Jackson Spokane Florence Morro Missoula Flagstaff Covington Newport Springfield'.split()
states = 'MA MI MN MS MO MT NE NV NH OH OK OR PA RI SC SD TN TX UT VT VA WA WV'.split()
first_names = 'Olivia Noah Emma Liam Amelia Oliver Sophia Elijah Charlotte Mateo Ava Lucas Isabella Levi'.split()
last_names = 'Smith Johnson Williams Brown Jones Miller Davis Garcia Rodriguez Wilson'.split()
Random Road names
For example, road names are depicted by road_names
. We will chain it with a split()
method to return a list.
road_names = 'Main Church High Elm Park Walnut Washington Chestnut Broad Maple Oak Maple Center Pine River Market Washington Water Union'.split()
In order to get a random road name, we will call random.choice(road_names)
. Remember to import random first.
import random
random.choice(road_names)
Random Road types
Road types are represented by by road_types
. What you see below is actually the complete set of valid road types in the USA.
Similar to the previous We will chain it with a split()
method to return a list.
road_types = 'Road Way Street Avenue Boulevard Lane Drive Terrace Place Court Plaza Square'.split()
In order to get a random road type, we will call random.choice(road_types)
.
random.choice(road_types)
Random City names
I picked a set of random city names. The following code snippet returns a random city name from the list.
cities = 'Savannah Eugene Jackson Spokane Florence Morro Missoula Flagstaff Covington Newport Springfield'.split()
To get a random city name:
random.choice(cities)
Random State names (2-letter)
I picked a set of random 2-letter state names. This does not include all 50 states. The following code snippet returns a random state name from the list.
states = 'MA MI MN MS MO MT NE NV NH OH OK OR PA RI SC SD TN TX UT VT VA WA WV'.split()
To get a random state name:
random.choice(states)
Random Zip codes
We will generate random zip codes. Zip codes take the format XXXXXX
.
We can generate a random number between 100 and 99999 like this:
random.randint(100, 99999)
But, if the generated number is a 3-digit number, we need to pad it with leading 0s so that we end up with a 5-digit number. For that, we use the zfill()
method.
f'{str(random.randint(100, 99999)).zfill(5)}'
Fun Fact: The lowest numerical value for zip code is 00501 for Holtsville, NY. The highest numerical value for zip code is 99950 in Ketchikan, AK.
PS: To look up where an SSN was issue, you can use our SSN lookup utility by inputting the first 3 digits. PPS: You should NEVER enter your full SSN anywhere except whitelisted websites.
Random People Names
First names
I selected the top first names from this list of top baby names.
first_names = 'Olivia Noah Emma Liam Amelia Oliver Sophia Elijah Charlotte Mateo Ava Lucas Isabella Levi'.split()
Last names
The list of last names were randomly taken from among the most common last names.
last_names = 'Smith Johnson Williams Brown Jones Miller Davis Garcia Rodriguez Wilson'.split()
Random Full names
To get a random full name, that is a random first name followed by a random last name:
f'{random.choice(first_names)} {random.choice(last_names)}'
Random Birth dates
We will generate random birth dates from 1919 through 2023.
To create a random birth date in the format YYYY-MM-DD, we use random.randint()
.
f'{random.randint(1919, 2023)}-{random.randint(1, 12)}-{random.randint(1, 31)}'
It is not error-free, and you can get an invalid date like 2023-02-31. If you want perfectly valid dates, you can add more conditions.
Random SSN
We will generate random social security numbers of the format XXX-XX-XXXX
. This is the standard SSN format.
To create a random SSN of that format:
f'{random.randint(100, 999)}-{random.randint(10, 99)}-{random.randint(1000, 9999)}'
Complete Python program
This is the complete Python script to generate the dataset.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Generate a dataset with millions of rows
Coded by Arul John in 2024
"""
import argparse
import random
import csv
import json
import time
# Vars
road_names = 'Main Church High Elm Park Walnut Washington Chestnut Broad Maple Oak Maple Center Pine River Market Washington Water Union'.split()
road_types = 'Road Way Street Avenue Boulevard Lane Drive Terrace Place Court Plaza Square'.split()
cities = 'Savannah Eugene Jackson Spokane Florence Morro Missoula Flagstaff Covington Newport Springfield'.split()
states = 'MA MI MN MS MO MT NE NV NH OH OK OR PA RI SC SD TN TX UT VT VA WA WV'.split()
first_names = 'Olivia Noah Emma Liam Amelia Oliver Sophia Elijah Charlotte Mateo Ava Lucas Isabella Levi'.split()
last_names = 'Smith Johnson Williams Brown Jones Miller Davis Garcia Rodriguez Wilson'.split()
one_million = 1000000
headers = ['Name', 'Date of Birth', 'SSN', 'City', 'State', 'Address', 'Zip', 'Phone']
def generate_name():
return f'{random.choice(first_names)} {random.choice(last_names)}'
def generate_ssn():
return f'{random.randint(100, 999)}-{random.randint(10, 99)}-{random.randint(1000, 9999)}'
def generate_dob():
return f'{random.randint(1919, 2023)}-{random.randint(1, 12)}-{random.randint(1, 31)}'
def generate_phone():
return f'{random.randint(100, 999)}-{random.randint(100, 999)}-{random.randint(1000, 9999)}'
def generate_address():
return f'{random.randint(100,99999)} {random.choice(road_names)} {random.choice(road_types)}'
def generate_city():
return f'{random.choice(cities)}'
def generate_state():
return f'{random.choice(states)}'
def generate_zipcode():
return f'{str(random.randint(100, 99999)).zfill(5)}'
# Create dataset with <num_millions> * 1,000,000 records
def generate_dataset(num_millions, format):
# csv or json?
if format == 'json':
json_keys = [h.replace(' ', '_').lower() for h in headers]
people = [dict(zip(json_keys, (generate_name(), generate_dob(), generate_ssn(), generate_address(), generate_city(), generate_state(), generate_zipcode(), generate_phone())))
for i in range(num_millions * one_million)]
output_filename = f'{num_millions}_million_people.json'
with open(output_filename, 'w') as f:
json.dump(people, f, indent=2, ensure_ascii=False)
elif format == 'csv':
people = [(generate_name(), generate_dob(), generate_ssn(), generate_address(), generate_city(), generate_state(), generate_zipcode(), generate_phone())
for i in range(num_millions * one_million)]
output_filename = f'{num_millions}_million_people.csv'
with open(output_filename, 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(headers)
writer.writerows(people)
# Main function
if __name__ == '__main__':
start_time = time.perf_counter()
parser = argparse.ArgumentParser(description='Dataset Generator')
parser.add_argument('-n', '--number', type=int, required=True, choices=list(range(1,101)), help='Number of records, in millions')
parser.add_argument('-f', '--format', type=str, required=True, choices=['csv', 'json'], help='Create CSV or JSON dataset')
args = parser.parse_args()
print(f'>> Generating {args.format} dataset with {args.number} million rows')
generate_dataset(args.number, args.format)
end_time = time.perf_counter()
print(f'Total Time: {round(end_time - start_time, 5)} seconds.')
Conclusion
This program is just a sample program with a very limited use-case. If you want different fields or want to include different constraints or read values from an external file or database, you will have to make the changes yourself.
Related Posts
If you have any questions, please contact me at arulbOsutkNiqlzziyties@gNqmaizl.bkcom. You can also post questions in our Facebook group. Thank you.