4. In class problem-solving!#
4.1. Approaching programming problems#
Write a function to convert numbers from written english to numeric form. The function should be able to convert numbers zero through one trillion (and negatives?).
4.1.1. Strategy#
understand the question
write out some cases, inputs and expected outputs. Include standard and non-standard cases.
look for patterns. Patterns suggest resuable elements (functions, loops, etc).
break the problem down into tasks and work incrementally, testing frequently.
# write a function that converts text numbers to numeric for 0-999
def word2num(text):
text2dig = {
'zero' : 0,
'one' : 1,
'two' : 2,
'three' : 3,
'four' : 4,
'five' : 5,
'six' : 6,
'seven' : 7,
'eight' : 8,
'nine' : 9,
'ten' : 10,
'eleven' : 11,
'twelve' : 12,
'thirteen' : 13,
'fourteen' : 14,
'fifteen' : 15,
'sixteen' : 16,
'seventeen' : 17,
'eighteen' : 18,
'nineteen' : 19,
'twenty' : 20,
'thirty': 30,
'forty': 40,
'fifty': 50,
'sixty': 60,
'seventy': 70,
'eighty': 80,
'ninety': 90,
}
wordlist = text.replace(',', '').replace('-', ' ').split(' ')
num = 0
multiplier = 1
for word in wordlist[::-1]:
if word == 'hundred':
multiplier = 100
if word in text2dig:
num += text2dig[word] * multiplier
return num
def word2num_trillion(text):
ordermag = {
'thousand' : 1000,
'million' : 1000000,
'billion' : 1000000000
}
num = 0
multiplier = 1
for order in ['thousand', 'million', 'billion']:
text_list = text.split(order)
if len(text_list) == 1:
continue
text = text_list[0]
R_text = text_list[1]
num += word2num(R_text) * multiplier
multiplier = ordermag[order]
# print(f'Order {order}:\t{text} _____ {R_text}')
num += word2num(text) * multiplier
if text.split(' ')[0] == 'negative':
num = num * -1
return num
test1 = word2num_trillion('negative one hundred nineteen billion, three hundred one million, twelve')
test2 = word2num_trillion('nine hundred ninety-nine billion, nine hundred ninety-nine million, nine hundred ninety-nine thousand, nine hundred ninety-nine')
test3 = word2num_trillion('twenty-five')
test4 = word2num_trillion('nineteen thousand')
# test1 = word2num('seven hundred thirty-two')
# test2 = word2num('fifteen')
# test3 = word2num('twenty-five')
print(f'Test 1: {test1}')
print(f'Test 2: {test2}')
print(f'Test 3: {test3}')
print(f'Test 4: {test4}')
Test 1: -119301000012
Test 2: 999999999999
Test 3: 25
Test 4: 19000
4.2. Playing with Data#
4.2.1. Text data#
import requests
import csv
url = "https://raw.githubusercontent.com/GettysburgDataScience/datasets/refs/heads/main/potter.txt"
response = requests.get(url)
potter_text = response.text
print(potter_text[:1000])
Scene:
A neighbourhood on a street called Privet Drive.
An owl, sitting on the street sign flies off to reveal a mysterious appearing old man walking through a forest near the street. He stops at the start of the street and takes out a mechanical device and zaps all the light out of the lampposts.
He puts away the device and a cat meows. The man, ALBUS DUMBLEDORE, looks down at the cat, which is a tabby and is sitting on a brick ledge.
Dumbledore: I should have known that you would be here...Professor McGonagall.
The cat meows, sniffs out and the camera pans back to a wall. The cats shadow is seen progressing into a human. There are footsteps and MINERVA MCGONAGALL is revealed.
McGonagall: Good evening, Professor Dumbledore. Are the rumours true, Albus?
Dumbledore: I'm afraid so, Professor. The good, and the bad.
McGonagall: And the boy?
Dumbledore: Hagrid is bringing him.
McGonagall: Do you think it wise to trust Hagrid with something as important as this
Data story ideas:
frequency of specific keywords (e.g. Harry)
histogram or maybe bar graph
word cloud
number of characters and the frequency of their appearances
bar graph
number of lines of each character and length (number of word per line)
histogram
4.2.2. Election Data#
data_url = "https://raw.githubusercontent.com/GettysburgDataScience/datasets/refs/heads/main/countypres_2000-2024.csv"
csv_response = requests.get(data_url)
lines = csv_response.text.splitlines()
reader = csv.DictReader(lines)
elec_dict = {header: [] for header in reader.fieldnames}
for row in reader:
for header in reader.fieldnames:
elec_dict[header].append(row[header])
elec_dict.keys()
dict_keys(['year', 'state', 'state_po', 'county_name', 'county_fips', 'office', 'candidate', 'party', 'candidatevotes', 'totalvotes', 'version', 'mode'])
for key in elec_dict.keys():
try:
elec_dict[key] = [float(v) for v in elec_dict[key]]
except:
pass
import pandas as pd
elec_df = pd.read_csv(data_url)
elec_df
| year | state | state_po | county_name | county_fips | office | candidate | party | candidatevotes | totalvotes | version | mode | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2000 | ALABAMA | AL | AUTAUGA | 1001.0 | US PRESIDENT | AL GORE | DEMOCRAT | 4942 | 17208 | 20250821 | TOTAL |
| 1 | 2000 | ALABAMA | AL | AUTAUGA | 1001.0 | US PRESIDENT | GEORGE W. BUSH | REPUBLICAN | 11993 | 17208 | 20250821 | TOTAL |
| 2 | 2000 | ALABAMA | AL | AUTAUGA | 1001.0 | US PRESIDENT | OTHER | OTHER | 113 | 17208 | 20250821 | TOTAL |
| 3 | 2000 | ALABAMA | AL | AUTAUGA | 1001.0 | US PRESIDENT | RALPH NADER | GREEN | 160 | 17208 | 20250821 | TOTAL |
| 4 | 2000 | ALABAMA | AL | BALDWIN | 1003.0 | US PRESIDENT | AL GORE | DEMOCRAT | 13997 | 56480 | 20250821 | TOTAL |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 94404 | 2024 | WYOMING | WY | WESTON | 56045.0 | US PRESIDENT | DONALD J TRUMP | REPUBLICAN | 3069 | 3512 | 20250821 | NaN |
| 94405 | 2024 | WYOMING | WY | WESTON | 56045.0 | US PRESIDENT | KAMALA D HARRIS | DEMOCRAT | 378 | 3512 | 20250821 | NaN |
| 94406 | 2024 | WYOMING | WY | WESTON | 56045.0 | US PRESIDENT | OTHER | OTHER | 18 | 3512 | 20250821 | NaN |
| 94407 | 2024 | WYOMING | WY | WESTON | 56045.0 | US PRESIDENT | OVERVOTES | NaN | 1 | 3512 | 20250821 | NaN |
| 94408 | 2024 | WYOMING | WY | WESTON | 56045.0 | US PRESIDENT | UNDERVOTES | NaN | 20 | 3512 | 20250821 | NaN |
94409 rows × 12 columns
Data story ideas: - candidate with most votes in each county (for a given year) - line graph of how total votes per party change over time - snapshot of the results from a single election - trends in party by national/state/county - demographic data (e.g. education, median income, population density, racial demographics) vs votes for either party (or lean) - percentage of voter turnout for every county vs party lean