In class problem-solving!

4. In class problem-solving!#

4.1. Approaching programming problems#

Write a function to convert numbers from written english to numeric form. The function should be able to convert numbers zero through one trillion (and negatives?).

4.1.1. Strategy#

understand the question
write out some cases, inputs and expected outputs. Include standard and non-standard cases.
look for patterns. Patterns suggest resuable elements (functions, loops, etc).
break the problem down into tasks and work incrementally, testing frequently.

# write a function that converts text numbers to numeric for 0-999

def word2num(text):
    text2dig = {
        'zero' : 0,
        'one' : 1,
        'two' : 2,
        'three' : 3,
        'four' : 4,
        'five' : 5,
        'six' : 6,
        'seven' : 7,
        'eight' : 8,
        'nine' : 9,
        'ten' : 10,
        'eleven' : 11,
        'twelve' : 12,
        'thirteen' : 13,
        'fourteen' : 14,
        'fifteen' : 15,
        'sixteen' : 16,
        'seventeen' : 17,
        'eighteen' : 18,
        'nineteen' : 19,
        'twenty' : 20,
        'thirty': 30,
        'forty': 40,
        'fifty': 50,
        'sixty': 60,
        'seventy': 70,
        'eighty': 80,
        'ninety': 90,
    }
    
    wordlist = text.replace(',', '').replace('-', ' ').split(' ')
    
    num = 0
    multiplier = 1
    for word in wordlist[::-1]:
        if word == 'hundred':
            multiplier = 100
            
        if word in text2dig:
            num += text2dig[word] * multiplier  
    
    return num


def word2num_trillion(text):
    ordermag = {
        'thousand' : 1000,
        'million' : 1000000,
        'billion' : 1000000000
    }
    
    num = 0
    multiplier = 1
    
    for order in ['thousand', 'million', 'billion']:
        text_list = text.split(order)
        
        if len(text_list) == 1:
            continue
        
        text = text_list[0]
        R_text = text_list[1]
        
        num += word2num(R_text) * multiplier
        multiplier = ordermag[order]
        
        # print(f'Order {order}:\t{text} _____ {R_text}')
     
    num += word2num(text) * multiplier   
    
    if text.split(' ')[0] == 'negative':
        num = num * -1

    return num

test1 = word2num_trillion('negative one hundred nineteen billion, three hundred one million, twelve')
test2 = word2num_trillion('nine hundred ninety-nine billion, nine hundred ninety-nine million, nine hundred ninety-nine thousand, nine hundred ninety-nine')
test3 = word2num_trillion('twenty-five')
test4 = word2num_trillion('nineteen thousand') 

# test1 = word2num('seven hundred thirty-two')
# test2 = word2num('fifteen')
# test3 = word2num('twenty-five')

print(f'Test 1: {test1}')
print(f'Test 2: {test2}')
print(f'Test 3: {test3}')
print(f'Test 4: {test4}')

Test 1: -119301000012
Test 2: 999999999999
Test 3: 25
Test 4: 19000

4.2. Playing with Data#

4.2.1. Text data#

import requests
import csv

url = "https://raw.githubusercontent.com/GettysburgDataScience/datasets/refs/heads/main/potter.txt"
response = requests.get(url)
potter_text = response.text

print(potter_text[:1000])

Scene:

A neighbourhood on a street called Privet Drive.

An owl, sitting on the street sign flies off to reveal a mysterious appearing old man walking through a forest near the street. He stops at the start of the street and takes out a mechanical device and zaps all the light out of the lampposts.

He puts away the device and a cat meows. The man, ALBUS DUMBLEDORE, looks down at the cat, which is a tabby and is sitting on a brick ledge.

Dumbledore: I should have known that you would be here...Professor McGonagall.

The cat meows, sniffs out and the camera pans back to a wall. The cats shadow is seen progressing into a human. There are footsteps and MINERVA MCGONAGALL is revealed.

McGonagall: Good evening, Professor Dumbledore. Are the rumours true, Albus?

Dumbledore: I'm afraid so, Professor. The good, and the bad.

McGonagall: And the boy?

Dumbledore: Hagrid is bringing him.

McGonagall: Do you think it wise to trust Hagrid with something as important as this

Data story ideas:

frequency of specific keywords (e.g. Harry)
- histogram or maybe bar graph
- word cloud
number of characters and the frequency of their appearances
- bar graph
number of lines of each character and length (number of word per line)
- histogram

4.2.2. Election Data#

data_url = "https://raw.githubusercontent.com/GettysburgDataScience/datasets/refs/heads/main/countypres_2000-2024.csv"
csv_response = requests.get(data_url)
lines = csv_response.text.splitlines()

reader = csv.DictReader(lines)
elec_dict = {header: [] for header in reader.fieldnames}
for row in reader:
    for header in reader.fieldnames:
        elec_dict[header].append(row[header])     

elec_dict.keys()

dict_keys(['year', 'state', 'state_po', 'county_name', 'county_fips', 'office', 'candidate', 'party', 'candidatevotes', 'totalvotes', 'version', 'mode'])

for key in elec_dict.keys():
    try:
        elec_dict[key] = [float(v) for v in elec_dict[key]]
    except:
        pass

import pandas as pd

elec_df = pd.read_csv(data_url)
elec_df

	year	state	state_po	county_name	county_fips	office	candidate	party	candidatevotes	totalvotes	version	mode
0	2000	ALABAMA	AL	AUTAUGA	1001.0	US PRESIDENT	AL GORE	DEMOCRAT	4942	17208	20250821	TOTAL
1	2000	ALABAMA	AL	AUTAUGA	1001.0	US PRESIDENT	GEORGE W. BUSH	REPUBLICAN	11993	17208	20250821	TOTAL
2	2000	ALABAMA	AL	AUTAUGA	1001.0	US PRESIDENT	OTHER	OTHER	113	17208	20250821	TOTAL
3	2000	ALABAMA	AL	AUTAUGA	1001.0	US PRESIDENT	RALPH NADER	GREEN	160	17208	20250821	TOTAL
4	2000	ALABAMA	AL	BALDWIN	1003.0	US PRESIDENT	AL GORE	DEMOCRAT	13997	56480	20250821	TOTAL
...	...	...	...	...	...	...	...	...	...	...	...	...
94404	2024	WYOMING	WY	WESTON	56045.0	US PRESIDENT	DONALD J TRUMP	REPUBLICAN	3069	3512	20250821	NaN
94405	2024	WYOMING	WY	WESTON	56045.0	US PRESIDENT	KAMALA D HARRIS	DEMOCRAT	378	3512	20250821	NaN
94406	2024	WYOMING	WY	WESTON	56045.0	US PRESIDENT	OTHER	OTHER	18	3512	20250821	NaN
94407	2024	WYOMING	WY	WESTON	56045.0	US PRESIDENT	OVERVOTES	NaN	1	3512	20250821	NaN
94408	2024	WYOMING	WY	WESTON	56045.0	US PRESIDENT	UNDERVOTES	NaN	20	3512	20250821	NaN

94409 rows × 12 columns

Data story ideas: - candidate with most votes in each county (for a given year) - line graph of how total votes per party change over time - snapshot of the results from a single election - trends in party by national/state/county - demographic data (e.g. education, median income, population density, racial demographics) vs votes for either party (or lean) - percentage of voter turnout for every county vs party lean