4. In class problem-solving!#

4.1. Approaching programming problems#

Write a function to convert numbers from written english to numeric form. The function should be able to convert numbers zero through one trillion (and negatives?).

4.1.1. Strategy#

  • understand the question

  • write out some cases, inputs and expected outputs. Include standard and non-standard cases.

  • look for patterns. Patterns suggest resuable elements (functions, loops, etc).

  • break the problem down into tasks and work incrementally, testing frequently.

# write a function that converts text numbers to numeric for 0-999

def word2num(text):
    text2dig = {
        'zero' : 0,
        'one' : 1,
        'two' : 2,
        'three' : 3,
        'four' : 4,
        'five' : 5,
        'six' : 6,
        'seven' : 7,
        'eight' : 8,
        'nine' : 9,
        'ten' : 10,
        'eleven' : 11,
        'twelve' : 12,
        'thirteen' : 13,
        'fourteen' : 14,
        'fifteen' : 15,
        'sixteen' : 16,
        'seventeen' : 17,
        'eighteen' : 18,
        'nineteen' : 19,
        'twenty' : 20,
        'thirty': 30,
        'forty': 40,
        'fifty': 50,
        'sixty': 60,
        'seventy': 70,
        'eighty': 80,
        'ninety': 90,
    }
    
    wordlist = text.replace(',', '').replace('-', ' ').split(' ')
    
    num = 0
    multiplier = 1
    for word in wordlist[::-1]:
        if word == 'hundred':
            multiplier = 100
            
        if word in text2dig:
            num += text2dig[word] * multiplier  
    
    return num


def word2num_trillion(text):
    ordermag = {
        'thousand' : 1000,
        'million' : 1000000,
        'billion' : 1000000000
    }
    
    num = 0
    multiplier = 1
    
    for order in ['thousand', 'million', 'billion']:
        text_list = text.split(order)
        
        if len(text_list) == 1:
            continue
        
        text = text_list[0]
        R_text = text_list[1]
        
        num += word2num(R_text) * multiplier
        multiplier = ordermag[order]
        
        # print(f'Order {order}:\t{text} _____ {R_text}')
     
    num += word2num(text) * multiplier   
    
    if text.split(' ')[0] == 'negative':
        num = num * -1

    return num

test1 = word2num_trillion('negative one hundred nineteen billion, three hundred one million, twelve')
test2 = word2num_trillion('nine hundred ninety-nine billion, nine hundred ninety-nine million, nine hundred ninety-nine thousand, nine hundred ninety-nine')
test3 = word2num_trillion('twenty-five')
test4 = word2num_trillion('nineteen thousand') 

# test1 = word2num('seven hundred thirty-two')
# test2 = word2num('fifteen')
# test3 = word2num('twenty-five')

print(f'Test 1: {test1}')
print(f'Test 2: {test2}')
print(f'Test 3: {test3}')
print(f'Test 4: {test4}')
Test 1: -119301000012
Test 2: 999999999999
Test 3: 25
Test 4: 19000

4.2. Playing with Data#

4.2.1. Text data#

import requests
import csv

url = "https://raw.githubusercontent.com/GettysburgDataScience/datasets/refs/heads/main/potter.txt"
response = requests.get(url)
potter_text = response.text
print(potter_text[:1000])
Scene:

A neighbourhood on a street called Privet Drive.



An owl, sitting on the street sign flies off to reveal a mysterious appearing old man walking through a forest near the street. He stops at the start of the street and takes out a mechanical device and zaps all the light out of the lampposts.



He puts away the device and a cat meows. The man, ALBUS DUMBLEDORE, looks down at the cat, which is a tabby and is sitting on a brick ledge.



Dumbledore: I should have known that you would be here...Professor McGonagall.



The cat meows, sniffs out and the camera pans back to a wall. The cats shadow is seen progressing into a human. There are footsteps and MINERVA MCGONAGALL is revealed.







McGonagall: Good evening, Professor Dumbledore. Are the rumours true, Albus?



Dumbledore: I'm afraid so, Professor. The good, and the bad.





McGonagall: And the boy?

Dumbledore: Hagrid is bringing him.

McGonagall: Do you think it wise to trust Hagrid with something as important as this

Data story ideas:

  • frequency of specific keywords (e.g. Harry)

    • histogram or maybe bar graph

    • word cloud

  • number of characters and the frequency of their appearances

    • bar graph

  • number of lines of each character and length (number of word per line)

    • histogram

4.2.2. Election Data#

data_url = "https://raw.githubusercontent.com/GettysburgDataScience/datasets/refs/heads/main/countypres_2000-2024.csv"
csv_response = requests.get(data_url)
lines = csv_response.text.splitlines()

reader = csv.DictReader(lines)
elec_dict = {header: [] for header in reader.fieldnames}
for row in reader:
    for header in reader.fieldnames:
        elec_dict[header].append(row[header])     
elec_dict.keys()
dict_keys(['year', 'state', 'state_po', 'county_name', 'county_fips', 'office', 'candidate', 'party', 'candidatevotes', 'totalvotes', 'version', 'mode'])
for key in elec_dict.keys():
    try:
        elec_dict[key] = [float(v) for v in elec_dict[key]]
    except:
        pass
import pandas as pd

elec_df = pd.read_csv(data_url)
elec_df
year state state_po county_name county_fips office candidate party candidatevotes totalvotes version mode
0 2000 ALABAMA AL AUTAUGA 1001.0 US PRESIDENT AL GORE DEMOCRAT 4942 17208 20250821 TOTAL
1 2000 ALABAMA AL AUTAUGA 1001.0 US PRESIDENT GEORGE W. BUSH REPUBLICAN 11993 17208 20250821 TOTAL
2 2000 ALABAMA AL AUTAUGA 1001.0 US PRESIDENT OTHER OTHER 113 17208 20250821 TOTAL
3 2000 ALABAMA AL AUTAUGA 1001.0 US PRESIDENT RALPH NADER GREEN 160 17208 20250821 TOTAL
4 2000 ALABAMA AL BALDWIN 1003.0 US PRESIDENT AL GORE DEMOCRAT 13997 56480 20250821 TOTAL
... ... ... ... ... ... ... ... ... ... ... ... ...
94404 2024 WYOMING WY WESTON 56045.0 US PRESIDENT DONALD J TRUMP REPUBLICAN 3069 3512 20250821 NaN
94405 2024 WYOMING WY WESTON 56045.0 US PRESIDENT KAMALA D HARRIS DEMOCRAT 378 3512 20250821 NaN
94406 2024 WYOMING WY WESTON 56045.0 US PRESIDENT OTHER OTHER 18 3512 20250821 NaN
94407 2024 WYOMING WY WESTON 56045.0 US PRESIDENT OVERVOTES NaN 1 3512 20250821 NaN
94408 2024 WYOMING WY WESTON 56045.0 US PRESIDENT UNDERVOTES NaN 20 3512 20250821 NaN

94409 rows × 12 columns

Data story ideas: - candidate with most votes in each county (for a given year) - line graph of how total votes per party change over time - snapshot of the results from a single election - trends in party by national/state/county - demographic data (e.g. education, median income, population density, racial demographics) vs votes for either party (or lean) - percentage of voter turnout for every county vs party lean