TDM 10200: Project 6 — 2024

Context: Functions also help us to put our work into bite-size pieces that are easier to understand. The basic idea is similar to functions from R or from other languages and tools.

Scope: functions

Reading and Resources

Datasets

/anvil/projects/tdm/data/election/

We added seven new videos to help you with Project 6.

Questions

Question 1 (2 points)

When you first read the project, you might think, WOW!, this project is long, but it isn’t actually very long at all. Dr Ward just wanted to make sure that students understand the way to think about functions, namely: You need to first make sure that you understand what to do, and then check it a few times, and then (afterwards) write the function, and (finally) verify the function. [It is usually too hard to write the function without (first) understanding the work itself.]

Dr Ward broke all of this down into steps, and the sentences look long, but don’t be scared. Dr Ward is just breaking things down into little steps for you, so that the work is easier to do.

Functions should have simple inputs and should create helpful outputs. That way, a person can use the function to get good things done, without having to remember the details. For the election data, it is hard to remember where the files are located, and what the column names should be, etc. This question will create a useful function that only requires the user to start with 1 year as input, and it returns a dataframe as the output. That dataframe contains all of the election data for that year.

  1. First (without a function!) start with a variable called myyear. It can be an election year, and it helps if you try this with a few of the early 1980’s elections years, so that they data is not-too-big. For instance, try myyear with each of the years 1980, 1984, and 1988 (one at a time, not a list). Once you set the value of myyear, then make a variable for the path to the data from that year, for instance, /anvil/projects/tdm/data/election/itcont1984.txt

  2. Read in the data from the year stored in myyear. Please note that the data does not have column headers built-in, so you need to add the column headers, as in the "tip" below. Please also note that the elements in the data set has | between the data elements, not commas (this is why the file name ends in .txt not .csv in this case). So you need to use the parameter delimiter='|' in your read_csv function.

  3. Look at the head of the data set, after you read it in. Does it look correct? Does everything make sense? Try this for each of the years, 1980, then 1984, then 1988, one at a time, storing each of these values into the variable myyear and then read in the data for that year, and check the head of the data each time. This is good practice for making sure that your work is designed properly.

  4. Now that you are sure that your work is OK, make a function called read_election_year that takes one parameter called myyear as the input, and returns a data frame that contains the data from that election year. Make sure to document your function with a docstring to explain how it works.

    This might be helpful for reading in your data!

    mycolumnnames=["CMTE_ID","AMNDT_IND","RPT_TP","TRANSACTION_PGI","IMAGE_NUM","TRANSACTION_TP","ENTITY_TP","NAME","CITY","STATE","ZIP_CODE","EMPLOYER","OCCUPATION","TRANSACTION_DT","TRANSACTION_AMT","OTHER_ID","TRAN_ID","FILE_NUM","MEMO_CD","MEMO_TEXT","SUB_ID"]
    
    mydictionarytypes = dtype_dict = {"CMTE_ID": str, "AMNDT_IND": str, "RPT_TP": str, "TRANSACTION_PGI": str, "IMAGE_NUM": str, "TRANSACTION_TP": str, "ENTITY_TP": str, "NAME": str, "CITY": str, "STATE": str, "ZIP_CODE": str, "EMPLOYER": str, "OCCUPATION": str, "TRANSACTION_DT": str, "TRANSACTION_AMT": float, "OTHER_ID": str, "TRAN_ID": str, "FILE_NUM": str, "MEMO_CD": str, "MEMO_TEXT": str, "SUB_ID": int}
    
    myDF = pd.read_csv("/anvil/projects/tdm/data/election/itcont1980.txt", delimiter='|', names=mycolumnnames, dtype=mydictionarytypes)
    
    myDF['TRANSACTION_DT'] = pd.to_datetime(myDF['TRANSACTION_DT'], format="%m%d%Y")

    It might also help to have 2 cores for this project. You might be able to do it with 1 core, but it is probably easier for you with 2 cores.

Question 2 (2 points)

  1. First (without a function!) start with a variable called myyear, such as 1980, and find the number of (unique) committees that appear in the CMTE_ID column in that year. Then do the same for the year 1984, and then do this again for 1988. Print your results for each of these three years in separate cells.

  2. Now that you have part 2a working well, put your work from question 2a into a function. Namely, create a function called committees_function that accepts a year as input, and returns the number of (unique) committees that appear in the CMTE_ID column in that year. Use the function designed in Question 1 to help you accomplish this work.

  3. Test your function for each of the years 1980, 1984, and 1988. How many (unique) committees appear in each of these 3 individual years? The output from this question should show, for each year, how many (unique) committees appear in the data for each of those 3 years. The output for each of these 3 years should agree with your output from question 2a.

Do not add the results from the three years. Instead, use three separate cells to show the three separate years of output.

Question 3 (2 points)

The goal of this question is to find the top 5 states in a given year, according to the total (sum) of the values in the TRANSACTION_AMT column.

  1. First (without a function!) start with a variable called myyear, such as 1980, and find the total (sum) of the values from the TRANSACTION_AMT column for each state in the data set. You only need to print the top 5 results (i.e., the top 5 states and the total of the transaction amounts from those states) for 1980. Then do this again for 1984, and then do this again for 1988.

  2. Now that you have your work from Question 3a working well, build a function called top_five_states. This function should take 1 year as input, and should return the top 5 states and the total (sum) of the values for each of the 5 states, from the TRANSACTION_AMT column (for that state).

Question 4 (2 points)

The goal of this question is to identify the top 5 employers, according to the total (sum) of the values from the TRANSACTION_AMT column for each employer.

  1. First find the top 5 employers in each year 1980, 1984, and 1988, and print the top 5 for each of those years. Do this before you make a function.

  2. Once that is working, then build a function called top_employers that returns the top 5 employers in each year 1980, 1984, and 1988. Your results from question 4b should agree with your results from question 4a.

Question 5 (2 points)

  1. Experiment with the election data for the same 3 years as above (1980, 1984, 1988). Identify something that you find interesting each of those 3 years before you build a function.

  2. Wrap your interesting working into a function, and make sure that it matches your work from question 5a, for each of the 3 years.

Project 06 Assignment Checklist

  • Jupyter Lab notebook with your code, comments and output for the assignment

    • firstname-lastname-project06.ipynb.

  • Python file with code and comments for the assignment

    • firstname-lastname-project06.py

  • Submit files through Gradescope

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.