Tuesday, October 2, 2012

Data Aggregation and Decomposition

Data Aggregation and Decomposition



Standards
  • Common Core Standards-High School Statistics and Probability:
    • Understand statistics as a process for making inferences about population parameters based on a random sample from that population
    • Use data from a sample survey to estimate a population mean or proportion; develop a margin of error through the use of simulation models for random sampling

Overview
  • Students will:
    • Learn how to use/analyze data to draw conclusions about the world around them.
  • Students will improve their computational thinking by:
    • Collecting/aggregating data onto a spreadsheet.
    • Identifying patterns in their data.
    • Decomposing the data into specified groups for analysis and further pattern recognition.
    • Modifying an algorithm written in Python to facilitate the analysis of their data.  

Prerequisites
  • Students should be familiar with:

Materials

  • Python Starter Files (includes sample data file)
  • Class data spreadsheet (the instructor must collect data about the class prior to the lesson)



Suggested Time Frame


Terminology
  • Aggregation: assembly, mixture, combination, blend, composite of data collected from different sources or at different times.
  • Decomposition: to reduce data to parts of the whole to expose a different perspective or to analyze the data at a different level.  
  • Pattern recognition: the ability to find commonality in form, format, structure, scheme, method or plan.

Activity: Grouping Similar Classmates' Data
Teacher Notes:
The instructor must collect the following data and aggregate it into a spreadsheet prior to teaching this lesson: student names, genders, who ate breakfast, how many hours each student slept, each student’s height in inches, approximately how many hours of TV each student watches per week, approximately how many hours of exercise each student gets each week, and a ranking of how each person is feeling on a particular day (please see the Example Data Sheet below).

Although the class data may be manually aggregated into an excel spreadsheet, instructors with gmail accounts may facilitate this process by creating a form in gmail.  They can send this form to each student, and their responses will be instantly collected in a Google docs spreadsheet, identical to the Example Data Sheet below.  Each student must have a gmail account in order to access and complete the form.


  1. Spreadsheets are frequently used to aggregate data so that we can analyze it and extract information from it.  Use your classes’ data spreadsheet to complete the tasks below.
  2. Let's first count up all the females in your class:
    1. On your electronic spreadsheet under the female or male? column, highlight all of the 'f' responses in a color of your choice.  Use “Conditional Formatting”  (in Google Docs this tool is called “Change colors with rules”) instead of highlighting each ‘f’ individually.

Student Question:
  • How many females are on the list? A: Solution to all questions depend of your class data.


Teacher Note:
After students have had an opportunity to complete (a), (b), and (c) individually or in pairs, ask them who was able to use the “Conditional Formatting” tool.  Project the spreadsheet and demonstrate how this feature of a spreadsheet works, and what benefits it provides.


  1. Next, let's find out how many females slept for over 8 hours last night by highlighting each box in the "How many hours did you sleep last night?" column. Note: we recommend using a different color from the one you used for highlighting all females.
Student Questions:
(answers depend on class data)
  1. How many people on the list are both female and slept for more than 8 hours last night
  2. Finally, let's find out how many females slept for over than 8 hours last night AND watch less than 9 hours of television a week. 
  3. Highlight each number less than 9 in the column that displays each person’s hours of sleep to determine how many people on the list met all 3 criteria (female, slept over 8 hours, and watched less than 9 hours of TV a week)?









Activity: Using Python to Decompose Data

Teacher Notes:
Students should realize that when looking through a long list to identify data that satisfies several conditions, it is helpful to do more than just highlight.  We need to decompose and re-aggregate the data according to the specified conditions.  Once we learn how to read data into Python, we can use a Python program to do the work for us, quickly and accurately.  

Walk the class through Exercise 3, based on the generic Example Data Sheet below to step students through the process.  They can work as pairs to do the same thing for their own class data.  You can find this same data on the document labeled Example Data Sheet in the materials section of this lesson.


  1. Highlighting data can help us to visualize patterns and trends, but sometimes it is useful to go a step further and decompose the data.  Two ways by which we might decompose data are sorting and regrouping  Examine the sample data below:

Example Data Sheet

  1. We can program Python to read in this data.  Python uses ‘regular expressions’ (re) to help it understand the data. You can think of a regular expression as a ‘code’ that tells Python the form/type of information that it will read in. Here are three important regular expressions that we will use in this lesson:
    1. (\w+):  tells Python to expect letters (words)
    2. (\d+): tells Python to expect numbers (digits)
    3. \s : tells Python to expect a white space (space bar, tab, ...)

  1. Starting with Row 2 (Sam), read across the sheet.  Every time we see a cell that contain letters we will write (\w+) underneath it, and every time we see a cell with digits we will represent it with (\d+).  
  2. Complete the chart below, based on rows two and three of the Example Data Sheet from the previous page (answers are in bold):
Row 2
Sam
m
y
8
68
10
5
9
Regular Expression
(\w+)
(\w+)
(\w+)
(\d+)
(\d+)
(\d+)
(\d+)
(\d+)
Row 3
Jeremy
m
y
12
64
7
6
8
Regular Expression
(\w+)
(\w+)
(\w+)
(\d+)
(\d+)
(\d+)
(\d+)
(\d+)


  1. This same pattern of (\w+)(\w+)(\w+)(\d+) ... is found in every row of this chart .  However we are missing one important piece of the pattern, \s.  
  2. We hit the tab key every time we move one cell to the right.  Since tab is represented with \s, we need to separate each regular expression with \s.  
  3. With this in mind, rewrite the regular expression for Row 2:
Row 2
Sam
tab
m
tab
y
tab
8
tab
64
tab
10
tab
5
tab
9
RE
(\w+)
\s
(\w+)
\s
(\w+)
\s
(\d+)
\s
(\d+)
\s
(\d+
\s
(\d+
\s
(\d+)


  1. Use this pattern to complete the blank line of a Python program that will search for this pattern in the data contained on our spreadsheet.

Program 1
import re

count = 0
data = open('ExampleData.txt','r')#make sure you type the name of your spreadsheet exactly as you saved it

datasearch=re.findall('Replace this with the regular expression', data.read())
for column in datasearch:
  name = column[0]
  gender = column[1]
  breakfast = column[2]
  sleep = float(column[3])
  height = float(column[4])
  tv = float(column[5])
  exercise = float(column[6])
  if breakfast == 'y' and gender == 'f' and tv < 9:#change this line based #on what information you want
    print name
    count += 1
print '\ntotal:',count


  1. Now that Python knows what file to open and what pattern to look for in the data that the file contains, we assign variables to hold this data.
  2. Python always begins counting with zero, and the 0th column in the spreadsheet contains names, so we chose the variable name to hold all the names in column 0.  The next column contains the genders, so we chose the variable gender, hold all of the genders stored in column 1 and so on.
Teacher Note:
Project and run the program, using the data from the Example Data Sheet, so that students can see the results.  Be sure to download this document (found in the materials section of this lesson) and save it as a .txt file in the same folder in which you chose to save the Python program above.

Student Questions:
(questions are based on the sample data)
  1. How many people fit all 3 criterion, e.g. ate breakfast, are female, and watched less than 9 hours of television? A: 2
  2. What are their names? A: Mayra and Mikayla

Teacher Note:
Before student set to work using Python to analyze their own class data, carefully go over the four bullet points below.  If the columns of their class data are in the exact same order as the columns in the Example Data Sheet, they can ignore the second two bullet points.  However if any of the columns are interchanged or missing, students will need to rewrite their regular expression and rename the variables to match their spreadsheets.  

Instructors who collected the class data using a gmail form should delete the “timestamp” column that is automatically generated before distributing the spreadsheet to students.  If you do not wish to delete this column, the regular expression from the previous exercise must be modified to account for it.


  1. Now you are ready to analyze your own classes’ data.  Paste the code above into the Python editor and follow the steps below.
  2. Save the program in the same folder in which your spreadsheet is saved.
  3. Make sure the name of the file you tell Python to open in the third line of code,
data = open('ExampleDataSheet.txt','r'), is identical to the name of your class data spreadsheet.

  1. Make sure your pattern of regular expressions matches the pattern found in your classes’ data spreadsheet, (\w+)\s...
  2. Make sure the column data matches up with the correct variable names.  If your class data had different topics/headers than the sample data, you may want to choose new variable names that better describe the information they will hold.
  3. Run your program to find out how many people in your class (of either gender) did not eat breakfast AND slept less than 7 hours last night.  
  4. Now change your program so that it searches for people that did not eat breakfast OR  slept less than 7 hours last night.   
Student Question:
  • Explain why the results are different when we use or verses when we use and. A:When we use ‘and’ the only people who fit the criteria are those who did both.  When we use ‘or’, anyone who met either one (or both) of the criteria get counted.

Teacher Note:
After students have had the chance to complete all of Exercise 4, lead a discussion about the operators and vs. or.  Students should have noticed that the same people who did not eat breakfast and slept less than seven hours, are also on the list of people who did not eat breakfast or slept less than seven hours.  However there are several more people that meet this new criterion as well.  Ask students to stop and think about explain why this is, and explain their reasoning.
















Activity: Designing an Experiment

Teacher Note:
Up until this point, students have modified their Python program based on what they have been instructed to do.  They are now ready to make up their own criteria about what topics on the class data most affect how people are feeling on a given day.  They can use modify their program to test out their hypothesis against their class data.  Let them know that we can assume anyone feeling better than a 6 or 7 can be considered as feeling good, and anyone who gave themselves less than 6 is probably not feeling very well.


  1. Python helped us to decompose data and re-aggregate it according to whatever criteria we need. Now let's use this to design an experiment!
  2. Use your class survey and Python to figure out what criterion most strongly influences how a person is feeling.  
    1. Do you think gender has a strong influence on how a given person is feeling today?
  3. Change your program so that it checks for the number of females that ranked their feeling greater than 6, and then do the same for the number of males that ranked their feeling greater than 6.  
    1. How many females feel higher than 6?
    2. How many males feel higher than 6?
  4. Some people believe that the more TV a person watches, the better they feel.  They also believe that if a person sleeps more hours, they will feel better. Run your program to test of both these predictions.
    1. Record the number of people in your class who watch over 7 hours of TV a week and feel better than 6.
    2. Record the number of people in your class who slept over 8 hours last night and feel better than 6.
    3. Record the number of people who watch over 7 hours of TV a week and slept over 8 hours last night and feel better than 6.
  5. Make your own prediction. What criteria on your class data sheet do you think most effects how someone feels?
  6. Design your own experiment to find out what criteria have the strongest influence on how a person is feeling today. Describe what you will check for in your program. Make at least 3 predictions and run Python to test each of them. Based on the data you collect, write up any conclusions you can make on what criteria do and/or do not effect how a person feels today:


Activity: Calculating Percentages

Teacher Note:
As is, the program on the following page  reads in data from a spreadsheet called ExampleDataSheet.txt, using the original regular expression that we wrote in Exercise 3.  If your class’s data spreadsheet is saved under another name or has different headers/questions, students may need to change lines four and five of the code below before running the program.


  1. We can modify a couple lines at the end of our program to perform some basic statistical analysis.
  2. Run the program below to calculate the percent of the class that is female:










Program 2
import re

total = 0
female = 0

#Type the name of your spreadsheet exactly as you saved it
data = open('ExampleDataSheet.txt','r')

#Change the regular expression to match the columns on your #spreadsheet.
datasearch = re.findall('(\w+)\s(\w+)\s(\w+)\s(\d+)\s(\d+)\s(\d+)\s(\d+)\s(\d+)', data.read())for column in datasearch:
   name = column[0] 
   gender = column[1] 
   breakfast = column[2]
   sleep = float(column[3])
   height = float(column[4])
   tv = float(column[5])
   exercise = float(column[6]) 
   feeling = float(column[7])
   total += 1.0   
  if gender == 'f':#change this line for the information you want
   print name
   female += 1
   percent_female = (female/total)*100
   rounded = round(percent_female, 2)
print '\ntotal:',total
print 'total females:', female
print 'percent female:', rounded,'%'


  1. What is percent of the people represented in the spreadsheet are female?
  2. Modify the program to calculate the following:
    1. The percent of people who are female and who ate breakfast this morning.
    2. The percent of people who are male and who exercise more than 6 hours a week.
    3. The percent of people who ate breakfast and slept more than 6 hours last night.
  3. To answer these questions, you had to aggregate or decompose data! The data you used was found all around you - it is ubiquitous.


More lessons and examples can be found at Google’s Exploring Computational Thinking website.

Except as otherwise noted, the content of this lesson is licensed under the Creative Commons Attribution 3.0 License, and code samples are licensed under the Apache 2.0 License.

No comments:

Post a Comment