530 likes | 696 Views
Data Collections. Chapter 11 (skip 11.3, 11.4, 11.5 & 11.6.3). Adapted from the online slides provided by John Zelle ( http://mcsp.wartburg.edu/zelle/python/ppics2/index.html ). Objectives. To understand the use of lists (arrays) to represent a collection of related data.
E N D
Data Collections Chapter 11 (skip 11.3, 11.4, 11.5 & 11.6.3) Adapted from the online slides provided by John Zelle (http://mcsp.wartburg.edu/zelle/python/ppics2/index.html)
Objectives • To understand the use of lists (arrays) to represent a collection of related data. • To be familiar with the functions and methods available for manipulating Python lists. • To be able to write programs that use lists to manage a collection of information. • To understand the use of Python dictionaries for storing nonsequential collections.
Example Problem:Simple Statistics • Many programs deal with large collections of similar information. • Words in a document • Students in a course • Data from an experiment • Customers of a business • Graphics objects drawn on the screen • Cards in a deck
Sample Problem:Simple Statistics Let’s review some code we wrote in chapter 8: # average4.py # A program to average a set of numbers # Illustrates sentinel loop using empty string as sentinel def main(): sum = 0.0 count = 0 xStr = input("Enter a number (<Enter> to quit) >> ") while xStr != "": x = eval(xStr) sum = sum + x count = count + 1 xStr = input("Enter a number (<Enter> to quit) >> ") print("\nThe average of the numbers is", sum / count) main()
Sample Problem:Simple Statistics • This program allows the user to enter a sequence of numbers, but the program itself doesn’t keep track of the numbers that were entered – it only keeps a running total. • Suppose we want to extend the program to compute not only the mean, but also the median and standard deviation.
Sample Problem:Simple Statistics • The median is the data value that splits the data into equal-sized parts. • For the data 2, 4, 6, 9, 13, the median is 6, since there are two values greater than 6 and two values that are smaller. • One way to determine the median is to store all the numbers, sort them, and identify the middle value.
Sample Problem:Simple Statistics • The standard deviation is a measure of how spread out the data is relative to the mean. • If the data is tightly clustered around the mean, then the standard deviation is small. If the data is more spread out, the standard deviation is larger. • The standard deviation is a yardstick to measure/express how exceptional the data is.
Sample Problem:Simple Statistics • The standard deviation is • Here is the mean, represents the ith data value and n is the number of data values. • The expression is the square of the “deviation” of an individual item from the mean.
Sample Problem:Simple Statistics • The numerator is the sum of these squared “deviations” across all the data. • Suppose our data was 2, 4, 6, 9, and 13. • The mean is 6.8 • The numerator of the standard deviation is
Sample Problem:Simple Statistics • As you can see, calculating the standard deviation not only requires the mean (which can’t be calculated until all the data is entered), but also each individual data element! • We need some way to remember these values as they are entered.
Applying Lists • We need a way to store and manipulate an entire collection of numbers. • We can’t just use a bunch of variables, because we don’t know many numbers there will be. • What do we need? Some way of combining an entire collection of values into one object.
Lists and Arrays • Suppose the sequence is stored in a variable s. We could write a loop to calculate the sum of the items in the sequence like this:sum = 0for i in range(n): sum = sum + s[i] • Almost all computer languages have a sequence structure like this, sometimes called an array.
Lists and Arrays • A list or array is a sequence of items where the entire sequence is referred to by a single name (i.e. s) and individual items can be selected by indexing (i.e.s[i]). • In other programming languages, arrays are generally a fixed size, meaning that when you create the array, you have to specify how many items it can hold. • Arrays are generally also homogeneous, meaning they can hold only one data type.
Lists and Arrays • Python lists are dynamic. They can grow and shrink on demand. • Python lists are also heterogeneous, a single list can hold arbitrary data types. • Python lists are mutable sequences of arbitrary objects.
List Operations • Except for the membership check, we’ve used these operations before on strings. • The membership operation can be used to see if a certain value appears anywhere in a sequence.>>> mylist = [1,2,3,4]>>> 3 in mylistTrue >>> if 4 in mylist: print(“Yes”) Yes >>>
List Operations • The summing example from earlier can be written like this:sum = 0for x in s: sum = sum + x • Unlike strings, lists are mutable:>>> mylist = [1,2,3,4]>>> mylist[3]4>>> mylist[3] = "Hello”>>> mylist[1, 2, 3, 'Hello']>>> mylist[2] = 7>>> mylist[1, 2, 7, 'Hello']
List Operations • A list of identical items can be created using the repetition operator. This command produces a list containing 50 zeroes:zeroes = [0] * 50
List Operations • Lists are often built up one piece at a time using append.nums = []x = eval(input('Enter a number: '))while x >= 0: nums.append(x) x = eval(input('Enter a number: ')) • Here, nums is being used as an accumulator, starting out empty, and each time through the loop a new value is tacked on.
List Operations >>> lst = [3, 1, 4, 1, 5, 9] >>> lst.append(2) >>> lst [3, 1, 4, 1, 5, 9, 2] >>> lst.sort() >>> lst [1, 1, 2, 3, 4, 5, 9] >>> lst.reverse() >>> lst [9, 5, 4, 3, 2, 1, 1] >>> lst.index(4) 2 >>> lst.insert(4, "Hello") >>> lst [9, 5, 4, 3, 'Hello', 2, 1, 1] >>> lst.count(1)s 2 >>> lst.remove(1) >>> lst [9, 5, 4, 3, 'Hello', 2, 1] >>> lst.pop(3) 3 >>> lst [9, 5, 4, 'Hello', 2, 1]
List Operations • Most of these methods don’t return a value – they change the contents of the list in some way. • Lists can grow by appending new items, and shrink when items are deleted. Individual items or entire slices can be removed from a list using the del operator.
List Operations • >>> myList=[34, 26, 0, 10]>>> del myList[1]>>> myList[34, 0, 10]>>> del myList[1:3]>>> myList[34] • del isn’t a list method, but a built-in operation that can be used on list items.
List Operations • Basic list principles • A list is a sequence of items stored as a single object. • Items in a list can be accessed by indexing, and sublists can be accessed by slicing. • Lists are mutable; individual items or entire slices can be replaced through assignment statements.
List Operations • Lists support a number of convenient and frequently used methods. • Lists will grow and shrink as needed.
Statistics with Lists • One way we can solve our statistics problem is to store the data in lists. • We could then write a series of functions that take a list of numbers and calculates the mean, standard deviation, and median. • Let’s rewrite our earlier program to use lists to find the mean.
Statistics with Lists • Let’s write a function called getNumbers that gets numbers from the user. • We’ll implement the sentinel loop to get the numbers. • An initially empty list is used as an accumulator to collect the numbers. • The list is returned once all values have been entered.
Statistics with Lists def getNumbers(): nums = [] # start with an empty list # sentinel loop to get numbers xStr = input("Enter a number (<Enter> to quit) >> ") while xStr != "": x = eval(xStr) nums.append(x) # add this value to the list xStr = input("Enter a number (<Enter> to quit) >> ") return nums • Using this code, we can get a list of numbers from the user with a single line of code:data = getNumbers()
Statistics with Lists • Now we need a function that will calculate the mean of the numbers in a list. • Input: a list of numbers • Output: the mean of the input list • def mean(nums): sum = 0.0 for num in nums: sum = sum + num return sum / len(nums)
Statistics with Lists • The next function to tackle is the standard deviation. • In order to determine the standard deviation, we need to know the mean. • Should we recalculate the mean inside of stdDev? • Should the mean be passed as a parameter to stdDev?
Statistics with Lists • Recalculating the mean inside of stdDev is inefficient if the data set is large. • Since our program is outputting both the mean and the standard deviation, let’s compute the mean and pass it to stdDev as a parameter.
Statistics with Lists • def stdDev(nums, xbar): sumDevSq = 0.0 for num in nums: dev = xbar - num sumDevSq = sumDevSq + dev * dev return sqrt(sumDevSq/(len(nums)-1)) • The summation from the formula is accomplished with a loop and accumulator. • sumDevSq stores the running sum of the squares of the deviations.
Statistics with Lists • We don’t have a formula to calculate the median. We’ll need to come up with an algorithm to pick out the middle value. • First, we need to arrange the numbers in ascending order. • Second, the middle value in the list is the median. • If the list has an even length, the median is the average of the middle two values.
Statistics with Lists • Pseudocode - sort the numbers into ascending order if the size of the data is odd: median = the middle value else: median = the average of the two middle values return median
Statistics with Lists def median(nums): nums.sort() size = len(nums) midPos = size // 2 if size % 2 == 0: median = (nums[midPos] + nums[midPos-1]) / 2 else: median = nums[midPos] return median
Statistics with Lists • With these functions, the main program is pretty simple! • def main(): print("This program computes mean, median and standard deviation.") data = getNumbers() xbar = mean(data) std = stdDev(data, xbar) med = median(data) print("\nThe mean is", xbar) print("The standard deviation is", std) print("The median is", med)
Tuples • Tuples are similar to lists but are immutable (their content can’t be changed) • Parentheses are used to represent tuples instead of square brackets • When it is known that its content won’t change then use tuples instead of lists as they are more efficient, otherwise use lists
Tuples examples >>> a = (1,2,3) >>> a (1, 2, 3) >>> type(a) <class 'tuple'> >>> a[1] 2 >>> a[1:2] (2,) >>> a[0:2] (1, 2) >>> for x in a: print(x) 1 2 3 >>> a[1] = 4 Traceback (most recent call last): File "<pyshell#22>", line 1, in <module> a[1] = 4 TypeError: 'tuple' object does not support item assignment
Non-Sequential Collections • Python provides another built-in data type for collections, called a dictionary. • Not all programming languages have dictionaries, while almost all have arrays or lists.
Dictionary Basics • Typically, when we retrieve information from a sequential collection, we look it up by its position, or index, in the collection. • Say you want to retrieve data about students or employees based on social security numbers and not by the index of the student or the employee.
Dictionary Basics • The combination of social security number with other data is known as a key-value pair. • We access the value (the student information) associated with a particular key (the social security number) • It is easy to think of many key-value pairs: username & passwords, names & phone numbers, etc.
Dictionary Basics • A collection that allows us to loop up data with arbitrary keys is called a mapping • Python dictionaries are mappings • Some other languages call them hashes or associative arrays
Dictionary Basics • A dictionary can be created in Python by listing key-value pairs inside curly brackets: • >>> passwd = {“guido”:”superprogrammer”, “turing”:”genius”, “bill”:”monopoly”} • Keys and values are joined with ‘:’, and commas are used to separate pairs.
Dictionary Basics • The main use of a dictionary is to look up a value associated with a particular key, using indexing notation: • >>> passwd[“guido”] “superprogrammer” >>>passwd[“bill”] “monopoly” • <dictionary>[<key>] returns the object associated with the given key
Dictionary Basics • Dictionaries are mutable. The value associated with a key can be changed with assignment. • >>> passwd[“bill”] = “bluescreen” >>>passwd {“turing”:”genius”, “bill”:”bluescreen”, “guido”:”superprogrammer”} • Did you notice the dictionary did not print out in the same order it was entered? • Mappings are unordered.
Dictionary Basics • Python stored dictionaries in a way that makes key lookup very efficient. • Special algorithms are used for this • If you want to keep a collection of items in a certain order, use a list! • But lists won’t allow you to access an item through its key, you can only access through an index
Dictionary Basics • Dictionaries are mutable collections that implement a mapping from keys to values. • Keys can be any standard type, like strings, ints and floats • Values can be of any type, including lists and programmer-defined classes.
Dictionary Operations • Python dictionaries support several built-in operations. • Dictionaries can be extended (data added after creation) by adding new entries. • >>> passwd[“newuser”] = “ImANewbie” >>>passwd {“turing”:”genius”, “bill”:”bluescreen”, “newuser:IamANewbie”, “guido”:”superprogrammer”}
Dictionary Operations • A common way to build a dictionary is to start with an empty collection and add the key-value pairs one at a time. • Suppose usernames and passwords were stored in a file called “passwords”, where each line of the file contains a username and password separated by a space.
Dictionary Operations passwd = {} infile = open(“passwords”, “r”) for line in infile : user, pass = line.split() passwd[user] = pass infile.close()