Python Collections

Python has a variety of built-in data structures that we will use in this course—lists, tuples, sets, and dictionaries. All of these are similar in that they represent collections of objects, but they each have their own strengths and weaknesses relative to each other for different tasks. They are also all objects, and so they have their own member functions that we may use to access or manipulate the data they contain. All four of the data structures mentioned above are heterogeneous which means they can hold a collection of different data types. This in contrast to a Java array, for instance, which must declare the type it can hold. For instance a Java array of integers can only hold integer values while a Python list could hold an integer, a floating point number, and a string or any other object we would care to place in the collection.

A good reference on all the information for each collection below is available in the official Python documentation here.

Lists

A list in Python can be created in multiple ways. Two ways allow us to create an empty list to which we can add items later. A third way allows us to create a list that is initialized to hold prescribed values. This code snippet illustrates all three ways:

The initialization of another_empty_list above is said to be more pythonic since it’s the typical way to program such a statement in Python. The previous line initializing empty_list would be less typical in a program, but it does illustrate to us that list is a class while the objects are instances of that class. The list class is part of Python, so it is a built-in class. Since it’s a class, it has member functions (these are called instance methods in Java) that we can use. One of the most common is list.append() which lets us add to the end of the list:

Finally, the code below illustrates a key property of a list—random access by index. Python lists act like Java arrays in that each item has a position in the sequence. The first item is at index 0, the second at index 1, and so on.

The previous example illustrates that we can access using negative indices: -1 is the last item, -2 the next to last, and so on. Even though list is a class, we determine the length of a list using the built-in len() function and supplying the list as an argument. We do not use a member function in the list class to determine the length. Access by index is one of the key traits of a list, so whenever we are holding a collection of data and we care that an item is first, second, third … we would choose list as our container.

Traversing a list

It is common in computing to need to do something to every item in a list. To accomplish these repeated (once per element) operations, we typically use iteration–a for loop–in programming. Iterating over these collections in Python is similar to using enhanced for loops in Java. A dummy variable acts as a placeholder for each value in the collection in turn. We are free to choose its name. The dummy variable is not an index; it is a value in the collection. Short variable names like i are appropriate for an index but not a value. item or value are appropriate generic names, or a variable name that describes the contents of the collection would be appropriate. The only times single letter variable names would be appropriate would be when implementing mathematical operations or formulas, like adding up elements, on a collection. Then names like i, j, or k would be appropriate for int values or x, y, or z may be appropriate for float.

Since a collection is a group of values, collections names may often be plural. For instance, documents is a good name for a list where each item in the list is a document in the corpus. In that case, the singular document may be a good dummy variable name.

Python provides the enumerate() method for cases when we need the value in the list as well as the index of that value. In this case, we get to name two dummy variables. The first is the index, and i is an appropriate dummy variable name in this case. The second is the value in the list, and we should name it as described above.

What is happening in the previous example is that the enumerate method is yielding (i.e. returning each iteration) a 2 values which are simultaneously assigned to i, name for each iteration of the loop. Also note that since i is an index, it’s an appropriate variable name in this case. These multiple assignments to several values simultaneously is an example of tuple unpacking (see more info here). We haven't seen a tuple yet, but it's a collection very similar to list. Here the two values i, name are like a two element collection, but we name each item in the collection as a dummy variable instead of naming the entire collection. It will turn out that tuple objects also support the object_name[index] syntax, so we could also write that for loop as shown below. The first way above is the correct (Pythonic) way to write the loop. The way below is not an example to follow. It's an antipattern that simply illustrates a concept.

Rules of thumb for for loops

If you only need the index for accessing the value at that index, then you don't need the index!

A lot of students like to write code like this which is like for (int i = 0; i < l.length; i++) in Java

l = [2,4,6]
for i in range(3): # i is the index. First 0, then 1, then 2
    print(l[i]) # we only use the index to get the ith item in the loop

when they should write it as

l = [2,4,6]
for val in l:
    print(val) # we didn't need the index, just the value

If you need the value and the index, use enumerate!

l = [2,4,6]
for i,val in enumerate(l):
    print(f"The value at index {i} is {val}")



Calculating from a list

I've already mentioned that we should focus our thinking on the data that's flowing through our program. If we know the code patterns for manipulating that data, then it's easier to be able to produce or come up with the code needed to achieve our goals. By focusing on the data, we're able to articulate what those goals are: "I need to convert this corpus into an inverted index", "I need to tokenize this document", ... are possible goals when writing a search engine. We can transform those more specifically to data-drive code goals like "For every document in the corpus, I need to add its tokens to the index", "I need to convert this document stored in a string into a list of tokens", ..., and it become easier to produce code to do what we need.

When our data is in a list, it's very common to need to convert that list into:

  1. a single value (this is called aggregation or a reduction). For instance, computing the sum of a list converts a list of numbers into a single value, the sum.
  2. another list. If we have a list of tokens and need a list of the position of each token in a document, we are computing a new list from each value in an original list.

Both of these things can be achieved with the accumulator pattern. The gist of it is to

  1. initialize a new, empty value in a variable
  2. traverse the list (for loop) and update the new variable each iteration

If we're summing items, the initial, empty value should be 0, and then we can add each item in term to our variable (accumulating the sum as we go along by keeping track of the partial sums). If we're creating a new list based on an element-wise computation on an original list, we can start with an empty list and add to it (appending) at each iteration. Both are illustrated below.

Compute the sum of a list

Create a new list where each item is twice the value in an original list

Tuples

As everyone knows, a crime against humanity was committed and Shemp replaced Curly. (Let us not speak of the so-called Curly Joe stooge). If we try to replace Curly's name with Shemp in our dummy_variables objects, we aren't allowed to. This is because tuples are immutable--the can't be changed. Lists in contrast can be changed and are called a mutable data structure as a consequence. Mutability is the primary difference between lists and tuples. They both support integer indices, and they support some of the same methods. Some methods, like append() mutate the list's value, so we woudln't expect a tuple to support that method, and it doesn't.

Finally, note that the antipattern dummy variable dummy_variables requires me to know that the value index 0 in the tuple is an index while the value at index 1 is a name. By naming the elements separately with two dummy variabls, our code is more readable, and it's easier to write/code since we're not likely to accidentally confuse the two variables whereas we would be likely to accidentally type a [1] instead of a [0] or vice versa.

Sets

A set and a tuple are similar to a list in many ways, but each also has at least one key difference. We've already seen mutability as the key distinction for a tuple. A set represents a mathematical set, so it cannot hold duplicate values and has no particular ordering. This is in contrast to a list which can hold duplicate values and is ordered.

In the above example we see that duplicates cannot exist in a set and that we also cannot access items by index in a set. We use a set when we specifically need the property of no duplicate values. We can compute all the normal set operations:

Traversing sets and tuples

Sets and tuples can be traversed exactly like lists. A dummy variable in a for loop can take on each value in the collection in turn. Tuples have an ordering, so a tuple would be traversed in order of index. Sets do not have an order, to the items in a set occur in no particular order. If we need to traverse items in a certain order, we would need to sort the set’s values, for instance.

Here is a similar example using a set.

Once again, one thing worth noting is that the ordering of sets is not preserved. This is not a guarantee that the set class provides the programmer.