Python has a variety of built-in data structures that we will use in this course—lists, tuples, sets, and dictionaries. All of these are similar in that they represent collections of objects, but they each have their own strengths and weaknesses relative to each other for different tasks. They are also all objects, and so they have their own member functions that we may use to access or manipulate the data they contain. All four of the data structures mentioned above are heterogeneous which means they can hold a collection of different data types. This in contrast to a Java array, for instance, which must declare the type it can hold. For instance a Java array of integers can only hold integer values while a Python list could hold an integer, a floating point number, and a string or any other object we would care to place in the collection.
A good reference on all the information for each collection below is available in the official Python documentation here.
A list
in Python can be created in multiple ways. Two ways allow us to create an empty list to which we can add items later. A third way allows us to create a list that is initialized to hold prescribed values. This code snippet illustrates all three ways:
empty_list = list() # we call the list() constructor here
another_empty_list = [] # this is a more typical way to create an empty list
list_with_values = [1, 2.3, "informatics"] # list is heterogenous. Comma-separated list of values to initialize
The initialization of another_empty_list
above is said to be more pythonic since it’s the typical way to program such a statement in Python. The previous line initializing empty_list
would be less typical in a program, but it does illustrate to us that list is a class while the objects are instances of that class. The list
class is part of Python, so it is a built-in class. Since it’s a class, it has member functions (these are called instance methods in Java) that we can use. One of the most common is list.append()
which lets us add to the end of the list:
another_empty_list = [] # this is a more typical way to create an empty list
list_with_values = [1, 2.3, "informatics"] # list is heterogenous. Comma-separated list of values to initialize
another_empty_list.append(1) # another_empty_list now equals [1]
list_with_values.append("i427") # now holds [1,2.3,"informatics","i427"]
Finally, the code below illustrates a key property of a list—random access by index. Python lists act like Java arrays in that each item has a position in the sequence. The first item is at index 0, the second at index 1, and so on.
list_with_values = [1, 2.3, "informatics"] # list is heterogenous. Comma-separated → list of values to initialize
print(list_with_values[0])
print(list_with_values[1])
print(list_with_values[2])
print(list_with_values[-1])
print(list_with_values[-2])
print(len(list_with_values))
1 2.3 informatics informatics 2.3 3
The previous example illustrates that we can access using negative indices: -1 is the last item, -2 the next to last, and so on. Even though list is a class, we determine the length of a list using the built-in len()
function and supplying the list as an argument. We do not use a member function in the list class to determine the length. Access by index is one of the key traits of a list, so whenever we are holding a collection of data and we care that an item is first, second, third … we would choose list as our container.
It is common in computing to need to do something to every item in a list. To accomplish these repeated (once per element) operations, we typically use iteration–a for
loop–in programming. Iterating over these collections in Python is similar to using enhanced for loops in Java. A dummy variable acts as a placeholder for each value in the collection in turn. We are free to choose its name. The dummy variable is not an index; it is a value in the collection. Short variable names like i
are appropriate for an index but not a value. item
or value
are appropriate generic names, or a variable name that describes the contents of the collection would be appropriate. The only times single letter variable names would be appropriate would be when implementing mathematical operations or formulas, like adding up elements, on a collection. Then names like i
, j
, or k
would be appropriate for int
values or x
, y
, or z
may be appropriate for float
.
Since a collection is a group of values, collections names may often be plural. For instance, documents
is a good name for a list where each item in the list is a document in the corpus. In that case, the singular document
may be a good dummy variable name.
list_of_names = ["Larry", "Moe", "Curly"]
for item in list_of_names: # generic dummy variables
print(item)
Larry Moe Curly
for name in list_of_names: # more meanginful dummy name
print(name)
Larry Moe Curly
names = ["Larry", "Moe", "Curly"]
for name in names: # more concise but equally readable
print(name)
Larry Moe Curly
Python provides the enumerate()
method for cases when we need the value in the list as well as the index of that value. In this case, we get to name two dummy variables. The first is the index, and i
is an appropriate dummy variable name in this case. The second is the value in the list, and we should name it as described above.
list_of_names = ["Larry","Moe","Curly"]
for i,name in enumerate(list_of_names):
print(i)
print(name)
0 Larry 1 Moe 2 Curly
What is happening in the previous example is that the enumerate
method is yielding (i.e. returning each iteration) a 2 values which are simultaneously assigned to i, name
for each iteration of the loop. Also note that since i
is an index, it’s an appropriate variable name in this case. These multiple assignments to several values simultaneously is an example of tuple unpacking (see more info here). We haven't seen a tuple yet, but it's a collection very similar to list. Here the two values i, name
are like a two element collection, but we name each item in the collection as a dummy variable instead of naming the entire collection. It will turn out that tuple objects also support the object_name[index]
syntax, so we could also write that for loop as shown below. The first way above is the correct (Pythonic) way to write the loop. The way below is not an example to follow. It's an antipattern that simply illustrates a concept.
for dummy_variables in enumerate(list_of_names):
print(dummy_variables[0])
print(dummy_variables[1])
print(type(dummy_variables))
0 Larry 1 Moe 2 Curly <class 'tuple'>
for
loops¶If you only need the index for accessing the value at that index, then you don't need the index!
A lot of students like to write code like this which is like for (int i = 0; i < l.length; i++)
in Java
l = [2,4,6]
for i in range(3): # i is the index. First 0, then 1, then 2
print(l[i]) # we only use the index to get the ith item in the loop
when they should write it as
l = [2,4,6]
for val in l:
print(val) # we didn't need the index, just the value
If you need the value and the index, use enumerate!
l = [2,4,6]
for i,val in enumerate(l):
print(f"The value at index {i} is {val}")
I've already mentioned that we should focus our thinking on the data that's flowing through our program. If we know the code patterns for manipulating that data, then it's easier to be able to produce or come up with the code needed to achieve our goals. By focusing on the data, we're able to articulate what those goals are: "I need to convert this corpus into an inverted index", "I need to tokenize this document", ... are possible goals when writing a search engine. We can transform those more specifically to data-drive code goals like "For every document in the corpus, I need to add its tokens to the index", "I need to convert this document stored in a string into a list of tokens", ..., and it become easier to produce code to do what we need.
When our data is in a list, it's very common to need to convert that list into:
Both of these things can be achieved with the accumulator pattern. The gist of it is to
If we're summing items, the initial, empty value should be 0, and then we can add each item in term to our variable (accumulating the sum as we go along by keeping track of the partial sums). If we're creating a new list based on an element-wise computation on an original list, we can start with an empty list and add to it (appending) at each iteration. Both are illustrated below.
l = [4, 5, 6]
list_sum = 0
for value in l:
list_sum += value
print(list_sum)
15
double_l = []
for value in l:
double_l.append(2 * value)
print(double_l)
[8, 10, 12]
As everyone knows, a crime against humanity was committed and Shemp replaced Curly. (Let us not speak of the so-called Curly Joe stooge). If we try to replace Curly's name with Shemp in our dummy_variables
objects, we aren't allowed to. This is because tuples are immutable--the can't be changed. Lists in contrast can be changed and are called a mutable data structure as a consequence. Mutability is the primary difference between lists and tuples. They both support integer indices, and they support some of the same methods. Some methods, like append()
mutate the list's value, so we woudln't expect a tuple to support that method, and it doesn't.
dummy_variables.append('Shemp')
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-15-74aa3b5df8cd> in <module> ----> 1 dummy_variables.append('Shemp') AttributeError: 'tuple' object has no attribute 'append'
Finally, note that the antipattern dummy variable dummy_variables
requires me to know that the value index 0 in the tuple is an index while the value at index 1 is a name. By naming the elements separately with two dummy variabls, our code is more readable, and it's easier to write/code since we're not likely to accidentally confuse the two variables whereas we would be likely to accidentally type a [1]
instead of a [0]
or vice versa.
A set
and a tuple
are similar to a list
in many ways, but each also has at least one key difference. We've already seen mutability as the key distinction for a tuple. A set
represents a mathematical set, so it cannot hold duplicate values and has no particular ordering. This is in contrast to a list which can hold duplicate values and is ordered.
empty_set = set()
duplicate_list = [1,1,2]
set_with_values = {1,1,2}
print(duplicate_list)
[1, 1, 2]
print(set_with_values)
{1, 2}
set_with_values[0]
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-18-376cf05d2412> in <module> ----> 1 set_with_values[0] TypeError: 'set' object is not subscriptable
In the above example we see that duplicates cannot exist in a set and that we also cannot access items by index in a set. We use a set when we specifically need the property of no duplicate values. We can compute all the normal set operations:
one_set = {1,3,4,5}
another_set = {4,5,8}
one_set | another_set # this is union
{1, 3, 4, 5, 8}
one_set.union(another_set) # another way
{1, 3, 4, 5, 8}
one_set & another_set # this is intersection
{4, 5}
one_set.intersection(another_set) # intersection another way
{4, 5}
one_set - another_set # set difference
{1, 3}
one_set.difference(another_set) # set difference another way
{1, 3}
Sets and tuples can be traversed exactly like lists. A dummy variable in a for
loop can take on each value in the collection in turn. Tuples have an ordering, so a tuple would be traversed in order of index. Sets do not have an order, to the items in a set occur in no particular order. If we need to traverse items in a certain order, we would need to sort the set’s values, for instance.
tuple_of_days = ("Friday","Saturday","Sunday")
for day in tuple_of_days:
print(day)
Friday Saturday Sunday
Here is a similar example using a set.
set_of_days = {"Friday","Saturday","Sunday"}
for day in set_of_days:
print(day)
Saturday Sunday Friday
Once again, one thing worth noting is that the ordering of sets is not preserved. This is not a guarantee that the set class provides the programmer.