We've seen the defintion of two important concepts before, but now we have to begin putting them in to practice:

Tokenizing: Parsing a document in a corpus to split it up into smaller parts (tokens) for individual processing. The simplest approach is to split a document up into words, and that's what we'll do here. More sophisticated ways of tokenizing a document may attempt to also recognize phrases (a group of adjacent words) in addition to individual words. The output of tokenizing a document are the tokens. A token can be a word or phrase from the examples we just saw.
Normalizing: Every token should be in a standard form so we can compare them. The two tokens 'He' and 'he' likely represent the same word except the first one apparently began a sentence while the second one did not. We can normalize a token in this example by converting it to all lowercase characters. We'll see that other kinds of normalization may be needed depending on how we tokenize.

Any search engine must make a choice about which tokens are needed and consequently how to normalize them. Different engines or needs may lead to different choices. If searching by phrase is not going to be an option, then having phrases as tokens is not helpful. If searching by proper names is important, then converting all tokens to lowercase may not be appropriate unless a method other than uppercase is used to indicate a proper name.

For our web search engine, we'll restrict the potential search terms (remember that's the lexicon or vocabulary) to single words, so our tokens will be single words. We won't distinguish between proper nouns and regular nouns for our search engine.

First, let's load a document (web page) into a string like we've seen how to do. We'll then strip the markup to get the text of the document, and this text will be our starting point for normalizing and tokenizing.

Below is another way to achieve what we saw last time. Here's what we saw before:

url_to_read = "https://ada.ius.edu/~cjkimmer/teaching/i427.html"
response = urllib.request.urlopen(url_to_read) # request
html_document = response.read() # read response from server
response.close()

And here's a different way to achieve the same thing:

In this example, the with statement is used. with is common in Python when resources like files are streams are being used. with defines a block or scope for a particular variable to be live. The variable name is defined after the as keyword. The part between with and as is the value used to initialize the variable name. The advantage of defining a scope for the lifetime of the response variable is that Python can tell when the the variable is "finished", and it can automatically clean up the resource. In this case, cleaning up means closing the connection, so Python calls .close() automatically for us. Although we have to remember more syntas with this approach, we ultimately might have a little less to worry about. Either form we have seen is fine to use; they are equivalent, but this new form is a little more modern since the programmer doesn't have to manage the resource as explicitly. Bugs or memory leaks creep into code when programmers have to micromanage resources and memory and then inevitably forget to do so sometime.

Normalization¶

Now we have the document from our corpus stored in a string. If normalizing means converting to lowercase, then normalizing is easy. A str object in Python is an instance of the str class, and the str class includes many helpful member functions. See here for a list of them. One of the functions we can find in that list is .lower() which converts a str value to lowercase.

If our definition of normalizing is simply to convert to lowercase, then we have solved the problem! With such simple normalization, it doesn't matter if normalize before tokenize or vice versa. You can probably convince yourself that it wouldn't matter if you convert all the words to lowercase before splitting the document up into words or split the document up into words and then convert each one to lowercase. With other normalization approaches, it may be necessary to normalize after tokenizing.

Anyway, here's how to normalize the entire document:

The value is not shown here in the notes to save space. It is expected that you are writing and evaluating your own code and are verifying it works as intended as you go through these notes.

Tokenization¶

Today we will see a "naïve" way to tokenize. Today's method is not suitable for our production code later! Today's method is simply to illustrate the concept of tokenization and introduce some programming topics we need to consider.

First, as I will always suggest for I427, think about the data. The starting point or tokenization is text stored in a string. The end point for tokenization is a collection of tokens--presumably a list since that's the only collection class we've focused on so far. Tokenizing a document transforms a string into a list. Each token is a word, so each token is presumably a str. Tokenization transforms a str into a list of str. This is how the data are transformed by the code, and note that the description of the data boils down to two things: what information the data holds (document, token, collection of tokends) and what type the data are (str, list of str).

If we look back over the list of str methods here we can see that a method .split() converts a str into a list of str. Moreover, by default, what it does is "split based on whitespace". It will do what we need for now. Since it's a member function, it acts on the data stored in the str. If we invoke it like:

"my test string".split()

then the input value is "my test string" and the entire expression above evalutes to the result, the list

['my', 'test', 'string']

Try it out in Juptyer to see. Likewise if we have a value stored in a variable and invoke it like:

doc = "another test string"
tokens = doc.split()

then the input string is the value of doc and doc.split() evaluates to the result (['another', 'test', 'string']) and that results is stored in the variable tokens.

An aside on functions¶

So what we've seen is that .split() performs a simple version of tokenization (we'll see why it's too simple in a little bit...) and the member function operates on input data (the value of the instance of the str class) and returns an output value (the list of tokens). Operating on input data and returning output data is what functions do. When we take a data-driven approach to programming and recognize that tokenization transforms a string into a collection of them, we are heading towards recognizing that we should implement tokenization as a function. Soon we will devote significant energy in the course towards reviewing what functions are and learning and practicing how to write them in Python. You've seen functions in Python in I110, and you've seen methods in Java in I210 and I211. Methods and functions are synonyms for the same concepts. Different programming languages or programmers tend to emphasize different names for them. Subroutine and procedure are two other synonyms. I prefer function because it emphasizes the similarity with mathematical functions:

output = f(input)

where f is the function. Python actually makes a distinction between what a function is and what a method is, and we'll talk about that soon...

Naïve Tokenization in a Nutshell¶

Use .split() for tokenization for now. We'll see a better way soon...

In [8]:

normalized_text.split()

Out[8]:

['christopher',
 'j.',
 'kimmer',
 'christopher',
 'j.',
 'kimmer,',
 'ph.d.',
 'iu',
 'southeast',
 'informatics',
 'iscicl',
 'informatics',
 'and',
 'scientific',
 'computing',
 'laboratory',
 'home',
 'teaching',
 'research',
 'calendar',
 'links',
 'contact',
 'i427',
 'tentative',
 'class',
 'schedule',
 'spring',
 '2014',
 'tentative',
 'means',
 'that',
 'this',
 'schedule',
 'may',
 'change',
 'due',
 'to',
 'weather,',
 'classroom',
 'pace,',
 'etc.',
 'date',
 'class',
 'topic',
 'reading,',
 'assignments,',
 'etc.',
 'm',
 '1/13',
 'syllabus,',
 'information',
 'retrieval,',
 'intro',
 'to',
 'python',
 'interpreter',
 'w',
 '1/15',
 'python',
 'lists',
 'm',
 '1/20',
 'no',
 'class',
 '-',
 'mlk',
 'day',
 'holiday',
 'w',
 '1/22',
 'python',
 'dictionaries',
 'm',
 '1/27',
 'serialization,',
 'list',
 'comprehensions',
 '-',
 'hw',
 '1',
 'due',
 'w',
 '1/29',
 'ipython',
 'setup,',
 'mysqldb',
 'm',
 '2/3',
 'no',
 'class',
 '-',
 'kimmer',
 'sick',
 'w',
 '2/5',
 'twitter',
 'api',
 'introduction',
 'm',
 '2/10',
 'twitter',
 'api',
 '-',
 'what',
 'can',
 'we',
 'get?',
 '-',
 'hw',
 '2',
 'due',
 'w',
 '2/12',
 'in',
 'class:',
 'twitter',
 'wrapper',
 'class',
 '-',
 'hw',
 '3',
 'due',
 'm',
 '2/17',
 'twittir',
 '-',
 'term-document',
 'incidence',
 '&',
 'inverted',
 'indices',
 'in',
 'class',
 '1',
 'hand-in',
 'w',
 '2/19',
 'in',
 'class:',
 'inverted',
 'index',
 'm',
 '2/24',
 'ranking',
 '&',
 'stemming',
 'tweets',
 'in',
 'class',
 '2',
 'hand-in',
 'w',
 '2/26',
 'how',
 'similar',
 'are',
 'two',
 'twitter',
 'users?',
 'm',
 '3/3',
 '...continued.',
 'w',
 '3/5',
 'scraping',
 'web',
 'pages',
 'read',
 'chapter',
 '5',
 'in',
 'mining',
 'the',
 'social',
 'web',
 'project',
 '1',
 'due',
 'm',
 '3/10',
 'storing',
 'web',
 'pages',
 'in',
 'a',
 'database',
 'w',
 '3/12',
 'in',
 'class',
 'scraping',
 'm',
 '3/17',
 'crawling',
 'the',
 'web',
 '-',
 'bfs',
 'w',
 '3/19',
 'crawling',
 'the',
 'web',
 '-',
 'real-world',
 'issues',
 'm',
 '3/31',
 'scoring',
 '&',
 'ranking',
 'documents',
 '-',
 'tf,',
 'df,',
 'idf,',
 'tf-idf,',
 'and',
 'd',
 'w',
 '4/2',
 'the',
 'vector',
 'space',
 'model',
 'hw',
 '5',
 'due',
 'fri.',
 '4/4',
 'at',
 '11',
 'am',
 'm',
 '4/7',
 'test',
 '1',
 '-',
 'python',
 'practical',
 'w',
 '4/9',
 'biwords',
 'and',
 'query',
 'processing',
 'm',
 '4/14',
 'implementation',
 'details',
 'and',
 'catch-up,',
 'simple',
 'voting',
 'w',
 '4/16',
 'hits',
 'project',
 '2',
 'due',
 '11',
 'am',
 'm',
 '4/21',
 'page',
 'rank',
 'part',
 '1',
 'w',
 '4/23',
 'finish',
 'up',
 'page',
 'rank',
 'hw',
 '6',
 'due',
 '11',
 'am',
 'm',
 '4/28',
 'search',
 'engine',
 'presentations',
 'hw',
 '7',
 'due',
 '11',
 'am',
 'm',
 '5/5',
 '11',
 'am',
 'test',
 '2',
 'project',
 '3',
 'due',
 'courses',
 'i101:',
 'introduction',
 'to',
 'informatics',
 'i110:',
 'programming',
 'concepts',
 'i111:',
 'introduction',
 'to',
 'databases',
 'i300:',
 'human',
 'computer',
 'interaction',
 'i308:',
 'information',
 'representation',
 'i427:',
 'search',
 'informatics',
 'i441:',
 'interaction',
 'design',
 'practice',
 'i494:',
 'capstone',
 'i',
 'i495:',
 'capstone',
 'ii',
 'cjkimmer',
 '-at-',
 'ius.edu',
 'valid',
 'xhtml',
 '|',
 'css']

The problem with naïve tokenization¶

If we scroll through the list of tokens from our document, we see some tokens that aren't words. Items like '-', 'w', 'get?', and '3/19' appear in the list, among other things. If we want our tokens to be words or potential search terms, then we have a problem. One of the issues has to do with the presence of punctuation and another issue has to do with that fact we might need to verify a token is a word before we "admit it" to our list of tokens. We will have to deal with these issues as we go forward.

Quiz question¶

This question appears on a quiz in Canvas.

For the URL "enter77.ius.edu/~cjkimmer/i427/token_test.html", what are the tokens in the document according to our naïve tokenization approach using .split()?