We've seen the defintion of two important concepts before, but now we have to begin putting them in to practice:
Any search engine must make a choice about which tokens are needed and consequently how to normalize them. Different engines or needs may lead to different choices. If searching by phrase is not going to be an option, then having phrases as tokens is not helpful. If searching by proper names is important, then converting all tokens to lowercase may not be appropriate unless a method other than uppercase is used to indicate a proper name.
For our web search engine, we'll restrict the potential search terms (remember that's the lexicon or vocabulary) to single words, so our tokens will be single words. We won't distinguish between proper nouns and regular nouns for our search engine.
First, let's load a document (web page) into a string like we've seen how to do. We'll then strip the markup to get the text of the document, and this text will be our starting point for normalizing and tokenizing.
import urllib.request # abilty to connect to web server & get HTML
import bs4 # parses the HTML
Below is another way to achieve what we saw last time. Here's what we saw before:
url_to_read = "https://ada.ius.edu/~cjkimmer/teaching/i427.html"
response = urllib.request.urlopen(url_to_read) # request
html_document = response.read() # read response from server
response.close()
And here's a different way to achieve the same thing:
url_to_read = "https://ada.ius.edu/~cjkimmer/teaching/i427.html"
with urllib.request.urlopen(url_to_read) as response:
html_document = response.read() # read response from server
In this example, the with
statement is used. with
is common in Python when resources like files are streams are being used. with
defines a block or scope for a particular variable to be live. The variable name is defined after the as
keyword. The part between with
and as
is the value used to initialize the variable name. The advantage of defining a scope for the lifetime of the response
variable is that Python can tell when the the variable is "finished", and it can automatically clean up the resource. In this case, cleaning up means closing the connection, so Python calls .close()
automatically for us. Although we have to remember more syntas with this approach, we ultimately might have a little less to worry about. Either form we have seen is fine to use; they are equivalent, but this new form is a little more modern since the programmer doesn't have to manage the resource as explicitly. Bugs or memory leaks creep into code when programmers have to micromanage resources and memory and then inevitably forget to do so sometime.
soup = bs4.BeautifulSoup(html_document, "lxml")
text_content = soup.get_text()
Now we have the document from our corpus stored in a string. If normalizing means converting to lowercase, then normalizing is easy. A str
object in Python is an instance of the str
class, and the str
class includes many helpful member functions. See here for a list of them. One of the functions we can find in that list is .lower()
which converts a str
value to lowercase.
If our definition of normalizing is simply to convert to lowercase, then we have solved the problem! With such simple normalization, it doesn't matter if normalize before tokenize or vice versa. You can probably convince yourself that it wouldn't matter if you convert all the words to lowercase before splitting the document up into words or split the document up into words and then convert each one to lowercase. With other normalization approaches, it may be necessary to normalize after tokenizing.
Anyway, here's how to normalize the entire document:
normalized_text = text_content.lower()
The value is not shown here in the notes to save space. It is expected that you are writing and evaluating your own code and are verifying it works as intended as you go through these notes.
Today we will see a "naïve" way to tokenize. Today's method is not suitable for our production code later! Today's method is simply to illustrate the concept of tokenization and introduce some programming topics we need to consider.
First, as I will always suggest for I427, think about the data. The starting point or tokenization is text stored in a string. The end point for tokenization is a collection of tokens--presumably a list
since that's the only collection class we've focused on so far. Tokenizing a document transforms a string into a list. Each token is a word, so each token is presumably a str
. Tokenization transforms a str
into a list
of str
. This is how the data are transformed by the code, and note that the description of the data boils down to two things: what information the data holds (document, token, collection of tokends) and what type the data are (str
, list
of str
).
If we look back over the list of str
methods here we can see that a method .split()
converts a str
into a list of str
. Moreover, by default, what it does is "split based on whitespace". It will do what we need for now. Since it's a member function, it acts on the data stored in the str
. If we invoke it like:
"my test string".split()
then the input value is "my test string"
and the entire expression above evalutes to the result, the list
['my', 'test', 'string']
Try it out in Juptyer to see. Likewise if we have a value stored in a variable and invoke it like:
doc = "another test string"
tokens = doc.split()
then the input string is the value of doc
and doc.split()
evaluates to the result (['another', 'test', 'string']
) and that results is stored in the variable tokens
.
doc = "another test string"
tokens = doc.split()
print(tokens)
['another', 'test', 'string']
So what we've seen is that .split()
performs a simple version of tokenization (we'll see why it's too simple in a little bit...) and the member function operates on input data (the value of the instance of the str
class) and returns an output value (the list
of tokens). Operating on input data and returning output data is what functions do. When we take a data-driven approach to programming and recognize that tokenization transforms a string into a collection of them, we are heading towards recognizing that we should implement tokenization as a function. Soon we will devote significant energy in the course towards reviewing what functions are and learning and practicing how to write them in Python. You've seen functions in Python in I110, and you've seen methods in Java in I210 and I211. Methods and functions are synonyms for the same concepts. Different programming languages or programmers tend to emphasize different names for them. Subroutine and procedure are two other synonyms. I prefer function because it emphasizes the similarity with mathematical functions:
output = f(input)
where f
is the function. Python actually makes a distinction between what a function is and what a method is, and we'll talk about that soon...
Use .split()
for tokenization for now. We'll see a better way soon...
normalized_text.split()
['christopher', 'j.', 'kimmer', 'christopher', 'j.', 'kimmer,', 'ph.d.', 'iu', 'southeast', 'informatics', 'iscicl', 'informatics', 'and', 'scientific', 'computing', 'laboratory', 'home', 'teaching', 'research', 'calendar', 'links', 'contact', 'i427', 'tentative', 'class', 'schedule', 'spring', '2014', 'tentative', 'means', 'that', 'this', 'schedule', 'may', 'change', 'due', 'to', 'weather,', 'classroom', 'pace,', 'etc.', 'date', 'class', 'topic', 'reading,', 'assignments,', 'etc.', 'm', '1/13', 'syllabus,', 'information', 'retrieval,', 'intro', 'to', 'python', 'interpreter', 'w', '1/15', 'python', 'lists', 'm', '1/20', 'no', 'class', '-', 'mlk', 'day', 'holiday', 'w', '1/22', 'python', 'dictionaries', 'm', '1/27', 'serialization,', 'list', 'comprehensions', '-', 'hw', '1', 'due', 'w', '1/29', 'ipython', 'setup,', 'mysqldb', 'm', '2/3', 'no', 'class', '-', 'kimmer', 'sick', 'w', '2/5', 'twitter', 'api', 'introduction', 'm', '2/10', 'twitter', 'api', '-', 'what', 'can', 'we', 'get?', '-', 'hw', '2', 'due', 'w', '2/12', 'in', 'class:', 'twitter', 'wrapper', 'class', '-', 'hw', '3', 'due', 'm', '2/17', 'twittir', '-', 'term-document', 'incidence', '&', 'inverted', 'indices', 'in', 'class', '1', 'hand-in', 'w', '2/19', 'in', 'class:', 'inverted', 'index', 'm', '2/24', 'ranking', '&', 'stemming', 'tweets', 'in', 'class', '2', 'hand-in', 'w', '2/26', 'how', 'similar', 'are', 'two', 'twitter', 'users?', 'm', '3/3', '...continued.', 'w', '3/5', 'scraping', 'web', 'pages', 'read', 'chapter', '5', 'in', 'mining', 'the', 'social', 'web', 'project', '1', 'due', 'm', '3/10', 'storing', 'web', 'pages', 'in', 'a', 'database', 'w', '3/12', 'in', 'class', 'scraping', 'm', '3/17', 'crawling', 'the', 'web', '-', 'bfs', 'w', '3/19', 'crawling', 'the', 'web', '-', 'real-world', 'issues', 'm', '3/31', 'scoring', '&', 'ranking', 'documents', '-', 'tf,', 'df,', 'idf,', 'tf-idf,', 'and', 'd', 'w', '4/2', 'the', 'vector', 'space', 'model', 'hw', '5', 'due', 'fri.', '4/4', 'at', '11', 'am', 'm', '4/7', 'test', '1', '-', 'python', 'practical', 'w', '4/9', 'biwords', 'and', 'query', 'processing', 'm', '4/14', 'implementation', 'details', 'and', 'catch-up,', 'simple', 'voting', 'w', '4/16', 'hits', 'project', '2', 'due', '11', 'am', 'm', '4/21', 'page', 'rank', 'part', '1', 'w', '4/23', 'finish', 'up', 'page', 'rank', 'hw', '6', 'due', '11', 'am', 'm', '4/28', 'search', 'engine', 'presentations', 'hw', '7', 'due', '11', 'am', 'm', '5/5', '11', 'am', 'test', '2', 'project', '3', 'due', 'courses', 'i101:', 'introduction', 'to', 'informatics', 'i110:', 'programming', 'concepts', 'i111:', 'introduction', 'to', 'databases', 'i300:', 'human', 'computer', 'interaction', 'i308:', 'information', 'representation', 'i427:', 'search', 'informatics', 'i441:', 'interaction', 'design', 'practice', 'i494:', 'capstone', 'i', 'i495:', 'capstone', 'ii', 'cjkimmer', '-at-', 'ius.edu', 'valid', 'xhtml', '|', 'css']
If we scroll through the list of tokens from our document, we see some tokens that aren't words. Items like '-', 'w', 'get?', and '3/19' appear in the list, among other things. If we want our tokens to be words or potential search terms, then we have a problem. One of the issues has to do with the presence of punctuation and another issue has to do with that fact we might need to verify a token is a word before we "admit it" to our list of tokens. We will have to deal with these issues as we go forward.
This question appears on a quiz in Canvas.
.split()
?