We are focusing on web search this semester in I427 although there are other kinds of search engines, too. There are search engines for legal cases and precedents, PDFs and other documents, medical records, products, ... The concepts are the same for any search engine, but the implementation details will differ depending on what kind of documents comprise the corpus. For web search, the corpus is a collection of web pages. For simplicity we will assume that the web pages in our corpus are either HTML or text documents, although many other kinds of content are also served to web browsers.
If we are to write a web search engine, we need to be able to retrieve the HTML or text content of a web page. Actually retrieving the content from an external web server is not the serach engine's responsibility. A web crawler is a separate piece of software that visits web pages and archives them for the search engine. The web crawler may store the pages in persistent storage like a database, and the search engine can access the HTML from the persistent store.
A web page is identified by its URL, and given that unique URL, we will see how to retrieve the content in a Python program. In essence our Python code will act like a web browser and request a web server to respond with the content corresponding to the URL. Web browers render the content once a complete web page (HTML, CSS, JavaScript (JS), images, ...) has been loaded, but our program does not have to worry about doing that. In I427, we will assume that the HTML content returned to use in the response is the complete web page. Since we're not searching images or CSS, we don't need to retrieve those. We'll see enough details to see how to retrieve these items if we wished, too, but we won't need to here.
I427 is an intro to Information Retrieval and search engines, and most topics we see will just be skimming the surface. This material is no exception. Many webpages today can't be rendered from the initial HTML response. JavaScript in the document typically loads other content asynchronously after the initial request, particularly if modern frameworks are used for the web page's interface. To load a page in this fashion, our Python program would have to also include a running JS engine for this part. Rather than incorporate a JS engine, most web crawlers use a headless browser that acts just as a regular browser except it doesn't render for display (hence "headless"). The headless browser still includes the JS engine and all the other components of a full browser. I do this kind of web crawling and scraping for some other projects and use a normal, full install of Google chrome for this purpose. It has a headless mode that can be activated with a command line argument. Talk to me outside of the regular flow of class if you're interested in this aspect or want to try it out with your web crawler this semester in I427.
Once we have the HTML, we're not out of the woods yet. The search engine needs to get the actual text content of the document, so we have to extract that content from all the other markup and non-content (links, images, JavaScript, ...). To extract the content, we will have to parse the HTML. If we're crawling the web, which we will do at the end of this course, then we also have to extract all the links from the page in order to see if there are any new pages we should also retrieve to add to our corpus. In this section, we'll see how to parse the HTML. Later in the semester, we'll also see how to extract links for our web crawler.
A popular tech interview question is "What happens when you type a URL into a browser bar?". Here's an overview assuming an HTTP connection (HTTPS is more complicated because of key exchange and other issues):
If you dreaded programming all of this stuff while reading, you had the correct reaction. We don't have time to do that, and we shouldn't have to since we don't really have that expertise. We need to use other people's code for this, and in Python, as in Java, doing that usually boils down to importing code. External code that we import into our programs live in modules in Python. Eventually we'll see how to write modules, but today we just need to use them.
The general syntax to import a module is
import module_name
We'll also see a variant today that looks like
import very_long_module_name_we_dont_wish_to_type_again as short_name
import module_name_that_is_not_the_typically_used_one as typical_name
With either of these variants, we will have brought new code into our program via this module. If a module we import has a syntax error or a runtime error as code executes during the import, then we will see an error message. If we see no error message, then the import was successful, and we can use code from the module.
If you Google around, you'll also see a from module_name import *
syntax. Don't use this in your code!!. It's an antipattern. We'll see why that usage is not recommended below after a little more other information first...
There are two main modules we'll need for reading in a webpage. The urllib.request module simply requests a page from a web server and then receives the response (remember our Python program acts like a web browser that doesn't render the page for us). The BeautifulSoup module (version 4 which is bs4 is the latest) will let us parse the markup.
import urllib.request # abilty to connect to web server & get HTML
import bs4 # parses the HTML
Assuming we know the URL and have it in our program, it's very easy to get the HTML for a document!
url_to_read = "https://ada.ius.edu/~cjkimmer/teaching/i427.html"
response = urllib.request.urlopen(url_to_read) # request
html_document = response.read() # read response from server
response.close()
We have all the HTML markup now stored in html_document
and can do with it as we please.
print(html_document)
b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html xmlns="http://www.w3.org/1999/xhtml">\n <head>\n <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n <title>\n Christopher J. Kimmer\n </title>\n <link href="../css/style.css" rel="stylesheet" type="text/css" />\n </head>\n <body>\n <div id="wrapper">\n <div id="content">\n <div id="mainimg">\n <div id="logo">\n <h1>\n Christopher J. Kimmer, Ph.D.\n </h1>\n <h2>\n IU Southeast Informatics\n </h2>\n </div>\n <h3>\n iSciCL\n </h3>\n <h4>\n Informatics and Scientific Computing Laboratory\n </h4>\n </div>\n <div id="navheader">\n <ul>\n <li style="border-left: 0px">\n <a href="../index.html">\n Home\n </a>\n </li>\n <li id="focused"> \n <a href="./teaching.html">\n Teaching\n </a>\n </li>\n <li>\n <a href="../research/research.html">\n Research\n </a>\n </li>\n <li>\n <a href="../calendar.html">\n Calendar\n </a>\n </li>\n <li>\n <a href="../links.html">\n Links\n </a>\n </li>\n <li style="width:124px;">\n <a href="../contact.html">\n Contact\n </a>\n </li>\n </ul>\n </div>\n <div id="contentarea">\n <div id="leftbar">\n <!-- lemon tree -->\n <h2>I427 Tentative Class Schedule \n </h2>\n <h3> Spring 2014</h3>\n <p><em>\n Tentative means that this schedule may change due to weather, classroom pace, etc.</em>\n </p>\n <br />\n <table>\n\t<tr>\n\t <td>Date\n\t </td>\n\t <td>\n\t\tClass Topic\n\t <br/>\n\t\tReading, Assignments, etc.\n\t </td>\n\t</tr>\n\t<tr class="day1">\n\t <td>M 1/13\n\t </td>\n\t <td>Syllabus, Information Retrieval, Intro to Python Interpreter\n\t </td>\n\t</tr>\n\t<tr class="day2">\n\t <td>W 1/15\n\t </td>\n\t <td>Python Lists\n\t </td>\n\t</tr>\n\t<tr class="day1">\n\t <td>M 1/20\n\t </td>\n\t <td> NO CLASS - MLK DAY HOLIDAY\n\t </td>\n\t <td>\n\t </td>\n\t</tr>\n\t<tr class="day2">\n\t <td>W 1/22\n\t </td>\n\t <td>Python Dictionaries\n\t </td>\n\t</tr>\n\t<tr class="day1">\n\t <td>M 1/27\n\t </td>\n\t <td>Serialization, List Comprehensions - HW 1 due\n\t </td>\n\t</tr>\n\t<tr class="day2">\n\t <td>W 1/29\n\t </td>\n\t <td>IPython Setup, MySQLdb \n\t </td>\n\t</tr>\n\t<tr class="day1">\n\t <td>M 2/3\n\t </td>\n\t <td>No Class - Kimmer Sick\n\t </td>\n\t</tr>\n\t<tr class="day2">\n\t <td>W 2/5\n\t </td>\n\t <td>Twitter API Introduction \n\t </td>\n\t</tr>\n\t<tr class="day1">\n\t <td>M 2/10\n\t </td>\n\t <td>Twitter API - What can we get? - HW 2 due\n\t </td>\n\t</tr>\n\t<tr class="day2">\n\t <td>W 2/12\n\t </td>\n\t <td>In Class: Twitter wrapper Class - HW 3 due\n\t </td>\n\t</tr>\n\t<tr class="day1">\n\t <td>M 2/17\n\t </td>\n\t <td>TwittIR - Term-Document Incidence & Inverted Indices\n\t <br/>\n\t In Class 1 Hand-in\n\t </td>\n\t</tr>\n\t<tr class="day2">\n\t <td>W 2/19\n\t </td>\n\t <td>In Class: Inverted Index \n\t </td>\n\t</tr>\n\t<tr class="day1">\n\t <td>M 2/24\n\t </td>\n\t <td>Ranking & Stemming Tweets\n\t <br/>\n\t In Class 2 Hand-in\n\t </td>\n\t</tr>\n\t<tr class="day2">\n\t <td>W 2/26\n\t </td>\n\t <td>How similar are two Twitter Users?\n\t </td>\n\t</tr>\n\t<tr class="day1">\n\t <td>M 3/3\n\t </td>\n\t <td>...continued.\n\t </td>\n\t</tr>\n\t<tr class="day2">\n\t <td>W 3/5\n\t </td>\n\t <td>Scraping Web Pages\n\t <em>Read Chapter 5 in Mining the Social Web</em>\n\t <strong>Project 1 Due</strong>\n\t </td>\n\t</tr>\n\t<tr class="day1">\n\t <td>M 3/10\n\t </td>\n\t <td>Storing Web Pages in a Database\n\t </td>\n\t</tr>\n\t<tr class="day2">\n\t <td>W 3/12\n\t </td>\n\t <td>In Class Scraping\n\t </td>\n\t</tr>\n\t<tr class="day1">\n\t <td>M 3/17\n\t </td>\n\t <td>Crawling the Web - BFS\n\t </td>\n\t</tr>\t\n\t<tr class="day2">\n\t <td>W 3/19\n\t </td>\n\t <td>Crawling the Web - Real-World Issues\n\t </td>\n\t</tr>\n\t<tr class="day1">\n\t <td>M 3/31\n\t </td>\n\t <td>Scoring & Ranking Documents - tf, df, idf, tf-idf, and d\n\t </td>\n\t</tr>\t \n\t<tr class="day2">\n\t <td>W 4/2\n\t </td>\n\t <td>\n\t The Vector Space Model\n\t <br/>\n\t <strong>HW 5 due Fri. 4/4 at 11 AM</strong>\n\t </td>\n\t</tr>\n\t<tr class="day1">\n\t <td>M 4/7\n\t </td>\n\t <td>Test 1 - Python Practical\n\t </td>\n\t</tr>\n\t<tr class="day2">\n\t <td>W 4/9\n\t </td>\n\t <td>Biwords and query processing\n\t </td>\n\t</tr>\n\t<tr class="day1">\n\t <td>M 4/14\n </td>\n <td>Implementation Details and Catch-up, Simple Voting\n\t </td>\n\t</tr>\n\t<tr class="day2">\n\t <td>W 4/16\n\t </td>\n\t <td>HITS <br/>\n\t <strong>Project 2 due 11 AM</strong>\n\t </td>\n\t</tr>\n\t<tr class="day1">\n\t <td>M 4/21\n\t </td><td>\nPage Rank Part 1\n\t </td>\n\t</tr>\n\t<tr class="day2">\n\t <td>W 4/23\n\t </td>\n\t <td>Finish Up Page Rank <br/>\n\t <strong>HW 6 due 11 AM</strong>\n\t </td>\n\t</tr>\n\t<tr class="day1">\n\t <td>M 4/28\n\t </td>\n\t <td>Search Engine Presentations <br/>\n\t <strong>HW 7 due 11 AM</strong>\n\t </td>\n\t</tr>\n\t<tr class="day2">\n\t <td>M 5/5\n\t </td>\n\t <td>11 AM Test 2 <br/>\n\t <strong>Project 3 Due</strong>\n\t </td>\n\t</tr>\n </table>\n <br />\n <!-- lemon tree -->\n </div>\n <div id="rightbar">\n <h2 class="colortext">\n Courses\n </h2>\n <p>\n <a href="i101.html">\n I101: Introduction to Informatics\n </a>\n </p>\n <p>\n <a href="i110.html">\n I110: Programming Concepts\n </a>\n </p>\n <p>\n <a href="i111.html">\n I111: Introduction to Databases\n </a>\n </p>\n <p>\n <a href="i300.html">\n I300: Human Computer Interaction\n </a>\n </p>\n <p>\n <a href="i308.html">\n I308: Information Representation\n </a>\n </p>\n <p>\n <a href="i427.html">\n I427: Search Informatics\n </a>\n </p>\n <p>\n <a href="i441.html">\n I441: Interaction Design Practice\n </a>\n </p>\n <p>\n <a href="i494.html">\n I494: Capstone I\n </a>\n </p>\n <p>\n <a href="i495.html">\n I495: Capstone II\n </a>\n </p>\n </div>\n </div>\n <div id="bottom">\n <div id="email">\n <div id="emailtext">\n cjkimmer -at- ius.edu\n </div>\n </div>\n <div id="validtext">\n <p>\n Valid\n <a href="http://validator.w3.org/check?uri=referer">XHTML</a>\n |\n <a href="http://jigsaw.w3.org/css-validator/check/referer">CSS</a>\n </p>\n </div>\n </div>\n </div>\n </div>\n </body>\n</html>\n'
That b
in front of the quotes in the output indicates that we didn't actually get a string back from urllib.urlopen
. Instead we got back a Bytes object (hence the b
in front of the quotes). The Python documentation for this module linked to above indicates that to be the case, so it's just something interesting to note here since we have previously only seen strings without any qualifier in front of the quotation marks. This page appears to look fine as it is, but with some pages with special characters the output may look wrong. In either case, we can convert the byte string to a normal string in two ways. One way is to decode it using knowledge of its character set, and another way is to construct or instantiate a new str
object. The two approaches are equivalent; in essence a byte string may need to be decoded.
type(html_document)
bytes
# now it's a real string
html_document.decode('utf-8')[:256] # just show the first few characters. we don't need all the output
'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html xmlns="http://www.w3.org/1999/xhtml">\n <head>\n <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n <title>\n'
# call the str() constructor instead, specifying the character set
str(html_document,'utf-8')[:256]
'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html xmlns="http://www.w3.org/1999/xhtml">\n <head>\n <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n <title>\n'
Notice how the urllib.request
module above is used. The bs4
module will be used in the same way below, too. In general, after import module_name
you get to use the contents (variables, objects, functions, classes, ...) of module_name
by name provided you put the name of the module and then a dot before the name from the module. If there's a variable x
in module_name
, then you refer to x
in your program as module_name.x
after you've used import module_name
. urrllib.request
provides a function called urlopen
, and so we use it via urllib.request.urlopen()
in our code. That's all there is to modules! Anything from the bs4
module will be easy to spot below. Just looke for bs4.
in front to spot variables, functions, or classes from that module.
The evil from module_name import ...
syntax I warned you not to use is because that variant doesn't require the module_name.
syntax in your code. You can use the imported contents without indicating in your code it came from that module. That sounds great in principle but it's a bad practice because it makes it difficult to reason about where variables, objects, functions, classes, ... came from when it's not obvious. It also makes it more likely for the name of something you import to cover up something of the same name in your code. These things will happen as your programs increase in size! This evil, bad syntax is very common in code examples online, so you will come across it. It's common online because it leads to shorter code, and that's generally a bonus in code snippets that are illustrating a point other than using namespaces correctly.
Beautiful Soup is pretty well documented. I will show you how to use beautiful soup for dealing with extracting data from markup. You don't have to use it. There are alternatives (nltk), but I will not support them.
HTML is a sloppy standard, and a lot of web pages have errors or are, at the least, not what we would consider well-formed markup (in the I308, XML sense). That makes parsing them tough because you have to deal with tags not closing, etc. Beautiful soup has all that problem solved for us, so we can use it at a higher level of abstraction. The soup part of it's name refers to, I believe, how sloppy HTML is just a mess of tags and potential errros that are too important to give up on (not render) when there are mistakes that can be ignored or handled forgivingly.
The starting point for any stuff you do with this module is to construct a soup object. Once you have the soup object, you can invoke its methods to do whatever is needed. The first argument can be a string containing the HTML, which is appropriate for us when we're reading HTML as we are above. The second argument is the HTML parser to use. There are other options than "lxml"
below, but they're not really worth getting in to. If you leave out this second argument you get a warning message, so it's best to not leave it out just in case you run across an HTML document that might give the default parser the fits.
soup = bs4.BeautifulSoup(html_document, "lxml")
text_content = soup.get_text()
print(text_content) # this is something that we can tokenize & normalize before adding to index
Christopher J. Kimmer Christopher J. Kimmer, Ph.D. IU Southeast Informatics iSciCL Informatics and Scientific Computing Laboratory Home Teaching Research Calendar Links Contact I427 Tentative Class Schedule Spring 2014 Tentative means that this schedule may change due to weather, classroom pace, etc. Date Class Topic Reading, Assignments, etc. M 1/13 Syllabus, Information Retrieval, Intro to Python Interpreter W 1/15 Python Lists M 1/20 NO CLASS - MLK DAY HOLIDAY W 1/22 Python Dictionaries M 1/27 Serialization, List Comprehensions - HW 1 due W 1/29 IPython Setup, MySQLdb M 2/3 No Class - Kimmer Sick W 2/5 Twitter API Introduction M 2/10 Twitter API - What can we get? - HW 2 due W 2/12 In Class: Twitter wrapper Class - HW 3 due M 2/17 TwittIR - Term-Document Incidence & Inverted Indices In Class 1 Hand-in W 2/19 In Class: Inverted Index M 2/24 Ranking & Stemming Tweets In Class 2 Hand-in W 2/26 How similar are two Twitter Users? M 3/3 ...continued. W 3/5 Scraping Web Pages Read Chapter 5 in Mining the Social Web Project 1 Due M 3/10 Storing Web Pages in a Database W 3/12 In Class Scraping M 3/17 Crawling the Web - BFS W 3/19 Crawling the Web - Real-World Issues M 3/31 Scoring & Ranking Documents - tf, df, idf, tf-idf, and d W 4/2 The Vector Space Model HW 5 due Fri. 4/4 at 11 AM M 4/7 Test 1 - Python Practical W 4/9 Biwords and query processing M 4/14 Implementation Details and Catch-up, Simple Voting W 4/16 HITS Project 2 due 11 AM M 4/21 Page Rank Part 1 W 4/23 Finish Up Page Rank HW 6 due 11 AM M 4/28 Search Engine Presentations HW 7 due 11 AM M 5/5 11 AM Test 2 Project 3 Due Courses I101: Introduction to Informatics I110: Programming Concepts I111: Introduction to Databases I300: Human Computer Interaction I308: Information Representation I427: Search Informatics I441: Interaction Design Practice I494: Capstone I I495: Capstone II cjkimmer -at- ius.edu Valid XHTML | CSS
This is the same as a quiz question that will appear on Canvas. I'll use it to check that you can write enough Python code to read an HTML page and extract the content.
soup.get_text()
?There are other quiz questions that cover the basic syntax and ideas of modules (this set of notes) and lists (next set of notes).
It's already been mentioned above that the corpus would be housed in a persistent store like a database. We only covered dealing with a single webpage--one document in a corpus. Fortunately, it was mentioned that this part of building a search engine would be the web crawler's responsibility. When we write a web crawler at the end of the semester, we'll know how to store the document's content in a database, and we'll know how to visit multiple pages. So this missing piece will get solved towards the end of the semester.
One problem that will need to be dealt with before then, though, is that sometimes we will need to maintain a collection of documents in memory in our program. For instance, when our search engine determines that, say, 12 documents match the search terms, we will need to manage a collection of those 12 matching documents in memory. We will need to learn how to manage collections of objects in our Python program. In Java, array types would be the first data structure for a collection most students encounter. Python does not have a built-in array type, but it has a list type which is similar is some key ways to Java arrays. Learning to work with lists in Python so that we can manage collections of data or objects in our code is the next topic...