Exercise 1: Change the socket program
socket1.py to prompt the user for the URL so it can read any web page. You can use
split('/') to break the URL into its component parts so you can extract the host name for the socket
connect call. Add error checking using
except to handle the condition where the user enters an improperly formatted or non-existent URL.
Exercise 2: Change your socket program so that it counts the number of characters it has received and stops displaying any text after it has shown 3000 characters. The program should retrieve the entire document and count the total number of characters and display the count of the number of characters at the end of the document.
Exercise 3: Use
urllib to replicate the previous exercise of (1) retrieving the document from a URL, (2) displaying up to 3000 characters, and (3) counting the overall number of characters in the document. Don't worry about the headers for this exercise, simply show the first 3000 characters of the document contents.
Exercise 4: Change the
urllinks.py program to extract and count paragraph (p) tags from the retrieved HTML document and display the count of the paragraphs as the output of your program. Do not display the paragraph text, only count them. Test your program on several small web pages as well as some larger web pages.
Exercise 5: (Advanced) Change the socket program so that it only shows data after the headers and a blank line have been received. Remember that
recv is receiving characters (newlines and all), not lines.
- The XML format is described in the next chapter.↩