Skip to main content
Engineering LibreTexts

4.3: Word Histogram

  • Page ID
    15434
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    You should attempt the previous exercises before you go on. You can download my solution from http://thinkpython.com/code/analyze_book.py. You will also need http://thinkpython.com/code/emma.txt.

    Here is a program that reads a file and builds a histogram of the words in the file:

    import string
    
    def process_file(filename):
        hist = dict()
        fp = open(filename)
        for line in fp:
            process_line(line, hist)
        return hist
    
    def process_line(line, hist):
        line = line.replace('-', ' ')
        
        for word in line.split():
            word = word.strip(string.punctuation + string.whitespace)
            word = word.lower()
    
            hist[word] = hist.get(word, 0) + 1
    
    hist = process_file('emma.txt')
    

    This program reads emma.txt, which contains the text of Emma by Jane Austen.

    process_file loops through the lines of the file, passing them one at a time to process_line. The histogram hist is being used as an accumulator.

    process_line uses the string method replace to replace hyphens with spaces before using split to break the line into a list of strings. It traverses the list of words and uses strip and lower to remove punctuation and convert to lower case. (It is a shorthand to say that strings are “converted;” remember that string are immutable, so methods like strip and lower return new strings.)

    Finally, process_line updates the histogram by creating a new item or incrementing an existing one.

    To count the total number of words in the file, we can add up the frequencies in the histogram:

    def total_words(hist):
        return sum(hist.values())
    

    The number of different words is just the number of items in the dictionary:

    def different_words(hist):
        return len(hist)
    

    Here is some code to print the results:

    print 'Total number of words:', total_words(hist)
    print 'Number of different words:', different_words(hist)
    

    And the results:

    Total number of words: 161080
    Number of different words: 7214
    

    This page titled 4.3: Word Histogram is shared under a CC BY-NC-SA 3.0 license and was authored, remixed, and/or curated by Allen B. Downey (Green Tea Press) .

    • Was this article helpful?