12.8: Reading binary files using urllib
- Page ID
Sometimes you want to retrieve a non-text (or binary) file such as an image or video file. The data in these files is generally not useful to print out, but you can easily make a copy of a URL to a local file on your hard disk using
The pattern is to open the URL and use
read to download the entire contents of the document into a string variable (
img) then write that information to a local file as follows:
import urllib.request, urllib.parse, urllib.error img = urllib.request.urlopen('http://data.pr4e.org/cover3.jpg').read() fhand = open('cover3.jpg', 'wb') fhand.write(img) fhand.close() # Code: http://www.py4e.com/code3/curl1.py
This program reads all of the data in at once across the network and stores it in the variable
img in the main memory of your computer, then opens the file
cover.jpg and writes the data out to your disk. This will work if the size of the file is less than the size of the memory of your computer.
However if this is a large audio or video file, this program may crash or at least run extremely slowly when your computer runs out of memory. In order to avoid running out of memory, we retrieve the data in blocks (or buffers) and then write each block to your disk before retrieving the next block. This way the program can read any size file without using up all of the memory you have in your computer.
Code 12.8.1 (Python)
%%python3 import urllib.request, urllib.parse, urllib.error img = urllib.request.urlopen('http://data.pr4e.org/cover3.jpg') fhand = open('cover3.jpg', 'wb') size = 0 while True: info = img.read(100000) if len(info) < 1: break size = size + len(info) fhand.write(info) print(size, 'characters copied.') fhand.close() # Code: http://www.py4e.com/code3/curl2.py
In this example, we read only 100,000 characters at a time and then write those characters to the
cover.jpg file before retrieving the next 100,000 characters of data from the web.
This program runs as follows:
python curl2.py 568248 characters copied.
If you have a Unix or Macintosh computer, you probably have a command built in to your operating system that performs this operation as follows:
curl -O http://www.py4e.com/cover.jpg
curl is short for "copy URL" and so these two examples are cleverly named
curl2.py on www.py4e.com/code3 as they implement similar functionality to the
curl command. There is also a
curl3.py sample program that does this task a little more effectively, in case you actually want to use this pattern in a program you are writing.