Most operating systems provide a command-line interface, also known as a shell. Shells usually provide commands to navigate the file system and launch applications. For example, in Unix you can change directories with
cd, display the contents of a directory with
ls, and launch a web browser by typing (for example)
Any program that you can launch from the shell can also be launched from Python using a pipe. A pipe is an object that represents a running program.
For example, the Unix command
ls -l normally displays the contents of the current directory (in long format). You can launch
>>> cmd = 'ls -l' >>> fp = os.popen(cmd)
The argument is a string that contains a shell command. The return value is an object that behaves just like an open file. You can read the output from the
ls process one line at a time with
readline or get the whole thing at once with
>>> res = fp.read()
When you are done, you close the pipe like a file:
>>> stat = fp.close() >>> print stat None
The return value is the final status of the
None means that it ended normally (with no errors).
For example, most Unix systems provide a command called
md5sum that reads the contents of a file and computes a “checksum.” You can read about MD5 at http://en.Wikipedia.org/wiki/Md5. This command provides an efficient way to check whether two files have the same contents. The probability that different contents yield the same checksum is very small (that is, unlikely to happen before the universe collapses).
You can use a pipe to run
md5sum from Python and get the result:
>>> filename = 'book.tex' >>> cmd = 'md5sum ' + filename >>> fp = os.popen(cmd) >>> res = fp.read() >>> stat = fp.close() >>> print res 1e0033f0ed0656636de0d75144ba32e0 book.tex >>> print stat None
In a large collection of MP3 files, there may be more than one copy of the same song, stored in different directories or with different file names. The goal of this exercise is to search for duplicates.
- Write a program that searches a directory and all of its subdirectories, recursively, and returns a list of complete paths for all files with a given suffix (like
os.pathprovides several useful functions for manipulating file and path names.
- To recognize duplicates, you can use
md5sumto compute a “checksum” for each files. If two files have the same checksum, they probably have the same contents.
- To double-check, you can use the Unix command
popenis deprecated now, which means we are supposed to stop using it and start using the
subprocessmodule. But for simple cases, I find
subprocessmore complicated than necessary. So I am going to keep using
popenuntil they take it away.