Skip to main content

What it means to "know Python"

Since Adam Barr replied to my post on his book, I'd like to elaborate a little on what I said.

Adam wrote,

[F]or me, "knowing" Python means you understand how slices work, the difference between a list and a tuple, the syntax for defining a dictionary, that indenting thing you do for blocks, and all that. It's not about knowing that there is a sort() function.

In Python, reinventing sort and split is like a C programmer starting a project by writing his own malloc. It just isn't something you see very often. Similarly, I just don't think you can credibly argue that a C programmer who doesn't know how to use malloc really knows C. At some level, libraries do matter.

On the other hand, I wouldn't claim that you must know all eleventy jillion methods that the Java library exposes in one way or another to say you know Java.

What is the middle ground here?

I think the answer is something along the lines of, "you have to get enough practice actually using the language to be able to write idiomatic code." That's necessarily going to involve picking up some library knowledge along the way.

This made me think. What are the most commonly used Python modules? I decided to scan the Python Cookbook's code base and find out. This is a fairly large sample (over 2000 recipes), and further attractive in that most of the scripts there are reasonably standalone, so they're not filled with importing lots of non-standard modules. The downside is there is code dating back at least to the very ancient Python 1.5 version.

In 2000+ source files and almost 4000 imports of stdlib modules, here are the frequency counts of imported modules.

Is this a reasonable list? I obviously think I qualify as knowing Python well enough to blog about it. Of the modules above the 80% line, _winreg, win32con, and win32api are platform-specific; new is deprecated, string isn't officially deprecated but should be, and __future__ isn't really a module per se. I believe I've used all of the rest but xmlrpclib at some point, although my line of comfort-without-docs would be only about the 60% mark. I think anyone who programs professionally will quickly get to knowing well at least the modules up to the 50% line.

sys473
os302
24%
time210
re145
35%
string140
random103
threading66
socket57
os.path52
types50
Tkinter47
50%
math43
win32com.client42
__future__41
traceback40
itertools38
doctest37
urllib35
cStringIO33
struct32
60%
win32api31
getopt29
thread29
ctypes28
StringIO28
inspect26
win32con25
copy25
cPickle25
operator24
datetime23
cgi22
70%
Queue22
urllib220
md520
base6420
xmlrpclib19
sets19
optparse19
logging18
weakref18
shutil17
unittest17
pprint16
urlparse15
getpass15
httplib15
pickle15
_winreg14
UserDict13
signal13
80%

For those interested, a tarball of the recipes I scanned is here, so you don't need to scrape the Cookbook site yourself. The import scanning code is simple enough:

import os, re, compiler
from collections import defaultdict

# define an AST visitor that only cares about "import" and "from [x import y]" nodes
count_by_module = defaultdict(lambda: 0)
class ImportVisitor:
    def visitImport(self, t):
        for m in t.names:
            if not isinstance(m, basestring):
                m = m[0] # strip off "as" part
            count_by_module[m] += 1
    def visitFrom(self, t):
        count_by_module[t.modname] += 1

# parse
for fname in os.listdir('recipes'):
    try:
        ast = compiler.parseFile('recipes/%s' % fname)
    except SyntaxError:
        continue
    compiler.walk(ast, ImportVisitor())
    print 'parsed ' + fname

# some raw stats, for posterity
counts = count_by_module.items()
total = sum(n for module, n in counts)
print '%d/%d total/unique imports' % (total, len(counts))

# strip out non-stdlib modules
for module in count_by_module.keys():
    try:
        __import__(module)
    except (ImportError, ValueError):
        del count_by_module[module]
        
# post-stripped stats
counts = count_by_module.items()
total = sum(n for module, n in counts)
print '%d/%d total/unique imports in stdlib' % (total, len(counts))
counts.sort(key=lambda (module, n): n)

# results
subtotal = 0
for module, n in reversed(counts):
    subtotal += n
    print '%s\t%d' % (module, n)
    print '%f' % (float(subtotal) / total)

Comments

Anonymous said…
It appears that some people like to re-invent wheels or demonstrate their knowledge of algorithms.
Funny I was once asked to implement sort in my language of choice during a job interview. I said I'd use python and my implementation looked something like this "sort()".
The interviewer then revealed that he really wanted me to implement a sorting algorithm. I told him I hadn't done that since first year CS classes but that I'd implement a bubble sort. I honestly couldn't remember quicksort, since I haven't needed it for years, I pushed it out of my brain to make room for stuff I actually use. (I also said, if I really needed to implement a sort in real [working] conditions I would do it completely differently). Ah useless interview questions....
Anonymous said…
very interesting post. i've always wondered what the most popular modules were. now i have a much better idea! thanks.
Sergey Shepelev said…
Why compiling where you could use simple line.lstrip().startswith('import') or simple regexp?
Jonathan Ellis said…
There is no way to tell with a regexp whether you are inside a multiline string. So doing it right is actually easier than hacking a half-assed parser together. :)
Sergey Shepelev said…
You're afraid of code like

"""Module does useful importing as in
import os_sys_log
and never fails"""

?
James Thiele said…
Clicking on link to tarball gives:
Not Found
The requested URL /group/utahpythonjellis/recipes.tar.bz2 was not found on this server.
Jonathan Ellis said…
Sorry, this article is nearly two years old and utahpython has moved on (it's hosted on google groups now).

Popular posts from this blog

Why schema definition belongs in the database

Earlier, I wrote about how ORM developers shouldn't try to re-invent SQL . It doesn't need to be done, and you're not likely to end up with an actual improvement. SQL may be designed by committee, but it's also been refined from thousands if not millions of man-years of database experience. The same applies to DDL. (Data Definition Langage -- the part of the SQL standard that deals with CREATE and ALTER.) Unfortunately, a number of Python ORMs are trying to replace DDL with a homegrown Python API. This is a Bad Thing. There are at least four reasons why: Standards compliance Completeness Maintainability Beauty Standards compliance SQL DDL is a standard. That means if you want something more sophisticated than Emacs, you can choose any of half a dozen modeling tools like ERwin or ER/Studio to generate and edit your DDL. The Python data definition APIs, by contrast, aren't even compatibile with other Python tools. You can't take a table definition

Python at Mozy.com

At my day job, I write code for a company called Berkeley Data Systems. (They found me through this blog, actually. It's been a good place to work.) Our first product is free online backup at mozy.com . Our second beta release was yesterday; the obvious problems have been fixed, so I feel reasonably good about blogging about it. Our back end, which is the most algorithmically complex part -- as opposed to fighting-Microsoft-APIs complex, as we have to in our desktop client -- is 90% in python with one C extension for speed. We (well, they, since I wasn't at the company at that point) initially chose Python for speed of development, and it's definitely fulfilled that expectation. (It's also lived up to its reputation for readability, in that the Python code has had 3 different developers -- in serial -- with very quick ramp-ups in each case. Python's succinctness and and one-obvious-way-to-do-it philosophy played a big part in this.) If you try it out, pleas

A review of 6 Python IDEs

(March 2006: you may also be interested the updated review I did for PyCon -- http://spyced.blogspot.com/2006/02/pycon-python-ide-review.html .) For September's meeting, the Utah Python User Group hosted an IDE shootout. 5 presenters reviewed 6 IDEs: PyDev 0.9.8.1 Eric3 3.7.1 Boa Constructor 0.4.4 BlackAdder 1.1 Komodo 3.1 Wing IDE 2.0.3 (The windows version was tested for all but Eric3, which was tested on Linux. Eric3 is based on Qt, which basically means you can't run it on Windows unless you've shelled out $$$ for a commerical Qt license, since there is no GPL version of Qt for Windows. Yes, there's Qt Free , but that's not exactly production-ready software.) Perhaps the most notable IDEs not included are SPE and DrPython. Alas, nobody had time to review these, but if you're looking for a free IDE perhaps you should include these in your search, because PyDev was the only one of the 3 free ones that we'd consider using. And if you aren