I am trying to add some readability indexes to a script. For a starter, ARI (Automated Readability Index) seems like a good choice, it only needs character, word, and sentence counts.
At first thought, using regular expressions seems like a logical direction. After digging around, NLTK looks like a good tool to use. So, this is the code I have:
#!/usr/bin/env python # -*- coding: utf-8 -*- import re import nltk.data from nltk import wordpunct_tokenize text = '''There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies.''' # C.A.R. Hoare, The 1980 ACM Turing Award Lecture # split into words by punctuations # remove punctuations and all '-' words RE = re.compile('[0-9a-z-]', re.I) words = filter(lambda w: RE.search(w) and w.replace('-', ''), wordpunct_tokenize(text)) wordc = len(words) charc = sum(len(w) for w in words) sent = nltk.data.load('tokenizers/punkt/english.pickle') sents = sent.tokenize(text) sentc = len(sents) print words print charc, wordc, sentc print 4.71 * charc / wordc + 0.5 * wordc / sentc - 21.43
['There', 'are', 'two', 'ways', 'of', 'constructing', 'a', 'software', 'design', 'One', 'way', 'is', 'to', 'make', 'it', 'so', 'simple', 'that', 'there', 'are', 'obviously', 'no', 'deficiencies', 'and', 'the', 'other', 'way', 'is', 'to', 'make', 'it', 'so', 'complicated', 'that', 'there', 'are', 'no', 'obvious', 'deficiencies'] 173 39 1 18.9630769231
It uses training data to tokenize sentences1. If you see an error like:
Traceback (most recent call last): File "./test.py", line 13, in <module> sent_tokenize = nltk.data.load('tokenizers/punkt/english.pickle') File "/usr/lib64/python2.7/site-packages/nltk/data.py", line 594, in load resource_val = pickle.load(_open(resource_url)) File "/usr/lib64/python2.7/site-packages/nltk/data.py", line 673, in _open return find(path).open() File "/usr/lib64/python2.7/site-packages/nltk/data.py", line 455, in find raise LookupError(resource_not_found) LookupError: ********************************************************************** Resource 'tokenizers/punkt/english.pickle' not found. Please use the NLTK Downloader to obtain the resource: >>> nltk.download(). Searched in: - '/home/livibetter/nltk_data' - '/usr/share/nltk_data' - '/usr/local/share/nltk_data' - '/usr/lib/nltk_data' - '/usr/local/lib/nltk_data' **********************************************************************
Just run and type:
python import nltk nltk.download() d punkt
to download the data file.
The result isnt consistent with other calculators you have on the Internet. In fact, most likely each calculator results different value of index, which is caused by what is considered as sentence divider and all other small details about how to count.
There is actually an old contributed code to NLTK, which probably has everything you need, all different indexes and methods for calculation. I dont see it in current code yet, the latest code will raise NotImplementedError.
Unfortunately, NLTK doesnt support Python 3. And the script I am writing will be Python 3 only, I dont plan to make it compatible with Python 2.X. There seemed to have a branch for it, but it was gone. So, after all these, I might have to revert back to the regular expressions. Well, thats not too bad actually.
[1] | http://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html is gone. |
0 comments:
Post a Comment