bookshrink

How does the program work?

— The algorithm cleans up the input text so that it can be analyzed.

— Then, it finds the frequency of each word in the cleaned up text.

— Each word is assigned a score based on a simple TF-IDF analysis.

— Based on the scores of the words within them, a score is calculated for each sentence.

— So that longer sentences aren't favored and shorter sentences aren't punished, the sentence scores are then normalized by length.

— The sentences are sorted by their scores.

— Finally, depending on what type of output is asked for, the program spits out the results.

What does this do?

The program tries to pick out the sentences of an input text that are most representative of the text as a whole; that is to say, find the essence of a text.

Where can I get texts?

Project Gutenberg is an excellent resource for full books in the public domain.

Try pasting text from any of these links into the input box above:

Who made this?

Peter Downs

With what?

Python, web.py, NLTK, JQuery, 1140.css, sexybuttons, vim, and Adobe Photoshop.

Why?

I'm interested in computational linguistics. It's interesting to consider what exactly makes a sentence important, and if it's even possible to find an objective measure of 'meaningfulness'.

Want to learn more?

Check out the code on GitHub!