```
def test_run_gen_rule1():
"""
If a live cell has less than 2 live neighbours, it dies due to
loneliness.
"""
grid = [[1,0,0],
[0,0,0],
[0,0,0]]
new_grid = life.run_generation(grid)
assert new_grid[0][0] == 0
```

`0` implies a dead cell and `1` a live one.
This is simple enough. It asserts that given such a condition, the cell in the top left corner should die. The implementation of `run_generation` that made this pass was like so.
So, this makes the test green and we marked it as such but here's where things get interesting. My claim was that it's not just a broken implementation making the test pass but that it's a complete implementation and which perfectly implements rule 1. And that's where the confusion started.
The questions that came up were things like "Where are we checking that it has fewer than 2 live neighbours?", "Won't it kill the cell even if it has 2 neighbours?" etc.
### Logical implications
There's a notation in classical logic called logical implication. The symbol used is a right arrow (→). It's often read as "implies" so `p → q` is "p implies q". It's also sometimes read as "if p, then q". What does this mean? Logical connective operators like `and`, `or` or `→` are often described using a [truth table](https://en.wikipedia.org/wiki/Truth_table). This is a list of all permutations of the inputs and the resulting outputs. So, a truth table for `a and b` would be something like this
|a |b |a and b|
|-----|-----|-------|
|True |True |True |
|True |False|False |
|False|True |False |
|False|False|False |
This tells us that if `a and b` is `False` in all cases except if both `a` and `b` are True. This is a precise and complete way of saying what `a and b` means. Now, consider the truth table of →.
|a |b |a → b|
|-----|-----|-----|
|True |True |True |
|True |False|False|
|False|True |True |
|False|False|True |
This tells us that that the only way in which `a → b` can be `False` is if `a` is `True` *and* `b` is `False`.
The expression is equivalent to `b or (not a)`. Meaning that either `b` should be `True` or `a` should be `False`. This means
- If `b` is True, then we can be sure that the statement is `True` *regardless of the value of `a`*.
- If `a` is False, then we can be sure that the statement is `True` *regardless of the value of `b`*. We say that the statement in this case is [vacuously true](https://en.wikipedia.org/wiki/Vacuous_truth).
If we can guarantee either of these, we can be sure of our implementation.
I'll stop here lest you accuse me of going all [Smullyan](https://en.wikipedia.org/wiki/Raymond_Smullyan) on you .
### Our implementation
So, if we have a rule that's of the form `a → b`, we just have to make sure this truth table is satisfied. Now let's look at the rule we're trying to implement.
> Any live cell with fewer than two live neighbours dies.
or stated in the form of a logical implication,
> If a live cell has less than two live neighbours, it should die.
or better
> "a live cell has less than two neighbours" → "it dies".
Now, we're going to try to implement this. A direct translation of the rule would be something like
But this is a logical implication `a → b` where `a` is "a live cell has less than two neighbours" and `b` is "it dies".
We can implement `a → b` by just making sure that `b` is `True`. Then, as per the truth table, the statement will be true (ie. our code will be correct) regardless of the value of `a`.
How do we do that? We simply `kill(cell)` and since we start with all cells set to dead (`0`), this is `True`.
Let's look at this implementation. Can it ever violate the truth table? Well, the only way in which it can is we let `b` become `False` even if `a` is `True`. But, we set everything to dead explicitly so `b` is never `False` hence the final result is always `True`.
## Questions
So, let's answer some questions.
- Where are we checking that it has fewer than 2 live neighbours?
*We don't need to. If `b` is `True`, then the statement is `True` regardless of the value of `a`.*
- Won't it kill the cell even if it has 2 neighbours?
*That's a situation where `a` is `False` and `b` is `True`. If `a` is `False`, then the statement is `True`. This is because we care only about the result **if** `a` is `True`. Otherwise, `b` can be anything.*
- You're just ignoring the if part and making this unconditional. You can't do that to every if condition.
*Actually, I can. If this were the only rule, it would be enough but there are other rules too.*
]]>`grep -iw chapter moby-dick.txt | wc -l`

and we get `172`. So we know that it has (roughly) 172 chapters. The `-i` option to grep makes the search case insensitive (we match `Chapter` and `chapter`). The `-w` restricts the pattern to word boundaries. So, we won't match things like `chapters`.
Next, we try to get the number of pages in the book. A typical paperback book, which is the kind I'd get if I bought a paper copy of Moby Dick, has approximately 350 words on a page (35 lines per page and 10 words per line). I know this because I actually counted them on 10 books. We can get this using
[expr](http://unixhelp.ed.ac.uk/CGI/man-cgi?expr) is an under appreciated command line calculator that you can use in a pipeline. The `$(` and `)` is command substitution where the snippet inside the brackets is run and the output put instead of the `$(` and `)`. In this case, we simply count the words and get the count. We get this and divide it by 350. The output is `595`. That's around 3 pages a chapter on the average.
The next thing we try to get is the length of sentences. This is useful to approximate the reading grade for the book. The [Flesch-Kincaid](https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_test) tests use these (among other things) to calculate the reading level for the book. It's also fair to say that technical books usually keep the sentence lengths somewhat low (although code snippets can ruin our estimations). Childrens books have shorter sentences. The sentences we usually speak during conversation are about 20 words long. To do this, first we run the book through `tr '\n' ' '`. This changes all newlines to spaces so the whole book fits on a single line. Then we pipe that through `tr '.' '\n'` which converts it to a single sentence per line. We then count the words per such "line" using `awk '{print NF}'` and then we pipe that through `sort -n | uniq -c | sort -n` which gives us a frequency count per sentence length in increasing order. The last few lines will tell us what the lengths of most of the sentences are.
The last 20 lines of this gives me
153 27
158 24
158 25
159 12
162 13
163 26
164 20
166 11
168 22
168 23
173 19
176 14
178 21
179 17
179 18
179 8
186 15
194 9
197 16
230 2
The first column is the number of sentences and the second column the length of the sentence. Summing column one from this gives us `3490`.
and the total sentence count from
is `7385`. So the last the last 20 lengths account for a little less than half the number of sentences.
Sorting the last 20 by sentence length using
gives us some more insight into the lengths.
230 2
179 8
194 9
166 11
159 12
162 13
176 14
186 15
197 16
179 17
179 18
173 19
164 20
178 21
168 22
168 23
158 24
158 25
163 26
153 27
which is that they're all less than 27 words. That's fairly conversational. However, the maximum sentence length is 394 and it even has two sentences that are 224 words long. This makes it quite unlikely that it's a childrens or a technical book. We can even go a step further and drop the `tail -20` to get a frequency distribution.
and then plot that using `gnuplot` to get something like this
![Sentence lengths](/img/lengths.png)
The next thing we can try to approximate is the year of writing this. Something like
gives us all the years in the book. This first converts the text into one word per line (by changing all "non word" characters into newlines) and then looks for numbers that look like years. This gives us quite a list. Sticking a `wc -l` at the end gives us the number of matches (in our case 30). We can sum this and then again divide by the number of matches to get an average.
This is hairy but actually runs for a two hundred thousand word text file in about 0.03 seconds on my computer. That's much lesser than the time needed to write a real program to do this. I get `1796`. It's likely that it was written a little after this date (unless it's futuristic speculative fiction of some kind) so let's say early 19 century.
So far, we have a non technical book for older audiences. It's approximately 600 pages spread across 170 chapters written in the early 19 century. Let's go on.
We can do a frequency analysis on the number of words. First lower case everything and get one word per line using
Then pipe that through `sort | uniq -c | sort -n | tail -20` to get the most common 20 words in the book. The results are disappointing.
1382 this
1515 all
1593 for
1627 was
1690 is
1692 with
1720 as
1737 s
1805 but
1876 he
2114 i
2495 his
2496 it
3045 that
4077 in
4539 to
4636 a
6325 and
6469 of
14175 the
All these words don't give us any information about the content of the book. We can filter for larger words using by sticking a `grep .....` before the first sort to look only for words longer than 4 letters. This gives us
252 queequeg
257 stubb
262 again
268 after
280 white
282 seemed
292 great
295 before
303 those
308 about
312 still
327 captain
382 though
394 these
410 other
426 would
604 their
625 which
854 there
1150 whale
in about 1.6 seconds. You can see themes here. *Whale* is obviously something important. *Captain* makes the book either military or nautical. The whale suggests the latter. *Great* and *white* are not significant in themselves but with whale, they give you *great white whale* which is good. The *captain* is important in the story. There are also two words which you can't find in the dictionary - *queequeg* and *stubb*. It's likely that these are characters in the story. By changing the lengths of the words we filter, we get out some more stuff from the text. By using `....`, we get *ahab* and *ship*. Using `......`, we get *whaling*, *pequod* and *starbuck*. We can adjust the lengths like this and we get these words that are not in the dictionary - Ahab, Queequeg, Stubb, Pequod, Starbuck and Tashtego. We get these words that are in the dictionary - ship, white, whale, captain, whaling, harpooneers, harpooneer, leviathan, Nantucket.
So, we can judge that this is a non technical book for older audiences. It's approximately 600 pages spread across 170 chapters written in the early 19 century. It deals with a story on a whaling ship. The captain is an important character in the story. They are hunting for a white whale. It's likely that it's an American story (since Nantucket has a history of whaling). The main characters in the story are Ahab, Queequeg etc.
Now you can head on to the [wikipedia page of Moby Dick](https://en.wikipedia.org/wiki/Moby_dick) and see how close we've reached.
It's possible to squeeze out more information from the text. We can, for example, get bigrams from it with this (try it).
I think these tools are not sufficiently advertised in the modern developer community and it's a loss for them. I'm planning to put together a [course that teaches these skills](http://nibrahim.net.in/2013/08/03/unix_command_line_course.html) which I'm going to hold in Bangalore later this year. You should sign up onto my [trainings list](https://lists.hcoop.net/listinfo/trainings) or follow me on [twitter](https://twitter.com/noufalibrahim) if you want to know when I'm ready.
This course is also going to be a part of a larger boot camp style thing which I'm doing under the Lycӕum banner so stay tuned.
]]>