January 22nd, 2012

Blacking Out Your Website

On January 18, 2012 a number of websites “went dark” to protest the United States House of Representatives Stop Online Piracy Act and Senate Protect IP Act, more commonly known as SOPA and PIPA. Like many things, there are many way to go about “blacking out” one’s website.

Wikipedia used JavaScript and CSS to overlay their message on every page, so users saw the content of the page for a second or so and then it was completely obscured. BoingBoing returned a HTTP 503 “service unavailable” for every URL on their domain. Mozilla returned a 503 for their main URL. Imgur and Craigslist simply changed the content on their main pages and continued to return HTTP 200s. reddit redirected every URL on their domain to their protest message via a HTTP 302 “temporary redirect”. So which, if any of these is “correct”?

In this case, the HTTP 503 is the best option because it conveys two important bits of information:

  1. The server is unable to handle the request
  2. The situation is temporary

The HTTP 302 is dangerous because it conveys two bits of information, one which is correct and the other which is a lie:

  1. The requested resource was found, but under a different URI
  2. The situation is temporary

The difference likely doesn’t matter much to a person using a web browser, as in the worst case it might cache the 302 response. If the redirect changes on the server a user re-requests the URI, the cached response might be returned. At this point if the user thinks the page isn’t valid e.g. seeing a January 18th SOPA protest page on January 19th, it’s easy enough for the user to reload the page and retrieve the new redirect target.

However, a web crawlers like Googlebot, DuckDuckGo Bot etc. will store the contents of the page for consideration in their respective search index and may not visit a given URI very often. So if one were to configure their site to issue a 302 or 200 with alternative content during the time a web crawler visited the URI, it could render one’s page virtually inaccessible via a search as the bot would think the temporary content fetched was valid and not retry again until it is scheduled to visit the URI again. In contrast, returning 503 will likely cause the bot to break out of it’s normal scheduling cycle for a given URL such that it retries it sooner rather than later.

So next time you decide to stage a blackout, remember there are clients other than humans with web browsers accessing your site!

January 9th, 2012

AI Class

Towards the end of last year I signed up for the “basic” track of the Stanford Engineering Introduction To Artificial Intelligence class, mainly out of curiosity, but also because the instructors (Sebastian Thrun and Peter Norvig) and I share the same employer. I intended to follow along at a leisurely pace, unsure if I would be able to commit the time necessary to watch all the lectures and go through the homework problems. However, during the first week I found the homework assignments unavailable to “basic” track participants until after they were due (this was later changed) so I switched to the “advanced” track and made a concerned effort to adhere to the class schedule.

It’s similar in style to the Khan Academy, and it definitely took my understanding of Bayes’ Rule, Hidden Markov Models, etc. to another level. Sure there are tons of books and tutorials out there, but I feel there’s a limit to what one can learn without practice. One form of practice is to create a toy programming project, the main downside of which is never knowing if you’re doing it “right”. Contests ala Netflix Prize are another which at least provides better/worse feedback. The main benefit of online courses though is one can practice solving problems and know whether or not an answer is correct.

I didn’t participate in any of the discussions, though I did find the group website useful for clarification on a few of the problem statements. Even so, there were still a couple of questions throughout the homework and exams which I misinterpreted, leading to an incorrect answer; however, the largest cause of incorrect answers seemed to be my inability to punch numbers into a calculator correctly…I guess I haven’t gotten any better at that since I was last in school :S

An unexpected realization was how much I value flexibility in scheduling. There were definitely a few weeks I had to scramble in order to make it through the videos and homework before the deadline. Funny how the addition of one more deadline per week caused me so much grief. I was interested in the material, but some of it took quite a bit of time to get through (especially once I slowed down after rushing through a couple of homework assignments and doing poorly).

While one could certainly write programs to solve some of the quiz/homework/exam problems, all of them were solvable with pen and paper. The lack of problems requiring writing code was one of the top complaints. They held an AI competition; I didn’t participate as I feared that would take up too much of my time. However, towards the end of the class, the instructors provided an optional quiz which required writing some code: Optional Natural Language Processing Problems. This seemed like a much less time consuming set of questions and I would also get instant feedback.

The first problem was easily solved with brute force. My basic idea was to try every possible rotation and see which one yielded the most words which appear in an English dictionary.

#!/usr/bin/python

import sys
import string

# Problem text, all lowercase, no punctuation.
SENTENCE = 'esp qtcde nzyqpcpynp zy esp ezatn zq lcetqtntlw tyepwwtrpynp hld spwo le olcexzfes nzwwprp ty estd jplc'

# Parse the system's English dictionary into a set.
def WordsToSet():
  word_set = set()
  for w in open('/usr/share/dict/words'):
    word_set.add(w.strip().lower())
  return word_set

def p1(word_set):
  # Setup character position lookup table.
  chars = string.ascii_lowercase
  num_chars = len(chars)
  letter_positions = dict([(c, i) for i, c in enumerate(chars)])

  # Track the rotation that yields the most real words.
  max_word_count = 0
  best_i = 1
  sentences = {}

  # Try each rotation.
  for i in xrange(1, 25):
    temp_word_count = 0
    decoded = []

    # Build up a decoded sentence.
    for char in SENTENCE:
      if char == ' ':
        decoded.append(char)
      else:
        decoded.append(chars[(letter_positions[char.lower()] - i) % num_chars])
    sentences[i] = ''.join(decoded)

    # See how many words in the sentence exist in the word set.
    for word in sentences[i].split(' '):
      if word in word_set:
        temp_word_count += 1
    if temp_word_count > max_word_count:
      best_i = i
      max_word_count = temp_word_count

  # Output our best guess.
  print best_i, sentences[best_i]

def main(argv):
  word_set = WordsToSet()
  p1(word_set)

if __name__ == '__main__':
  main(sys.argv)

The second problem proved to be a bit more difficult. It couldn’t be brute forced since there were 18 “strips” of characters and thus 18! possible combinations (6,402,373,705,728,000 so if one could generate 1 every millisecond, it’d take 200,000+ years and that doesn’t even factor in the evaluation of whether the combination is gibberish). To make life easier, I first got rid of all the punctuation, lowercased all the characters, and converted the “strips” of characters into a Python list of lists. My idea was to pick the first strip, try the remaining 17 to see which one “fit” best before or after strip 0, then try the remaining 16 on the joined strips, etc. For best “fit” I would try to maximize the number of legitimate words which could be composed by the left or right most series of characters (whitespace delimited).

Initially it seemed like counting a partial word match with fewer than 3 characters wouldn’t be useful e.g. “er” appears in tens of thousands of words. That wasn’t good enough to keep my program on track though, as it would get a few strips aligned correcting and then veer off track creating sequences of characters that were found in many words, but ended up being gibberish in this example. So I had to augment the solution to detect when a sequence of characters should appear at the beginning of a word (whitespace before the sequence, but not after) or when a sequence of characters should be matched exactly (whitespace before and after the sequence). In this case, I didn’t have to detect whether a sequence of characters should appear at the end of a word (whitespace after the sequence, but not before), but it would be trivial to add.

Another necessary feature of my model was the ability to short-circuit if more than 1 row of characters would result in zero possible words. At first I had the threshold set at more than 0, but “shannon”, a last name as it turns out, is not in the dictionary. The code below takes about 90 seconds to solve the problem and could easily be optimized to reduce the number of loops or lookups in the dictionary, but I think those changes obscure the heart of the algorithm. Below is the code as well as a copy of the output generated by the program as it solved the problem.

#!/usr/bin/python

import sys
import string

STRIPS = [
  ['de',' ',' f','cl','nf','ed','au',' i','ti',' ','ma','ha','or','nn','ou',' s','on','nd','on'],
  ['ry',' ','is','th','is',' b','eo','as',' ',' ','f ','wh',' o','ic',' t',' ',' ','he','h '],
  ['ab',' ','la','pr','od','ge','ob',' m','an',' ','s ','is','el','ti','ng','il','d ','ua','c '],
  ['he',' ','ea','of','ho',' m',' t','et','ha',' ',' t','od','ds','e ','ki',' c','t ','ng','br'],
  ['wo','m ','to','yo','hi','ve','u ',' t','ob',' ','pr','d ','s ','us',' s','ul','le','ol','e '],
  [' t','ca',' t','wi',' m','d ','th',' a','ma','l ','he',' p','at','ap','it','he','ti','le','er'],
  ['ry','d ','un','th',' ','io','eo','n ','is',' ','bl','f ','pu','co','ic',' o','he','at','mm'],
  ['hi',' ',' ','in',' ',' ',' t',' ',' ',' ',' ','ye',' ','ar',' ','s ',' ',' ',' ']
]

def WordsToSet():
  word_set = set()
  for w in open('/usr/share/dict/words'):
    word_set.add(w.strip().lower())
  return word_set

# Determine whether word is, starts with or contins prospective_word.
def WordMatch(word, prospective_word, exact, startswith):
  if exact and prospective_word == word:
    return 1
  elif startswith and word.startswith(prospective_word):
    return 1
  elif not exact and not startswith and prospective_word in word:
    return 1
  return 0

def ProspectiveWord(i, j, row, prefix):
  # Construct a new result based on the current result row, the ith row and,
  # jth column of STRIPS and whether the new set of characters should be
  # positioned before or after the current result row.
  if prefix:
    prospective_result = STRIPS[i][j] + row
  else:
    prospective_result = row + STRIPS[i][j]

  words = prospective_result.split(' ')
  prospective_word = ''
  exact = False
  startswith = False
  if prefix:
    if words[0] == '':  # There was a leading space.
      prospective_word = words[1]
      if len(words) > 2:  # And a space after prospective_word.
        exact = True
      else:
        startswith = True
    else:
      prospective_word = words[0]
  else:
    if words[-1] == '':  # There was a trailing space.
      prospective_word = words[-2]
      if len(words) > 2:  # And a space before prospective_word.
        exact = True
    else:
      prospective_word = words[-1]
      if len(words) > 1:
        startswith = True
  return prospective_word, exact, startswith

def p2(word_set):
  # Set the first column of STRIPS as the current result.
  result = [row[0] for row in STRIPS]
  # The order of j's, not used to solve the problem, but useful to print at the
  # end. i.e. the order of the STRIPS.
  strip_order = [0]
  # Keep track of the unused j's i.e. columns of STRIPS
  remaining_strips = set([j for j in xrange(1, len(STRIPS[0]))])

  while len(remaining_strips) > 0:
    # Make one pass placing column j of STRIPS before the current result and
    # then another pass placing it after.
    for prefix in [True, False]:
      # Keep track of the best j and highest possible probablistic word count.
      best_j = 0
      best_word_count = 0
      for j in remaining_strips:  # Columns
        word_count = 0
        # Keep track of the number of rows without probable words. Rows which are
        # not checked i.e. without a prospective word of length > 2 or where
        # STRIPS[i][j] is a space (doesn't add any new information) do not count.
        rows_without_words = 0
        for i, row in enumerate(result):  # Rows
          # More than one row without probable words means this column is likely
          # in the wrong position.
          if rows_without_words > 1:
            break
          if STRIPS[i][j] == ' ':  # No new information, continue to next row.
            continue
          prospective_word, exact, startswith = ProspectiveWord(i, j, row, prefix)
          # Too short a word/word fragment, continue to next row.
          if len(prospective_word) < 3:
            continue

          # Total up all the prospective word matches and update the state.
          temp_word_count = 0
          for word in word_set:
            temp_word_count += WordMatch(word, prospective_word, exact, startswith)

          if temp_word_count == 0:
            rows_without_words += 1
          else:
            word_count += temp_word_count

        # More than one row without probable words means this column is likely
        # in the wrong position.
        if rows_without_words > 1:
          continue

        if word_count > best_word_count:
          best_word_count = word_count
          best_j = j

      if best_j != 0:
        if prefix:
          strip_order.insert(0, best_j)
        else:
          strip_order.append(best_j)
        for i in xrange(len(result)):
          if prefix:
            result[i] = STRIPS[i][best_j] + result[i]
          else:
            result[i] = result[i] + STRIPS[i][best_j]
        remaining_strips.remove(best_j)
        print 'strip_order: ', strip_order
        for row in result:
          print row
          print ' '

def main(argv):
  word_set = WordsToSet()
  p2(word_set)

if __name__ == '__main__':
  main(sys.argv)

Here’s what the output looks like so you can see it building up the solution:

strip_order: [6, 0]
aude
eory
obab
the
u wo
th t
eory
thi

strip_order: [6, 0, 15]
aude s
eory
obabil
the c
u woul
th the
eory o
this

strip_order: [3, 6, 0, 15]
claude s
theory
probabil
of the c
you woul
with the
theory o
in this

strip_order: [3, 6, 0, 15, 11]
claude sha
theory wh
probabilis
of the cod
you would
with the p
theory of
in this ye

strip_order: [3, 6, 0, 15, 11, 13]
claude shann
theory whic
probabilisti
of the code
you would us
with the pap
theory of co
in this year

strip_order: [3, 6, 0, 15, 11, 13, 18]
claude shannon
theory which
probabilistic
of the code br
you would use
with the paper
theory of comm
in this year

strip_order: [3, 6, 0, 15, 11, 13, 18, 2]
claude shannon f
theory which is
probabilistic la
of the code brea
you would use to
with the paper t
theory of commun
in this year

strip_order: [3, 6, 0, 15, 11, 13, 18, 2, 14]
claude shannon fou
theory which is t
probabilistic lang
of the code breaki
you would use to s
with the paper tit
theory of communic
in this year

strip_order: [3, 6, 0, 15, 11, 13, 18, 2, 14, 17]
claude shannon found
theory which is the
probabilistic langua
of the code breaking
you would use to sol
with the paper title
theory of communicat
in this year

strip_order: [3, 6, 0, 15, 11, 13, 18, 2, 14, 17, 5]
claude shannon founded
theory which is the b
probabilistic language
of the code breaking m
you would use to solve
with the paper titled
theory of communicatio
in this year

strip_order: [3, 6, 0, 15, 11, 13, 18, 2, 14, 17, 5, 7]
claude shannon founded i
theory which is the bas
probabilistic language m
of the code breaking met
you would use to solve t
with the paper titled a
theory of communication
in this year

strip_order: [3, 6, 0, 15, 11, 13, 18, 2, 14, 17, 5, 7, 4]
claude shannon founded inf
theory which is the basis
probabilistic language mod
of the code breaking metho
you would use to solve thi
with the paper titled a m
theory of communication
in this year

strip_order: [3, 6, 0, 15, 11, 13, 18, 2, 14, 17, 5, 7, 4, 12]
claude shannon founded infor
theory which is the basis o
probabilistic language model
of the code breaking methods
you would use to solve this
with the paper titled a mat
theory of communication pu
in this year

strip_order: [3, 6, 0, 15, 11, 13, 18, 2, 14, 17, 5, 7, 4, 12, 10]
claude shannon founded informa
theory which is the basis of
probabilistic language models
of the code breaking methods t
you would use to solve this pr
with the paper titled a mathe
theory of communication publ
in this year

strip_order: [3, 6, 0, 15, 11, 13, 18, 2, 14, 17, 5, 7, 4, 12, 10, 8]
claude shannon founded informati
theory which is the basis of
probabilistic language models an
of the code breaking methods tha
you would use to solve this prob
with the paper titled a mathema
theory of communication publis
in this year

strip_order: [3, 6, 0, 15, 11, 13, 18, 2, 14, 17, 5, 7, 4, 12, 10, 8, 16]

theory which is the basis of
probabilistic language models and
of the code breaking methods that
you would use to solve this proble
with the paper titled a mathemati
theory of communication publishe
in this year

strip_order: [3, 6, 0, 15, 11, 13, 18, 2, 14, 17, 5, 7, 4, 12, 10, 8, 16, 1]
claude shannon founded information
theory which is the basis of
probabilistic language models and
of the code breaking methods that
you would use to solve this problem
with the paper titled a mathematica
theory of communication published
in this year

strip_order: [3, 6, 0, 15, 11, 13, 18, 2, 14, 17, 5, 7, 4, 12, 10, 8, 16, 1, 9]
claude shannon founded information
theory which is the basis of
probabilistic language models and
of the code breaking methods that
you would use to solve this problem
with the paper titled a mathematical
theory of communication published
in this year

November 2nd, 2011

The Other 99% of Halloween Candy Distributors

Here’s another 1% vs. 99% that I believe has been overlooked in the media; I posit each year 1% of the population gets to give out over 20% of the candy (and 10% of the population gets to give out more than 70% of the candy) on Halloween. In cities across America, there has been a growing trend where parents drive their children to a specific street on which to go trick-or-treating because a number of houses in that area will go all out in terms of decorations and “atmosphere” e.g. building haunted houses, ransacking the local Spirit store for the latest in animatronic ghouls, etc. Houses on these streets will give out hundreds, if not thousands of pieces of candy and the streets will be overrun with kids and parents alike. I do not live on one of those streets and just like last Halloween, our visitation rate this year was abysmal.

6:23p bee musketeer
6:56p butterfly and skeleton
7:11p airplane
7:30p spider girl?
7:35p fireman
7:40p 2 cowboys
7:41p 3 teenagers, no costumes
8:42p 3 teenagers (different ones), no costumes

I left our lit Halloween decorations along the 8ft path to our front door out until 10p and our porch light on until 11p. One can walk to 2 elementary, 1 middle and 1 high school within 0.7mi of my house; I don’t live on a busy street and there is a sidewalk. Many of the kids don’t even like to go to the mega-trick-or-treat zones; my wife works at an afterschool program and they complain it’s too crowded and easy to get separated from their friends; however, the parents feel it’s “the thing to do”. Our neighbors also seem to think the number of trick-or-treaters dropped from last year and I’m sure they have a bunch of excess candy just as we do. There’s a clear market inefficiency.For the first month+ of the Occupy Wall Street protests, I managed to turn a blind eye, blissfully occupying myself with baseball pennant chases, painting close to 1,000 sqft of outdoor fencing, Stanford’s Online Class – Introduction to Artificial Intelligence, etc. One evening, while walking back from the grocery store I expressed interest in seeing what one of these Occupy camps looked like and my wife mentioned if we walked down Martin Luther King Jr. Way (in Berkeley, CA) we’d pass by Occupy Berkeley. There wasn’t much activity, granted it was a Thursday night, just a handful of tents and some signs.

The next day I took a stroll by Occupy San Francisco along the Embarcadero, and it looked as if a number of homeless people decided to hang out together. I have nothing against homeless people, but I don’t think the many of the Occupy San Francisco population were ‘protesters’. Perhaps homelessness is a by-product of the state of affaires the Occupy Wall Street movement wants addressed; but I imagine most of the people I saw at Occupy San Francisco were homeless first and protestors second, if at all. I passed on visiting the Occupy Oakland camp due to the violence. Meanwhile the topic of “the other 99%” kept coming up more and more in conversation, such that I could no longer ignore it.

Based on the IRS Adjusted Gross Income Statistics my yearly income has put me in the top quintile (20%) ever since I graduated from college. This is not a result of me being especially brilliant or even lucky as if I were either of those I would probably have dropped out of college and joined the 1% during the dot-com bubble instead of catching the crash at the end. The two dominant factors are likely that I live in one of the most expensive areas of the country (San Francisco Bay Area) and work in an industry (computer software) which has higher than average salaries. In the last few yea
rs I’ve probably moved up into the top 10% and routinely interact with people who are currently in the top 1%, or have been in the past (you can check yourself for the current year via the Wall Street Journal’s “what percent are you?” calculator).

So in a sense, the current system works for me i.e. I can work within its constraints to do most of what I want to do. Note I do not aspire to have a Gulfstream 500 (I haven’t even ever flown first-class), houses in 5 different countries, or even retire early at this point. However, I realize the system has also failed many people, myself included to a degree as I am reluctantly a landlord due the the mortgage lending situation swinging wildly from giving loans to anyone who can breath to only giving loans to people with nearly perfect credit and a whole bunch of assets. Of course it could be much worse, I could not have a place to live or the ability to feed myself, but at the same time I’m contributing (proportionally) a great deal financially back into the system via taxes, charitable donations, etc. And I’m willing to contribute more because I know that not all work is valued appropriately and infrastructure is important to ev
eryone’s well-being whether they use it directly or not.

It frustrates me the more I think about the Occupy* movement because while I agree with their general sentiment that what has happened with the economy recently sucks and should not have happened, I fail to see how they’re changing anything for the better. I suppose the current system of mass congregation to Halloween trick-or-treat zones works for the people on those streets who enjoy handing out disproportionate amounts of candy, but it doesn’t work for me. Maybe I should have tried occupying some of their front porches trying highlight the inequalities.

Instead, next year I will attempt to increase the demand in my area. I’m thinking about either stepping up to handing out full size (or perhaps king size?) candy bars and/or fancier Ghirardelli chocolate; perhaps word will spread that our street is worth visiting. Another thought I had was to put up neon orange signs, much like is done for a garage sale, advertising the bounty that awaits those willing to broaden their horizons. Perhaps our neighbors would be willing to join us in increasing the Halloween value of our block.

As silly as it sounds within the context of handing out Halloween candy, the general ideas i.e. increasing one’s value to society, making people aware of it and attempting to get others to join your cause seem like the only legitimate ways to change the situation. I wish the Occupy* movement would focus their efforts on something more productive than civil disobedience.