zero

Open Sesame. Sesame is a smart door lock from the CandyHouse. It uses Bluetooth Low Energy to communicate wirelessly with smartphone apps. We are going to explain its BLE protocol and how we can write a script to control it. The protocol is reverse engineered from its Android app. This is not a full protocol documentation. I only reversed it just enough to lock/unlock the door.

BLE Services

The device exposes two BLE services that we can discover:

  • Operations: Normal operation service 00001523-1212-EFDE-1523-785FEABCD123
  • DFU: Device Firmware Upgrade service

DFU service is for upgrading firmwares while the operations service is where all the fun happens and it exposes a few characteristic that we can use to read / write data.

  • Command: 00001524-1212-EFDE-1523-785FEABCD123
  • Status: 00001526-1212-EFDE-1523-785FEABCD123

The packet format

Before we can send anything to the lock, we need to first understand the format of its packet.

HMAC macData md5(userID) S/N OP payload
32 6 16 4 1 optional

Where:

  • macData is the manufacturerData you can read from the BLE advertisement packet: it is the Sesame's MAC address with 3 bytes of zeroes prepending it, which you will need to strip
  • userID is your email address
  • S/N is a number read from the Status characteristic
  • OP is a number indicating operations: LOCK = 1, UNLOCK = 2
  • payload is not needed for locking and unlocking

The HMAC

HMAC is a standard way to authenticate the authenticity of a message. Sesame used SHA-256 as the hash function. The password can be a bit hard to extract. I believe (which means I traced the reversed code, but I have not verified if my assumption is correct) it came from a password which can be retrieved by logging into their XMPP server and chat with the server for user profile. However, it will need to be decrypt with a hard-coded key from the app. I was lazy to going through this so I wrote a Xposed module to extract it from the app. I hooked on to the SecretKeySpec constructor and wait for it to be initialized with the HMAC password.

Reading the Status

The serialNumber is an incrementing, rollover counter that we need to read from the device and include in the packet. It is located at bytes 6 ~ 10 of the response of the Status characteristic. You will need to plus one before you use it. Byte 14 is somewhat interesting as well, it is the error code for last command. You can find a list of error codes in the example code.

Wrapping it up

Pun intended. Before you can send out the constructed packet, you will need to break it down to a series of 20 bytes-sized packets. The first bytes is PT, indicating which part is the packet and then 19-bytes of the original payload.

  • PT indicates a series of packets: 01 for the first one, 02 for the rest and 04 to finalize it

With that ready, you can simply write the wrapped packet to the Command characteristic. It's a write-and-ack endpoint.

Notes & Trivia

  • The reversing was done with apktool, JADX and Android Studio
  • Interestingly, the Sesame app use XMPP protocol to talk to its cloud counterparts
  • The early version app contains a lot of Taiwanese internet memes in it. #TaiwanNo1 #56不能亡

Example Code

Here's an example snippet for unlocking a Sesame. Have fun!

const crypto = require('crypto');
const noble = require('noble');

const userId = 'REDACTED';
const deviceId = 'REDACTED';
const password = 'REDACTED';

const CODE_UNLOCK = 2;
const serviceOperationUuid = '000015231212efde1523785feabcd123';
const characteristicCommandUuid = '000015241212efde1523785feabcd123';
const characteristicStatusUuid = '000015261212efde1523785feabcd123';

console.log('==> waiting on adapter state change');

noble.on('stateChange', (state) => {
  console.log('==> adapter state change', state);
  if (state === 'poweredOn') {
    console.log('==> start scanning', [serviceOperationUuid]);
    noble.startScanning();
  } else {
    noble.stopScanning();
  }
});

noble.on('discover', (peripheral) => {
  if (peripheral.id !== deviceId) {
    console.log('peripheral discovered; id mismatch:', peripheral.id);
  } else {
    noble.stopScanning();
    connect(peripheral);
  }
});

function connect(peripheral) {
  console.log('==> connecting to', peripheral.id);
  peripheral.connect((error) => {
    if (error) {
      console.log('==> Failed to connect:', error);
    } else {
      console.log('==> connected');
      discoverService(peripheral);
    }
  });
}

function discoverService(peripheral) {
  console.log('==> discovering services');
  peripheral.once('servicesDiscover', (services) => {
    const opServices = services.filter((s) => s.uuid === serviceOperationUuid);
    if (opServices.length !== 1) {
      throw new Error('unexpected number of operation services');
    }

    discoverCharacteristic(peripheral, opServices[0]);
  });
  peripheral.discoverServices();
}

function discoverCharacteristic(peripheral, opService) {
  console.log('==> discovering characteristics');
  opService.once('characteristicsDiscover', (characteristics) => {
    const charStatus = characteristics.filter((c) => c.uuid === characteristicStatusUuid);
    const charCmd = characteristics.filter((c) => c.uuid === characteristicCommandUuid);

    if (charStatus.length !== 1 || charCmd.length !== 1) {
      throw new Error('unexpected number of command/status characteristics');
    }

    unlock(peripheral, charStatus[0], charCmd[0]);
  });
  opService.discoverCharacteristics();
}

function unlock(peripheral, charStatus, charCmd) {
  console.log('==> reading serial number');
  charStatus.on('data', (data) => {
    const sn = data.slice(6, 10).readUInt32LE(0) + 1;
    const err = data.slice(14).readUInt8();
    const errMsg = [
      "Timeout",
      "Unsupported",
      "Success",
      "Operating",
      "ErrorDeviceMac",
      "ErrorUserId",
      "ErrorNumber",
      "ErrorSignature",
      "ErrorLevel",
      "ErrorPermission",
      "ErrorLength",
      "ErrorUnknownCmd",
      "ErrorBusy",
      "ErrorEncryption",
      "ErrorFormat",
      "ErrorBattery",
      "ErrorNotSend"
    ];
    console.log('status update [sn=', + sn + ', err=' + errMsg[err+1] + ']');
  });
  charStatus.subscribe();
  charStatus.read((error, data) => {
    if (error) { console.log(error); process.exit(-1); }
    if (data) {
      const macData = peripheral.advertisement.manufacturerData;
      const sn = data.slice(6, 10).readUInt32LE(0) + 1;
      const payload = _sign(CODE_UNLOCK, '', password, macData.slice(3), userId, sn);
      console.log('==> unlocking', sn);
      write(charCmd, payload);
      setTimeout(() => process.exit(0), 500);
    }
  });
}

function _sign(code, payload, password, macData, userId, nonce) {
  const hmac = crypto.createHmac('sha256', Buffer.from(password, 'hex'));
  const hash = crypto.createHash('md5');
  hash.update(userId);
  const buf = Buffer.alloc(payload.length + 59);
  macData.copy(buf, 32); /* len = 6 */
  hash.digest().copy(buf, 38); /* len = 16 */
  buf.writeUInt32LE(nonce, 54); /* len = 4 */
  buf.writeUInt8(code, 58); /* len = 1 */
  Buffer.from(payload).copy(buf, 59);
  hmac.update(buf.slice(32));
  hmac.digest().copy(buf, 0);
  return buf;
}

function write(char, payload) {
  const writes = [];
  for(let i=0;i<payload.length;i+=19) {
    const sz = Math.min(payload.length - i, 19);
    const buf = Buffer.alloc(sz + 1);
    if (sz < 19) {
      buf.writeUInt8(4, 0);
    } else if (i === 0) {
      buf.writeUInt8(1, 0);
    } else {
      buf.writeUInt8(2, 0);
    }

    payload.copy(buf, 1, i, i + 19);
    console.log('<== writing:', buf.toString('hex').toUpperCase());
    char.write(buf, false);
  }
}

Zipkin is the Twitter open-source implementation of Google's distributed tracing system, Dapper. It's a great tool for people who wants to understand the bottleneck in their multi-services system. The only downside is that I found its documentation isn't quiet clear about the tracing format, so I decided to write a blog post that gives an overview of the system and its communication protocol.

Before we continue, I would suggest you to take a glance at the paper. It would give you some background knowledges and the assumptions of the Zipkin. I will try to include relevant points in the post, but you may find it easier if you read the paper first.

Overview

Zipkin splits the roles of a tracing system into four parts: a collector, a query service, a database and a web interface. Zipkin is a passive tracing system, which means the app are responsible of sending the tracing information to the Zipkin. Zipkin itself does not actively listen to the traffic on the network, nor does it try to ping the application for statistics.

Architecture

{% img /images/posts/zipkin-arch.png 650 “Zipkin Architecture” %}

An usual Zipkin deployment looks like the figure above. The recommended database is Cassandra. The protocol between the applications and Zipkin collector is Zipkin/Scribe/Thrift (read Zipkin on Scribe on Thrift). If you want scalability, the zipkin project recommended to setup a full Scribe environment. You can run multiple copies of Zipkin collector and configure your server-local Scribe receiver to route Zipkin messages to the cluster for load-balancing. For testing or low workload environment, you can point your application directly to the Zipkin collector, as it supports Scribe protocol as well.

Tracing

Zipkin treats every user initiated request as a trace. Each trace contains several spans, and each span is correlated to a RPC call. In each span, you can have several annotations. There are four annotations the one span must have in order to construct a full-view of a RPC call (in chronological order): cs, sr, ss, cr, in which c stands for the client, s stands for the server and the second s stands for send, the second r stands for receive. Please note that these annotations does not have to be all present in one span. You can send Zipkin two spans of the same spanID and have (cs, cr) and (sr, ss) respectively, this is an useful property since you can do logging in the client and the server separately. Each of those annotations would also have a timestamp to denote when the event happened and an host for on which host this event happened.

If you took a look at the Zipkin's thrift definition, you will also see that the span also carries a list of binary annotations. These are a special kind of annotations allows you to tag some request-specific information in the trace, for example, the HTTP request URI, the SQL query or the HTTP response code.

ID propagation

In the last section, we talked about trace and spans. Each trace is identified by a globally unique traceId. Each span is identified by the traceId it belongs to and an in-trace unique spanId. You may also specify an parentSpanId to represent another RPC call made during the parent span's duration. The spans should form an acyclic tree structure.

Now think about how server handles a request. Let's say we have a nginx server as frontend, an application server and a database server. When nginx gets a request, it needs to generate a traceId and two spans. The first spans denotes the user requesting nginx, it will have spanId = traceId and parentSpanId = 0 by convention for root spans. The second spans will be generated when nginx initiate the connection to the upstream. It would have a new spanId, parentSpanId set to the first span's id and reuse the same traceId.

The nginx will then need to pass the second span's traceId and spanId to the upstream. Fortunately, there's a convention for HTTP. Zipkin uses HTTP header to carry those informations. The nginx would need to set X-B3-TraceId, X-B3-SpanId and X-B3-ParentSpanId for the upstream to pick up, and the same process goes on for each layer. If you're using other protocols, you might need to come up with your own creative solution. For example, you may use SQL comment to carry over those ids in database queries.

In Practical

You should have enough knowledge of Zipkin to get started by now. Let's see how these things would be use in code.

Setup

Before we dig into codes, you need to deploy the Zipkin first. You can download the Zipkin code and set it up yourself. To make things easier, I packaged Zipkin into Docker images, enabling one-liner deployment. Check out docker-zipkin if you're interested.

Communications

We've talked about how Zipkin processes traces and spans. Here we will use an example to show you what has been transferred to the collector under-the-hood. We will reuse the example before: nginx frontend and an application server. Note that you will not see any Trace object below, since trace is a virtual entity that only exists as traceId in Spans. In the following example, we will use JSON to denote an object since it's easier to write. Also, in the real world Zipkin communication, spans are being encapsulated in Scribe. It would look like this.

{ category: "Zipkin", message: <Span-serialized-as-Binary-Thrift-Struct> }

When an user's request hits nginx, the nginx sends a span to the collector.

{
  traceId: 1,      // randomly generated globally unique ID
  spanId:  1,      // root span shares spanId with traceId
  parentSpanId: 0, // root span does not have a parent
  name: "GET",     // RPC method name
  annotations: [
    {
      timestamp: "10", // a UNIX timestamp in **milliseconds**
      value: "sr",
      host: {
        ipv4: 0xC0A80101, // IP address, but as an Integer
        port: 80,
        service_name: "nginx"
      }
    }
  ],
  binaryAnnotations: [ // It's optional, useful for store metadata.
    {
      key: "http.uri",
      value: "/book/1990", // would be store as byte[]
      annotationType: "String",
      host: {
        ipv4: 0xC0A80101,
        port: 80,
        service_name: "nginx"
      }
    }
  ]
}

The nginx would than figure out that it needs to contact the upstream application server to serve the content. Before it initiates a connection, it sends another span to the collector.

{
  traceId: 1,       // all spans in this request shares the same traceid
  spanId:  2,       // note that a new ID is being used
  parentSpanId: 1,  // the user <-> nginx span is now our parent
  name: "GET Book", // RPC method name
  annotations: [
    {
      timestamp: "12",
      value: "cs",
      host: {
        ipv4: 0xC0A80101,
        port: 80,
        service_name: "nginx"
      }
    }
  ],
  binaryAnnotations: []
}

The application server receives the request and, just like nginx, sends a server receive span to the collector.

{
  traceId: 1,
  spanId:  2,
  parentSpanId: 1,
  name: "GET Book",
  annotations: [
    {
      timestamp: "14",
      value: "sr",
      host: {
        ipv4: 0xC0A80102,
        port: 3000,
        service_name: "thin"
      }
    }
  ],
  binaryAnnotations: []
}

After the request has been processed, the application server sends server send to the collector.

{
  traceId: 1,
  spanId:  2,
  parentSpanId: 1,
  name: "GET Book",
  annotations: [
    {
      timestamp: "18",
      value: "ss",
      host: {
        ipv4: 0xC0A80102,
        port: 3000,
        service_name: "thin"
      }
    }
  ],
  binaryAnnotations: []
}

The nginx now receives the response from the upstream, it will sends a cr to the collector. It also sends a ss before it proxies the response back to the user.

// client receive from upstream
{
  traceId: 1,
  spanId:  2,
  parentSpanId: 1,
  name: "GET Book",
  annotations: [
    {
      timestamp: "20",
      value: "cr",
      host: {
        ipv4: 0xC0A80101,
        port: 80,
        service_name: "nginx"
      }
    }
  ],
  binaryAnnotations: []
}

// server send to the user
{
  traceId: 1,
  spanId:  1,
  parentSpanId: 0,
  name: "/book/1990",
  annotations: [
    {
      timestamp: "21",
      value: "ss",
      host: {
        ipv4: 0xC0A80101,
        port: 80,
        service_name: "nginx"
      }
    }
  ],
  binaryAnnotations: [
    {
      key: "http.responseCode",
      value: "200",
      annotationType: "int16",
      host: {
        ipv4: 0xC0A80101,
        port: 80,
        service_name: "nginx"
      }
    }
  ]
}

Send trace to Zipkin

Scala

Let's talk about Zipkin's native language: Scala first. Zipkin project published a client library based on Scrooge and Finagle. To use the library, you will need the following dependencies (shown in Gradle script format).

repositories {
  mavenCentral()
  maven { url "http://maven.twttr.com/" }

dependencies {
  compile 'org.apache.thrift:libthrift:0.9.1'
  compile 'com.twitter:finagle-core_2.10:6.12.1'
  compile 'com.twitter:zipkin-scrooge:1.0.0'
  compile 'com.twitter:util-core_2.10:6.12.1'
  compile 'com.twitter:util-app_2.10:6.12.1'
}

For the code example, Twitter already have an great example on the github. Please check out zipkin-test.

Java

For Java, I would not recommend to use the Finagle Java support just yet. (or maybe I'm too dumb to figure it out. :( ) Fortunately, there is a Zipkin implementation in Java called, Brave. The dependencies you're looking for are listed below.

repositories {
  mavenCentral()
  maven { url "http://maven.twttr.com/" }

dependencies {
  compile 'com.github.kristofa:brave-impl:2.1.1'
  compile 'com.github.kristofa:brave-zipkin-spancollector:2.1.1'
}

Brave provides an awesome ZipkinSpanCollector class which automagically handles queueing and threading for you.

Conclusion

Phew, finally we can conclude this long blog post. These are basically where I got lost when I tried to understand the Zipkin and tried to extend some other services such as nginx, MySQL to report traces back to the Zipkin. I hope these experiences would have you to get hands on the Zipkin faster. Zipkin actually have more feautres than we talked about here, please also take a look at the doc directory too. Have fun!

Gradle is a new build system that Android is currently promoting. It can be used to build Android project by adding a Gradle Android plugin. It is also possible that Android would move to Gradle-only build system, ditching the old ant-way to build things. The Android Studio only supports Gradle projects.

Quickstart

To join the new family, we need to write a new build.gradle file in project root. An general example of build.gradle would look like the snippet below. Please note that the system is still in active development, version numbers could change very soon. I'm using Gradle 1.7, plugin 0.5.6, Android Build Tools 17 and my target SDK version is 15.

buildscript {
  repositories {
    mavenCentral()
  }

  dependencies {
    classpath 'com.android.tools.build:gradle:0.5.6'
  }
}

apply plugin: 'android'

android {
  compileSdkVersion 15
  buildToolsVersion "17"

  sourceSets {
    main {
      manifest.srcFile 'AndroidManifest.xml'
        java.srcDirs = ['src']
        res.srcDirs = ['res']
    }
  }
}

This simple gradle file should be able to build most of standard Android projects. But what if I have special need? I'm going to talk about how to add support for AndroidAnnotation and token replacement below.

Android Annotation

AndroidAnnotation is a set of very useful annotations that you could use in your code to make your code shorter and more easier to read. The framework utilize a annotation processor to scan through your source code file and generate helper classes. To integrate it, we need to do 2 things:

  • Add processor flags to Java compiler
  • Add androidannotation-api dependency

Manipulating Compile Task

To add extra flags to compiler, we need to use the interface Android Gradle plugin exposed to access relevant tasks. The build plugin organize tasks into different “variants”, some basic variants includes debug and release. To iterate through all variants, we could use android.applicationVariants.all.

// Code based on https://github.com/excilys/androidannotations/issues/676
def getSourceSetName(variant) {
  return new File(variant.dirName).getName();
}

android.applicationVariants.all { variant ->

  def aptOutputDir = project.file("build/source/apt")
  def aptOutput = new File(aptOutputDir, variant.dirName)

  android.sourceSets[getSourceSetName(variant)].java.srcDirs += aptOutput.getPath()

  variant.javaCompile.options.compilerArgs += [
    '-processorpath', configurations.apt.getAsPath(),
    '-s', aptOutput
  ]

  variant.javaCompile.source = variant.javaCompile.source.filter { p ->
    return !p.getPath().startsWith(aptOutputDir.getPath())
  }

  variant.javaCompile.doFirst {
    aptOutput.mkdirs()
  }

}

The configurations.apt is missing at the moment. We used the variable to store path to Android Annotation processor and we use Gradle's dependencies managing feature to resolve the path for us.

configurations {
  apt
}

dependencies {
  apt files('compile-libs/androidannotations-2.7.1.jar')
}

Add dependencies

For managing dependencies, Gradle uses a dependencies block. The following code snippets shows different ways to include dependencies. fileTree can be used to iterate files in directory, string will be resolved using repositories(default includes Maven Central) and project is used to reference dependent local projects. We will talk about multi-project build support later.

dependencies {
  compile fileTree(dir: 'libs', include: '*.jar', exclude: 'android-support-v4.jar')
  compile 'com.android.support:support-v4:18.0.+'
  compile project(':extra:actionbarsherlock')
  compile project(':extra:ViewPagerIndicator')
}

Say if you put the Android Annotation API jar into libs directory, the build system should picked it up by now.

Token Replacement

Sometimes we would like to have a way to reference build version, git revisions from Java code. To accomplish this, we need to collect those informations and put it in a Java source file which will be picked out later during compilation.

Let's say if you have a AppBuild.java looks like below and we put it in compile-libs directory.

package idv.Zero.example;

public class AppBuild {
  public static final String GIT_REV = "@git-rev@";
}

Note the @git-rev@ is the token we will replace with current git revision number each time we build the project. Now we have the template, we need to rig it into the build system. The way I'm doing this is by adding a generation task per variant and setup the dependency for compiling task to be depend on this generation task.

import org.apache.tools.ant.filters.ReplaceTokens

android.applicationVariants.all { variant ->
  def taskName = "generateAppBuild${variant.dirName.capitalize()}"

  task (taskName, type: Copy) {
    def todir = "build/source/AppBuild/${variant.dirName}/cc/hypo/pieceroids"
    new File(todir).mkdirs()

    from project.file('compile-libs/AppBuild.java')
    into todir
    outputs.upToDateWhen { false }

    def proc = "git rev-parse --verify HEAD".execute()
    proc.waitFor()

    filter(ReplaceTokens, tokens:['git-rev': proc.in.text.trim()])
  }
  tasks["compile${variant.dirName.capitalize()}"].dependsOn(taskName)
}

What this code snippets do is to initiate a copy task by the name generateAppBuild${variant}. Note the outputs.upToDateWhen line would cause the task to always run, otherwise it'll be skipped if the source file is not changed, which is not what we want. You'll also need to import ReplaceTokens filter from our good old friend, ant.

Multi-Projects

It is very common that you might have some sub-projects that needs to be build together. Previously, I showed that you could have compile project(':path:to:project') in dependencies block to have it being included during compile time. However, you need some extra setup to make it work. You need an extra file called settings.gradle with those lines.

include ":extra:actionbarsherlock"
include ":extra:ViewPagerIndicator"

This indicates you have two sub-projects, they're located at ${project.root}/extra/actionbarsherlock and ${project.root}/extra/ViewPagerIndicator. Gradle will look for build.gradle inside those directories and set up project dependencies automatically.

Conclusion

In this article, I showed how to migrate your android project to utilize Gradle as the build system. Additionally, I also showed a way to integrate Android Annotations and build information generation. I hope these would be enough to get your started with Gradle. Here is the final build.gradle you should have after all these extra codes.

import org.apache.tools.ant.filters.ReplaceTokens

buildscript {
  repositories {
    mavenCentral()
  }

  dependencies {
    classpath 'com.android.tools.build:gradle:0.5.6'
  }
}

apply plugin: 'android'

configurations {
  apt
}

dependencies {
  compile fileTree(dir: 'libs', include: '*.jar', exclude: 'android-support-v4.jar')
  compile 'com.android.support:support-v4:18.0.+'
  compile project(':extra:actionbarsherlock')
  compile project(':extra:ViewPagerIndicator')

  apt files('compile-libs/androidannotations-2.7.1.jar')
}

def getSourceSetName(variant) {
    return new File(variant.dirName).getName();
}

android.applicationVariants.all { variant ->
  def aptOutputDir = project.file("build/source/apt")
  def aptOutput = new File(aptOutputDir, variant.dirName)

  android.sourceSets[getSourceSetName(variant)].java.srcDirs += aptOutput.getPath()

  variant.javaCompile.options.compilerArgs += [
    '-processorpath', configurations.apt.getAsPath(),
    '-s', aptOutput
  ]

  variant.javaCompile.source = variant.javaCompile.source.filter { p ->
    return !p.getPath().startsWith(aptOutputDir.getPath())
  }

  variant.javaCompile.doFirst {
    aptOutput.mkdirs()
  }

  android.sourceSets[getSourceSetName(variant)].java.srcDirs += "build/source/AppBuild/${variant.dirName}"

  def taskName = "generateAppBuild${variant.dirName.capitalize()}"
  task (taskName, type: Copy) {
    def todir = "build/source/AppBuild/${variant.dirName}/cc/hypo/pieceroids"
    new File(todir).mkdirs()

    from project.file('compile-libs/AppBuild.java')
    into todir
    outputs.upToDateWhen { false }

    def proc = "git rev-parse --verify HEAD".execute()
    proc.waitFor()

    filter(ReplaceTokens, tokens:['git-rev': proc.in.text.trim()])
  }
  tasks["compile${variant.dirName.capitalize()}"].dependsOn(taskName)
}

android {
  compileSdkVersion 15
  buildToolsVersion "17"

  sourceSets {
    main {
      manifest.srcFile 'AndroidManifest.xml'
        java.srcDirs = ['src']
        res.srcDirs = ['res']
    }
  }
}

I was asked to write a hangman AI as a challenge last week. I was asked not to leak the detail, so I will not post my solution here. However, the hangman problem itself is a well-known problem, I would like to share some thoughts that I used to build my hangman solver.

Preface

There is a hangman page provides detailed information for hangman game. However, I would like to describe it in a shorter, more programmer's way.

Input

You will receive an serialized object containing following information:

  • state: The state is a string that reveals your correctly guessed characters in the string. For those unrevealed characters, a _ is placed in those positions. For example, say the answer is “apple” and you guessed p, the state string will be _pp__. Please note that the state could contain multiple words, a sentence.
  • remaining_guesses: How many times you can make a wrong guess before the game is over.
  • status: It could be one of {WIN, ONGOING, LOSE}. The game will start as ONGOING. If you reveal all characters before you used up all guesses, you WIN the game. If you make too many wrong guesses, you LOSE the game.

Output

For each game, you will start with a empty state. You must keep giving out a character as a guess until the status is WIN or LOSE.

Thoughts

Before we talk about anything, I need to clarify one thing: I am not familiar with NLP(Natural Language Processing) techniques. I might use some dumb way to solve this problem, but, well, it's at least something that my intuition leads me to.

Initial Guess

If the state is all unrevealed, I will do an initial guess. The initial guess uses vowels, the famous a-e-i-o-u. However, I use a little variation here. I do not guess the vowels in this order, I sort it by the frequency of characters appearing in dictionary. I guess by e-i-o-u-a. The initial guess stops whenever a character works, then we go to word attack mode.

Word Attack

Since I don't have any knowledge in the language model, I will launch a word-based attack. In my implementation, I only attack one word at a time. It is very important that we choose the right word to attack. If we pick the wrong word to guess, the probability to produce a correct guess will decrease. First, I assume that, in most cases, the word-to-be-attacked will be included in my dictionary. If we don't make this assumption, it is hard to do optimization since you have to assume you have no knowledge regarding the context.

Which Word?

The best word to attack, in my opinion, is the word having most characters solved and having longest length as possible. Now, to decide it, the formula I used is:

$$ f_{cost}(w_i) = \frac{f_{unsolved}(w_i)}{f_{strlen}(w_i)^2} $$

The best word to attack will have smallest cost from this formula. It first calculate the percentage of unsolved characters in the word, so the word with most portion of its characters revealed now having the smallest cost now. However, say we have a word a_, the word is now 50% solved but to finish the second character, you only need one move. It sounds good, but it does not generate any side-effect. Say now we have a word co__ec__e__ and you already know the full word is correctness. You can now confidently send answers [r, t, n, s]. You may reveal characters in other words in the process, thus increase the overall probability of making right guesses.

The n-grams

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. An n-gram could be any combination of letters. However, the items in question can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. – http://en.wikipedia.org/wiki/N-gram

Since I'm not a NLP guy, I will just quote the definition of n-gram from Wikipedia. In my solver, I use two type of n-grams: A bigram table and a unigram table. Before I calculate those, I will filter the dictionary to a smaller version by eliminating words using the current state and guessed words. For example, the state of the word I'm attacking is a__le, and I haven't guessed [p, z]. I will generate a regular expression of a[pz][pz]le and apply it on my dictionary. Then, I calculate those n-grams table in current reduced context.

Unigram table

My usage for unigram is very simple. I merely calculate the frequency of a characters shows in my dictionary.

Bigram table

The bigram I'm trying to build is based on the connection for two continuous characters. I will split each word into bigrams using the following ruby code:

def bigram_split(str)
  "^#{str}$".split('').each_cons(2).to_a
end

>> bigram_split("hello")
=> [["^", "h"], ["h", "e"], ["e", "l"], ["l", "l"], ["l", "o"], ["o", "$"]]

I collect those bigrams and calculate the frequency for each of those showing up among whole bigrams space for current dictionary.

The Attack

Now everything that we need before launching the attack is prepared, we can start to work on a guess. The first step is to acquire bigrams from the word state. The only bigrams we need is those with one of the character unsolved, but not all unsolved. In a more direct way, what we need is [*, '_'] ['_', *] but not ['_', '_']. For those unsolved bigrams, I use the bigram table I just calculated to get a count on how frequent of a character shows up after or before a known character. I then sum the count for all bigrams in the word state and pick the character with highest count as the guess. It's that simple.

Falling back

Well, sometimes you just don't have any clue. In that case, my program will first falls back to attack on second best word and so on. If all words are tried, but we still have no clue. I'll just fire a random guess using unigram table. However, it is really a wild guess, the success rate of making a correct guess using unigram table is not high.

Future Work

There are many ideas that can be implemented to improve my solution. Like implementing a word bigrams to further weight the possibility of characters bigrams could be useful, but it would require more research into NLP field. I think it is really a fun challenge. If you got some spare time, try it!

Applying for a PhD degree can be troublesome. There are agencies out there helping people deal with the process. Is it really necessary to pay lots of money for it? My answer is simply negative. All you need is a right set of tool to track the progress. I will share my experience of applying both the PhD and F-1 VISA (this part only applies to students in Taiwan, as your experience may vary). I started working on my application very late and it resulted in a rush application process. Hope by sharing my experience, you can apply for PhD programs more easily. :)

What do you need?

Before you begin, there are some prerequisites you need to take care of. Here is a short list:

  • The TOEFL examination
  • The GRE examination
  • Résumé
  • Statement of Purpose (and/or Personal Statement)
  • Financial support document
  • Transcripts for your BS and MS degree

Please note that this may not be a complete list of everything you will need for every school. Some schools may ask for additional documents, for example, some schools may requires you to take the Subject GRE examination as well. If you want to apply for the Fall semester, the deadline is usually in December of previous year. For example, you want to apply for Fall 2012. The application deadline would be around December 2011 to January 2012. I would suggest you to take care all of the examinations before November and leave a whole month for getting your application ready as you will need to wait for the deliveries of every documents and letters of recommendation(LoR).

TOEFL/GRE

You usually need to have your examinations scores delivered to the school by the end of the application deadline. Some schools may have one to two weeks extension for test scores. Since the ETS can sometimes be very unproductive, I would suggest you to finish the test before November. Oh, please take a note of the TOEFL/GRE institution code of your target schools. You can request a few free copies on the spot after taking the tests to save money. :P

I don't really have tips for preparing, as I wan't very well prepared. However, there is some insights I can share. If you go to PTT, Taiwan's largest electronic bulletin board system, popular among the university students, and read those test-taking experience shared in TOEFL/GRE board. You will see that many people are complaining about the noise during the oral part of tests. Here is one thing I can tell you, do your tests quickly. If you're quick enough, you can be the first one to answer the speaking questions before everyone else, so you can relax and do your tests at the best you can.

If you have friends who lives in the States, you can talk to him/her in for practice, you will have no problem on TOEFL test. However, GRE is a lot different. Though the verbal part of GRE may be a pain in the ass, the quantitative part is usually easy for Taiwanese students. We all have that miserable experience preparing for college entrance exam, right? You will definitely need some time to memorize some advanced vocabulary for GRE. Another thing that you may want to practice is AWA. AWA is like writing a small thesis on a given topic. There is no right or wrong. However, you will need to express your opinion systematically and defense your prospective with solid reasons. My tips? Write, write, and read. Find some time to write things down in English, preferably a long blog post or diaries. Read some articles from some well-known journals or magazines. Once you get used to the way of Americans thinking, you should not have problem on AWA.

Last thing to do before taking tests: Spend a day to read some example test questions. It is not your objective to memorize all the questions. What you need to do is to get used to different kind of questions and know how to rush to an answer in no time.

Résumé

I think that you should have your résumé prepared all the time. Trust me. You will not want to write down your whole life in a day. It will be an erroneous and boring process. I would suggest you to write in Markdown syntax (or any plain text formatting syntax you prefer). Writing things down in plain text format can help you to organize things more quickly. If you use Markdown, you can easily convert it into a web page and put it online for display. I will suggest that you do a separate print version using LaTeX, as the template based on LaTeX gives a professional look. It is so easy to maintain compared to those made using Word, Pages or even InDesign.

You can find my résumé here. Make sure you include those sections: basic personal information, education histories, work experiences, publications, some projects that you involved in before. You may want to put as much as possible to show your strength, but do not make it more than 2 pages. Professors and your future employers simply do not have time to read a long résumé. Keep it clean and simple. Highlights the most important things in your life! (Update: If you're preparing resume for job application, make it in one page.)

SoP and PS

SoP (Statement of Purpose) and PS (Personal Statement) are quite confusing for me. The purpose of the two documents is actually the same. It presents your background and personality. I distinguish those two by applying more personal touch to the personal statement. You may want to use one or two paragraph to state a personal experience that leads you to your current study, while focus your SoP solely on the what you want to do in the future. I think it should be able to fit in a page or two. I doubt professors would have much time to read, so make it simple and clear.

Financial Document

This should be the easiest part. You just deposit money into an account, and apply for the “Certificate of Deposit”, or 存款證明 in Chinese. For the amount, please ask your school for the details. However, an average number would be USD$20000~30000. Some school may not request for it until they're applying I-20 for you. If that is the case, and you have a scholarship or assistantship offering, you may ask your department or graduate office for the correct amount you will need to prepare.

Some Tips

During the process, I would suggest you to use some spreadsheets software for managing application process of all the different schools. Google Docs would be a good choice. Make sure you print (as PDF, to save papers) and archive everything. Track every letter of recommendations, TOEFL/GRE scores and results. If you do this, you will always have a dashboard for all those different applications. Likewise, since you will need to log into many systems during the application. I would suggest you to use some password manager like 1Password, KeePass to track every used password, because every system will have its own requirement on password. It will help a lot.

Conclusions

You should have everything you need to know by now. For the things I didn't mentioned above, look for your desired school. You might find some help on the internet as well. Try StudyAbroad on PTT BBS if you're a Taiwanese student. I hope this experience can be of help. I also hope that you will be admitted by your favorite school. ;)