{cool_title}

Artem Yankov

Mindmapping AIMA

“Artificial Intelligence: A Modern Approach” (commonly abbreviated as AIMA) is a very famous introductory book on AI written by Peter Norvig and Stuart Russell. The problem with such books though is that it is big and fullfilled with concepts you may have never heard before. And unless you have an exceptional memory, you will start forgetting the details pretty soon after you read it for the first time. Same goes for any new knowledge or skill. Practicing helps a lot to keep new stuff in working memory, but another thing that may help to settle new knowledge is giving information a good structure. A good technique for that is mindmaping. The idea is to identify the key concepts and terms, determine relationships between them and plot this on a graph. It is easier to keep such image in memory even if not all but the most important parts. Or you can just open it up to refresh your memory and figure out which parts may need a review.

So here is my AIMA mindmap. Full-sized version is here.

GitHub Logo


Monty Hall Problem

There’s one old very nice mind crashing puzzle that I really like: Monty Hall Problem. It is amazing because of how counterintuitive the answer is. You won’t just believe it’s true until you triple-check it, read the proof and even after that you will likely keep thinking that there should be a catch somewhere. So here is the problem as quoted from Wikipedia:

Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice?

So is it better to switch your choice, to stick with your first one or it doesn’t matter? If you are like me and majority of other people the most intuitive answer will be: switching a door won’t make a difference and the chance of winning is still the same. Right?

The truth is switching will give you 2/3 chance of winning while sticking with your choice is only 1/3 chance. Why? When you are picking one of 3 doors your chance to get a car is 1/3, while your chance to get a goat is 2/3. The chance that a car behind one of the 2 doors you didn’t chose is also 2/3. When one of the 2 doors is opened for you and you know there’s a goat so you won’t pick it, the chance that the car is behind the last door is still 2/3. Still doesn’t sounds right? Let’s run a programming experiment and generate 1000 choices for every scenario.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import pandas as pd
from pylab import *
from random import randrange

def boxes():
    b = ["goat"]*3
    b[randrange(0, 3)] = "car"
    return b

def gen_choices(num, switch = True):
    results = []
    for _ in range(num):
        b = boxes()
        choice1 = randrange(0, 3)
        c = range(3)
        c.remove(choice1)

        choice2 = c[0] if b[c[0]] == "car" else c[1]
        prize = b[choice2] if switch else b[choice1]

        results.append(prize)

    return results


fig, axes = plt.subplots(nrows=1, ncols=2)
fig.set_figheight(5)
fig.set_figwidth(8)

# generate choices with switching
switch = pd.Series(gen_choices(1000, True)).value_counts()
switch.plot(ax=axes[0], kind='bar', title="switched", color='y')

# generate choices without switching
noswitch = pd.Series(gen_choices(1000, False)).value_counts()
noswitch.plot(ax=axes[1], kind='bar', title="didn't switch", color='r')
plt.show()

Yup. Switching is winning. Isn’t it amazing?


WebTCP: Making TCP connections from Browser

The problem

There is no simple way to create TCP sockets in Javascript on a browser side. Although solutions like Websockets allow to create something that resemble sockets, you can use them to connect only to servers that support Websockets. Not to any random servers that know nothing about HTTP.

Why bother

If creating such connections were possible, then we could connect to any external server from browser and keep all logic in client-side Javascript without needing to implement a backend app.

For instance, it would be possible to create (or just port them from node.js) client libraries for things like: Memcache, Redis, MySQL, Riak, RabbitMQ or any other server.

While in many situations such usage would be questionable and insecure, there are cases when it could be quite useful:

  • using server-side cache from JS
  • using pub/sub servers to deliver notifications to browsers. (Redis, RabbitMQ, Apache Kafka, etc)
  • making HTTP request to any server bypassing same-origin policies :|

Solution

As an experiment I implemented this small library: WebTCP. Here is how it works.

Read on →

Sharding Redis #2. Rebalancing Your Cluster.

The fact is that when it comes to sharding – Redis is not the best tool you could have. Although redis cluster was partly implemented on an unstable branch a long time ago, apparently other prioritized stuff keeps antirez from finishing it.

So if you are sitting and wondering what is going to happen when you can’t fit all this data into a single redis server there are some, almost not documented, workarounds. I wrote briefly about it already here, but let’s take another look.

Ruby redis client has a Distributed module. It is simple.

1
2
3
4
require "redis"
require "redis/distributed"

redis =  Redis::Distributed.new ["redis://host_1.com:6379", "redis://host_2.com:6379"]

Now you can use redis in a usual way.

1
2
3
4
5
redis.set("key1", "value1")
redis.set("key2", "value2")

redis.get("key1") # => value1
redis.get("key2") # => value2

What’s different here from using a regular client is that Redis::Distributed creates a hash ring out of redis nodes that were passed in the beginning. It uses crc32 to calculate hashes for each node based of redis url of the node and its number. When performing an operation on a key it calculates a hash for it and maps it to an appropriate redis node. It works in a way so that keys are going to be mapped almost evenly across nodes. So for a cluster of two nodes, if you create 100 lists, aproximately 50 will be mapped to the first node and another 50 to the second node.

To know what node is mapped to a given key simply use a node_for method.

1
redis.node_for("key1")

Usually there’s no need to do that, because the whole process is transparent for developer. Until you need to add another node.

Okay, let’s add another node

Let’s say your two-node redis cluster is out of capacity and you need to add a 3rd node.

1
redis.add_node("redis://host_3.com:6379")

Simple enough, but what happened to the keys that were stored before adding the 3rd node? They remain at their old places. But because hashing depends on the number of nodes in the cluster, hash for some of the keys will be changed and they will be mapped to different nodes. So that when you try to get a key1 this can go to a different node and you won’t get a value that was stored before. Not fun if you care for the old values.

As an attempt to solve this problem I wrote redis-migrator. Redis-migrator takes a list of nodes for your old cluster and list of nodes for your new cluster and determines for which keys routes were changed. Then it moves those keys to the new nodes.

So it solves the previous problem like this:

1
2
3
4
5
6
7
8
9
10
require 'redis_migrator'

# a list of redis-urls for an old cluster
old_redis_hosts = ["redis://host1.com:6379", "redis://host2.com:6379"]

# a list of redis-urls for a new cluster
new_redis_hosts = ["redis://host1.com:6379", "redis://host2.com:6379", "redis://host3.com:6379"]

migrator = Redis::Migrator.new(old_redis_hosts, new_redis_hosts)
migrator.run

To make a migration process faster instead of doing sequentual writes I used redis pipeline which allows to send bunch of operations without delay and then gather responses afterwards. Checkout migrator_benchmark.rb to perform benchmark testing.

Although this tool still lacks some good error handling I believe it can give a good start.


Look Ma, no URL Calls!

What if we never had to make any URL calls from our javascript to access our backend APIs. Instead, if we had a model User with the class method create(name) then in JS we would just say user.create("Harry Potter") and that’d run this method on a server and return a result to JS сlient and it’d looked as if a function was defined in JS. Oh, I know you know that it’s called RPC and existed for a thosands of years, got wiped out by REST and generally considered as bad practice. But why? Wouldn’t be more convinient than dealing with URLs and controllers? Web application is just like any other distributed application consists of server and many clients that are talking to each other. Passing messages and making remote procedure calls seems way more natural than hitting URL endpoints. URLs are understandable for people, but why make software talk this language?

To test this idea I made a simple lib https://github.com/yankov/nourl. You define a class on your backend side and it looks like this:

1
2
3
4
5
6
7
8
9
class User
  include Nourl::RPCable

  allow_rpc_for :get

  def self.get(name)
    User.find_by_name(name)
  end
end

Then you run your server and in your JS it can be written like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
var settings = {
  rpcUrl: "http://0.0.0.0/rpc", //host and port where your server is running
  transport: "ajax",
  require: ["user"]                 // list all classes that you wanna access to
}

nourl.run(settings, function(){

   user.get('john', function(result) {
      console.log(result);
   });

});

Nourl automatically creates stubs for your model’s classes and methods, so you can just run a method call from JS, pass arguments and get the result back from your backend server. The point is that all RPC and Ajax logic is hidden and everything just looks like model was already implemented in javascript.

Especially taking in account that websockets probably are gonna be used more and more, I don’t see yet why this pattern wouldn’t work.

Some unnecessary text that tries to make post look cooler

A known fact: sometimes we keep doing things in an old way just because we get used to it even if it’s inconvinient. Handling this inconvinience quickly becomes a part of the standard routine which builds up our comfort zone. Usually it takes time to break the old layer of habbits and adapt to the new ideas. This is why people don’t care about craigslist’s UI although it doesn’t conform to any modern standards. Try to optimize it and people will rage (for a while).

So I was just trying to find things that we are doing in web-development that seem odd or obsolete, but to which we got used to and accepted as standards. One of such things is REST and using URLs for accessing APIs in general. Well, probably saying that REST is bad would be too arrogant, but there just could be other ways to go.

Remember, there was stuff like SOAP, RPC, Corba, RMI probably more familiar for people from the world of enterprise distributed software development. For a bunch of reasons it didn’t kick in web development: it was too cumbersome to deal with and not very natural for HTTP. But since we are moving on, web-applications more and more look like desktop application, websockets are gonna be widely spread out very soon, we probably have to find better ways to communicate clients with the servers rather than hitting URL endpoints.


How to Find Facebook Users on Match.com by Using Face Recognition Tools

One day I was drinking coffee with my friend and he told me a story of how he got in trouble with his girlfriend because she found his old profile on a popular dating site match.com. Allegedly someone from their common friends bumped into it and sent her a “friendly notification”. He was unable to prove her that it’s just ‘an old thing’ that he forgot to delete and their story ended there – they broke up pretty fast after this incident. Probably their relationships wasn’t that strong anyway, but this story struck my mind with a couple of thoughts. What are the odds of bumping into a profile of someone you know on a dating website and how easily such privacy could be violated if someone had a direct intention?

I remembered that there were some open-source face detection and recognition libraries available and thought that it’s probably possible to write a tool that would crawl photos on dating sites and try to recognize a particular person on them. Then I ran into face.com – a platform that provides a RESTful API for detecting, tagging and recognizing faces on the pictures. I recalled that story again, told to my friends, we all laughed, agreed that such a tool would be creepy, but I did put it in my list of ideas for hacking.

So, guess what, let’s go creepy and run a small experiment to see how easy that would be. To do that we’ll write a tool that will take tagged photos of a Facebook user and try to find his/her profile on match.com.

Let’s split a task into smaller problems.

  • How send authorized search requests to match.com
  • How to get URLs of profile images
  • How to make parsing fast by running requests asynchronously
  • How to use face.com API

How to parse match.com

Sending authorized requests

To get profile pictures we need a search output. If you go to match.com you’ll figure that it doesn’t allow you to browse search results without registration. So the very first step would be to create an account on match.com. After that go to the search form, choose the parameters, like sex, age and a zipcode and click “Search now”. Now we see the results with the profile pictures, but let’s try to get this from our tool. Let’s copy URL of the request from the browser and write a simple Ruby script.

Read on →

8 Ideas for a Weekend Hacking

My current strategy is just to try hacking on one new technology every week and create something simple. The primary goal is to learn new stuff and have fun. So here are some ideas I came up with during brainstorming.

1. Eventmachine & web crawling

Ever wondered how to write a fast distributed web-crawler? If you code in Ruby, you may use Eventmachine. It’s event-processing library for Ruby that implements Reactor pattern which provides you with non-blocking IO. Normally, if you send a http request it will block an entire process until the response is received. Even if you try to use threads it won’t help and crawling speed will suck.

Here are some libraries that work on the top of event machine and help you to make requests asynchronous.

Asynchronous HTTP client. em-http-request

Asynchronous Redis client em-redis

Non-blocking DNS-resolution em-resolv-replace

2. Brainwave sensing headsets

Currently, I’m aware of two affordable brainwave sensing headsets on the market:

neurosky.com

emotiv.com

Such headsets are measuring brainwave impulses from the forehead and generate interesting data based on it. What’s interesting about it is that there is number of amazing applications developed for such devices: from games that you control with your mind where you can just move objects with the power of your thought to researching tools for serious neurohackers. You can can develop your own applications using such languages as C++, C#, .NET or Java. Unfortunately I haven’t found any support for Ruby or Python yet, but we that’s not going to stop anyone, right? ;)

3. Face recognition API

Face.com looks like a great service that allows to detect, recognize and tag faces thought the REST API. You can combine different sources and, for instance, find Facebook friends in Flickr photos or a private photoalbum. As a dirty idea for a weekend hacking: spend a day and write a script that finds your girlfriend’s profile on match.com ;)

4. Bigdata: Hadoop, HBase, Hive and Pig

Technologies related to Bigdata are very interesting to hack on. Setting up a hadoop cluster, writing mappers and reducers to do some badass calculations and process huge amounts of data can fill up evenings with the real geek fun. Try to figure out how HBase and Hypertable work which are modeled after Google’s BigTable. Then set up Hive or Pig and learn how to run complex SQL- like queries without writing mappers in Java.

A couple of previous posts related to it:

Hadoop 1.0 + MongoDB: the Beginning

How to Set Up a Hadoop Cluster with Mongo Support on EC2

5. Amazon AWS

A lot of startups create their services based on Amazon AWS: dropbox, heroku, engine yard, mongohq, etc. Nowadays you don’t need to have your own datacenter to build anything like dropbox. So one of the days I was wondered how really complicated it would be to build something like dropbox. Basically it’s a simple client that connects to a bucket on Amazon S3 and allows you to upload files there by putting them to a specific folder on your computer. Sounds like a one-evening deal for a simple version.

7. Mobile development frameworks

The fastest ways to develop mobile applications

phonegap.com

appcelerator.com

rhomobile.com

All three frameworks allow to write applications for a bunch of different platforms: iPhone IOS, Android, Symbian, etc. Phonegap and Appcelerator leverage HTML5, CSS and Javascript whereas Rhodes uses MVC Ruby framework. So it’s now easy to develop your first mobile app for all main platforms without knowing Java or Objective C.

8. Websockets, comet and socket.io

Ever wondered how real time notifications work on Facebook? Or any other web application that involves real-time interactions between users in the browser: chats, games, etc? There are number of techniques that can be used to allow a server (or rather to simulate it) to push a message to a browser: such as xhr- polling, jsonp-polling, web sockets or flash sockets. Some libraries to checkout:

socket.io

faye – publish-subscribe messaging system. Very easy to use. Server is node.js based.

juggernaut – uses socket.io, node.js and redis.


How to Set Up a Hadoop Cluster with Mongo Support on EC2

In the previous post I described how to setup hadoop on your local machine and make it work with MongoDB. That’s good enough only for development and testing, but if you want to crunch any serious numbers you have to run hadoop as a cluster. So let’s figure out how to run a hadoop cluster with mongodb support on Amazon EC2.

This is a step by step guide that should show you how to:

  • Create your own AMI with the custom settings (installed hadoop and mongo-hadoop)
  • Launch a hadoop cluster on EC2
  • Add mode nodes to the cluster
  • Run the jobs

So let’s hack.

Setting an AWS account.

1. Create an Amazon AWS account if you still didn’t get one.

2. Download amazon command line tools on your local machine. Unpack them, for example, in ~/ec2-api-tools folder.

3. In your home folder edit .bash_profile (or .bashrc if you’re not using OS X) and add there the following lines

export EC2_HOME=~/ec2-api-tools

export PATH=$PATH:$EC2_HOME/bin

Read on →

Hadoop 1.0 + MongoDB: the Beginning

I’ve been playing with Hadoop and MongoDB for a couple of months and noticed that there’s lack of information describing how to actually make them work together. In the next few posts I’ll try to cover this starting from the basics.

In this post I’ll explain how to setup hadoop with mongo-hadoop library on your localhost. I did it on OS X but the same approach should work for Linux as well.

Installing Hadoop

1. Download hadoop 1.0

wget http://apache.mirrors.pair.com//hadoop/common/hadoop-1.0.0/hadoop-1.0.0.tar.gz

2. Unpack it somewhere in your home directory:

tar xzvf hadoop-1.0.0.tar.gz

3. Set JAVA_HOME

cd hadoop-1.0.0

vim conf/hadoop-env.sh

And for OS X add the following line:

export JAVA_HOME=$(/usr/libexec/java_home)

If you are using Linux change it to the actual path to your Java binary.

4. Edit config files

Read on →

Sharding Redis is Easy.

While Redis doesn’t have a built-in support for clustering yet, there’s a pretty easy solution if you are using ruby client. It’s not documented and you can find it out only from reading tests, but this client has a support for consistent hashing and multiple Redis’ nodes just out of the box.

Here’s how you use it:

1
2
3
4
5
6
7
8
9
10
11
require rubygems
require redis
require redis/distributed

r = Redis::Distributed.new %w[redis://localhost:6379 redis://localhost:6378]

# show node for key "foo"
p r.node_for("foo")

# set the value of "foo"
r.set("foo", "value")

I fired off two instances of Redis on different ports just as an example but there can be any number of Redis instances hosted on different machines. So what happens next? When you are trying to write or read a key, client calculates an unique hash for this key and maps it to a specific Redis node. That way same keys are always routed to the same redis nodes.

There are also more features like adding nodes which you can find out from reading distributed*.rb tests here.

More read if you are interested in scaling Redis.

Redis Memory Usage
Redis Sharding at Craigslist