Experimental Coding

Artem Yankov

Monty Hall Problem

There’s one old very nice mind crashing puzzle that I really like: Monty Hall Problem. It is amazing because of how counterintuitive the answer is. You won’t just believe it’s true until you triple-check it, read the proof and even after that you will likely keep thinking that there should be a catch somewhere. So here is the problem as quoted from Wikipedia:

Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice?

So is it better to switch your choice, to stick with your first one or it doesn’t matter? If you are like me and majority of other people the most intuitive answer will be: switching a door won’t make a difference and the chance of winning is still the same. Right?

The truth is switching will give you 2/3 chance of winning while sticking with your choice is only 1/3 chance. Why? When you are picking one of 3 doors your chance to get a car is 1/3, while your chance to get a goat is 2/3. The chance that a car behind one of the 2 doors you didn’t chose is also 2/3. When one of the 2 doors is opened for you and you know there’s a goat so you won’t pick it, the chance that the car is behind the last door is still 2/3. Still doesn’t sounds right? Let’s run a programming experiment and generate 1000 choices for every scenario.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
  import pandas as pd
  from pylab import *
  from random import randrange

  def boxes():
      b = ["goat"]*3
      b[randrange(0, 3)] = "car"
      return b

  def gen_choices(num, switch = True):
      results = []
      for _ in range(num):
          b = boxes()
          choice1 = randrange(0, 3)
          c = range(3)
          c.remove(choice1)

          choice2 = c[0] if b[c[0]] == "car" else c[1]
          prize = b[choice2] if switch else b[choice1]

          results.append(prize)

      return results


  fig, axes = plt.subplots(nrows=1, ncols=2)
  fig.set_figheight(5)
  fig.set_figwidth(8)

  # generate choices with switching
  switch = pd.Series(gen_choices(1000, True)).value_counts()
  switch.plot(ax=axes[0], kind='bar', title="switched", color='y')

  # generate choices without switching
  noswitch = pd.Series(gen_choices(1000, False)).value_counts()
  noswitch.plot(ax=axes[1], kind='bar', title="didn't switch", color='r')
  plt.show()

Yup. Switching is winning. Isn’t it amazing?


WebTCP: Making TCP connections from Browser

The problem

There is no simple way to create TCP sockets in Javascript on a browser side. Although solutions like Websockets allow to create something that resemble sockets, you can use them to connect only to servers that support Websockets. Not to any random servers that know nothing about HTTP.

Why bother

If creating such connections were possible, then we could connect to any external server from browser and keep all logic in client-side Javascript without needing to implement a backend app.

For instance, it would be possible to create (or just port them from node.js) client libraries for things like: Memcache, Redis, MySQL, Riak, RabbitMQ or any other server.

While in many situations such usage would be questionable and insecure, there are cases when it could be quite useful:

  • using server-side cache from JS
  • using pub/sub servers to deliver notifications to browsers. (Redis, RabbitMQ, Apache Kafka, etc)
  • making HTTP request to any server bypassing same-origin policies :|

Solution

As an experiment I implemented this small library: WebTCP. Here is how it works.

It is impossible to make browser to initiate raw TCP connections to a server, but it is possible to use some proxy (or a “bridge” would be a better name) that will receive connection requests from the browser, create real sockets and then redirect responses back to the browser. So on a client side we will have fake socket objects that will be mapped to real socket connections on a proxy side.

How client connects to the bridge

This is when Websockets or something like Socket.IO or SockJS comes in handy. Client can use Websockets (or fall back to xhr/jsonp-polling or whatever is supported) to talk to the bridge and bridge will talk to TCP servers using real socket connections. I decided to use SockJS although Socket.IO is fine too. Here is how the entire thing looks like:

diagram

How is it different than having a backend app

The difference is in where to put logic to handle different servers’ protocols. The normal way would be to handle it on a server side. So, for instance, if browser wants to get something out of Memcache – it will make a request to some backend app that knows how to get data out of Memcache. If suddenly you want to do something with Redis then you will need to modify backend code and restart server to add support for that.

On the other hand, if browser can operate on a socket level such protocol logic could be implemented on a client side. There can be just a bridge (or easily a cluster of bridges behind HAproxy) that knows nothing about servers’ protocols and just pass data to sockets back and forth. Whatever you want to connect to from the browser just include a JS client library in the page, no need to touch bridge at all.

Examples

Here are some examples for sockets, http, memcache client and redis client. I ported memcache client easily from node.js version just by using node-browserify and by replacing net library with WebTCP connection. Redis client for node.js relies on C library, so I had to implement client from scratch, but luckily Redis protocol is pretty simple.

Socket example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
//First create a SockJS tunnel. 
//Use whatever port and address your WebTCP server is on.
var net = new WebTCP('localhost', 9999)

//Now you can create sockets like this
var socket = net.createSocket("127.0.0.1", 1337)

// To send data to socket 
socket.write("hi")

// On connection callback
socket.on('connect', function(){
  console.log('connected');
})

// This gets called every time new data for this socket is received
socket.on('data', function(data) {
  console.log("received: " + data);
});

socket.on('end', function(data) {
  console.log("socket is closed ");
});

It’s also possible to specify advanced options when creating a socket connection

1
2
3
4
5
6
7
options = {
  encoding: "utf-8",
  timeout: 0,
  noDelay: true, // disable/enable Nagle algorithm
  keepAlive: false, //default is false
  initialDelay: 0 // for keepAlive. default is 0
}

And then pass those options when creating socket

1
var socket = net.createSocket("127.0.0.1", 1337, options)

HTTP example

1
2
3
4
5
6
7
8
9
10
11
12
//Create a http client
var client = net.createHTTPClient();

// GET request
client.get({ host: 'news.ycombinator.com', port: 80 }, function(res) {
  console.log(res);
});

// POST request
client.post({ host: 'news.ycombinator.com', port: 80 }, { param: 1 }, function(res) {
  console.log(res);
});

Redis example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
<script src="../lib/client/webtcp-0.0.1.min.js"></script>
<script src="../lib/client/redis.js"></script>
<script>
// Redis client example
var net = new WebTCP('127.0.0.1', 9999);

var redis = new Redis(net, "127.0.0.1", 6379);

redis.send("set a 1", function(res) {
  console.log(res);
});

redis.send("incr a", function(res) {
  console.log(res);
});

redis.send("incr a", function(res) {
  console.log(res);
});

redis.send("subscribe ch1", function(res) {
  console.log(res);
});

</script>

Memcache example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
<script src="../lib/client/webtcp-0.0.1.min.js"></script>
<script src="../lib/client/memcache.js"></script>
<script>

var tcp = new WebTCP('127.0.0.1', 9999);

var client = new memcache.Client(11211, "127.0.0.1");

client.connect();

client.on('connect', function(){
   console.log('connected')
});

client.on('close', function(){
   console.log('closed')
});

client.set('foo', 'some value', function(error, result){
  console.log(result);
});

client.get('foo', function(error, result){
   console.log(result);
});

client.version(function(error, result){
  console.log(result);
});

</script>

That’s about it. Although this project still need more thought to figure out good use cases, I had really fun time playing with it.


Sharding Redis #2. Rebalancing Your Cluster.

The fact is that when it comes to sharding – Redis is not the best tool you could have. Although redis cluster was partly implemented on an unstable branch a long time ago, apparently other prioritized stuff keeps antirez from finishing it.

So if you are sitting and wondering what is going to happen when you can’t fit all this data into a single redis server there are some, almost not documented, workarounds. I wrote briefly about it already here, but let’s take another look.

Ruby redis client has a Distributed module. It is simple.

1
2
3
4
require "redis"
require "redis/distributed"

redis =  Redis::Distributed.new ["redis://host_1.com:6379", "redis://host_2.com:6379"]

Now you can use redis in a usual way.

1
2
3
4
5
redis.set("key1", "value1")
redis.set("key2", "value2")

redis.get("key1") # => value1
redis.get("key2") # => value2

What’s different here from using a regular client is that Redis::Distributed creates a hash ring out of redis nodes that were passed in the beginning. It uses crc32 to calculate hashes for each node based of redis url of the node and its number. When performing an operation on a key it calculates a hash for it and maps it to an appropriate redis node. It works in a way so that keys are going to be mapped almost evenly across nodes. So for a cluster of two nodes, if you create 100 lists, aproximately 50 will be mapped to the first node and another 50 to the second node.

To know what node is mapped to a given key simply use a node_for method.

1
redis.node_for("key1")

Usually there’s no need to do that, because the whole process is transparent for developer. Until you need to add another node.

Okay, let’s add another node

Let’s say your two-node redis cluster is out of capacity and you need to add a 3rd node.

1
redis.add_node("redis://host_3.com:6379")

Simple enough, but what happened to the keys that were stored before adding the 3rd node? They remain at their old places. But because hashing depends on the number of nodes in the cluster, hash for some of the keys will be changed and they will be mapped to different nodes. So that when you try to get a key1 this can go to a different node and you won’t get a value that was stored before. Not fun if you care for the old values.

As an attempt to solve this problem I wrote redis-migrator. Redis-migrator takes a list of nodes for your old cluster and list of nodes for your new cluster and determines for which keys routes were changed. Then it moves those keys to the new nodes.

So it solves the previous problem like this:

1
2
3
4
5
6
7
8
9
10
require 'redis_migrator'

# a list of redis-urls for an old cluster
old_redis_hosts = ["redis://host1.com:6379", "redis://host2.com:6379"]

# a list of redis-urls for a new cluster
new_redis_hosts = ["redis://host1.com:6379", "redis://host2.com:6379", "redis://host3.com:6379"]

migrator = Redis::Migrator.new(old_redis_hosts, new_redis_hosts)
migrator.run

To make a migration process faster instead of doing sequentual writes I used redis pipeline which allows to send bunch of operations without delay and then gather responses afterwards. Checkout migrator_benchmark.rb to perform benchmark testing.

Although this tool still lacks some good error handling I believe it can give a good start.


Look Ma, no URL Calls!

What if we never had to make any URL calls from our javascript to access our backend APIs. Instead, if we had a model User with the class method create(name) then in JS we would just say user.create("Harry Potter") and that’d run this method on a server and return a result to JS сlient and it’d looked as if a function was defined in JS. Oh, I know you know that it’s called RPC and existed for a thosands of years, got wiped out by REST and generally considered as bad practice. But why? Wouldn’t be more convinient than dealing with URLs and controllers? Web application is just like any other distributed application consists of server and many clients that are talking to each other. Passing messages and making remote procedure calls seems way more natural than hitting URL endpoints. URLs are understandable for people, but why make software talk this language?

To test this idea I made a simple lib https://github.com/yankov/nourl. You define a class on your backend side and it looks like this:

1
2
3
4
5
6
7
8
9
class User
  include Nourl::RPCable

  allow_rpc_for :get

  def self.get(name)
    User.find_by_name(name)
  end
end

Then you run your server and in your JS it can be written like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
var settings = {
  rpcUrl: "http://0.0.0.0/rpc", //host and port where your server is running
  transport: "ajax",
  require: ["user"]                 // list all classes that you wanna access to
}

nourl.run(settings, function(){

   user.get('john', function(result) {
      console.log(result);
   });

});

Nourl automatically creates stubs for your model’s classes and methods, so you can just run a method call from JS, pass arguments and get the result back from your backend server. The point is that all RPC and Ajax logic is hidden and everything just looks like model was already implemented in javascript.

Especially taking in account that websockets probably are gonna be used more and more, I don’t see yet why this pattern wouldn’t work.

Some unnecessary text that tries to make post look cooler

A known fact: sometimes we keep doing things in an old way just because we get used to it even if it’s inconvinient. Handling this inconvinience quickly becomes a part of the standard routine which builds up our comfort zone. Usually it takes time to break the old layer of habbits and adapt to the new ideas. This is why people don’t care about craigslist’s UI although it doesn’t conform to any modern standards. Try to optimize it and people will rage (for a while).

So I was just trying to find things that we are doing in web-development that seem odd or obsolete, but to which we got used to and accepted as standards. One of such things is REST and using URLs for accessing APIs in general. Well, probably saying that REST is bad would be too arrogant, but there just could be other ways to go.

Remember, there was stuff like SOAP, RPC, Corba, RMI probably more familiar for people from the world of enterprise distributed software development. For a bunch of reasons it didn’t kick in web development: it was too cumbersome to deal with and not very natural for HTTP. But since we are moving on, web-applications more and more look like desktop application, websockets are gonna be widely spread out very soon, we probably have to find better ways to communicate clients with the servers rather than hitting URL endpoints.


How to Find Facebook Users on Match.com by Using Face Recognition Tools

One day I was drinking coffee with my friend and he told me a story of how he got in trouble with his girlfriend because she found his old profile on a popular dating site match.com. Allegedly someone from their common friends bumped into it and sent her a “friendly notification”. He was unable to prove her that it’s just ‘an old thing’ that he forgot to delete and their story ended there – they broke up pretty fast after this incident. Probably their relationships wasn’t that strong anyway, but this story struck my mind with a couple of thoughts. What are the odds of bumping into a profile of someone you know on a dating website and how easily such privacy could be violated if someone had a direct intention?

I remembered that there were some open-source face detection and recognition libraries available and thought that it’s probably possible to write a tool that would crawl photos on dating sites and try to recognize a particular person on them. Then I ran into face.com – a platform that provides a RESTful API for detecting, tagging and recognizing faces on the pictures. I recalled that story again, told to my friends, we all laughed, agreed that such a tool would be creepy, but I did put it in my list of ideas for hacking.

So, guess what, let’s go creepy and run a small experiment to see how easy that would be. To do that we’ll write a tool that will take tagged photos of a Facebook user and try to find his/her profile on match.com.

Let’s split a task into smaller problems.

  • How send authorized search requests to match.com
  • How to get URLs of profile images
  • How to make parsing fast by running requests asynchronously
  • How to use face.com API

How to parse match.com

Sending authorized requests

To get profile pictures we need a search output. If you go to match.com you’ll figure that it doesn’t allow you to browse search results without registration. So the very first step would be to create an account on match.com. After that go to the search form, choose the parameters, like sex, age and a zipcode and click “Search now”. Now we see the results with the profile pictures, but let’s try to get this from our tool. Let’s copy URL of the request from the browser and write a simple Ruby script.

1
2
3
4
5
6
require 'rubygems'
require 'open-uri'

response = open(YOUR_URL_OF_REUEST)

p response.read

Run this and you won’t see the results but that’s what we expected, right? Ruby script sends a request without a session cookie and therefore hitting a sign in form. So we need to pass this session cookie to send requests as a signed in user. If you’re already logged in, open the list of cookies for match.com in your browser and you’ll see the bunch of shit it stores. Save yourself sometime, cause I already figured that their session cookie is called SECU. Copy the value of this cookie and update the script.

1
response = open(YOUR_URL_OF_REUEST, "cookie" => "SECU=VALUE_OF_THE_COOKIE")

If you run it, you’ll see a different response. Search through it and you can find something like Welcome, your_name and that means we sent a request as an authorized user.

Getting URL of profile images

Now, how do we get URLs of profile images? Let’s analyze HTML structure of the search results page. Use Web Inspector in Chrome or Safari or Firebug if you use Firefox. Point to a profile image and you’ll see that HTML code for it looks something like this:

1
<img class="profilePic" src="http://sthumbnails.match.com/sthumbnails/03/06/95230303242.jpeg" style="border-width:0px;>

All profile pictures on the page have class “profilePic”. Awesome. But those are very small images which would be hard to use for recognition. We need the bigger ones. Let’s click on someone’s profile, find the big image of the thumbnail from the search result and see how the link for it looks like:

http://pictures.match.com/pictures/03/06/95230303242.jpeg

Boom! Looks like we’ve found a pattern. The image has the same name, the difference is only in the part of the path: we should replace sthumbnails.match.com/sthumbnails with pictures.match.com/pictures to get a big image for the thumbnail. That way parsing only pages from the search results we can have URLs for big profile images without additionally requesting a profile page. Ok, let’s do it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
require 'rubygems'
require 'nokogiri'
require 'open-uri'

# can be different for your specific search
PAGES = 140

PAGES.times do |page_num|

  response = open("http://www.match.com/search/searchSubmit.aspx?\
    by=radius&lid=226&cl=1&gc=2&tr=1&lage=27&uage=29&ua=29&pc=94121&\
    dist=10&po=1&oln=0&do=2&q=woman,men,27,29,1915822078&st=quicksearch&\
    pn=#{page_num}&rn=4",
    "cookie" => "SECU=VALUE_OF_THE_COOKIE")

  doc = Nokogiri::HTML(response.read)

  doc.xpath("//img[@class='profilePic']/..").each do |link|
    img_src = link.xpath("img/@src").to_s
    img_src.gsub!('sthumbnails.match.com/sthumbnails', 'pictures.match.com/pictures')
    puts img_src
  end

end

That will print URLs of big profile pictures from the first results page. For parsing I’m using here a Nokogiri gem and a little of XPATH. Easy.

Getting images from the first page is not enough, we gotta get them from all pages so let’s run a loop and change the page num param in URL. Examine URL structure of the search request and you’ll see a parameter called ‘pn’ where we pass a number of the page.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
require 'rubygems'
require 'nokogiri'
require 'open-uri'

# can be different for your specific search
PAGES = 140

PAGES.times do |page_num|

  response = open("http://www.match.com/search/searchSubmit.aspx?\
    by=radius&amp;lid=226&amp;cl=1&amp;gc=2&amp;tr=1&amp;lage=27&\
    amp;uage=29&amp;ua=29&amp;pc=94121&amp;dist=10&amp;po=1&\
    amp;oln=0&amp;do=2&amp;q=woman,men,27,29,1915822078&amp;st=quicksearch&\
    amp;pn=#{page_num}&amp;rn=4",
    "cookie" => "SECU=VALUE_OF_THE_COOKIE")

  doc = Nokogiri::HTML(response.read)

  doc.xpath("//img[@class='profilePic']/..").each do |link|
    img_src = link.xpath("img/@src").to_s
    img_src.gsub!('sthumbnails.match.com/sthumbnails', 'pictures.match.com/pictures')
    puts img_src
  end

end

Note the &pn=#{page_num} part. The rest of the URL should be yours as your copied from your browser.

Making parser work fast

Although, that would get part of the job done, that solution doesn’t scale. HTTP requests here are running sequently and taking too much time to complete if you want to crawl a lot of pages. What we need to do is to run HTTP requests asynchronously. There is a number of way to achieve that (no, using threads is not a way), but I’d suggest using Eventmachine. Eventmachine gives you an ability to run IO operations asynchronously without blocking a process. There’s a em-http-request, an asynchronous HTTP client that works on top of the Eventmachine and ideally fit for our purpose. So let’s rewrite our small program using Eventmachine and see what happens.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
require 'nokogiri'
require 'eventmachine'
require 'em-http-request'
require 'em-redis'

PAGES = 140

EM.run {

  @redis = EM::Protocols::Redis.connect

  PAGES.times do |page_num|

    url = "http://www.match.com/search/searchSubmit.aspx?by=radius&amp;lid=226&\
      amp;cl=1&amp;gc=2&amp;tr=1&amp;lage=27&amp;uage=29&\
      amp;ua=29&amp;pc=94121&amp;dist=10&amp;po=1&amp;oln=0&amp;do=2&\
      amp;q=woman,men,27,29,1915822078&amp;st=quicksearch&amp;\
      pn=#{page_num}&amp;rn=4"

    http = EM::HttpRequest.new(URI.escape(url)).get :head =>  {'cookie' => "SECU=YOUR_SESSION_COOKIE;"}

    http.callback {

      p "parsing page #{num+1}"
      doc = Nokogiri::HTML(http.response)

      doc.xpath("//img[@class='profilePic']/..").each do |link|
        img_src = link.xpath("img/@src").to_s
        img_src.gsub!('sthumbnails.match.com/sthumbnails', 'pictures.match.com/pictures')
        @redis.hset("people", img_src, link['href'])
      end

    }

  end
}

There’s a few things I have to explain here. Eventmachine runs an event loop and everything should be passed there as a block. Then, when you run EM::HttpRequest.new(URI.escape(url)).get instead of blocking a process and waiting for the response it will immediately return, but when response will be received Eventmachine will call a callback method http.callback where I put our parsing logic. Also you may have noticed that instead of printing out the URLS on the screen, I save them in Redis hash where each profile image associated with the profile URL. We’ll be accessing this hash from a recognizing tool later. Note that I’m using asynchronous version of Redis client for Eventmachine here: em-redis. If you run this script you’ll see that it’s working way way faster than its synchronous version. Now that we have profile pictures let’s get to using face.com API to recognize faces on them.

How to use Face.com API

First we’ll need to register at face.com and get some credentials to be able to use their API.

Go there and sign up. Save API KEY and API SECRET that you’ll be given after registration is complete.

To recognize faces you have to first “train” their app by feeding it some images of the people you want to find. There’s a number of ways to do that. You can pass URLs of images, detect faces on them, tag them and then pass URL of of other images for recognition. Or you can pass a twitter user or Facebook user, they will automatically take available tagged photos for this user to train and “remember” them.

Geting fb_oauth_token

So let’s go the Facebook way. To be able to access tagged photos of your friends we need to have an fb_oauth_token for any Facebook app that has permission to access your profile and your friend’s images. You can either register your own Facebook app, or (that would be faster) grant permission to an app of face.com. In any case we just need a value of fb_oauth_token that we gonna use later in our script.

Doing that is a bit tricky. Go to http://developers.face.com/tools/. Choose a method faces.recognize and a Facebook connect button will show up. Click it and grant their app requested permissions. Then click a “call method” button, ignore whatever appears in the response body but checkout the REST URL on the top of the response body. You’ll see a fb_oauth_token parameter in the end of URL. Copy and save its value.

Finding profiles!

The good news is there’s a ruby gem for face.com that works. I assume that you were following this post and have URLs of profile images stored in a hash named ‘people’ in Redis. A script that’ll do the rest of the job:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
require 'redis'
require 'face'

# fb user_id of the user that needs to be find on match.com
FB_UID = '11111@facebook.com'

# show all profiles whith the given confidence of recognition
ACCURACY = 30

redis = Redis.new
pictures =  redis.hgetall("people")

# recognize user
client = Face.get_client(:api_key => FACECOM_APIKEY,
                         :api_secret => FACECOM_APISECRET)

# note in fb_user you pass here YOUR fb user id, not an id of the user your are looking for
client.facebook_credentials = { :fb_user => YOUR_FB_USER_ID,
                                :fb_oauth_token =>YOUR_FB_OAUTH_TOKEN }

#train pictures
response = client.faces_train(:uids => [FB_UID] )

pictures.keys.each_slice(20) do |pictures_chunk|

  response = client.faces_recognize(:urls => pictures_chunk, :uids => [FB_UID])

  response["photos"].each do |photo|
    photo["tags"].each{|tag|
      next if tag.nil? || tag['uids'].empty?
      if tag['uids'][0]['confidence'].to_i > ACCURACY
        p "Profile found, #{pictures[photo['url']]}, confidence #{tag['uids'][0]['confidence']}"
      end
    }
  end

end

That’s it. It will display a list of profiles where the confidence of recognition was more than 30%. You can change this number to see more accurate results.

There was a lot of fun to run this test against my own photos and get the results with the high number of confidence and see people who allegedly look like me (well and sometimes they do). I ran this test for some friends (with their consent!)and was able to find a couple of them too. No details and pictures can be revealed here for the obvious reasons :)

Conclusion about privacy: if you have your photos associated with your profiles on the different websites and especially photos with your tagged face, there’s definitely a way to find this profiles just using your photos.


8 Ideas for a Weekend Hacking

My current strategy is just to try hacking on one new technology every week and create something simple. The primary goal is to learn new stuff and have fun. So here are some ideas I came up with during brainstorming.

1. Eventmachine & web crawling

Ever wondered how to write a fast distributed web-crawler? If you code in Ruby, you may use Eventmachine. It’s event-processing library for Ruby that implements Reactor pattern which provides you with non-blocking IO. Normally, if you send a http request it will block an entire process until the response is received. Even if you try to use threads it won’t help and crawling speed will suck.

Here are some libraries that work on the top of event machine and help you to make requests asynchronous.

Asynchronous HTTP client. em-http-request
Asynchronous Redis client em-redis
Non-blocking DNS-resolution em-resolv-replace

2. Brainwave sensing headsets

Currently, I’m aware of two affordable brainwave sensing headsets on the market:

neurosky.com
emotiv.com

Such headsets are measuring brainwave impulses from the forehead and generate interesting data based on it. What’s interesting about it is that there is number of amazing applications developed for such devices: from games that you control with your mind where you can just move objects with the power of your thought to researching tools for serious neurohackers. You can can develop your own applications using such languages as C++, C#, .NET or Java. Unfortunately I haven’t found any support for Ruby or Python yet, but we that’s not going to stop anyone, right? ;)

3. Face recognition API

Face.com looks like a great service that allows to detect, recognize and tag faces thought the REST API. You can combine different sources and, for instance, find Facebook friends in Flickr photos or a private photoalbum. As a dirty idea for a weekend hacking: spend a day and write a script that finds your girlfriend’s profile on match.com ;)

4. Bigdata: Hadoop, HBase, Hive and Pig

Technologies related to Bigdata are very interesting to hack on. Setting up a hadoop cluster, writing mappers and reducers to do some badass calculations and process huge amounts of data can fill up evenings with the real geek fun. Try to figure out how HBase and Hypertable work which are modeled after Google’s BigTable. Then set up Hive or Pig and learn how to run complex SQL- like queries without writing mappers in Java.

A couple of previous posts related to it:

Hadoop 1.0 + MongoDB: the Beginning
How to Set Up a Hadoop Cluster with Mongo Support on EC2

5. Amazon AWS

A lot of startups create their services based on Amazon AWS: dropbox, heroku, engine yard, mongohq, etc. Nowadays you don’t need to have your own datacenter to build anything like dropbox. So one of the days I was wondered how really complicated it would be to build something like dropbox. Basically it’s a simple client that connects to a bucket on Amazon S3 and allows you to upload files there by putting them to a specific folder on your computer. Sounds like a one-evening deal for a simple version.

7. Mobile development frameworks

The fastest ways to develop mobile applications

phonegap.com
appcelerator.com
rhomobile.com

All three frameworks allow to write applications for a bunch of different platforms: iPhone IOS, Android, Symbian, etc. Phonegap and Appcelerator leverage HTML5, CSS and Javascript whereas Rhodes uses MVC Ruby framework. So it’s now easy to develop your first mobile app for all main platforms without knowing Java or Objective C.

8. Websockets, comet and socket.io

Ever wondered how real time notifications work on Facebook? Or any other web application that involves real-time interactions between users in the browser: chats, games, etc? There are number of techniques that can be used to allow a server (or rather to simulate it) to push a message to a browser: such as xhr- polling, jsonp-polling, web sockets or flash sockets. Some libraries to checkout:

socket.io
faye – publish-subscribe messaging system. Very easy to use. Server is node.js based.
juggernaut – uses socket.io, node.js and redis.


How to Set Up a Hadoop Cluster with Mongo Support on EC2

In the previous post I described how to setup hadoop on your local machine and make it work with MongoDB. That’s good enough only for development and testing, but if you want to crunch any serious numbers you have to run hadoop as a cluster. So let’s figure out how to run a hadoop cluster with mongodb support on Amazon EC2.

This is a step by step guide that should show you how to:

  • Create your own AMI with the custom settings (installed hadoop and mongo-hadoop)
  • Launch a hadoop cluster on EC2
  • Add mode nodes to the cluster
  • Run the jobs

So let’s hack.

Setting an AWS account.

1. Create an Amazon AWS account if you still didn’t get one.

2. Download amazon command line tools on your local machine. Unpack them, for example, in ~/ec2-api-tools folder.

3. In your home folder edit .bash_profile (or .bashrc if you’re not using OS X) and add there the following lines

export EC2_HOME=~/ec2-api-tools
export PATH=$PATH:$EC2_HOME/bin

4. Create X.509 certificate and a private key in your AWS Management Console. Check out this doc to see how to do that. It’s pretty straightforward process, in the end you should download two .pem files. Put them in ~/.ec2 folder. Those are your keys to access amazon aws services.

After you done with this edit your .bash_profile and add the following lines there:

export EC2_PRIVATE_KEY=~/.ec2/name_of_your_private_key_file.pem
export EC2_CERT=~/.ec2/name_of_your_certificate_file.pem

5. Create a private ssh key pair

To be able to login to your instances you need to create a separate keypair. From a command line run:

ec2-add-keypair gsg-keypair

This will output you a private key. Save it to the file id_rsa-gsg-keypair and put it to ~/.ec2 folder with the rest of your key files. Then change the permissions:

chmod 600 ~/.ec2/id_rsa-gsg-keypair

And add the key to authentication agent:

ssh-add ~/.ec2/id_rsa-gsg-keypair

6. Create a rule to allow to login to your amazon instances

Go to AWS Management console, click EC2 tab, then choose Network and Security from the menu on the left sidebar. Add a new rule, choose “SSH rule” from the list and apply the changes.

Hadoop settings

Luckily, hadoop has built-in tools to help you deploy a cluster on EC2.

1. Set your amazon security credentials

Go to your hadoop folder and edit file src/contrib/ec2/bin/hadoop-ec2-env.sh

Fill out the following lines:

Your Amazon Account Number

AWS_ACCOUNT_ID=""

Your Amazon AWS access key

AWS_ACCESS_KEY_ID=""

Your Amazon AWS secret access key

AWS_SECRET_ACCESS_KEY=""

You can find your amazon account number by logging into your AWS account. It’s gonna be on the top-right corner below your name. To get access and secret keys go to My Account –> Security Credentials and click “Access keys” tab.

2. Setting a private key

In the same file hadoop-ec2-env.sh find this:

KEY_NAME=gsg-keypair

If you named your key id_rsa-gsg-keypair then leave this field without changes.

3. Set the hadoop version

In the line HADOOP_DIR= set your hadoop version. If you were following a previous post to set up the hadoop then it’s gonna be:

HADOOP_VERSION=1.0.0

4. Set S3 bucket

This bucket will be used to upload an amazon image that you are going to create later. First you have to create a bucket. Go to AWS Management console, click S3 tab and click “Create a bucket” button. Choose the name, for example, ‘my-hadoop-images’ and save the bucket. Then edit hadoop-ec2-env.sh and update the following line:

S3_BUCKET=my-hadoop-images

5. Choose the amazon instance type

There are number of types of instance AWS provides depends on your requirements. You can read more about that later. Let’s stick with m1.small for now. Uncomment the following line:

INSTANCE_TYPE="m1.small"

6. Edit Java settings

We are still keep editing the same hadoop-ec2-env.sh file. Set the Java version to:

JAVA_VERSION=1.6.0_30

Then find a block after # SUPPORTED_ARCHITECTURES = [i386, x86_64] and edit JAVA_BINARY_URLs.

For i386:

JAVA_BINARY_URL=http://download.oracle.com/otn-pub/java/jdk/6u30-b12/jdk-6u30-linux-i586.bin

For x86_64 (large and xlarge instance type):

JAVA_BINARY_URL=http://download.oracle.com/otn-pub/java/jdk/6u30-b12/jdk-6u30-linux-x64.bin

If by the time you are reading this post those URLs do not work or if you want to get another Java version you can find URLs for download here.

Create an Amazon Machine Image (AMI)

Additionally our image should have mongo-hadoop driver, mongo java driver and our mapreduce jobs installed. First you need to upload your mongo-hadoop core and mapreduce jobs somewhere so it can be accessible from the internet. The easiest way is just to upload those files to the same S3 bucket you created earlier for hadoop images. Where do you get these files? In the previous post I showed how to compile mongo-hadoop. Your mongo-hadoop-core lays in $mongo_hadoop_dir/core/target/ and is called something like mongo-hadoop-core.jar. Upload it to your S3 bucket and URL for it is going to look something like [http://my-hadoop-images.s3.amazonaws.com/mongo-hadoop-core.jar](http://my-hadoop-images.s3.amazonaws.com/mongo-hadoop-core.jar). Your mapreduce jobs is also a jar file. You can create and compile your own mapreduce jobs but for now let’s use treasury_yield we compiled in the previous post. This file should be in $your_mongo_hadoop_dir/examples/treasury_yield/target/ and is called something like mongo-hadoop-treasury_yield.jar. Upload this file to your S3 bucket too.

When it’s done, edit create-hadoop-image-remote file in your $hadoop_dir/src/contrib/ec2/bin/image/ folder. After the block # Configure Hadoop insert the following block:

#Install MONGO-HADOOP
cd /usr/local/hadoop-$HADOOP_VERSION/lib

#Copy mongo java driver
wget -nv https://github.com/downloads/mongodb/mongo-java-driver/mongo-2.7.3.jar

#Copy mongo-hadoop core
wget -nv http://your_url_here/mongo-hadoop-core.jar

#Copy MapReduce jobs
wget -nv http://your_url_here/mongo-hadoop-treasury_yield.jar

That will download the files we need when image is configured.

Create an image

Finally to create an image, in your hadoop folder, run the following command:

bin/hadoop-ec2 create-image

If everything runs well that should bundle a new image and upload it to your S3 bucket. Now you are ready to run the cluster.

Launch the cluster

Phew.. when a long way of setting everything up is successfully passed launching a new hadoop cluster is very easy. The format of the command is this:

bin/hadoop-ec2 launch-cluster <name_of_the_cluster> <number_of_slaves>

So to run a hadoop cluster named “mongohadoop” consisting of 1 master and one slave nodes you would run the following command:

bin/hadoop-ec2 launch-cluster mongohadoop 1

In output you’ll see a hostname for your master instance. Give it a few minutes to run and then you can open

http://<your_master_hostname>:50030

in your browser and you should see a hadoop administration page. The number of nodes showed should be 1 (it counts only slave nodes).

Later if you want to add, for example, two more slaves you would do it like this:

bin/hadoop-ec2 launch-slaves mongohadoop 2

To login to your master instance:

bin/hadoop-ec2 login mongohadoop

Run the jobs

To run a job you have to login to the master instance and then use the command you learned in the previous post:

bin/hadoop jar mongo-hadoop/core/target/mongo-hadoop-core.jar com.mongodb.hadoop.examples.treasury.TreasuryYieldXMLConfig -conf mongo-path_to_jobs_config/mongo-treasury_yield.xml

For your own mapreduce jobs you would change com.mongodb.hadoop.examples.treasury.TreasuryYieldXMLConfig and mongo-treasury_yield.xml accordingly of course.

If you want to run a treasury_yield example in your hadoop cluster you have to do few more things:

  1. Have a MongoDB server running and accessible from the internet.
  2. Import yield_historical_in.json there like you did for your local MongoDB in the previous post.
  3. On the hadoop master instance upload mongo-treasury_yield.xml and put it, for example, in /usr/local/hadoop-1.0.0/ folder.
  4. Edit mongo-treasury_yield.xml and change mongo.input.uri and mongo.output.uri. Put the hostname of your database instead of 127.0.0.1 and add login and password if needed.
  5. Then run the job:

bin/hadoop jar mongo-hadoop/core/target/mongo-hadoop-core.jar com.mongodb.hadoop.examples.treasury.TreasuryYieldXMLConfig -conf mongo-/usr/local/hadoop-1.0.0/mongo-treasury_yield.xml

On your hadoop administration page verify that job is running.

Done! Now you have a hadoop cluster running on EC2 and you know how to make it work with MongoDB. To write your own mappers and reducers check out how treasury_yield example is made, change it in a way you want, recompile and upload it to your hadoop cluster (or create an image which will have this file).


Hadoop 1.0 + MongoDB: the Beginning

I’ve been playing with Hadoop and MongoDB for a couple of months and noticed that there’s lack of information describing how to actually make them work together. In the next few posts I’ll try to cover this starting from the basics.

In this post I’ll explain how to setup hadoop with mongo-hadoop library on your localhost. I did it on OS X but the same approach should work for Linux as well.

Installing Hadoop

1. Download hadoop 1.0

wget http://apache.mirrors.pair.com//hadoop/common/hadoop-1.0.0/hadoop-1.0.0.tar.gz

2. Unpack it somewhere in your home directory:

tar xzvf hadoop-1.0.0.tar.gz

3. Set JAVA_HOME

cd hadoop-1.0.0

vim conf/hadoop-env.sh

And for OS X add the following line:

export JAVA_HOME=$(/usr/libexec/java_home)

If you are using Linux change it to the actual path to your Java binary.

4. Edit config files

conf/core-site.xml

1
2
3
4
5
6
7
8
9
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>

conf/hdfs-site.xml

1
2
3
4
5
6
7
8
9
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
     <name>dfs.replication</name>
     <value>1</value>
  </property>
</configuration>

conf/mapred-site.xml

1
2
3
4
5
6
7
8
9
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:9001</value>
  </property>
</configuration>

5. Setup the permissions

In order to make hadoop work, user which will run the jobs should be able to ssh to localhost. For OS X go to the Settings –> Sharing and check “Remote login” box. You can specify the name of the user. Also if you don’t want to enter the password each time you start hadoop or run the jobs you have to add your ssh key to ~/.ssh/authorized_keys.

If you don’t have a ssh key you can generate it with the following command:

ssh-keygen -t dsa -P -f ~/.ssh/id_dsa_for_hadoop

And then:

cat ~/.ssh/id_dsa_for_hadoop.pub >> ~/.ssh/authorized_keys

6. Format namenode

In your hadoop home directory run:

bin/hadoop namenode -format

7. Start hadoop

bin/start-all.sh

If everything went well this should start bunch of stuff like namenode, jobtracker, secondarynamenode, datanode and tasktracker.

Verify everything works by opening http://localhost:50030 in your browser.

Installing mongo-hadoop library

Now this is a fun and tricky part. This library is fairly young and sometimes requires some additional hacking.

1. Download mongo-hadoop

Go to your hadoop home folder and run:

git clone https://github.com/mongodb/mongo-hadoop

2. Compile it

In mongo-hadoop folder use maven to compile the package:

mvn -U package

3. Copy compiled files to hadoop-1.0.0/lib

cp core/target/mongo-hadoop-core-1.0-SNAPSHOT.jar ../lib

cp examples/treasury_yield/target/mongo-hadoop-treasury_yield-example-1.0-SNAPSHOT.jar ../lib

4. Download mongo java driver

Download https://github.com/downloads/mongodb/mongo-java-driver/mongo-2.7.3.jar and put it in hadoop-1.0.0/lib folder.

For now use only 2.7.3 version of java driver since mongo-hadoop was compiled against it.

5. Restart hadoop

bin/stop-all.sh
bin/start-all.sh

Running examples

I assume you already have mongo server up and running on your localhost.

Let’s verify that our system works by using MongoTreasuryYield example from mongo-hadoop package.

1. Import initial data

In mongo-hadoop folder run:

mongoimport --db demo --collection yield_historical.in --type json --file examples/treasury_yield/src/main/resources/yield_historical_in.json

Make sure path to json file is correct. Verify data was imported from mongo console:

use demo
db.yield_historical.in.count()

You should get 5193.

2. Run the example!

From hadoop folder run this long command:

bin/hadoop jar mongo-hadoop/core/target/mongo-hadoop-core-1.0-SNAPSHOT.jar com.mongodb.hadoop.examples.treasury.TreasuryYieldXMLConfig -conf mongo-hadoop/examples/treasury_yield/src/main/resources/mongo-treasury_yield.xml

Now open http://localhost:50030 in your browser and you should see a running job there. After it’s done verify you got the results in mongo console:

use demo
db.yield_historical.in.find()

Done!
Now your local hadoop system is ready for the further hacking. In the next post I’ll cover how to launch your own hadoop cluster on Amazon EC2.

Links

Launching Hadoop on EC2
Hbase/Hadoop on OS X
mongo-hadoop library


Sharding Redis is Easy.

While Redis doesn’t have a built-in support for clustering yet, there’s a pretty easy solution if you are using ruby client. It’s not documented and you can find it out only from reading tests, but this client has a support for consistent hashing and multiple Redis’ nodes just out of the box.

Here’s how you use it:

1
2
3
4
5
6
7
8
9
10
11
require rubygems
require redis
require redis/distributed

r = Redis::Distributed.new %w[redis://localhost:6379 redis://localhost:6378]

# show node for key "foo"
p r.node_for("foo")

# set the value of "foo"
r.set("foo", "value")

I fired off two instances of Redis on different ports just as an example but there can be any number of Redis instances hosted on different machines. So what happens next? When you are trying to write or read a key, client calculates an unique hash for this key and maps it to a specific Redis node. That way same keys are always routed to the same redis nodes.

There are also more features like adding nodes which you can find out from reading distributed*.rb tests here.

More read if you are interested in scaling Redis.

Redis Memory Usage
Redis Sharding at Craigslist


Essentials to Learn Ruby from Zero to Advanced.

Here is my idea on how to start learning Ruby from scratch.

  1. Ruby in Twenty Minutes (free)
    This short 4-page intro will give you an idea about syntax and the general feeling of a language.

  2. Programming Ruby (free)
    Next stop is an excellent book – “Programming Ruby”. You will learn about classes, objects, iterators, blocks and exceptions. You can read it till “Basic Input and Output” chapter and postpone the rest for later.

  3. Ruby Koans (free)
    After you are done with “Programming Ruby” I bet many things will remain unclear and many will be forgotten almost immediately without good practice. This is when you get to “Ruby Koans”. It’s an awesome collection of tests thathelp you to learn ruby while reading and fixing real code. Seriously, don’t miss this step, it is really helpful and you will learn a lot.

  4. Eloquent Ruby
    Russ Olsen is a great author. His books like good literature, you never get bored. “Eloquent Ruby” teaches you good style, walks you through advanced technics, goes deep inside of ruby object model, gives you first understanding of metaprogramming and DSL.

  5. Metaprogramming Ruby
    You can’t be a serious Ruby developer without good understanding of what’s the use of metaprogramming. Great reading, especially it will help to uncover many “magic” tricks in such frameworks as Rails that were unclear before. You begin to understand how stuff works in frameworks and libraries.

  6. Design Patterns in Ruby
    Back to Russ Olsen. This book will help to refresh some of your knowledge about fundamental stuff from computer science: design patterns and how to use them.

  7. RSpec Book
    Now when you know how to code, you have to learn how to design application, organize the whole development process and test your code. This is essential to learn about BDD, RSpec and Cucumber.