
There's a lot of data out there, and paying attention to it in order to make decisions is a good idea. But where do you begin?
I began by browsing through some of the preview chapters (don't think they're up anymore) of Speech and Language Processing. I didn't get far. I also found some of Norvig's reviews on Amazon, one which pertained to this gem (the sell: "But if someone told me I had to make a million bucks in one year, and I could only refer to one book to do it, I'd grab a copy of this book and start a web text-processing company."). After previewing it online, I knew that sans a whip at my back, it wouldn't help me get any better.
What I really wanted was something that was written like this, but with some examples.
I ordered a copy of Toby Segaran's Programming Collective Intelligence after browsing the table of contents and reading some reviews. In 330 pages, the author covers building a link recommendation engine, building a search engine, stochastic optimization (wheeee), spam filters, and genetic programming (list not exhaustive).
What makes the above fun is that in each case you're working with data from sites you're probably very familiar with already (del.icio.us, kayak, ebay, facebook).
Still, much of the book is code, so to get the most out of it, you should really try the examples. You can, of course, download the source code to the book. The examples are written in Python, which concise and readable.
I decided to comprehend the book by writing the examples in Ruby as I went along. While I've flipped through a few of the chapters already, I've only actually worked through Chapter 2 (Chapter 1 is intro stuff).
It took me an embarrassing amount of time to work through Chapter 2, which surprised me. Ruby and Python are similar syntactically, but a few weird bugs in my translation tied me up.
One thing Segaran makes much use of is Python's list comprehensions, which I once attempted to dazzle (nay distract!) my Google interviewers with, to no avail.
Not having list comprehensions in Ruby made things a little rough, but there are good methods in Enumerable that come to the rescue.
I also didn't have the pydelicious library for writing the del.icio.us recommendation engine. It wasn't very hard to implement the necessary functions though. I didn't spend any time looking for a Ruby version, if there is one.
The link to the MovieLens dataset didn't work; it seems to have moved.
I also had some trouble with passing a function as a parameter. I know there are blocks, procs, and beats, but the solution wasn't apparent. I settled on passing a symbol and just picking the right scoring method based on its value.
Interested brethren, you can grab the Ruby code for Chapter 2 <-- there.
After testing at home and resolving various warnings, I finally upgraded Rankforest to Rails 2.0.2. Rankforest allows authors and publishers to keep track of the sales rank of their books. The site has about 2,000 users, and many keep up to date with rankings through RSS feeds or by exporting rankings every couple of days.
I was interested to see how the latest version of Rails would perform, especially after reading the posts mentioned here.
I allowed the site to run for a good full week before running rawk on them. I already had many megabytes of logs from the site running on Rails 1.2.6.
Here's a slim rundown, focusing on the average time required to complete a request.
The detail view - This view is the dashboard for a given item. It contains a product image, a chart showing sales ranks, and some running averages. Each log contained 1500-3000 sample requests.
Before: 0.45s
After: 0.42s
Hourly RSS feed - The hourly RSS feed gives authors a sales rank update every hour. I also had a couple thousand requests for this page in each log.
Before: 0.07s
After: 0.04s
Daily RSS feed - The daily RSS feed has a sales rank for the past 10 days.
Before: 0.40s
After: 0.30s
Cached charts - This is how long it takes to render a cached sales rank chart.
Before: 0.71s
After: 0.50s
Items listing - This view shows all of the items Rankforest is tracking and allows the user to paginate through them. I only had a few hundred requests for this. I think the difference must be due to updating the plugin I use for pagination. Prior to seeing these results I debated moving to will_paginate, but this is fine for now.
Before: 1.40s
After: 0.18s
Item compare - This is a view that allows a user to compare their book to similar items on Amazon.com. I had a couple hundred requests in each log.
Before: 0.11s
After: 0.15s
The collection view - Logged in users can view all of their books and sort them using various criteria.
Before: 0.05s
After: 0.09s
The last two areas of the site showed tiny slowdowns for some reason, but the gains in the more heavily-trafficked portions of the site outweigh the decrease.
The top line of the rawk script output shows the average for all requests. Here's how that worked out (86K requests under Rails 2.0.2 and 58K under 1.2.6).
Before: 0.17s
After: 0.07s
In terms of requests/second, that's going from 352req/sec to 857req/sec, which sounds substantial. In addition to updating the pagination plugin, I think a lot of the performance gain stems from the move from SqlSessionStore to the new cookie-based sessions.
The migration was mostly painless. I didn't have to change much code, the VPS hasn't changed, and now Rankforest can serve up even more information than before.
There's an interesting question over at subWindow about accessing the scope of the calling object in a block.
If access to local variables is desired, you can yield a Binding object from within the method. Just doing this would require writing things like eval "baz", context, where context is the Binding object returned from a call to binding. I used method_missing on Binding to make things a little more readable in the calling code.
To get at instance variables, it seems instance_variable_get is the way to go. Internally, it calls instance_eval on the string that's passed in, which is very similar to what's happening below.
class BindingTest
attr_accessor :accessible_baz
def bar
baz = \"qux\"
foonum = 23
@instance_baz = \"instance_baz\"
@accessible_baz = \"accessible_baz\"
yield(binding)
end
end
# Extends the binding object
class Binding
def method_missing( method, *args )
eval method.to_s, self
end
end
binder = BindingTest.new
puts binder.bar { |context| context.foonum.to_s }
puts binder.bar { |context| context.baz }
puts binder.bar { |context| context.accessible_baz }
#the following line fails; only locals work
#puts binder.bar { |context| puts context.instance_baz }
puts binder.instance_variables
You could also use instance_variables and local_variables to return lists of those respective variables and evaluate them against the binding.
I haven't actually needed anything like this (and I'm sure it's open to abuse), but it's an interesting question that provides an opportunity to explore some of Ruby's reflection facilities.
I was experiencing extremely slow gem updates on my Slicehost VPS. gem would max out the memory usage and sit there, doing a bulk update on the index but never actually updating.
After a few attempts at a server software upgrade, I found success with the following commands.
gem1.8 update --system --source http://segment7.net/
gem1.8 update --no-rdoc --no-ri
I don't think the first command really helped, but I mention it because I did run it tonight. The second command issues the update, but without downloading the rubygems docs, etc. This sped things up, though the whole process still took about 30 minutes.
Using C# 3.0's new extension methods, it's now possible to implement Map, which is pretty awesome.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace FunctionalMap
{
static class Program
{
static IEnumerable<U> Map<T, U>(this IEnumerable<T> s, Func<T, U> f)
{
foreach (var item in s)
yield return f(item);
}
static int SumOfSquares(IEnumerable nums)
{
return nums.Map(delegate(int x) { return x*x; }).Sum();
//return nums.Map(x => x * x).Sum(); // <== same as above
}
static void Main(string[] args)
{
int[] xs = new int[] { 1, 2, 3, 4, 5 };
Console.WriteLine(Program.SumOfSquares(xs));
Console.ReadLine();
return;
}
}
}
The Sum() function is a built-in extension, while Map is one you'd add yourself. The this in front of the first parameter signifies an extension method.
In Ruby, the situation is the opposite: map (alias for collect) is built-in, but sum isn't.
sum = 0
xs = [1,2,3,4,5]
xs.map { |n| n*n }.each { |n| sum += n }
puts sum
The C# version requires more syntax but it's a welcome addition that I hope we end up using at work soon.