Simon Reavely's Blog

Monday, March 11, 2013

I don't see how newsle.com is going to survive the lawyers

Newsle has great potential but its also the home of disturbing defamation of character.

I just signed up for newsle.com which has an interesting promise:
http://www.crunchbase.com/company/newsle

"Never miss an important story about a friend, professional contact, or public figure you care about.

Newsle is a web application that allows you to follow real news about your Facebook friends, LinkedIn contacts, and favorite public figures."

The problem is with this part:

"We’ve developed sophisticated algorithms to disambiguate between people with the same name, to evaluate the importance of news articles, and to optimize your newsfeed as you rate and interact with articles."

The problem is that their algorithms don't work.

For example, there is an article about a school friend of mine beating up his girlfriend. Trouble is, while the part of England matches and the name matches its not him. My friend is in their late 30's and this person was in their 20's. Even if the age matched there is still a huge risk in false positives that result in putting out stories about people without proof its the same person.

So, I see big legal woes ahead for Newsle, with pressure (at least) to:

a) Avoid stories that are very negative such as involving bankruptcy, criminal or cival proceedings

b) Providing a way for friends to dispute stories about people

c) Indicate on stories their level of confidence so people can judge when to trust.

Friday, May 13, 2011

HBase scalability for binary data - I wonder if Cassandra would have the same issue?

One interesting "learning" around HBase is that its not really a good idea to use it for storing tons of binary data e.g. photos, map tiles, audio files, etc
http://www.quora.com/Apache-Hadoop/Is-HBase-appropriate-for-indexed-blob-storage-in-HDFS
http://www.quora.com/Apache-Hadoop/How-would-HBase-compare-to-Facebooks-Haystack-for-photo-storage
...the message and experience here is to store the meta data in HBase, but keep the actual binary data outside HBase. I'm hearing others mirroring this learning.

There is clearly issues in how partitioning in HBase works and the way in which it spreads the work load across nodes and rebalances itself. Interestingly, Apache Cassandra has two partitioners out of the box: random + ordered. As I understand it, the HBase partitioner is closer to the ordered version and therefore trying these same use-cases on Cassandra with a random partitioner might be interesting as a compare and contrast.

My working assumption is that its also safer to store binary data outside Cassandra if you want constant predictable response times and rely on highly available (i.e. replicated) storage that is really good at binary data that is write light, read heavy.

I'm interested to hear from anyone using Apache Cassandra who is storing large amounts of binary data (upwards of 10s of TBs).

S.

Monday, November 29, 2010

Getting status out of JBoss

Recent question arose on how to monitor JBoss from the load balancer. Since we can't use JMX or the jmx-console from the load balancer (our usual method) we needed some HTTP endpoints that were easy to use out of the box (i.e. we didn't want to write our own servlet/jsp page). Two good candidates were:

http://hostname/jboss/status

and (if you have are using web services)

http://hostname/jbossws/services

Ok, thats it for now!

Tuesday, May 18, 2010

Book review - Pulling Strings with Puppet; configuration mgmt made easy

Puppet is an amazing tool for keeping everything on a cluster in sync. I/we use it for apache cassandra, hadoop and our own internal software distribution.
Its a short book but given the light level of documentation around the puppet opensource project I found this book useful to get you started automating machine administration. However, I realize now that I am using puppet more and more what holds it back is that its such a thin book it lacks complete examples. This is really a critical flaw. Maybe if it came with source code on a website this hole could be plugged. In the end I got a lot out of online tutorials and then used this book for reference/reminders. Eventually I've mostly moved past this book and I use the reference guides on the reductivelabs/puppetlabs website:
http://docs.puppetlabs.com/references/stable/configuration.html
http://docs.puppetlabs.com/references/latest/type.html#package

Another annoyance is the lack of an index so I recommend the ebook.
Final comments:
- Puppet is based on Ruby, so a light understanding of Ruby does help (especially if you need to patch puppet).
- I did have to patch my redhat 4 box's version of puppet that i got from epel yum repo since it was failing on templates and causing a SEGV in ruby. See http://projects.reductivelabs.com/issues/2604

Book review - Hadoop: The Definitive Guide by Tom White

I really enjoyed the book "Hadoop: The Definitive Guide by Tom White".
It has everything you need to:
a) Get started running your own cluster and writing your own MR jobs
b) Understand how to administer the cluster
c) Troubleshoot your programs
d) Learn about really important side projects like Pig, Hive, Zookeeper and HBase (of which I think Hive is the most amazing)

One thing I wish I'd done is go through the cloudera online tutorials BEFORE reading this book. If I'd done that (instead of doing so afterwards) I think I'd have got through certain sections of the book much quicker; basically I would have 'got it' quicker. See http://www.cloudera.com/resources/?type=Training

After reading the book I organized a little geek meet where I covered a synopsis of Hadoop, Pig and Hive with the development team. I also introduced them to the Cloudera training virtual machine. That is just an amazing resource for learning hadoop et al. It also introduced me to some unique cool things like the sqoop program (http://www.cloudera.com/developers/downloads/sqoop/) for reading tables out of an RDBMS like MySQL or Oracle and auto populating Hadoop and/or Hive...very useful!

Friday, March 26, 2010

Enabling ssh-agent for password-less ssh login on KDE/Gnome

So one of the things that had been bothering me was ssh'ing into remote machines with keys that had passwords. I wanted to use ssh-agent so that I would not have to type in my password. Trouble was that I couldn't figure out how to do it on my KDE desktop so that every time I opened a new shell the ssh-agent would be active. Everything I'd previously read talked about executing the command:

ssh-agent bash

...but this only starts the agent for the shell started and any child processes of that shell. Consequently, every shell opened has its own ssh-agent and you have to do a ssh-add on each shell, typing in your password each time.

Well here is how to do it

start-ssh-agent script

#!/bin/bash
if [ -f ~/.ssh/ssh-agent.env ]; then
  #echo "Agent already started"
  i=1
  #I just needed something above so the then was a valid statement
  #...is there a noop in bash?
else
  ssh-agent > ~/.ssh/ssh-agent.env
  #we need to delete the echo from the source script since some
  #commands like scp and ssh hate it when .cshrc echos stuff out
  sed -e '/echo/d' ~/.ssh/ssh-agent.env > ~/.ssh/ssh-agent2.env
  mv ~/.ssh/ssh-agent2.env ~/.ssh/ssh-agent.env
  . ~/.ssh/ssh-agent.env
  #echo "Agent started"
  ssh-add
fi

Basically this script executes ssh-agent, captures the output that specifies the environment variables and writes them to a file for future reference from future shells. It then executes ssh-add to prompt you to enter the passwords for the private keys.

stop-ssh-agent script

#!/bin/bash
if [ -f ~/.ssh/ssh-agent.env ]; then
  . ~/.ssh/ssh-agent.env > /dev/null
  kill $SSH_AGENT_PID
  rm ~/.ssh/ssh-agent.env
  echo "Agent stopped"
else
  echo "Agent is not running"
fi

Then in ~/.bashrc file you add the following:

if [ -f ~/.ssh/ssh-agent.env ]; then
   . ~/.ssh/ssh-agent.env
else
   ~/bin/start-ssh-agent
fi

...this basically means...
if the ssh-agent.env file exists
  source it so that the environment vars point to the ssh-agent process running.
else
  run the script to start the ssh-agent and prompt for the passwords for any keys

This is not perfect and you need to be careful if you are doing agent forwarding into the box but for most general cases this works.

Sunday, February 21, 2010

Good and bad: NUFC promotion to Premiership

Over the past year my Newcastle United RSS feed has been very different to the year before. We've only lost 4 times away and the results have often been 3-0, 4-1, etc Its been a joy to read the news. However, that good feeling that I get on a Monday morning is going to change with promotion back into the Premiership. Honestly, I've got mixed feelings about promotion now that I am so used to hearing good news and the most we can seem to hope for is solid middle table performance in the Premier League.

Tuesday, December 01, 2009

Book Review - Wicket In Action by Manning

I just finished reading Wicket in Action by Manning. The book is well laid out; I particularly liked how the simple example web site (cheesr) is grown through the book in line with the topics of the chapter. In addition to the stuff that you expect (such as how to work with/customize components, models, etc) there is also good coverage of important topics like I18N, testing, integration with frameworks like hibernate and spring, and integration with JavaScript engines (other than the wicket JavaScript engine).

Regarding Wicket itself: I really like this framework (we use it in my current team and it produces nice UIs that are pretty easy to maintain and change). Why do I like Wicket? Well for the following reasons:

Its Java based, and since I'm strongest in Java it suites me.
There is a nice separation between the UI in HTML/CSS/JS and the java code that backs it. This clear separation between the presentation/design aspect and the coding is useful because it separates along the common skill groups. JavaFX (can you say "designer/developer workflow") may change my opinion on this but right now I see advantages over say the JSP approach.
Testing is well covered (with WicketTester) over and above just using something like Selenium (also mentioned in the book).
The AJAX support seems solid and flexible. It even leaves you open to using other 3p JavaScript frameworks for your fancy UI components. In particular there is lots of support for request/response queues and falling back to full page refreshing that is particularly attractive.

I suspect (but cannot confirm) that Swing programmers will really like Wicket.

BTW...I love manning books. The fact that you can get a free ebook when you purchase the print copy is excellent and in general everything I read from the publisher is superb.

Checkpoint VPN Client Tray Icon Disappears...how to get back

One of the issues I sometimes have is that the Checkpoint VPN Client Tray Icon (the yellow key) sometimes disappears from my windows xp start bar. If I try to re-run checkpoint it says its already started. I just found out how to get it back without restarting my laptop - kill the SR_GUI.exe process. Things will automatically restart and the tray icon re-appears again.
BTW...I hate Checkpoint VPN...compared to Cisco VPN its an unstable horrific piece of software. I guess you get what you pay for.