Sunday, September 28, 2014

I was accepted into the Microsoft BizSpark program

Since I winding down my consulting business this year (that means that I am limiting myself to a maximum of about 10 hours a week working for consulting customers) I have spent a lot of time getting better at developing in Haskell, reviewing what I hopefully already know about machine learning, and taking classes. In other words, I want to work on my own stuff :-)

I have had an idea for starting a small business and a while ago I applied to the Microsoft BizSpark program. I was just accepted into the program a few days ago. Using my own business idea as my yardstick, Microsoft is taking long term bets with BisSpark. It costs them money and resources to support the development of new business ideas, but the long tail is many years of selling infrastructure services. Even though there is not much lock-in using Microsoft Azure I am absolutely personally committed to using Azure long term if my idea works: Microsoft is providing up to $150/month of free Azure services for up to three years and it seems like really bad form to not reward them with long tail business if things work out. If you have a web based business idea that you want to pursue, I would suggest giving BizSpark a serious look.

I am planning on just using Linux servers on Azure, and it has been really easy to configure a Ubuntu server, hook up domains, etc. So far I am only using a single server for development and test deployments. I am used to doing everything on the command line but the Azure dev dashboard is useful to get a quick view of resource use and configuration. I am just using a small A-series 1 core VPS with 1.75 GB of RAM for development right now but I am pleased by how fast large builds run. It would be interesting to see relative performance of "1 core" VPS systems from many providers.

Azure offers some nice "Amazon-like" add ons for monitoring and setting up clusters for horizontal scaling. While it is definitely less expensive (except for labor costs) to run your own servers, I am a huge fan of (almost) no-admin PaaS services like Heroku, IBM's BlueMix, Google AppEngine, etc. and basic cloud infrastructure providers like Amazon (AWS), Google (Compute Engine) and Microsoft (Azure). I expect the large infrastructure providers to make a healthy profit, and I expect that they will!

Wednesday, September 24, 2014

I pushed a NLP demo to IBM's PaaS service BlueMix

The demo processes news stories to summarize them and map entities found in the text to DBPedia URIs. The Ruby code is similar in functionality to the open source Haskell NLP code on my github account.

Some background: I have been helping a customer integrate the IBM Watson AI platform into his system. I noticed on Hacker News this morning that IBM's PaaS service BlueMix will very soon offer a sandbox for IBM Watson services. I signed up for BlueMix to have an opportunity to get more experience using IBM Watson.

I just spent an hour putting together a quick NLP demo that uses my own entity detection code and the Ruby classification gem which supports pretty good summarization. Give it a try :-)

2014/09/29 update: I stopped this quick demo I put together - is is simple and was just to experiment with BlueMix. A better demo is my site.

BlueMix is built using Cloud Foundry so if you are already familiar with the Cloud Foundry command line tools then you will find the development cycle very familiar.

Wednesday, September 17, 2014

Setting up "Heroku like" git push deploys on a VPS is so easy

I was reading about Docker closing a $40M series C round this morning. While containerization is extremely useful at large scale, I think that the vast majority of individual developers and small teams write many web applications that don't need to scale beyond a beefed up VPS or single physical server.

For a good developer experience it is difficult to beat a slightly expensive but convenient PaaS like Heroku. However, if you have many small web app projects and experiments then hosting on a PaaS and paying $30-$50/month per application can add up, year after year. If you need failover and scalability, then paying for a PaaS or implementing a more failsafe system on AWS makes sense. For experimental projects that don't need close to 100% uptime, I set up a .git/hooks/post-commit git hook like this:

ssh 'bash -s' <
I have my DNS setup for (this is not a real domain, I am using it as an example) and all other domains for my example/experimental web apps point to the IP address of my large VPS. My files look like this:
rsync -e "ssh" -avz --delete --delete-excluded  \
   --exclude-from=/Users/mark/Code/mywebapps/ \
In my rsync_exclude file I specify to not copy my .git folder to the server:
The file that gets remotely executed on my server looks like this:
#! /bin/bash

ps aux | grep -e '' | grep -v grep | awk '{print $2}' | xargs -i kill {}
(cd; lein deps; nohup lein trampoline run prod > out.log&)
This is the pattern I use for running Clojure web apps. Running Ruby/Sinatra, Haskell, and Java apps is similar.

Since I tend to run many small experiments on a single large VPS, I use entries like the following in my /etc/rc.local file to restart all applications if I reboot the VPS:

(cd /home/mark/ ; su mark -c 'nohup lein trampoline run prod > out.log&') &

I use an account on the server that does not have root or sudo privileges so my web apps use non-privileged ports and I use nginx as a proxy. In my nginx.conf file, I have entries like the following to map non-privileged to virtual domain names:

 server {
    listen       80;
    location / {
      proxy_pass http://localhost:7070;
      proxy_redirect off;
      proxy_set_header Host $host;
      proxy_set_header X-Real-IP $remote_addr;
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    error_page 500 502 503 504  /error.html;
    location = /error.html {
             root  /etc/nginx;
In this example, the application is running on the non-privileged port 7070 and this app would be accessed as or On my laptop, just doing a git push has the new version of my app running on my server in a few seconds.

Sunday, September 14, 2014

Changed license on my Haskell NLP code and comments on IBM Watson AI system

When I added licensing information on the github repository for my Haskell NLP experiments I specified the AGPL v3 license. I just changed it to GPL v3 so now it can be used as a web service without affecting the rest of any system that you use it for. I also did some code cleanup this morning. In addition to the natural language processing code, this repository also contains some example SPARQL client code and my Open Calais client library that you might find useful.

Some news about IBM Watson: their developer web site now has more documentation and example code available without needing to register to become an IBM Watson Partner.

I am helping a long term customer use IBM Watson as a web service over the next several months so I registered as a partner and have been enjoying reading all of the documentation on training an instance for a specific application, the REST APIs, etc. Good stuff, and I think IBM may grow a huge business around Watson.

Saturday, September 13, 2014

I am open sourcing my Haskell NLP experiments

I just switched the github repository for my NLP experiments to be a public repository. Git pull requests will be appreciated! The code and data is released under the AGPL version 3 license - if you improve the code I want you to share the improvements with me and others :-)

This is just experimental code but hopefully some people may find it useful. My latest changes involve trying to use DBPedia URIs as identifiers for entities detected in text. Simple stuff, but it is a start.