Tuesday, January 26, 2016

Great talk on Spark

I just listened to an ACM sponsored talk Making Big Data Processing Simple With Spark by Matei Zaharias. You may need to be an ACM member to watch the webinar. I first joined ACM in the mid 1970s - recommended.

For handling huge datasets Spark is evolutionary or revolutionary depending on your point of view. A bit of personal history before I talk specifically about Spark:

In the late 1980s I was an architect and developer on a multinational project to use seismic data from 38 data collection stations to detect atomic bomb tests. All of our data handling software was custom; if we had Spark, or even Hadoop, we would have saved a ton of effort. Similarly, in the 1990s I was tech lead on a fraud detection system that used massive real time telephone records data sets. Modern infrastructure would have saved a lot of time and money.

My first serious use of map reduce was processing large Twitter data sets at Compass Labs. We used Hadoop on Amazon ElasticMapreduce. Later when I worked as a contractor at Google, in addition to using map reduce, I was introduced to realtime interactive tools like Dremel that made it easy to interactively use large data sets.

With Spark, everyone gets to interactively work with massive datasets! I think that Spark is evolutionary in that it builds on and plugs into existing work like the Hadoop File Sytem and supports familiar map reduce style operations. I think that it is revolutionary in the memory based distributed architecture and application programming model. Spark was designed based on limitations of map reduce systems like Hadoop that while providing easy to use programming models, have ineffiencies in data access. With Spark, you have an easy to use programming model, more efficiency, and built in interactivity. I have examples of using Spark in my last book Power Java. You can experiment with Spark on your laptop and only worry about accessing a cluster when you need to scale.

Saturday, January 23, 2016

Simple Haskell: using a sqlite3 database

I have been using Lisp languages professionally since the early 1980s. While I now use Java, Ruby, and Clojure for much of my work, I have been slowly been getting up to speed using Haskell over the last 5 years. My difficulties using Haskell are caused almost 100% when I need write impure Haskell code. This occasional discomfort is made up for by the fun and productivity of writing pure Haskell code. Using haskell-mode in Emacs I get the same happy feeling writing pure Haskell code that I used to get using Common Lisp, Scheme, and Clojure - and with the advantages of a strongly typed language!

I like to mock up test data and write the pure code first and then write impure code that needs to access the web, RDF data stores, relational databases, file IO, etc. For me, as a student of Haskell, this is the easiest way to write Haskell programs.

About 15 years ago, in one of my Java artificial intelligence books I wrote an example program that provides a natural language processing (NLP) interface to relational databases. I have decided that I would like to do the same, but in Haskell, and take advantage of what I have learned in the last 15 years. Writing the code to convert natural language queries into SQL queries is pure Haskell code (given mockup data for database metadata and sample table data, and test NLP queries) and I am enjoying working on that. Eventually I will need to write some impure code that accesses the popular databases. To make the initial development as easy as possible (a good idea since I may never totally finish this side project) I have decided that I will use sqlite and the sqlite-simple library. For the first proof of concept/prototype, I don't expect to need much impure code. A good thing!

This reminds me of a comment Erik Meijer made when he was teaching the edX functional programming class. He said that as developers we can think of pure Haskell code a being islands and impure code that has to maintain state and interact with the world as the ocean containing the islands. I like this metaphor!

I write little code snippets (or sometimes mini-projects) to experiment with nonpure Haskell code and the following listing, derived from the sqlite-simple library, contains the small experiments with the functionality that I need for now. I thought it was worth sharing in case this saves anyone else some time:

{-# LANGUAGE OverloadedStrings #-}
import Database.SQLite.Simple

   Create sqlite database:
     sqlite3 test.db "create table test (id integer primary key, str text);"

   This is derived from the example at github.com/nurpax/sqlite-simple

main :: IO ()
main = do
  conn <- open "test.db"
  -- start by getting table names in database:
    r <- query_ conn "SELECT name FROM sqlite_master WHERE type='table'" :: IO [(Only String)]
    print "Table names in database test.db:"
    mapM_ (print . fromOnly) r
  -- get the metadata for table test in test.db:
    r <- query_ conn "SELECT sql FROM sqlite_master WHERE type='table' and name='test'" :: IO [(Only String)]
    print "SQL to create table 'test' in database test.db:"
    mapM_ (print . fromOnly) r
  -- add a row to table 'test' and print out the rows in table 'test':
    execute conn "INSERT INTO test (str) VALUES (?)"
      (Only ("test string 2" :: String))
    r2 <- query_ conn "SELECT * from test" :: IO [(Int, String)]
    print "number of rows in table 'test':"
    print (length r2)
    print "rows in table 'test':"
    mapM_ print  r2
  close conn

Just to make this example complete, here is my stack.yaml file:

resolver: lts-4.0
packages: - '.'
extra-deps: []
flags: {}
And here is my sqlite.cabal file:
name:                sqlite
synopsis:            Experiment with sqlite-simple
description:         Derived from example in github.com/nurpax/sqlite-simple
homepage:            https://github.com/mark-watson?tab=repositories
license:             Apache2
license-file:        LICENSE
author:              Mark Watson
maintainer:          markw@markwatson.com
copyright:           2016 Mark Watson
category:            Web
build-type:          Simple
-- extra-source-files:
cabal-version:       >=1.10

executable test1
  hs-source-dirs:      .
  main-is:             test1.hs
  ghc-options:         -threaded -rtsopts -with-rtsopts=-N
  build-depends:       base
                     , sqlite-simple
  default-language:    Haskell2010

Here is a build and sample run (assuming that the sqlite database test.db has been created as per the comments in the first source listing):

✗ stack build
✗ stack exec test1
"Table names in database test.db:"
"SQL to create table 'test' in database test.db:"
"number of rows in table 'test':"
"rows in table 'test':"
(1,"test string 2")
(2,"test string 2")
(3,"test string 2")

I would like to thank Janne Hellsten for maintaining the sqlite-simple library and I would also like to thank the developers of stack. Using stack has solved most of my build issues with Haskell. Thanks!

Thursday, January 14, 2016

I will not vote for Hillary Clinton. I reject the "lesser of two evils" argument.

I believe that Hillary Clinton is in the pocket of Wall Street, a lacky by any definition. I also believe that she is, as Ralph Nader says, a poster child for the military industrial complex. I also don't like her close ties to agribusiness giant Monsanto and her advocacy for the industry's genetically modified crops.

I believe that our two party system is broken, almost never giving us a choice that matches the preferences of the electorate. Corporate news corporations favor Clinton over Bernie Sanders in subtle and unfair ways, basing so much of their slanted (as directed to the financial interests of the network owners) discussion in terms assuming Hillary Clinton will be the Democratic candidate and pushing the false narrative that Bernie Sanders has no chance of winning the general election.

Some of my friends who are Democrats believe that it is a mistake to not vote for whatever Democratic toadie the establishment runs. What if a Republican wins? Oh NOoos! The sky will fall.

I believe that the sky will fall on our representative democracy if people don't stand up to the political establishment and the corporations that their preferred candidates represent.