Publicising your Slack Bot through Slack Apps + the Add to Slack button

In my previous blog posts on Slack bots, I spoke about making bots, both using a simple script, and using incoming webhooks. We are now well versed with how to code up a slack bot and code it according to our needs.

Now that our Slack bot is made. How do we get it through to everyone?

Slack offers an amazing feature for the same, the Add to Slack button. Using this button, other teams can simply add your bot to them, and this is one of the best ways of publicising, because this button can be kept anywhere: your website, your README.md on github, a blog post etc.

Add to Slack, however works with OAuth, so that only authorised teams can take in your bot. Also, for distributing your bot, you will have to package it into a Slack app. So, let’s get started!

First off, we’ll make a Slack app:

1. Login into your team on Slack.
2. Go to the Apps page here and click on “Create an App”.

3. Fill out the relevant details for your app. You should especially fill out the redirect_uri (it’s not compulsory but you’ll have to fill it sometime) since it is needed for OAuth (when other users use your bot). Once form filled, click on Add App.

4. Go to the main page of your app, and under Bot Integrations, add your bot (keep the name as @susi, or whatever you like. You’ll have to change the bot name then in the code).

5. Go to App Credentials, and save the client_id and the client_secret for reference. We need it for OAuth.

Don’t worry, we’ll handle the redirect_uri in a short while!

So the flow of this goes as follows:

1. When a team clicks on “Add to Slack” button, they are led to a page, where they have to verify that a bot is being added to their team. Once they have verified, you click on “Authorize”.

2. When one clicks on Authorize, Slack generates a code, and appends it to the redirect_uri as a GET parameter, and leads you there.

3. Your redirect_uri needs to handle this code, and then send the client_secret, client_id and this code as GET parameters to http://slack.com/api/oauth.access, so that your OAuth request is verified.

4. Once request is verified and the parameters match, the bot is successfully deployed onto your team. Additionally, a JSON is returned, specifying your access_token for the bot you just deployed, as well as the incoming webhook URL (incase you have incoming webhook in your code). You need to now use this very access token and the webhook URL to control your bot.

Let’s get started on implementing this then.

1. We first go to the Slack Button page. In the bottom of the page under the section “Add to Slack Button”, there is a box, where there’s a custom url so that you can add the bot to your website etc (where people will click on it). There are three checkboxes there as you can see. Check whichever one you need for your bot:

Screen Shot 2016-08-29 at 6.37.26 PM

2. Once you have selected this, you can embed this into your website / README file. That’s half the job done!

Now let’s dive into the code. We need to take in the code that’s sent as a GET parameter to our redirect_uri. Once we get this code, we need to send in a GET request to http://slack.com/api/oauth.access with the client_id, client_secret and this code. If the bot is approved, we take up the webhook url / bot token and use it for the deployed bot so that it runs properly.

Here, the redirect_uri I’ll use is the Slack deployment URL I have on Heroku (http://asksusisunode.herokuapp.com). I’ll just create a path on Express, named ‘/slackbot’, and get started from there. The entire process starts when you get the code as a GET parameter on the redirect_uri. So do the following:

1. Go to your Apps page on Slack, under App credentials, add http://yourherokuurl.com/slackbot (or obviously any other URL you have) as the redirect_uri. I used http://asksusisunode.herokuapp.com/slackbot as the redirect_uri.

2. Let’s dive into the code now. Below is the final code that handles the Add to Slack button:


'use strict';
/* global require, process, console */

var express = require('express');
var bodyParser = require('body-parser');
var request = require('request');
var SlackBot = require('slackbots');
var http = require("http");
var Slack = require('node-slackr')
var app = express();
var custom_slack_token;
var slack_token = process.env.SLACK_TOKEN
var payload;
var payload_url = process.env.PAYLOAD_URL //this is just the webhook URL
var custom_payload_url;
var slack; 

var slack_code;
var client_id = process.env.CLIENT_ID;
var client_secret = process.env.CLIENT_SECRET;

app.set('port', (process.env.PORT || 5000));

app.use(bodyParser.urlencoded({extended: false}));

app.use(bodyParser.json());

app.get('/', function (req, res) {
	res.send('Susi says Hello.');
});

app.get('/slackbot', function(req, res) {
	slack_code = req.param('code'); //getting the code GET parameter
	var queryurl = 'http://slack.com/api/oauth.access?client_id='+client_id+'&client_secret='+client_secret+'&code='+slack_code;
	console.log(queryurl);
	request(queryurl, {json:true}, function(error, response, body) { // we get a JSON response
		if(!error && response.statusCode == 200 && body.ok == 'true'){ //i.e if bot has been installed

//take in the slack token and webhook url
			custom_slack_token = body.bot.bot_access_token;
			custom_payload_url = body.incoming_webhook.url;
			console.log(body);
			console.log(slack_token);
			res.send('Susi has been installed to your team!');
		} else{
			res.send('Could not install');
		}
	});
});

function slackbot(){
	
	setInterval(function() {
		http.get("http://asksusisunode.herokuapp.com");
	}, 1800000); 

	if (custom_slack_token && custom_payload_url){
		slack_token = custom_slack_token;
		payload_url = custom_payload_url;
	}
	var slack_bot = new SlackBot({
		token: slack_token, 
		name: 'susi'
	})

	slack = new Slack(payload_url);

	slack_bot.on('message', function(data){
		var slackdata = data;
		var msg, channel, output, user;
		if(Object.keys(slackdata).length > 0){
			if('text' in slackdata && slackdata['username'] != 'susi'){
				msg = data['text'];
				channel = data['channel']
			}
			else {
				msg = null;
				channel = null;
			}
		}
		if(msg != null && channel !=null){
			var botid = '<@U1UK6DANT>' //need to change
			if (msg.split(" ")[0] != botid){
			//do nothing
		} else{
			var apiurl = 'http://loklak.org/api/susi.json?q=' + msg;
			var payload;
			request(apiurl, function (error, response, body) {
				if (!error && response.statusCode === 200) {
					var data = JSON.parse(body);
					if(data.answers[0].actions.length == 1){
						var susiresponse = data.answers[0].actions[0].expression;
						payload = {
							text: susiresponse,
							channel: channel
						}
						slack.notify(payload)

					} else if(data.answers[0].actions.length == 2 && data.answers[0].actions[1].type == "table"){
						payload = {
							text: data.answers[0].actions[0].expression + " (" + data.answers[0].data.length + " results)",
							channel: channel
						}
						slack.notify(payload)
						for(var i = 0; i < data.answers[0].data.length; ++i){
							var response = data.answers[0].data[i];
							var ansstring = "";
							for(var resp in response){
								ansstring += (resp + ": " + response[resp] + ", ");
							}
							payload = {
								text: ansstring,
								channel: channel
							}
							slack.notify(payload);
						}
					}
				}
			});
		}
	}
});
}

// Getting Susi up and running.
app.listen(app.get('port'), function() {
	console.log('running on port', app.get('port'));
	slackbot();
});

There’s just one small shortcoming: the bot id used above won’t be the same for the deployed bot, so there can be cases where you message the bot but it does not reply. So we need to actually use the RTM API to figure out the bot id directly. We’re in the process of fixing it. But the bot will definitely be installed into your team, just that in some cases it won’t message and will stay “Away”.

See? It was as simple as adding another path in Express, and the awesome Requests package does the rest. Your bot will successfully be added to your team as a result, and anyone can use it. 🙂

Apart from publicising using the Add to Slack button, additionally, you can also publicise your app on the Slack Apps directory by going here and filling out the form.

So now we know how to make a Slack bot from scratch, from two different methods, and how to effectively publicise it. This is another great way by which Susi will be publicised to everyone and more people can use it. Amazing, right?

By the way, please go to https://github.com/fossasia/asksusi_messengers and add more bots for Susi there. We wish to add Susi on as many platforms as possible. We really value your contributions 🙂

So that’s it for today! Feedback is welcome as always 🙂 See you later!

Publicising your Slack Bot through Slack Apps + the Add to Slack button

Making Slack Chatbots using Incoming Webhooks + The Idling problem

The last time I spoke about Chatbots, I spoke about the need of increasing Susi’s reach, how Slack is a great platform because of how it works within teams, and how to make a Slack bot yourself.

However, if you see the code snippet I posted in that blog post, you’ll see that the Slack bot I have is just a Python script, while the rest of the index.js code (which contains the Messenger and Telegram bots) is an Express application. We are basically just using a package (slackbots, if you remember), and it simply takes in your Slack token and POSTs to the Slack interface. Also, that is a custom bot, it will only be in use for us right now, we need to distribute it (which we do using Slack apps, we’ll talk about that later).

Today, I’ll be describing another method of making Slackbots: using Incoming Webhooks.

Incoming Webhook is a way by which you don’t directly POST to the Slack interface, but you POST to a webhook generated by Slack. It is a very convenient way of posting messages from external sources into Slack. Moreover, when you distribute your Slack bot, you can distribute your Webhook separately so that your reach can increase more (we’ll talk about distributions and OAuth in the next blog post). Incoming webhooks are seamlessly integrated within your Slack apps, so that your Slack bot can be distributed efficiently.

So let’s get started. To create an Incoming webhook integration:

1. Go to the Incoming Webhook Integration page here.

2. Fill in the details and select the channel you wish to post to.

3. Save the webhook URL for reference. We’ll need it.

Incoming Webhooks work with a payload. A payload is a JSON which contains all the information of the message (text, emojis, files etc). A normal payload looks like:

payload={"text":"This is a line of text.\nAnd this is another one."}

Now all we need to do is POST our message, AS a payload, to this URL, instead of directly posting to Slack. For easily handling payloads, we use a library named node-slackr. You can install it as follows:

npm install --save node-slackr

To post a payload to the URL, we first instantiate the node-slackr object using our webhook URL:

var slack = new Slack(webhook_url);

When we have the payload ready, all we need to POST to the webhook is simply do:

slack.notify(payload);

So here’s the final modified code that’s used for making bots using incoming webhooks. We just make a few changes to our original bot code in the last post on Slack bots on this blog:


'use strict';
/* global require, process, console */

var express = require('express');
var bodyParser = require('body-parser');
var request = require('request');
var SlackBot = require('slackbots');
var Slack = require('node-slackr')
var app = express();
var slack_token = process.env.SLACK_TOKEN
var webhook_url = process.env.WEBHOOK_URL
var heroku_url = process.env.HEROKU_URL
var slack; 

app.set('port', (process.env.PORT || 5000));

app.use(bodyParser.urlencoded({extended: false}));

app.use(bodyParser.json());

app.get('/', function (req, res) {
	res.send('Susi says Hello.');
});

function slackbot(){
	
	setInterval(function() {
		http.get(heroku_url);
	}, 1800000); 

	var slack_bot = new SlackBot({
		token: slack_token, 
		name: 'susi'
	})

	slack = new Slack(payload_url);

	slack_bot.on('message', function(data){
		var slackdata = data;
		var msg, channel, output, user;
		if(Object.keys(slackdata).length > 0){
			if('text' in slackdata && slackdata['username'] != 'susi'){
				msg = data['text'];
				channel = data['channel']
			}
			else {
				msg = null;
				channel = null;
			}
		}
		if(msg != null && channel !=null){
			var botid = ':' 
			if (msg.split(" ")[0] != botid){
			//do nothing
		} else{
			var apiurl = 'http://loklak.org/api/susi.json?q=' + msg;
			var payload;
			request(apiurl, function (error, response, body) {
				if (!error && response.statusCode === 200) {
					var data = JSON.parse(body);
					if(data.answers[0].actions.length == 1){
						var susiresponse = data.answers[0].actions[0].expression;
						payload = {
							text: susiresponse,
							channel: channel
						}
						slack.notify(payload)

					} else if(data.answers[0].actions.length == 2 && data.answers[0].actions[1].type == "table"){
						payload = {
							text: data.answers[0].actions[0].expression + " (" + data.answers[0].data.length + " results)",
							channel: channel
						}
						slack.notify(payload)
						for(var i = 0; i < data.answers[0].data.length; ++i){
							var response = data.answers[0].data[i];
							var ansstring = "";
							for(var resp in response){
								ansstring += (resp + ": " + response[resp] + ", ");
							}
							payload = {
								text: ansstring,
								channel: channel
							}
							slack.notify(payload);
						}
					}
				}
			});
		}
	}
});
}

// Getting Susi up and running.
app.listen(app.get('port'), function() {
	console.log('running on port', app.get('port'));
	slackbot();
});

All we did is set the webhook url as an environment variable, and used that, and just did slack.notify. Also, I encapsulated the function inside app.listen so that it runs up as soon as the app starts and stays alive.

But here comes another problem: We used heroku dynos for deployment. Heroku dynos have a sleep period of 6 hours. In those 6 hours, the bot would just be idle and would not work. We wish to circumvent this.

There are three ways of doing so. One way is to install the newrelic plugin of Heroku and using it (you can read more about it here). The second way is to simply use Kaffeine so that your heroku url is pinged every 30 minutes and the bot stays alive.

Or you can programatically solve it as well. Look at the code snippet above and notice:


setInterval(function() {
		http.get(heroku_url);
	}, 1800000); 

We’re basically pinging the Heroku URL (again stored as env var) every 1800000 milliseconds, i.e 30 minutes. This is a more convenient approach to solve this problem of idling too.

So now we know how to make our bot using two different methods, and how to solve the idling problem. To get this full circle, in the next blog post, I will talk about distribution of your bot, and how people can know about it. Feedback is welcome as always 🙂

Making Slack Chatbots using Incoming Webhooks + The Idling problem

Monetisation of Susi using the Amazon Product API (Part 2)

So in my previous blog post, I covered about the semantics of the Amazon Product Advertising API, and how does the monetisation work. Today, let’s jump into the code and the relevance with Susi.

We have seen that the Amazon Product Advertising API is a SOAP API. The query of the SOAP API access goes like this:

http://webservices.amazon.com/onca/xml?Service=AWSECommerceService&AWSAccessKeyId=[AWS Access Key ID]&AssociateTag=[Associate ID]&Operation=ItemSearch&Keywords=the%20hunger%20games&SearchIndex=Books&Timestamp=[YYYY-MM-DDThh:mm:ssZ]&Signature=[Request Signature]

We supply to it the Operation (ItemSearch, ItemLookup etc, you can have the full list here), the Keywords to look for (could be a keyword or an ASIN (if it is ItemLookup i.e search by ID) and the Timestamp, Signature (base 64 Hmac) and of course the tags. Now we need to implement this in a real Java program. But the SOAP nature of the API could obviously cause some inconveniences.

Thankfully, Amazon made up a REST API code snippet which people can directly use. It takes in the URL as mentioned above, generates the timestamp, and signs the query with the Access ID, Associate Tag and the other params in the Hmac algorithm (which uses Base64). Here is the code: (SignedRequestsHelper.java)


/**********************************************************************************************
 * Copyright 2009 Amazon.com, Inc. or its affiliates. All Rights Reserved.
 *
 * Licensed under the Apache License, Version 2.0 (the "License"). You may not use this file 
 * except in compliance with the License. A copy of the License is located at
 *
 *       http://aws.amazon.com/apache2.0/
 *
 * or in the "LICENSE.txt" file accompanying this file. This file is distributed on an "AS IS"
 * BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under the License. 
 *
 * ********************************************************************************************
 *
 *  Amazon Product Advertising API
 *  Signed Requests Sample Code
 *
 *  API Version: 2009-03-31
 *
 */

package org.loklak.api.amazon;

import java.io.UnsupportedEncodingException;
import java.net.URLDecoder;
import java.net.URLEncoder;
import java.security.InvalidKeyException;
import java.security.NoSuchAlgorithmException;
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.Base64;
import java.util.Calendar;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.SortedMap;
import java.util.TimeZone;
import java.util.TreeMap;

import javax.crypto.Mac;
import javax.crypto.spec.SecretKeySpec;

/**
 * This class contains all the logic for signing requests to the Amazon Product
 * Advertising API.
 */
public class SignedRequestsHelper {
	/**
	 * All strings are handled as UTF-8
	 */
	private static final String UTF8_CHARSET = "UTF-8";

	/**
	 * The HMAC algorithm required by Amazon
	 */
	private static final String HMAC_SHA256_ALGORITHM = "HmacSHA256";

	/**
	 * This is the URI for the service, don't change unless you really know what
	 * you're doing.
	 */
	private static final String REQUEST_URI = "/onca/xml";

	/**
	 * The sample uses HTTP GET to fetch the response. If you changed the sample
	 * to use HTTP POST instead, change the value below to POST.
	 */
	private static final String REQUEST_METHOD = "GET";

	private String endpoint = null;
	private String awsAccessKeyId = null;
	private String awsSecretKey = null;
	private String associatetag = null;
	private SecretKeySpec secretKeySpec = null;
	private Mac mac = null;

	/**
	 * You must provide the three values below to initialize the helper.
	 * 
	 * @param endpoint
	 *            Destination for the requests.
	 * @param awsAccessKeyId
	 *            Your AWS Access Key ID
	 * @param awsSecretKey
	 *            Your AWS Secret Key
	 */
	public static SignedRequestsHelper getInstance(String endpoint, String awsAccessKeyId, String awsSecretKey,
			String associatetag) throws IllegalArgumentException, UnsupportedEncodingException,
			NoSuchAlgorithmException, InvalidKeyException {
		if (null == endpoint || endpoint.length() == 0) {
			throw new IllegalArgumentException("endpoint is null or empty");
		}
		if (null == awsAccessKeyId || awsAccessKeyId.length() == 0) {
			throw new IllegalArgumentException("awsAccessKeyId is null or empty");
		}
		if (null == awsSecretKey || awsSecretKey.length() == 0) {
			throw new IllegalArgumentException("awsSecretKey is null or empty");
		}

		if (null == associatetag || associatetag.length() == 0) {
			throw new IllegalArgumentException("associatetag is null or empty");
		}

		SignedRequestsHelper instance = new SignedRequestsHelper();
		instance.endpoint = endpoint.toLowerCase();
		instance.awsAccessKeyId = awsAccessKeyId;
		instance.awsSecretKey = awsSecretKey;
		instance.associatetag = associatetag;

		byte[] secretyKeyBytes = instance.awsSecretKey.getBytes(UTF8_CHARSET);
		instance.secretKeySpec = new SecretKeySpec(secretyKeyBytes, HMAC_SHA256_ALGORITHM);
		instance.mac = Mac.getInstance(HMAC_SHA256_ALGORITHM);
		instance.mac.init(instance.secretKeySpec);

		return instance;
	}

	/**
	 * The construct is private since we'd rather use getInstance()
	 */
	private SignedRequestsHelper() {
	}

	/**
	 * This method signs requests in hashmap form. It returns a URL that should
	 * be used to fetch the response. The URL returned should not be modified in
	 * any way, doing so will invalidate the signature and Amazon will reject
	 * the request.
	 */
	public String sign(Map params) {
		// Let's add the AWSAccessKeyId, AssociateTag and Timestamp parameters
		// to the request.
		params.put("AWSAccessKeyId", this.awsAccessKeyId);
		params.put("AssociateTag", this.associatetag);
		params.put("Timestamp", this.timestamp());

		// The parameters need to be processed in lexicographical order, so
		// we'll
		// use a TreeMap implementation for that.
		SortedMap sortedParamMap = new TreeMap(params);

		// get the canonical form the query string
		String canonicalQS = this.canonicalize(sortedParamMap);

		// create the string upon which the signature is calculated
		String toSign = REQUEST_METHOD + "\n" + this.endpoint + "\n" + REQUEST_URI + "\n" + canonicalQS;

		// get the signature
		String hmac = this.hmac(toSign);
		String sig = this.percentEncodeRfc3986(hmac);

		// construct the URL
		String url = "http://" + this.endpoint + REQUEST_URI + "?" + canonicalQS + "&Signature=" + sig;

		return url;
	}

	/**
	 * This method signs requests in query-string form. It returns a URL that
	 * should be used to fetch the response. The URL returned should not be
	 * modified in any way, doing so will invalidate the signature and Amazon
	 * will reject the request.
	 */
	public String sign(String queryString) {
		// let's break the query string into it's constituent name-value pairs
		Map params = this.createParameterMap(queryString);

		// then we can sign the request as before
		return this.sign(params);
	}

	/**
	 * Compute the HMAC.
	 * 
	 * @param stringToSign
	 *            String to compute the HMAC over.
	 * @return base64-encoded hmac value.
	 */
	private String hmac(String stringToSign) {
		String signature = null;
		byte[] data;
		byte[] rawHmac;
		try {
			data = stringToSign.getBytes(UTF8_CHARSET);
			rawHmac = mac.doFinal(data);
			signature = Base64.getEncoder().encodeToString(rawHmac);
		} catch (UnsupportedEncodingException e) {
			throw new RuntimeException(UTF8_CHARSET + " is unsupported!", e);
		}
		return signature;
	}

	/**
	 * Generate a ISO-8601 format timestamp as required by Amazon.
	 * 
	 * @return ISO-8601 format timestamp.
	 */
	private String timestamp() {
		String timestamp = null;
		Calendar cal = Calendar.getInstance();
		DateFormat dfm = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss'Z'");
		dfm.setTimeZone(TimeZone.getTimeZone("GMT"));
		timestamp = dfm.format(cal.getTime());
		return timestamp;
	}

	/**
	 * Canonicalize the query string as required by Amazon.
	 * 
	 * @param sortedParamMap
	 *            Parameter name-value pairs in lexicographical order.
	 * @return Canonical form of query string.
	 */
	private String canonicalize(SortedMap sortedParamMap) {
		if (sortedParamMap.isEmpty()) {
			return "";
		}

		StringBuffer buffer = new StringBuffer();
		Iterator<Map.Entry> iter = sortedParamMap.entrySet().iterator();

		while (iter.hasNext()) {
			Map.Entry kvpair = iter.next();
			buffer.append(percentEncodeRfc3986(kvpair.getKey()));
			buffer.append("=");
			buffer.append(percentEncodeRfc3986(kvpair.getValue()));
			if (iter.hasNext()) {
				buffer.append("&");
			}
		}
		String cannoical = buffer.toString();
		return cannoical;
	}

	/**
	 * Percent-encode values according the RFC 3986. The built-in Java
	 * URLEncoder does not encode according to the RFC, so we make the extra
	 * replacements.
	 * 
	 * @param s
	 *            decoded string
	 * @return encoded string per RFC 3986
	 */
	private String percentEncodeRfc3986(String s) {
		String out;
		try {
			out = URLEncoder.encode(s, UTF8_CHARSET).replace("+", "%20").replace("*", "%2A").replace("%7E", "~");
		} catch (UnsupportedEncodingException e) {
			out = s;
		}
		return out;
	}

	/**
	 * Takes a query string, separates the constituent name-value pairs and
	 * stores them in a hashmap.
	 * 
	 * @param queryString
	 * @return
	 */
	private Map createParameterMap(String queryString) {
		Map map = new HashMap();
		String[] pairs = queryString.split("&");

		for (String pair : pairs) {
			if (pair.length() < 1) {
				continue;
			}

			String[] tokens = pair.split("=", 2);
			for (int j = 0; j < tokens.length; j++) {
				try {
					tokens[j] = URLDecoder.decode(tokens[j], UTF8_CHARSET);
				} catch (UnsupportedEncodingException e) {
				}
			}
			switch (tokens.length) {
			case 1: {
				if (pair.charAt(0) == '=') {
					map.put("", tokens[0]);
				} else {
					map.put(tokens[0], "");
				}
				break;
			}
			case 2: {
				map.put(tokens[0], tokens[1]);
				break;
			}
			default: {
				// nothing
				break;
			}
			}
		}
		return map;
	}
}

Now things become a whole lot easier. We can straightaway sign our requests using this class, make our request authenticated, and get the result.

Now we need to figure out what we should get from the API. My idea was to use the Large ResponseGroup by default, so that we get all the possible info (the Large ResponseGroup encapsulates all other ResponseGroups), and also, we should enable searching both by ASIN and Product Name so that the API is efficient enough and can give proper results, i.e I had to implement both the ItemLookup and ItemSearch APIs. Also, I added an option to choose your own ResponseGroup so that you can select what all quantity of data, and what all data you want, and get the result.

So here is the code of the AmazonAPIService, which enables Susi Monetisation.


/**
 *  AmazonProductService
 *  Copyright 05.08.2016 by Shiven Mian, @shivenmian
 *
 *  This library is free software; you can redistribute it and/or
 *  modify it under the terms of the GNU Lesser General Public
 *  License as published by the Free Software Foundation; either
 *  version 2.1 of the License, or (at your option) any later version.
 *  
 *  This library is distributed in the hope that it will be useful,
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 *  Lesser General Public License for more details.
 *  
 *  You should have received a copy of the GNU Lesser General Public License
 *  along with this program in the file lgpl21.txt
 *  If not, see .
 */

package org.loklak.api.amazon;

import java.io.StringWriter;

import javax.servlet.http.HttpServletResponse;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;

import org.json.JSONObject;
import org.json.XML;
import org.loklak.data.DAO;
import org.loklak.server.APIException;
import org.loklak.server.APIHandler;
import org.loklak.server.AbstractAPIHandler;
import org.loklak.server.Authorization;
import org.loklak.server.BaseUserRole;
import org.loklak.server.Query;
import org.loklak.tools.storage.JSONObjectWithDefault;
import org.w3c.dom.Document;

public class AmazonProductService extends AbstractAPIHandler implements APIHandler {

	private static final long serialVersionUID = 2279773523424505716L;

	// set your key configuration in config.properties under the Amazon API
	// Settings field
	private static final String AWS_ACCESS_KEY_ID = DAO.getConfig("aws_access_key_id", "randomxyz");
	private static final String AWS_SECRET_KEY = DAO.getConfig("aws_secret_key", "randomxyz");
	private static final String ASSOCIATE_TAG = DAO.getConfig("aws_associate_tag", "randomxyz");

	// using the USA locale
	private static final String ENDPOINT = "webservices.amazon.com";

	@Override
	public String getAPIPath() {
		return "/cms/amazonservice.json";
	}

	@Override
	public BaseUserRole getMinimalBaseUserRole() {
		return BaseUserRole.ANONYMOUS;
	}

	@Override
	public JSONObject getDefaultPermissions(BaseUserRole baseUserRole) {
		return null;
	}

	public static JSONObject fetchResults(String requestUrl, String operation) {
		JSONObject itemlookup = new JSONObject(true);
		try {
			DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
			DocumentBuilder db = dbf.newDocumentBuilder();
			Document doc = db.parse(requestUrl);
			DOMSource domSource = new DOMSource(doc);
			StringWriter writer = new StringWriter();
			StreamResult result = new StreamResult(writer);
			TransformerFactory tf = TransformerFactory.newInstance();
			Transformer transformer = tf.newTransformer();
			transformer.transform(domSource, result);
			JSONObject xmlresult = new JSONObject(true);
			xmlresult = XML.toJSONObject(writer.toString());
			JSONObject items = xmlresult.getJSONObject(operation).getJSONObject("Items");
			if (items.getJSONObject("Request").has("Errors")) {
				itemlookup.put("status", "error");
				itemlookup.put("reason",
						items.getJSONObject("Request").getJSONObject("Errors").getJSONObject("Error").get("Message"));
				return itemlookup;
			}
			itemlookup.put("number_of_items",
					(operation.equals("ItemLookupResponse") ? "1" : (items.getJSONArray("Item").length())));
			itemlookup.put("list_of_items", items);
		} catch (Exception e) {
			itemlookup.put("status", "error");
			itemlookup.put("reason", e);
			return itemlookup;
		}
		return itemlookup;
	}

	@Override
	public JSONObject serviceImpl(Query call, HttpServletResponse response, Authorization rights,
			JSONObjectWithDefault permissions) throws APIException {
		String ITEM_ID = call.get("id", "");
		String PRODUCT_NAME = call.get("q", "");
		String responsegroup = (call.get("response_group", "") != "" ? call.get("response_group", "") : "Large");
		if (!("".equals(ITEM_ID)) && ITEM_ID.length() != 0) {
			return itemLookup(ITEM_ID, responsegroup);
		} else if (!("".equals(PRODUCT_NAME)) && PRODUCT_NAME.length() != 0) {
			return itemSearch(PRODUCT_NAME, responsegroup);
		} else {
			return new JSONObject().put("error", "no parameters given");
		}
	}

	public JSONObject itemSearch(String query, String responsegroup) {
		JSONObject result = new JSONObject(true);
		SignedRequestsHelper helper;
		if (query.length() == 0 || "".equals(query)) {
			result.put("error", "Please specify a query to search");
			return result;
		}
		try {
			helper = SignedRequestsHelper.getInstance(ENDPOINT, AWS_ACCESS_KEY_ID, AWS_SECRET_KEY, ASSOCIATE_TAG);
		} catch (Exception e) {
			result.put("error", e.toString());
			return result;
		}
		String requestUrl = null;
		String queryString = "Service=AWSECommerceService&ResponseGroup=" + responsegroup
				+ "&Operation=ItemSearch&Keywords=" + query + "&SearchIndex=All";
		requestUrl = helper.sign(queryString);
		result = fetchResults(requestUrl, "ItemSearchResponse");
		return result;
	}

	public JSONObject itemLookup(String asin, String responsegroup) {
		SignedRequestsHelper helper;
		JSONObject result = new JSONObject(true);
		if (asin.length() == 0 || "".equals(asin)) {
			result.put("error", "Please specify an Item ID");
			return result;
		}

		try {
			helper = SignedRequestsHelper.getInstance(ENDPOINT, AWS_ACCESS_KEY_ID, AWS_SECRET_KEY, ASSOCIATE_TAG);
		} catch (Exception e) {
			result.put("error", e.toString());
			return result;
		}
		String requestUrl = null;
		String queryString = "Service=AWSECommerceService&ResponseGroup=" + responsegroup
				+ "&Operation=ItemLookup&ItemId=" + asin;
		requestUrl = helper.sign(queryString);
		result = fetchResults(requestUrl, "ItemLookupResponse");
		return result;
	}

}

As you can see in this code, I have taken in the parameters (either of q or ASIN, and responsegroup), and depending on type of param, I have decided whether to use the ItemLookup or the ItemSearch API (only these two as of now are relevant for Susi in real). The ResponseGroup is defaulted to Large, so even if you avoid the responsegroup param, you still get all the data. What next? I just built the query, signed it using the SignedRequestsHelper (note: the associate tags and the keys are in the config file as mentioned in my last blog post), and I then parse the returned XML and display it as a JSON.

We are yet to get this into Susi (in the form of questions), but that will be up soon. Susi can simply be monetised by sending in the URL (which contains our associate tag) along with the result, so that a person can go to the URL and we can get hits on that, for which we get paid by the Affiliates Program. But now, we have seen how we intend the API to work. Since the Product Advertising API is huge, we can always make this API more efficient and expand it, which is a future plan too.

Feedback, as always, is welcome. 🙂

Monetisation of Susi using the Amazon Product API (Part 2)

Monetisation of Susi using the Amazon Product API (Part 1)

I’ve worked with the loklak team on Susi for the past month, and we’re in that stage where we are mostly expanding the dataset we have. Susi is still nascent, mind you, but it does show a lot of promise in terms of the idea and the progress we have made.

In the past some posts, I covered OSM Analysis, integrating that into Susi, as well as Bot integration, and I also spoke about the need for Susi to increase its reach (which was the purpose for the Bot integration). For this purpose and the general purpose of making Susi more able to answer different queries, I dug around a lot of APIs, and came across the Amazon Product Advertising API, which answered both the reach and the dataset question.

Through the Amazon API, we can get a wide (really wide) range of information for products in its database. Along with ItemSearch (search Item by Name), ItemLookup (search Item by Amazon ID, known as ASIN), there are a host of other APIs: SimilarityLookup, BrowseNodes, even Virtual Carts (wherein you can add items to a remote virtual cart and get prices etc). And here comes the best thing: since you use the API with your Affiliate / Associate Tag and secret keys, if someone goes into a URL which is marked by your affiliate tag, you get paid for it.

So clearly, we can expand our dataset, as well as get an income by using this API, making it suitable for solving both the reach and the dataset problem. I will explain the usage of this API, as well as its integration into Susi in today’s and two more blog posts. Today, let’s go through the structure, as well as the operations of this API.

The Amazon API is a SOAP based API, which means we get the information as an XML. To access the API, Amazon has an authentication requirement. It gives you a set of API keys (through AWS): an AWS secret key, and an AWS access ID. In addition, you also need to apply for an associate tag at the AWS Affiliates Program. The reason for that is the API gives out URLs marked with our associate tag, and as mentioned above, one earns income after enough number of hits on those URLs. More can be seen here.

Once we have those keys, we decide the Operation that we need to perform: ItemLookup, ItemSearch etc (full list of operations here). And once that is done, we decide the Response Group. The Response Group defines the way that operation returns data, i.e the format, what all data does it return. This makes it very convenient for users to get exactly what they want, which makes it even more ideal for Susi. More about Response Groups here.

So what do we do with all this data? How do we even get the response from the API? That’s the fun part. Let us take an example with the ItemSearch API. We will get the Amazon data, i.e a list of products similar to or matching “the hunger games” for example.

Since Amazon API is SOAP, we need to build up the API request URL. We first decide on a locale. The Amazon locale is to be decided based on your AWS settings. Since I set up my AWS account as a USA locale, I use the USA endpoint of Amazon, namely webservices.amazon.com.

Once this is done we supply the following GET parameters:

1. Service (for most operations we use Service as AWSECommerceService)
2. AWS Access ID
3. AWS Associate Tag
4. Operation (in this case ItemSearch)
5. Keywords (i.e query, in this case “harry potter and the cursed child”)
6. Search Index (this is ONLY used for ItemSearch, in this case we put Search Index as All. It is basically which category to search in, it’s a compulsory param)

7. Response Group (optional, it’s defaulted to Small)
8. Timestamp (time of request)
9. Signature (Amazon uses Hmac algorithm to sign requests, so we need to supply a signature)

My implementation of the Amazon API uses a REST wrapper which uses java.util commands to get the Signature and the Timestamp etc, the rest has to be supplied by us. As of now, let’s see how a sample API request for the USA locale looks like:

http://webservices.amazon.com/onca/xml?Service=AWSECommerceService&AWSAccessKeyId=[AWS Access Key ID]&AssociateTag=[Associate ID]&Operation=ItemSearch&Keywords=the%20hunger%20games&SearchIndex=Books&Timestamp=[YYYY-MM-DDThh:mm:ssZ]&Signature=[Request Signature]

I have then used w3c’s DOM to connect to the API, get the XML, and parse it, but here’s how the sample XML looks like:


<TotalResults>2849</TotalResults>
<TotalPages>285</TotalPages>
<MoreSearchResultsUrl>http://www.amazon.com/gp/redirect.html?linkCode=xm2&SubscriptionId=[AWS Access Key ID]&location=http%3A%2F%2Fwww.amazon.com%2Fgp%2Fsearch%3Fkeywords%3Dthe%2Bhunger%2Bgames%26url%3Dsearch-alias%253Dstripbooks&tag=[Associate ID]&creative=386001&camp=2025</MoreSearchResultsUrl>
<Item>
    <ASIN>0545670314</ASIN>
    <DetailPageURL>http://www.amazon.com/The-Hunger-Games-Trilogy-Mockingjay/dp/0545670314%3FSubscriptionId%3D[AWS Access Key ID]%26tag%3D[Associate ID]%26linkCode%3Dxm2%26camp%3D2025%26creative%3D165953%26creativeASIN%3D0545670314</DetailPageURL>
    <ItemLinks>
        <ItemLink>
            <Description>Technical Details</Description>
            <URL>http://www.amazon.com/The-Hunger-Games-Trilogy-Mockingjay/dp/tech-data/0545670314%3FSubscriptionId%3D[AWS Access Key ID]%26tag%3D[Associate ID]%26linkCode%3Dxm2%26camp%3D2025%26creative%3D386001%26creativeASIN%3D0545670314</URL>
        </ItemLink>
        <ItemLink>
            <Description>Add To Baby Registry</Description>
            <URL>http://www.amazon.com/gp/registry/baby/add-item.html%3Fasin.0%3D0545670314%26SubscriptionId%3D[AWS Access Key ID]%26tag%3D[Associate ID]%26linkCode%3Dxm2%26camp%3D2025%26creative%3D386001%26creativeASIN%3D0545670314</URL>
        </ItemLink>
        <ItemLink>
            <Description>Add To Wedding Registry</Description>
            <URL>http://www.amazon.com/gp/registry/wedding/add-item.html%3Fasin.0%3D0545670314%26SubscriptionId%3D[AWS Access Key ID]%26tag%3D[Associate ID]%26linkCode%3Dxm2%26camp%3D2025%26creative%3D386001%26creativeASIN%3D0545670314</URL>
        </ItemLink>
        <ItemLink>
            <Description>Add To Wishlist</Description>
            <URL>http://www.amazon.com/gp/registry/wishlist/add-item.html%3Fasin.0%3D0545670314%26SubscriptionId%3D[AWS Access Key ID]%26tag%3D[Associate ID]%26linkCode%3Dxm2%26camp%3D2025%26creative%3D386001%26creativeASIN%3D0545670314</URL>
        </ItemLink>
        <ItemLink>
            <Description>Tell A Friend</Description>
            <URL>http://www.amazon.com/gp/pdp/taf/0545670314%3FSubscriptionId%3D[AWS Access Key ID]%26tag%3D[Associate ID]%26linkCode%3Dxm2%26camp%3D2025%26creative%3D386001%26creativeASIN%3D0545670314</URL>
        </ItemLink>
        <ItemLink>
            <Description>All Customer Reviews</Description>
            <URL>http://www.amazon.com/review/product/0545670314%3FSubscriptionId%3D[AWS Access Key ID]%26tag%3D[Associate ID]%26linkCode%3Dxm2%26camp%3D2025%26creative%3D386001%26creativeASIN%3D0545670314</URL>
        </ItemLink>
        <ItemLink>
            <Description>All Offers</Description>
            <URL>http://www.amazon.com/gp/offer-listing/0545670314%3FSubscriptionId%3D[AWS Access Key ID]%26tag%3D[Associate ID]%26linkCode%3Dxm2%26camp%3D2025%26creative%3D386001%26creativeASIN%3D0545670314</URL>
        </ItemLink>
    </ItemLinks>
    <ItemAttributes>
        <Author>Suzanne Collins</Author>
        <Manufacturer>Scholastic Press</Manufacturer>
        <ProductGroup>Book</ProductGroup>
        <Title>The Hunger Games Trilogy: The Hunger Games / Catching Fire / Mockingjay</Title>
    </ItemAttributes>
</Item>

One thing to notice is the URLs in the XML result. See how it has the Associate tag and the Access ID mentioned in it? This is how the monetisation happens: when the users buy their products with links having our associate tags on them.

This is obviously just scratching the surface, the API itself has a whole lot of operations. I have given a necessary brief on how the API really works. In my next two blog posts, I will speak on building the Amazon API Service for loklak, and then integrating it into Susi. Feedback is welcome. 🙂

Monetisation of Susi using the Amazon Product API (Part 1)

Bot integrations of Susi on Online Social Media: Slack

In my past few posts, I have explained the use of Susi in detail. We have come to see Susi as an intelligent chat bot cum search engine, which answers Natural Language queries, and has a large dataset to support it thanks to the various sites we scrape from, the APIs we integrate, and also, the additional services that we make (like the TwitterAnalysisService I talked about). All of these make Susi an excellent chat service.

So now the question comes up: how do we increase its reach?

This is where bot integration comes up. Services like Messenger (Facebook), Google Hangouts, Slack, Gitter etc have a large number of user chatting on their platform, but in addition, they also added an additional service of bot users. These users, when messaged about related queries, answer those queries to the user. We have recently seen a very interesting example of this, when the White House used FB Messenger Bots for people to reach out to President Obama (link). This makes users get quick and instant replies to specific queries, and also, bot integrations on these big platforms make more and more people connect with the bot and its maintainers too.

That is why we believed it would be amazing if Susi were integrated onto these platforms as a bot, so that people realise all the things it is able to do. Now, we need to implement these.

As Sudheesh must have spoken about, we are following a system of maintaining all the bots on one index.js file, all of the bots post to different routes, and we deploy this file and the npm requirements in the package.json so that all run concurrently. Keeping this in mind, I developed the Slack Bot for Susi.

I actually developed the bot both in Python and node, but we will only be using node because of the easiness of deployment. For those who wish to check out the Python code, head over here. The Slack API usage remains the same though.

The main part of our bot will be the Slack RTM (Real Time Messaging) API. It basically reads all the conversation going on, and reports every message in a specified format, like:


{
    "id": 1,
    "type": "message",
    "channel": "C024BE91L",
    "text": "Hello world"
}

There are other parameters also included, like username. More info on all the parameters can be found here.

So this is the API schema. We will be using an npm package called slackbots for implementing our bot. Slackbots gives an easy way of interfacing with the RTM API, so that we can focus on implementing the Susi API in the bot without having to worry much about the RTM. You could read Slackbots’ documentation here.

For making the bot, first go here, register your bot, and get the access token. We will need this token to make authorised requests to the API. For keeping it in a secure place, store it as an environment variable:

export SLACK_TOKEN=<access token>

Now comes the main code. Create a new node project using npm init. Once the package.json is created, execute the following commands:


npm install --save requests
npm install --save slackbots

This install the slackbots and the requests packages in our project. We will need requests to make a connection with the Susi API on http://loklak.org/api/susi.json.

Now we are all set to use slackbots and write our code. Make a new file index.js (add this to the package.json as well). Here’s the code for our slackbot.


'use strict';
/* global require, process, console */
var request = require('request');
var SlackBot = require('slackbots');
var slack_token = process.env.SLACK_TOKEN; //accessing the slack token from environment
var slack_bot = new SlackBot({
	token: slack_token, 
	name: 'susi'
});

slack_bot.on('message', function(data){
	var slackdata = data;
	var msg, channel, output, user;
	if(Object.keys(slackdata).length > 0){
		if('text' in slackdata && slackdata['username'] != 'susi'){
			msg = data['text'];
			channel = data['channel']
		}
		else {
			msg = null;
			channel = null;
		}
	}
	if(msg != null && channel !=null){
		var botid = ':' 
		if (msg.split(" ")[0] != botid){
			//do nothing
		} else{
			var apiurl = 'http://loklak.org/api/susi.json?q=' + msg;
			request(apiurl, function (error, response, body) {
				if (!error && response.statusCode === 200) {
					var data = JSON.parse(body);
					if(data.answers[0].actions.length == 1){
						var susiresponse = data.answers[0].actions[0].expression;
						slack_bot.postMessage(channel, susiresponse);
					} else if(data.answers[0].actions.length == 2 && data.answers[0].actions[1].type == "table"){
						slack_bot.postMessage(channel, data.answers[0].actions[0].expression + " (" + data.answers[0].data.length + " results)");
						for(var i = 0; i < data.answers[0].data.length; ++i){
							var response = data.answers[0].data[i];
							var ansstring = "";
							for(var resp in response){
								ansstring += (resp + ": " + response[resp] + ", ");
							}
							slack_bot.postMessage(channel, ansstring);
						}
					}
				}
			});
		}
	}
});

Let’s go over this code bit by bit. We instantiate SlackBots using our token first. Then, the line slack_bot.on('message', function(data) triggers the RTM API. We first get the message in the conversation, check if its JSON is empty or not. Also, our bot should only reply when the user asks, it should not reply to the queries of itself (because RTM continuously reads input, so even the bot’s replies come under it, so we don’t want the bot to react to its own replies lest we get an infinite loop). This check is done through:


if(Object.keys(slackdata).length > 0){
		if('text' in slackdata && slackdata['username'] != 'susi'){
			msg = data['text'];
			channel = data['channel']
		}
		else {
			msg = null;
			channel = null;
		}
	}

We also get the text message and the channel to post the message into.

Next, we check for an empty message. If there is a message, we check if the message starts with @susi: (my bot was named susi, and the bot id came from the RTM API itself, I hardcoded it). We should only query the Susi API in such a case where the message starts with @susi. And once that check is done, we query the Susi API, and the response is data.answers[0].actions[0].expression (except when it’s a table, then we use data.answers[0].data). Once we get what we need to send, we use SlackBot’s postMessage method, and post the message onto the channel using the RTM API. That’s what the rest of the code does.


if(msg != null && channel !=null){
		var botid = ':' 
		if (msg.split(" ")[0] != botid){
			//do nothing
		} else{
			var apiurl = 'http://loklak.org/api/susi.json?q=' + msg;
			request(apiurl, function (error, response, body) {
				if (!error && response.statusCode === 200) {
					var data = JSON.parse(body);
					if(data.answers[0].actions.length == 1){
						var susiresponse = data.answers[0].actions[0].expression;
						slack_bot.postMessage(channel, susiresponse);
					} else if(data.answers[0].actions.length == 2 && data.answers[0].actions[1].type == "table"){
						slack_bot.postMessage(channel, data.answers[0].actions[0].expression + " (" + data.answers[0].data.length + " results)");
						for(var i = 0; i < data.answers[0].data.length; ++i){
							var response = data.answers[0].data[i];
							var ansstring = "";
							for(var resp in response){
								ansstring += (resp + ": " + response[resp] + ", ");
							}
							slack_bot.postMessage(channel, ansstring);
						}
					}
				}
			});
		}
	}
});

This completes the bot. When you shoot it up from your terminal using node index.js, or deploy it, it will work perfectly.

This can now be used by a wide range of people, and everyone can see all that Susi can do. 🙂

We are still in the process of making bots. FB, Telegram and Slack bots have been made till now, and we will be making more. Feedback, as usual, is welcome. 🙂

Bot integrations of Susi on Online Social Media: Slack

Social Media Analysis using Loklak (Part 3)

In my last two blog posts, I spoke about the TwitterAnalysis Servlet, and how the data actually comes into it through scraping methods. For TwitterAnalysis though, there was one thing that was missing: Susi integration, which I’ll cover in this blog post.

Given that the TwitterAnalysis servlet is basically a Social Media profile analyser, we could definitely get a lot of useful statistics from it. As covered earlier, we are getting likes, retweets, hashtag statistics, sentiment analysis, frequency charts etc. Now, to get this working on Susi, we need to build queries which can use these statistics and give the user valuable information.

First off, the serviceImpl method needs to be changed to return a SusiThought object. SusiThought is a JSONObject which processes the query (does keyword extraction etc), uses the APIs to get an answer to the query, and returns the answer along with the count of answers (incase of a table). SusiThought is what triggers the entire Susi mechanism, so the first thing for Susi integration is to convert TwitterAnalysis to return a SusiThought object:


@Override
	public JSONObject serviceImpl(Query call, HttpServletResponse response, Authorization rights,
			JSONObjectWithDefault permissions) throws APIException {
		String username = call.get("screen_name", "");
		String count = call.get("count", "");
		TwitterAnalysisService.request = call.getRequest();
		return showAnalysis(username, count);
	}
public static SusiThought showAnalysis(String username, String count) {

//rest of the code as explained in last blog post
//SusiThought is a JSONObject so we simply copy-paste the serviceImpl code here

}

Once this is done, we write up the queries in the susi_cognition.

As you may have read in my last blog post, TwitterAnalysis gives the Twitter Profile analysis of a user, it’s basically statistics, so we could have a lot of queries regarding this. So these are the rules I implemented, they are self-explanatory on reading the example fields:


{
			"keys"   :["tweet frequency", "tweets", "month"],
			"score"  :2000,
			"example": "How many tweets did melaniatrump post in May 2016",
			"phrases":[ {"type":"pattern", "expression":"* tweet frequency of * in *"},
				{"type":"pattern", "expression":"* tweets did * post in *"},
				{"type":"pattern", "expression":"* tweets did * post in the month of *"}
			],
			"process":[ {"type": "console", "expression": "SELECT yearwise[$3$] AS count FROM twitanalysis WHERE screen_name='$2$' AND count='1000';"}],
			"actions":[ {"type": "answer", "select": "random", "phrases":[
				"$2$ tweeted $count$ times in $3$"
			]}]
		},
		{
			"keys"   :["tweet frequency", "tweets", "post", "at"],
			"score"  :2000,
			"example": "How many tweets did melaniatrump post at 6 PM",
			"phrases":[ {"type":"pattern", "expression":"* tweet frequency of * at *"},
				{"type":"pattern", "expression":"* tweets did * post at *"}
			],
			"process":[ {"type": "console", "expression": "SELECT hourwise[$3$] AS count FROM twitanalysis WHERE screen_name='$2$' AND count='1000';"}],
			"actions":[ {"type": "answer", "select": "random", "phrases":[
				"$2$ tweeted $count$ times at $3$"
			]}]
		},
		{
			"keys"   :["tweet frequency", "tweets", "post", "on"],
			"score"  :2000,
			"example": "How many tweets did melaniatrump post on Saturdays",
			"phrases":[ {"type":"pattern", "expression":"* tweet frequency of * on *s"},
				{"type":"pattern", "expression":"* tweets did * post on *s"},
				{"type":"pattern", "expression":"* tweet frequency of * on *"},
				{"type":"pattern", "expression":"* tweets did * post on *"}
			],
			"process":[ {"type": "console", "expression": "SELECT daywise[$3$] AS count FROM twitanalysis WHERE screen_name='$2$' AND count='1000';"}],
			"actions":[ {"type": "answer", "select": "random", "phrases":[
				"$2$ tweeted $count$ times on $3$"
			]}]
		},
		{
			"keys"   :["tweet frequency", "chart"],
			"score"  :2000,
			"example": "Show me the yearwise tweet frequency chart of melaniatrump",
			"phrases":[ {"type":"pattern", "expression":"* the * tweet frequency chart of *"}],
			"process":[ {"type": "console", "expression": "SELECT $2$ FROM twitanalysis WHERE screen_name='$3$' AND count='1000';"}],
			"actions":[ {"type": "answer", "select": "random", "phrases":[
				"This is the $2$ frequency chart of $3$"
			]}, {"type":"table"}]
		},
		{
			"keys"   :["tweet type", "post", "a"],
			"score"  :2000,
			"example": "How many times did melaniatrump post a video",
			"phrases":[ {"type":"pattern", "expression":"* did * post a *"}],
			"process":[ {"type": "console", "expression": "SELECT $3$ AS count FROM twitanalysis WHERE screen_name='$2$' AND count='1000';"}],
			"actions":[ {"type": "answer", "select": "random", "phrases":[
				"$2$ posted a $3$ $count$ times"
			]}]
		},
		{
			"keys"   :["tweet activity", "likes", "count"],
			"example": "How many likes does melaniatrump have in all",
			"score"  :2000,
			"phrases":[ {"type":"pattern", "expression":"* likes does * have *"}
			],
			"process":[ {"type": "console", "expression": "SELECT likes_count AS count FROM twitanalysis WHERE screen_name='$2$' AND count='1000';"}],
			"actions":[ {"type": "answer", "select": "random", "phrases":[
				"$2$ has $count$ likes till now"
			]}]
		},
		{
			"keys"   :["tweet activity", "likes", "maximum"],
			"example": "What is the maximum number of likes that melaniatrump got",
			"score"  :2000,
			"phrases":[ {"type":"pattern", "expression":"* maximum * likes that * got"}
			],
			"process":[ {"type": "console", "expression": "SELECT max_likes FROM twitanalysis WHERE screen_name='$3$' AND count='1000';"}],
			"actions":[ {"type": "answer", "select": "random", "phrases":[
				"Here you go"
			]}, {"type": "table"}]
		},
		{
			"keys"   :["tweet activity", "likes", "average"],
			"example": "What is the average number of likes that melaniatrump gets",
			"score"  :2000,
			"phrases":[ {"type":"pattern", "expression":"* average * likes that * gets"}
			],
			"process":[ {"type": "console", "expression": "SELECT average_number_of_likes AS count FROM twitanalysis WHERE screen_name='$3$' AND count='1000';"}],
			"actions":[ {"type": "answer", "select": "random", "phrases":[
				"$3$ gets $count$ likes on an average"
			]}]
		},
		{
			"keys"   :["tweet activity", "likes", "frequency"],
			"score"  :2000,
			"example": "How many times did melaniatrump get 0 likes",
			"phrases":[ {"type":"pattern", "expression":"* * have * likes"},
				{"type":"pattern", "expression":"* * get * likes"}
			],
			"process":[ {"type": "console", "expression": "SELECT likes_chart[$3$] AS count FROM twitanalysis WHERE screen_name='$2$' AND count='1000';"}],
			"actions":[ {"type": "answer", "select": "random", "phrases":[
				"$2$ got $3$ likes, $count$ times"
			]}]
		},
		{
			"keys"   :["tweet activity", "likes", "frequency", "chart"],
			"score"  :2000,
			"example": "Show me the likes frequency chart of melaniatrump",
			"phrases":[ {"type":"pattern", "expression":"* likes frequency chart * *"}
			],
			"process":[ {"type": "console", "expression": "SELECT likes_chart FROM twitanalysis WHERE screen_name='$3$' AND count='1000';"}],
			"actions":[ {"type": "answer", "select": "random", "phrases":[
				"Here is the likes frequency chart"
			]}, {"type": "table"}]
		},
		{
			"keys"   :["tweet activity", "retweets", "count"],
			"score"  :2000,
			"example": "How many retweets does melaniatrump have in all",
			"phrases":[ {"type":"pattern", "expression":"* retweets does * have *"}
			],
			"process":[ {"type": "console", "expression": "SELECT retweets_count AS count FROM twitanalysis WHERE screen_name='$2$' AND count='1000';"}],
			"actions":[ {"type": "answer", "select": "random", "phrases":[
				"$2$ has $count$ retweets till now"
			]}]
		},
		{
			"keys"   :["tweet activity", "retweets", "maximum"],
			"score"  :2000,
			"example": "What is the maximum number of retweets that melaniatrump got",
			"phrases":[ {"type":"pattern", "expression":"* maximum * retweets that * got"}
			],
			"process":[ {"type": "console", "expression": "SELECT max_retweets FROM twitanalysis WHERE screen_name='$3$' AND count='1000';"}],
			"actions":[ {"type": "answer", "select": "random", "phrases":[
				"Here you go"
			]}, {"type": "table"}]
		},
		{
			"keys"   :["tweet activity", "retweets", "average"],
			"score"  :2000,
			"example": "What is the average number of retweets that melaniatrump gets",
			"phrases":[ {"type":"pattern", "expression":"* average * retweets that * gets"}
			],
			"process":[ {"type": "console", "expression": "SELECT average_number_of_retweets AS count FROM twitanalysis WHERE screen_name='$3$' AND count='1000';"}],
			"actions":[ {"type": "answer", "select": "random", "phrases":[
				"$3$ gets $count$ retweets on an average"
			]}]
		},
		{
			"keys"   :["tweet activity", "retweets", "frequency"],
			"score"  :2000,
			"example": "How many times did melaniatrump get 0 retweets",
			"phrases":[ {"type":"pattern", "expression":"* * have * retweets"},
				{"type":"pattern", "expression":"* * get * retweets"}
			],
			"process":[ {"type": "console", "expression": "SELECT retweets_chart[$3$] AS count FROM twitanalysis WHERE screen_name='$2$' AND count='1000';"}],
			"actions":[ {"type": "answer", "select": "random", "phrases":[
				"$2$ got $3$ retweets, $count$ times"
			]}]
		},
		{
			"keys"   :["tweet activity", "retweets", "frequency", "chart"],
			"score"  :2000,
			"example": "Show me the retweet frequency chart of melaniatrump",
			"phrases":[ {"type":"pattern", "expression":"* retweet frequency chart * *"}
			],
			"process":[ {"type": "console", "expression": "SELECT retweets_chart FROM twitanalysis WHERE screen_name='$3$' AND count='1000';"}],
			"actions":[ {"type": "answer", "select": "random", "phrases":[
				"Here is the retweets frequency chart"
			]}, {"type": "table"}]
		},
		{
			"keys"   :["tweet activity", "hashtags", "count"],
			"score"  :2000,
			"example": "How many hashtags has melaniatrump used in all",
			"phrases":[ {"type":"pattern", "expression":"* hashtags has * used *"}
			],
			"process":[ {"type": "console", "expression": "SELECT hashtags_used_count AS count FROM twitanalysis WHERE screen_name='$2$' AND count='1000';"}],
			"actions":[ {"type": "answer", "select": "random", "phrases":[
				"$2$ has used $count$ hashtags till now"
			]}]
		},
		{
			"keys"   :["tweet activity", "hashtags", "maximum"],
			"score"  :2000,
			"example": "What is the maximum number of hastags that melaniatrump used",
			"phrases":[ {"type":"pattern", "expression":"* maximum * hashtags that * used"}
			],
			"process":[ {"type": "console", "expression": "SELECT max_hashtags FROM twitanalysis WHERE screen_name='$3$' AND count='1000';"}],
			"actions":[ {"type": "answer", "select": "random", "phrases":[
				"Here you go"
			]}, {"type": "table"}]
		},
		{
			"keys"   :["tweet activity", "hashtags", "average"],
			"score"  :2000,
			"example": "What is the average number of hashtags that melaniatrump uses",
			"phrases":[ {"type":"pattern", "expression":"* average * hashtags that * uses"}
			],
			"process":[ {"type": "console", "expression": "SELECT average_number_of_hashtags_used AS count FROM twitanalysis WHERE screen_name='$3$' AND count='1000';"}],
			"actions":[ {"type": "answer", "select": "random", "phrases":[
				"$3$ uses $count$ hashtags on an average"
			]}]
		},
		{
			"keys"   :["tweet activity", "hashtags", "frequency"],
			"score"  :2000,
			"example": "How many times did melaniatrump use 20 hashtags",
			"phrases":[ {"type":"pattern", "expression":"* * use * hashtags"}
			],
			"process":[ {"type": "console", "expression": "SELECT hashtags_chart[$3$] AS count FROM twitanalysis WHERE screen_name='$2$' AND count='1000';"}],
			"actions":[ {"type": "answer", "select": "random", "phrases":[
				"$2$ used $3$ hashtags, $count$ times"
			]}]
		},
		{
			"keys"   :["tweet activity", "hashtags", "frequency", "chart"],
			"score"  :2000,
			"example": "Show me the hashtag frequency chart of melaniatrump",
			"phrases":[ {"type":"pattern", "expression":"* hashtag frequency chart * *"}
			],
			"process":[ {"type": "console", "expression": "SELECT hashtags_chart FROM twitanalysis WHERE screen_name='$3$' AND count='1000';"}],
			"actions":[ {"type": "answer", "select": "random", "phrases":[
				"Here is the hashtags frequency chart"
			]}, {"type": "table"}]
		},
		{
			"keys"   :["tweet content", "language", "frequency"],
			"score"  :2000,
			"example": "How many tweets did melaniatrump write in English?",
			"phrases":[ {"type":"pattern", "expression":"* * write in *"},
				{"type":"pattern", "expression":"* * post in *"},
				{"type":"pattern", "expression":"* of * were written in *"}
			],
			"process":[ {"type": "console", "expression": "SELECT languages[$3$] AS count FROM twitanalysis WHERE screen_name='$2$' AND count='1000';"}],
			"actions":[ {"type": "answer", "select": "random", "phrases":[
				"$2$ posted $count$ tweets in $3$"
			]}]
		},
		{
			"keys"   :["tweet content", "language", "analysis", "chart"],
			"score"  :2000,
			"example": "Show me the language analysis chart of melaniatrump",
			"phrases":[ {"type":"pattern", "expression":"* language analysis chart * *"}
			],
			"process":[ {"type": "console", "expression": "SELECT languages FROM twitanalysis WHERE screen_name='$3$' AND count='1000';"}],
			"actions":[ {"type": "answer", "select": "random", "phrases":[
				"Here is the language analysis chart"
			]}, {"type": "table"}]
		},
		{
			"keys"   :["tweet content", "sentiment", "analysis", "chart"],
			"score"  :2000,
			"example": "Show me the sentiment analysis chart of melaniatrump",
			"phrases":[ {"type":"pattern", "expression":"* sentiment analysis chart * *"}
			],
			"process":[ {"type": "console", "expression": "SELECT sentiments FROM twitanalysis WHERE screen_name='$3$' AND count='1000';"}],
			"actions":[ {"type": "answer", "select": "random", "phrases":[
				"Here is the sentiment analysis chart"
			]}, {"type": "table"}]
		}

(PS: no points for guessing why melaniatrump is there in the examples 😉 )

As has been explained before, I simply write up an expression consisting of parameters and some core words which are hardcoded, and I then fetch up the parameters using $x$ (x = parameter number). These queries can actually give a whole lot of statistics regarding a user’s activity and activity on his profile, so it is definitely a whole lot useful for a chatbot.

Now, to end this, we need a way to process these queries. Enter ConsoleService. Notice that all the process['expression'] SQL queries are of the type:

SELECT <something> FROM twitanalysis where screen_name = '<parameter_of_username>' AND count = '1000';

I have taken count as 1000 because as mentioned in my last blog post, the scraper scrapes and displays a maximum of 1000 results at a time, so I wish to maximise the range.

Converting the above SQL generalised query to regex, we get this form:

SELECT\\h+?(.*?)\\h+?FROM\\h+?twitanalysis\\h+?WHERE\\h+?screen_name\\h??=\\h??'(.*?)'\\h+?AND\\h+?count\\h??=\\h??'(.*?)'\\h??;

The \h is for a whitespace that may occur, and the random queries are just expressed using (.*) (random character selection in regex). Since we have specific parameters (as described above) that we use in our SQL queries, we encapsulate these random character selections into groups.

Now, we need to compile this regex, and point to what needs to be done. This is done in ConsoleService.


dbAccess.put(Pattern.compile("SELECT\\h+?(.*?)\\h+?FROM\\h+?twitanalysis\\h+?WHERE\\h+?screen_name\\h??=\\h??'(.*?)'\\h+?AND\\h+?count\\h??=\\h??'(.*?)'\\h??;"), (flow, matcher) -> {
            SusiThought json = TwitterAnalysisService.showAnalysis(matcher.group(2), matcher.group(3));
            SusiTransfer transfer = new SusiTransfer(matcher.group(1));
            json.setData(transfer.conclude(json.getData()));
            return json;
        });

We basically compile the regex, and feed it to a bifunction (lambda lingo for a function that takes in two params). We take in the groups using matcher.group and since you saw above, the SusiThought object in TwitterAnalysis takes in screen_name and count, so we take them in using the matcher and feed them to the static function showAnalysisinside the TwitterAnalysis servlet. We then get back the JSON. This completes the procedure. TwitterAnalysis is now integrated with the Susi API. 🙂

In my next blog posts, I’ll talk about Bot integrations for Susi, and a Slack Bot for Susi I made, and then I’ll move to Susi monetisation using Amazon API. Feedback is welcome 🙂

Social Media Analysis using Loklak (Part 3)

Social Media Analysis using Loklak (Part 2)

In the last post, I spoke about the TwitterAnalysis servlet I developed, how we can analyse the entire twitter profile of a user and get useful data from that servlet. This was pretty simple to do, because all I really did was parse the Search API and some other simple commands which resulted into a concise and detailed profile analysis.

But there’s something that I’ve not spoken about yet, which I will in this blog post. How is the social media data collected in that form? Where does it come from and how?

Loklak, as is known, is a social media search server, which scrapes social media sites and profiles. Scraping is basically checking out the HTML source code of the website and from the relevant tags, getting out the information from there. The p2p nature of loklak enables a lot of peers scrape synchronously and feed tweets to a backend, and store it in their own backend as well.

Scraping is a very well known practice, and in Java, we already have tools like JSoup which are easy-to-use scraping tools. You just need to connect to the website, mention the tags between which the information is present, and voila. Here is an example from the EventBrite scraper we have made:


public static SusiThought crawlEventBrite(String url) {
		Document htmlPage = null;

		try {
			htmlPage = Jsoup.connect(url).get();
		} catch (Exception e) {
			e.printStackTrace();
		}

		String eventID = null;
		String eventName = null;
		String eventDescription = null;

		// TODO Fetch Event Color
		String eventColor = null;

		String imageLink = null;

		String eventLocation = null;

		String startingTime = null;
		String endingTime = null;

		String ticketURL = null;

		Elements tagSection = null;
		Elements tagSpan = null;
		String[][] tags = new String[5][2];
		String topic = null; // By default

		String closingDateTime = null;
		String schedulePublishedOn = null;
		JSONObject creator = new JSONObject();
		String email = null;

		Float latitude = null;
		Float longitude = null;

		String privacy = "public"; // By Default
		String state = "completed"; // By Default
		String eventType = "";

		String temp;
		Elements t;

		eventID = htmlPage.getElementsByTag("body").attr("data-event-id");
		eventName = htmlPage.getElementsByClass("listing-hero-body").text();
		eventDescription = htmlPage.select("div.js-xd-read-more-toggle-view.read-more__toggle-view").text();

		eventColor = null;

		imageLink = htmlPage.getElementsByTag("picture").attr("content");

		eventLocation = htmlPage.select("p.listing-map-card-street-address.text-default").text();

		temp = htmlPage.getElementsByAttributeValue("property", "event:start_time").attr("content");
		if(temp.length() >= 20){
			startingTime = htmlPage.getElementsByAttributeValue("property", "event:start_time").attr("content").substring(0,19);
		}else{
			startingTime = htmlPage.getElementsByAttributeValue("property", "event:start_time").attr("content");
		}

		temp = htmlPage.getElementsByAttributeValue("property", "event:end_time").attr("content");
		if(temp.length() >= 20){
			endingTime = htmlPage.getElementsByAttributeValue("property", "event:end_time").attr("content").substring(0,19);
		}else{
			endingTime = htmlPage.getElementsByAttributeValue("property", "event:end_time").attr("content");
		}

		ticketURL = url + "#tickets";

		// TODO Tags to be modified to fit in the format of Open Event "topic"
		tagSection = htmlPage.getElementsByAttributeValue("data-automation", "ListingsBreadcrumbs");
		tagSpan = tagSection.select("span");
		topic = "";

		int iterator = 0, k = 0;
		for (Element e : tagSpan) {
			if (iterator % 2 == 0) {
				tags[k][1] = "www.eventbrite.com"
						+ e.select("a.js-d-track-link.badge.badge--tag.l-mar-top-2").attr("href");
			} else {
				tags[k][0] = e.text();
				k++;
			}
			iterator++;
		}

		creator.put("email", "");
		creator.put("id", "1"); // By Default

		temp = htmlPage.getElementsByAttributeValue("property", "event:location:latitude").attr("content");
		if(temp.length() > 0){
			latitude = Float
				.valueOf(htmlPage.getElementsByAttributeValue("property", "event:location:latitude").attr("content"));
		}

		temp = htmlPage.getElementsByAttributeValue("property", "event:location:longitude").attr("content");
		if(temp.length() > 0){
			longitude = Float
				.valueOf(htmlPage.getElementsByAttributeValue("property", "event:location:longitude").attr("content"));
		}

		// TODO This returns: "events.event" which is not supported by Open
		// Event Generator
		// eventType = htmlPage.getElementsByAttributeValue("property",
		// "og:type").attr("content");

		String organizerName = null;
		String organizerLink = null;
		String organizerProfileLink = null;
		String organizerWebsite = null;
		String organizerContactInfo = null;
		String organizerDescription = null;
		String organizerFacebookFeedLink = null;
		String organizerTwitterFeedLink = null;
		String organizerFacebookAccountLink = null;
		String organizerTwitterAccountLink = null;

		temp = htmlPage.select("a.js-d-scroll-to.listing-organizer-name.text-default").text();
		if(temp.length() >= 5){
			organizerName = htmlPage.select("a.js-d-scroll-to.listing-organizer-name.text-default").text().substring(4);
		}else{
			organizerName = "";
		}
		organizerLink = url + "#listing-organizer";
		organizerProfileLink = htmlPage
				.getElementsByAttributeValue("class", "js-follow js-follow-target follow-me fx--fade-in is-hidden")
				.attr("href");
		organizerContactInfo = url + "#lightbox_contact";

		Document orgProfilePage = null;

		try {
			orgProfilePage = Jsoup.connect(organizerProfileLink).get();
		} catch (Exception e) {
			e.printStackTrace();
		}

		if(orgProfilePage != null){

			t = orgProfilePage.getElementsByAttributeValue("class", "l-pad-vert-1 organizer-website");
			if(t != null){
				organizerWebsite = orgProfilePage.getElementsByAttributeValue("class", "l-pad-vert-1 organizer-website").text();
			}else{
				organizerWebsite = "";
			}

			t = orgProfilePage.select("div.js-long-text.organizer-description");
			if(t != null){
				organizerDescription = orgProfilePage.select("div.js-long-text.organizer-description").text();
			}else{
				organizerDescription = "";
			}

			organizerFacebookFeedLink = organizerProfileLink + "#facebook_feed";
			organizerTwitterFeedLink = organizerProfileLink + "#twitter_feed";

			t = orgProfilePage.getElementsByAttributeValue("class", "fb-page");
			if(t != null){
				organizerFacebookAccountLink = orgProfilePage.getElementsByAttributeValue("class", "fb-page").attr("data-href");
			}else{
				organizerFacebookAccountLink = "";
			}

			t = orgProfilePage.getElementsByAttributeValue("class", "twitter-timeline");
			if(t != null){
				organizerTwitterAccountLink = orgProfilePage.getElementsByAttributeValue("class", "twitter-timeline").attr("href");
			}else{
				organizerTwitterAccountLink = "";
			}

		}

		

		JSONArray socialLinks = new JSONArray();

		JSONObject fb = new JSONObject();
		fb.put("id", "1");
		fb.put("name", "Facebook");
		fb.put("link", organizerFacebookAccountLink);
		socialLinks.put(fb);

		JSONObject tw = new JSONObject();
		tw.put("id", "2");
		tw.put("name", "Twitter");
		tw.put("link", organizerTwitterAccountLink);
		socialLinks.put(tw);

		JSONArray jsonArray = new JSONArray();

		JSONObject event = new JSONObject();
		event.put("event_url", url);
		event.put("id", eventID);
		event.put("name", eventName);
		event.put("description", eventDescription);
		event.put("color", eventColor);
		event.put("background_url", imageLink);
		event.put("closing_datetime", closingDateTime);
		event.put("creator", creator);
		event.put("email", email);
		event.put("location_name", eventLocation);
		event.put("latitude", latitude);
		event.put("longitude", longitude);
		event.put("start_time", startingTime);
		event.put("end_time", endingTime);
		event.put("logo", imageLink);
		event.put("organizer_description", organizerDescription);
		event.put("organizer_name", organizerName);
		event.put("privacy", privacy);
		event.put("schedule_published_on", schedulePublishedOn);
		event.put("state", state);
		event.put("type", eventType);
		event.put("ticket_url", ticketURL);
		event.put("social_links", socialLinks);
		event.put("topic", topic);
		jsonArray.put(event);

		JSONObject org = new JSONObject();
		org.put("organizer_name", organizerName);
		org.put("organizer_link", organizerLink);
		org.put("organizer_profile_link", organizerProfileLink);
		org.put("organizer_website", organizerWebsite);
		org.put("organizer_contact_info", organizerContactInfo);
		org.put("organizer_description", organizerDescription);
		org.put("organizer_facebook_feed_link", organizerFacebookFeedLink);
		org.put("organizer_twitter_feed_link", organizerTwitterFeedLink);
		org.put("organizer_facebook_account_link", organizerFacebookAccountLink);
		org.put("organizer_twitter_account_link", organizerTwitterAccountLink);
		jsonArray.put(org);

		JSONArray microlocations = new JSONArray();
		jsonArray.put(new JSONObject().put("microlocations", microlocations));

		JSONArray customForms = new JSONArray();
		jsonArray.put(new JSONObject().put("customForms", customForms));

		JSONArray sessionTypes = new JSONArray();
		jsonArray.put(new JSONObject().put("sessionTypes", sessionTypes));

		JSONArray sessions = new JSONArray();
		jsonArray.put(new JSONObject().put("sessions", sessions));

		JSONArray sponsors = new JSONArray();
		jsonArray.put(new JSONObject().put("sponsors", sponsors));

		JSONArray speakers = new JSONArray();
		jsonArray.put(new JSONObject().put("speakers", speakers));

		JSONArray tracks = new JSONArray();
		jsonArray.put(new JSONObject().put("tracks", tracks));
		SusiThought json = new SusiThought();
		json.setData(jsonArray);
		return json;

	}

As is seen, we first connect with the url using Jsoup.connect().get() and then we use methods like getElementByAttributeValue and getElementByTag to extract the information.

This is one way of scraping: by using tools like Jsoup. You could also do it manually. Just connect to the website, and use things like BufferedReader or InputStreamReader etc to manually extract the HTML and then iterate through it and extract the information. This method was adopted for the TwitterScraper we have.

In the TwitterScraper, we first connect to the URL using ClientConnection() and then use BufferedReader to get the HTML code, as shown here.


private static String prepareSearchURL(final String query) {
        // check
        // https://twitter.com/search-advanced for a better syntax
        // https://support.twitter.com/articles/71577-how-to-use-advanced-twitter-search#
        String https_url = "";
        try {
            StringBuilder t = new StringBuilder(query.length());
            for (String s: query.replace('+', ' ').split(" ")) {
                t.append(' ');
                if (s.startsWith("since:") || s.startsWith("until:")) {
                    int u = s.indexOf('_');
                    t.append(u < 0 ? s : s.substring(0, u));
                } else {
                    t.append(s);
                }
            }
            String q = t.length() == 0 ? "*" : URLEncoder.encode(t.substring(1), "UTF-8");
            //https://twitter.com/search?f=tweets&vertical=default&q=kaffee&src=typd
            https_url = "https://twitter.com/search?f=tweets&vertical=default&q=" + q + "&src=typd";
        } catch (UnsupportedEncodingException e) {}
        return https_url;
    }
    
    private static Timeline[] search(
            final String query,
            final Timeline.Order order,
            final boolean writeToIndex,
            final boolean writeToBackend) {
        // check
        // https://twitter.com/search-advanced for a better syntax
        // https://support.twitter.com/articles/71577-how-to-use-advanced-twitter-search#
        String https_url = prepareSearchURL(query);
        Timeline[] timelines = null;
        try {
            ClientConnection connection = new ClientConnection(https_url);
            if (connection.inputStream == null) return null;
            try {
                BufferedReader br = new BufferedReader(new InputStreamReader(connection.inputStream, StandardCharsets.UTF_8));
                timelines = search(br, order, writeToIndex, writeToBackend);
            } catch (IOException e) {
            	Log.getLog().warn(e);
            } finally {
                connection.close();
            }
        } catch (IOException e) {
            // this could mean that twitter rejected the connection (DoS protection?) or we are offline (we should be silent then)
            // Log.getLog().warn(e);
            if (timelines == null) timelines = new Timeline[]{new Timeline(order), new Timeline(order)};
        };

        // wait until all messages in the timeline are ready
        if (timelines == null) {
            // timeout occurred
            timelines = new Timeline[]{new Timeline(order), new Timeline(order)};
        }
        if (timelines != null) {
            if (timelines[0] != null) timelines[0].setScraperInfo("local");
            if (timelines[1] != null) timelines[1].setScraperInfo("local");
        }
        return timelines;
    }

If you check out the Search Servlet at /api/search.json, you would see that it accepts either plain query terms, or you can also do from: username or @username to see messages from the particular username. The prepareSearchURL parses this Search query, and converts it into a term Twitter’s Search can understand (because they don’t have this feature) and we then use Twitter’s Advanced Search to search. In Timelines[] Search, we use BufferedReader to get the HTML of Search Result, and we store it in a Timeline object for further use.

Now is the time when this HTML is to be processed. We need to check out the tags and work with them. This is achieved here:


private static Timeline[] search(
            final BufferedReader br,
            final Timeline.Order order,
            final boolean writeToIndex,
            final boolean writeToBackend) throws IOException {
        Timeline timelineReady = new Timeline(order);
        Timeline timelineWorking = new Timeline(order);
        String input;
        Map props = new HashMap();
        Set images = new LinkedHashSet();
        Set videos = new LinkedHashSet();
        String place_id = "", place_name = "";
        boolean parsing_favourite = false, parsing_retweet = false;
        int line = 0; // first line is 1, according to emacs which numbers the first line also as 1
        boolean debuglog = false;
        while ((input = br.readLine()) != null){
            line++;
            input = input.trim();
            if (input.length() == 0) continue;
            
            // debug
            //if (debuglog) System.out.println(line + ": " + input);            
            //if (input.indexOf("ProfileTweet-actionCount") > 0) System.out.println(input);

            // parse
            int p;
            if ((p = input.indexOf("=\"account-group")) > 0) {
                props.put("userid", new prop(input, p, "data-user-id"));
                continue;
            }
            if ((p = input.indexOf("class=\"avatar")) > 0) {
                props.put("useravatarurl", new prop(input, p, "src"));
                continue;
            }
            if ((p = input.indexOf("class=\"fullname")) > 0) {
                props.put("userfullname", new prop(input, p, null));
                continue;
            }
            if ((p = input.indexOf("class=\"username")) > 0) {
                props.put("usernickname", new prop(input, p, null));
                continue;
            }
            if ((p = input.indexOf("class=\"tweet-timestamp")) > 0) {
                props.put("tweetstatusurl", new prop(input, 0, "href"));
                props.put("tweettimename", new prop(input, p, "title"));
                // don't continue here because "class=\"_timestamp" is in the same line 
            }
            if ((p = input.indexOf("class=\"_timestamp")) > 0) {
                props.put("tweettimems", new prop(input, p, "data-time-ms"));
                continue;
            }
            if ((p = input.indexOf("class=\"ProfileTweet-action--retweet")) > 0) {
                parsing_retweet = true;
                continue;
            }
            if ((p = input.indexOf("class=\"ProfileTweet-action--favorite")) > 0) {
                parsing_favourite = true;
                continue;
            }
            if ((p = input.indexOf("class=\"TweetTextSize")) > 0) {
                // read until closing p tag to account for new lines in tweets
                while (input.lastIndexOf("

") == -1){ input = input + ' ' + br.readLine(); } prop tweettext = new prop(input, p, null); props.put("tweettext", tweettext); continue; } if ((p = input.indexOf("class=\"ProfileTweet-actionCount")) > 0) { if (parsing_retweet) { prop tweetretweetcount = new prop(input, p, "data-tweet-stat-count"); props.put("tweetretweetcount", tweetretweetcount); parsing_retweet = false; } if (parsing_favourite) { props.put("tweetfavouritecount", new prop(input, p, "data-tweet-stat-count")); parsing_favourite = false; } continue; } // get images if ((p = input.indexOf("class=\"media media-thumbnail twitter-timeline-link media-forward is-preview")) > 0 || (p = input.indexOf("class=\"multi-photo")) > 0) { images.add(new prop(input, p, "data-resolved-url-large").value); continue; } // we have two opportunities to get video thumbnails == more images; images in the presence of video content should be treated as thumbnail for the video if ((p = input.indexOf("class=\"animated-gif-thumbnail\"")) > 0) { images.add(new prop(input, 0, "src").value); continue; } if ((p = input.indexOf("class=\"animated-gif\"")) > 0) { images.add(new prop(input, p, "poster").value); continue; } if ((p = input.indexOf("= 0 && input.indexOf("type=\"video/") > p) { videos.add(new prop(input, p, "video-src").value); continue; } if ((p = input.indexOf("class=\"Tweet-geo")) > 0) { prop place_name_prop = new prop(input, p, "title"); place_name = place_name_prop.value; continue; } if ((p = input.indexOf("class=\"ProfileTweet-actionButton u-linkClean js-nav js-geo-pivot-link")) > 0) { prop place_id_prop = new prop(input, p, "data-place-id"); place_id = place_id_prop.value; continue; } if (props.size() == 10 || (debuglog && props.size() > 4 && input.indexOf("stream-item") > 0 /* li class="js-stream-item" starts a new tweet */)) { // the tweet is complete, evaluate the result if (debuglog) System.out.println("*** line " + line + " propss.size() = " + props.size()); prop userid = props.get("userid"); if (userid == null) {if (debuglog) System.out.println("*** line " + line + " MISSING value userid"); continue;} prop usernickname = props.get("usernickname"); if (usernickname == null) {if (debuglog) System.out.println("*** line " + line + " MISSING value usernickname"); continue;} prop useravatarurl = props.get("useravatarurl"); if (useravatarurl == null) {if (debuglog) System.out.println("*** line " + line + " MISSING value useravatarurl"); continue;} prop userfullname = props.get("userfullname"); if (userfullname == null) {if (debuglog) System.out.println("*** line " + line + " MISSING value userfullname"); continue;} UserEntry user = new UserEntry( userid.value, usernickname.value, useravatarurl.value, MessageEntry.html2utf8(userfullname.value) ); ArrayList imgs = new ArrayList(images.size()); imgs.addAll(images); ArrayList vids = new ArrayList(videos.size()); vids.addAll(videos); prop tweettimems = props.get("tweettimems"); if (tweettimems == null) {if (debuglog) System.out.println("*** line " + line + " MISSING value tweettimems"); continue;} prop tweetretweetcount = props.get("tweetretweetcount"); if (tweetretweetcount == null) {if (debuglog) System.out.println("*** line " + line + " MISSING value tweetretweetcount"); continue;} prop tweetfavouritecount = props.get("tweetfavouritecount"); if (tweetfavouritecount == null) {if (debuglog) System.out.println("*** line " + line + " MISSING value tweetfavouritecount"); continue;} TwitterTweet tweet = new TwitterTweet( user.getScreenName(), Long.parseLong(tweettimems.value), props.get("tweettimename").value, props.get("tweetstatusurl").value, props.get("tweettext").value, Long.parseLong(tweetretweetcount.value), Long.parseLong(tweetfavouritecount.value), imgs, vids, place_name, place_id, user, writeToIndex, writeToBackend ); if (!DAO.messages.existsCache(tweet.getIdStr())) { // checking against the exist cache is incomplete. A false negative would just cause that a tweet is // indexed again. if (tweet.willBeTimeConsuming()) { executor.execute(tweet); //new Thread(tweet).start(); // because the executor may run the thread in the current thread it could be possible that the result is here already if (tweet.isReady()) { timelineReady.add(tweet, user); //DAO.log("SCRAPERTEST: messageINIT is ready"); } else { timelineWorking.add(tweet, user); //DAO.log("SCRAPERTEST: messageINIT unshortening"); } } else { // no additional thread needed, run the postprocessing in the current thread tweet.run(); timelineReady.add(tweet, user); } } images.clear(); props.clear(); continue; } } //for (prop p: props.values()) System.out.println(p); br.close(); return new Timeline[]{timelineReady, timelineWorking}; }

I suggest you go to Twitter’s Advanced Search page and search for some terms, and once you have the page loaded, check out its HTML, because we need to work with the tags.

This code is self-explanatory. Once you have the HTML result, it is fairly easy to check by inspection that between which tags is the data we need. We iterate through the code with the while loop, and then check the tags: for example – Images in the search result were stored between a <div class = "media media-thumbnail twitter-timeline-link media-forward is-preview"></div> tag, so we use indexOf to center on those tags and get the images. This is done for all the data we need: username, timestamp, likes count, retweets count, mentions count etc, every single thing that the Search Servlet of loklak shows.

So this is how the Social Media data is scraped, we have covered scraping using tools and manually, which are the most used methods anyway. In my next posts, I will talk about the rules for TwitterAnalysis Servlet, then Social Media Chat bots, and how Susi is integrated into them (especially in FB Messenger and Slack). Feedback is welcome 🙂

Social Media Analysis using Loklak (Part 2)

Social Media Analysis using Loklak (Part 1)

So now in the past 7 posts, I have covered how loklak_depot started off, the beginnings, and how we got loklak_depot on the World Wide Web so that people got to know more about it. Now, I shall go a bit deeper with what we at loklak are trying to do with all the information we are getting.

Since Twitter gives a lot of information, about how people are living their lives, and things on a variety of topics: sports, business, finance, food etc, it is human information and such information can be taken as a (not-completely because there are a lot of users) reliable source for a variety of data analysis. We could do a host of things with that data we get from tweets, and a lot of things as a matter of fact have been done already, like stock market predictions, sentiment analysis etc.

One of these uses is also analysing Social Media profiles. Give in a Twitter / any Online Social Media profile, and the program gives you a complete analysis of the posts that you make and the people you connect with. Another use is Susi, our NLP-based Search Engine, on which a couple of blogposts have already been made.

So while I was spotting issues with the Search API and checking out the others, I had an idea to make an small application to which we supply a Twitter username and we get a Profile analysis of that username. The app has been built and works well. I have only built the servlet because I planned on integrating it to Susi, and using it with susi would actually give a lot of interesting answers.

The app / servlet has the following features I decided to implement:

1. User’s Activity Frequency: yearly, hourly and daily statistics etc
2. User’s Activity Type: how many times did user upload photo / video / status etc
3. Activity on User’s Activity: Number of likes / retweets user did, hashtags analysis etc
4. Analysis of User’s content: Language and Sentiment analysis

So here is the entire code for the program. After this, I will explain how I implemented the parts step by step:


public class TwitterAnalysisService extends AbstractAPIHandler implements APIHandler {

	private static final long serialVersionUID = -3753965521858525803L;

	private static HttpServletRequest request;

	@Override
	public String getAPIPath() {
		return "/api/twitanalysis.json";
	}

	@Override
	public BaseUserRole getMinimalBaseUserRole() {
		return BaseUserRole.ANONYMOUS;
	}

	@Override
	public JSONObject getDefaultPermissions(BaseUserRole baseUserRole) {
		return null;
	}

	@Override
	public JSONObject serviceImpl(Query call, HttpServletResponse response, Authorization rights,
			JSONObjectWithDefault permissions) throws APIException {
		String username = call.get("screen_name", "");
		String count = call.get("count", "");
		TwitterAnalysisService.request = call.getRequest();
		return showAnalysis(username, count);
	}

	public static SusiThought showAnalysis(String username, String count) {

		SusiThought json = new SusiThought();
		JSONArray finalresultarray = new JSONArray();
		JSONObject finalresult = new JSONObject(true);
		String siteurl = request.getRequestURL().toString();
		String baseurl = siteurl.substring(0, siteurl.length() - request.getRequestURI().length())
				+ request.getContextPath();

		String searchurl = baseurl + "/api/search.json?q=from%3A" + username + (count != "" ? ("&count=" + count) : "");
		byte[] searchbyte;
		try {
			searchbyte = ClientConnection.download(searchurl);
		} catch (IOException e) {
			return json.setData(new JSONArray().put(new JSONObject().put("Error", "Can't contact server")));
		}
		String searchstr = UTF8.String(searchbyte);
		JSONObject searchresult = new JSONObject(searchstr);

		JSONArray tweets = searchresult.getJSONArray("statuses");
		if (tweets.length() == 0) {
			finalresult.put("error", "Invalid username " + username + " or no tweets");
			finalresultarray.put(finalresult);
			json.setData(finalresultarray);
			return json;
		}
		finalresult.put("username", username);
		finalresult.put("items_per_page", searchresult.getJSONObject("search_metadata").getString("itemsPerPage"));
		finalresult.put("tweets_analysed", searchresult.getJSONObject("search_metadata").getString("count"));

		// main loop
		JSONObject activityFreq = new JSONObject(true);
		JSONObject activityType = new JSONObject(true);
		int imgCount = 0, audioCount = 0, videoCount = 0, linksCount = 0, likesCount = 0, retweetCount = 0,
				hashtagCount = 0;
		int maxLikes = 0, maxRetweets = 0, maxHashtags = 0;
		String maxLikeslink, maxRetweetslink, maxHashtagslink;
		maxLikeslink = maxRetweetslink = maxHashtagslink = tweets.getJSONObject(0).getString("link");
		List tweetDate = new ArrayList();
		List tweetHour = new ArrayList();
		List tweetDay = new ArrayList();
		List likesList = new ArrayList();
		List retweetsList = new ArrayList();
		List hashtagsList = new ArrayList();
		List languageList = new ArrayList();
		List sentimentList = new ArrayList();
		Calendar calendar = Calendar.getInstance();

		for (int i = 0; i < tweets.length(); i++) {
			JSONObject status = tweets.getJSONObject(i);
			String[] datearr = status.getString("created_at").split("T")[0].split("-");
			calendar.set(Integer.parseInt(datearr[0]), Integer.parseInt(datearr[1]) - 1, Integer.parseInt(datearr[2]));
			Date date = new Date(calendar.getTimeInMillis());
			tweetDate.add(new SimpleDateFormat("MMMM yyyy").format(date));
			tweetDay.add(new SimpleDateFormat("EEEE", Locale.ENGLISH).format(date)); // day
			String times = status.getString("created_at").split("T")[1];
			String hour = times.substring(0, times.length() - 5).split(":")[0];
			tweetHour.add(hour); // hour
			imgCount += status.getInt("images_count");
			audioCount += status.getInt("audio_count");
			videoCount += status.getInt("videos_count");
			linksCount += status.getInt("links_count");
			likesList.add(status.getInt("favourites_count"));
			retweetsList.add(status.getInt("retweet_count"));
			hashtagsList.add(status.getInt("hashtags_count"));
			if (status.has("classifier_emotion")) {
				sentimentList.add(status.getString("classifier_emotion"));
			} else {
				sentimentList.add("neutral");
			}
			if (status.has("classifier_language")) {
				languageList.add(status.getString("classifier_language"));
			} else {
				languageList.add("no_text");
			}
			if (maxLikes < status.getInt("favourites_count")) {
				maxLikes = status.getInt("favourites_count");
				maxLikeslink = status.getString("link");
			}
			if (maxRetweets < status.getInt("retweet_count")) {
				maxRetweets = status.getInt("retweet_count");
				maxRetweetslink = status.getString("link");
			}
			if (maxHashtags < status.getInt("hashtags_count")) {
				maxHashtags = status.getInt("hashtags_count");
				maxHashtagslink = status.getString("link");
			}
			likesCount += status.getInt("favourites_count");
			retweetCount += status.getInt("retweet_count");
			hashtagCount += status.getInt("hashtags_count");
		}
		activityType.put("posted_image", imgCount);
		activityType.put("posted_audio", audioCount);
		activityType.put("posted_video", videoCount);
		activityType.put("posted_link", linksCount);
		activityType.put("posted_story",
				Integer.parseInt(searchresult.getJSONObject("search_metadata").getString("count"))
						- (imgCount + audioCount + videoCount + linksCount));

		JSONObject yearlyact = new JSONObject(true);
		JSONObject hourlyact = new JSONObject(true);
		JSONObject dailyact = new JSONObject(true);
		Set yearset = new HashSet(tweetDate);
		Set hourset = new HashSet(tweetHour);
		Set dayset = new HashSet(tweetDay);

		for (String s : yearset) {
			yearlyact.put(s, Collections.frequency(tweetDate, s));
		}

		for (String s : hourset) {
			hourlyact.put(s, Collections.frequency(tweetHour, s));
		}

		for (String s : dayset) {
			dailyact.put(s, Collections.frequency(tweetDay, s));
		}

		activityFreq.put("yearwise", yearlyact);
		activityFreq.put("hourwise", hourlyact);
		activityFreq.put("daywise", dailyact);
		finalresult.put("tweet_frequency", activityFreq);
		finalresult.put("tweet_type", activityType);

		// activity on my tweets

		JSONObject activityOnTweets = new JSONObject(true);
		JSONObject activityCharts = new JSONObject(true);
		JSONObject likesChart = new JSONObject(true);
		JSONObject retweetChart = new JSONObject(true);
		JSONObject hashtagsChart = new JSONObject(true);

		Set likesSet = new HashSet(likesList);
		Set retweetSet = new HashSet(retweetsList);
		Set hashtagSet = new HashSet(hashtagsList);

		for (Integer i : likesSet) {
			likesChart.put(i.toString(), Collections.frequency(likesList, i));
		}

		for (Integer i : retweetSet) {
			retweetChart.put(i.toString(), Collections.frequency(retweetsList, i));
		}

		for (Integer i : hashtagSet) {
			hashtagsChart.put(i.toString(), Collections.frequency(hashtagsList, i));
		}

		activityOnTweets.put("likes_count", likesCount);
		activityOnTweets.put("max_likes",
				new JSONObject(true).put("number", maxLikes).put("link_to_tweet", maxLikeslink));
		activityOnTweets.put("average_number_of_likes",
				(likesCount / (Integer.parseInt(searchresult.getJSONObject("search_metadata").getString("count")))));

		activityOnTweets.put("retweets_count", retweetCount);
		activityOnTweets.put("max_retweets",
				new JSONObject(true).put("number", maxRetweets).put("link_to_tweet", maxRetweetslink));
		activityOnTweets.put("average_number_of_retweets",
				(retweetCount / (Integer.parseInt(searchresult.getJSONObject("search_metadata").getString("count")))));

		activityOnTweets.put("hashtags_used_count", hashtagCount);
		activityOnTweets.put("max_hashtags",
				new JSONObject(true).put("number", maxHashtags).put("link_to_tweet", maxHashtagslink));
		activityOnTweets.put("average_number_of_hashtags_used",
				(hashtagCount / (Integer.parseInt(searchresult.getJSONObject("search_metadata").getString("count")))));

		activityCharts.put("likes", likesChart);
		activityCharts.put("retweets", retweetChart);
		activityCharts.put("hashtags", hashtagsChart);
		activityOnTweets.put("frequency_charts", activityCharts);
		finalresult.put("activity_on_my_tweets", activityOnTweets);

		// content analysis
		JSONObject contentAnalysis = new JSONObject(true);
		JSONObject languageAnalysis = new JSONObject(true);
		JSONObject sentimentAnalysis = new JSONObject(true);
		Set languageSet = new HashSet(languageList), sentimentSet = new HashSet(sentimentList);

		for (String s : languageSet) {
			languageAnalysis.put(s, Collections.frequency(languageList, s));
		}

		for (String s : sentimentSet) {
			sentimentAnalysis.put(s, Collections.frequency(sentimentList, s));
		}
		contentAnalysis.put("language_analysis", languageAnalysis);
		contentAnalysis.put("sentiment_analysis", sentimentAnalysis);
		finalresult.put("content_analysis", contentAnalysis);
		finalresultarray.put(finalresult);
		json.setData(finalresultarray);
		return json;
	}
}

The code first gets the Search API data from the username (stores it in a byte Array and then converts it to a JSONObject). It returns a SusiThought object so that it can be integrated into Susi. Once that is done, we are all set to analyse the data.

Let us go through the code feature-by-feature and I’ll explain what parts of the code implement those features:

Activity Frequency

For this, I initialised three ArrayLists: tweetDay, tweetHour and tweetDate, which will extract the Date, Day and Time from the tweet timestamp. The extraction is done using the Calendar instance (calculating time elapsed till timestamp and converting it to a date format and getting Month, Day and time from it). Once it is stored in the lists, I use the Collections function of java.util to calculate the frequency, and to keep it unique, I use a HashTable (implemented as a HashSet). So on running, the Activity Frequency of my username looks like this:

Activity Frequency

Activity Type

For this, I extracted the activity types from the Search API and summed every category up (imgCount += status.getInt("images_count"); etc), then directly added them into the JSONObject. Easy.

Activity Type

Activity on User’s Activity

For this, we analyse the stats on likes, retweets and hashtags. I again create three arraylists for them (they store the frequency), and in the main loop, I figure out the maximum number of likes, retweets and hashtags along with the link to the maximum number tweet, and calculate sum of number of likes, retweets and hashtags. Then, I again use Collections to get a frequency chart, and calculate the average number, and add them all into a JSONObject.

Activity on User's Activity

Content Analysis

This was again short: we extracted the emotion and language from the Search API, calculate the number, and put it in the JSONObject like we have done with everything else.

Content Analysis

This completes the TwitterAnalysis servlet: it gives you the entire profile analysis of a Twitter Profile, all powered by loklak.

This was relatively simpler to do. In my next blog post, I will discuss a bit about how loklak scrapes this data and makes it available, and how I integrate this application into Susi, so that we can get interesting profile data from a chatbot effortlessly. As always, feedback is welcome 🙂

Social Media Analysis using Loklak (Part 1)