Keeping alive the messenger architecture on free heroku dynos

Heroku is a cloud application platform – a new way of building and deploying web apps and a PaaS service to host applications in various programming languages and frameworks on its cloud. The entire architecture behind the messengers has been written with nodejs and deployed to heroku as its production for one main reason which is that it provides a verified and signed SSL Certificate along with its deployment which is useful for facebook messenger integration of susi as well as useful for telegram to trigger webhooks on SSL.

A major problem with the dynos (apps deployment) is that for free users the deployment automatically goes to sleep if no one is using it for a short while and only wakes up when the endpoint is hit and also is available in a day for upto 18 hours. This could be problematic when the deployment is made, so the best way is to have maximum utilization of the resources and keep it awake as and when required. This was resolved with facebook messenger because of incoming event webhooks that are available which request the server to wake up in case they sleep but what about Slack ? The slack server doesn’t send a notification event to the server when a message is sent by a user mentioning susi which is just like every other user and can be added to a channel.

To fix this problem there are multiple approaches that we’ve taken, the first step is to ensure that every fixed time interval the server pings itself so that it’s kept alive. This is accomplished with this fraction of the code

setInterval(function() {
		http.get(heroku_deploy_url);
	}, 1800000); //pings the deployment url every 30 minutes

where the heroku_deploy_url is an environment variable that can be set by the user depending on the URL of the deployment.

Another option was to use the New Relic insights and their availability tracker to keep sending requests after every fixed interval of time and use it to keep the server alive. This can be accomplished by doing the following on the heroku toolbelt.

heroku addons:add newrelic:standard
heroku addons:open newrelic

then using the following ruby script and setting the PING_URL as heroku config:add PING_URL=http://longweburl.herokuapp.com

desc "Pings PING_URL to keep a dyno alive"
    task :dyno_ping do
      require "net/http"

      if ENV['PING_URL']
        uri = URI(ENV['PING_URL'])
        Net::HTTP.get_response(uri)
      end
    end

and then performing the execution of the task by doing

heroku addons:add scheduler:standard
heroku addons:open scheduler
rake dyno_ping

The last option was however to use the existing loklak.net server setup a cronjob on that to query the heroku instance periodically so that the instance quota doesn’t get over and at the same time has as much uptime as majorly required. The best option however would be to upgrade to a hobby plan and purchase a dyno to host the resource.

Keeping alive the messenger architecture on free heroku dynos

Setting up Susi for access from Telegram Messenger

Telegram is one of the popular applications for communication in the open source community and was one of the first apps to come out with end to encryption soon to be followed suit by whatsapp messenger and other messengers. Telegram is used by a large number of people and we the folks at loklak put on our thinking hats and decided why not provide susi’s capabilities to those using Telegram and the telegram integration to the ask susi messengers integration was born.

Consuming the Susi API with Telegram is fairly straightforward in telegram. The first step is the use of botfather in telegram ensures that the bots can be created with ease. So the first step is to login into telegram with your user account and search and talk to BotFather. Bot father would ask a few questions and then provide the required token. You need to save this token and the bot powered by susi is now available.

botfather1

botfather2

This sets up the bot and provides the token, the next step is to use the token and setup the way in which it responds. This was done by keeping the token as an environment variable

var telegramToken = process.env.TELEGRAM_ACCESS_TOKEN;

and continuing to build the response system around what should happen when a message event is received from telegram. To start the bot the standard entry to the bot happens using the /start message that’s sent to the telegram service.

bot.onText(/\/start/, function (msg, match) {
	var fromId = msg.from.id;
	var resp = 'Hi, I am Susi, You can ask me anything !';
	bot.sendMessage(fromId, resp);
});

This initiates the bot if it’s the first time a user is using the bot, here after every event is read and processed by susi and the response is returned

bot.on('message', function (msg) {
	var chatId = msg.chat.id;
	var queryUrl = 'http://loklak.org/api/susi.json?q='+encodeURI(msg.text);
	var message = '';
	// Wait until done and reply
	if (msg.text !== '/start') {
		request({
			url: queryUrl,
			json: true
		}, function (error, response, body) {
			if (!error && response.statusCode === 200) {
				message = body.answers[0].actions[0].expression;
				bot.sendMessage(chatId, message);
			} else {
				message = 'Oops, Looks like Susi is taking a break, She will be back soon';
				bot.sendMessage(chatId, message);
			}
		});
	}
});

Here’s how Susi’s capabilities are now available to all those users on telegram.

Setting up Susi for access from Telegram Messenger

Rolling out Freifunk Router data IOT to Loklak

Freifunk is a non-commercial initiative for free wireless networks. The vision of freifunk is distributing free networks, to democratize the communication infrastructure and promoting social structures locally. There are a lot of routers on the freifunk network that are available across the entire country of germany and a few other countries. Previously there was an IOT system to push data to loklak about each of the freifunk nodes that are available.

This time we’re stretching it a little further, each of the nodes collected are packaged into objects and can be given back to the user in the JSON format so that the user can use this information for visualizations or other tasks needed. This was done using the fetch servlet and each of the given data looks somewhat like this

"communities": {
    "aachen": {
      "name": "Freifunk Aachen",
      "url": "http://www.Freifunk-Aachen.de",
      "meta": "Freifunk Regio Aachen"
    }...,
}

"allTheRouters": [
    {
      "id": "60e327366bfe",
      "lat": "50.564485",
      "long": "6.359705",
      "name": "ffac-FeWo-Zum-Sonnenschein-2",
      "community": "aachen",
      "status": "online",
      "clients": 1
    }...,
}

The complete JSON dumps can be read by querying the Freifunk network and can be used to populate the data available in loklak push from the locations stored on the router network and then fetched to loklak. This information can be harvested every 24 hours to fetch updates of the entire entire and update the results accordingly.

Each of this data is available at the /api/freifunkfetch.json which is queried as follows

	private static String readAll(Reader rd) throws IOException {
		StringBuilder sb = new StringBuilder();
		int cp;
		while ((cp = rd.read()) != -1) {
			sb.append((char) cp);
		}
		return sb.toString();
	}

	public static JSONObject readJsonFromUrl(String url) throws IOException, JSONException {
		InputStream is = new URL(url).openStream();
		try {
			BufferedReader rd = new BufferedReader(new InputStreamReader(is, Charset.forName("UTF-8")));
			String jsonText = readAll(rd);
			JSONObject json = new JSONObject(jsonText);
			return json;
		} finally {
			is.close();
		}
	}

In this way, each of the freifunk node data is available on loklak server and being harvested, thus adding one more IOT Service to the loklak server and harvester.

Rolling out Freifunk Router data IOT to Loklak

Improved documentation for Loklak repos

Its the final week of GSoC 2016. All the projects are nearing their completion stage. Since one of the plugins from FOSSASIA (https://wordpress.org/plugins/tweets-widget/) is already in WordPress directory, I took this opportunity to write some documentation for other plugins and the plugin maintenance repo.

The documentation now verbosely describes the complete Heroku deployment procedure directly from Github as well as using the Git-Heroku toolbelt (see this).

Selection_023

Docs for updating to a newer version of WordPress have also been added.

Selection_022

Screenshots and relevant documentation regarding Loklak API was added to several plugins.

 

Selection_025
readme.txt of https://github.com/fossasia/wp-recent-tweet

Some screenshots of the plugin (wp-recent-tweet) added in the readme

Selection_026

Improved documentation for Loklak repos

First Loklak plugin added to WordPress directory

Recently first Loklak plugin was accepted in WordPress directory. We are planning to add more and more plugins with loklak integration and it is an awesome start. The plugin Tweets Widget is a minimalist tweet feed which can render tweets from Loklak.org API or Twitter API v1.1 .

The widget uses Twitter Auth WordPress API (wp-twitter-api) by _timwhitlock_ and Loklak’s PHP API for tweet rendering purposes. Both the APIs are added as submodules.

Tweets Widget shows tweet feed on your wordpress website. You can add its shortcode or just add it as a Widget using wordpress’s drag and drop feature.

It is compatible with both Twitter as well as Loklak API. The settings page allows you to select one of the two modes for your current feed (as shown below).

screenshot-2

After you provide the default settings, you have to provide a tweet title, handle, number of tweets and some more custom feature

screenshot-3

See the screenshot of a sample feed below.

screenshot-1

It uses loklak search API (code sample shown below).

Selection_020

The main widget code can be seen here.

First Loklak plugin added to WordPress directory

Scraping and Feeding IOT data sets into Loklak – Part 2

As the integrations for the IOT services had begun, there are challenges especially with scraping multiple pages at once, such was the case with the NOAA Alerts and Weather information of the US Government. To scrape this information for the live updates that happen every 5 minutes it was necessary to simplify the process with which this process could be completed without running a complex web scraper work on them every time taking up precious cycles, What’s really interesting about the website is the way in which the data can be modeled into XML for any given page. Using this and leveraging the XML and other data conversion logic implemented previously for such task, I started digging deeper into the total working of the website and realized that appending &y=0 to the alerts URL resulted in XML generation, here’s an example of how this works
https://alerts.weather.gov/cap/wwaatmget.php?x=AKC013&y=0
and
https://alerts.weather.gov/cap/wwaatmget.php?x=AKC013

Screen Shot 2016-08-21 at 8.21.08 PM

Equivalent XML being
XML from Source NOAA

So extracting this has become quite a challenge because this poses two different challenges , one is how we can efficiently retrieve the information of the counties and how we can construct the alert urls. Perl to the rescue here !

sub process_statelist {
    my $html = `wget -O- -q https://alerts.weather.gov/`;
    $html =~ [email protected]*summary="Table [email protected]@s;
    $html =~ [email protected]*\s*@@s;
    $html =~ [email protected]\s*.*@@s;
    $html =~ [email protected]\s*@@s;
    %seen = ();

    while ( $html =~ [email protected]/(\w+?)\.php\?x=1">([^<]+)@sg ) {
        my $code = $1;
        my $name = $2;
        $name =~ s/'/\\'/g;
        $name =~ [email protected]\[email protected] @g;
        if (!exists($seen{$code})) {
            push @states_entries, $name;
            push @states_entryValues, $code;
        }
        $seen{$code} = 1;
    }
    open STATE, ">", "states.xml";
    print STATE <<EOF1;




    
EOF1
    foreach my $entry (@states_entries) {
        my $temp = $entry;
        $temp =~ s/'/\\'/g;
        $temp = escapeHTML($temp);
        print STATE "        $temp\n";
    }
    print STATE <<EOF2;
    
    
EOF2
    foreach my $entryValue (@states_entryValues) {
        my $temp = $entryValue;
        print STATE "        $temp\n";
    }
    print STATE <<EOF3;
    


EOF3
    close STATE;
    print "Wrote states.xml.\n";
}

Makes a request to the website and constructs the states list of all the states present in the USA. Now it’s time to construct it’s counties.

sub process_state {
    my $state = shift @_;
    if ( $state !~ /^[a-z]+$/ ) {
        print "Invalid state code: $state (skipped)\n";
        return;
    }

    my $html = `wget -O- -q https://alerts.weather.gov/cap/${state}.php?x=3`;

    my @entries     = ();
    my @entryValues = ();

    $html =~ [email protected].*@@s;
    while ( $html =~
[email protected]\s*?]+>\s*?]+>\s*?]+>\s*?\s*?]+>\s*?]+>([^<]+)\s*?\s*?]+>([^<]+)\s*?\s*@mg
      )
    {
        push @entries,     $2;
        push @entryValues, $1;
    }
    my $unittype = "Entire State";
    if ($state =~ /^mz/) {
        $unittype = "Entire Marine Zone";
    }
    if ($state eq "dc") {
        $unittype = "Entire District";
    }
    if (grep { $_ eq $state } qw(as gu mp um vi) ) {
        $unittype = "Entire Territory";
    }
    if ($state eq "us") {
        $unittype = "Entire Country";
    }
    if ($state eq "mzus") {
        $unittype = "All Marine Zones";
    }
    print COUNTIES <<EOF1;
    
        $unittype
EOF1
    foreach my $entry (@entries) {
        my $temp = $entry;
        $temp =~ s/'/\\'/g;
        $temp = escapeHTML($temp);
        print COUNTIES "        $temp\n";
    }
    print COUNTIES <<EOF2;
    
    
        https://alerts.weather.gov/cap/$state.php?x=0
EOF2
    foreach my $entryValue (@entryValues) {
        my $temp = $entryValue;
        $temp =~ s/'/\\'/g;
        $temp = escapeHTML($temp);
        print COUNTIES "        https://alerts.weather.gov/cap/wwaatmget.php?x=$temp&y=0\n";
    }
    print COUNTIES <<EOF3;
    
EOF3
    print "Processed counties from $state.\n";

}

There we go voila, we now have a perfect mapping in between every single county and the alert URL requirement for that particular county. The NOAA scraper and parser has been quite a challenge but provides the data in real-time from the loklak server. The information can be passed via the XML Parser written as a service at /api/xml2json.json and the developers can receive the information in their required format.

Scraping and Feeding IOT data sets into Loklak – Part 2

Generic Scraper – All about the Article API

Generic scraper now uses algorithms to detect and remove the surplus “clutter” (boilerplate, templates) around the main textual content of a web page. It uses the boilerpipe java library, a web API which provides algorithms to detect the main blog content.

Why not JSoup? 

Traditional approach towards extracting the DOM contents is not suggest-able for such Generic Scrapers. I tried scraping wordpress blog posts, certain common html tags are being used across many blog hosting platforms. When i tried for Medium blogs, the scraper (which was written using JSoup) failed. This approach is web link specific and a very tedious job to include for every web link possible.

Why Boilerpipe? 

Boilerpipe is an excellent Java library for boilerplate removal and fulltext extraction from HTML pages.

The following are the available boilerpipe extractor types:

  • DefaultExtractor – A full-text extractor, but not as good as ArticleExtractor.
  • ArticleExtractor – A full-text extractor which is specialized on extracting articles. It is having higher accuracy than DefaultExtractor.
  • ArticleSentencesExtractor
  • KeepEverythingExtractor – Gets everything. We can use this for extracting the title and description.
  • KeepEverythingWithMinKWordsExtractor
  • LargestContentExtractor – Like DefaultExtractor, it keeps the largest content block similar to DefaultExtractor.
  • NumWordsRulesExtractor
  • CanolaExtractor

Which can be used as extractor keys according to the requirement. As of now the Article Extractor is being implemented. Will implement the other extractor types too.

It is the best tool that intelligently removes unwanted html tags and even irrelevant text from the web page. It extracts the contents very fast in milliseconds, with minimum requirement of inputs. It does not require global or site-level information and is usually quite accurate.

Benefits:

  • Much  smarter  than  the  regular  expression.
  • Provides several extraction methods.
  • Returns  text  in  a  variety  of  formats.
  • Helps to avoid manual process of finding content pattern from the source site.
  • Helps to remove boilerplates like headers, footers, menus and advertisements.

The output of the extraction can be of Html, Text or Json. Given below are the lists of output formats.

  • Html(Default) : To output the whole HTML Document.
  • htmlFragment : To output only those HTML fragments that are regarded as main content.
  • Text : To output the extracted main content as plain text.
  • Json : To output the extracted main content as plain json.
  • Debug : To output debug information to understand how boilerpipe internally represents a document.

Here’s the Java code which imports the boilerpipe package and does the article extraction.

 

/**
 *  GenericScraper
 *  Copyright 16.06.2016 by Damini Satya, @daminisatya
 *
 *  This library is free software; you can redistribute it and/or
 *  modify it under the terms of the GNU Lesser General Public
 *  License as published by the Free Software Foundation; either
 *  version 2.1 of the License, or (at your option) any later version.
 *  
 *  This library is distributed in the hope that it will be useful,
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 *  Lesser General Public License for more details.
 *  
 *  You should have received a copy of the GNU Lesser General Public License
 *  along with this program in the file lgpl21.txt
 *  If not, see <http://www.gnu.org/licenses/>.
 */

package org.loklak.api.search;

import java.io.IOException;
import java.io.PrintWriter;
import java.net.URLEncoder;
import java.util.*;
import java.io.*;

import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import org.eclipse.jetty.util.log.Log;
import org.json.JSONArray;
import org.json.JSONObject;
import org.loklak.data.DAO;
import org.loklak.http.ClientConnection;
import org.loklak.http.RemoteAccess;
import org.loklak.server.Query;
import org.loklak.tools.CharacterCoding;
import org.loklak.tools.UTF8;

import java.net.URL;
import java.net.MalformedURLException;
import java.io.IOException;

import de.l3s.boilerpipe.extractors.ArticleExtractor;
import de.l3s.boilerpipe.BoilerpipeProcessingException;
import java.io.PrintWriter;

import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class GenericScraper extends HttpServlet {

	private static final long serialVersionUID = 4653635987712691127L;

	/**
     * PrintJSON
     * @param response
     * @param JSONObject genericScraperData
     */
	public void printJSON(HttpServletResponse response, JSONObject genericScraperData) throws ServletException, IOException {
		response.setCharacterEncoding("UTF-8");
		PrintWriter sos = response.getWriter();
		sos.print(genericScraperData.toString(2));
		sos.println();
	}

	/**
     * Article API
     * @param URL
     * @param JSONObject genericScraperData
     * @return genericScraperData
     */
	public JSONObject articleAPI (String url, JSONObject genericScraperData) throws MalformedURLException{
        URL qurl = new URL(url);
        String data = "";

        try {
            data = ArticleExtractor.INSTANCE.getText(qurl);
            genericScraperData.put("query", qurl);
            genericScraperData.put("data", data);
            genericScraperData.put("NLP", "true");
        }
        catch (Exception e) {
            if ("".equals(data)) {
                try 
                {
                    Document htmlPage = Jsoup.connect(url).get();
                    data = htmlPage.text();
                    genericScraperData.put("query", qurl);
                    genericScraperData.put("data", data);
                    genericScraperData.put("NLP", "false");
                }
                catch (Exception ex) {

                }
            }
        }

        return genericScraperData;
    }

	@Override
	protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
		doGet(request, response);
	}

	@Override
    protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {

        Query post = RemoteAccess.evaluate(request);

        String url = post.get("url", "");
        String type = post.get("type", "");

        URL qurl = new URL(url);

        // This can also be done in one line:
        JSONObject genericScraperData = new JSONObject(true);
        if ("article".equals(type)) {
            genericScraperData = articleAPI(url, genericScraperData);
            genericScraperData.put("type", type);
            printJSON(response, genericScraperData);
        } else {
            genericScraperData.put("error", "Please mention type of scraper:");
            printJSON(response, genericScraperData); 
        } 
    } 
}

Try this sample query

http://localhost:9000/api/genericscraper.json?url=http://stackoverflow.com/questions/15655012/how-final-keyword-works&type=article

And this is the sample output.

Selection_320

That’s all Folks. Will give updates on the further improvements.

 

Generic Scraper – All about the Article API

Scraping and Feeding IOT datasets into Loklak – Part 1

There’s a lot of open data that’s available online be it a government website providing opendata or different portals having a lot of information in various formats. Many data portals and IOT devices support XML, CSV and JSON type data queries and hence previously in Loklak the integration for type conversion has been made making it a very simple method call so that each source format can be converted into a destination format. This makes it really easy for other parts of the entire code to also reuse the components of the program. For example to convert from XML to JSONML, the requirement is to have a well structured XML document after which performing a method call like this below makes the conversion.

XML.toJSONObject(xmlDataString);

Since we now have all the required data conversion logic in place it was time to start scraping and fetching the information from various data sources that we had targeted and IOT devices. In the previous weeks, the support for tracking the GPS datasets from different GPS devices to render the location has been complete and this time we looked ahead and started with the Earthquake data sets available from the government. The data available here are classified based on duration and magnitude with the fixed values possible for each one of them as

duration : hour, day, week, month
magnitude: significant, 1.0, 2.5, 4.5

So different set of queries can be constructed which roughly translate as follows
1. Data of significant earthquakes in the last hour
2. Data of significant earthquakes in the last day
3. Data of significant earthquakes in the last week
4. Data of significant earthquakes in the last month
5. Data of earthquakes less than 1.0 richters in the last hour
6. Data of earthquakes less than 1.0 richters in the last day
7. Data of earthquakes less than 1.0 richters in the last week
8. Data of earthquakes less than 1.0 richters in the last month

Similarly other queries can also be constructed. All of this data is realtime and refreshes at 5 minute intervals enabling loklak server to harvest this data for use with Susi or providing this information to researchers and scientists / data visualization experts to perform their tasks.

Once this stream has been implemented, it was time to look at similar data structures and integrate them, most of the IOT devices sending out weather related information generally send out similar structure of information, So the next target was to integrate Yahi the haze index, these devices in singapore monitor the air quality. The data which looks like this

[
  {
    "hardwareid": "hardwareid",
    "centerLat": "1.3132409",
    "centerLong": "103.8878271"
  },
  {
    "hardwareid": "48ff73065067555017271587",
    "centerLat": "1.348852",
    "centerLong": "103.926314"
  },
  {
    "hardwareid": "53ff6f066667574829482467",
    "centerLat": "1.3734711",
    "centerLong": "103.9950669"
  },
  {
    "hardwareid": "53ff72065075535141071387",
    "centerLat": "1.3028249",
    "centerLong": "103.762174"
  },
  {
    "hardwareid": "55ff6a065075555332151787",
    "centerLat": "1.2982054",
    "centerLong": "103.8335754"
  },
  {
    "hardwareid": "55ff6b065075555351381887",
    "centerLat": "1.296721",
    "centerLong": "103.787217",
    "lastUpdate": "2015-10-14T16:00:25.550Z"
  },
  {
    "hardwareid": "55ff6b065075555340221787",
    "centerLat": "1.3444644",
    "centerLong": "103.7046901",
    "lastUpdate": "2016-05-19T16:43:03.704Z"
  },
  {
    "hardwareid": "53ff72065075535133531587",
    "centerLat": "1.324921",
    "centerLong": "103.838749",
    "lastUpdate": "2015-11-29T01:45:44.985Z"
  },
  {
    "hardwareid": "53ff72065075535122521387",
    "centerLat": "1.317937",
    "centerLong": "103.911654",
    "lastUpdate": "2015-12-04T09:23:48.912Z"
  },
  {
    "hardwareid": "53ff75065075535117181487",
    "centerLat": "1.372952",
    "centerLong": "103.856987",
    "lastUpdate": "2015-01-22T02:06:23.470Z"
  },
  {
    "hardwareid": "55ff71065075555323451487",
    "centerLat": "1.3132409",
    "fillColor": "green",
    "centerLong": "103.8878271",
    "lastUpdate": "2016-08-21T13:39:01.047Z"
  },
  {
    "hardwareid": "53ff7b065075535156261587",
    "centerLat": "1.289199",
    "fillColor": "blue",
    "centerLong": "103.848112",
    "lastUpdate": "2016-08-21T13:39:06.981Z"
  },
  {
    "hardwareid": "55ff6c065075555332381787",
    "centerLat": "1.2854769",
    "centerLong": "103.8481097",
    "lastUpdate": "2015-03-19T02:31:18.738Z"
  },
  {
    "hardwareid": "55ff70065075555333491887",
    "centerLat": "1.308429",
    "centerLong": "103.796707",
    "lastUpdate": "2015-03-31T00:48:49.772Z"
  },
  {
    "hardwareid": "55ff6d065075555312471787",
    "centerLat": "1.4399071",
    "centerLong": "103.8030919",
    "lastUpdate": "2015-11-15T04:04:41.907Z"
  },
  {
    "hardwareid": "53ff6a065075535139311587",
    "centerLat": "1.310398",
    "fillColor": "green",
    "centerLong": "103.862517",
    "lastUpdate": "2016-08-21T13:38:56.147Z"
  }
]

There’s more that’s happened with IOT and interesting scenarios that happened, I’ll detail this in the next follow up blog post on IOT.

Scraping and Feeding IOT datasets into Loklak – Part 1