Generic Scraper – All about the Article API

Generic scraper now uses algorithms to detect and remove the surplus “clutter” (boilerplate, templates) around the main textual content of a web page. It uses the boilerpipe java library, a web API which provides algorithms to detect the main blog content.

Why not JSoup? 

Traditional approach towards extracting the DOM contents is not suggest-able for such Generic Scrapers. I tried scraping wordpress blog posts, certain common html tags are being used across many blog hosting platforms. When i tried for Medium blogs, the scraper (which was written using JSoup) failed. This approach is web link specific and a very tedious job to include for every web link possible.

Why Boilerpipe? 

Boilerpipe is an excellent Java library for boilerplate removal and fulltext extraction from HTML pages.

The following are the available boilerpipe extractor types:

  • DefaultExtractor – A full-text extractor, but not as good as ArticleExtractor.
  • ArticleExtractor – A full-text extractor which is specialized on extracting articles. It is having higher accuracy than DefaultExtractor.
  • ArticleSentencesExtractor
  • KeepEverythingExtractor – Gets everything. We can use this for extracting the title and description.
  • KeepEverythingWithMinKWordsExtractor
  • LargestContentExtractor – Like DefaultExtractor, it keeps the largest content block similar to DefaultExtractor.
  • NumWordsRulesExtractor
  • CanolaExtractor

Which can be used as extractor keys according to the requirement. As of now the Article Extractor is being implemented. Will implement the other extractor types too.

It is the best tool that intelligently removes unwanted html tags and even irrelevant text from the web page. It extracts the contents very fast in milliseconds, with minimum requirement of inputs. It does not require global or site-level information and is usually quite accurate.

Benefits:

  • Much  smarter  than  the  regular  expression.
  • Provides several extraction methods.
  • Returns  text  in  a  variety  of  formats.
  • Helps to avoid manual process of finding content pattern from the source site.
  • Helps to remove boilerplates like headers, footers, menus and advertisements.

The output of the extraction can be of Html, Text or Json. Given below are the lists of output formats.

  • Html(Default) : To output the whole HTML Document.
  • htmlFragment : To output only those HTML fragments that are regarded as main content.
  • Text : To output the extracted main content as plain text.
  • Json : To output the extracted main content as plain json.
  • Debug : To output debug information to understand how boilerpipe internally represents a document.

Here’s the Java code which imports the boilerpipe package and does the article extraction.

 

/**
 *  GenericScraper
 *  Copyright 16.06.2016 by Damini Satya, @daminisatya
 *
 *  This library is free software; you can redistribute it and/or
 *  modify it under the terms of the GNU Lesser General Public
 *  License as published by the Free Software Foundation; either
 *  version 2.1 of the License, or (at your option) any later version.
 *  
 *  This library is distributed in the hope that it will be useful,
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 *  Lesser General Public License for more details.
 *  
 *  You should have received a copy of the GNU Lesser General Public License
 *  along with this program in the file lgpl21.txt
 *  If not, see <http://www.gnu.org/licenses/>.
 */

package org.loklak.api.search;

import java.io.IOException;
import java.io.PrintWriter;
import java.net.URLEncoder;
import java.util.*;
import java.io.*;

import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import org.eclipse.jetty.util.log.Log;
import org.json.JSONArray;
import org.json.JSONObject;
import org.loklak.data.DAO;
import org.loklak.http.ClientConnection;
import org.loklak.http.RemoteAccess;
import org.loklak.server.Query;
import org.loklak.tools.CharacterCoding;
import org.loklak.tools.UTF8;

import java.net.URL;
import java.net.MalformedURLException;
import java.io.IOException;

import de.l3s.boilerpipe.extractors.ArticleExtractor;
import de.l3s.boilerpipe.BoilerpipeProcessingException;
import java.io.PrintWriter;

import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class GenericScraper extends HttpServlet {

	private static final long serialVersionUID = 4653635987712691127L;

	/**
     * PrintJSON
     * @param response
     * @param JSONObject genericScraperData
     */
	public void printJSON(HttpServletResponse response, JSONObject genericScraperData) throws ServletException, IOException {
		response.setCharacterEncoding("UTF-8");
		PrintWriter sos = response.getWriter();
		sos.print(genericScraperData.toString(2));
		sos.println();
	}

	/**
     * Article API
     * @param URL
     * @param JSONObject genericScraperData
     * @return genericScraperData
     */
	public JSONObject articleAPI (String url, JSONObject genericScraperData) throws MalformedURLException{
        URL qurl = new URL(url);
        String data = "";

        try {
            data = ArticleExtractor.INSTANCE.getText(qurl);
            genericScraperData.put("query", qurl);
            genericScraperData.put("data", data);
            genericScraperData.put("NLP", "true");
        }
        catch (Exception e) {
            if ("".equals(data)) {
                try 
                {
                    Document htmlPage = Jsoup.connect(url).get();
                    data = htmlPage.text();
                    genericScraperData.put("query", qurl);
                    genericScraperData.put("data", data);
                    genericScraperData.put("NLP", "false");
                }
                catch (Exception ex) {

                }
            }
        }

        return genericScraperData;
    }

	@Override
	protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
		doGet(request, response);
	}

	@Override
    protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {

        Query post = RemoteAccess.evaluate(request);

        String url = post.get("url", "");
        String type = post.get("type", "");

        URL qurl = new URL(url);

        // This can also be done in one line:
        JSONObject genericScraperData = new JSONObject(true);
        if ("article".equals(type)) {
            genericScraperData = articleAPI(url, genericScraperData);
            genericScraperData.put("type", type);
            printJSON(response, genericScraperData);
        } else {
            genericScraperData.put("error", "Please mention type of scraper:");
            printJSON(response, genericScraperData); 
        } 
    } 
}

Try this sample query

http://localhost:9000/api/genericscraper.json?url=http://stackoverflow.com/questions/15655012/how-final-keyword-works&type=article

And this is the sample output.

Selection_320

That’s all Folks. Will give updates on the further improvements.

 

Generic Scraper – All about the Article API

Time across seven seas…

It has been rightly said:

Time is of your own making
Its clock ticks in your head.
The moment you stop thought
Time too stops dead.

loklak_org_sticker

Hence to keep up with evolving times, Loklak has now introduced a new service for “time”.

The recently developed API provides the current time and day at the location queried by the user.

The /api/locationwisetime.json API scrapes the results from timeanddate.com using our favourite JSoup as it provides a very convenient API for extracting and manipulating data, scrape and parse HTML from a given URL.

In case of multiple locations with the same name, countries are then also provided along-with corresponding day and time wrapped up as a JSONObject.

A sample query could then be something like: http://loklak.org/api/locationwisetime.json?query=london

Screenshot from 2016-08-17 14:28:28

 

When implemented as a console service, this API can be used along-with our our dear SUSI by utilising the API Endpoints like: http://loklak.org/api/console.json?q=SELECT * FROM locationwisetime WHERE query=’berlin’;

Screenshot from 2016-08-17 14:50:58

LocationWiseTimeService.java for reference:


/**
 *  Location Wise Time
 *  timeanddate.com scraper
 *  Copyright 27.07.2016 by Jigyasa Grover, @jig08
 *
 *  This library is free software; you can redistribute it and/or
 *  modify it under the terms of the GNU Lesser General Public
 *  License as published by the Free Software Foundation; either
 *  version 2.1 of the License, or (at your option) any later version.
 *  
 *  This library is distributed in the hope that it will be useful,
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 *  Lesser General Public License for more details.
 *  
 *  You should have received a copy of the GNU Lesser General Public License
 *  along with this program in the file lgpl21.txt
 *  If not, see <http://www.gnu.org/licenses/>.
 */

package org.loklak.api.search;

import java.io.IOException;

import javax.servlet.http.HttpServletResponse;

import org.json.JSONArray;
import org.json.JSONObject;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.loklak.server.APIException;
import org.loklak.server.APIHandler;
import org.loklak.server.AbstractAPIHandler;
import org.loklak.server.Authorization;
import org.loklak.server.BaseUserRole;
import org.loklak.server.Query;
import org.loklak.susi.SusiThought;
import org.loklak.tools.storage.JSONObjectWithDefault;

public class LocationWiseTimeService extends AbstractAPIHandler implements APIHandler {

	private static final long serialVersionUID = -1495493690406247295L;

	@Override
	public String getAPIPath() {
		return "/api/locationwisetime.json";
	}

	@Override
	public BaseUserRole getMinimalBaseUserRole() {
		return BaseUserRole.ANONYMOUS;

	}

	@Override
	public JSONObject getDefaultPermissions(BaseUserRole baseUserRole) {
		return null;
	}

	@Override
	public JSONObject serviceImpl(Query call, HttpServletResponse response, Authorization rights,
			JSONObjectWithDefault permissions) throws APIException {
		String query = call.get("query", "");
		return locationWiseTime(query);
	}

	public static SusiThought locationWiseTime(String query) {
		
		Document html = null;

		JSONArray arr = new JSONArray();

		try {
			html = Jsoup.connect("http://www.timeanddate.com/worldclock/results.html?query=" + query).get();
		} catch (IOException e) {
			e.printStackTrace();
		}

		Elements locations = html.select("td");
		int i = 0;
		for (Element e : locations) {
			if (i % 2 == 0) {
				JSONObject obj = new JSONObject();
				String l = e.getElementsByTag("a").text();
				obj.put("location", l);
				String t = e.nextElementSibling().text();
				obj.put("time", t);
				arr.put(obj);
			}
			i++;
		}
		
		SusiThought json = new SusiThought();
		json.setData(arr);
		return json;
	}

}

 

Hope this helps, and worth the “time” 😛

Feel free to ask questions regarding the above code snippet, shall be happy to assist.

Feedback and Suggestions welcome 🙂

Time across seven seas…

Architectural design for supporting susi on multiple messaging services

Susi has been evolving and learning more every single day leveraging the billion+ tweets that the loklak server has indexed. The next important step would be to hookup Susi’s capabilities in a fashion that the world can easily use. A best friend powered by the data that’s scraped on every single platform available. With this in mind, we first dug deep into the facebook messenger potentially exposing susi’s capabilities to more than a billion people on the planet but as we scale and move to other agents like telegram, slack etc.., We needed some architectural changes to minimize the code duplication as well as the number of resources that we consume. In this blog post, i’ll walk you all through the design decision for the architecture planned to expose Susi to the world.

This is a detailed architecture for running all the different messaging services that we wish to accomplish in the near future. Chat / Messengers are becoming something very important and many a times the very first app that one opens up on their smart phone. It’s very important that the data in Loklak be made sense of to the people out there and learn intelligently. Susi is a great step in the process towards using the twitter data and data from other scrapers and data sources so that information can be given to people querying for it. Running a lot of services is really simple when we set up each one of the individually on a separate server but running the same code on multiple servers to just cater to one single messenger like platform ? Nah not a great idea.

Almost all of the messenger platforms be it Facebook Messenger, Telegram, Slack or anything else run on the same method, event driven and use webhooks. The idea here is to have multiple of these webhooks, and create validation endpoints for the same in case they use the GET request validations of the server like how Facebook does before verifying, At the same time many of them need SSL Certificates so that the service can be setup. This part is simplified by the heroku hosting and the default SSL that it provides for every application URL it provides.

All the services residing in the same server host/application can be used to share the common query library i.e. making the requests to /api/susi.json and returning the corresponding json or the answer entry which is available at body.answers[0].actions[0].expression , There’s a lot more information and modular architecture that can be targeted during the cleanup of each of these services into the required folders by using routing from the index.js for the same. In such a system, the index.js behaves as a proxy layer forwarding the requests to the corresponding service agent rather than scanning through the entire index.js file as it is now. So the application structure over time would look like this.

Messenger Architecture Diagram

|- Common\QueryBuilder.js (Common Library to be used across)
|- Facebook/
|--------\facebook.js
|--------\supportFiles.js
|- Slack/
|- Telegram/
|- Susi's Chat Interface/
|- Other Services ...,
|- index.js (Route to required agent)
Architectural design for supporting susi on multiple messaging services

Push & Pull : Scraped Data into Index and back

With many scrapers being integrated into the Loklak server, it is but natural that the load on the server would also increase if multitude of requests are to be served each millisecond.

Initially, when Loklak only harvested tweets from Twitter, Elasticsearch was implemented along-with a Data Access Object to do the needful task of indexing.

The JSON Object(s) pushed into the index were of the form statuses and had to be in a specific format to be shoved and retrieved easily in the index.

Sample:


{
  "statuses": [
    {
      "id_str": "yourmessageid_1234",
      "screen_name": "testuser",
      "created_at": "2016-07-22T07:53:24.000Z",
      "text": "The rain is spain stays always in the plain",
      "source_type": "GENERIC",
      "place_name": "Georgia, USA",
      "location_point": [
        3.058579854228782,
        50.63296878274201
      ],
      "location_radius": 0,
      "user": {
        "user_id": "youruserid_5678",
        "name": "Mr. Bob",
        
      }
    }
  ]
}

But with the inclusion of many other scrapers like Github, WordPress, Event Brite etc. and RSS Readers it was a bit cumbersome to use the exact same format as that of Twitter because not all fields matched.

For example:


{
  "data": [
    {
      "location": "Canada - Ontario - London",
      "time": "Sun 9:33 PM"
    },
    {
      "location": "South Africa - East London",
      "time": "Mon 3:33 AM"
    }
  ]
}

Hence, Scott suggested an idea of implementing a DAO Wrapper which would enable us to use the same schema as that of Twitter Index to push and pull data.

DAO Wrapper was implemented as GenericJSONBuilder which had the feature of adding the remaining fields of data other than the text into the same schema using Markdown Format

Peeking into the code:


package org.loklak.data;

import javafx.util.Pair;
import org.loklak.objects.MessageEntry;
import org.loklak.objects.QueryEntry;
import org.loklak.objects.SourceType;
import org.loklak.objects.UserEntry;

import java.net.MalformedURLException;
import java.util.*;

/**
 * The json below is the minimum json
 * {
 "statuses": [
 {
 "id_str": "yourmessageid_1234",
 "screen_name": "testuser",
 "created_at": "2016-07-22T07:53:24.000Z",
 "text": "The rain is spain stays always in the plain",
 "source_type": "GENERIC",
 "place_name": "Georgia, USA",
 "location_point": [3.058579854228782,50.63296878274201],
 "location_radius": 0,
 "user": {
 "user_id": "youruserid_5678",
 "name": "Mr. Bob",
 }
 }
 ]
 }
 */
public class DAOWrapper {
    public static final class GenericJSONBuilder{
        private String id_str = null;
        private String screen_name = "unknown";
        private Date created_at = null;
        private String text = "";
        private String place_name = "unknown";
        private String user_name = "[email protected]";
        private String user_id = "unknown";
        private String image = null;
        private double lng = 0.0;
        private double lat = 0.0;
        private int loc_radius = 0;
        private ArrayList extras = new ArrayList();


        /**
         * Not required
         * @param author
         * @param domain
         * @return
         */
        public GenericJSONBuilder setAuthor(String author, String domain){
            user_name = author + "@" + domain;
            screen_name = author;
            return this;
        }

        /**
         * Not required
         * @param user_id_
         * @return
         */
        public GenericJSONBuilder setUserid(String user_id_){
            user_id = user_id_;
            return this;
        }

        /**
         * Not required
         * @param id_str_
         * @return
         */
        public GenericJSONBuilder setIDstr(String id_str_){
            id_str = id_str_;
            return this;
        }

        /**
         * Not required
         * @param createdTime
         * @return
         */
        public GenericJSONBuilder setCreatedTime(Date createdTime){
            created_at = createdTime;
            return this;
        }

        /**
         * Required
         * This is the text field. You can use JSON style in this field
         * @param text_
         * @return
         */
        public GenericJSONBuilder addText(String text_){
            text = text + text_;
            return this;
        }

        /**
         * Not required
         * @param name
         * @return
         */
        public GenericJSONBuilder setPlaceName(String name){
            place_name = name;
            return this;
        }

        /**
         * Not required
         * @param longtitude
         * @param latitude
         * @return
         */
        public GenericJSONBuilder setCoordinate(double longtitude, double latitude){
            lng = longtitude;
            lat = latitude;
            return this;
        }

        /**
         * Not required
         * @param radius
         * @return
         */
        public GenericJSONBuilder setCoordinateRadius(int radius){
            loc_radius = radius;
            return this;
        }


        /**
         * Not required
         * @param key
         * @param value
         * @return
         */
        public GenericJSONBuilder addField(String key, String value){
            String pair_string = "\"" + key + "\": \"" + value + "\"";
            extras.add(pair_string);
            return this;
        }

        private String buildFieldJSON(){
            String extra_json = "";
            for(String e:extras){
                extra_json =  extra_json + e + ",";
            }
            if(extra_json.length() > 2) extra_json = "{" + extra_json.substring(0, extra_json.length() -1) + "}";
            return extra_json;
        }

        /**
         * Not required
         * @param link_
         * @return
         */
        public GenericJSONBuilder setImage(String link_){
            image = link_;
            return this;
        }

        public void persist(){
            try{
                //building message entry
                MessageEntry message = new MessageEntry();

                /**
                 * Use hash of text if id of message is not set
                 */
                if(id_str == null)
                    id_str = String.valueOf(text.hashCode());

                message.setIdStr(id_str);

                /**
                 * Get current time if not set
                 */
                if(created_at == null)
                    created_at = new Date();
                message.setCreatedAt(created_at);


                /**
                 * Append the field as JSON text
                 */
                message.setText(text + buildFieldJSON());

                double[] locPoint = new double[2];
                locPoint[0] = lng;
                locPoint[1] = lat;

                message.setLocationPoint(locPoint);

                message.setLocationRadius(loc_radius);

                message.setPlaceName(place_name, QueryEntry.PlaceContext.ABOUT);
                message.setSourceType(SourceType.GENERIC);

                /**
                 * Insert if there is a image field
                 */
                if(image != null) message.setImages(image);

                //building user
                UserEntry user = new UserEntry(user_id, screen_name, "", user_name);

                //build message and user wrapper
                DAO.MessageWrapper wrapper = new DAO.MessageWrapper(message,user, true);

                DAO.writeMessage(wrapper);
            } catch (MalformedURLException e){
            }
        }
    }





    public static GenericJSONBuilder builder(){
        return new GenericJSONBuilder();
    }





    public static void insert(Insertable msg){

        GenericJSONBuilder bd = builder()
        .setAuthor(msg.getUsername(), msg.getDomain())
        .addText(msg.getText())
        .setUserid(msg.getUserID());


        /**
         * Insert the fields
         */
        List<Pair<String, String>> fields = msg.getExtraField();
        for(Pair<String, String> field:fields){
            bd.addField(field.getKey(), field.getValue());
        }
    }
}

DAOWrapper was then used with other scrappers to push the data into the index as:


...
DAOWrapper dw = new DAOWrapper();
dw.builder().addText(json.toString());
dw.builder().setUserid("profile_"+profile);
dw.builder().persist();
...

Here , addText(...) can be used several times to insert text in the object but set...(...) methods should be used only once and perist() should also be used only once as this is the method which finally pushes into the index.

Now, when a scraper receives a request to scrape a given HTML page, a check is first made if the data already exists in the index with the help of a unique userIDString. This saves the time and effort of scraping the page all over again, instead it simply returns the saved instance.

The check is done something like this:


if(DAO.existUser("profile_"+profile)){
    /*
     *  Return existing JSON Data
    */
}else{
    /*
     *  Scrape the HTML Page addressed by the given URL
    */
}

This pushing and pulling into the index would certainly reduce the load on the Loklak server.

Feel free to ask questions regarding the above.

Feedback and Suggestions welcome 🙂

Push & Pull : Scraped Data into Index and back

The Making of the Console Service

SUSI , our very own personal digital assistant has been up and running giving quirky answers.

But behind all these are rules which train our cute bot to assist her and decide what answers to provide after parsing the question asked by the users.

The questions could range from any formal-informal greetings, general queries about name, weather, date, time to specific ones like details about some random Github profile or Tweets and Replies  from Twitter, or Weibo or election/football score predictions or simply asking her to read a RSS feed or a WordPress blog for you.

The rules for her training are written after that specific service is implemented which shall help her fetch the particular website/social network in question and scrape data out of it to present it to her operator.

And to help us expand the scope and ability of this naive being, it shall be helpful if users could extend her rule set. For this, it is required to make console service for sites which do not provide access to information without OAuth.

To begin with, let us see how a console service can be made.

Starting with a SampleService class which shall basically include the rudimentary scraper or code fetching the data is defined in the package org.loklak.api.search.
This is made by extending the AbstractAPIHandler class which itself extends the javax.servlet.http.HttpServlet class.
SampleService class further implements APIHandler class.

A placeholder for SampleService class can be as:


package org.loklak.api.search;

/**
* import statements
**/

public class SampleService extends AbstractAPIHandler 
    implements APIHandler{

    private static final long serialVersionUID = 2142441326498450416L;
    /**
     * serialVersionUID could be 
     * auto-generated by the IDE used
    **/

    @Override
    public String getAPIPath() {
        return "/api/service.json";
        /**
         *Choose API path for the service in question
        **/
    }

    @Override
    public BaseUserRole getMinimalBaseUserRole() {
        return BaseUserRole.ANONYMOUS;
    }

    @Override
    public JSONObject getDefaultPermissions(BaseUserRole baseUserRole) {
        return null;
    }

    @Override
    public JSONObject serviceImpl(Query call, HttpServletResponse response, 
        Authorization rights, JSONObjectWithDefault permissions) 
        throws APIException {

        String url = call.get("url", "");
        /**
         *This would extract the argument that will be supplied
         * to the "url" parameter in the "call"
        **/
        return crawlerForService(url);

    }

    public SusiThought crawlerForService(String url) {
        JSONArray arr = new JSONArray();
        
        /**
         * Crawler code or any other function which
         * returns a JSON Array **arr** goes in here 
        **/

        SusiThought json = new SusiThought();
        json.setData(arr);
        return json;
    }

}

 

The JSONArray in the key function crawlerForService is wrapped up in a SusiThought which is nothing but a piece of data that can be remembered. The structure or the thought can be modeled as a table which may be created using the retrieval of information from elsewhere of the current argument.

Now to implement it as a Console Service we include it in the ConsoleService class which is defined in the same package org.loklak.api.search and similarly extends AbstractAPIHandler class and implements APIHandler class.

Here, dbAccess is a static variable of the type SusiSkills where a skill is defined as the ability to inspire, to create thoughts from perception. The data structure of a skill set is a mapping from perception patterns to lambda expressions which induce thoughts.


package org.loklak.api.search;

/**
 * import statements go here
**/

public class ConsoleService extends AbstractAPIHandler 
    implements APIHandler {

    private static final long serialVersionUID = 8578478303032749879L;
    /**
     * serialVersionUID could be 
     * auto-generated by the IDE used
    **/

    @Override
    public BaseUserRole getMinimalBaseUserRole() { 
        return BaseUserRole.ANONYMOUS; 
    }

    @Override
    public JSONObject getDefaultPermissions(BaseUserRole baseUserRole) {
        return null;
    }

    public String getAPIPath() {
        return "/api/console.json";
    }

    public final static SusiSkills dbAccess = new SusiSkills();
        static {

            /**
             * Other "skills" are defined here
             * by "putting" them in "dbAccess"
            **/
    
    dbAccess.put(Pattern.compile("SELECT\\h+?(.*?)\\h+?FROM\\h+?
        sampleservice\\h+?WHERE\\h+?url\\h??=\\h??'(.*?)'\\h??;"), 
            (flow, matcher) -> {
                /**
                 * SusiThought-s are fetched from the Services
                 * implemented as above
                **/
                SusiThought json = SampleService.crawlerForService(matcher.group(2));
                SusiTransfer transfer = new SusiTransfer(matcher.group(1));
                json.setData(transfer.conclude(json.getData()));
                return json;
                });
    }

    @Override
    public JSONObject serviceImpl(Query post, HttpServletResponse response, 
        Authorization rights, final JSONObjectWithDefault permissions) 
        throws APIException {

            String q = post.get("q", "");
            /**
             *This would extract the argument that will be supplied
             * to the "q" parameter in the "post" query
            **/
            

            return dbAccess.inspire(q);
        }

}

 

Now that the console service is made, an API endpoint for the same can correspond to: http://localhost:9000/api/console.json?q=SELECT * FROM sampleservice WHERE url = ‘ … ‘;

The above can serve as a placeholder for creating Console Service which shall enable SUSI widen her horizon and become intelligent.

So, Go ahead and make Susi rules using it and you are done !

If any aid is required in making SUSI Rules, stay tuned for the next post.

Come, contribute to Loklak and SUSI !

The Making of the Console Service

Spin-Off: Loklak fuels Open Event

Continuing with the Loklak & Open Event Partnership (check out Loklak fuels Open Event ), we can now easily in clicks create our very own web-app for the event with details imported from eventbrite.com powered by Loklak.

The scraping of data done using JSoup, Java HTML parsing was explained in the previous post of this series.

Next, a console service was implemented as the single point for information retrieval from various social networks and websites (a post coming for it soon 😉 ) especially for SUSI (our very own personal digital assistant, a cute one indeed !)

The JSONArray result of the EventBriteCrawler was set in SusiThought, which is nothing but a piece of data that can be remembered. The structure or the thought can be modelled as a table which may be created using the retrieval of information from elsewhere of the current argument.


/** Defining SusiThought as a class 
 * which extends JSONObject
 */

public class SusiThought extends JSONObject {

/* details coming soon.... */

}

/** Modifications in EventBriteCrawler
 *  Returning SusiThought instead of 
 * a simple JSONObject/JSONArray.
 */
public static SusiThought crawlEventBrite(String url) {
    ...
    ...    
    SusiThought json = new SusiThought();
    json.setData(jsonArray);
    return json;
}

 

The API EndPoint was thus created.
A sample is as: http://loklak.org/api/console.json?q=SELECT * FROM eventbrite WHERE url=’https://www.eventbrite.fr/e/billets-europeade-2016-concert-de-musique-vocale-25592599153′;

Screenshot from 2016-07-15 13:22:00

 

The files generated were next imported in the Open Event Web App generator, using simple steps.

screenshot-from-2016-07-04-075700

Screenshot from 2016-07-15 13:25:39 Screenshot from 2016-07-15 13:36:19

It’s amazing to see how a great visual platform is provided to edit details parsed from the plain JSONObject and deploy the personalized web-app !

Screenshot from 2016-07-15 13:59:47

Screenshot from 2016-07-15 12:55:06Screenshot from 2016-07-15 12:55:18Screenshot from 2016-07-15 12:55:29Screenshot from 2016-07-15 12:54:24
Screenshot from 2016-07-15 12:54:33
Tadaa !
We have our very own event web-app with all the information imported from eventbrite.com in a single (well, very few 😛 ) click (s) !

With this, we conclude the Loklak – Open Event – EventBrite series.

Stay tuned for detailed post on SUSI and Console Services 🙂

Spin-Off: Loklak fuels Open Event

Loklak fuels Open Event

A general background building….

The FOSSASIA Open Event Project aims to make it easier for events, conferences, tech summits to easily create Web and Mobile (only Android currently) micro Apps. The project comprises of a data schema for easily storing event details, a server and web front-end that are used to view, modify, update this data easily by the event organizers, a mobile-friendly web-app client to show the event data to attendees, an Android app template which will be used to generate specific apps for each event.

And Eventbrite is the world’s largest self-service ticketing platform. It allows anyone to create, share and find events comprising music festivals, marathons, conferences, hackathons, air guitar contests, political rallies, fundraisers, gaming competitions etc.

Kaboom !

Loklak now has a dedicated Eventbrite scraper API which takes in the URL of the event listing on eventbrite.com and outputs JSON Files as required by the Open Event Generator viz: events.json, organizer.json, user.json, microlocations.json, sessions.json, session_types.json, tracks.json, sponsors.json, speakers.json, social _links.json and custom_forms.json (details: Open Event Server : API Documentation)

What do we differently do than using the Eventbrite API  ? No authentication tokens required. This gels in perfectly with the Loklak missive.

To achieve this, I have simply parsed the HTML Pages using my favorite JSoup: The Java HTML parser library because it provides a very convenient API for extracting and manipulating data, scrape and parse all varieties of HTML from a URL.

The API call format is as: http://loklak.org/api/eventbritecrawler.json?url=https://www.eventbrite.com/[event-name-and-id]

And in return we get all the details on the Eventbrite page as JSONObject and also it gets stored in differently named files in a zipped folder [userHome + “/Downloads/EventBriteInfo”]

Example:

Event URL: https://www.eventbrite.de/e/global-health-security-focus-africa-tickets-25740798421
Screenshot from 2016-07-04 07:04:38

API Call: 
http://loklak.org/api/eventbritecrawler.json?url=https://www.eventbrite.de/e/global-health-security-focus-africa-tickets-25740798421

Output: JSON Object on screen andevents.json, organizer.json, user.json, microlocations.json, sessions.json, session_types.json, tracks.json, sponsors.json, speakers.json, social _links.json and custom_forms.json files written out in a zipped folder locally.

Screenshot from 2016-07-04 07:05:16
Screenshot from 2016-07-04 07:57:00


For reference, the code is as:

/**
 *  Eventbrite.com Crawler v2.0
 *  Copyright 19.06.2016 by Jigyasa Grover, @jig08
 *
 *  This library is free software; you can redistribute it and/or
 *  modify it under the terms of the GNU Lesser General Public
 *  License as published by the Free Software Foundation; either
 *  version 2.1 of the License, or (at your option) any later version.
 *  
 *  This library is distributed in the hope that it will be useful,
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 *  Lesser General Public License for more details.
 *  
 *  You should have received a copy of the GNU Lesser General Public License
 *  along with this program in the file lgpl21.txt
 *  If not, see http://www.gnu.org/licenses/.
 */

package org.loklak.api.search;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.zip.ZipEntry;
import java.util.zip.ZipOutputStream;

import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import org.json.JSONArray;
import org.json.JSONObject;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.loklak.http.RemoteAccess;
import org.loklak.server.Query;

public class EventbriteCrawler extends HttpServlet {

	private static final long serialVersionUID = 5216519528576842483L;

	@Override
	protected void doPost(HttpServletRequest request, HttpServletResponse response)
			throws ServletException, IOException {
		doGet(request, response);
	}

	@Override
	protected void doGet(HttpServletRequest request, HttpServletResponse response)
			throws ServletException, IOException {
		Query post = RemoteAccess.evaluate(request);

		// manage DoS
		if (post.isDoS_blackout()) {
			response.sendError(503, "your request frequency is too high");
			return;
		}

		String url = post.get("url", "");

		Document htmlPage = null;

		try {
			htmlPage = Jsoup.connect(url).get();
		} catch (Exception e) {
			e.printStackTrace();
		}

		String eventID = null;
		String eventName = null;
		String eventDescription = null;

		// TODO Fetch Event Color
		String eventColor = null;

		String imageLink = null;

		String eventLocation = null;

		String startingTime = null;
		String endingTime = null;

		String ticketURL = null;

		Elements tagSection = null;
		Elements tagSpan = null;
		String[][] tags = new String[5][2];
		String topic = null; // By default

		String closingDateTime = null;
		String schedulePublishedOn = null;
		JSONObject creator = new JSONObject();
		String email = null;

		Float latitude = null;
		Float longitude = null;

		String privacy = "public"; // By Default
		String state = "completed"; // By Default
		String eventType = "";

		eventID = htmlPage.getElementsByTag("body").attr("data-event-id");
		eventName = htmlPage.getElementsByClass("listing-hero-body").text();
		eventDescription = htmlPage.select("div.js-xd-read-more-toggle-view.read-more__toggle-view").text();

		eventColor = null;

		imageLink = htmlPage.getElementsByTag("picture").attr("content");

		eventLocation = htmlPage.select("p.listing-map-card-street-address.text-default").text();
		startingTime = htmlPage.getElementsByAttributeValue("property", "event:start_time").attr("content").substring(0,
				19);
		endingTime = htmlPage.getElementsByAttributeValue("property", "event:end_time").attr("content").substring(0,
				19);

		ticketURL = url + "#tickets";

		// TODO Tags to be modified to fit in the format of Open Event "topic"
		tagSection = htmlPage.getElementsByAttributeValue("data-automation", "ListingsBreadcrumbs");
		tagSpan = tagSection.select("span");
		topic = "";

		int iterator = 0, k = 0;
		for (Element e : tagSpan) {
			if (iterator % 2 == 0) {
				tags[k][1] = "www.eventbrite.com"
						+ e.select("a.js-d-track-link.badge.badge--tag.l-mar-top-2").attr("href");
			} else {
				tags[k][0] = e.text();
				k++;
			}
			iterator++;
		}

		creator.put("email", "");
		creator.put("id", "1"); // By Default

		latitude = Float
				.valueOf(htmlPage.getElementsByAttributeValue("property", "event:location:latitude").attr("content"));
		longitude = Float
				.valueOf(htmlPage.getElementsByAttributeValue("property", "event:location:longitude").attr("content"));

		// TODO This returns: "events.event" which is not supported by Open
		// Event Generator
		// eventType = htmlPage.getElementsByAttributeValue("property",
		// "og:type").attr("content");

		String organizerName = null;
		String organizerLink = null;
		String organizerProfileLink = null;
		String organizerWebsite = null;
		String organizerContactInfo = null;
		String organizerDescription = null;
		String organizerFacebookFeedLink = null;
		String organizerTwitterFeedLink = null;
		String organizerFacebookAccountLink = null;
		String organizerTwitterAccountLink = null;

		organizerName = htmlPage.select("a.js-d-scroll-to.listing-organizer-name.text-default").text().substring(4);
		organizerLink = url + "#listing-organizer";
		organizerProfileLink = htmlPage
				.getElementsByAttributeValue("class", "js-follow js-follow-target follow-me fx--fade-in is-hidden")
				.attr("href");
		organizerContactInfo = url + "#lightbox_contact";

		Document orgProfilePage = null;

		try {
			orgProfilePage = Jsoup.connect(organizerProfileLink).get();
		} catch (Exception e) {
			e.printStackTrace();
		}

		organizerWebsite = orgProfilePage.getElementsByAttributeValue("class", "l-pad-vert-1 organizer-website").text();
		organizerDescription = orgProfilePage.select("div.js-long-text.organizer-description").text();
		organizerFacebookFeedLink = organizerProfileLink + "#facebook_feed";
		organizerTwitterFeedLink = organizerProfileLink + "#twitter_feed";
		organizerFacebookAccountLink = orgProfilePage.getElementsByAttributeValue("class", "fb-page").attr("data-href");
		organizerTwitterAccountLink = orgProfilePage.getElementsByAttributeValue("class", "twitter-timeline")
				.attr("href");

		JSONArray socialLinks = new JSONArray();

		JSONObject fb = new JSONObject();
		fb.put("id", "1");
		fb.put("name", "Facebook");
		fb.put("link", organizerFacebookAccountLink);
		socialLinks.put(fb);

		JSONObject tw = new JSONObject();
		tw.put("id", "2");
		tw.put("name", "Twitter");
		tw.put("link", organizerTwitterAccountLink);
		socialLinks.put(tw);

		JSONArray jsonArray = new JSONArray();

		JSONObject event = new JSONObject();
		event.put("event_url", url);
		event.put("id", eventID);
		event.put("name", eventName);
		event.put("description", eventDescription);
		event.put("color", eventColor);
		event.put("background_url", imageLink);
		event.put("closing_datetime", closingDateTime);
		event.put("creator", creator);
		event.put("email", email);
		event.put("location_name", eventLocation);
		event.put("latitude", latitude);
		event.put("longitude", longitude);
		event.put("start_time", startingTime);
		event.put("end_time", endingTime);
		event.put("logo", imageLink);
		event.put("organizer_description", organizerDescription);
		event.put("organizer_name", organizerName);
		event.put("privacy", privacy);
		event.put("schedule_published_on", schedulePublishedOn);
		event.put("state", state);
		event.put("type", eventType);
		event.put("ticket_url", ticketURL);
		event.put("social_links", socialLinks);
		event.put("topic", topic);
		jsonArray.put(event);

		JSONObject org = new JSONObject();
		org.put("organizer_name", organizerName);
		org.put("organizer_link", organizerLink);
		org.put("organizer_profile_link", organizerProfileLink);
		org.put("organizer_website", organizerWebsite);
		org.put("organizer_contact_info", organizerContactInfo);
		org.put("organizer_description", organizerDescription);
		org.put("organizer_facebook_feed_link", organizerFacebookFeedLink);
		org.put("organizer_twitter_feed_link", organizerTwitterFeedLink);
		org.put("organizer_facebook_account_link", organizerFacebookAccountLink);
		org.put("organizer_twitter_account_link", organizerTwitterAccountLink);
		jsonArray.put(org);

		JSONArray microlocations = new JSONArray();
		jsonArray.put(microlocations);

		JSONArray customForms = new JSONArray();
		jsonArray.put(customForms);

		JSONArray sessionTypes = new JSONArray();
		jsonArray.put(sessionTypes);

		JSONArray sessions = new JSONArray();
		jsonArray.put(sessions);

		JSONArray sponsors = new JSONArray();
		jsonArray.put(sponsors);

		JSONArray speakers = new JSONArray();
		jsonArray.put(speakers);

		JSONArray tracks = new JSONArray();
		jsonArray.put(tracks);

		JSONObject eventBriteResult = new JSONObject();
		eventBriteResult.put("Event Brite Event Details", jsonArray);

		// print JSON
		response.setCharacterEncoding("UTF-8");
		PrintWriter sos = response.getWriter();
		sos.print(eventBriteResult.toString(2));
		sos.println();

		String userHome = System.getProperty("user.home");
		String path = userHome + "/Downloads/EventBriteInfo";

		new File(path).mkdir();

		try (FileWriter file = new FileWriter(path + "/event.json")) {
			file.write(event.toString());
		} catch (IOException e1) {
			e1.printStackTrace();
		}

		try (FileWriter file = new FileWriter(path + "/org.json")) {
			file.write(org.toString());
		} catch (IOException e1) {
			e1.printStackTrace();
		}

		try (FileWriter file = new FileWriter(path + "/social_links.json")) {
			file.write(socialLinks.toString());
		} catch (IOException e1) {
			e1.printStackTrace();
		}

		try (FileWriter file = new FileWriter(path + "/microlocations.json")) {
			file.write(microlocations.toString());
		} catch (IOException e1) {
			e1.printStackTrace();
		}

		try (FileWriter file = new FileWriter(path + "/custom_forms.json")) {
			file.write(customForms.toString());
		} catch (IOException e1) {
			e1.printStackTrace();
		}

		try (FileWriter file = new FileWriter(path + "/session_types.json")) {
			file.write(sessionTypes.toString());
		} catch (IOException e1) {
			e1.printStackTrace();
		}

		try (FileWriter file = new FileWriter(path + "/sessions.json")) {
			file.write(sessions.toString());
		} catch (IOException e1) {
			e1.printStackTrace();
		}

		try (FileWriter file = new FileWriter(path + "/sponsors.json")) {
			file.write(sponsors.toString());
		} catch (IOException e1) {
			e1.printStackTrace();
		}

		try (FileWriter file = new FileWriter(path + "/speakers.json")) {
			file.write(speakers.toString());
		} catch (IOException e1) {
			e1.printStackTrace();
		}

		try (FileWriter file = new FileWriter(path + "/tracks.json")) {
			file.write(tracks.toString());
		} catch (IOException e1) {
			e1.printStackTrace();
		}

		try {
			zipFolder(path, userHome + "/Downloads");
		} catch (Exception e1) {
			e1.printStackTrace();
		}

	}

	static public void zipFolder(String srcFolder, String destZipFile) throws Exception {
		ZipOutputStream zip = null;
		FileOutputStream fileWriter = null;
		fileWriter = new FileOutputStream(destZipFile);
		zip = new ZipOutputStream(fileWriter);
		addFolderToZip("", srcFolder, zip);
		zip.flush();
		zip.close();
	}

	static private void addFileToZip(String path, String srcFile, ZipOutputStream zip) throws Exception {
		File folder = new File(srcFile);
		if (folder.isDirectory()) {
			addFolderToZip(path, srcFile, zip);
		} else {
			byte[] buf = new byte[1024];
			int len;
			FileInputStream in = new FileInputStream(srcFile);
			zip.putNextEntry(new ZipEntry(path + "/" + folder.getName()));
			while ((len = in.read(buf)) > 0) {
				zip.write(buf, 0, len);
			}
			in.close();
		}
	}

	static private void addFolderToZip(String path, String srcFolder, ZipOutputStream zip) throws Exception {
		File folder = new File(srcFolder);

		for (String fileName : folder.list()) {
			if (path.equals("")) {
				addFileToZip(folder.getName(), srcFolder + "/" + fileName, zip);
			} else {
				addFileToZip(path + "/" + folder.getName(), srcFolder + "/" + fileName, zip);
			}
		}
	}

}

Check out https://github.com/loklak/loklak_server for more…


 

Feel free to ask questions regarding the above code snippet.

Also, Stay tuned for the next part of this post which shall include using the scraped information for Open Event.

Feedback and Suggestions welcome 🙂

Loklak fuels Open Event

Now get wordpress blog updates with Loklak !

Loklak shall soon be spoiling its users !

Next, it will be bringing in tiny tweet-like cards showing the blog-posts (title, publishing date, author and content) from the given WordPress Blog URL.

This feature is certain to expand the realm of Loklak’s missive of building a comprehensive and an extensive social network dispensing useful information.

Screenshot from 2016-06-22 04:48:28

In order to implement this feature, I have again made the use of JSoup: The Java HTML parser library as it provides a very convenient API for extracting and manipulating data, scrape and parse HTML from a URL.

The information is scraped making use of JSoup after the corresponding URL in the format "https://[username].wordpress.com/" is passed as an argument to the function scrapeWordpress(String blogURL){..} which returns a JSONObject as the result.

A look at the code snippet :

/**
 *  WordPress Blog Scraper
 *  By Jigyasa Grover, @jig08
 **/

package org.loklak.harvester;

import java.io.IOException;

import org.json.JSONArray;
import org.json.JSONObject;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class WordPressBlogScraper {
	public static void main(String args[]){
		
		String blogURL = "https://loklaknet.wordpress.com/";
		scrapeWordpress(blogURL);		
	}
	
	public static JSONObject scrapeWordpress(String blogURL) {
		
                Document blogHTML = null;
		
		Elements articles = null;
		Elements articleList_title = null;
		Elements articleList_content = null;
		Elements articleList_dateTime = null;
		Elements articleList_author = null;

		String[][] blogPosts = new String[100][4];
		
		//blogPosts[][0] = Blog Title
		//blogPosts[][1] = Posted On
		//blogPosts[][2] = Author
		//blogPosts[][3] = Blog Content
		
		Integer numberOfBlogs = 0;
		Integer iterator = 0;
		
		try{			
			blogHTML = Jsoup.connect(blogURL).get();
		}catch (IOException e) {
            e.printStackTrace();
        }
			
			articles = blogHTML.getElementsByTag("article");
			
			iterator = 0;
			for(Element article : articles){
				
				articleList_title = article.getElementsByClass("entry-title");				
				for(Element blogs : articleList_title){
					blogPosts[iterator][0] = blogs.text().toString();
				}
				
				articleList_dateTime = article.getElementsByClass("posted-on");				
				for(Element blogs : articleList_dateTime){
					blogPosts[iterator][1] = blogs.text().toString();
				}
				
				articleList_author = article.getElementsByClass("byline");				
				for(Element blogs : articleList_author){
					blogPosts[iterator][2] = blogs.text().toString();
				}
				
				articleList_content = article.getElementsByClass("entry-content");				
				for(Element blogs : articleList_content){
					blogPosts[iterator][3] = blogs.text().toString();
				}
				
				iterator++;
				
			}
			
			numberOfBlogs = iterator;
			
			JSONArray blog = new JSONArray();
			
			for(int k = 0; k<numberOfBlogs; k++){
				JSONObject blogpost = new JSONObject();
				blogpost.put("blog_url", blogURL);
				blogpost.put("title", blogPosts[k][0]);
				blogpost.put("posted_on", blogPosts[k][1]);
				blogpost.put("author", blogPosts[k][2]);
				blogpost.put("content", blogPosts[k][3]);
				blog.put(blogpost);
			}			
			
			JSONObject final_blog_info = new JSONObject();
			
			final_blog_info.put("Wordpress blog: " + blogURL, blog);			

			System.out.println(final_blog_info);
			
			return final_blog_info;
		
	}
}

 

In this, simply a HTTP Connection was established and text extracted using “element_name”.text() from inside the specific tags using identifiers like classes or ids. The tags from which the information was to be extracted were identified after exploring the web page’s HTML source code.

The result thus obtained is in the form of a JSON Object

{
  "Wordpress blog: https://loklaknet.wordpress.com/": [
    {
      "posted_on": "June 19, 2016",
      "blog_url": "https://loklaknet.wordpress.com/",
      "author": "shivenmian",
      "title": "loklak_depot u2013 The Beginning: Accounts (Part 3)",
      "content": "So this is my third post in this five part series on loklak_depo... As always, feedback is duly welcome."
    },
    {
      "posted_on": "June 19, 2016",
      "blog_url": "https://loklaknet.wordpress.com/",
      "author": "sopankhosla",
      "title": "Creating a Loklak App!",
      "content": "Hello everyone! Today I will be shifting from course a...ore info refer to the full documentation here. Happy Coding!!!"
    },
    {
      "posted_on": "June 17, 2016",
      "blog_url": "https://loklaknet.wordpress.com/",
      "author": "leonmakk",
      "title": "Loklak Walls Manual Moderation u2013 tweet storage",
      "content": "Loklak walls are going to....Stay tuned for more updates on this new feature of loklak walls!"
    },
    {
      "posted_on": "June 17, 2016",
      "blog_url": "https://loklaknet.wordpress.com/",
      "author": "Robert",
      "title": "Under the hood: Authentication (login)",
      "content": "In the second post of .....key login is ready."
    },
    {
      "posted_on": "June 17, 2016",
      "blog_url": "https://loklaknet.wordpress.com/",
      "author": "jigyasa",
      "title": "Loklak gives some hackernews now !",
      "content": "It's been befittingly said  u... Also, Stay tuned for more posts on data crawling and parsing for Loklak. Feedback and Suggestions welcome"
    },
    {
      "posted_on": "June 16, 2016",
      "blog_url": "https://loklaknet.wordpress.com/",
      "author": "Damini",
      "title": "Does tweets have emotions?",
      "content": "Tweets do intend some kind o...t of features: classify(feat1,u2026,featN) = argmax(P(cat)*PROD(P(featI|cat)"
    },
    {
      "posted_on": "June 15, 2016",
      "blog_url": "https://loklaknet.wordpress.com/",
      "author": "sudheesh001",
      "title": "Dockerize the loklak server and publish docker images to IBM Containers on Bluemix Cloud",
      "content": "Docker is an open source...nd to create and deploy instantly as well as scale on demand."
    }
  ]
}

 

The next step now would include "writeToBackend"-ing and then parsing the JSONObject as desired.

Feel free to ask questions regarding the above code snippet, shall be happy to assist.

Feedback and Suggestions welcome 🙂

Now get wordpress blog updates with Loklak !