Generic Scraper – All about the Article API

Generic scraper now uses algorithms to detect and remove the surplus “clutter” (boilerplate, templates) around the main textual content of a web page. It uses the boilerpipe java library, a web API which provides algorithms to detect the main blog content.

Why not JSoup? 

Traditional approach towards extracting the DOM contents is not suggest-able for such Generic Scrapers. I tried scraping wordpress blog posts, certain common html tags are being used across many blog hosting platforms. When i tried for Medium blogs, the scraper (which was written using JSoup) failed. This approach is web link specific and a very tedious job to include for every web link possible.

Why Boilerpipe? 

Boilerpipe is an excellent Java library for boilerplate removal and fulltext extraction from HTML pages.

The following are the available boilerpipe extractor types:

  • DefaultExtractor – A full-text extractor, but not as good as ArticleExtractor.
  • ArticleExtractor – A full-text extractor which is specialized on extracting articles. It is having higher accuracy than DefaultExtractor.
  • ArticleSentencesExtractor
  • KeepEverythingExtractor – Gets everything. We can use this for extracting the title and description.
  • KeepEverythingWithMinKWordsExtractor
  • LargestContentExtractor – Like DefaultExtractor, it keeps the largest content block similar to DefaultExtractor.
  • NumWordsRulesExtractor
  • CanolaExtractor

Which can be used as extractor keys according to the requirement. As of now the Article Extractor is being implemented. Will implement the other extractor types too.

It is the best tool that intelligently removes unwanted html tags and even irrelevant text from the web page. It extracts the contents very fast in milliseconds, with minimum requirement of inputs. It does not require global or site-level information and is usually quite accurate.

Benefits:

  • Much  smarter  than  the  regular  expression.
  • Provides several extraction methods.
  • Returns  text  in  a  variety  of  formats.
  • Helps to avoid manual process of finding content pattern from the source site.
  • Helps to remove boilerplates like headers, footers, menus and advertisements.

The output of the extraction can be of Html, Text or Json. Given below are the lists of output formats.

  • Html(Default) : To output the whole HTML Document.
  • htmlFragment : To output only those HTML fragments that are regarded as main content.
  • Text : To output the extracted main content as plain text.
  • Json : To output the extracted main content as plain json.
  • Debug : To output debug information to understand how boilerpipe internally represents a document.

Here’s the Java code which imports the boilerpipe package and does the article extraction.

 

/**
 *  GenericScraper
 *  Copyright 16.06.2016 by Damini Satya, @daminisatya
 *
 *  This library is free software; you can redistribute it and/or
 *  modify it under the terms of the GNU Lesser General Public
 *  License as published by the Free Software Foundation; either
 *  version 2.1 of the License, or (at your option) any later version.
 *  
 *  This library is distributed in the hope that it will be useful,
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 *  Lesser General Public License for more details.
 *  
 *  You should have received a copy of the GNU Lesser General Public License
 *  along with this program in the file lgpl21.txt
 *  If not, see <http://www.gnu.org/licenses/>.
 */

package org.loklak.api.search;

import java.io.IOException;
import java.io.PrintWriter;
import java.net.URLEncoder;
import java.util.*;
import java.io.*;

import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import org.eclipse.jetty.util.log.Log;
import org.json.JSONArray;
import org.json.JSONObject;
import org.loklak.data.DAO;
import org.loklak.http.ClientConnection;
import org.loklak.http.RemoteAccess;
import org.loklak.server.Query;
import org.loklak.tools.CharacterCoding;
import org.loklak.tools.UTF8;

import java.net.URL;
import java.net.MalformedURLException;
import java.io.IOException;

import de.l3s.boilerpipe.extractors.ArticleExtractor;
import de.l3s.boilerpipe.BoilerpipeProcessingException;
import java.io.PrintWriter;

import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class GenericScraper extends HttpServlet {

	private static final long serialVersionUID = 4653635987712691127L;

	/**
     * PrintJSON
     * @param response
     * @param JSONObject genericScraperData
     */
	public void printJSON(HttpServletResponse response, JSONObject genericScraperData) throws ServletException, IOException {
		response.setCharacterEncoding("UTF-8");
		PrintWriter sos = response.getWriter();
		sos.print(genericScraperData.toString(2));
		sos.println();
	}

	/**
     * Article API
     * @param URL
     * @param JSONObject genericScraperData
     * @return genericScraperData
     */
	public JSONObject articleAPI (String url, JSONObject genericScraperData) throws MalformedURLException{
        URL qurl = new URL(url);
        String data = "";

        try {
            data = ArticleExtractor.INSTANCE.getText(qurl);
            genericScraperData.put("query", qurl);
            genericScraperData.put("data", data);
            genericScraperData.put("NLP", "true");
        }
        catch (Exception e) {
            if ("".equals(data)) {
                try 
                {
                    Document htmlPage = Jsoup.connect(url).get();
                    data = htmlPage.text();
                    genericScraperData.put("query", qurl);
                    genericScraperData.put("data", data);
                    genericScraperData.put("NLP", "false");
                }
                catch (Exception ex) {

                }
            }
        }

        return genericScraperData;
    }

	@Override
	protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
		doGet(request, response);
	}

	@Override
    protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {

        Query post = RemoteAccess.evaluate(request);

        String url = post.get("url", "");
        String type = post.get("type", "");

        URL qurl = new URL(url);

        // This can also be done in one line:
        JSONObject genericScraperData = new JSONObject(true);
        if ("article".equals(type)) {
            genericScraperData = articleAPI(url, genericScraperData);
            genericScraperData.put("type", type);
            printJSON(response, genericScraperData);
        } else {
            genericScraperData.put("error", "Please mention type of scraper:");
            printJSON(response, genericScraperData); 
        } 
    } 
}

Try this sample query

http://localhost:9000/api/genericscraper.json?url=http://stackoverflow.com/questions/15655012/how-final-keyword-works&type=article

And this is the sample output.

Selection_320

That’s all Folks. Will give updates on the further improvements.

 

Generic Scraper – All about the Article API

How Susi linkifies the link?

Susi responses contains links to open street maps as of now. Previously the UI was not able to handle the links gracefully.

0c013146-4cf6-11e6-91a7-57015406412e

The links are not linkified properly. Suppose if i want to check the contents on the link, i need to copy the link text and check out the contents in it, which is a very bad UX. This blog post explains on how Susi deals with linkifying the links gracefully without using any third party plugins or JS libs.

When I started off with the process of fixing this, i searched for exisiting js library which can help me do this job. I came across Linkify JS library for adding the Hyperlinks to texts and avoiding the plain URL displays. But found that the library is pretty huge and quite convoluted to implement it. Instead of using this, i tried out solving this issue with simple regular expressions.

How I implemented this? 

I made a generic check for the linkification. Most of the URLs start with the following protocols.

  • http://
  • https://
  • ftp://
  • www.
  • mailto::

The following regular expression checks for URLs starting with http://, https://, or ftp://

/(\b(https?|ftp):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|])/gim;

The following regular expression for URLs starting with “www.” (without // before it, or it’d re-link the ones done above)

/(^|[^\/])(www\.[\S]+(\b|$))/gim;

The following regular expressions change email addresses to mailto:: links

/(([a-zA-Z0-9\-\_\.])[email protected][a-zA-Z\_]+?(\.[a-zA-Z]{2,6})+)/gim;

These regular expressions follow the format for each of the different URL formats mentioned, places the link into the hrefs.

//URLs starting with http://, https://, or ftp://
replacePattern1 = /(\b(https?|ftp):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|])/gim;
replacedText = replacedText.replace(replacePattern1, '<a href="$1" target="_blank">Click Here!</a>');

//URLs starting with "www." (without // before it, or it'd re-link the ones done above).
replacePattern2 = /(^|[^\/])(www\.[\S]+(\b|$))/gim;
replacedText = replacedText.replace(replacePattern2, '$1<a href="http://$2" target="_blank">$3$</a>');

//Change email addresses to mailto:: links.
replacePattern3 = /(([a-zA-Z0-9\-\_\.])[email protected][a-zA-Z\_]+?(\.[a-zA-Z]{2,6})+)/gim;
replacedText = replacedText.replace(replacePattern3, '<a href="mailto:$1">$3$</a>');

At the same time the text is being wrapped inside the pre tags. Here is the final result without using any libraries and simply fixing the issue using three different regular expressions.

b9dd6c06-4d39-11e6-907d-29945fc8f4c5

 

How Susi linkifies the link?

Susi supports Map tiles

Susi chat client supports map tiles now. Try out the following query related to location and Susi responses with a internal map tile with the pin at the location.

Where is Singapore? 

Selection_310

You have internal zoom in and zoom out options for the map. It also provides you with a link to open street maps where you can get the whole view of the location. Other than the map tile it gives you information about the location’s population. Isn’t it awesome?

Implementation: Let’s get into the implementation part. There are multiple ways to display the map tiles. One way is we can use our own Loklak services for displaying the map. For example -> you can make a call to the /vis/map.png API for displaying the map. But the issue with this API is we cant dynamically zoom in or zoom out the map on the tile, which is moreover a static display. So to make it more interesting we used a js library called Leaflet, which provides interactive maps.

How is the data captured? Here is the sample response coming from the server.

Selection_307

This sample data which contains latitude, longitude, place and population helps us for drawing the map tiles.

  1. expression:“Berlin is a place with a population of 3426354. Here is a map: https://www.openstreetmap.org/#map=13/52.52436820069531/13.41053001275776”
  2. type:“answer”

The above two keys are under the actions object providing the us the answer along with the URL where it is linkified.

How is the map captured? First we included the following required library files.

<script src=”https://npmcdn.com/[email protected]/dist/leaflet.js”></script>

<link rel=”stylesheet” href=”https://npmcdn.com/[email protected]/dist/leaflet.css” />

Selection_312

The JSON response is parsed and the co ordinates (lat, lon) are captured. The variable type mapType initialized.

Selection_313

PS: The co ordinates are passed into the response, for telling the html that ‘hey the map is coming in the response!’ so that it prepares it’s space for the map. Here’s the html’s job. (Handlebars)

Selection_314

The extra div for the maps are loaded when intimated about the map.

Selection_316

When the mapType is set to true the drawMap method is called which initializes the id for the the html and paints the map tile to it. The object has attributes like maxZoom levels, Map marker and it’s tooltip on the map. And that’s how the map tiles are formed.

 

 

 

Susi supports Map tiles

Why RESTful error handling is important?

Error handling is an important and integral part of API development and exposing public API. In loklak we have the APIs which return JSON responses exposed as REST APIs. REST, or Representational State Transfer represents an architectural style for building distributed applications. Unlike SOAP, REST-based web services do not have a well-defined convention for returning error messages. In general, these are the ways on which the errors are recognized, or thrown.

The most common errors that occur are HTTP Error codes, In many cases the hosted web services can result in a 503 and other similar errors indicating server errors which mostly mean that the server has crashed at a particular point. The reply to these is generally indicated back to the user as a HTTP Error code which is plaintext and doesnot conform to the reply format that the user is expecting resulting in the client application using the API to also crash.

In Loklak, the error codes are captured and handled properly.

Other common errors that have been occurring within loklak is missing pages of the content being served. However, these errors are commonly known as 404 errors and are indicated and emphasized similar to a web surfer hitting a brick wall. Loklak can now return an Error Document – This is how the 404 page came up. Here’s the loklak’s 404 page. 

The main challenge while building REST APIs is handling error conditions in each and every single API endpoint that has been exposed as a service. The challenge however is whether the error messages should be human readable error messages or application specific error messages or machine readable error messages.

However, the correct error message should contain a right blend of all the three types of messages. It should ideally enable the user to identify the error and the location where the error has been happening, a good way to indicate this, is the file or the type of error that has been occurring. This is human readable and it becomes useful especially for client side applications to use the APIs correctly. A 404 error status can help from a developer perspective but never to a user. Giving a proper message with details specific to the error can always be an add on.

Added to the details which are human readable it’s also important to have the right debug information regarding where exactly the error has occurred, these are called application errors. In these cases it’s useful to mentor the error code and a small stack trace so that it becomes easier to debug and find traces of repeated errors occurring in the service platform.

In addition to the human readable and application related errors, it’s also important to have more information for automatic code analysis tools to identify the errors automatically and take the required actions. This is however really useful in cases where there are background tasks that are running throughout the service. A user has requested an API endpoint which requires a specific set of libraries which might have been removed or not built in the built process. The automatic tools can use these machine errors to trigger the right actions like rebuilding or fetching updates of libraries and so on so as to avoid these errors.

Loklak is currently able to handle the error codes, give a generic error message while using the api’s, handle server errors and 404 pages. But this system can be improvised by providing with a error specific messages which can be a future enhancement.

Why RESTful error handling is important?

Generic Scraper – The whole new design

Previously the generic API converts the webpage source code into a JSON response, with predefined tags. But this API needs a design model which can do much more than simply giving the obvious result. This API is redesigned completely. The architecture is shared in this blog post. Before reading this, you would like to go here and read about the existing API.

Input – The input to the generic scraper is the URL, where in the code internally scrapes the source code. Which is happening over here. (Using JSoup)

Document page = Jsoup.connect(url).get();

The page is the source code extracted from the URL (which is a string). Where in the source code is expected to be in this format.

Selection_283

But this is not the case with few websites where the source code is embedded is inside the script tags like this.

Selection_282

The API must be able to compile such source code and convert into the required format. Such corner cases must be handled.

Scraping – The format of the websites are different. The blog posts are completely different from a discussion posts. Such variations must be handled properly and keeping that in mind the code is being divided into five main sub APIs internally.

Article API – The Article API is used to extract clean article text and other data from news articles, blog posts and other text-heavy pages. Retrieve the full-text, cleaned, related images and videos, author, date, tags—automatically, from any article on any site.

Product API – The Product API automatically extracts complete data from any shopping or e-commerce product page. Retrieve full pricing information, product IDs (SKU, UPC, MPN), images, product specifications, brand and more.

Image API – The Image API identifies the primary image(s) of a submitted web page and returns comprehensive information and metadata for each image.

Discussion API – The Discussion API automatically structures and extracts entire threads or lists of reviews/comments from most discussion pages, forums, and similarly structured web pages.

Advertisements API – The Advertisement API automatically structures the ad details on the website.

Along with the above APIs which covers most of the requirement, the API even supports the general response. When none of the search results matches with the formats, it gives a generic JSON response. The APIs are lucid and must be specified by the user giving the input , in the following pattern.

/api/genericscraper.json?url=http://blog.loklak.net/convert-web-pages-into-structured-data/&type=article

The type covers the sub APIs like article, discussion, product and so on.

Re Usability – The API is designed to use the existing scraper responses which are specific to certain websites like, wordpress, meetups.com, Amazon etc. Internally the API handles the match for such URLs and calls the page specific scrapers for the response. This helps in reusing the existing scrapers in loklak. You can go through this doc page for more details.

Technology Stack – The scraping is done using Jsoup , which is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods. The API endpoint is registered and can be accessed in the given format.

http://loklak.org/api/genericscraper.json?url=””&type=””

Coming up(Part Two) is a deeper version of this blog post which explains in depth about the API implementation and the targeted tags. Stay tuned for further updates.

 

Generic Scraper – The whole new design

All About Peer Deploy App – Part One

We all know that loklak is a distributed peer to peer sharing system, where in you can host your own loklak peer. The advantages of having your own server is that you have the privileges to share your search data which goes into the indexing, eventually resulting in faster search results. So how can this “peer deploy” app help loklak to get more servers up. Before going into the details, let’s discuss a bit about the peers API provided by Loklak.

Loklak provides a transparent view of its peers and server deploys through its peers API.

http://loklak.org/api/peers.json

This API gives you the details of the loklak peers and count of the active servers.

{
      "class": "SuggestServlet",
      "host": "169.55.12.244",
      "port.http": 9000,
      "port.https": 9443,
      "lastSeen": 1470365717753,
      "lastPath": "/api/suggest.json",
      "peername": "anonymous"
},

So you can be a part of this group and host your own search engine and share your indexed data.

There are many ways, where you can deploy a loklak peer. But this app provides you a simple one click deploy buttons.

Selection_273

These one click deploy buttons determine what code you are trying to deploy. If you’re not logged in or don’t have an account, you’ll go through the login flow first. For instance, Heroku uses an app.json manifest in the code repo to figure out what add-ons, config and other deployment steps are required to make the code run. This is used to configure and deploy the app. The similar flow goes with other buttons too, where you deploy to Docker containers, Bluemix and Scalingo.

This app aims at having three modules,

  1. Providing all the one click deploy buttons at one place.
  2. Display of the peers network using D3.js charts.
  3. Have a leaderboard page, counting the number deploys per user.

This app is complete upto the first two levels, the upcoming enhancement can be done using loklak_depot module.

Technology stack

  1. The buildpack is available previously, and the buttons are embedded using the html tags provided by each service provider. Here is the code to the app.
  2. The app is written in angularJS and the Force directed graph is built using the d3.js library.
  3. The app consumes the peers.json to get the data for displaying the graph.

Here is a screenshot of the app.

Selection_272

The upcoming enhancement is to have a leaderboard depending on the number of peers deployed per user. If you are interested you can try deploying the peer from here itself. Checkout how simple it can be to deploy.

 


Deploy


Deploy on Scalingo


Deploy to Bluemix


Deploy to Docker Cloud

All About Peer Deploy App – Part One

Susi chat interface with visualizations

Susi got few capabilities to visualize it’s response. She can respond by sharing links, showing analytics on pie charts and give you a list of bulleted data. So this post shows you on how these components are integrated into Susi.

The rules which are defined can give data in various compatible forms. It can give links, share some analytics in the form of percentages and certain list of data. For example, in the previous blog post on adding susi rules, we added a sample rule showing on how to add types of responses to susi. If you want more context on it, you can click here.

  • Susi taking responses from data: This type of response is in the form of a table. Susi can take the extra data under the data.answers[0].data , where the type if of table. The below is a sample JSON format from which the tabular data can be parsed.

Selection_251

From the above JSON the data under the answers object is being traced for the tabulated answers. This expression will get you the titles of the reddit articles.

Selection_252

The above response is for the following query

What are the reddit articles about loklak

This is Susi’s response on asksusi.

Selection_249

  • Susi answering using piecharts: Susi rules can also be defined in such a way that the response can give out a well formed pie chart. The data required for the piechart is defined and this can easily be interpreted using highcharts for giving a clear pie chart response. Here is sample JSON response for the following query.
Who will win the 2016 presidential election

Selection_234

The above JSON defines data for piecharts giving percentage and relevant name for that particular object. This is easy to interpret the json for defining the piecharts using highchart.js . The below is the sample code which was used to define the piecharts.

Selection_253

This is how the interface answers with piecharts.

Selection_247

Selection_248

  • So susi can also interpret links from the response and linkify them accordingly.

Here is the sample code on how Susi interprets the links from the response.

Selection_254

The links are linkified and this how Susi responses.

Selection_255

Stay tuned for more updates on Susi.

 

 

Susi chat interface with visualizations

First Sprint: Susi’s chat interface

This blog post aims at sharing the details on how Susi got it’s custom interface. Susi is well trained with proper rules and sufficient data. But it’s lacking a makeover which can attract people to chat with her. So we give her that custom makeover as a chat interface with starter functionalities. We used a particular technology stack which was believed would make the chat process much simpler and flexible.

  • Handlebars – Why only handlebars? This templating framework helps you to reproduce the chat bubbles without much hassle. Two script blocks with embedded handlebar expressions will do it all. This ensures less front end code without any break in the bubbles template.

Selection_237

This above template is for displaying the user’s message from the send DOM.

Selection_238

This above template is used for binding the Susi’s response into the chat bubble. Every time the request is triggered when the user send’s out a query that is the chat message. For example Hi Susi and Susi responses back saying Hello!. So the response is triggered when the user types in the message and Susi queries using the following URL.

http://loklak.org/api/susi.json?q=Hi Susi

That is how the Template is being embedded into the interface. Along with that Susi is given a separate avatar or artwork and here is one.

screenshot

  • JQuery – We used JQuery for handling the requests and constantly query the Susi API for the answers. This handles the calls very swiftly and the response time is very quick.

Selection_239

In the above code snippet the request is being handled and the response is queried accordingly for the answer. For example, this is the sample JSON

Selection_241

The above JSON is the response when we type “Hello” into Susi’s chat interface. We track down to expression for the actual answer to be provided.

data.answers[0].actions[0].expression

The chat interface also provides other requirements like having keyboard events, scroll events, etc. This chat interface code can be viewed here at asksusi. This is still under development and further enhancements would be to port the telegram UI into it’s functionality.

 

 

First Sprint: Susi’s chat interface