Generic Scraper – All about the Article API

Generic scraper now uses algorithms to detect and remove the surplus “clutter” (boilerplate, templates) around the main textual content of a web page. It uses the boilerpipe java library, a web API which provides algorithms to detect the main blog content.

Why not JSoup? 

Traditional approach towards extracting the DOM contents is not suggest-able for such Generic Scrapers. I tried scraping wordpress blog posts, certain common html tags are being used across many blog hosting platforms. When i tried for Medium blogs, the scraper (which was written using JSoup) failed. This approach is web link specific and a very tedious job to include for every web link possible.

Why Boilerpipe? 

Boilerpipe is an excellent Java library for boilerplate removal and fulltext extraction from HTML pages.

The following are the available boilerpipe extractor types:

  • DefaultExtractor – A full-text extractor, but not as good as ArticleExtractor.
  • ArticleExtractor – A full-text extractor which is specialized on extracting articles. It is having higher accuracy than DefaultExtractor.
  • ArticleSentencesExtractor
  • KeepEverythingExtractor – Gets everything. We can use this for extracting the title and description.
  • KeepEverythingWithMinKWordsExtractor
  • LargestContentExtractor – Like DefaultExtractor, it keeps the largest content block similar to DefaultExtractor.
  • NumWordsRulesExtractor
  • CanolaExtractor

Which can be used as extractor keys according to the requirement. As of now the Article Extractor is being implemented. Will implement the other extractor types too.

It is the best tool that intelligently removes unwanted html tags and even irrelevant text from the web page. It extracts the contents very fast in milliseconds, with minimum requirement of inputs. It does not require global or site-level information and is usually quite accurate.

Benefits:

  • Much  smarter  than  the  regular  expression.
  • Provides several extraction methods.
  • Returns  text  in  a  variety  of  formats.
  • Helps to avoid manual process of finding content pattern from the source site.
  • Helps to remove boilerplates like headers, footers, menus and advertisements.

The output of the extraction can be of Html, Text or Json. Given below are the lists of output formats.

  • Html(Default) : To output the whole HTML Document.
  • htmlFragment : To output only those HTML fragments that are regarded as main content.
  • Text : To output the extracted main content as plain text.
  • Json : To output the extracted main content as plain json.
  • Debug : To output debug information to understand how boilerpipe internally represents a document.

Here’s the Java code which imports the boilerpipe package and does the article extraction.

 

/**
 *  GenericScraper
 *  Copyright 16.06.2016 by Damini Satya, @daminisatya
 *
 *  This library is free software; you can redistribute it and/or
 *  modify it under the terms of the GNU Lesser General Public
 *  License as published by the Free Software Foundation; either
 *  version 2.1 of the License, or (at your option) any later version.
 *  
 *  This library is distributed in the hope that it will be useful,
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 *  Lesser General Public License for more details.
 *  
 *  You should have received a copy of the GNU Lesser General Public License
 *  along with this program in the file lgpl21.txt
 *  If not, see <http://www.gnu.org/licenses/>.
 */

package org.loklak.api.search;

import java.io.IOException;
import java.io.PrintWriter;
import java.net.URLEncoder;
import java.util.*;
import java.io.*;

import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import org.eclipse.jetty.util.log.Log;
import org.json.JSONArray;
import org.json.JSONObject;
import org.loklak.data.DAO;
import org.loklak.http.ClientConnection;
import org.loklak.http.RemoteAccess;
import org.loklak.server.Query;
import org.loklak.tools.CharacterCoding;
import org.loklak.tools.UTF8;

import java.net.URL;
import java.net.MalformedURLException;
import java.io.IOException;

import de.l3s.boilerpipe.extractors.ArticleExtractor;
import de.l3s.boilerpipe.BoilerpipeProcessingException;
import java.io.PrintWriter;

import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class GenericScraper extends HttpServlet {

	private static final long serialVersionUID = 4653635987712691127L;

	/**
     * PrintJSON
     * @param response
     * @param JSONObject genericScraperData
     */
	public void printJSON(HttpServletResponse response, JSONObject genericScraperData) throws ServletException, IOException {
		response.setCharacterEncoding("UTF-8");
		PrintWriter sos = response.getWriter();
		sos.print(genericScraperData.toString(2));
		sos.println();
	}

	/**
     * Article API
     * @param URL
     * @param JSONObject genericScraperData
     * @return genericScraperData
     */
	public JSONObject articleAPI (String url, JSONObject genericScraperData) throws MalformedURLException{
        URL qurl = new URL(url);
        String data = "";

        try {
            data = ArticleExtractor.INSTANCE.getText(qurl);
            genericScraperData.put("query", qurl);
            genericScraperData.put("data", data);
            genericScraperData.put("NLP", "true");
        }
        catch (Exception e) {
            if ("".equals(data)) {
                try 
                {
                    Document htmlPage = Jsoup.connect(url).get();
                    data = htmlPage.text();
                    genericScraperData.put("query", qurl);
                    genericScraperData.put("data", data);
                    genericScraperData.put("NLP", "false");
                }
                catch (Exception ex) {

                }
            }
        }

        return genericScraperData;
    }

	@Override
	protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
		doGet(request, response);
	}

	@Override
    protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {

        Query post = RemoteAccess.evaluate(request);

        String url = post.get("url", "");
        String type = post.get("type", "");

        URL qurl = new URL(url);

        // This can also be done in one line:
        JSONObject genericScraperData = new JSONObject(true);
        if ("article".equals(type)) {
            genericScraperData = articleAPI(url, genericScraperData);
            genericScraperData.put("type", type);
            printJSON(response, genericScraperData);
        } else {
            genericScraperData.put("error", "Please mention type of scraper:");
            printJSON(response, genericScraperData); 
        } 
    } 
}

Try this sample query

http://localhost:9000/api/genericscraper.json?url=http://stackoverflow.com/questions/15655012/how-final-keyword-works&type=article

And this is the sample output.

Selection_320

That’s all Folks. Will give updates on the further improvements.

 

Generic Scraper – All about the Article API