Generic Scraper – The whole new design

Previously the generic API converts the webpage source code into a JSON response, with predefined tags. But this API needs a design model which can do much more than simply giving the obvious result. This API is redesigned completely. The architecture is shared in this blog post. Before reading this, you would like to go here and read about the existing API.

Input – The input to the generic scraper is the URL, where in the code internally scrapes the source code. Which is happening over here. (Using JSoup)

Document page = Jsoup.connect(url).get();

The page is the source code extracted from the URL (which is a string). Where in the source code is expected to be in this format.

Selection_283

But this is not the case with few websites where the source code is embedded is inside the script tags like this.

Selection_282

The API must be able to compile such source code and convert into the required format. Such corner cases must be handled.

Scraping – The format of the websites are different. The blog posts are completely different from a discussion posts. Such variations must be handled properly and keeping that in mind the code is being divided into five main sub APIs internally.

Article API – The Article API is used to extract clean article text and other data from news articles, blog posts and other text-heavy pages. Retrieve the full-text, cleaned, related images and videos, author, date, tags—automatically, from any article on any site.

Product API – The Product API automatically extracts complete data from any shopping or e-commerce product page. Retrieve full pricing information, product IDs (SKU, UPC, MPN), images, product specifications, brand and more.

Image API – The Image API identifies the primary image(s) of a submitted web page and returns comprehensive information and metadata for each image.

Discussion API – The Discussion API automatically structures and extracts entire threads or lists of reviews/comments from most discussion pages, forums, and similarly structured web pages.

Advertisements API – The Advertisement API automatically structures the ad details on the website.

Along with the above APIs which covers most of the requirement, the API even supports the general response. When none of the search results matches with the formats, it gives a generic JSON response. The APIs are lucid and must be specified by the user giving the input , in the following pattern.

/api/genericscraper.json?url=http://blog.loklak.net/convert-web-pages-into-structured-data/&type=article

The type covers the sub APIs like article, discussion, product and so on.

Re Usability – The API is designed to use the existing scraper responses which are specific to certain websites like, wordpress, meetups.com, Amazon etc. Internally the API handles the match for such URLs and calls the page specific scrapers for the response. This helps in reusing the existing scrapers in loklak. You can go through this doc page for more details.

Technology Stack – The scraping is done using Jsoup , which is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods. The API endpoint is registered and can be accessed in the given format.

http://loklak.org/api/genericscraper.json?url=””&type=””

Coming up(Part Two) is a deeper version of this blog post which explains in depth about the API implementation and the targeted tags. Stay tuned for further updates.

 

Generic Scraper – The whole new design