Avoiding Nested Callbacks using RxJS in Loklak Scraper JS

Loklak Scraper JS, as suggested by the name, is a set of scrapers for social media websites written in NodeJS. One of the most common requirement while scraping is, there is a parent webpage which provides links for related child webpages. And the required data that needs to be scraped is present in both parent webpage and child webpages. For example, let’s say we want to scrape quora user profiles matching search query “Siddhant”. The matching profiles webpage for this example will be https://www.quora.com/search?q=Siddhant&type=profile which is the parent webpage, and the child webpages are links of each matched profiles.

Now, a simplistic approach is to first obtain the HTML of parent webpage and then synchronously fetch the HTML of child webpages and parse them to get the desired data. The problem with this approach is that, it is slower as it is synchronous.

A different approach can be using request-promise-native to implement the logic in asynchronous way. But, there are limitations with this approach. The HTML of child webpages that needs to be fetched can only be obtained after HTML of parent webpage is obtained and number of child webpages are dynamic. So, there is a request dependency between parent and child i.e. if only we have data from parent webpage we can extract data from child webpages. The code would look like this

request(parent_url)
   .then(data => {
       ...
       request(child_url)
           .then(data => {
               // again nesting of child urls
           })
           .catch(error => {

           });
   })
   .catch(error => {

   });

 

Firstly, with this approach there is callback hell. Horrible, isn’t it? And then we don’t know how many nested callbacks to use as the number of child webpages are dynamic.

The saviour: RxJS

The solution to our problem is reactive extensions in JavaScript. Using rxjs we can obtain the required data without callback hell and asynchronously!

The promise-request object of the parent webpage is obtained. Using this promise-request object an observable is generated by using Rx.Observable.fromPromise. flatmap operator is used to parse the HTML of the parent webpage and obtain the links of child webpages. Then map method is used transform the links to promise-request objects which are again transformed into observables. The returned value – HTML – from the resulting observables is parsed and accumulated using zip operator. Finally, the accumulated data is subscribed. This is implemented in getScrapedData method of Quora JS scraper.

getScrapedData(query, callback) {
   // observable from parent webpage
   Rx.Observable.fromPromise(this.getSearchQueryPromise(query))
     .flatMap((t, i) => { // t is html of parent webpage
       // request-promise object of child webpages
       let profileLinkPromises = this.getProfileLinkPromises(t);
       // request-promise object to observable transformation
       let obs = profileLinkPromises.map(elem => Rx.Observable.fromPromise(elem));

       // each Quora profile is parsed
       return Rx.Observable.zip( // accumulation of data from child webpages
         ...obs,
         (...profileLinkObservables) => {
           let scrapedProfiles = [];
           for (let i = 0; i < profileLinkObservables.length; i++) {
             let $ = cheerio.load(profileLinkObservables[i]);
             scrapedProfiles.push(this.scrape($));
           }
           return scrapedProfiles; // accumulated data returned
         }
       )
     })
     .subscribe( // desired data is subscribed
       scrapedData => callback({profiles: scrapedData}),
       error => callback(error)
     );
 }

 

Resources:

Avoiding Nested Callbacks using RxJS in Loklak Scraper JS

Using NodeJS modules of Loklak Scraper in Android

Loklak Scraper JS implements scrapers of social media websites so that they can be used in other platforms, like Android or in a native Java project. This way there will be only a single source of scraper, as a result it will be easier to update the scrapers in response to the change in websites. This blog explains how Loklak Wok Android, a peer for Loklak Server on Android platform uses the Twitter JS scraper to scrape tweets.

LiquidCore is a library available for android that can be used to run standard NodeJS modules. But Twitter scraper can’t be used directly, due to the following problems:

  • 3rd party NodeJS libraries are used to implement the scraper, like cheerio and request-promise-native and LiquidCore doesn’t support 3rd party libraries.
  • The scrapers are written in ES6, as of now LiquidCore uses NodeJS 6.10.2, which doesn’t support ES6 completely.

So, if 3rd party NodeJS libraries can be included in our scraper code and ES6 can be converted to ES5, LiquidCore can easily execute Twitter scraper.

3rd party NodeJS libraries can be bundled into Twitter scraper using Webpack and ES6 can be transpiled to ES5 using Babel.

The required dependencies can be installed using:

$npm install --save-dev webpack
$npm install --save-dev babel-core babel-loader babel-preset-es2015

Bundling and Transpiling

Webpack does bundling based on the configurations provided in webpack.config.js, present in root directory of project.

var fs = require('fs');

function listScrapers() {
   var src = "./scrapers/"
   var files = {};
   fs.readdirSync(src).forEach(function(data) {
       var entryName = data.substr(0, data.indexOf("."));
       files[entryName] = src+data;
   });
   return files;
}

module.exports = {
 entry: listScrapers(),
 target: "node",
 module: {
     loaders: [
         {
             loader: "babel-loader",
             test: /\.js?$/,
             query: {
                 presets: ["es2015"],
             }
         },
     ]
 },
 output: {
   path: __dirname + '/build',
   filename: '[name].js',
   libraryTarget: 'var',
   library: '[name]',
 }
};

 

Now let’s break the config file, the function listScrapers returns a JSONObject with key as name of scraper and value as relative location of scraper, ex:

{
   twitter: "./scrapers/twitter.js",
   github: "./scrapers/github.js"
   // same goes for other scrapers
}

The parameters in module.exports as described in the documentation of webpack for multiple inputs and to use the generated output externally:

  • entry: Since a bundle file is required for each scraper we provide the  the JSONObject returned by listScrapers function. The multiple entry points provided generate multiple bundled files.
  • target: As the bundled files are to be used in NodeJS platform,  “node” is set here.
  • module: Using webpack the code can be directly transpiled while bundling, the end users don’t need to run separate commands for transpiling. module contains babel configurations for transpiling.
  • output: options here customize the compilation of webpack
    • path: Location where bundled files are kept after compilation, “__dirname” means the current directory i.e. root directory of the project.
    • filename: Name of bundled file, “[name]“ here refers to the key of JSONObject provided in entry i.e. key of JSONObect returned from listScrapers. Example for Twitter scraper, the filename of bundled file will be “twitter.js”.
    • libraryTarget: by default the functions or methods inside bundled files can’t be used externally – can’t be imported. By providing the “var” the functions in bundled module can be accessed.
    • library: the name of the library.

Now, time to do the compilation work:

$ ./node_modules/.bin/webpack

The bundled files can be found in build directory. But, the generated bundled files are large files – around 77,000 lines. Large files are not encouraged for production purposes. So, a “-p” flag is used to generate bundled files for production – around 400 lines.

$ ./node_modules/.bin/webpack -p

Using LiquidCore to execute bundled files

The generated bundled file can be copied to the raw directory in res (resources directory in Android). Now, events are emitted from Activity/Fragment and in response to those events the scraping function is invoked in the bundled JS file, present in raw directory, the vice-versa is also possible.

So, we handle some events in our JS file and send some events to the android Activity/Fragment. The event handling and event creating code in JS file:

var query = "";
LiquidCore.on("queryEvent", function(msg) {
  query = msg.query;
});

LiquidCore.on("fetchTweets", function() {
  var twitterScraper = new twitter();
  twitterScraper.getTweets(query, function(data) {
    LiquidCore.emit("getTweets", {"query": query, "statuses": data});
  });
});

LiquidCore.emit('start');

 

First a “start” event is emitted from JS file, which is consumed in TweetHarvestingFragment by getScrapedTweet method using startEventListener.

EventListener startEventListener = (service, event, payload) -> {
   JSONObject jsonObject = new JSONObject();
   try {
       jsonObject.put("query", query);
       service.emit(LC_QUERY_EVENT, jsonObject); // value of LC_QUERY_EMIT is  "queryEvent"
   } catch (JSONException e) {
       Log.e(LOG_TAG, e.toString());
   }
   service.emit(LC_FETCH_TWEETS_EVENT); //value of  LC_FETCH_TWEETS_EVENT is  "fetchTweets"
};

 

The startEventListener then emits “queryEvent” with a JSONObject that contains the query to search tweets for scraping. This event is consumed in JS file by:

var query = "";
LiquidCore.on("queryEvent", function(msg) {
  query = msg.query;
});

 

After “queryEvent”, “fetchTweets” event is emitted from fragment, which is handled in JS file by:

LiquidCore.on("fetchTweets", function() {
  var twitterScraper = new twitter(); // scraping object is created
  twitterScraper.getTweets(query, function(data) { // function that scrapes twitter
    LiquidCore.emit("getTweets", {"query": query, "statuses": data});
  });
});

 

Once the scraped data is obtained, it is sent back to fragment by emitting “getTweets” event from JS file, “{“query”: query, “statuses”: data}” contains scraped data. This event is consumed in android by getTweetsEventListener.

EventListener getTweetsEventListener = (service, event, payload) -> { // payload contains scraped data
   Push push = mGson.fromJson(payload.toString(), Push.class);
   emitter.onNext(push);
};

 

LiquidCore creates a NodeJS instance to execute the bundled JS file. The NodeJS instance is called MicroService in LiquidCore terminology. For all this event handling to work, the NodeJS instance is created inside the method with a ServiceStartListner where all EventListener are added.

MicroService.ServiceStartListener serviceStartListener = (service -> {
   service.addEventListener(LC_START_EVENT, startEventListener);
   service.addEventListener(LC_GET_TWEETS_EVENT, getTweetsEventListener);
});
URI uri = URI.create("android.resource://org.loklak.android.wok/raw/twitter"); // Note .js is not used
MicroService microService = new MicroService(getActivity(), uri, serviceStartListener);
microService.start();

Resources

Using NodeJS modules of Loklak Scraper in Android

Scraping in JavaScript using Cheerio in Loklak

FOSSASIA recently started a new project loklak_scraper_js. The objective of the project is to develop a single library for web-scraping that can be used easily in most of the platforms, as maintaining the same logic of scraping in different programming languages and project is a headache and waste of time. An obvious solution to this was writing scrapers in JavaScript, reason JS is lightweight, fast, and its functions and classes can be easily used in many programming languages e.g. Nashorn in Java.

Cheerio is a library that is used to parse HTML. Let’s look at the youtube scraper.

Parsing HTML

Steps involved in web-scraping:

  1. HTML source of the webpage is obtained.
  2. HTML source is parsed and
  3. The parsed HTML is traversed to extract the required data.

For 2nd and 3rd step we use cheerio.

Obtaining the HTML source of a webpage is a piece of cake, and is done by function getHtml, sync-request library is used to send the “GET” request.

Parsing of HTML can be done using the load method by passing the obtained HTML source of the webpage, as in getSearchMatchVideos function.

var $ = cheerio.load(htmlSourceOfWebpage);

 

Since, the API of cheerio is similar to that of jquery, as a convention the variable to reference cheerio object which has parsed HTML is named “$”.

Sometimes, the requirement may be to extract data from a particular HTML tag (the tag contains a large number of nested children tags) rather than the whole HTML that is parsed. In that case, again load method can be used, as used in getVideoDetails function to obtain only the head tag.

var head = cheerio.load($("head").html());

html” method provides the html content of the selected tag i.e. <head> tag. If a parameter is passed to the html method then the content of selected tag (here <head>) will be replaced by the html of new parameter.

Extracting data from parsed HTML

Some of the contents that we see in the webpage are dynamic, they are not static HTML. When a “GET” request is sent the static HTML of webpage is obtained. When Inspect element is done it can be seen that the class attribute has different value in the webpage we are using than the static HTML we obtain from “GET” request using getHtml function. For example, inspecting the link of one of suggested videos, see the different values of class attribute :

 

  • In website (for better view):

  • In static HTML, obtained from “GET” request using getHtml function (for better view):

So, it is recommended to do a check first, whether attributes have same values or not, and then proceed accordingly.

Now, let’s dive into the actual scraping stuff.

As most of the required data are available inside head tag in meta tag. extractMetaAttribute function extracts the value of content attribute based on another provided attribute and its value.

function extractMetaAttribute(cheerioObject, metaAttribute, metaAttributeValue) {
	var selector = 'meta[' + metaAttribute + '="' + metaAttributeValue + '"]';
	return cheerioFunction(selector).attr("content");
}

cheerioObject” here will be the “head” object created above.

For example, our final JSONObject contains a og_url key-value pair, to get that we need to obtain the following html element.

<meta property="og:url" content="https://www.youtube.com/watch?v=KVGRN7Z7T1A">

 

This can be obtained by:

  1. Writing a selector for property attribute of meta. The selector would be ‘meta[property=”og:url”]’.
  2. The selector is passed to cheerioObject.
  3. Then attr method is used to obtain the value of content attribute.
  4. Finally, we set the obtained value of content attribute as the value of JSONObject’s key.

Similarly og:site_name, og:url and other values can be extracted, which in the final JSONObject would be the value of keys og_site_name, og_url and similarly. Since, a lot of data needs to be extracted this way, the extractMetaAttribute function generalizes it, where metaAttribute is “property” and metaAttributeValue is “og:url” in the above example.

If one parameter is provided in attr method, then it is used as a getter method, the value of that attribute is returned. If two parameters are provided then first parameter is the name of attribute and second parameter is the value of attribute, in this case it is used as a setter method.

Now, what if the provided selector matches more than one html element and we need to extract data or perform some operations on all of them. The answer is using each method on the cheerio Object, it iterates over the matched elements and executes the passed function – as a parameter – on them. The passed function has two parameters, the index of matched element and the matched element itself. To break out of the loop early, false is returned.

One of the use case of each method in youtube scraper is to extract related “tags” of the video.

Selector for this would be ‘meta[property=”og:video:tag”]’ and as it is inside a head tag, we can use the already created head tag. Applying the each method, it becomes:

head('meta[property="og:video:tag"]').each(function(i, element) {
    // the logic goes here
});

 

Here for the first iteration the value of “i” will be “0” and “element” will be

<meta property="og:video:tag" content="Iggy">

 

and so on. We need to obtain the value of content attribute, so we can use attr method as used above. Finally all the values are pushed to an array. Hence, the final code snippet with logic.

var ary = [];
head('meta[property="og:video:tag"]').each(function(i, element) {
    ary.push(head(element).attr("content"));
});

 

The same functionality is implemented in extractMetaProperties method.

function extractMetaProperties(cheerioObj, metaProperty) {
	var properties = [];
	var selector = 'meta[property="' + metaProperty + '"]';
	cheerioObj(selector).each(function(i, element) {
		properties.push(cheerioObj(element).attr("content"));
	});
	return properties;}
Scraping in JavaScript using Cheerio in Loklak