Create Scraper in Javascript for Loklak Scraper JS

Loklak Scraper JS is the latest repository in Loklak project. It is one of the interesting projects because of expected benefits of Javascript in web scraping. It has a Node Javascript engine and is used in Loklak Wok project as bundled package. It has potential to be used in different repositories and enhance them.

Scraping in Python is easy (at least for Pythonistas) as one needs to just import Request library and BeautifulSoup library (lxml as better option), write some lines of code using Request library to get webpage and some lines of bs4 to walk through html and scrape data. This sums up to about less than a hundred lines of coding, where as Javascript coding isn’t easily readable (at least to me) as compared to Python. But it has an advantage, it can easily deal with Javascript in the pages we are scraping. This is one of the motive, Loklak Scraper JS repository was created and we contributed and worked on it.

I recently coded a Javascript scraper in loklak_scraper_js repository. While coding, I found it’s libraries similar to the libraries, I use to code in Python. Therefore, this blog is for Pythonistas how they can start scraping in Javascript as they finish reading and also contribute to Loklak Scraper JS.

First, replace Python interpreter, Request and Beautifulsoup library with Node JS interpreter, Request and Cheerio JS library.

1) Node JS Interpreter: Node JS Interpreter is used to interpret Javascript files. This is different from Python as it deals with the project instead of a module in case of Python. The most compatible Node for most of the libraries is 6.0.0 , where as latest version available(as I checked) is 8.0.0

TIP: use `–save` with npm like here while installing a library.

2) Request Library :- This is used to load webpage to be processed. Similar to one in Python.

Request-promise library, a wrapper around Request with implementation of Bluebird library, improves readability and makes code cleaner (how?).

 

3) Cheerio Library:- A Pythonista (a rookie one) can call it twin of BeautifulSoup Library. But this is faster and is Javascript. It’s selector implementation is nearly identical to jQuery’s.

Let us code a basic Javascript scraper. I will take TimeAndDate scraper from loklak_scraper_js as example here. It inputs place and outputs its local time.

Step#1: fetching HTML from webpage with the help of Request library.

We input url to Request function to fetch the webpage and is saved to `html` variable. This scrapeTimeAndDate() function scrapes data from html

url = "http://www.timeanddate.com/worldclock/results.html?query=London";

request(url, function(error, response, body) {

 if(error) {

    console.log("Error: " + error);

    process.exit(-1);

 }

 html = body;

 scrapeTimeAndDate()

});

 

Step#2: To scrape important data from html using Cheerio JS

list of date and time of locations is embedded in table tag, So we will iterate through <td> and extract text.

  1. a) Load html to Cheerio as we do in beautifulsoup

In Python

soup = BeautifulSoup(html,'html5lib')

 

In Cheerio JS

$ = cheerio.load(html);

 

  1. b) This line finds first tr tag in table tag.

var htmlTime = $("table").find('tr');

 

  1. c) Iterate through td tags data by using each() function. This function acts as loop (in Python) iterating through list of elements in which data will be extracted.

htmlTime.each(function (index, element) {      

  // in python, we will use loop, `for element from elements:`

  tag = $(element).find("td");    // in python, `tag = soup.find_all('td')`

  if( tag.text() != "") {

    .

    .

    //EXTRACT DATA

    .

    .

  } else {

    //go to next td tag

    tag = tag.next();

  }

}

 

  1. d) To extract data

Cheerio JS loads html and uses DOM model traverse through. DOM model considers html is tree. So, go to the tag, and scrape data you want.

//extract location(text) enclosed in tag

location = tag.text();

//go to next tag

tag = tag.next();

//extract time(text) enclosed in tag

time = tag.text();

//save in dictionary like in python

loc_list["location"] = location;

loc_list["time"] = time;

 

Some other useful functions:-

1) $(selector, [context], [root])

returns object of selector(any tag) with class or id inside root

2) $(“table”).attr(name, value)

To get tag object having attribute having `value`

3) obj.html()

To get html enclosed in tags

For more just drop in here

Step#3: Execute scraper using command

node <scrapername>.js

 

Hoping that this blog is able to  how to scrape in Javascript by finding similarities with Python.

Resources:

Create Scraper in Javascript for Loklak Scraper JS

Generating a documentation site from markup documents with Sphinx and Pandoc

Generating a fully fledged website from a set of markup documents is no easy feat. But thanks to the wonderful tool sphinx, it certainly makes the task easier. Sphinx does the heavy lifting of generating a website with built in javascript based search. But sometimes it’s not enough.

This week we were faced with two issues related to documentation generation on loklak_server and susi_server. First let me give you some context. Now sphinx requires an index.rst file within /docs/  which it uses to generate the first page of the site. A very obvious way to fill it which helps us avoid unnecessary duplication is to use the include directive of reStructuredText to include the README file from the root of the repository.

This leads to the following two problems:

  • Include directive can only properly include a reStructuredText, not a markdown document. Given a markdown document, it tries to parse the markdown as  reStructuredText which leads to errors.
  • Any relative links in README break when it is included in another folder.

To fix the first issue, I used pypandoc, a thin wrapper around Pandoc. Pandoc is a wonderful command line tool which allows us to convert documents from one markup format to another. From the official Pandoc website itself,

If you need to convert files from one markup format into another, pandoc is your swiss-army knife.

pypandoc requires a working installation of Pandoc, which can be downloaded and installed automatically using a single line of code.

pypandoc.download_pandoc()

This gives us a cross-platform way to download pandoc without worrying about the current platform. Now, pypandoc leaves the installer in the current working directory after download, which is fine locally, but creates a problem when run on remote systems like Travis. The installer could get committed accidently to the repository. To solve this, I had to take a look at source code for pypandoc and call an internal method, which pypandoc basically uses to set the name of the installer. I use that method to find out the name of the file and then delete it after installation is over. This is one of many benefits of open-source projects. Had pypandoc not been open source, I would not have been able to do that.

url = pypandoc.pandoc_download._get_pandoc_urls()[0][pf]
filename = url.split(‘/’)[-1]
os.remove(filename)

Here pf is the current platform which can be one of ‘win32’, ‘linux’, or ‘darwin’.

Now let’s take a look at our second issue. To solve that, I used regular expressions to capture any relative links. Capturing links were easy. All links in reStructuredText are in the same following format.

`Title <url>`__

Similarly links in markdown are in the following format

[Title](url)

Regular expressions were the perfect candidate to solve this. To detect which links was relative and need to be fixed, I checked which links start with the \docs\ directory and then all I had to do was remove the \docs prefix from those links.

A note about loklak and susi server project

Loklak is a server application which is able to collect messages from various sources, including twitter.

SUSI AI is an intelligent Open Source personal assistant. It is capable of chat and voice interaction and by using APIs to perform actions such as music playback, making to-do lists, setting alarms, streaming podcasts, playing audiobooks, and providing weather, traffic, and other real time information

Generating a documentation site from markup documents with Sphinx and Pandoc

Releasing the loklak Python SDK 1.7

Python is one of the most popular languages in which many developers from the open source community and startups write their applications, What makes this happen is the ease of usage for the developers to leverage the library. We noticed the same here at loklak, the data on the loklak server and the new integration of Susi could be leveraged with one line of code each by the developers using the library instead of writing complex reusable components to integrate loklak into their application.

Loklak Susi Python

In the v1.7 release, there have been major changes that’ve been made to the library SDK which includes direct parsing and conversion logic from one format to another i.e. XML => JSON / JSON => XML etc.., Added to this, the ability for Susi and for developers to leverage susi’s capabilities has also been integrated into the recent release. As the library matured, the library now also supports Python3 and Python2 simultaneously. It’s now very simple for a developer to leverage Susi’s capabilities because of the library.

To install the library you can do pip install python-loklak-api, works with both pip3 and pip2. Once the library is installed, it’s very simple to make queries to loklak and to susi with just a few lines of code. Here’s an example of how this could be used and the modularity and robustness with which the library has been built.

>>> from loklak import Loklak
>>> from pprint import pprint
>>> l = Loklak() # Uses the domain loklak.org
>>> susi_result = l.susi('Hi I am Sudheesh')
>>> pprint(susi_result)
{'answer_date': '2016-08-20T04:56:17.371Z',
 'answer_time': 11,
 'answers': [{'actions': [{'expression': 'Hi sudheesh.', 'type': 'answer'}],
              'data': [{'0': 'i am sudheesh', '1': 'sudheesh'}],
              'metadata': {'count': 1, 'hits': 1, 'offset': 0}}],
 'client_id': 'aG9zdF8xODMuODMuMTIuNzY=',
 'count': 1,
 'query': 'Hi I am Sudheesh',
 'query_date': '2016-08-20T04:56:17.360Z',
 'session': {'identity': {'anonymous': True,
                          'name': '183.83.12.76',
                          'type': 'host'}}}

Similarly, fetching the information for a search or a user is also equally easy

>>> l.search('rio')
>>> l.user('sudheesh001')

This makes it useful for hundreds of developers and plugins in Python to potentially leverage this library into various frameworks like Django, Flask, Pyramid or even run it from the command line interface. Head over to our github repository to learn more and detailed documentation.

Releasing the loklak Python SDK 1.7

Susi support for Loklak APIs

Here at Loklak, we are striving continuously for innovation. Continuing with this trend, we recently launched ‘Susi – The chat bot’. Please refer to this previous blog post by Damini.

Along with the chat bot, Susi query support was added to Loklak Python and PHP APIs. Susi can be queried from localhost as well as other online loklak peers.

Susi API function added to Python API(as shown below). See full implementation here.

def susi(self, query=None):
   """Hits Susi with the required query and returns back the susi response"""
   susi_application = 'api/susi.json'
   url_to_give = self.baseUrl + susi_application
   self.query = query
   if query:
      params = {}
      params['q'] = self.query
      return_to_user = requests.get(url_to_give, params=params)
      if return_to_user.status_code == 200:
          return return_to_user.json()
      else:
          return_to_user = {}
          return_to_user['error'] = ('Looks like there is a problem in susi replying.')
          return json.dumps(return_to_user)
      else:
          return_to_user = {}
          return_to_user['error'] = ('Please ask susi something.')
          return json.dumps(return_to_user)

A sample usage of Susi API in python could be:

from loklak import Loklak
query = "Hi I am Zeus"
l = Loklak()
result = l.susi(query)
print result

Susi integration with PHP API(see below). See full implementation here.

public function susi($query=null) {
	$this->requestURL = $this->baseUrl . '/api/susi.json';
	$this->query = $query;
	if($query) {
		$params = array('q'=>$this->query);
		$request = Requests::request($this->requestURL, array('Accept' => 'application.json'), $params);
		if ($request->status_code == 200) {
			return json_encode($request, true);
		}
		else {
			$request = array();
			$error = "Looks like Susi is not replying.";
			$request['error'] = array_push($request, $error);
			return json_encode($request, true);
		}
	}
	else {
		$request = array();
		$error = "Please ask Susi something.";
		$request['error'] = array_push($request, $error);
		return json_encode($request, true);
	}
}

Sample usage of Susi API in PHP:

include('loklak.php');
$loklak = new Loklak(); 
$result = $loklak->susi('Hi I am Zeus');
$susiResponse = json_decode($result);
$susiResponse = $susiResponse->body;
$susiResponse = json_decode($susiResponse, true);
var_dump($susiResponse);

Tests for above-mentioned functions have been added to the respective API suite. Refer to this and this.

Try Social Universe Super Intelligence!

Ask questions, interact with it. I am pretty sure that you would like it!

Susi support for Loklak APIs

A low-cost laboratory for everyone: Sensor Plug-ins for ExpEYES to measure temperature, pressure, humidity, wind speed, acceleration, tilt angle and magnetic field

Working on ExpEYES in the last few months has been an amazing journey and I am gratful of the support of Mario Behling, Hong Phuc Dang and Andre Rebentisch at FOSSASIA. I had a lot of learning adventures with experimenting and exploring with new ideas to build sensor plug-ins for ExpEYES. There were some moments which were disappointing and there were some other moments which brought the joy of creating sensor plug-ins, add-on devices and GUI improvements for ExpEYES.

My GSoC Gallery of Sensors and Devices: Here are all the sensors I played with for PSLab..

The complete list of sensor plug-ins developed is available at http://gnovi.edublogs.org/2015/08/21/gsoc-2015-with-fossasia-list-of-sensor-plug-ins-developed-for-expeyes/

Sensor Plugins for ExpEYES

The aim of my project is to develop new Sensor Plug-ins for ExpEYES to measure a variety of parameters like temperature, pressure, humidity, wind speed, acceleration, tilt angle, magnetic field etc. and to provide low-cost open source laboratory equipment for students and citizien scientists all over the world.

We are enhancing the scope of ExpEYES for using it to perform several new experiments. Developing a low-cost stand alone data acquisition system that can be used for weather monitoring or environmental studies is another objective of our project.

I am happy to see that the things have taken good shape with additional gas sensors added which were not included in the initial plan and we have almost achieved all the objectives of the project, except for some difficulties in calibrating sensor outputs and documentation. This issue will be solved in a couple of days.

Experimenting with different sensors in my kitchen laboratory

I started exploring and experimenting with different sensors. After doing preliminary studies I procured analog and a few digital sensors for measuring weather parameters like temperature, relative humidity and barometric pressure. A few other sensors like low cost piezoelectric sensor, accelerometer ADXL-335, Hall effect magnetic sensor, Gyro-module etc were also added to my kitchen laboratory. We then decided to add gas sensors for detecting Carbon Monoxide, LPG and Methane.

With this development ExpEYES can now be used for pollution monitoring and also in safety systems in Physics/chemistry laboratory. The work on the low-cost Dust Sensor is under progress.

Challenges, Data Sheet, GUI programs

I had to spend a lot of time in getting the sensor components, studying their data sheets, soldering and setting them up with ExpEYES. And then little time in writing GUI Programs. I started working almost 8 to 10 hours every evening after college hours (sometimes whole night) and now things have taken good shape.

Thanks to my mentor at FOSSASIA for pushing me, sometimes with strict words. I could add many new sensor plug-ins to ExpEYES and now I will also be working on Light sensors so that the Pocket Science Lab can be used in optics. With these new sensor plug-ins one can replace many costly devices from Physics, Chemistry, Biology and also Geology Lab.

What’s next? My Plan for next steps

  • Calibration of sensor data

  • Prototyping stand-alone weather station

  • Pushing data to Loklak server

  • Work on [email protected] website

  • Fossasia Live Cd based on Lubuntu with ExpEYES and other educational softwares

  • Set-up Documentation for possible science experiments with the sensor plug-ins and low-cost, open source apparatus

A low-cost laboratory for everyone: Sensor Plug-ins for ExpEYES to measure temperature, pressure, humidity, wind speed, acceleration, tilt angle and magnetic field