Adding Tweet Streaming Feature in World Mood Tracker loklak App

The World Mood Tracker was added to loklak apps with the feature to display aggregated data from the emotion classifier of loklak server. The next step in the app was adding the feature to display the stream of Tweets from a country as they are discovered by loklak. With the addition of stream servlet in loklak, it was possible to utilise it in this app.

In this blog post, I will be discussing the steps taken while adding to introduce this feature in World Mood Tracker app.

Props for WorldMap component

The WorldMap component holds the view for the map displayed in the app. This is where API calls to classifier endpoint are made and results are displayed on the map. In order to display tweets on clicking a country, we need to define react props so that methods from higher level components can be called.

In order to enable props, we need to change the constructor for the component –

export default class WorldMap extends React.Component {
    constructor(props) {
        super(props);
        ...
    }
    ...
}

[SOURCE]

We can now pass the method from parent component to enable streaming and other components can close the stream by using props in them –

export default class WorldMoodTracker extends React.Component {
    ...
    showStream(countryName, countryCode) {
        /* Do something to enable streaming component */
        ...
    }
 
    render() {
        return (
             ...
                <WorldMap showStream={this.showStream}/>
             ...
        )
    }
}

[SOURCE]

Defining Actions on Clicking Country Map

As mentioned in an earlier blog post, World Mood Tracker uses Datamaps to visualize data on a map. In order to trigger a piece of code on clicking a country, we can use the “done” method of the Datamaps instance. This is where we use the props passed earlier –

done: function(datamap) {
    datamap.svg.selectAll('.datamaps-subunit').on('click', function (geography) {
        props.showStream(geography.properties.name, reverseCountryCode(geography.id));
    })
}

[SOURCE]

The name and ID for the country will be used to display name and make API call to stream endpoint respectively.

The StreamOverlay Component

The StreamOverlay components hold all the utilities to display the stream of Tweets from loklak. This component is used from its parent components whose state holds info about displaying this component –

export default class WorldMoodTracker extends React.Component {
    ...
    getStreamOverlay() {
        if (this.state.enabled) {
            return (<StreamOverlay
                show={true} channel={this.state.channel}
                country={this.state.country} onClose={this.onOverlayClose}/>);
        }
    }

    render() {
        return (
            ...
                {this.getStreamOverlay()}
            ...
        )
    }
}

[SOURCE]

The corresponding props passed are used to render the component and connect to the stream from loklak server.

Creating Overlay Modal

On clicking the map, an overlay is shown. To display this overlay, react-overlays is used. The Modal component offered by the packages provides a very simple interface to define the design and interface of the component, including style, onclose hook, etc.

import {Modal} from 'react-overlays';

<Modal aria-labelledby='modal-label'
    style={modalStyle}
    backdropStyle={backdropStyle}
    show={true}
    onHide={this.close}>
    <div style={dialogStyle()}>
        ...
    </div>
</Modal>

[SOURCE]

It must be noted that modalStyle and backdropStyle are React style objects.

Dialog Style

The dialog style is defined to provide some space at the top, clicking where, the overlay is closed. To do this, vertical height units are used –

const dialogStyle = function () {
    return {
        position: 'absolute',
        width: '100%',
        top: '5vh',
        height: '95vh',
        padding: 20
        ...
    };
};

[SOURCE]

Connecting to loklak Tweet Stream

loklak sends Server Sent Events to clients connected to it. To utilise this stream, we can use the natively supported EventSource object. Event stream is started with the render method of the StreamOverlay component –

render () {
    this.startEventSource(this.props.channel);
    ...
}

[SOURCE]

This channel is used to connect to twitter/country/<country-ID> channel on the stream and then this can be passed to EventStream constructor. On receiving a message, a list of Tweets is appended and later rendered in the view –

startEventSource(country) {
    let channel = 'twitter%2Fcountry%2F' + country;
    if (this.eventSource) {
        return;
    }
    this.eventSource = new EventSource(host + '/api/stream.json?channel=' + channel);
    this.eventSource.onmessage = (event) => {
        let json = JSON.parse(event.data);
        this.state.tweets.push(json);
        if (this.state.tweets.length > 250) {
            this.state.tweets.shift();
        }
        this.setState(this.state);
    };
}

[SOURCE]

The size of the list is restricted to 250 here, so when a newer Tweet comes in, the oldest one is chopped off. And thanks to fast DOM actions in React, the rendering doesn’t take much time.

Rendering Tweets

The Tweets are displayed as simple cards on which user can click to open it on Twitter in a new tab. It contains basic information about the Tweet – screen name and Tweet text. Images are not rendered as it would make no sense to load them when Tweets are coming at a high rate.

function getTweetHtml(json) {
    return (
        <div style={{padding: '5px', borderRadius: '3px', border: '1px solid black', margin: '10px'}}>
            <a href={json.link} target="_blank">
            <div style={{marginBottom: '5px'}}>
                <b>@{json['screen_name']}</b>
            </div>
            <div style={{overflowX: 'hidden'}}>{json['text']}</div>
            </a>
        </div>
    )
}

[SOURCE]

They are rendered using a simple map in the render method of StreamOverlay component –

<div className={styles.container} style={{'height': '100%', 'overflowY': 'auto',
    'overflowX': 'hidden', maxWidth: '100%'}}>
    {this.state.tweets.reverse().map(getTweetHtml)}
</div>

[SOURCE]

Closing Overlay

With the previous setup in place, we can now see Tweets from the loklak backend as they arrive. But the problem is that we will still be connected to the stream when we click-close the modal. Also, we would need to close the overlay from the parent component in order to stop rendering it.

We can use the onclose method for the Modal here –

close() {
    if (this.eventSource) {
        this.eventSource.close();
        this.eventSource = null;
    }
    this.props.onClose();
}

[SOURCE]

Here, props.onClose() disables rendering of StreamOverlay in the parent component.

Conclusion

In this blog post, I explained how the flow of props are used in the World Mood Tracker app to turn on and off the streaming in the overlay defined using react-overlays. This feature shows a basic setup for using the newly introduced stream API in loklak.

The motivation of such application was taken from emojitracker by mroth as mentioned in fossasia/labs.fossasia.org#136. The changes were proposed in fossasia/apps.loklak.org#315 by @singhpratyush (me).

The app can be accessed live at https://singhpratyush.github.io/world-mood-tracker/index.html.

Resources

Adding Tweet Streaming Feature in World Mood Tracker loklak App

Using Protractor for UI Tests in Angular JS for Loklak Apps Site

Loklak apps site’s home page and app details page have sections where data is dynamically loaded from external javascript and json files. Data is fetched from json files using angular js, processed and then rendered to the corresponding views by controllers. Any erroneous modification to the controller functions might cause discrepancies in the frontend. Since Loklak apps is a frontend project, any bug in the home page or details page will lead to poor UI/UX. How do we deal with this? One way is to write unit tests for the various controller functions and check their behaviours. Now how do we test the behaviours of the site. Most of the controller functions render something on the view. One thing we can do is simulate the various browser actions and test site against known, accepted behaviours with Protractor.

What is Protractor

Protractor is end to end test framework for Angular and AngularJS apps. It runs tests against our app running in browser as if a real user is interacting with our browser. It uses browser specific drivers to interact with our web application as any user would.

Using Protractor to write tests for Loklak apps site

First we need to install Protractor and its dependencies. Let us begin by creating an empty json file in the project directory using the following command.

echo {} > package.json

Next we will have to install Protractor.

The above command installs protractor and webdriver-manager. After this we need to get the necessary binaries to set up our selenium server. This can be done using the following.

./node_modules/protractor/bin/webdriver-manager update
./node_modules/protractor/bin/webdriver-manager start

Let us tidy up things a bit. We will include these commands in package.json file under scripts section so that we can shorten our commands.

Given below is the current state of package.json

{
    "scripts": {
        "start": "./node_modules/http-server/bin/http-server",
        "update-driver": "./node_modules/protractor/bin/webdriver-manager update",
        "start-driver": "./node_modules/protractor/bin/webdriver-manager start",
        "test": "./node_modules/protractor/bin/protractor conf.js"
    },
    "dependencies": {
        "http-server": "^0.10.0",
        "protractor": "^5.1.2"
    }
}

The package.json file currently holds our dependencies and scripts. It contains command for starting development server, updating webdriver and starting webdriver (mentioned just before this) and command to run test.

Next we need to include a configuration file for protractor. The configuration file should contain the test framework to be used, the address at which selenium is running and path to specs file.

// conf.js
exports.config = {
    framework: "jasmine",
    seleniumAddress: "http://localhost:4444/wd/hub",
    specs: ["tests/home-spec.js"]
};

We have set the framework as jasmine and selenium address as http://localhost:4444/wd/hub. Next we need to define our actual file. But before writing tests we need to find out what are the things that we need to test. We will mostly be testing dynamic content loaded by Javascript files. Let us define a spec. A spec is a collection of tests. We will start by testing the category name. Initially when the page loads it should be equal to All apps. Next we test the top right hand side menu which is loaded by javascript using topmenu.json file.

it("should have a category name", function() {
    expect(element(by.id("categoryName")).getText()).toEqual("All apps");
  });

  it("should have top menu", function() {
    let list = element.all(by.css(".topmenu li a"));
    expect(list.count()).toBe(5);
  });

As mentioned earlier, we are using jasmine framework for writing our specs. In the above code snippet ‘it’ describes a particular test. It takes a test description and a callback function thereby providing a very efficient way to document our tests white write the test code itself. In the first test we use expect function to check whether the category name is equal to All apps or not. Here we select the div containing the category name by its id.

Next we write a test for top menu. There should be five menu options in total for the top menu. We select all the list items that are supposed to contain the top menu items and check whether the number of such items are five or not using expect function. As it can be seen from the snippet, the process of selecting a node is almost similar to that of Jquery library.

Next we test the left hand side category list. This list is loaded by AngularJS controller from apps,json file. We should make sure the list is loaded properly and all the options are present.

it("should have a category list", function() {
    let categoryIds = ["All", "Scraper", "Search", "Visualizer", "LoklakLibraries", "InternetOfThings", "Misc"];
    let categoryNames = ["All", "Scraper", "Search", "Visualizer", "Loklak Libraries", "Internet Of Things", "Misc"];

    expect(element(by.css("#catTitle")).getText()).toBe("Categories");

    let categoryList = element.all(by.css(".category-main"));
    expect(categoryList.count()).toBe(7);

    categoryIds.forEach(function(id, index) {
      element(by.css("#" + id)).isPresent().then(function(present) {
        expect(present).toBe(true);
      });

      element(by.css("#" + id)).getText().then(function(text) {
        expect(text).toBe(categoryNames[index]);
      });
    });
  });

At first we maintain two lists of category id and category names. We begin by confirming that Category title is equal to Categories. Next we get the list of categories and iterate over them, For each category we check whether the corresponding id is present in the DOM or not. After confirming this, we match the names of the categories with the expected names. Elements.all function allows us to get a list of selected nodes.

Finally we check the click functionality of the left side menu. Expected behaviour is, on clicking a menu item, the category name should get replaced with the selected category name. For this we need to simulate the click event. Protractor allows us to do it very easily using click function.

it("category list should respond to click", function() {
    let categoryIds = ["All", "Scraper", "Search", "Visualizer", "LoklakLibraries", "InternetOfThings", "Misc"];
    let categoryNames = ["All apps", "Scraper", "Search", "Visualizer", "Loklak Libraries", "Internet Of Things", "Misc"];

    categoryIds.forEach(function(id, index) {
      element(by.id(categoryIds[index])).click().then(function() {
        browser.getCurrentUrl().then(function(url) {
          expect(url).toBe("http://127.0.0.1:8080/#/" + categoryIds[index]);
        });
        element(by.id("categoryName")).getText().then(function(text) {
          expect(text).toBe(categoryNames[index]);
        });
      });
    });
  });



Once again we maintain two lists, category id and category names. We obtain the present list of categories and iterate over them. For each category link we simulate a click event. For each click event we check two values. We check the new browser URL which should now contain the category id. Next we check the value of category name. It should be equal to the category selected.
FInally after all the tests are over we get the final report on our terminal.
In order to run the tests, use the following command.

npm test

This will start executing the tests.

Important resources

Using Protractor for UI Tests in Angular JS for Loklak Apps Site

Enhancing LoklakWordCloud app present on Loklak apps site

LoklakWordCloud app is presently hosted on loklak apps site. Before moving into the content of this blog, let us get a brief overview of the app. What does the app do? The app generates a word cloud using twitter data returned by loklak based on the query word provided by the user. The user enters a word in the input field and presses the search button. After that a word cloud is created using the content (text body, hashtags and mentioned) of the various tweets which contains the user provided query word.

In my previous post I wrote about creating the basic functional app. In this post I will be describing the next steps that have been implemented in the app.

Making the word cloud clickable

This is one of the most important and interesting features added to the app. The words in the cloud are now clickable.Whenever an user clicks on a word present in the cloud, the cloud is replaced by the word cloud of that selected word. How do we achieve this behaviour? Well, for this we use Jqcloud’s handler feature. While creating the list of objects for each word and its frequency, we also specify a handler corresponding to each of the word. The handler is supposed to handle a click event. Whenever a click event occurs, we set the value of $scope.tweet to the selected word and invoke the search function, which calls the loklak API and regenerates the word cloud.

for (var word in $scope.wordFreq) {
            $scope.wordCloudData.push({
                text: word,
                weight: $scope.wordFreq[word],
                handlers: {
                    click: function(e) {
                        $scope.tweet = e.target.textContent;
                        $scope.search();
                    }
                }
            });
        }

As it can be seen in the above snippet, handlers is simply an JavaScript object, which takes a function for the click event. In the function we pass the word selected as value of the tweet variable and call search method.

Adding filters to the app

Previously the app generated word cloud using the entire tweet content, that is, hashtags, mentions and tweet body. Thus the app was not flexible. User was not able to decide on which field he wants his word cloud to be generated. User might want to generate his  word cloud using only the hashtags or the mentions or simply the tweet body. In order to make this possible, filters have been introduced. Now we have filters for hashtags, mentions, tweet body and date.

<div class="col-md-6 tweet-filters">
              <strong>Filters</strong>
              <hr>
              <div class="filters">
                <label class="checkbox-inline"><input type="checkbox" value="" ng-model="hashtags">Hashtags</label>
                <label class="checkbox-inline"><input type="checkbox" value="" ng-model="mentions">Mentions</label>
                <label class="checkbox-inline"><input type="checkbox" value="" ng-model="tweetbody">Tweet body</label>
              </div>
              <div class="filter-all">
                <span class="select-all" ng-click="selectAll()"> Select all </span>
              </div>
            </div>

We have used checkboxes for the individual filters and have kept an option to select all the filters at once. Next we require to hook this HTML to AngularJS code to make the filters functional.

if ($scope.hashtags) {
                tweet.hashtags.forEach(function (hashtag) {
                    $scope.filteredWords.push("#" + hashtag);
                });
            }

            if ($scope.mentions) {
                tweet.mentions.forEach(function (mention) {
                    $scope.filteredWords.push("@" + mention);
                });
            }

In the above snippet, before adding the hashtags to the list of filtered words, we first make sure that the checkbox for hashtags is selected. Once we find out the the variable bound to the hashtags checkbox is true, we proceed further and add the hashtags associated with a given tweet to the list of filteredWords. The same strategy is applied for both mentions (shown in the snippet) and tweet bodies.

Adding error notification

Next, we handle certain errors to notify the users that there is problem in their input. Such cases include empty input. If user provides empty input then we notify him or her and break the search. Next we check whether From date is before To date or not. If From date is after To date then we notify the user about the problem.

if ($scope.tweet === "" || $scope.tweet === undefined) {
            $scope.error = "Please enter a valid query word";
            $scope.showError();
            return;
}

In the above snippet we check for empty or undefined input and display snackbar along with error accordingly.

if ((sinceDate !== "" && sinceDate !== undefined) && (endDate !== "" && endDate !== undefined)) {
            var date1 = new Date(sinceDate);
            var date2 = new Date(endDate);
            if (date1 > date2) {
                $scope.error = "To date should be after From date";
                $scope.showError();
                return;
            }
        }

The above snippet compares date. For comparing dates, first we fetch the values entered (via jquery date widget) into the respective input fields and then create JavaScript Date objects out of them. Finally we compare those Date objects to find out if there is any error or not.

Now it might happen that a particular search is taking a long time (perhaps due to network problem), however the user becomes impatient and tries to search again. In that case we need to inform the user that the previous search is still going on. For this purpose we use a boolean variable  to keep track whether the previous search is completed or still going on. If the previous search is going on and user tries to make a new search then we provide a proper notification and prevent the user from making further searches.

Finally we need to make sure that the user is online and has an active internet connection before the search can take place and Loklak API can be called. For this we have used navigator. We have polled the onLine property of navigator to find out whether the user is online or not. If the user is offline then we inform him that we cannot initiate a search due to internet connectivity problem.

if ($scope.isLoading === true) {
            $scope.error = "Previous search not completed. Please wait...";
            $scope.showError();
            return;
        }
        if (!navigator.onLine) {
            $scope.error = "You are currently offline. Please check your internet connection!";
            $scope.showError();
            return;
        }

Important resources

  • View the app source here.
  • View loklak apps site source here.
  • View Loklak API documentation here
  • View Jqcloud documentation here.
  • Learn more about AngularJS here.
Enhancing LoklakWordCloud app present on Loklak apps site

Developing LoklakWordCloud app for Loklak apps site

LoklakWordCloud app is an app to visualise data returned by loklak in form of a word cloud.

The app is presently hosted on Loklak apps site.

Word clouds provide a very simple, easy, yet interesting and effective way to analyse and visualise data. This app will allow users to create word cloud out of twitter data via Loklak API.

Presently the app is at its very early stage of development and more work is left to be done. The app consists of a input field where user can enter a query word and on pressing search button a word cloud will be generated using the words related to the query word entered.

Loklak API is used to fetch all the tweets which contain the query word entered by the user.

These tweets are processed to generate the word cloud.

Related issue: https://github.com/fossasia/apps.loklak.org/pull/279

Live app: http://apps.loklak.org/LoklakWordCloud/

Developing the app

The main challenge in developing this app is implementing its prime feature, that is, generating the word cloud. How do we get a dynamic word cloud which can be easily generated by the user based on the word he has entered? Well, here comes in Jqcloud. An awesome lightweight Jquery plugin for generating word clouds. All we need to do is provide list of words along with their weights.

Let us see step by step how this app (first version) works. First we require all the tweets which contain the entered word. For this we use Loklak search service. Once we get all the tweets, then we can parse the tweet body to create a list of words along with their frequency.

var url = "http://35.184.151.104/api/search.json?callback=JSON_CALLBACK&count=100&q=" + query;
        $http.jsonp(url)
            .then(function (response) {
                $scope.createWordCloudData(response.data.statuses);
                $scope.tweet = null;
            });

Once we have all the tweets, we need to extract the tweet texts and create a list of valid words. What are valid words? Well words like ‘the’, ‘is’, ‘a’, ‘for’, ‘of’, ‘then’, does not provide us with any important information and will not help us in doing any kind of analysis. So there is no use of including them in our word cloud. Such words are called stop words and we need to get rid of them. For this we are using a list of commonly used stop words. Such lists can be very easily found over the internet. Here is the list which we are using. Once we are able to extract the text from the tweets, we need to filter stop words and insert the valid words into a list.

 tweet = data[i];
            tweetWords = tweet.text.replace(", ", " ").split(" ");

            for (var j = 0; j < tweetWords.length; j++) {
                word = tweetWords[j];
                word = word.trim();
                if (word.startsWith("'") || word.startsWith('"') || word.startsWith("(") || word.startsWith("[")) {
                    word = word.substring(1);
                }
                if (word.endsWith("'") || word.endsWith('"') || word.endsWith(")") || word.endsWith("]") ||
                    word.endsWith("?") || word.endsWith(".")) {
                    word = word.substring(0, word.length - 1);
                }
                if (stopwords.indexOf(word.toLowerCase()) !== -1) {
                    continue;
                }
                if (word.startsWith("#") || word.startsWith("@")) {
                    continue;
                }
                if (word.startsWith("http") || word.startsWith("https")) {
                    continue;
                }
                $scope.filteredWords.push(word);
            }

What are we actually doing in the above snippet? We are simply iterating over each of the statuses returned by Loklak API. For each tweet, first we are splitting the text into words and then we are iterating over those words. For a given word we do a number of checks. First we check if the word begins or ends with a special character, for example quotation marks or brackets. If so we remove those character as it will cause trouble in calculating frequencies. Next we also check if the word is beginning with ‘#’ or ‘@’. If it is true, then we discard such words as we are handling hashtags and mentions separately. Finally we check whether the word is a stop word or not. If it is a stop word then we discard it. If a word passes all the checks, we add it to our list of valid words.

Once we are done with the tweet bodies, next we need to handle hashtags and mentions.

tweet.hashtags.forEach(function (hashtag) {
                $scope.filteredWords.push("#" + hashtag);
            });

            tweet.mentions.forEach(function (mention) {
                $scope.filteredWords.push("@" + mention);
            });

The above code simply iterates over the hashtags and mentions and inserts them into the filteredWords list. We have handled hashtags and mentions separately so that we can apply filters in future.

Once we are done with generating list of valid words, we need to calculate weight for each of the word. Here weight is nothing but the number of times a particular word is present in the list. We calculate this using JavaScript object. We iterate over the list of valid words. If word is not present in the object (or dictionary as you wish to call it), we create a new key by the name of that word and set its value to one. If a word is already present as a key, then we simply increment its value by one.

for (var word in $scope.wordFreq) {
            $scope.wordCloudData.push({
                text: word,
                weight: $scope.wordFreq[word]
            });
        }

The above code snippet calculates the frequency of each word by the process mentioned above.

Now we are all set to generate our word cloud. We simply use Jqcloud’s interface to configure it with the words and their respective frequencies, provide a list of color codes for a color gradient, and set autoResize to true so that our word cloud resizes itself when the screen size changes.

$scope.generateWordCloud = function() {
        if ($scope.wordCloud === null) {
            $scope.wordCloud = $('.wordcloud').jQCloud($scope.wordCloudData, {
                colors: ["#D50000", "#FF5722", "#FF9800", "#4CAF50", "#8BC34A", "#4DB6AC", "#7986CB", "#5C6BC0", "#64B5F6"],
                fontSize: {
                    from: 0.06,
                    to: 0.01
                },
                autoResize: true
            });
        } else {
            $scope.wordCloud = $(".wordcloud").jQCloud('update', $scope.wordCloudData);
        }
    }

Whenever the user searches for a new word, we simply update the existing word cloud with the cloud of the new word.

Future roadmap

  • Make the words in the cloud clickable. On clicking a word, the cloud should get replaced by the selected word’s cloud.
  • Add filters for hashtags, mentions, date.
  • Add option for exporting the cloud to an image, so that user’s can also use this app as a tool to generate word clouds as images and save them.
  • Add a loader and error notification for invalid or empty input.

Important resources

  • View the app source code here.
  • Learn more about Loklak API here.
  • Learn more about Jqcloud here.
  • Learn more about AngularJS here.
Developing LoklakWordCloud app for Loklak apps site

Create Scraper in Javascript for Loklak Scraper JS

Loklak Scraper JS is the latest repository in Loklak project. It is one of the interesting projects because of expected benefits of Javascript in web scraping. It has a Node Javascript engine and is used in Loklak Wok project as bundled package. It has potential to be used in different repositories and enhance them.

Scraping in Python is easy (at least for Pythonistas) as one needs to just import Request library and BeautifulSoup library (lxml as better option), write some lines of code using Request library to get webpage and some lines of bs4 to walk through html and scrape data. This sums up to about less than a hundred lines of coding, where as Javascript coding isn’t easily readable (at least to me) as compared to Python. But it has an advantage, it can easily deal with Javascript in the pages we are scraping. This is one of the motive, Loklak Scraper JS repository was created and we contributed and worked on it.

I recently coded a Javascript scraper in loklak_scraper_js repository. While coding, I found it’s libraries similar to the libraries, I use to code in Python. Therefore, this blog is for Pythonistas how they can start scraping in Javascript as they finish reading and also contribute to Loklak Scraper JS.

First, replace Python interpreter, Request and Beautifulsoup library with Node JS interpreter, Request and Cheerio JS library.

1) Node JS Interpreter: Node JS Interpreter is used to interpret Javascript files. This is different from Python as it deals with the project instead of a module in case of Python. The most compatible Node for most of the libraries is 6.0.0 , where as latest version available(as I checked) is 8.0.0

TIP: use `–save` with npm like here while installing a library.

2) Request Library :- This is used to load webpage to be processed. Similar to one in Python.

Request-promise library, a wrapper around Request with implementation of Bluebird library, improves readability and makes code cleaner (how?).

 

3) Cheerio Library:- A Pythonista (a rookie one) can call it twin of BeautifulSoup Library. But this is faster and is Javascript. It’s selector implementation is nearly identical to jQuery’s.

Let us code a basic Javascript scraper. I will take TimeAndDate scraper from loklak_scraper_js as example here. It inputs place and outputs its local time.

Step#1: fetching HTML from webpage with the help of Request library.

We input url to Request function to fetch the webpage and is saved to `html` variable. This scrapeTimeAndDate() function scrapes data from html

url = "http://www.timeanddate.com/worldclock/results.html?query=London";

request(url, function(error, response, body) {

 if(error) {

    console.log("Error: " + error);

    process.exit(-1);

 }

 html = body;

 scrapeTimeAndDate()

});

 

Step#2: To scrape important data from html using Cheerio JS

list of date and time of locations is embedded in table tag, So we will iterate through <td> and extract text.

  1. a) Load html to Cheerio as we do in beautifulsoup

In Python

soup = BeautifulSoup(html,'html5lib')

 

In Cheerio JS

$ = cheerio.load(html);

 

  1. b) This line finds first tr tag in table tag.

var htmlTime = $("table").find('tr');

 

  1. c) Iterate through td tags data by using each() function. This function acts as loop (in Python) iterating through list of elements in which data will be extracted.

htmlTime.each(function (index, element) {      

  // in python, we will use loop, `for element from elements:`

  tag = $(element).find("td");    // in python, `tag = soup.find_all('td')`

  if( tag.text() != "") {

    .

    .

    //EXTRACT DATA

    .

    .

  } else {

    //go to next td tag

    tag = tag.next();

  }

}

 

  1. d) To extract data

Cheerio JS loads html and uses DOM model traverse through. DOM model considers html is tree. So, go to the tag, and scrape data you want.

//extract location(text) enclosed in tag

location = tag.text();

//go to next tag

tag = tag.next();

//extract time(text) enclosed in tag

time = tag.text();

//save in dictionary like in python

loc_list["location"] = location;

loc_list["time"] = time;

 

Some other useful functions:-

1) $(selector, [context], [root])

returns object of selector(any tag) with class or id inside root

2) $(“table”).attr(name, value)

To get tag object having attribute having `value`

3) obj.html()

To get html enclosed in tags

For more just drop in here

Step#3: Execute scraper using command

node <scrapername>.js

 

Hoping that this blog is able to  how to scrape in Javascript by finding similarities with Python.

Resources:

Create Scraper in Javascript for Loklak Scraper JS

Implementing Loklak APIs in Java using Reflections

Loklak server provides a large API to play with the data scraped by it. Methods in java can be implemented to use these API endpoints. A common approach of implementing the methods for using API endpoints is to create the request URL by taking the values passed to the method, and then send GET/POST request. Creating the request URL in every method can be tiresome and in the long run maintaining the library if implemented this way will require a lot of effort. For example, assume a method is to be implemented for suggest API endpoint, which has many parameters, for creating request URL a lot of conditionals needs to be written – whether a parameter is provided or not.

Well, the methods to call API endpoints can be implemented with lesser and easy to maintain code using Reflection in Java. The post ahead elaborates the problem, the approach to solve the problem and finally solution which is implemented in loklak_jlib_api.

Let’s say, the status API endpoint needs to be implemented, a simple approach can be:

public class LoklakAPI {
    
    public static String status(String baseUrl) {
        String requestUrl = baseUrl   "/api/status.json";
        
        // GET request using requestUrl 
    }

    public static void main(String[] argv) {
        JSONObject result = status("https://api.loklak.org");
    }
}

This one is easy, isn’t it, as status API endpoint requires no parameters. But just imagine if a method implements an API endpoint that has a lot of parameters, and most of them are optional parameters. As a developer, you would like to provide methods that cover all the parameters of the API endpoint. For example, how a method would look like if it implements suggest API endpoint, the old SuggestClient implementation in loklak_jlib_api does that:

public static ResultList<QueryEntry> suggest(
           final String hostServerUrl,
           final String query,
           final String source,
           final int count,
           final String order,
           final String orderBy,
           final int timezoneOffset,
           final String since,
           final String until,
           final String selectBy,
           final int random) throws JSONException, IOException {
       ResultList<QueryEntry>  resultList = new ResultList<>();
       String suggestApiUrl = hostServerUrl
               + SUGGEST_API
               + URLEncoder.encode(query.replace(' ', '+'), ENCODING)
               + PARAM_TIMEZONE_OFFSET + timezoneOffset
               + PARAM_COUNT + count
               + PARAM_SOURCE + (source == null ? PARAM_SOURCE_VALUE : source)
               + (order == null ? "" : (PARAM_ORDER + order))
               + (orderBy == null ? "" : (PARAM_ORDER_BY + orderBy))
               + (since == null ? "" : (PARAM_SINCE + since))
               + (until == null ? "" : (PARAM_UNTIL + until))
               + (selectBy == null ? "" : (PARAM_SELECT_BY + selectBy))
               + (random < 0 ? "" : (PARAM_RANDOM + random))
               + PARAM_MINIFIED + PARAM_MINIFIED_VALUE;
 
                // GET request using suggestApiUrl
   }
}

A lot of conditionals!!! The targeted users may also get irritated if they need to provide all the parameters every time even if they don’t need them. The obvious solution to that is overloading the methods. But,  then again for each overloaded method, the same repetitive conditionals need to be written, a form of code duplication!! And what if you have to implement some 30 API endpoints and in future maintaining them, a single thought of it is scary and a nightmare for the developers.

Approach to the problem

To reduce the code size, one obvious thought that comes is to implement static methods that send GET/POST requests provided request URL and the required data. These methods are implemented in loklak_jlib_api by JsonIO and NetworkIO.

You might have noticed the method name i.e. status, suggest is also the part of the request URL. So, the common pattern that we get is:

base_url + "/api/" + methodName + ".json" + "?" + firstParameterName + "=" + firstParameterValue + "&" + secondParameterName + "=" + secondParameterValue ...

Reflection API to rescue

Using Reflection API inspection of classes, interfaces, methods, and variables can be done, yes you guessed it right if we can inspect we can get the method and parameter names. It is also possible to implement interfaces and create objects at runtime.

So, an interface is created and API endpoint methods are defined in it. Using reflection API the interface is implemented lazily at runtime. LoklakAPI interface defines all the API endpoint methods. Some of the defined methods are:

public interface LoklakAPI {
 
    @Target(ElementType.METHOD)
    @Retention(RetentionPolicy.RUNTIME)
    @interface GET {}
 
    @Target(ElementType.METHOD)
    @Retention(RetentionPolicy.RUNTIME)
    @interface POST {}
 
    @GET
    JSONObject search(String q);
 
    @GET
    JSONObject search(String q, int count);
 
    @GET
    JSONObject search(String q, int timezoneOffset, String since, String until);
 
    @GET
    JSONObject search(String q, int timezoneOffset, String since, String until, int count);
 
    @GET
    JSONObject peers();
 
    @GET
    JSONObject hello();
 
    @GET
    JSONObject status();
 
    @POST
    JSONObject push(JSONObject data);
}

Here GET and POST annotations are used to mark the methods which use GET and POST request respectively, so that appropriate method for GET and POST static method can be used JsonIOloadJson for GET request and pushJson for POST request.

A private static inner class ApiInvocationHandler is created in APIGenerator, which implements InvocationHandler – used for implementing interfaces at runtime.

private static class ApiInvocationHandler implements InvocationHandler {
 
   private String mBaseUrl;
 
   public ApiInvocationHandler(String baseUrl) {
       this.mBaseUrl = baseUrl;
   }
 
   @Override
   public Object invoke(Object o, Method method, Object[] values) throws Throwable {
       Parameter[] params = method.getParameters();
       Object[] paramValues = values;
 
       /*
       format of annotation name:
       @packageWhereAnnotationIsDeclared$AnnotationName()
       Example: @org.loklak.client.LoklakAPI$GET()
       */
       Annotation annotation = method.getAnnotations()[0];
       String annotationName = annotation.toString().toLowerCase();
 
       String apiUrl = createGetRequestUrl(mBaseUrl, method.getName(),
               params, paramValues);
       if (annotationName.contains("get")) { // GET REQUEST
           return loadJson(apiUrl);
       } else { // POST REQUEST
           JSONObject jsonObjectToPush = (JSONObject) paramValues[0];
           String postRequestUrl = createPostRequestUrl(mBaseUrl, method.getName());
           return pushJson(postRequestUrl, params[0].getName(), jsonObjectToPush);
       }
   }
}

The base URL is provided while creating the object, in the constructor. The invoke method is called whenever a defined method in the interface is called in actual code. This way an interface is implemented at runtime. The parameters of invoke method:

  • Object o – the object on which to call the method.
  • Method method – method which is called, parameters and annotations of the method are obtained using getParameters and getAnnotations methods respectively which are required to create our request url.
  • Object[] values – an array of parameter values which are passed to the method when calling it.

createGetRequestUrl is used to create the request URL for sending GET requests.

private static String createGetRequestUrl(
       String baseUrl, String methodName, Parameter[] params, Object[] paramValues) {
   String apiEndpointUrl = baseUrl.replaceAll("/+$", "") + "/api/" + methodName + ".json";
   StringBuilder url = new StringBuilder(apiEndpointUrl);
   if (params.length > 0) {
       String queryParamAndVal = "?" + params[0].getName() + "=" + paramValues[0];
       url.append(queryParamAndVal);
       for (int i = 1; i < params.length; i++) {
           String paramAndVal = "&" + params[i].getName()
                   + "=" + String.valueOf(paramValues[i]);
           url.append(paramAndVal);
       }
   }
   return url.toString();
}

Similarly, createPostRequestUrl is used to create request URL for sending POST requests.

private static String createPostRequestUrl(String baseUrl, String methodName) {
   baseUrl = baseUrl.replaceAll("/+$", "");
   return baseUrl + "/api/" + methodName + ".json";
}

Finally, Proxy.newProxyInstance is used to implement the interface at runtime. An instance of ApiInvocationHandler class is passed to the newProxyInstance method. This is all done by static method createApiMethods.

public static <T> T createApiMethods(Class<T> service, final String baseUrl) {
   ApiInvocationHandler apiInvocationHandler = new ApiInvocationHandler(baseUrl);
   return (T) Proxy.newProxyInstance(
           service.getClassLoader(), new Class<?>[]{service}, apiInvocationHandler);
}

Where:

  • Class<T> service – is the interface where API endpoint methods are defined, here LoklakAPI.class.
  • String baseUrl – web address of the server where Loklak server is hosted.

NOTE: For all this to work the library must be build using “-parameters” flag, else parameter names can’t be obtained. Since loklak_jlib_api uses maven, “-parameters” is provided in pom.xml file.

A small example:

String baseUrl = "https://api.loklak.org";
LoklakAPI loklakAPI = APIGenerator.createApiMethods(LoklakAPI.class, baseUrl);
 
JSONObject searchResult = loklakAPI.search("FOSSAsia");
Implementing Loklak APIs in Java using Reflections

Automatic Imports of Events to Open Event from online event sites with Query Server and Event Collect

One goal for the next version of the Open Event project is to allow an automatic import of events from various event listing sites. We will implement this using Open Event Import APIs and two additional modules: Query Server and Event Collect. The idea is to run the modules as micro-services or as stand-alone solutions.

Query Server
The query server is, as the name suggests, a query processor. As we are moving towards an API-centric approach for the server, query-server also has API endpoints (v1). Using this API you can get the data from the server in the mentioned format. The API itself is quite intuitive.

API to get data from query-server

GET /api/v1/search/<search-engine>/query=query&format=format

Sample Response Header

 Cache-Control: no-cache
 Connection: keep-alive
 Content-Length: 1395
 Content-Type: application/xml; charset=utf-8
 Date: Wed, 24 May 2017 08:33:42 GMT
 Server: Werkzeug/0.12.1 Python/2.7.13
 Via: 1.1 vegur

The server is built in Flask. The GitHub repository of the server contains a simple Bootstrap front-end, which is used as a testing ground for results. The query string calls the search engine result scraper scraper.py that is based on the scraper at searss. This scraper takes search engine, presently Google, Bing, DuckDuckGo and Yahoo as additional input and searches on that search engine. The output from the scraper, which can be in XML or in JSON depending on the API parameters is returned, while the search query is stored into MongoDB database with the query string indexing. This is done keeping in mind the capabilities to be added in order to use Kibana analyzing tools.

The frontend prettifies results with the help of PrismJS. The query-server will be used for initial listing of events from different search engines. This will be accessed through the following API.

The query server app can be accessed on heroku.

➢ api/list​: To provide with an initial list of events (titles and links) to be displayed on Open Event search results.

When an event is searched on Open Event, the query is passed on to query-server where a search is made by calling scraper.py with appending some details for better event hunting. Recent developments with Google include their event search feature. In the Google search app, event searches take over when Google detects that a user is looking for an event.

The feed from the scraper is parsed for events inside query server to generate a list containing Event Titles and Links. Each event in this list is then searched for in the database to check if it exists already. We will be using elastic search to achieve fuzzy searching for events in Open Event database as elastic search is planned for the API to be used.

One example of what we wish to achieve by implementing this type of search in the database follows. The user may search for

-Google Cloud Event Delhi
-Google Event, Delhi
-Google Cloud, Delhi
-google cloud delhi
-Google Cloud Onboard Delhi
-Google Delhi Cloud event

All these searches should match with “Google Cloud Onboard Event, Delhi” with good accuracy. After removing duplicates and events which already exist in the database from this list have been deleted, each event is rendered on search frontend of Open Event as a separate event. The user can click on any of these event, which will make a call to event collect.

Event Collect

The event collect project is developed as a separate module which has two parts

● Site specific scrapers
In its present state, event collect has scrapers for eventbrite and ticket-leap which, given a query, scrape eventbrite (and ticket-leap respectively) search results and downloads JSON files of each event using Loklak‘s API.
The scrapers can be developed in any form or any number of scrapers/scraping tools can be added as long as they are in alignment with the Open Event Import API’s data format. Writing tests for these against the concurrent API formats will take care of this. This part will be covered by using a json-validator​ to check against a pre-generated schema.

● REST APIs
The scrapers are exposed through a set of APIs, which will include, but not limited to,
➢ api/fetch-event : ​to scrape any event given the link and compose the data in a predefined JSON format which will be generated based on Open Event Import API. When this function is called on an event link, scrapers are invoked which collect event data such as event, meta, forms etc. This data will be validated against the generated JSON schema. The scraped JSON and directory structure for media files:
➢ api/export : to export all the JSON data containing event information into Open Event Server. As and when the scraping is complete, the data will be added into Open Event’s database as a new event.

How the Import works

The following graphic shows how the import works.




Let’s dive into the workflow. So as the diagram illustrates, the ‘search​’ functionality makes a call to api/list API endpoint provided by query-server which returns with events’ ‘Title’ and ‘Event Link’ from the parsed XML/JSON feed. This list is displayed as Open Event’s search results. Now the results having been displayed, the user can click on any of the events. When the user clicks on any event, the event is searched for in Open Event’s database. Two things happen now:

  • The event page loads if the event is found.
  • If the event does not already exist in the database, clicking on any event will

➢ Insert this event’s title and link in the database and get the event_id

➢ Make a call to api/fetch-event in event-collect which then invokes a site-specific scraper to fetch data about the event the user has chosen

➢ When the data is scraped, it is imported into Open Event database using the previously generated event_id. The page will be loaded using jquery ajax ​as and when the scraping is done.​When the imports are done, the search page refreshes with the new results. The Open Event Orga Server exposes a well documented REST API that can be used by external services to access the data.

Automatic Imports of Events to Open Event from online event sites with Query Server and Event Collect

Writing Simple Unit-Tests with JUnit

In the Loklak Server project, we use a number of automation tools like the build testing tool ‘TravisCI’, automated code reviewing tool ‘Codacy’, and ‘Gemnasium’. We are also using JUnit, a java-based unit-testing framework for writing automated Unit-Tests for java projects. It can be used to test methods to check their behaviour whenever there is any change in implementation. These unit-tests are handy and are coded specifically for the project. In the Loklak Server project it is used to test the web-scrapers. Generally JUnit is used to check if there is no change in behaviour of the methods, but in this project, it also helps in keeping check if the website code has been modified, affecting the data that is scraped.

Let’s start with basics, first by setting up, writing a simple Unit-Tests and then Test-Runners. Here we will refer how unit tests have been implemented in Loklak Server to familiarize with the JUnit Framework.

Setting-UP

Setting up JUnit with gradle is easy, You have to do just 2 things:-

1) Add JUnit dependency in build.gradle

Dependencies {

. . .

. . .<other compile groups>. . .

compile group: 'com.twitter', name: 'jsr166e', version: '1.1.0'

compile group: 'com.vividsolutions', name: 'jts', version: '1.13'

compile group: 'junit', name: 'junit', version: '4.12'

compile group: 'org.apache.logging.log4j', name: 'log4j-1.2-api', version: '2.6.2'

compile group: 'org.apache.logging.log4j', name: 'log4j-api', version: '2.6.2'

. . .

. . .

}

 

2) Add source for ‘test’ task from where tests are built (like here).

Save all tests in test directory and keep its internal directory structure identical to src directory structure. Now set the path in build.gradle so that they can be compiled.

sourceSets.test.java.srcDirs = ['test']

 

Writing Unit-Tests

In JUnit FrameWork a Unit-Test is a method that tests a particular behaviour of a section of code. Test methods are identified by annotation @Test.

Unit-Test implements methods of source files to test their behaviour. This can be done by fetching the output and comparing it with expected outputs.

The following test tests if twitter url that is created is valid or not that is to be scraped.

/**

 * This unit-test tests twitter url creation

 */

@Test

public void testPrepareSearchURL() {

String url;

String[] query = {"fossasia", "from:loklak_test",

"spacex since:2017-04-03 until:2017-04-05"};

String[] filter = {"video", "image", "video,image", "abc,video"};

String[] out_url = {

"https://twitter.com/search?f=tweets&vertical=default&q=fossasia&src=typd",

"https://twitter.com/search?f=tweets&vertical=default&q=from%3Aloklak_test&src=typd",

"and other output url strings to be matched…..."

};


// checking simple urls

for (int i = 0; i < query.length; i++) {

url = TwitterScraper.prepareSearchURL(query[i], "");


//compare urls with urls created

assertThat(out_url[i], is(url));

}


// checking urls having filters

for (int i = 0; i < filter.length; i++) {

url = TwitterScraper.prepareSearchURL(query[0], filter[i]);


//compare urls with urls created

assertThat(out_url[i+3], is(url));

}

}

 

Testing the implementation of code is useless as it will either make code more difficult to change or tests useless  . So be cautious while writing tests and keep difference between Implementation and Behaviour in mind.

This is the perfect example for a simple Unit-Test. As we see there are some points, which needs to be observed like:-

1) There is a annotation @Test .

2) Input array of query which is fed to the method TwitterScraper.prepareSearchURL() .

3) Array of urls out_url[], which are the expected urls to output.

4) asserThat() to compare the expected url (in array out_url[]) and the output url (in variable ‘url’).

NOTE: assertEquals() could also be used here, but we prefer to use assert methods to get error message that is readable (We will discuss about this some time later)

And the TestRunner

When we are working on a project, It is not feasible to run tests using gradle as they are first built  (else verified whether tests are build-ready) and then executed. gradle test shall be used only for building and testing the tests. For testing the project, one shall set-up TestRunner. It allows to run specific set of tests, one wants to run.

TestRunners are built once using gradle (with other tests) and can be run whenever you want. Also it is easy to stack up the test classes you want to run in SuiteClasses and @RunWith to run SuiteClasses with the TestRunner.

In loklak server, TestRunner runs the web-scraper tests. They are used by developers to test the changes they have made.

This is a sample TestRunner, code link here .

package org.loklak;


// Library classes imported

import org.junit.runner.RunWith;

import org.junit.runners.Suite;

// Source files to be tested

import org.loklak.harvester.TwitterScraperTest;

import org.loklak.harvester.YoutubeScraperTest;


/*

* TestRunner for harvesters

*/

@RunWith(Suite.class)

@Suite.SuiteClasses({

TwitterScraperTest.class,

YoutubeScraperTest.class

})

public class TestRunner {

}

 

You can also add TestRunners for different sections of the project. Like here it is initialized only to test harvesters.

To run the TestRunner

Add classpath of the jar file of the project and run ‘JUnitCore’ with TestRunner to get output on terminal.

java -classpath .:build/libs/<yourProject>.jar:build/classes/test org.junit.runner.JUnitCore org.loklak.TestRunner

In the project we have set up a shell script to run the tests.

Few points

1) Build the project and tests separately. Build tests only when changed as they take time to be built and executed.

2) Whenever you are done with the coding part, run the tests using TestRunner.

3) Write unit-tests whenever you add a new feature to the project to keep it up-to-date.

Now lets end up here.

So for now, Code it, Test it and Repeat.

Resources:

Writing Simple Unit-Tests with JUnit