Time across seven seas…

It has been rightly said:

Time is of your own making
Its clock ticks in your head.
The moment you stop thought
Time too stops dead.

loklak_org_sticker

Hence to keep up with evolving times, Loklak has now introduced a new service for “time”.

The recently developed API provides the current time and day at the location queried by the user.

The /api/locationwisetime.json API scrapes the results from timeanddate.com using our favourite JSoup as it provides a very convenient API for extracting and manipulating data, scrape and parse HTML from a given URL.

In case of multiple locations with the same name, countries are then also provided along-with corresponding day and time wrapped up as a JSONObject.

A sample query could then be something like: http://loklak.org/api/locationwisetime.json?query=london

Screenshot from 2016-08-17 14:28:28

 

When implemented as a console service, this API can be used along-with our our dear SUSI by utilising the API Endpoints like: http://loklak.org/api/console.json?q=SELECT * FROM locationwisetime WHERE query=’berlin’;

Screenshot from 2016-08-17 14:50:58

LocationWiseTimeService.java for reference:


/**
 *  Location Wise Time
 *  timeanddate.com scraper
 *  Copyright 27.07.2016 by Jigyasa Grover, @jig08
 *
 *  This library is free software; you can redistribute it and/or
 *  modify it under the terms of the GNU Lesser General Public
 *  License as published by the Free Software Foundation; either
 *  version 2.1 of the License, or (at your option) any later version.
 *  
 *  This library is distributed in the hope that it will be useful,
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 *  Lesser General Public License for more details.
 *  
 *  You should have received a copy of the GNU Lesser General Public License
 *  along with this program in the file lgpl21.txt
 *  If not, see <http://www.gnu.org/licenses/>.
 */

package org.loklak.api.search;

import java.io.IOException;

import javax.servlet.http.HttpServletResponse;

import org.json.JSONArray;
import org.json.JSONObject;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.loklak.server.APIException;
import org.loklak.server.APIHandler;
import org.loklak.server.AbstractAPIHandler;
import org.loklak.server.Authorization;
import org.loklak.server.BaseUserRole;
import org.loklak.server.Query;
import org.loklak.susi.SusiThought;
import org.loklak.tools.storage.JSONObjectWithDefault;

public class LocationWiseTimeService extends AbstractAPIHandler implements APIHandler {

	private static final long serialVersionUID = -1495493690406247295L;

	@Override
	public String getAPIPath() {
		return "/api/locationwisetime.json";
	}

	@Override
	public BaseUserRole getMinimalBaseUserRole() {
		return BaseUserRole.ANONYMOUS;

	}

	@Override
	public JSONObject getDefaultPermissions(BaseUserRole baseUserRole) {
		return null;
	}

	@Override
	public JSONObject serviceImpl(Query call, HttpServletResponse response, Authorization rights,
			JSONObjectWithDefault permissions) throws APIException {
		String query = call.get("query", "");
		return locationWiseTime(query);
	}

	public static SusiThought locationWiseTime(String query) {
		
		Document html = null;

		JSONArray arr = new JSONArray();

		try {
			html = Jsoup.connect("http://www.timeanddate.com/worldclock/results.html?query=" + query).get();
		} catch (IOException e) {
			e.printStackTrace();
		}

		Elements locations = html.select("td");
		int i = 0;
		for (Element e : locations) {
			if (i % 2 == 0) {
				JSONObject obj = new JSONObject();
				String l = e.getElementsByTag("a").text();
				obj.put("location", l);
				String t = e.nextElementSibling().text();
				obj.put("time", t);
				arr.put(obj);
			}
			i++;
		}
		
		SusiThought json = new SusiThought();
		json.setData(arr);
		return json;
	}

}

 

Hope this helps, and worth the “time” 😛

Feel free to ask questions regarding the above code snippet, shall be happy to assist.

Feedback and Suggestions welcome 🙂

Time across seven seas…

Push & Pull : Scraped Data into Index and back

With many scrapers being integrated into the Loklak server, it is but natural that the load on the server would also increase if multitude of requests are to be served each millisecond.

Initially, when Loklak only harvested tweets from Twitter, Elasticsearch was implemented along-with a Data Access Object to do the needful task of indexing.

The JSON Object(s) pushed into the index were of the form statuses and had to be in a specific format to be shoved and retrieved easily in the index.

Sample:


{
  "statuses": [
    {
      "id_str": "yourmessageid_1234",
      "screen_name": "testuser",
      "created_at": "2016-07-22T07:53:24.000Z",
      "text": "The rain is spain stays always in the plain",
      "source_type": "GENERIC",
      "place_name": "Georgia, USA",
      "location_point": [
        3.058579854228782,
        50.63296878274201
      ],
      "location_radius": 0,
      "user": {
        "user_id": "youruserid_5678",
        "name": "Mr. Bob",
        
      }
    }
  ]
}

But with the inclusion of many other scrapers like Github, WordPress, Event Brite etc. and RSS Readers it was a bit cumbersome to use the exact same format as that of Twitter because not all fields matched.

For example:


{
  "data": [
    {
      "location": "Canada - Ontario - London",
      "time": "Sun 9:33 PM"
    },
    {
      "location": "South Africa - East London",
      "time": "Mon 3:33 AM"
    }
  ]
}

Hence, Scott suggested an idea of implementing a DAO Wrapper which would enable us to use the same schema as that of Twitter Index to push and pull data.

DAO Wrapper was implemented as GenericJSONBuilder which had the feature of adding the remaining fields of data other than the text into the same schema using Markdown Format

Peeking into the code:


package org.loklak.data;

import javafx.util.Pair;
import org.loklak.objects.MessageEntry;
import org.loklak.objects.QueryEntry;
import org.loklak.objects.SourceType;
import org.loklak.objects.UserEntry;

import java.net.MalformedURLException;
import java.util.*;

/**
 * The json below is the minimum json
 * {
 "statuses": [
 {
 "id_str": "yourmessageid_1234",
 "screen_name": "testuser",
 "created_at": "2016-07-22T07:53:24.000Z",
 "text": "The rain is spain stays always in the plain",
 "source_type": "GENERIC",
 "place_name": "Georgia, USA",
 "location_point": [3.058579854228782,50.63296878274201],
 "location_radius": 0,
 "user": {
 "user_id": "youruserid_5678",
 "name": "Mr. Bob",
 }
 }
 ]
 }
 */
public class DAOWrapper {
    public static final class GenericJSONBuilder{
        private String id_str = null;
        private String screen_name = "unknown";
        private Date created_at = null;
        private String text = "";
        private String place_name = "unknown";
        private String user_name = "[email protected]";
        private String user_id = "unknown";
        private String image = null;
        private double lng = 0.0;
        private double lat = 0.0;
        private int loc_radius = 0;
        private ArrayList extras = new ArrayList();


        /**
         * Not required
         * @param author
         * @param domain
         * @return
         */
        public GenericJSONBuilder setAuthor(String author, String domain){
            user_name = author + "@" + domain;
            screen_name = author;
            return this;
        }

        /**
         * Not required
         * @param user_id_
         * @return
         */
        public GenericJSONBuilder setUserid(String user_id_){
            user_id = user_id_;
            return this;
        }

        /**
         * Not required
         * @param id_str_
         * @return
         */
        public GenericJSONBuilder setIDstr(String id_str_){
            id_str = id_str_;
            return this;
        }

        /**
         * Not required
         * @param createdTime
         * @return
         */
        public GenericJSONBuilder setCreatedTime(Date createdTime){
            created_at = createdTime;
            return this;
        }

        /**
         * Required
         * This is the text field. You can use JSON style in this field
         * @param text_
         * @return
         */
        public GenericJSONBuilder addText(String text_){
            text = text + text_;
            return this;
        }

        /**
         * Not required
         * @param name
         * @return
         */
        public GenericJSONBuilder setPlaceName(String name){
            place_name = name;
            return this;
        }

        /**
         * Not required
         * @param longtitude
         * @param latitude
         * @return
         */
        public GenericJSONBuilder setCoordinate(double longtitude, double latitude){
            lng = longtitude;
            lat = latitude;
            return this;
        }

        /**
         * Not required
         * @param radius
         * @return
         */
        public GenericJSONBuilder setCoordinateRadius(int radius){
            loc_radius = radius;
            return this;
        }


        /**
         * Not required
         * @param key
         * @param value
         * @return
         */
        public GenericJSONBuilder addField(String key, String value){
            String pair_string = "\"" + key + "\": \"" + value + "\"";
            extras.add(pair_string);
            return this;
        }

        private String buildFieldJSON(){
            String extra_json = "";
            for(String e:extras){
                extra_json =  extra_json + e + ",";
            }
            if(extra_json.length() > 2) extra_json = "{" + extra_json.substring(0, extra_json.length() -1) + "}";
            return extra_json;
        }

        /**
         * Not required
         * @param link_
         * @return
         */
        public GenericJSONBuilder setImage(String link_){
            image = link_;
            return this;
        }

        public void persist(){
            try{
                //building message entry
                MessageEntry message = new MessageEntry();

                /**
                 * Use hash of text if id of message is not set
                 */
                if(id_str == null)
                    id_str = String.valueOf(text.hashCode());

                message.setIdStr(id_str);

                /**
                 * Get current time if not set
                 */
                if(created_at == null)
                    created_at = new Date();
                message.setCreatedAt(created_at);


                /**
                 * Append the field as JSON text
                 */
                message.setText(text + buildFieldJSON());

                double[] locPoint = new double[2];
                locPoint[0] = lng;
                locPoint[1] = lat;

                message.setLocationPoint(locPoint);

                message.setLocationRadius(loc_radius);

                message.setPlaceName(place_name, QueryEntry.PlaceContext.ABOUT);
                message.setSourceType(SourceType.GENERIC);

                /**
                 * Insert if there is a image field
                 */
                if(image != null) message.setImages(image);

                //building user
                UserEntry user = new UserEntry(user_id, screen_name, "", user_name);

                //build message and user wrapper
                DAO.MessageWrapper wrapper = new DAO.MessageWrapper(message,user, true);

                DAO.writeMessage(wrapper);
            } catch (MalformedURLException e){
            }
        }
    }





    public static GenericJSONBuilder builder(){
        return new GenericJSONBuilder();
    }





    public static void insert(Insertable msg){

        GenericJSONBuilder bd = builder()
        .setAuthor(msg.getUsername(), msg.getDomain())
        .addText(msg.getText())
        .setUserid(msg.getUserID());


        /**
         * Insert the fields
         */
        List<Pair<String, String>> fields = msg.getExtraField();
        for(Pair<String, String> field:fields){
            bd.addField(field.getKey(), field.getValue());
        }
    }
}

DAOWrapper was then used with other scrappers to push the data into the index as:


...
DAOWrapper dw = new DAOWrapper();
dw.builder().addText(json.toString());
dw.builder().setUserid("profile_"+profile);
dw.builder().persist();
...

Here , addText(...) can be used several times to insert text in the object but set...(...) methods should be used only once and perist() should also be used only once as this is the method which finally pushes into the index.

Now, when a scraper receives a request to scrape a given HTML page, a check is first made if the data already exists in the index with the help of a unique userIDString. This saves the time and effort of scraping the page all over again, instead it simply returns the saved instance.

The check is done something like this:


if(DAO.existUser("profile_"+profile)){
    /*
     *  Return existing JSON Data
    */
}else{
    /*
     *  Scrape the HTML Page addressed by the given URL
    */
}

This pushing and pulling into the index would certainly reduce the load on the Loklak server.

Feel free to ask questions regarding the above.

Feedback and Suggestions welcome 🙂

Push & Pull : Scraped Data into Index and back

Spin-Off: Loklak fuels Open Event

Continuing with the Loklak & Open Event Partnership (check out Loklak fuels Open Event ), we can now easily in clicks create our very own web-app for the event with details imported from eventbrite.com powered by Loklak.

The scraping of data done using JSoup, Java HTML parsing was explained in the previous post of this series.

Next, a console service was implemented as the single point for information retrieval from various social networks and websites (a post coming for it soon 😉 ) especially for SUSI (our very own personal digital assistant, a cute one indeed !)

The JSONArray result of the EventBriteCrawler was set in SusiThought, which is nothing but a piece of data that can be remembered. The structure or the thought can be modelled as a table which may be created using the retrieval of information from elsewhere of the current argument.


/** Defining SusiThought as a class 
 * which extends JSONObject
 */

public class SusiThought extends JSONObject {

/* details coming soon.... */

}

/** Modifications in EventBriteCrawler
 *  Returning SusiThought instead of 
 * a simple JSONObject/JSONArray.
 */
public static SusiThought crawlEventBrite(String url) {
    ...
    ...    
    SusiThought json = new SusiThought();
    json.setData(jsonArray);
    return json;
}

 

The API EndPoint was thus created.
A sample is as: http://loklak.org/api/console.json?q=SELECT * FROM eventbrite WHERE url=’https://www.eventbrite.fr/e/billets-europeade-2016-concert-de-musique-vocale-25592599153′;

Screenshot from 2016-07-15 13:22:00

 

The files generated were next imported in the Open Event Web App generator, using simple steps.

screenshot-from-2016-07-04-075700

Screenshot from 2016-07-15 13:25:39 Screenshot from 2016-07-15 13:36:19

It’s amazing to see how a great visual platform is provided to edit details parsed from the plain JSONObject and deploy the personalized web-app !

Screenshot from 2016-07-15 13:59:47

Screenshot from 2016-07-15 12:55:06Screenshot from 2016-07-15 12:55:18Screenshot from 2016-07-15 12:55:29Screenshot from 2016-07-15 12:54:24
Screenshot from 2016-07-15 12:54:33
Tadaa !
We have our very own event web-app with all the information imported from eventbrite.com in a single (well, very few 😛 ) click (s) !

With this, we conclude the Loklak – Open Event – EventBrite series.

Stay tuned for detailed post on SUSI and Console Services 🙂

Spin-Off: Loklak fuels Open Event

Convert web pages into structured data

Loklak provides a new API which converts web pages into structured data in JSON. The genericscraper API helps you to scrape any web page from a given URL and provides you with the structured JSON data. Just place the URL in the given format http://localhost:9000/api/genericscraper.json?url=http://www.google.com

This scrapes generic data from a given web page URL, for instance this is the current URL after scraping the main Google search page.

{
  "Text in Links": [
    "Images",
    "Maps",
    "Play",
    "YouTube",
    "News",
    "Gmail",
    "Drive",
    "More »",
    "Web History",
    "Settings",
    "Sign in",
    "Advanced search",
    "Language tools",
    "हिन्दी",
    "বাংলা",
    "తెలుగు",
    "मराठी",
    "தமிழ்",
    "ગુજરાતી",
    "ಕನ್ನಡ",
    "മലയാളം",
    "ਪੰਜਾਬੀ",
    "Advertising Programs",
    "Business Solutions",
    "+Google",
    "About Google",
    "Google.com",
    "Privacy",
    "Terms"
  ],
  "Image files": [],
  "source files": [],
  "Links": [
    "http://www.google.co.in/imghp?hl=en&tab=wi",
    "http://maps.google.co.in/maps?hl=en&tab=wl",
    "https://play.google.com/?hl=en&tab=w8",
    "http://www.youtube.com/?gl=IN&tab=w1",
    "http://news.google.co.in/nwshp?hl=en&tab=wn",
    "https://mail.google.com/mail/?tab=wm",
    "https://drive.google.com/?tab=wo",
    "https://www.google.co.in/intl/en/options/",
    "http://www.google.co.in/history/optout?hl=en",
    "/preferences?hl=en",
    "https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=http://www.google.co.in/%3Fgfe_rd%3Dcr%26ei%3DR_xpV6G9M-PA8gfis7rIDA",
    "/advanced_search?hl=en-IN&authuser=0",
    "/language_tools?hl=en-IN&authuser=0",
    "http://www.google.co.in/setprefs?sig=0_VODpnfQFFvCo-TLhn2_Kr9sRC2c%3D&hl=hi&source=homepage",
    "http://www.google.co.in/setprefs?sig=0_VODpnfQFFvCo-TLhn2_Kr9sRC2c%3D&hl=bn&source=homepage",
    "http://www.google.co.in/setprefs?sig=0_VODpnfQFFvCo-TLhn2_Kr9sRC2c%3D&hl=te&source=homepage",
    "http://www.google.co.in/setprefs?sig=0_VODpnfQFFvCo-TLhn2_Kr9sRC2c%3D&hl=mr&source=homepage",
    "http://www.google.co.in/setprefs?sig=0_VODpnfQFFvCo-TLhn2_Kr9sRC2c%3D&hl=ta&source=homepage",
    "http://www.google.co.in/setprefs?sig=0_VODpnfQFFvCo-TLhn2_Kr9sRC2c%3D&hl=gu&source=homepage",
    "http://www.google.co.in/setprefs?sig=0_VODpnfQFFvCo-TLhn2_Kr9sRC2c%3D&hl=kn&source=homepage",
    "http://www.google.co.in/setprefs?sig=0_VODpnfQFFvCo-TLhn2_Kr9sRC2c%3D&hl=ml&source=homepage",
    "http://www.google.co.in/setprefs?sig=0_VODpnfQFFvCo-TLhn2_Kr9sRC2c%3D&hl=pa&source=homepage",
    "/intl/en/ads/",
    "http://www.google.co.in/services/",
    "https://plus.google.com/104205742743787718296",
    "/intl/en/about.html",
    "http://www.google.co.in/setprefdomain?prefdom=US&sig=__SF4cV2qKAyiHu9OKv2V_rNxesko%3D",
    "/intl/en/policies/privacy/",
    "/intl/en/policies/terms/",
    "/images/branding/product/ico/googleg_lodp.ico"
  ],
  "language": "en-IN",
  "title": "Google",
  "Script Files": []
}

I wrote a generic scraper using the popular HTML scraper Java library JSoup. I scraped the generic fields like title, images, links, source files and other text between links. After the generic scraper was ready I registered the API endpoint as api/genericscraper.json along with the servlet.

Selection_128

I loaded the page from the given URL and took the value from the variable url. I scraped every tag by using getElementByTag. After storing the elements, I looped through the list and retrieved the attributes from each tag. I stored the data accordingly and pushed into the JSONArray. After the necessary scraping I pushed the JSON Arrays into a JSON object and pretty printed it.

Selection_129

There is an app consuming the above API called WebScraper under the Loklak apps page.

Selection_122.png

Convert web pages into structured data

Loklak ShuoShuo: Another feather in the cap !

Work is still going on Loklak Weibo to extract the information as desired and there shall be another post up soon explaining the intricacies of implementation.

Currently, an attempt was made to parse the HTML page of QQShuoShuo.com  (another Chinese twitter like service)

Screenshot from 2016-06-05 22:28:18Screenshot from 2016-06-05 22:28:25
Just like last time, The major challenge however is to understand the Chinese annotations especially being from a non-Chinese background. Google translate aids testing the retrieved data by helping me match each phrase or/and line.

I have made use of of  JSoup: The Java HTML parser library which assists in extracting and manipulating data, scrape and parse HTML from the URL. The suggested use of JSoup is designed to deal with all varieties of HTML, hence as of now it is being considered a suitable choice.

Screenshot from 2016-06-05 22:32:53

/**
 *  Shuoshuo Crawler
 *  By Jigyasa Grover, @jig08
 **/

package org.loklak.harvester;

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.*;
import org.jsoup.select.Elements;

public class ShuoshuoCrawler {
    public static void main(String args[]){

        Document shuoshuoHTML = null;
        Element recommendedTalkBox = null;
        Elements recommendedTalksList = null;
        String recommendedTalksResult[] = new String[100];
        Integer numberOfrecommendedTalks = 0;
        Integer i = 0;

        try {
            shuoshuoHTML = Jsoup.connect("http://www.qqshuoshuo.com/").get();

            recommendedTalkBox = shuoshuoHTML.getElementById("list2");
            recommendedTalksList = recommendedTalkBox.getElementsByTag("li");

            i=0;
            for (Element recommendedTalks : recommendedTalksList)
            {
                //System.out.println("\nLine: " + recommendedTalks.text());
                recommendedTalksResult[i] = recommendedTalks.text().toString();
                i++;
            }           
            numberOfrecommendedTalks = i;
            System.out.println("Total Recommended Talks: " + numberOfrecommendedTalks);
            for(int k=0; k<numberOfrecommendedTalks; k++){
                System.out.println("Recommended Talk " + k + ": " + recommendedTalksResult[k]);
            }

        } catch (IOException e) {
            e.printStackTrace();
        }

    }
}

QQ Recommended Talks from qqshuoshuo.com are now stored as an array of Strings.

Total Recommended Talks: 10
Recommended Talk 0: 不会在意无视我的人,不会忘记帮助过我的人,不会去恨真心爱过我的人。
Recommended Talk 1: 喜欢一个人是一种感觉,不喜欢一个人却是事实。事实容易解释,感觉却难以言喻。
Recommended Talk 2: 一个人容易从别人的世界走出来却走不出自己的沙漠
Recommended Talk 3: 有什么了不起,不就是幸福在左边,我站在了右边?
Recommended Talk 4: 希望我跟你的爱,就像新闻联播一样没有大结局
Recommended Talk 5: 你会遇到别的女子和她举案齐眉,而我自会有别的男子与我白首相携。
Recommended Talk 6: 既然爱,为什么不说出口,有些东西失去了,就再也回不来了!
Recommended Talk 7: 凡事都有可能,永远别说永远。
Recommended Talk 8: 都是因为爱,而喜欢上了怀旧;都是因为你,而喜欢上了怀念。
Recommended Talk 9: 爱是老去,爱是新生,爱是一切,爱是你。

A similar approach can be now used to do the same for Latest QQ talk and QQ talk Leaderboard.


Check out this space for upcoming detail on implementing this technique to parse the entire page and get desired results…

Feedback and Suggestions welcome.

Loklak ShuoShuo: Another feather in the cap !