Social Media Analysis using Loklak (Part 2)

In the last post, I spoke about the TwitterAnalysis servlet I developed, how we can analyse the entire twitter profile of a user and get useful data from that servlet. This was pretty simple to do, because all I really did was parse the Search API and some other simple commands which resulted into a concise and detailed profile analysis.

But there’s something that I’ve not spoken about yet, which I will in this blog post. How is the social media data collected in that form? Where does it come from and how?

Loklak, as is known, is a social media search server, which scrapes social media sites and profiles. Scraping is basically checking out the HTML source code of the website and from the relevant tags, getting out the information from there. The p2p nature of loklak enables a lot of peers scrape synchronously and feed tweets to a backend, and store it in their own backend as well.

Scraping is a very well known practice, and in Java, we already have tools like JSoup which are easy-to-use scraping tools. You just need to connect to the website, mention the tags between which the information is present, and voila. Here is an example from the EventBrite scraper we have made:

public static SusiThought crawlEventBrite(String url) {
		Document htmlPage = null;

		try {
			htmlPage = Jsoup.connect(url).get();
		} catch (Exception e) {

		String eventID = null;
		String eventName = null;
		String eventDescription = null;

		// TODO Fetch Event Color
		String eventColor = null;

		String imageLink = null;

		String eventLocation = null;

		String startingTime = null;
		String endingTime = null;

		String ticketURL = null;

		Elements tagSection = null;
		Elements tagSpan = null;
		String[][] tags = new String[5][2];
		String topic = null; // By default

		String closingDateTime = null;
		String schedulePublishedOn = null;
		JSONObject creator = new JSONObject();
		String email = null;

		Float latitude = null;
		Float longitude = null;

		String privacy = "public"; // By Default
		String state = "completed"; // By Default
		String eventType = "";

		String temp;
		Elements t;

		eventID = htmlPage.getElementsByTag("body").attr("data-event-id");
		eventName = htmlPage.getElementsByClass("listing-hero-body").text();
		eventDescription ="").text();

		eventColor = null;

		imageLink = htmlPage.getElementsByTag("picture").attr("content");

		eventLocation ="p.listing-map-card-street-address.text-default").text();

		temp = htmlPage.getElementsByAttributeValue("property", "event:start_time").attr("content");
		if(temp.length() >= 20){
			startingTime = htmlPage.getElementsByAttributeValue("property", "event:start_time").attr("content").substring(0,19);
			startingTime = htmlPage.getElementsByAttributeValue("property", "event:start_time").attr("content");

		temp = htmlPage.getElementsByAttributeValue("property", "event:end_time").attr("content");
		if(temp.length() >= 20){
			endingTime = htmlPage.getElementsByAttributeValue("property", "event:end_time").attr("content").substring(0,19);
			endingTime = htmlPage.getElementsByAttributeValue("property", "event:end_time").attr("content");

		ticketURL = url + "#tickets";

		// TODO Tags to be modified to fit in the format of Open Event "topic"
		tagSection = htmlPage.getElementsByAttributeValue("data-automation", "ListingsBreadcrumbs");
		tagSpan ="span");
		topic = "";

		int iterator = 0, k = 0;
		for (Element e : tagSpan) {
			if (iterator % 2 == 0) {
				tags[k][1] = ""
			} else {
				tags[k][0] = e.text();

		creator.put("email", "");
		creator.put("id", "1"); // By Default

		temp = htmlPage.getElementsByAttributeValue("property", "event:location:latitude").attr("content");
		if(temp.length() > 0){
			latitude = Float
				.valueOf(htmlPage.getElementsByAttributeValue("property", "event:location:latitude").attr("content"));

		temp = htmlPage.getElementsByAttributeValue("property", "event:location:longitude").attr("content");
		if(temp.length() > 0){
			longitude = Float
				.valueOf(htmlPage.getElementsByAttributeValue("property", "event:location:longitude").attr("content"));

		// TODO This returns: "events.event" which is not supported by Open
		// Event Generator
		// eventType = htmlPage.getElementsByAttributeValue("property",
		// "og:type").attr("content");

		String organizerName = null;
		String organizerLink = null;
		String organizerProfileLink = null;
		String organizerWebsite = null;
		String organizerContactInfo = null;
		String organizerDescription = null;
		String organizerFacebookFeedLink = null;
		String organizerTwitterFeedLink = null;
		String organizerFacebookAccountLink = null;
		String organizerTwitterAccountLink = null;

		temp ="a.js-d-scroll-to.listing-organizer-name.text-default").text();
		if(temp.length() >= 5){
			organizerName ="a.js-d-scroll-to.listing-organizer-name.text-default").text().substring(4);
			organizerName = "";
		organizerLink = url + "#listing-organizer";
		organizerProfileLink = htmlPage
				.getElementsByAttributeValue("class", "js-follow js-follow-target follow-me fx--fade-in is-hidden")
		organizerContactInfo = url + "#lightbox_contact";

		Document orgProfilePage = null;

		try {
			orgProfilePage = Jsoup.connect(organizerProfileLink).get();
		} catch (Exception e) {

		if(orgProfilePage != null){

			t = orgProfilePage.getElementsByAttributeValue("class", "l-pad-vert-1 organizer-website");
			if(t != null){
				organizerWebsite = orgProfilePage.getElementsByAttributeValue("class", "l-pad-vert-1 organizer-website").text();
				organizerWebsite = "";

			t ="div.js-long-text.organizer-description");
			if(t != null){
				organizerDescription ="div.js-long-text.organizer-description").text();
				organizerDescription = "";

			organizerFacebookFeedLink = organizerProfileLink + "#facebook_feed";
			organizerTwitterFeedLink = organizerProfileLink + "#twitter_feed";

			t = orgProfilePage.getElementsByAttributeValue("class", "fb-page");
			if(t != null){
				organizerFacebookAccountLink = orgProfilePage.getElementsByAttributeValue("class", "fb-page").attr("data-href");
				organizerFacebookAccountLink = "";

			t = orgProfilePage.getElementsByAttributeValue("class", "twitter-timeline");
			if(t != null){
				organizerTwitterAccountLink = orgProfilePage.getElementsByAttributeValue("class", "twitter-timeline").attr("href");
				organizerTwitterAccountLink = "";



		JSONArray socialLinks = new JSONArray();

		JSONObject fb = new JSONObject();
		fb.put("id", "1");
		fb.put("name", "Facebook");
		fb.put("link", organizerFacebookAccountLink);

		JSONObject tw = new JSONObject();
		tw.put("id", "2");
		tw.put("name", "Twitter");
		tw.put("link", organizerTwitterAccountLink);

		JSONArray jsonArray = new JSONArray();

		JSONObject event = new JSONObject();
		event.put("event_url", url);
		event.put("id", eventID);
		event.put("name", eventName);
		event.put("description", eventDescription);
		event.put("color", eventColor);
		event.put("background_url", imageLink);
		event.put("closing_datetime", closingDateTime);
		event.put("creator", creator);
		event.put("email", email);
		event.put("location_name", eventLocation);
		event.put("latitude", latitude);
		event.put("longitude", longitude);
		event.put("start_time", startingTime);
		event.put("end_time", endingTime);
		event.put("logo", imageLink);
		event.put("organizer_description", organizerDescription);
		event.put("organizer_name", organizerName);
		event.put("privacy", privacy);
		event.put("schedule_published_on", schedulePublishedOn);
		event.put("state", state);
		event.put("type", eventType);
		event.put("ticket_url", ticketURL);
		event.put("social_links", socialLinks);
		event.put("topic", topic);

		JSONObject org = new JSONObject();
		org.put("organizer_name", organizerName);
		org.put("organizer_link", organizerLink);
		org.put("organizer_profile_link", organizerProfileLink);
		org.put("organizer_website", organizerWebsite);
		org.put("organizer_contact_info", organizerContactInfo);
		org.put("organizer_description", organizerDescription);
		org.put("organizer_facebook_feed_link", organizerFacebookFeedLink);
		org.put("organizer_twitter_feed_link", organizerTwitterFeedLink);
		org.put("organizer_facebook_account_link", organizerFacebookAccountLink);
		org.put("organizer_twitter_account_link", organizerTwitterAccountLink);

		JSONArray microlocations = new JSONArray();
		jsonArray.put(new JSONObject().put("microlocations", microlocations));

		JSONArray customForms = new JSONArray();
		jsonArray.put(new JSONObject().put("customForms", customForms));

		JSONArray sessionTypes = new JSONArray();
		jsonArray.put(new JSONObject().put("sessionTypes", sessionTypes));

		JSONArray sessions = new JSONArray();
		jsonArray.put(new JSONObject().put("sessions", sessions));

		JSONArray sponsors = new JSONArray();
		jsonArray.put(new JSONObject().put("sponsors", sponsors));

		JSONArray speakers = new JSONArray();
		jsonArray.put(new JSONObject().put("speakers", speakers));

		JSONArray tracks = new JSONArray();
		jsonArray.put(new JSONObject().put("tracks", tracks));
		SusiThought json = new SusiThought();
		return json;


As is seen, we first connect with the url using Jsoup.connect().get() and then we use methods like getElementByAttributeValue and getElementByTag to extract the information.

This is one way of scraping: by using tools like Jsoup. You could also do it manually. Just connect to the website, and use things like BufferedReader or InputStreamReader etc to manually extract the HTML and then iterate through it and extract the information. This method was adopted for the TwitterScraper we have.

In the TwitterScraper, we first connect to the URL using ClientConnection() and then use BufferedReader to get the HTML code, as shown here.

private static String prepareSearchURL(final String query) {
        // check
        // for a better syntax
        String https_url = "";
        try {
            StringBuilder t = new StringBuilder(query.length());
            for (String s: query.replace('+', ' ').split(" ")) {
                t.append(' ');
                if (s.startsWith("since:") || s.startsWith("until:")) {
                    int u = s.indexOf('_');
                    t.append(u < 0 ? s : s.substring(0, u));
                } else {
            String q = t.length() == 0 ? "*" : URLEncoder.encode(t.substring(1), "UTF-8");
            https_url = "" + q + "&src=typd";
        } catch (UnsupportedEncodingException e) {}
        return https_url;
    private static Timeline[] search(
            final String query,
            final Timeline.Order order,
            final boolean writeToIndex,
            final boolean writeToBackend) {
        // check
        // for a better syntax
        String https_url = prepareSearchURL(query);
        Timeline[] timelines = null;
        try {
            ClientConnection connection = new ClientConnection(https_url);
            if (connection.inputStream == null) return null;
            try {
                BufferedReader br = new BufferedReader(new InputStreamReader(connection.inputStream, StandardCharsets.UTF_8));
                timelines = search(br, order, writeToIndex, writeToBackend);
            } catch (IOException e) {
            } finally {
        } catch (IOException e) {
            // this could mean that twitter rejected the connection (DoS protection?) or we are offline (we should be silent then)
            // Log.getLog().warn(e);
            if (timelines == null) timelines = new Timeline[]{new Timeline(order), new Timeline(order)};

        // wait until all messages in the timeline are ready
        if (timelines == null) {
            // timeout occurred
            timelines = new Timeline[]{new Timeline(order), new Timeline(order)};
        if (timelines != null) {
            if (timelines[0] != null) timelines[0].setScraperInfo("local");
            if (timelines[1] != null) timelines[1].setScraperInfo("local");
        return timelines;

If you check out the Search Servlet at /api/search.json, you would see that it accepts either plain query terms, or you can also do from: username or @username to see messages from the particular username. The prepareSearchURL parses this Search query, and converts it into a term Twitter’s Search can understand (because they don’t have this feature) and we then use Twitter’s Advanced Search to search. In Timelines[] Search, we use BufferedReader to get the HTML of Search Result, and we store it in a Timeline object for further use.

Now is the time when this HTML is to be processed. We need to check out the tags and work with them. This is achieved here:

private static Timeline[] search(
            final BufferedReader br,
            final Timeline.Order order,
            final boolean writeToIndex,
            final boolean writeToBackend) throws IOException {
        Timeline timelineReady = new Timeline(order);
        Timeline timelineWorking = new Timeline(order);
        String input;
        Map props = new HashMap();
        Set images = new LinkedHashSet();
        Set videos = new LinkedHashSet();
        String place_id = "", place_name = "";
        boolean parsing_favourite = false, parsing_retweet = false;
        int line = 0; // first line is 1, according to emacs which numbers the first line also as 1
        boolean debuglog = false;
        while ((input = br.readLine()) != null){
            input = input.trim();
            if (input.length() == 0) continue;
            // debug
            //if (debuglog) System.out.println(line + ": " + input);            
            //if (input.indexOf("ProfileTweet-actionCount") > 0) System.out.println(input);

            // parse
            int p;
            if ((p = input.indexOf("=\"account-group")) > 0) {
                props.put("userid", new prop(input, p, "data-user-id"));
            if ((p = input.indexOf("class=\"avatar")) > 0) {
                props.put("useravatarurl", new prop(input, p, "src"));
            if ((p = input.indexOf("class=\"fullname")) > 0) {
                props.put("userfullname", new prop(input, p, null));
            if ((p = input.indexOf("class=\"username")) > 0) {
                props.put("usernickname", new prop(input, p, null));
            if ((p = input.indexOf("class=\"tweet-timestamp")) > 0) {
                props.put("tweetstatusurl", new prop(input, 0, "href"));
                props.put("tweettimename", new prop(input, p, "title"));
                // don't continue here because "class=\"_timestamp" is in the same line 
            if ((p = input.indexOf("class=\"_timestamp")) > 0) {
                props.put("tweettimems", new prop(input, p, "data-time-ms"));
            if ((p = input.indexOf("class=\"ProfileTweet-action--retweet")) > 0) {
                parsing_retweet = true;
            if ((p = input.indexOf("class=\"ProfileTweet-action--favorite")) > 0) {
                parsing_favourite = true;
            if ((p = input.indexOf("class=\"TweetTextSize")) > 0) {
                // read until closing p tag to account for new lines in tweets
                while (input.lastIndexOf("

") == -1){ input = input + ' ' + br.readLine(); } prop tweettext = new prop(input, p, null); props.put("tweettext", tweettext); continue; } if ((p = input.indexOf("class=\"ProfileTweet-actionCount")) > 0) { if (parsing_retweet) { prop tweetretweetcount = new prop(input, p, "data-tweet-stat-count"); props.put("tweetretweetcount", tweetretweetcount); parsing_retweet = false; } if (parsing_favourite) { props.put("tweetfavouritecount", new prop(input, p, "data-tweet-stat-count")); parsing_favourite = false; } continue; } // get images if ((p = input.indexOf("class=\"media media-thumbnail twitter-timeline-link media-forward is-preview")) > 0 || (p = input.indexOf("class=\"multi-photo")) > 0) { images.add(new prop(input, p, "data-resolved-url-large").value); continue; } // we have two opportunities to get video thumbnails == more images; images in the presence of video content should be treated as thumbnail for the video if ((p = input.indexOf("class=\"animated-gif-thumbnail\"")) > 0) { images.add(new prop(input, 0, "src").value); continue; } if ((p = input.indexOf("class=\"animated-gif\"")) > 0) { images.add(new prop(input, p, "poster").value); continue; } if ((p = input.indexOf("= 0 && input.indexOf("type=\"video/") > p) { videos.add(new prop(input, p, "video-src").value); continue; } if ((p = input.indexOf("class=\"Tweet-geo")) > 0) { prop place_name_prop = new prop(input, p, "title"); place_name = place_name_prop.value; continue; } if ((p = input.indexOf("class=\"ProfileTweet-actionButton u-linkClean js-nav js-geo-pivot-link")) > 0) { prop place_id_prop = new prop(input, p, "data-place-id"); place_id = place_id_prop.value; continue; } if (props.size() == 10 || (debuglog && props.size() > 4 && input.indexOf("stream-item") > 0 /* li class="js-stream-item" starts a new tweet */)) { // the tweet is complete, evaluate the result if (debuglog) System.out.println("*** line " + line + " propss.size() = " + props.size()); prop userid = props.get("userid"); if (userid == null) {if (debuglog) System.out.println("*** line " + line + " MISSING value userid"); continue;} prop usernickname = props.get("usernickname"); if (usernickname == null) {if (debuglog) System.out.println("*** line " + line + " MISSING value usernickname"); continue;} prop useravatarurl = props.get("useravatarurl"); if (useravatarurl == null) {if (debuglog) System.out.println("*** line " + line + " MISSING value useravatarurl"); continue;} prop userfullname = props.get("userfullname"); if (userfullname == null) {if (debuglog) System.out.println("*** line " + line + " MISSING value userfullname"); continue;} UserEntry user = new UserEntry( userid.value, usernickname.value, useravatarurl.value, MessageEntry.html2utf8(userfullname.value) ); ArrayList imgs = new ArrayList(images.size()); imgs.addAll(images); ArrayList vids = new ArrayList(videos.size()); vids.addAll(videos); prop tweettimems = props.get("tweettimems"); if (tweettimems == null) {if (debuglog) System.out.println("*** line " + line + " MISSING value tweettimems"); continue;} prop tweetretweetcount = props.get("tweetretweetcount"); if (tweetretweetcount == null) {if (debuglog) System.out.println("*** line " + line + " MISSING value tweetretweetcount"); continue;} prop tweetfavouritecount = props.get("tweetfavouritecount"); if (tweetfavouritecount == null) {if (debuglog) System.out.println("*** line " + line + " MISSING value tweetfavouritecount"); continue;} TwitterTweet tweet = new TwitterTweet( user.getScreenName(), Long.parseLong(tweettimems.value), props.get("tweettimename").value, props.get("tweetstatusurl").value, props.get("tweettext").value, Long.parseLong(tweetretweetcount.value), Long.parseLong(tweetfavouritecount.value), imgs, vids, place_name, place_id, user, writeToIndex, writeToBackend ); if (!DAO.messages.existsCache(tweet.getIdStr())) { // checking against the exist cache is incomplete. A false negative would just cause that a tweet is // indexed again. if (tweet.willBeTimeConsuming()) { executor.execute(tweet); //new Thread(tweet).start(); // because the executor may run the thread in the current thread it could be possible that the result is here already if (tweet.isReady()) { timelineReady.add(tweet, user); //DAO.log("SCRAPERTEST: messageINIT is ready"); } else { timelineWorking.add(tweet, user); //DAO.log("SCRAPERTEST: messageINIT unshortening"); } } else { // no additional thread needed, run the postprocessing in the current thread; timelineReady.add(tweet, user); } } images.clear(); props.clear(); continue; } } //for (prop p: props.values()) System.out.println(p); br.close(); return new Timeline[]{timelineReady, timelineWorking}; }

I suggest you go to Twitter’s Advanced Search page and search for some terms, and once you have the page loaded, check out its HTML, because we need to work with the tags.

This code is self-explanatory. Once you have the HTML result, it is fairly easy to check by inspection that between which tags is the data we need. We iterate through the code with the while loop, and then check the tags: for example – Images in the search result were stored between a <div class = "media media-thumbnail twitter-timeline-link media-forward is-preview"></div> tag, so we use indexOf to center on those tags and get the images. This is done for all the data we need: username, timestamp, likes count, retweets count, mentions count etc, every single thing that the Search Servlet of loklak shows.

So this is how the Social Media data is scraped, we have covered scraping using tools and manually, which are the most used methods anyway. In my next posts, I will talk about the rules for TwitterAnalysis Servlet, then Social Media Chat bots, and how Susi is integrated into them (especially in FB Messenger and Slack). Feedback is welcome 🙂

Social Media Analysis using Loklak (Part 2)

Loklak ShuoShuo: Another feather in the cap !

Work is still going on Loklak Weibo to extract the information as desired and there shall be another post up soon explaining the intricacies of implementation.

Currently, an attempt was made to parse the HTML page of  (another Chinese twitter like service)

Screenshot from 2016-06-05 22:28:18Screenshot from 2016-06-05 22:28:25
Just like last time, The major challenge however is to understand the Chinese annotations especially being from a non-Chinese background. Google translate aids testing the retrieved data by helping me match each phrase or/and line.

I have made use of of  JSoup: The Java HTML parser library which assists in extracting and manipulating data, scrape and parse HTML from the URL. The suggested use of JSoup is designed to deal with all varieties of HTML, hence as of now it is being considered a suitable choice.

Screenshot from 2016-06-05 22:32:53

 *  Shuoshuo Crawler
 *  By Jigyasa Grover, @jig08

package org.loklak.harvester;

import org.jsoup.Jsoup;
import org.jsoup.nodes.*;

public class ShuoshuoCrawler {
    public static void main(String args[]){

        Document shuoshuoHTML = null;
        Element recommendedTalkBox = null;
        Elements recommendedTalksList = null;
        String recommendedTalksResult[] = new String[100];
        Integer numberOfrecommendedTalks = 0;
        Integer i = 0;

        try {
            shuoshuoHTML = Jsoup.connect("").get();

            recommendedTalkBox = shuoshuoHTML.getElementById("list2");
            recommendedTalksList = recommendedTalkBox.getElementsByTag("li");

            for (Element recommendedTalks : recommendedTalksList)
                //System.out.println("\nLine: " + recommendedTalks.text());
                recommendedTalksResult[i] = recommendedTalks.text().toString();
            numberOfrecommendedTalks = i;
            System.out.println("Total Recommended Talks: " + numberOfrecommendedTalks);
            for(int k=0; k<numberOfrecommendedTalks; k++){
                System.out.println("Recommended Talk " + k + ": " + recommendedTalksResult[k]);

        } catch (IOException e) {


QQ Recommended Talks from are now stored as an array of Strings.

Total Recommended Talks: 10
Recommended Talk 0: 不会在意无视我的人,不会忘记帮助过我的人,不会去恨真心爱过我的人。
Recommended Talk 1: 喜欢一个人是一种感觉,不喜欢一个人却是事实。事实容易解释,感觉却难以言喻。
Recommended Talk 2: 一个人容易从别人的世界走出来却走不出自己的沙漠
Recommended Talk 3: 有什么了不起,不就是幸福在左边,我站在了右边?
Recommended Talk 4: 希望我跟你的爱,就像新闻联播一样没有大结局
Recommended Talk 5: 你会遇到别的女子和她举案齐眉,而我自会有别的男子与我白首相携。
Recommended Talk 6: 既然爱,为什么不说出口,有些东西失去了,就再也回不来了!
Recommended Talk 7: 凡事都有可能,永远别说永远。
Recommended Talk 8: 都是因为爱,而喜欢上了怀旧;都是因为你,而喜欢上了怀念。
Recommended Talk 9: 爱是老去,爱是新生,爱是一切,爱是你。

A similar approach can be now used to do the same for Latest QQ talk and QQ talk Leaderboard.

Check out this space for upcoming detail on implementing this technique to parse the entire page and get desired results…

Feedback and Suggestions welcome.

Loklak ShuoShuo: Another feather in the cap !