Social Media Analysis using Loklak (Part 2)

In the last post, I spoke about the TwitterAnalysis servlet I developed, how we can analyse the entire twitter profile of a user and get useful data from that servlet. This was pretty simple to do, because all I really did was parse the Search API and some other simple commands which resulted into a concise and detailed profile analysis.

But there’s something that I’ve not spoken about yet, which I will in this blog post. How is the social media data collected in that form? Where does it come from and how?

Loklak, as is known, is a social media search server, which scrapes social media sites and profiles. Scraping is basically checking out the HTML source code of the website and from the relevant tags, getting out the information from there. The p2p nature of loklak enables a lot of peers scrape synchronously and feed tweets to a backend, and store it in their own backend as well.

Scraping is a very well known practice, and in Java, we already have tools like JSoup which are easy-to-use scraping tools. You just need to connect to the website, mention the tags between which the information is present, and voila. Here is an example from the EventBrite scraper we have made:


public static SusiThought crawlEventBrite(String url) {
		Document htmlPage = null;

		try {
			htmlPage = Jsoup.connect(url).get();
		} catch (Exception e) {
			e.printStackTrace();
		}

		String eventID = null;
		String eventName = null;
		String eventDescription = null;

		// TODO Fetch Event Color
		String eventColor = null;

		String imageLink = null;

		String eventLocation = null;

		String startingTime = null;
		String endingTime = null;

		String ticketURL = null;

		Elements tagSection = null;
		Elements tagSpan = null;
		String[][] tags = new String[5][2];
		String topic = null; // By default

		String closingDateTime = null;
		String schedulePublishedOn = null;
		JSONObject creator = new JSONObject();
		String email = null;

		Float latitude = null;
		Float longitude = null;

		String privacy = "public"; // By Default
		String state = "completed"; // By Default
		String eventType = "";

		String temp;
		Elements t;

		eventID = htmlPage.getElementsByTag("body").attr("data-event-id");
		eventName = htmlPage.getElementsByClass("listing-hero-body").text();
		eventDescription = htmlPage.select("div.js-xd-read-more-toggle-view.read-more__toggle-view").text();

		eventColor = null;

		imageLink = htmlPage.getElementsByTag("picture").attr("content");

		eventLocation = htmlPage.select("p.listing-map-card-street-address.text-default").text();

		temp = htmlPage.getElementsByAttributeValue("property", "event:start_time").attr("content");
		if(temp.length() >= 20){
			startingTime = htmlPage.getElementsByAttributeValue("property", "event:start_time").attr("content").substring(0,19);
		}else{
			startingTime = htmlPage.getElementsByAttributeValue("property", "event:start_time").attr("content");
		}

		temp = htmlPage.getElementsByAttributeValue("property", "event:end_time").attr("content");
		if(temp.length() >= 20){
			endingTime = htmlPage.getElementsByAttributeValue("property", "event:end_time").attr("content").substring(0,19);
		}else{
			endingTime = htmlPage.getElementsByAttributeValue("property", "event:end_time").attr("content");
		}

		ticketURL = url + "#tickets";

		// TODO Tags to be modified to fit in the format of Open Event "topic"
		tagSection = htmlPage.getElementsByAttributeValue("data-automation", "ListingsBreadcrumbs");
		tagSpan = tagSection.select("span");
		topic = "";

		int iterator = 0, k = 0;
		for (Element e : tagSpan) {
			if (iterator % 2 == 0) {
				tags[k][1] = "www.eventbrite.com"
						+ e.select("a.js-d-track-link.badge.badge--tag.l-mar-top-2").attr("href");
			} else {
				tags[k][0] = e.text();
				k++;
			}
			iterator++;
		}

		creator.put("email", "");
		creator.put("id", "1"); // By Default

		temp = htmlPage.getElementsByAttributeValue("property", "event:location:latitude").attr("content");
		if(temp.length() > 0){
			latitude = Float
				.valueOf(htmlPage.getElementsByAttributeValue("property", "event:location:latitude").attr("content"));
		}

		temp = htmlPage.getElementsByAttributeValue("property", "event:location:longitude").attr("content");
		if(temp.length() > 0){
			longitude = Float
				.valueOf(htmlPage.getElementsByAttributeValue("property", "event:location:longitude").attr("content"));
		}

		// TODO This returns: "events.event" which is not supported by Open
		// Event Generator
		// eventType = htmlPage.getElementsByAttributeValue("property",
		// "og:type").attr("content");

		String organizerName = null;
		String organizerLink = null;
		String organizerProfileLink = null;
		String organizerWebsite = null;
		String organizerContactInfo = null;
		String organizerDescription = null;
		String organizerFacebookFeedLink = null;
		String organizerTwitterFeedLink = null;
		String organizerFacebookAccountLink = null;
		String organizerTwitterAccountLink = null;

		temp = htmlPage.select("a.js-d-scroll-to.listing-organizer-name.text-default").text();
		if(temp.length() >= 5){
			organizerName = htmlPage.select("a.js-d-scroll-to.listing-organizer-name.text-default").text().substring(4);
		}else{
			organizerName = "";
		}
		organizerLink = url + "#listing-organizer";
		organizerProfileLink = htmlPage
				.getElementsByAttributeValue("class", "js-follow js-follow-target follow-me fx--fade-in is-hidden")
				.attr("href");
		organizerContactInfo = url + "#lightbox_contact";

		Document orgProfilePage = null;

		try {
			orgProfilePage = Jsoup.connect(organizerProfileLink).get();
		} catch (Exception e) {
			e.printStackTrace();
		}

		if(orgProfilePage != null){

			t = orgProfilePage.getElementsByAttributeValue("class", "l-pad-vert-1 organizer-website");
			if(t != null){
				organizerWebsite = orgProfilePage.getElementsByAttributeValue("class", "l-pad-vert-1 organizer-website").text();
			}else{
				organizerWebsite = "";
			}

			t = orgProfilePage.select("div.js-long-text.organizer-description");
			if(t != null){
				organizerDescription = orgProfilePage.select("div.js-long-text.organizer-description").text();
			}else{
				organizerDescription = "";
			}

			organizerFacebookFeedLink = organizerProfileLink + "#facebook_feed";
			organizerTwitterFeedLink = organizerProfileLink + "#twitter_feed";

			t = orgProfilePage.getElementsByAttributeValue("class", "fb-page");
			if(t != null){
				organizerFacebookAccountLink = orgProfilePage.getElementsByAttributeValue("class", "fb-page").attr("data-href");
			}else{
				organizerFacebookAccountLink = "";
			}

			t = orgProfilePage.getElementsByAttributeValue("class", "twitter-timeline");
			if(t != null){
				organizerTwitterAccountLink = orgProfilePage.getElementsByAttributeValue("class", "twitter-timeline").attr("href");
			}else{
				organizerTwitterAccountLink = "";
			}

		}

		

		JSONArray socialLinks = new JSONArray();

		JSONObject fb = new JSONObject();
		fb.put("id", "1");
		fb.put("name", "Facebook");
		fb.put("link", organizerFacebookAccountLink);
		socialLinks.put(fb);

		JSONObject tw = new JSONObject();
		tw.put("id", "2");
		tw.put("name", "Twitter");
		tw.put("link", organizerTwitterAccountLink);
		socialLinks.put(tw);

		JSONArray jsonArray = new JSONArray();

		JSONObject event = new JSONObject();
		event.put("event_url", url);
		event.put("id", eventID);
		event.put("name", eventName);
		event.put("description", eventDescription);
		event.put("color", eventColor);
		event.put("background_url", imageLink);
		event.put("closing_datetime", closingDateTime);
		event.put("creator", creator);
		event.put("email", email);
		event.put("location_name", eventLocation);
		event.put("latitude", latitude);
		event.put("longitude", longitude);
		event.put("start_time", startingTime);
		event.put("end_time", endingTime);
		event.put("logo", imageLink);
		event.put("organizer_description", organizerDescription);
		event.put("organizer_name", organizerName);
		event.put("privacy", privacy);
		event.put("schedule_published_on", schedulePublishedOn);
		event.put("state", state);
		event.put("type", eventType);
		event.put("ticket_url", ticketURL);
		event.put("social_links", socialLinks);
		event.put("topic", topic);
		jsonArray.put(event);

		JSONObject org = new JSONObject();
		org.put("organizer_name", organizerName);
		org.put("organizer_link", organizerLink);
		org.put("organizer_profile_link", organizerProfileLink);
		org.put("organizer_website", organizerWebsite);
		org.put("organizer_contact_info", organizerContactInfo);
		org.put("organizer_description", organizerDescription);
		org.put("organizer_facebook_feed_link", organizerFacebookFeedLink);
		org.put("organizer_twitter_feed_link", organizerTwitterFeedLink);
		org.put("organizer_facebook_account_link", organizerFacebookAccountLink);
		org.put("organizer_twitter_account_link", organizerTwitterAccountLink);
		jsonArray.put(org);

		JSONArray microlocations = new JSONArray();
		jsonArray.put(new JSONObject().put("microlocations", microlocations));

		JSONArray customForms = new JSONArray();
		jsonArray.put(new JSONObject().put("customForms", customForms));

		JSONArray sessionTypes = new JSONArray();
		jsonArray.put(new JSONObject().put("sessionTypes", sessionTypes));

		JSONArray sessions = new JSONArray();
		jsonArray.put(new JSONObject().put("sessions", sessions));

		JSONArray sponsors = new JSONArray();
		jsonArray.put(new JSONObject().put("sponsors", sponsors));

		JSONArray speakers = new JSONArray();
		jsonArray.put(new JSONObject().put("speakers", speakers));

		JSONArray tracks = new JSONArray();
		jsonArray.put(new JSONObject().put("tracks", tracks));
		SusiThought json = new SusiThought();
		json.setData(jsonArray);
		return json;

	}

As is seen, we first connect with the url using Jsoup.connect().get() and then we use methods like getElementByAttributeValue and getElementByTag to extract the information.

This is one way of scraping: by using tools like Jsoup. You could also do it manually. Just connect to the website, and use things like BufferedReader or InputStreamReader etc to manually extract the HTML and then iterate through it and extract the information. This method was adopted for the TwitterScraper we have.

In the TwitterScraper, we first connect to the URL using ClientConnection() and then use BufferedReader to get the HTML code, as shown here.


private static String prepareSearchURL(final String query) {
        // check
        // https://twitter.com/search-advanced for a better syntax
        // https://support.twitter.com/articles/71577-how-to-use-advanced-twitter-search#
        String https_url = "";
        try {
            StringBuilder t = new StringBuilder(query.length());
            for (String s: query.replace('+', ' ').split(" ")) {
                t.append(' ');
                if (s.startsWith("since:") || s.startsWith("until:")) {
                    int u = s.indexOf('_');
                    t.append(u < 0 ? s : s.substring(0, u));
                } else {
                    t.append(s);
                }
            }
            String q = t.length() == 0 ? "*" : URLEncoder.encode(t.substring(1), "UTF-8");
            //https://twitter.com/search?f=tweets&vertical=default&q=kaffee&src=typd
            https_url = "https://twitter.com/search?f=tweets&vertical=default&q=" + q + "&src=typd";
        } catch (UnsupportedEncodingException e) {}
        return https_url;
    }
    
    private static Timeline[] search(
            final String query,
            final Timeline.Order order,
            final boolean writeToIndex,
            final boolean writeToBackend) {
        // check
        // https://twitter.com/search-advanced for a better syntax
        // https://support.twitter.com/articles/71577-how-to-use-advanced-twitter-search#
        String https_url = prepareSearchURL(query);
        Timeline[] timelines = null;
        try {
            ClientConnection connection = new ClientConnection(https_url);
            if (connection.inputStream == null) return null;
            try {
                BufferedReader br = new BufferedReader(new InputStreamReader(connection.inputStream, StandardCharsets.UTF_8));
                timelines = search(br, order, writeToIndex, writeToBackend);
            } catch (IOException e) {
            	Log.getLog().warn(e);
            } finally {
                connection.close();
            }
        } catch (IOException e) {
            // this could mean that twitter rejected the connection (DoS protection?) or we are offline (we should be silent then)
            // Log.getLog().warn(e);
            if (timelines == null) timelines = new Timeline[]{new Timeline(order), new Timeline(order)};
        };

        // wait until all messages in the timeline are ready
        if (timelines == null) {
            // timeout occurred
            timelines = new Timeline[]{new Timeline(order), new Timeline(order)};
        }
        if (timelines != null) {
            if (timelines[0] != null) timelines[0].setScraperInfo("local");
            if (timelines[1] != null) timelines[1].setScraperInfo("local");
        }
        return timelines;
    }

If you check out the Search Servlet at /api/search.json, you would see that it accepts either plain query terms, or you can also do from: username or @username to see messages from the particular username. The prepareSearchURL parses this Search query, and converts it into a term Twitter’s Search can understand (because they don’t have this feature) and we then use Twitter’s Advanced Search to search. In Timelines[] Search, we use BufferedReader to get the HTML of Search Result, and we store it in a Timeline object for further use.

Now is the time when this HTML is to be processed. We need to check out the tags and work with them. This is achieved here:


private static Timeline[] search(
            final BufferedReader br,
            final Timeline.Order order,
            final boolean writeToIndex,
            final boolean writeToBackend) throws IOException {
        Timeline timelineReady = new Timeline(order);
        Timeline timelineWorking = new Timeline(order);
        String input;
        Map props = new HashMap();
        Set images = new LinkedHashSet();
        Set videos = new LinkedHashSet();
        String place_id = "", place_name = "";
        boolean parsing_favourite = false, parsing_retweet = false;
        int line = 0; // first line is 1, according to emacs which numbers the first line also as 1
        boolean debuglog = false;
        while ((input = br.readLine()) != null){
            line++;
            input = input.trim();
            if (input.length() == 0) continue;
            
            // debug
            //if (debuglog) System.out.println(line + ": " + input);            
            //if (input.indexOf("ProfileTweet-actionCount") > 0) System.out.println(input);

            // parse
            int p;
            if ((p = input.indexOf("=\"account-group")) > 0) {
                props.put("userid", new prop(input, p, "data-user-id"));
                continue;
            }
            if ((p = input.indexOf("class=\"avatar")) > 0) {
                props.put("useravatarurl", new prop(input, p, "src"));
                continue;
            }
            if ((p = input.indexOf("class=\"fullname")) > 0) {
                props.put("userfullname", new prop(input, p, null));
                continue;
            }
            if ((p = input.indexOf("class=\"username")) > 0) {
                props.put("usernickname", new prop(input, p, null));
                continue;
            }
            if ((p = input.indexOf("class=\"tweet-timestamp")) > 0) {
                props.put("tweetstatusurl", new prop(input, 0, "href"));
                props.put("tweettimename", new prop(input, p, "title"));
                // don't continue here because "class=\"_timestamp" is in the same line 
            }
            if ((p = input.indexOf("class=\"_timestamp")) > 0) {
                props.put("tweettimems", new prop(input, p, "data-time-ms"));
                continue;
            }
            if ((p = input.indexOf("class=\"ProfileTweet-action--retweet")) > 0) {
                parsing_retweet = true;
                continue;
            }
            if ((p = input.indexOf("class=\"ProfileTweet-action--favorite")) > 0) {
                parsing_favourite = true;
                continue;
            }
            if ((p = input.indexOf("class=\"TweetTextSize")) > 0) {
                // read until closing p tag to account for new lines in tweets
                while (input.lastIndexOf("

") == -1){ input = input + ' ' + br.readLine(); } prop tweettext = new prop(input, p, null); props.put("tweettext", tweettext); continue; } if ((p = input.indexOf("class=\"ProfileTweet-actionCount")) > 0) { if (parsing_retweet) { prop tweetretweetcount = new prop(input, p, "data-tweet-stat-count"); props.put("tweetretweetcount", tweetretweetcount); parsing_retweet = false; } if (parsing_favourite) { props.put("tweetfavouritecount", new prop(input, p, "data-tweet-stat-count")); parsing_favourite = false; } continue; } // get images if ((p = input.indexOf("class=\"media media-thumbnail twitter-timeline-link media-forward is-preview")) > 0 || (p = input.indexOf("class=\"multi-photo")) > 0) { images.add(new prop(input, p, "data-resolved-url-large").value); continue; } // we have two opportunities to get video thumbnails == more images; images in the presence of video content should be treated as thumbnail for the video if ((p = input.indexOf("class=\"animated-gif-thumbnail\"")) > 0) { images.add(new prop(input, 0, "src").value); continue; } if ((p = input.indexOf("class=\"animated-gif\"")) > 0) { images.add(new prop(input, p, "poster").value); continue; } if ((p = input.indexOf("= 0 && input.indexOf("type=\"video/") > p) { videos.add(new prop(input, p, "video-src").value); continue; } if ((p = input.indexOf("class=\"Tweet-geo")) > 0) { prop place_name_prop = new prop(input, p, "title"); place_name = place_name_prop.value; continue; } if ((p = input.indexOf("class=\"ProfileTweet-actionButton u-linkClean js-nav js-geo-pivot-link")) > 0) { prop place_id_prop = new prop(input, p, "data-place-id"); place_id = place_id_prop.value; continue; } if (props.size() == 10 || (debuglog && props.size() > 4 && input.indexOf("stream-item") > 0 /* li class="js-stream-item" starts a new tweet */)) { // the tweet is complete, evaluate the result if (debuglog) System.out.println("*** line " + line + " propss.size() = " + props.size()); prop userid = props.get("userid"); if (userid == null) {if (debuglog) System.out.println("*** line " + line + " MISSING value userid"); continue;} prop usernickname = props.get("usernickname"); if (usernickname == null) {if (debuglog) System.out.println("*** line " + line + " MISSING value usernickname"); continue;} prop useravatarurl = props.get("useravatarurl"); if (useravatarurl == null) {if (debuglog) System.out.println("*** line " + line + " MISSING value useravatarurl"); continue;} prop userfullname = props.get("userfullname"); if (userfullname == null) {if (debuglog) System.out.println("*** line " + line + " MISSING value userfullname"); continue;} UserEntry user = new UserEntry( userid.value, usernickname.value, useravatarurl.value, MessageEntry.html2utf8(userfullname.value) ); ArrayList imgs = new ArrayList(images.size()); imgs.addAll(images); ArrayList vids = new ArrayList(videos.size()); vids.addAll(videos); prop tweettimems = props.get("tweettimems"); if (tweettimems == null) {if (debuglog) System.out.println("*** line " + line + " MISSING value tweettimems"); continue;} prop tweetretweetcount = props.get("tweetretweetcount"); if (tweetretweetcount == null) {if (debuglog) System.out.println("*** line " + line + " MISSING value tweetretweetcount"); continue;} prop tweetfavouritecount = props.get("tweetfavouritecount"); if (tweetfavouritecount == null) {if (debuglog) System.out.println("*** line " + line + " MISSING value tweetfavouritecount"); continue;} TwitterTweet tweet = new TwitterTweet( user.getScreenName(), Long.parseLong(tweettimems.value), props.get("tweettimename").value, props.get("tweetstatusurl").value, props.get("tweettext").value, Long.parseLong(tweetretweetcount.value), Long.parseLong(tweetfavouritecount.value), imgs, vids, place_name, place_id, user, writeToIndex, writeToBackend ); if (!DAO.messages.existsCache(tweet.getIdStr())) { // checking against the exist cache is incomplete. A false negative would just cause that a tweet is // indexed again. if (tweet.willBeTimeConsuming()) { executor.execute(tweet); //new Thread(tweet).start(); // because the executor may run the thread in the current thread it could be possible that the result is here already if (tweet.isReady()) { timelineReady.add(tweet, user); //DAO.log("SCRAPERTEST: messageINIT is ready"); } else { timelineWorking.add(tweet, user); //DAO.log("SCRAPERTEST: messageINIT unshortening"); } } else { // no additional thread needed, run the postprocessing in the current thread tweet.run(); timelineReady.add(tweet, user); } } images.clear(); props.clear(); continue; } } //for (prop p: props.values()) System.out.println(p); br.close(); return new Timeline[]{timelineReady, timelineWorking}; }

I suggest you go to Twitter’s Advanced Search page and search for some terms, and once you have the page loaded, check out its HTML, because we need to work with the tags.

This code is self-explanatory. Once you have the HTML result, it is fairly easy to check by inspection that between which tags is the data we need. We iterate through the code with the while loop, and then check the tags: for example – Images in the search result were stored between a <div class = "media media-thumbnail twitter-timeline-link media-forward is-preview"></div> tag, so we use indexOf to center on those tags and get the images. This is done for all the data we need: username, timestamp, likes count, retweets count, mentions count etc, every single thing that the Search Servlet of loklak shows.

So this is how the Social Media data is scraped, we have covered scraping using tools and manually, which are the most used methods anyway. In my next posts, I will talk about the rules for TwitterAnalysis Servlet, then Social Media Chat bots, and how Susi is integrated into them (especially in FB Messenger and Slack). Feedback is welcome 🙂

Social Media Analysis using Loklak (Part 2)