Now get wordpress blog updates with Loklak !

Loklak shall soon be spoiling its users !

Next, it will be bringing in tiny tweet-like cards showing the blog-posts (title, publishing date, author and content) from the given WordPress Blog URL.

This feature is certain to expand the realm of Loklak’s missive of building a comprehensive and an extensive social network dispensing useful information.

Screenshot from 2016-06-22 04:48:28

In order to implement this feature, I have again made the use of JSoup: The Java HTML parser library as it provides a very convenient API for extracting and manipulating data, scrape and parse HTML from a URL.

The information is scraped making use of JSoup after the corresponding URL in the format "https://[username].wordpress.com/" is passed as an argument to the function scrapeWordpress(String blogURL){..} which returns a JSONObject as the result.

A look at the code snippet :

/**
 *  WordPress Blog Scraper
 *  By Jigyasa Grover, @jig08
 **/

package org.loklak.harvester;

import java.io.IOException;

import org.json.JSONArray;
import org.json.JSONObject;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class WordPressBlogScraper {
	public static void main(String args[]){
		
		String blogURL = "https://loklaknet.wordpress.com/";
		scrapeWordpress(blogURL);		
	}
	
	public static JSONObject scrapeWordpress(String blogURL) {
		
                Document blogHTML = null;
		
		Elements articles = null;
		Elements articleList_title = null;
		Elements articleList_content = null;
		Elements articleList_dateTime = null;
		Elements articleList_author = null;

		String[][] blogPosts = new String[100][4];
		
		//blogPosts[][0] = Blog Title
		//blogPosts[][1] = Posted On
		//blogPosts[][2] = Author
		//blogPosts[][3] = Blog Content
		
		Integer numberOfBlogs = 0;
		Integer iterator = 0;
		
		try{			
			blogHTML = Jsoup.connect(blogURL).get();
		}catch (IOException e) {
            e.printStackTrace();
        }
			
			articles = blogHTML.getElementsByTag("article");
			
			iterator = 0;
			for(Element article : articles){
				
				articleList_title = article.getElementsByClass("entry-title");				
				for(Element blogs : articleList_title){
					blogPosts[iterator][0] = blogs.text().toString();
				}
				
				articleList_dateTime = article.getElementsByClass("posted-on");				
				for(Element blogs : articleList_dateTime){
					blogPosts[iterator][1] = blogs.text().toString();
				}
				
				articleList_author = article.getElementsByClass("byline");				
				for(Element blogs : articleList_author){
					blogPosts[iterator][2] = blogs.text().toString();
				}
				
				articleList_content = article.getElementsByClass("entry-content");				
				for(Element blogs : articleList_content){
					blogPosts[iterator][3] = blogs.text().toString();
				}
				
				iterator++;
				
			}
			
			numberOfBlogs = iterator;
			
			JSONArray blog = new JSONArray();
			
			for(int k = 0; k<numberOfBlogs; k++){
				JSONObject blogpost = new JSONObject();
				blogpost.put("blog_url", blogURL);
				blogpost.put("title", blogPosts[k][0]);
				blogpost.put("posted_on", blogPosts[k][1]);
				blogpost.put("author", blogPosts[k][2]);
				blogpost.put("content", blogPosts[k][3]);
				blog.put(blogpost);
			}			
			
			JSONObject final_blog_info = new JSONObject();
			
			final_blog_info.put("Wordpress blog: " + blogURL, blog);			

			System.out.println(final_blog_info);
			
			return final_blog_info;
		
	}
}

 

In this, simply a HTTP Connection was established and text extracted using “element_name”.text() from inside the specific tags using identifiers like classes or ids. The tags from which the information was to be extracted were identified after exploring the web page’s HTML source code.

The result thus obtained is in the form of a JSON Object

{
  "Wordpress blog: https://loklaknet.wordpress.com/": [
    {
      "posted_on": "June 19, 2016",
      "blog_url": "https://loklaknet.wordpress.com/",
      "author": "shivenmian",
      "title": "loklak_depot u2013 The Beginning: Accounts (Part 3)",
      "content": "So this is my third post in this five part series on loklak_depo... As always, feedback is duly welcome."
    },
    {
      "posted_on": "June 19, 2016",
      "blog_url": "https://loklaknet.wordpress.com/",
      "author": "sopankhosla",
      "title": "Creating a Loklak App!",
      "content": "Hello everyone! Today I will be shifting from course a...ore info refer to the full documentation here. Happy Coding!!!"
    },
    {
      "posted_on": "June 17, 2016",
      "blog_url": "https://loklaknet.wordpress.com/",
      "author": "leonmakk",
      "title": "Loklak Walls Manual Moderation u2013 tweet storage",
      "content": "Loklak walls are going to....Stay tuned for more updates on this new feature of loklak walls!"
    },
    {
      "posted_on": "June 17, 2016",
      "blog_url": "https://loklaknet.wordpress.com/",
      "author": "Robert",
      "title": "Under the hood: Authentication (login)",
      "content": "In the second post of .....key login is ready."
    },
    {
      "posted_on": "June 17, 2016",
      "blog_url": "https://loklaknet.wordpress.com/",
      "author": "jigyasa",
      "title": "Loklak gives some hackernews now !",
      "content": "It's been befittingly said  u... Also, Stay tuned for more posts on data crawling and parsing for Loklak. Feedback and Suggestions welcome"
    },
    {
      "posted_on": "June 16, 2016",
      "blog_url": "https://loklaknet.wordpress.com/",
      "author": "Damini",
      "title": "Does tweets have emotions?",
      "content": "Tweets do intend some kind o...t of features: classify(feat1,u2026,featN) = argmax(P(cat)*PROD(P(featI|cat)"
    },
    {
      "posted_on": "June 15, 2016",
      "blog_url": "https://loklaknet.wordpress.com/",
      "author": "sudheesh001",
      "title": "Dockerize the loklak server and publish docker images to IBM Containers on Bluemix Cloud",
      "content": "Docker is an open source...nd to create and deploy instantly as well as scale on demand."
    }
  ]
}

 

The next step now would include "writeToBackend"-ing and then parsing the JSONObject as desired.

Feel free to ask questions regarding the above code snippet, shall be happy to assist.

Feedback and Suggestions welcome 🙂

Now get wordpress blog updates with Loklak !

Loklak gives some hackernews now !

It’s been befittingly said  “Well, news is anything that’s interesting, that relates to what’s happening in the world, what’s happening in areas of the culture that would be of interest to your audience.” by Kurt Loder, the famous American Journalist.

And what better than Hackernews : news.ycombinator.com for the tech community. It helps community by showing the important and latest buzz and sort them by popularity and their links.

Screenshot from 2016-06-17 08:01:42

LOKLAK next tried to include this important piece of information in its server by collecting data from this source. Instead of the usual scraping of HTML Pages we had been doing for other sources before, we have tried to read the RSS stream instead.

Simply put, RSS (Really Simple Syndication) uses a family of standard web feed formats to publish frequently updated information: blog entries, news headlines, audio, video. A standard XML file format ensures compatibility with many different machines/programs. RSS feeds also benefit users who want to receive timely updates from favorite websites or to aggregate data from many sites without signing-in and all.

Hackernews RSS Feed can be fetched via the URL https://news.ycombinator.com/rss and looks something like…

Screenshot from 2016-06-17 09:33:32

In order to keep things simple, I decided to use the ROME Framework to make a RSS Reader for Hackernews for Loklak.

Just for a quick introduction, ROME is a Java framework for RSS and Atom feeds. It’s open source and licensed under the Apache 2.0 license. ROME includes a set of parsers and generators for the various flavors of syndication feeds, as well as converters to convert from one format to another. The parsers can give you back Java objects that are either specific for the format you want to work with, or a generic normalized SyndFeed class that lets you work on with the data without bothering about the incoming or outgoing feed type.

So, I made a function hackernewsRSSReader which basically returns us a JSONObject of JSONArray “Hackernews RSS Feed[]” having JSONObjects each of which represents a ‘news headline’ from the source.

The structure of the JSONObject result obtained is something like:

{
   "Hackernews RSS Feed":[
      {
         "Description":"SyndContentImpl.value=....",
         "Updated-Date":"null",
         "Link":"http://journals.aps.org/prl/abstract/10.1103/PhysRevLett.116.241103",
         "RSS Feed":"https://news.ycombinator.com/rss",
         "Published-Date":"Wed Jun 15 13:30:33 EDT 2016",
         "Hash-Code":"1365366114",
         "Title":"Second Gravitational Wave Detected at LIGO",
         "URI":"http://journals.aps.org/prl/abstract/10.1103/PhysRevLett.116.241103"
      },
     ......
      {
         "Description":"SyndContentImpl.value=....",
         "Updated-Date":"null",
         "Link":"http://ocw.mit.edu/courses/aeronautics-and-astronautics/16-410-principles-of-autonomy-and-decision-making-fall-2010/lecture-notes/MIT16_410F10_lec20.pdf",
         "RSS Feed":"https://news.ycombinator.com/rss",
         "Published-Date":"Wed Jun 15 08:37:36 EDT 2016",
         "Hash-Code":"1649214835",
         "Title":"Intro to Hidden Markov Models (2010) [pdf]",
         "URI":"http://ocw.mit.edu/courses/aeronautics-and-astronautics/16-410-principles-of-autonomy-and-decision-making-fall-2010/lecture-notes/MIT16_410F10_lec20.pdf"
      }
   ]
}

It includes information like Title, Link, HashCode, Published Date, Updated Date, URI and the Description of each “news headline”.

The next step after extracting information is to write it to the back-end and then retrieve it whenever required and display it in the desired format as suitable to the Loklak Web Client after parsing it.

It requires JDOM and ROME jars to be configured into the build path before proceeding with implementation of the RSS Reader.

A look through the code for the HackernewsRSSReader.java :

/**
 *  Hacker News RSS Reader
 *  By Jigyasa Grover, @jig08
 **/

package org.loklak.harvester;

import java.net.MalformedURLException;
import java.net.URL;
import java.util.List;
import org.json.JSONArray;
import org.json.JSONObject;
import com.sun.syndication.feed.synd.SyndEntry;
import com.sun.syndication.feed.synd.SyndFeed;
import com.sun.syndication.io.SyndFeedInput;
import com.sun.syndication.io.XmlReader;

public class HackernewsRSSReader {	
	
	/*
	 * For HackernewsRSS, simply pass URL: https://news.ycombinator.com/rss 
	 * in the function to obtain a corresponding JSON
	 */
	@SuppressWarnings({ "unchecked", "static-access" })
	public static JSONObject hackernewsRSSReader(String url){
		 
	        URL feedUrl = null;
			try {
				feedUrl = new URL(url);
			} catch (MalformedURLException e) {
				e.printStackTrace();
			}
	        
	        SyndFeedInput input = new SyndFeedInput();
	        
	        SyndFeed feed = null;
			try {
				feed = input.build(new XmlReader(feedUrl));
			} catch (Exception e) {
				e.printStackTrace();
			}
	        
	        String[][] result = new String[100][7];
	        //result[][0] = Title
	        //result[][1] = Link
	        //result[][2] = URI
	        //result[][3] = Hash Code
	        //result[][4] = PublishedDate
	        //result[][5] = Updated Date
	        //result[][6] = Description
	        
	        @SuppressWarnings("unused")
			int totalEntries = 0;
	        int i = 0;
	        
	        JSONArray jsonArray = new JSONArray();
	        
	        for (SyndEntry entry : (List)feed.getEntries()) {
	        	
	        	result[i][0] = entry.getTitle().toString();
	        	result[i][1] = entry.getLink().toString();
	        	result[i][2] = entry.getUri().toString();
	        	result[i][3] = Integer.toString(entry.hashCode()); 
	        	result[i][4] = entry.getPublishedDate().toString();
	        	result[i][5] = ( (entry.getUpdatedDate() == null) ? ("null") : (entry.getUpdatedDate().toString()) );
	        	result[i][6] = entry.getDescription().toString();
	        	
		        JSONObject jsonObject = new JSONObject();

	        	jsonObject.put("RSS Feed", url);
	        	jsonObject.put("Title", result[i][0]);
	        	jsonObject.put("Link", result[i][1]);
	        	jsonObject.put("URI", result[i][2]);
	        	jsonObject.put("Hash-Code", result[i][3]);
	        	jsonObject.put("Published-Date", result[i][4]);
	        	jsonObject.put("Updated-Date", result[i][5]);
	        	jsonObject.put("Description", result[i][6]);
	        	
	        	jsonArray.put(i, jsonObject);
	        	
	        	i++;
	        }
	        
	        totalEntries = i;
	        
	    JSONObject rssFeed = new JSONObject();
	    rssFeed.put("Hackernews RSS Feed", jsonArray);
	    System.out.println(rssFeed);
		return rssFeed;
		
	}

}

 

Feel free to ask questions regarding the above code snippet.

Also, Stay tuned for more posts on data crawling and parsing for Loklak.

Feedback and Suggestions welcome 🙂

Loklak gives some hackernews now !