TopMenu and SiteMaps – Making loklak crawlable

So now we have seen how loklak_depot actually started off: by an accounts system and lot of security fixes (AAA system etc). We have made the foundation of loklak_depot as simple and branched-out as possible. But before we go on to working on Q&A apps and Susi (our intelligent query system of loklak_depot), I figured out one problem.

How do the users on the WWW get to know about this?

loklak had not been made crawlable until recently. This prevented search engines to crawl loklak.org and display its results. To improve out reach, thus, enabling crawling became necessary.

To enable crawling, what we needed was a sitemap.xml file and a robots.txt. The sitemap specifies the URLs branching out from the main page (including the main page itself) and the robots.txt mainly specifies parts of the site which should NOT be crawled. Thus, both had to be made to enable crawling.

Talking about the main loklak.org website, if you visit the site, you will see a menu on the top which leads to the various links (lets refer to it as the TopMenu). Once the user knows that these links are there, it will automatically crawl them. So it could be simple to create a normal xml file which has those links. But here’s the catch.

We knew loklak.org was something all of us are working on (and updating) regularly, and so the TopMenu is also bound to change. We also did not want to keep updating the HTML files to accommodate changes in the TopMenu. So we decided to do two things:

1. Make the TopMenu dynamic so that only a little change can update it.
2. Generate and update the sitemap.xml dynamically from TopMenu changes, without changing the xml.

For Part 1, we decided to implement a servlet which returns a JSON containing the TopMenu items and their links. We then implement an Angular function which parses this JSON and changes the TopMenu dynamically.

Here is the servlet TopMenuService.java. It’s pretty easy to understand:


public class TopMenuService extends AbstractAPIHandler implements APIHandler {
    
    private static final long serialVersionUID = 1839868262296635665L;

    @Override
    public BaseUserRole getMinimalBaseUserRole() { return BaseUserRole.ANONYMOUS; }

    @Override
    public JSONObject getDefaultPermissions(BaseUserRole baseUserRole) {
        return null;
    }

    @Override
    public String getAPIPath() {
        return "/cms/topmenu.json";
    }
    
    @Override
    public JSONObject serviceImpl(Query call, Authorization rights, final JSONObjectWithDefault permissions) {
        
        int limited_count = (int) DAO.getConfig("download.limited.count", (long) Integer.MAX_VALUE);
    
        JSONObject json = new JSONObject(true);
        JSONArray topmenu = new JSONArray()
            .put(new JSONObject().put("Home", "index.html"))
            .put(new JSONObject().put("About", "about.html"))
            .put(new JSONObject().put("Showcase", "showcase.html"))
            .put(new JSONObject().put("Architecture", "architecture.html"))
            .put(new JSONObject().put("Download", "download.html"))
            .put(new JSONObject().put("Tutorials", "tutorials.html"))
            .put(new JSONObject().put("API", "api.html"));
        if (limited_count > 0) topmenu.put(new JSONObject().put("Dumps", "dump.html"));
        topmenu.put(new JSONObject().put("Apps", "apps/applist/index.html"));
        json.put("items", topmenu);

    }
}

As seen, in a serviceImpl object, we are making a JSONObject containing all the links of loklak.org TopMenu with their URLs, and this object is returned.

Now what we want is make the changes to the index.html and the JavaScript, and here they are:

JS:



angular.element(document).ready(function () {
  var navString = "";
  var winLocation = window.location.href;
  $.getJSON("/cms/topmenu.json", function(data) {
    navItems = data.items;
    navItems = navItems.reverse();
    var count = 0;
    $.each( navItems, function(index, itemData) {
      name = Object.keys(itemData);
      link = itemData[name];
      // Now construct the li items
      liItem = "<li>";
      if (winLocation.indexOf(link) != -1 && count != 1) {
        liItem = "<li class='active'>";
        count = count + 1;
      }
      liItem += "<a href='\/"+link+"'>"+name+"</a></li>";
      liItem = $(liItem);
      $('#navbar > ul').prepend(liItem);
    });
  });
});

HTML:



<nav class="navbar navbar-inverse navbar-fixed-top">
      <div class="container-fluid">
        <div class="navbar-header">
          <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar">
            <span class="sr-only">Toggle navigation</span>
            <span class="icon-bar"></span>
            <span class="icon-bar"></span>
            <span class="icon-bar"></span>
          </button>
          <a class="navbar-brand" href="#"></a>
        </div>
        <div id="navbar" class="navbar-collapse collapse">
          <ul class="nav navbar-nav navbar-right">
            <!-- This will get populated -->
          </ul>
        </div>
      </div>
    </nav>

So in the Angular function, we parse the JSON and insert the items in the TopMenu code in the HTML. So basically all we need to do is change the entries in TopMenuService.java and the TopMenu will get updated.

So this is Part 1 done. Now comes the crawling part. We need to use TopMenuService.java in a servlet so that only changing the entries in TopMenuService.java will change the sitemap. So basically TopMenuService is the central servlet, changing it should update both sitemap and the TopMenu URLs as shown above.

So I coded another servlet which parses the JSON from TopMenu and makes up a SiteMap:


public class Sitemap extends HttpServlet {

	private static final long serialVersionUID = -8475570405765656976L;
	private final String sitemaphead = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"
			+ "<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">\n";

	@Override
	protected void doPost(HttpServletRequest request, HttpServletResponse response)
			throws ServletException, IOException {
		doGet(request, response);
	}

	@Override
	protected void doGet(HttpServletRequest request, HttpServletResponse response)
			throws ServletException, IOException {
		Query post = RemoteAccess.evaluate(request);
		// String siteurl = request.getRequestURL().toString();
		// String baseurl = siteurl.substring(0, siteurl.length() -
		// request.getRequestURI().length()) + request.getContextPath() + "/";
		String baseurl = "http://loklak.org/";
		JSONObject TopMenuJsonObject = new TopMenuService().serviceImpl(post, null, null);
		JSONArray sitesarr = TopMenuJsonObject.getJSONArray("items");
		response.setCharacterEncoding("UTF-8");
		PrintWriter sos = response.getWriter();
		sos.print(sitemaphead + "\n");
		for (int i = 0; i < sitesarr.length(); i++) {
			JSONObject sitesobj = sitesarr.getJSONObject(i);
			Iterator sites = sitesobj.keys();
			sos.print("<url>\n<loc>" + baseurl + sitesobj.getString(sites.next().toString()) + "/</loc>\n"
					+ "<changefreq>weekly</changefreq>\n</url>\n");
		}
		sos.print("</urlset>");
		sos.println();
		post.finalize();
	}
}

The XML is adhering to the sitemap standard as prescribed here. Basically, I just took up the JSON from TopMenu, used an Iterator to get the keys (if you look at the JSON, you will notice I only need the values from all the objects in the JSONArray). and then print it out using a PrintWriter.

Since we wanted all the URLs to be crawled in the sitemap, the robots.txt looks something like:


User-agent: *
Sitemap: http://loklak.org/api/sitemap.xml

So now we have achieved in getting a dynamically updating SiteMap and TopMenu, all controlled using only a JSONObject in TopMenuService.java. Easy, no?

That’s all for now. In my next post, I will be talking about the Q&A Apps I’m working on, as well as a bit about Susi. Till then, ciao! Feedback as always is appreciated 🙂

TopMenu and SiteMaps – Making loklak crawlable