Indexing for multiscrapers in Loklak Server

I recently added multiscraper system which can scrape data from web-scrapers like YoutubeScraper, QuoraScraper, GithubScraper, etc. As scraping is a costly task, it is important to improve it’s efficiency. One of the approach is to index data in cache. TwitterScraper uses multiple sources to optimize the efficiency.

This system uses Post message holder object to store data and PostTimeline (a specialized iterator) to iterate the data objects. This difference in data structures from TwitterScraper leads to the need of different approach to implement indexing of data to ElasticSearch (currently in review process).

These are the following changes I made while implementing ‘indexing of data’ in the project.

1) Writing of data is invoked only using PostTimeline iterator

In TwitterScraper, the data is written in message holder TwitterTweet. So all the tweets are written to index as they are created. Here, when the data is scraped, Writing of the posts is initiated. Scraping of data is considered a heavy process. This approach keeps lower resource usage in average traffic on the server.

protected Post putData(Post typeArray, String key, Timeline2 postList) {
   if(!"cache".equals(this.source)) {
       postList.writeToIndex();
   }
   return this.putData(typeArray, key, postList.toArray());
}

2) One object for holding a message

During the implementation, I kept the same message holder Post and post-iterator PostTimeline from scraping to indexing of data. This helps to keep the structure uniform. Earlier approach involves different types of message wrappers in the way. This approach cuts the processes for looping and transitioning of data structures.

3) Index a list, not a message

In TwitterScraper, as the messages are enqueued in the bulk to be indexed. But in this approach, I have enqueued the complete lists. This approach delays the indexing till the scraper is done with processing the html.

Creating the queue of postlists:

// Add post-lists to queue to be indexed
queueClients.incrementAndGet();
try {
    postQueue.put(postList);
} catch (InterruptedException e) {
DAO.severe(e);
}
queueClients.decrementAndGet();

 

Indexing of the posts in postlists:

// Start indexing of data in post-lists
for (Timeline2 postList: postBulk) {
    if (postList.size() < 1) continue;
    if(postList.dump) {
        // Dumping of data in a file
        writeMessageBulkDump(postList);
    }
    // Indexing of data to ElasticSearch
    writeMessageBulkNoDump(postList);
}

 

4) Categorizing the input parameters

While searching the index, I have divided the query parameters from scraper into 3 categories. The input parameters are added to those categories (implemented using map data structure) and thus data fetched are according to them. These categories are:

// Declaring the QueryBuilder
BoolQueryBuilder query = new BoolQueryBuilder();

 

a) Get the parameter– Get the results for the input fields in map getMap.

// Result must have these fields. Acts as AND operator
if(getMap != null) {
    for(Map.Entry<String, String> field : getMap.entrySet()) {
        query.must(QueryBuilders.termQuery(
field.getKey(), field.getValue()));
    }
}

 

b) Don’t get the parameter- Don’t get the results for the input fields in map notGetMap.

// Result must not have these fields.
if(notGetMap != null) {
    for(Map.Entry<String, String> field : notGetMap.entrySet()) {
        query.mustNot(QueryBuilders.termQuery(
                field.getKey(), field.getValue()));
    }
}

 

c) Get if possible- Get the results with the input fields if they are present in the index.

// Result may preferably also get these fields. Acts as OR operator
if(mayAlsoGetMap != null) {
    for(Map.Entry<String, String> field : mayAlsoGetMap.entrySet()) {
        query.should(QueryBuilders.termQuery(
                field.getKey(), field.getValue()));

    }
}

 

By applying these changes, the scrapers are shifted from a message indexing to list of messages indexing. This way we are keeping load on RAM low, but the aggregation of latest scraped data may be affected. So there will be a need to workaround to solve this issue while scraping itself.

References

Indexing for multiscrapers in Loklak Server

Backend Scraping in Loklak Server

Loklak Server is a peer-to-peer Distributed Scraping System. It scrapes data from websites and also maintain other sources like peers, storage and a backend server to scrape data. Maintaining different sources has it’s benefits of not engaging in costly requests to the websites, no scraping of data and no cleaning of data.

Loklak Server can maintain a secondary Loklak Server (or a backend server) tuned for storing large amount of data. This enables the primary Loklak Server fetch data in return of pushing all scraped data to the backend.

Lately there was a bug in backend search as a new feature of filtering tweets was added to scraping and indexing, but not for backend search. To fix this issue, I had backtracked the backend search codebase and fix it.

Let us discuss how Backend Search works:-

1) When query is made from search endpoint with:

a) source=all

When source is set to all. The first TwitterScraper and Messages from local search server is preferred. If the messages scraped are not enough or no output has been returned for a specific amount of time, then, backend search is initiated

b) source=backend

SearchServlet specifically scrapes directly from backend server.

2) Fetching data from Backend Server

The input parameters fetched from the client is feeded into DAO.searchBackend method. The list of backend servers fetched from config file. Now using these input parameters and backend servers, the required data is scraped and output to the client.

In DAO.searchOnOtherPeers method. the request is sent to multiple servers and they are arranged in order of better response rates. This method invokes SearchServlet.search method for sending request to the mentioned servers.

List<String> remote = getBackendPeers();
if (remote.size() > 0) {
    // condition deactivated because we need always at least one peer
    Timeline tt = searchOnOtherPeers(remote, q, filterList, order, count, timezoneOffset, where, SearchServlet.backend_hash, timeout);
    if (tt != null) tt.writeToIndex();
    return tt;
}

 

3) Creation of request url and sending requests

The request url is created according to the input parameters passed to SearchServlet.search method. Here the search url is created according to input parameters and request is sent to the respective servers to fetch the required messages.

   // URL creation
    urlstring = protocolhostportstub + "/api/search.json?q="
           + URLEncoder.encode(query.replace(' ', '+'), "UTF-8") + "&timezoneOffset="
           + timezoneOffset + "&maximumRecords=" + count + "&source="
           + (source == null ? "all" : source) + "&minified=true&shortlink=false&timeout="
           + timeout;
    if(!"".equals(filterString = String.join(", ", filterList))) {
       urlstring = urlstring + "&filter=" + filterString;
    }
    // Download data
    byte[] jsonb = ClientConnection.downloadPeer(urlstring);
    if (jsonb == null || jsonb.length == 0) throw new IOException("empty content from " + protocolhostportstub);
    String jsons = UTF8.String(jsonb);
    JSONObject json = new JSONObject(jsons);
    if (json == null || json.length() == 0) return tl;
    // Final data fetched to be returned
    JSONArray statuses = json.getJSONArray("statuses");

References

Backend Scraping in Loklak Server