Harvesting strategies in loklak are something that the end users can’t see, but they play a vital role in deciding the way in which messages are collected with loklak. One of the strategies in loklak is defined by the Kaizen Harvester, which generates queries from collected messages.
The original strategy used a simple hash queue which drops queries once it is full. This effect is not desirable as we tend to lose important queries in this process if they come up late while harvesting. To overcome this behaviour without losing important search queries, we needed to come up with new harvesting strategy(ies) that would provide a better approach for harvesting. In this blog post, I am discussing the changes made in the kaizen harvester so it can be extended to create different flavors of harvesters.
What can be different in extended harvesters?
To make the Kaizen harvester extendable, we first needed to decide that what are the parts in the original Kaizen harvester that can be changed to make the strategy different (and probably better).
Since one of the most crucial part of the Kaizen harvester was the way it stores queries to be processed, it was one of the most obvious things to change. Another thing that should be allowed to configure across various strategies was the decision of whether to go for harvesting the queries from the query list.
Query storage with KaizenQueries
To allow different methods of storing the queries, KaizenQueries class was introduced in loklak. It was configured to provide basic methods that would be required for a query storing technique to work. A query storing technique can be any data structure that we can use to store search queries for Kaizen harvester.
Also, a default type of KaizenQueries was introduced to use in the original Kaizen harvester. This allowed the same interface as the original queue which was used in the harvester.
Another constructor was introduced in Kaizen harvester which allowed setting the KaizenQueries for an instance of its derived classes. It solved the problem of providing an interface of KaizenQueries inside the Kaizen harvester which can be used by any inherited strategy –
This being added, getting new queries or adding new queries was a simple. We just need to use getQuery() and addQuery() methods without worrying about the internal implementations.
Configurable decision for harvesting
As mentioned earlier, the decision taken for harvesting should also be configurable. For this, a protected method was implemented and used in harvest() method –
The shallHarvest() method can be overridden by derived classes to allow any type of harvesting decision that they want. For example, we can configure it in such a way that it harvests if there are any queries in the queue, or to use a different probability distribution instead of linear (maybe gaussian around 1).
This blog post explained about the changes made in loklak’s Kaizen harvester which allowed other harvesting strategies to be built on its top. It discussed the two components changed and how they allowed ease of inheriting the original Kaizen harvester. These changes were proposed in PR loklak/loklak_server#1203 by @singhpratyush (me).
- Harvesting strategies in loklak server – https://github.com/loklak/loklak_server/blob/development/src/org/loklak/harvester/strategy.
- Improving Harvesting Decision for Kaizen Harvester in loklak server – http://blog.fossasia.org/improving-harvesting-decision-for-kaizen-harvester-in-loklak-server/.
- Kaizen harvester before the change – https://github.com/loklak/loklak_server/blob/a0ccbdfdd86b128824e56235c877e942cee7c325/src/org/loklak/harvester/strategy/KaizenHarvester.java.