Recommending Stack Overflow Questions with Lateral

Update: this demo has been decommissioned, but you can still read below to learn how you can build it yourself.

Stack Overflow is a programming Q&A website with over 4M users and 9M posts. It is one of many such sites on a variety of topics run by StackExchange.  Stack Overflow is highly successful at gameifying the answering of questions through a reputation system based on up-votes and bounties.  Users can use the reputation points and badges they win to support job applications, and employers can use the reputation to find the best employees.  So answering many questions and earning tons of points is something that users take very seriously.  But how can I find questions that I can answer?

Somewhat surprisingly, StackExchange does not offer personalised recommendations to users as to which of the thousands of new questions every day they would be best able to answer.  I decided to fill this gap using Lateral’s content recommendation API, recommending to users which questions they could answer, based upon the posts that they have interacted with in the past.

How it was done:

For full details, get in touch!  I am happy to tell you everything, it’d be great if someone built this out.  In short:

  • StackExchange makes XML dumps of their data on a (roughly) monthly basis, including all the questions and answers and user interactions with the posts.  So I pulled one of these data dumps and parsed it.  The XML is all well-formed but large (e.g. the posts XML file is 33GB).  Specifically, I aggregated the question and all its answers into a single document, extracted user meta data, and recorded which users had interacted with which questions (for this I used posts and comments).
  • User behavioural data and the Q&A text to build a profile of each users interests using the functionality behind Lateral’s content recommender.
  • Every hour or so, the new questions from Stack Overflow are fetched using the StackExchange API via the Python SDK built by Lucas Jones.  The API seems to function very well, and the SDK (while being still in development) does a great job of handling throttling and pagination in a transparent manner.
  • Our recommendations of new questions to each user are refreshed.  Then we take a break.  Then we repeat all over again.

Some potential applications:

  • Send email alerts or push notifications to the people best able to answer a question;
  • Send personalised weekly digests of what’s been happening on Stack Overflow.
  • Turn the recommendation around the other way, and find users to solve a given problem.

Easy potential improvements:

  • We used only posts and comments.  A richer view of the interests of all users (not just contributing users) could be obtained from page-view or who-upvoted-what data (not included in the datadump).
  • Take into account the time when users interacted with past posts.  For example, I could answer questions about jQuery four years ago, but that doesn’t mean that I’d want to today.
The Author
I am in charge of machine learning and data science at Lateral.
Comments