A Machine Learning Solution For Better User Experience
We have recently had an opportunity to design and develop an independent machine learning-based service for a social publishing and e-learning platform. The client needed to build a service that would deliver a recommendation system with automatic content classification. In this post I would like to share some background to the work as well
We have recently had an opportunity to design and develop an independent machine learning-based service for a social publishing and e-learning platform. The client needed to build a service that would deliver a recommendation system with automatic content classification. In this post I would like to share some background to the work as well as some lessons learnt.
When confronted with a request like that, the task may seem somewhat challenging at first, yet when some relevant tools are available to be used, the task becomes much easier. Since the client did not impose any technological constraints on the solution, we weighed the pros and cons of a few options and eventually chose to use the MLlib machine learning library from the Apache Spark suite. The key driver of our decision was the need to ensure the capacity to scale up easily when the volume of data processed for learning exceeds the capabilities of a single machine – the distributed data processing framework would meet the job very well. On top of that, the stack consisted of the Spring Framework for exposing our services through a secured RESTful API as well as MongoDB for internal data storage. Everything was wrapped in a self-contained Docker image.
The machine learning service (ML Service) was meant to be a piece of independent software working alongside the client’s application which was being developed concurrently by another team. Because of that we did not have as much on-site data for generating user personalized content recommendations as we would like to. After analyzing the client business domain and doing some brainstorming with their development team we came up with a fairly simple solution. We chose to expose an API endpoint in the ML Service so that the client’s application could send information about the users who made a 50% progress on any given content item. We assumed such data would inform us about and enable us to identify the content which was somehow interesting for users to watch or read. The measurement of user content consumption progress was performed by an existing functionality in the client’s application so we just needed to push the historical data to the ML Service and start recording new progress events. With the data available in this form, we further decided to use the collaborative filtering approach which enabled us to create personalized user recommendations.
In short, the collaborative filtering approach is about predicting what the user will like based on their similarities to other users, ie the characteristics/properties they share with other users. In the figure above we can see that User 1 purchased Product 1, 2, 3 and 4. Now if we needed to create recommendation for User 3, who so far bought Product 2 and 3, we would need to identify other users with similar preferences. Having analyzed the available data we end up with recommending Product 1 and/or 4 based on User 1’s shopping pattern – the latter is just the most similar to that exhibited by User 3.
The second requirement we were confronted with was to add an automatic content classification functionality, which in this case referred to automatic text content labeling. You can find and check a similar feature in the Medium publishing platform.
When – for example – one copies and pastes this BBC article to the medium story editor, the latter automatically generates two thematically related labels based on the content of the story. Our client expected the ML Service to be able to mimic such functionality. With a sufficient number of labeled (classified) text content items the only issue to tackle for us was to find a machine learning algorithm which would support such a use case. Unfortunately, at that time, the client did not yet have an existing body of data that would be sufficient for training our algorithm. We solved the problem by building a helper tool which used a twitter stream to automatically generate a training set with some logic responsible for extracting labels and text content needed. We were thus able to generate a training set from “nothing” and – most importantly – the solution enables to permanently scan the twitter stream (which is being done even right now) so that each time we train the algorithm, the training is performed against a bigger training set.
When it comes to the machine learning part we used the one-vs-rest technique whereby we build one classifier per each class we want to detect, a class in turn is a label in our domain. Then the generated text content coming from the client’s application through an API endpoint is passed to all the classifiers and the latter in turn return predictions as to which labels were detected for a given content item.
With the experience gathered – when we are in a position to gain access to / generate a sufficient body of data in our system and we attempt to devise some more intelligent interactions for the system users – we believe machine learning is perfectly suite for this purpose. There is a wide spectrum of ML-related tools – many of them open source – that one can use to ensure a better user experience on their projects. Do not hesitate to use them.