TLDR: Link to the code: LINK
Besides being known as one of the largest e-commerce platforms in Southeast Asia, Shopee regularly holds its own data science competition. This year (2020), the competition was held virtually for Indonesian citizens.
I joined this competition together with my former university mates for a fun week-long project. We are grateful to have won the first place and would love to share our solution as a reference to any future projects.
Link to competition page: https://www.kaggle.com/c/product-matching-id-ndsc-2020/overview
“Lowest Price Guaranteed” is one of the guarantees for Shopee’s customers that some of the items sold are the cheapest among competitors. One of the fundamental components for this is called “Product Matching”, where a machine learning model is used to automatically detect if 2 items, based on their product information such as title / description / images are actually the same items in real life.
Hence, the task is: Given item pairs, build a model to predict if they are the same or different products. This is categorised as a “binary classification task” in the machine learning world.
For instance, the item pair above might have different title, image and price. However, they are actually the same product in real life (at least according to the ones who labelled the data, of course).
Each submission is evaluated and ranked based on
Macro F1 Score .
F1 Score (without the
Macro part) is defined as the harmonic mean of
Recall , which means a low
Recall results in a lower
F1 Score compared to using regular arithmetic mean (normal average).
Macro F1 Score is defined as the arithmetic mean of
F1 Score Positive Class and
F1 Score Negative Class .
For more information regarding this metric, go to Wikipedia.
The data structure is straightforward, each item pair is represented by their respective titles and images, with a label indicating whether they are the duplicate (1 means duplicate, and 0 vice versa). In fact, this format is very similar to a past Kaggle competition Quora Question Pairs, with the difference in having image data.
Label distribution is not imbalanced, 57% of the training data are marked as duplicates.
Training data is only ~30% of the test data. Moreover, test data is only published 3 hours before the competition ends. This means a model which generalises well + a robust validation set are needed to be ahead of the competition.
Item titles are lit 🔥, they contain a wide variety of characters, and the vocabulary consists of English and Indonesian. Hence, text pre-processing is crucial in order for us to extract value from this data. Moreover, leveraging pre-trained word embedding model trained on conventional data sources (e.g. Wikipedia articles) will not help us much here.
There is a shortcoming of the training data, in which some item pairs are labeled as 0 (not duplicate), while in fact they are. This shows that the labelling process is not perfect. There is not much we can do to circumvent this issue, other than hoping that the test data has similar “bias” to the training data, and train our model to learn that “bias”.
Extracting Features from Text Data
The text data needs to be pre-processed before extracting any futures out of it. This is due to our findings while exploring the data above (wide variety of characters + bilingual vocabulary). Here are the steps:
- Remove any special characters (e.g emojis) and only keep alphanumeric characters because they do not signify a product’s identity
- Remove stop words which are commonly used in Indonesian e-commerce, such as: “grosir”, “cod”, “diskon”, “starseller”, etc: These words are usually “hype words” used by sellers to boost their sales. However they do not help much to uncover an identity of a product
- Add white space after each number in a product title: Those numbers may signify a product’s variant or quantity and may be a differentiating factor compared to other items
After pre-processing, to obtain the feature vector of each item title, we used FastText, a library that allows users to perform unsupervised learning on the entire text data set to obtain a feature vector given an item title. Its architecture is similar to how the famous Word2Vec is trained, which is by using continuous bag of words (CBOW) or Skip-Gram model using negative sampling. Visit Jay Alammar’s blog for a great illustration on how Word2Vec works.
One key feature of FastText is its ability to produce feature vector for any words, even made-up ones. This is due to the fact that the word vectors are built from sub-strings of characters contained in it. This allows us to build feature vectors even for misspelled brand names or concatenation of words. Moreover, FastText allows us to choose the dimension of the feature vector, unlike TF-IDF. After training the FastText model, we obtained a 100 dimensional feature vector for each item title.
On top of obtaining feature vectors, we added more features by calculating the “distance measure” for each feature vector pair. The intuition here is: duplicate items should have similar title, hence their feature vector distance should be small. For the “distance measures”, we cover various statistics such as: Euclidean (L2 distance), Manhattan (L1 distance), Cosine distance, etc.
Finally, we calculated the Levenshtein distance between 2 item titles. Levenshtein distance is defined as: the minimum number of single-character edits (insertions, deletions or substitutions) required to change one item title into another.
Extracting Features from Image Data
For extracting features from image data, we employ a similar strategy to text (representing an image as an N-dimensional feature vector). However, unlike text where we had to train an unsupervised model by ourselves, we implemented transfer learning from an existing model called ResNet-34 to obtain the feature vectors.
Transfer learning is defined as a machine learning method where a model developed for a task is reused as the starting point for a model on a second task. In this case, our second task is using the vectors as a feature for our main model. Whereas ResNet-34 is a deep residual networks trained on ImageNet data. ImageNet is a large visual database designed for use in visual object recognition software research. More than 14 million images have been hand-annotated by the project to indicate what objects are pictured and in at least one million of the images, bounding boxes are also provided.
Similar to text, on top of obtaining the feature vectors, we also added more features by calculating the “distance measure” for each feature vector pair.
After extracting the features from image + text data. We threw them into a LightGBM model. LightGBM is a gradient boosting framework that uses tree based learning algorithm. It is based on decision tree algorithms and used for ranking, classification and other machine learning tasks.
We were aware that the amount of training data is only ~30% of the test data and the test data is only published 3 hours before the competition ends.
We needed a model which generalises well to the given test data and have a reliable validation set as a scoring proxy before the test data is released. Hence, we employed the following strategy:
- We split the training data randomly into 70% for training and 30% for validation. We chose the model which has the best score in the validation set
- To tune our LightGBM model hyper-parameters, we used Randomised Search Cross Validation. It implements a randomised search over parameters, where each setting is sampled from a distribution over possible parameter values. Cross validation is a suitable technique (unlike regular train test split) for our scenario, since we have much less training data compared to test data, and CV allows us to test our model hyper-parameters based on the entire training data
- After obtaining the best scoring model on validation dataset, we re-trained the model with the same features and best hyper-parameters using the entire training dataset. This final model is then used to predict the test dataset and submitted to Kaggle
We learnt a lot but also had a lot of fun during this competition. Fun fact: we were still tuning our model hyper-parameters 15 minutes before the competition ends. We hope this post could be a useful reference for anyone who wants to implement similar projects in the future. So, until next time!