Ultra-Fast Sentiment Analysis with Tiny Embeddings

At Narnium, we tackle hard problems with the clever use of small, efficient, domain-specific AI models. A question we recently encountered involved classifying natural language product reviews into positive and negative categories.

We were curious to see how some publicly available, pre-trained small language models would fare on this task. So we set out to perform a fair, general, systematic, multi-dimensional evaluation of the most popular open-weight models, and compare them with respect to accuracy, training and inference speed, and on-disk size. We wanted to share with you our most important discoveries, as well as some actionable insights you can readily apply to your next machine learning project.

TL;DR

We achieved a 93.5% accuracy on the Amazon Reviews dataset using the best neural model. Meanwhile, our best static model, which can process text at 1.3 MB/s, still achieved an accuracy of 86%.

The best part is, you can do this too! We recommend using one of the following setups:

These are quite good results, especially in the light of the simplicity of the approach and the small size of the lightweight language models. Read on for the full story!

The Approach

We used a simple two-step transfer learning method:

  1. Generate vector embeddings for the reviews, using a sentence embedding model.
  2. Train a simple downstream classifier on the resulting embeddings.

We performed 10-fold cross-validation over the data, which allowed us to accurately assess the predictive power of each individual model, while retaining all of the data for both training and testing (once the cross-validation process is complete), by splitting it into 10 disjoint, random subsets, and using a different one for testing in each iteration ("fold").

All metrics reported in this analysis were computed by averaging the results of the test scores of the 10 folds. The cross-validation was constrained (grouped), so that reviews by any specific user never appeared simultaneously in both the train and the test set. It was thus ensured that models didn't just learn individual users' style.

The following figure summarizes the setup:

Throughput and training/inference performance was tested on a MacBook Pro with the M3 Max CPU with 10+4 cores and 36 GB of RAM.

The Models

First, let's talk embeddings. We used some of the most popular, small, publicly available text embeddings from Hugging Face Hub:

FastText and the Apple NL framework do not require a separate runtime. For the Model2Vec static models, we used our own, improved implementation, while for the neural models, we used the Fastembed runtime, both written in Rust.

The full list and of text embedding models and their short explanation can be found in this table.

As for the classification models, we used the simplest, traditional classifiers from sklearn, initialized with the following sets of parameters:

models = [
    LinearDiscriminantAnalysis(solver='lsqr', shrinkage='auto'),
    QuadraticDiscriminantAnalysis(reg_param=0.001),
    LogisticRegressionCV(max_iter=1024, n_jobs=-1, random_state=133742),
    RandomForestClassifier(n_jobs=-1, random_state=133742),
    HistGradientBoostingClassifier(random_state=133742),
]

The Data

We used the excellent Amazon Reviews 2023 Dataset in our experiments. This is a large, cleaned, labeled, open-access dataset. It contains at least 10s of thousands, or even millions of reviews (title, text body, and a 1-5 star integer rating) of various categories of products, sold on Amazon.

To simplify our task, we grabbed the Subscription Boxes category, restricted to unambiguously positive (4 or 5 stars) and negative (1 or 2 stars) reviews. Neutral reviews with 3 stars were discarded. This left us with a grand total of 16216 reviews, about 75% of which were positive.

As an example, a subset of the final, cleaned data frame is available here.

Raw Results

First of all, let's have a quick glance at the best results. The best value(s) of each column is/are in bold:

embeddingestimatorBal. Acc.MCCfit_timescore_timeembedding_throughput
fastembedMultilingualE5Smalllda0.9180.8330.1650.0045861
logistic_regression0.9210.8442.6850.0045861
fastembedNomicEmbedTextV15Qlda0.9330.8650.5060.0042106
logistic_regression0.9350.8676.2610.0082106
m2vNomicEmbedTextV15lda0.8590.7370.5500.0041319162
logistic_regression0.8700.7508.1190.0081319162
m2vPotionRetrieval32Mlda0.8600.7360.3690.0051748334
logistic_regression0.8650.7434.8510.0081748334

The legend for the performance metrics is as follows:

If you want to dive deep into the numbers, the complete set of timing and accuracy metrics can be found in these tables. In these tables, all sheets except the last one ("All Metrics") are sorted from best to worst. (The last sheet, containing all metrics, is sorted alphabetically by embedding and classifier.)

Some of the sheets are named according to the scheme <metric> by embedding, where <metric> is the name of an individual performance (accuracy or speed) metric. The tables present the best (i.e., minimum or maximum, as appropriate for the given metric) values aggregated along the axis of classifiers, so that for each embedding model, the best score it achieved with any classifier is displayed.

In contrast, every other sheet is named using the pattern <metric> by estimator. These tables aggregate and compute the maximum over embeddings, so you can use them to assess the performance of each individual downstream classifier model, paired with the embedding it performed best on.

The sheet named All Metrics contains all computed performance scores, for all combinations of embedding and classifier models, without any max- or min-aggregation.

The throughput of the embedding models can be found in this table. Both the median and the mean throughput are presented in units of B/s (bytes per second). The median values are likely a better representation of embedding speed, because the averages may include model loading and warm-up time.

Main Take-aways

The Nomic Embed Text v1.5 embedding model is the most accurate contestant, both in its original (neural) form, and among the significantly faster, distilled, static Model2Vec models as well. Unfortunately, in the non-distilled neural form, it's the slowest model; Multilingual-E5-small is almost 3x faster, in exchange for a slight loss of accuracy, around 2.5% points. Curiously, the Model2Vec distillation of Nomic Embed Text v1.5 is among the smallest models, at a whopping 91 MB. The only smaller model is minishlab/M2V_base_output, with its tiny, 30 MB file.

Among the static models, Potion Retrieval 32M and static-similarity-mrl-multilingual-v1 are decent alternatives, too. They are worth trying for other tasks (such as clustering and semantic similarity search). Do note, however, that static models are still somewhat less accurate than transformers.

The Apple Natural Language framework didn't perform particularly well on either metric. Both its accuracy and its embedding throughput lag behind the competition: in terms of accuracy (MCC, F1 score, balanced accuracy, pseudo-R2), it can't beat the top 1…3 static models, while being significantly slower.

As for the downstream classifier, there seems to be no reason to use anything beyond simple, linear models. Logistic regression wins in terms of accuracy most of the time, closely followed by LDA, while LDA is a lot (~20x) faster to train, thanks to the lack of hyperparameter tuning.

Gradient boosting usually exhibits similar or slightly lower accuracy; however, it is an order of magnitude slower to train and evaluate. Meanwhile, QDA somewhat overfits in several cases, and random forest catastrophically overfits most of the time, and it is the slowest to train.

In the above experiments, QDA was initialized with a constant regularization coefficient of 0.001, which is on the same order of magnitude as the typical Ledoit-Wolf coefficient, when computed on all embeddings of the whole dataset. Apart from this eyeballed constant value, we also tried more sophisticated approaches (e.g., average of class-wise Ledoit-Wolf shrinkage coefficients weighted by class priors). Interestingly, these more complicated approaches didn't meaningfully affect the results: sometimes, they slightly increased performance, while in some other cases, they slightly decreased the overall predictive power, so in the end, we decided to keep the constant shrinkage.

Caveats and Limitations

We are aware of at least the following notable potential issues with our study: