Meor Amer

Jun 21, 2022

Article Recommender with Text Embedding, Classification, and Extraction

A simple demonstration of how we can stack multiple NLP models together to get an output as close as possible to our desired outcome.

Guide
API

Embeddings can capture the meaning of a piece of text beyond keyword-matching. In this article, we will build a simple news article recommender system that computes the embeddings of all available articles and recommend the most relevant articles based on embeddings similarity.

We will also make the recommendation tighter by using text classification to recommend only articles within the same category. We will then extract a list of tags from each recommended article, which can further help readers discover new articles.

All this will be done via three Cohere API endpoints stacked together: Embed, Classify, and Generate.

We will implement the following steps:

Find the most similar articles to the one currently reading using embeddings.
Keep only articles of the same category using text classification.
Extract tags from these articles.
Show the top 5 recommended articles.

View the complete notebook here.

1. Find the most similar articles to the one currently reading using embeddings

Find the most similar articles using embeddings

Throughout this article, we'll use the BBC news article dataset as an example [Source]. This dataset consists of articles from a few categories: business, politics, tech, entertainment, and sport. Here are some example articles:

1.2 Turn articles into embeddings

The first thing we need to do is to turn each article's text into embeddings. An embedding is a list of numbers that our models use to represent a piece of text, capturing its context and meaning. We do this by calling Cohere’s Embed endpoint, which takes in texts as input and returns embeddings as output.

articles = df_inputs['Text'].tolist()

output = co.embed(
            model ='large',
            texts = articles)
embeds = output.embeddings

1.3 Pick one article and find the most similar articles

Next, we pick any one article to be the one the reader is currently reading (let's call this the target) and find other articles with the most similar embeddings (let's call these candidates) using cosine similarity.

Cosine similarity is a metric that measures how similar two sequences of numbers are (embeddings in our case), and we compute it for each target-candidate pair.

from sklearn.metrics.pairwise import cosine_similarity
 
def get_similarity(target,candidates):
  # Calculate cosine similarity
  similarity_scores = cosine_similarity(target,candidates)
 
  # Sort by descending order in similarity
  similarity_scores = list(enumerate(similarity_scores))
  similarity_scores = sorted(similarity_scores, key=lambda x:x[1], reverse=True)
 
  # Return similarity scores
  return similarity_scores

Using Article ID 70 as an example target article, here’s what we get:

Target:

[ID 70] aragones angered by racism fine spain coach luis aragones is furious after being fined by the spanis ...

Candidates:

1
ferguson urges henry punishment sir alex ferguson has called on the football association to punish a ...

2
benitez delight after crucial win liverpool manager rafael benitez admitted victory against deportiv ...

3
mourinho defiant on chelsea form chelsea boss jose mourinho has insisted that sir alex ferguson and ...

4
boris opposes mayor apology ken livingstone should stick to his guns and not apologise for his na ...

5
wenger signs new deal arsenal manager arsene wenger has signed a new contract to stay at the club un ...

2. Keep only articles of the same category using text classification

In the example above (Article ID 70 as the target), we see that the top 5 most similar articles given by the system are very relevant. The target is a football/soccer article, and the system duly recommended very similar articles despite this dataset also containing articles from other sports like tennis and rugby.

However, not all of them are. The fourth recommended article is not a sports article, but rather politics. Reading the text, it's likely because the target is an article about a clash of individuals (i.e. anger about a racism fine), which also happens to be what the politics article is about (i.e. disagreement over an apology). So these two articles' meanings are similar in this way, captured in the embeddings.

Perhaps we can enhance the system by only recommending articles of the same category. For this, let's build a news category classifier.

2.1 Build a classifier

We use Cohere’s Classify endpoint to build a news category classifier, classifying articles into five classes: Business, Politics, Tech, Entertainment, and Sport.

A typical text classification model requires hundreds/thousands of data points to train, but with this endpoint, we can build a classifier with as few as five examples per class.

To build the classifier, we need a set of examples consisting of text (news text) and labels (news category). The BBC News dataset happens to have both (columns 'Text' and 'Category'), so this time we’ll use the categories for building our examples.

To build the classifier, we will use the Text and Category columns

The Classify endpoint needs a minimum of 5 examples for each category, which we will sample randomly from the dataset. We have 5 categories, so we will have a total of 25 examples.

# Get classifications via the Classify endpoint
def classify_text(text,examples):
  classifications = co.classify(
    model='medium',
    taskDescription='',
    outputIndicator='',
    inputs=[text],
    examples=examples
    )
  return classifications.classifications[0].prediction

2.2 Measure its performance

Before actually using the classifier, let's first test its performance. Here we take another 100 data points as the test dataset and the classifier will predict the classes i.e. news category.

# Predicted classes
predictions = df_test['Text'].apply(classify_text, args=(examples,)).tolist()
 
# Actual classes
actual = df_test['Category'].tolist()
 
# Compute metrics on the test dataset
accuracy = accuracy_score(actual, predictions)

We get a good accuracy score of 91% (more details in the notebook), so the classifier is ready to be implemented in our recommender system.

3. Extract tags from these articles

We now proceed to the tags extraction step. Compared to the previous two steps, this step is not about sorting or filtering articles, but rather enriching them with more information.

We do this by prompting Cohere’s Generate endpoint with a few examples of text and its tags. We then feed the articles from the classifier step and the endpoint will generate the corresponding tags.

There is more than one way to construct the prompt, depending on what you'd like to extract. In my case, the tags I'd like to extract are primarily the names of a person, company, or organization, and perhaps also some generic keywords. That was the idea behind the example tags I put in the prompt, which you can see on the Cohere Playground screenshot below:

Playground screenshot of the tag extraction prompt

We call the endpoint by specifying a few settings, and it will generate the corresponding extractions.

# Get extractions via the Generate endpoint
def extract_tags(complete_prompt):
  prediction = co.generate(
    model='xlarge',
    prompt=complete_prompt,
    max_tokens=30,
    temperature=0.3,
    k=0,
    p=1,
    frequency_penalty=0,
    presence_penalty=0,
    stop_sequences=["--"],
    return_likelihoods='NONE')
  return prediction.generations[0].text

4. Show the Top 5 recommended articles

Putting everything together to recommend the Top 5 articles

Let's now put everything together for our article recommender system.

First, we select the target article and compute the similarity scores against the candidate articles. Next, we filter the articles via classification. Finally, we extract the keywords from each article and show the recommendations.

Keeping to Article ID 70 as an example target article, here’s what we get:

You are reading...

Article: [ID 70] aragones angered by racism fine spain coach luis aragones is furious after being fined by the spanish football federation for his comments about thierry henry. the 66-year-old criticised his 3000 euros (£2 060) punishment even though it was far below the maximum penalty. i am not guilty nor do i ...

You might also like...

1
Article: ferguson urges henry punishment sir alex ferguson has called on the football association to punish arsenal s thierry henry for an incident involving gabriel heinze. ferguson believes henry deliberately caught heinze on the head with his knee during united s controversial win. the united boss said i...

Tags: arsenal, thierry henry, football association, alex ferguson

2
Article: benitez delight after crucial win liverpool manager rafael benitez admitted victory against deportivo la coruna was vital in their tight champions league group. jorge andrade s early own goal gave liverpool a 1-0 win. and benitez said: we started at a very high tempo and had many chances. it is a ...

Tags: liverpool, deportivo la coruna, rafael benitez

3
Article: mourinho defiant on chelsea form chelsea boss jose mourinho has insisted that sir alex ferguson and arsene wenger would swap places with him. mourinho s side were knocked out of the fa cup by newcastle last sunday before seeing barcelona secure a 2-1 champions league first-leg lead in the nou camp....

Tags: chelsea, sir alex ferguson, arsene wenger, mourinho defiant

4
Article: wenger signs new deal arsenal manager arsene wenger has signed a new contract to stay at the club until may 2008. wenger has ended speculation about his future by agreeing a long-term contract that takes him beyond the opening of arsenal s new stadium in two years. he said: signing a new contract ...

Tags: arsenal, arsene wenger, arsenal manager

5
Article: premier league planning cole date the premier league is attempting to find a mutually convenient date to investigate allegations chelsea made an illegal approach for ashley cole. both chelsea and arsenal will be asked to give evidence to a premier league commission but no deadline has been put on ...

Tags: chelsea, arsenal, premier league, cole date

Here we see how the classification and extraction steps have improved our recommendation outcome.

First, now the politics article doesn't get recommended anymore. Second, now we have the tags related to each article being generated.

Let’s try a couple of other articles in business and tech and see the output.

Business (returning recommendations around German economy and global economic growth/slump):

You are reading...

Article: german business confidence slides german business confidence fell in february knocking hopes of a speedy recovery in europe s largest economy. munich-based research institute ifo said that its confidence index fell to 95.5 in february from 97.5 in january its first decline in three months. the stu...

You might also like...

1
Article: german growth goes into reverse germany s economy shrank 0.2% in the last three months of 2004 upsetting hopes of a sustained recovery. the figures confounded hopes of a 0.2% expansion in the fourth quarter in europe s biggest economy. the federal statistics office said growth for the whole of 200...

Tags: germany, german growth, german economy

2
Article: car giant hit by mercedes slump a slump in profitability at luxury car maker mercedes has prompted a big drop in profits at parent daimlerchrysler. the german-us carmaker saw fourth quarter operating profits fall to 785m euros ($1bn) from 2.4bn euros in 2003. mercedes-benz s woes - its profits slid...

Tags: daimlerchrysler, mercedes, mercedes-benz

3
Article: bbc poll indicates economic gloom citizens in a majority of nations surveyed in a bbc world service poll believe the world economy is worsening. most respondents also said their national economy was getting worse. but when asked about their own family s financial outlook a majority in 14 countries...

Tags: world economy, bbc world service, economic gloom

4
Article: china continues rapid growth china s economy has expanded by a breakneck 9.5% during 2004 faster than predicted and well above 2003 s 9.1%. the news may mean more limits on investment and lending as beijing tries to take the economy off the boil. china has sucked in raw materials and energy to fee...

Tags: china, china s economy, rapid growth

5
Article: bank set to leave rates on hold uk interest rates are set to remain on hold at 4.75% following the latest meeting of the bank of england. the bank s rate-setting committee has put up rates five times in the past year but rates have been on hold since september amid signs of a slowdown. economic gro...

Tags: bank, interest rates, uk

Tech (returning recommendations around consumer devices):

You are reading...

Article: camera phones are must-haves four times more mobiles with cameras in them will be sold in europe by the end of 2004 than last year says a report from analysts gartner. globally the number sold will reach 159 million an increase of 104%. the report predicts that nearly 70% of all mobile phones ...

You might also like...

1
Article: lifestyle governs mobile choice faster better or funkier hardware alone is not going to help phone firms sell more handsets research suggests. instead phone firms keen to get more out of their customers should not just be pushing the technology for its own sake. consumers are far more interest...

Tags: lifestyle, mobile choice

2
Article: moving mobile improves golf swing a mobile phone that recognises and responds to movements has been launched in japan. the motion-sensitive phone - officially titled the v603sh - was developed by sharp and launched by vodafone s japanese division. devised mainly for mobile gaming users can also ac...

Tags: mobile phone, golf swing, japan

3
Article: gates opens biggest gadget fair bill gates has opened the consumer electronics show (ces) in las vegas saying that gadgets are working together more to help people manage multimedia content around the home and on the move. mr gates made no announcement about the next generation xbox games console ...

Tags: bill gates, consumer electronics show, las vegas, gadget

4
Article: broadband fuels online change fast web access is encouraging more people to express themselves online research suggests. a quarter of broadband users in britain regularly upload content and have personal sites according to a report by uk think-tank demos. it said that having an always-on fast co...

Tags: broadband, online, change

5
Article: china ripe for media explosion asia is set to drive global media growth to 2008 and beyond with china and india filling the two top spots analysts have predicted. japan south korea and singapore will also be strong players but china s demographics give it the edge a media conference in londo...

Tags: china, media explosion, india, japan, south korea, singapore

In conclusion, this demonstrates an example of how we can stack multiple NLP endpoints together to get an output much closer to our desired outcome.

In practice, hosting and maintaining multiple models can turn quickly into a complex activity. But by leveraging Cohere endpoints, this task is reduced to a simple API call.