Basic Information Retrieval Problem — Boolean Retrieval Model

Himanshu Bajpai
2 min readMay 13, 2021

--

Consider below sentences,
1. I am a cow.
2. Cow is what I am.
3. Today is Tuesday.

Now, if I ask you a question — Can you tell the sentences which contain the term ‘cow’ but not ‘Tuesday’?
As a human, it is easy for us to say that the answer will be sentence 1 and sentence 2.
But how to model this problem mathematically so that it can be solved by a machine?

Term-Document Incidence Matrix

The term-document incidence matrix is one of the basic techniques to represent text data where,
> We get the unique words across all the documents.
> For each document, we
add 1 if the term exists in the document otherwise fill 0 in the cell.

For the sentences, which we took in our problem statement, Term-Document Incidence Matrix will look something like this :

Term-Document Incidence Matrix for the sentences — 1, 2 and 3.

Note : Words are normalized i.e. same word is not considered twice across all the documents/sentences.

Boolean Retrieval Model

It is one of the application of this matrix where we can answer any query which is in the form of a Boolean expression of terms, that is, in which terms are combined with the operators and, or, and not.

For our query i.e. get the sentences which contain the term ‘cow’ but not ‘tuesday’,
> We will get the term vector, which is basically, the values from the row containing the term in Term-Document Matrix. Example — For Cow, the vector will be [1,1,0].
> Perform a Bitwise AND operation between the vectors of the terms provided in the input query.

Let’s apply the algorithm and see if we get the right answer.

  1. Cow Vector = [1,1,0]
  2. Tuesday Vector = [0,0,1].
  3. Not Tuesday Vector = [1,1,0]. Not vector can be obtained by taking compliment of the original vector.

Perform BITWISE AND OPERATION :
[1,1,0] & [1,1,0] => [1,1,0]

Inference from the result :
In the result obtained from BITWISE AND operation, the indices for which 1 is present, those sentence satisfy the input query. Hence, sentence one and two contain the word ‘cow’ but not ‘tuesday’ and will be returned as result for the query.

Conclusion

Term-Document Incidence matrix is one of the basic mathematical model to represent texts and it can be used to answer Boolean expression queries via model called Boolean Retrieval Model. Below are the key points to consider:

  1. It can answer any query which is a Boolean expression.
  2. Views document as the set of terms.
  3. Good precision since the documents are retrieved only if the condition is matched.

Please find the notebook here in the GitHub.

Thank you for reading this article. Gracias!

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

No responses yet

What are your thoughts?