Saturday, May 18, 2013

Using Term Document Matrix

Problem Statement:

Let's say you have a number of documents. Each document has a number of words (we refer them as terms). The problem is to identify two documents which are most similar.

Given below is one of the solution approaches

Use of Term Document Matrix:

Each row of a Term document matrix (let's name it as D) is a document vector, with one column for every term in the entire corpus of documents. The matrix is sparse as not all document might contain a term. The value in each cell of the matrix is the term frequency.

docid      term1      term2     term3     term4                                term n
d1            2             1            0              0                                    3
d2            0             2            3              4                                    1
d3            1             0            4              2                                    0

The transpose of the same Term Document Matrix (DT) will look as follows

docid      d1     d2        d3  .............. dn
term1       2      0          1 
term2       1      2          0
term3       0      3          4

We can create a Similarity Matrix (S) by multiplying D with DT { e.g. S = D * DT}
The structure of Similarity matrix will be as follows

docid   d1       d2        d3 ............. dn
d1       x11     x12      x13             x1n
d2       x21     x22      x23             x2n
d3       x31     x32      x33             x3n
dn        xn1    xn2      xn3               xnn

where Xmn = SUM Product of all term frequencies of docids dm and dn
Intuitively higher the value of Xmn, the more similar are the documents with doc ids dm and dn.

Coming back to our original problem, find which to documents are most similar. 
Simply look into the Similarity matrix and find our the row, col of the highest value in the matrix. 

Sunday, May 5, 2013

The Serving Leader

"The Serving Leader" by Ken Jennings and John Stahl-Wert was my weekend reading. The book talks about five actions that can transform teams, business and community. My key takeaways from the book are

  1. Serving leaders run to great purpose by holding out in front of their team a "reason why" that is so big that it requires and motivates everybody's best efforts
  2. They qualify to be first by putting other people first
  3. They raise the bar of expectation by being highly selective in choice of team leaders and by establishing high standards of performance 
  4. They teach serving leader principles and practices and remove obstacle to performance. These actions multiply the serving leader's impact by educating and activating tier after tier of leadership
  5. They build on strength by arranging each person in the team to contribute what he or she is best at. This improves everyone's performance and solidifies teams by aligning the strengths of many people
Some of the thoughts were counter intuitive such as "focus over strength over addressing weaknesses". Yet the argument that "it is foolish to pour all our energy into turning weaknesses to serviceable mediocrity" - makes profound sense. 

Overall a good read and fodder for introspection.