I explored a novel way of document clustering, which is language-independent. The crux is, that it applies to any text in any language. The process remains the same as the data changes.
Let me note the approach in bullet points:
- Build word vectors from the available data.
- Build word clusters of the document word vectors.
- Find which cluster best represents the document.
- Aggregate these clusters (similar clusters go together).
- Documents with similar clusters go together.
Now we will walk through the process and code. We will start by creating word vectors. Before that
In Natural Language Processing (NLP) 90% of the work is preprocessing — Some Researcher
So make sure you do all the preprocessing that you need to do on your data. Since this is a POC, I have excluded the part where preprocessing is done.
We will use the genesis python package to build word vectors. The below function creates the vectors and saves them in binary format.
# Create wordvectors def create_vectors(filename): sentences =  with bz2.open(filename, 'rt') as df: for l in df: sentences.append(" ".join(l.split(",")).split()) model = Word2Vec(sentences, size=25, window=5, min_count=1, workers=4) model.save(r"filename.vec") # Line changed refer code on github return model
Once the word vectors are ready, we can move on to the next step. This is an exciting step, in particular. I stole it from a blog here. It’s very simple and it works!
Let’s say we have a document with n words. We will start with the first word and lookup for the vector for that word in the model (word vector). That vector goes to a cluster, then moving forward, we assign the word to either a pre-created cluster or a new cluster with a probability 1/(1+n). If we decide to assign to an existing cluster, we assign the word to the most similar cluster. This approach creates clusters that separate the words into groups that make sense and take a look.
More examples in this notebook are here.
The below function is the implementation of the above algorithm, which is also called as Chinese Restaurant Process.
I have put the code on GitHub. Also, there is an accompanying notebook that you can have a look at github.
A few suggestions to improve the above algorithm.
- More Data
- Better Preprocessing
- More Dimensional Vectors (we used 25 dimensions)
- A better approach to matching clusters (other than the cosine approach)
Note: This is a very experimental approach. A lot is unclear and undefined as of now. I will keep improving and completing the approach. Your comments are appreciated.
For any questions and, inquiries visit us at thinkitive