Diagonal latent block model for document classification
With the increasing amount of data available, unsupervised learning has become an important tool used to discover underlying patterns that explain it without the need to label data instances manually. Among different approaches proposed to tackle this problem, clustering is arguably the most popular one. Clustering is usually based on the assumption that each group, also called cluster, is characterized by a centroid defined in terms of all features. However, in some real-world applications dealing with high-dimensional data, this assumption may be false. To this end, co-clustering algorithms were proposed to describe clusters by subsets of features that are the most relevant to them. The obtained latent structure of data is composed of blocks usually called co-clusters. In this talk one co-clustering based on the mixture models framework and adapted for the problem of document classification is presented. The problem of co-clustering binary data is addressed in the latent block model framework with diagonal constraints imposed on resulting data partitions. We consider the Bernoulli generative mixture model and present three new methods differing in the assumptions made about the homogeneity degree of diagonal blocks. The proposed models are parsimonious and allow to take into account the structure of data matrix when reorganizing it into homogeneous diagonal blocks. We derive algorithms for each of the presented models based on the Expectation-Maximization type algorithms.