Generalized Model Selection For Unsupervised Learning In High Dimensions Shivakumar Vaithyanathan IBM Almaden Research Center 650 Harry Road San Jose, CA 95136

[email protected] Byron Dom IBM Almaden Research Center 650 Harry Road San Jose, CA 95136

[email protected] Abstract We describe a Bayesian approach to model selection in unsupervised learning that determines both the feature set and the number of clusters. We then evaluate this scheme (based on marginal likelihood) and one based on cross-validated likelihood. For the Bayesian scheme we derive a closed-form solution of the marginal likelihood by assuming appropriate forms of the likelihood function and prior. Extensive experiments compare these approaches and all results are verified by comparison against ground truth. In these experiments the Bayesian scheme using our objective function gave better results than cross-validation. 1 Introduction Recent efforts define the model selection problem as one of estimating the number of clusters[ 10, 17]. It is easy to see, particularly in applications with large number of features, that various choices of feature subsets will reveal different structures underlying the data. It is our contention that this interplay between the feature subset and the number of clusters is essential to provide appropriate views of the data.We thus define the problem of model selection in clustering as selecting both the number of clusters and the feature subset. Towards this end we propose a unified objective function whose arguments include the both the feature space and number of clusters. We then describe two approaches to model selection using this objective function. The first approach is based on a Bayesian scheme using the marginal likelihood for model selection. The second approach is based on a scheme using cross-validated likelihood. In section 3 we apply these approaches to document clustering by making assumptions about the document generation model. Further, for the Bayesian approach we derive a closed-form solution for the marginal likelihood using this document generation model. We also describe a heuristic for initial feature selection based on the distributional clustering of terms. Section 5 describes the experiments and our approach to validate the proposed models and algorithms. Section 6 reports and discusses the results of our experiments and finally section 7 provides directions for future work. Model Selection for Unsupervised Learning in High Dimensions 971 2 Model selection in clustering Model selection approaches in clustering have primarily concentrated on determining the number of components/clusters. These attempts include Bayesian approaches [7,10], MDL approaches [15] and cross-validation techniques [17]. As noticed in [17] however, the optimal number of clusters is dependent on the feature space in which the clustering is performed. Related work has been described in [7]. 2.1 A generalized model for clustering Let D be a data-set consisting of “patterns“ {d I, , dv}, which we assume to be represented in some feature space T with dimension M. The particular problem we address is that of clustering D into groups such that its likelihood described by a probability model p(DTIQ), is maximized, where DT indicates the representation of D in feature space T and Q is the structure of the model, which consists of the number of clusters, the partitioning of the feature set (explained below) and the assignment of patterns to clusters. This model is a weighted sum of models {P(DTIQ, ~)I~ E [Rm} where ~ is the set of all parameters associated with Q . To define our model we begin by assuming that the feature space T consists of two sets: U -useful features and N - noise features. Our feature-selection problem will thus consist of partitioning T (into U and N) for a given number of clusters. Assumption 1 The feature sets represented by U and N are conditionally independent p(DTIQ,~) = P(DN I Q, ~) P(Du I Q,~) (1) where DN indicates data represented in the noise feature space and DU indicates data represented in useful feature space. Using assumption 1 and assuming that the data is independently drawn, we can rewrite equation (1) as p(DTIQ,~) = {n p(d~ I ~N). nn p(dy I ~f)} (2) 1=1 k=IJED, where V is the number of patterns in D, p(dy I ~u) is the probability of dy given the parameter vector ~f and p(d~ I ~N) is the probability of d~ given the parameter vector ~N. Note that while the explicit dependence on Q has been removed in this notation, it is implicit in the number of clusters K and the partition of T into Nand U. 2.2 Bayesian approach to model selection The objective function, represented in equation (2) is not regularized and attempts to optimize it directly may result in the set N becoming empty - resulting in overfitting. To overcome this problem we use the marginallikelihood[2]. K Assumption 2 All parameter vectors are independent. n (~) = n (~N). n n (~f) k=1 where the n( . ) denotes a Bayesian prior distribution. The marginal likelihood, using assumption 2, can be written as P(DT I Q)= IN [UP(d~ I ~N)]n(~N)d~N. DL [!lp(dY I ~f)]n(~f)d~f(3) 972 S. Vaithyanathan and B. Dom where SN, SV are integral limits appropriate to the particular parameter spaces. These will be omitted to simplify the notation. 3.0 Document clustering Document clustering algorithms typically start by representing the document as a “bag-of-words“ in which the features can number - 104to 105. Ad-hoc dimensionality reduction techniques such as stop-word removal, frequency based truncations [16] and techniques such as LSI [5] are available. Once the dimensionality has been reduced, the documents are usually clustered into an arbitrary number of clusters. 3.1 Multinomial models Several models of text generation have been studied[3]. Our choice is multinomial models using term counts as the features. This choice introduces another parameter indicating the probability of the Nand U split. This is equivalent to assuming a generation model where for each document the number of noise and useful terms are determined by a probability (}s and then the terms in a document are “drawn“ with a probability ((}n or ()~ ). 3.2 Marginal likelihood / stochastic complexity To apply our Bayesian objective function we begin by substituting multinomial models into (3) and simplifying to obtain P(D I Q) = (tN ;,tV) S[((}S)tN (1-(}s)tu]n((}s)d(}S . [Ii: II ({t. tIYEU}J] S[II((}k)t,.u]n((}f)d(}f· (4) k==1 IEDk I,U U U EV [iI ( tf J 1 S[II((}n)ti .? ] n((}N) d(}N j=1 {tj,nlnEN} nEN where ( (. .\) is the multinomial coefficient, ti,u is the number of occurrences of the feature term u in document i, tYis the total number of all useful features (terms) in document i (tY = L U ti,u, t~:, and ti,n are to be interpreted similar to above but for noise features, (n = kl(~~k)! , tNis the total number of all noise features in all patterns and tVis the total number of all useful features in all patterns. To solve (4) we still need a form for the priors {n( . )}. The Beta family is conjugate to the Binomial family [2] and we choose the Dirichlet distribution (mUltiple Beta) as the form for both n((}f) and n((}N) and the Beta distribution for n((}s). Substituting these into equation (8) and simplifying yields P (D I Q) = [ f(Ya + Yb) f(tN + Ya)f(tV + Yb) ] ? [ f(/J) II f(/Jn + tn) ] f(Ya)f(Yb) [(tV + tN + Ya + Yb) f(/J + tN) nE N f(/Jn) [ [(0 ) K r(O k + IDkl)] [K f(a) [(au + tV ] [(0 + v) D f(lDkl) ? D f(a+ tU(k) Du f(a u) (5) Model Selection for Unsupervised Learning in High Dimensions 973 where f3, and au are the hyper-parameters of the Dirichlet prior for noise and useful features respectively, f3 = L f3n , a = L au, U = L ukand ro is the “gamma“ function. neN UEU k=1 Further, Yu, Yure the hyper parameters of the Beta prior for the split probability, IDkl is the number of documents in cluster k and tU(k is computed as L tf. The results iEDk reported for our evaluation will be the negative of the log of equation (5), which (following Rissanen [14]) we refer to as Stochastic Complexity (SC). In our experiments all values of the hyper-parameters pj ,aj (Jk Ya and Y bare set equal to 1 yielding uniform priors. 3.3 Cross-Validated likelihood To compute the cross validated likelihood using multinomial models we first substitute the multinomial functional forms, using the MLE found using the training set. This results in the following equation { ,., N ,.,., U VIt!\1 ,.,, K ~ P(CVT I QP) = [(05)t“ (1-(5)1,,] IT p(evf ION) . IT IT peevy I O~i) p(q) (6) 1=1 k=IJEDk ,., ---,., where Os, ON and O~i) are the MLE of the appropriate parameter vectors. For our implementation of MCCV, following the suggestion in [17], we have used a 50% split of the training and test set. For the vCV criterion although a value of v = 10 was suggested therein, for computational reasons we have used a value of v = 5. 3.4 Feature subset selection algorithm for document clustering As noted in section 2.1, for a feature-set of size M there are a total of 2M partitions and for large M it would be computationally intractable to search through all possible partitions to find the optimal subset. In this section we propose a heuristic method to obtain a subset of tokens that are topical (indicative of underlying topics) and can be used as features in the bag-of-words model to cluster documents. 3.4.1 Distributional clustering for feature subset selection Identifying content-bearing and topical terms, is an active research area [9]. We are less concerned with modeling the exact distributions of individual terms as we are with simply identifying groups of terms that are topical. Distributional clustering (DC), apparently first proposed by Pereira et al [13], has been used for feature selection in supervised text classification [1] and clustering images in video sequences [9]. We hypothesize that function, content-bearing and topical terms have different distributions over the documents. DC helps reduce the size of the search space for feature selection from 2M to 2e, where C is the number of clusters produced by the DC algorithm. Following the suggestions in [9], we compute the following histogram for each token. The first bin consists of the number of documents with zero occurrences of the token, the second bin is the number of documents consisting of a single occurrence of the token and the third bin is the number of documents that contain more two or more occurrences of the term. The histograms are clustered using reLative entropy ~(. II .) as 974 S. Vaithyanathan and B. Dom a distance measure. For two terms with probability distributions PI (.) and P2(.), this is given by [4]: “ PI(t) ,1.(Pt(t) II P2(t)) = k PI(t) log P2(t) t (7) We use a k-means-style algorithm in which the histograms are normalized to sum to one and the sum in equation (7) is taken over the three bins corresponding to counts of 0,1, and ~ 2. During the assignment-to-clusters step of k-means we compute ,1.(pw II PCk) (where pw is the normalized histogram for term wand Pq(t) is the centroid of cluster k) and the term w is assigned to the cluster for which this is minimum [13,8]. 4.0 Experimental setup Our evaluation experiments compared the clustering results against human-labeled ground truth. The corpus used was the AP Reuters Newswire articles from the TREC-6 collection. A total of 8235 documents, from the routing track, existing in 25 classes were analyzed in our experiments. To simplify matters we disregarded multiple assignments and retained each document as a member of a single class. 4.1 Mutual information as an evaluation measure of clustering We verify our models by comparing our clustering results against pre-classified text. We force all clustering algorithms to produce exactly as many clusters as there are classes in the pre-classified text and we report the mutual information[ 4] (MI) between the cluster labels and pre-classified class labels 5.0 Results and discussions After tokenizing the documents and discarding terms that appeared in less than 3 documents we were left with 32450 unique terms. We experimented with several numbers of clusters for DC but report only the best (lowest SC) for lack of space. For each of these clusters we chose the best of 20 runs corresponding to different random starting clusters. Each of these sets includes one cluster that consists of high-frequency words and upon examination were found to contain primarily function words, which we eliminated from further consideration. The remaining non-function-word clusters were used as feature sets for the clustering algorithm. Only combinations of feature sets that produced good results were used for further document clustering runs. We initialized the EM algorithm using k-means algorithm - other initialization schemes are discussed in [11]. The feature vectors used in this k -means initialization were generated using the pivoted normal weighting suggested in [16]. All parameter vectors Of and eN were estimated using Laplace s Rule of Succession[2]. Table 1 shows the best results of the SC criterion, the vCV and MCCV using the feature subsets selected by the different combinations of distributional clusters. The feature subsets are coded as FSXP where X indicates the number of clusters in the distributional clustering and P indicates the cluster number(s) used as U. For SC and MI all results reported are averages over 3 runs of the k-means+EM combination with different initialization fo k-means. For clarity, the MI numbers reported are normalized such that the theoretical maximum is 1.0. We also show comparisons against no feature selection (NF) and LSI. Model Selection for Unsupervised Learning in High Dimensions 975 For LSI, the principal 165 eigenvectors were retained and k-means clustering was performed in the reduced dimensional space. While determining the number of clusters, for computational reasons we have limited our evaluation to only the feature subset that provided us with the highest MI, i.e., FS41-3. Feature Useful SC vCV MCCV Ml Set Features X 107 X 107 X 107 FS41-3 6,157 2.66 0.61 1.32 0.61 FS52 386 2.8 0.3 0.69 0.51 NF 32,450 2.96 1.25 2.8 0.58 LSI 324501165 NA NA NA 0.57 Table 1 Comparison Of Results Figure 1 1~1 ? . . .: I . . . ? , “ “ , . “ , Flgur.2 1~1 . ~ “ . I ? ? , . Log IQ . L . hood . “ ThIwI:_P !Ch MCCV· gII . lavl . -ood 5.3 Discussion The consistency between the MI and SC (Figure 1) is striking. The monotonic trend is more apparent at higher SC indicating that bad clusterings are more easily detected by SC while as the solution improves the differences are more subtle. Note that the best value of SC and Ml coincide. Given the assumptions made in deriving equation (5), this consistency and is encouraging. The interested reader is referred to [18] for more details. Figures 2 and 3 indicate that there is certainly a reasonable consistency between the cross-validated likelihood and the MI although not as striking as the