Workshop: Week 4 PRELIM 1. Scipy, Numpy. ======================= We're starting to look at methods (e.g., singular value decomposition) that are too complex for it to be worthwhile to implement them from scratch. From now on, we'll make more use of existing software tools and libraries to perform some of the basic analyses. In Python, many advanced analytic tools build upon two Python packages (or extensions): NumPy and SciPy (http://www.numpy.org and http://scipy.org/scipylib). NumPy is a package for numerical analysis. It provides an efficient N-dimensional array object, and direct support for matrix algebra operations upon it (for instance, matrix addition or multiplication as single operators). The array object (whether as a 1-d vector, a 2-d matrix, or a 3+-d tensor) is the fundamental data structures for many Python scientific and analytic packages. SciPy extends NumPy with a several components useful in scientific computing, such as statistical distributions, sparse matrices, and clustering (it, for instance, provides implementations of hierarchical clustering). Install NumPy and SciPy on your system. If you are not familiar with the packages, read through the NumPy tutorial at: http://wiki.scipy.org/Tentative_NumPy_Tutorial and try at least some of the example code. ===================================================================== Question 1: Using SciPy for Clustering ====================================== To get started with using SciPy, used the scipy function scipy.cluster.hierarchy.single() to redo Question 3 of last week's worksheet. Use scipy.cluster.hierarchy.dendrogram() to calculate the dendrogram representation, and then pylab.savefig() to store it. You'll need to calculate the pairwise doc--doc similarities, and place them in an upper-triangular matrix; cluster.hierarchy.single() will then cluster them for you. As before, be aware of the $O(n^2)$ time and (because of representation required by the function) space complexity here. Start with a small number of documents (say, 100), then progressively double until running time or space requirements become uncomfortable. Question 2: Singular value decomposition ======================================== Scipy provides singular value decomposition through the scipy.linalg.svd function. Consider the following TDM (taken from Manning and Shutze, Chapter 18, "Intro to IR") d1 d2 d3 d4 d5 d6 ship 1 0 1 0 0 0 boat 0 1 0 0 0 0 ocean 1 1 0 0 0 0 voyage 1 0 0 1 1 0 trip 0 0 0 1 0 1 Convert this into a numpy matrix (using numpy.matrix()), then use scipy.linalg.svd to perform the singular value decomposition. Do this as: import numpy as np import sciply.linalg as la X = np.matrix( /* the above TDM encoded as matrix */) (U, s, V_T) = la.svd(X) Print out U, s, V_T. Note: due to imprecision in FP operations, you can get extremely small values where there should be zeros. To set these to zero for, say, the U matrix, do: U[np.absolute(U) < 0.000001] = 0.0 Note, in passing, how numpy allows you to do this sort of operation as single array operation, rather than having to use loops. Compare to the answers given by Schutze and Manning (pp415ff). You should get the same _absolute_ values, but the signs may differ. (Why?) Use np.shape() to check the shapes of U and V_T. Now multiply U, S, and V_T back together: X = U * S * V_T Note that s just gives the singular values, i.e. the diagonal S. To reconstitute S as a diagonal matrix, you need to do: S_ = np.diagflat(s) Then, to get the shape right, you need to add on all-0 rows or columns. The following code will do this _for this particular example_, but whether you need additional rows or columns, and how many, will vary between problems. empty_col = np.array([[0.0] * 5).transpose() S = np.concatenate((S, empty_col), 1) Then use np.dot() to do the matrix multiplication (one matrix pair at a time). Check that the output of U * S * V_T is the same as the original input (again, allowing for rounding error). Question 3: Dimensionality reduction ==================================== Take your code for Question 2, but add in dimensionality reduction. Allow the user to specify a dimensionality $k$ (here, $k <= 5$; why?). Set the appropriate singular values in $s$ to zero to perform the dimensionality reduction (which are these?) Shrink the dimensionality to 4, 3, 2, 1, and at each reduced dimensionality calculate: X_K = U * S_K * V_T How quickly does X_K diverge from X? Does this relate to the size of the singular values? Question 4: LSA =============== We will use the gensim package for performing LSA (aka LSI) analysis. See: http://radimrehurek.com/gensim/ for a description of gensim and how to install it. Read the gensim tutorial at: http://radimrehurek.com/gensim/tutorial.html specifically the sections marked "Corpora and Vector Spaces" (for understanding the representations used internally by gensim) and "Topics and Transformations" (which discusses, very briefly, how to perform the LSA/LSI transformation). The file: http://codalism.com/comp90042/wksh/wk04/code/index.py contains Python code for a creating an indexed corpus in a format that Gensim can work with. Running this program as: python index.py ../../dat/lyrl_tokens_30k.dat idx will create index files with the prefix "idx". (Of course, you have to specify the correct path to your copy of the LYRL 30k dataset.) Then the code fragment: import index from gensim import corpora term_dict = corpora.Dictionary.load(index.get_dict_fname(index_tag)) corpus = corpora.MmCorpus(index.get_corpus_fname(index_tag)) will load the corpus representation and the dictionary mapping term ids to term strings. Extend this to perform a full LSA decomposition on the LYRL 30k collection. Extract the top 20 terms from the 20 topics. Compare the topic 6 topics with the results given in lecture notes. Look at the remaining 14 topics. How many of these appear coherent to you? How do you interpret the negatively-weighted terms? Question 5: LSA querying ======================== The next section of the gensim tutorial explains how to run queries against the LSA index. Implement this on the LYRL index. Select some queries that you think are under-specified (that is, where the concepts expressed by the query terms could also be expressed by other query terms). (You could use the topics identified above to help guide you in this.) Run the queries, and retrieve the documents, as word lists. (The file index.py has a function lyrl_to_docs() which will load the document word lists as a 2-d array.) How successful is LSA at "bridging the semantic gap"? Get some of the documents which have been matched from: http://www.williamwebber.com/rcv1v2/?docid=46825&user=foo (see worksheet for week 2 for how this service works). Do these seem like correct matches for the query?