Workshop: Week 6
PRELIMINARIES:
==============
I have kept this worksheet shorter to allow students to spend
time on their projects. The last two questions are optional
(though IMHO fun).
===============================================================
Question 1: BIM with full knowledge
===================================
In Equation 9 of Lecture 10, we estimated the weight of term $t$ as
w_t = \log \frac{p_t}{1 - p_t} + \log \frac{1 - u_t}{u_t}
We also said how to estimate values for $p_t$ and $u_t$ (respectively,
the proportion of relevant and irrelevant documents that contain term
$t$). Assume we have the following values:
$N$ Number of documents
$R$ Number of relevant documents
$f_t$ Number of documents that term $t$ appears in
$r_t$ Number of relevant documents that term $t$ appears in
derive a full formula for $w_t$. What happens to your formula if
$t$ appears in every relevant document?
NOTE: for this question, you can show your derivation in either of
these forms:
1. As latex code (see above). Please place in a stand alone file
"q1.tex" that will compiles with "pdflatex q1.tex"
2. Hand-written on paper (neatly please!), then photo'ed and
compressed. (If you can't get photo < 100k, please
just write the final formula in text, and email me the
image.
Question 2: Calculating f_{d,t} distributions
=============================================
Write a program that does takes the following arguments:
Then for every term in that has a collection frequency
(i.e. total number of occurrences, not number of documents it
occurs in) between and (inclusive), calculate
the number of times that term appears with an fdt between
[0, ]. Keep a separate count for fdt values greater
than .
Then print out the fdt values for the following settings:
lyrl_tokens_30k.dat 1000 1040 15
that is, the terms in the LYRL30k dataset that have a collection
frequency between 1000 and 1040. An example line of output is:
fair 1019 [30480, 626, 93, 31, 15, 6, 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
In answering this question, you may find the file:
http://www.williamwebber.com/comp90042/wksh/wk06/code/lyrl_terms.py
helpful. It contains a function, lyrl_to_term_bow, which parses
the LYRL 30k data into a list of BOW dictionaries, one per document.
Question 3: Expected One-Poisson distribution
=============================================
Assume a term $t$ has a collection frequency of 1020, and given the
collection size is 31254 (the number of docs in LYRL30k). What
is the parameter $\lambda$ in the one-Poisson model for that
term?
The code:
from scipy.stats import poisson
poisson.pmf(0, lambda)
gives the probability that a Poisson process with parameter
$\lambda$ will result in 0 observations in the unit interval
of time. Using this function, calculate the number of docs
we expect $t$ to have an $f_{d,t}$ of 0, 1, 2, 3, 4 ... in.
Which of the terms in Question 2 have a distribution that
appears roughly to follow the expected one-Poisson distribution?
(You may use 1020 as the $c_t$ for all these terms, rather than
their actual $c_t$.)
Question 4: Formalization of Question 3 (OPTIONAL)
==================================================
A more formal way of making the visual judgment of fit is as
follows. The cumulative density function (CDF) of a random
distribution gives the proportion of observations that are
expected be at or below a given value. So, for instance:
>>> poisson.cdf(2, 0.5)
0.9856
says that for a Poisson RV with $\lambda = 0.5$, 98.56% of
the observations are expected to be in the range {0, 1, 2}.
One minus the CDF will give the proportion of observations
expected to be above the specified value.
Find the value $f_{d,t}$ such that we expect the $t$ (as
defined in Question 3) to have less than a 1% chance of
occurring with that $f_{d,t}$ or higher within the collection
(given the collection size). What is that $f_{d,t}$? If
we observe that a term $t$ has that $f_{d,t}$, we can say
that we are 99% confident that $t$ deviates from the
One-Poisson model. For which of the terms in Question 3
are we _not_ 99% confident that it diverges from the
One-Poisson model?
Question 5: Extension of Question 4 (VERY OPTIONAL)
===================================================
The survival function is (1 - CDF(x)); that is, it gives
us the proportion of observations we expect to be at
or above ("survive until") $x$. The inverse survival
function takes a proportion $p$, and (for a discrete
RV like the Poisson) gives smallest value $x + 1$ for
which 1 - CDF(x) < p.
The inverse survival function for the Poisson distribution
is provided by
poisson.isf(p, lambda)
Use this to examine _all_ the terms in the LYRL30k
collection. Which of these terms are we _not_ 99%
confident they violate the One-Poisson distribution?
Would you describe these as "non-content" terms
(given the nature of the LYRL30k collection)?