Table of Contents:
What is TF-IDF?
Preprocessing data.
Weights to title and body.
Document retrieval using TF-IDF matching score.
Document retrieval using TF-IDF cosine similarity.
目录:
TF-IDF是什么?
预处理数据;
标题和正文的权重;
使用TF-IDF匹配得分进行文档检索
使用TF-IDF余弦相似度进行文档检索
Now coming back to our TF-IDF,
现在回到我们的TF-IDF,
TF-IDF = Term Frequency (TF) * Inverse Document Frequency (IDF)
Terminology
术语解释
t — term (word)
t — 关键词
d — document (set of words)
d — 文档
N — count of corpus
N — 语料数量
corpus — the total document set
语料库 — 全部文档的集合
Term Frequency
关键词频率
This measures the frequency of a word in a document. This highly depends on the length of the document and the generality of word, for example a very common word such as “was” can appear multiple times in a document. but if we take two documents one which have 100 words and other which have 10,000 words. There is a high probability that the common word such as “was” can be present more in the 10,000 worded document. But we cannot say that the longer document is more important than the shorter document. For this exact reason, we perform a normalization on the frequency value. we divide the the frequency with the total number of words in the document.
这是衡量一个关键词在文档中出现的频率。这很大程度地取决于文档的长度和关键词的通用性,比如:一个常见的关键词如“was”会在一个文档中多次出现。但是如果我们用两篇文档,一篇100字,另一篇10000字,很有可能像“was”这样常用词在10000字文档中出现的次数更多。但我们不能说明字数长的文档要比字数短的文档更加重要。针对这个原因,我们将次数值做了处理,将次数除以文档的总字数。
Recall that we need to finally vectorize the document, when we are planning to vectorize the documents, we cannot just consider the words that are present in that particular document. If we do that, then the vector length will be different for both the documents, and it will not be feasible to compute the similarity. So, what we do is that we vectorize the documents on the vocab. vocab is the list of all possible words in the corpus.
回想一下,我们最终需要将文档向量化,当准备对文档向量化的时候,我们不能仅仅考虑特定文档当中的关键词,如果这样做了,那么两个文档的向量长度会不同,无法计算相似度。所以我们要做的是,基于Vocab语料库对文档进行向量化。Vocab是一个包含所有可能的词的语料库。
When we are vectorizing the documents, we check for each words count. In worst case if the term doesn’t exist in the document, then that particular TF value will be 0 and in other extreme case, if all the words in the document are same, then it will be 1. The final value of the normalised TF value will be in the range of [0 to 1]. 0, 1 inclusive.
我们对文档向量化的时候,统计每一个关键词的数量。最差的情况是,这个关键词不存在这个文档当中,那么这个关键词出现频率TF值为0,或者另一种极端情况,文档中所有的词都一样,这个关键词出现频率TF值为1。所以正常情况下TF值在[0,1]之间,包含0和1。
TF is individual to each document and word, hence we can formulate TF as follows.
tf(t,d) = count of t in d / number of words in d
TF值对于每个文档关键词都是独立的,所以我们可以将TF值描述为:
tf(t,d) = d文档中关键词t的次数 / d文档的总字数
If we already computed the TF value and if this produces a vectorized form of the document, why not use just TF to find the relevance between documents? why do we need IDF?
如果我们已经计算了TF值,然后制作出文档的向量表,为什么不能使用TF值去发现文档之间的相关性?为什么需要IDF?
Let me explain, though we calculated the TF value, still there are few problems, for example, words which are the most common words such as “is, are” will have very high values, giving those words a very high importance. But using these words to compute the relevance produces bad results. These kind of common words are called stop-words, although we will remove the stop words later in the preprocessing step, finding the importance of the word across all the documents and normalizing using that value represents the documents much better.
让我来解释下,虽然我们计算了TF值,但还是有些问题,比如,一些非常常用的关键词如“is,are”的值非常高,给与了这些词很高的重要性,但使用这些词去计算相关性会得到很坏的结果。这种常见的关键词被称为停止词,虽然我们在数据预处理阶段会删除这些停止词,但找到所有文档中重要的关键词,用这些词能够更好的代表文档。
Document Frequency
文档出现频率
This measures the importance of document in whole set of corpus, this is very similar to TF. The only difference is that TF is frequency counter for a term t in document d, where as DF is the count of occurrences of term t in the document set N. In other words, DF is the number of documents in which the word is present. We consider one occurrence if the term consists in the document at least once, we do not need to know the number of times the term is present.
df(t) = occurrence of t in documents
这是衡量文档在整个语料库中的重要性,和TF相似。不同点在于TF是关键词t在文档d中的出现频率,DF是含有关键词t的文档在语料数N中的出现频率,换句话说,DF是包含关键词t的文档数量。只要关键词在文档中出现了,我们就计一次,不需要知道这个词在这个文档中到底出现几次。
df(t) = 包含关键词t的文档数
To keep this also in a range, we normalize by dividing with the total number of documents. Our main goal is to know the informativeness of a term, and DF is the exact inverse of it. that is why we inverse the DF.
为了让这个值同样处在一个区间内,我们将它除以整个文档总数。我们主要的目的是去了解一个关键词的信息总量,DF值倒数就能准确的描述,这就是我们为什么取DF的倒数。
Inverse Document Frequency
逆向文档频率
IDF is the inverse of the document frequency which measures the informativeness of term t. When we calculate IDF, it will be very low for the most occurring words such as stop words (because stop words such as “is” is present in almost all of the documents, and N/df will give a very low value to that word). This finally gives what we want, a relative weightage.
idf(t) = N/df
IDF称为逆向文档频率,是用来衡量关键词t的信息总量。当我们计算IDF的时候,一些出现频次非常高的常见词的值会非常低,例如停止词。(因为停止词比如”is”几乎是在所有的文档中都会出现,那么它的N/df值将会非常低)。这样就能够得到我们想要的,一个相对的权重。
idf(t) = N/df
Now there are few other problems with the IDF, in case of a large corpus, say 10,000, the IDF value explodes. So to dampen the effect we take log of IDF.
现在对于IDF有些其他的问题,在一个巨大的语料库里,比如N=10000个,IDF值会非常大。所以避免这个问题,我们取IDF的log对数。
During the query time, when a word which is not in vocab occurs, the df will be 0. As we cannot divide by 0, we smoothen the value by adding 1 to the denominator.
idf(t) = log(N/(df + 1))
在查询的时候,当一个在Vocab语料库中没有的关键词,DF值为0,我们不能除以0,所以处理成在分母上加1。
idf(t) = log(N/(df + 1))
Finally, by taking a multiplicative value of TF and IDF, we get the TF-IDF score, there are many different variations of TF-IDF but for now let us concentrate on the this basic version.
tf-idf(t, d) = tf(t, d) * log(N/(df + 1))
最终,将TF和IDF相乘,我们得到TF-IDF得分。对于TF-IDF有很多变化形态,但目前我们只专注于这个基础版本。
tf-idf(t, d) = tf(t, d) * log(N/(df + 1))
Implementing on a real world dataset
使用真实数据进行计算
Now that we learnt what is TF-IDF let us try to find out the relevance of documents that are available online.
The dataset we are going to use are archives of few stories, this dataset has lots of documents in different formats. Download the dataset and open your notebooks, Jupyter Notebooks i mean.
Dataset Link: http://archives.textfiles.com/stories.zip
现在我们知道了TF-IDF是什么,让我们尝试找出现有网上文档的相关性。
我们使用的是几个故事档案的数据集,这个数据集中有很多文档分布在不同的文件夹中。下载打开这个数据集。
数据集链接:http://archives.textfiles.com/stories.zip
Step 1: Analysing Dataset
第一步:数据集分析
The first step in any of the Machine Learning tasks is to analyse the data. So if we look at the dataset, at first glance, we see all the documents with words in English. Each document has different names and there are two folders in it.
任何机器学习任务的第一步都是分析数据集。观察我们的数据集,第一眼看到的是,所有的文档都是英文的。每个文档都有不同的名字并且里面有两个文件夹。
Now one of the important tasks is to identify the title in the body, if we analyse the documents, there are different patterns of alignment of title. But most of the titles are centre aligned. Now we need to figure out a way to extract the title. But before we get all pumped up and start coding, let us analyse the dataset little deep.
现在一个重要的任务是从正文中识别标题,如果我们分析这些文档,他们的标题对齐方式都是不一样的。但是大部分的标题都是居中对齐。我们需要找到一个提取标题的方法。在我们开始写代码之前,让我们再深入分析下数据集。
Take few minutes to analyse the dataset yourself. Try to explore…
自己花几分钟时间分析数据集。尝试一下...
Upon more inspection, we can notice that there’s an index.html in each folder (including the root), which contains all the document names and their titles. So, let us consider ourselves lucky as the titles are given to us, without exhaustively extracting titles from each document.
仔细观察后,我们注意到每隔文件夹中有index.hml文件(包括根目录),包含了所有文档的名字和他们的title。我们应该庆幸可以直接获取这些标题,而不需要去遍历每个文档提取标题。
Step 2: Extracting Title & Body:
第二步:提取标题和正文
There is no specific way to do this, this totally depends on the problem statement at hand and on the analysis we do on the dataset.
没有特别的方法,
As we have already found that the titles and the document names are in the index.html, we need to extract those names and titles. We are lucky that html has tags which we can use as patterns to extract our required content.
由于我们已经获得包含文档标题和文档名称的index.html,所有我们只需要从中提取每个文档的名字和标题。很幸运html有标签,这样我们可以使用正则规则来提取我们需要的内容。
Before we start extracting the titles and file names, as we have different folders, first let’s crawl to the folders to later read all the index.html files at once.
[x[0] for x in os.walk(str(os.getcwd())+’/stories/’)]
在我们开始提取标题和名称之前,由于我们有很多不同的文件夹,首先爬取文件夹,然后一次性获取所有的index.html。
[x[0] for x in os.walk(str(os.getcwd())+’/stories/’)]
os.walk gives us the files in the directory, os.getcwd gives us the current directory and title and we are going to search in the current directory + stories folder as our data files are in the stories folder.
Always assume that you are dealing with a huge dataset, this helps in automating the code.
Now we can find that folders give extra / for the root folder, so we are going to remove it
folders[0] = folders[0][:len(folders[0])-1]
the above code removes the last character for the 0th index in folders, which is the root folder
Now, let’s crawl through all the index.html to extract their titles. To do that we need to find a pattern to take out the title. As this is in html, our job will be little simpler.
let’s see…
We can clearly observe that each file name is enclosed between (><A HREF=”) and (”) and each title is between (<BR><TD>) and (\n)
我们观察到,每个文件的名称是在<A href=”{文件名称}”> 之间,文件的标题是在<BR><TD>{文件标题}\n之间。
We will use simple regular expression to retrieve the name and title. The following code gives the list of all the values that match that pattern. so names and titles variables have the list of all names and titles.
names = re.findall(‘><A HREF=”(.*)”>’, text)
titles = re.findall(‘<BR><TD> (.*)\n’, text)
我们使用简单的正则表达式来获取文件的名称和标题。下面给出了获取所有的值匹配规则。所以变量names和titles,是对应文件名称和文件标题的列表。
names = re.findall(‘><A HREF=”(.*)”>’, text)
titles = re.findall(‘<BR><TD> (.*)\n’, text)
Now that we have code to retrieve the values from index, we just need to iterate to all the folders and get the title and file name from all the index.html files
现在我们有了从index.html中取值的代码,接下来仅仅需求遍历整个文件夹,获取所有index.html文件的名称和标题。
- read the file from index files
- extract title and names
- iterate to next folder
- 从index文件中读取文件
- 提取标题和名称
- 遍历下一个文件夹
dataset = []
for i in folders:
file = open(i+"/index.html", 'r')
text = file.read().strip()
file.close()
file_name = re.findall('><A HREF="(.*)">', text)
file_title = re.findall('<BR><TD> (.*)\n', text)
for j in range(len(file_name)):
dataset.append((str(i) + str(file_name[j]), file_title[j]))
This prepares the indexes of the dataset, which is a tuple of location of file and its title. There is a small issue, the root folder index.html also has folders and its links, we need to remove those
准备数据集的索引,是由文件的位置和文件的标题组成的元组。有个小问题,根目录index.html也有文件夹和链接,我们需要删掉。
simply use a conditional checker to remove it.
简单的使用一个检测条件就可以删除。
if c == False:
file_name = file_name[2:]
c = True
Step 3: Preprocessing
第三步:预处理
Preprocessing is one of the major steps when we are dealing with any kind of text models. During this stage we have to look at the distribution of our data, what techniques are needed and how deep we should clean.
预处理是非常关键的步骤,当我们再处理任何一种文本模型时。
This step never has a one hot rule, and totally depends on the problem statement. Few mandatory preprocessing are converting to lowercase, removing punctuation, removing stop words and lemmatization/stemming. In our problem statement it seems like the basic preprocessing steps will be sufficient.
这一步没有一个固定规则,完全是具体问题具体分析。一些常规的处理如转化大小写,去除标点符号,去除停止词,同义词转化/去除单词时态。在我们实际问题中,似乎常规的处理步骤就足够了。
Lowercase
小写
During the text processing each sentence is split to words and each word is considered as a token after preprocessing. Programming languages consider textual data as sensitive, which means that The is different from the. we humans know that those both belong to same token but due to the character encoding those are considered as different tokens. Converting to lowercase is a very mandatory preprocessing step. As we have all our data in list, numpy has a method which can convert the list of lists to lowercase at once.
np.char.lower(data)
在文本处理过程中,每个句子被分割成单词,每个单词处理后当成一个token。程序语言会和精准的处理文本数据,意味着它会认为The和the是不一样的。我们知道他们应该是相同的token,但由于字符编码的原因导致被程序认为是两个不同的token。转化大小写是一个基本的预处理步骤,由于我们的数据在一个列表中,Numpy有个方法可以一次性的将列表转化成小写。
np.char.lower(data)
Stop words
停止词
Stop words are the most commonly occurring words which don’t give any additional value to the document vector. in-fact removing these will increase computation and space efficiency. nltk library has a method to download the stopwords, so instead of explicitly mentioning all the stopwords ourselves we can just use the nltk library and iterate over all the words and remove the stop words. There are many efficient ways to do this, but ill just give a simple method.
停止词是最常见的词,他们不会给文档向量带来任何价值。实际上,删除掉这些会提高计算效率。NLTK库有个下载停止词的方法,所以我们可以使用NLTK库遍历所有的单词,去除停止词。做这些有很多高效的方式,下面我给出一种简单的方法。
we are going to iterate over all the stop words and not append to the list if it’s a stop word
我们遍历所有的停止词,如果是停止词的就不加入列表中。
new_text = ""
for word in words:
if word not in stop_words:
new_text = new_text + " " + word
Punctuation
标点符号
Punctuation are the unnecessary symbols that are in our corpus documents, we should be little careful with what we are doing with this. There might be few problems such as U.S — us “United Stated” being converted to “us” after the preprocessing. hyphen and should usually be dealt little carefully. but for this problem statement we are just going to remove these
标点符号在我们语料库中是没有用的符号,我们应该小心地处理标点符号。他们可能会带来一些小问题,比如:U.S-us ,“United Stated”会被处理成us,连词符通常需要谨慎处理。在这个问题中,我们仅需要处理下面这些
symbols = "!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n"
for i in symbols:
data = np.char.replace(data, i, ' ')
We are going to store all our symbols in a variable and iterate that variable removing that particular symbol in the whole dataset. we are using numpy here because our data is stored in list of lists, and numpy is our best best.
我们将符号储存在一个变量中,然后遍历这个变量,在整个文档中去除这些符号。我们使用Numpy因为我们使用了列表储存我们的数据,而Numpy是最好的选择。
Apostrophe
撇号
Note that there is no ‘ apostrophe in the punctuation symbols. Because when we remove punctuation first it will convert don’t to dont, and it is a stop word which wont be removed. so what we are doing is we are first removing the stop words, and then symbols and then finally stopwords because few words might still have a apostrophe which are not stop words.
return np.char.replace(data, "'", "")
注意,标点符号变量中没有‘,因为先去掉标点,don’t 会被处理成dont,don’t这是个需要删除的停止词,处理成dont后不会被删除。所以我们先删除停止词,然后删除标点符号,最后再删除停止词,因为一些单词不是停止词,但他们仍然含有撇号。
return np.char.replace(data, "'", "")
Single Characters
单一字符
Single characters are not much useful in knowing the importance of the document and few final single characters might be irrelevant symbols, so it is always good be remove the single characters.
单一字符在理解文档的重要性没有太大的帮助,并且一些单一字符可能是特殊符号,所以最好删除掉这些单一字符。
new_text = ""
for w in words:
if len(w) > 1:
new_text = new_text + " " + w
We just need to iterate to all the words and not append the word if the length is not greater than 1.
我们仅需要遍历所有的单词,如果词长度小于1就去掉;
Stemming
去前缀/后缀
This is the final and most important part of the preprocessing. stemming converts words to its stem.
这是预处理的最后一步也是最重要的一步。去前缀/后缀能够得到单词的词根部分。
For example playing and played are the same type of words which basically indicate an action play. Stemmer does exactly this, it reduces the word to its stem. we are going to use a library called porter-stemmer which is a rule based stemmer. Porter-Stemmer identifies and removes the suffix or affix of a word. The words given by the stemmer need note be meaningful few times, but it will be identified as the same for the model.
举个例子,playing和played是同一个词的不同时态。Stemmer能够精准做到删减到词根。我们将使用一个叫”Porter-Stemmer”的库(有套处理词根规则)。Porter-Stemmer能够识别、去除一个词的前缀和后缀。一个单词被变化几次,通过这个模型仍然可以被识别出来为同一个词。
Lemmatisation
Lemmatisation is a way to reduce the word to root synonym of a word. Unlike Stemming, Lemmatisation makes sure that the reduced word is again a dictionary word (word present in the same language). WordNetLemmatizer can be used to lemmatize any word.
Lemmatisation能够将单词转化成单词的词根。和Stemming不同,Lemmatisation能够确保将单词转化成同一个语料库中的词根(同一种语言)。WordNetLemmatizer能够处理任何单词。
Stemming vs Lemmatization
stemming — need not be a dictionary word, removes prefix and affix based on few rules
lemmatization — will be a dictionary word. reduces to a root synonym.
Stemming和Lemmatization对比
Stemming— 根据规则去掉单词的前缀和后缀,处理后不一定有含义;
Lemmatization— 将单词还原成词根,能够确保在同一个语料中的单词;
A better efficient way to proceed is to first lemmatise and then stem, but stemming alone is also fine for few problems statements, in this problem statement we are not going to lemmatise.
一个比较有效得方式是先进行lemmatise处理,然后进行stem处理,但是对于一些简单的数据集,stemming就足够,我们这次将不进行lemmatise处理。
Converting Numbers
转化数字
When user gives a query such as 100 dollars or hundred dollars. For the user both those search terms are same. but out IR model treats them separately, as we are storing 100, dollar, hundred as different tokens. so to make our IR mode little better we need to convert 100 to hundred. To achieve this we are going to use a library called num2word.
当用户搜索100美元或者一百美元,对于用户来说他们是一个意思。但是IR模型认为他们是单独的,将100,dollar,hundred储存不同token,所以为了我们模型更加精准些,我们需要将100转化成一百。为了实现这个功能,我们将使用num2word库。
(未完,原文链接:https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089)
还没有评论,来说两句吧...