import numpy as np
#install module to acquire text from PDF image files
#un comment the below command to install the pdfminer for the first time
#pip install pdfminer.six
from pdfminer.high_level import extract_text
import os
import pandas as pd
n=0
docs = []
for root, dirs, files in os.walk("/Users/mike/DSC478/Project/CIAUFOCD-FULL-CONVERTED"):
for file in files:
text = extract_text("/Users/mike/DSC478/Project/CIAUFOCD-FULL-CONVERTED/"+str(file))
text = text.replace('\n','')
bad_chars = [';', ':', '!', '*', '\n', '-', '.', '~','/',',']
for i in bad_chars:
text = text.replace(i,'')
docs.append(text)
n+=1
The code block above utilized PDF miner to read through the PDF files in the local directory and extract the readable text from the documents. As the documents are very old and scanned without much care, the text extraction picks up on a lot of unreadable characters and adds its own when it misreads a character. In order to clean up the text extraction, the line breaks and "bad characters" were replaced with blanks to create a more readable string representation of each document.
#Sample of an extracted PDF
docs[1]
This is a sample of what one of the extracted PDF documents looks like once it is extracted and cleaned by the code snippet above.
from sklearn.feature_extraction.text import CountVectorizer
documents = docs
vec = CountVectorizer()
x = vec.fit_transform(documents)
dt = pd.DataFrame(x.toarray(), columns = vec.get_feature_names())
dt
As you can see above, we start with 18,123 terms and there are still terms captured that don't mean anything. To further clean and reduce our term features, we'll start by removing terms that begin with zero so that we don't lose any dates but have no need for string representations of integers.
for name in dt.columns:
if name.startswith('0'):
del dt[name]
dt
We are still left with 17,850 terms that don't add much value to our analysis such as terms that aren't real words. To try and cut down on these, we'll remove the term features with terms that begin with the same letter and see how much further we can reduce our feature set.
for item in dt.columns:
for letter in item:
try:
if item.startswith(letter[0]+letter[0]):
del dt[item]
except:
pass
dt
We were only able to reduce our term feature set by ~ 300 terms. In order to filter out a vast majority of the noise here, we'll turn to utilize the Pyenchant python package which includes a spellchecking library using the English dictionary that we can leverage to check for "real english words" in our term feature set.
import enchant
d = enchant.Dict("en_US")
for item in dt.columns:
if d.check(item) == False:
del dt[item]
dt
We were able to reduce our term features by about 10,000 and as you can see in the output above, the terms are much more interpretable. The further reduction of our dataset should help combat high assumed high variance in our dataset which should help to prevent some overfitting in our models.
#checking that all PDFs were loaded
len(docs)
Due to the RAM limitations of our machines, we had to reduce the amount of documents in our total dataset from 2,780 documents to 670 documents in order to successfully and efficiently cluster and classify the documents.
#Number of docs
numDocs = dt.shape[0]
#Number of terms
numTerms = dt.shape[1]
print(numTerms)
print(numDocs)
Our full dataset consists of 7,681 terms and 670 documents.
#Term frequency per document
td = dt.T
DocFreq = pd.DataFrame([(td!=0).sum(1)]).T
DocFreq
DocFreq.sort_values(by = 0, ascending=False)
Sorting our document frequency dataframe shows us our 5 most common words and 5 least common words.
# Creating a matrix with all entries = numDocs
NMatrix=np.ones(np.shape(td), dtype=float)*numDocs
np.set_printoptions(precision=2,suppress=True,linewidth=120)
print(NMatrix)
# Convert each entry into IDF values
IDF = np.log2(np.divide(NMatrix, np.array(DocFreq)))
np.set_printoptions(precision=2,suppress=True)
print(IDF)
# Computing the TFxIDF values for each document-term entry
TD_tfidf = td * IDF
#Viewing our TFxIDF matrix
pd.set_option("display.precision", 2)
TD_tfidf
#Creating a list of our terms
index = TD_tfidf.index
lst = list(index)
#DocxTerm TFxIDF matrix
DT_tfidf = TD_tfidf.T
DT_tfidf.shape
#Importing the KMeans module from sklearn to run our clustering
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5, max_iter=300, verbose=1) # initialization
A cluste size of 5 was chosen as we initially went with 10 clusters that did not offer much interpritablity. As such, we reduced to 5 clusters to increase the explainability of each cluster which you will see below.
kmeans.fit(DT_tfidf)
clusters = kmeans.predict(DT_tfidf)
labels = pd.DataFrame(clusters, columns=["Cluster"])
labels
A quick view of the document clustering is shown above. For example, the first 5 documents were clustered into Cluster 4 while the last 5 documents were clustered into Cluster 1.
centroids = pd.DataFrame(kmeans.cluster_centers_, columns = lst)
centroids.head(10)
centroidAbs = centroids.abs()
centroidAbs.sort_values(by = 4, axis=1, ascending=False)
We can sort our clusters here by smallest centroid distances to each term to try and interpet the terms that are most closely related to each cluster. Cluster 2 is shown above and seems to include documents that reference words that begin with "gu". We can also see that the terms with the farthest distance from this cluster center are all terms with a scientific subject (science, academy, space, mars) perhaps suggesting that a large majority of the CIA documents on UFOs are related to scientific information.
def cluster_sizes(clusters):
#clusters is an array of cluster labels for each instance in the data
size = {}
cluster_labels = np.unique(clusters)
n_clusters = cluster_labels.shape[0]
for c in cluster_labels:
size[c] = len(DT_tfidf[clusters == c])
return size
size = cluster_sizes(clusters)
for c in size.keys():
print("Size of Cluster", c, "= ", size[c])
We can use the function "cluster_sizes()" above to view how many documents were clustered into each cluster. As referenced earlier, 10 clusters showed a similar trend of the majority (92%) of our documents were clustered into Cluster 4.
from sklearn import metrics
silhouettes = metrics.silhouette_samples(DT_tfidf, clusters)
print(silhouettes[:20])
print (silhouettes.mean())
Silhouette value ranges from -1 to 1. An average value of .78 means our documents are actually pretty well matched to their own clusters and separated from other clusters even though the majority of the documents were put into one cluster (cluster 3)
import pylab as pl
def plot_silhouettes(data, clusters, metric='euclidean'):
from matplotlib import cm
from sklearn.metrics import silhouette_samples
cluster_labels = np.unique(clusters)
n_clusters = cluster_labels.shape[0]
silhouette_vals = metrics.silhouette_samples(data, clusters, metric='euclidean')
c_ax_lower, c_ax_upper = 0, 0
cticks = []
for i, k in enumerate(cluster_labels):
c_silhouette_vals = silhouette_vals[clusters == k]
c_silhouette_vals.sort()
c_ax_upper += len(c_silhouette_vals)
color = cm.jet(float(i) / n_clusters)
pl.barh(range(c_ax_lower, c_ax_upper), c_silhouette_vals, height=1.0,
edgecolor='none', color=color)
cticks.append((c_ax_lower + c_ax_upper) / 2)
c_ax_lower += len(c_silhouette_vals)
silhouette_avg = np.mean(silhouette_vals)
pl.axvline(silhouette_avg, color="red", linestyle="--")
pl.yticks(cticks, cluster_labels)
pl.ylabel('Cluster')
pl.xlabel('Silhouette coefficient')
pl.tight_layout()
#pl.savefig('images/11_04.png', dpi=300)
pl.show()
return
plot_silhouettes(DT_tfidf, clusters)
We can use the function above to view the silhouette values of each cluster. Cluster 0 has the value closest to zero (documents not as well matched to one another) while the other four clusters all have silhouette values around the average of .78 suggesting the documents in these clusters are well matched.
We'll first build a classifier model using the TFxIDF matrix and compare this to classifying on the PCA matrix.
#Creating training and testing data (20/80 split)
from sklearn import neighbors
from sklearn.model_selection import train_test_split
train, test, target_train, target_test = train_test_split(DT_tfidf, labels, test_size=.2, random_state=5)
print(test.shape)
#Using sklearn neighbors module to classify.
n_neighbors = 5
knnclf = neighbors.KNeighborsClassifier(n_neighbors, weights = 'distance')
knnclf.fit(train, target_train)
knnpreds_test = knnclf.predict(test)
print(knnpreds_test)
from sklearn.metrics import classification_report
print(classification_report(target_test, knnpreds_test))
print(knnclf.score(test, target_test))
print(knnclf.score(train, target_train))
We don't like seeing a perfect accuracy score from our Classification. This is most likely due to our model being highly overfit due to the amount of noise in our dataset from filler and stop words that are affecting the cluster labels. The classifier is being biased by the same things that the clustering was influenced by it seems like. We need to further reduce the term feature space to combat this.
As our term feature space is so large and filled with noise, we will attempt to reduce this dimension by computing principal components on the dataset. We'll cluster on this reduced dimensional space and then build a classification model to predict the cluster labels we find.
#Using sklearn decomposition to compute principal components.
from sklearn import decomposition
pca = decomposition.PCA(n_components=10)
DTtrans = pca.fit(DT_tfidf).transform(DT_tfidf)
print(pca.explained_variance_ratio_)
4 PCs explain 83% of the variance in our data.
kmeans = KMeans(n_clusters=5, max_iter=300, verbose=1) # initialization
kmeans.fit(DTtrans)
clusters = kmeans.predict(DTtrans)
labels = pd.DataFrame(clusters, columns=["Cluster"])
labels
centroids = pd.DataFrame(kmeans.cluster_centers_)
centroids.sort_values(by = 3, axis=1, ascending=False)
We can't see as much from the cluster centroids on PCA data matrix compared to our TFxIDF matrix as we can't see exact terms that are underlying in each PC.
def cluster_sizes(clusters):
#clusters is an array of cluster labels for each instance in the data
size = {}
cluster_labels = np.unique(clusters)
n_clusters = cluster_labels.shape[0]
for c in cluster_labels:
size[c] = len(DTtrans[clusters == c])
return size
size = cluster_sizes(clusters)
for c in size.keys():
print("Size of Cluster", c, "= ", size[c])
Clustering on the lower dimensional space of the Principal Components actually clustered the same amount of documents into the same clusters as when we clustered on the tf x idf matrix into 5 clusters. As previously stated, there is still probably too much noise in our term feature set.
silhouettes = metrics.silhouette_samples(DTtrans, clusters)
print(silhouettes[:20])
print(silhouettes.mean())
We have a higher average silhouettes metric when clustering on our PCs. Clustering on the tf x idf matrix yielded a .78 silhouette score so our documents are actually better matched to their own clusters and separated from other clusters even though each of the five clusters hold the same amount of documents as our clustering on tf x idf.
plot_silhouettes(DTtrans, clusters)
#Creating training and testing data
from sklearn import neighbors
from sklearn.model_selection import train_test_split
train, test, target_train, target_test = train_test_split(DTtrans, labels, test_size=.2, random_state=5)
print(test.shape)
n_neighbors = 5
knnclf = neighbors.KNeighborsClassifier(n_neighbors, weights = 'distance')
knnclf.fit(train, target_train)
knnpreds_test = knnclf.predict(test)
print(classification_report(target_test, knnpreds_test))
print(knnclf.score(test, target_test))
print(knnclf.score(train, target_train))
Again, we are obtaining perfect accuracy on our classifier which is not what we want to see. Our model is extremely overfitting to the data and we need to further reduce our feature dimensions to fight the overfitting we continue to see.