Internal TopEx methods

These methods are intended for internal use only. TopEx users should only need to import the core module in order to fully use every feature of the TopEx library.

Data processing

score_phrase[source]

score_phrase(phrase:list, score:float, include_sentiment:bool)

Gets the numerical polarity of a list of tokens Returns float

score_token[source]

score_token(token:str, pos:str, doc_id:int, vocab:dict, tfidf:ndarray, max_token_scores:ndarray, tfidf_corpus:str, include_sentiment:bool)

Calculates the importance of a single token Returns float

get_phrase[source]

get_phrase(sent:Series, window_size:int, vocab:dict, tfidf_corpus:str, tfidf:ndarray, max_token_scores:ndarray, include_sentiment:bool)

Finds the most expressive phrase in a sentence. This function is called in a lambda expression in core.get_phrases. Passing include_sentiment=False will weight all tokens equally, ignoring sentiment and part of speech.

Returns list

get_vector_tfidf[source]

get_vector_tfidf(sent:Series, dictionary:Dictionary, term_matrix:ndarray)

Create a word vector for a given sentence using a term matrix. This function is called in a lambda expression in core.get_vectors.

Returns list

get_vector_w2v[source]

get_vector_w2v(sent:Series, model:Word2VecKeyedVectors)

Create a word vector for a given sentence using a Word2Vec model. This function is called in a lambda expression in core.get_vectors.

Returns list

w2v_pretrained[source]

w2v_pretrained(bin_file:str)

Load a pre-trained Word2Vec model from a bin file.

Returns gensim.models.keyedvectors.Word2VecKeyedVectors

Clustering

get_cluster_assignments_hac[source]

get_cluster_assignments_hac(linkage_matrix:ndarray, height:int)

Assigns clusters by cutting the HAC dendrogram at the specified height.

Returns list

get_silhouette_score_hac[source]

get_silhouette_score_hac(phrase_vecs:list, linkage_matrix:ndarray, height:int)

Assigns clusters to a list of word vectors for a given height and calculates the silhouette score of the clustering.

Returns float

get_tree_height[source]

get_tree_height(root:ClusterNode)

Recursively finds the height of a binary tree.

Returns int

get_optimal_height[source]

get_optimal_height(data:DataFrame, linkage_matrix:ndarray, max_h:int, show_chart:bool=True, save_chart:bool=False, chart_file:str='HACSilhouette.png')

Clusters the top phrase vectors and plots the silhoute coefficients for a range of dendrograph heights. Returns the optimal height value (highest silhoute coefficient)

Returns int

get_clusters_hac[source]

get_clusters_hac(data:DataFrame, dist_metric:str, height:int=None, show_dendrogram:bool=False, show_chart:bool=False)

Use Hierarchical Agglomerative Clustering (HAC) to cluster phrase vectors

Returns (list, np.ndarray, int)

get_silhouette_score_kmeans[source]

get_silhouette_score_kmeans(phrase_vecs:list, k:int)

Assigns clusters to a list of word vectors for a given k and calculates the silhouette score of the clustering.

Returns float

get_optimal_k[source]

get_optimal_k(data:DataFrame, show_chart:bool=True, save_chart:bool=False, chart_file:str='KmeansSilhouette.png')

Calculates the optimal k-value (highest silhoute coefficient). Optionally prints a chart of silhouette score by k-value or saves it to disk.

Returns int

get_cluster_assignments_kmeans[source]

get_cluster_assignments_kmeans(phrase_vecs:list, k:int)

K-means clustering.

Returns list

get_clusters_kmeans[source]

get_clusters_kmeans(data:DataFrame, k:int=None, show_chart:bool=False)

Use K-means algorithm to cluster phrase vectors

Returns list

get_topics_from_docs[source]

get_topics_from_docs(docs:list, topic_count:int)

Gets a list of topic_count topics for each list of tokens in docs.

Returns list

Export

df_to_disk[source]

df_to_disk(df:DataFrame, file_name:str, mode:str='w', header:bool=True, sep='\t')

Writes a dataframe to disk as a tab delimited file.

Returns None

sentences_to_disk[source]

sentences_to_disk(data:DataFrame, file_name:str='output/DocumentSentenceList.txt')

Writes the raw sentences to a file organized by document and sentence number.

Returns None

write_cluster[source]

write_cluster(cluster_rows:DataFrame, file_name:str, mode:str='a', header:bool=False)

Appends the rows for a single cluster to disk.

Returns None

clusters_to_disk[source]

clusters_to_disk(data:DataFrame, doc_df:DataFrame, cluster_df:DataFrame, file_name:str='output/TopicClusterResults.txt')

Writes the sentences and phrases to a file organized by cluster and document.

Returns None