TopEx core API
C:\Users\etfrench\Anaconda3\envs\TopExEnv\lib\site-packages\sklearn\base.py:329: UserWarning: Trying to unpickle estimator TfidfTransformer from version 0.22.2.post1 when using version 1.0.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
  warnings.warn(
C:\Users\etfrench\Anaconda3\envs\TopExEnv\lib\site-packages\sklearn\base.py:329: UserWarning: Trying to unpickle estimator TfidfVectorizer from version 0.22.2.post1 when using version 1.0.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
  warnings.warn(

import_data[source]

import_data(raw_docs:DataFrame, save_results:bool=False, file_name:str=None, stop_words_file:str=None, stop_words_list:list=None, custom_stopwords_only:bool=False, ner:bool=False)

Imports and pre-processes the documents from the raw_docs dataframe

Document pre-processing is handled in tokenize_and_stem. path_to_file_list is a path to a text file containing a list of files to be processed separated by line breaks.

Returns (DataFrame, DataFrame)

import_from_files[source]

import_from_files(path_to_file_list:str, save_results:bool=False, file_name:str='output/DocumentSentenceList.txt', stop_words_file:str=None, stop_words_list:list=None, custom_stopwords_only:bool=False, ner:bool=False)

Imports and pre-processes a list of documents contained in path_to_file_list. Returns (DataFrame, DataFrame)

import_from_csv[source]

import_from_csv(path_to_csv:str, save_results:bool=False, file_name:str='output/DocumentSentenceList.txt', stop_words_file:str=None, stop_words_list:list=None, custom_stopwords_only:bool=False, ner:bool=False)

Imports and pre-processes documents from a pipe-demilited csv file. File should be formatted with two columns: "doc_name" and "text"

Returns (DataFrame, DataFrame)

create_tfidf[source]

create_tfidf(tfidf_corpus:str='both', doc_df:DataFrame=None, path_to_expansion_file_list:str=None, path_to_expansion_csv:str=None, expansion_df:DataFrame=None)

Creates a dense TF-IDF matrix from the tokens in some combination of the clustering corpus and/or expansion corpus. This combination is determined by tfidf_corpus which has possible values (both, clustering, expansion).

path_to_seed_topics_file_list is a path to a text file containing a list of files with sentences corresponding to known topics. Use the path_to_seed_topics_csv if you would prefer to load all seed topics documents from a single, pipe-delimited csv file. If the doc_df is passed, the input corpus will be used along with the seed topics documents to generate the TF-IDF matrix.

Returns (numpy.ndarray, gensim.corpora.dictionary.Dictionary)

get_phrases[source]

get_phrases(data:DataFrame, vocab:dict, tfidf:ndarray, window_size:int=6, tfidf_corpus:str='clustering', include_sentiment:bool=True)

Extracts the most expressive phrase from each sentence.

feature_names should be dictionary.token2id and vocab should be dictionary.token2id from the output of create_tfidf. window_size is the length of phrase extracted, if a -1 is passed, all tokens will be included (IMPORTANT: this option requires aggregating vectors in the next step.) When tfidf_corpus='clustering', token_scores are calculated using the TF-IDF, otherwise, token_scores are calculated using max_token_scores (max scores for each token in all documents. When include_sentiment is False, sentiment and token part of speech are ignored when scoring phrases.

Returns DataFrame

get_vectors[source]

get_vectors(method:str, data:DataFrame, dictionary:Dictionary=None, tfidf:ndarray=None, dimensions:int=2, umap_neighbors:int=15, path_to_w2v_bin_file:str=None, doc_df:DataFrame=None)

Creates a word vector for each phrase in the dataframe.

Options for method are ('tfidf', 'svd', 'umap', 'pretrained', 'local'). Options for method are ('tfidf', 'svd', 'umap', 'pretrained', 'local').tfidf and dictionary are output from create_tfidf. dimensions is the number of dimensions to which SVD or UMAP reduce the TF-IDF matrix. path_to_w2v_bin_file is the path to a pretrained Word2Vec .bin file.

Returns DataFrame

assign_clusters[source]

assign_clusters(data:DataFrame, method:str='kmeans', dist_metric:str='euclidean', k:int=None, height:int=None, show_chart:bool=False, show_dendrogram:bool=False)

Clusters the sentences using phrase vectors.

Options for method are ('kmeans', 'hac'). Options for dist_metric are ('cosine' or anything accepted by sklearn.metrics.pairwise_distances). k is the number of clusters for K-means clustering. height is the height at which the HAC dendrogram should be cut. When show_chart is True, the chart of silhoute scores by possible k or height is shown inline. When show_dendrogram is True, the HAC dendrogram is shown inline.

Returns (DataFrame, np.ndarray, int, int)

reassign_hac_clusters[source]

reassign_hac_clusters(linkage_matrix:ndarray, height:int)

Reassigns HAC clusters using a different height.

reassign_kmeans_clusters[source]

reassign_kmeans_clusters(phrase_vecs:list, k:int)

Reassigns clusters using a different # of clusters.

visualize_clustering[source]

visualize_clustering(data:DataFrame, method:str='umap', dist_metric:str='cosine', umap_neighbors:int=15, show_chart=True, save_chart=False, return_data=False, chart_file='output/cluster_visualization.html')

Visualize clustering in two dimensions.

Options for method are ('umap', 'tsne', 'mds', 'svd'). Options for dist_metric are ('cosine' or anything accepted by sklearn.metrics.pairwise_distances). When show_chart is True, the visualization is shown inline. When save_chart is True, the visualization is saved to chart_file.

Returns DataFrame

visualize_df[source]

visualize_df(vis_df:DataFrame, cluster_df:DataFrame=None, show_chart=True, save_chart=False, min_cluster_size=0, chart_file='output/cluster_visualization.html')

Visualize clustering in two dimensions. This method takes in the vis_df produced by visualize_clustering. The cluster column can be updated to dynamically try different clustering thresholds without the overhead of recomputing the (x, y) coordinates.

min_cluster_size is the minimum # of points for a cluster to be displayed. When show_chart is True, the visualization is shown inline. When save_chart is True, the visualization is saved to chart_file.

get_cluster_topics[source]

get_cluster_topics(data:DataFrame, doc_df:DataFrame=None, topics_per_cluster:int=10, save_results:bool=False, file_name:str='output/TopicClusterResults.txt')

Gets the main topics for each cluster.

topics_per_cluster is the number of main topics per cluster. When save_results is True, the resulting dataframe will be saved to file_name.

Returns DataFrame

recluster[source]

recluster(data:DataFrame, viz_df:DataFrame, cluster_method:str, linkage_matrix:ndarray=None, height:int=None, k:int=None, min_cluster_size:int=None, topics_per_cluster:int=10, show_chart=True)

Recomputes clusters with a new threshold using the output of a previous clustering.

Returns DataFrame, DataFrame

get_doc_topics[source]

get_doc_topics(doc_df:DataFrame, topics_per_doc:int=10, save_results:bool=False, file_name:str='output/TopicDocumentResults.txt')

Gets the main topics for each document.

topics_per_doc is the number of topics extracted per document. When save_results is True, the resulting dataframe will be saved to file_name.

Returns DataFrame

evaluate[source]

evaluate(data, gold_file, save_results=False, file_name='output/EvaluationResults.txt')

Evaluate precision, recall, and F1 against a gold standard dataset.

gold_file is a path to a text file containing a list IDs and labels. When save_results is True, the resulting dataframe will be saved to file_name.

Returns DataFrame