C:\Users\etfrench\Anaconda3\envs\TopExEnv\lib\site-packages\sklearn\base.py:329: UserWarning: Trying to unpickle estimator TfidfTransformer from version 0.22.2.post1 when using version 1.0.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to: https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations warnings.warn( C:\Users\etfrench\Anaconda3\envs\TopExEnv\lib\site-packages\sklearn\base.py:329: UserWarning: Trying to unpickle estimator TfidfVectorizer from version 0.22.2.post1 when using version 1.0.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to: https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations warnings.warn(
import_data
[source]
import_data
(raw_docs
:DataFrame
,save_results
:bool
=False
,file_name
:str
=None
,stop_words_file
:str
=None
,stop_words_list
:list
=None
,custom_stopwords_only
:bool
=False
,ner
:bool
=False
)
Imports and pre-processes the documents from the raw_docs
dataframe
Document pre-processing is handled in tokenize_and_stem
.
path_to_file_list
is a path to a text file containing a list of files to be processed separated by line breaks.
Returns (DataFrame, DataFrame)
import_from_files
[source]
import_from_files
(path_to_file_list
:str
,save_results
:bool
=False
,file_name
:str
='output/DocumentSentenceList.txt'
,stop_words_file
:str
=None
,stop_words_list
:list
=None
,custom_stopwords_only
:bool
=False
,ner
:bool
=False
)
Imports and pre-processes a list of documents contained in path_to_file_list
.
Returns (DataFrame, DataFrame)
import_from_csv
[source]
import_from_csv
(path_to_csv
:str
,save_results
:bool
=False
,file_name
:str
='output/DocumentSentenceList.txt'
,stop_words_file
:str
=None
,stop_words_list
:list
=None
,custom_stopwords_only
:bool
=False
,ner
:bool
=False
)
Imports and pre-processes documents from a pipe-demilited csv file. File should be formatted with two columns: "doc_name" and "text"
Returns (DataFrame, DataFrame)
create_tfidf
[source]
create_tfidf
(tfidf_corpus
:str
='both'
,doc_df
:DataFrame
=None
,path_to_expansion_file_list
:str
=None
,path_to_expansion_csv
:str
=None
,expansion_df
:DataFrame
=None
)
Creates a dense TF-IDF matrix from the tokens in some combination of the clustering corpus and/or expansion corpus.
This combination is determined by tfidf_corpus
which has possible values (both, clustering, expansion).
path_to_seed_topics_file_list
is a path to a text file containing a list of files with sentences corresponding to
known topics. Use the path_to_seed_topics_csv
if you would prefer to load all seed topics documents from a single,
pipe-delimited csv file. If the doc_df
is passed, the input corpus will be used along with the seed topics documents
to generate the TF-IDF matrix.
Returns (numpy.ndarray, gensim.corpora.dictionary.Dictionary)
get_phrases
[source]
get_phrases
(data
:DataFrame
,vocab
:dict
,tfidf
:ndarray
,window_size
:int
=6
,tfidf_corpus
:str
='clustering'
,include_sentiment
:bool
=True
)
Extracts the most expressive phrase from each sentence.
feature_names
should be dictionary.token2id
and vocab
should be dictionary.token2id
from the output of
create_tfidf
. window_size
is the length of phrase extracted, if a -1 is passed, all tokens will be included
(IMPORTANT: this option requires aggregating vectors in the next step.)
When tfidf_corpus='clustering'
, token_scores are calculated using the TF-IDF, otherwise, token_scores
are calculated using max_token_scores
(max scores for each token in all documents. When include_sentiment
is False,
sentiment and token part of speech are ignored when scoring phrases.
Returns DataFrame
get_vectors
[source]
get_vectors
(method
:str
,data
:DataFrame
,dictionary
:Dictionary
=None
,tfidf
:ndarray
=None
,dimensions
:int
=2
,umap_neighbors
:int
=15
,path_to_w2v_bin_file
:str
=None
,doc_df
:DataFrame
=None
)
Creates a word vector for each phrase in the dataframe.
Options for method
are ('tfidf', 'svd', 'umap', 'pretrained', 'local'). Options for method
are ('tfidf', 'svd', 'umap', 'pretrained', 'local').tfidf
and dictionary
are output from
create_tfidf
. dimensions
is the number of dimensions to which SVD or UMAP reduce the TF-IDF matrix.
path_to_w2v_bin_file
is the path to a pretrained Word2Vec .bin file.
Returns DataFrame
assign_clusters
[source]
assign_clusters
(data
:DataFrame
,method
:str
='kmeans'
,dist_metric
:str
='euclidean'
,k
:int
=None
,height
:int
=None
,show_chart
:bool
=False
,show_dendrogram
:bool
=False
)
Clusters the sentences using phrase vectors.
Options for method
are ('kmeans', 'hac'). Options for dist_metric
are ('cosine' or anything accepted by
sklearn.metrics.pairwise_distances). k
is the number of clusters for K-means clustering. height
is the height
at which the HAC dendrogram should be cut. When show_chart
is True, the chart of silhoute scores by possible k or
height is shown inline. When show_dendrogram
is True, the HAC dendrogram is shown inline.
Returns (DataFrame, np.ndarray, int, int)
reassign_hac_clusters
[source]
reassign_hac_clusters
(linkage_matrix
:ndarray
,height
:int
)
Reassigns HAC clusters using a different height.
reassign_kmeans_clusters
[source]
reassign_kmeans_clusters
(phrase_vecs
:list
,k
:int
)
Reassigns clusters using a different # of clusters.
visualize_clustering
[source]
visualize_clustering
(data
:DataFrame
,method
:str
='umap'
,dist_metric
:str
='cosine'
,umap_neighbors
:int
=15
,show_chart
=True
,save_chart
=False
,return_data
=False
,chart_file
='output/cluster_visualization.html'
)
Visualize clustering in two dimensions.
Options for method
are ('umap', 'tsne', 'mds', 'svd'). Options for dist_metric
are ('cosine' or anything accepted by
sklearn.metrics.pairwise_distances). When show_chart
is True, the visualization is shown inline.
When save_chart
is True, the visualization is saved to chart_file
.
Returns DataFrame
visualize_df
[source]
visualize_df
(vis_df
:DataFrame
,cluster_df
:DataFrame
=None
,show_chart
=True
,save_chart
=False
,min_cluster_size
=0
,chart_file
='output/cluster_visualization.html'
)
Visualize clustering in two dimensions. This method takes in the vis_df produced by visualize_clustering. The cluster column can be updated to dynamically try different clustering thresholds without the overhead of recomputing the (x, y) coordinates.
min_cluster_size
is the minimum # of points for a cluster to be displayed.
When show_chart
is True, the visualization is shown inline.
When save_chart
is True, the visualization is saved to chart_file
.
get_cluster_topics
[source]
get_cluster_topics
(data
:DataFrame
,doc_df
:DataFrame
=None
,topics_per_cluster
:int
=10
,save_results
:bool
=False
,file_name
:str
='output/TopicClusterResults.txt'
)
Gets the main topics for each cluster.
topics_per_cluster
is the number of main topics per cluster. When save_results
is True, the resulting dataframe
will be saved to file_name
.
Returns DataFrame
recluster
[source]
recluster
(data
:DataFrame
,viz_df
:DataFrame
,cluster_method
:str
,linkage_matrix
:ndarray
=None
,height
:int
=None
,k
:int
=None
,min_cluster_size
:int
=None
,topics_per_cluster
:int
=10
,show_chart
=True
)
Recomputes clusters with a new threshold using the output of a previous clustering.
Returns DataFrame, DataFrame
get_doc_topics
[source]
get_doc_topics
(doc_df
:DataFrame
,topics_per_doc
:int
=10
,save_results
:bool
=False
,file_name
:str
='output/TopicDocumentResults.txt'
)
Gets the main topics for each document.
topics_per_doc
is the number of topics extracted per document. When save_results
is True, the resulting dataframe
will be saved to file_name
.
Returns DataFrame
evaluate
[source]
evaluate
(data
,gold_file
,save_results
=False
,file_name
='output/EvaluationResults.txt'
)
Evaluate precision, recall, and F1 against a gold standard dataset.
gold_file
is a path to a text file containing a list IDs and labels. When save_results
is True, the resulting
dataframe will be saved to file_name
.
Returns DataFrame