C:\Users\etfrench\Anaconda3\envs\TopExEnv\lib\site-packages\sklearn\base.py:329: UserWarning: Trying to unpickle estimator TfidfTransformer from version 0.22.2.post1 when using version 1.0.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to: https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations warnings.warn( C:\Users\etfrench\Anaconda3\envs\TopExEnv\lib\site-packages\sklearn\base.py:329: UserWarning: Trying to unpickle estimator TfidfVectorizer from version 0.22.2.post1 when using version 1.0.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to: https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations warnings.warn(
token_filter
[source]
token_filter
(token
,stopwords
:list
,custom_stopwords_only
:bool
=False
)
Checks if the given token both has alpha characters and is not a stopword Returns bool
normalize_entity
[source]
normalize_entity
(entity
)
For a given entity extracted via NER, attempts to normalize to a canonical name if available, otherwise returns the lemmatized entity. Returns string
preprocess_docs
[source]
preprocess_docs
(doc_df
:DataFrame
,save_results
:bool
=False
,file_name
:str
=None
,stop_words_file
:str
=None
,stop_words_list
:list
=None
,custom_stopwords_only
:bool
=False
,ner
:bool
=False
)
Imports and pre-processes the documents from the raw_docs
dataframe
Document pre-processing is handled in tokenize_and_stem
.
path_to_file_list
is a path to a text file containing a list of files to be processed separated by line breaks.
ner
runs a biomedical NER pipeline over the input and clusters on extracted entities rather than tokens.
Returns (DataFrame, DataFrame)
get_stop_words
[source]
get_stop_words
(stop_words_file
:str
=None
,stop_words_list
:list
=None
)
Gets a list of all stop words. Returns list(string)