Methods for preprocessing raw text data
C:\Users\etfrench\Anaconda3\envs\TopExEnv\lib\site-packages\sklearn\base.py:329: UserWarning: Trying to unpickle estimator TfidfTransformer from version 0.22.2.post1 when using version 1.0.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
  warnings.warn(
C:\Users\etfrench\Anaconda3\envs\TopExEnv\lib\site-packages\sklearn\base.py:329: UserWarning: Trying to unpickle estimator TfidfVectorizer from version 0.22.2.post1 when using version 1.0.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
  warnings.warn(

token_filter[source]

token_filter(token, stopwords:list, custom_stopwords_only:bool=False)

Checks if the given token both has alpha characters and is not a stopword Returns bool

normalize_entity[source]

normalize_entity(entity)

For a given entity extracted via NER, attempts to normalize to a canonical name if available, otherwise returns the lemmatized entity. Returns string

preprocess_docs[source]

preprocess_docs(doc_df:DataFrame, save_results:bool=False, file_name:str=None, stop_words_file:str=None, stop_words_list:list=None, custom_stopwords_only:bool=False, ner:bool=False)

Imports and pre-processes the documents from the raw_docs dataframe Document pre-processing is handled in tokenize_and_stem. path_to_file_list is a path to a text file containing a list of files to be processed separated by line breaks. ner runs a biomedical NER pipeline over the input and clusters on extracted entities rather than tokens. Returns (DataFrame, DataFrame)

get_stop_words[source]

get_stop_words(stop_words_file:str=None, stop_words_list:list=None)

Gets a list of all stop words. Returns list(string)