Preprocessing_Pipeline_JuML
Preprocessing_Pipeline_JuML is a Julia package for preprocessing text data in NLP pipelines.
API Structure
The package provides a set of pipeline stages that can be chained together to preprocess text data. The pipeline stages are implemented as functions that take a NlpPipe
or TokenizedNlpPipe
struct as input and return a modified object of the same type. This makes it easy to build custom preprocessing pipelines by piping together the desired stages.
Overview
Features and Objects
Features (for detailed explanation visit here)
- Text preprocessing: prepare textual data for machine learning tasks. Preprocessing steps include:
- applied before Tokenization:
- expansion of contractions
- masking of numbers
- noise removal (punctuation, special characters, phone numbers, e-mail addresses, ...)
- text standardization (lowercasing, remove ambiguous characters)
- applied after Tokenization:
- stopword removal
- stemming
- standardization of token encoding
- applied before Tokenization:
- Tokenization: Split text into words or characters.
- Vectorization: Transform text into machine-learning-compatible vector representations
- one-hot encoding
- Bag of Words (BoW)
- Bag of N-Grams
- Term Frequency-Inverse Document Frequency (TF-IDF)
Pipe Objects (to learn more, visit here)
NlpPipe
First struct to instantiate in a pipeline. Can be created directly from a text corpus. Can be
- used in preprocessing stages that do not require the text to be tokenized.
- transformed into a
TokenizedNlpPipe
by applying thetokenize
function.
TokenizedNlpPipe
Struct that holds tokenized text data. Can be used for preprocessing stages that require tokenized text (e.g., stopword removal, stemming, etc.). Can be transformed into a VectorizedNlpPipe
by applying any vectorization function.
VectorizedNlpPipe
Struct that holds vectorized text data (embeddings). Can be used for machine learning tasks.
Usage Example
corpus=["Hello, world!", "How are you?"]
NlpPipe(corpus) |> remove_noise |> tokenize |> one_hot_encoding
VectorizedNlpPipe{Int64}([[1 0 … 0 0; 0 0 … 0 1], [0 1 … 0 0; 0 0 … 1 0; 0 0 … 0 0]], Dict("Hello" => 1, "How" => 2, "you" => 3, "are" => 4, "world" => 5), nothing)