Pipe Structs
Preprocessing_Pipeline_JuML.NlpPipe
— TypeNlpPipe
A simple pipeline structure for handling text data (corpus) and corresponding labels.
Fields
corpus::Vector{String}
: A collection of text documents.labels::Union{Vector{String}, Nothing}
: Optional labels corresponding to each document incorpus
.
Constructors
NlpPipe(corpus::Vector{String}, labels::Union{Vector{String}, Nothing})
Creates anNlpPipe
instance with a given corpus and optional labels. Throws anArgumentError
if the number of documents and labels do not match.NlpPipe(corpus::Vector{String})
Creates anNlpPipe
instance with only a corpus, settinglabels
tonothing
.NlpPipe(corpus::String)
Creates anNlpPipe
instance with a single document, storing it in a vector.NlpPipe(previousPipe::NlpPipe; corpus::Vector{String} = previousPipe.corpus, labels::Union{Vector{String}, Nothing} = previousPipe.labels)
Creates a newNlpPipe
instance based on an existing one, optionally overridingcorpus
andlabels
. Throws anArgumentError
iflabels
is notnothing
and its length does not match the corpus length.
Example Usage
Creating a pipe from a corpus with multiple documents, including labels
julia> pipe1 = NlpPipe(["document1", "document2"], ["label1", "label2"])
NlpPipe(["document1", "document2"], ["label1", "label2"])
Creating a pipe from a corpus without labels
julia> pipe2 = NlpPipe(["document3"])
NlpPipe(["document3"], nothing)
Creating a pipe from a single string corpus
julia> pipe3 = NlpPipe("single document")
NlpPipe(["single document"], nothing)
Creating a new pipe from an existing one with modified corpus and labels
julia> NlpPipe(pipe1, corpus=["new_doc1", "new_doc2"])
NlpPipe(["new_doc1", "new_doc2"], ["label1", "label2"])
Preprocessing_Pipeline_JuML.TokenizedNlpPipe
— TypeTokenizedNlpPipe
A structure for handling tokenized text data, maintaining a vocabulary and optional labels.
Fields
corpus::Vector{String}
: A collection of original text documents.tokens::Vector{Vector{String}}
: Tokenized representation of each document incorpus
.vocabulary::Set{String}
: A set of unique tokens derived fromtokens
.labels::Union{Vector{String}, Nothing}
: Optional labels corresponding to each document.
Constructors
TokenizedNlpPipe(corpus::Vector{String}, tokens::Vector{Vector{String}}, labels::Union{Vector{String}, Nothing})
Creates aTokenizedNlpPipe
instance with a given corpus, tokenized documents, and optional labels. The vocabulary is automatically generated fromtokens
.TokenizedNlpPipe(previousPipe::TokenizedNlpPipe; tokens::Vector{Vector{String}} = previousPipe.tokens, vocabulary::Set{String} = previousPipe.vocabulary, labels::Union{Vector{String}, Nothing} = previousPipe.labels)
Creates a newTokenizedNlpPipe
instance based on an existing one, allowing modifications to tokens, vocabulary, and labels while retaining the original corpus.
Example Usage
Creating a pipe from an NlpPipe instance (usual way to do it)
julia> corpus = ["Hello world", "Julia is great"]
2-element Vector{String}:
"Hello world"
"Julia is great"
julia> tokenizedPipe = NlpPipe(corpus) |> tokenize
TokenizedNlpPipe(["Hello world", "Julia is great"], [["Hello", "world"], ["Julia", "is", "great"]], Set(["great", "Hello", "is", "Julia", "world"]), nothing)
Creating a new pipe from scratch
julia> corpus = ["Hello world", "Julia is great"]
2-element Vector{String}:
"Hello world"
"Julia is great"
julia> tokens = [["Hello", "world"], ["Julia", "is", "great"]]
2-element Vector{Vector{String}}:
["Hello", "world"]
["Julia", "is", "great"]
julia> TokenizedNlpPipe(corpus, tokens, ["greeting", "statement"])
TokenizedNlpPipe(["Hello world", "Julia is great"], [["Hello", "world"], ["Julia", "is", "great"]], Set(["great", "Hello", "is", "Julia", "world"]), ["greeting", "statement"])
Creating a new pipe from an existing one with modified tokens
julia> pipe1 = TokenizedNlpPipe(corpus, tokens, ["greeting", "statement"])
TokenizedNlpPipe(["Hello world", "Julia is great"], [["Hello", "world"], ["Julia", "is", "great"]], Set(["great", "Hello", "is", "Julia", "world"]), ["greeting", "statement"])
julia> pipe2 = TokenizedNlpPipe(pipe1; tokens=[["Hello"], ["Julia", "is"]])
TokenizedNlpPipe(["Hello world", "Julia is great"], [["Hello"], ["Julia", "is"]], Set(["great", "Hello", "is", "Julia", "world"]), ["greeting", "statement"])
Preprocessing_Pipeline_JuML.VectorizedNlpPipe
— TypeVectorizedNlpPipe
A structure for handling vectorized representations of tokenized text data, including a vocabulary mapping and optional labels.
Fields
tokens::Vector{Matrix{T<:Real}
: A collection of numerical representations (e.g., embeddings, one-hot encodings) for tokenized text.vocabulary::Dict{String, Int}
: A dictionary mapping words to unique integer indices.labels::Union{Vector{String}, Nothing}
: Optional labels corresponding to each document.
Example Usage
Creating a pipe from an existing TokenizedNlpPipe instance (usual way to do it)
julia> corpus = ["Hello world", "Julia is great"]
2-element Vector{String}:
"Hello world"
"Julia is great"
julia> NlpPipe(corpus) |> tokenize |> one_hot_encoding # (or any other vectorization method)
VectorizedNlpPipe{Int64}([[0 1 … 0 0; 0 0 … 0 1], [0 0 … 1 0; 0 0 … 0 0; 1 0 … 0 0]], Dict("great" => 1, "Hello" => 2, "is" => 3, "Julia" => 4, "world" => 5), nothing)
Creating a pipe from scratch
julia> tokens = [[1 2; 3 4], [5 6; 7 8]] # Example word embeddings (each document is a matrix)
2-element Vector{Matrix{Int64}}:
[1 2; 3 4]
[5 6; 7 8]
julia> vocab = Dict("hello" => 1, "world" => 2, "Julia" => 3)
Dict{String, Int64} with 3 entries:
"hello" => 1
"Julia" => 3
"world" => 2
julia> labels = ["greeting", "statement"]
2-element Vector{String}:
"greeting"
"statement"
julia> VectorizedNlpPipe(tokens, vocab, labels)
VectorizedNlpPipe{Int64}([[1 2; 3 4], [5 6; 7 8]], Dict("hello" => 1, "Julia" => 3, "world" => 2), ["greeting", "statement"])
Preprocessing before tokenization
Preprocessing_Pipeline_JuML.standardize_text
— Functionstandardize_text(pipe::NlpPipe) -> NlpPipe
Standardizes the text in the corpus by converting it to lowercase and replacing unusual characters with their standard counterparts.
Parameters
pipe::NlpPipe
: AnNlpPipe
object containing a corpus and associated labels.
Returns
- A new
NlpPipe
object with the standardized corpus and the original labels.
Example Usage
julia> NlpPipe(["Hello WORLD", "Julia is GREAT"]) |> standardize_text
NlpPipe(["hello world", "julia is great"], nothing)
Preprocessing_Pipeline_JuML.remove_noise
— Functionremove_noise(pipe::NlpPipe) -> NlpPipe
Removes noise from the corpus.
Noise includes HTML tags, URLs, email addresses, file paths, special characters, dates & times. Replaces URLs, dates, timereferences, filepaths and e-mail addresses with corresponding replacement tokens.
Parameters
pipe::NlpPipe
: TheNlpPipe
object with a corpus to remove noise from
Returns
NlpPipe
: A new pipe object with the noise removed from the corpus
Example Usage
julia> NlpPipe(["<html>This is a test</html>"]) |> remove_noise
NlpPipe(["This is a test"], nothing)
julia> NlpPipe(["Today is 28/01/2025"]) |> remove_noise
NlpPipe(["Today is <DATE>"], nothing)
With custom replacement patterns
julia> NlpPipe(["<html>This is a test</html>"]) |> pipe -> remove_noise(pipe, replacement_patterns=[r"is a" => "🦖🫶"])
NlpPipe(["<html>This 🦖🫶 test</html>"], nothing)
Preprocessing_Pipeline_JuML.mask_numbers
— Functionmask_numbers(pipe::NlpPipe; replace_with::String="<NUM>") -> NlpPipe
Replaces all numbers in the text of the given NlpPipe
corpus with a specified string.
Parameters
pipe::NlpPipe
: The inputNlpPipe
object containing the corpus to be processed.replace_with::String
: The string to replace numbers with. Defaults to "<NUM>".
Returns
- A new
NlpPipe
object with the numbers in the corpus replaced by the specified string.
Example Usage
julia> NlpPipe(["The price is 1000€."]) |> mask_numbers
NlpPipe(["The price is <NUM>€."], nothing)
Preprocessing_Pipeline_JuML.expand_contractions
— Functionexpand_contractions(input::NlpPipe) -> NlpPipe
Expand contractions in the input text. This function expands common English contractions.
Parameters
input::NlpPipe
: ANlpPipe
object containing the corpus to expand contractions in.
Returns
- A new
NlpPipe
object with the contractions expanded in the corpus.
Example Usage
julia> NlpPipe(["I'm happy", "I've got a cat"]) |> expand_contractions
NlpPipe(["I am happy", "I have got a cat"], nothing)
Tokenization
Preprocessing_Pipeline_JuML.tokenize
— Functiontokenize(pipe::NlpPipe, level::Symbol = :word) -> TokenizedNlpPipe
Tokenizes the documents in the corpus of the given NlpPipe
object. The level
parameter sets depth of tokenizing.
Parameters
pipe::NlpPipe
: AnNlpPipe
object containing a corpus of documents.level::Symbol
: The tokenization level, either:word
(default) or:character
.
Returns
TokenizedNlpPipe
: A new pipe object with the tokenized documents.
Example Usage
julia> NlpPipe(["Hello world", "Julia is great"]) |> tokenize
TokenizedNlpPipe(["Hello world", "Julia is great"], [["Hello", "world"], ["Julia", "is", "great"]], Set(["great", "Hello", "is", "Julia", "world"]), nothing)
Preprocessing after tokenization
Preprocessing_Pipeline_JuML.remove_stop_words
— Functionremove_stop_words(pipe::TokenizedNlpPipe; language::String="en", stop_words::Set{String}=Set{String}()) -> TokenizedNlpPipe
Removes predefined stopwords. You can access the stop words for a given language using the language name or ISO 639 code.
For example, to get the stop words for English, you can use stopwords["eng"], stopwords["en"], or stopwords["English"]. Stop words sourced from StopWords.jl.
Parameters
pipe::TokenizedNlpPipe
: The inputTokenizedNlpPipe
object containing the tokens to be processed.language::String = "en"
: Defaults to english, other languages are possible (see linked github-page above)stop_words::Set{String} = Set{String}()
: Defaults toStopWords.jl
-stopwords set, possible to set own stopword-set.
Returns
TokenizedNlpPipe
: A new pipe object with the stop words removed from the tokens.
Example Usage
Removing stop words from a tokenized pipe (default stop words)
julia> NlpPipe(["This is a dinosaur"]) |> tokenize |> remove_stop_words |> pipe -> pipe.tokens
1-element Vector{Vector{String}}:
["This", "dinosaur"]
Using custom stop words
julia> NlpPipe(["This is a dinosaur"]) |> tokenize |> pipe -> remove_stop_words(pipe, stop_words=Set(["This", "dinosaur"])) |> pipe -> pipe.tokens
1-element Vector{Vector{String}}:
["is", "a"]
Preprocessing_Pipeline_JuML.stemming
— Functionstemming(pipe::TokenizedNlpPipe; language::String="english") -> TokenizedNlpPipe
Reduces words to their roots by removing pre- and suffixes. These are provided by SnowballStemmer.jl.
Parameters
pipe::TokenizedNlpPipe
: The inputTokenizedNlpPipe
object containing the tokens to be processed.language::String = "en"
: Defaults to english, other languages are possible
Returns
TokenizedNlpPipe
: A new pipe object with the stemmed tokens.
Example Usage
Applying stemming with the default language (English)
julia> NlpPipe(["This is a test for stemming"]) |> tokenize |> stemming
TokenizedNlpPipe(["This is a test for stemming"], [["This", "is", "a", "test", "for", "stem"]], Set(["test", "is", "This", "stem", "a", "for"]), nothing)
Preprocessing_Pipeline_JuML.standardize_encoding
— Functionstandardize_encoding(pipe::TokenizedNLPPipe, encoding::String = "ASCII") -> TokenizedNLPPipe
Standardizes the encoding of the tokens in the corpus.
Parameters
pipe::TokenizedNlpPipe
: ATokenizedNlpPipe
object containing a corpus and associated labels.encoding::String
: The target encoding, either "ASCII" or "UTF-8" (default is "ASCII").
Returns
TokenizedNlpPipe
: A pipe object with the standardized corpus and the original labels.
Example Usage
Using the default encoding
julia> NlpPipe(["Hello world 😊", "Julia is great"]) |> tokenize |> standardize_encoding
TokenizedNlpPipe(["Hello world 😊", "Julia is great"], [["Hello", "world", "<UNK>"], ["Julia", "is", "great"]], Set(["great", "Hello", "is", "Julia", "<UNK>", "world"]), nothing)
Using the UTF-8 encoding
julia> NlpPipe(["Hello world 😊", "Julia is great"]) |> tokenize |> pipe -> standardize_encoding(pipe, encoding="UTF-8")
TokenizedNlpPipe(["Hello world 😊", "Julia is great"], [["Hello", "world", "😊"], ["Julia", "is", "great"]], Set(["great", "Hello", "is", "Julia", "world", "😊"]), nothing)
Vectorization
Preprocessing_Pipeline_JuML.bag_of_words
— Functionbag_of_words(pipe::TokenizedNlpPipe) -> VectorizedNlpPipe
Create a bag-of-words-encoding out of given TokenizedNlpPipe.
Parameters
pipe::TokenizedNlpPipe
: The inputTokenizedNlpPipe
object containing the tokenized documents.
Returns
VectorizedNlpPipe
: A new pipe object with the bag-of-words vectors.
Example Usage
julia> NlpPipe(["words one", "words two"]) |> tokenize |> bag_of_words
VectorizedNlpPipe{Int64}([[0 1 1], [1 0 1]], Dict("two" => 1, "one" => 2, "words" => 3), nothing)
Preprocessing_Pipeline_JuML.bag_of_ngrams
— Functionbag_of_ngrams(pipe::TokenizedNlpPipe; n::Int = 1) -> VectorizedNlpPipe
Create a bag of n-grams out of given TokenizedNlpPipe, with padding for shorter documents.
Parameters
pipe::TokenizedNlpPipe
: The inputTokenizedNlpPipe
object containing the tokenized documents.- `n::Int: The n-gram size. Defaults to 1.
Returns
VectorizedNlpPipe
: A new pipe object with the n-gram vectors.
Example Usage
julia> NlpPipe(["words one", "words two"]) |> tokenize |> bag_of_ngrams
VectorizedNlpPipe{Int64}([[1 0 0; 0 1 0], [1 0 0; 0 0 1]], Dict("two" => 3, "one" => 2, "words" => 1), nothing)
Preprocessing_Pipeline_JuML.tf_idf
— Functiontf_idf(pipe::TokenizedNlpPipe; tf_weighting::String = "relative term frequency", idf_weighting::String="inverse document frequency") -> VectorizedNlpPipe
Compute the TF-IDF (Term Frequency-Inverse Document Frequency) representation of the tokenized documents in the given pipe
.
Parameters
pipe::TokenizedNlpPipe
: A pipeline containing tokenized documents.tf_weighting::String
: The term frequency weighting scheme. Options are "relative term frequency" (default) and "raw term frequency".idf_weighting::String
: The inverse document frequency weighting scheme. Options are "inverse document frequency" (default) and "smooth inverse document frequency".
Returns
VectorizedNlpPipe
: A pipe object containing the TF-IDF vectorized representation of the documents.
Example Usage
julia> NlpPipe(["words one", "words two"]) |> tokenize |> tf_idf
VectorizedNlpPipe{Float64}([[0.0 0.0 0.0; 0.0 0.35 0.0], [0.0 0.0 0.0; 0.0 0.0 0.35]], Dict("two" => 3, "one" => 2, "words" => 1), nothing)
Preprocessing_Pipeline_JuML.one_hot_encoding
— Functionone_hot_encoding(pipe::TokenizedNlpPipe) -> VectorizedNlpPipe
Create a one-hot-encoding out of given TokenizedNlpPipe.
Parameters
pipe::TokenizedNlpPipe
: The inputTokenizedNlpPipe
object containing the tokenized documents.
Returns
VectorizedNlpPipe
: The output pipe object containing the one-hot-encoded documents.
Example Usage
julia> NlpPipe(["words one", "words two"]) |> tokenize |> one_hot_encoding
VectorizedNlpPipe{Int64}([[0 0 1; 0 1 0], [0 0 1; 1 0 0]], Dict("two" => 1, "one" => 2, "words" => 3), nothing)