Pipe Structs

Preprocessing_Pipeline_JuML.NlpPipeType
NlpPipe

A simple pipeline structure for handling text data (corpus) and corresponding labels.

Fields

  • corpus::Vector{String}: A collection of text documents.
  • labels::Union{Vector{String}, Nothing}: Optional labels corresponding to each document in corpus.

Constructors

  • NlpPipe(corpus::Vector{String}, labels::Union{Vector{String}, Nothing}) Creates an NlpPipe instance with a given corpus and optional labels. Throws an ArgumentError if the number of documents and labels do not match.

  • NlpPipe(corpus::Vector{String}) Creates an NlpPipe instance with only a corpus, setting labels to nothing.

  • NlpPipe(corpus::String) Creates an NlpPipe instance with a single document, storing it in a vector.

  • NlpPipe(previousPipe::NlpPipe; corpus::Vector{String} = previousPipe.corpus, labels::Union{Vector{String}, Nothing} = previousPipe.labels) Creates a new NlpPipe instance based on an existing one, optionally overriding corpus and labels. Throws an ArgumentError if labels is not nothing and its length does not match the corpus length.

Example Usage


Creating a pipe from a corpus with multiple documents, including labels

julia> pipe1 = NlpPipe(["document1", "document2"], ["label1", "label2"])
NlpPipe(["document1", "document2"], ["label1", "label2"])

Creating a pipe from a corpus without labels

julia> pipe2 = NlpPipe(["document3"])
NlpPipe(["document3"], nothing)

Creating a pipe from a single string corpus

julia> pipe3 = NlpPipe("single document") 
NlpPipe(["single document"], nothing)

Creating a new pipe from an existing one with modified corpus and labels

julia> NlpPipe(pipe1, corpus=["new_doc1", "new_doc2"])
NlpPipe(["new_doc1", "new_doc2"], ["label1", "label2"])
source
Preprocessing_Pipeline_JuML.TokenizedNlpPipeType
TokenizedNlpPipe

A structure for handling tokenized text data, maintaining a vocabulary and optional labels.

Fields

  • corpus::Vector{String}: A collection of original text documents.
  • tokens::Vector{Vector{String}}: Tokenized representation of each document in corpus.
  • vocabulary::Set{String}: A set of unique tokens derived from tokens.
  • labels::Union{Vector{String}, Nothing}: Optional labels corresponding to each document.

Constructors

  • TokenizedNlpPipe(corpus::Vector{String}, tokens::Vector{Vector{String}}, labels::Union{Vector{String}, Nothing}) Creates a TokenizedNlpPipe instance with a given corpus, tokenized documents, and optional labels. The vocabulary is automatically generated from tokens.

  • TokenizedNlpPipe(previousPipe::TokenizedNlpPipe; tokens::Vector{Vector{String}} = previousPipe.tokens, vocabulary::Set{String} = previousPipe.vocabulary, labels::Union{Vector{String}, Nothing} = previousPipe.labels) Creates a new TokenizedNlpPipe instance based on an existing one, allowing modifications to tokens, vocabulary, and labels while retaining the original corpus.

Example Usage


Creating a pipe from an NlpPipe instance (usual way to do it)

julia> corpus = ["Hello world", "Julia is great"]
2-element Vector{String}:
 "Hello world"
 "Julia is great"
julia> tokenizedPipe = NlpPipe(corpus) |> tokenize
TokenizedNlpPipe(["Hello world", "Julia is great"], [["Hello", "world"], ["Julia", "is", "great"]], Set(["great", "Hello", "is", "Julia", "world"]), nothing)

Creating a new pipe from scratch

julia> corpus = ["Hello world", "Julia is great"]
2-element Vector{String}:
 "Hello world"
 "Julia is great"

julia> tokens = [["Hello", "world"], ["Julia", "is", "great"]]
2-element Vector{Vector{String}}:
 ["Hello", "world"]
 ["Julia", "is", "great"]
 
julia> TokenizedNlpPipe(corpus, tokens, ["greeting", "statement"])
TokenizedNlpPipe(["Hello world", "Julia is great"], [["Hello", "world"], ["Julia", "is", "great"]], Set(["great", "Hello", "is", "Julia", "world"]), ["greeting", "statement"])

Creating a new pipe from an existing one with modified tokens

julia> pipe1 = TokenizedNlpPipe(corpus, tokens, ["greeting", "statement"])
TokenizedNlpPipe(["Hello world", "Julia is great"], [["Hello", "world"], ["Julia", "is", "great"]], Set(["great", "Hello", "is", "Julia", "world"]), ["greeting", "statement"])

julia> pipe2 = TokenizedNlpPipe(pipe1; tokens=[["Hello"], ["Julia", "is"]])
TokenizedNlpPipe(["Hello world", "Julia is great"], [["Hello"], ["Julia", "is"]], Set(["great", "Hello", "is", "Julia", "world"]), ["greeting", "statement"])
source
Preprocessing_Pipeline_JuML.VectorizedNlpPipeType
VectorizedNlpPipe

A structure for handling vectorized representations of tokenized text data, including a vocabulary mapping and optional labels.

Fields

  • tokens::Vector{Matrix{T<:Real}: A collection of numerical representations (e.g., embeddings, one-hot encodings) for tokenized text.
  • vocabulary::Dict{String, Int}: A dictionary mapping words to unique integer indices.
  • labels::Union{Vector{String}, Nothing}: Optional labels corresponding to each document.

Example Usage


Creating a pipe from an existing TokenizedNlpPipe instance (usual way to do it)

julia> corpus = ["Hello world", "Julia is great"]
2-element Vector{String}:
 "Hello world"
 "Julia is great"

julia> NlpPipe(corpus) |> tokenize |> one_hot_encoding # (or any other vectorization method)
VectorizedNlpPipe{Int64}([[0 1 … 0 0; 0 0 … 0 1], [0 0 … 1 0; 0 0 … 0 0; 1 0 … 0 0]], Dict("great" => 1, "Hello" => 2, "is" => 3, "Julia" => 4, "world" => 5), nothing)

Creating a pipe from scratch

julia> tokens = [[1 2; 3 4], [5 6; 7 8]]  # Example word embeddings (each document is a matrix)
2-element Vector{Matrix{Int64}}:
 [1 2; 3 4]
 [5 6; 7 8]

julia> vocab = Dict("hello" => 1, "world" => 2, "Julia" => 3)
Dict{String, Int64} with 3 entries:
  "hello" => 1
  "Julia" => 3
  "world" => 2

julia> labels = ["greeting", "statement"]
2-element Vector{String}:
 "greeting"
 "statement"

julia> VectorizedNlpPipe(tokens, vocab, labels)
VectorizedNlpPipe{Int64}([[1 2; 3 4], [5 6; 7 8]], Dict("hello" => 1, "Julia" => 3, "world" => 2), ["greeting", "statement"])
source

Preprocessing before tokenization

Preprocessing_Pipeline_JuML.standardize_textFunction
standardize_text(pipe::NlpPipe) -> NlpPipe

Standardizes the text in the corpus by converting it to lowercase and replacing unusual characters with their standard counterparts.

Parameters

  • pipe::NlpPipe: An NlpPipe object containing a corpus and associated labels.

Returns

  • A new NlpPipe object with the standardized corpus and the original labels.

Example Usage

julia> NlpPipe(["Hello WORLD", "Julia is GREAT"]) |> standardize_text
NlpPipe(["hello world", "julia is great"], nothing)
source
Preprocessing_Pipeline_JuML.remove_noiseFunction
remove_noise(pipe::NlpPipe) -> NlpPipe

Removes noise from the corpus.

Noise includes HTML tags, URLs, email addresses, file paths, special characters, dates & times. Replaces URLs, dates, timereferences, filepaths and e-mail addresses with corresponding replacement tokens.

Parameters

  • pipe::NlpPipe: The NlpPipe object with a corpus to remove noise from

Returns

  • NlpPipe: A new pipe object with the noise removed from the corpus

Example Usage


julia> NlpPipe(["<html>This is a test</html>"]) |> remove_noise
NlpPipe(["This is a test"], nothing)

julia> NlpPipe(["Today is 28/01/2025"]) |> remove_noise
NlpPipe(["Today is <DATE>"], nothing)

With custom replacement patterns

julia> NlpPipe(["<html>This is a test</html>"]) |> pipe -> remove_noise(pipe, replacement_patterns=[r"is a" => "🦖🫶"])
NlpPipe(["<html>This 🦖🫶 test</html>"], nothing)
source
Preprocessing_Pipeline_JuML.mask_numbersFunction
mask_numbers(pipe::NlpPipe; replace_with::String="<NUM>") -> NlpPipe

Replaces all numbers in the text of the given NlpPipe corpus with a specified string.

Parameters

  • pipe::NlpPipe: The input NlpPipe object containing the corpus to be processed.
  • replace_with::String: The string to replace numbers with. Defaults to "<NUM>".

Returns

  • A new NlpPipe object with the numbers in the corpus replaced by the specified string.

Example Usage

julia> NlpPipe(["The price is 1000€."]) |> mask_numbers
NlpPipe(["The price is <NUM>€."], nothing)
source
Preprocessing_Pipeline_JuML.expand_contractionsFunction
expand_contractions(input::NlpPipe) -> NlpPipe

Expand contractions in the input text. This function expands common English contractions.

Parameters

  • input::NlpPipe: A NlpPipe object containing the corpus to expand contractions in.

Returns

  • A new NlpPipe object with the contractions expanded in the corpus.

Example Usage

julia> NlpPipe(["I'm happy", "I've got a cat"]) |> expand_contractions
NlpPipe(["I am happy", "I have got a cat"], nothing)
source

Tokenization

Preprocessing_Pipeline_JuML.tokenizeFunction
tokenize(pipe::NlpPipe, level::Symbol = :word) -> TokenizedNlpPipe

Tokenizes the documents in the corpus of the given NlpPipe object. The level parameter sets depth of tokenizing.

Parameters

  • pipe::NlpPipe: An NlpPipe object containing a corpus of documents.
  • level::Symbol: The tokenization level, either :word (default) or :character.

Returns

  • TokenizedNlpPipe: A new pipe object with the tokenized documents.

Example Usage

julia> NlpPipe(["Hello world", "Julia is great"]) |> tokenize
TokenizedNlpPipe(["Hello world", "Julia is great"], [["Hello", "world"], ["Julia", "is", "great"]], Set(["great", "Hello", "is", "Julia", "world"]), nothing)
source

Preprocessing after tokenization

Preprocessing_Pipeline_JuML.remove_stop_wordsFunction
remove_stop_words(pipe::TokenizedNlpPipe; language::String="en", stop_words::Set{String}=Set{String}()) -> TokenizedNlpPipe

Removes predefined stopwords. You can access the stop words for a given language using the language name or ISO 639 code.

For example, to get the stop words for English, you can use stopwords["eng"], stopwords["en"], or stopwords["English"]. Stop words sourced from StopWords.jl.

Parameters

  • pipe::TokenizedNlpPipe: The input TokenizedNlpPipe object containing the tokens to be processed.
  • language::String = "en": Defaults to english, other languages are possible (see linked github-page above)
  • stop_words::Set{String} = Set{String}(): Defaults to StopWords.jl-stopwords set, possible to set own stopword-set.

Returns

  • TokenizedNlpPipe: A new pipe object with the stop words removed from the tokens.

Example Usage


Removing stop words from a tokenized pipe (default stop words)

julia> NlpPipe(["This is a dinosaur"]) |> tokenize |> remove_stop_words |> pipe -> pipe.tokens
1-element Vector{Vector{String}}:
 ["This", "dinosaur"]

Using custom stop words

julia> NlpPipe(["This is a dinosaur"]) |> tokenize |> pipe -> remove_stop_words(pipe, stop_words=Set(["This", "dinosaur"])) |> pipe -> pipe.tokens
1-element Vector{Vector{String}}:
 ["is", "a"]
source
Preprocessing_Pipeline_JuML.stemmingFunction
stemming(pipe::TokenizedNlpPipe; language::String="english") -> TokenizedNlpPipe

Reduces words to their roots by removing pre- and suffixes. These are provided by SnowballStemmer.jl.

Parameters

  • pipe::TokenizedNlpPipe: The input TokenizedNlpPipe object containing the tokens to be processed.
  • language::String = "en": Defaults to english, other languages are possible

Returns

  • TokenizedNlpPipe: A new pipe object with the stemmed tokens.

Example Usage


Applying stemming with the default language (English)

julia> NlpPipe(["This is a test for stemming"]) |> tokenize |> stemming
TokenizedNlpPipe(["This is a test for stemming"], [["This", "is", "a", "test", "for", "stem"]], Set(["test", "is", "This", "stem", "a", "for"]), nothing)
source
Preprocessing_Pipeline_JuML.standardize_encodingFunction
standardize_encoding(pipe::TokenizedNLPPipe, encoding::String = "ASCII") -> TokenizedNLPPipe

Standardizes the encoding of the tokens in the corpus.

Parameters

  • pipe::TokenizedNlpPipe: A TokenizedNlpPipe object containing a corpus and associated labels.
  • encoding::String: The target encoding, either "ASCII" or "UTF-8" (default is "ASCII").

Returns

  • TokenizedNlpPipe: A pipe object with the standardized corpus and the original labels.

Example Usage


Using the default encoding

julia> NlpPipe(["Hello world 😊", "Julia is great"]) |> tokenize |> standardize_encoding
TokenizedNlpPipe(["Hello world 😊", "Julia is great"], [["Hello", "world", "<UNK>"], ["Julia", "is", "great"]], Set(["great", "Hello", "is", "Julia", "<UNK>", "world"]), nothing)

Using the UTF-8 encoding

julia> NlpPipe(["Hello world 😊", "Julia is great"]) |> tokenize |> pipe -> standardize_encoding(pipe, encoding="UTF-8")
TokenizedNlpPipe(["Hello world 😊", "Julia is great"], [["Hello", "world", "😊"], ["Julia", "is", "great"]], Set(["great", "Hello", "is", "Julia", "world", "😊"]), nothing)
source

Vectorization

Preprocessing_Pipeline_JuML.bag_of_wordsFunction
bag_of_words(pipe::TokenizedNlpPipe) -> VectorizedNlpPipe

Create a bag-of-words-encoding out of given TokenizedNlpPipe.

Parameters

  • pipe::TokenizedNlpPipe: The input TokenizedNlpPipe object containing the tokenized documents.

Returns

  • VectorizedNlpPipe: A new pipe object with the bag-of-words vectors.

Example Usage

julia> NlpPipe(["words one", "words two"]) |> tokenize |> bag_of_words
VectorizedNlpPipe{Int64}([[0 1 1], [1 0 1]], Dict("two" => 1, "one" => 2, "words" => 3), nothing)
source
Preprocessing_Pipeline_JuML.bag_of_ngramsFunction
bag_of_ngrams(pipe::TokenizedNlpPipe; n::Int = 1) -> VectorizedNlpPipe

Create a bag of n-grams out of given TokenizedNlpPipe, with padding for shorter documents.

Parameters

  • pipe::TokenizedNlpPipe: The input TokenizedNlpPipe object containing the tokenized documents.
  • `n::Int: The n-gram size. Defaults to 1.

Returns

  • VectorizedNlpPipe: A new pipe object with the n-gram vectors.

Example Usage

julia> NlpPipe(["words one", "words two"]) |> tokenize |> bag_of_ngrams
VectorizedNlpPipe{Int64}([[1 0 0; 0 1 0], [1 0 0; 0 0 1]], Dict("two" => 3, "one" => 2, "words" => 1), nothing)
source
Preprocessing_Pipeline_JuML.tf_idfFunction
tf_idf(pipe::TokenizedNlpPipe; tf_weighting::String = "relative term frequency", idf_weighting::String="inverse document frequency") -> VectorizedNlpPipe

Compute the TF-IDF (Term Frequency-Inverse Document Frequency) representation of the tokenized documents in the given pipe.

Parameters

  • pipe::TokenizedNlpPipe: A pipeline containing tokenized documents.
  • tf_weighting::String: The term frequency weighting scheme. Options are "relative term frequency" (default) and "raw term frequency".
  • idf_weighting::String: The inverse document frequency weighting scheme. Options are "inverse document frequency" (default) and "smooth inverse document frequency".

Returns

  • VectorizedNlpPipe: A pipe object containing the TF-IDF vectorized representation of the documents.

Example Usage

julia> NlpPipe(["words one", "words two"]) |> tokenize |> tf_idf
VectorizedNlpPipe{Float64}([[0.0 0.0 0.0; 0.0 0.35 0.0], [0.0 0.0 0.0; 0.0 0.0 0.35]], Dict("two" => 3, "one" => 2, "words" => 1), nothing)
source
Preprocessing_Pipeline_JuML.one_hot_encodingFunction
one_hot_encoding(pipe::TokenizedNlpPipe) -> VectorizedNlpPipe

Create a one-hot-encoding out of given TokenizedNlpPipe.

Parameters

  • pipe::TokenizedNlpPipe: The input TokenizedNlpPipe object containing the tokenized documents.

Returns

  • VectorizedNlpPipe: The output pipe object containing the one-hot-encoded documents.

Example Usage

julia> NlpPipe(["words one", "words two"]) |> tokenize |> one_hot_encoding
VectorizedNlpPipe{Int64}([[0 0 1; 0 1 0], [0 0 1; 1 0 0]], Dict("two" => 1, "one" => 2, "words" => 3), nothing)
source