Functions · Preprocessing_Pipeline

Pipe Structs

Preprocessing_Pipeline_JuML.NlpPipe — Type

NlpPipe

A simple pipeline structure for handling text data (corpus) and corresponding labels.

Fields

corpus::Vector{String}: A collection of text documents.
labels::Union{Vector{String}, Nothing}: Optional labels corresponding to each document in corpus.

Constructors

NlpPipe(corpus::Vector{String}, labels::Union{Vector{String}, Nothing}) Creates an NlpPipe instance with a given corpus and optional labels. Throws an ArgumentError if the number of documents and labels do not match.
NlpPipe(corpus::Vector{String}) Creates an NlpPipe instance with only a corpus, setting labels to nothing.
NlpPipe(corpus::String) Creates an NlpPipe instance with a single document, storing it in a vector.
NlpPipe(previousPipe::NlpPipe; corpus::Vector{String} = previousPipe.corpus, labels::Union{Vector{String}, Nothing} = previousPipe.labels) Creates a new NlpPipe instance based on an existing one, optionally overriding corpus and labels. Throws an ArgumentError if labels is not nothing and its length does not match the corpus length.

Example Usage

Creating a pipe from a corpus with multiple documents, including labels

julia> pipe1 = NlpPipe(["document1", "document2"], ["label1", "label2"])
NlpPipe(["document1", "document2"], ["label1", "label2"])

Creating a pipe from a corpus without labels

julia> pipe2 = NlpPipe(["document3"])
NlpPipe(["document3"], nothing)

Creating a pipe from a single string corpus

julia> pipe3 = NlpPipe("single document") 
NlpPipe(["single document"], nothing)

Creating a new pipe from an existing one with modified corpus and labels

julia> NlpPipe(pipe1, corpus=["new_doc1", "new_doc2"])
NlpPipe(["new_doc1", "new_doc2"], ["label1", "label2"])

source

Preprocessing_Pipeline_JuML.TokenizedNlpPipe — Type

TokenizedNlpPipe

A structure for handling tokenized text data, maintaining a vocabulary and optional labels.

Fields

corpus::Vector{String}: A collection of original text documents.
tokens::Vector{Vector{String}}: Tokenized representation of each document in corpus.
vocabulary::Set{String}: A set of unique tokens derived from tokens.
labels::Union{Vector{String}, Nothing}: Optional labels corresponding to each document.

Constructors

TokenizedNlpPipe(corpus::Vector{String}, tokens::Vector{Vector{String}}, labels::Union{Vector{String}, Nothing}) Creates a TokenizedNlpPipe instance with a given corpus, tokenized documents, and optional labels. The vocabulary is automatically generated from tokens.
TokenizedNlpPipe(previousPipe::TokenizedNlpPipe; tokens::Vector{Vector{String}} = previousPipe.tokens, vocabulary::Set{String} = previousPipe.vocabulary, labels::Union{Vector{String}, Nothing} = previousPipe.labels) Creates a new TokenizedNlpPipe instance based on an existing one, allowing modifications to tokens, vocabulary, and labels while retaining the original corpus.

Example Usage

Creating a pipe from an NlpPipe instance (usual way to do it)

julia> corpus = ["Hello world", "Julia is great"]
2-element Vector{String}:
 "Hello world"
 "Julia is great"
julia> tokenizedPipe = NlpPipe(corpus) |> tokenize
TokenizedNlpPipe(["Hello world", "Julia is great"], [["Hello", "world"], ["Julia", "is", "great"]], Set(["great", "Hello", "is", "Julia", "world"]), nothing)

Creating a new pipe from scratch

julia> corpus = ["Hello world", "Julia is great"]
2-element Vector{String}:
 "Hello world"
 "Julia is great"

julia> tokens = [["Hello", "world"], ["Julia", "is", "great"]]
2-element Vector{Vector{String}}:
 ["Hello", "world"]
 ["Julia", "is", "great"]
 
julia> TokenizedNlpPipe(corpus, tokens, ["greeting", "statement"])
TokenizedNlpPipe(["Hello world", "Julia is great"], [["Hello", "world"], ["Julia", "is", "great"]], Set(["great", "Hello", "is", "Julia", "world"]), ["greeting", "statement"])

Creating a new pipe from an existing one with modified tokens

julia> pipe1 = TokenizedNlpPipe(corpus, tokens, ["greeting", "statement"])
TokenizedNlpPipe(["Hello world", "Julia is great"], [["Hello", "world"], ["Julia", "is", "great"]], Set(["great", "Hello", "is", "Julia", "world"]), ["greeting", "statement"])

julia> pipe2 = TokenizedNlpPipe(pipe1; tokens=[["Hello"], ["Julia", "is"]])
TokenizedNlpPipe(["Hello world", "Julia is great"], [["Hello"], ["Julia", "is"]], Set(["great", "Hello", "is", "Julia", "world"]), ["greeting", "statement"])

source

Preprocessing_Pipeline_JuML.VectorizedNlpPipe — Type

VectorizedNlpPipe

A structure for handling vectorized representations of tokenized text data, including a vocabulary mapping and optional labels.

Fields

tokens::Vector{Matrix{T<:Real}: A collection of numerical representations (e.g., embeddings, one-hot encodings) for tokenized text.
vocabulary::Dict{String, Int}: A dictionary mapping words to unique integer indices.
labels::Union{Vector{String}, Nothing}: Optional labels corresponding to each document.

Example Usage

Creating a pipe from an existing TokenizedNlpPipe instance (usual way to do it)

julia> corpus = ["Hello world", "Julia is great"]
2-element Vector{String}:
 "Hello world"
 "Julia is great"

julia> NlpPipe(corpus) |> tokenize |> one_hot_encoding # (or any other vectorization method)
VectorizedNlpPipe{Int64}([[0 1 … 0 0; 0 0 … 0 1], [0 0 … 1 0; 0 0 … 0 0; 1 0 … 0 0]], Dict("great" => 1, "Hello" => 2, "is" => 3, "Julia" => 4, "world" => 5), nothing)

Creating a pipe from scratch

julia> tokens = [[1 2; 3 4], [5 6; 7 8]]  # Example word embeddings (each document is a matrix)
2-element Vector{Matrix{Int64}}:
 [1 2; 3 4]
 [5 6; 7 8]

julia> vocab = Dict("hello" => 1, "world" => 2, "Julia" => 3)
Dict{String, Int64} with 3 entries:
  "hello" => 1
  "Julia" => 3
  "world" => 2

julia> labels = ["greeting", "statement"]
2-element Vector{String}:
 "greeting"
 "statement"

julia> VectorizedNlpPipe(tokens, vocab, labels)
VectorizedNlpPipe{Int64}([[1 2; 3 4], [5 6; 7 8]], Dict("hello" => 1, "Julia" => 3, "world" => 2), ["greeting", "statement"])

source

Preprocessing before tokenization

Preprocessing_Pipeline_JuML.standardize_text — Function

standardize_text(pipe::NlpPipe) -> NlpPipe

Standardizes the text in the corpus by converting it to lowercase and replacing unusual characters with their standard counterparts.

Parameters

pipe::NlpPipe: An NlpPipe object containing a corpus and associated labels.

Returns

A new NlpPipe object with the standardized corpus and the original labels.

Example Usage

julia> NlpPipe(["Hello WORLD", "Julia is GREAT"]) |> standardize_text
NlpPipe(["hello world", "julia is great"], nothing)

source

Preprocessing_Pipeline_JuML.remove_noise — Function

remove_noise(pipe::NlpPipe) -> NlpPipe

Removes noise from the corpus.

Noise includes HTML tags, URLs, email addresses, file paths, special characters, dates & times. Replaces URLs, dates, timereferences, filepaths and e-mail addresses with corresponding replacement tokens.

Parameters

pipe::NlpPipe: The NlpPipe object with a corpus to remove noise from

Returns

NlpPipe: A new pipe object with the noise removed from the corpus

Example Usage

julia> NlpPipe(["<html>This is a test</html>"]) |> remove_noise
NlpPipe(["This is a test"], nothing)

julia> NlpPipe(["Today is 28/01/2025"]) |> remove_noise
NlpPipe(["Today is <DATE>"], nothing)

With custom replacement patterns

julia> NlpPipe(["<html>This is a test</html>"]) |> pipe -> remove_noise(pipe, replacement_patterns=[r"is a" => "🦖🫶"])
NlpPipe(["<html>This 🦖🫶 test</html>"], nothing)

source

Preprocessing_Pipeline_JuML.mask_numbers — Function

mask_numbers(pipe::NlpPipe; replace_with::String="<NUM>") -> NlpPipe

Replaces all numbers in the text of the given NlpPipe corpus with a specified string.

Parameters

pipe::NlpPipe: The input NlpPipe object containing the corpus to be processed.
replace_with::String: The string to replace numbers with. Defaults to "<NUM>".

Returns

A new NlpPipe object with the numbers in the corpus replaced by the specified string.

Example Usage

julia> NlpPipe(["The price is 1000€."]) |> mask_numbers
NlpPipe(["The price is <NUM>€."], nothing)

source

Preprocessing_Pipeline_JuML.expand_contractions — Function

expand_contractions(input::NlpPipe) -> NlpPipe

Expand contractions in the input text. This function expands common English contractions.

Parameters

input::NlpPipe: A NlpPipe object containing the corpus to expand contractions in.

Returns

A new NlpPipe object with the contractions expanded in the corpus.

Example Usage

julia> NlpPipe(["I'm happy", "I've got a cat"]) |> expand_contractions
NlpPipe(["I am happy", "I have got a cat"], nothing)

source

Tokenization

Preprocessing_Pipeline_JuML.tokenize — Function

tokenize(pipe::NlpPipe, level::Symbol = :word) -> TokenizedNlpPipe

Tokenizes the documents in the corpus of the given NlpPipe object. The level parameter sets depth of tokenizing.

Parameters

pipe::NlpPipe: An NlpPipe object containing a corpus of documents.
level::Symbol: The tokenization level, either :word (default) or :character.

Returns

TokenizedNlpPipe: A new pipe object with the tokenized documents.

Example Usage

julia> NlpPipe(["Hello world", "Julia is great"]) |> tokenize
TokenizedNlpPipe(["Hello world", "Julia is great"], [["Hello", "world"], ["Julia", "is", "great"]], Set(["great", "Hello", "is", "Julia", "world"]), nothing)

source

Preprocessing after tokenization

Preprocessing_Pipeline_JuML.remove_stop_words — Function

remove_stop_words(pipe::TokenizedNlpPipe; language::String="en", stop_words::Set{String}=Set{String}()) -> TokenizedNlpPipe

Removes predefined stopwords. You can access the stop words for a given language using the language name or ISO 639 code.

For example, to get the stop words for English, you can use stopwords["eng"], stopwords["en"], or stopwords["English"]. Stop words sourced from StopWords.jl.

Parameters

pipe::TokenizedNlpPipe: The input TokenizedNlpPipe object containing the tokens to be processed.
language::String = "en": Defaults to english, other languages are possible (see linked github-page above)
stop_words::Set{String} = Set{String}(): Defaults to StopWords.jl-stopwords set, possible to set own stopword-set.

Returns

TokenizedNlpPipe: A new pipe object with the stop words removed from the tokens.

Example Usage

Removing stop words from a tokenized pipe (default stop words)

julia> NlpPipe(["This is a dinosaur"]) |> tokenize |> remove_stop_words |> pipe -> pipe.tokens
1-element Vector{Vector{String}}:
 ["This", "dinosaur"]

Using custom stop words

julia> NlpPipe(["This is a dinosaur"]) |> tokenize |> pipe -> remove_stop_words(pipe, stop_words=Set(["This", "dinosaur"])) |> pipe -> pipe.tokens
1-element Vector{Vector{String}}:
 ["is", "a"]

source

Preprocessing_Pipeline_JuML.stemming — Function

stemming(pipe::TokenizedNlpPipe; language::String="english") -> TokenizedNlpPipe

Reduces words to their roots by removing pre- and suffixes. These are provided by SnowballStemmer.jl.

Parameters

pipe::TokenizedNlpPipe: The input TokenizedNlpPipe object containing the tokens to be processed.
language::String = "en": Defaults to english, other languages are possible

Returns

TokenizedNlpPipe: A new pipe object with the stemmed tokens.

Example Usage

Applying stemming with the default language (English)

julia> NlpPipe(["This is a test for stemming"]) |> tokenize |> stemming
TokenizedNlpPipe(["This is a test for stemming"], [["This", "is", "a", "test", "for", "stem"]], Set(["test", "is", "This", "stem", "a", "for"]), nothing)

source

Preprocessing_Pipeline_JuML.standardize_encoding — Function

standardize_encoding(pipe::TokenizedNLPPipe, encoding::String = "ASCII") -> TokenizedNLPPipe

Standardizes the encoding of the tokens in the corpus.

Parameters

pipe::TokenizedNlpPipe: A TokenizedNlpPipe object containing a corpus and associated labels.
encoding::String: The target encoding, either "ASCII" or "UTF-8" (default is "ASCII").

Returns

TokenizedNlpPipe: A pipe object with the standardized corpus and the original labels.

Example Usage

Using the default encoding

julia> NlpPipe(["Hello world 😊", "Julia is great"]) |> tokenize |> standardize_encoding
TokenizedNlpPipe(["Hello world 😊", "Julia is great"], [["Hello", "world", "<UNK>"], ["Julia", "is", "great"]], Set(["great", "Hello", "is", "Julia", "<UNK>", "world"]), nothing)

Using the UTF-8 encoding

julia> NlpPipe(["Hello world 😊", "Julia is great"]) |> tokenize |> pipe -> standardize_encoding(pipe, encoding="UTF-8")
TokenizedNlpPipe(["Hello world 😊", "Julia is great"], [["Hello", "world", "😊"], ["Julia", "is", "great"]], Set(["great", "Hello", "is", "Julia", "world", "😊"]), nothing)

source

Vectorization

Preprocessing_Pipeline_JuML.bag_of_words — Function

bag_of_words(pipe::TokenizedNlpPipe) -> VectorizedNlpPipe

Create a bag-of-words-encoding out of given TokenizedNlpPipe.

Parameters

pipe::TokenizedNlpPipe: The input TokenizedNlpPipe object containing the tokenized documents.

Returns

VectorizedNlpPipe: A new pipe object with the bag-of-words vectors.

Example Usage

julia> NlpPipe(["words one", "words two"]) |> tokenize |> bag_of_words
VectorizedNlpPipe{Int64}([[0 1 1], [1 0 1]], Dict("two" => 1, "one" => 2, "words" => 3), nothing)

source

Preprocessing_Pipeline_JuML.bag_of_ngrams — Function

bag_of_ngrams(pipe::TokenizedNlpPipe; n::Int = 1) -> VectorizedNlpPipe

Create a bag of n-grams out of given TokenizedNlpPipe, with padding for shorter documents.

Parameters

pipe::TokenizedNlpPipe: The input TokenizedNlpPipe object containing the tokenized documents.
`n::Int: The n-gram size. Defaults to 1.

Returns

VectorizedNlpPipe: A new pipe object with the n-gram vectors.

Example Usage

julia> NlpPipe(["words one", "words two"]) |> tokenize |> bag_of_ngrams
VectorizedNlpPipe{Int64}([[1 0 0; 0 1 0], [1 0 0; 0 0 1]], Dict("two" => 3, "one" => 2, "words" => 1), nothing)

source

Preprocessing_Pipeline_JuML.tf_idf — Function

tf_idf(pipe::TokenizedNlpPipe; tf_weighting::String = "relative term frequency", idf_weighting::String="inverse document frequency") -> VectorizedNlpPipe

Compute the TF-IDF (Term Frequency-Inverse Document Frequency) representation of the tokenized documents in the given pipe.

Parameters

pipe::TokenizedNlpPipe: A pipeline containing tokenized documents.
tf_weighting::String: The term frequency weighting scheme. Options are "relative term frequency" (default) and "raw term frequency".
idf_weighting::String: The inverse document frequency weighting scheme. Options are "inverse document frequency" (default) and "smooth inverse document frequency".

Returns

VectorizedNlpPipe: A pipe object containing the TF-IDF vectorized representation of the documents.

Example Usage

julia> NlpPipe(["words one", "words two"]) |> tokenize |> tf_idf
VectorizedNlpPipe{Float64}([[0.0 0.0 0.0; 0.0 0.35 0.0], [0.0 0.0 0.0; 0.0 0.0 0.35]], Dict("two" => 3, "one" => 2, "words" => 1), nothing)

source

Preprocessing_Pipeline_JuML.one_hot_encoding — Function

one_hot_encoding(pipe::TokenizedNlpPipe) -> VectorizedNlpPipe

Create a one-hot-encoding out of given TokenizedNlpPipe.

Parameters

pipe::TokenizedNlpPipe: The input TokenizedNlpPipe object containing the tokenized documents.

Returns

VectorizedNlpPipe: The output pipe object containing the one-hot-encoded documents.

Example Usage

julia> NlpPipe(["words one", "words two"]) |> tokenize |> one_hot_encoding
VectorizedNlpPipe{Int64}([[0 0 1; 0 1 0], [0 0 1; 1 0 0]], Dict("two" => 1, "one" => 2, "words" => 3), nothing)

source