Installation in REPL
Install the package by running the following commands in Pkg
activate --temp
add https://github.com/michellekappl/Preprocessing_Pipeline_JuML
This will install the required dependencies and make the package available in your Julia REPL.
Quickstart
1. Load the package:
using Preprocessing_Pipeline_JuML
2. Prepare a test corpus: Define a set of noisy text samples for preprocessing:
test_corpus = [
"Hello <b>world</b>! Visit http://example.com.",
"Email me: test@example.com or call +123-456-7890.",
"Today is 12/25/2024, time now: 10:30AM.",
"My file is at C:\\Users\\JohnDoe\\Documents\\file.txt.",
"Check this out: www.awesome-website.org/about-us.html!",
"#JuliaLang is great. Follow us @JuliaNLP."
]
6-element Vector{String}:
"Hello <b>world</b>! Visit http://example.com."
"Email me: test@example.com or call +123-456-7890."
"Today is 12/25/2024, time now: 10:30AM."
"My file is at C:\\Users\\JohnDoe\\Documents\\file.txt."
"Check this out: www.awesome-website.org/about-us.html!"
"#JuliaLang is great. Follow us @JuliaNLP."
3. Build your pipeline:
pipe = NlpPipe(test_corpus) |> remove_noise |> tokenize |> tf_idf
VectorizedNlpPipe{Float64}([[0.44793986730701374 0.0 … 0.0 0.0; 0.0 0.44793986730701374 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], [0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], [0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], [0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], [0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], [0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.358351893845611 0.0; 0.0 0.0 … 0.0 0.358351893845611]], Dict("now" => 15, "<TIME>" => 16, "call" => 9, "time" => 14, "is" => 12, "1234567890" => 10, "<EMAIL>" => 7, "at" => 19, "<PATH>" => 20, "Check" => 21…), nothing)
4. View the vectors produced by the pipeline:
Get the tokenized representation:
@info pipe.tokens
[ Info: [[0.44793986730701374 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.44793986730701374 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.44793986730701374 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.27465307216702745 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0], [0.0 0.0 0.0 0.0 0.2986265782046758 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.2986265782046758 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.2986265782046758 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2986265782046758 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2986265782046758 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2986265782046758 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0], [0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2986265782046758 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.11552453009332421 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2986265782046758 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2986265782046758 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2986265782046758 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2986265782046758 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0], [0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.358351893845611 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.358351893845611 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.13862943611198905 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.358351893845611 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.358351893845611 0.0 0.0 0.0 0.0 0.0 0.0 0.0], [0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.44793986730701374 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.44793986730701374 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.44793986730701374 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.27465307216702745 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0], [0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.13862943611198905 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.358351893845611 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.358351893845611 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.358351893845611 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.358351893845611]]
5. View the vocabulary generated during vectorization:
@info pipe.vocabulary
[ Info: Dict("now" => 15, "<TIME>" => 16, "call" => 9, "time" => 14, "is" => 12, "1234567890" => 10, "<EMAIL>" => 7, "at" => 19, "<PATH>" => 20, "Check" => 21, "Hello" => 1, "or" => 8, "this" => 22, "Today" => 11, "out" => 23, "<DATE>" => 13, "Follow" => 25, "us" => 26, "world" => 2, "Visit" => 3, "me" => 6, "great" => 24, "Email" => 5, "My" => 17, "file" => 18, "<URL>" => 4, "JuliaNLP" => 27)
6. (optional) View the labels:
@info pipe.labels
[ Info: nothing
Run the tests
To run the tests, open a Julia REPL, activate the project, and use the ]
key to enter the package management console. Then, execute test
to run the test suite. The output should look like this:
(@v1.6) pkg> test
Testing: | 52 52 0.6s
NlpPipe Tests | 10 10 0.1s
TokenizedNlpPipe Tests | 8 8 0.1s
Remove Stop Words Tests | 3 3 0.1s
Expand Contractions Tests | 3 3 0.1s
Mask Numbers Tests | 6 6 0.0s
Remove Noise Tests | 6 6 0.0s
Standardize Text Tests | 2 2 0.0s
OneHotEncoding Tests | 2 2 0.1s
Bag of Words Tests | 4 4 0.0s
BagOfNGrams Tests | 8 8 0.1s
Testing Preprocessing_Pipeline_JuML tests passed