Preprocessing_Pipeline_JuML

Stable Coverage Build Status

Preprocessing_Pipeline_JuML is a Julia package for preprocessing text data in NLP pipelines.

API Structure

The package provides a set of pipeline stages that can be chained together to preprocess text data. The pipeline stages are implemented as functions that take a NlpPipe or TokenizedNlpPipe struct as input and return a modified object of the same type. This makes it easy to build custom preprocessing pipelines by piping together the desired stages.

Overview

Pipeline Diagram

Features and Objects

Features (for detailed explanation visit here)

  • Text preprocessing: prepare textual data for machine learning tasks. Preprocessing steps include:
    • applied before Tokenization:
      • expansion of contractions
      • masking of numbers
      • noise removal (punctuation, special characters, phone numbers, e-mail addresses, ...)
      • text standardization (lowercasing, remove ambiguous characters)
    • applied after Tokenization:
      • stopword removal
      • stemming
      • standardization of token encoding
  • Tokenization: Split text into words or characters.
  • Vectorization: Transform text into machine-learning-compatible vector representations
    • one-hot encoding
    • Bag of Words (BoW)
    • Bag of N-Grams
    • Term Frequency-Inverse Document Frequency (TF-IDF)

Pipe Objects (to learn more, visit here)

NlpPipe First struct to instantiate in a pipeline. Can be created directly from a text corpus. Can be

  1. used in preprocessing stages that do not require the text to be tokenized.
  2. transformed into a TokenizedNlpPipe by applying the tokenize function.

TokenizedNlpPipe Struct that holds tokenized text data. Can be used for preprocessing stages that require tokenized text (e.g., stopword removal, stemming, etc.). Can be transformed into a VectorizedNlpPipe by applying any vectorization function.

VectorizedNlpPipe Struct that holds vectorized text data (embeddings). Can be used for machine learning tasks.

Usage Example

corpus=["Hello, world!", "How are you?"]
NlpPipe(corpus) |> remove_noise |> tokenize |> one_hot_encoding
VectorizedNlpPipe{Int64}([[1 0 … 0 0; 0 0 … 0 1], [0 1 … 0 0; 0 0 … 1 0; 0 0 … 0 0]], Dict("Hello" => 1, "How" => 2, "you" => 3, "are" => 4, "world" => 5), nothing)