Skip to main content

Positional Vectorizer is a scikit-learn transformer that converts text to bag of words vector using a positional ranking algorithm as score

Project description

Positional Vectorizer

The Positional Vectorizer is a transformer in scikit-learn designed to transform text into a bag of words vector using a positional ranking algorithm to assign scores. Similar to scikit-learn's CountVectorizer and TFIDFVectorizer, it assigns a value to each dimension based on the term's position in the original text.

How to use

pip install positional-vectorizer

Using to generate de text vectors

from positional_vectorizer import PositionalVectorizer

input_texts = ["my text here", "other text here"]

vectorizer = PositionalVectorizer()
vectorizer.fit(input_texts)

encoded_texts = vectorizer.transform(input_texts)

Using with scikit-learn pipeline

from positional_vectorizer import PositionalVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier

pipeline = Pipeline([
    ('vect', PositionalVectorizer(ngram_range=(1, 2))),
    ('clf', SGDClassifier(random_state=42, loss='modified_huber'))
])

pipeline.fit(X_train, y_train)

Why this new vectorizer?

Text embeddings based on bag-of-words using count, binary, or TF-IDF normalization are highly effective in most scenarios. However, in certain cases, such as those involving languages like Latin, the position of terms becomes crucial, which these techniques fail to capture.

For instance, consider the importance of word position in a Portuguese classification task distinguishing between a smartphone device and a smartphone accessory. In traditional bag-of-words approaches with stop words removed, the following titles yield identical representations:

  • "xiaomi com fone de ouvido" => {"xiaomi", "fone", "ouvido"}
  • "fone de ouvido do xiaomi" => {"xiaomi", "fone", "ouvido"}

As demonstrated, the order of words significantly alters the meaning, but this meaning is not reflected in the vectorization.

One common workaround is to employ n-grams instead of single words, but this can inflate the feature dimensionality, potentially increasing the risk of overfitting.

How it works

The value in each dimension is calculated as 1 / math.log(rank + 1) (similar to the Discounted Cumulative Gain formula), where the rank denotes the position of the corresponding term, starting from 1.

If a term appears multiple times in the text, only its lowest rank is taken into account.

TODO

  • Test the common parameters of _VectorizerMixin to identify potential issues when upgrading scikit-learn. Currently, only the ngrams_range and analyzer parameters are automatically tested.
  • Implement the max_features parameter.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

positional_vectorizer-0.0.9.tar.gz (5.1 kB view details)

Uploaded Source

File details

Details for the file positional_vectorizer-0.0.9.tar.gz.

File metadata

  • Download URL: positional_vectorizer-0.0.9.tar.gz
  • Upload date:
  • Size: 5.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.9.19

File hashes

Hashes for positional_vectorizer-0.0.9.tar.gz
Algorithm Hash digest
SHA256 32f321240ea74e2f9fb61ce30c4c522a93e35f666bdc5ce20cca0f6ea7b8cf15
MD5 77d6d9556df84c63a94720b915d497ac
BLAKE2b-256 3b70131f25eebc618decca060d0e47130c8a8856384ad7a5b41b091394375e77

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page