Presidio analyzer package

These details have been verified by PyPI

Maintainers

avbalter microsoft omri374 omrimendels presidio

These details have not been verified by PyPI

Project links

Homepage

Reason this release was yanked:

pre release

Project description

Presidio analyzer

Description

The Presidio analyzer is a Python based service for detecting PII entities in text.

During analysis, it runs a set of different PII Recognizers, each one in charge of detecting one or more PII entities using different mechanisms.

Presidio analyzer comes with a set of predefined recognizers, but can easily be extended with other types of custom recognizers. Predefined and custom recognizers leverage regex, spaCy and other types of logic to detect PII in unstructured text.

Deploy Presidio analyzer to Azure

Use the following button to deploy presidio analyzer to your Azure subscription.

TODO: change this link to main branch once merged (#2765).

Installation

To get started with Presidio-analyzer, download the package and the en_core_web_lg spaCy model, preferably in a virtual environment like Conda.

pip install presidio-analyzer
python -m spacy download en_core_web_lg

Getting started

Running Presidio as an HTTP server

You can run presidio analyzer as an http server using either python runtime or using a docker container.

Using python runtime

cd presidio-analyzer
python app.py
curl -d '{"text":"John Smith drivers license is AC432223", "language":"en"}' -H "Content-Type: application/json" -X POST http://localhost:3000/analyze

Using docker container

cd presidio-analyzer
docker build -t presidio-analyzer --build-arg NAME=presidio-analyzer  .
docker run -p 5001:5001 presidio-analyzer

Simple analysis script

from presidio_analyzer import AnalyzerEngine

# Set up the engine, loads the NLP module (spaCy model by default) and other PII recognizers
analyzer = AnalyzerEngine()

# Call analyzer to get results
results = analyzer.analyze(text="My phone number is 212-555-5555",
                           entities=["PHONE_NUMBER"],
                           language='en')
print(results)

Customizing Presidio analyzer

Presidio can be exteded to support new types of PII entities, and to support additional languages.

The three main modules are the AnalyzerEngine and the RecognizerRegistry and EntityRecognizer.

The AnalyzerEngine is in charge of calling each requested recognizer.
The RecognizerRegistry is in charge of providing the list of predefined and custom recognizers for analysis.
The EntityRecognizer class can be extended to support new types of PII recognition logic.

Extending the analyzer for additional PII entities

First, a class based on EntityRecognizer needs to be created. Second, the new recognizer should be added to the recognizer registry. So that the AnalyzerEngine would be able to use the new recognizer during analysis.

In order to implement a new recognizer by code, follow these two steps:

Simple example

For simple recognizers based on regular expressions or deny-lists, we can leverage the provided PatternsRecognizer:

from presidio_analyzer import PatternRecognizer
titles_recognizer = PatternRecognizer(supported_entity="TITLE",
                                      deny_list=["Mr.","Mrs.","Miss"])

Calling the recognizer itself:

titles_recognizer.analyze(text="Mr. Schmidt",entities="TITLE")

Adding to the list of recognizers:

from presidio_analyzer import AnalyzerEngine, RecognizerRegistry

text="His name is Mr. Jones"

registry = RecognizerRegistry()
registry.load_predefined_recognizers()

# Add new recognizer
registry.add_recognizer(titles_recognizer)

# Set up analyzer with our updated recognizer registry
analyzer = AnalyzerEngine(registry=registry)

results = analyzer.analyze(text=text,language="en")
print(results)

Alternatively, we can add the recognizer to the existing analyzer:

from presidio_analyzer import AnalyzerEngine

analyzer = AnalyzerEngine()

analyzer.registry.add_recognizer(titles_recognizer)

results = analyzer.analyze(text=text,language="en")
print(results)

Creating a new `EntityRecognizer` in code

There are various types of Recognizers in Presidio:

EntityRecognizer, the base class
PatternsRecognizer, for regex and deny-list based detection
LocalRecognizer: A base class for all recognizers living within the same process as the AnalyzerEngine.
RemoteRecognizer: A base class for accessing external recognizers, such as 3rd party services or ML models served outside the main Presidio Python process.

To create a new recognizer via code:

Create a new Python class which implements LocalRecognizer. (LocalRecognizer implements the base EntityRecognizer class.)

This class has the following functions:

i. load: load a model / resource to be used during recognition
```
def load(self)
```
ii. analyze: The main function to be called for getting entities out of the new recognizer:
```
def analyze(self, text, entities, nlp_artifacts)
```
Notes:
1. Each recognizer has access to different NLP assets such as tokens, lemmas, and more. These are given through the nlp_artifacts parameter. Refer to the code documentation for more information.
2. The analyze method should return a list of RecognizerResult.
Add it to the recognizer registry using registry.add_recognizer(my_recognizer).

Multi language support

Presidio supports PII detection in multiple languages. In its default configuration, it contains recognizers and models for English. To configure Presidio to detect PII in additional languages, these modules require modification:

The NlpEngine containing the NLP model which performs tokenization, lemmatization, Named Entity Recognition and other NLP tasks.
PII recognizers (different EntityRecognizer objects) should be adapted or created.

While different detection mechanisms such as regular expressions are language agnostic, the context words used to increase the PII detection confidence aren't. Consider updating the list of context words for each recognizer to leverage context words in additional languages.

Configuring the NLP Engine

As its internal NLP engine, Presidio supports both spaCy and Stanza. To set up new models, follow these two steps:

Download the spaCy/Stanza NER models for your desired language.
- To download a new model with spaCy:
```
python -m spacy download es_core_news_md
```
  In this example we download the medium size model for Spanish.
- To download a new model with Stanza:
```
import stanza
stanza.download("en") # where en is the language code of the model.
```
For the available models, follow these links: spaCy, stanza.

Update the models configuration in one of two ways:

Via code: Create an NlpEngine using the NlpEnginerProvider class, and pass it to the AnalyzerEngine as input:

from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngineProvider

# Create configuration containing engine name and models
configuration = {
    "nlp_engine_name": "spacy",
    "models": [{"lang_code": "es", "model_name": "es_core_news_md"},
               {"lang_code": "en", "model_name": "en_core_web_lg"}],
}

# Create NLP engine based on configuration
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine_with_spanish = provider.create_engine()

# Pass the created NLP engine and supported_languages to the AnalyzerEngine
analyzer = AnalyzerEngine(
    nlp_engine=nlp_engine_with_spanish, 
    supported_languages=["en", "es"]
)

# Analyze in different languages
results_spanish = analyzer.analyze(text="Mi nombre es David", language="es")
print(results_spanish)

results_english = analyzer.analyze(text="My name is David", language="en")
print(results_english)

Via configuration: Set up the models which should be used in the default conf file.

An example Conf file:

nlp_engine_name: spacy
models:
    -
    lang_code: en
    model_name: en_core_web_lg
    -
    lang_code: es
    model_name: es_core_news_md

The default conf file is read during the default initialization of the AnalyzerEngine. Alternatively, the path to a custom configuration file can be passed to the NlpEngineProvider:

from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngineProvider

# Create NLP engine based on configuration file
provider = NlpEngineProvider(conf_file="PATH_TO_YAML")
nlp_engine_with_spanish = provider.create_engine()

# Pass created NLP engine and supported_languages to the AnalyzerEngine
analyzer = AnalyzerEngine(
    nlp_engine=nlp_engine_with_spanish, 
    supported_languages=["en", "es"]
)

# Analyze in different languages
results_spanish = analyzer.analyze(text="Mi nombre es David", language="es")
print(results_spanish)

results_english = analyzer.analyze(text="My name is David", language="en")
print(results_english)

In this examples we create an NlpEngine holding two spaCy models (one in English: en_core_web_lg and one in Spanish: es_core_news_md), define the supported_languages parameter accordingly, and can send requests in each of these languages.

Set up language specific recognizers

Recognizers are language dependent either by their logic or by the context words used while scanning the surrounding of a detected entity. As these context words are used to increase score, they should be in the expected input language.

Consider updating the context words of existing recognizers or add new recognizers to support new languages. Each recognizer can support one language. For example:

from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.predefined_recognizers import EmailRecognizer

# Setting up an English Email recognizer:
email_recognizer_en = EmailRecognizer(supported_language="en",context=["email","mail"])

# Setting up a Spanish Email recognizer
email_recognizer_es = EmailRecognizer(supported_language="es",context=["correo","electrónico"])

registry = RecognizerRegistry()

# Add recognizers to registry
registry.add_recognizer(email_recognizer_en)
registry.add_recognizer(email_recognizer_es)

# Set up analyzer with our updated recognizer registry
analyzer = AnalyzerEngine(
    registry=registry,
    supported_languages=["en","es"],
    nlp_engine=nlp_engine_with_spanish)

analyzer.analyze(...)

Automatically install NLP models into the Docker container

When packaging the code into a Docker container, NLP models are automatically installed. To define which models should be installed, update the conf/default.yaml file. This file is read during the docker build phase and the models defined in it are installed automatically.

HTTP API

/analyze

Analyzes a text. Method: POST

Parameters

Name	Type	Optional	Description
text	string	no	the text to analyze
language	string	no	2 characters of the desired language. E.g en, de
correlation_id	string	yes	a correlation id to append to headers and traces
score_threshold	float	yes	the the minimal score threshold
entities	string[]	yes	a list of entities to analyze
trace	bool	yes	whether to trace the request
remove_interpretability_response	bool	yes	whether to include analysis explanation in the response

/recognizers

Returns a list of supported recognizers. Method: GET

Parameters

Name	Type	Optional	Description
language	string	yes	2 characters of the desired language code. e.g., en, de.

/supportedentities

Returns a list of supported entities. Method: GET

Parameters

Name	Type	Optional	Description
language	string	yes	2 characters of the desired language code. e.g., en, de.

Project details

These details have been verified by PyPI

Maintainers

avbalter microsoft omri374 omrimendels presidio

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.2.355

Jul 22, 2024

2.2.354

Mar 29, 2024

2.2.353

Feb 12, 2024

2.2.352

Jan 22, 2024

2.2.351

Nov 8, 2023

2.2.350

Nov 2, 2023

2.2.35

Oct 31, 2023

2.2.34

Oct 30, 2023

2.2.33

Jun 4, 2023

2.2.32

Jan 31, 2023

2.2.31

Dec 14, 2022

2.2.30

Oct 25, 2022

2.2.29

Jul 12, 2022

2.2.28

May 8, 2022

2.2.27

Mar 9, 2022

2.2.26

Feb 24, 2022

2.2.25

Feb 22, 2022

2.2.24

Jan 26, 2022

2.2.23

Nov 16, 2021

2.2.22

Oct 3, 2021

2.2.21

Jun 14, 2021

2.2.4

Nov 2, 2023

2.2.2

Jun 9, 2021

2.2.1

May 10, 2021

2.2.0

Apr 12, 2021

2.1.0

Mar 25, 2021

2.0.3

Mar 25, 2021

2.0.2

Mar 25, 2021

2.0.1

Mar 4, 2021

2.0.0

Mar 1, 2021

2.0.0rc1 pre-release

Mar 1, 2021

This version

1.11.0 yanked

Feb 11, 2021

Reason this release was yanked:

pre release

1.10.0 yanked

Feb 8, 2021

Reason this release was yanked:

debug

1.9.0 yanked

Feb 7, 2021

Reason this release was yanked:

debug

0.95

Feb 25, 2021

0.3.10649rc0 pre-release

Feb 9, 2021

0.3.10541.dev0 pre-release

Feb 5, 2021

0.3.9684.dev0 pre-release

Oct 26, 2020

0.3.9671rc0 pre-release

Sep 30, 2020

0.3.9547rc0 pre-release

Aug 10, 2020

0.3.9545rc0 pre-release

Aug 9, 2020

0.3.9542rc0 pre-release

Aug 6, 2020

0.3.9540rc0 pre-release

Aug 5, 2020

0.3.9537.dev0 pre-release

Aug 4, 2020

0.3.9535rc0 pre-release

Aug 1, 2020

0.3.9528rc0 pre-release

Jul 28, 2020

0.3.9525.dev0 pre-release

Jul 27, 2020

0.3.9491.dev0 pre-release

Jul 9, 2020

0.3.9488rc0 pre-release

Jul 7, 2020

0.3.9352rc0 pre-release

Jun 22, 2020

0.3.9237rc0 pre-release

Jun 7, 2020

0.3.9231.dev0 pre-release

Jun 5, 2020

0.3.8917rc0 pre-release

May 26, 2020

0.3.8780.dev0 pre-release

May 6, 2020

0.3.8779.dev0 pre-release

May 6, 2020

0.3.8778.dev0 pre-release

May 6, 2020

0.3.8774rc0 pre-release

May 6, 2020

0.3.8768rc0 pre-release

May 4, 2020

0.3.8767.dev0 pre-release

May 4, 2020

0.3.8763rc0 pre-release

May 3, 2020

0.3.8758.dev0 pre-release

May 2, 2020

0.3.8754rc0 pre-release

Apr 30, 2020

0.3.8750rc0 pre-release

Apr 30, 2020

0.3.8743rc0 pre-release

Apr 30, 2020

0.3.8740.dev0 pre-release

Apr 30, 2020

0.3.8705.dev0 pre-release

Apr 22, 2020

0.3.8697.dev0 pre-release

Apr 22, 2020

0.3.8689rc0 pre-release

Apr 21, 2020

0.3.8666rc0 pre-release

Apr 20, 2020

0.3.8636.dev0 pre-release

Apr 13, 2020

0.3.8635.dev0 pre-release

Apr 13, 2020

0.3.8628rc0 pre-release

Apr 12, 2020

0.3.8622rc0 pre-release

Apr 10, 2020

0.3.8619rc0 pre-release

Apr 10, 2020

0.3.8614rc0 pre-release

Apr 9, 2020

0.3.8602rc0 pre-release

Apr 6, 2020

0.3.8600rc0 pre-release

Apr 5, 2020

0.3.8575rc0 pre-release

Mar 30, 2020

0.3.8573.dev0 pre-release

Mar 30, 2020

0.3.8571.dev0 pre-release

Mar 30, 2020

0.3.8513rc0 pre-release

Mar 24, 2020

0.3.8493.dev0 pre-release

Mar 23, 2020

0.3.8468.dev0 pre-release

Mar 22, 2020

0.3.8467.dev0 pre-release

Mar 22, 2020

0.3.8463.dev0 pre-release

Mar 22, 2020

0.3.8450.dev0 pre-release

Mar 22, 2020

0.3.8443rc0 pre-release

Mar 22, 2020

0.3.8441rc0 pre-release

Mar 22, 2020

0.3.8405rc0 pre-release

Mar 16, 2020

0.3.8400.dev0 pre-release

Mar 15, 2020

0.3.8396.dev0 pre-release

Mar 15, 2020

0.3.8395.dev0 pre-release

Mar 15, 2020

0.3.8359rc0 pre-release

Mar 12, 2020

0.3.8326rc0 pre-release

Mar 12, 2020

0.3.8307rc0 pre-release

Mar 11, 2020

0.3.8303.dev0 pre-release

Mar 11, 2020

0.3.8293.dev0 pre-release

Mar 11, 2020

0.3.8278.dev0 pre-release

Mar 10, 2020

0.3.8273.dev0 pre-release

Mar 10, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

presidio_analyzer-1.11.0-py3-none-any.whl (50.7 kB view details)

Uploaded Feb 11, 2021 Python 3

File details

Details for the file presidio_analyzer-1.11.0-py3-none-any.whl.

File metadata

Download URL: presidio_analyzer-1.11.0-py3-none-any.whl
Upload date: Feb 11, 2021
Size: 50.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.8.7

File hashes

Hashes for presidio_analyzer-1.11.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8745b5f765e3444c1ceea7c3801de25d8f18a969d9c4ffa71320dd3d5d6b13c6`
MD5	`e07bf6d902c75c2887554002aeea788e`
BLAKE2b-256	`95905187d399371b38a33c1c83d023f3c06e2b4abdb7dea710e136af56803b8e`