Skip to main content

Detect duplicated zones in (clinical) text

Project description

Duplicate Text Finder

duptextfinder is a python library to detect duplicated zones in text. Primarily meant to detect copy/paste across medical documents. Should be faster than python's built-in difflib algorithm and more robust to whitespace, newlines and other irrelevant characters.

Installation

duptextfinder can be installed through pip:

pip install duptextfinder

Usage

from pathlib import Path
from duptextfinder import CharFingerprintBuilder, DuplicateFinder

# load some text files
texts = [p.read_text() for p in Path("some/dir").glob("*.txt")]

# init fingerprint and duplicate finder
fingerprintBuilder = CharFingerprintBuilder(fingerprintLength=15)
duplicateFinder = DuplicateFinder(fingerprintBuilder, minDuplicateLength=15)

# call findDuplicates() on each file
for i, text in enumerate(texts):
    id = f"D{i}"
    duplicates = duplicateFinder.findDuplicates(id, text)
    for duplicate in duplicates:
        print(
            f"sourceDoc={duplicate.sourceDocId}, "
            f"sourceStart={duplicate.sourceSpan.start}, "
            f"sourceEnd={duplicate.sourceSpan.end}, "
            f"targetStart={duplicate.targetSpan.start}, "
            f"targetEnd={duplicate.targetSpan.end}"
        )
        duplicated_text = text[duplicate.targetSpan.start : duplicate.targetSpan.end]
        print(duplicated_text)

WordFingerprintBuilder can be used instead of CharFingerprintBuilder. For more details, refer to the docstrings of DuplicateFinder, CharFingerprintBuilder and WordFingerprintBuilder.

How to run tests

  1. Install package in editable mode with test and extra dependencies by running pip install -e ".[tests, ncls, intervaltree]" in the repo directory
  2. Launch pytest tests/

About ncls and intervaltree

This tool can be used without any additional dependencies, but performance can be improved when using interval trees. To benefit from this you well need to install either the ncls package or the intervaltree package.

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

duptextfinder-0.3.0.tar.gz (19.4 kB view details)

Uploaded Source

Built Distribution

duptextfinder-0.3.0-py3-none-any.whl (17.2 kB view details)

Uploaded Python 3

File details

Details for the file duptextfinder-0.3.0.tar.gz.

File metadata

  • Download URL: duptextfinder-0.3.0.tar.gz
  • Upload date:
  • Size: 19.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for duptextfinder-0.3.0.tar.gz
Algorithm Hash digest
SHA256 a8ff3f3128bdc56157b2d09778e9ca2d73b093fea5419947418c07f83eaba08e
MD5 d4e1d32863bedf4450d62748fed6fa7f
BLAKE2b-256 67668796a3b0156aa70584e768bd66002a219eec868c215b9c83a90e28d26c5b

See more details on using hashes here.

File details

Details for the file duptextfinder-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for duptextfinder-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b23072f5839a71240c43e288902024ee09a4491a64eb102619e4028f0253e37d
MD5 5fc4d746e4823e999cc8052541f37e4e
BLAKE2b-256 c650c45b26f67e3301efeb9cfd11b7509f18c794439cbfffe250392380f11e1d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page