Skip to main content

Detect duplicated zones in (clinical) text

Project description

Duplicate Text Finder

A python library to detect duplicated zones in text. Primarily meant to detect copy/paste across medical documents. Should be faster than python's built-in difflib algorithm and more robust to whitespace, newlines and other irrelevant characters.

Usage

from pathlib import Path
from duptextfinder import CharFingerprintBuilder, DuplicateFinder

# load some text files
texts = [p.read_text() for p in Path("some/dir").glob("*.txt")]

# init fingerprint and duplicate finder
fingerprintBuilder = CharFingerprintBuilder(fingerprintLength=15)
duplicateFinder = DuplicateFinder(fingerprintBuilder, minDuplicateLength=15)

# call findDuplicates() on each file
for i, text in enumerate(texts):
    id = f"D{i}"
    duplicates = duplicateFinder.findDuplicates(id, text)
    for duplicate in duplicates:
        print(
            f"sourceDoc={duplicate.sourceDocId}, "
            f"sourceStart={duplicate.sourceSpan.start}, "
            f"sourceEnd={duplicate.sourceSpan.end}, "
            f"targetStart={duplicate.targetSpan.start}, "
            f"targetEnd={duplicate.targetSpan.end}"
        )
        duplicated_text = text[duplicate.targetSpan.start : duplicate.targetSpan.end]
        print(duplicated_text)

WordFingerprintBuilder can be used instead of CharFingerprintBuilder. For more details, refer to the docstrings of DuplicateFinder, CharFingerprintBuilder and WordFingerprintBuilder.

How to run tests

  1. Install package in editable mode with test and extra dependencies by running pip install -e ".[tests, ncls, intervaltree]" in the repo directory
  2. Launch pytest tests/

About ncls and intervaltree

This tool can be used without any additional dependencies, but performance can be improved when using interval trees. To benefit from this you well need to install either the ncls package or the intervaltree package.

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

duptextfinder-0.1.tar.gz (16.0 kB view details)

Uploaded Source

File details

Details for the file duptextfinder-0.1.tar.gz.

File metadata

  • Download URL: duptextfinder-0.1.tar.gz
  • Upload date:
  • Size: 16.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.10

File hashes

Hashes for duptextfinder-0.1.tar.gz
Algorithm Hash digest
SHA256 5cfc21c275b8e2e4ac5a99c99254519d9b52ad09c5ad366579e8d53e53bfd5eb
MD5 d484e1f6a0d995b2b64bdff9ac401902
BLAKE2b-256 04fdbee8255ee3899510cd73299b2713d9c1241d1cc3aff1b7378d93d4052f1f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page