datafog

0

Python SDK to scan and redact out PII from files going into RAG applications

Security

data-anonymization
rag
pii
ai

DataFog logo

Open-source DevSecOps for Generative AI Systems.

PyPi Version PyPI pyversions GitHub stars PyPi downloads Discord Code style: black codecov GitHub Issues

Overview

What is DataFog?

DataFog is an open-source DevSecOps platform that lets you scan and redact Personally Identifiable Information (PII) out of your Generative AI applications.

Core Problem

image

How it works

image

Installation

DataFog can be installed via pip:

pip install datafog

Getting Started

The DataFog library provides functionality for text and image processing, including PII (Personally Identifiable Information) annotation and OCR (Optical Character Recognition) capabilities.

Installation

To install the DataFog library, use the following command:

pip install datafog

Usage

The Getting Started notebook features a standalone Colab notebook.

Text PII Annotation

To annotate PII in a given text, lets start with a set of clinical notes:

!git clone https://gist.github.com/b43b72693226422bac5f083c941ecfdb.git
# Define the directory path
folder_path = 'clinical_notes/'

# List all files in the directory
file_list = os.listdir(folder_path)
text_files = sorted([file for file in file_list if file.endswith('.txt')])

with open(os.path.join(folder_path, text_files[0]), 'r') as file:
    clinical_note = file.read()

display(Markdown(clinical_note))

which looks like this:


**Date:** April 10, 2024

**Patient:** Emily Johnson, 35 years old

**MRN:** 00987654

**Chief Complaint:** "I've been experiencing severe back pain and numbness in my legs."

**History of Present Illness:** The patient is a 35-year-old who presents with a 2-month history of worsening back pain, numbness in both legs, and occasional tingling sensations. The patient reports working as a freelance writer and has been experiencing increased stress due to tight deadlines and financial struggles.

**Past Medical History:** Hypothyroidism

**Social History:**
The patient shares a small apartment with two roommates and relies on public transportation. They mention feeling overwhelmed with work and personal responsibilities, often sacrificing sleep to meet deadlines. The patient expresses concern over the high cost of healthcare and the need for affordable medication options.

**Review of Systems:** Denies fever, chest pain, or shortness of breath. Reports occasional headaches.

**Physical Examination:**
- General: Appears tired but is alert and oriented.
- Vitals: BP 128/80, HR 72, Temp 98.6°F, Resp 14/min

**Assessment/Plan:**
- Continue to monitor blood pressure and thyroid function.
- Discuss affordable medication options with a pharmacist.
- Refer to a social worker to address housing concerns and access to healthcare services.
- Encourage the patient to engage with community support groups for social support.
- Schedule a follow-up appointment in 4 weeks or sooner if symptoms worsen.

**Comments:** The patient's health concerns are compounded by socioeconomic factors, including employment status, housing stability, and access to healthcare. Addressing these social determinants of health is crucial for improving the patient's overall well-being.

we can then set up our pipeline to accept these files

async def run_text_pipeline_demo():
  results = await datafog.run_text_pipeline(texts)
  print("Text Pipeline Results:", results)
  return results


texts = [clinical_note]
loop = asyncio.get_event_loop()
results = loop.run_until_complete(run_text_pipeline_demo())

Note: The DataFog library uses asynchronous programming, so make sure to use the async/await syntax when calling the appropriate methods.

OCR PII Annotation

Let's use a image (which could easily be a converted or scanned PDF)

Executive Email

datafog = DataFog(operations='extract_text')
url_list = ['https://pbs.twimg.com/media/GM3-wpeWkAAP-cX.jpg']

async def run_ocr_pipeline_demo():
  results = await datafog.run_ocr_pipeline(url_list)
  print("OCR Pipeline Results:", results)

loop = asyncio.get_event_loop()
loop.run_until_complete(run_ocr_pipeline_demo())

You'll notice that we use async functions liberally throughout the SDK - given the nature of the functions we're providing and the extension of DataFog into API/other formats, this allows the functions to be more easily adapted for those uses.

Contributing

DataFog is a community-driven open-source platform and we've been fortunate to have a small and growing contributor base. We'd love to hear ideas, feedback, suggestions for improvement - anything on your mind about what you think can be done to make DataFog better! Join our Discord and join our growing community.


Dev Notes

For local development:

  1. Clone the repository.
  2. Navigate to the project directory:
    cd datafog-python
    
  3. Create a new virtual environment (using .venv is recommended as it is hardcoded in the justfile):
    python -m venv .venv
    
  4. Activate the virtual environment:
    • On Windows:
      .venv\Scripts\activate
      
    • On macOS/Linux:
      source .venv/bin/activate
      
  5. Install the package in editable mode:
    pip install -e .
    
  6. Set up the project:
    just setup
    

Now, you can develop and run the project locally.

Important Actions:

  • Format the code:
    just format
    
    This runs isort to sort imports.
  • Lint the code:
    just lint
    
    This runs flake8 to check for linting errors.
  • Generate coverage report:
    just coverage-html
    
    This runs pytest and generates a coverage report in the htmlcov/ directory.

We use pre-commit to run checks locally before committing changes. Once installed, you can run:

pre-commit run --all-files

Dependencies

For OCR, we use Tesseract, which is incorporated into the build step. You can find the relevant configurations under .github/workflows/ in the following files:

  • dev-cicd.yml
  • feature-cicd.yml
  • main-cicd.yml

Testing

  • Python 3.10

License

This software is published under the MIT license.