Introduction to privJedAI

Introduction to privJedAI#

This notebook will guide you throughout all the possible methods and how to use them by our open-source library privJedAI.

%pip install privjedai

Import Dataset Abt Buy Clean#

Below we load the two different datasets and their ground truth. privJedAI needs the indices of the pairs for each dataset and not their id. Here we present also how to preprocess a ground truth that contains id1-id2 pairs to index1-index2 pairs.

import pandas as pd

def load_dataset(path, file, sep):
    return pd.read_csv(f"../{path}/{file}.csv", sep=sep)

def load_ground_truth(path, sep, d1, d2): 
    gt = pd.read_csv(f'../{path}/gtclean.csv' , sep=sep)
    df_a = d1.reset_index().rename(columns={"index": "index_A"})
    df_b = d2.reset_index().rename(columns={"index": "index_B"})

    df_a = df_a[["index_A", "id"]]
    df_b = df_b[["index_B", "id"]]

    gt_index = gt.merge(left_on='D1', right=df_a, right_on='id')

    gt_index = gt_index.drop(columns=['id', 'D1'])
    gt_index.columns = ['D2', 'D1']

    gt_index = gt_index.merge(left_on='D2', right=df_b, right_on='id')
    gt_index = gt_index.drop(columns=['id', 'D2'])
    gt_index.columns = ['D1', 'D2']
    d1 = d1.astype(str)
    d2 = d2.astype(str)

    return gt_index, d1, d2

DIR = "D2"
PATH = f"data/ccer/{DIR}"
FILE = 'abtclean'
FILE2 = 'buyclean'
attributes = ['name']
SEP = "|"

abt = load_dataset(PATH, FILE, SEP)
buy = load_dataset(PATH, FILE2, SEP)

gt_index, abt, buy = load_ground_truth(PATH, SEP, abt, buy)

Encode data and build bloom filters#

Each party agree in an exact configuration and then encode locally their data. Those encoded data are then shared to a third party to proceed with record linkage.

from privjedai.encoder import BloomFilterConfig, BloomEncodedData, BloomFilter

bloom_filter_configuration = {
    "size" : 512,
    "offset" : 0,
    "num_hashes" : 15,
    "hashing_type": "salted_qgrams",
    "salt" : "",
    "attributes": ['name'],
    "qgrams": 4
}

Abt Owner Encodes Dataset#

config = BloomFilterConfig(**bloom_filter_configuration)
bloom_generator = BloomFilter(config)

## The two parties encode their datasets and save them to disk.
## The encoded datasets are then shared with the other party and used for the matching process.
encoded_d1 = bloom_generator.encode(abt)
encoded_d1.to_file(f"dataset_1.pkl")

Buy Owner Encodes Dataset#

bloom_generator_buy = BloomFilter(config)
encoded_d2 = bloom_generator_buy.encode(buy)
encoded_d2.to_file(f"dataset_2.pkl")

Trusted Third Party: Linking Phase#

# Third party loads the encoded datasets and performs the matching process.
encoded_data = BloomEncodedData.from_file("dataset_1.pkl", "dataset_2.pkl")


# Ground truth must be explicitly set for the evaluation process. 
# This is done by providing the indices of the matching records in the original datasets.r
encoded_data.set_ground_truth(gt_index)

Blocking with privJedAI#

In privJedAI we have 2 different implementations of Hamming LSH and a FAISS implementation.

BitBlocker#

from privjedai.blocking import BitBlocker

blocker = BitBlocker(psi = 8,
            lmbda = 24,
            seed = 42)

blocks = blocker.build_blocks(encoded_data=encoded_data)

_ = blocker.evaluate(blocks)

***************************************************************************************************************************
                                         Method:  BitBlocker
***************************************************************************************************************************
Method name: BitBlocker
Parameters: 
Runtime: 0.1073 seconds
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Performance:
	Precision:      0.42% 
	Recall:        85.81%
	F1-score:       0.83%
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

LSHBlocker#

from privjedai.blocking import LSHBlocker

blocker = LSHBlocker(psi = 8,
            lmbda = 24,
            seed = 42,
            prune_ratio = 0.8)

blocks = blocker.build_blocks(encoded_data=encoded_data)
_ = blocker.evaluate(blocks)

***************************************************************************************************************************
                                         Method:  LSHBlocker
***************************************************************************************************************************
Method name: LSHBlocker
Parameters: 
	psi: 8
	lmbda: 24
	prune_ratio: 0.8
	prune_sample: 1000
	seed: 42
Runtime: 0.2658 seconds
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Performance:
	Precision:      0.09% 
	Recall:        92.20%
	F1-score:       0.19%
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

FAISSBlocking#

from privjedai.blocking import FAISSBlocking

blocker = FAISSBlocking(index_type='hnsw')

blocks = blocker.build_blocks(encoded_data=encoded_data, top_k=20)
_ = blocker.evaluate(blocks)

***************************************************************************************************************************
                                         Method:  FAISS Blocking
***************************************************************************************************************************
Method name: FAISS Blocking
Parameters: 
Runtime: 0.0409 seconds
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Performance:
	Precision:      4.40% 
	Recall:        88.06%
	F1-score:       8.39%
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Meta-blocking Techniques#

Those above are all standard blocking techniques. privJedAI also implements meta-blocking methods. It leverages comparison cleaning methods from ER and implements them for bitarrays. A block is a set of adjacent active bits of a bitarray.

from privjedai.comparison_cleaning import ( 
    WeightedEdgePruning,
    WeightedNodePruning,
    CardinalityEdgePruning,
    CardinalityNodePruning
)

cc = CardinalityEdgePruning(weighting_scheme='CN-CBS')

cc_blocks = cc.process(encoded_data, adjacent_bits=2)

_ = cc.evaluate(cc_blocks)

Total matching pairs: 233036
***************************************************************************************************************************
                                         Method:  Cardinality Edge Pruning
***************************************************************************************************************************
Method name: Cardinality Edge Pruning
Parameters: 
	Node centric: False
	Weighting scheme: CN-CBS
Runtime: 0.7610 seconds
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Performance:
	Precision:      0.45% 
	Recall:        98.50%
	F1-score:       0.90%
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Matching#

After filtering our datasets we can use a similarity function and match the possible candidate pairs.

from privjedai.matching import Matcher
import numpy as np
matcher = Matcher(batch_size = 10_000,
                  threshold = 0.6,
                  metric='dice')

matches = matcher.predict(encoded_data=encoded_data, blocks=blocks)

_ = matcher.evaluate(matches)

***************************************************************************************************************************
                                         Method:  Matcher
***************************************************************************************************************************
Method name: Matcher
Parameters: 
	batch_size: 10000
	threshold: 0.6
	metric: dice
	attributes: None
Runtime: 0.1109 seconds
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Performance:
	Precision:      4.82% 
	Recall:        87.59%
	F1-score:       9.13%
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Clustering#

To eliminate possible conflicts on the matching pairs, we provide multiple clustering techniques.

from privjedai.clustering import KiralyMSMApproximateClustering

clusterer = KiralyMSMApproximateClustering()

clusters = clusterer.process(matches, encoded_data=encoded_data)

_ = clusterer.evaluate(clusters)

***************************************************************************************************************************
                                         Method:  Kiraly MSM Approximate Clustering
***************************************************************************************************************************
Method name: Kiraly MSM Approximate Clustering
Parameters: 
	Similarity Threshold: 0.1
Runtime: 0.0295 seconds
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Performance:
	Precision:     79.11% 
	Recall:        71.90%
	F1-score:      75.33%
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Additonal Features#

Concurrent matching using ray

from privjedai.ray.matching import Matcher
import numpy as np
matcher = Matcher(batch_size = 10_000,
                  threshold = 0.6,
                  metric='dice',
                  workers=10)

matches = matcher.predict(encoded_data=encoded_data, blocks=blocks)

_ = matcher.evaluate(matches)

/home/lstetsikas/miniconda3/envs/privjedai-demo/lib/python3.10/site-packages/ray/_private/worker.py:2052: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
  warnings.warn(

***************************************************************************************************************************
                                         Method:  Matcher
***************************************************************************************************************************
Method name: Matcher
Parameters: 
	batch_size: 10000
	threshold: 0.6
	metric: dice
	attributes: None
Runtime: 0.3367 seconds
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Performance:
	Precision:      4.82% 
	Recall:        87.69%
	F1-score:       9.14%
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

GPU-Accelaration#

from privjedai.gpu.matching import Matcher
from privjedai.gpu.clustering import KiralyMSMApproximateClustering

matcher = Matcher(batch_size = 10_000,
                  threshold = 0.6,
                  metric='dice',
                  )

matches = matcher.predict(encoded_data=encoded_data, blocks=blocks)

_ = matcher.evaluate(matches, verbose=True)

clusterer = KiralyMSMApproximateClustering()
clusters = clusterer.process(matches, encoded_data=encoded_data)
_ = clusterer.evaluate(clusters)