Introduction to privJedAI#
This notebook will guide you throughout all the possible methods and how to use them by our open-source library privJedAI.
%pip install privjedai
Import Dataset Abt Buy Clean#
Below we load the two different datasets and their ground truth. privJedAI needs the indices of the pairs for each dataset and not their id. Here we present also how to preprocess a ground truth that contains id1-id2 pairs to index1-index2 pairs.
import pandas as pd
def load_dataset(path, file, sep):
return pd.read_csv(f"../{path}/{file}.csv", sep=sep)
def load_ground_truth(path, sep, d1, d2):
gt = pd.read_csv(f'../{path}/gtclean.csv' , sep=sep)
df_a = d1.reset_index().rename(columns={"index": "index_A"})
df_b = d2.reset_index().rename(columns={"index": "index_B"})
df_a = df_a[["index_A", "id"]]
df_b = df_b[["index_B", "id"]]
gt_index = gt.merge(left_on='D1', right=df_a, right_on='id')
gt_index = gt_index.drop(columns=['id', 'D1'])
gt_index.columns = ['D2', 'D1']
gt_index = gt_index.merge(left_on='D2', right=df_b, right_on='id')
gt_index = gt_index.drop(columns=['id', 'D2'])
gt_index.columns = ['D1', 'D2']
d1 = d1.astype(str)
d2 = d2.astype(str)
return gt_index, d1, d2
DIR = "D2"
PATH = f"data/ccer/{DIR}"
FILE = 'abtclean'
FILE2 = 'buyclean'
attributes = ['name']
SEP = "|"
abt = load_dataset(PATH, FILE, SEP)
buy = load_dataset(PATH, FILE2, SEP)
gt_index, abt, buy = load_ground_truth(PATH, SEP, abt, buy)
Encode data and build bloom filters#
Each party agree in an exact configuration and then encode locally their data. Those encoded data are then shared to a third party to proceed with record linkage.
from privjedai.encoder import BloomFilterConfig, BloomEncodedData, BloomFilter
bloom_filter_configuration = {
"size" : 512,
"offset" : 0,
"num_hashes" : 15,
"hashing_type": "salted_qgrams",
"salt" : "",
"attributes": ['name'],
"qgrams": 4
}
Abt Owner Encodes Dataset#
config = BloomFilterConfig(**bloom_filter_configuration)
bloom_generator = BloomFilter(config)
## The two parties encode their datasets and save them to disk.
## The encoded datasets are then shared with the other party and used for the matching process.
encoded_d1 = bloom_generator.encode(abt)
encoded_d1.to_file(f"dataset_1.pkl")
Buy Owner Encodes Dataset#
bloom_generator_buy = BloomFilter(config)
encoded_d2 = bloom_generator_buy.encode(buy)
encoded_d2.to_file(f"dataset_2.pkl")
Trusted Third Party: Linking Phase#
# Third party loads the encoded datasets and performs the matching process.
encoded_data = BloomEncodedData.from_file("dataset_1.pkl", "dataset_2.pkl")
# Ground truth must be explicitly set for the evaluation process.
# This is done by providing the indices of the matching records in the original datasets.r
encoded_data.set_ground_truth(gt_index)
Blocking with privJedAI#
In privJedAI we have 2 different implementations of Hamming LSH and a FAISS implementation.
BitBlocker#
from privjedai.blocking import BitBlocker
blocker = BitBlocker(psi = 8,
lmbda = 24,
seed = 42)
blocks = blocker.build_blocks(encoded_data=encoded_data)
_ = blocker.evaluate(blocks)
***************************************************************************************************************************
Method: BitBlocker
***************************************************************************************************************************
Method name: BitBlocker
Parameters:
Runtime: 0.1073 seconds
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Performance:
Precision: 0.42%
Recall: 85.81%
F1-score: 0.83%
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
LSHBlocker#
from privjedai.blocking import LSHBlocker
blocker = LSHBlocker(psi = 8,
lmbda = 24,
seed = 42,
prune_ratio = 0.8)
blocks = blocker.build_blocks(encoded_data=encoded_data)
_ = blocker.evaluate(blocks)
***************************************************************************************************************************
Method: LSHBlocker
***************************************************************************************************************************
Method name: LSHBlocker
Parameters:
psi: 8
lmbda: 24
prune_ratio: 0.8
prune_sample: 1000
seed: 42
Runtime: 0.2658 seconds
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Performance:
Precision: 0.09%
Recall: 92.20%
F1-score: 0.19%
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
FAISSBlocking#
from privjedai.blocking import FAISSBlocking
blocker = FAISSBlocking(index_type='hnsw')
blocks = blocker.build_blocks(encoded_data=encoded_data, top_k=20)
_ = blocker.evaluate(blocks)
***************************************************************************************************************************
Method: FAISS Blocking
***************************************************************************************************************************
Method name: FAISS Blocking
Parameters:
Runtime: 0.0409 seconds
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Performance:
Precision: 4.40%
Recall: 88.06%
F1-score: 8.39%
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Meta-blocking Techniques#
Those above are all standard blocking techniques. privJedAI also implements meta-blocking methods. It leverages comparison cleaning methods from ER and implements them for bitarrays. A block is a set of adjacent active bits of a bitarray.
from privjedai.comparison_cleaning import (
WeightedEdgePruning,
WeightedNodePruning,
CardinalityEdgePruning,
CardinalityNodePruning
)
cc = CardinalityEdgePruning(weighting_scheme='CN-CBS')
cc_blocks = cc.process(encoded_data, adjacent_bits=2)
_ = cc.evaluate(cc_blocks)
Total matching pairs: 233036
***************************************************************************************************************************
Method: Cardinality Edge Pruning
***************************************************************************************************************************
Method name: Cardinality Edge Pruning
Parameters:
Node centric: False
Weighting scheme: CN-CBS
Runtime: 0.7610 seconds
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Performance:
Precision: 0.45%
Recall: 98.50%
F1-score: 0.90%
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Matching#
After filtering our datasets we can use a similarity function and match the possible candidate pairs.
from privjedai.matching import Matcher
import numpy as np
matcher = Matcher(batch_size = 10_000,
threshold = 0.6,
metric='dice')
matches = matcher.predict(encoded_data=encoded_data, blocks=blocks)
_ = matcher.evaluate(matches)
***************************************************************************************************************************
Method: Matcher
***************************************************************************************************************************
Method name: Matcher
Parameters:
batch_size: 10000
threshold: 0.6
metric: dice
attributes: None
Runtime: 0.1109 seconds
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Performance:
Precision: 4.82%
Recall: 87.59%
F1-score: 9.13%
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Clustering#
To eliminate possible conflicts on the matching pairs, we provide multiple clustering techniques.
from privjedai.clustering import KiralyMSMApproximateClustering
clusterer = KiralyMSMApproximateClustering()
clusters = clusterer.process(matches, encoded_data=encoded_data)
_ = clusterer.evaluate(clusters)
***************************************************************************************************************************
Method: Kiraly MSM Approximate Clustering
***************************************************************************************************************************
Method name: Kiraly MSM Approximate Clustering
Parameters:
Similarity Threshold: 0.1
Runtime: 0.0295 seconds
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Performance:
Precision: 79.11%
Recall: 71.90%
F1-score: 75.33%
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Additonal Features#
Concurrent matching using ray
from privjedai.ray.matching import Matcher
import numpy as np
matcher = Matcher(batch_size = 10_000,
threshold = 0.6,
metric='dice',
workers=10)
matches = matcher.predict(encoded_data=encoded_data, blocks=blocks)
_ = matcher.evaluate(matches)
/home/lstetsikas/miniconda3/envs/privjedai-demo/lib/python3.10/site-packages/ray/_private/worker.py:2052: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
warnings.warn(
***************************************************************************************************************************
Method: Matcher
***************************************************************************************************************************
Method name: Matcher
Parameters:
batch_size: 10000
threshold: 0.6
metric: dice
attributes: None
Runtime: 0.3367 seconds
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Performance:
Precision: 4.82%
Recall: 87.69%
F1-score: 9.14%
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
GPU-Accelaration#
from privjedai.gpu.matching import Matcher
from privjedai.gpu.clustering import KiralyMSMApproximateClustering
matcher = Matcher(batch_size = 10_000,
threshold = 0.6,
metric='dice',
)
matches = matcher.predict(encoded_data=encoded_data, blocks=blocks)
_ = matcher.evaluate(matches, verbose=True)
clusterer = KiralyMSMApproximateClustering()
clusters = clusterer.process(matches, encoded_data=encoded_data)
_ = clusterer.evaluate(clusters)