Implemenation of the PQMass two sample test from Lemos et al. 2024

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ConnorStone

These details have not been verified by PyPI

Development Status
- 1 - Planning
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

PQMass: Probabilistic Assessment of the Quality of Generative Models using Probability Mass Estimation

PyPI - Version PyPI - Downloads

PQMass is a new sample-based method for evaluating the quality of generative models as well as assessing distribution shifts to determine if two datasets come from the same underlying distribution.

Install

To install PQMass, run the following:

pip install pqm

Usage

PQMass takes in $x$ and $y$ two datasets and determines if they come from the same underlying distribution. For instance, in the case of generative models, $x$ represents the samples generated by your model, while $y$ corresponds to the real data or test set.

Headline plot showing an example tessellation for PQMass PQMass partitions the space by taking reference points from $x$ and $y$ and creating Voronoi tessellations around the reference points. On the left is an example of one such region, which we note follows a Binomial Distribution; the samples are either inside or outside the region. On the right is the entire space partitioned, allowing us to see that this is a multinomial distribution, a given sample can be in region P or any other region. This is crucial as it allows for two metrics to be defined that can be used to determine if $x$ and $y$ come from the same underlying distribution. The first is the $\chi_{PQM}^2$ $$\chi_{PQM}^2 \equiv \sum_{i = 1}^{n_R} \left[ \frac{(k({\bf x}, R_i) - \hat{N}{x, i})^2}{\hat{N}{x, i}} + \frac{(k({\bf y}, R_i) - \hat{N}{y, i})^2}{\hat{N}{y, i}} \right]$$

and the second is the $\text{p-value}(\chi_{PQM}^2)$ $$\text{p-value}(\chi_{PQM}^2) \equiv \int_{-\infty}^{\chi^2_{\rm {PQM}}} \chi^2_{n_R - 1}(z) dz$$

For $\chi_{PQM}^2$ metric, given your two sets of samples, if they come from the same distribution, the histogram of your $\chi_{PQM}^2$ values should follow the $\chi^2$ distribution. The degrees of freedom (DoF) will equal DoF = num_refs - 1 The peak of this distribution will be at DoF - 2, the mean will equal DoF, and the standard deviation will be sqrt(2 * DoF). If your $\chi_{PQM}^2$ values are too high (chi^2 / DoF > 1), it suggests that the samples are out of distribution. Conversely, if the values are too low (chi^2 / DoF < 1), it indicates potential duplication of samples between x and y.

If your two samples are drawn from the same distribution, then the $\text{p-value}(\chi_{PQM}^2)$ should be drawn from the random $\mathcal{U}(0,1)$ distribution. This means that if you get a very small value (i.e., 1e-6), then you have failed the null hypothesis test, and the two samples are not drawn from the same distribution. If you get values approximately equal to 1 every time then that suggests potential duplication of samples between x and y.

PQMass can work for any two datasets as it measures the distribution shift between the $x$ and $y$, which we show below.

Example

We are using 100 regions. Thus, the DoF is 99, our expected $\chi^2$ peak of the distribution is 97, the mean is 99, and the standard deviation should be 14.07. With this in mind, we set up our example. For the p-value, we expect to be between 0 and 1 and a significantly small p-value (e.g., $< 0.05$ or $< 0.01$) would mean we reject the null hypothesis and thus $x$ and $y$ do not come from the same distribution.

Our expected p-value should be around 0.5 to pass the null hypothesis test; any significant deviation away from this would indicate failure of the null hypothesis test.

Given two distributions, $x$ and $y$, sampling from a $\mathcal{N}(0, 1)$ in 10 dimensions, the goal is to determine if they come from the same underlying distribution. This is considered the null test as we know they come from the same distribution, but we show how one would use PQMass to determine this.

from pqm import pqm_pvalue, pqm_chi2
import numpy as np

p = np.random.normal(size = (500, 10))
q = np.random.normal(size = (400, 10))

# To get chi^2 from PQMass
chi2_stat = pqm_chi2(p, q, re_tessellation = 1000)
print(np.mean(chi2_stat), np.std(chi2_stat)) # 98.51, 11.334

# To get pvalues from PQMass
pvalues = pqm_pvalue(p, q, re_tessellation = 1000)
print(np.mean(pvalues), np.std(pvalues)) # 0.50, 0.26

We see that both $\chi_{PQM}^2$ and $\text{p-value}(\chi_{PQM}^2)$ follow the expected $\chi^2$ indicatiing that both $x$ and $y$ come from the same underlying distribution.

Another such example in which we do $\textit{not}$ expect $x$ and $y$ to come from the same distribution is if $x$ is again sampled from a $\mathcal{N}(0, 1)$ in 10 dimensions whereas $y$ is sampled from a $\mathcal{U}(0, 1)$ in 10 dimensions.

from pqm import pqm_pvalue, pqm_chi2
import numpy as np

p = np.random.normal(size = (500, 10))
q = np.random.uniform(size = (400, 10))

# To get chi^2 from PQMass
chi2_stat = pqm_chi2(p, q, re_tessellation = 1000)
print(np.mean(chi2_stat), np.std(chi2_stat)) # 577.29, 25.74

# To get pvalues from PQMass
pvalues = pqm_pvalue(p, q, re_tessellation = 1000)
print(np.mean(pvalues), np.std(pvalues)) # 3.53e-56, 8.436e-55

Here it is clear that both $\chi_{PQM}^2$ and $\text{p-value}(\chi_{PQM}^2)$ are not close to their expected results, thus showing that $x$ and $y$ do $\textbf{not}$ come from the same underlying distribution.

Thus, PQMass can be used to identify if any two distributions come from the same underlying distributions if enough samples are given. We encourage users to look through the paper to see the varying experiments and use cases for PQMass!

How to Intrept Result

We have shown what to expect for PQMass when working with $\chi_{PQM}^2$ or $\text{p-value}(\chi_{PQM}^2)$ however when working with $\chi_{PQM}^2$, there is the case in which it will return 0's. There are a couple reasons in why this could happen

For Generative Models; 0's indicate memorization. Samples are duplicates of the data it has been trained on.
For non generative model scenario, it is typically due to lack of samples espically in high dimensions. Increasing samples should alleviate the issue.
Another scenario in which one could get 0's in a non generative model case is that it can also be an inidcator of duplicate samples in $x$ and $y$.

Advanced Usage

Depending on the data you are working with we show other uses of the parameters for PQMass.

Z-Score Normalization

If you determine that you need to normalize $x$ and $y$, there is a z-score normalization function built into PQMass, and one can call it by setting z_score_norm = True:

chi2_stat = pqm_chi2(p, q, re_tessellation = 1000, z_score_norm = True)
pvalues = pqm_pvalue(p, q, re_tessellation = 1000, z_score_norm = True)

Modification to how references points are selected

The default setup for selecting reference points is to take the number of regions and then sample from $x$ and $y$ proportional to each length, respectively. However, if, for your case, you want to only sample the reference points from $x$ by setting x_frac = 1.0:

chi2_stat = pqm_chi2(p, q, re_tessellation = 1000, x_frac = 1.0)
pvalues = pqm_pvalue(p, q, re_tessellation = 1000, x_frac = 1.0)

Alternatively, you can sample the reference points only from $y$ by setting x_frac = 0:

chi2_stat = pqm_chi2(p, q, re_tessellation = 1000, x_frac = 0)
pvalues = pqm_pvalue(p, q, re_tessellation = 1000, x_frac = 0)

Similary you can sample reference points equally from both $x$ and $y$ by setting x_frac = 0.5:

chi2_stat = pqm_chi2(p, q, re_tessellation = 1000, x_frac = 0.5)
pvalues = pqm_pvalue(p, q, re_tessellation = 1000, x_frac = 0.5)

Lastly one could not sample reference points from either $x$ or $y$ but instead sample from a Guassian by using the guass_frac = 1.0:

chi2_stat = pqm_chi2(p, q, re_tessellation = 1000, guass_frac = 1.0)
pvalues = pqm_pvalue(p, q, re_tessellation = 1000, guass_frac = 1.0)

GPU Compatibility

PQMass now works on both CPU and GPU. All that is needed is to pass what device you are on via device = 'cuda' or device = 'cpu'

Developing

If you're a developer then:

git clone git@github.com:Ciela-Institute/PQM.git
cd PQM
git checkout -b my-new-branch
pip install -e .

But make an issue first so we can discuss implementation ideas.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ConnorStone

These details have not been verified by PyPI

Development Status
- 1 - Planning
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.6.2

Nov 11, 2024

0.6.1

Nov 6, 2024

This version

0.6.0

Nov 4, 2024

0.5.1

Aug 20, 2024

0.5.0

Aug 13, 2024

0.4.0

Jul 19, 2024

0.3.1

May 31, 2024

0.3.0

May 31, 2024

0.2.0

May 31, 2024

0.1.2

May 29, 2024

0.1.1

May 29, 2024

0.1.0

May 29, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pqm-0.6.0.tar.gz (320.9 kB view details)

Uploaded Nov 4, 2024 Source

Built Distribution

pqm-0.6.0-py3-none-any.whl (12.0 kB view details)

Uploaded Nov 4, 2024 Python 3

File details

Details for the file pqm-0.6.0.tar.gz.

File metadata

Download URL: pqm-0.6.0.tar.gz
Upload date: Nov 4, 2024
Size: 320.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.0.0 CPython/3.12.7

File hashes

Hashes for pqm-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`7d011a57792f15f5254b62e8efbbf8baa9c13900da2a482664c6387f703b2fd1`
MD5	`d1fed7d70fa7c9d383e49817785aacac`
BLAKE2b-256	`24b6859396002a5bcc57d0b304d0b85ae5fa849b90b7496282a5e6b27de8fcbb`

See more details on using hashes here.

File details

Details for the file pqm-0.6.0-py3-none-any.whl.

File metadata

Download URL: pqm-0.6.0-py3-none-any.whl
Upload date: Nov 4, 2024
Size: 12.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.0.0 CPython/3.12.7

File hashes

Hashes for pqm-0.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`864794bdb82f12ca0b18160a4396dfd22beb861b46497b2a4d0d0e0b782dc173`
MD5	`2dd4503b39739d686c124f0117b68e7b`
BLAKE2b-256	`2fb02e2fd88109a2524bcfd20ab1edd8e87703e84e3fb55a3ef06651c25ec840`