Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Power analysis #20

Merged
merged 62 commits into from
Dec 22, 2024
Merged

Power analysis #20

merged 62 commits into from
Dec 22, 2024

Conversation

Michael-Howes
Copy link
Contributor

@Michael-Howes Michael-Howes commented Oct 18, 2024

Overview: This pull requests adds functions to perform power analyses with PPI. The methodology behind the power analyses is developed in Section 3 of [BHvL2024]. The pull request includes:

  • A new python module ppi/ppi_power_analysis.py implementing the power analysis.
  • A jupyter notebook examples/ppi_power_analysis.ipynb to demonstrate the power analysis with examples.
  • A test file tests/test_power_analysis.py.

Motivation: Power analysis inform design choices and are a desirable feature for applied researchers. The implemented power analysis captures the trade-off between expensive high-quality labels and cheaper machine learning predictions. The power analysis also quantifies the effectiveness of PPI for a given dataset.

Implementation: Functions are named ppi_[estimand]_power in line with the existing PPI functions such as ppi_[estimand]_ci. The functions output a standardized dictionary containing the recommended number of labeled and unlabeled samples. The dictionary also contains other quantities related to the power analysis. The power analysis is currently implemented for mean estimation, linear regression, logistic regression and Poisson regression.

Testing: Tests are included in tests/test_power_analysis.py. The following features are tested:

  • The output satisfies the budget or effective sample size constraints.
  • The output is optimal given the costs.
  • The predicted effective sample size is close to realized effective sample size.

Dependencies: No new dependencies added.

Documentation: No additional documentation was added outside of the jupyter notebook (examples/power_analysis.ipynb). Let me know if you would like additional documentation.

Checklist:

  • Tested with pytest framework
  • Formatted with black
  • Documentation

Copy link
Collaborator

@tijana-zrnic tijana-zrnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. Thanks!

1 + epsilon
), f"{optimal_n}, {powerful_pair['effective_n']}"

## Check if the estimated
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment incomplete?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed two errors:

First, np.concat does not exist. Made it np.concatenate.

Second, was getting a nan error. Added np.nan_to_num. It is a kluge to make the notebook work. Please feel free to add a different fix if there's one @Michael-Howes .

image

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, reverted the second change np.nan_to_num because it looks like it was caused by some corrupted data on my end. Re-downloading the dataset fixed it.

@aangelopoulos aangelopoulos merged commit 63d1782 into aangelopoulos:main Dec 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants