Generic MP Querying Overview
This notebook demonstrates how to use the energy_gnome
library to query the Materials Project (MP) database via its API.
You will learn how to:
- Initialize and configure a generic materials database.
- Retrieve entries using the
MPRester
frommp_api
. - Manage and update local raw datasets.
- Optionally save structure files (
CIF
) for future use.
> This workflow is also compatible with specialized subclasses such as PerovskiteDatabase
and CathodeDatabase
. Refer to the respective notebooks for targeted use cases.
from energy_gnome.dataset import MPDatabase
from pathlib import Path
# Change data_dir to reflect your project's folder structure.
# Here, we assume that there are a `notebook` and a `data` subfolder
# in the main project folder.
main_dir = Path(".").resolve().parent
data_dir = main_dir / "data"
Dataset creation
Database Initialization
We begin by initializing a generic MP-based database using the MPDatabase
class.
name
: Defines a unique name for this database instance. Use distinct names for different projects or dataset versions to avoid accidental overwriting.data_dir
: Sets the root directory where all files will be stored (e.g., raw and processed datasets, CIFs).allow_raw_update()
: Enables updates to the raw data stage, allowing newly retrieved entries to be stored.
> For initializing other MP-based database types, such as PerovskiteDatabase
or CathodeDatabase
, consult the respective example notebooks.
Data Retrieval
This step fetches material entries from the Materials Project via its API.
To proceed, you must have an MP API key. Follow these steps:
- Register on the Materials Project.
- Copy your API key from here.
- Save it to a
config.yaml
file in your working directory using this format:
Updating the Raw Database
Once the data is retrieved, we compare it with the existing raw dataset (if any) and update accordingly.
This ensures:
- New materials are added.
- Existing entries are not duplicated.
- Data integrity is maintained across multiple runs.
Saving CIF Files
You may optionally save structure files (CIF
format) for the retrieved materials.
IMPORTANT: To save time and disk space, it’s often more efficient to skip saving CIFs at the raw stage — especially if you plan to downsample or filter the dataset later.
Instead, consider saving CIFs only after:
- The dataset has been processed or cleaned.
- You've finalized the materials you plan to use in model training or screening.
This can significantly reduce IO overhead and file clutter, particularly for large-scale MP queries.