Generic MP Querying Overview

This notebook demonstrates how to use the energy_gnome library to query the Materials Project (MP) database via its API.

You will learn how to:

Initialize and configure a generic materials database.
Retrieve entries using the MPRester from mp_api.
Manage and update local raw datasets.
Optionally save structure files (CIF) for future use.

> This workflow is also compatible with specialized subclasses such as PerovskiteDatabase and CathodeDatabase. Refer to the respective notebooks for targeted use cases.

Python

from energy_gnome.dataset import MPDatabase
from pathlib import Path

# Change data_dir to reflect your project's folder structure.
# Here, we assume that there are a `notebook` and a `data` subfolder
# in the main project folder.
main_dir = Path(".").resolve().parent
data_dir = main_dir / "data"

Dataset creation

Database Initialization

We begin by initializing a generic MP-based database using the MPDatabase class.

name: Defines a unique name for this database instance. Use distinct names for different projects or dataset versions to avoid accidental overwriting.
data_dir: Sets the root directory where all files will be stored (e.g., raw and processed datasets, CIFs).
allow_raw_update(): Enables updates to the raw data stage, allowing newly retrieved entries to be stored.

> For initializing other MP-based database types, such as PerovskiteDatabase or CathodeDatabase, consult the respective example notebooks.

Python

mp_db = MPDatabase(name="mp", data_dir=data_dir)
mp_db.allow_raw_update()

print(mp_db)

Data Retrieval

This step fetches material entries from the Materials Project via its API.

To proceed, you must have an MP API key. Follow these steps:

Register on the Materials Project.
Copy your API key from here.
Save it to a config.yaml file in your working directory using this format:

YAML

MP: "<your-api-key>"

Python

db, materials = mp_db.retrieve_materials(max_framework_size=6, mute_progress_bars=False)

Updating the Raw Database

Once the data is retrieved, we compare it with the existing raw dataset (if any) and update accordingly.

This ensures:

New materials are added.
Existing entries are not duplicated.
Data integrity is maintained across multiple runs.

Python

mp_db.compare_and_update(new_db=db, stage="raw")

Saving CIF Files

You may optionally save structure files (CIF format) for the retrieved materials.

IMPORTANT: To save time and disk space, it’s often more efficient to skip saving CIFs at the raw stage — especially if you plan to downsample or filter the dataset later.

Instead, consider saving CIFs only after:

The dataset has been processed or cleaned.
You've finalized the materials you plan to use in model training or screening.

This can significantly reduce IO overhead and file clutter, particularly for large-scale MP queries.

Python

mp_db.save_cif_files(stage="raw", mute_progress_bars=False)
mp_db.save_database("raw")