Revolutionizing Scientific Discovery with AI: Inside the Science Discovery Engine

4 MIN READ

Nishan Pantha, Muthukumaran Ramasubramanian, Carson Davis, Derek Koehl

Blog

UPDATED May 27, 2025

PUBLISHED May 27, 2025

In an era of information overload, even seasoned researchers struggle to stay current with the latest data and discoveries. That’s where NASA’s Science Discovery Engine (SDE) comes in, leveraging artificial intelligence (AI) to transform how we discover, access, and engage with scientific knowledge, while also making metadata stewardship more efficient.

The SDE is a project funded by NASA’s Office of the Chief Science Data Officer at NASA Headquarters aimed at increasing the usability and discoverability of the agency’s vast collection of Science data and research. Managed by NASA’s Marshall Space Flight Center’s data science organization in Huntsville, Alabama, SDE is a collaboration between machine learning and human expertise.

Metadata Stewardship Meets AI Precision

SDE is designed to help researchers quickly locate high-quality, relevant content in an ever-expanding sea of data and information. Traditionally, this process required expert review and extensive manual curation of thousands, if not millions, of digital assets to categorize and tag them with accurate metadata.

To address this challenge, the SDE team developed sophisticated machine learning pipelines that automate many aspects of the classification process. These models categorize content quickly with domain-specific accuracy and make the metadata curation process more efficient.

Human + AI Collaboration

While AI is powerful, it is not perfect on its own. That’s why SDE uses a “human-in-the-loop” approach, with subject matter experts (SMEs) involved throughout the workflow.

The process starts with SMEs outlining classification goals and expectations, such as defining the scope of classification, performance expectations, and decision factors. This ensures that machine learning outputs align with domain expertise from day one. From there, SMEs label data, validate AI outputs, and help uncover edge cases, feeding new examples into the system to improve future predictions. This feedback loop ensures the system continuously learns and adapts to the evolving nature of scientific data and information.

Here’s how the process works:

Training: SMEs provide labeled examples to guide the model learning.
Validation: AI predictions are reviewed and corrected by SMEs. This step is especially crucial for tricky or ambiguous cases.
Refinement: Corrections are incorporated into future training rounds, meaning the AI is constantly getting smarter over time.
Deployment: Once deployed, models remain in the feedback loop, allowing curators to catch errors and feed improvements back into the system.

Diagram showing how the Science Discovery Engine's AI model is reviewed by subject matter experts at each stage of the training process. — *Figure: Iterative model refinement process that includes continuous communication between SMEs, stakeholders, and the LLM team.*

Diagram showing each step of the Science Discovery Engine artificial intelligence workflow. — *Figure: SDE curation workflow with AI-assisted tagging and inference pipeline.*

This cycle of feedback, learning, and improvement ensures that the SDE evolves alongside the research it helps organize, without ever losing the human touch.

Tagging the Universe at Scale

One of the most impactful applications of AI-powered curation in SDE is the Time-Domain and Multi-Messenger (TDAMM) Astronomy classifier, a domain-specific model that classifies astronomy and astrophysics information into 36 categories spanning objects, observational signals, and messengers—like black holes, gravitational waves, or gamma-ray bursts.

What makes the TDAMM classifier special isn’t just its accuracy, but its ability to understand scientific nuance. It can, for example, identify papers that discuss supernovae observed in the infrared even if they never use the term "infrared" explicitly—picking up clues like “795nm” instead. The model powers a dedicated TDAMM portal within SDE, enabling users to search and filter content by categories, accelerating how researchers explore phenomena in the cosmos.

This demonstrates how machine learning can enable deep, contextual discovery—far beyond basic keyword matching.

Technology that Powers Discovery

The infrastructure that powers SDE’s AI tagging is anchored by an inference pipeline hosted within a system called COSMOS. It uses tools like Docker and FastAPI to support real-time classification at scale.

When a user queries whether an asset belongs to a particular science domain, COSMOS sends the request, processes it through the model and returns the result instantly. If the model struggles, COSMOS flags the issue for SME review, ensuring continuous improvement.

A diagram of the inference pipeline architecture involved in the Science Discovery Engine’s artificial intelligence tagging system. — *Figure: A schematic diagram of the inference pipeline architecture that shows the overall flow for inference.*

Looking Ahead: Indus-SDE

To expand the system’s capabilities, the team is developing Indus-SDE, a custom domain-specific language model trained on more than 500,000 scientific documents.

Once deployed, Indus-SDE will:

Assess document relevance for indexing
Generate concise, accurate titles
Enhance search precision across disciplines

While developed for NASA, the model’s architecture could serve other scientific fields—from Earth science to biomedicine.

Why It Matters

SDE is more than a search engine—it’s a leap forward in research efficiency and knowledge access. By combining machine learning with expert oversight, SDE ensures that researchers can spend less time sifting through data and information and more time making discoveries.

Advanced search capabilities provide several key benefits:

Faster Access to High-Quality Content: Automated tagging speeds up indexing and reduces manual labor.
Improved Search Precision: With better organization, scientists can more easily find what they need.
Cross-Disciplinary Support: Models like Indus-SDE can serve multiple fields, not just Earth or space science.

A New Era of Scientific Discovery

From classifying cosmic phenomena to simplifying data exploration, SDE and its machine learning-powered curation workflow marks a new chapter in how we interact with scientific information.

As it continues to evolve, SDE is poised to accelerate discovery – not just at NASA, but across the scientific community.

The age of intelligent discovery is here—and it’s only just beginning.

Search with the Science Discovery Engine

Science Mission Directorate Division Data Sites

Astrophysics Data

Biological & Physical Sciences

Earth Science

Heliophysics

Planetary Science

Interdisciplinary Data Sites

ACROSS

Astronomy and Geodesy

Opacities

PEGASUS Stellar Spectra

Citizen Science

High-End Computing

Science Cloud

Science Discovery Engine

Science Explorer Digital Library (SciX)

Blog

About NASA Science Data

Open Data Registry Project

Science Data Repository Metrics

Science Data Licenses

NASA Science Home