An Invited oral presentation on Intelligent Structure Identification of Powder X-ray Diffraction Patterns.
Hi, I'm Bin CAO 曹斌!
Creative Designer Coder Player
In Hong Kong
I am engaged in AI4CM (AI for Computational Materials) research, focusing on crystallography and spectroscopy
Infinite possibilities start with your next line of code.
Bin CAO 曹斌
PhD student & Developer Coder PlayerBin CAO is a researcher in Artificial Intelligence for Computational Materials (AI4CM), with a research focus on crystallography, spectroscopy, and data-driven materials characterization. His work integrates physics-based diffraction simulation with machine learning methodologies, particularly emphasizing spectrum-informed sequence models and graph neural representations of crystal structures.
Driven by a strong commitment to open science, he actively promotes the transparent and reproducible dissemination of research outcomes. All source codes and models from his studies are openly released on GitHub and Hugging Face, fostering accessibility and collaboration within the scientific community.
He has received several distinctions, including the National Scholarship (2022), the Outstanding Young Academic Presentation Award at the China Materials Conference (CMC) 2024, and an Invited Young Talents Academic Report at CMC 2025.
- Age 27
- Born In Shuozhou(朔州), China
- CV https://bin-cao.github.io/caobin/
-
6+
Years of Experience -
10+
Projects Completed -
20+
Papers Published
Everything about me!
-
-2025 July 1st - 2026 Jan 1stExchange Student (half a year)
-CityU, HongKongSpectroscopy Crystal Characterization & Crystal Structure Prediction (CSP) via Generative Algorithms
City University of Hong Kong (CityU) (香港城市大学)
I am currently an exchange student at City University of Hong Kong (CityU) in the Department of Physics, under the supervision of Prof. Ren Yang.
During this period, I aim to work on two main tasks: developing a crystal phase identification system based on XRD data, and generating crystal structures using generative algorithms.
-
-2023 - PresentPhD in Advanced Materials
-HKUST, GuangzhouAI-driven X-ray Structural Characterization & Crystal Generation and Property Prediction
The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学广州)
Cao Bin is an active open-source community builder and a collaborative partner in materials science and AI.
Currently, I am pursuing my studies at HKUST(GZ) under the supervision of Professor Zhang Tong-Yi and Prof.Weng Lutao.
Access my publication here : Google scholar.
-
-2023(Mar 1st - Aug 31th)Researcher (half a year)
-Zhejiang Lab, HangzhouLeading the development of the transfer learning framework.
I have been working at Zhejiang Lab in the Intelligent Materials Design Department, led by Prof. Zhang Tong-Yi (Chief Scientist in Materials Science), for half a year.
During this period, I mainly focused on studying transfer learning and established a strong research connection with my peers.
The open-source project can be accessed here : TrAdaboost
-
-2020 - 2023Master of Philosophy
-SHU, ShanghaiAI-driven X-ray Structural Characterization & Active Learning Framework: Bgolearn.
I obtained a master's degree in Solid Mechanics from Shanghai University, supervised by Prof. Zhang Tong-Yi.
Shanghai University provided me with an excellent academic experience. During this period, I was awarded the National Scholarship and recognized as an Outstanding Graduate of Shanghai University.
-
-2016 - 2020Bachelor
-BUCT, BeijingMechanical Engineering & 3D Modeling and Finite Element Analysis.
Beijing University of Chemical Technology(北京化工大学)
I obtained my bachelor's degree from Beijing University of Chemical Technology (BUCT).
During my four years of study at BUCT, I made many great friends and created wonderful memories.
BUCT has a rigorous academic atmosphere and a dynamic learning environment—I highly recommend studying here.
My Projects
-
01Crystal Property Prediction
It generates synthetic diffraction patterns that are invariant to crystallographic symmetries...
We propose PRDNet, a novel architecture that integrates graph embeddings with a learned pseudoparticle diffraction module. It generates synthetic diffraction patterns that are invariant to crystallographic symmetries.
We extensively evaluate PRDNet on multiple large-scale benchmarks, including Materials Project, JARVIS-DFT, and MatBench. Our model achieves state-of-the-art performance across a wide range of crystal property prediction tasks, demonstrating its effectiveness.
-
01XRD structure identification
I am making efforts to promote end-to-end structure identification...
To achieve end-to-end intelligent structure identification, we developed a novel powder XRD simulation tool (SimXRD, ICLR 2025) that generates high-fidelity simulated XRD patterns closely aligned with experimental data.
Building on this, we also participate in the opXRD database project, striving to establish the largest experimental raw XRD database (opXRD, Adv. Intell. Discov.).
Furthermore, we introduced the first software–hardware integrated system for real-time structure identification, achieving state-of-the-art performance across diverse chemical systems. (XQueryer, Natl. Sci. Rev. https://doi.org/10.1093/nsr/nwaf421). An online website is freely available at https://xqueryer.caobin.asia/
For more details, refer to the document: WPEM Manual (figshare,file=51378833).
-
02Crystal structure generation
Generating crystals with a minimal element set and maximum symmetry...
Diffraction patterns and crystal structures are closely related concepts. Therefore, my research interest lies in crystal representation.
In this survey (https://arxiv.org/pdf/2505.16379), we provide a comprehensive overview of crystal generation, summarizing and organizing various types of materials while illustrating multiple representations of crystalline structures. We then present a detailed summary and taxonomy of current AI-driven materials generation approaches. Furthermore, we discuss commonly used evaluation metrics and highlight open-source codebases and benchmark datasets.
One of the projects we have worked on involves embedding crystals using asymmetry units (ASUs), space groups, lattice vectors, and the minimal element set to inversely generate stable, novel crystal structures (CGWGAN, JMI, 2024), achieving good results while preserving high symmetry.
In another project, we introduced powder XRD to provide additional insights from reciprocal space, enhancing the model's understanding of crystals (ASUGNN, J. Appl. Cryst.), which shows great potential.
I am currently working on deriving a universal pre-trained model for crystals and hope to share more soon!
-
03Baysian Opt. - Bgolearn
We launched the first Bayesian optimization/active learning framework for the materials community...
Bgolearn is the first active learning framework tailored for the materials science community. Official homepage:
https://github.com/Bgolearn.Since its release, Bgolearn has gained significant traction, with over 100,000 downloads and growing adoption in both academia and industry (see: Bgolearn, Materials & Design, 2024). It provides a lightweight yet extensible Python package for Bayesian global optimization, purpose-built for accelerating materials discovery and intelligent design workflows.
The framework supports regression and classification tasks in both single- and multi-target scenarios. It implements diverse acquisition strategies and offers a modular pipeline for virtual screening, active learning, and multi-objective optimization. Bgolearn includes nine ready-to-use utility functions, making it easy to integrate into a wide range of research applications.
🔧 Key Features:
- Built-in support for regression/classification
- Single/multi-objective and multi-target design
- Seamless integration with high-throughput and virtual screening workflows
- Lightweight design with extensible modules
> 📦 PyPI: pip install Bgolearn
> 🎥 Tutorial (BiliBili): Watch here
> 🚀 Try it online: Google Colab DemoIn 2025, I led the development of BgoFace, a user interface for Bgolearn published in MGR Advances. BgoFace makes Bayesian global optimization accessible for materials research by simplifying workflows and bridging experimental-computational gaps. With an intuitive design and support for real-world constraints, it enables rapid discovery without requiring ML expertise.
-
TrAdaboost is an open-source project for transfer learning education.
After leading the transfer learning framework during my work at Zhejiang Lab, I decided to open-source a teaching project to introduce the fundamental concepts of transfer learning using simple models and toy data.
This project has been gaining more and more attention. Thank you! (Location: https://github.com/Bin-Cao/TrAdaboost).
-
05Outlier identifying by TCGPR model
A noval machine learning algorithm for outlier identifying and feature selection...
I proposed TCGPR in 2022, based on the data sensitivity reflected in kernel-based Gaussian process models. It defines a factor to evaluate the data consistency for pattern recognition and outlier identification (https://github.com/Bin-Cao/TCGPR)
This model achieved great performance in studying materials with small data sets. By characterizing the data distributions, we can often achieve better fitting results (though it may not always work).
Following this strategy, we successfully applied the algorithm to two works:(Small, 2024 : https://onlinelibrary.wiley.com/doi/10.1002/smll.202408750) (npj cm 2023 :https://www.nature.com/articles/s41524-023-01150-0).
Selected papers
-
Crystal property prediction, governed by quantum mechanical principles, is computationally prohibitive to solve exactly for large many-body systems using traditional density functional theory. While machine learning models have emerged as efficient approximations for large-scale applications, their performance is strongly influenced by the choice of atomic representation.
We introduce PRDNet that leverages unique reciprocal-space diffraction besides graph representations. To enhance sensitivity to elemental and environmental variations, we employ a data-driven pseudo-particle to generate a synthetic diffraction pattern. PRDNet ensures full invariance to crystallographic symmetries.
- Model type Graph model
- Category Crystal Property Prediction
- Date Sep 29, 2025
- Share
-
We developed XQueryer, an intelligent agent for simulating, recognizing, and analyzing powder X-ray diffraction (PXRD) patterns. Trained on over two million high-fidelity simulated spectra, XQueryer achieves significantly higher accuracy—28.9% better than existing AI models and traditional methods. Integrated with a powder diffractometer, it enables real-time structural analysis of crystal samples.
A freely accessible online platform is available at https://xqueryer.caobin.asia/
- Model type Sequence model
- Category Crystal Structure Identification
- Date Oct 1, 2025
- Share
-
Powder X-ray diffraction (XRD) patterns are highly effective for crystal identification and play a pivotal role in materials discovery. While machine learning (ML) has advanced the analysis of powder XRD patterns, progress has been constrained by the limited availability of training data and established benchmarks. To address this, we introduce SimXRD, the largest open-source simulated XRD pattern dataset to date, aimed at accelerating the development of crystallographic informatics.
We developed a novel XRD simulation method that incorporates comprehensive physical interactions, resulting in a high-fidelity database. SimXRD comprises 4,065,346 simulated powder XRD patterns, representing 119,569 unique crystal structures under 33 simulated conditions that reflect real-world variations. We benchmark 21 sequence models in both in-library and out-of-library scenarios and analyze the impact of class imbalance in long-tailed crystal label distributions.
- Model type Sequential model
- Category XRD structure identification
- Date Mar 2, 2025
- Share
-
Materials discovery is a fundamental driver of technological advancement with direct impact on real-world challenges. From energy systems and electronics to biomedical devices and sustainable manufacturing, novel materials enable new functionalities and improved performance.
Our survey offers a detailed and comprehensive overview of material representations, particularly focusing on crystal structures and their precise mathematical definitions. We systematically categorize and compare a wide range of existing techniques, supported by a clear development timeline. To facilitate research and practical application, we provide abundant resources, including links to open-source code and datasets. Additionally, we highlight current challenges and propose future research directions to inspire continued innovation in the field.
- Paper type Review
- Category Crystals structure generation
- Date May 23 2025
- Share
-
In 2025, I led the development of the user interface software for Bgolearn, called BgoFace, published in MGR Advances. BgoFace is a user-friendly platform designed to accelerate materials innovation through active learning. It streamlines Bayesian global optimization by tackling key challenges such as experimental–computational interoperability and algorithm accessibility. With its intuitive interface and built-in support for experimental constraints, BgoFace enables efficient materials discovery without requiring deep machine learning expertise.
Using default settings and simple button clicks, BgoFace guided six iterations of optimization, recommending four infill points per iteration via four different utility functions—for a total of 24 suggestions. The optimal sample, with a highest QY of 37.25%, was discovered in the fifth iteration—nearly doubling the initial value. The software and source code are openly available at
https://github.com/Bgolearn/BgoFace.- Paper type Research article
- Category Active learning for materials design
- Date Aug 4 2025
- Share
-
A notable difficulty in applying machine learning to this domain is the lack of sufficiently sized experimental datasets, which has constrained researchers to train primarily on simulated data. However, models trained on simulated pXRD patterns showed limited generalization to experimental patterns, particularly for low-quality experimental patterns with high noise levels and elevated backgrounds.
With the Open Experimental Powder X-Ray Diffraction Database (opXRD), we provide an openly available and easily accessible dataset of labeled and unlabeled experimental powder diffractograms. Labeled opXRD data can be used to evaluate the performance of models on experimental data and unlabeled opXRD data can help improve the performance of models on experimental data, e.g. through transfer learning methods. We collected 92552 diffractograms, 2179 of them labeled, from a wide spectrum of materials classes.
- Model type Sequential model
- Category XRD open database
- Date Mar 7, 2025
- Share
-
In 2022, I open-sourced the Bgolearn framework and have since actively maintained and promoted it in collaboration with experimental researchers. My goal is to foster materials innovation through attachable and accessible machine learning tools.
Bgolearn is a lightweight and extensible Python package for Bayesian global optimization, tailored for accelerating materials discovery and design. It supports both regression and classification tasks, includes a variety of acquisition strategies, and provides a seamless pipeline for virtual screening, active learning, and multi-objective optimization.
- 📦 Official PyPI:
pip install Bgolearn - 🎥 Code tutorial (BiliBili): Watch here
- 🚀 Colab Demo: Run it online
I’m truly glad to see several groundbreaking innovations in materials science enabled by the Bgolearn framework:
2025
- Nano Letters: Self-Driving Laboratory under UHV — Link
- Small: ML-Engineered Nanozyme System for Anti-Tumor Therapy — Link
- Computational Materials Science: Mg-Ca-Zn Alloy Optimization — Link
- Measurement: Foaming Agent Optimization in EPB Shield Construction — Link
- Intelligent Computing: Metasurface Design via Bayesian Learning — Link
2024
- Model type Active learning
- Category Active learning
- Date May 1, 2024
- Share
- 📦 Official PyPI:
-
In this work, we present a crystal generative framework based on Wyckoff generative adversarial network (CGWGAN) to efficiently discover novel crystals.
The CGWGAN includes three modules: a generator of crystal templates, an atom-infill module, and a crystal screening module. The generator uses a generative adversarial network (GAN) to produce crystal templates embedded with asymmetry units (ASUs), space groups, lattice vectors, and the total number of atoms within the lattice cell, ensuring that the generated templates precisely match all requirements of crystals. These templates become crystal candidates after filling in atoms of different chemical elements. These candidates are screened by M3GNet and the passed ones are subjected to density functional theory (DFT)-based calculations to finally verify their stability. As a showcase, the CGWGAN successfully discovers seven novel crystals within the Ba-Ru-O system, demonstrating its effectiveness. This work provides a knowledge-guided Artificial Intelligence generative framework for accelerating crystal discovery.
- Model type GAN
- Category Crystal generation
- Date Nov 7, 2024
- Share
My Supervisors
-
Prof.Tong-Yi Zhang is my master's and doctoral supervisor. He is an academician of the Chinese Academy of Sciences and the founding dean of the Materials Genome Institute at Shanghai University. He is also the founding director of the Materials Genome Engineering division in the Chinese Materials Research Society (CMRS). In 2022, he joined the Hong Kong University of Science and Technology (Guangzhou). His current research interests include Materials Genome Engineering, Materials Informatics, and Mechanoinformatics.
Chair professor of HKUST(GZ)
-
Prof. Lu Tao Weng is my co-supervisor at HKUST (Guangzhou). He is currently the director of the Materials Characterization and Preparation Facility (GZ), the deputy director of the Office of Lab Facilities and Safety, and an adjunct professor in the Advanced Materials Thrust at HKUST (GZ). He is also an adjunct professor in the Department of Chemical and Biological Engineering at HKUST. His research interests focus on the surface and interface analysis of advanced materials using techniques such as XPS, ToF-SIMS, dynamic SIMS, AFM, contact angle measurements, and more.
Director of MCPF, HKUST(GZ)
-
Prof. Ren Yang is my host supervisor during my visiting period at CityU (July 2025 - December 2025). He is the Head of the Department of Physics at City University of Hong Kong, a Chair Professor, the Hong Kong Global STEM Professor, and the Director of the Jockey Club “Energy and Materials Physics” STEM Laboratory. From 1999 to 2021, he worked at the Argonne National Laboratory (ANL) in the United States, serving as the Chief Beamline Scientist for the High-Energy X-ray beamline station at the Advanced Photon Source and as a Senior Laboratory Physicist. He has published more than 960 papers in scholarly journals such as Nature, Science, and Physical Review Letters.
Chair Professor of CityU
My blog & news
-
An Invited oral presentation on BGOLearn: An End-to-End Active Learning Framework for Materials Optimization.
A poster presentation showcasing our published work, SimXRD-4M.
I am honored to have received two awards: the Invited Academic Talk by Promising Young Talent & the High-Level Academic Poster.
The Thirteenth International Conference on Learning Representations (ICLR 2025) will be hosted in Singapore.
Our paper SimXRD-4M has been selected for poster presentation at this international conference.
The report topic is about Intelligent Phase Identification for Powder X-Ray Diffraction. For more details, see theMRS Spring Meeting
Our system revolutionizes PXRD-based crystal identification with high-fidelity data synthesis and the state-of-the-art XQueryer model. Seamlessly integrating with diffractometers, it enables precise, AI-driven material discovery and extends to broader chemical applications.
I attended the 2024 China Materials Conference held in Guangzhou and gave a report on Intelligent Material Characterization Systems.
I was honored to receive the Outstanding Young Academic Presentation Award.
I have given a talk on intelligent characterization in the AI Lab Blueprint.
The talk mainly included some technical details and future research plans.
I was invited to attend the X-ray technology progress seminar.
I gave an oral report on WPEM: From Characterization to Structural Analysis.
I graduated from Shanghai University on June 1, 2023, as an Outstanding Graduate.
During my study period, I was awarded the 2022 Chinese National Scholarship due to my hard work.
I attended the First National Symposium on Data-Driven Computational Mechanics held in Dalian, China.
I gave an oral presentation on Symbolic Regression for Formula Discovery in Physical Formulas.
I was invited to attend the Roundtable Forum - Spirit of Qian Weichang as a Student Representative.
The forum was hosted by Academician Gu Binglin and focused on the legacy of Academician Qian Weichang.
Get in touch
I'm currently available for new collaborations, so feel free to send me a message about anything that you want to run past me. You can contact anytime at 24/7
