An Invited oral presentation on BGOLearn: An End-to-End Active Learning Framework for Materials Optimization.
A poster presentation showcasing our published work, SimXRD-4M.
I am engaged in AI4CM (AI for Computational Materials) research, focusing on crystallography and spectroscopy
Hello there! My name is Bin CAO. I am engaged in AI4CM (AI for Computational Materials) research, focusing on crystallography and spectroscopy. My research primarily includes physics-based diffraction pattern simulation and machine learning representations in spectrum-based sequence models and crystal-based graph structures.
I am passionate about open science and strongly advocate for the unrestricted dissemination of knowledge. To support this vision, I share all code from my research (On GitHub & Huggingface) to ensure transparency and accessibility.
Spectroscopy Crystal Characterization & Crystal Structure Prediction (CSP) via Generative Algorithms
City University of Hong Kong (CityU) (香港城市大学)
I am currently an exchange student at City University of Hong Kong (CityU) in the Department of Physics, under the supervision of Prof. Ren Yang.
During this period, I aim to work on two main tasks: developing a crystal phase identification system based on XRD data, and generating crystal structures using generative algorithms.
AI-driven X-ray Structural Characterization & Crystal Generation and Property Prediction
The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学广州)
Cao Bin is an active open-source community builder and a collaborative partner in materials science and AI.
Currently, I am pursuing my studies at HKUST(GZ) under the supervision of Professor Zhang Tong-yi and Prof.Weng Lutao.
Access my publication here : Google scholar.
Leading the development of the transfer learning framework.
I have been working at Zhejiang Lab in the Intelligent Materials Design Department, led by Prof. Zhang Tongyi (Chief Scientist in Materials Science), for half a year.
During this period, I mainly focused on studying transfer learning and established a strong research connection with my peers.
The open-source project can be accessed here : TrAdaboost
AI-driven X-ray Structural Characterization & Active Learning Framework: Bgolearn.
I obtained a master's degree in Solid Mechanics from Shanghai University, supervised by Prof. Zhang Tongyi.
Shanghai University provided me with an excellent academic experience. During this period, I was awarded the National Scholarship and recognized as an Outstanding Graduate of Shanghai University.
Mechanical Engineering & 3D Modeling and Finite Element Analysis.
Beijing University of Chemical Technology(北京化工大学)
I obtained my bachelor's degree from Beijing University of Chemical Technology (BUCT).
During my four years of study at BUCT, I made many great friends and created wonderful memories.
BUCT has a rigorous academic atmosphere and a dynamic learning environment—I highly recommend studying here.
I am making efforts to promote end-to-end structure identification...
To achieve end-to-end intelligent structure identification, we developed a novel powder XRD simulation tool (SimXRD, ICLR 2025) that generates high-fidelity simulated XRD patterns closely aligned with experimental data.
Building on this, we also participate in the opXRD database project, striving to establish the largest experimental raw XRD database (arXiv 2503.0557).
Furthermore, we proposed the first software-hardware integrated system for real-time structure identification, achieving state-of-the-art performance (XQueryer, Oriel, Seattle, USA). In our framework, detailed atomic sites are determined using a refinement strategy.
For more details, refer to the document: WPEM Manual (figshare,file=51378833).
Generating crystals with a minimal element set and maximum symmetry...
Diffraction patterns and crystal structures are closely related concepts. Therefore, my research interest lies in crystal representation.
In this survey (https://arxiv.org/pdf/2505.16379), we provide a comprehensive overview of crystal generation, summarizing and organizing various types of materials while illustrating multiple representations of crystalline structures. We then present a detailed summary and taxonomy of current AI-driven materials generation approaches. Furthermore, we discuss commonly used evaluation metrics and highlight open-source codebases and benchmark datasets.
One of the projects we have worked on involves embedding crystals using asymmetry units (ASUs), space groups, lattice vectors, and the minimal element set to inversely generate stable, novel crystal structures (CGWGAN, JMI, 2024), achieving good results while preserving high symmetry.
In another project, we introduced powder XRD to provide additional insights from reciprocal space, enhancing the model's understanding of crystals (ASUGNN, J. Appl. Cryst.), which shows great potential.
I am currently working on deriving a universal pre-trained model for crystals and hope to share more soon!
We launched a Bayesian optimization/active learning framework for the materials community...
Bgolean is the first active learning framework designed for the materials community (homepage: https://github.com/Bgolearn).
Since its release, it has achieved over 80,000 downloads, gaining significant popularity in the application community (Bgolearn, Mat. & Design, 2024). Bgolearn includes nine utility functions that can be applied to both single and multi-target designs, in regression or classification tasks.
In collaboration with Dr. Ma, we launched a user interface for ease of use (MLMD, npj Comput. Mat., 2024).
TrAdaboost is an open-source project for transfer learning education.
After leading the transfer learning framework during my work at Zhejiang Lab, I decided to open-source a teaching project to introduce the fundamental concepts of transfer learning using simple models and toy data.
This project has been gaining more and more attention. Thank you! (Location: https://github.com/Bin-Cao/TrAdaboost).
A noval machine learning algorithm for outlier identifying and feature selection...
I proposed TCGPR in 2022, based on the data sensitivity reflected in kernel-based Gaussian process models. It defines a factor to evaluate the data consistency for pattern recognition and outlier identification (https://github.com/Bin-Cao/TCGPR)
This model achieved great performance in studying materials with small data sets. By characterizing the data distributions, we can often achieve better fitting results (though it may not always work).
Following this strategy, we successfully applied the algorithm to two works:(Small, 2024 : https://onlinelibrary.wiley.com/doi/10.1002/smll.202408750) (npj cm 2023 :https://www.nature.com/articles/s41524-023-01150-0).
Powder X-ray diffraction (XRD) patterns are highly effective for crystal identification and play a pivotal role in materials discovery. While machine learning (ML) has advanced the analysis of powder XRD patterns, progress has been constrained by the limited availability of training data and established benchmarks. To address this, we introduce SimXRD, the largest open-source simulated XRD pattern dataset to date, aimed at accelerating the development of crystallographic informatics.
We developed a novel XRD simulation method that incorporates comprehensive physical interactions, resulting in a high-fidelity database. SimXRD comprises 4,065,346 simulated powder XRD patterns, representing 119,569 unique crystal structures under 33 simulated conditions that reflect real-world variations. We benchmark 21 sequence models in both in-library and out-of-library scenarios and analyze the impact of class imbalance in long-tailed crystal label distributions.
Materials discovery is a fundamental driver of technological advancement with direct impact on real-world challenges. From energy systems and electronics to biomedical devices and sustainable manufacturing, novel materials enable new functionalities and improved performance.
Our survey offers a detailed and comprehensive overview of material representations, particularly focusing on crystal structures and their precise mathematical definitions. We systematically categorize and compare a wide range of existing techniques, supported by a clear development timeline. To facilitate research and practical application, we provide abundant resources, including links to open-source code and datasets. Additionally, we highlight current challenges and propose future research directions to inspire continued innovation in the field.
A notable difficulty in applying machine learning to this domain is the lack of sufficiently sized experimental datasets, which has constrained researchers to train primarily on simulated data. However, models trained on simulated pXRD patterns showed limited generalization to experimental patterns, particularly for low-quality experimental patterns with high noise levels and elevated backgrounds.
With the Open Experimental Powder X-Ray Diffraction Database (opXRD), we provide an openly available and easily accessible dataset of labeled and unlabeled experimental powder diffractograms. Labeled opXRD data can be used to evaluate the performance of models on experimental data and unlabeled opXRD data can help improve the performance of models on experimental data, e.g. through transfer learning methods. We collected 92552 diffractograms, 2179 of them labeled, from a wide spectrum of materials classes.
The active learning approach dynamically balances exploration and exploitation.
All active learning algorithms in Bgolearn framework are open-sourced, supporting materials informatics development.
In this work, we present a crystal generative framework based on Wyckoff generative adversarial network (CGWGAN) to efficiently discover novel crystals.
The CGWGAN includes three modules: a generator of crystal templates, an atom-infill module, and a crystal screening module. The generator uses a generative adversarial network (GAN) to produce crystal templates embedded with asymmetry units (ASUs), space groups, lattice vectors, and the total number of atoms within the lattice cell, ensuring that the generated templates precisely match all requirements of crystals. These templates become crystal candidates after filling in atoms of different chemical elements. These candidates are screened by M3GNet and the passed ones are subjected to density functional theory (DFT)-based calculations to finally verify their stability. As a showcase, the CGWGAN successfully discovers seven novel crystals within the Ba-Ru-O system, demonstrating its effectiveness. This work provides a knowledge-guided Artificial Intelligence generative framework for accelerating crystal discovery.
Prof.Tong-Yi Zhang is my master's and doctoral supervisor. He is an academician of the Chinese Academy of Sciences and the founding dean of the Materials Genome Institute at Shanghai University. He is also the founding director of the Materials Genome Engineering division in the Chinese Materials Research Society (CMRS). In 2022, he joined the Hong Kong University of Science and Technology (Guangzhou). His current research interests include Materials Genome Engineering, Materials Informatics, and Mechanoinformatics.
Chair professor of HKUST(GZ)
Prof. Lu Tao Weng is my co-supervisor at HKUST (Guangzhou). He is currently the director of the Materials Characterization and Preparation Facility (GZ), the deputy director of the Office of Lab Facilities and Safety, and an adjunct professor in the Advanced Materials Thrust at HKUST (GZ). He is also an adjunct professor in the Department of Chemical and Biological Engineering at HKUST. His research interests focus on the surface and interface analysis of advanced materials using techniques such as XPS, ToF-SIMS, dynamic SIMS, AFM, contact angle measurements, and more.
Director of MCPF, HKUST(GZ)
Prof. Ren Yang is my host supervisor during my visiting period at CityU (July 2025 - December 2025). He is the Head of the Department of Physics at City University of Hong Kong, a Chair Professor, the Hong Kong Global STEM Professor, and the Director of the Jockey Club “Energy and Materials Physics” STEM Laboratory. From 1999 to 2021, he worked at the Argonne National Laboratory (ANL) in the United States, serving as the Chief Beamline Scientist for the High-Energy X-ray beamline station at the Advanced Photon Source and as a Senior Laboratory Physicist. He has published more than 960 papers in scholarly journals such as Nature, Science, and Physical Review Letters.
Chair Professor of CityU
An Invited oral presentation on BGOLearn: An End-to-End Active Learning Framework for Materials Optimization.
A poster presentation showcasing our published work, SimXRD-4M.
The Thirteenth International Conference on Learning Representations (ICLR 2025) will be hosted in Singapore.
Our paper SimXRD-4M has been selected for poster presentation at this international conference.
The report topic is about Intelligent Phase Identification for Powder X-Ray Diffraction. For more details, see theMRS Spring Meeting
Our system revolutionizes PXRD-based crystal identification with high-fidelity data synthesis and the state-of-the-art XQueryer model. Seamlessly integrating with diffractometers, it enables precise, AI-driven material discovery and extends to broader chemical applications.
I attended the 2024 China Materials Conference held in Guangzhou and gave a report on Intelligent Material Characterization Systems.
I was honored to receive the Outstanding Young Academic Presentation Award.
I have given a talk on intelligent characterization in the AI Lab Blueprint.
The talk mainly included some technical details and future research plans.
I was invited to attend the X-ray technology progress seminar.
I gave an oral report on WPEM: From Characterization to Structural Analysis.
I graduated from Shanghai University on June 1, 2023, as an Outstanding Graduate.
During my study period, I was awarded the 2022 Chinese National Scholarship due to my hard work.
I attended the First National Symposium on Data-Driven Computational Mechanics held in Dalian, China.
I gave an oral presentation on Symbolic Regression for Formula Discovery in Physical Formulas.
I was invited to attend the Roundtable Forum - Spirit of Qian Weichang as a Student Representative.
The forum was hosted by Academician Gu Binglin and focused on the legacy of Academician Qian Weichang.
I'm currently available for new collaborations, so feel free to send me a message about anything that you want to run past me. You can contact anytime at 24/7