This repository contains the official implementation of GuideX, a novel method for synthetic data generation that automatically defines domain-specific schemas, infers guidelines, and generates synthetically labeled instances for Information Extraction (IE) tasks. We achieve SOTA zero-shot NER performance by finetuning Llama-3.1-8B on our GuideX dataset.
The project uses Python 3.9.7+ and is managed through pyproject.toml.
To generate a GuideX dataset, you can use the GUIDEX_pipeline.py script. This script will generate a GuideX dataset from a given input file. We provide a small example dataset in data/GUIDEX_gen/fineweb-edu-1k.json, that you can use to test the pipeline.
To generate a GuideX dataset, you need to have a Hugging Face token. You can get one by logging in to your Hugging Face account and going to this page.
- Clone the repository
git clone https://github.com/HiTZ/GUIDEX.git
cd GUIDEX- Create a virtual environment and install the dependencies
python3.9 -m venv guidex_env
source guidex_env/bin/activate.csh
pip install -r requirements.txt- Set the Hugging Face token
cd data/GUIDEX_gen
setenv HF_TOKEN "<your_huggingface_token>"- Run the pipeline
python GUIDEX_pipeline.py --input fineweb-edu-1k.json --output guidex_data.jsonl --batch-size 32The Llama3.1-70B-Instruct model with which we annotate the GuideX dataset is big, probably won't fit in your GPU. You can use the run_GUIDEX_pipeline_1.slurm script to run the pipeline on a cluster, our experiments were run on a cluster with 2x A100 GPUs.
sbatch run_GUIDEX_pipeline_cluster.slurm- Check the output's first 10 lines
cat guidex_out.jsonl | head -n 10We provide a comprehensive Jupyter notebook that guides you through the entire process of performing Named Entity Recognition (NER) with a GuideX-finetuned model. You can find this notebook in the repository at GUIDEX/notebooks/Named Entity Recognition.ipynb.
We are open-sourcing the following components:
- Models: GuideX fine-tuned models (link
- Dataset: GuideX training and evaluation datasets (link)
- Code: Complete implementation of the GuideX framework
- Reproduction Guide: Step-by-step instructions to reproduce our experiments (below)
Our paper is available at guidex.com. Please cite our work if you use our code or models:
@misc{delafuente2025guidexguidedsyntheticdata,
title={GuideX: Guided Synthetic Data Generation for Zero-Shot Information Extraction},
author={Neil De La Fuente and Oscar Sainz and Iker García-Ferrero and Eneko Agirre},
year={2025},
eprint={2506.00649},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.00649},
}We welcome contributions! Please stay tuned for our contribution guidelines.
For questions and feedback, please contact Neil De La Fuente at [email protected].
Note: This repository is under active development.



