Skip to content

Neilus03/GUIDEX

Repository files navigation

Guided Synthetic Data Generation for Zero-Shot Information Extraction (ACL Findings 2025)

This repository contains the official implementation of GuideX, a novel method for synthetic data generation that automatically defines domain-specific schemas, infers guidelines, and generates synthetically labeled instances for Information Extraction (IE) tasks. We achieve SOTA zero-shot NER performance by finetuning Llama-3.1-8B on our GuideX dataset.

🛠️ Setup

The project uses Python 3.9.7+ and is managed through pyproject.toml.

Generating a GuideX dataset

To generate a GuideX dataset, you can use the GUIDEX_pipeline.py script. This script will generate a GuideX dataset from a given input file. We provide a small example dataset in data/GUIDEX_gen/fineweb-edu-1k.json, that you can use to test the pipeline.

To generate a GuideX dataset, you need to have a Hugging Face token. You can get one by logging in to your Hugging Face account and going to this page.

  1. Clone the repository
git clone https://github.com/HiTZ/GUIDEX.git
cd GUIDEX
  1. Create a virtual environment and install the dependencies
python3.9 -m venv guidex_env 
source guidex_env/bin/activate.csh
pip install -r requirements.txt
  1. Set the Hugging Face token
cd data/GUIDEX_gen
setenv HF_TOKEN "<your_huggingface_token>"
  1. Run the pipeline
python GUIDEX_pipeline.py --input fineweb-edu-1k.json --output guidex_data.jsonl --batch-size 32

The Llama3.1-70B-Instruct model with which we annotate the GuideX dataset is big, probably won't fit in your GPU. You can use the run_GUIDEX_pipeline_1.slurm script to run the pipeline on a cluster, our experiments were run on a cluster with 2x A100 GPUs.

sbatch run_GUIDEX_pipeline_cluster.slurm
  1. Check the output's first 10 lines
cat guidex_out.jsonl | head -n 10

Doing NER with GuideX

We provide a comprehensive Jupyter notebook that guides you through the entire process of performing Named Entity Recognition (NER) with a GuideX-finetuned model. You can find this notebook in the repository at GUIDEX/notebooks/Named Entity Recognition.ipynb.

📊 Results

table2 table3 table4

🚀 Open Source Plan

We are open-sourcing the following components:

  • Models: GuideX fine-tuned models (link
  • Dataset: GuideX training and evaluation datasets (link)
  • Code: Complete implementation of the GuideX framework
  • Reproduction Guide: Step-by-step instructions to reproduce our experiments (below)

📚 Paper

Our paper is available at guidex.com. Please cite our work if you use our code or models:

@misc{delafuente2025guidexguidedsyntheticdata,
      title={GuideX: Guided Synthetic Data Generation for Zero-Shot Information Extraction}, 
      author={Neil De La Fuente and Oscar Sainz and Iker García-Ferrero and Eneko Agirre},
      year={2025},
      eprint={2506.00649},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.00649}, 
}

🤝 Contributing

We welcome contributions! Please stay tuned for our contribution guidelines.

📧 Contact

For questions and feedback, please contact Neil De La Fuente at [email protected].


Note: This repository is under active development.

About

GUIDEX official repo

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •