Projects, corpora, and data

Clásicos Hispánicos

The most important project besides my activity at the University has taken place in the independent collection of eBooks Clásicos Hispánicos. This collection publishes Spanish classics in ePUB and mobi format with texts prepared by specialists and reviewed by a second specialist. We have developed our

Quijote I and II in CH

Quijote I and II in CH

Some already published texts

Some published texts

Coplas de Manrique

Coplas de Manrique

Becquer in the XML-TEI of CH

Bécquer in the XML-TEI of CH

Some contributors of CH

Some contributors of CH

teiHeader of the Quijote

teiHeader of the Quijote

Corpus of the Spanish Novel from 1880-1940

As part of my work at the CLiGS research group, I have already published a small corpus of Spanish novels in XML-TEI called Corpus of Spanish Novels from 1880-1940. We have published in our GitHub repository different versions: XML-TEI, plain text, linguistic annotated XML, and PDF.

This corpus is only a teaser of the real corpus I am currently working on, which will be published at the end of the project.

Corpus of Spanish Novels used for Stylometry

Corpus of Spanish Novels used for Stylometry


As part of my work at the CLiGS I have also contributed to our repository of scripts in Python. My main contributions are related to the conversion from HTML to XML-TEI, the treatment and extraction of metadata and the work with stylometric matrixes.

Extraction of places from La Regente using regex

Extraction of places from La Regenta using regex


I am currently editing chapter by chapter the Bible in Spanish, marking with identifiers people, places, groups, and direct speech (with the specification of who is talking to whom). After editing, I am also extract the information and visualise it as graphs.
Everything about this project is published on the GitHub repository.

Graph based on Genesis

Graph based on the Genesis of the Bible

Stylometry on Political Text

I have been developing a political corpus of Spanish manifestos and studying it with Machine Learning techniques and stylometry. You can find some results here.

Stylometry and Spanish Politics

Stylometry of Manifestos of 2004-2015 Spanish Elections

Casa de Citas

A database and a website containing quotes of Spanish literature that I find interesting while reading. It allows advanced searches, even with semantic filters.

Casa de Citas: guía de citas de la literatura

Casa de Citas: a guide of quotes from Spanish Literature


A database about the German morphological gender, with more than 5000 German words, with the objective of making it easier to learn this part of the German grammar for Spanish speakers.

Regla de palabras de origen romance

Morphological Rule about the Roman Origin of Words in German