Listicles

A natural dataset without the biases of human-elicited datasets.

This is a dataset, on the order of 300k items, consists of claims ("milk good for you") and reasons ("milk is rich in calcium").

This dataset can be used for weakly-supervised natural language inference. The labels would be "supports" and "neutral", however the annotations and generation of the language is derived from an entirely natural setting. This means that it does not contain the same artifacts that plague other such datasets.

For more details (and the dataset itself), visit the github repository.

Look here to inspect the data and collection process.