This is a dataset of ~341k items consisting of claims and reasons, collected from listicle-style web pages (pages presenting "N reasons why X"). It can be used for weakly-supervised natural language inference The labels would be "supports" and "neutral." Because the annotations come from a natural setting rather than crowdsourcing, the dataset avoids the annotation artifacts that plague other NLI datasets..
Purpose
The labels are "supports" and "neutral," programmatically detected across matched sites. The evidence and other content are from entirely natural settings — this means the data does not contain the same artifacts that plague other such datasets. About 70% of the entries include supporting evidence alongside the claim and reason Research fun fact: the original goal was to also include contradictory examples (and labels), but these were more difficult to find and confirm. Not being able to do so is a reason this research project was ultimately orphaned..
Format
The dataset is a CSV with the following columns:
| Column | Description |
|---|---|
| query | The query originally issued to find the web page. |
| claim | The claim implicit in the title of the page. |
| reason | One of the enumerated reasons supporting the claim. |
| evidence | Optional. Information that supports the reason. |
Examples
Below are representative entries from the dataset. Each card shows a claim–reason pair, with evidence when available.
Claim
Girls should lift weights.
Reason
Lifting weights reduces your risk of osteoporosis.
Evidence
One of the fears women have about aging is developing osteoporosis, a condition where your bones weaken and become brittle. One amazing benefit of strength training is that it can help lower the risk of and even prevent osteoporosis. When you consistently strength train, your bones increase in density, making them stronger than before.
Claim
You should ditch refined oils.
Reason
Heating.
Evidence
Refined oils are heated and reheated in the extraction process, losing most of its valuable nutrients. Often these high temperatures result in the oils oxidising and going rancid even before you buy them! Oxidation also creates free radicals that can damage the cells of our bodies so it is best to avoid them.
Claim
Everyone should take a probiotic.
Reason
Probiotics are a newfound weapon that may assist in lowering elevated blood pressure and cholesterol levels that contribute to cardiovascular disease.
No evidence provided.
Claim
You should opt for a bot.
Reason
They're easy to use.
No evidence provided.
Collection
The root idea is that listicles — websites that present "N reasons for X" — contain useful structured information. This information is not as direct as Mechanical Turk annotations, but is relatively cheap, generated without the framing of a crowdsourced task, and is scalable.
First, links to such pages are found by recursively querying the Bing API: seed queries are issued (e.g. "* reasons * is good"), and then results are used to find more similar and branching results. Pages are then parsed to extract the claim and the reasons. About 60% of the reasons include supporting evidence; sometimes this evidence is not distinct from the reason itself.
Dataset Statistics
| Total entries | 341,155 |
| With evidence | 240,102 (70.4%) |
| Without evidence | 101,053 (29.6%) |
For more details and to download the dataset, visit the GitHub repository.