On the Statistical Significance Testing for Natural Language Processing