Listening Tests

First, some great papers to read about the subject…

  • [PDF] M. Wester, C. Valentini-Botinhao, and Gustav Eje, “Are we using enough listeners? No! An empirically-supported critique of Interspeech 2014 TTS evaluations,” in Proc. Conference of the International Speech Communication Association (Interspeech), 2015, pp. 3476-3480.
    [Bibtex]
    @inproceedings{WesterM2015enoughlistno,
    title = "Are we using enough listeners? No! An empirically-supported critique of Interspeech 2014 TTS evaluations",
    publisher = "International Speech Communication Association",
    author = "Mirjam Wester and Cassia Valentini-Botinhao and {Gustav Eje} Henter",
    year = "2015",
    pages = "3476-3480",
    booktitle = "Proc. Conference of the International Speech Communication Association (Interspeech)",
    pdf = "http://www.cstr.ed.ac.uk/downloads/publications/2015/wester:listeners:IS2015.pdf"
    }
  • [PDF] S. Buchholz and J. Latorre, “Crowdsourcing Preference Tests, and How to Detect Cheating,” in Proc. Conference of the International Speech Communication Association (Interpseech), 2011, pp. 3053-3056.
    [Bibtex]
    @inproceedings{BuchholzS2011crowdsourcing,
    author = {Sabine Buchholz and Javier Latorre},
    title = {Crowdsourcing Preference Tests, and How to Detect Cheating},
    booktitle = {Proc. Conference of the International Speech Communication Association (Interpseech)},
    pages = {3053--3056},
    year = {2011},
    crossref = {DBLP:conf/interspeech/2011},
    url = {http://www.isca-speech.org/archive/interspeech_2011/i11_3053.html},
    publisher = {International Speech Communication Association},
    pdf = {https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/Speak11To12/IS110306.pdf}
    }

… and some personal experiences

Here are some opinions about experiences and advices to whom might be interested in carrying on listening tests. This is obviously personal remarks and not proved statements. I hope the following may just help others by avoiding surprises in results.

Be aware of

  • You have only one chance. You can NOT run and run again the test correcting some bugs here and there like with a numerical quantitative evaluation.
  • The number of answers you will receive is inversely proportional to the length of the test. 5 minutes tests give easily 40 answers and thus usable results with convenient confidence intervals.
  • After 15 minutes, most listeners get tired and bored. Tired: They cannot simply perceive differences as accurately as at the beginning of the test [ref needed …]. Bored: They will stop concentrating and finish the test as quick as possible while providing answers which look like they did the test properly.
  • People reacts very differently to each particular kind of artefacts.
  • Experts are among the less objective listeners. There opinions are very biased because they often worked on a particular method. They can be therefore very sensitive to an artefact of this method or ignore it unconsciously. From my experiences, evaluation using team members only are not credible at all. IMHO, I think the best listeners are people who have trained ears but have absolutely no deep knowledge about particular sound processing methods (e.g. musicians, sound engineers).
  • The variability of sounds (in its most general sense) is always bigger than that of the sounds used for the listening test. Thus, it’s better to avoid claiming general results like: “The method X provides always a better quality than X”. It makes more sense to write: “Based on the used sentences, the method X provides better quality than X”. “According to this test, …”

Advices when designing the listening test

  • As said above, expert ears, your ears, are often biased. Thus be sure at 90% of the results you expect. The test should “lock” a strong expectation. If the listening test is used to sort some doubts, it is very likely that you will face big unpublishable surprises.
  • The description of the element to assess is extremely important. The listener may not properly understand the element to assess. Sending the test to a first small group and ask for feedback is a good practice to ensure the intelligibility and preciseness of the instructions. Keep in mind that we all work in very precise domains where the meaning of each technical word is as precise as unknown for others (just change your lab once and you will realize how differently the terms are understood)
  • About the announcements on mailing lists: Make it short. Your listeners often receive 100 mails a day, they will not stop on your announcement if there is a fairy tale to read before carrying out the test.
  • About the instructions at the beginning of the test: Similar to the previous remark for the annoncement, they should be the shortest sufficient text.

Verification/Consistency

  • A simple procedure to minimize mistakes:
    1. Run the test yourself, to check it’s technically correct. (no need to say, though …)
    2. Send the test to a small group to check if the trends follow your expectations.
    3. Send the test to bigger mailing lists to obtain non-overlapping confidence intervals.
  • Consider that a sound file may not be properly played in the web navigator of the listener. Thus, allow the listener to discard a question.
  • Verify (as much as possible) that each listener answered the test properly. For example, in encoding/decoding methods, adding the original file in the sounds to assess is a necessary practice to check the consistency of the answers.

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload the CAPTCHA.