The workspace:

FAQ

[training video (soon)] [contract and payment info]

My goal with this project (if you want to know):

  1. Analyse connectors and their equivalents (all possible ways to express causality, condition and concession).Did you know that not all languages have the same inventory of connectors?
    e.g. Dutch may choose "omdat" or "want" for English 'because' depending on the circumstances.

  2. Detect anomalies in interpretation.Did you know that interpreters may drop connectors to save time, because they are rarely essential?

  3. Detect anomalies in OpenAI Whisper's transcription.Did you know that LLMs like Whisper may sometimes hallucinate whole paragraphs?


To achieve this, I need some help.

The Main Job

Summed up: You receive an automatic transcription. Every full word spoken in the main speech(es) should be included, nothing else.

  1. Identify the main speech(es), delete everything else, enter missing pieces.
  2. Ignore interpreters changing, it's the speech that counts. Delete interruptions, keep only the speech.
  3. If there are two complete long speeches, keep both. Otherwise keep only the longest / most complete.
  4. Delete Whisper's hallucinations and loops (text that is not in the recording).
  5. Keep interpreter mistakes and errors (see next guideline).

Initial < > and Interpreter Errors

Summed up: The first < > contains information on the speaker. If the speaker violates any rules, then annotate it behind the violating word in < >.

  1. Whisper can not use < >, so we can use them in two contexts:
  2. At the beginning of the text window to mark (within the same < >)
    • <speaker/interpreter> : this tells me if an interpreter is speaking or the original speaker. Please always do this.
    • <non_native> : this means the speaker does not sound like a native speaker. Only put this down if you are sure.
    • <wrong_language> : this means the recording is not in your working language. You do not have to do anything else here.
    • <voting_session> : this means the recording is mainly the president counting votes. You do not have to do anything else here.
    • You may also use the < > at the start to mark something you find really exceptional, but please use this very rarely. For example <speaker very thick dialect>
  3. Directly after an obvious grammatical error of the speaker/interpreter (give them the benefit of the doubt):
        In the example "questa" does not fit the gender of the following noun, <questo> is correct. Two lines down "asprendere" does not exist at all
        in standard Italian, while "prendere" does not seem a viable choice. If it is unclear from the recording what the right choice is <?> can be used.
  4. Please don't mark unusual idioms this way, just single words that are obviously wrong.
  5. You should mark words that should not be there with <->. You should mark every single word of a "false start" this way.
  6. If there is a word missing and you are 100% certain where it should be, then you can write e.g. <+the>, otherwise mark sentences with missing elements with a <+> right after a punctuation mark

Orthography and Disfluencies

  1. Correct , . ? ¿ only if you are certain. This is often hard to know with interpreters.
  2. Do not set other punctuation yourself. You do not have to correct Whisper if it does, unless you think it's dead wrong.
  3. Correct Whisper's spelling and capitalization, but keep interpreter errors.
        If whisper misses an /n/: just add it, no marking needed.
        If the interpreter misses an /n/: correct the whole word in <>.
  4. If you happen to know the proper spelling of names, then correct them, but you do not have to look these up.
  5. You do not have to enter anything below the word level. No "ehm", "aah", "hmm", "ääh", etc.
  6. You do not have to mark sub-word repeti-repetitions or stu-stu-stuttering.
  7. Please add full word repetitions, even if Whisper tends to skip them.

Thank you for your time.