Interactive Evaluation of Dialog

This website describes the DSTC9 Track on Interactive Evaluation of Dialog where participants are challenged to develop dialog systems that can converse effectively in interactive environments with real users.


The ultimate goal of dialog research is the creation of systems that can be effectively used in interactive settings by real users. Despite this, the majority of research is done on static datasets. For example, the task of response generation is typically done by producing a response for a static dialog context. This track is intended to move research beyond datasets, and evaluate models in interactive environments with real. Interactive evaluation allows several valuable properties of dialog to be measured, such as: consistency, adaptiveness and user-centric development. Additionally, the use of an interactive environment allows for learning after deployment. Improving a model after it has been deployed may be based on real user data and implicit and explicit feedback signals.

Through this task, we hope to take a first step in expanding dialog research beyond datasets. DialPort, a platform for interactive assessment with real users, will be used for evaluation throughout the task. We propose to have three subtasks that take the participants from a totally static data assessment to an interactive one. The first sub-task involves static evaluation on a dataset while the second one challenges participants to extend their models for interactive evaluation on DialPort.


The first sub-task (Static Evaluation) is currently active! The second sub-task (Interactive Evaluation) will be active on July 1st! Join our slack, download the data/baselineis, submit to the first sub-task or check out our leaderboards now!

Join our Slack! Data/Baselines Sub-task 1 Leaderboards

Sub-Task 1: Static Evaluation of Dialog

Objective: build better models for dialog response generation

Dates: June 15th - Oct 5th


For both the static and interactive sub-tasks, we will continuously run both automatic and human evaluation throughout the duration of DSTC9. All results will be shown on the LEADERBOARD.

Sub-Task 2: Interactive Evaluation of Dialog

Objective: build/adapt conversation models to work effectively in an interactive setting

Dates: July 1st - Oct 5th

  • You may submit two types of systems:
    • A system trained only on the Amazon Topical-Chat Corpus training data
    • A system using any publicly available data or pre-trained models
  • When submitting, please indicate whether you used additional data/pre-trained models as there will be seperate leaderboards.
  • You must provide us with a public API endpoint for your system. This will allow us to deploy your system on DialPort
    • You can find example code HERE
    • Please reach out to us on Slack if you are having difficulty with this.
  • Please send us your public API endpoint on Slack, and we will deploy your system to our users.
    • As long as your system is running, it will be available for users to interact with. If your API goes down, we will automatically stop presenting your system to users. Please reach out to us once you've brought it back up.
    • You may submit as many times as you’d like. We will replace the version of your system that is running, unless you request otherwise (with sufficient justification -- e.g., significant architectural differences).
  • We guarantee that for each submission we will collect 30 interactive conversations (for at most 2 new submissions per 7 day period per participating team) with real users. Additional conversations may be collected depending on the number of total submissions.
  • During the final testing period, we will gather 60+ interactive conversations for the final system.
  • We will do post-hoc assessment of all the conversations
    • Using Amazon Mechanical Turk: questionairre
    • Using automatic metrics: FED (to be released on arxiv)
  • For each collected conversation, you will be provided with:
    • Conversation logs
    • User feedback logs
    • User rating logs
    • Post-hoc quality annotation
    • FED evaluation
  • The leaderboard will show (1) the user rating, (2) the post-hoc human rating and (3) the FED scores


This challenge was organized as a collaboration between Carnegie Mellon University and University of Southern California. If you have any questions, please contact us on our Slack or email us. The full list of organizers:

  • Shikib Mehri (CMU)
  • Yulan Feng (CMU)
  • Carla Gordon (USC)
  • Seyed Hossein Alavi (USC)
  • David Traum (USC)
  • Maxine Eskenazi (CMU)