My field is unusual in that the primary venues for publishing academic research are conferences, not journals. Of these venues, CHI (the ACM Conference on Human-Computer Interaction) is generally regarded as the most desirable in terms of its large audience and cachet for building one's CV for job searches and tenure cases. A typical cycle is to submit a paper in, say, September, receive reviews a month or so later, write a short rebuttal, then receive a final decision a month or so after that. As you can probably tell from the conference's title, the range of potential topics is huge. Submission volume increases each year, and the bar gets subsequently higher for acceptance.
In recent years, there has been an argument, from some very well-known and respected members of our field, that CHI tends to accept some types of papers over others. Specifically, these researchers claim a bias against "systems" papers, where some large and sophisticated interactive system has been developed, perhaps over several years, and possibly evaluated with a set of target users. Instead, the conference reviewers and program committee have favored more incremental, controlled experiments with airtight, measurable results but perhaps less real-world impact. As a result, measures have been taken in the review process to emphasize the strength of a contribution and to even segment the program committee to focus on particular sub-discliplines of CHI work.
I don't really want to voice an opinion on whether this bias exists, whether a separate conference should be created for systems work, or how to fix the review system (which, I feel, has improved by leaps and bounds in the time since I began attending the conference). Instead, I will discuss two case studies of my own work and comment on what made them good or bad CHI papers. Sounds fun, right? Let's get started.
The type of systems work I typically submit to CHI is usually a fairly sophisticated, standalone application that has been evaluated in the field with real people. The systems typically take a team of people several months or years to get ready for field evaluations that can last several months. Let's start with my PhD project, a predictive calendar called Augur. This project involved a one-semester evaluation with almost 30 students, faculty, and staff, complete with data logging, interviews, surveys, and post-study analysis. While a paper on the system itself was accepted to a more specialized conference called UIST, several versions of a paper on the evaluation itself were rejected from CHI.
Conversely, more recent work with my colleagues at Motorola on the SocialTV system has resulted in several CHI publications. Like Augur, these studies involved deployment of a sophisticated system (in this case, an interactive TV application) to real users over a period of one or two months. Data collection also involved mixed methods, including logging, diaries, and interviews.
So what made the difference? Obviously, I can't point to the "CHI is biased against systems papers" argument because SocialTV was quite successful there (though, from what my insiders on the program committee tell me, not without quite a bit of conflict among the committee members). Here are the reasons I think are behind success/failure of these papers, and these kinds of papers in general:
- Rigor. The SocialTV project was subject to a detailed and frequently-revised study plan that covered all the research questions, data collection and analysis methods, rationale, and target participants. Each step of the study was carried out in accordance with this plan, and as a result we had the right questions to ask, the right data to address them, and the right analysis techniques to arrive at those answers. Once the time came to put the results on paper, we were able to clearly describe our methods in detail. I cannot stress enough how important it is to demonstrate rigor in these kinds of studies. A cool system will only get you so far.
- Research experience. The contrast between Augur and SocialTV is also a contrast between a graduate student's first substantial research program and the work of an experienced team with several such projects under their belts. The disparity in rigor between these two projects is largely a function of experience. In addition, an experienced group has a far larger "toolbox" of methods to draw from, as well as the skill to convey results in a manner that demonstrates their impact and supports an overall conclusion or design recommendation throughout the paper. While a daunting task for students, the support of advisors, postdocs, and senior students becomes critical for overcoming this issue. In addition, internships help immensely by throwing students into teams with experienced researchers.
- Communication skills. With experience comes an increasingly intuitive knowledge of what constitutes a good CHI paper. It seems obvious, but it takes time to learn how to properly spell out the research questions and contributions of a paper, completely describe methods used and why, and relate results back to these contributions. Some professors would say that becoming a reviewer is a great way to learn this skill. Personally, I don't think that is enough. People can go through their whole careers as bad reviewers because they don't have anyone giving them feedback on the quality of their reviews. For my friends in academia, I highly recommend instituting some sort of mock PC meeting, review exchange, or other mechanism for providing feedback to students (and even faculty) on what constitutes a good review. In this way, people can then learn what goes into a good paper.
- Context. One gripe about reviews for systems papers is that the number of users is too small, or the diversity of the user population is too narrow. Systems paper authors quite rightly claim that this is an unfair criteria for systems that are often expensive and labor-intensive to deploy, even for a small set of users. SocialTV was one such system. Each study involved roughly five households, because we only had enough equipment to handle that many households at a time. These studies typically focused on a particular demographic, such as young suburban US families or middle-aged couples with teenage children. What I think helped our cause at CHI was that these studies were couched as part of a larger research program involving several phases and use cases. As a result, the greater body of work on this system constituted a somewhat more generalizable contribution, with each phase building on the next and confirming/contrasting the findings of the previous one.
Augur, by comparison, had no such context, and related work in this area focused on a different domain (corporate office workers). Without this context, I think the meaning of the contributions was less clear, and the paper suffered as a result. - Real results. While Augur constituted an interesting technical contribution, this was not regarded as the right criteria by which to judge it for CHI. Conversely, SocialTV had a less novel technical contribution (other social television systems existed at the time), but our qualitative data demonstrated a concrete impact on those using it that went beyond the side effects of merely having the technology in their homes.
Augur had fewer effects on its users that went beyond just having access to a basic shared calendar. Parts of this outcome are attributable to various reasons such as study design, site choice, and implementation (chalk it up to lack of experience). Another reason, a little harder to swallow, could be that the system just wasn't as desirable. While its design was grounded in real practices and supported by a community of work in this area, it can be a hard sell if you have no compelling results to demonstrate its value.
So while I can't comment on the existence of a bias against systems papers (or qualitative research papers for that matter), I do think there are criteria that have to be met by such papers before turning to bias as an excuse.
That said, I do see some issues that may need to be navigated before one can even start addressing the points above:
- Resources. The time, manpower, and budget that went into SocialTV may be beyond the reach of many graduate students. Experience aside, students may need to organize into teams that leverage diverse strengths (implementation, study design, qual/quant analysis, etc.) in order to adequately address all of these requirements. The Alice project is a great example of how this kind of team effort can result in a stream of publications for a group of graduate students.
- "The System". By this I mean the hiring processes of institutions seeking HCI researchers. It pains me when I hear colleagues talk up a faculty candidate or upcoming interviewee because he has X CHI papers or Y UIST papers. It's absolutely true that systems papers take longer to generate and thus reduce the number of pubs a student will have on his/her CV upon graduation. It is up to hiring committees to emphasize impact over quantity. In addition, value must be placed on the effort required in organizing a team of people to execute a successful large-scale research project, a skill that, incidentally, is exactly what a candidate will have to do as new faculty.
- Luck. Ok, I know that this is not up to the submitter, but it's still a part of the review process, although less so than in past years. Bad reviewers and lazy PC members can still sink a paper unfairly, but mechanisms such as the rebuttal phase and hand-picked reviewers have made some progress on this front. Plus, papers can always be resubmitted the following year.

Comments