ProPublica’s Surgeon Scorecard: Call for Peer Review

An Open Letter to Healthcare Outcomes Researchers, Journalists and Data Scientists

Thank you for taking to time to read this letter. I’d like to ask you to review some important new information.

Last week ProPublica published a major story and online data base they’ve termed Surgeon Scorecard. It has been promoted this as a tool for individuals to learn more about their surgeons before an operation. After looking at the Surgeon Scorecard data and methodology carefully, I’m left with serious reservations about its quality and applicability. I am requesting your help with an expert peer review.

In the project, ProPublica evaluated eight common elective surgical procedures using previously unreleased data from Medicare. Their source of information was administrative data from billing submissions.  Individual surgeons were rated based on readmissions and mortality. No chart level clinical data was analyzed for the dataset. Each surgeon was assigned a visual ranking based on their performance with a grade that falls into low (green), medium (yellow), or high (red) “adjusted rate of complications” with confidence intervals superimposed. My own work as a cardiac electrophysiologist (i.e. heart rhythm, pacemakers, and defibrillators) is not represented in this data. If you haven’t seen the database, take a look – enter a hospital or doctor you know and note the results.

Prior to the release of the database, ProPublica promoted the project with a video:

It’s worth a watch, as it may reflect the tone and purpose of their mission. There have been some negative reactions to this piece, and a lead reporter for the project has acknowledged this criticism.

This was a big undertaking, as you’ll see when you review. These physician scorecards could have major impact on the medical community, particularly if ProPublica expands their investigations beyond their currently narrow scope. For a journalist generated project, there is some pretty heavy science involved, particularly when it comes to the methodology of the database. The background was published in a separate white paper with appendices. They indicate that they consulted with experts, many unnamed on background, to analyze and format their data.

Upon release, there has been vigorous debate about the methodology of their project, particularly on twitter. If you search the stream of the reporters @marshall_allen and @olgapierce and the hashtag #SurgeonScorecard, you’ll find many of the arguments and their responses. Vocal critics include @JohnTuckerPhD, @skepticalscalpel, @justinmclachlan and @daviesbj. Numerous blog posts have outlined these criticisms and I’ll link to several that are worth reading:

ProPublica’s #SurgeonScorecard Should be Retracted from former journalist Justin McLachlan

ProPublica’s Surgeon Score Card: Clickbait? Or Serious Data? from urologist Benjamin Davies MD

The Problem with ProPublica’s Surgeon Scorecards from transplant surgeon Ewen Harrison

The Surgeon Scorecard is Here! (It’s Just Not Meaningful) from cardiologist Rocky Bilhartz MD MBA

After Transparency: Morbidity Hunter MD joins Cherry Picker MD from radiologist Saurabh Jha MD

The Surgeon Scorecard: Much Ado About Literally Nothing from general surgeon Jeffrey Parks MD

Here are a few high impact tweets addressing the statistical methods:

I realize there’s lot here to digest. Let me take a moment to summarize some major points.

– Responsible doctors agree that increasing transparency is appropriate. One of the major MD blog critics above actually wrote a book on healthcare transparency. We do not object to responsible, accurate reporting of physician performance. We recognize that it is very difficult to assess the quality of a doctor and this needs to be fixed. I have promoted my own idea of direct physician supervision. The folks criticizing this project value patient safety, and are not afraid to criticize doctors when appropriate. We are all seeking the same goals.

– Surgeon Scorecard looks at elective, low risk inpatient procedures and uses purely administrative data to score the surgeons. Only mortality and readmissions are measured. No patient level chart data is reviewed. Actual peri-operative complications and procedural success are not systematically measured. Many clinicians, including myself, have noted inaccuracies in administrative data (which is compiled without MD oversight). I think most clinicians would agree that without direct review of clinical data, it is difficult to accurately judge another doctor’s performance. To their credit, the reporters openly acknowledge these limitations.

– ProPublica applied a clinical risk-adjustment to the data. However, this co-morbidity “Heath Score” did not independently predict outcomes (Item 2.5 on page 11 of their method paper). Their model did not show an increase of deaths or readmissions in the patients determined to be sickest pre-op. This makes me wonder about the validity of their risk adjustment. If pre-op risk is not accurately assessed, the doctors that take on the most difficult cases will be unfairly penalized. Dr. Jha’s parable of Cherry Picker MD vs. Morbidity Hunter MD (linked above) speaks directly to this issue. OB/Gynecologist Dr. Jen Gunter also covers this concern well on her blog. If doctors are reluctant to take on difficult cases for fear of scorecards, needy patients could go undertreated.

– Individual surgeon data is presented with visual red/yellow/green rankings and confidence intervals. In ProPublica’s words, “A high adjusted complication rate indicates that a surgeon’s patients suffered harm more often than his or her peers.” Neither this explanatory document nor the scorecard app discusses the importance of confidence intervals in data reporting (this question is only addressed in a separate FAQ document). A surgeon may have his “dot” in the red, but have confidence intervals that suggest that he may actually be a high performer. I and others wonder if consumers will be able to interpret this complex data without a more up-front discussion by the reporters. There is no visual indication of P=non-significant for surgeons whose CI’s cross into low or medium risk. In a twitter exchange, journalist Reed Miller likened this to reporting a baseball batting average leaderboard without a minimum number of at bats. Scientist John Alan Tucker PhD covers this limitation well in his tweets.

– Procedure numbers for many of the surgeons are low, thus making the risk analysis difficult to interpret. Still these doctors are “graded.” In at least one case, a doctor with zero complications was ranked in the yellow zone (as criticized by cardiology outcomes researcher Mintu Turakhia MD in his tweet cited earlier).

– Many of the outcomes tracked are entirely out of the surgeon’s control, and may better reflect non-surgeon factors such as patient post-op adherence and emergency department staff actions.

– The statistical methods are complex, and there was no independent peer review. ProPublica acknowledges the work of doctors and scientists, many unnamed, in the review of their methodology, but editorial control was entirely in ProPublica’s hands.

– There is no prospective validation that these scorecards predict surgeon performance.

– There does not appear to be a mechanism for physician verification of his or her individual report.

– ProPublica’s promotional video is difficult to describe as anything less than sensational and fear-mongering. It is far out of place with the otherwise professional tone of this project. If you haven’t watched it yet, please do so now and tell me I’m wrong.

To their credit, ProPublica has bravely taken on a critically important mission that was certain to ruffle some feathers. They have done an enormous amount of work to create this database, and their presentation is beautiful. I have been a vocal fan of ProPublica’s work. I have also been both a quoted and background source for their reporters (although not on this project).

Some have argued that it was important to get this data out for public review, despite it’s limitations. I respectfully disagree. I subscribe to the belief that bad data is worse than no data. Certainly the scientific literature is replete with examples that prove this correct.

So is Surgeon Scorecard bad data? Strong words, but I say yes. This analysis was a great idea, but it fails to deliver on its goals. The data and methodology both have significant flaws. I say that from the perspective of a working clinician and clinical researcher with over 20 years experience, but I’d like to see a higher level of review. This project is as much science as it is journalism.  Surgeon Scorecard should be peer reviewed and critically discussed as would any scientific outcomes study. As I suggested to ProPublica, we need to kick the tires.

This is why I’m calling on experts in healthcare outcomes, data science and journalism to review Surgeon Scorecard on methodological grounds to determine its validity, interpretability and appropriate application. This needs to be evaluated thoroughly, and at the highest level of expertise.  I hope you will be willing to take a close look and let us know what you think. ProPublica has invited expert commentary by email at Please submit your comments there, and leave me a copy in the comments section of this post.

Thank you,


Edward J. Schloss MD
Medical Director, Cardiac Electrophysiology
The Christ Hospital
Cincinnati, OH