Three decades of finance, economics, and legal studies in corporate governance have been built substantially on data sets with nearly unknown provenance. A new paper sets to correct this fatal flaw of contemporary corporate governance research by debuting a brand new resource—the Cleaning Corporate Governance database.


Data-driven research has become a foundational underpinning of many policies, regulations, precedents, and private-sector practices. From qualitative interviews to big data, today, more than ever, we rely in data to guide us in our research, policy making and day to day living. Yet data-driven research is only as good as its component parts.

The field of empirical corporate governance is a poster child of the growing use of data as a key pillar of informing and shaping research. Over the last thirty years, empirical corporate governance has risen in prominence by quantifying the traditionally hard-to-quantify—text from state laws, federal regulations, and firm-level governance documents. This data in then used to measure the quality of governance. Some of the most influential studies have shown that countries with strong investor protections are more likely to have higher firm valuations, that more shareholder-friendly firms outperformed more management-friendly ones, and that there are numerous other significant real-world predicted effects of governance on firm performance.

Despite their obvious allure, these groundbreaking studies have also long had an unappreciated Achilles’ heel: three decades of finance, economics, and legal studies in corporate governance have been built substantially on data sets with nearly unknown provenance.

In our recent article, “Cleaning Corporate Governance,” forthcoming in the University of Pennsylvania Law Review, we set to correct this fatal flaw of contemporary corporate governance research by debuting a brand new resource—the CCG database—that allows researchers to investigate, for the first time, the fidelity of foundational corporate governance findings. The database is anchored by a first-of-its-kind, open-source textual corpus representing nearly thirty years of historical charters for companies listed in the S&P 1500—a total of approximately 3,000 companies over time. Charters are the foundational organizational document of a corporation, setting its basic equity structure, its purpose and basic rights and responsibilities of shareholders, directors and managers. We hand-label a significant subset of this firm-level data regarding these foundational attributes and augment it further with labeled state-level panel data that tracks sixteen statutory governance rules across 50 states (and the District of Columbia).

The CCG database unsettles some of the most beatified results in empirical corporate governance. A core example is the much-cited “G-index” first explored in the classic paper “Corporate Governance and Equity Prices” by Paul A. Gompers, Joy L. Ishii, and Andrew Metric (GIM). Using data from a third-party provider, the G-index aggregated 24 binary corporate governance variables into a single additive index to classify firms along a spectrum from more “dictatorial” (or management-centered) to “democratic” (or shareholder-centered).

Deploying that index for years in the late 1990s, GIM demonstrated that a strategy of systematically investing in democratic issuers (while shorting the dictatorial ones) would have delivered an astounding 9 percent excess return on a risk-adjusted basis. Our newly constructed database, however, reveals that the data underlying this finding contains significant inaccuracies. For example, we found that the G-index is inaccurate over 82 percent of the time, and that the rate of inaccuracy grows worse in the 2000s—even as that database (and results from it) gained increasing attention among academics, regulators, and practitioners. We use the CCG to implement a conservative correction to the underlying G-index, and we show that the relationship between democratic governance and arbitrage returns diminishes significantly with the corrected data.

The CCG database also presents exciting opportunities for future constructive research. Its underlying textual corpus, in particular, is fertile ground for machine learning and computational text analysis methodologies. We offer a taste of such approaches by deploying some of those burgeoning methods in our paper to show, among other results, that non-Delaware charters have become longer and less readable over time and that the similarity of charters for firms in certain industries has increased over time.

“Lawyers have taken a back seat in assembling and utilizing quantitative data, fearing that we are unqualified for empirical work.”

The availability of correct, open-source data will be invaluable resource for researchers who investigate deeper governance questions, such as whether state law matters, how governance evolves during periods of upheaval (such as the Financial Crisis), and whether common ownership of firms by large passive investors lead to anti-competitive behavior. Our database is also unique in that it allows scholars to use the underlying data to build new measures of stakeholder governance, which sets it apart from pre-existing shareholder-focused databases.

Perhaps the most important contribution of the CCG data is its underlying corpus and all of our labeled data will become free and open-access. In sharing our data, we hope to right two important wrongs.

First, we help solve the problem of access. While the data we collected are theoretically available from states’ secretaries of state and the Securities and Exchange Commission, gathering data from either source is no walk in the park. We estimate that harvesting the Delaware firms in our sample—constituting about 58 percent of the total dataset—from the Delaware Secretary of State’s office would cost half a million dollars in fees alone. Searching through the SEC’s online EDGAR database is theoretically free, but frustrating—it is impossible to search for only charters and bylaws, so the process of finding these documents is an exercise is excavation. Commercial databases like Westlaw and Bloomberg are slightly better (even though they also reflect EDGAR’s disorganization), but harvesting data from those sources come with their own obstacles.

Second, we surmise that a key reason for two decades’ worth of error propagation in existing data is that lawyers have taken a back seat in assembling and utilizing quantitative data, fearing that we are unqualified for empirical work. In our absence, non-lawyers did the best they could to dispense judgments that required distilling into binary variables overlapping state law, stock exchange listing rules, federal securities laws, and firm-level governance documents. These complicated legal questions require legal training. Through a careful process (and with many legally-rained research assistants), we have been able to create a database that, we believe, provides an accurate account of the governance provisions found in companies charters, as well as their interaction with the regulatory apparatus governing them. available. One early reviewer of this paper described the existing governance data as “mystery meat,” contrasting our CCG data as “the organic, meet-the-grower stuff from the farmer’s market.” We believe that’s true—and in the coming years, we invite others to help us cultivate this brood of governance data in the open range it was meant for.