Jed Gordon, Partner & Global Head, IP Transactions Practice, McDermott Will & Emery
With the increasing dependence on data as a key component to products, companies have begun to rely on third party data sources to accelerate their product development efforts. Unfortunately, an appreciation of the legal framework governing use of third party data sources has not kept pace. The environment is reminiscent of the early days of open source software, during which many software developers saw the availability of open source software as a cost-free shortcut to product development, not realizing they could be introducing viral license terms into their code base. Discovery of “copyleft” open source code in commercial products has scuttled numerous M&A transactions and cost the software industry substantial sums in remediation efforts. Data-oriented companies should learn from the industry’s open source software experience to avoid similar pitfalls.
Integrating third party data into your product without a clear understanding of the terms under which the data was provided can introduce similar risks to a company as the incorporation of open source software. Proprietary datasets could become subjected to unwanted disclosure obligations. Sale of product could result in claims of copyright infringement or a breach of a license agreement. Moreover, once third party data is incorporated into a product, extracting that data or undoing its use could incur substantial costs and delays in product launches.
As a starting point, in-house counsel and corporate information officers should recognize that databases are generally protected by intellectual property rights. These rights may take the form of copyright protection in some jurisdictions, while other jurisdictions offer sui generis database protections. Even if a database creator makes their database publicly available for download or access, one’s rights to use, modify, copy, and distribute those data sets to third parties is limited to the rights and restrictions included in a license associated with the initial access of the data. One common and problematic limitation on usage of data made available by academic institutions is the prohibition of using the data for commercial purposes. While this clearly would prohibit resale of the data as a product, it also would arguably prohibit directly incorporating the data source into a commercial product or indirectly incorporating it, such as in training a machine learning model. Other common limitations are prohibitions on making derivative works of a product or requirements to include attribution to the data set providers in any redistribution. Unless these license requirements are tracked, inadvertent violations can easily result.
Accordingly, a robust data source tracking process is recommended for any product that is built upon third party data sets. The process should track limitations on how the data can be used, what attributions may be required and whether any fees or royalties may be due to a data provider in connection with its use. The process should also track a product’s compliance with these rights and obligations. As with many open source software components which are freely licensed under “copyleft” license terms, data sets which appear to be available on problematic license terms can often also be obtained under commercial licenses that avoid these risks. By rigorously vetting third party data source license terms and product usage of such data, publicly available data sets can be the boon to data-based products and services in much the same way open source has benefitted the broader software industry.