Hey Facebook… don’t tell her we told you she’s pregnant!

A few weeks ago, the United States Federal Trade Commission (FTC) announced a proposed settlement with Flo Health Inc over its disclosure of sensitive health data to Facebook, Google, and other web advertising entities.

In 2019, the Wall Street Journal revealed that the menstrual-cycle-tracking app Flo, used by millions of women worldwide, was sharing its users’ personal data – breaking its public promise to keep such data private. The day after the news broke, Flo stopped sharing these data, which struck many as tacit admission of guilt.

A complaint was lodged at the FTC soon after, alleging seven counts of misrepresentation and one count of a violation of US law. Under the newly proposed settlement, Flo would have to notify its users that their data had been shared and cease engaging in deceptive data sharing. Interestingly, Flo doesn’t have to admit any wrongdoing. The FTC’s decision is now posted in the Federal Register for a public comment period ending on March 1, 2021, though lawsuits have already been filed.

Photo by ev on Unsplash

These kinds of data breaches – both unintentional and deliberate on the part of the companies collecting our data – have become distressingly commonplace in the news. Many words have been written about clandestine data collection, access, security, and the effects and proposed consequences for such violations; with less written about the choices made, technical provision, and ethical backbone of the collection, transformation, storage, and mining of the data itself. The journey from collection to algorithmic incorporation or bundling for sale is long and circuitous.

A layer in the modern analytics ecosystem hidden to most of the public is data engineering, the process by which data flow – how they’re moved, transformed, and integrated – into automated decision or analytics systems. At this layer, hundreds of ethical decisions are being made, usually implicitly, as the analytics and engineering teams implement code to meet their clients’ demands — rarely do they have the means to question what the code eventually produces.

These decisions have enormous implications for the individuals who produce the data. Sometimes, explicit decisions are made to share, concatenate, aggregate, or associate data that the individual might reasonably (even if not legally) think is private, which was the case with Flo Health. But just as often, decisions are implicit and retroactively unidentifiable. They are made automatically based on choice of technology, or individual capability or comfort with various engineering or statistical techniques. Through these hidden decisions, biases can become enshrined not only within individual projects or products, but within entire data management systems.

In healthcare, most patients’ privacy concerns are covered by HIPAA, while research subjects’ data are generally protected by an Institutional Review Board. But increasing use of new technologies – especially where there is no legal mechanism for oversight (including nearly all health-related apps, including Flo) – has led to increasingly more difficult ethical questions for which there are no easy solutions. Data-driven companies, non-profit organizations, and government entities are often left to their own devices.

At our hospital, we needed a way for our data professionals – engineers, analysts, and scientists/statisticians – to consider the ethical dimensions of any data project that they were working on. In collaboration with Philosophy Professor David Danks of Carnegie Mellon University, we created a short ethics checklist meant to help data professionals quickly assess the major types of concerns that could arise from their daily work.

Data Ethics Checklist DOWNLOAD

We used four pairs of ordinally-scored questions to help assess ethical issues: two overview questions and two questions related to three topics: privacy, bias and equity, and transparency and measurement. Our thought processes for each of the three topics are as follows, to paraphrase from our recent publication in the Journal of the American Medical Informatics Association:

Privacy: Patients and their families are largely unaware that their data can be aggregated across multiple sources and used to predict and infer seemingly unrelated information. These questions aim to help determine whether those inferences may be problematic.

Bias and Equity: The analyses of data may have differential impact on some groups, perhaps through more precise and targeted group-specific models. Biases in data collection and measurement can thus ramify into biased or inequitable data products. These questions help decide whether analyses would be biased and lead to inequitable outcomes.

Transparency and Measurement: Patients are often interested (or have a moral right) to know why some decision, prediction, or judgment was made; at the same time, data professionals should be able to determine how models generate output. These questions help elucidate the extent to which potentially problematic outcomes could occur.

Through an internal questionnaire of our data professional community, we concluded that the use of the checklist prompted thinking about ethical issues that would likely have otherwise been left implicit, or ignored entirely. As the team responsible for data use at a large institution in the healthcare domain, we believe that such ethical prompts, whether they spur action or not, are essential for future trust in data products. Our circle of direct influence may be small, but dissemination of grass roots initiatives can shift perspectives for an entire industry.

If data professionals are not provided explicit prompts to motivate reflection on the ethical implications of their standard work, it is inevitable that implicit biases will entangle themselves into data products upon which critical decisions are made (e.g., health care delivery, who to hire, legal consequences, mortgage rates). Equity can be impacted even when there are no immoral actors in the chain of decisions.

The choices we make in building data pipelines, data warehouses, apps, dashboards, and visualizations – choices made on expedience, tech availability, ease of coding – will invariably result in actions of social and equitable import. Consciously and reflectively considering ethical matters from the outset – and throughout – the data product development process is a critical strategy for protecting privacy, equity, and trust.

Photo by Matthew Henry on Unsplash

It may seem that such a tool might have prevented the Flo app leak; however, the pressure to generate revenue within internet-based companies is often overwhelming, and it is clear that Flo shared private data with outside companies on purpose. While interested parties can provide feedback to the FTC on this settlement, the proposal was supported unanimously by the FTC Commissioners, and it’s doubtful that it would change much based on public comment (which tends to be rare in any case). Increasingly, the tension between privacy and profits is compelling acknowledgment of the ethical dimensions inherent in data engineering. Data professionals are perhaps best situated to explicate and act on those issues, but public oversight of some sort will also be required to ensure the profit motive doesn’t overwhelm privacy concerns. Nonetheless, as other current and recent ethics activities around major data companies show, data professionals can help lead their companies toward more ethical products and positions.