Data Lakes from a Privacy Office Perspective

The Coast Guard’s Office of Privacy Management (CG-6P) requires all program offices for software systems, data systems and Information Technology (IT) systems to conduct a Privacy Threshold Agreement (PTA) describing the use case, type of technology, data used, analysis to be performed, and expected outcome. 1

The Department of Homeland Security Privacy Office then reviews and adjudicates each PTA prior to implementation. 2

With this framework in mind, it should be apparent why stipulations from CG-6P are both relevant and not only intersecting with, but on an inevitable collision course with the goals and vision of a data lake and Integrated Data Environment (IDE).

In a previous post, I stated, “The [Data Readiness Task Force] DRTF’s relationship and history with CG-6P warrants a series of posts in and of itself.” 3

Before I move on from reflections on my time assigned to the DRTF, it is worth noting the dynamic that was forming between CG-6P and the Coast Guard’s Office of Data and Analytics (CG-ODA) upon the inception of CG-ODA. (Chief among many tasks assigned to the DRTF was establishing the Coast Guard’s new Office of the Chief Data Officer, which was re-branded by the DRTF to CG-ODA to avoid the office being too centric to the Chief Data Officer.)

Previously, I stated:

What is relevant to the IDE’s first data connection is, the DRTF would no longer be looking to consume data as rapidly as possible. Rather, data ingestion into the IDE will take a conservative approach where only data with an immediate purpose or need is ingested into the IDE. From a privacy perspective, this approach limits data sharing (a privacy initiative) by ensuring an explicit justification and use for all data presented in the environment. 3

Data Lakes and Privacy

Populating a data lake with data may seem straight forward. Move the data into the centralized data repository (i.e. put the data in the lake). If you want, send the data through parsers for cleaning and optimization. As long as there is still compute/storage space, keep bringing in data (i.e. filling the lake). And if you need more compute/storage space, get more. In a simplistic application of a data lake, this is exactly what should happen. And this is exactly what the common conception of a data lake is.

However, as an organization’s use of a data lake matures, the simplistic application of the data lake may become outworn. Depending on the organization’s priorities with respect to data democratization, a single centralized environment where all users have access to all data may be problematic. For the Coast Guard, this was exactly the case

What Makes a Data Lake Useful?

A data lake is most useful when it has the data required to answer the problem statement posed. Thus, a maximalist approach to a data lake would be to get as much data as possible within the data lake. At which point, the data lake is best positioned to answer the most problem statements.

However, a data lake can also be viewed as an optimization. On one hand, an organization could be budget constrained and only able to buy a certain amount of compute/storage space. Thus, they would want to make the most out of the compute/storage space they can afford by having the data lake populated with the most useful data.

In parallel to the optimization view, if populated with poor data or if populated incorrectly, a data lake can become a bit of a data “swamp”. If the common opinion of a data lake is it has become a data swamp, the effectiveness of the data lake becomes undermined.

In an earlier post, I highlighted four metrics of a data centric organization: 4

  1. Everyone in the organization has access to the data relevant to their team/work,
  2. There is a transparent entity responsible for data quality,
  3. Opinions are voiced only when accompanied by supporting data, and
  4. Numbers are communicated even when they communicate negative messages.

I always liked the idea of a data lake/IDE contract for the Coast Guard, because it effectively contracts out the transparent entity who is responsible for data quality (the second metric of a data driven organization). However, if the common opinion of the data lake is it is a data swamp, then the purpose of the transparent entity responsible for data quality is irrelevant. Which is to say, if the data is not trusted upfront, then: the quality, the availability and any insights derived from the data are all irrelevant. I mention all of this simply to reinforce the point that if a data lake is overpopulated with data, it can become a data swamp. Although its perception as a data swamp could be driven by many different reasons, the perception as a data swamp alone will undermine any insights drawn from the data lake, and skeptical decision makers are enabled to introduce doubt.

The Office of Privacy Management’s (CG-6P) Stance on Data Sharing

With respect to data democratization, CG-6P is extremely restrictive. In the established system, CG-6P requires all program offices for software, data and IT systems to conduct PTAs. PTAs position CG-6P as a bottleneck between any data and its availability to be readily and easily democratized (another post is warranted regarding the Coast Guard’s broader approach to data democratization). This bottleneck positioning suits CG-6P; because as the privacy office, their preference with respect to data democratization is as conservative an approach as possible.

Thus, with respect to its data lake/IDE, the Coast Guard was strong armed by CG-6P into only bringing in data that had an immediate use, and only allowing users who had an immediate justification to see the data. Which is the second most conservative way to populate a data lake. Second only to not filling the data lake with data whatsoever.


These views are mine and should not be construed as the views of the U.S. Coast Guard.