Data lineage and data sovereignty: from anarchy to harmony

Data lineage and data sovereignty: from anarchy to harmony

In this special guest feature, Sashank Purighalla, CEO and Founder of BOS Framework, discusses how data sovereignty and lineage impact each other, how you can track data lineage, and what we can do to confront data anarchy. BOS Framework is a microservices and DevSecOps automation platform that enables enterprises to drive up business efficiency and fix security gaps.


This article was originally published on InsideBigData.

Digital privacy has come a long way since June 2013, when Edward Snowden and WikiLeaks got us thinking about rightful data ownership and governance. Despite the initial shift towards data democratization, many factors have contributed to the fact that we now live in the age of data anarchy instead. 

Firstly, the sheer volume of data produced by our digital footprints means that data management is in a state of disorder. This is due to the absence of a standard that regulates who can claim direct access to the data. There’s an ongoing, philosophical debate between companies, countries, and legal authorities due to the uncertainty surrounding data protection, retention, and access. Who owns the data? Is it the individual, an organization, a country, or should it even be more of a universal asset? And how do you then govern this uncertainty? 

Data lineage – the process of understanding and recording data through its lifecycle – involves tracing data back to where it started and capturing its movements across multiple systems and possibly across multiple geographies. But lineage is a challenge since data sovereignty and governance vary from country to country, and the rules change frequently. Data anarchy has become part of IT specialists’ everyday working lives.

Lastly, since there’s a disconnect between security, developer, and operation teams, siloed applications are often built over a number of years by various developers without a common architecture standard. This means hackers can find data vulnerabilities to exploit more easily.

So, let’s discuss how data sovereignty and lineage impact each other, how you can track data lineage, and what we can do to confront data anarchy. 

The Clash of the Data Titans 

Data engineers and scientists are often seen as the yin and yang of turning raw sets of data into useful insights. However, current systems still impede their ability to work efficiently together.

Scientists are interested in having lots of data structured in a certain way to make an insightful report analysis. 

In contrast, engineers are interested in ensuring that they put data in formats consistent with the regulation that governs specific geographies and industries. Therefore, inherently, engineers are restricted based on governance while scientists want more of it to deal with large volumes of data.

As data is continually produced over a span of time while the governing laws in countries and industries are constantly evolving due to political influences, anarchy becomes a very natural outcome. Navigating the inconsistent global privacy landscape and regulation is tough for IT professionals, especially since January 2021, as 130 jurisdictions worldwide have adopted different data privacy laws. 

Regulation is in a state of flux due to an evolving ecosystem where every country has its own laws about governing their data. For example, China’s recently announced strict personal data protection laws will require companies to pass a security assessment, and already very few foreign apps make it inside the country. 

While engineers are required to find ways of making sure that data is stored in a consistent way with governing laws, data scientists want to explore the differences between the characteristics of let’s say a 25-year-old woman in the United States versus China. The governing laws prevent them from querying and comparing data across these geographies. Therefore, data governance – a set of rules and procedures organizations use to control data – and data lineage are at odds with one another. 

DevSecOps can help engineers and data scientists, but there will always be an ongoing problem regarding data consumption and the heavy governance determined by regulatory rules. The key to solving this problem is building automated data platforms that can remain the source for all future analytics and applications while evolving over time.

Tracking Data Lineage and Securing Data

Democratization of data across departments within an organization is essential to help businesses derive innovative insights for growth. Some data engineers and scientists view this traceability or data lineage as the GPS of data. It gives them granular visibility of how data transforms and flows from source to destination while tracking errors, which translates into meaningful, cross-border insights for businesses. 

However, companies mistakenly view data lineage as a single project, but they need to use an ever-evolving mechanism because data is constantly being produced and policed. Some companies and their data analysts currently use data virtualization software while others implement data cataloging and then go on to design Master Data Management (MDM) solutions. A Data Catalog provides a rich interface to attach business metadata to the swathes of data scattered across the clouds, cloud storage, and on-premise data centers or databases.

When multiple developers monitor this data and use varied tools across endless different environments, a key concern is cloud security and ensuring that metadata stays within the cloud organization or project. A cloud architecture solution with multi-tenancy data isolation per tenant and that uses open source databases like MySQL and PostgreSQL could also help modernize and simplify data strategies. They consolidate disconnected DevSecOps tools and undifferentiated microservices into an automation platform, empowering companies to deliver tech-enabled business outcomes. 

Many international companies that have data distributed across geographies are up against multiple challenges regarding varying data protection acts and laws and tracking their data lineage. But they shouldn’t despair, new self-service solutions are appearing so that not only developers with niche skill sets can roll them out, but business professionals can too.