Knowledge Scientists: Unlocking the data-driven organization

Organizations across all sectors are increasingly undergoing deep transformation and restructuring towards data-driven operations. The central role of data highlights the need for reliable and clean data. Unreliable, erroneous, and incomplete data lead to critical bottlenecks in processing pipelines and, ultimately, service failures, which are disastrous for the competitive performance of the organization. Given its central importance, those organizations which recognize and react to the need for reliable data will have the advantage in the coming decade. We argue that the technologies for reliable data are driven by distinct concerns and expertise which complement those of the data scientist and the data engineer. Those organizations which identify the central importance of meaningful, explainable, reproducible, and maintainable data will be at the forefront of the democratization of reliable data. We call the new role which must be developed to fill this critical need the Knowledge Scientist. The organizational structures, tools, methodologies and techniques to support and make possible the work of knowledge scientists are still in their infancy. As organizations not only use data but increasingly rely on data, it is time to empower the people who are central to this transformation.

READ FULL TEXT VIEW PDF
09/07/2020

Detecting Informal Organization Through Data Mining Techniques

One of the main topics in human resources management is the subject of i...
04/15/2020

A Philosophy of Data

We argue that while this discourse on data ethics is of critical importa...
02/23/2021

Data Quality Certification using ISO/IEC 25012: Industrial Experiences

The most successful organizations in the world are data-driven businesse...
12/15/2010

Dynamic Knowledge Capitalization through Annotation among Economic Intelligence Actors in a Collaborative Environment

The shift from industrial economy to knowledge economy in today's world ...
10/03/2021

Feedback Loops in Open Data Ecosystems

Public agencies are increasingly publishing open data to increase transp...
06/24/2022

Exploring Tenets of Data Democratization

Data democratization is an ongoing process that broadens access to data ...
03/12/2019

Termite: A System for Tunneling Through Heterogeneous Data

Data-driven analysis is important in virtually every modern organization...

The Data Driven Organization - the new secular trend

The rising importance of data in organizations over the last two decades is undeniable. Organizations across all sectors are increasingly undergoing deep transformation and restructuring towards data-driven operations. Looking back, we can identify two earlier trends which brought us to the data-driven organization.

In the 2000’s with the rise of the Web, and driven by Moore’s Law, organizations of all sizes could, for the first time, easily capture and process massive data collections using commodity distributed computing platforms such as Hadoop [7] and NoSQL databases [4, 8]. Big Data became democratized.

In the 2010’s, with the rise of Big Data and dropping prices in GPU’s, organizations could increasingly perform sophisticated data analytics and extract value from their data using commodity machine learning solutions such as TensorFlow

[1]. Data analytics and the AI revolution became democratized.

The Big Data trend placed increasing importance for organizations on being able to collect and harness data. As this trend matured, organizations with competitive advantage are the ones which identified and developed the role of Data Engineer to manipulate large amounts of data [2] and Data Stewards to manage data governance and workflows [16]. The AI revolution has lead to the vital need for organizations to be able to draw value from their data [5, 9]. This lead to the rise of the Data Scientist as a critical role [6].

The AI revolution and the rise of cloud computing make possible the shift to the data-driven organization. In the data-driven organization, the central role of data highlights the need for reliable and clean data. If you don’t have clean and reliable data, your AI, machine learning and analytics are worthless: garbage in, garbage out. Unreliable, erroneous, and incomplete data lead to critical bottlenecks in processing pipelines and, ultimately, service failures, which are disastrous for the competitive performance of the organization.

Currently, the responsibility for making data reliable is implicitly shared between the data engineer, data stewards and data scientist roles. Given its central importance, those organizations which recognize and react to the need for reliable data will have the advantage in the coming decade.

We argue that the technologies for reliable, clean, meaningful, beautiful data are driven by distinct concerns and expertise which complement those of the data scientist and the data engineer. Those organizations which identify the central importance of meaningful, explainable, reproducible, and maintainable data will be at the forefront of the democratization of reliable data. We call the new role which must be developed to fill this critical need the Knowledge Scientist.

Context Trend Organizational Need Technology Role
Web + Moore’s Law Big Data Harness and collect data Commodity distributed computing platforms (e.g., Hadoop) Data Engineer
Big Data + GPU Compute AI Revolution Draw value from data Commodity machine learning (e.g., TensorFlow, SciPy) Data Scientist
AI Revolution + Cloud Computing Data-Driven Organization, Digital Transformation Rely on data

Clean, meaningful, beautiful data technologies (e.g. knowledge graphs, data wrangling systems, data catalog platforms)

Knowledge Scientist
Table 1: Key secular trends in recent data history.

Unpacking “rely on data”

The organizational shift we articulated above means that the products of data science are no longer just a “bonus” but are central to an organization’s value. This means that organizations must be able to

rely on the data. But what does this mean in practice?

Consider the following example in an e-commerce company: The business user is trying to answer the question “how many orders were placed in a given time period per their status?” As simple as this question may seem, there is a lot to be unpacked.

When analyzing the data, are there missing values, and if so, what should they be? What should the reference time zone be? Which currency unit should we use? What’s the assumed exchange rate? (Property 1)

Putting the data aside, what exactly is the definition of “an order”? Is it when a customer clicked “place order” on the website? Or is it when the money from the customer has been received? Or is it when the package has been delivered to the customer? Each of these are valid definitions of “an order”. (Property 2)

What is the correct source of the data? Is it data coming from the order management system? Or perhaps the accounting system? (Property 3)

Can the data be easily provided to business analysts in the tools of their choice and can it be easily fed into churn models? (Property 4)

Finally, is the business leader confident that the results given are truly up-to-date and will be reproducible in future analyses? (Property 5).

Being able to systematically answer these kinds of questions is fundamental for having reliable data. We summarize these questions as a series of properties that can be checked in Table 2.

  1. Reliable data is clean data. An obvious place to start is traditional notions of clean data: that it is uniform (e.g. timestamps are all the same), that it’s valid in conforming to business rules and well defined schemas, that it is complete, that it accurately reflects reality, that it’s consistent with other data in the organization.

  2. Reliable data is grounded in shared meaning spaces. What does a column mean? Can I understand that without talking to the initial producer(s) of the data? Is the meaning adopted across the organization and do we have a shared understanding? Can one tie the data easily to human understandable definitions?

  3. Reliable data is data in context. Clean data with shared meaning is not enough. At first blush, data may look pristine but if we don’t know where it comes from (its lineage), how it was sourced, and if we have the rights to use it, it can become a massive liability. What if we didn’t obtain consent for its usage and violate regulatory laws? What if the licenses under which the data is obtained constrain the usage data and we are now financially liable? What if we use data where it was cleaned appropriately for one task but will fail for another? We can rely on data when we know the organizational context in which it’s situated, developed and sourced.

  4. Reliable data is data accessible in a standardized format. Can data be easily imported in existing tools? Are the tools and programs available? Is it using open formats?

  5. Reliable data is maintained. An organization can only rely on data that reflects the current state of an organization’s world and that is kept up-to-date and reproducible for future analyses.

Table 2: Properties of reliable data.

All these properties require deeper knowledge and understanding of the data. Essentially, we can rely on data only when we acknowledge the fact that we cannot ignore the organizational activities within which it is situated. Recent work on understanding data science practices [12, 11, 3] highlights the importance of such an acknowledgement.

Data wrangling is often said to be 80% of the work of data science [13]. Typically this is seen as boring annoying grunt work - data janitorial work that people don’t want to do and get stuck with it [10]. However, the problem is not about cleaning data by eliminating white spaces and replacing wrong characters. It’s about understanding the ecosystem between people, data and tasks in an organization and about communicating, documenting, and maintaining that knowledge. This is why it takes 80% of the effort. Our contention is that this is vital knowledge work at the center of any data-driven organization.

In typical organizations, the knowledge work to create reliable data is ad-hoc and the results and practices are not shared [14]. Furthermore, in data science teams, a data scientist or data engineer might do this knowledge work, but is not equipped, trained or incentivized to do so. Indeed, from our experience, the knowledge work (e.g. 8 hour conference calls, discussions, documentation, long slack chats, confluence spelunking ) required to create reliable data is often not valued by managers or employees themselves. The tasks and functions of creating reliable data are never fully articulated and thus responsibility is diffuse or non-existent. Who should be responsible?

The role of the Knowledge Scientist

The Knowledge Scientist is responsible.

Who is a knowledge scientist?

The Knowledge Scientist is the person who builds bridges between data and business requirements, questions, and needs. Their goal is to document knowledge by gathering information from Business Users, Data Scientists, Data Engineers and their environment with the goal of making reliable data that can then be used effectively in a data driven organization.

Returning to our scenario of order placement over time. The knowledge scientist gets in a room with business users and starts the discussion about the business question. During this discussion, it becomes apparent that multiple business users use the same word to mean different things. After further discussion, there may be an agreement that the correct use of the term “order” is when money has been received and the package has been delivered. With this information, the knowledge scientist can work with the data engineer to identify that the Order management system should be used and with data scientists to figure out how best to enable churn modeling. Furthermore during analysis of the data, different time zones are being recorded. The knowledge scientist would then go back and have further discussions with the business users to define what would constitute an agreed meaning for a time period. Given that the HQ of the company is on the west coast, they all agree that PST is the reference timezone. We can keep going on with this example, but what is important to note that the knowledge scientist is the person that is driving the discussions between the business users, data scientists, and data engineers, documenting the decisions and defining how the data corresponds to the business meaning. At the end, the knowledge scientist is able to generate reliable data that can then be passed on to the data scientist for analysis.

What skills does a knowledge scientist need to have?

Knowledge science work is technical work. Knowledge Scientists use skills and techniques such as data modeling, data integration, knowledge representation, ontology engineering to manifest what they learned from business users. The output is a data model that represents how the business user sees the world. They can align this data model with other models derived from talking to other business users. Furthermore, while working with Data Engineers, the knowledge scientist is fluent in data access and transformation methods such as query and programming languages. They can transform the data being provided by the data engineer and map it to the business meaning provided by the business user. They are conversant in analytical and machine learning methods.

Knowledge science work is people work. The knowledge scientist has excellent communication skills that can be applied to both the business user and data engineer. The knowledge scientist is both a “people person” and a “geek”, who is comfortable with the context-dependent, dynamic, and collaborative nature of meaning making from data.

Haven’t we seen this before?

Knowledge science has its roots in the knowledge engineering approaches of the 1980s and 1990s

[15]. In that world, skills such as knowledge identification, knowledge elicitation, and knowledge specification were taught and used. These are lost arts in industry today and particularly in the data science context. We believe that revisiting these approaches will be a key part of developing both the instructional curriculum and also tooling needed to support the knowledge scientist.

Implications

We need to empower people and organizations to produce reliable data. This requires rethinking across the board: organizational structures, academic training, and advances in the study of knowledge. The following are some of our initial thoughts on what this means for these various actors.

For organizations, introduce the role of a knowledge scientist into your organization. A simple step is to seek out existing team members who play this role. They might be business analysts, data scientists, product owners, data stewards or data engineers. We have even seen sales representatives who have played this role. By acknowledging this role, you can elevate the importance of reliable data in your organization and focus on the development of the skills outlined above. Another step is to communicate your experience with knowledge scientists: What do they bring to the organization? What infrastructure do they need? How have they impacted your data estate? Only through this communication can we effectively find the practices that empower knowledge scientists. For tool developers, there is paucity of tools that help knowledge scientists - now might be a time to see what you can do.

For educators, new courses and content are needed that is integrative. A knowledge scientist does not just require the skills of data engineering, machine learning, knowledge acquisition, communications, and human computer interaction. They require pieces of all these. To do so, one approach would be to provide pathways through the excellent content available for these various disciplines. It is also perhaps time to revisit classic knowledge engineering and data modeling knowledge and update it for this new area. Finally, as we have already noted, knowledge scientists exist. We see a tremendous opportunity to learn from this practical experience.

For researchers, again our call is to be integrative. We need to study the tripartite relationship between data/knowledge models, their corresponding query languages, and the people both using and producing reliable data. We need to understand how people perceive the way data is modeled and represented and its organizational embedding. In order to do so, we need to work with scientists and experts across communities to design methodologies, experiments and user studies. This requires bringing together data management expertise in theory, systems and semantics with communities who study people (e.g. human data interaction) and those who are actively using data (e.g. data journalist, political scientist, life science, etc.) In this integrative approach what kinds of questions might be interesting to ask? Here’s a few to get going: What is the role and function of data and knowledge modeling in organizations? Why do we keep inventing new data models? What are the affordances necessary to help people in creating and using data models? Are new organizational roles needed for data and knowledge modeling and management?

The organizational structures, tools, methodologies and techniques to support and make possible the work of knowledge scientists are still in their infancy. Let’s make their work easier.

Unlocking the data-driven organization

We have argued that there is a fundamental role – the knowledge scientist – that has been overlooked in data driven organizations. As organizations not only use data but increasingly rely on data, it is time to empower the people who are central to this transformation.

References

  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. A. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng (2016) TensorFlow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2-4, 2016, K. Keeton and T. Roscoe (Eds.), pp. 265–283. External Links: Link Cited by: The Data Driven Organization - the new secular trend.
  • [2] J. Anderson (2017) Data engineering teams: creating successful big data teams and products. Big Data Institute. External Links: Link Cited by: The Data Driven Organization - the new secular trend.
  • [3] C. L. Borgman (2015) Big data, little data, no data: scholarship in the networked world. The MIT Press. External Links: ISBN 0262028565 Cited by: Unpacking “rely on data”.
  • [4] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber (2006) Bigtable: A distributed storage system for structured data. In 7th Symposium on Operating Systems Design and Implementation (OSDI ’06), November 6-8, Seattle, WA, USA, B. N. Bershad and J. C. Mogul (Eds.), pp. 205–218. External Links: Link Cited by: The Data Driven Organization - the new secular trend.
  • [5] N. R. Council (2013) Frontiers in massive data analysis. The National Academies Press, Washington, DC. External Links: Document Cited by: The Data Driven Organization - the new secular trend.
  • [6] T. Davenport and D. Patil (2012-10) Data scientist: the sexiest job of the 21st century. Harvard Business Review 90, pp. 70–6, 128. Cited by: The Data Driven Organization - the new secular trend.
  • [7] J. Dean and S. Ghemawat (2004) MapReduce: simplified data processing on large clusters. In 6th Symposium on Operating System Design and Implementation (OSDI 2004), San Francisco, California, USA, December 6-8, 2004, E. A. Brewer and P. Chen (Eds.), pp. 137–150. External Links: Link Cited by: The Data Driven Organization - the new secular trend.
  • [8] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels (2007) Dynamo: amazon’s highly available key-value store. In Proceedings of the 21st ACM Symposium on Operating Systems Principles 2007, SOSP 2007, Stevenson, Washington, USA, October 14-17, 2007, T. C. Bressoud and M. F. Kaashoek (Eds.), pp. 205–220. External Links: Link, Document Cited by: The Data Driven Organization - the new secular trend.
  • [9] A. Y. Halevy, P. Norvig, and F. Pereira (2009) The unreasonable effectiveness of data. IEEE Intell. Syst. 24 (2), pp. 8–12. External Links: Link Cited by: The Data Driven Organization - the new secular trend.
  • [10] S. Lohr (2014-08) For big-data scientists, ‘janitor work’ is key hurdle to insights. The New York Times. External Links: Link Cited by: Unpacking “rely on data”.
  • [11] M. Muller, M. Feinberg, T. George, S. J. Jackson, B. E. John, M. B. Kery, and S. Passi (2019) Human-centered study of data science work practices. In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems, CHI EA ’19, New York, NY, USA. External Links: ISBN 9781450359719, Link, Document Cited by: Unpacking “rely on data”.
  • [12] M. Muller, I. Lange, D. Wang, D. Piorkowski, J. Tsay, Q. V. Liao, C. Dugan, and T. Erickson (2019) How data science workers work with data: discovery, capture, curation, design, creation. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 1–15. Cited by: Unpacking “rely on data”.
  • [13] A. Ruiz (2017-09) The 80/20 data science dilemma. InfoWorld. External Links: Link Cited by: Unpacking “rely on data”.
  • [14] M. Stonebraker and I. F. Ilyas (2018) Data integration: the current status and the way forward. IEEE Data Eng. Bull. 41 (2), pp. 3–9. External Links: Link Cited by: Unpacking “rely on data”.
  • [15] R. Studer, V. R. Benjamins, and D. Fensel (1998) Knowledge engineering: principles and methods. Data Knowl. Eng. 25 (1-2), pp. 161–197. External Links: Link, Document Cited by: Haven’t we seen this before?.
  • [16] M. Teperek, M. Cruz, E. Verbakel, J. Böhmer, and A. Dunning (2018) Data stewardship – addressing disciplinary data management needs. International Journal of Digital Curation 13, pp. 141–149. External Links: Document, Link Cited by: The Data Driven Organization - the new secular trend.