Today’s CIO is challenged to turn data into wisdom at all levels of the organization. We need to transform our organizational data into wisdom, cybersecurity data into wisdom and our client data into wisdom. This process is commonly known as the Data Information Knowledge Wisdom (DIKW) pyramid, which refers loosely to a class of models for representing structural and/or functional relationships between data, information, knowledge, and wisdom. Data as a Service (or DaaS) is a way enable this process.
Traditionally, organizations have used data stored in a self-contained repository, with software specifically developed to access and present the data in a human-readable form. This same software prevents real data sharing and the discovery of knowledge and wisdom. DaaS breaks that model and shares the sources via a federated enterprise information architecture model across a range of platforms, data publishers, and users. DaaS enables sharing by automatically identifying and adding data to a catalog though a set of ontologies and taxonomies (information models). Data is given context and meaning by mapping dataset metadata to an information model (giving semantic meaning to the data) using an auto-recommendation algorithm. A good ontology integrates major open-source, highly-utilized standard ontologies. This article focuses on applying DaaS for cloud based ingestion and redaction.
There are several best practice / design principle to ensure your DaaS has the best chance for success. If you implement DaaS, make sure data resources and capabilities are:
• Discoverable – ability to discover data and analytics across multiple instances
• Consistent – consistent use of syntax and terminology (if not necessarily naming)
• Distributed – ability to share data and analytics across multiple platforms & instances
• Standardized – analytics design patterns provided and largely adhered to
• Decentralized and Organic - no requirement for centralized approvals, only central registries
• Scalable – ability to accept and manage multiple data feeds and analytics
What problems does DaaS solve?
Large organizations are unable to discover, access, and share data across users and groups – a problem that wastes huge amounts of time and money and often results in failing to achieve critical mission goals. This problem often comes from legacy conditions:
• In a federated data environment, each data publisher has different methods of describing, storing, and accessing data, this makes sharing data difficult
• Users do not know what the data is or how it can be used
• Silos do not see how the data applies to the enterprise
• Data publishers often describe the same terms using different names, inhibiting discovery, use and integration.
• Older data is not relevant due to outdated and not updated tagging
Beware Half DaaS solutions
Many organizations (and vendors) will try to solve the sharing problem forcing use of a common platform. This solution is typically ineffective, as publishers are not motivated to adopt changes because changes add work and create new human error. DaaS does not place any new burdens on your current staff, because it enables sharing from what already exists and automates integration using neural networks and natural language processing, which minimizes burden on publishers.
"Using the frameworks and the knowledge of DaaS, we have been able to process up to 5000 records a second for one of our clients"
Technologies were intentionally left out of the article because technologies change but the why and how are constant. A good DaaS implementation should be able to use any technology stack – if someone tells you that DaaS is dependent on a specific tool, they don’t understand the problem. Default to open when possible.
Use Case: Ingestion
CIOs need to ingest large, high volume streams of data. This includes structured, unstructured data, images, and PDFs. Cloud hosted open source data processing frameworks can ingest at a rate that was unachievable a few years ago. Using these frameworks and the knowledge of DaaS, we have been able to process up to 5000 records a second for one of our clients, storing them in the right application and adding information and knowledge to a live dashboard. This streaming ingestion architecture can easily scale to whatever the current needs dictate using horizontal cloud scaling models, and allows our clients to plug in any analytical algorithms they need, such as machine-learning for high throughput automated redaction.
DaaS and cloud based ingestion allow us to implement redaction on the fly - adjusting who has access to what data based on both rule and role based redaction. By example, a report that has both secret and top secret information can be processed to show only secret and lower to the secret cleared role and the top secret and below to the top secret role, If a piece of content is ruled confidential on the fly it can be redacted and scrubbed from the system. Utilizing OCR we can redact images and pixel regions to remove classified or PII areas. Finally, DaaS and cloud based redaction allow us to apply machine learning methods that can identify additional information that is sensitive when combined, that a human may have missed.