Database Systems

Wednesday, 14 January 2026

Why trustworthy AI isn’t optional - it’s the foundation

In the rapidly evolving landscape of generative AI, data security and data privacy are not just compliance requirement - they are the bedrock of trust and innovation.

After all who would want to waste time with a dodgy and fake AI?😑

While data security focuses on protecting AI models and datasets from breaches, tampering, or unauthorised access (through measures like encryption, access controls, and secure APIs), data privacy ensures that the data powering these models is collected, processed, and used ethically and legally. This distinction becomes particularly critical in grounding techniques, where AI models are anchored to external knowledge bases, APIs, or real-time data sources. The more an AI model is trained on accurate, well-structured data, the fewer inaccuracies and hallucinations it produces.

What is grounding?

In simple terms, grounding connects AI models to external sources of truth - such as sensors, databases, or APIs - to ensure responses are accurate, context-aware, and reliable. Without grounding, AI risks hallucinations or relying on outdated or biased data.

For example, imagine grounding an AI customer support chatbot in a SQL database of product manuals and FAQs. The chatbot queries the database in real-time to provide accurate, up-to-date answers—rather than inventing responses. However, if the database connection isn’t secured with encryption or the data isn’t anonymized, the grounded AI system could expose sensitive customer information or fall victim to prompt injection attacks.

And without robust security, grounded AI systems risk exposing sensitive data or being manipulated through techniques like prompt injection or data poisoning. Without privacy safeguards, they may inadvertently violate regulations like GDPR or CCPA by misusing or retaining personal data embedded in their knowledge sources.

For data architects, data engineers and AI engineers, grounding introduces unique challenges. When an AI model queries external databases, APIs, or data lakes, every interaction must be both secured (e.g., using TLS 1.3, OAuth 2.0, or zero-trust architectures) and privacy-preserving (e.g., via differential privacy-federated learning, or data anonymization). For example, if you’re grounding a large language model (LLM) in a relational database or Delta Lake, you MUST ensure that:

Security: The connection is encrypted, access is role-based (e.g., RBAC in Snowflake or Azure Synapse), and queries are logged for audits.
Privacy: The underlying data is scrambled, anonymised or pseudonymized, and the model only retrieves data it’s authorised to use - aligning with the principle of least privilege. Tools like Python’s faker or SQL’s dynamic data masking can help strip PII from responses, your grounding frameworks MUST enforce strict data access policies.

The Future: Trustworthy Grounding in AI

The future of AI hinges on trustworthy grounding - where models don’t just perform well but also respect data sovereignty and user consent. As you design AI systems that interact with databases, data lakes, or Lakehouses, prioritise :

Privacy-by-Design: Embed consent checks into API calls, database connections and ensure data minimisation, authorised data only.

Security-by-Default: Encrypt even vectors-embeddings in your vector databases and all related grounding data assets and enforce strict access controls.

If you’re experimenting with grounding, start by auditing your data sources:

Are they secure?
Are they privacy-compliant?

The answers will define whether your use of AI with your database is not just smart, but also trustworthy.

Thursday, 1 January 2026

Exploring C4 Models with Structurizr DSL, VSCode, and Diagramming Tools

Introduction

As a Data Architect, creating clear and effective diagrams is crucial for communicating and documenting software and data architectures. The C4 model, with its focus on abstraction-first design—a principle I firmly believe is the backbone of software engineering—immediately caught my interest. To explore this further, I recently began experimenting with C4 modeling using Structurizr DSL, (DSL=Domain Specific Language) VSCode, and popular diagramming tools like PlantUML and Mermaid. I used Cairo and Graphviz in the past but these newer libraries require less tinkering. Here’s a look at my journey and the insights I gained along the way while trying the diagram as code approach.

Why C4 Models?

The C4 model is a powerful way to describe software systems at four distinct levels of abstraction—Context, Containers, Components, and Code—often referred to as the "4 Cs." Its simplicity, scalability, and developer-friendly approach make it a perfect fit for both new (greenfield) and existing (brownfield) projects.

Since I prefer to avoid cloud-based tools for richer experience and control, initially I set up a local environment using VSCode and Docker on my trusty old but fast Ubuntu laptop. This way, I can create clear, maintainable diagrams while keeping everything offline and efficient. Looking at it again, I decided that even Docker is an overkill. I decided Vscode is enough to code and diagram.

My Setup

I took a quick look at the Structurizr DSL Python wrapper, but I also skipped it—I wanted to dive straight into the native DSL syntax and see my diagrams render with minimal overhead. After all, treating diagrams as code means I can version-control them like any other project, keeping everything clean and reproducible.

While I could have spun up Structurizr Lite in a Docker container (because who doesn’t love local, self-hosted solutions?), I went lighter—just VSCode extensions to get the job done. My philosophy? Minimum viable effort for maximum results. No unnecessary layers, no cloud dependencies, just code and diagrams, the way it should be.

They integrate seamlessly with wiki platforms (like Confluence, Notion, or GitLab/GitHub Wikis) and Git repositories, allowing you to embed dynamic, version-controlled diagrams directly in your documentation.

Tools in Action

Structurizr DSL: Writing diagrams as code in DSL in vscode and for better previews run their server on localhost
VSCode: With extensions for PlantUML and Mermaid, I could preview diagrams instantly in vscode.
PlantUML & Mermaid: Both tools integrated seamlessly with VSCode via extensions, though I found Mermaid’s syntax more intuitive for quick sketches and wiki integration. Mermaid has its own markup.

Outcomes

I successfully created Context, Container, and Component diagrams for a sample imaginary project. The ability to generate diagrams locally ensured full control and flexibility, no SaaS. Here are two examples of what I built:

Figure 1: Output from Structurizr server running on localhost:8080 in docker with code on the left generating the C4 model diagram on the right

Figure 2: Output from Mermaid vscode extension showing Mermaid code on the left generating the diagram on the right

Final Thoughts

I find the C4 model and tools like PlantUML and Mermaid are a game-changer for architecture documentation—it shifts the process from static, manual diagrams to code-driven, version-controlled clarity. By leveraging Structurizr DSL in VSCode and pairing it with Mermaid/PlantUML, I’ve crafted a workflow that’s both flexible and precise, giving me full control over how my systems are visualized.

There’s something deeply satisfying about coding your diagrams—no more wrestling with drag-and-drop tools or misaligned Bézier curves. Just clean, maintainable DSL and instant visual feedback. I’m officially done with joining rectangles by hand; from now on, it’s code all the way.

Sunday, 23 February 2025

Building a Real-Time Weather App with Streamlit and Open-Meteo

see app here: https://data-exchange.streamlit.app/

I recently embarked on a project to build a real-time weather application, and I wanted to share my experience using Streamlit and Open-Meteo. The goal was to create a dynamic web app that provides users with up-to-date weather information and 10 day weather forecasts, all while leveraging the convenience of cloud-based development.

Streamlit: Rapid Web App Development in the Browser

One of the most compelling aspects of Streamlit is its ability to facilitate rapid web application development directly within the browser. For this project, I utilized GitHub Codespaces, which provided a seamless development environment. This eliminated the need for complex local setups and allowed me to focus solely on coding.

Key Advantages of Using GitHub Codespaces:

Browser-Based Workflow: All development activities were performed within a web browser, streamlining the process.
Dependency Management: Installing necessary Python packages was straightforward using pip install.
Version Control: Integrating with Git enabled efficient version control with git commit and push commands.

Data Acquisition with Open-Meteo

To obtain accurate and current weather data, I employed the Open-Meteo API. This API offers a comprehensive set of weather parameters, allowing for detailed data visualization.

Visualizing Weather Data with Streamlit's Graph Capabilities

Streamlit's built-in graph visualization tools proved to be highly effective. Creating dynamic charts to represent weather data was quick and efficient. The clarity and responsiveness of these visualizations significantly enhanced the user experience.

Technical Implementation:

The application was developed using Python, leveraging Streamlit for the front-end and the requests library to interact with the Open-Meteo API. The workflow involved:

Fetching weather data from the Open-Meteo API.
Processing the data to extract relevant information.
Utilizing Streamlit's charting functions to create graphical representations.
Deploying the application via the streamlit sharing platform.

Observations:

Streamlit's simplicity and ease of use allowed for rapid prototyping and development.
GitHub Codespaces provided a consistent and reliable development environment.
Open-Meteo API provided accurate data.
Streamlit sharing made deployment very easy.

Conclusion:

This project demonstrated the power of Streamlit and Open-Meteo for building data-driven web applications. The ability to develop entirely within a browser, combined with powerful visualisation tools, made the development process efficient and enjoyable.

You can view the final app here: https://data-exchange.streamlit.app/

Saturday, 8 February 2025

GitHub Codespaces: A Fast-Track to Development with Minimal Setup

Do you like coding but you hate the scaffolding and prep-work?

As developer, I often spend a considerable amount of time setting up development environments and the project scaffolding before I even write a single line of code. Configuring dependencies, installing tools, and making sure everything runs smoothly across different machines can be tedious. IF you find this prep work time consuming and constraining then...

Enter GitHub Codespaces

GitHub Codespaces is cloud-based development environment that allows you to start coding instantly without the hassle of setting up a local machine on your browser!

Whether you’re working on an open-source project, collaborating with a team, or quickly prototyping an idea, Codespaces provides a streamlined workflow with minimal scaffolding.

Why GitHub Codespaces?

Instant Development Environments
With a few clicks, you get a fully configured development environment in the cloud. No need to install dependencies manually—just launch a Codespace, and it’s ready to go.
Pre-configured for Your Project
Codespaces can use Dev Containers (.devcontainer.json) to define dependencies, extensions, and runtime settings. This means every team member gets an identical setup, reducing "works on my machine" issues.
Seamless GitHub Integration
Since Codespaces runs directly on GitHub, pushing, pulling, and collaborating on repositories is effortless. No need to clone and configure repositories locally.
Access from Anywhere
You can code from a browser, VSCode desktop, or even an iPad, making it an excellent option for developers who switch devices frequently.
Powerful Compute Resources
Codespaces provides scalable cloud infrastructure, so even resource-intensive projects can run smoothly without overloading your local machine.

A Real-World Example

Imagine you’re starting a new Streamlit project on their community. Normally, you’d:

Install Streamlit and other packages
Set up a virtual environment
Configure dependencies
Ensure all team members have the same setup

With GitHub Codespaces, you can define everything in a requirements.txt and .devcontainer.json file and launch your environment in seconds. No more worrying about mismatched Python versions or missing dependencies—just open a browser and start coding.

See below how I obtained this coding environment to built a Weather Streamlit app quickly and for FREE using the Streamlit community Cloud

All in one browser page using GitHub, browser edition of VScode and access to a free machine on Streamlit Community Cloud with GitHub Codespace for development.

To see the above app visit https://click-weather.streamlit.app/

Final Thoughts

GitHub Codespaces is a game-changer for modern development I think. It eliminates the friction of setting up environments, making collaboration effortless and speeding up development cycles. If you haven’t tried it yet, spin up a Codespace for your next project—you might never go back to traditional setups on your laptop anymore.

There is another tool I want to look at which does all the scaffolding automatically with AI. Is the IDE called 'Windsurf' from Codeium, but that's another blog post.

Friday, 27 September 2024

Understanding the Contextual Awareness of data

Image attribution: Thomas Nordwest, CC BY-SA 4.0, via Wikimedia Commons

What sources does this data come from?

How was this data collected in the first place?

Do you clearly understand the structure and format of your data?

What specific business objectives does this data help you achieve?

Do you know that there are relationships between this dataset and others, if you do which?

What limitations or biases there may exist in this data?

How frequently is this data updated or refreshed, is it stale?

What context is important to consider when trying to understand this data?

What questions are you trying to answer with this data?

The above questions can help us achieve data zen, or the contextual awareness we all seek.

Contextual Awareness of data refers to the ability to understand and interpret data within its relevant situation, including the source, structure, and intended use. It is the 'world of interest' it tries to describe. It goes beyond merely collecting data; it encompasses a comprehensive understanding of what the data represents, where it originates, and how it relates to other datasets. This awareness is crucial for organisations to derive meaningful insights, as data without context can lead to misinterpretation and ineffective decision-making.

The importance of contextual awareness cannot be overstated. Without a clear understanding of the data’s context, organisations risk making decisions based on incomplete or misleading information. For instance, data collected from disparate sources may appear accurate in isolation but may contradict one another when analysed together. By fostering contextual awareness, organisations can ensure they ask the right questions and identify relevant insights, ultimately driving strategic initiatives and operational efficiency.

Furthermore, contextual awareness enhances data quality assessment and improves the overall data governance framework. When organisations know the context of their data, when they know the semantics or the meaning of their data, they can better assess its reliability and relevance. This clarity allows for more informed decision-making processes, ensuring that data insights are actionable and aligned with business objectives. In today's data-driven landscape, cultivating contextual awareness is not just useful—it is essential for achieving the competitive edge and fostering a data-driven culture within the organisations.

Several tools like conceptual data models, data profiling, talking to colleagues, data catalogs and ontologies can all help you in the journey of understanding your data.

So ask loads of questions the next time you see that data set!

Pages