Tuesday 9 January 2024
Friday 24 November 2023
In the realm of Artificial Intelligence (AI), understanding and retaining context stands as a pivotal factor for decision-making and enhanced comprehension. Vector databases, are the foundational pillars in encapsulating your own data to be used in conjunction with AI and LLMs. Vector databases are empowering these systems to absorb and retain intricate contextual information.
Sunday 16 April 2023
If you're a software developer, you know how important it is to have a development environment that is flexible, efficient, and easy to use. PyCharm is a popular IDE (Integrated Development Environment) for Python developers, but there are other options out there that may suit your needs better. One such option is Visual Studio Code, or VS Code for short.
After using PyCharm for a while, I decided to give VS Code a try, and I was pleasantly surprised by one of its features: the remote container development extension. This extension allows you to develop your code in containers, with no footprint on your local machine at all. This means that you can have a truly ephemeral solution, enabling abstraction to the maximum.
So, how does it work? First, you need to create two files: a Dockerfile and a devcontainer.json file. These files should be located in a hidden .devcontainer folder at the root location of any of your GitHub projects.
The Dockerfile is used to build the container image that will be used for development. Here's a sample Dockerfile that installs Python3, sudo, and SQLite3:
RUN apt-get update -y
RUN apt-get install -y python3
RUN apt-get install -y sudo
RUN apt-get install -y sqlite3
The devcontainer.json file is used to configure the development environment in the container. Here's a sample devcontainer.json file that sets the workspace folder to "/workspaces/alpha", installs the "ms-python.python" extension, and forwards port 8000:
Once you have these files ready, you can clone your GitHub code down to a Visual Studio Code container volume. Here's how to do it:
- Start Visual Studio Code
- Make sure you have the "Remote Development" extension installed and enabled
- Go to the "Remote Explorer" extension from the button menu
- Click "Clone Repository in Container Volume" at the bottom left
- In the Command Palette, choose "Clone a repository from GitHub in a Container Volume" and pick your GitHub repo.
That's it! You are now tracking your code inside a container volume, built by a Dockerfile which is also being tracked on GitHub together with all your environment-specific extensions you require for development.
The VS Code remote container development extension is a powerful tool for developers who need a flexible, efficient, and easy-to-use development environment. By using containers, you can create an ephemeral solution that allows you to abstract away the complexities of development environments and focus on your code. If you're looking for a new IDE or just want to try something different, give VS Code a try with the remote container development extension.
Wednesday 12 January 2022
"A data hub is an architectural pattern that enables the mediation, sharing, and governance of data flowing from points of production in the enterprise to points of consumption in the enterprise” Ted Friedman, datanami.com
Aren't relational databases, data marts, data warehouses and more recently data lakes not enough? Why is there a need to come up with yet another strategy and paradigm for database management?
To begin answering the above questions, I suggest we start looking at the history of data management and figure out how data architecture developed a new architectural pattern like Data Hub. After all, history is important as a famous quote from Martin Luther King Jr. says "We are not makers of history. We are made by history"
A few decades ago, businesses began using relational databases and data warehouses to store their interests in a consistent and coherent record. The relational architecture still keeps the clocks ticking with its well understood architectural structures and relational data models. It is a sound and consistent architectural pattern based on mathematical theory which will continue serving data workloads. The relational architecture serves brilliantly the very specific use case of transactional workloads, where the data semantics are defined in advance before any data is stored in any system. If implemented correctly the relational model can become a hub of information that is centralised and easy to query. It is hard to see that the relational architecture could be the reason to cause a paradigm shift into something like a data hub. Most likely is something else. Could it be cloud computing?
When the cloud came, it changed everything. The Cloud brought along an unfathomable proliferation of apps and an incredible amount of raw and unorganised data. With this outlandish amount of disorganised data in the pipes, the suitability of the relational architecture for data storage had to be re-examined and reviewed. Faced with a data deluge, the relational architecture couldn't scale quickly and couldn't serve the analytical workloads and the needs of the business in a reasonable time. Put simply, there was no time to understand and model data. The sheer weight of the number of unorganised chunks of data coming from the cloud, structured and unstructured, at high speeds, propelled the engineers to look for a new architectural pattern.
In a data lake, the structured and unstructured data chunks are stored raw and no questions are asked. Data is not organised and is not kept in well-understood data models anymore and it can be stored infinitely and in abundance. Moreover, very conveniently the process of understanding and creating a data model in a data lake is deferred to the future, which is a process known as schema-on-read. We have to admit, the data lake is the new monolith where data is stored only, a mega data dump yard indeed. This new architectural pattern also brought with it the massively parallel (MPP) processing data platforms, tools and disciplines, such as machine learning, which became the standard methods for extracting business insights from the absurd amounts of data found in a data lake. Unfortunately, the unaccounted amounts of unknown data living in a data lake didn't help us understand data better and made the life of engineers even more difficult. Does a data lake have any redundant data or bad data? Are there complex data silos living in a data lake? These are still hard questions to answer and the chaotic data lakes looked like are missing a mediator.
Could the mediator be a "data hub"? It is an architectural pattern based on the hub and spoke architecture. A data hub, which itself is another database system, integrates and stores critical and important data and metadata for mediation, from diverse and complex transactional and analytical workloads and data sources. Once the data is stored, the data hub becomes the tool to "harmonise" and "enrich" data and then radiate it to the AI, Machine Learning and other enterprise insights and reporting systems, via its spokes.
Saturday 17 October 2020
Oracle Apex 20.2 is out and has a very interesting new feature, REST Data Source Synchronisation
Why is the REST Data Source Synchronization feature interesting?
Oracle Apex REST Data Source Synchronisation is exciting because it lets you query REST endpoints on the internet on a schedule or on-demand basis and saves the results automatically in database tables.
I think this feature will suit slow-changing data accessible with REST APIs very well. That is, if a REST endpoint data is known to be changing, say few times a day, why should we call the REST endpoint via HTTP every time we wanted to display data on an Apex page? Why would one want to render a page with data over HTTP if that data changes only once a day? Why should we cause network traffic and keep machines busy for data which is not changing often? Or maybe by requirement, you only needed to query a REST endpoint once a day and store it somewhere for data-warehousing.
Wouldn't it be better to store the data in a database table and render it from there every time a page is viewed?
This is exactly what the REST Data Source Synchronisation does. It queries the REST API endpoint and saves the JSON response as data in a database table on a schedule of your choice or on demand.
For my experiment, I used the Public Free London TfL REST API Endpoint from the TfL API which holds data for TfL transportation disruptions and I configured this endpoint to synchronise with my database table every day at 5am.
I even created the Oracle Apex REST Data source inside the apex.oracle.com platform. I used the TfL API Dev platform provided key to make the call from there to the TfL REST endpoint and I managed to sync it once a day on an Oracle Apex Faceted Search page and some charts.
I was able to do all this with zero coding, just pointing the Oracle Apex REST Data Source I created for the TfL API to a table and scheduling the sync to happen once a day at 5am.
To see the working app, go to this link: https://apex.oracle.com/pls/apex/databasesystems/r/tfl-dashboard/home
Screenshots of the app below