Database Smart Web Tool

1. Abstract

Database Smart Web Tool (DABS) is a web tool that can compare, visualize and recommend various database management systems (DBMS) according to user (e.g. a developer) preferences and specific usages. The tool supports filtering databases by name, supported languages, implementation languages, operating systems, license types and project types.

It’s a smart platform that provides an interactive way to find the most suitable database solutions for any project development. DABS can facilitate, in an easy and accessible way, the process through which the users can discover information regarding databases. The user-friendly web interface offers a good experience for developers, but also for non-technical people.

2. Software requirements

2.1 Non-functional requirements

No.	Description
NFR. 1	Scalability - The system shall be able to handle a large volume of databases that keeps increasing.
NFR. 2	Security - All data inside the system or its part shall be protected against malware attacks or unauthorized access.
NFR. 3	Availability - The system shall be accessible for users 100% of the time.
NFR. 4	Performance - The system shall be able to determine how fast an application responds to commands (Speed and Response Time). When a user searches for a certain database or uses certain filters, the results must load within 0.2 seconds. When the "Compare" button is pressed, the result screen (pop-up) must load within 0.1 seconds.
NFR. 5	Maintainability - The system must be capable of being maintained cost-effectively over its expected lifetime.
NFR. 6	Usability & Accessibility - The system should be able to be used by as many people as possible.
NFR. 7	Capacity - The system should be to deliver sufficient functionality required by the end users.

2.2 Functional requirements

No.	Description
Start page: filters & recommendations
FR. 1	The user should be able to enter data in the search bar.
FR. 2	The user should be able to search for a certain database by name.
FR. 3	The user should be able to filter the search based on the desired database characteristics.
FR. 4	The database filters must display information about: specific usages, paradigms, query languages, platforms and running environments and specific contexts, country of origin, open source status and implementation language.
FR. 5	In the Recommendation area, the system must display the most popular databases used or searched.
Search results, database comparison
FR. 6	The user should be able to select a maximum of 3 databases.
FR. 7	The user should be able to press the "Compare" button, for comparing the databases selected to the previous step.
FR. 8	A pop-up will be opened, showing a table with the comparison between the databases (at least 2).
FR. 9	The differences between the databases’ features will be coloured red.
View databases
FR. 10	The user is able to press on a specific database name from the table.
FR. 10	The system will automatically highlight the database name (with the "bold" option) when the user is pressing on it.
FR. 10	The system will display a pop-up with information regarding the selected database (a table, a text and a website reference).
FR. 10	The user is able to press on the website reference (ex: "Wikipedia") and will be taken to another website to read more details about the selected database.

3. Application design and interaction

The design of the application was built using Miro[8], an online whiteboard for team collaboration, and the latest version can be accessed here (beware that the actual implementation might differ from designs).

A brief demo of the application can be seen in this video.

3.1 Start page: filters & recommendations

The main key of the Start Page will be the search bar and the filters that help the users choose the characteristics they are looking for.

Besides the name input, the user can select information about: implementation programming language (C, C++, etc.), programming language supported (Java, Python, etc.), operating system (Linux, Windows, etc.), project type (Commercial, Open Source, etc.), and license (Apache v2, BSD, etc.).

On the Most popular section of the Start Page will be displayed four of the most searched or used databases. Each one will be located in a different card, with some general information, so that the people that are not sure what is more suitable for their project will have the recommendation of what they can use.

Implementation outcome:

Start page: filters & recommendations implementation — Actual implementation for Start page: filters & recommendations

3.2 Search results, databases comparison

The user can either search after a keyword or choose some features from the dropdowns and toggles presented below the search bar:

After the user presses the Search button, a table will be shown with the results regarding the database name, programming languages used for implementation, programming languages supported, licenses, operating systems and project types.

Implementation outcome:

The user has the possibility to select a maximum of 3 databases for comparing them, by pressing the "Compare" button, after which a pop-up appears. In this pop-up, the selected databases are presented in a table format with differences between their features coloured with red.

This feature is post-MVP.

3.3 View database details

The user is able to press on a database name and a pop-up will be shown with the details of the selected database. In this pop-up, there’s also a table composed of 2 columns (the first column indicates the feature name and the second column is the particular feature description). In the right part of the pop-up is a text with some details about the database selected by the user and at the bottom of the pop-up there's a reference to various pages providing more information about the database (Homepage, Wikipedia, Source code - depending on what data we have available in the backend).

Implementation outcome:

View database details implementation — Actual implementation for View database details

4. Dataset

4.1 Database management system

The relational database management system chosen is PostgreSQL[1], also known as Postgres, due to our familiarity with it over the years. It is a free and open-source relational database management system (RDBMS) emphasizing extensibility and SQL compliance.

PostgreSQL features transactions with Atomicity, Consistency, Isolation, Durability (ACID) properties, automatically updatable views, materialized views, triggers, foreign keys, and stored procedures. It is designed to handle a range of workloads, from single machines to data warehouses or Web services with many concurrent users. It is the default database for macOS Server and is also available for Windows, Linux, FreeBSD, and OpenBSD.

4.2 Provenance

The dataset was provided directly by Associate Professor of Databaseology in the Computer Science Department at Carnegie Mellon University, Andy Pavlo, who is one of the main maintainers of dbdb.io, the Database of Databases. Permission to use the dataset for this project has been granted.

The data was sent as a Django database dump archive, which contained a 257mb one-line JSON file.

4.3 Transformations

The format in which the data was received was not fit to be used by our application, so further operations were necessary in order to clean and restructure it.

Initially, we tried to format the one-line JSON file to human-friendly JSON formatting by using jq.

                cat dbdb-dump-nov14-2021.json | jq . > formatted.json

That didn’t really make much of a difference since the resulting JSON[3] file was still quite big, almost impossible to open with the usual tools (Notepad, Notepad++, Visual Studio and even vi).

Further, it was clear that the JSON format was not proper for us to understand the data structure and content. The aim is to import the data into a database so it’s easy to query by any backend. Databases go well with CSV imports, so we converted the JSON file into a CSV file using a simple Python[4] script:

                
import pandas as pd
from pathlib import Path
import json

# set path to file
p = Path(r'/content/formatted.json')

# read json
with p.open('r', encoding='utf-8') as f:
    data = json.loads(f.read())

df = pd.json_normalize(data)
# write csv
df.to_csv('/content/data.csv', index=False, encoding='utf-8')

The resulting CSV file contained about 1 million rows and 53 columns, out of which many turned out to be redundant, missing dependencies and/or not useful to us (we removed about 900.000 rows which were nginx logs from the actual dbdb.io website). Importing the CSV file to a PostgreSQL database was less straight forward than expected, but, using pgfutter and pgAdmin, it finally resulted in a new table.

4.4 Structure

The table structure after importing the CSV consisted of a table with columns such as:

                
model, pk, fields_slug, fields_label, fields_multivalued, fields_feature, fields_value, fields_name, fields_url,
fields_created, fields_modified, fields_secret_key, fields_view_count, fields_ver, fields_system,
fields_user, fields_description, fields_citations, fields_options, fields_ip_address, fields_user_agent,
fields_recommendation, fields_score, fields_creator, fields_meta, fields_is_current, fields_comment,
fields_start_year, fields_end_year, fields_history, fields_acquired_by, fields_developer, fields_logo,
fields_countries, fields_former_names, fields_tech_docs, fields_source_url, fields_wikipedia_url,
fields_description_citations, fields_start_year_citations, fields_end_year_citations,
fields_history_citations, fields_acquired_by_citations, fields_project_types, fields_derived_from,
fields_embedded, fields_inspired_by, fields_compatible_with, fields_licenses, fields_oses,
fields_publications, fields_supported_languages, fields_written_in

Working with a single table with 100.000 rows and 53 columns can be cumbersome, so, after some data exploration in the remaining dataset, we worked out a possible improved data structure:

This new structure, however, needed to be created from scratch. The tables and relationships were created directly using pgAdmin UI, but the data from the initial table was copied to the new tables using a PHP script [5] which can be found in our GitLab repository here

5. UML diagrams

In UML, use-case diagrams [7] model the behavior of a system and help to capture the requirements of the system. Use-case diagrams describe the high-level functions and scope of a system. These diagrams also identify the interactions between the system and its actors (the users).

For our Data Smart Web Tool, we identified 3 use cases:

5.1 UC1: Search a certain database

The user accesses the site.
The user types in the Search bar the key words.
The user selects from the presented drop-down lists the features for the database they want.
The system takes the selected options (from the user) to generate the proper results.
The system displays the results into a table format.

5.2 UC2: View details about a database

The user searches for a specific database.
The system displays a list with the databases.
The user presses on a database name from the generated list by system.
The system shows a pop-up with the details regarding a certain database.
The user views more details accessing the wikipedia reference from the pop-up.
The user accesses the database website from the presented reference.

5.3 UC3: Comparing minimum 2 databases

The user searches for a specific database.
The system displays a list with the databases.
The user selects a minimum of 2 databases from the generated list by system.
The system shows a pop-up with a comparison between the selected databases.
The system marks with red the differences between the databases and shows the similarities with black.

Database Smart Web Tool

1. Abstract

2. Software requirements

2.1 Non-functional requirements

2.2 Functional requirements

3. Application design and interaction

3.1 Start page: filters & recommendations

3.2 Search results, databases comparison

3.3 View database details

4. Dataset

4.1 Database management system

4.2 Provenance

4.3 Transformations

4.4 Structure

5. UML diagrams

5.1 UC1: Search a certain database

5.2 UC2: View details about a database

5.3 UC3: Comparing minimum 2 databases

6. Business Process Model and Notation (BPMN)

7. Acknowledgements

8. References