DB4ALL: reformatting the mess that Internet has become
I always try very hard to keep my posts within the main topic of this blog, namely computers in the context of building automation and simulation. Occasionally I fail, like for today’s post.
I’d like to tell you about a software company co-founded by a friend and fellow Toastmaster of mine, David Portabella. The company’s name is DB4ALL, and they specialize in software for retrieving structured data from the web.
(Disclaimer: I am not affiliated with this company. I have had the opportunity to play with their tool, which I sincerely think is a high-quality one, but I derive no remuneration from writing this piece.)
They’ve developed `Webminer’, a Java library for extracting data in a structured manner from any website. Suppose, for instance, that you need a relational database with the data from the CIA World Factbook. That data, though in the public domain, cannot be obtained in the form of a relational database, but only by clicking around on the CIA website. But with ‘Webminer’, the smart guys at DB4ALL can write a custom application that will know how to navigate such websites, ‘scrape’ and ‘normalize’ its data, and save it to a relational database for you.
On DB4ALL’s website you will find references to the two most popular datasets that they’ve mined: the above-mentioned CIA World Factbook, and the SourceForge database of open-source projects. Having such data in a relational form is invaluable for any researcher or marketing analyst. Suppose for instance that you want scientific data on the popularity of different programming languages over time in open-source projects. Well with these datasets you have all you need to get started.
This, for instance, is a screenshot of the SourceForge dataset opened in Excel: 
All in all, if you need publicly available data from a website stored in a relational database form, you should definitely consider using DB4ALL‘s services.
Related posts:
