Today we would like to switch gears a bit and get our feet wet with another BigData combo of Python and Impala. The reason for this is because there are some limitations that exist when using Hive that might prove a deal-breaker for your specific solution. Impala might be a better route to take instead.
Over the past few years, we have been hearing more about the wealth of data we humans generate. This has progressively grown into the concept that if you have enough of this data and you are able to piece together some meaning from it, then you can achieve everything from predicting the future to curing all human ills.
It goes without saying that for the last decades a vast majority of institutions, companies, firms and the like, have dealt with the Big Data reality, which required or just forced the urgent necessity to create processing platforms capable of storing and analyzing this vast amount of data. Here is why Hadoop and [Spark](/spark-consulting/), later on, around the year 2008, came into picture.
High-volume data streams and a great number of reports for the real estate market was what we were confronted with on one of our client’s projects. More specifically, the client faced a tough scalability problem: the property market reports generated from such a big data set took up to 3 hours to produce (just for 100 markets). Worse, this time was increasing as each day a few million new records were fed to augment the data set. In a step to resolve the problem, the client decided to invest in a new system architecture.