Blog Post

How to manage the huge volume of unstructured data that a smart city generates

Data is the manna of a Smart City. Without data, we cannot monitor or manage it, without data, we have no information about the state of its services and infrastructures, and without data, we cannot carry out any measures to improve and optimize the city's resources to make them more sustainable, efficient and livable.

In general, the problem with our smart cities is not a lack of data, at least not across the board, but the way in which we need to manage the sheer volume of data that IoT sensor networks, IT systems and smart city structures generate. In addition, most of the data we can collect from available sources in the city is often unstructured data, which, when not properly managed, can become overwhelming to analyze. According to some estimates, between 80% and 90% of the data generated and collected by the organizations and services that are part of a city are unstructured, and their volumes are growing rapidly.

What is unstructured data?

Unstructured data is information that is not ordered according to a pre-established data model or schema and therefore cannot be stored in a traditional relational database. Text format and multimedia information are two common types of unstructured content, as are many documents that companies, businesses and city users generate, such as e-mail messages, videos, photos, web pages and audio files.

Unstructured data has historically been very difficult to analyze and that’s why, with the help of AI algorithms and machine learning, new software tools are emerging that can search through vast amounts of it to make sense of huge volumes of information in a beneficial and actionable way.

Unstructured versus structured data

Let’s talk about structured data first. It is usually stored in a relational database or RDBMS, and is sometimes referred to as relational data. They can be easily mapped to user-designated fields, for example, fields for zip codes, phone numbers, and credit cards. Data that conforms to the RDBMS structure is easy to search, either with human-defined queries or with specific software programs.

Unstructured data, on the other hand, does not fit into such predefined data models. They cannot be stored in an RDBMS and because they come in so many formats, they are a real challenge for the conventional software of any Smart City service analysis system to ingest, process and analyze. With the right tools, simple content searches can be performed on unstructured textual data, but it is difficult to extract its full potential precisely because of its non-structure. As a result, many cities, companies and industries have been unable to take advantage of this type of data, which, on the other hand, is an enormous source of value to better understand the interactions of residents and users of city services, customers of its infrastructure, the media or conversations on social networks that could be used to interact with the municipal government of the city.

What are some examples of unstructured data?

Unstructured data can be created by people or generated by machines. For example:

  • Email::email message fields are unstructured and cannot be analyzed by traditional analytics tools. However, email metadata gives it some structure and explains why email is sometimes considered semi-structured data. All emails for example received by the municipal management services of a Smart City could give us a lot of information about citizen patterns and concerns if they could be properly analyzed automatically.
  • Text files: This category includes word processing documents, spreadsheets, presentations, email and log files. For example, usage records for many “Smart” city services are often stored in this manner.
  • Social networks and websites: data from social networks that city managers use to interact with citizens, such as Twitter, LinkedIn and Facebook, and websites such as Instagram, photo-sharing sites and YouTube.
  • Mobile and communications data: All text messages, phone recordings, collaboration software, chat and instant messaging that the city’s communication interfaces have with citizens.

Here are some examples of unstructured machine-generated data:

  • Technical data: data packets sent by the city’s IoT sensor system, data from space exploration systems, seismic images, atmospheric data, etc.
  • City monitoring and surveillance data: his category includes data such as reconnaissance photos and videos generated by traffic cameras or those present at different points in the city.

Extracting value from unstructured data management

The beauty of the very rapid development of many technologies that we can use today to improve digital services and systems in cities is that as data management matures, unstructured data moves from being a cost for the need to have storage centers to being at the epicenter of value creation for city optimization.

Data generated by all components of the business, social and urban ecosystems is growing, which is no surprise. It is the current pace of data growth that is truly staggering. In 2010, the amount of data created, consumed and stored was two zettabytes per year, according to Statista. Firms such as IDC have been predicting explosive overall data growth in the coming years: from 64.2 ZB of data in 2020 to 175 ZB in 2025. This represents a growth of almost three times in five years.

By some estimates, less than 5% of this data is used for any purpose, and Smart Cities IT teams have minimal visibility into their data and its value. So, in general, they store it forever because it’s the safest thing to do. The end result: overspending on storage and the inability to leverage the data for new use cases and value.

Simplify Database Management

To make use of unstructured data for competitive gain, it is important to develop a management strategy that satisfies the dual need for cost-effectiveness and monetization of all information. One way to do this is to divide the work with the entire volume of data collected into stages, for example, as follows:

  • Collection of unstructured data that has not been managed. At this stage, the volumes of data are large and typically distributed across on-premises and cloud databases, resulting in minimal visibility and few, if any, prospects for making good use of the content and value they bring to the entire ecosystem that generated them. In many cases, data is simply sitting in storage and is not properly managed to save money or meet the needs of the various user groups that might find value in it. Without adequate visibility into the assets this data represents, it is difficult for IT and management to plan and decide how to use it to optimize the services to which the data refers.
  • Storage-centric data management. This phase is characterized by the centralization of all information to reduce and optimize storage costs, using the vendor’s own management capabilities that we have and the migration of unstructured data to cloud systems that can be accessed by those responsible for the city’s systems. This step achieves some cost savings, but does not yet decrease the complexity of the stored content.
  • Implementation of analytics systems for unstructured data. Once the data are stored and organized in a coherent way, it is important to distinguish between the data generation patterns used by the devices that have generated them, mostly IoT sensors and city information elements, in order to proceed with their analysis.

Defining the data generation pattern of each device is not particularly complex, as it is mainly defined by the service for which it is intended. Basically, and within the range of sensors that our Smart City will be able to use, it is possible to identify two general patterns

  1. data coming from the generation of periodic observations;
  2. data coming from event-based observation generation..

Differentiating unstructured data by IoT device type

IoT devices programmed with the periodic observation generation pattern will report a measurement containing the sensed information at a frequency configurable by city technicians or service managers. For fixed devices permanently installed somewhere in the city, this frequency will normally be fixed, but for mobile devices, it is possible to configure them to operate at a specific time or distance (or a combination of these).
Devices using this pattern are used for example for monitoring the city’s environment, for monitoring traffic intensity or for managing parks and gardens and their automatic irrigation systems.

For the other type, for example, IoT devices used in smart parking management work by sending event-based measurements, i.e. if a car enters the monitored parking space, the information packet is sent, if the parking space remains empty, nothing is sent, therefore, data is only reported when a change in the parameter they are monitoring is detected.

Data for testing before the implementation of new services in the Smart City

One of the goals of central management platforms for Smart Cities is to support advanced IoT experimentation and to be able to evaluate the performance of a new service before it goes live. In this sense, we need much more data in the testing period and, since the notification period for those devices that implement the pattern of generating periodic observations is configurable, a high frequency of sending notifications is usually established to obtain as much data as possible from a given service or system.

In this aspect, and taking into account only the needs of the service, it is possible that the selected frequency will lead us to a situation of oversampling, and we will have many more measurements than we really need to be able to monitor whether a city service is working properly. However, this allows for more extensive experimentation when we are in the testing periods of some of the new “Smart” systems that we may implement. For devices and sensors using the event-based observation generation pattern, the number of observations depends solely on the actual usage of the service, as here we cannot set the sampling rate other than by making artificial and repetitive use of it simply as part of performance testing.

Large-scale IoT infrastructure data management requirements

Once we have started testing on a new service or city infrastructure, we have to make sure that the entire IoT sensor network, the devices and elements that make it work, the servers and communication channels, etc., comply with a series of characteristics in terms of data processing such as:

  • Heterogeneity: : supporting a wide range of information implies a high level of heterogeneity in terms of both the data managed and its use. For data to be useful it needs to be well described and consistent. A lot of time and effort is often invested in analytical systems to “clean and align” data in order to make information packages from different sources integrable and useful. Therefore, any IoT data management platform must homogenize this information as it comes into the system to serve it already aligned to a consistent data model..
  • Experimentation realism: Live testbeds of any new service provide a degree of realism that even the most detailed simulation cannot achieve, but it is also necessary to leverage the vast capabilities of the simulation software that many systems give us to leverage IoT solutions before putting them into service. In this sense, creating a construct to manage massive volumes of data is meaningless if it is not clear how an application -or a user- accesses them, how those data really reflect the environment being analyzed and whether the information contained in them is valid enough to be able to accept the result of what the data analysis yields about the state of the service or system to be put into operation.
  • Scalability: Real-world experimentation in a limited environment to check that everything is working properly requires testing at an appropriate scale. While small-scale testbeds with networks or systems of only a few dozen devices are sufficient for a minimal simulation of a new service, there will be IoT-based Smart City systems that require an order of magnitude larger scale to ensure that the data collected and sent by these truly reflect the full functionality of what is being tested. To facilitate access to the information generated by thousands of IoT nodes, it is necessary to deploy appropriate mechanisms that can scale and allocate access to progressively growing infrastructures.
  • Interoperability:: Information modeling is not only necessary to efficiently manage city data, but also to ensure the extensibility of city management platforms and the extension of the system with new IoT devices or even with new data sets and legacy data streams. In this sense, it is necessary to establish the means by which possible new providers of city infrastructures (whichever increases with its deployment the number of physical or virtual sensors that generate data to be analyzed by the management systems) are interoperable with the existing ones and expand the catalog of information captured by the city sensor system.
  • Metadata:: Linking the information generated by IoT devices with the information provider and supporting the concept of metadata is of utmost importance when it comes to obtaining a complete and heterogeneous view of the data content generated in the city. In order for the data management platform to serve the widest possible range of users, it must be possible to apply simple and common rules to the information based on the location and timestamp of a device, while allowing connection to other attributes of the information such as sensor accuracy, sensor range, etc..
  • Security: : In any IoT technology deployment, the criticality and value of the data collected is paramount to the control functions, which means it is essential to secure these assets to protect them from anything from data theft to invasion of privacy. Another challenge is that the security that needs to be applied is often highly contextual, depending on the device’s identity, application, location, time of operation and potentially other factors. More importantly, when data is retrieved from IoT networks, the returned data set is filtered according to the relevant security policy. If someone has permission to view historical readings from temperature sensors in the common areas of a building, but not for other tenants’ apartments, then a query to display historical values for the entire building has to be returned with only the relevant data..
  • Ease of application development: Open APIs, conceptually simple and consistent REST interfaces, and aligned data are factors that facilitate application development and provide IoT data management platforms with a fast and efficient tool for Smart Cities IT managers to integrate data from all systems and efficient for developers to work on creating new tools to analyze it.
  • Near real-time: A very common requirement for IoT applications in smart cities is to “get the right data to the right people at the right place at the right time”. This concept is a well-known paradigm, and therefore it is of utmost importance to support the filtering of results at the level of the information server and storage system. In addition, this must have provided a high degree of symmetry between data access and subscription filtering, which is very useful for certain usage patterns with the city data.

In summary

The field of big data analytics with unstructured data is a complex and expanding field, with information analysis techniques that, having as a general purpose to extract as much value as possible from the parameters collected in the city, also requires us to offer more focused and specialized solutions that address the requirements of specific services and structures. To identify meaningful data patterns for Smart City operators, users, managers and providers and enable optimization of algorithms and service delivery, it is first necessary to provide the means to properly capture the data, store it in a consistent and secure way, and then make available to those responsible for the analysis an efficient way to evaluate significant amounts of high-quality historical data that is continuously generated in the city.

In the end, as we said at the beginning, data is the manna of a Smart City, without it the monitoring and management system of a Smart City cannot function, and the process of how we treat this data is crucial to extract all the possible and appropriate value and content for it. The analysis solutions that help us to do the above continue to develop at enormous speeds, and, increasingly, both users and managers of our cities realize that a good analysis and treatment of the parameters collected from the city is what enables the implementation and improvements of everything that is deployed and functional in it.

Related Posts