Hello everybody, I am starting to build a web server to receive the AutoPi data instead of the default cloud. I am giving this project a try so that I can process and store the data in a format that is more suitible for my research purposes.
My main design doubt is to whether use a relational database (e.g. PostegreSQL) or a non relational database (e.g. MongoDB) to store the time series data being received from the car.
I would lean on an SQL database for its relaiabilty, but since I will not always be sotring the same set of PIDs, a fixed table schema might prove problematic. On the other hand, using a flexible non relational database I could have more flexibility with this aspect, but lose simplicity when later querieng the data.
So, have anyone tried this before? Do you have any suggestions?
Finally, just for the sake of comparison, what kind of architecture do you guys use on the default AutoPi Cloud? I have seen that you guys use Django, so my guess would be an SQL database. If that is the case, how do you address the problem that the user can choose a variable PID set to be saved for each trip?
So many things to consider here. Are you planning on having millions, thousands or 10’s of AutoPi’s installed?
For the majority of use cases, less than 10, I would probably go with one of the No SQL Alternatives. They offer much more flexibility. I would also consider one of the options which support (by default) time series data; and write-behind. Both of these exclude PostgressSQL AND MongoDB. My bias against MongoDB probably plays into this point. I have been bitten by MongoDB in the past, difficult and expensive to scale correctly.
Also consider your own data retention policy up front. What are you going to be using this data for? What type of data needs to stick around for how long? For example, for Tax purposes 18 months of trip summary data should be good enough. But not the raw positional/location data points per trip.
If near-realtime data handling (geofence, dtc alerts, etc…) is needed, analyze the data points as they arrive, but dont store.
Again, it will depend on your needs. All I have to say is consider NOT having the attitude of store everything for all time. You will hit a wall pretty quickly.
Also, are you going to go home/personal hosted or “cloud” hosted? If so, which cloud and what is at your disposal for infrastructure you don’t need to manage. For example, Azure with CosmosDB; expensive but stable and scale-able beyond most needs.
InfluxDB is a good option here
I have also used ElasticSeach for time series data. Being able to leverage Kibana for dashboards (out of the box) as fantastic for prototyping.
If you do go relational, really consider the fact that 3NF or beyond might not be optimal. Might be best to just store things pretty flat.
ps. I would also read up on CAP Theorem. It should help you decide on characteristics required.
Your second question, related to architecture.
For your edge/gateway, at scale, any form of IoT traffic over TCP/IP (yes, this includes HTTP) will suck your resources (and your pocket book) dry. UDP is the preferred method, with an “software” level ACK and encryption mechanisms in place.
If you cant do UDP, then I would consider something like MQTT, just keep in mind MQTT can he hard to scale; depending on your needs.
Once on the server, I would always recommend that you use something like MQTT, Kinesis or Event Hubs (depending on scale and hosting provider). This will create a layer between your raw socket ingress and the rest of your “ingress pipeline”.
Please, please avoid HTTP(S). For one, SSL AND HTTP are both terribly chatty and brutally expensive, a pain to scale, and simply not necessary. Although it is the “easy” option…
So, now you have the data on the server, sitting in a queue.
Much of the ingress issue is going to be your ability to partition the traffic. For GSM/Cellular devices you can almost always count on having an IMEI available. So creating say 16/32/… buckets based on IMEI is a great way to do the partitioning. If you are doing application level encryption, include the IMEI in some form of “header” and have it un-encrypted. This will allow you to “quick” parse the IMEI out, and start to partition your traffic out. You have to consider the gateway as your most valuable resource (since it is limited by the sheer number of IO sockets). Perform the smallest amount of processing you can get away with. Partitions in this manner also enable you to support FIFO, if that is required.
From here there are a few approaches you can take…
CQRS and Event Sourcing
I suspect some combination of these could be leveraged.
For me, I would almost always recommend using the Actor Framework. I would even consider moving your Gateway (UDP services) into the Actors as well. Since most implementations do allow you to leverage Queues as the underlying infrastructure for communication this is an easy win for communication between your gateway and your device actors.
Some domain specific actors which could be considered:
Depending on your design (state-full vs stateless actors) you could alleviate much of your database requirements and simply store “state” of much of your data in the actors themselves, and use snapshotting and what-not to manage the persistence. Historical data would (most likely) still need to be persisted according to your needs. But a stateless actor which is just responsible for a single entity (dongle/vehicle/user) which can query your repository and respond accordingly back to your API would be ideal here.
Finally, your API. Here is where our old friend HTTPS comes into play. Keep your footprint to a minimum. Have aggregate calls in place to do the initial data load for your clients, I would also consider other mechanisms like WebSockets and/or MQTT as near-realtime feeds for telemetry. Super handy for your client-application development experience. You dont want to provide a polling mechanism for your clients. That will be hard to scale.