This is a tech talk installation of our blog, brought to you by developer extraordinaire, Adam Knox.
Though Big Query uses syntax very similar to SQL, it actually takes a significantly different tact to a solution for processing large amounts of data. Big Query has a tendency to brute force things and doesn’t actually use indexes, so lots of data means lots of processors and processing time. To pull this off, Google has gone with a different way of storing data that may affect how you go about designing your queries.
When a table is created, Google uses their ColumnIO format to store your data. Each column is stored as a separate file with the exception being repeated fields. If you have a table with the columns: name (string), phoneNumbers (repeated integer); the name will be stored on a single line with the other names, and all phone numbers associated with that name will be stored on a single line with the phone numbers. Columns can also get split up as well if they get too large, but that won’t have an impact on workflow.
Knowing this provides the ability to run faster queries by not processing columns you don’t actually care about and in some cases even makes formerly impossible to run queries, possible. Queries may also become cheaper since the system charges by the amount of data processed, and if a column is not included in the query, then that data is not processed.
If you have created a table, you can feel reasonably safe it’s not going anywhere, because once your files are created, they’re stored in three different data centres, and even three different places within each datacenter. Since Big Query throws processors at problems, data needs to be readily accessible. The caveat to this is temporary datasets. There is an option to specify an expiry time for tables within a dataset and tables will go away once they are older than this expiry date.
A projects has the ability to store datasets, which are each able to contain tables. The full command for accessing a table in Big Query syntax is
[projectname:datasetname.tablename], however this can be shortened to
dataset.tablename if you are accessing the table from the project you are currently working in.
Breaking related data up into separate tables (e.g: by date or after a specific event occurs) can be beneficial for making more maintainable queries since queries generally run over all rows in a table. This means you may want to look at multiple tables, so there is something called a meta table hidden in each dataset that’s accessible at
[projectname:datasetname.__TABLES__] and contains information about each table in the dataset and can be used for aggregating tables for querying.
TABLE_QUERY is a command that allows multiple like tables to be aggregated before running a query on the aggregate, and any column in
__TABLES__ can be used to form this aggregate. For example, if I want an idea of the response times since Jan 1, 2016 for a project, I could run the following query that shows the minimum, median and maximum response times:
SELECT QUANTILES((protoPayload.endTime - protoPayload.startTime), 3) AS responseTimeBucketsInMilliseconds FROM (TABLE_QUERY(appengine_logs, "table_id CONTAINS 'appengine_googleapis_com_request_log' AND creation_time > 1451606400000"))
Since only two columns each containing integers are used, this is still a reasonably cheap query ($0.0003), even though it is accessing 28+ that I am told is 1.857TB by the below query and would cost about 9$ to access every field from.
SELECT SUM(size_bytes)/1000000000 FROM [repcore-prod:appengine_logs.__TABLES__] WHERE table_id CONTAINS 'appengine_googleapis_com_request_log' AND creation_time > 1451606400000
Big Query is great when it comes to logging changes and analysing said changes because you can stream new rows to a table. It is, however, a terrible data storage device for quick and frequent lookups of specific rows because of the lack of indexes, and also not great with storing singular representations of objects that you expect to change because rows in a table cannot be modified or removed.
In some situations even accessing a single column is too much to deal with in Big Query. I present to you the following unusable query that returns listing IDs by order of listing creation date:
SELECT lid FROM [repcore-prod:datastore.LIS] ORDER BY ct
Technically, it recently became possible for this to succeed, but not for people at Vendasta because it requires enabling a higher billing tier. Since there are no pre-ordered indices, it is processing a small amount of data very intensively so in this case they are charging for the processing. Changing the billing tier can be done in interactive queries in the Big Query UI by checking “allow unlimited” (if enabled) under the “options” button.
Assuming your query is as efficient as possible, there are a couple approaches left to paring down table sizes to get around these kinds of hard limits. The first is table time decorators and the second is with a hash.
You can learn more about table decorators here. Unlike filtering down results using commands like
LIMIT/WHERE/HAVING/OMIT, table decorators will actually decrease the cost of querying a table because it won’t even access the data outside the given range. Unfortunately, this method is next to useless at Vendasta because even for tables like the ListingHistoryModel that we actually stream to, we still drop the whole table and replace it every day with a replica if it from NDB meaning table time decorators will only work on the ListingHistoryModel if you only care about the current day’s entries.
Looking at logging is the one place they can be quite helpful at Vendasta. Due to the size of the actual log contents, computation and memory limits are easy to hit, and querying even just the log content column is expensive. With logs, usually one also only cares about specific points in time which is exactly what the time decorator helps with.
Every time you run a query, it creates a new table and sometimes creates unknown hidden tables behind the scenes. If you so choose, then you can give the result table a name to keep it, or let it stay with the name google gave it and let it disappear after a day. If you ever run into a “Response too large to return” it’s because they are protecting you from yourself. It’s easy to create enormous tables (especially with cross joins). If this happens and you were expecting it to happen then you are required to name the table and enable the “Allow Large Results” option. If you repeat a query that doesn’t have new rows streamed to it then you will just get the cached results if they are still around.