Hey there, fellow developer! Ready to dive into the world of BigQuery? If you're looking to harness the power of Google's serverless data warehouse in your Python projects, you're in the right place. We'll be using the google-cloud-bigquery
package to make this integration a breeze. Let's get started!
Before we jump in, make sure you've got:
First things first, let's get that package installed:
pip install google-cloud-bigquery
Easy peasy, right?
Now, let's get you authenticated:
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/service-account-key.json"
Pro tip: Add this to your .bashrc
or .zshrc
to make your life easier.
Time to get that client object ready:
from google.cloud import bigquery client = bigquery.Client()
Boom! You're connected and ready to roll.
Let's run through some essential operations:
dataset_id = "your_dataset_name" dataset = bigquery.Dataset(f"{client.project}.{dataset_id}") dataset = client.create_dataset(dataset) print(f"Dataset {dataset.dataset_id} created.")
schema = [ bigquery.SchemaField("name", "STRING"), bigquery.SchemaField("age", "INTEGER"), ] table_id = "your_table_name" table = bigquery.Table(f"{client.project}.{dataset_id}.{table_id}", schema=schema) table = client.create_table(table) print(f"Table {table.table_id} created.")
rows_to_insert = [ {"name": "Alice", "age": 25}, {"name": "Bob", "age": 30}, ] errors = client.insert_rows_json(table, rows_to_insert) if not errors: print("New rows have been added.") else: print(f"Encountered errors while inserting rows: {errors}")
query = f""" SELECT name, age FROM `{client.project}.{dataset_id}.{table_id}` WHERE age > 25 """ query_job = client.query(query) results = query_job.result() for row in results: print(f"{row.name}: {row.age}")
Ready to level up? Let's look at some more advanced stuff:
For better performance when inserting lots of data:
from google.cloud import bigquery_storage_v1 write_client = bigquery_storage_v1.BigQueryWriteClient() parent = write_client.table_path(client.project, dataset_id, table_id) write_stream = write_client.create_write_stream( parent=parent, type_=bigquery_storage_v1.WriteStream.Type.COMMITTED ) # Then use write_client.append_rows() in a loop or batch process
Protect against SQL injection and improve performance:
query = """ SELECT name, age FROM `{}.{}.{}` WHERE age > @age_threshold """ job_config = bigquery.QueryJobConfig( query_parameters=[ bigquery.ScalarQueryParameter("age_threshold", "INT64", 25) ] ) query_job = client.query(query.format(client.project, dataset_id, table_id), job_config=job_config)
When dealing with big data, stream those results:
query_job = client.query(query) for row in query_job: print(f"{row.name}: {row.age}")
And there you have it! You're now equipped to integrate BigQuery into your Python projects like a pro. Remember, the BigQuery API is powerful and flexible, so don't be afraid to explore and experiment. Happy coding!
For more in-depth info, check out the official BigQuery documentation and the google-cloud-bigquery package docs.
Here's a complete script tying it all together:
from google.cloud import bigquery # Initialize client client = bigquery.Client() # Create dataset dataset_id = "example_dataset" dataset = bigquery.Dataset(f"{client.project}.{dataset_id}") dataset = client.create_dataset(dataset) # Create table schema = [ bigquery.SchemaField("name", "STRING"), bigquery.SchemaField("age", "INTEGER"), ] table_id = "example_table" table = bigquery.Table(f"{client.project}.{dataset_id}.{table_id}", schema=schema) table = client.create_table(table) # Insert data rows_to_insert = [ {"name": "Alice", "age": 25}, {"name": "Bob", "age": 30}, ] errors = client.insert_rows_json(table, rows_to_insert) if not errors: print("New rows have been added.") else: print(f"Encountered errors while inserting rows: {errors}") # Query data query = f""" SELECT name, age FROM `{client.project}.{dataset_id}.{table_id}` WHERE age > @age_threshold """ job_config = bigquery.QueryJobConfig( query_parameters=[ bigquery.ScalarQueryParameter("age_threshold", "INT64", 25) ] ) query_job = client.query(query, job_config=job_config) for row in query_job: print(f"{row.name}: {row.age}")
Now go forth and conquer those big datasets!