ISE Blog

Using Amazon Athena to Query Large Datasets

One of the great leveling factors of cloud technology is the ability for small companies and individuals to do things that, not that long ago, would have required the support of a larger IT organization’s infrastructure, and doing them quickly and affordably. One of the most powerful yet simple of these technologies is ad hoc querying of data offered by Amazon Athena.

Athena is a “serverless interactive query service.” In plain English, this means we can query unstructured data we have stored in S3 in real time, without configuring database servers or Hadoop clusters and loading data. You pay for the queries you run, rather than an hourly charge for a standing database server, making Athena a fantastic way to explore datasets for their potential value and leverage them in new and unusual ways.

Now, I can recommend a few methods to quickly amass a large data set, but as an example for a blog post, any of these would still take too long.  Luckily, AWS has also recently announced RODS – the Registry of Open Data Sets – a resource that allows you to search by category from more than 50 shared datasets at the time of this writing.  Here I found the Global Database of Events, Language and Tone (GDELT), a corpus of events gleaned from global news coverage.

The GDELT Project

[Source: The GDELT Project

Gluing it Together

The first step is to create a “table” from your data. This defines the columns and data types associated with the data to allow us to query it in a standard way.  AWS Glue is here to help.  You can enter the columns and types associated with your data manually, or you can use a “crawler” to infer and classify the data you have.  A crawler is the most common way to do this, as it allows for data formats to change over time.

AWS Glue

[Source: AWS Console] 

In the case of GDELT you can use the defined schema to create a table by running the following query in the Athena console (change “sampledb” to the name of a database you have created):

MonthYear INT,
Year INT,
FractionDate DOUBLE,
Actor1Code STRING,
Actor1Name STRING,
Actor1CountryCode STRING,
Actor1KnownGroupCode STRING,
Actor1EthnicCode STRING,
Actor1Religion1Code STRING,
Actor1Religion2Code STRING,
Actor1Type1Code STRING,
Actor1Type2Code STRING,
Actor1Type3Code STRING,
Actor2Code STRING,
Actor2Name STRING,
Actor2CountryCode STRING,
Actor2KnownGroupCode STRING,
Actor2EthnicCode STRING,
Actor2Religion1Code STRING,
Actor2Religion2Code STRING,
Actor2Type1Code STRING,
Actor2Type2Code STRING,
Actor2Type3Code STRING,
IsRootEvent INT,
EventCode STRING,
EventBaseCode STRING,
EventRootCode STRING,
QuadClass INT,
GoldsteinScale DOUBLE,
NumMentions INT,
NumSources INT,
NumArticles INT,
Actor1Geo_Type INT,
Actor1Geo_FullName STRING,
Actor1Geo_CountryCode STRING,
Actor1Geo_ADM1Code STRING,
Actor1Geo_Lat FLOAT,
Actor1Geo_Long FLOAT,
Actor1Geo_FeatureID INT,
Actor2Geo_Type INT,
Actor2Geo_FullName STRING,
Actor2Geo_CountryCode STRING,
Actor2Geo_ADM1Code STRING,
Actor2Geo_Lat FLOAT,
Actor2Geo_Long FLOAT,
Actor2Geo_FeatureID INT,
ActionGeo_Type INT,
ActionGeo_FullName STRING,
ActionGeo_CountryCode STRING,
ActionGeo_ADM1Code STRING,
ActionGeo_Lat FLOAT,
ActionGeo_Long FLOAT,
ActionGeo_FeatureID INT,
LOCATION 's3://support.elasticmapreduce/training/datasets/gdelt';

If all goes well, after a few seconds it will tell you the query was a success.

Querying the Data

When your table is created, you can use the Athena console to run this simple query to see how much data you have loaded:

AWS Query

[Source: AWS Console] 

For the GDELT dataset, you need another resource, the CAMEO manual, to understand the event codes in the data.  Say we want to know the average tone of visits by country across our dataset, answering the question, “Where is the happiest coverage of state visits in the world?“ In the CAMEO codebook, a visit is code 042.  So, I came up with this query:

SELECT ActionGeo_CountryCode, count(*) as count, Avg(AvgTone) as tone FROM "sampledb"."gdeltmaster"
WHERE EventCode = '042'
GROUP BY ActionGeo_CountryCode
AWS Query 2
[Source: AWS Console] 
The top result is country code “WF” for the Wallis and Fortuna Islands with a whopping 9.12 average tone score, but there is only 1 in the count column.  That doesn’t seem statistically significant.  However, the next is the nation of Tuvalu (“TV”) with a healthy 6.55 average tone, and this has a count of 1111, making it a viable-looking score.  Looks like a happy place to visit!

Wisdom at Your Fingertips

Simply put, Athena makes it fast and easy to casually explore large data sets without a deep knowledge of big data.  If you understand data formats and SQL queries, you can load and perform some basic analysis in a couple of minutes.  Under the hood, Athena uses Hadoop and Apache Hive, and the AWS Glue system can be used to import your Athena datasets into an Elastic MapReduce Hadoop cluster for more extended analysis, or ongoing processes.  But if you just need to find some quick facts from a large set of data, Athena is a great solution.

Have you used Amazon Athena for querying data? Do you have questions about Amazon Athena, the Cloud or DevOps? Comment below and join the conversation!

Samuel Thurston, Software Engineer

Samuel Thurston, Software Engineer

Samuel Thurston is a Software Engineer and Cloud Practice Lead for ISE, architecting and implementing cloud solutions for enterprise clients. He enjoys running, yoga, and cooking, and is frequently found on the disc golf course.