import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import altair as alt
We will be working with a table of data about motion pictures, taken from the vega-datasets collection. The data includes variables such as the film name, director, genre, release date, ratings, and gross revenues. However, be careful when working with this data: the films are from unevenly sampled years, using data combined from multiple sources. If you dig in you will find issues with missing values and even some subtle errors! Nevertheless, the data should prove interesting to explore...
Let's retrieve the URL for the JSON data file from the vega_datasets package, and then read the data into a Pandas data frame so that we can inspect its contents.
movies_url = 'https://cdn.jsdelivr.net/npm/vega-datasets@1/data/movies.json'
movies = pd.read_json(movies_url)
Now let's peek at the first 5 rows of the table to get a sense of the fields and data types...
movies.head(5)
We'll start our transformation tour by binning data into discrete groups and counting records to summarize those groups. The resulting plots are known as histograms.
Let's first look at unaggregated data: a scatter plot showing movie ratings from Rotten Tomatoes versus ratings from IMDB users. We'll provide data to Altair by passing the movies data URL to the Chart method. (We could also pass the Pandas data frame directly to get the same result.) We can then encode the Rotten Tomatoes and IMDB ratings fields using the x and y channels:
alt.Chart(movies_url).mark_circle().encode(
alt.X('Rotten_Tomatoes_Rating:Q'),
alt.Y('IMDB_Rating:Q')
).properties(
width=800,
height=400
)
To summarize this data, we can bin a data field to group numeric values into discrete groups. Here we bin along the x-axis by adding bin=True to the x encoding channel. The result is a set of ten bins of equal step size, each corresponding to a span of ten ratings points.
alt.Chart(movies_url).mark_circle().encode(
alt.X('Rotten_Tomatoes_Rating:Q', bin=True),
alt.Y('IMDB_Rating:Q')
).properties(
width=800,
height=400
)
Setting bin=True uses default binning settings, but we can exercise more control if desired. Let's instead set the maximum bin count (maxbins) to 20, which has the effect of doubling the number of bins. Now each bin corresponds to a span of five ratings points.
alt.Chart(movies_url).mark_circle().encode(
alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
alt.Y('IMDB_Rating:Q')
).properties(
width=800,
height=400
)
With the data binned, let's now summarize the distribution of Rotten Tomatoes ratings. We will drop the IMDB ratings for now and instead use the y encoding channel to show an aggregate count of records, so that the vertical position of each point indicates the number of movies per Rotten Tomatoes rating bin.
As the count aggregate counts the number of total records in each bin regardless of the field values, we do not need to include a field name in the y encoding.
alt.Chart(movies_url).mark_circle().encode(
alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
alt.Y('count()')
).properties(
width=800,
height=400
)
To arrive at a standard histogram, let's change the mark type from circle to bar:
alt.Chart(movies_url).mark_bar().encode(
alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
alt.Y('count()')
).properties(
width=800,
height=400
)
We can now examine the distribution of ratings more clearly: we can see fewer movies on the negative end, and a bit more movies on the high end, but a generally uniform distribution overall. Rotten Tomatoes ratings are determined by taking "thumbs up" and "thumbs down" judgments from film critics and calculating the percentage of positive reviews. It appears this approach does a good job of utilizing the full range of rating values.
Similarly, we can create a histogram for IMDB ratings by changing the field in the x encoding channel:
alt.Chart(movies_url).mark_bar().encode(
alt.X('IMDB_Rating:Q', bin=alt.BinParams(maxbins=20)),
alt.Y('count()')
).properties(
width=800,
height=400
)
In contrast to the more uniform distribution we saw before, IMDB ratings exhibit a bell-shaped (though negatively skewed) distribution. IMDB ratings are formed by averaging scores (ranging from 1 to 10) provided by the site's users. We can see that this form of measurement leads to a different shape than the Rotten Tomatoes ratings. We can also see that the mode of the distribution is between 6.5 and 7: people generally enjoy watching movies, potentially explaining the positive bias!
Now let's turn back to our scatter plot of Rotten Tomatoes and IMDB ratings. Here's what happens if we bin both axes of our original plot.
alt.Chart(movies_url).mark_circle().encode(
alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
alt.Y('IMDB_Rating:Q', bin=alt.BinParams(maxbins=20)),
).properties(
width=800,
height=400
)
Detail is lost due to overplotting, with many points drawn directly on top of each other.
To form a two-dimensional histogram we can add a count aggregate as before. As both the x and y encoding channels are already taken, we must use a different encoding channel to convey the counts. Here is the result of using circular area by adding a size encoding channel.
alt.Chart(movies_url).mark_circle().encode(
alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
alt.Y('IMDB_Rating:Q', bin=alt.BinParams(maxbins=20)),
alt.Size('count()')
).properties(
width=800,
height=400
)
Alternatively, we can encode counts using the color channel and change the mark type to bar. The result is a two-dimensional histogram in the form of a heatmap.
alt.Chart(movies_url).mark_bar().encode(
alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
alt.Y('IMDB_Rating:Q', bin=alt.BinParams(maxbins=20)),
alt.Color('count()')
).properties(
width=800,
height=400
)
Compare the size and color-based 2D histograms above. Which encoding do you think should be preferred? Why? In which plot can you more precisely compare the magnitude of individual values? In which plot can you more accurately see the overall density of ratings?