Resources

import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import altair as alt

The Movies Dataset

We will be working with a table of data about motion pictures, taken from the vega-datasets collection. The data includes variables such as the film name, director, genre, release date, ratings, and gross revenues. However, be careful when working with this data: the films are from unevenly sampled years, using data combined from multiple sources. If you dig in you will find issues with missing values and even some subtle errors! Nevertheless, the data should prove interesting to explore...

Let's retrieve the URL for the JSON data file from the vega_datasets package, and then read the data into a Pandas data frame so that we can inspect its contents.

movies_url = 'https://cdn.jsdelivr.net/npm/vega-datasets@1/data/movies.json'
movies = pd.read_json(movies_url)

Now let's peek at the first 5 rows of the table to get a sense of the fields and data types...

movies.head(5)
Title US_Gross Worldwide_Gross US_DVD_Sales Production_Budget Release_Date MPAA_Rating Running_Time_min Distributor Source Major_Genre Creative_Type Director Rotten_Tomatoes_Rating IMDB_Rating IMDB_Votes
0 The Land Girls 146083.0 146083.0 NaN 8000000.0 Jun 12 1998 R NaN Gramercy None None None None NaN 6.1 1071.0
1 First Love, Last Rites 10876.0 10876.0 NaN 300000.0 Aug 07 1998 R NaN Strand None Drama None None NaN 6.9 207.0
2 I Married a Strange Person 203134.0 203134.0 NaN 250000.0 Aug 28 1998 None NaN Lionsgate None Comedy None None NaN 6.8 865.0
3 Let's Talk About Sex 373615.0 373615.0 NaN 300000.0 Sep 11 1998 None NaN Fine Line None Comedy None None 13.0 NaN NaN
4 Slam 1009819.0 1087521.0 NaN 1000000.0 Oct 09 1998 R NaN Trimark Original Screenplay Drama Contemporary Fiction None 62.0 3.4 165.0

Histograms

We'll start our transformation tour by binning data into discrete groups and counting records to summarize those groups. The resulting plots are known as histograms.

Let's first look at unaggregated data: a scatter plot showing movie ratings from Rotten Tomatoes versus ratings from IMDB users. We'll provide data to Altair by passing the movies data URL to the Chart method. (We could also pass the Pandas data frame directly to get the same result.) We can then encode the Rotten Tomatoes and IMDB ratings fields using the x and y channels:

alt.Chart(movies_url).mark_circle().encode(
    alt.X('Rotten_Tomatoes_Rating:Q'),
    alt.Y('IMDB_Rating:Q')
).properties(
    width=800,
    height=400
)

To summarize this data, we can bin a data field to group numeric values into discrete groups. Here we bin along the x-axis by adding bin=True to the x encoding channel. The result is a set of ten bins of equal step size, each corresponding to a span of ten ratings points.

alt.Chart(movies_url).mark_circle().encode(
    alt.X('Rotten_Tomatoes_Rating:Q', bin=True),
    alt.Y('IMDB_Rating:Q')
).properties(
    width=800,
    height=400
)

Setting bin=True uses default binning settings, but we can exercise more control if desired. Let's instead set the maximum bin count (maxbins) to 20, which has the effect of doubling the number of bins. Now each bin corresponds to a span of five ratings points.

alt.Chart(movies_url).mark_circle().encode(
    alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
    alt.Y('IMDB_Rating:Q')
).properties(
    width=800,
    height=400
)

With the data binned, let's now summarize the distribution of Rotten Tomatoes ratings. We will drop the IMDB ratings for now and instead use the y encoding channel to show an aggregate count of records, so that the vertical position of each point indicates the number of movies per Rotten Tomatoes rating bin.

As the count aggregate counts the number of total records in each bin regardless of the field values, we do not need to include a field name in the y encoding.

alt.Chart(movies_url).mark_circle().encode(
    alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
    alt.Y('count()')
).properties(
    width=800,
    height=400
)

To arrive at a standard histogram, let's change the mark type from circle to bar:

alt.Chart(movies_url).mark_bar().encode(
    alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
    alt.Y('count()')
).properties(
    width=800,
    height=400
)

We can now examine the distribution of ratings more clearly: we can see fewer movies on the negative end, and a bit more movies on the high end, but a generally uniform distribution overall. Rotten Tomatoes ratings are determined by taking "thumbs up" and "thumbs down" judgments from film critics and calculating the percentage of positive reviews. It appears this approach does a good job of utilizing the full range of rating values.

Similarly, we can create a histogram for IMDB ratings by changing the field in the x encoding channel:

alt.Chart(movies_url).mark_bar().encode(
    alt.X('IMDB_Rating:Q', bin=alt.BinParams(maxbins=20)),
    alt.Y('count()')
).properties(
    width=800,
    height=400
)

In contrast to the more uniform distribution we saw before, IMDB ratings exhibit a bell-shaped (though negatively skewed) distribution. IMDB ratings are formed by averaging scores (ranging from 1 to 10) provided by the site's users. We can see that this form of measurement leads to a different shape than the Rotten Tomatoes ratings. We can also see that the mode of the distribution is between 6.5 and 7: people generally enjoy watching movies, potentially explaining the positive bias!

Now let's turn back to our scatter plot of Rotten Tomatoes and IMDB ratings. Here's what happens if we bin both axes of our original plot.

alt.Chart(movies_url).mark_circle().encode(
    alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
    alt.Y('IMDB_Rating:Q', bin=alt.BinParams(maxbins=20)),
).properties(
    width=800,
    height=400
)

Detail is lost due to overplotting, with many points drawn directly on top of each other.

To form a two-dimensional histogram we can add a count aggregate as before. As both the x and y encoding channels are already taken, we must use a different encoding channel to convey the counts. Here is the result of using circular area by adding a size encoding channel.

alt.Chart(movies_url).mark_circle().encode(
    alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
    alt.Y('IMDB_Rating:Q', bin=alt.BinParams(maxbins=20)),
    alt.Size('count()')
).properties(
    width=800,
    height=400
)

Alternatively, we can encode counts using the color channel and change the mark type to bar. The result is a two-dimensional histogram in the form of a heatmap.

alt.Chart(movies_url).mark_bar().encode(
    alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
    alt.Y('IMDB_Rating:Q', bin=alt.BinParams(maxbins=20)),
    alt.Color('count()')
).properties(
    width=800,
    height=400
)

Compare the size and color-based 2D histograms above. Which encoding do you think should be preferred? Why? In which plot can you more precisely compare the magnitude of individual values? In which plot can you more accurately see the overall density of ratings?