Sizes Of Popcorn At The Movie Theater Are An Example Of Which Type Of Data?

Popcorn Data — Analysing Cinema Seating Patterns (Part I)

What Tin Data Analytics Reveal About Your Movie theatre Habits?

By Noel Mathew Isaac and Vanshiqa Agrawal

In Part II , we analyse the information, visualise information technology, so build a website for our findings.

Part I — Obtaining the Data for Analysis

Ever felt the crippling disappointment of finding out your favourite seat at the theatre has been booked?

How popular really is your favourite seat?

We wanted to find out more than about the motion picture trends in Singapore — from which seats people prefer to the way they similar to watch different movies. Then we created PopcornData — a website to get a glimpse of Singapore's Movie trends — by scraping data, finding interesting insights, and visualising it.

On the website, you tin can see how people watched different movies, at different halls, theatres, and timings! Some unique aspects include heat maps to show the almost popular seats and animations to testify the order in which seats were bought. This two role article elaborates on how we obtained the information for the website and our analysis of the data.

Scraping the Data

To implement our thought, the commencement and maybe even the most crucial stride was to collect the information. We decided to scrape the website of Shaw Theatres, one of the biggest picture palace bondage in Singapore.

Starting with basic knowledge of scraping in python, we initially tried using python's requests library to go the site'due south HTML and BeautifulSoup library to parse it but quickly realized that the information nosotros required was non present in the HTML we requested. This was because the website was dynamic — it requests the data from an external source using Javascript and renders the HTML dynamically . When nosotros request the HTML direct, the dynamic function of the website is non being rendered, hence the missing data.

To gear up this issue, we used Selenium — A spider web browser automation tool that can kickoff render the website with the dynamic content before getting the HTML.

Problems With Selenium

Getting the selenium driver to work and fixing minor issues with information technology was a big learning curve. Afterward countless StackOverflow searches and 'giving upwards' multiple times we managed to scrape through (pun intended) and get it to work.

The main issues we faced were:

Scrolling to a specific portion of the screen to click a button then that the data will be found in the HTML.
Figuring out how to run headless Selenium on the cloud.
Afterward deploying the script on Heroku, some of the data was not beingness scraped when in fact the script was working properly on the local machine. Later racking our brains we figured out that some pages loaded by Selenium were defaulting to the mobile version of the page. Nosotros fixed it by explicitly mentioning the screen size.

With Selenium and BeautifulSoup, we were finally able to get the data for all the available film sessions for a particular day!

Sample movie session data:

                {
                  "theatre":"Nex",
                  "hall":"nex Hall 5",
                  "movie":"Jumanji: The Next Level",
                  "date":"18 Jan 2020",
                  "time":"ane:00 PM+",
                  "session_code":"P00000000000000000200104"
}

We were halfway there! At present we needed to collect the seat information for each movie slot to see which seats were occupied and when they were bought. After going through the Network tab of the website in the Developer Tools, we plant that the seat data was being requested from Shaw's API.

The data could be obtained by requesting the URL https://www.shaw.sg/api/SeatingStatuses?recordcode=<session_code> where the session code was the unique code for each motion-picture show session which we had already scraped earlier.

The information we got was in JSON format and nosotros parsed it and reordered the seats in ascending social club of seat purchase time to obtain an assortment of JSON objects where each object contained information almost each seat in the moving-picture show hall, including seat_number, seat_buy_time, and seat_status.

Sample seat data:

                [
                  {                  
                  "seat_status":"AV",
                  "last_update_time":"2020-01-xx 14:34:53.704117",
                  "seat_buy_time":"1900-01-01T00:00:00",
                  "seat_number":"I15",
                  "seat_sold_by":""
                  },
                  ...,
                  {                  
                  "seat_status":"SO",
                  "last_update_time":"2020-01-20 xiv:34:53.705116",
                  "seat_buy_time":"2020-01-18T13:12:34.193",
                  "seat_number":"F6",
                  "seat_sold_by":""
                  }
]

seat_number: Unique identifier for a seat in a hall
seat_status: Indicates the availability of a seat (And then-seat occupied, AV-Available)
seat_buy_time: time seat was purchased by the customer
last_update_time: time seat data was last scraped

Halls have anywhere between 28–502 seats and each seat corresponds to a JSON object in the assortment. Add this to the fact that at that place are up of 350 film sessions in a single mean solar day, and the amount of data generated is pretty big. Storing data for a single mean solar day took about 10MB. The movie session information was combined with the seat data and stored on a MongoDB database.

Nosotros managed to scrape all the picture information from Shaw for January 2020.

A single document in the database

                {
                  "theatre":"Nex",
                  "hall":"nex Hall v",
                  "motion-picture show":"Jumanji: The Next Level",
                  "date":"18 Jan 2020",
                  "fourth dimension":"one:00 PM+",
                  "session_code":"P00000000000000000200104"
                  "seats":[
                  {                  
                  "seat_status":"AV",
                  "last_update_time":"2020-01-xx 14:34:53.704117",
                  "seat_buy_time":"1900-01-01T00:00:00",
                  "seat_number":"I15",
                  "seat_sold_by":""
                  },
                  ...,
                  {                  
                  "seat_status":"Then",
                  "last_update_time":"2020-01-20 14:34:53.705116",
                  "seat_buy_time":"2020-01-18T13:12:34.193",
                  "seat_number":"F6",
                  "seat_sold_by":""
                  }
                  ]
}

To view the full document, follow this link: https://gist.github.com/noelmathewisaac/31a9d20a674f6dd8524ed89d65183279

The complete raw data nerveless can be downloaded here:

Time to Get Our Easily Dingy

It was now fourth dimension to get our hands muddy by cleaning the data and pulling out relevant information. Using pandas, nosotros parsed the JSON, cleaned information technology, and made a DataFrame with the information to improve readability and filter it easily.

Since the seat data took a lot of memory, we could not include all of it in the DataFrame. Instead, nosotros aggregated the seat data using python to obtain the following:

Total Seats: Total number of seats available for a motion-picture show session
Sold Seats: Number of seats sold for a moving picture session
Seat Buy Club: two-dimensional array showing the order in which seats were bought

                [['A_10', 'A_11'], ['A_12'], ['B_4', 'B_7', 'B_6', 'B_5'], ['C_8', 'C_10', 'C_9'], ['B_1', 'B_2'], ['C_6', 'C_7'], ['C_5', 'C_4'], ['B_8', 'B_10', 'B_9'], ['D_8'], ['A_15', 'A_14', 'A_13']]

Each element in the array represents the seats bought at the same fourth dimension and the order of elements represents the order in which the seats were purchased.

iv. Seat Distribution: Lexicon showing the number of seats that were bought together (in groups of ane, two, 3 or more)

                {
                  'Groups of 1': 8,
                  'Groups of 2': 30,
                  'Groups of 3': 9,
                  'Groups of 4': 3,
                  'Groups of 5': ane
}

v. Seat Frequency: Dictionary showing the number of times each seat in a hall was bought over the calendar month

                {'E_7': iv, 'E_6': 5, 'E_4': xi, 'E_5': 9, 'E_2': 2, 'E_1': 2, 'E_3': 7, 'D_7': 15, 'D_6': 17, 'C_1': 33, 'D_2': 15, 'D_1': 14, 'B_H2': 0, 'B_H1': 0, 'D_4': 45, 'D_5': 36, 'D_3': 32, 'C_3': 95, 'C_4': 94, 'A_2': 70, 'A_1': lxx, 'B_2': 50, 'B_1': 47, 'C_2': 37, 'C_6': 53, 'C_5': 61, 'B_4': 35, 'B_3': 40}

vi. Rate of Ownership: Ii dictionaries, with the first dictionary showing the fourth dimension left to a movie showing (in days) with the corresponding accumulated number of tickets bought in the second dictionary.

                {"1917": [4.1084606481481485..., 2.566423611111111, 2.245578703703704, 2.0319560185185184, 1.9269907407407407, one.8979513888888888....],
...}
{"1917": [1, 3, 8, 10, 11, ...],
...}

The cleaned data tin exist viewed here:

Finally, we were washed! (with 20% of the work)

With our data scraped and cleaned, we could at present get to the fun part — analysing the data to find patterns amongst the popcorn litter.

To find our analysis of the data and interesting patterns we found, check out Role Ii of this article: