9/10/2023 0 Comments Imdb raw data set![]() MySQL expects upload files to be available in a specific location - see notes below.You will need to have created a user ID already in MySQL.You will need to edit the my_conf.cnf file to provide the relevant credentials.This is very similar to the H2 process - however, there are some additional considerations: Navigate to the “mysql” directory and run the loader script. You may need to adjust the script, depending on the name and location of your H2 database. This connects to your H2 server (which must already be running, of course!), creates the schema and tables, and loads data from the sampled CSV files. Load the Data into a Relational Database For loading into H2Īt the command prompt navigate to the “h2” directory.Įxecute the Windows batch script 01_run_data_load_script.bat. To control the sample size, edit the Python script’s “sample_size” variable.Read the related comments in the file! 5. The files are placed in the “sampled” directory. You can optionally choose to run the 02_sample_titles.py script to create a second set of CSV files, containing a smaller sample of data: It is approximately 100 million rows of data, in total. You may not wish to load the entire set of data into a relational database. Places these new CSV files in the “csv” directory.Īt the end of the process, you should see this:Īnd there will be the following new files in your “csv” directory:.Generates a new set of CSV files, containing re-arranged and normalize data.I did nothing to optimize/parallelize how the job runs. Open a command prompt, navigate to the imdb_test_data directory and run the following command: python 01_process_imdb_files.py I placed the files into the new directory from step 1: I will assume this directory has been renamed to “imdb_test_data”.Ĭreate a new directory “csv” in “imdb_test_data”.Ĭreate a new directory “sampled” in “csv”. I will assume the files are in a new directory, with nothing else in it. The github repository is here: relational-sample-imdb-dataĭownload the zip file of the repo to your PC, and unzip it. This process assumes you have Python installed (I used Python 3.8). See this post for a quick guide to installing and connecting to an H2 server and database. H2 is a good choice if you don’t have MySQL available, and just want to get up-and-running quickly for demo or test purposes. Load those new files into a database (we will use H2 and MySQL as examples).Optional step: Run a Python script to extract a subset of the raw data.Run a Python script to re-arrange the data into a new set of files.Get the scripts we will be using from GitHub.To load the raw data into a relational database schema, we will perform the following steps: We will end up with a data set looking like this: I will re-arrange this data into structures more closely resembling a 3rd normal form relational schema - not exactly - but most of the way there. Looking at the overview (linked above) you will see that there is a fair amount of denormalization in the source files. How you censor the data is up to you - but always be respectful of your audience. Do not use this data, or present it in demos, if that content may cause offense. You can filter these out using the is_adult column in the title table. The raw uncompressed data files contain approximately 6.2 million titles and approximately 9.6 million people, who I shall refer to as “talent” - meaning actors, directors, producers, and other cast & crew members.Ī Warning about the IMDb Data Set: The data contains movies and videos for adult content titles. Please refer to the Non-Commercial Licensing and copyright/license and verify compliance. You can hold local copies of this data, and it is subject to our terms and conditions. Subsets of IMDb data are available for access to customers for personal and non-commercial use. They also clarify the limitations on how this data can be used: The data is provided in a set of files here:Ī brief overview of their contents is provided here: ![]() In this post I will describe how I first transformed the raw IMDb data and then loaded it into a relational database (actually, two databases - H2 and MySQL). I needed a reasonably large data set to support some technology evaluations I was doing - and this seemed like a good fit. IMDb (the Internet Movie Database) provides a set of files containing a core subset of its data for movies, TV shows, actors, crew, and related entities.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |