{
    "componentChunkName": "component---src-templates-portofolio-post-js",
    "path": "/data-modeling-cassandra",
    "result": {"data":{"markdownRemark":{"id":"7f9eb227-9d3f-5a1d-9b81-aff6c52bd5ab","html":"<h1>🚀 Data Modeling With Apache Cassandra + Docker</h1>\n<p><figure class=\"gatsby-resp-image-figure\" style=\"\">\n    <span\n      class=\"gatsby-resp-image-wrapper\"\n      style=\"position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 361px; \"\n    >\n      <span\n    class=\"gatsby-resp-image-background-image\"\n    style=\"padding-bottom: 103.515625%; position: relative; bottom: 0; left: 0; background-image: url('data:image/svg+xml,%3csvg%20xmlns=\\'http://www.w3.org/2000/svg\\'%20width=\\'400\\'%20height=\\'413\\'%20viewBox=\\'0%200%20400%20413\\'%20preserveAspectRatio=\\'none\\'%3e%3cpath%20d=\\'M0%20207v206h401V0H0v207m23-92v92h153V23H23v92m203-12v79h152V23H226v80M24%2040v15h150V24H24v16m203%200v15h150V24H227v16M35%2039v7h16V33l-8-1h-8v7m201-6l-1%207v6h16V33l-8-1-7%201M24%2081v25h150V56H24v25m203%200v25h150V56H227v25M24%20156v49h150v-98H24v49m203-12v37h150v-74H227v37M36%20114l2%202c2%200%201%202-1%202v1c2%200%203%202%201%202-3%200-2%202%200%202v1c-4%200-2%202%203%202h4v-14h-4c-4%200-5%200-5%202m203%200l2%202c2%200%201%202-1%202v1c2%200%203%202%201%202-3%200-2%202%200%202v1c-4%200-2%202%203%202h4v-14h-4c-4%200-5%200-5%202M36%20139l2%201%201%201-1%201c-3%200-2%202%200%202v1c-2%200-3%202%200%202%202%200%201%202-1%202-1%201%201%201%203%201h5v-13h-4c-4%200-5%200-5%202m203%200l2%201%201%201-1%201c-3%200-2%202%200%202v1c-2%200-3%202%200%202%202%200%201%202-1%202-1%201%201%201%203%201h5v-13h-4c-4%200-5%200-5%202m-167%204c0%203%200%203-2%202v-2c3-1%200-2-4-2h-6v4c0%203%200%204%202%204l1-2c0-3%201-4%204-1v1l-2%201%203%201%203-1h1l4%201c2%200%202%200%201-1-2-2-3-5-1-5v-2c-3-3-4-2-4%202m16-1l2%201-1%202-2%203h8c0%202%203%201%203-2%200-1%200-2%201-1l1%202%201%202%201-2%201-3%201%203%202%202%201-1h1c3%202%206%201%207-2%200-4-5-7-7-3h-2c-2-3-10-3-10%200h-2c-1-2-6-3-6-1m-52%2021l2%202%201%201-1%201h-1v1c3%200%202%202%200%202v1c2%200%203%202%201%202l-2%201%205%201h4v-14h-4c-4%200-5%200-5%202m203%200l2%202v1c-2%200-3%202%200%202%202%200%201%202-1%202v1c2%200%203%202%201%202l-2%201%204%201h5v-14h-4c-4%200-5%200-5%202m-115%201c1%204%200%206-2%205v-3h-6l-3-1c-2%201-3%205-1%205v1l-1%202h13c2%202%203%200%203-6%200-4%200-5-2-5l-1%202m-88%2023l2%202c2%200%201%202-1%202v1c2%200%203%202%201%202l-2%201%202%201c2%200%201%202-1%202-1%201%201%201%203%201h5v-14h-4c-4%200-5%200-5%202m14%200v6l1%205h8c9%200%2012-2%207-5v-1l2-1c0-2-5-1-6%201l1%202v1c-1%202-2%201-2-2%200-2-1-3-3-3-3%200-4%202-2%202%201%200%201%201-1%202s-2%201-2-3-2-6-3-4m42%207c0%205%203%206%203%201l1-3v3l1%203c1%200%202-1%202-3l1-3v3c0%203%203%205%203%201h2c3%203%206%202%206-1%201-4-3-7-6-3h-2c-1-2-3-2-6-2h-5v4M23%20325v67h133v-26c0-17%200-25-1-24l-1%2024v24H89l-65%201v-50h132v-26c0-17%200-25-1-24l-1%2024v25H24v-50h132v-33H23v68m1-51v15h131v-31H24v16m11-1v7h16v-13l-8-1h-8v7m1%2075l2%202c2%200%201%202-1%202v1l2%201-2%201v1c2%200%203%202%201%202l-2%201%205%201h4v-14h-4c-4%200-5%200-5%202m37%201l-1%203c0%202%200%203-2%202v-1c2-1%200-3-5-3l-4%201-1%204c0%205%202%206%203%201%200-3%201-4%203-1l1%201-2%201c0%202%203%202%205%201h4l3%201v-2c-2-2-3-4-1-4v-4h-3m16%202c-2%201-1%202%201%202l-1%201c-2%201-2%203%200%204h7c1%202%202%201%202-2%200-2%200-3%201-2l1%202%201%203%201-3%201-3%201%203c0%203%202%204%203%202%200-2%200-2%202%200s5%201%206-2c0-6-3-8-7-4h-2c-1-2-9-2-10%200-1%201-1%201-3-1h-4m-53%2021l2%202c2%200%201%202-1%202v1c2%200%203%202%201%202l-2%201%202%201c2%200%201%202-1%202-1%201%201%201%203%201h5v-14h-4c-4%200-5%200-5%202m14%200v6l1%205h8c9%200%2011-1%208-5v-1l1-1-3-1c-2%200-2%201-2%203l-1%203-1-3c0-2-1-3-3-3-3%200-4%202-2%202%201%200%201%201-1%202s-2%201-2-3-2-6-3-4m42%207c0%205%203%206%203%201l1-3v3l1%203c1%200%202-1%202-3l1-3v3c0%203%203%205%203%201h2c3%203%206%202%206-1%201-4-3-7-6-3h-2c-1-2-3-2-6-2h-5v4\\'%20fill=\\'%23d3d3d3\\'%20fill-rule=\\'evenodd\\'/%3e%3c/svg%3e'); background-size: cover; display: block;\"\n  ></span>\n  <img\n        class=\"gatsby-resp-image-image\"\n        alt=\"ERD project \"\n        title=\"ERD project Sparkify\"\n        src=\"/static/48030194f15d639dff8f3164ec175ec4/39d76/sparkify.png\"\n        srcset=\"/static/48030194f15d639dff8f3164ec175ec4/6f3f2/sparkify.png 256w,\n/static/48030194f15d639dff8f3164ec175ec4/39d76/sparkify.png 361w\"\n        sizes=\"(max-width: 361px) 100vw, 361px\"\n        style=\"width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;\"\n        loading=\"lazy\"\n      />\n    </span>\n    <figcaption class=\"gatsby-resp-image-figcaption\">ERD project Sparkify</figcaption>\n  </figure></p>\n<h2><strong>Overview</strong></h2>\n<p> this project, we create data modeling with Apacahe Cassandra and build ETL pipeline using python. <strong>Study Case</strong> : A startup in indonesia wants to analyze the data they have been collecting on songs and user csv on their new music streaming app. Currently, this startup collecting data log events in csv format and the analytics team is particularly interested in understanding what songs user are listening to.</p>\n<h2><strong>Song Dataset</strong></h2>\n<p>Songs dataset is a subset  of [Million song dataset]((<a href=\"http://millionsongdataset.com/\">http://millionsongdataset.com/</a>)</p>\n<p>Sample record:</p>\n<div class=\"gatsby-highlight\" data-language=\"json\"><pre class=\"language-json\"><code class=\"language-json\"><span class=\"token punctuation\">{</span><span class=\"token property\">\"num_songs\"</span><span class=\"token operator\">:</span> <span class=\"token number\">1</span><span class=\"token punctuation\">,</span> <span class=\"token property\">\"artist_id\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"ARJIE2Y1187B994AB7\"</span><span class=\"token punctuation\">,</span> <span class=\"token property\">\"artist_latitude\"</span><span class=\"token operator\">:</span> <span class=\"token null keyword\">null</span><span class=\"token punctuation\">,</span> <span class=\"token property\">\"artist_longitude\"</span><span class=\"token operator\">:</span> <span class=\"token null keyword\">null</span><span class=\"token punctuation\">,</span> <span class=\"token property\">\"artist_location\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"\"</span><span class=\"token punctuation\">,</span> <span class=\"token property\">\"artist_name\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"Line Renaud\"</span><span class=\"token punctuation\">,</span> <span class=\"token property\">\"song_id\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"SOUPIRU12A6D4FA1E1\"</span><span class=\"token punctuation\">,</span> <span class=\"token property\">\"title\"</span><span class=\"token operator\">:</span> <span class=\"token string\">\"Der Kleine Dompfaff\"</span><span class=\"token punctuation\">,</span> <span class=\"token property\">\"duration\"</span><span class=\"token operator\">:</span> <span class=\"token number\">152.92036</span><span class=\"token punctuation\">,</span> <span class=\"token property\">\"year\"</span><span class=\"token operator\">:</span> <span class=\"token number\">0</span><span class=\"token punctuation\">}</span></code></pre></div>\n<h2><strong>Log Dataset</strong></h2>\n<p>Logs dataset is generated by <a href=\"https://github.com/Interana/eventsim\">Event Simulator</a></p>\n<p>Sample Record :</p>\n<div class=\"gatsby-highlight\" data-language=\"csv\"><pre class=\"language-csv\"><code class=\"language-csv\"><span class=\"token value\">{</span><span class=\"token value\">\"artist\"</span><span class=\"token value\">: null</span><span class=\"token punctuation\">,</span><span class=\"token value\"> </span><span class=\"token value\">\"auth\"</span><span class=\"token value\">: </span><span class=\"token value\">\"Logged In\"</span><span class=\"token punctuation\">,</span><span class=\"token value\"> </span><span class=\"token value\">\"firstName\"</span><span class=\"token value\">: </span><span class=\"token value\">\"Walter\"</span><span class=\"token punctuation\">,</span><span class=\"token value\"> </span><span class=\"token value\">\"gender\"</span><span class=\"token value\">: </span><span class=\"token value\">\"M\"</span><span class=\"token punctuation\">,</span><span class=\"token value\"> </span><span class=\"token value\">\"itemInSession\"</span><span class=\"token value\">: 0</span><span class=\"token punctuation\">,</span><span class=\"token value\"> </span><span class=\"token value\">\"lastName\"</span><span class=\"token value\">: </span><span class=\"token value\">\"Frye\"</span><span class=\"token punctuation\">,</span><span class=\"token value\"> </span><span class=\"token value\">\"length\"</span><span class=\"token value\">: null</span><span class=\"token punctuation\">,</span><span class=\"token value\"> </span><span class=\"token value\">\"level\"</span><span class=\"token value\">: </span><span class=\"token value\">\"free\"</span><span class=\"token punctuation\">,</span><span class=\"token value\"> </span><span class=\"token value\">\"location\"</span><span class=\"token value\">: </span><span class=\"token value\">\"San Francisco-Oakland-Hayward, CA\"</span><span class=\"token punctuation\">,</span><span class=\"token value\"> </span><span class=\"token value\">\"method\"</span><span class=\"token value\">: </span><span class=\"token value\">\"GET\"</span><span class=\"token punctuation\">,</span><span class=\"token value\">\"page\"</span><span class=\"token value\">: </span><span class=\"token value\">\"Home\"</span><span class=\"token punctuation\">,</span><span class=\"token value\"> </span><span class=\"token value\">\"registration\"</span><span class=\"token value\">: 1540919166796.0</span><span class=\"token punctuation\">,</span><span class=\"token value\"> </span><span class=\"token value\">\"sessionId\"</span><span class=\"token value\">: 38</span><span class=\"token punctuation\">,</span><span class=\"token value\"> </span><span class=\"token value\">\"song\"</span><span class=\"token value\">: null</span><span class=\"token punctuation\">,</span><span class=\"token value\"> </span><span class=\"token value\">\"status\"</span><span class=\"token value\">: 200</span><span class=\"token punctuation\">,</span><span class=\"token value\"> </span><span class=\"token value\">\"ts\"</span><span class=\"token value\">: 1541105830796</span><span class=\"token punctuation\">,</span><span class=\"token value\"> </span><span class=\"token value\">\"userAgent\"</span><span class=\"token value\">: </span><span class=\"token value\">\"\\\"</span><span class=\"token value\">Mozilla\\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\\/537.36 (KHTML</span><span class=\"token punctuation\">,</span><span class=\"token value\"> like Gecko) Chrome\\/36.0.1985.143 Safari\\/537.36\\</span><span class=\"token value\">\"\"</span><span class=\"token punctuation\">,</span><span class=\"token value\"> </span><span class=\"token value\">\"userId\"</span><span class=\"token value\">: </span><span class=\"token value\">\"39\"</span><span class=\"token value\">}</span></code></pre></div>\n<h2>Schema</h2>\n<h4>Fact Table</h4>\n<p><strong>songplays</strong> - records in log data associated with song plays i.e. records with page <code class=\"language-text\">NextSong</code></p>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre class=\"language-text\"><code class=\"language-text\">songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent</code></pre></div>\n<h4>Dimension Tables</h4>\n<p><strong>user_session</strong>  - users session in the app</p>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre class=\"language-text\"><code class=\"language-text\">session_id,user_id,artist, firstname, iteminsession, lastname</code></pre></div>\n<p> <strong>user_songs</strong>  - user play songs  </p>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre class=\"language-text\"><code class=\"language-text\">song, user_id, firstname, lastname</code></pre></div>\n<p><strong>session_item</strong>  - item in session</p>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre class=\"language-text\"><code class=\"language-text\">session_id,iteminsession, artist, length, song</code></pre></div>\n<h2>Project Files</h2>\n<p><code class=\"language-text\">sql_queries.py</code> -> contains sql queries for dropping and  creating fact and dimension tables. Also, contains insertion query template.</p>\n<p><code class=\"language-text\">create_tables.py</code> -> contains code for setting up database. Running this file creates <strong>sparkify</strong> and also creates the fact and dimension tables.</p>\n<p><code class=\"language-text\">modeling-data.ipynb</code> -> a jupyter notebook for testing. </p>\n<p><code class=\"language-text\">etl.py</code> -> read and process file in event_data directory</p>\n<p><code class=\"language-text\">lib.py</code> -> import library that used</p>\n<p><code class=\"language-text\">event_datefile_new.csv</code> -> output etl process</p>\n<h2>Environment</h2>\n<p>Python 3.6 or above</p>\n<p>Apache Cassandra  </p>\n<p>cassandra - Cassandra database adapter for Python</p>\n<h2>How to run</h2>\n<p>Run the drive program <code class=\"language-text\">main.py</code> as below.</p>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre class=\"language-text\"><code class=\"language-text\">python main.py</code></pre></div>\n<p>The <code class=\"language-text\">create_tables.py</code> and <code class=\"language-text\">etl.py</code> file can also be run independently as below:</p>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre class=\"language-text\"><code class=\"language-text\">python create_tables.py \npython etl.py </code></pre></div>\n<h4>Reference:</h4>\n<h4><a href=\"https://github.com/datastax/python-driver\">Cassandra</a></h4>\n<p><a href=\"https://cassandra.apache.org/\">Cassandra Documentation</a></p>","excerpt":"🚀 Data Modeling With Apache Cassandra + Docker  Overview  this project, we create data modeling with Apacahe Cassandra and build ETL pipeline using…","frontmatter":{"date":"September 10, 2021","slug":"/data-modeling-cassandra","title":"Data Modeling With Apache Cassandra + Docker","description":"Data Modeling With Apache  Cassandra + Docker","featuredImage":{"childImageSharp":{"gatsbyImageData":{"layout":"fullWidth","backgroundColor":"#282828","images":{"fallback":{"src":"/static/48030194f15d639dff8f3164ec175ec4/96181/sparkify.png","srcSet":"/static/48030194f15d639dff8f3164ec175ec4/96181/sparkify.png 361w","sizes":"100vw"},"sources":[{"srcSet":"/static/48030194f15d639dff8f3164ec175ec4/bbc2e/sparkify.webp 361w","type":"image/webp","sizes":"100vw"}]},"width":1,"height":1.033240997229917}}}}}},"pageContext":{"id":"7f9eb227-9d3f-5a1d-9b81-aff6c52bd5ab","previous":{"id":"6aa640b2-c907-535f-a5c1-642882a4b5a9","frontmatter":{"slug":"/data-visualization-postgresql-olympics","template":"portofolio-post","title":"Data Visualization and Analytics With PostgresSQL  Olympics Dataset"}},"next":{"id":"f9b5001f-1899-56ed-8b19-da018df01b46","frontmatter":{"slug":"/web-scraping-trademap-org","template":"portofolio-post","title":"Web Scraping trademap.org  with Selenium and BeautifulSoup (Bs4) "}}}},
    "staticQueryHashes": ["228695001","2744905544","4267595483"]}