dsc100-wi23

The_Data_Lifecycle

DSC 100: Introduction to Data Management

Description:

Databases are at the heart of modern commercial application development. Their use extends beyond this to many other environments and domains where large amounts of data must be stored for efficient update, retrieval, and analysis. This course provides a comprehensive introduction to data management, covering topics such as the relational data model, SQL, formal query languages, query evaluation, data storage and indexes, NoSQL databases, conceptual design, integrity constraints, design theory, normal forms, and data quality. Through a combination of lectures, demos, and hands-on exercises, students will gain a strong foundation in database management concepts and practices, with a focus on the use of SQL and the relational model. The course begins with an overview of course organization and an introduction to the relational data model, followed by lessons on SQL basics and different types of joins in SQL. In the following weeks, students will learn advanced SQL queries, formal query languages, and query evaluation. The course also covers topics related to data storage and indexing, including hard disk and file organization, index organization, and creating indexes in SQL. In the latter half of the course, students will study NoSQL databases, conceptual design, integrity constraints, and design theory, culminating in a discussion of normal forms. The course includes a discussion of data quality, including techniques for dealing with missing data and entity deduplication. Upon completing the course, students will be equipped with the skills and knowledge necessary for designing, querying, and maintaining a relational database.

Instructional team:

Instructor:

Babak Salimi, bsalimi@ucsd.edu

Course Assistants:

Baharan Khatami, skhatami@ucsd.edu

Divija Devarla, ddevarla@ucsd.edu

Pragna Elavarthi pelavarthi@ucsd.edu

Megha Agarwal meagarwal@ucsd.edu

Lectures:

Tuesdays and Thursdays at 6:30pm-7:50pm

Note: The lectures for this course will be automatically recorded and made available for students to watch at their convenience. While attending the lectures in person is not required, it is highly recommended as it provides an opportunity to ask questions and interact with the instructor and other students. The recordings are meant to be a supplement to the in-person lectures and should not be used as a replacement for attendance

Office Hours:

Babak Salimi, Wedensdays 4PM - 5PM

Baharan Khatami, Fridays 3PM - 4PM

Divija Devarla, Thursdays 9AM - 10AM

Pragna Elavarthi, Mondays 10AM - 11AM

Megha Agarwal, Tuesdays 4PM - 5 PM

Note: Office hours will be held via Zoom (link can be found on the Canvas calendar).

Piazza: link (Requires access code posted on Canvas)

Have questions? Please email Babak Salimi (bsalimi@ucsd.edu) and at least one of the TAs for questions about logistics. All other questions should be discussed on Piazza.

Course Workload

(subject to change)

Homework (60%): There will be weekly homeworks consisting of written problem-solving and programming assignments (50%) and web quizzes (10%). The purpose of the homeworks is to provide students with the opportunity to apply and practice the concepts and skills learned in the lectures. Homeworks will be more rigorous than the web quizzes and will require students to demonstrate a deeper understanding of the material. The purpose of the web quizzes is to help students reinforce their understanding of the course material by answering review questions based on the previous lectures. The following is a list of the homeworks and web quizzes in the course:

Due dates: Please note that the due dates for both the homework assignments and web quizzes can be found on the Canvas calendar. Additionally, the homework assignment due dates are also available on the course webpage calendar (below). Web quizzes are due on Mondays of the following week in which they are released.

Note: Homework assignments can be completed in teams of size 2. This means that you can work with one other student to complete the assignment and submit a single solution. It is expected that each member of the team will contribute equally to the solution and have a thorough understanding of the material. Collaboration is encouraged as it can help deepen understanding and facilitate learning, but it is important that all team members understand the work being submitted and are able to explain it if necessary. Plagiarism and cheating will not be tolerated and will result in disciplinary action.

Late Day Policy: You have up to 3 late days to use on homework assignments. These late days can be used in 24-hour chunks. You can only use one late day on each assignment. Late days are meant to be a safety net and should not be used as a convenience. It is expected that you will not need to use your late days and that you will submit all assignments on time. If you use all of your late days, it is likely that you are not managing your time effectively and will need to adjust your study habits. Please note that no excuses for not submitting assignments on time will be accepted once you have exhausted your late days.

Midterm (15%): The midterm exam will cover all topics up until Formal Query Languages. It will be held on canvas.

Final Exam (25%): The final exam will be comprehensive and will be held on canvas.

Both Midterm and Final exam will be held remote.

Exam Dates:

Extra Credit:

Calender:

(subject to change)

Week Lecture Date Course Topic Lecture Description Assignments Slides Discussion Optional Reading Lecture Motivation                
1 Jan 10 Course organization and Introduction This lecture will cover the organization and structure of the course, including the syllabus, grading policies, and course materials. We will also introduce the basic concepts of data management and the importance of databases in modern society.   Slides Demo 1: Relational Data Model   Text Book: Sec. 2.1
Blog posts: Data Management: What It Is, Importance, And Challenges
This lecture will provide an overview of the course and help students understand the goals and objectives of the class.                
  Jan 12 Relational Data Model This lecture will introduce the relational data model, including a demonstration of the SQLite database management system and an example relational database. We will also discuss the connection to the DataFrame data model and how it is used in data analysis. Homework 1 - release: SQLITE and SQL Basics Slides   Text Book: Sec. 2.2, 2.3 Blog posts: Why You Should Learn SQL, Preventing the Death of the Dataframe The relational model is a fundamental concept in data management and is the basis for many popular database systems. Understanding the principles of the relational model is essential for data scientists.                
  Jan 16         Introduction to SQL                    
2 Jan 17 SQL Basics This lecture will introduce the basic syntax and semantics of SQL, including the use of the SELECT, FROM, and WHERE clauses to retrieve data from a database. Task: Completion of Postgres Setup for Homework 3 Web Quiz 1: Data Models and Basic SQL Slide
Demo 2: SQL Basics
    SQL is a widely used language for interacting with relational databases, and having a strong understanding of SQL is important for data scientists.                
  Jan 19 Joins in SQL Part 1 This lecture will cover the various types of joins in SQL, including inner, outer, and cross joins. The use of join clauses to combine information from multiple tables will also be discussed.   Slide     Joins are an important tool for combining data from multiple tables and are used in many complex queries. Understanding how to use joins effectively is essential for working with large datasets.                
  Jan 23       Slides SQL Basics and Joins                    
3 Jan 24 Joins in SQL Part 2 This lecture will continue the discussion of joins in SQL, with a focus on practical examples and a demonstration using the SQLite database management system. Homework 2 - release: Basic SQL Queries Slide Demo 3: Joins in SQL                      
  Jan 26 Grouping and Aggregation This lecture will introduce the concepts of grouping and aggregation in SQL, and show how to write complex analytical queries using these techniques. We will also provide a demonstration of aggregate functions in SQL. Web Quiz 2: SQL Aggregates Slides
Demo 4: AGGREGATES IN SQL
Demo 5: AGGREGATES IN SQL (real data)
    Grouping and aggregation are powerful tools for analyzing and summarizing data, and are frequently used in data analysis and reporting.                
  Jan 30       Slides SQL Group and Aggregation SQL Group and Aggregation                  
4 Jan 31 Advanced SQL Queries Part 1 This lecture will cover complex nested queries and the use of next queries in the WHERE, FROM, and SELECT clauses of an SQL statement. Homework 3 - release: Advanced SQL and PostgreSQL Slide     Writing complex queries is an essential skill for working with large and complex datasets, and understanding how to do so is crucial for data scientists.                
  Feb 2 Advanced SQL Queries Part 2 This lecture will continue the discussion of advanced SQL queries, covering expression quantifiers in SQL, monotone versus non-monotone queries, and set operations in SQL. Web Quiz 3: Advanced SQL Slides
Demo 6: Nested Queries and Set Oprations In SQL
                     
  Feb 6       Slides Nested Queries                    
5 Feb 7 Formal Relational Query Languages Part 1 This lecture will introduce the basics of relational algebra, extended relational algebra, and their connection to data frames and SQL. Homework 4 - release: Relational Algebra Slides     Understanding Formal Relational languages, such as relational algebra, is crucial for understanding how relational database systems operate.                
  Feb 9 Formal Relational Query Languages Part 2 This lecture continues our discussion on the connection between relational algebra and SQL. Web Quiz 4: Relational Algebra                        
  Feb 13         Midterm Sample Questions
Part I
Part II
                   
  Feb 14 Interactive Data Analysis Using SQL                            
  Feb 16 Midterm Exam                            
  Feb 20       Slides Relational Algebra, Algebra of Bags                    
7 Feb 21 Data Storage, Indexes and Size Estimation Part 1 This lecture will cover the basics of data storage, including the structure of hard disks and file systems, as well as the use of indexes to improve the performance of database queries. We will also discuss the differences between clustered and unclustered indexes.   Slide Recording     Data storage is a critical aspect of database management, and understanding the different options and trade-offs is essential for optimizing the performance of a database system.                
  Feb 23 Data Storage, Indexes and Size Estimation Part 2 This lecture will continue the discussion of data storage and indexes, with a focus on creating indexes in SQL and the use of size estimation techniques to optimize queries.   Recording                      
  Feb 27       Slides Database indexes, Conceptual Design, ER Diagram Database indexes, Conceptual Design, ER Diagram                  
8 Feb 28 Conceptual Design The purpose of the conceptual design phase is to build a conceptual model based upon the previously identified requirements, but closer to the final physical model. A commonly-used conceptual model is called an entity-relationship model. This lecture will introduce the concepts of conceptual design and entity-relationship modeling, and show how to use these tools to design a database. Web Quiz 5: Query Evaluation and Indexes Homework 5 - release: Entity Relationship Diagrams, Conceptual Design E/R Diagrams Constraints     Conceptual design is an important step in the database design process, and understanding the principles of entity-relationship modeling is essential for creating a well-structured database.                
Mar 2 Design Theory This lecture will cover the database design process, including the importance of functional dependencies and normal forms. We will also introduce the implication problem and the closure algorithm, which are used to determine functional dependencies.   Slide     Design theory is a fundamental concept in database design, and understanding the principles of normalization and functional dependencies is essential for creating a well-structured database.                  
  Mar 6         Database Design Theory                    
9 Mar 9 Normal Forms This lecture will introduce the concept of Boyce-Codd normal form (BCNF) in database design. We will discuss the importance of BCNF in ensuring the integrity and performance of a database, and show how to apply BCNF to a design. Web Quiz 6: Conceptual Design and Design Theory Slides     Normalization is a technique used in database design to minimize redundancy and eliminate anomalies. It is an essential skill for designing effective and efficient databases.                
  Mar 13         Normal Forms                    
10 Mar 14 Data Quality This lecture will introduce the concept of data quality and its common dimensions, then it will focus on simple tricks to deal with missing data in SQL.   Slide     Data quality is an important concern in database management, and understanding how to identify and address issues of poor quality is essential for ensuring the accuracy and reliability of a database.                
  Mar 16 Final exam review             Mar 21 Final Exam            

Grading Scheme

The grading scheme is a hybrid of absolute and relative grading. The absolute cutoffs are based on your absolute total score. The relative bins are based on your position in the total score distribution of the class. The better grade among the two (absolute-based and relative-based) will be your final grade.

Grade Absolute Cutoff (>=) Relative Bin (Use strictest)
A+ 96 Highest 5%
A 93 Next 10% (5-15)
A- 90 Next 15% (15-30)
B+ 83 Next 15% (30-45)
B 75 Next 15% (45-60)
B- 70 Next 15% (60-75)
C+ 65 Next 5% (75-80)
C 60 Next 5% (80-85)
C- 55 Next 5% (85-90)
D 50 Next 5% (90-95)
F < 50 Lowest 5%

Example

Suppose the total score is 82 and the percentile is 33%. So, the relative grade is B-, while the absolute grade is B+. The final grade then is B+.

Textbook

Although a textbook is not required in the course, the following textbook is optional and recommended. Lecture slides and recorded videos would be sufficient for this class.

Database Systems: The Complete Book, by Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2nd Edition. Prentice Hall. 2008.

Canvas

All weekly homework assignments and web quizzes should be turned in via Canvas.

Communication

All important announcements will be sent through Canvas.

Piazza

All questions that may be of general interest to the class should be directed to Piazza. You will get your questions answered faster on Piazza than via personal emails to the instructional team, because Piazza is monitored closely by everybody in the class, not just the course staff. You are highly encouraged to answer each other’s questions on Piazza (you will get extra credit for the number of good answers on Piazza!) and the instructional team would endorse/add to those answers.

Note: Some slides are adopted from the UW database group.