Reading CSV files

Reading single file

Start by importing scalaql:

import scalaql._

// Docs classes
import scalaql.docs.Hogwarts._
import scalaql.docs.DocUtils._

// Imports for examples
import java.nio.file.Paths

Assume you have a CSV file like the following:

val studentsPath = Paths.get("docs/src/main/resources/students.csv")
// studentsPath: java.nio.file.Path = docs/src/main/resources/students.csv

printFile(studentsPath)
// "name","age","faculty","grade","specialization","birthDay"
// "Harry","19","Gryffindor","85.1","learning","1980-07-31"
// "Ron","18","Gryffindor","66.2","eating","1980-05-01"
// "Hermione","18","Gryffindor","99.6","learning","1979-09-17"
// "Draco","18","Slytherin","85.1","trolling","1980-06-05"
// "Cedric","17","Hufflepuff","90.1","young dying","1977-10-01"

First, you defined a Query as usual:

val students =
  select[Student]
    .where(_.age >= 18)
// students: Query[From[Student], Student] = FROM(Hogwarts::Student) -> WHERE

Then you could specify to run the Query on a CSV file instead of a Scala collection:

students
  .show(truncate=false)
  .run(
    from(
      csv.read[Student].file(studentsPath)
    )
  )
// +--------+---+----------+-----+--------------+----------+
// |name    |age|faculty   |grade|specialization|birthDay  |
// +--------+---+----------+-----+--------------+----------+
// |Harry   |19 |Gryffindor|85.1 |learning      |1980-07-31|
// |Ron     |18 |Gryffindor|66.2 |eating        |1980-05-01|
// |Hermione|18 |Gryffindor|99.6 |learning      |1979-09-17|
// |Draco   |18 |Slytherin |85.1 |trolling      |1980-06-05|
// +--------+---+----------+-----+--------------+----------+
//

Reading from directory

You could also read multiple CSV files from an arbitrary nested directories using GLOB pattern.

Start with the following imports:

import scalaql._

// Docs classes
import scalaql.docs.CsvData._
import scalaql.docs.DocUtils._

// Imports for examples
import java.nio.file.Paths

Then define a query:

val enterpriseSurvey =
  select[EnterpriseSurvey]
    .where(_.year >= 2015)
// enterpriseSurvey: Query[From[EnterpriseSurvey], EnterpriseSurvey] = FROM(CsvData::EnterpriseSurvey) -> WHERE

And then read the files:

val dir = Paths.get("docs/src/main/resources/annual-enterprise-survey-2020/")
// dir: java.nio.file.Path = docs/src/main/resources/annual-enterprise-survey-2020

enterpriseSurvey
  .show(truncate=false)
  .run(
    from(
      csv
        .read[EnterpriseSurvey]
        // In this case it's required to provide correct naming. 
        .option(Naming.SnakeCase)
        .directory(dir, globPattern = "**/*.csv")
    )
  )
// +----+------------------+--------------------------------+----------+-----------------------------------------------+-----+-----------------+
// |year|industryCodeAnzsic|industryNameAnzsic              |rmeSizeGrp|variable                                       |value|unit             |
// +----+------------------+--------------------------------+----------+-----------------------------------------------+-----+-----------------+
// |2018|O                 |Public Administration and Safety|d_10-19   |Sales, government funding, grants and subsidies|78   |DOLLARS(millions)|
// |2018|O                 |Public Administration and Safety|d_10-19   |Total income                                   |78   |DOLLARS(millions)|
// |2018|O                 |Public Administration and Safety|d_10-19   |Total expenditure                              |72   |DOLLARS(millions)|
// |2018|O                 |Public Administration and Safety|d_10-19   |Operating profit before tax                    |6    |DOLLARS(millions)|
// |2018|O                 |Public Administration and Safety|d_10-19   |Total assets                                   |39   |DOLLARS(millions)|
// |2018|O                 |Public Administration and Safety|d_10-19   |Fixed tangible assets                          |8    |DOLLARS(millions)|
// |2018|O                 |Public Administration and Safety|e_20-49   |Activity unit                                  |48   |COUNT            |
// |2018|O                 |Public Administration and Safety|e_20-49   |Rolling mean employees                         |1560 |COUNT            |
// |2018|O                 |Public Administration and Safety|e_20-49   |Salaries and wages paid                        |51   |DOLLARS(millions)|
// |2018|O                 |Public Administration and Safety|e_20-49   |Sales, government funding, grants and subsidies|114  |DOLLARS(millions)|
// |2018|O                 |Public Administration and Safety|e_20-49   |Total income                                   |123  |DOLLARS(millions)|
// |2018|O                 |Public Administration and Safety|e_20-49   |Total expenditure                              |109  |DOLLARS(millions)|
// |2018|O                 |Public Administration and Safety|e_20-49   |Operating profit before tax                    |12   |DOLLARS(millions)|
// |2018|O                 |Public Administration and Safety|e_20-49   |Total assets                                   |55   |DOLLARS(millions)|
// |2018|O                 |Public Administration and Safety|e_20-49   |Fixed tangible assets                          |13   |DOLLARS(millions)|
// |2018|O                 |Public Administration and Safety|f_50-99   |Activity unit                                  |12   |COUNT            |
// |2018|O                 |Public Administration and Safety|f_50-99   |Rolling mean employees                         |954  |COUNT            |
// |2018|O                 |Public Administration and Safety|f_50-99   |Salaries and wages paid                        |41   |DOLLARS(millions)|
// |2018|O                 |Public Administration and Safety|f_50-99   |Sales, government funding, grants and subsidies|68   |DOLLARS(millions)|
// |2018|O                 |Public Administration and Safety|f_50-99   |Total income                                   |69   |DOLLARS(millions)|
// |2018|O                 |Public Administration and Safety|f_50-99   |Total expenditure                              |69   |DOLLARS(millions)|
// +----+------------------+--------------------------------+----------+-----------------------------------------------+-----+-----------------+
// only showing top 20 rows
//

Reading from the Internet

With scalaql, it's also easy to read data from the Internet, by providing and URL.

This could be done as follows:

import java.net.URL

case class WebsiteUser(
  username: String, 
  identifier: String, 
  firstName: String, 
  lastName: String)

select[WebsiteUser]
  .show(truncate = false)
  .run(
    from(
      csv
        .read[WebsiteUser]
        .options(
          delimiter = ';',
          omitEmptyLines = true,
          naming = Naming.WithSpacesLowerCase
        )
        .url(new URL("https://support.staffbase.com/hc/en-us/article_attachments/360009197031/username.csv"))
    )
  )
// +---------+----------+---------+--------+
// |username |identifier|firstName|lastName|
// +---------+----------+---------+--------+
// |booker12 |9012      |Rachel   |Booker  |
// |grey07   |2070      |Laura    |Grey    |
// |johnson81|4081      |Craig    |Johnson |
// |jenkins46|9346      |Mary     |Jenkins |
// |smith79  |5079      |Jamie    |Smith   |
// +---------+----------+---------+--------+
//

Reading CSV files

Reading single file​

Reading from directory​

Reading from the Internet​

Reading single file

Reading from directory

Reading from the Internet