Skip to main content

Reading CSV files

Reading single file​

Start by importing scalaql:

import scalaql._

// Docs classes
import scalaql.docs.Hogwarts._
import scalaql.docs.DocUtils._

// Imports for examples
import java.nio.file.Paths

Assume you have a CSV file like the following:

val studentsPath = Paths.get("docs/src/main/resources/students.csv")
// studentsPath: java.nio.file.Path = docs/src/main/resources/students.csv

printFile(studentsPath)
// "name","age","faculty","grade","specialization","birthDay"
// "Harry","19","Gryffindor","85.1","learning","1980-07-31"
// "Ron","18","Gryffindor","66.2","eating","1980-05-01"
// "Hermione","18","Gryffindor","99.6","learning","1979-09-17"
// "Draco","18","Slytherin","85.1","trolling","1980-06-05"
// "Cedric","17","Hufflepuff","90.1","young dying","1977-10-01"

First, you defined a Query as usual:

val students =
select[Student]
.where(_.age >= 18)
// students: Query[From[Student], Student] = FROM(Hogwarts::Student) -> WHERE

Then you could specify to run the Query on a CSV file instead of a Scala collection:

students
.show(truncate=false)
.run(
from(
csv.read[Student].file(studentsPath)
)
)
// +--------+---+----------+-----+--------------+----------+
// |name |age|faculty |grade|specialization|birthDay |
// +--------+---+----------+-----+--------------+----------+
// |Harry |19 |Gryffindor|85.1 |learning |1980-07-31|
// |Ron |18 |Gryffindor|66.2 |eating |1980-05-01|
// |Hermione|18 |Gryffindor|99.6 |learning |1979-09-17|
// |Draco |18 |Slytherin |85.1 |trolling |1980-06-05|
// +--------+---+----------+-----+--------------+----------+
//

Reading from directory​

You could also read multiple CSV files from an arbitrary nested directories using GLOB pattern.

Start with the following imports:

import scalaql._

// Docs classes
import scalaql.docs.CsvData._
import scalaql.docs.DocUtils._

// Imports for examples
import java.nio.file.Paths

Then define a query:

val enterpriseSurvey =
select[EnterpriseSurvey]
.where(_.year >= 2015)
// enterpriseSurvey: Query[From[EnterpriseSurvey], EnterpriseSurvey] = FROM(CsvData::EnterpriseSurvey) -> WHERE

And then read the files:

val dir = Paths.get("docs/src/main/resources/annual-enterprise-survey-2020/")
// dir: java.nio.file.Path = docs/src/main/resources/annual-enterprise-survey-2020

enterpriseSurvey
.show(truncate=false)
.run(
from(
csv
.read[EnterpriseSurvey]
// In this case it's required to provide correct naming.
.option(Naming.SnakeCase)
.directory(dir, globPattern = "**/*.csv")
)
)
// +----+------------------+--------------------------------+----------+-----------------------------------------------+-----+-----------------+
// |year|industryCodeAnzsic|industryNameAnzsic |rmeSizeGrp|variable |value|unit |
// +----+------------------+--------------------------------+----------+-----------------------------------------------+-----+-----------------+
// |2018|O |Public Administration and Safety|d_10-19 |Sales, government funding, grants and subsidies|78 |DOLLARS(millions)|
// |2018|O |Public Administration and Safety|d_10-19 |Total income |78 |DOLLARS(millions)|
// |2018|O |Public Administration and Safety|d_10-19 |Total expenditure |72 |DOLLARS(millions)|
// |2018|O |Public Administration and Safety|d_10-19 |Operating profit before tax |6 |DOLLARS(millions)|
// |2018|O |Public Administration and Safety|d_10-19 |Total assets |39 |DOLLARS(millions)|
// |2018|O |Public Administration and Safety|d_10-19 |Fixed tangible assets |8 |DOLLARS(millions)|
// |2018|O |Public Administration and Safety|e_20-49 |Activity unit |48 |COUNT |
// |2018|O |Public Administration and Safety|e_20-49 |Rolling mean employees |1560 |COUNT |
// |2018|O |Public Administration and Safety|e_20-49 |Salaries and wages paid |51 |DOLLARS(millions)|
// |2018|O |Public Administration and Safety|e_20-49 |Sales, government funding, grants and subsidies|114 |DOLLARS(millions)|
// |2018|O |Public Administration and Safety|e_20-49 |Total income |123 |DOLLARS(millions)|
// |2018|O |Public Administration and Safety|e_20-49 |Total expenditure |109 |DOLLARS(millions)|
// |2018|O |Public Administration and Safety|e_20-49 |Operating profit before tax |12 |DOLLARS(millions)|
// |2018|O |Public Administration and Safety|e_20-49 |Total assets |55 |DOLLARS(millions)|
// |2018|O |Public Administration and Safety|e_20-49 |Fixed tangible assets |13 |DOLLARS(millions)|
// |2018|O |Public Administration and Safety|f_50-99 |Activity unit |12 |COUNT |
// |2018|O |Public Administration and Safety|f_50-99 |Rolling mean employees |954 |COUNT |
// |2018|O |Public Administration and Safety|f_50-99 |Salaries and wages paid |41 |DOLLARS(millions)|
// |2018|O |Public Administration and Safety|f_50-99 |Sales, government funding, grants and subsidies|68 |DOLLARS(millions)|
// |2018|O |Public Administration and Safety|f_50-99 |Total income |69 |DOLLARS(millions)|
// |2018|O |Public Administration and Safety|f_50-99 |Total expenditure |69 |DOLLARS(millions)|
// +----+------------------+--------------------------------+----------+-----------------------------------------------+-----+-----------------+
// only showing top 20 rows
//

Reading from the Internet​

With scalaql, it's also easy to read data from the Internet, by providing and URL.

This could be done as follows:

import java.net.URL

case class WebsiteUser(
username: String,
identifier: String,
firstName: String,
lastName: String)

select[WebsiteUser]
.show(truncate = false)
.run(
from(
csv
.read[WebsiteUser]
.options(
delimiter = ';',
omitEmptyLines = true,
naming = Naming.WithSpacesLowerCase
)
.url(new URL("https://support.staffbase.com/hc/en-us/article_attachments/360009197031/username.csv"))
)
)
// +---------+----------+---------+--------+
// |username |identifier|firstName|lastName|
// +---------+----------+---------+--------+
// |booker12 |9012 |Rachel |Booker |
// |grey07 |2070 |Laura |Grey |
// |johnson81|4081 |Craig |Johnson |
// |jenkins46|9346 |Mary |Jenkins |
// |smith79 |5079 |Jamie |Smith |
// +---------+----------+---------+--------+
//