0% found this document useful (0 votes)
73 views38 pages

(English (Auto-Generated) ) SQLMesh - Streamlining Python & SQL Transformations With Tobias Mao, Co-Founder & CTO at Tobiko Data (DownSub - Com)

The guest, Toby, is the CTO of Topico Data and previously led analytics and experimentation teams at Airbnb and Netflix. He discusses the challenges of building a data-driven culture at large companies, noting Netflix is extremely data-driven while Airbnb is more design-focused. The major pain points for data engineering at scale involve large amounts of data, users, and models. Starting a new company presents marketing challenges to explain the product value compared to existing solutions. DataOps aims to make data workflows more efficient through faster and safer releases. SQL Mesh is introduced as a next generation data transformation framework building on the concepts of DBT.

Uploaded by

fjaimesilva
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
Download as txt, pdf, or txt
0% found this document useful (0 votes)
73 views38 pages

(English (Auto-Generated) ) SQLMesh - Streamlining Python & SQL Transformations With Tobias Mao, Co-Founder & CTO at Tobiko Data (DownSub - Com)

The guest, Toby, is the CTO of Topico Data and previously led analytics and experimentation teams at Airbnb and Netflix. He discusses the challenges of building a data-driven culture at large companies, noting Netflix is extremely data-driven while Airbnb is more design-focused. The major pain points for data engineering at scale involve large amounts of data, users, and models. Starting a new company presents marketing challenges to explain the product value compared to existing solutions. DataOps aims to make data workflows more efficient through faster and safer releases. SQL Mesh is introduced as a next generation data transformation framework building on the concepts of DBT.

Uploaded by

fjaimesilva
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1/ 38

hello everyone welcome again to another

English episode um today we are

extremely flattened and and and happy to

have this amazing guest here mat yeah

today we have like a special guest and

I'm not going to take too much time Toby

can you introduce yourself please hey

thanks for having me on the show my name

is Toby uh and so CTO of topico data

where we just launched our new product

SQL mesh before too data uh I led

Analytics

at Airbnb which included things like

metric I redesigned the metric platform

Comm nerva as I also led the

experimentation platform and then before

that I led the experimentation team at

Netflix so yeah thanks for having me

yeah just the small companies right and

yeah I'm going to start off by saying um

now based on or your experience through

this amazing and gigantic companies such

as Netflix or mbnb and I know you did

other the previous jobs on matter of the

data Spectrum as well um what do you see

I mean as this data driven culture

because we we have seen customers trying

to understand a little bit about the

data driven but are these companies need

to be data driven if they would like to


unlock data analytics and just really

analyze their data and just get insights

of

it yeah so every company's

different Netflix and Airbnb have big

data

cultures uh but I would say that they're

different in terms of how much they

respect data I would say like Netflix is

extremely data driven every decision

under goes AB testing everything is very

data driven I think that's kind of part

of the culture of the leadership team I

think Airbnb they do a lot with data but

they're not quite as rigorous in terms

of scientifically analyzing every

change okay so even though they're big

and they have like lot of data running

through the pipeline it's just like a

little bit of uh Behavior change in

terms of how they deal with data right

right Airbnb is a little bit more

designer focused they have a designer Le

culture and and Netflix is a little bit

more data data driven I see I see um and

you know just based on that what are the

major pain points that you see like big

companies or even median companies as

well that you work at for or you have


acquaintance or acknowledgement

regarding the data engineering job what

are the major ping points you see there

are three big vectors of scale at these

companies there's big data right that's

one vector so you can have a lot of data

there's a lot of users a lot of people

right so as you have more people in your

organization more people making data

more people consuming data that's

another vector and then you also have a

lot of models like so a lot just a lot

of code right and that's like another

the vector and so at these kind of

companies you have all three Netflix had

huge amounts of data lots of users and

lots of models maybe a not so much

medium models Airbnb had less data lots

of users but many many models and so a

little bit different but they all three

of the both of these companies kind of

have these three vectors of scale okay

so do you judge that these three vectors

would be like the main uh major ping

points when you deal with massive data

right just massive adoption of data or

massive massive adoption or

understanding about your customer

because as more data you have it gets a

little bit complex it seems easier but


it gets a little bit complex to

understand the customer Behavior right

not only that but everything gets more

complex right your queries get more

complex your processes get more complex

you have to make sure that you're not

breaking something else someone else is

doing or there's just so much as you add

more more things right everything just

gets harder to manage yeah yeah yeah and

and and that that's not only tied up

with the coding perspective but also on

the infrastructure right so it's all the

different layers that get hit by this

you know augmentation of Big Data people

in model I like the way that you put it

this because usually I don't see I was

not tackling that approach but I like

the approach to CS like factors like big

data people and models I I totally agree

with that sentence pretty well spot

on great and talking about something

that the experience that you have been

having to become founder of the new

company tell us the challenges different

from the technical stuff now you guys

are building something and have a

product that is the the eomes can you

tell us a littleit about the challenges


of this new

company yeah I've never done it before

it's my first time making a company I've

built open source projects before but a

company is a whole different ball game

it's been a lot of fun i' I've been

learning a lot but I would say like in

terms of kind of our strengths our our

team is very engineering heavy we all

come from kind of fang and and we all

very

senior and so that kind of part of it is

pretty easy for us the hardest part is

that marketing is a big part of making a

startup

especially when there's an existing

incumbent that is huge right and so the

biggest challenge for us is really in

the marketing P how can we sell what

we're doing how can we get people

understand and respect what we're doing

how can we get people to think about us

and talk about us yes that's really big

challenge for us no but honestly I think

you guys started with with the right

foot uh because I mean first impression

I've been in the data system for quite a

while and when I had my first you know

when I saw SQL mesh is just kind of

actually I'm just with the open tab open


here because I'm just doing some tests

and I've got a uh I'm must say that

we're going to go for that but I I I was

kind of impressed with the how much

distractions you guys just put it in in

complex things like environments and

just dags and stuff like how you guys

are distracted in a way that would be

easier to think about the business

problem at as being like the first class

citizen instead of just focusing too

much on the technology and forgetting a

little bit about the business and I

think you got are approaching that in

the right way just think about the

business and having some you know

predefined stuff that you guys are

taking for granted that you consider as

the you know uh as the key points to be

successful the data and and I think that

we we we're looking at we saw a lot of

devops but then this name arrived like

the data Ops um can you explain a little

bit about what data off is and how it

can help us data teams to build better

you know

products yeah data Ops is just kind of

like a set of philosophies or or

processes to make people more efficient


with data so shipping faster shipping

safer uh and Shi shipping more often

right and so especially at the larger

scales when you have more people you

have more models you have more data it

gets harder to do these things right

especially at some big companies or even

some little companies whenever you make

a change in your data model to get a new

metric or a fixed business logic it can

be dangerous right you're not kind of

sure what's going to happen and so data

Ops is just making that easier making

that safer and faster wow that's I mean

uh do we have mats that we know rare uh

that we know of about any data Ops

product that is doing that because I I

don't actually know something that does

this end to end so that's why SQL MH

just pretty much caught me because I was

like okay what devops state Ops they

must be pretty much similar in terms of

shipping faster and make things easier

right a lot of abstractions being just

uh done and then that's the for me

that's the very first product that

intends or targets to do this massive

you know abstraction of

complexity uh and but but before you you

launch like toico um do we have


something regarding the data Ops pretty

much you know inserted on the data

engineering

Market I think data Ops has existed as a

term but I don't know of any real

product that kind of came about it as we

have I think there's some work around

observability so like that's one part of

data Ops observing the health of your

pipelines

Etc but that's only one small part of it

right I think that for example

scheduling Frameworks like dter and

prefi and airflow try to solve another

aspect of it uh but it's pretty lowlevel

so I don't know if anyone uh before us

has made anything that's why it's so

groundbreaking that's why it's so

groundbreaking yeah so without cesh

there's a bunch of data Ops difficulties

like learning curve rols and as set

distraction of Technologies so on and so

forth and then I would like to jump and

I would like to for you to you know tell

some comments regarding SQL mesh and

what is the target idea of SQL mesh

explain for the people for the data

engineers and data people that that is

listen to us what SQL mesh


is so score SQL mesh is a is a

transformation framework and it allows

you to write python or SQL queries uh

and then we we handle the rest right we

handle the creation of the dag

registration we handle scalability of

all that stuff right and so we kind of

think of of ourselves as a next

generation of of DBT right so DBT as you

all familiar with is kind of like a SQL

transformation framework to make it

easier for analysts to kind of write SQL

and be productive however kind of the

creation of SQL mes happened when I was

at Airbnb right I got I had actually

never used DBT because uh you know DBT

wouldn't have worked at at a company

like

Airbnb um and I can explain kind of why

that is in a bit but I saw a demo of DPT

and I was just like

wow like you know a tool like this is

very valuable however I want it to work

at any company right uh and so we built

SQL mesh with that mindset right coming

from it like okay transformation

Frameworks are valuable they make people

more productive but how can we do it in

a way that will work at any company at

any
scale

wow yeah I mean that's

impressive yeah yeah I guess let me

explain like why DBT wouldn't work at a

company like perfect yeah would be nice

so if you just use DBT out of the box

right you you just do DBT run right and

so let's say you have a bunch of models

and you want to make a change right the

default approach would be just to do

like DBT run and that's going to fully

refresh everything your whole

project now if you're at a company like

airb you have maybe 30,000 models and

terabytes or pedabytes of data you're

not going to be able to refresh your

whole Warehouse

right now certainly you can now use a

bunch of advanced features that no one

uses like State and defer Etc right but

this is very complex and error prone

right and so I wanted the ability for a

person at these companies to be able to

just make the change at any scale and to

be able to recreate a Dev environment uh

for very little cost right and so kind

of the key innovation of SQL mesh it's

really with the virtual environments

right because with a virtual environment


a user can make a change we can

understand the whole dependency of what

has changed and then only change that

right and then we use this abstraction

with views pointing to the physical

layer so that when you make a Dev

environment it's complete it's got your

whole Warehouse except for the changes

you have right and being able to do that

like quickly and cheaply is super

powerful

yeah it is indeed and hopefully people

can understand the complexity that lies

underneath this this layers and how you

guys are making easier for people to use

and plus it's completely open source

which is mind-blowing

yeah yeah amazing it can one more big

thing like with SQL mesh that I wanted

to highlight is like increment

incremental models right

like incremental models is a necessity

when you're at any company with scale

right it's just unfeasible to re

backfill your whole Warehouse every day

if you get views or Impressions every

day you only want to process that right

and so with DBT you can do incremental

models however they kind of label it as

a complex feature and the reason why


it's complex as a DPT user is because

it's up to you right you have to do an

if else macro you know if incremental do

this else do that you have to then like

do your own sub cury right and another

big problem with the way DPT does

incremental models is that it expects

you to be able to run the whole data set

fresh the first time and actually that's

not possible with a lot of the systems

that I've worked with right you have to

batch up the incremental loads so maybe

only loading one week at a time right um

and so one of the additional core things

that I wanted with SQL mesh was for you

to be able to Define an incremental

model easily right and for us to for SQL

mesh to understand which intervals have

run so if you're working with airflow as

an example airflow was designed in this

way air flow is very partition based

right and so is SQL mesh so SQL mesh has

a first class understanding of time uh

it tracks all of that and so when you

define an incremental model in SLE mesh

you don't have to write any subqueries

you just have to do like select all from

my table where the the date is between

the start and the end and SQL mesh


handles start and end it keeps track wow

the start and end and then it can then

from there paralyze everything make sure

there's no data leakage making sure

everything is consistent

Etc wow so basically have incremento as

the first class citizen of the entire

process exactly yeah that's interesting

wow and as you said I don't truly don't

see systems workload production workload

based they're not being like incremental

right I mean just majority of the DAT

rare houses they do incremental so yeah

just having this as a flag uh that's

being made nature and the main

capability is just outstanding

truly yeah and and people uh listening

to us that they must be asking

yourselves up and I have two questions

for you to is the first one is how I can

deploy this in my infrastructure it is

how how easy it is and one of the things

that usually the management always ask

when we using something new what is the

culture changes that we must be careful

of when deploy what we have to how can

fully use the capability of the tool

that we installing in in our comp in our

company sure so SLE mesh is easy to use

um basically you can just pip install it


right it's all open source it's all free

and you can get started really quickly

because SQL mesh works with duct DB as

well right and so you can get started

really fast now if you wanted to use S

machion and production setting we

support snowflake Big cury Data brakes

Etc but but for like the kind of

Enterprise solution if you want to do it

kind of how the top companies do it we

have first class airflow support so

basically you can point your SQL mesh uh

and have everything run through airf

flow and so another big difference

between SQL mesh and DBT is that SQL

mesh treats DBT as oh sorry treats airl

as a first class citizen so it's not

just a black box where you just do SQL

mesh run inside of a node right it

actually understands your Dag and then

creates nodes inside airf flow and so

that you can manage things much better

in airf flow and you can even hook up to

external dependencies much

easier so that's kind of how you kind of

get started I know there's also

documentation to do quick start Etc but

T your question what kind of cultural

changes are
needed I think there's a lot there I

think one part of it is like you know we

have a cicd bot that we plan to open

source very soon and so that'll make

things kind of much more robust for

people it's not just kind of you're just

running your project will need we really

want to tie your code to your data right

so with the cicd bot we're trying to

make it so that any whatever's in main

is a true reflection of what is actually

in your data warehouse right and so

that's kind of like one cultural shift

another cultural shift which I

personally think is very important is

unit tests right SQL mesh has first

class unit tests so BAS basically you

define kind of fake data or fixtures in

like yaml and then we can then run your

existing models and then verify that

against a fixed output right so these

are very fast very quick and they

validate business logic right and so

this can run in your CI and so whenever

someone makes a change you can be more

confident that they haven't broken

anything right like so DBT has what they

call tests but we call them audits so

they're really just data quality checks

right they check nulls and Etc right so


those are good and we have those as well

but they're really for checking big

things right to make sure your data

quality is good or your Upstream hasn't

changed but they don't validate business

logic right yeah so that that's another

big cultural shift that we're trying to

change as well the practice of actually

testing data yeah and just considering

this as one of the main pillars of the

entire SQ right not having as additional

step but just considering as a main

pillar of the entire s MH

parag exactly yeah yeah and I I think I

I I have seen a lot of posts getting out

regarding the SQL mash and just as you

stated the MS it's a major breakthrough

it's a it's a major thing uh for people

to realize how cool it is to have their

own data house and D then just being

able to understand the difference

between production and d and just only

capture the differentials the deck

generation so it handles everything as a

first class citizen on air flow um

considering any other orchestration

tooling regarding airflow is there any

consideration around that like as you

said Daxter or something like this it's


on the world

map yeah we'd love to eventually

integrate with Daxter and prefect but

we're quite a small team and so it's

manner of priorities for us and so the

answer is yes we want to support D prefi

but we don't have the bandwidth right

now yeah also I saw something pretty

cool I mean you guys are Beauty and

giving to to the community IUI right and

I and I we don't have this capability on

DBT for example this is one of the

things for DBT core we can do that but

it's just kind of you know uh cish which

is not a problem for but for data

analysts for example they would like

they would love to have like a UI or

they can tap SQL and just do all the

stuff in one place so can you tell a

little bit about this

UI yeah so our UI is open source and and

free

and the idea with that is that we want

wanted

to make SQL mesh easy for everybody

right so it's SQL we feel like SQL mesh

is powerful enough for the data engineer

and the platform engineer right they're

going to really love the features but at

the end of the day the analyst kind of


just wants to run SQL and they just want

to get their changes in and so the UI is

really there to make it easy make it an

easy experience and provide them all the

guard rails and safety features of SQL

mesh but in a in a way that's easy for

them to interact with and so we felt

like the IDE was a very important part

of our open source offering it is it is

truly it

is yeah great and what we can expect to

be before next steps of C mesh for this

year and the next what you guys are

thinking okay this is like priority for

us we must have this for this year or

maybe next year as a feature as a new

time of deployment whatever you guys are

thinking of with p

so in terms of kind of the big features

that we're thinking about doing so my

background and a lot of the team's

background is on metrics and

experimentation and so the reason why we

did kind of transformation and etls

first is because you need a solid

transformation uh framework before you

can do metrics because metrics with bad

data is useless right MH and so in the

next six to n months we plan to expand s


mesh to be the full cic layer as well uh

but not just business metrics right with

my experience in experimentation I

really wanted to make sure that SQL mesh

computes metrics in a way that's

compatible with um inference and machine

learning right so like experimentation

metrics they're just much more complex

than normal business metrics right you

have to keep track of additional

statistics you have to do things in a

certain way but if you can solve

experimentation then you can also solve

analytics at the same time okay I see

your point yeah so that that's going to

be like the focus for for I would say

the quarter the q1 and Q2 as well on on

this year right just chasing

like this quarter is really about

solidifying our product making sure that

you know we're brand new we've only been

at it for about seven months now so we

want to make sure that we can handle

Transformations well so we'll be

spending this quarter kind of making it

more robust

um as example like I just recently

launched multi repo support right so SQL

mesh can work even if you have different

projects different repositories


right um wow I didn't know that okay oh

yeah I just added that like right and so

like if you have a really big project

you want to split it up you can now one

of the really cool things about the

multier project for SQL mesh is that you

can actually backfill other people stuff

right because it's a whole graph right

even if you multiple projects they might

depend on some of your stuff and in

order to get that full preview

environment right you can't just build

your project because then you're going

to be missing the downstream right and

so even though you don't have their

project checked out you can actually

backfill it and make the world

consistent wow okay meaning that we can

have we can sync up or line up with the

timeline right so everyone can be lined

up and just be in charge of the

downstream resources also and this is

one of the to be honest the most or one

of the most problems that we see

nowadays on the data engineering site is

that there's so many downstreams there

are so many people that consume for that

specific data set or for that specific

ETL that you did or elt that sometimes


pretty hard to keep up and SQL Mash can

understands that right so this is pretty

much magic and what we have seen like in

in the trenches one of the key features

that is missing from you know

technologies that even airf flow right

because you can see what is happening on

the deck but you don't have the clear

visibility about the whole and with SQL

mash it sits on top of that so you're

going to be able to see everything that

is happening through that specific you

know pipeline that you that you just

establish on on SE mesh pretty cool I

don't know if I talked about this but

like one of the core things that we do

in order for that to actually work well

is uh with like understanding the SQL

and then having this automatic

categorization are you familiar with

like the non-breaking and breaking

changes oh yeah

uhhuh well so SQL mesh as I said has a

first class under of SQL right so we can

actually understand the

SQL

now what SQL mesh does is it actually

can parse the sequel and understand like

what it means and so for example let's

say you have a model uh and it has many


dependencies

Downstream and then you add a column to

the model right that no one else uses

this is a non-breaking change right

because no one else is using your stuff

right uh but if you did a naive and you

want to make things consistent you just

back for the whole world right but SQL

mesh understands that all you've done is

added a column you haven't changed

anything else and then it can

automatically then categorize your

change as non-breaking and so it would

not backfill the downstream whoa

automatically automatically whoa but if

you had a column that was like one plus

one and then you change it to one plus

two it would say hey that's a breaking

change and it'll be like in order for

you to be consistent you need to

backfill yourself anyone

Downstream my goodness and this is open

source oh my goodness so that's yeah

because I mean for people that is

listening to us it's complex to this

right it's just pretty complex to do

this type of stuff and just being like

CES do doing this transparently is just

like crazy I'm just feeling like oh my


God I have to stop everything that I'm

doing just fully test because I just

need to getting started and just worked

well and I was like oh my God I'm so

shocked and impressed but there's so

much more about it and what are the

possibilities that just kind of become

endless so yeah it's

amazing yeah before we go to the second

the four board I have one question about

the SQL match that we have to add here

and people have must be think the same

as as myself and Toby there's are some

particular case that just say Okay this

uh professional this person all this

company come to me give me like a a use

case that not quite usual for SQL MH and

okay this use case guys you're not

supposed to use S mesh for this what

your recommendation on

that SQL mesh uh is not really designed

for low latency real-time applications

so we support micr batching so for

example if you want to run your

pipelines every 5 minutes we have

support for that but if you're looking

for like subsecond latency and making

like a realtime streaming application

you should not use SQL

mesh good great but yeah five minutes is


I mean and and that's a curiosity you

know just I know that Netflix is pretty

much event driven like realtime fashion

but I know that they have like this big

Off Systems where they do a bunch of

stuff um how much of the companies that

have gone through like rbnb Netflix and

the other ones and I know that it been

you know there dealing with a lot of

customers how much you would say that

SQL mesh would fit how much percentage

because I think 5 minutes is just like

pretty super decent for yeah it's good

enough yeah exactly so it's just you

know especially for back filling or just

do incremental loads on your data

housing come on five minutes it's just

like you exactly it's good yeah so from

my experience if you can have microb

batching in five minutes even an hour

latency it's like fine right every

company basically has kind of a batch

computation framewor I don't know any I

haven't talked to many companies or any

at all yet that only do everything

streaming it's just hard to manage it's

hard to backfill hard to make changes

right yeah so even at Netflix most of

their analytics was done badge right


they only use streaming for like uh real

time deployments and observability but

it was a challenge right because at

Netflix you have this huge batch

pipeline which is used to make real

decisions right and then you have to

kind of duplicate all this logic in real

time and it gets inconsistent right

because you can't do all the things you

can do in real time like

joins so you know real time honestly I

would avoid it if you don't have to do

it I think five minute batch even an

hour batch perfectly reasonable yeah I I

truly agree with that with that

statement because I see a lot of people

talking about real time and we do

Advocate about real time but there's a

specific use case for that and also I I

don't see honestly I don't see any

customer that does like streaming that

do not do batch but all of the customers

initially do batch and later they go

into the stream or not but the true

reality is the data the data

analytics that that's at least my my my

experience from the field the data

engineering uh the machine learning and

all of the training stuff it happens on

the data L on the batch side of stuff it


doesn't happen in real time you don't

train them model in real time right you

train them in batch you just verify the

quality of that and then just move to uh

to streaming so streaming is just

becoming truly mainstream for some use

cases but I I don't see like I don't see

truly a data housing being fed up in

like milliseconds latency because we

have to have some data quality rules you

have to understand the data you have to

break it down data into dimensions in

fact and there's like process that needs

to happen that needs to gracefully

happen step by step so you're going to

have to have you know batching in five

minut minutes is just so decent I think

that even one hour or more than that is

just still super decent for da housing

to keep up with the

data

exactly yeah let's go to the the fourth

block and to talk about the toico data

and that's something that I like to ask

you to what is the main idea when you

guys come up okay let's build this this

company what you guys targeting to do

with with your company

we're looking to make data easy for


everybody at any

scale right across those three vectors

so if you have a lot of users a lot of

models or a lot of data even if you

don't we want to be valuable right

because even if you don't if you're a

small company if you have small data SQL

mesh is going to save you time right

maybe it takes you only five minutes to

re do a full refresh right that's still

five minutes you're wasting every single

time you're doing full refresh right and

so

we feel like if we can make that process

easy and seamless then then everybody

wins except for maybe the snowflake

who's who's making a killing off of of

the inefficiencies of the current state

of the

world yeah but you said something well I

mean regarding you know even for small

companies or you know but eventually

these companies they can just become

like medium and eventually small and

guess what they don't they don't need to

change the entire tool set

because if they're using SQL match it's

going to be transparent for them right

you guys use their flow you guys use

data breaks apis big query or even if


the customer would like to change

between Technologies C match is going to

do this transparently so it's going to

be better for them to to make the choice

right you don't have to change your

whole chain of tool sets to to adapt to

you know new volumes of data or these

three new vectors that that the the

three vectors that you

said right I guess one thing that I

didn't mention that you kind of brought

up is like yeah if you want to change

your stack SQL M can do that because SQL

MH is built on top of SQL glot which is

my open source transpiler SQL Parson

transpiler so if you write all your

queries in big query we can then just

seamlessly transpile that and run that

in Snowflake or whatever right come on

that's that that helps with that as well

that's insane so it understands that you

changed the technology and then adapt

the SQL to run on that specific engine

that's right goodness and so like that

started because at Netflix and

Airbnb everyone's using spark for the

batch pipelines but then we use Presto

for analy for quick querying and Druid

right and so the data scientist doesn't


want to have to write the query twice

and so that that's really why I started

SQL glad right was to make that

transition

seamless who I didn't know that has well

yeah wow what a

class and in what what are the products

that you guys have like what type of

services that you guys are offering for

your customers can you tell a little bit

about besides SQL mes what other

products that you guys have and what you

guys offer as a

service

so we just started seven months ago our

two products are SQL mesh and SQL glot

we've got a lot of work to do on SQL

mesh and so we're really going to be

focusing on making that good and

building out uh kind of a cloud inter

prize product for SQL mesh

right yeah amazing yeah and I also saw

that you guys started your journey also

you just pretty much trusting on the

community and just receiving lots of

comments and lots of things from the

community tell me a little bit about how

how has been this experience between

just working alongs I know that you

already have some Open Source Products


but how it is this feeling to work so

close to the

community yeah it's

great as a company and personally I

really believe in open source right even

before squl mesh before squl clot I

actually made a a board game website

called 18xx do games this like hardcore

economic train simulation game it's open

source right and that project was really

successful because the community got

involved right I have dozens and dozens

of contributors creating many many games

and so I really believe in this open

source model uh and so we have only been

at it for a couple months as I said and

we wanted to launch SQL mesh as fast as

possible because I wanted to make sure

that the community could see it play

with it and give us feedback

because I don't want to build something

isolation for two years and then come

out with something that no one wants I

want people to criticize it I want

people tell me I'm wrong right and

that's how we're going to grow and so

Community is a huge part of it's

everything right everything the

community asks we're going to listen to


right and so one of our kind of

philosophies with dealing with Community

is to be very

responsive so if you make an issue on

GitHub we will respond very quickly and

try to address it right and so that's

kind of one of our kind of philosophies

with dealing with open source that's

nice that's nice and I think as as my

last question uh for today's episode

it's going to

be um what do you envision for the

next for the new feature on the data

engineer on the data teams Spectrum I

know the SQL match is going to play the

major role so this is your this is the

intent to having SQL mesh out there and

I truly think honestly so so first thank

you so much for accepting U to be at

podcast because I saw s mash and I I

understood the idea and I said well

definitely people not only from the

English but we have to open up more to

people from Portuguese which is have a

big Community trust me it's big here

sometimes you you don't have this much

visibility but I can tell you that South

America it's gigantic in terms of

products uh in terms of adoption in

terms of community wise and what we


usually do is that we just speak this

English we do a summer in Portuguese and

also we just speak the uh we translate

some titles in Portuguese and we try to

release this to our community so this

way they can list for you the Creator

about how SQL MH is important and the

companies that went through but what

would be your you know good feeling

regarding the next two to three years on

the Dy Spectrum what what do you think

is going to

happen it's really tough right but what

do you what do you

foresee I'm not very good at uh making

these

predictions but it can try now and then

after two years it can just listen

again

um I I think data will continue to be

fragmented I think there will be many

tools for everything um because data is

complicated and not every solution is

going to work for every company it's not

a one siiz fits

all I think that

um I think data bricks and snowflake

will continue to be big players in the

space
um I think

that the next kind of big thing that I'm

looking into is like uh really fast

analytic engines like uh you have star

rocks click

house um Doris

Etc right making aggregations and join

really fast

right

um I don't know man no those are fine

those are fine because you know just

don't get me wrong I'm still think that

snowflaking dollar braks are going to be

we have seen a lot of customers moving

away from the cloud provider structure

and just moving to a data platform like

Snowflake and just data brakes but also

I'm seeing like with the SQL mesh you

know released I'm seeing lot for the

very first time on the data engineering

ecosystem a tool that can reduce the

complexity of the toolings in for focus

on what needs to be done DBT started

with that on my opinion I think they

just did a pretty nice uh shot in terms

of reducing a little bit this friction

from analysts or people that write sqin

then eventually python now but it

doesn't offer the whole complete

solution right so C MH goes on top of


that and offers like this amount of

stuff question regarding that so are you

guys targeting to have click house on

the on the integration site because this

is something that I'm looking for to be

honest is on the road road map on where

this year next year Okay click house is

not on the road map okay and people

don't really do etls in Click house

however I I could see like click house

as being a part of the semantic layer so

basically having s

mesh control the etls of your

Transformations but then storing that

into click house and then providing

extraction abstraction on top of Click

house to get summary statistics uh very

quickly gotcha so that in particular I

could see but I don't really see using

click house as an ETL engine yeah does

that make sense no yeah it does it does

it does I was looking more it's like a

data rare housing system where you can

just throw queries against it but not

from the ETL portion of it because for

example uh for customers that are open

source for example we don't have

snowflake or datab brick SQL right and

one of the good things that we can do on


kubernetes or can install and deploy it

it's click house because it offers like

either batch in real time and offers you

like this pretty nice NCC Co standards

which Pino or D doesn't offer it's a

little bit limited and if you would like

you have to have trino for example to do

this and then we ended up with two

deployments uh so I think K how is going

to be like one of the the things that's

going to you know just pretty much gain

traction my opinion

so I think if you're into this space you

should really check out star rocks I

mean from all of the benchmarks that I

did working at Airbnb with my team

Starck just blows click house out of the

water really wow yeah it's not even

close like click

house click house doesn't have a cost

based Optimizer it didn't want to use

that and so like you had to manually do

your join orders right and if you're

joining many big tables together join

order is very important right and so

star rock has a cost base Optimizer and

can join things much faster and in terms

of ingest like click house couldn't

ingest airbnb's data it didn't scale

right staro did it in like a couple


hours oh my goodness what additionally

like click House's squl syntax is is too

obscure it's not standard at all for

example they have a lot of functions

that need to be case sensitive is really

annoying they have a bunch of hints that

you have to put in your SQL in order to

tell click house to do the right thing

um and so that all those

reasons pushed Airbnb to basically

eliminate click house from

contention okay that's a pretty nice

info that we didn't know it so yeah

actually to be honest I'm just we are

usually pretty good in in terms of

knowing all the ecosystem but I never

heard about star rocks that's good to

know

interesting it's open source right yeah

it is um I think they recently moved

from elastic too to Apache

uh you probably haven't heard of it

because it's a Chinese company but they

were based on another Apache project

called Doris oh yeah I know Doris yeah I

know Doris know dor yeah I know Doris i'

actually I've tried to find someone to

speak with Doris but it turns out that I

didn't find anyone on LinkedIn alive or


something like that so it it's just like

my SQL all analytics work version of of

it right but it works pretty fantastic

yeah okay star is for do oh okay

interesting gotcha okay good to know

yeah and I think that this is pretty

much what we have I just want to thank

you so much Toby for your time I know

that you're just a lot of things to do

on your plate and hopefully I would like

to see more and more releases coming up

from the eesh uh we are kind of you know

eager to do some YouTube lives and some

demonstrations and some workshops on our

end about about SQL mesh and once we

done obviously I'm going to ping you and

just let you know because I don't have

any doubts that people are going to love

it and the adoption is going to be

pretty massive because people is going

to understand

how easy it is to to do it and how much

complexity it it get rids off the D

engineering burden right so it just

remove the burden for the D engineers

and thank you so much I'm going to pass

to

mys yeah was what a class I can say what

a great time thanks Toby for sharing

your knowledge to bring some much


content to the table and I hope to see

more about s mesh yeah on the next

episodes me too thanks a lot for having

me it's great chatting with

you yeah yeah well we'll see we'll see

if it takes off let's see yes thank you

so much man thank

You might also like