(English (Auto-Generated) ) SQLMesh - Streamlining Python & SQL Transformations With Tobias Mao, Co-Founder & CTO at Tobiko Data (DownSub - Com)

hello everyone welcome again to another
English episode um today we are
extremely flattened and and and happy to
have this amazing guest here mat yeah
today we have like a special guest and
I'm not going to take too much time Toby
can you introduce yourself please hey
thanks for having me on the show my name
is Toby uh and so CTO of topico data
where we just launched our new product
SQL mesh before too data uh I led
Analytics
at Airbnb which included things like
metric I redesigned the metric platform
Comm nerva as I also led the
experimentation platform and then before
that I led the experimentation team at
Netflix so yeah thanks for having me
yeah just the small companies right and
yeah I'm going to start off by saying um
now based on or your experience through
this amazing and gigantic companies such
as Netflix or mbnb and I know you did
other the previous jobs on matter of the
data Spectrum as well um what do you see
I mean as this data driven culture
because we we have seen customers trying
to understand a little bit about the
data driven but are these companies need
to be data driven if they would like to

unlock data analytics and just really
analyze their data and just get insights
of
it yeah so every company's
different Netflix and Airbnb have big
data
cultures uh but I would say that they're
different in terms of how much they
respect data I would say like Netflix is
extremely data driven every decision
under goes AB testing everything is very
data driven I think that's kind of part
of the culture of the leadership team I
think Airbnb they do a lot with data but
they're not quite as rigorous in terms
of scientifically analyzing every
change okay so even though they're big
and they have like lot of data running
through the pipeline it's just like a
little bit of uh Behavior change in
terms of how they deal with data right
right Airbnb is a little bit more
designer focused they have a designer Le
culture and and Netflix is a little bit
more data data driven I see I see um and
you know just based on that what are the
major pain points that you see like big
companies or even median companies as
well that you work at for or you have

acquaintance or acknowledgement
regarding the data engineering job what
are the major ping points you see there
are three big vectors of scale at these
companies there's big data right that's
one vector so you can have a lot of data
there's a lot of users a lot of people
right so as you have more people in your
organization more people making data
more people consuming data that's
another vector and then you also have a
lot of models like so a lot just a lot
of code right and that's like another
the vector and so at these kind of
companies you have all three Netflix had
huge amounts of data lots of users and
lots of models maybe a not so much
medium models Airbnb had less data lots
of users but many many models and so a
little bit different but they all three
of the both of these companies kind of
have these three vectors of scale okay
so do you judge that these three vectors
would be like the main uh major ping
points when you deal with massive data
right just massive adoption of data or
massive massive adoption or
understanding about your customer
because as more data you have it gets a
little bit complex it seems easier but

it gets a little bit complex to
understand the customer Behavior right
not only that but everything gets more
complex right your queries get more
complex your processes get more complex
you have to make sure that you're not
breaking something else someone else is
doing or there's just so much as you add
more more things right everything just
gets harder to manage yeah yeah yeah and
and and that that's not only tied up
with the coding perspective but also on
the infrastructure right so it's all the
different layers that get hit by this
you know augmentation of Big Data people
in model I like the way that you put it
this because usually I don't see I was
not tackling that approach but I like
the approach to CS like factors like big
data people and models I I totally agree
with that sentence pretty well spot
on great and talking about something
that the experience that you have been
having to become founder of the new
company tell us the challenges different
from the technical stuff now you guys
are building something and have a
product that is the the eomes can you
tell us a littleit about the challenges

of this new
company yeah I've never done it before
it's my first time making a company I've
built open source projects before but a
company is a whole different ball game
it's been a lot of fun i' I've been
learning a lot but I would say like in
terms of kind of our strengths our our
team is very engineering heavy we all
come from kind of fang and and we all
very
senior and so that kind of part of it is
pretty easy for us the hardest part is
that marketing is a big part of making a
startup
especially when there's an existing
incumbent that is huge right and so the
biggest challenge for us is really in
the marketing P how can we sell what
we're doing how can we get people
understand and respect what we're doing
how can we get people to think about us
and talk about us yes that's really big
challenge for us no but honestly I think
you guys started with with the right
foot uh because I mean first impression
I've been in the data system for quite a
while and when I had my first you know
when I saw SQL mesh is just kind of
actually I'm just with the open tab open

here because I'm just doing some tests
and I've got a uh I'm must say that
we're going to go for that but I I I was
kind of impressed with the how much
distractions you guys just put it in in
complex things like environments and
just dags and stuff like how you guys
are distracted in a way that would be
easier to think about the business
problem at as being like the first class
citizen instead of just focusing too
much on the technology and forgetting a
little bit about the business and I
think you got are approaching that in
the right way just think about the
business and having some you know
predefined stuff that you guys are
taking for granted that you consider as
the you know uh as the key points to be
successful the data and and I think that
we we we're looking at we saw a lot of
devops but then this name arrived like
the data Ops um can you explain a little
bit about what data off is and how it
can help us data teams to build better
you know
products yeah data Ops is just kind of
like a set of philosophies or or
processes to make people more efficient

with data so shipping faster shipping
safer uh and Shi shipping more often
right and so especially at the larger
scales when you have more people you
have more models you have more data it
gets harder to do these things right
especially at some big companies or even
some little companies whenever you make
a change in your data model to get a new
metric or a fixed business logic it can
be dangerous right you're not kind of
sure what's going to happen and so data
Ops is just making that easier making
that safer and faster wow that's I mean
uh do we have mats that we know rare uh
that we know of about any data Ops
product that is doing that because I I
don't actually know something that does
this end to end so that's why SQL MH
just pretty much caught me because I was
like okay what devops state Ops they
must be pretty much similar in terms of
shipping faster and make things easier
right a lot of abstractions being just
uh done and then that's the for me
that's the very first product that
intends or targets to do this massive
you know abstraction of
complexity uh and but but before you you
launch like toico um do we have

something regarding the data Ops pretty
much you know inserted on the data
engineering
Market I think data Ops has existed as a
term but I don't know of any real
product that kind of came about it as we
have I think there's some work around
observability so like that's one part of
data Ops observing the health of your
pipelines
Etc but that's only one small part of it
right I think that for example
scheduling Frameworks like dter and
prefi and airflow try to solve another
aspect of it uh but it's pretty lowlevel
so I don't know if anyone uh before us
has made anything that's why it's so
groundbreaking that's why it's so
groundbreaking yeah so without cesh
there's a bunch of data Ops difficulties
like learning curve rols and as set
distraction of Technologies so on and so
forth and then I would like to jump and
I would like to for you to you know tell
some comments regarding SQL mesh and
what is the target idea of SQL mesh
explain for the people for the data
engineers and data people that that is
listen to us what SQL mesh

is so score SQL mesh is a is a
transformation framework and it allows
you to write python or SQL queries uh
and then we we handle the rest right we
handle the creation of the dag
registration we handle scalability of
all that stuff right and so we kind of
think of of ourselves as a next
generation of of DBT right so DBT as you
all familiar with is kind of like a SQL
transformation framework to make it
easier for analysts to kind of write SQL
and be productive however kind of the
creation of SQL mes happened when I was
at Airbnb right I got I had actually
never used DBT because uh you know DBT
wouldn't have worked at at a company
like
Airbnb um and I can explain kind of why
that is in a bit but I saw a demo of DPT
and I was just like
wow like you know a tool like this is
very valuable however I want it to work
at any company right uh and so we built
SQL mesh with that mindset right coming
from it like okay transformation
Frameworks are valuable they make people
more productive but how can we do it in
a way that will work at any company at
any
scale
wow yeah I mean that's
impressive yeah yeah I guess let me
explain like why DBT wouldn't work at a
company like perfect yeah would be nice
so if you just use DBT out of the box
right you you just do DBT run right and
so let's say you have a bunch of models
and you want to make a change right the
default approach would be just to do
like DBT run and that's going to fully
refresh everything your whole
project now if you're at a company like
airb you have maybe 30,000 models and
terabytes or pedabytes of data you're
not going to be able to refresh your
whole Warehouse
right now certainly you can now use a
bunch of advanced features that no one
uses like State and defer Etc right but
this is very complex and error prone
right and so I wanted the ability for a
person at these companies to be able to
just make the change at any scale and to
be able to recreate a Dev environment uh
for very little cost right and so kind
of the key innovation of SQL mesh it's
really with the virtual environments
right because with a virtual environment

a user can make a change we can
understand the whole dependency of what
has changed and then only change that
right and then we use this abstraction
with views pointing to the physical
layer so that when you make a Dev
environment it's complete it's got your
whole Warehouse except for the changes
you have right and being able to do that
like quickly and cheaply is super
powerful
yeah it is indeed and hopefully people
can understand the complexity that lies
underneath this this layers and how you
guys are making easier for people to use
and plus it's completely open source
which is mind-blowing
yeah yeah amazing it can one more big
thing like with SQL mesh that I wanted
to highlight is like increment
incremental models right
like incremental models is a necessity
when you're at any company with scale
right it's just unfeasible to re
backfill your whole Warehouse every day
if you get views or Impressions every
day you only want to process that right
and so with DBT you can do incremental
models however they kind of label it as
a complex feature and the reason why

it's complex as a DPT user is because
it's up to you right you have to do an
if else macro you know if incremental do
this else do that you have to then like
do your own sub cury right and another
big problem with the way DPT does
incremental models is that it expects
you to be able to run the whole data set
fresh the first time and actually that's
not possible with a lot of the systems
that I've worked with right you have to
batch up the incremental loads so maybe
only loading one week at a time right um
and so one of the additional core things
that I wanted with SQL mesh was for you
to be able to Define an incremental
model easily right and for us to for SQL
mesh to understand which intervals have
run so if you're working with airflow as
an example airflow was designed in this
way air flow is very partition based
right and so is SQL mesh so SQL mesh has
a first class understanding of time uh
it tracks all of that and so when you
define an incremental model in SLE mesh
you don't have to write any subqueries
you just have to do like select all from
my table where the the date is between
the start and the end and SQL mesh

handles start and end it keeps track wow
the start and end and then it can then
from there paralyze everything make sure
there's no data leakage making sure
everything is consistent
Etc wow so basically have incremento as
the first class citizen of the entire
process exactly yeah that's interesting
wow and as you said I don't truly don't
see systems workload production workload
based they're not being like incremental
right I mean just majority of the DAT
rare houses they do incremental so yeah
just having this as a flag uh that's
being made nature and the main
capability is just outstanding
truly yeah and and people uh listening
to us that they must be asking
yourselves up and I have two questions
for you to is the first one is how I can
deploy this in my infrastructure it is
how how easy it is and one of the things
that usually the management always ask
when we using something new what is the
culture changes that we must be careful
of when deploy what we have to how can
fully use the capability of the tool
that we installing in in our comp in our
company sure so SLE mesh is easy to use
um basically you can just pip install it

right it's all open source it's all free
and you can get started really quickly
because SQL mesh works with duct DB as
well right and so you can get started
really fast now if you wanted to use S
machion and production setting we
support snowflake Big cury Data brakes
Etc but but for like the kind of
Enterprise solution if you want to do it
kind of how the top companies do it we
have first class airflow support so
basically you can point your SQL mesh uh
and have everything run through airf
flow and so another big difference
between SQL mesh and DBT is that SQL
mesh treats DBT as oh sorry treats airl
as a first class citizen so it's not
just a black box where you just do SQL
mesh run inside of a node right it
actually understands your Dag and then
creates nodes inside airf flow and so
that you can manage things much better
in airf flow and you can even hook up to
external dependencies much
easier so that's kind of how you kind of
get started I know there's also
documentation to do quick start Etc but
T your question what kind of cultural
changes are
needed I think there's a lot there I
think one part of it is like you know we
have a cicd bot that we plan to open
source very soon and so that'll make
things kind of much more robust for
people it's not just kind of you're just
running your project will need we really
want to tie your code to your data right
so with the cicd bot we're trying to
make it so that any whatever's in main
is a true reflection of what is actually
in your data warehouse right and so
that's kind of like one cultural shift
another cultural shift which I
personally think is very important is
unit tests right SQL mesh has first
class unit tests so BAS basically you
define kind of fake data or fixtures in
like yaml and then we can then run your
existing models and then verify that
against a fixed output right so these
are very fast very quick and they
validate business logic right and so
this can run in your CI and so whenever
someone makes a change you can be more
confident that they haven't broken
anything right like so DBT has what they
call tests but we call them audits so
they're really just data quality checks
right they check nulls and Etc right so

those are good and we have those as well
but they're really for checking big
things right to make sure your data
quality is good or your Upstream hasn't
changed but they don't validate business
logic right yeah so that that's another
big cultural shift that we're trying to
change as well the practice of actually
testing data yeah and just considering
this as one of the main pillars of the
entire SQ right not having as additional
step but just considering as a main
pillar of the entire s MH
parag exactly yeah yeah and I I think I
I I have seen a lot of posts getting out
regarding the SQL mash and just as you
stated the MS it's a major breakthrough
it's a it's a major thing uh for people
to realize how cool it is to have their
own data house and D then just being
able to understand the difference
between production and d and just only
capture the differentials the deck
generation so it handles everything as a
first class citizen on air flow um
considering any other orchestration
tooling regarding airflow is there any
consideration around that like as you
said Daxter or something like this it's

on the world
map yeah we'd love to eventually
integrate with Daxter and prefect but
we're quite a small team and so it's
manner of priorities for us and so the
answer is yes we want to support D prefi
but we don't have the bandwidth right
now yeah also I saw something pretty
cool I mean you guys are Beauty and
giving to to the community IUI right and
I and I we don't have this capability on
DBT for example this is one of the
things for DBT core we can do that but
it's just kind of you know uh cish which
is not a problem for but for data
analysts for example they would like
they would love to have like a UI or
they can tap SQL and just do all the
stuff in one place so can you tell a
little bit about this
UI yeah so our UI is open source and and
free
and the idea with that is that we want
wanted
to make SQL mesh easy for everybody
right so it's SQL we feel like SQL mesh
is powerful enough for the data engineer
and the platform engineer right they're
going to really love the features but at
the end of the day the analyst kind of

just wants to run SQL and they just want
to get their changes in and so the UI is
really there to make it easy make it an
easy experience and provide them all the
guard rails and safety features of SQL
mesh but in a in a way that's easy for
them to interact with and so we felt
like the IDE was a very important part
of our open source offering it is it is
truly it
is yeah great and what we can expect to
be before next steps of C mesh for this
year and the next what you guys are
thinking okay this is like priority for
us we must have this for this year or
maybe next year as a feature as a new
time of deployment whatever you guys are
thinking of with p
so in terms of kind of the big features
that we're thinking about doing so my
background and a lot of the team's
background is on metrics and
experimentation and so the reason why we
did kind of transformation and etls
first is because you need a solid
transformation uh framework before you
can do metrics because metrics with bad
data is useless right MH and so in the
next six to n months we plan to expand s

mesh to be the full cic layer as well uh
but not just business metrics right with
my experience in experimentation I
really wanted to make sure that SQL mesh
computes metrics in a way that's
compatible with um inference and machine
learning right so like experimentation
metrics they're just much more complex
than normal business metrics right you
have to keep track of additional
statistics you have to do things in a
certain way but if you can solve
experimentation then you can also solve
analytics at the same time okay I see
your point yeah so that that's going to
be like the focus for for I would say
the quarter the q1 and Q2 as well on on
this year right just chasing
like this quarter is really about
solidifying our product making sure that
you know we're brand new we've only been
at it for about seven months now so we
want to make sure that we can handle
Transformations well so we'll be
spending this quarter kind of making it
more robust
um as example like I just recently
launched multi repo support right so SQL
mesh can work even if you have different
projects different repositories

right um wow I didn't know that okay oh
yeah I just added that like right and so
like if you have a really big project
you want to split it up you can now one
of the really cool things about the
multier project for SQL mesh is that you
can actually backfill other people stuff
right because it's a whole graph right
even if you multiple projects they might
depend on some of your stuff and in
order to get that full preview
environment right you can't just build
your project because then you're going
to be missing the downstream right and
so even though you don't have their
project checked out you can actually
backfill it and make the world
consistent wow okay meaning that we can
have we can sync up or line up with the
timeline right so everyone can be lined
up and just be in charge of the
downstream resources also and this is
one of the to be honest the most or one
of the most problems that we see
nowadays on the data engineering site is
that there's so many downstreams there
are so many people that consume for that
specific data set or for that specific
ETL that you did or elt that sometimes

pretty hard to keep up and SQL Mash can
understands that right so this is pretty
much magic and what we have seen like in
in the trenches one of the key features
that is missing from you know
technologies that even airf flow right
because you can see what is happening on
the deck but you don't have the clear
visibility about the whole and with SQL
mash it sits on top of that so you're
going to be able to see everything that
is happening through that specific you
know pipeline that you that you just
establish on on SE mesh pretty cool I
don't know if I talked about this but
like one of the core things that we do
in order for that to actually work well
is uh with like understanding the SQL
and then having this automatic
categorization are you familiar with
like the non-breaking and breaking
changes oh yeah
uhhuh well so SQL mesh as I said has a
first class under of SQL right so we can
actually understand the
SQL
now what SQL mesh does is it actually
can parse the sequel and understand like
what it means and so for example let's
say you have a model uh and it has many

dependencies
Downstream and then you add a column to
the model right that no one else uses
this is a non-breaking change right
because no one else is using your stuff
right uh but if you did a naive and you
want to make things consistent you just
back for the whole world right but SQL
mesh understands that all you've done is
added a column you haven't changed
anything else and then it can
automatically then categorize your
change as non-breaking and so it would
not backfill the downstream whoa
automatically automatically whoa but if
you had a column that was like one plus
one and then you change it to one plus
two it would say hey that's a breaking
change and it'll be like in order for
you to be consistent you need to
backfill yourself anyone
Downstream my goodness and this is open
source oh my goodness so that's yeah
because I mean for people that is
listening to us it's complex to this
right it's just pretty complex to do
this type of stuff and just being like
CES do doing this transparently is just
like crazy I'm just feeling like oh my

God I have to stop everything that I'm
doing just fully test because I just
need to getting started and just worked
well and I was like oh my God I'm so
shocked and impressed but there's so
much more about it and what are the
possibilities that just kind of become
endless so yeah it's
amazing yeah before we go to the second
the four board I have one question about
the SQL match that we have to add here
and people have must be think the same
as as myself and Toby there's are some
particular case that just say Okay this
uh professional this person all this
company come to me give me like a a use
case that not quite usual for SQL MH and
okay this use case guys you're not
supposed to use S mesh for this what
your recommendation on
that SQL mesh uh is not really designed
for low latency real-time applications
so we support micr batching so for
example if you want to run your
pipelines every 5 minutes we have
support for that but if you're looking
for like subsecond latency and making
like a realtime streaming application
you should not use SQL
mesh good great but yeah five minutes is

I mean and and that's a curiosity you
know just I know that Netflix is pretty
much event driven like realtime fashion
but I know that they have like this big
Off Systems where they do a bunch of
stuff um how much of the companies that
have gone through like rbnb Netflix and
the other ones and I know that it been
you know there dealing with a lot of
customers how much you would say that
SQL mesh would fit how much percentage
because I think 5 minutes is just like
pretty super decent for yeah it's good
enough yeah exactly so it's just you
know especially for back filling or just
do incremental loads on your data
housing come on five minutes it's just
like you exactly it's good yeah so from
my experience if you can have microb
batching in five minutes even an hour
latency it's like fine right every
company basically has kind of a batch
computation framewor I don't know any I
haven't talked to many companies or any
at all yet that only do everything
streaming it's just hard to manage it's
hard to backfill hard to make changes
right yeah so even at Netflix most of
their analytics was done badge right

they only use streaming for like uh real
time deployments and observability but
it was a challenge right because at
Netflix you have this huge batch
pipeline which is used to make real
decisions right and then you have to
kind of duplicate all this logic in real
time and it gets inconsistent right
because you can't do all the things you
can do in real time like
joins so you know real time honestly I
would avoid it if you don't have to do
it I think five minute batch even an
hour batch perfectly reasonable yeah I I
truly agree with that with that
statement because I see a lot of people
talking about real time and we do
Advocate about real time but there's a
specific use case for that and also I I
don't see honestly I don't see any
customer that does like streaming that
do not do batch but all of the customers
initially do batch and later they go
into the stream or not but the true
reality is the data the data
analytics that that's at least my my my
experience from the field the data
engineering uh the machine learning and
all of the training stuff it happens on
the data L on the batch side of stuff it

doesn't happen in real time you don't
train them model in real time right you
train them in batch you just verify the
quality of that and then just move to uh
to streaming so streaming is just
becoming truly mainstream for some use
cases but I I don't see like I don't see
truly a data housing being fed up in
like milliseconds latency because we
have to have some data quality rules you
have to understand the data you have to
break it down data into dimensions in
fact and there's like process that needs
to happen that needs to gracefully
happen step by step so you're going to
have to have you know batching in five
minut minutes is just so decent I think
that even one hour or more than that is
just still super decent for da housing
to keep up with the
data
exactly yeah let's go to the the fourth
block and to talk about the toico data
and that's something that I like to ask
you to what is the main idea when you
guys come up okay let's build this this
company what you guys targeting to do
with with your company
we're looking to make data easy for

everybody at any
scale right across those three vectors
so if you have a lot of users a lot of
models or a lot of data even if you
don't we want to be valuable right
because even if you don't if you're a
small company if you have small data SQL
mesh is going to save you time right
maybe it takes you only five minutes to
re do a full refresh right that's still
five minutes you're wasting every single
time you're doing full refresh right and
so
we feel like if we can make that process
easy and seamless then then everybody
wins except for maybe the snowflake
who's who's making a killing off of of
the inefficiencies of the current state
of the
world yeah but you said something well I
mean regarding you know even for small
companies or you know but eventually
these companies they can just become
like medium and eventually small and
guess what they don't they don't need to
change the entire tool set
because if they're using SQL match it's
going to be transparent for them right
you guys use their flow you guys use
data breaks apis big query or even if

the customer would like to change
between Technologies C match is going to
do this transparently so it's going to
be better for them to to make the choice
right you don't have to change your
whole chain of tool sets to to adapt to
you know new volumes of data or these
three new vectors that that the the
three vectors that you
said right I guess one thing that I
didn't mention that you kind of brought
up is like yeah if you want to change
your stack SQL M can do that because SQL
MH is built on top of SQL glot which is
my open source transpiler SQL Parson
transpiler so if you write all your
queries in big query we can then just
seamlessly transpile that and run that
in Snowflake or whatever right come on
that's that that helps with that as well
that's insane so it understands that you
changed the technology and then adapt
the SQL to run on that specific engine
that's right goodness and so like that
started because at Netflix and
Airbnb everyone's using spark for the
batch pipelines but then we use Presto
for analy for quick querying and Druid
right and so the data scientist doesn't

want to have to write the query twice
and so that that's really why I started
SQL glad right was to make that
transition
seamless who I didn't know that has well
yeah wow what a
class and in what what are the products
that you guys have like what type of
services that you guys are offering for
your customers can you tell a little bit
about besides SQL mes what other
products that you guys have and what you
guys offer as a
service
so we just started seven months ago our
two products are SQL mesh and SQL glot
we've got a lot of work to do on SQL
mesh and so we're really going to be
focusing on making that good and
building out uh kind of a cloud inter
prize product for SQL mesh
right yeah amazing yeah and I also saw
that you guys started your journey also
you just pretty much trusting on the
community and just receiving lots of
comments and lots of things from the
community tell me a little bit about how
how has been this experience between
just working alongs I know that you
already have some Open Source Products

but how it is this feeling to work so
close to the
community yeah it's
great as a company and personally I
really believe in open source right even
before squl mesh before squl clot I
actually made a a board game website
called 18xx do games this like hardcore
economic train simulation game it's open
source right and that project was really
successful because the community got
involved right I have dozens and dozens
of contributors creating many many games
and so I really believe in this open
source model uh and so we have only been
at it for a couple months as I said and
we wanted to launch SQL mesh as fast as
possible because I wanted to make sure
that the community could see it play
with it and give us feedback
because I don't want to build something
isolation for two years and then come
out with something that no one wants I
want people to criticize it I want
people tell me I'm wrong right and
that's how we're going to grow and so
Community is a huge part of it's
everything right everything the
community asks we're going to listen to

right and so one of our kind of
philosophies with dealing with Community
is to be very
responsive so if you make an issue on
GitHub we will respond very quickly and
try to address it right and so that's
kind of one of our kind of philosophies
with dealing with open source that's
nice that's nice and I think as as my
last question uh for today's episode
it's going to
be um what do you envision for the
next for the new feature on the data
engineer on the data teams Spectrum I
know the SQL match is going to play the
major role so this is your this is the
intent to having SQL mesh out there and
I truly think honestly so so first thank
you so much for accepting U to be at
podcast because I saw s mash and I I
understood the idea and I said well
definitely people not only from the
English but we have to open up more to
people from Portuguese which is have a
big Community trust me it's big here
sometimes you you don't have this much
visibility but I can tell you that South
America it's gigantic in terms of
products uh in terms of adoption in
terms of community wise and what we

usually do is that we just speak this
English we do a summer in Portuguese and
also we just speak the uh we translate
some titles in Portuguese and we try to
release this to our community so this
way they can list for you the Creator
about how SQL MH is important and the
companies that went through but what
would be your you know good feeling
regarding the next two to three years on
the Dy Spectrum what what do you think
is going to
happen it's really tough right but what
do you what do you
foresee I'm not very good at uh making
these
predictions but it can try now and then
after two years it can just listen
again
um I I think data will continue to be
fragmented I think there will be many
tools for everything um because data is
complicated and not every solution is
going to work for every company it's not
a one siiz fits
all I think that
um I think data bricks and snowflake
will continue to be big players in the
space
um I think
that the next kind of big thing that I'm
looking into is like uh really fast
analytic engines like uh you have star
rocks click
house um Doris
Etc right making aggregations and join
really fast
right
um I don't know man no those are fine
those are fine because you know just
don't get me wrong I'm still think that
snowflaking dollar braks are going to be
we have seen a lot of customers moving
away from the cloud provider structure
and just moving to a data platform like
Snowflake and just data brakes but also
I'm seeing like with the SQL mesh you
know released I'm seeing lot for the
very first time on the data engineering
ecosystem a tool that can reduce the
complexity of the toolings in for focus
on what needs to be done DBT started
with that on my opinion I think they
just did a pretty nice uh shot in terms
of reducing a little bit this friction
from analysts or people that write sqin
then eventually python now but it
doesn't offer the whole complete
solution right so C MH goes on top of

that and offers like this amount of
stuff question regarding that so are you
guys targeting to have click house on
the on the integration site because this
is something that I'm looking for to be
honest is on the road road map on where
this year next year Okay click house is
not on the road map okay and people
don't really do etls in Click house
however I I could see like click house
as being a part of the semantic layer so
basically having s
mesh control the etls of your
Transformations but then storing that
into click house and then providing
extraction abstraction on top of Click
house to get summary statistics uh very
quickly gotcha so that in particular I
could see but I don't really see using
click house as an ETL engine yeah does
that make sense no yeah it does it does
it does I was looking more it's like a
data rare housing system where you can
just throw queries against it but not
from the ETL portion of it because for
example uh for customers that are open
source for example we don't have
snowflake or datab brick SQL right and
one of the good things that we can do on

kubernetes or can install and deploy it
it's click house because it offers like
either batch in real time and offers you
like this pretty nice NCC Co standards
which Pino or D doesn't offer it's a
little bit limited and if you would like
you have to have trino for example to do
this and then we ended up with two
deployments uh so I think K how is going
to be like one of the the things that's
going to you know just pretty much gain
traction my opinion
so I think if you're into this space you
should really check out star rocks I
mean from all of the benchmarks that I
did working at Airbnb with my team
Starck just blows click house out of the
water really wow yeah it's not even
close like click
house click house doesn't have a cost
based Optimizer it didn't want to use
that and so like you had to manually do
your join orders right and if you're
joining many big tables together join
order is very important right and so
star rock has a cost base Optimizer and
can join things much faster and in terms
of ingest like click house couldn't
ingest airbnb's data it didn't scale
right staro did it in like a couple

hours oh my goodness what additionally
like click House's squl syntax is is too
obscure it's not standard at all for
example they have a lot of functions
that need to be case sensitive is really
annoying they have a bunch of hints that
you have to put in your SQL in order to
tell click house to do the right thing
um and so that all those
reasons pushed Airbnb to basically
eliminate click house from
contention okay that's a pretty nice
info that we didn't know it so yeah
actually to be honest I'm just we are
usually pretty good in in terms of
knowing all the ecosystem but I never
heard about star rocks that's good to
know
interesting it's open source right yeah
it is um I think they recently moved
from elastic too to Apache
uh you probably haven't heard of it
because it's a Chinese company but they
were based on another Apache project
called Doris oh yeah I know Doris yeah I
know Doris know dor yeah I know Doris i'
actually I've tried to find someone to
speak with Doris but it turns out that I
didn't find anyone on LinkedIn alive or

something like that so it it's just like
my SQL all analytics work version of of
it right but it works pretty fantastic
yeah okay star is for do oh okay
interesting gotcha okay good to know
yeah and I think that this is pretty
much what we have I just want to thank
you so much Toby for your time I know
that you're just a lot of things to do
on your plate and hopefully I would like
to see more and more releases coming up
from the eesh uh we are kind of you know
eager to do some YouTube lives and some
demonstrations and some workshops on our
end about about SQL mesh and once we
done obviously I'm going to ping you and
just let you know because I don't have
any doubts that people are going to love
it and the adoption is going to be
pretty massive because people is going
to understand
how easy it is to to do it and how much
complexity it it get rids off the D
engineering burden right so it just
remove the burden for the D engineers
and thank you so much I'm going to pass
to
mys yeah was what a class I can say what
a great time thanks Toby for sharing
your knowledge to bring some much

content to the table and I hope to see
more about s mesh yeah on the next
episodes me too thanks a lot for having
me it's great chatting with
you yeah yeah well we'll see we'll see
if it takes off let's see yes thank you
so much man thank

(English (Auto-Generated) ) SQLMesh - Streamlining Python & SQL Transformations With Tobias Mao, Co-Founder & CTO at Tobiko Data (DownSub - Com)

Uploaded by

(English (Auto-Generated) ) SQLMesh - Streamlining Python & SQL Transformations With Tobias Mao, Co-Founder & CTO at Tobiko Data (DownSub - Com)

Uploaded by

hello everyone welcome again to another

English episode um today we are

extremely flattened and and and happy to

have this amazing guest here mat yeah

today we have like a special guest and

I'm not going to take too much time Toby

can you introduce yourself please hey

thanks for having me on the show my name

is Toby uh and so CTO of topico data

where we just launched our new product

SQL mesh before too data uh I led

at Airbnb which included things like

metric I redesigned the metric platform

Comm nerva as I also led the

experimentation platform and then before

that I led the experimentation team at

Netflix so yeah thanks for having me

yeah just the small companies right and

yeah I'm going to start off by saying um

now based on or your experience through

this amazing and gigantic companies such

as Netflix or mbnb and I know you did

other the previous jobs on matter of the

data Spectrum as well um what do you see

I mean as this data driven culture

because we we have seen customers trying

to understand a little bit about the

data driven but are these companies need

to be data driven if they would like to

analyze their data and just get insights

it yeah so every company's

different Netflix and Airbnb have big

cultures uh but I would say that they're

different in terms of how much they

respect data I would say like Netflix is

extremely data driven every decision

under goes AB testing everything is very

data driven I think that's kind of part

of the culture of the leadership team I

think Airbnb they do a lot with data but

they're not quite as rigorous in terms

of scientifically analyzing every

change okay so even though they're big

and they have like lot of data running

through the pipeline it's just like a

little bit of uh Behavior change in

terms of how they deal with data right

right Airbnb is a little bit more

designer focused they have a designer Le

culture and and Netflix is a little bit

more data data driven I see I see um and

you know just based on that what are the

major pain points that you see like big

companies or even median companies as

well that you work at for or you have

regarding the data engineering job what

are the major ping points you see there

are three big vectors of scale at these

companies there's big data right that's

one vector so you can have a lot of data

there's a lot of users a lot of people

right so as you have more people in your

organization more people making data

more people consuming data that's

another vector and then you also have a

lot of models like so a lot just a lot

of code right and that's like another

the vector and so at these kind of

companies you have all three Netflix had

huge amounts of data lots of users and

lots of models maybe a not so much

medium models Airbnb had less data lots

of users but many many models and so a

little bit different but they all three

of the both of these companies kind of

have these three vectors of scale okay