Jyve: Company Background

Jyve serves the Grocery and CPG industry by connecting them with a pool of skilled labor professionals, called Jyvers, who are trained to complete jobs like restocking shelves and auditing goods for expiry. As a gig-economy platform, Jyve understands that success lies in making both its partners and Jyvers happy and that data is an invaluable tool to gain insight into its marketplace dynamics. Focusing on the needs of its partners and Jyvers represents a mission that is unique to Jyve, so when Jyve’s tech team was beginning to grow its Data team it made sense to find a solution that would allow the company to structure its resources with a focus on these Jyve-specific challenges.

Early on into our data journey at Jyve, we decided that Datacoral’s Data Engineering Platform would give data scientists like myself and data analysts the ability to quickly set up data pipelines without having to worry about the plumbing itself. We leveraged our domain-specific knowledge to build pipelines that drove reporting from various SaaS services, such as Zendesk and Fountain, as well as Jyve’s own production database. Our team was able to achieve this without having to be involved with the intricacies of managing the company’s data infrastructure stack.

Challenge

As with all marketplaces, understanding liquidity, or the efficiency with which a marketplace matches buyers (Jyve’s partnered brands and retailers) and sellers (Jyve’s contractors aka Jyvers), is essential for a successful platform, and therefore, is always continuously improved as more information becomes available. With newly warehoused mobile data in Segment, Jyve looked to better its knowledge of the dynamics in its marketplace. More specifically, the goal was to determine the geographical density of Jyvers in relation to the stores listed, so that Jyve could better serve stores that were historically less staffed and also alert nearby Jyvers as jobs opened up. This project would require a pipeline that would not only be dynamic enough to react to changes in user behavior over time, but would also be able to distill potentially millions of data points for a single user into one value, allowing downstream teams to access this without having to perform complex geo-calculations every time. Without leveraging mobile data, reporting on user behavior was reliant on self-reported addresses, brittle to edge cases in which users were looking for work in an area away from their permanent mailing address, like college students.

Solution

With aggregating millions of geo-points being too complex to handle in a SQL environment alone, Datacoral recommended its Batch Compute feature which allows users to create and run a Python User Defined Function (UDF). The feature would work as follows:

  • Jyve’s data science to provide a SQL query to fetch data from Redshift
  • This would be converted into a Pandas DataFrame and serve as the input to a Python UDF provided by Jyve
  • The UDF would return another DataFrame that would be written as a table in Redshift
Datacoral provides data infrastructure essentials so that its customers can focus on doing what they do best: applying domain-specific knowledge in creating customer-specific solutions.

With Datacoral taking care of the data infrastructure (running the UDFs in a container that would run Python), Jyve could then focus on writing the appropriate input SQL script taking into account business logic, such as removing user geo-points when users were working on the job, and applying geospatial indexing techniques for grouping coordinates in its Python script. Better yet, deploying this pipeline on Jyve’s side was as simple as writing the appropriate SQL and python scripts (examples below) and pushing the files with a few Datacoral CLI commands. This was made self-serve with clear instructions available in the documentation. With Batch Compute materialized views already compatible with existing Datacoral DAGs, querying from the distilled user location data alongside production data unlocked a myriad of opportunities for downstream users to perform complex geolocation analysis without having to wrangle the upstream mobile data, opportunities such as correlating the density of Jyvers around a kilometer radius of a store to increased spend or improving on already existing notification ranking model.

  • Input Redshift Table: app_location (Segment to Redshift)
  • Input Redshift Table: store (synced from Postgres by Datacoral)
  • Input SQL query to join the tables above:
SELECT
    app_location.id,
    app_location.user_id,
    app_location."timestamp",
    app_location.latitude,
    app_location.longitude
FROM
    app_location -- only for users with location event in last 28 days
    JOIN
         (SELECT DISTINCT
              user_id
          FROM
              app_location
          WHERE
              "timestamp" BETWEEN current_timestamp - 28 
                          AND current_timestamp) tab_user
         ON tab_user.user_id = app_location.user_id
-- only location points where the user was not at a store
    JOIN store
         ON ROUND(app_location.latitude, 4) != ROUND(store.latitude, 4) AND
            ROUND(app_location.longitude, 4) != ROUND(store.longitude, 4)
-- only location points whose time is where a user is likely to be at home
WHERE
    (date_part('hour', timezone(context_timezone, "timestamp")::TIMESTAMP) >= 21 OR
     date_part('hour', timezone(context_timezone, "timestamp")::TIMESTAMP) <= 9);
  • Input Python script for grouping geo-locations: udf.py
import pandas as pd
import numpy as np
from h3 import h3


# get geospatial index given coordinates and radius size
def create_h3_centers(lat, lon, r):

    h3_address = h3.geo_to_h3(
        lat,
        lon, r)  # lat, lng, hex resolution

    return h3_address

# UDF input requires data and params as arguments
def transform(data, params):

    data.loc[:, ['latitude', 'longitude']] = data.loc[:,
      ['latitude', 'longitude']
    ].astype(float)

    # create lookup table of unique coordinates
    h3_lookup = data[[
        'latitude', 'longitude']].drop_duplicates()

    # return center coordinate for geospatial index for every coordinate in lookup
    h3_lookup['h3_9'] = h3_lookup.apply(lambda row: create_h3_centers(
        row['latitude'], row['longitude'], 9), axis=1)

    h3_lookup['h3_coord'] = h3_lookup.apply(lambda row:
        h3.h3_to_geo(create_h3_centers(
            row['latitude'], row['longitude'], 9)), axis=1)

    h3_lookup['h3_latitude'] = h3_lookup['h3_coord'].apply(lambda x: x[0])
    h3_lookup['h3_longitude'] = h3_lookup['h3_coord'].apply(lambda x: x[1])

    # join geospatial index from lookup table to user coordinates
    data = data.merge(
        h3_lookup, on=['latitude', 'longitude'])

    # output a table containing user_id, derived home coordinates,
    # and number of geolocation events used in calculation
    user_location_df = h3_lookup[['h3_9', 'h3_latitude', 'h3_longitude']] \
        .drop_duplicates() \
        .merge(data \
            .groupby(['h3_9', 'user_id']) \
            .size() \
            .rename('events') \
            .reset_index() \
            .sort_values(by=['user_id', 'events'], ascending=[0, 0]) \
            .drop_duplicates(subset=['user_id'], keep='first') \
        )

    return user_location_df[[
        'user_id',
        'h3_latitude',
        'h3_longitude',
        'events']]

Output Redshift table: user_location

Coordinates can be bucketed by hexagonal geospatial indexes to determine density in an area, such as the location of cars as in a city. Source

Result

With the Batch Compute stack running successfully, Jyve now has a clearer lens into how its Jyvers interact on its platform. Not only does Jyve benefit from being able to accurately quantify Jyver density in its marketplace, but so do its Jyvers, as they can now be more appropriately notified when jobs become available in their area or recommended to take certifications for brands that most jobs close to them require. This, in turn, helps Jyve’s partnered customers (brands and retailers) as the platform can recommend optimal areas to expand in, where additional jobs can be met with the capacity to fill them. Datacoral’s Batch Compute transformation has certainly supported Jyve in transitioning to the next level of becoming a more data-informed marketplace that elevates its Jyvers and clients by allowing both sides to tap into nearby and available opportunities.

About Datacoral

Datacoral is a Data Engineering Platform that allows users to quickly build data pipelines in the cloud. Customers can build observable data pipelines with no-code/low-code connectors to read any data source and write to any data lake or warehouse, SQL to specify multi-level transformations in multiple data lakes and data warehouses, and connectors to publish data to applications. Datacoral offers unprecedented security and scalability through its cloud-prem software delivery model powered by serverless microservices.

Reach out to us at hello@datacoral.co or https://datacoral.com/request-info to learn more about our Batch Compute transformations and the overall data engineering platform.