String property max chars when using MLMD with PostgreSQL #194

tarilabs · 2024-03-03T15:51:59Z

When you attempt to store a string based MLMD property backed by MySQL, MEDIUMTEXT is max ~16 M chars; the indexes make use of the first 255 chars (here) and you can store "~arbitrarily long text" in those string based MLMD properties.
This is working fine.
When instead you attempt to store a string based MLMD property backed by PostgreSQL, TEXT is max ~536 M chars; however, the index by default would limit you to merely ~2690 characters because is not taking the substring approach above (here) and as a net result because of default's btree limit (here) you can store at max ~2690 chars.

To me, this seems to indicate a potential typo/overlook/misconfiguration of the PostreSQL index, because as of today can store in practice only 2k chars of text when MLMD backed by PostgreSQL, when practically MLMD backed by MySQL you could do 1000x magnitude.

Potential solutions provided in PR: #195

The text was updated successfully, but these errors were encountered:

tarilabs · 2024-03-03T16:05:32Z

Reproducer

With PostgreSQL ready, for instance:

docker run --name some-postgres \
-e POSTGRES_USER=postgres \
-e POSTGRES_PASSWORD=mypsw \
-e PGUSER=postgres \
-e PGPASSWORD=mypsw \
-p 5432:5432 \
-d "postgres:12"

Run the following Python script:

from ml_metadata.metadata_store import metadata_store
from ml_metadata.proto import metadata_store_pb2
from faker import Faker
fake = Faker()

def create_store():
    connection_config = metadata_store_pb2.ConnectionConfig()
    connection_config.postgresql.host = 'localhost'
    connection_config.postgresql.port = '5432'
    connection_config.postgresql.user = 'postgres'
    connection_config.postgresql.password = 'mypsw'
    connection_config.postgresql.dbname = 'pgserver'
    # connection_config.postgresql.skip_db_creation = 'false'
    # connection_config.postgresql.ssloption.sslmode = '...' # disable, allow, verify-ca, verify-full, etc.
    # connection_config.postgresql.ssloption.sslcert = '...'
    # connection_config.postgresql.ssloption.sslkey = '...'
    # connection_config.postgresql.ssloption.sslpassword = '...'
    # connection_config.postgresql.ssloption.sslrootcert = '...'
    store = metadata_store.MetadataStore(connection_config, enable_upgrade_migration=True)
    return store

def create_ctx_type(store):
    experiment_type = metadata_store_pb2.ContextType()
    experiment_type.name = "Experiment"
    experiment_type.properties["note"] = metadata_store_pb2.STRING
    experiment_type_id = store.put_context_type(experiment_type)
    results = store.get_context_types()
    print(results)
    return experiment_type_id

def working(experiment_type_id):
    my_experiment = metadata_store_pb2.Context()
    my_experiment.type_id = experiment_type_id
    my_experiment.name = "exp1"
    my_experiment.properties["note"].string_value = "My first experiment."
    [experiment_id] = store.put_contexts([my_experiment])
    print(experiment_id)
    result = store.get_contexts_by_id([experiment_id])
    print(result)

def lorem(max_chars):
    result = ""
    while len(result) < max_chars:
        result += fake.sentence() + " "
    return result[:max_chars]

def not_working(experiment_type_id):
    my_experiment = metadata_store_pb2.Context()
    my_experiment.type_id = experiment_type_id
    my_experiment.name = "exp_longnote"
    my_experiment.properties["note"].string_value = lorem(2690) # up to ~2680 works
    [experiment_id] = store.put_contexts([my_experiment])
    print(experiment_id)
    result = store.get_contexts_by_id([experiment_id])
    print(result)

if __name__ == '__main__':
    store = create_store()
    experiment_type_id = create_ctx_type(store)
    working(experiment_type_id)
    not_working(experiment_type_id)

Resulting in:

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0303 16:19:42.571434 13378 postgresql_metadata_source.cc:208] Connecting to database. 
I0303 16:19:42.576880 13378 postgresql_metadata_source.cc:215] Connection to database succeed.
I0303 16:19:43.322311 13378 postgresql_metadata_source.cc:208] Connecting to database. 
I0303 16:19:43.327381 13378 postgresql_metadata_source.cc:215] Connection to database succeed.
[id: 10
name: "Experiment"
properties {
  key: "note"
  value: STRING
}
]
1
[id: 1
type_id: 10
name: "exp1"
properties {
  key: "note"
  value {
    string_value: "My first experiment."
  }
}
type: "Experiment"
create_time_since_epoch: 1709479183548
last_update_time_since_epoch: 1709479183548
]
E0303 16:19:43.555275 13378 postgresql_metadata_source.cc:128] Execution failed: ERROR:  index row size 2712 exceeds btree version 4 maximum 2704 for index "idx_context_property_string"
DETAIL:  Index row references tuple (0,2) in relation "contextproperty".
HINT:  Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.
WARNING:absl:mlmd client InternalError: PostgreSQL metadata source error: ERROR:  index row size 2712 exceeds btree version 4 maximum 2704 for index "idx_context_property_string"
DETAIL:  Index row references tuple (0,2) in relation "contextproperty".
HINT:  Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.

Traceback (most recent call last):
  File "/home/tarilabs/git/demo20231009-mlmdpg/python/test3.py", line 61, in <module>
    not_working(experiment_type_id)
  File "/home/tarilabs/git/demo20231009-mlmdpg/python/test3.py", line 52, in not_working
    [experiment_id] = store.put_contexts([my_experiment])
  File "/home/tarilabs/git/demo20231009-mlmdpg/venv/lib64/python3.10/site-packages/ml_metadata/metadata_store/metadata_store.py", line 520, in put_contexts
    self._call('PutContexts', request, response)
  File "/home/tarilabs/git/demo20231009-mlmdpg/venv/lib64/python3.10/site-packages/ml_metadata/metadata_store/metadata_store.py", line 203, in _call
    return self._call_method(method_name, request, response)
  File "/home/tarilabs/git/demo20231009-mlmdpg/venv/lib64/python3.10/site-packages/ml_metadata/metadata_store/metadata_store.py", line 224, in _call_method
    self._pywrap_cc_call(cc_method, request, response)
  File "/home/tarilabs/git/demo20231009-mlmdpg/venv/lib64/python3.10/site-packages/ml_metadata/metadata_store/metadata_store.py", line 255, in _pywrap_cc_call
    raise errors.make_exception(error_message.decode('utf-8'), status_code)
ml_metadata.errors.InternalError: PostgreSQL metadata source error: ERROR:  index row size 2712 exceeds btree version 4 maximum 2704 for index "idx_context_property_string"
DETAIL:  Index row references tuple (0,2) in relation "contextproperty".
HINT:  Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.

Demonstrating is potentially a typo/overlook/misconfiguration of the Index which is limiting the amount of chars which can be stored in the MLMD string property:

PostgreSQL metadata source error: ERROR: index row size 2712 exceeds btree version 4 maximum 2704 for index "idx_context_property_string"
DETAIL: Index row references tuple (0,2) in relation "contextproperty".
HINT: Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.

tarilabs mentioned this issue Mar 3, 2024

fix postgresql string property index #195

Open

tarilabs mentioned this issue Mar 8, 2024

kubeflow pipeline add support of postgresql kubeflow/pipelines#7512

Open

zijianjoy mentioned this issue Mar 8, 2024

Tracking: PostgreSQL support in KFP kubeflow/pipelines#9813

Open

29 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String property max chars when using MLMD with PostgreSQL #194

String property max chars when using MLMD with PostgreSQL #194

tarilabs commented Mar 3, 2024 •

edited

Loading

tarilabs commented Mar 3, 2024

String property max chars when using MLMD with PostgreSQL #194

String property max chars when using MLMD with PostgreSQL #194

Comments

tarilabs commented Mar 3, 2024 • edited Loading

tarilabs commented Mar 3, 2024

Reproducer

tarilabs commented Mar 3, 2024 •

edited

Loading