AWS Glue作业失败,出现连接超时错误作业、错误、AWS、Glue

2023-09-03 10:54:13 作者:该网名不对傻逼显示

我是AWS Glue的新手。我已经创建了一个作业,它使用两个Data Catalog表并在其上运行简单的SparkSQL查询。作业在转换步骤失败,出现异常

pyspark.sql.utils.AnalysisException: 'java.lang.RuntimeException: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to glue.us-east-1.amazonaws.com:443 [blah] failed: connect timed out;'

JDBC源(RedShift)私有网络安全组同时配置了入站和出站规则

操作失败错误为0x000003e3

我在So上看到了另一篇关于为Glue配置vPC端点的帖子,但我不太明白它应该是什么样子?它应该是并接口到gle.us-East-1.amazonaws.com:443还是其他什么?我糊涂了。

更新:自动生成的pyspark脚本

## @params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "redshift_catalog", redshift_tmp_dir = TempDir, table_name = "analytics_mongo_raw_conversations", transformation_ctx = "DataSource0"]
## @return: DataSource0
## @inputs: []
DataSource0 = glueContext.create_dynamic_frame.from_catalog(database = "redshift_catalog", redshift_tmp_dir = args["TempDir"], table_name = "analytics_mongo_raw_conversations", transformation_ctx = "DataSource0")
## @type: DataSource
## @args: [database = "redshift_catalog", redshift_tmp_dir = TempDir, table_name = "analytics_mongo_raw_messages", transformation_ctx = "DataSource1"]
## @return: DataSource1
## @inputs: []
DataSource1 = glueContext.create_dynamic_frame.from_catalog(database = "redshift_catalog", redshift_tmp_dir = args["TempDir"], table_name = "analytics_mongo_raw_messages", transformation_ctx = "DataSource1")
## @type: SqlCode
## @args: [sqlAliases = {"messages": DataSource1, "conversations": DataSource0}, sqlName = SqlQuery0, transformation_ctx = "Transform0"]
## @return: Transform0
## @inputs: [dfc = DataSource1,DataSource0]
Transform0 = sparkSqlQuery(glueContext, query = SqlQuery0, mapping = {"messages": DataSource1, "conversations": DataSource0}, transformation_ctx = "Transform0")
job.commit()

推荐答案

我能够解决此问题,实际上必须有vPC端点。 此外,该连接还应使用带有NAT网关专用子网。我的初始子网没有NAT。

Terraform中的vPC端点配置示例:

resource "aws_vpc_endpoint" "glue" {
  vpc_id            = var.vpc_id
  service_name      = var.glue_vpc_service_name
  vpc_endpoint_type = "Interface"

  security_group_ids = var.security_group_ids 
  subnet_ids = var.subnet_ids

  tags = { mytag = "mytag"}
}