1.0：（adaptive clasisfier guidance，input 输入一个没cam的branch；提高triplane分辨率）

总结：
- 大规模再train zero123++，但角度设置不同；adaptive clasisfier guidance（front view和早期，使用更大的CFG）
- 对input img再加一个cam embeddings全0的branch来融入其特征
- 用了一种线性复杂度的方法来提高triplane的分辨率（avoid self attention on higher-resolution triplane tokens）
- （text就是先从text生成img，然后都走img to 3D）
intro
- 还是multiview diffusion + LRM的路线。
- address的问题：
  - multiview inconsistency，
  - 依赖已知的pose或view
Method：
- Multiview Diffusion：
  - 基本情况：Zero123++ 扩大规模 & 更改角度设置
    - 还是基于Zero123++再训练
      - 注意Zerro123++和InstantMesh的角度是（ele是absolute的）
        [图片]
      - 有侧面，没正面。
    - 扩大：larger parameters, larger dataset
    - 角度：Ele: 0; azimuth: 0,60,120,180,240,300
      - 有正面，没侧面。
      - 说这个ele 0，可以最大化view中的visible area。emm。放弃上下视角？
    - 分辨率：lite还是320*320， standard进一步扩大到512了。
  - Adaptive Classifier-free guidance (front view和早期，使用更大的CFG)
    - 发现：CFG越大，几何更好但texture不行；正面越高保真但背面越暗
    - 因此：front view和早期，使用更大的CFG
- Sparse-View Reconstruction (LRM part)
  - Hybrid inputs: 同时使用input img和生成的multiview imgs（其实对于relative角度的方法并不存在此问题）
    - 对input img，专门搞个角度未知的branch来融入其信息。（就是camera embedding全设为0）
  - SR
    - 用了一种线性复杂度的方法来提高triplane的分辨率（avoid self attention on higher-resolution triplane tokens）
    - 起初是64641024，（用一个线性层把11 给上采样为44），得到256256120
  - 3D Rep：SDF + MC + UV unwrapping（是否稍显原始了啊？？Instantmesh
    已经上flexicubes了啊）

2.0：几何Hunyuan3D-DiT + 纹理Hunyuan 3D-Paint （albedo）

Hunyuan3D- DiT: 一个正常的image-conditioned DiT（Denoising Transformer），latent的。这个latent是用点云来训练的。（用到Uniform和Importance sampled points）mesh表征是SDF + Marching Cubes.
Hunyuan3D-Paint: 输入的是img（delighting）和multiview normal 和 multiview position；然后对输出进行SR
- Double-stream Image Conditioning Reference-Net:
  - 第一个stream是直接使用VAE的feature，设其time step为0
  - 第二个stream是冻结SD的weights。
- Texture Baking （怎么把multiview imgs变为3Dmesh的texture？）
  - Dense-view inference：听上去好像是，train的时候每次从44个pre-set view中随机选择6个来输出并train，这样inference的时候，这44个view就都能生成
  - 对输出的multiview imgs逐个进行super resolution
  - Texture inpainting: （邻居扩散，weighted sum）没有对应颜色的UV空间的像素点（texel），用他对应的有颜色的3D点的邻居点的weighted sum来填色。
Preprocessing：
- Image Delighting: 大规模数据集下全监督学习train的。
- View Selection：
  - 计算每个视角的信息增益，贪婪选择（先固定前后左右，然后选尽可能涵盖更多unseen regions的））

2.1 Paint时增加了material的支持（PBR（Physically-Based Rendering））

这个material似乎是metallic和roughness这两项。（所以就是gen的时候不止gen albedo，还gen他俩）

2.5 new shape generator LATTICE

geometry 变精致多了：
在这里插入图片描述

Detailed Shape Generation: LATTICE
- 一个diffusion model，输入是single or 4 view images
- 核心点：
  - scaling up
  - 还用了guidance and step distillation 来减少inference时间
Texture
- extend 2.1
- inherit 3D-aware RoPE to enhance cross-view consistency
- multi(dual)- channel attention mechanism to ensure sptial alignment
  - 无论albedo还是MR，都用albedo的attention mask